I think its the data science, natural science and non IT related engineering people would actually benefit the most from your software design centric videos. I`m one of them and we literally code spaghetti on the daily basis without ever getting taught the SOLID principles =). Thanks and you're making better those that listen!
I'm a professional data scientist and I'm following the channel since the beginning. It was essential to me in learning to be a better software engineer, even if this is not my main job requirement but my every day tool...
Agreed - working as a data scientist who is proficient in data wrangling, ML, etc., but definitely lacking in solid software development principles, more videos like these would help me a ton!
I am a senior data scientist, and I benefit from all your videos. Building architecture, productionizing and scaling up ML models is challenging. It requires good software engineering practices and a good understanding of the full software development stack. Good work as usual Arjan.
Hello Tunapedia, I came across your insightful comments on this video. I'm currently deepening my skills in data science and recently secured second place in an NLP competition on Zindi. I admire your expertise and would appreciate any guidance or insights you can provide on potential job opportunities in the field. Thank you.
Yes PLEASE do more videos like this at the intersection of data science / ETL pipelines and software engineering. It's extremely helpful for those of us who have come into building software from another adjacent field and are now struggling with big messes of our own making :)
I'm a data scientist and machine learning researcher, and looking into code design and refactoring from your perspective is very helpful for me in terms of coding! Thanks a lot
This helper function to compose is a gold nugget . I think it should go into the functools module so we could simply import it. The idea is so intuitive that it wouldn't be a problem if it wasn't explicitly defined on the codebase.
The "Unsatisfying cliffhanger" is me realizing I now have to go through a lot of refactoring because I've done this lazy single-variable function chains waaay too much... Great job as always, thank you Arjan !
All data science programming I’ve ever seen is usually written for a one-off experiment with very little principles applied, whether SOLID or reproducibility. The code is often not object oriented and is more functional - and written in declarative linear steps in one script. Even this code you are starting with is in better shape. I’ll be watching for sure to see these software development principles applied to that sort of programming style.
I always used coding as a tool to test my hypothesis. You videos put perspective into why and how writing code is much more than that. I am not a trained software engineer, but, professionally a data scientist. I feel your videos are really helping me fill glaring gaps in software design process while conceiving my data projects and this is important for the data science community as most are not from the software engineering background. Please make more videos in this series. Godspeed.
How on earth does this man not have more subscribers. I mean most people would benefit it's their problem if they don't watch these lmao I'm just glad I'm one of the first to hear his wisdom.
This is actually what I do at work - working in a Data Science team as a Software Engineer with some prior ML knowledge. I have to tell you that the code you received for refactoring here is actually what I would consider a state of the art design ;- ) No offence to Data Scientists, I totally understand how complex their world is!! Hopefully as the discipline matures a bit more, and sadly more projects fail due to quick & dirty solutions - we will be all in a better place. Thank you for your work.
You're most welcome and I absolutely agree with you - data science is a very complex field and it makes total sense that data science education programs have to spend all their time on data science concepts, leaving little room for software engineering practices!
Arjan I'm at awe at you ease of reworking things just by looking at them. And it works every time! I just recently followed all your advice in a program I'm developing and it took me a day just to get the thing running again in the new format. We are incredibly lucky to have you teaching us this stuff. Most courses will say over and over the design principles but getting to see them applied so naturally really makes them stick. Thank you so much
It's worth pointing out that those single-variable function calls are often preferred, because network composition is rarely purely sequential. In general, it is a DAG. For experimenting it's important to be able to quickly access intermediate results of the network and a chain of calls make it much easier. In practice it's more important to detect repeatable and meaningful patterns in the network and split them into separate classes, e.g. a network may consist of a sequence of 12 layers, but it could be conceptually easier to view it as a sequence of 4 blocks - 3 layers each. tl;dr - don't refactor out all single-variable function calls right away
In my experience, almost every new ML engineer start the journey from solving a very simple problem like classification and implement kind of a "Trainer" object. There is a lot of inversion of control to adjust certain parts of the experiments. It seems like a stable framework, but collapses pretty quickly when they try to do something more complicated.
There are few popular frameworks that approach this a bit more maturely. I think would be interesting to see an analysis and comparison of libraries like Keras, Ignite and Pytorch Lightning from perspective of an experienced programmer. They all invent some kind of callback or hook mechanism to control data loading and model training.
Thank you for your content Arjan, I have intermediate python skills but have been learning a lot from your refactoring videos. Moving to OOP for my projects has been a steep but rewarding curve. Thanks again!
You shouldn't make pure data science/machine learning content, because there is already plenty of that. A sort of "Software design for data scientists [Dummies]" could be a great contribution!
I agree - I also wouldn't feel very comfortable doing pure data science / ML stuff since that's not my main area of expertise. But I'll definitely think more about how design principles and patterns can be used in this setting!
@@peterdowdy174 Probably Kedro could be useful to combine notebook and code itself. P.S. Kedro - open-source Python framework for creating reproducible, maintainable and modular data science code
@@peterdowdy174 Hey Peter, I have been struggling with this topic for a few years and ended up here: Notebooks are great for local/quick/dirty experiments, but not for a proper/production grade code. For many, many reasons... Once I accepted this - my life is a happier place ;) Greetings and all the best!
Hi Arjan, I’m an astronomer learning to code more properly, and I work exactly with code like this often. This was so unbelievably helpful. Thank you for starting this series and I’m looking forward to more like it. It’s difficult to prototype things in a Jupyter notebook, get it running, then refactor to something shareable and useable and understandable by others that may need to work with it. You’re teaching me a lot, keep it up!
Some feedback: While seeing your face is always a bright point of any day, I still felt that you would often cut to a fullscreen camera view of yourself while talking about the code you just cut away from, which made it a bit hard to follow the structure of the code. Like, at 3:10 you said "You can see this happening here" during a cut where we literally can't see it happening, which caused a weird disconnect in my brain where I felt like I had to switch gears with each cut, trying to take in as much information as possible before the next cut would interrupt the reading. It's an interesting video, but these cuts made it hard to follow.
@@ArjanCodes Your other videos, editing-wise, have excellent pace and I don't notice the cuts at all, making it easy to follow along. This one felt like the cat was standing on the "cut" key.
Haha, I did start working with a cat (read: video editor ;) ) since a few weeks. It’s clear we still need to fix a few things in the process, but I’m on it.
I remember dealing with MNIST data sets in college when I was learning Machine Learning. I was taking an OOP course at the same time and my first ML (Machine Learning) assignment was a single-layered neural network with 10 perceptrons. Even though I went object-oriented with the assignment it took forever to go through the training data and testing data, 12+ hours in total in runtime. It wasn't that accurate either, like 75-80%. However, I redid the assignment, abandoning most, if not all, OOP principles and going towards something more procedural and mathematical (linear algebra to be precise). There was a huge difference in my experience. The code was easier to read, easier to understand, and a lot faster, when going through the training and testing data in less than 1 second and was reaching 92-96% accuracy.
Worth pointing out that both sklearn's Pipeline and torch's Sequential compose _classes_ satisfying certain interfaces and return _classes_ (with possibly different capabilities). Which is a bit more complicated than function composition, but usually necessary in real-world situations where the aggregate process needs more capabilities than just being Callable.
I am really intrested in design for data science applications. I used to be a programmer, but did other stuff for a lot of years, the reason I am back in programming is data science. But I find there is lack of practises that I am used to from programming applications lacking in the world of data science. So this is a great one!
I really liked this novel method of "Code Refactoring" & "Code-Roast" to look things from software best practices and see how to correct these common mistakes. I would like to see more such video.
Great vid, was looking forward to this for a while since you mentioned on Reddit that you had plans to get into ML/DS from software engineering perspective. Much better to refactor a project which is a real world scenario, rather than simple hypothetical examples which are abundant.
I love this, this video saves and the comments save me a lot of time returning code reviews to data people over and over! Now I can just send them here to explain what is not spaghetti!
I love these refactoring series. So informative. Thanks, not only to Arjan, but to the people who submit their code to literally be picked apart and rebuilt.
I really loved this video. I work in Quantitative Finance, where we have to write a lot of code (usually in a scientific programming language, a.k.a Python), and I've benefited a lot from these videos. A lot of a code that I've encountered is usually a spaghetti code, and just starting to think of solving the problems from good design principles has really helped in increasing the flexibility, maintainability and readability of my code. I always look forward to watching these videos! Hopefully, you'd cover more advanced topics of Python and designing systems in the future.
I'm a newb in python, and being experienced in other languages it is hard to flip the switch to a new one, Arjan videos have beem crucial to my undestanding of the "Pythonic" way. Thanks man! Keep em coming ... I don't know if it is your focus here but would love to see you talk about a project using PyQt5 ;)
Wonderful video! 🙏 Among many other things, you've shown me three nice ways to compose a sequence of functions: 1) with a torch network 2) with a scikit-learn pipeline 3) with functools.reduce I agree the third is very attractive. Some may find it a bit strange that the order of the functions switches, but that's not a defect in my eye.
Fantastic video, I'm eager to watch the two next parts. From my PhD studies in AI I can tell the majority of research code in ML and AI is terribly written and barely readable, even with published works. The guidelines for clean ML code are just starting to emerge and at times I feel there's even more confusing ML config / scheduling / architecture tools released every day than confusing JS frontend tools (and there's a JS framework released almost every day lol). Good to see plain old good design being used in this context. Content like this is VERY valuable, hope to see more ML refactoring videos! All the best!
Thanks and glad to hear you enjoyed the video! Let me know what you think of the other two. I'll certainly revisit more data science oriented content focused on design. Doing this miniseries was a lot of fun.
@@ArjanCodes so I've already watched the other two and really enjoyed them as well . Very clean, understandable and applicable approach and I think your channel really nicely fills a gap in intermediate to advanced programming topics. I really appreciate the references to Dijkstra, Hoare, SOLID, GRASP etc. - super rare to see that on YT. I've also watched your Hydra video and I really like how it compliments this miniseries - Hydra is getting lots of interest in the community these days. Another tool that's growing in popularity and also could be interesting for you for a future video is PyTorch Lightning - it introduces an opinionated design into PyTorch and also aims to clean up some of the clutter which can be found in 90% of AI code.
Using the Sequential is 1 way, and it works nicely when the model has a linear flow, however if you want to build a model with - for example - 2 outputs that's sitting on different levels of the model you need to use the non-sequential way, and then the X for all intermediate stage starts to make sense :)
Function composition is really cool and make the code very concise and clean. However, I feel like we achieve it at the cost of readability of the code and additionally make it hard to debug intermediate calculation/steps if suspect something is wrong(in reality this happens very often when there is too much math involved in the code). Some (picky) managers might not like it during code review/pull request for the reasons stated
I'm loving it! Please continue doing videos like this one :D I'm learning a lot from it - your videos are one of the most valuable/useful ones I've seen for Python or software design in general
I like to use multiple inheritance for string Enum classes. For example: class MyEnum(str, Enum): RED = 'RED' BLUE = 'BLUE' GREEN = 'GREEN' *Make sure the str comes first. Then you can use the class like normal, MyEnum.RED, and you can also use a string literal. It avoids the need to use the 'name' attribute. Lastly you also get equality if you are comparing the enum to a string literal.
Really good video, very well explained, and I can see in comments below you have noted the jump cuts away from code. Really will make the video perfect! Thank you
One point to consider from a data scientist: a lot of the times we like quick and dirty iterations to our exploratory and predictive insights. Many times (especially under time constraints) quick and dirty is better than slow and beautiful. That's why I personally love notebooks. As long as it is idempotent (notebook runs from start to end without issues) and the environment is containerized, it is reproducible. But I see the merit for both. There is a lot of power in writing scalable and reusable code in this space to organize to complex pipelines that supercharge society's solutions. This is why, over time, I now have learned to use a hybrid of both - but maybe not in the most optimal or well-principled way. Which leads to my suggestion! Would you be able to make a video on how you would use Jupyter Notebooks/Kaggle Kernel Notebooks/Google Collab Notebooks in tandem with with an internal packaged up repository as you have it in the video for DS projects? Maybe this means just maintaining your currently directory structure as shown in this video but adding a "notebooks" folder to the root folder where all that type of analysis is done since we can call your modules from that notebooks folder (not sure how this would be manifested, you probably have a better idea). You use .py scripts for most things that you can install these scripts as modules for use in other scripts or even notebooks, and that is what I have been doing to keep my notebooks cleaner. But I am sure your perspective on how to have fast iteration times to high value insights, maintain a scalable pipeline, yet keep everything reusable in doing this kind of work - even maybe some sort of generalized approach shown through a video example - would be invaluable. I think this would be a game changer for myself and a lot of people in DS and ML. As for this video, your other content has been useful, but seeing it directly applied to the type of work I do on a regular basis brings your concepts to life for me. Please keep these software design principles applied to DS crossover content coming! Thank you for what you do :)
The most important thing I've learned (I'm still learning) is to write good, cleaner, and reproducible data science code was: "Functional programming paradigm". R (with tidyverse, and tidymodel approach), and Julia programming language made me code almost like I was using a "General System Theory" from Bertalanffy, (ins -> transformations -> outs). With this approach, I can change the ins without break all the code, or I can change the functions (transformations, each one with its own rule) without break all code logic. Since I use Python only for NLP tasks I do not use a functional programming paradigm with it, but I know it is possible, maybe easier in Python (function composition was good to know it). The OO paradigm for Data Science that some data scientists use does not make any sense to me, of course, I am not a professional programmer, maybe for not having ground on computer science, I think that way. By the way, I'm learning a lot with you! Thank you very much!!!
Thanks Igor, glad you like the content! Using pure functions is certainly a great starting point. What OO programming brings to the table is that it provides a nice mechanism for structuring data representations via (data)classes and collection objects such as lists, dicts, and so on. Ideally, you'd have a marriage of both that provides a clear structure of the data, and has data manipulation pipelines with very limited coupling and side effects.
@@ArjanCodes Thank you! I will try to apply this approach to my NLP study codes, I know I have a lot to learn to be able to understand OO stuff, classes, dataclasses, but your videos are helping me a lot.
This looks more like a deep-learning project than a data-science one (using Torch, Tensorboard to follow the network training, instead of something like Pandas), which is actually exactly what I need right now, I work a lot with Pytorch and Pytorch Lightning and I'm looking to improve my code. The issue that I have with torch.nn.Sequential is that its annoying to debug when you have an error in your network-building lego, but if you sure that the lego is correct it is more clean to use Sequential.
Nice video 😊 Hope it reaches all my data scientist colleagues. There are many similarities in machine learning projects, this makes me think of why there is no custom Design Patterns for ML projects ?
Great video! IMHO, a simple loop over functions list is much easier and readable: x = 12 for func in (add_three, add_three, mul_two, mul_two, ): x = func(x)
Reiterating the others, very useful video for data scientists! I liked the idea of replacing the nested call with the compose function, but what about an "apply" function instead ? def apply_composition(x, *functions): for func in functions: x = func(x) return x For me, this seems easier to read than the functools solution... and its similar to the idea of a torch.nn.ModuleList container in Pytorch
As always a great video! The only suggestion I would add is maybe to turn off Intellisense for the video, because all the red squiggly lines are a bit overwhelming and actually useless because the code works!
Great video. What do you think of folder refactoring? In some repos, I have seen people putting files/classes in a separate folder called "commons" for utility files that are used agnostically across the project. I think this would be a great idea to touch on in a future video. Nonetheless, the best python videos on youtube hands down! Keep up the great content!
Really nice video. When working with ml / ds problems, I always end up using ugly designs / hacks that makes the job done. An then refactoring is such a pain. Thanks you for this advice :D
The one issue I have with this design is that is based solely on pytorch, so if you like to go to another framework such as tensorflow, this will require quite a bit of refactoring (without taking into account the new framework coding stuff), thus most likely making breaking changes to consumers that use the project
In general, this is a really hard problem to solve. Especially since most frameworks like Pytorch, TensorFlow, etc. ask you to "marry" the framework and use their data types all over the place, which then makes it hard to replace the framework with something else. I'll look into this and try to come up with some ideas to do a video about this.
Next time, you should definitely do a tensorflow/keras project. Would love to see how you would go about cleaning up the code in a project like that. full disclosure: I've written a very convoluted DL project with tf.keras and I'm 100% positive it can be written better
@@ArjanCodes I will try and see if I can package it up in a meaningful way. Right now it is split across two private github repos and trains on a rather large and proprietary image dataset
Maybe already answered, but does Pandas have function composition (aka network or sequential)? IMO this is a huge benefit of using the R tidyverse, the %>% command is called a “pipe” but it seems to work exactly like function composition and is extremely well-supported and flexible.
Great video! One thing that I did not quite understand: when you changed the ExperimentTracker from an abstract base class into a protocol then the TensorboardExperiment no longer inherits from ExperimentTracker. I do not see the connection between the two classes anymore. After the refactor, to me ExperimentTracker seems like an unused class. Or am I missing something?
After changing the ExperimentTracker to a Protocol class, the inheritance relationship between it and TensorboardExperiment is indeed gone. However, ExperimentTracker is used in the Runner class where it defines the interface that is expected for connecting the Runner with the experiment tracker. The result is that you can now create other experiment tracking classes that integrate seamlessly with the Runner class, as long as they implement the methods defined in ExperimentTracker.
Overwriting the 'forward' function in the Torch Model and updating the state (tensor) of the neural network at each step is actually the recommended way to do it by PyTorch.
I’ve written a lot of spaghetti code to process scientific data. It’s usually so bad that it just stays as a notebook that’s copied over and laboriously edited for each new time I repurpose it. Really think this is useful content. More please.
Thanks for your great contents. However, I didn't find your custom composition function useful. However, PyTorch's Sequential or Scikit Learn's Pipeline seem more proper.
Hi Arjan. Love your stuff. Would you be able to create comprehensive video about "How to structure bigger project"? I've got task to create PySide2 application with at least 3 windows (Main, Settings, Results) and I'm not sure how to structure it so it's not inside one file because that's just too much of a chaos. How to connect signals to what functions and where to write them, shoul each window (code) be individual file, how to connect everything, how to parse variable from one window to second?
At 14:25 you decide to remove the protocol inheritance, making it implicit. There is no difference to the working of the code, though it does make life harder for anyone needing to change and understand this class, for it is not clear anymore that it should adhere to the protocol.
Except using protocol instead of ABC, your video is nice :) Protocol makes things less clearer. Silly question: why do we need to avoid storing intermediate results in the same variable?
Would you ever consider overriding `__str__` on an Enum to return `self.name`? That would avoid having to add `stage.name` in all those f-strings. Feels neat to me from a code repetition perspective, but it does violate the "Explicit is better than Implicit" guidance of the zen of python. I'd be really interesting in your opinion.
Great suggestion, and I think it works really well in this particular case.
3 ปีที่แล้ว
I loved this video. It was the best momento to apply the design solid principles to data science because I work with it at daily base. Could you apply solid principles to panda's library because this is the most used library for data processing? Again, Thank you very much!!
Thanks for these videos they have been fun to watch. I see the benefit of function composition, however, In practice (data science) when composing functions I have never not had a whole slew of unique parameters and contexts to pass to each function along the chain. Is there an equally elegant solution to this problem.
Hi Robert, good question. I like using either closures for this or partial functions (from functools). For example with closures, you can define a function (with parameters, contexts, etc) that returns another function and then that's the function that's passed to the composition. In terms of the example in this video at the end, you could do the following, where n is an extra parameter, add_n is a closure that returns a function: def add_n(n: int): def add(x: int): return x + n return add ... compose(add(5), add(12), multiplyByTwo, ...)
@@ArjanCodes Robert, not sure whether @ArjanCodes would approve of this, but you could define a Callable ABC base class for your functions that implements a __rmul__ (or sth like that) method that you implements function composition for the __call__ methods and initialize the instances with whatever parameters you want that are not part of the functional input data. And if you make the __call__ method accept and return a dict you can also compose functions of different arities.
Dear Arjan, I am a 3 months of rookie in python(learned classes , functions basics etc) And interested in data things , not development 🙀 Is it a problem you think? I mean to find a job and career-wise Thank for your kind answers and advices 🙏
It’s an interesting video, but I think it’s actually misguided advice for Data Science/ML projects. Data Science projects have a different dynamics from software engineering projects, hence the need for MLOps platforms. Tracking is needed in the experimentation stage, when things change quickly, and writing abstractions to become independent of a particular experiment tracking platform is not creating value for anyone. What’s actually important is that the experimentation code is decoupled from the model code (which is why Tensorflow and LightGBM use callbacks… PyTorch doesn’t, but PyTorch Lightning does, which is why I would always use PyTorch Lightning and not raw PyTorch). Moreover, where I feel abstractions are really powerful is for the model itself, because I’m order to do model selection I may have to apply a fair evaluation to models that utilize different frameworks (e.g. PyTorch vs LightGBM) or even different problem framings. The first point is what MLflow Models tries to accomplish.
27:07 yo dog I heard you like lambda functions, so I put a lambda function in your lambda function so you can function while you function. ...but really this function composition business is actually breaking my mind. I'll need to practice this one.
Hi Arjan, I write a lot of models and I wanted to ask if you have tips regarding what I imagine is a very simple issue. Version hell. I write code on multiple machines, using multiple styles: jupyter notebooks, org buffers, and of course scripts. Everything is almost always contained in a pipenv environment. But when I try to pipenv install on different machines I keep getting all sorts of version-related errors. I think I am missing some key insight here. There is no way python has such a sloppy design :D Any tips will be really appreciated!
How can Tensorboard do anything using the experiment tracker class since you removed the inheritance and I can't see how the two classes are linked any more. What's the point of the experiment tracker class now?
That’s the whole idea of protocols. The relationship no longer exists between superclasses and subclasses, but you use protocols to define the interface at the place where it’s needed and Python’s structural typing system then does the type checks. So in this example, the goal of the experiment tracker protocol class is not to act as a superclass, but to act as an interface of the part of the code that uses it, here that’s the main file and the Runner class.
Overall, I find this gives more flexibility and offers a better separation of responsibilities. In this case, there are several responsibilities of the original abstract class: defining what the interface is between the experiment tracking and the rest of the code, keeping track of the experiment stage, and providing helper methods. I prefer to keep the single responsibility of the abstract class to define the interface and then use either inheritance or composition to provide the other features you need. For example here, I moved the set_stage implementation to the Tensorboard experiment tracker. Alternatively, if you want to be able to reuse the basic implementation of handling the experiment stage, you could create a subclass "BasicExperimentTracker" that provides that implementation, and then your more specific experiment trackers could inherit from that class.
💡 Here's my FREE 7-step guide to help you consistently design great software: arjancodes.com/designguide.
I think its the data science, natural science and non IT related engineering people would actually benefit the most from your software design centric videos. I`m one of them and we literally code spaghetti on the daily basis without ever getting taught the SOLID principles =). Thanks and you're making better those that listen!
I'm a professional data scientist and I'm following the channel since the beginning. It was essential to me in learning to be a better software engineer, even if this is not my main job requirement but my every day tool...
Same here, data scientist greatly benefitting from this channel
Agreed - working as a data scientist who is proficient in data wrangling, ML, etc., but definitely lacking in solid software development principles, more videos like these would help me a ton!
Thanks! It’s definitely an area I’d like to do more videos on in the future.
Same, very happy to see Arjan covering this topic as it’s what I was looking for few months ago when I first discovered his channel
I am a senior data scientist, and I benefit from all your videos. Building architecture, productionizing and scaling up ML models is challenging. It requires good software engineering practices and a good understanding of the full software development stack. Good work as usual Arjan.
Thank you, glad you liked it!
Hello Tunapedia,
I came across your insightful comments on this video. I'm currently deepening my skills in data science and recently secured second place in an NLP competition on Zindi. I admire your expertise and would appreciate any guidance or insights you can provide on potential job opportunities in the field.
Thank you.
Yes PLEASE do more videos like this at the intersection of data science / ETL pipelines and software engineering. It's extremely helpful for those of us who have come into building software from another adjacent field and are now struggling with big messes of our own making :)
Thank you Zane, will do!
i second this request!
I'm a data scientist and machine learning researcher, and looking into code design and refactoring from your perspective is very helpful for me in terms of coding! Thanks a lot
This helper function to compose is a gold nugget . I think it should go into the functools module so we could simply import it. The idea is so intuitive that it wouldn't be a problem if it wasn't explicitly defined on the codebase.
The "Unsatisfying cliffhanger" is me realizing I now have to go through a lot of refactoring because I've done this lazy single-variable function chains waaay too much... Great job as always, thank you Arjan !
I'm a simple man. I see Arjan post, I hit like button. As a DS student, this actually helps a bunch. Thanks brother!
All data science programming I’ve ever seen is usually written for a one-off experiment with very little principles applied, whether SOLID or reproducibility. The code is often not object oriented and is more functional - and written in declarative linear steps in one script. Even this code you are starting with is in better shape. I’ll be watching for sure to see these software development principles applied to that sort of programming style.
I always used coding as a tool to test my hypothesis. You videos put perspective into why and how writing code is much more than that. I am not a trained software engineer, but, professionally a data scientist. I feel your videos are really helping me fill glaring gaps in software design process while conceiving my data projects and this is important for the data science community as most are not from the software engineering background. Please make more videos in this series.
Godspeed.
Hi Arjun, thank you, I'll definitely continue in this direction. I think there are a lot of things to cover, so stay tuned!
How on earth does this man not have more subscribers. I mean most people would benefit it's their problem if they don't watch these lmao I'm just glad I'm one of the first to hear his wisdom.
This is actually what I do at work - working in a Data Science team as a Software Engineer with some prior ML knowledge. I have to tell you that the code you received for refactoring here is actually what I would consider a state of the art design ;- ) No offence to Data Scientists, I totally understand how complex their world is!! Hopefully as the discipline matures a bit more, and sadly more projects fail due to quick & dirty solutions - we will be all in a better place. Thank you for your work.
You're most welcome and I absolutely agree with you - data science is a very complex field and it makes total sense that data science education programs have to spend all their time on data science concepts, leaving little room for software engineering practices!
Arjan I'm at awe at you ease of reworking things just by looking at them. And it works every time! I just recently followed all your advice in a program I'm developing and it took me a day just to get the thing running again in the new format. We are incredibly lucky to have you teaching us this stuff. Most courses will say over and over the design principles but getting to see them applied so naturally really makes them stick. Thank you so much
I’m a JS guy but have learned so much from watching your videos. Thanks!
As a data scientist that was already watching your content, definitely looking forward to this series!
Thanks!
It's worth pointing out that those single-variable function calls are often preferred, because network composition is rarely purely sequential. In general, it is a DAG. For experimenting it's important to be able to quickly access intermediate results of the network and a chain of calls make it much easier. In practice it's more important to detect repeatable and meaningful patterns in the network and split them into separate classes, e.g. a network may consist of a sequence of 12 layers, but it could be conceptually easier to view it as a sequence of 4 blocks - 3 layers each.
tl;dr - don't refactor out all single-variable function calls right away
Good to know, thanks!
In my experience, almost every new ML engineer start the journey from solving a very simple problem like classification and implement kind of a "Trainer" object. There is a lot of inversion of control to adjust certain parts of the experiments. It seems like a stable framework, but collapses pretty quickly when they try to do something more complicated.
There are few popular frameworks that approach this a bit more maturely. I think would be interesting to see an analysis and comparison of libraries like Keras, Ignite and Pytorch Lightning from perspective of an experienced programmer. They all invent some kind of callback or hook mechanism to control data loading and model training.
Thank you for your content Arjan, I have intermediate python skills but have been learning a lot from your refactoring videos. Moving to OOP for my projects has been a steep but rewarding curve. Thanks again!
Learned the most out of your refactoring videos . Really enjoy them. Especially Solid principled in practice made them super easy to understand.
Great to hear, thanks!
You shouldn't make pure data science/machine learning content, because there is already plenty of that.
A sort of "Software design for data scientists [Dummies]" could be a great contribution!
100% agree with a series on Software Design for Data Scientists!
I agree - I also wouldn't feel very comfortable doing pure data science / ML stuff since that's not my main area of expertise. But I'll definitely think more about how design principles and patterns can be used in this setting!
@@peterdowdy174 Probably Kedro could be useful to combine notebook and code itself.
P.S. Kedro - open-source Python framework for creating reproducible, maintainable and modular data science code
@@peterdowdy174 Hey Peter, I have been struggling with this topic for a few years and ended up here: Notebooks are great for local/quick/dirty experiments, but not for a proper/production grade code. For many, many reasons... Once I accepted this - my life is a happier place ;) Greetings and all the best!
Yes please, that is such an important content to have
Hi Arjan, I’m an astronomer learning to code more properly, and I work exactly with code like this often. This was so unbelievably helpful. Thank you for starting this series and I’m looking forward to more like it.
It’s difficult to prototype things in a Jupyter notebook, get it running, then refactor to something shareable and useable and understandable by others that may need to work with it. You’re teaching me a lot, keep it up!
I'm proto astronomer, passing through the same process as you :D
Some feedback: While seeing your face is always a bright point of any day, I still felt that you would often cut to a fullscreen camera view of yourself while talking about the code you just cut away from, which made it a bit hard to follow the structure of the code.
Like, at 3:10 you said "You can see this happening here" during a cut where we literally can't see it happening, which caused a weird disconnect in my brain where I felt like I had to switch gears with each cut, trying to take in as much information as possible before the next cut would interrupt the reading.
It's an interesting video, but these cuts made it hard to follow.
I totally agree with this.
Yes, I also noticed this a bit too late. Will make sure this is better in the next videos.
@@ArjanCodes Your other videos, editing-wise, have excellent pace and I don't notice the cuts at all, making it easy to follow along. This one felt like the cat was standing on the "cut" key.
Haha, I did start working with a cat (read: video editor ;) ) since a few weeks. It’s clear we still need to fix a few things in the process, but I’m on it.
absolutely - this was really stopping me understand the process. Stay in the small box if you are talking about the specific code
I remember dealing with MNIST data sets in college when I was learning Machine Learning. I was taking an OOP course at the same time and my first ML (Machine Learning) assignment was a single-layered neural network with 10 perceptrons. Even though I went object-oriented with the assignment it took forever to go through the training data and testing data, 12+ hours in total in runtime. It wasn't that accurate either, like 75-80%. However, I redid the assignment, abandoning most, if not all, OOP principles and going towards something more procedural and mathematical (linear algebra to be precise). There was a huge difference in my experience. The code was easier to read, easier to understand, and a lot faster, when going through the training and testing data in less than 1 second and was reaching 92-96% accuracy.
Worth pointing out that both sklearn's Pipeline and torch's Sequential compose _classes_ satisfying certain interfaces and return _classes_ (with possibly different capabilities). Which is a bit more complicated than function composition, but usually necessary in real-world situations where the aggregate process needs more capabilities than just being Callable.
I am really intrested in design for data science applications. I used to be a programmer, but did other stuff for a lot of years, the reason I am back in programming is data science. But I find there is lack of practises that I am used to from programming applications lacking in the world of data science. So this is a great one!
I really liked this novel method of "Code Refactoring" & "Code-Roast" to look things from software best practices and see how to correct these common mistakes. I would like to see more such video.
Great vid, was looking forward to this for a while since you mentioned on Reddit that you had plans to get into ML/DS from software engineering perspective. Much better to refactor a project which is a real world scenario, rather than simple hypothetical examples which are abundant.
I love this, this video saves and the comments save me a lot of time returning code reviews to data people over and over! Now I can just send them here to explain what is not spaghetti!
I love these refactoring series. So informative. Thanks, not only to Arjan, but to the people who submit their code to literally be picked apart and rebuilt.
I really loved this video. I work in Quantitative Finance, where we have to write a lot of code (usually in a scientific programming language, a.k.a Python), and I've benefited a lot from these videos. A lot of a code that I've encountered is usually a spaghetti code, and just starting to think of solving the problems from good design principles has really helped in increasing the flexibility, maintainability and readability of my code. I always look forward to watching these videos! Hopefully, you'd cover more advanced topics of Python and designing systems in the future.
Thanks, I'll definitely do more videos like this in the future!
I just love watching you delete lines of code, keep up the great and informative videos
I'm a newb in python, and being experienced in other languages it is hard to flip the switch to a new one, Arjan videos have beem crucial to my undestanding of the "Pythonic" way. Thanks man! Keep em coming ... I don't know if it is your focus here but would love to see you talk about a project using PyQt5 ;)
Thank you, glad you like the videos and good topic suggestion!
Wonderful video! 🙏
Among many other things, you've shown me three nice ways to compose a sequence of functions:
1) with a torch network
2) with a scikit-learn pipeline
3) with functools.reduce
I agree the third is very attractive. Some may find it a bit strange that the order of the functions switches, but that's not a defect in my eye.
I wish I have seen this video two years ago. I write this kind of project all the time. I learned the hard way to do it like that.
Fantastic video, I'm eager to watch the two next parts. From my PhD studies in AI I can tell the majority of research code in ML and AI is terribly written and barely readable, even with published works. The guidelines for clean ML code are just starting to emerge and at times I feel there's even more confusing ML config / scheduling / architecture tools released every day than confusing JS frontend tools (and there's a JS framework released almost every day lol). Good to see plain old good design being used in this context. Content like this is VERY valuable, hope to see more ML refactoring videos! All the best!
Thanks and glad to hear you enjoyed the video! Let me know what you think of the other two. I'll certainly revisit more data science oriented content focused on design. Doing this miniseries was a lot of fun.
@@ArjanCodes so I've already watched the other two and really enjoyed them as well . Very clean, understandable and applicable approach and I think your channel really nicely fills a gap in intermediate to advanced programming topics. I really appreciate the references to Dijkstra, Hoare, SOLID, GRASP etc. - super rare to see that on YT. I've also watched your Hydra video and I really like how it compliments this miniseries - Hydra is getting lots of interest in the community these days. Another tool that's growing in popularity and also could be interesting for you for a future video is PyTorch Lightning - it introduces an opinionated design into PyTorch and also aims to clean up some of the clutter which can be found in 90% of AI code.
Looking forward to part two! Learned a lot and will be rewatching
Thanks Jonathan, glad you liked it!
Using the Sequential is 1 way, and it works nicely when the model has a linear flow, however if you want to build a model with - for example - 2 outputs that's sitting on different levels of the model you need to use the non-sequential way, and then the X for all intermediate stage starts to make sense :)
In this case I would prefer to have a class for defining an Acyclic Directed Graph. Perhaps PyTorch also has this... I didn't check.
Function composition is really cool and make the code very concise and clean. However, I feel like we achieve it at the cost of readability of the code and additionally make it hard to debug intermediate calculation/steps if suspect something is wrong(in reality this happens very often when there is too much math involved in the code). Some (picky) managers might not like it during code review/pull request for the reasons stated
Exactly what I was thinking
I'm loving it! Please continue doing videos like this one :D
I'm learning a lot from it - your videos are one of the most valuable/useful ones I've seen for Python or software design in general
Glad to hear it, thank you!
I like to use multiple inheritance for string Enum classes. For example:
class MyEnum(str, Enum):
RED = 'RED'
BLUE = 'BLUE'
GREEN = 'GREEN'
*Make sure the str comes first.
Then you can use the class like normal, MyEnum.RED, and you can also use a string literal. It avoids the need to use the 'name' attribute. Lastly you also get equality if you are comparing the enum to a string literal.
maybe not when this came out, but now is a helluva time to start doing data science material
Thanks! I'm a Data Engineer and this helps a lot!
Thanks so much Bruno, glad it was helpful!
Once again, excellent stuff Arjan. Definitely going to work with the function composition!
Thanks so much Coert! :)
Very stimulating and educational video. Love the pace. Thank you.
I’ve been hunting for a nice way to do function composition in standard-library python for awhile and this version with type hints is 👍
Really good video, very well explained, and I can see in comments below you have noted the jump cuts away from code. Really will make the video perfect! Thank you
One point to consider from a data scientist: a lot of the times we like quick and dirty iterations to our exploratory and predictive insights. Many times (especially under time constraints) quick and dirty is better than slow and beautiful. That's why I personally love notebooks. As long as it is idempotent (notebook runs from start to end without issues) and the environment is containerized, it is reproducible. But I see the merit for both. There is a lot of power in writing scalable and reusable code in this space to organize to complex pipelines that supercharge society's solutions. This is why, over time, I now have learned to use a hybrid of both - but maybe not in the most optimal or well-principled way.
Which leads to my suggestion! Would you be able to make a video on how you would use Jupyter Notebooks/Kaggle Kernel Notebooks/Google Collab Notebooks in tandem with with an internal packaged up repository as you have it in the video for DS projects? Maybe this means just maintaining your currently directory structure as shown in this video but adding a "notebooks" folder to the root folder where all that type of analysis is done since we can call your modules from that notebooks folder (not sure how this would be manifested, you probably have a better idea). You use .py scripts for most things that you can install these scripts as modules for use in other scripts or even notebooks, and that is what I have been doing to keep my notebooks cleaner. But I am sure your perspective on how to have fast iteration times to high value insights, maintain a scalable pipeline, yet keep everything reusable in doing this kind of work - even maybe some sort of generalized approach shown through a video example - would be invaluable. I think this would be a game changer for myself and a lot of people in DS and ML.
As for this video, your other content has been useful, but seeing it directly applied to the type of work I do on a regular basis brings your concepts to life for me. Please keep these software design principles applied to DS crossover content coming! Thank you for what you do :)
Thanks and great suggestion regarding the combination of notebooks with running python scripts in a repository. I'll look into it!
PLEASE give us more from just this very content! Awesome videos, going to spread the word! : ]
Thanks! Will do!
The most important thing I've learned (I'm still learning) is to write good, cleaner, and reproducible data science code was: "Functional programming paradigm". R (with tidyverse, and tidymodel approach), and Julia programming language made me code almost like I was using a "General System Theory" from Bertalanffy, (ins -> transformations -> outs). With this approach, I can change the ins without break all the code, or I can change the functions (transformations, each one with its own rule) without break all code logic. Since I use Python only for NLP tasks I do not use a functional programming paradigm with it, but I know it is possible, maybe easier in Python (function composition was good to know it). The OO paradigm for Data Science that some data scientists use does not make any sense to me, of course, I am not a professional programmer, maybe for not having ground on computer science, I think that way. By the way, I'm learning a lot with you! Thank you very much!!!
Thanks Igor, glad you like the content! Using pure functions is certainly a great starting point. What OO programming brings to the table is that it provides a nice mechanism for structuring data representations via (data)classes and collection objects such as lists, dicts, and so on. Ideally, you'd have a marriage of both that provides a clear structure of the data, and has data manipulation pipelines with very limited coupling and side effects.
@@ArjanCodes Thank you! I will try to apply this approach to my NLP study codes, I know I have a lot to learn to be able to understand OO stuff, classes, dataclasses, but your videos are helping me a lot.
Really cool compose function. Going to use that.
This looks more like a deep-learning project than a data-science one (using Torch, Tensorboard to follow the network training, instead of something like Pandas), which is actually exactly what I need right now, I work a lot with Pytorch and Pytorch Lightning and I'm looking to improve my code.
The issue that I have with torch.nn.Sequential is that its annoying to debug when you have an error in your network-building lego, but if you sure that the lego is correct it is more clean to use Sequential.
Nice video 😊 Hope it reaches all my data scientist colleagues.
There are many similarities in machine learning projects, this makes me think of why there is no custom Design Patterns for ML projects ?
Thanks! I'll try to come up with a few ideas for this and cover that in future videos.
Great video!
IMHO, a simple loop over functions list is much easier and readable:
x = 12
for func in (add_three, add_three, mul_two, mul_two, ):
x = func(x)
Reiterating the others, very useful video for data scientists!
I liked the idea of replacing the nested call with the compose function, but what about an "apply" function instead ?
def apply_composition(x, *functions):
for func in functions:
x = func(x)
return x
For me, this seems easier to read than the functools solution... and its similar to the idea of a torch.nn.ModuleList container in Pytorch
This way you are replacing x as f(x) in the same fashion as the original implementation.
Once more. A very useful and nice video! Thank you!
Glad it was helpful!
Wow thanks Senpai! Will definitely share on my linkedin and with my data engineering team
Thank you, happy you like it!
As always a great video! The only suggestion I would add is maybe to turn off Intellisense for the video, because all the red squiggly lines are a bit overwhelming and actually useless because the code works!
Thanks for the tip! I might do that for future refactorings (at least in the beginning :) ).
Okay this video is gonna blow up imo
Great video. What do you think of folder refactoring? In some repos, I have seen people putting files/classes in a separate folder called "commons" for utility files that are used agnostically across the project. I think this would be a great idea to touch on in a future video. Nonetheless, the best python videos on youtube hands down! Keep up the great content!
Really nice video. When working with ml / ds problems, I always end up using ugly designs / hacks that makes the job done. An then refactoring is such a pain. Thanks you for this advice :D
Thank you Sergio, glad you liked it!
The one issue I have with this design is that is based solely on pytorch, so if you like to go to another framework such as tensorflow, this will require quite a bit of refactoring (without taking into account the new framework coding stuff), thus most likely making breaking changes to consumers that use the project
In general, this is a really hard problem to solve. Especially since most frameworks like Pytorch, TensorFlow, etc. ask you to "marry" the framework and use their data types all over the place, which then makes it hard to replace the framework with something else. I'll look into this and try to come up with some ideas to do a video about this.
Next time, you should definitely do a tensorflow/keras project. Would love to see how you would go about cleaning up the code in a project like that. full disclosure: I've written a very convoluted DL project with tf.keras and I'm 100% positive it can be written better
Great suggestion! Feel free to submit your code as a Code Roast, and I'd be happy to take a look if it's something I can cover on the channel.
@@ArjanCodes I will try and see if I can package it up in a meaningful way. Right now it is split across two private github repos and trains on a rather large and proprietary image dataset
Thanks for making this (and similar) videos. They are very helpful and insightful
Thank you Ingo, glad you liked the video!
Maybe already answered, but does Pandas have function composition (aka network or sequential)? IMO this is a huge benefit of using the R tidyverse, the %>% command is called a “pipe” but it seems to work exactly like function composition and is extremely well-supported and flexible.
Great video! One thing that I did not quite understand: when you changed the ExperimentTracker from an abstract base class into a protocol then the TensorboardExperiment no longer inherits from ExperimentTracker. I do not see the connection between the two classes anymore. After the refactor, to me ExperimentTracker seems like an unused class. Or am I missing something?
After changing the ExperimentTracker to a Protocol class, the inheritance relationship between it and TensorboardExperiment is indeed gone. However, ExperimentTracker is used in the Runner class where it defines the interface that is expected for connecting the Runner with the experiment tracker. The result is that you can now create other experiment tracking classes that integrate seamlessly with the Runner class, as long as they implement the methods defined in ExperimentTracker.
Overwriting the 'forward' function in the Torch Model and updating the state (tensor) of the neural network at each step is actually the recommended way to do it by PyTorch.
I’ve written a lot of spaghetti code to process scientific data. It’s usually so bad that it just stays as a notebook that’s copied over and laboriously edited for each new time I repurpose it. Really think this is useful content. More please.
Thanks for your great contents.
However, I didn't find your custom composition function useful. However, PyTorch's Sequential or Scikit Learn's Pipeline seem more proper.
Yes please! More of these
Hi Arjan. Love your stuff. Would you be able to create comprehensive video about "How to structure bigger project"? I've got task to create PySide2 application with at least 3 windows (Main, Settings, Results) and I'm not sure how to structure it so it's not inside one file because that's just too much of a chaos. How to connect signals to what functions and where to write them, shoul each window (code) be individual file, how to connect everything, how to parse variable from one window to second?
I would love to know about the vscode keyboard shortcuts you love the most
this video is amazing - can we please get another data science / ml pipeline refactor?
At 14:25 you decide to remove the protocol inheritance, making it implicit. There is no difference to the working of the code, though it does make life harder for anyone needing to change and understand this class, for it is not clear anymore that it should adhere to the protocol.
Awesome! I think there are few tutorials about software design topics for data science.
Except using protocol instead of ABC, your video is nice :)
Protocol makes things less clearer.
Silly question: why do we need to avoid storing intermediate results in the same variable?
Yes to more Data Science!
I'm taking notes 📝
Would you ever consider overriding `__str__` on an Enum to return `self.name`? That would avoid having to add `stage.name` in all those f-strings. Feels neat to me from a code repetition perspective, but it does violate the "Explicit is better than Implicit" guidance of the zen of python. I'd be really interesting in your opinion.
Great suggestion, and I think it works really well in this particular case.
I loved this video. It was the best momento to apply the design solid principles to data science because I work with it at daily base. Could you apply solid principles to panda's library because this is the most used library for data processing? Again, Thank you very much!!
How do you handle errors if one of the composition function raises error?
Sponsored by 'basically'.
Just kidding, great content. Keep it up.
Where can I learn professional Machine Learning design projects? All I found is Jupyter Notebooks, but I want to do it more professional.
Thanks for these videos they have been fun to watch. I see the benefit of function composition, however, In practice (data science) when composing functions I have never not had a whole slew of unique parameters and contexts to pass to each function along the chain. Is there an equally elegant solution to this problem.
Hi Robert, good question. I like using either closures for this or partial functions (from functools). For example with closures, you can define a function (with parameters, contexts, etc) that returns another function and then that's the function that's passed to the composition. In terms of the example in this video at the end, you could do the following, where n is an extra parameter, add_n is a closure that returns a function:
def add_n(n: int):
def add(x: int):
return x + n
return add
...
compose(add(5), add(12), multiplyByTwo, ...)
@@ArjanCodes Robert, not sure whether @ArjanCodes would approve of this, but you could define a Callable ABC base class for your functions that implements a __rmul__ (or sth like that) method that you implements function composition for the __call__ methods and initialize the instances with whatever parameters you want that are not part of the functional input data. And if you make the __call__ method accept and return a dict you can also compose functions of different arities.
Dear Arjan,
I am a 3 months of rookie in python(learned classes , functions basics etc)
And interested in data things , not development 🙀
Is it a problem you think? I mean to find a job and career-wise
Thank for your kind answers and advices
🙏
Great video Thanks !
See you next week
Thanks, glad you liked it!
Great topic!
It’s an interesting video, but I think it’s actually misguided advice for Data Science/ML projects. Data Science projects have a different dynamics from software engineering projects, hence the need for MLOps platforms. Tracking is needed in the experimentation stage, when things change quickly, and writing abstractions to become independent of a particular experiment tracking platform is not creating value for anyone.
What’s actually important is that the experimentation code is decoupled from the model code (which is why Tensorflow and LightGBM use callbacks… PyTorch doesn’t, but PyTorch Lightning does, which is why I would always use PyTorch Lightning and not raw PyTorch). Moreover, where I feel abstractions are really powerful is for the model itself, because I’m order to do model selection I may have to apply a fair evaluation to models that utilize different frameworks (e.g. PyTorch vs LightGBM) or even different problem framings. The first point is what MLflow Models tries to accomplish.
That's great video! Thank you a lot!
27:07
yo dog I heard you like lambda functions, so I put a lambda function in your lambda function so you can function while you function.
...but really this function composition business is actually breaking my mind. I'll need to practice this one.
Do you have Kite installed for autocomplete ?
Hi Arjan, I write a lot of models and I wanted to ask if you have tips regarding what I imagine is a very simple issue. Version hell. I write code on multiple machines, using multiple styles: jupyter notebooks, org buffers, and of course scripts. Everything is almost always contained in a pipenv environment. But when I try to pipenv install on different machines I keep getting all sorts of version-related errors. I think I am missing some key insight here. There is no way python has such a sloppy design :D Any tips will be really appreciated!
DVC, mlflow, and/or kedro. will change your life. they changed mine :).
You are the Bob Ross of coding
Thanks Gercius, happy you’re enjoying the content!
Thanks Arjan!!
You're welcome Zeki, glad you liked the video!
23:50 yield is it also a good solution ?
How can Tensorboard do anything using the experiment tracker class since you removed the inheritance and I can't see how the two classes are linked any more. What's the point of the experiment tracker class now?
That’s the whole idea of protocols. The relationship no longer exists between superclasses and subclasses, but you use protocols to define the interface at the place where it’s needed and Python’s structural typing system then does the type checks. So in this example, the goal of the experiment tracker protocol class is not to act as a superclass, but to act as an interface of the part of the code that uses it, here that’s the main file and the Runner class.
@@ArjanCodes Ahh thank you
Debug of functions composition it's painful. It's much better to have variables with unique names between calls
Great stuff
Thank you Vladimir!
Why do you switch from showing the code you're discussing, to showing yourself full screen and removing the code from view?
okay catching up on vids 😋
9:10 Why do you think abstract bases classes should only have abstract methods and not atrributes?
Overall, I find this gives more flexibility and offers a better separation of responsibilities. In this case, there are several responsibilities of the original abstract class: defining what the interface is between the experiment tracking and the rest of the code, keeping track of the experiment stage, and providing helper methods. I prefer to keep the single responsibility of the abstract class to define the interface and then use either inheritance or composition to provide the other features you need.
For example here, I moved the set_stage implementation to the Tensorboard experiment tracker. Alternatively, if you want to be able to reuse the basic implementation of handling the experiment stage, you could create a subclass "BasicExperimentTracker" that provides that implementation, and then your more specific experiment trackers could inherit from that class.
Thank you