NEVER Worry About Data Science Projects Configs Again

ArjanCodes

มุมมอง 91 636

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 7 ม.ค. 2025

ความคิดเห็น • 124

@ArjanCodes 3 ปีที่แล้ว ⁺⁴⁷
Happy holidays everyone!
@HypnosisBear 3 ปีที่แล้ว ⁺²
Happy Holidays!!
@top-notch-tech 3 ปีที่แล้ว ⁺¹
Happy Holidays!
@cetilly 3 ปีที่แล้ว ⁺¹⁹
Arjan, I really appreciate that you cover topics that no one else does and/or you do so in a very in depth way. The level of professionalism that you bring to these topics is unlike anything else out there and I’m loving it.
@Vanessa-vz5cw 2 ปีที่แล้ว ⁺¹³
For those looking for a solution that doesn't require you to create the store, work with decorators, etc. while still allowing dynamic, nested configs, I would recommend looking more into Omega Conflig. It's the library that Hydra is built on top of, and allows for defaults, nested configurations, multiple files, etc. but you can then create your own method or class to customize how configs are read and passed around the program.
@kayb4490 8 หลายเดือนก่อน ⁺¹
I’ve seen a few people saying they think the overhead is more work than benefit. I would probably agree for some workflows but I have to say for machine learning or any kind of work with tons of configuration parameters and experimentation this library is a life saver. It cleans up the code a lot
@Xaelum 3 ปีที่แล้ว ⁺¹³
Great video!
I agree with Hydra being a bit too convoluted, that's why I limit my projects to work with Omegaconf (which is also how Hydra works under the hood). You should give it a try.
@niklase5901 3 ปีที่แล้ว ⁺⁴⁷
I am usually onboard with what you have to say. But here you define the configuration both in the yaml-file and in the data class, seems like a lot of added code to solve this problem.
@selimrbd 3 ปีที่แล้ว ⁺¹⁰
I agree, I'd expect hydra to create the dataclass definition automatically from the yaml file structure, removing the redundancy
@EvanBoldt 3 ปีที่แล้ว ⁺¹⁵
I think the concept is good, but the library seems to be doing maybe more harm than good. It does almost none of the heavy lifting since it basically just parses YAML and it adds some weird overhead like the decorator and storage setup.
I think it’s a good idea to pull config out into a file early on in development. I might consider doing this but parsing the YAML internally.
@danielepicone1480 3 ปีที่แล้ว
I am fine with this subdivision, as that allows to do some validation when the configuration schema is defined.
What i really dislike though, is that whenever a manipulation is internally made on the config (e.g. by transforming str into Path as it is done in this video), then one would need to define a structure both for the parser and the actual parameter usage within the project, which to me looks absurdely redundant.
@transatlant1c 2 ปีที่แล้ว
@@EvanBoldt agree, I would recommend looking into profig or decouple as a middle ground - unless you need the extra features hydra provides, either of these will likely be suitable in most scenarios
@ilkerbishop4217 ปีที่แล้ว ⁺¹
Hydra_zen exactly there to solve this problem. By using it, one can define configs only in the python file and use all Hydra features.
@kayakMike1000 3 ปีที่แล้ว ⁺²
Excellent observation! I am a configuration engineer, honestly that is really just my title, but... yes indeed, configuartions at the top push values down into lower base classes which leads to functions with lots of args. This does reduce cohesion and can make your code a bit... noisy. Then again... some globals littered around everywhere introduce coupling...
@ArjanCodes 3 ปีที่แล้ว ⁺²
Glad you liked it. I might do a separate, more in-depth video about this topic because I feel it got a bit undersnowed by the Hydra stuff.
@kayakMike1000 3 ปีที่แล้ว
@@ArjanCodes there may be a lack of appreciation for how complicated configurable modern software and scripts can be. I swear if one more brilliant developer suggests we pull any configuration value from an environment variable that will likely end up undocumented... How many times have you fixed something by grepping for "os.environ" or something similar?
@mrswats 3 ปีที่แล้ว ⁺²³
Nice video as per usual!
One small thing which can be improved related to pathlib is the following syntactic sugar:
data_path = Path(root_dir) / data_file
this way you don't have to use f-strings and it's a bit clearer in my opinion.
3 ปีที่แล้ว ⁺²
... and more platform-independent for the case the platform does not understand '/' as a path item separator.
@danielriedl1419 2 ปีที่แล้ว
You could also:
Path.joinpath(root_dir, data_file)
@frustbox 3 ปีที่แล้ว ⁺⁴
When I saw the title I got excited to hopefully find something new, but I don't know if this video scratched that itch.
I'd have hoped for a comparison of different approaches. Putting configuration constants in a file is not great, but how about a global settings.py file with a Config object, possibly a dataclass, that you can import into each module? How about passing it as arguments via dependency injection? What specifically makes hydra preferable? I guess what I had hoped for is higher level concepts and design principles, instead we got a showcase for one very specific implementation.
What I currently use is a settings.py file that is user editable (it could also be yaml or json files, doesn't really matter all too much). Then there's a config.py file that has the config_factory which can create different objects for different configurations, e.g. production vs development environments, it also uses python-dotenv to get "secrets" that should never end up in a code repository, the main project file will then import the config-factory and create a global object that can be imported into every sub-module.
@mauricepasternak6504 3 ปีที่แล้ว ⁺³⁸
I'm a big fan of JSON config files coupled with Pydantic to cast the configuration into a type-friendly structure. As a previous video of yours emphasized, this also carries the ability for the config to be validated and/or casted (i.e. automatic conversion to Path types from strings).
@kayakMike1000 3 ปีที่แล้ว ⁺³
JSON is YAML, isn't it?
@mbakr101 3 ปีที่แล้ว ⁺²
I’ve been looking for an example that uses Pydantic for configuration files. I would appreciate it if you can show me an example
@fackarov9412 2 ปีที่แล้ว ⁺¹
@@mbakr101 th-cam.com/video/Vj-iU-8_xLs/w-d-xo.html
@yeahjustlikethat 2 ปีที่แล้ว
agreed -- or use YAML tags to generate (Pydantic-validated Base or dataclass) objects on the fly
@MichaelDanziger ปีที่แล้ว
100% agree, I use Pydantic BaseModel or BaseSettings for config. That way I have a single source of truth for what fields are expected, what their types are and a clear place to change if I want to add settings. Connecting to json/yaml/dicts is trivial. I don't understand how you can keep track of valid or invalid fields with hydra, everything just gets parsed and passed down.
@walterppk1989 3 ปีที่แล้ว ⁺⁹
opinion: the most elegant solution is to put any config in a config.py file. Any (more dynamic) execution time variables could be passed via command-line arguments (e.g. using the fire lib)
@franzweitkamp 3 ปีที่แล้ว ⁺¹⁶
to me it does not look very convenient to decorate a function just to have the configuration variables load in
it clutters the code and decorating also messes with the debug trace
@ArjanCodes 3 ปีที่แล้ว ⁺⁴
I agree with that. In general, I'm not a big fan of decorators, except for maybe things like @property or @dataclass. I would have preferred if Hydra had chosen a more functional approach (I think I also mention this in the video somewhere). But still, it has quite a lot of useful features for dealing with configs so I can live with it.
@seren6453 3 ปีที่แล้ว ⁺²
If you dont like decorators hydra has something called Compose API, and you could use hydra without decorators but with some drawbacks. (described in hydra docs)
@fuba44 3 ปีที่แล้ว ⁺¹
Love your videos. And wow, that microphone is in a very very good mood!
@ArjanCodes 3 ปีที่แล้ว ⁺²
Haha, yes it seems he's always very happy to see me.
@SarveshShah 3 ปีที่แล้ว ⁺²
Hi Arjan, great video as always. The cuts when you speak seem a little jarring, more like a stop motion video from time to time. Maybe it's just me, but I thought I should point it out. Great content as always!
@traal 3 ปีที่แล้ว ⁺²²
I'm disappointed that Pydantic isn't used or mentioned in this video. That seems like a perfect fit for configuration PLUS type checking.
Define your configuration structure as a Pydantic model, load a configparser, JSON or yaml file, and just go. Type hints are automatic. Seems obvious. 😊
@julius333333 2 ปีที่แล้ว ⁺²
pydantic silently does type conversion (everywhere), which is bug prone & not something you can use safely in production IMHO or even in dev
@blackeadam 2 ปีที่แล้ว
Thanks!
@ArjanCodes 2 ปีที่แล้ว
Thank you Adam - glad you liked it!
@paul_devos 3 ปีที่แล้ว ⁺³
Really appreciate you introducing Hydra. It does look a bit more involved for what you get out of it. I also do NOT like YAML. I work as an AWS Cloud Architect for 7 years and hate it. I prefer anything but YAML. It is unnecessary and just provides extra syntax to deal with. Many of the frameworks for clouds are now starting to create libraries that output the JSON and YAML so you don't have to write any YAML.
For now, I'm a python-dotenv, ConfigParser, and Pathlib fan. I write a good README file that explains the parameters to config/alter. And then I have a config.py file that handles and parses all my text/json files of which I then import into my files as needed in the project.
@kjeldgaard0 3 ปีที่แล้ว ⁺¹²
Why not just use configparser? It comes with Python's standard library and is also structured.
@Daniel_Zhu_a6f 3 ปีที่แล้ว
Since configuration is essentially partial function application, I often use OOP version of partial application for configuring scripts.
Each distinct script part is a separate class with __init__ and __call__ methods only.
__init__ method accepts untyped dict or reads json/toml/yaml from a specified location and assigns data to typed fields.
then the __call__ method becomes an efficient partially applied function.
With such setup you can easily have multiple configurations for the same script, either from separate config files or created on the fly.
And if you need a new configuration approach, you just create a classmethod for it.
Hydra looks like a fancy and overcomplicated way to do that same thing.
@it_is_ni 3 ปีที่แล้ว ⁺⁶
Hail Hydra!
@deeplearningexplained 3 ปีที่แล้ว
Cool didn't know about that library, thanks for the video!
@kevon217 ปีที่แล้ว
Exactly what i was looking for. Great tutorial and thanks!
@alexkelly757 3 ปีที่แล้ว ⁺²
In Arjuans example, all parameters are pulled into main.py "main" function. What do you do if you have multiple python files in the same project that need parameters from the parameter file? Do you just reimport the log file within that said python file or is it best to pass the parameters when calling functions from main? or is it one of those, it depends...
@cameronball3998 2 ปีที่แล้ว
Really awesome video; exactly what I was looking for. Thanks Arjan!
@tiagovla 3 ปีที่แล้ว ⁺⁶
Shouldn't you be using a path.join method instead of constructing those with f-strings? Won't this break on windows?
@selimrbd 3 ปีที่แล้ว ⁺²
I agree I think its better to do something like: Path(str_1)/str_2
@pawelkubik 3 ปีที่แล้ว ⁺⁴
The biggest issue with this approach is that it's usually impractical to define schema in such configuration files. You need a lot of flexibility in DS projects so many config fields are often determined on-fly. Say you choose a specific optimizer that works well with different learning rate schedules. Just go over Keras/Torch schedules and see the variety of arguments they accept in their constructors. There is no way to built a consistent schema for all of them. One solution would be to have a different dataclass for each schedule, but then your main config needs some huge awkward Union.
It's much easier to just use dicts for all nested configurations and unpack them with double stars when calling constructors and factory methods.
@mohammedhelal5778 2 ปีที่แล้ว
Also guilty of double starring the hell out of everything 😄
I actually think the bigger headache is making sure the model is packaged up nicely during inferencing. A lot of times the config/dict has all the relevant info for preprocessing and sometimes even model specific info needed for predictions. I usually just dump the config in the model somewhere but seems like there should be a better way.
@Khushpich 3 ปีที่แล้ว
Another great video Arjan, nice job
@nickeldan ปีที่แล้ว ⁺¹
How does Hydra work with pytest? That is, I'd like Hydra to provide a configuration object to a test without pytest thinking that the argument is supposed to be a fixture?
@hobe4576 3 ปีที่แล้ว ⁺⁷
The hydra approach has its drawbacks, especially when the Python solution is not started as a script (think of Flask apps or Prefect jobs for example).
I completely agree with the statements on configuration in code versus parameter files (or if necessary data bases, secret servers or whatever appropriate).
But once it is clear that the input is json or yaml / with a "Pythonic access layer built of dataclasses" , I do not really see the adventage of hydra.
Another set of - in my opinion - very usefull tools in connection with configuration and dependency management:
- dependency_injector (especially, when creating more complex access objects like sqlalchemy sessions) , it allows you to use configuration from files and/or environment variables as well and the maintenace of containers (say for test, prod etc. in case you want to test with an in-memory sqlite database while production will be something else)
- Marshmallow and marshmallow-dataclasses for type-safe json-parsing and painless input data validation (in hard circumstances you can even write your own fields)
@selimrbd 3 ปีที่แล้ว ⁺¹
At 17:26 : Can we define multiple "nodes" in one ConfigStore instance ? (in the case we have 2 functions in the main script, each expecting in input a different configuration dataclass). In that case, how does hydra know which config dataclass to use for each function ? There is no reference to "mnist_config" in the dataclass call. Does it determine that through the typehint of the decorated function ?
@MatthiasStuebner 2 ปีที่แล้ว ⁺⁴
You miss to mention a very good reason to separate config from code: A change of config in code would require a release cycle, which causes some useless workload and waste of time.
@ArjanCodes 2 ปีที่แล้ว
Yes, good point!
@DavidCSaint 2 ปีที่แล้ว
Arjan thank you so much for your dope af videos
@joeybruce787 3 ปีที่แล้ว ⁺¹
What are your thoughts on using strictyaml with pydantic for validation
@GabrielCardoso95 3 ปีที่แล้ว
At my work, mostly for legacy reasons and avoid dependencies, I wrote some configuration management that works with XML. However, now I feel like trying this hydra
@amr.sharaf 5 หลายเดือนก่อน
Thanks for this, really useful
@ArjanCodes 5 หลายเดือนก่อน
Glad it was helpful!
@astronemir 2 ปีที่แล้ว ⁺¹
When I only have a few variables I usually just use argparse, with defaults. then you can use argparse to change inputs
@fexofenadinaGenerica 3 ปีที่แล้ว ⁺²
Thank you for your work!
@ArjanCodes 3 ปีที่แล้ว ⁺¹
My pleasure!
@jrudy457 3 ปีที่แล้ว ⁺⁴
What I learned: never include "references to explicit images" throughout the source code 2:12
@ArjanCodes 3 ปีที่แล้ว ⁺²
Haha, I think I meant to say something else. But on the other hand, it is actually better to keep explicit images out of your Git repositories. ;)
@rtiodev 3 ปีที่แล้ว
Awesome content. Keep doing this excelent job!
@yinmudino1 10 หลายเดือนก่อน
Arjan, you mention we dont need to enclose quotes in yaml but if I have a date format of 20240201, hydra will read it as integer even though I declare it as string using dataclass. What is the best practice ie to put quotes around it in yaml file or in the python program, i set it to str(xxx)?
@mberlinger3 ปีที่แล้ว
I always define my conf files as toml. That adds structure and I can just pass the specific heading into the command that needs it
@ChrisBNisbet 3 ปีที่แล้ว ⁺³
Eek. I'm not sure I'm keen on the hydra module forcing the user to include hydra-specific code (that working dir stuff) in the config file.
@menscheins125 ปีที่แล้ว
@arjan: Maybe it's worth comparing hydra, Omega Conflig, .env and pure config.py (and maybe more) as options to provide configuration variables to your project.
@zapy422 2 ปีที่แล้ว
Is there an easy to find hard coded values in code, strings/int values?
@kopytko998 3 ปีที่แล้ว
Thanks for your video. Very useful
@hosseinafsharnia2048 2 ปีที่แล้ว
Thank you so much for the explanation.
I wonder how to use such separated config file in non hard-coded way, and I do not understand how to use such config file in a more dynamic way. (specially about hyperparameter at the time of the tuning of them), because in the YAML file you were hard coding them.
Also, you mentioned that for the cleaner way to deal with configuration, a simple function can do the job. I do not understand exactly what it means by simple function. is it a separated approach or while using hydra and YAML file we can use such function. I wonder if you could provide an example of it to make it more clear, and I guess using such function is a key to the dynamic I was requested in previous paragraph. Did I have a correct line of Thought?
@johnanih56 2 ปีที่แล้ว
what extension do you use in VS code that does the automatic import?
@pycz 3 ปีที่แล้ว ⁺¹
I like the YAML + Pydantic with paths to config in envs approach.
@anton4075 ปีที่แล้ว ⁺¹
Hi, Arjan! Thank you for a fantastic video! Probably the video would be even better if you did not zoom in and zoom out code so frequently. It is painful for the eyes.
@ArjanCodes ปีที่แล้ว ⁺¹
Thanks for the idea!
@nahakuu 2 ปีที่แล้ว
I am little confused, is not dotenv same, and works with less coding?
@DelipRao ปีที่แล้ว
Hi @ArjanCodes, I loved this video not just for the content but also for how you create the transitions between the code and your video, the Picture in Picture setup you have when coding, and how the video focuses on relevant parts of your VSCode to show what you are doing. As a help to a university teacher forced to teach online, would you mind sharing the softwares you use in your video recording, editing, and production? Also, do you have two cameras pointing at you (one from the front and another from the side)? Thanks!
@ArjanCodes ปีที่แล้ว
Hi Delip, I used Final Cut Pro for the editing and I use two cameras to capture the different angles!
@omerorhan80 3 ปีที่แล้ว
Can you do a DDD and event sourcing example with KAFKA in python please? There is not much samples around with python.
@sachinpb875 3 ปีที่แล้ว
is it okay we make a config class and set the paramconfigurations as class variable then import the config class in the main function
@vmgustavo 3 ปีที่แล้ว ⁺¹
Thet is an interesting way of defining configurations. The option to define dataclasses to store the configs is what I liked most. However I do use click to create python CLI programs and it seems to me that there could be a conflict between the CLI parameters and Hydra. What would you do to use both Click CLI and Hydra configs? Or is there another option for dealing with this scenario?
@mohammedhelal5778 2 ปีที่แล้ว
I'm not sure how hydra does it, but I usually keep defaults in a config and then override them with any command line arguments.
@harshraj22_ 3 ปีที่แล้ว ⁺¹
Thanks for the insightful video. Can we pass some of these params as command line arg as well ? I mean does hydra has support for that as well ?
@bkinard9 2 ปีที่แล้ว
Yes. All config parameters in Hydra can be overwritten at rub time using command line arguments.
@brainforest88 10 หลายเดือนก่อน
I would not recommend json as a config file format, simply because you cannot comment it.
I went last week thru a similar problem. Having config parameters + commandline parameters. I used a mix if Pydantic BaseSettings and BaseModel classes plus a toml file which I converted into json and loaded into a class inherited from Pydantics BaseSettings. Was a shitload of work but at the end I had all in one place including checks for values and default values.
@Shri 7 หลายเดือนก่อน
Late to the party but just wanted to leave this here. As far as the main function error thing goes, you can do something like this:
def main(cfg: MNISTConfig | None = None):
assert(cfg)
This way your main() won't show any squiggly line and you also are checking for existence of cfg during runtime.
@rk-zs5sy 2 ปีที่แล้ว
Changing the definition of the data loaders seems bad. What about defining a @property on the config dataclasses?
@AyahuascaDataScientist ปีที่แล้ว ⁺¹
I think this is making something pretty simple into an unnecessarily complicated endeavor…
@kymosabe7807 3 ปีที่แล้ว
Thank you Arjan for your valuable videos. Interesting approach to solving one of the issues when a project has a lot of parameters. IMHO, having upper case to denote CONSTANTS contribute to the readability , maintainability of the code.
I remember you saying in one of the videos, that the number of parameter a function should have, for practical reason should be minimal. However, in ML, that may not be the case, how would hydra accommodate to pass groupings of objects?
@tiagovla 3 ปีที่แล้ว ⁺¹
BTW, to convert words to lowercase or uppercase, you could viw+u and viw+U, respectively.
@tiagovla 3 ปีที่แล้ว
It works for any selection, same goes for double-click/mouse-selection + u/U.
@traal 3 ปีที่แล้ว
Heh. As a fluent Vim user I constantly see Arjan doing things less efficiently than he could. He never uses vi movement commands or Vim text objects with his verbs, for instance. I've learned to ignore it and just enjoy the video content. 😁
@traal 3 ปีที่แล้ว
The correct way is to use Vim text objects, by the way. Then you can repeat your action with "." on the next row:
gUaw (Uppercase a word)
guaw (lowercase a word)
@denizm8590 3 ปีที่แล้ว
Thank you for your time and putting this video, that was really helpful. Wanted to ask you, what would be your preference if you have multiple environments and wanted to apply the same approach with hydra? I mean would you put all the param for each environment in the same YML or create a different YML for each environment?
@DS-tj2tu 3 ปีที่แล้ว
Thank you!
@gshan994 2 ปีที่แล้ว
AWS app config helps with CICD of config management.
@italo.buitron ปีที่แล้ว
What about setuptools?
@vincentpelletier1246 ปีที่แล้ว
Its a good video, however, I wish it would've gone further and played with 'how to use configurations without using a function decorator'. Its good for single configuration such as a main for a simple model training, however, when a single class needs to get these parameters, the documentation isn't crystal clear.
@atompotato 3 ปีที่แล้ว
Why not just make your config as .toml and read parameters in a dict-like manner in the main script?
@Medan1993 3 ปีที่แล้ว ⁺¹
TOML + dacite is way easier to manage and handle than doing this weird configuration stuff with hydra, which seems for me to be overkill for such task.
@THEMithrandir09 3 ปีที่แล้ว ⁺¹
Why not use "viwgu" instead of double clicking and retyping to make something lowercase? I see you have vim installed but you're always in Insert mode in these videos.
@ArjanCodes 3 ปีที่แล้ว ⁺⁴
Thanks for the suggestion - I'm still a Vim noob. I'm trying to balance being able to get work done and learning a couple of new commands every week. The main challenge at the moment for me with Vim is navigating through a code file. I have to get used to not automatically reaching for the mouse.
@hansenmarc 2 ปีที่แล้ว
Great information! I’m curious about your thoughts regarding using HOCON for configuration, as opposed to JSON or YAML.
@ArjanCodes 2 ปีที่แล้ว ⁺¹
Thanks for the suggestions, I've put it on the list.
@phystimn 4 หลายเดือนก่อน
The storyline: 1) having configs in multiple places is hard, 2) let's solve it, 3) let's store configs in many places (otherwise, how would you combine/batch/override them etc?), but with a 'fancy' framework. Well, ok.
The truth is that keeping the code or configs organized is not a matter of a framework, it's a matter of coding culture, so to say!
@qsykip 3 ปีที่แล้ว
I don’t see why you wouldn’t just use a dict for this. It’s easy to deserialise YAML config files into dicts if you prefer having it in that format. Then just either pass the dictionary of parameters into each object or just access it as a global variable if you’re not too averse to that practice (it’s not that bad if you make it a point to make sure that this is the one and only global variable to use).
Honestly, for any serious data science project, I would just use Kedro nowadays.
@ryansoklaski8242 3 ปีที่แล้ว ⁺¹
dicts aren't good for handling things like arbitrary numbers of positionsal args or positional-only args. Hydra also provides some runtime type-checking and powerful string-interpolation capabilities that dicts wouldn't handle. Lastly, dicts can't be composed via inheritance like dataclasses can
@LyuboslavPetrov 7 หลายเดือนก่อน
Overkill, until it isn't, perhaps. I'll get back when I am overwhelmed with JSON (yes, can't stand this YAML mambo jambo). Until then :)
@mike4617 ปีที่แล้ว
'made easy'
@melancholicsmile5575 หลายเดือนก่อน
Shy not just use json file and read
@FirstNameLastName-fv4eu 2 ปีที่แล้ว ⁺¹
seriously man! seriously !!! somebody is getting paid to make some simple thing too complex like hell.
@stiv13451251 2 ปีที่แล้ว ⁺¹
Don't like this. It looks so complicated without any reason
@hanabimock5193 2 ปีที่แล้ว ⁺¹
Yet another unnecessary python project.

ต่อไป

เล่นอัตโนมัติ

8 Python Coding Tips - From The Google Python Style Guide