Beginner - 1. Requests (and sftp) 2. Psycopg2 and similar database libraries 3. Beautifulsoup and scrapy 4. Datetime 5. Virtualenv Intermediate - 6. Airflow 7. Boto3 and similar libraries to interact with cloud 8. Flask/Django Advanced (based on need to know) - 9. Pyspark 10. Pyarrow
Some other cool libraries from my side: - Pandas - you've mentioned it but you haven't put it in a context that one should know I think (vide the case from your Facebook interviews) - I think its essential for any sort of data wrangling with Python. - NumPy - essential stuff for any sort of algebra if you want to dive deeper into ML - MyPy/Pydantic - for data validation & static typing - Pytest - for testing - matplotlib & seaborn - for data visualization in Python - any sort of file libraries for specific file formats like json, csv, avro-python etc. - ML libraries like scikit-learn - FastAPI as an alternative to Django/Flask - Selenium - argparse for scripting Although I haven't used most of these in my job on a regular basis - I think it doesn't hurt to know them :)
sympy is more of an algebra library. I think you meant numpy is a linear algebra library. This can be a good way of thinking about it for a beginner who wants to learn ML, but I find it gets used a lot for stuff where you want to try and represent continuous mathematics as closely as possible on a computer. For example, numpy would also be also be good for stuff like signal processing or creating a function of best fit for your data that can be plotted.
I have to use a shell script ti execute mysql queries then pass the resulrt as an argument in my python scripts >_< wish i could just use mysql connector
How can you know pandas every which direction, but not understand a dictionary? You wouldn't know how to construct a dataframe from a dictionary of lists (often my approach when webscraping) or know how to use the map function to change categorical names. Wes McKinney (who created pandas) even says that a pandas series data structure is similar to an ordered dictionary.
I've gone through possibly all python courses in Udemy but have never seen a course focused on Data Engineering and the good-to-know libraries. Some times there is one short chapter about one of them buth nothing complete. Anyone has any tips?
@@gabrielkolletalves493 depends on how you model your DW. If you want something similar to an OLTP, Snowflake rolled out hybrid tables a few months ago
If you guys want to learn more about data engineering, then sign up for my newsletter here seattledataguy.substack.com/
Beginner -
1. Requests (and sftp)
2. Psycopg2 and similar database libraries
3. Beautifulsoup and scrapy
4. Datetime
5. Virtualenv
Intermediate -
6. Airflow
7. Boto3 and similar libraries to interact with cloud
8. Flask/Django
Advanced (based on need to know) -
9. Pyspark
10. Pyarrow
Up
Warning e logging too
Some other cool libraries from my side:
- Pandas - you've mentioned it but you haven't put it in a context that one should know I think (vide the case from your Facebook interviews) - I think its essential for any sort of data wrangling with Python.
- NumPy - essential stuff for any sort of algebra if you want to dive deeper into ML
- MyPy/Pydantic - for data validation & static typing
- Pytest - for testing
- matplotlib & seaborn - for data visualization in Python
- any sort of file libraries for specific file formats like json, csv, avro-python etc.
- ML libraries like scikit-learn
- FastAPI as an alternative to Django/Flask
- Selenium
- argparse for scripting
Although I haven't used most of these in my job on a regular basis - I think it doesn't hurt to know them :)
sympy is more of an algebra library. I think you meant numpy is a linear algebra library. This can be a good way of thinking about it for a beginner who wants to learn ML, but I find it gets used a lot for stuff where you want to try and represent continuous mathematics as closely as possible on a computer. For example, numpy would also be also be good for stuff like signal processing or creating a function of best fit for your data that can be plotted.
Psycho pg2 is how I've heard folks say it too!
Requests
Psycopg
Bigquery
Beautifulsoup & scrapy
Datetime
Boto 3
Flask
Virtualenv
Spark
Pyarrow
Pykafka
Snowflake
Thanks! I finally added in the agenda so these are now included.
I'm stuck in a "data engineer" position where all my boss will let me do is debug SQL script and it's killing me
how long have you been there?
QUIT
Leave if you can. You are doing yourself no favors by wasting years at a job you don’t like and especially one that isn’t improving your skills
Watching the premiere... expecting to hear about the tenacity library here xD
Great content as usual! I'd add json library to that
amazing thank you!
You're very welcome!
good list, but most of your psycopg2 stuff prob would have been easier with sqlalchemy
I have to use a shell script ti execute mysql queries then pass the resulrt as an argument in my python scripts >_< wish i could just use mysql connector
How can you know pandas every which direction, but not understand a dictionary? You wouldn't know how to construct a dataframe from a dictionary of lists (often my approach when webscraping) or know how to use the map function to change categorical names. Wes McKinney (who created pandas) even says that a pandas series data structure is similar to an ordered dictionary.
You are awesome.
I've gone through possibly all python courses in Udemy but have never seen a course focused on Data Engineering and the good-to-know libraries. Some times there is one short chapter about one of them buth nothing complete. Anyone has any tips?
Regarding to APIs I always thought we should learn how to pull from them, not actually create them. So where does Flask fits into all that?
Depends on what product is built on top of your db/dw. You might need to build an api on top of your warehouse to power your product.
@@playea123 Cool. And do you know what kind of custom API could run over a DW? I could only think such case in an OLTP context...
@@gabrielkolletalves493 depends on how you model your DW. If you want something similar to an OLTP, Snowflake rolled out hybrid tables a few months ago
hey! leave gcp libs alone 😂