You can clone the repo, this way you will have all the requirements, then follow along. All links are in the description. Here is the link to the repo: github.com/hnawaz007/pythondataanalysis/tree/main/dagster-project/etl
Thanks this is helpful, however I do have a question, let say if I want to come out with ELT pipeline and ingest entire database into a data warehouse, is it better for me to separate the table into multiple data assets and ingest one by one? or just use one data asset?
It’s better to split each table as an asset. Each source table should have an asset, then stage this data after this step it descends on your data modeling strategy on how you want to model this data.
Question : to implement an incremental load io manager we need to use the ‘append’ arg istead of ‘replace’ to sqlAlchemy. Is it possible to send this paramter directly from the asset ?
It is possible. I have seen an example of this on stack overflow but it requires a little more configuration, link below. Another idea would be to have two versions of IO Manager one for incremental (append) and a second one for truncate and load (replace). stackoverflow.com/questions/76173666/how-to-implement-io-manager-that-have-a-parameter-at-asset-level
A popular practice with BigQuery is to process data in stages where each stage is effectively a table. So you might have a raw table that takes all the raw data in, and then a pivot or aggregation process that would take the data from table A and write it to table B. I am trying to wrap my head around how to do this correctly with Dagster. The data would always live inside of BQ, never coming out into these python functions. Is there a best practice for this sort of thing? Effectively there is no IO, it is all remote, and Dagster would just be orchestrating the commands. Is this possible?
I think this is a standard elt approach if you are buidling data mart or database using SQL. dbt will be perfect for this use case. Your data lives in your database. You can transform it with sql using dbt. You can have raw sources, build intermediate tables for transformation and final dims and facts for analytics. Dagster can orchestrate the whole process ad-hoc or on a schedule.
If we need to execute multiple .sav files and convert them into multiple CSV files and do some modifications on them, how can we accomplish this using Dagster?
I saw your comment on the reference data ingestion video. You can borrow the code on how to ingest multiple files from there. You can easily covert the Python functions to "op" or and/or "asset" with the help of Dagster decorators. I have covered how to covert a Python script to "op" in this video here: th-cam.com/video/t8QADtYdWEI/w-d-xo.html&t Code to convert sav files: import pandas as pd df = pd.read_spss("input_file.sav") df.to_csv("output_file.csv", index=False)
Hi @BiInsightsInc, thank you very much for posting this awesome content. Could you please create an ETL video or series that work with these tools and MongoDB?
Link to previous video on Dagster: th-cam.com/video/t8QADtYdWEI/w-d-xo.html&t
ETL with Python: th-cam.com/video/dfouoh9QdUw/w-d-xo.html&t
Love it ! Dasgter is my favorite tool for data orchestration and you video is very well built 🎉 need more on this topic :)
@BiInsightsInc, between 03:05 and 05:39 the requirements.txt magically appears in your etl folder. Makes it hard to follow along your video...
You can clone the repo, this way you will have all the requirements, then follow along. All links are in the description. Here is the link to the repo:
github.com/hnawaz007/pythondataanalysis/tree/main/dagster-project/etl
Thanks this is helpful, however I do have a question, let say if I want to come out with ELT pipeline and ingest entire database into a data warehouse, is it better for me to separate the table into multiple data assets and ingest one by one? or just use one data asset?
It’s better to split each table as an asset. Each source table should have an asset, then stage this data after this step it descends on your data modeling strategy on how you want to model this data.
@@BiInsightsInc thank you for the input
Question : to implement an incremental load io manager we need to use the ‘append’ arg istead of ‘replace’ to sqlAlchemy. Is it possible to send this paramter directly from the asset ?
It is possible. I have seen an example of this on stack overflow but it requires a little more configuration, link below. Another idea would be to have two versions of IO Manager one for incremental (append) and a second one for truncate and load (replace).
stackoverflow.com/questions/76173666/how-to-implement-io-manager-that-have-a-parameter-at-asset-level
@@BiInsightsInc thanks a lot, I well check it :)
This is grear, I've had similar issues. I want to query an API and APPEND the retrieved data to the existing asset.
A popular practice with BigQuery is to process data in stages where each stage is effectively a table. So you might have a raw table that takes all the raw data in, and then a pivot or aggregation process that would take the data from table A and write it to table B. I am trying to wrap my head around how to do this correctly with Dagster. The data would always live inside of BQ, never coming out into these python functions. Is there a best practice for this sort of thing? Effectively there is no IO, it is all remote, and Dagster would just be orchestrating the commands. Is this possible?
I think this is a standard elt approach if you are buidling data mart or database using SQL. dbt will be perfect for this use case. Your data lives in your database. You can transform it with sql using dbt. You can have raw sources, build intermediate tables for transformation and final dims and facts for analytics. Dagster can orchestrate the whole process ad-hoc or on a schedule.
If we need to execute multiple .sav files and convert them into multiple CSV files and do some modifications on them, how can we accomplish this using Dagster?
I saw your comment on the reference data ingestion video. You can borrow the code on how to ingest multiple files from there. You can easily covert the Python functions to "op" or and/or "asset" with the help of Dagster decorators.
I have covered how to covert a Python script to "op" in this video here:
th-cam.com/video/t8QADtYdWEI/w-d-xo.html&t
Code to convert sav files:
import pandas as pd
df = pd.read_spss("input_file.sav")
df.to_csv("output_file.csv", index=False)
Hi @BiInsightsInc, thank you very much for posting this awesome content. Could you please create an ETL video or series that work with these tools and MongoDB?
I will try and add the IO Manager for MongoDB.