DE Zoomcamp 2.2.3 - ETL: API to Postgres
ฝัง
- เผยแพร่เมื่อ 14 ต.ค. 2024
- In this module, Matt talks about loading data from an API in a form of a compressed CSV file and loading it to a local Postgres database.
---
Free Data Engineering course: github.com/Dat...
Join DataTalks.Club: datatalks.club...
DTC events: datatalks.club...
Mage Github repo: github.com/mag...
Mage Slack community: www.mage.ai/chat
Mage Docs: docs.mage.ai/
This is so much fun to do! Mage really looks amazing.
If needed:
taxi_dtypes = {
'VendorID': pd.Int64Dtype(),
'passenger_count': pd.Int64Dtype(),
'trip_distance': float,
'RatecodeID': pd.Int64Dtype(),
'store_and_fwd_flag': str,
'PULocationID': pd.Int64Dtype(),
'DOLocationID': pd.Int64Dtype(),
'payment_type': pd.Int64Dtype(),
'fare_amount': float,
'extra': float,
'mta_tax': float,
'tip_amount': float,
'tolls_amount': float,
'improvement_surcharge': float,
'total_amount': float,
'congestion_surcharge': float
}
parse_dates = ['tpep_pickup_datetime', 'tpep_dropoff_datetime']
Thank you!
The hero we all need!
thank you!!
Took me three rewinds, but at about time stamp 11:45 he refers to "idempotence" GREAT call out Matt! 🙂
i think you mean at timestamp 11:26
@@farhanromli5862 Oops. You are correct. :-)
Same, and great piece of information
using double square brackets should work with f-strings
print(f"Preprocessing rows with zero passengers: { data[['passenger_count']].isin([0]).sum() }")
The issue is that he used both single quotation marks on both the passenger_count and the print statement.
Using the following will work (tested myself):
print(f"Preprocessing: rows with zero passengers {data['passenger_count'].isin([0]).sum()}")
@@DylanTan Correct. It worked for me, print(f'Preprocessing: rows with zero passengers: {(data["passenger_count"] == 0).sum()}')
At 7:37, there's no string highlighting in Mage yet, but if you notice in the Exception message, there's a white chapeau pointing at the source of the error (quotation issue). Thanks Matt for these tutorials anyways! 👏👏
looks like we wrote the data to "ny_taxi.yellow_cab_data", which is not the specified schema in the .env file.
had to confirm and check the "magic" schema. idk why it just gave me some sort of itch haha anyways thanks for the smooth tutorial :D
In transformer this statement would give you the correct number of excluded rows:
print(f"Rows with out passengers: {data['passenger_count'].fillna(0).isin([0]).sum() }")
Matt greate video. Thank you.
There is no "mage" search_path in PostgreSQL after data export block execution. In case of use pgcli - need to add schema name to FROM clause (ex: FROM mage.table_name) or set "mage" as an additional search_path value (it works only for current session and will reset after log out). How to work with it properly?
Works as a charm
Awesome!!
@8:38 I think the string formatting should have double quotes
After running data export to Postgres, I want to verify that my data is in Postgres using pgcli in my ubuntu terminal. How do I do that?
Hello! Why "float" instead of Float64Dtype() ?
Wow really easy and smooth to work with
what am I doing wrong, I downloaded the json file and moved it to my magezoomcamp folder but within mage and the test pipeline it says file note found... I literally copied its path how is that possible
i try to figure out what's wrong with the f print statement but it's just wanted a double quote to getting work 😅
When I try to export to postgres I get an error Keyrror: 'dev'. It does not find the config_profile = 'dev'. Any idea what can I do?
Iam getting the same error. I am running it in google cloud, activate the docker-compose before and no idea why I get the error
Perhaps you did not change io_config.yaml file. Just add 'dev' config at the bottom of file, save it and run again-
dev:
POSTGRES_CONNECT_TIMEOUT: 10
POSTGRES_DBNAME: "{{(env_var('POSTGRES_DBNAME'))}}"
POSTGRES_SCHEMA: "{{(env_var('POSTGRES_SCHEMA'))}}" # Optional
POSTGRES_USER: "{{(env_var('POSTGRES_USER'))}}"
POSTGRES_PASSWORD: "{{(env_var('POSTGRES_PASSWORD'))}}"
POSTGRES_HOST: "{{(env_var('POSTGRES_HOST'))}}"
POSTGRES_PORT: "{{(env_var('POSTGRES_PORT'))}}"
Nice catch, this is the right solution. Thanks!
Here is the exact timestamp where he does the configuration in another video: th-cam.com/video/pmhI-ezd3BE/w-d-xo.htmlfeature=shared&t=188
I refactored transform section this way:
cond = data['passenger_count'] > 0
print(f'Rides with zero passengers: {len(data[~cond])}')
return data[cond]
awesome!
Hi. I keep getting this error in my code for the first part of this video: ValueError: Missing column provided to 'parse_dates': 'lpep_dropoff_datetime, lpep_pickup_datetime'. I am confused because this is my code for the last two lines: parse_dates = ['lpep_pickup_datetime', 'lpep_dropoff_datetime']
return pd.read_csv(url, sep=",", dtype=taxi_dtypes, parse_dates=parse_dates)
He says tpep in video but I have lpep also in my csv, maybe they change a few column names download the file and run head filename.csv u might have ti unzip it first. check which is the right column names. Some csv had tpep and some lpep (I assume if he didnt get an error)
Hi,
I tried to pull from the website for NY taxi data and all appears to be in Parquet, when I try using read_parquet the kernel crashes, any suggestions
Also how do you parse dates in Parquet read
For anyone needing a solution:
ny taxi dataset has changed their filetypes from .csv to all in parquet.
since parquet files are stored in binary you require the apache arrow library to read them.
for this example, we use the url of csv.gz file backup shown in the datatalks de-zoomcamp repo.
Exporter need the table to have been previously created?
no it will create the schema and the table
I am getting an error when loading that the kernel memory is overloaded and must shut down? Any ideas?
Hello! Sounds like you're experiencing memory issues. This could be due to various reasons from high resource usage or memory leak. You could try 1 of the following:
- Check your resource usage
- Restart the kernel
- Check for memory leak
You can also join our community slack for further assistance: www.mage.ai/chat
@@mage_ai i have the same problem
@mage_ai i have the same problem. does it means my laptop 8gb memory is too small to execute it???
Good video
Thank you!
wow, thanks
en.wikipedia.org/wiki/Idempotence