Let's Build an Exploratory Data Analysis Project from Scratch | Python, Numpy, Pandas

Jovian

มุมมอง 216 340

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 5 ก.ย. 2024

ความคิดเห็น • 265

@decodewithabhishek 3 ปีที่แล้ว ⁺¹⁰⁸
It was a good video, I like how he didn't cut out the part when he's stuck at some problem. ⭐⭐⭐⭐⭐
@normalperson1130 3 ปีที่แล้ว ⁺⁵³
Thank you Aakash for giving a raw walk-through. Apart from usual documentation stuff I think the ability to google and find answers for the problems are much more important skills in the area of Data Science apart from ofcourse mathematical understanding. This walkthrough actually gave me more confidence in using pandas without worrying about typical syntax pitfalls
@jovianhq 3 ปีที่แล้ว ⁺³
Glad it was helpful!
@raihanhosain3374 ปีที่แล้ว ⁺²
best project tutorial in any youtube channel. Everyone make video and cut those portion when they get stuck. On the other hand you just show the real scenario and show us how to find solution using google as well. we want similar video on data science project.
Love from Bangladesh.
@shivsharma9153 2 ปีที่แล้ว ⁺⁸
Do you know why I loved this video? You kept it raw and real, you are clearly portraying how a data analyst thinks and does the project, which I believe is more important than fancy coding...syntax you can get easily but analytical thinking requires the real efforts
@snehaldamkondwar618 2 ปีที่แล้ว
Hi shiv Im from non technical background i was doing the given project but when we close all tabs how to reach out to the same notebook
@shivsharma9153 2 ปีที่แล้ว ⁺¹
@@snehaldamkondwar618 Hi, where did you save it locally? Which folder
@snehaldamkondwar618 2 ปีที่แล้ว
@@shivsharma9153 need to save??
@snehaldamkondwar618 2 ปีที่แล้ว
@@shivsharma9153 open notebook through jovian run through colab with help of kaggle datset . I have written some line of code in colab .then i have close all tabs . Now how to go to that file where i have write my code.
@shivsharma9153 2 ปีที่แล้ว
@@snehaldamkondwar618 try to search locally with the name of notebook you may be able to locate it
@MuhammadAkbarAttamimi 3 ปีที่แล้ว ⁺¹⁷
55:52 this dataset contains New York data accidents, there are around 10.000 record. I checked it using df[df['City'] == 'New York']
@garvitpoddar6947 3 ปีที่แล้ว
Yes
@AakashNS 3 ปีที่แล้ว ⁺¹
Thanks! Not sure why I missed it. Maybe I was using a different version of the dataset.
@claudiolb8552 3 ปีที่แล้ว
@@AakashNS not sure why but ["New York" in Df.City] always returns false try it with a another city it just doesn't work
@tirthhihoriya690 3 ปีที่แล้ว
@@claudiolb8552
Use:
a. >>> 'NY' in list(df.State) or
b. >>> 'New York' in list(df.City) or
c. >>> cities_by_accident['New York']
@shailjamishra9423 2 ปีที่แล้ว
yes, new york city is there in the dataset, the state which is missing is 'Alaska'
@rajivgarg9480 ปีที่แล้ว
I have seen only half the video. Couldn't stop myself from apprecating the good work. Couldn't have been done any better. Way better than the Paid education platforms
@freehappymeal 2 ปีที่แล้ว ⁺³
Thank you for teaching us how to problem solve and the whole EDA process!
@jovianhq 2 ปีที่แล้ว
Happy to help!
@sandeepmesa 2 ปีที่แล้ว ⁺²
I like the way you google for help ..Appreciate your time ..learnt new things on how to articulate our work ..thanks
@jovianhq 2 ปีที่แล้ว
Glad it was helpful!
@sabchillhai802 3 ปีที่แล้ว ⁺¹²
great work jovian , we need more such types of session. Thanks a lot
@jovianhq 3 ปีที่แล้ว ⁺²
Glad you liked it!
@aashisethiya4653 ปีที่แล้ว
Aakash, you are one of the best teachers I have come across.
Coming from a hard-core medical background and pivoting into data analytics I came across your panda's courses while preparing for my foundation in python before starting a master's in the US in analytics this Fall.
Hands down you have given beginners like me a lot of handholding with your courses and videos!
@jovianhq ปีที่แล้ว
Thanks, I'm glad you found our course helpful! 😊 - Aakash
@aashisethiya4653 ปีที่แล้ว
@@jovianhq I went through many teachers on youtube and data camp: but truth to be told- most are ludicrously formal in their teaching methods and have a slower theoretical pace.
Is there any possibility to connect with Aakash to get certain roadmap tips for a beginner who plans to venture into the US Health Business Analytics Domain?
@prisri5953 ปีที่แล้ว ⁺¹
NY is in the state list. The Missing states are AK(Alaska) and HI(Hawaii). It also considers DC as state
@shreyaskulkarni7612 2 ปีที่แล้ว ⁺⁵
The current dataset is updated.
A high percentage of accidents occur between 3 pm to 6 pm (probably people in a hurry to get to home)
Next higest percentage is 6 am to 9 am.
Over 1100 cities have reported just one accident (need to investigate
On Sundays, the peak occurs between 11 am and 6 pm, unlike weekdays
@jovianhq 2 ปีที่แล้ว
Interesting analysis and insights Shreyas!
@outinthebeach 3 ปีที่แล้ว ⁺⁴
Great course Aakash - this and everything else you have put here. Thanks for your generosity to teach this the way you have done it. Brilliant!!!
@eyesofdoriss ปีที่แล้ว ⁺¹
Great sharing. I've been looking for a full guide like this one for a while. Thank you!
@jovianhq ปีที่แล้ว
Glad you enjoyed it!
@nikunjdeeep 19 วันที่ผ่านมา
this EDA is so motivating to me .......we all search in google...i thought why i can't recall all of those pandas function....
@jeetthakkar2297 2 ปีที่แล้ว ⁺⁷
Sir actually New York data is present in the given data set.
We get the output as False if we use:
'New York' in df ['City']
And we get the output as True if we use :
'New York' in df ['City'].unique()
@jovianhq 2 ปีที่แล้ว
Yes Jeet, you are correct. We found it later but didn't update the video to show that this type of error might happen any time during working on a project. Great work on finding it!
@harshucore ปีที่แล้ว ⁺¹
I used - 'New York' in df.values and got True
@anwoybarua8213 3 ปีที่แล้ว ⁺¹
One of the best TH-cam channel for data analysis learners❤️❤️
@vidyasagarbv89 3 ปีที่แล้ว ⁺¹
Watching the master is how you learn...Thanks a lot for this...
@sivaramaguhans4002 3 ปีที่แล้ว ⁺¹
I can't see an EDA explanation clearly in other videos... awesome 🎉
@ankitlakshya450 2 ปีที่แล้ว ⁺¹
bro you were my senior in intermediate .ascent junior college ,vizag . got a clarity on eda btw
@jovianhq ปีที่แล้ว
Glad you liked our tutorials!
@bane2256 ปีที่แล้ว ⁺¹
This was excellent. I hope for more of these in 2023
@jovianhq ปีที่แล้ว ⁺¹
Definitely!! Stay tuned, more interesting videos coming soon.
@bane2256 ปีที่แล้ว
@@jovianhq is this the type of project that is sufficient to be included in an analytics portfolio? or does it need to be something more extensive?
@muralikumaar9456 2 ปีที่แล้ว ⁺²
Great session on EDA. We need more such sessions on different datasets.
@jovianhq 2 ปีที่แล้ว
This is just an example, we hope the viewers will be able to make better EDA projects on different datasets after watching this video.
@neelajguhaneogi8348 9 ชั่วโมงที่ผ่านมา
New York data is there, manually checked the whole data to find the city because the method you showed at 58 minutes mostly doesn't work because of the spaces, some values contain unnecessary spaces and that creates a problem.
@piyushkumar-kb2jc 3 ปีที่แล้ว ⁺¹
concept is crystal clear by anuj bhyia.
@jovianhq ปีที่แล้ว
Thanks!
@Mlksgf ปีที่แล้ว ⁺²
What a great Tutorial! The df is obviously updated and I cannot find the 'New York' value in Cities, BUT there are data in cities_by_accident "cities_by_accident['New York']" and is equal to 7068
@igordemetriusalencar5861 3 ปีที่แล้ว ⁺⁵
Good class of Python pandas, but in R exploratory and statistics analysis are way easier compared to Python. Example: data_frame %>% filter(City == "New York") bam!! dataset filtered. Summarize numeric data => data_frame %>% summary() !! bam!! Done!! in a totally functional way.
@jovianhq ปีที่แล้ว ⁺¹
Both R and Python are great, you can use either one. Python is gaining more traction because it also has great packages for machine learning & deep learning.
@Phoenix_Bro1 ปีที่แล้ว
This was a superb explanation of how to do EDA. Extremely helpful, Aakash!
@moymaya 2 ปีที่แล้ว ⁺¹
Thank you Aakash. Really helpful. Liked the way we committed mistakes and even learnt something new from it.
@jovianhq 2 ปีที่แล้ว ⁺¹
Glad you liked it
@deepasarojam4425 3 ปีที่แล้ว ⁺¹
This is best video on EDA I have ever watched! Thanks Aakash :)
@jovianhq 3 ปีที่แล้ว
Thanks for the kind feedback!
@tiwarirr 3 ปีที่แล้ว ⁺¹
Best teacher for Data scince!
@unpatel1 2 ปีที่แล้ว
This is a great project and I really enjoyed it. After finishing this video yesterday, I am working on other parameters to expand my analysis. I would love to see more projects from Akash. Thank you.
@snehaldamkondwar618 2 ปีที่แล้ว
Hi do you know once we close all tabs how to work on it again
@sajjadabdullah 2 ปีที่แล้ว ⁺¹
Perfect video. I was looking for such video. Thank you Sir
@SeunOnSet ปีที่แล้ว
Thank you for sharing this! It was really insightful to see the analysis process from start to finish. It also answered a few questions I had.
@jovianhq ปีที่แล้ว
Glad it was helpful!
@architnangalia3426 3 ปีที่แล้ว ⁺¹
56:02 The dataset does contain 'New York'
"" cities_by_accident['New York'] "" gives us the output as 10255
@shailjamishra9423 2 ปีที่แล้ว
yes..but it does not show the values..just showing count...strange!!
@jovianhq ปีที่แล้ว
Yes, the dataset does contain New York now.
@raghvendrasingh8037 ปีที่แล้ว
nice video, simple explaination and the best part was it from the scratch. loved it
@user-zj9pq5xc7x 3 หลายเดือนก่อน
loved your freecodecamp course. thank you so much
@sarzilhossain5977 2 ปีที่แล้ว ⁺²
"New York" in df. City returns False
But "New York" in df.City.uniqu() returns True. (Which I have no explanation for)
And in fact, there are 4220 accident cases inside the dataset which occurred in New York inside the dataset (The dataset could be updated recently.). I don't know if it has been updated. But since the accident records fall in between the year 2016 and 2020, it would seem weird if new rows get added later on.
@jovianhq 2 ปีที่แล้ว ⁺³
You are correct, there was "New York" in the dataset before as well. df.City returns a Series where if you search using the "in" operator, it will search for the indexes and not match with the values. Where as df.City.unique() creates a list and "New York" is searched within that list so you were able to find "New York".
@shubhamtalks9718 3 ปีที่แล้ว ⁺¹
Very educational video. Please keep posting such videos.
@raminirakeshkumar8287 3 ปีที่แล้ว ⁺¹
Thank you Aakash, great work
@scapri1000 3 ปีที่แล้ว
Thank you. This is one of the best video on EDA .
@gunngunn6763 3 ปีที่แล้ว ⁺¹
Thank You...
looking forward to your upcoming videos
@InsaneRealityLeak 3 ปีที่แล้ว ⁺²
Thank you so much. Definitely a very useful video. ✌🏽
@jovianhq 3 ปีที่แล้ว
Glad it was helpful!
@imdadood5705 3 ปีที่แล้ว ⁺¹
@36:30 We can also do, df.describe().shape[1]
@54:40, I got the results for new york. I did cities_by_accident.loc[“New York”]
@jovianhq ปีที่แล้ว
Yes, the dataset now contains information about New York
@jawedkhan8602 3 ปีที่แล้ว ⁺²
You are doing great job. Thank you
@jovianhq 3 ปีที่แล้ว
Thank you!
@UCEAbhishekLokhande ปีที่แล้ว
Thank You Very Much learn lots of things through this session
@jovianhq ปีที่แล้ว ⁺¹
Glad to hear that!
@ytg6663 3 ปีที่แล้ว
Big thank you for being Here 👍👍
@jovianhq 3 ปีที่แล้ว
Glad you liked it!
@pandabear6095 3 ปีที่แล้ว
Thank you very much ! This video was useful and easy to understand.
@amanpreetsinghgulati2475 2 ปีที่แล้ว
Hi, at around 49:17 when you are checking that weather we have 'New York' data or not so in that when we are checking for the existence with,
if 'New York' in df.values - it will return True
And
If 'New York' in df.City - False
Also
If 'Dublin' in df.City - False ( and for all the other cities )
So, in my preference we need to use the df.values ( it will check the whole dataset - yes might be time taking and requires unwanted computing processing as well )
Please help us to improve this part
Thanks
@jovianhq 2 ปีที่แล้ว ⁺¹
Yes @Aman, you are correct, New York is indeed present in the dataframe. We've purposefully kept the video in it's raw format instead of editing it. This shows that it's very common to get errors like these while working on your project, one have to be very careful before making a conclusion.
@amanpreetsinghgulati2475 2 ปีที่แล้ว
@@jovianhq yes sir, thanks for the session learnt a lot from this basically for "how" to do it there is ample of resources available but "what" to do in EDA is hard to find
Thanks for that
@ShelloSongz 2 ปีที่แล้ว
Wow, thank you for your concise explanations.
@jovianhq 2 ปีที่แล้ว
Glad it was helpful!
@theforester_ 2 ปีที่แล้ว
awesome video! big shout out from Brazil
@jovianhq 2 ปีที่แล้ว
Hey Mauricio👋, thanks!
@TheHasanjafreee ปีที่แล้ว
This was great! Thank you for the video
@abhisarshrivastava4667 2 ปีที่แล้ว
This is really helpful thank you Jovian
@jovianhq 2 ปีที่แล้ว
Glad you liked it!
@SarcasmWEB 2 ปีที่แล้ว
Thank you so much! It was very educational
@moeid9935 3 ปีที่แล้ว ⁺¹
i liked ur naturality
@muhammadshoaibfareed2577 3 ปีที่แล้ว ⁺³
A great session indeed
@jovianhq 3 ปีที่แล้ว
Glad you liked it!
@sandipansarkar9211 2 ปีที่แล้ว ⁺¹
finished watching
@mansigaikwad9 ปีที่แล้ว
idk if they have updated the dataset , but i just tried to find whether New City is there or not and if yes then the number of records ....(referring to 56:00 )
used this code - len(df[df['City']=='New York']) and got the answer..
so , New york is there in the dataset and the number of accidents is 7068
@jovianhq ปีที่แล้ว
You are correct! New York was indeed present in the dataset, but in the live session it got skipped due to some mistake in code.
@sharkk2979 2 ปีที่แล้ว ⁺¹
aakash is knowledeble as sky .
@jovianhq 2 ปีที่แล้ว
Can't agree more!
- Jovian Team
@navyaagarwal5918 ปีที่แล้ว
Among the top 100 cities in number of accidents, which states do they belong to most frequently? How do we solve this question
@rohan30497 ปีที่แล้ว ⁺¹
For personal use:- 1:17:19
@abhishekkumar-qi3is 3 ปีที่แล้ว
please make vedio of feature engirreing and selection and thanks for this content
@jovianhq 3 ปีที่แล้ว
Hey, have you tried our Machine Learning course? We have covered feature Engineering/Selection and lots of other interesting topic in that course.
View the course from here -> zerotogbms.com
@bikrammajhi3020 ปีที่แล้ว
Thank you so much Sir !!
@Carworld-s5l 2 ปีที่แล้ว
Previously I felt to remember all the pandas methods but you made me confident. Thank you Bhai❤❤
@jovianhq 2 ปีที่แล้ว
Glad it was helpful! Check our other courses at jovian.ai/learn
@Griffindor21 ปีที่แล้ว ⁺¹
Really great video!
Any chance I can get a copy of the jupyter file?
@dilaraesmer ปีที่แล้ว
Thank you so much for all your efforts :)
@jovianhq ปีที่แล้ว ⁺¹
Thank you for the comment! Glad you like the videos
@hrittickdebnath35 3 ปีที่แล้ว
You did a fantastic job buddy
@jovianhq 3 ปีที่แล้ว
Glad you liked it!
@PinaColada65 2 ปีที่แล้ว
tysm for this. this tutorial is a blessing
@jovianhq 2 ปีที่แล้ว
You're so welcome!
@krazyhorse004 3 ปีที่แล้ว ⁺²
df[df.City == 'New York']
^ shows New York in the dataset
@jovianhq ปีที่แล้ว
Yes, the dataset seems to have changed to include New York now
@jyothiramesh3450 ปีที่แล้ว
Hey I am getting an error while installing packages. "You may need to restart the kernel to use updated packages"
@aryanrana5658 ปีที่แล้ว
It's a good video but the dataset you uploaded that is updated one . We also want the row messy dataset which u use while handling missing values
@jovianhq ปีที่แล้ว
Thank You. Unfortunately the dataset was updated from Kaggle, we don't have access to the previous version to the dataset.
@lakhanpatel2702 3 ปีที่แล้ว
sir i try this code and his show True in 'New York' city
first i see df.values
df.values show my all data value in array form
then i write this code
'New York' in df.values
this line of code show True as a output.
@gitasaheru2386 2 ปีที่แล้ว
Please sir build neural network algorithm with manual coding without keras and use study case
@harshitsinghal8843 5 หลายเดือนก่อน
New York is present in the data. I think you have used the wrong function to check whether a value is present, 'New York' in df. City: Using this function will always get a false result if you check for any city. The correct function to check is 'New York' in df['City'].values
@rishabgupta2733 ปีที่แล้ว
On data preparation step my data frame is crashing continuously. What to do now?
@raghavverma120 2 ปีที่แล้ว
I did read your exploratory analysis file for crop production analysis… and all the agroup by queries that you had run were wrong.. plz look into it and rectify them
@hydemi83 2 ปีที่แล้ว
Great video 👏 Congrats for this awesome job
@jovianhq 2 ปีที่แล้ว ⁺¹
Thank you very much!
@kshitizprajapati694 3 ปีที่แล้ว ⁺¹
i have completed zero to pandas course
can you plz create content about sql integration project?
@jovianhq 3 ปีที่แล้ว
Sure we will definitely consider the topic for our upcoming courses.
@dc4617 2 ปีที่แล้ว
thank you🙂
@825sohambharambe9 2 ปีที่แล้ว
In my case when i read the file the jupyter notebook is taking way too long time
What can i do?
@NiviudPu 8 หลายเดือนก่อน
Shall i do for this as my mini project???
@martialrasta 3 ปีที่แล้ว
there is a difference bw the county and country!
@milanms4593 3 ปีที่แล้ว
Thanks i got the idea of doing EDA.
Can you teach us about web scraping
.
@theo_riveroooo 3 ปีที่แล้ว ⁺¹
Corey Schafer post some great videos about that
@jovianhq 3 ปีที่แล้ว
Hi Milan, We are doing a workshop on web scraping next Thursday(April 15th) at 9PM IST on our TH-cam Channel.
th-cam.com/video/RKsLLG-bzEY/w-d-xo.html
@atharvaparanjape9585 2 ปีที่แล้ว
at 37:59 how did we get a plot without importing matplotlib ??
@nishikantanayak7797 ปีที่แล้ว
time 1:29:38 of the video......there was no data regarding on which days accident occur so how the graph was formed
@kamilmohammed9722 2 ปีที่แล้ว
trying now
@atifshaik1156 2 ปีที่แล้ว
Is it Fine to Google Something while working on a project??Like How did u in the Video??
@jovianhq 2 ปีที่แล้ว ⁺¹
Yes, it's absolutely fine, you're not expected to know everything, and even if you know there can be a better way of implementing the same thing. So it's totally fine to google something out.
@debojitmandal8670 3 ปีที่แล้ว
Sir here you have a column called siverity And it tells the siverity of the accident .
So what I am asking is to find out the cities with highest number of accidents can I group by function and group based on the city and siverity .
I.e df.groupby('City). Siverity.sum().sort_values( ascending = False)
Bcs I feel this is a better approach then using unique values .
Please please reply back
@jovianhq 3 ปีที่แล้ว
Yes, you can do that, but the code should be like this,
df.groupby('City')["Severity"].count().sort_values(ascending=False), Here the column severity does not matter to get the total number of accidents, so we are just counting the total number of rows in each city instead of using sum() on Severity.
For better assistance post your question in the community. jovian.ai/forum
@debojitmandal8670 3 ปีที่แล้ว
@@jovianhq but why doesn't it matter bcs if you read that column description it says the sevirity if the accidents.
@chizzy4109 2 ปีที่แล้ว
Wow, why am I just seeing this channel now
@jovianhq 2 ปีที่แล้ว
Better late than never!
@vishwaslad1810 2 ปีที่แล้ว
Great Video
@KushalSaini14 3 ปีที่แล้ว
thank you so much
@jovianhq 3 ปีที่แล้ว
You're welcome!
@NSASANAPURIKAVYASRI ปีที่แล้ว
it is asking permissions to use those datasets,what should i do?
@bhushanwagh7192 3 ปีที่แล้ว
Awesome sir
@anupriyasharma9282 2 ปีที่แล้ว
Hello Sir,
1.Can you pls tell me how to handle missing observations for the following features
FEATURE SUM
Precipitation(in) 510527
Wind_Chill(F) 449288
Wind_Speed(mph) 128852
Humidity(%) 45506
Visibility(mi) 44206
Weather_Condition 44001
Temperature(F) 43030
Wind_Direction 41857
Pressure(in) 36270
Weather_Timestamp 30263
Airport_Code 4248
Timezone 2302
Zipcode 935
dtype: int64
I have removed "number" feature as 70% of the data of that column was missing
Can we use mean/median/ mode or is there any other technique ?
2.For the univariate analysis wouldn't it be very lengthy and time consuming to study 47 features?
@karthikbs8457 2 ปีที่แล้ว
I have seen people filling median values in the empty cells
@CuriousCars 3 ปีที่แล้ว
if we have two columns like Vote_now and Past_vote so id chose rjd in both the columns how to calculate percentage of that
@jovianhq 3 ปีที่แล้ว
Your question is not clear, Can you please post your question in the forum with pictures of your DataFrame?
Link: jovian.ai/forum
@siddharthpunn10 3 ปีที่แล้ว
Great session
@jovianhq 3 ปีที่แล้ว
Glad you liked it!
@NishantSingh-zx3cd 3 ปีที่แล้ว
36:26
I tried numerics but an error was thrown saying 'numerics' is not defined. Googled the problem and found out df.select_dtypes(include='number') works.
@nishasilori4222 3 ปีที่แล้ว
In his project 49 column are there but now only 47 columns , is this problem bcz of this?
@ganeshr3297 2 ปีที่แล้ว
At 21:03 ..I couldn't load the data ...what should I do?
@yashdhangar3261 11 หลายเดือนก่อน
Which algorithm is used
@PratapO7O1 3 ปีที่แล้ว
How did this work 1:49:52
sample_df=df.sample(int(0.1*len(df)))
@jovianhq 3 ปีที่แล้ว
In df.sample() we need to pass a value denoting the number of rows of sample we want. 0.1 * len(df) means the number of rows of sample we want to see is equal to 10% of the length of df i.e if len(df) is 100 we want a sample dataframe with 10 rows.
@gajanansawadadkar5003 ปีที่แล้ว
Good session

ต่อไป

เล่นอัตโนมัติ

Exploratory Data Analysis with Pandas Python