Machine Learning & Data Science Project - 4 : Outlier Removal (Real Estate Price Prediction Project)

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ต.ค. 2024

ความคิดเห็น • 206

  • @codebasics
    @codebasics  2 ปีที่แล้ว +3

    Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced

  • @codingmadesimplified
    @codingmadesimplified 4 ปีที่แล้ว +3

    Following Your Series For About 4-5 Months Now And Found It One Of The Best For Data Science Specially This Working On Real World Project Is Extremely HelpFull For Me

    • @codebasics
      @codebasics  4 ปีที่แล้ว +4

      Muhammad, I am really happy that this was helpful to you

  • @mohammedfaisal6714
    @mohammedfaisal6714 4 ปีที่แล้ว +19

    Sir Aap Hamare Anand Kumar ho
    Hope all your students achieve their Dreams
    May the Almighty bless you with more Power, Health and Success.
    😃😃

  • @MrBunty85
    @MrBunty85 4 ปีที่แล้ว +5

    Very interestingly and naturally described. I am loving it. Please continue with sharing your knowledge. Absolutely amazing presentation.

  • @jaganinfo
    @jaganinfo 4 ปีที่แล้ว +2

    Hi Dhaval, I just subscribed your channel by watching previous three videos DS project related. Your teaching skills in a programming wise is so simple . Thanks for spending a time for us.

    • @codebasics
      @codebasics  4 ปีที่แล้ว +2

      Thanks and welcome

  • @nischalsingh3002
    @nischalsingh3002 ปีที่แล้ว +6

    Inner for loop will iterate for every possible group of no. of bedrooms of that respective location group. (of outer for loop)
    First inner for loop will store information about mean , std and no of data points( no of values present in a group of bedroom) in the already created dictionary in the outer for loop with key as the respective bedroom no. group. i.e (bhk_stats[2] stores info about 2 bedroom group values)
    Second inner for loop performs the main functionality,
    stats = bhk_stats.get(bhk-1)
    here it will fetch the value for the previous no. of bedroom group.
    For example, for 1 bedroom group it will be None , as there is no possibe value stored for 0 bedroom group, simply because there is not any value like that in dataframe.
    also for 3 bedroom group, it will fetch information about 2 bedroom group ( so that we can check the mean value )
    if stats and stats['count']>5:
    it checks if there is dictionary present ( we didn't have for 1 bedroom group ) because None value will throw error. It also checks if it has more than 5 values or not. Because we cannot decide to discard something without comparing it with substantial data values.
    exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft

  • @flamboyantperson5936
    @flamboyantperson5936 4 ปีที่แล้ว +5

    Extremely helpful. I just love your way of teaching.

  • @lucasbegue8232
    @lucasbegue8232 4 ปีที่แล้ว +13

    for outlier detection shouldnt we be using a more wide range of acceptance? in my statistics course we usually used mean +-- 2*sigma or mean +- 3*sigma

    • @codebasics
      @codebasics  4 ปีที่แล้ว +7

      Yes using 3 STD is a standard practice. Based on situation here I used different STD but you can try 3 Std and let me know how is the result

  • @aasawaridikholkar7327
    @aasawaridikholkar7327 4 ปีที่แล้ว +7

    Thank you explaining minor details in data preparation.
    Please explain the "stats and stats[count]>5" in the BHK outlier removal

    • @codebasics
      @codebasics  4 ปีที่แล้ว +8

      I am considering only cases where number of apartments (for given bhk) is greater than 5. Because less than that would be very few samples to run any logic. 5 is a randomly taken number, you can change it something else that is reasonable enough.

    • @pattanayakswayanshu6161
      @pattanayakswayanshu6161 3 ปีที่แล้ว +4

      could u please explain why -1 is there?
      stats = bhk_stats.get(bhk-1)

    • @jiyabyju565
      @jiyabyju565 3 ปีที่แล้ว

      @@codebasics thank you...I was on it....

    • @jiyabyju565
      @jiyabyju565 3 ปีที่แล้ว +1

      @@pattanayakswayanshu6161 here we get the dictionary values of 'mean','std' and 'count' of '-1' value of bhk.(if the value of bhk is 2 then values that lesser than mean of bhk1 is eliminated)

    • @Manojprapagar
      @Manojprapagar 2 ปีที่แล้ว

      @@jiyabyju565 what if the bhk value is 1 ?

  • @jaganinfo
    @jaganinfo 4 ปีที่แล้ว +1

    Previous three videos is very easy to understand. This video topic is toughest and interesting. To get good accurate feature engineering is very important. Here i have small doubt , if we are doing this kaggle project , How can we set threshold as 300 . Any suggestions ?

    • @codebasics
      @codebasics  4 ปีที่แล้ว +3

      Are you referring to 300 in this line? df5[df5.total_sqft/df5.bhk

  • @IRFANSAMS
    @IRFANSAMS 2 ปีที่แล้ว

    Your videos are super awesome for some one who is doing self study on ML

  • @shubhamkanwal8977
    @shubhamkanwal8977 4 ปีที่แล้ว

    Best youtube channel for data science

  • @lokeshakkireddy4671
    @lokeshakkireddy4671 4 ปีที่แล้ว +3

    sir ,in (10:40) to (11:10) part of the video, you told that for the same location(rajaji nagar) 2bhk 's price is higher than 3'bhk which is unusual and removed but,if the house with 2bhk is located near a highway and 3bhk is located far then, the price for 2bhk will be higher than 3bhk's for the same location.what is your answer

    • @niveshk936
      @niveshk936 4 ปีที่แล้ว +4

      Yes exactly. In fact, the price difference can also be because of the level of luxuries they offer. I am personally not gonna remove them as removing them doesn't makes sense.

  • @deependradeep9227
    @deependradeep9227 4 ปีที่แล้ว +2

    Hello sir, when you remove the outlier using number of bathrooms (at 18:29), I think it should be df9 = df8[df8.bath

  • @suyashtambe1718
    @suyashtambe1718 2 ปีที่แล้ว +1

    When you write the function for reducing the dataframe 7:15 why do you write Key, subdf while using for loop to groupby location. why do you use varibale KEY?

  • @ismailkaracakaya260
    @ismailkaracakaya260 ปีที่แล้ว +1

    yes but there are locations which are definitely more valuable than others so having 2 bedroom price for such a location can have higher value than a location with a 3 -4 bedrooms so wouldn't it be better to separate locations first depending on their high or low quality and create a new column that shows 1 or 0, where 1 means location more expensive and 0 means location normal price. If price is expensive for 2 bedrooms than 3 bedroom, we check the new location column and if it is 1 then the price is considered as normal. Because deleting all those locations with less bedrooms and more price looks like, it is not a solution. Please elaborate. Thank you

  • @mevinodyou
    @mevinodyou 4 ปีที่แล้ว +2

    Sir, Thanks for the wonderful video - I need some clarification at 11.57 - how you have achieved the mean,std, count for 1 bhk and 2 bhk?, Is this the assumption value or as per domain knowledge?

    • @jiyabyju565
      @jiyabyju565 3 ปีที่แล้ว

      for bhk,bhk_df in location_df groupby('bhk'):
      bhk_stats[bhk]={
      'mean': np.mean(bhk_df.price_per_sqft)
      'std' : np.std(bhk_df.price_per_sqft)
      'count':bhk_df.shape[0]
      this for loop is used for each bhk..and filter it out..

  • @abhishekmaharana5388
    @abhishekmaharana5388 ปีที่แล้ว +1

    According to me there might be some cases where property would be duplex or multi stored building. So the sq. Ft. Values are correct like 1000 sft having 4bhk is possible.

    • @kagguofficial
      @kagguofficial ปีที่แล้ว

      i was thinking the same , my house is 500 sqft and it has 3 rooms . we should find outliers according to bathrooms like it is bit unusual to have 5 baths even in 4 bhk

    • @abhishekmaharana5388
      @abhishekmaharana5388 ปีที่แล้ว

      @@kagguofficial 👍

  • @preethisriram8583
    @preethisriram8583 4 ปีที่แล้ว +17

    thank you for your efforts.. this remove_bhk_outliers function is boggling my brain.. would someone be kind enough to explain the inner for loop?

    • @madhurjyadeka5569
      @madhurjyadeka5569 4 ปีที่แล้ว +3

      Yeah it is troubling me too

    • @madhurjyadeka5569
      @madhurjyadeka5569 4 ปีที่แล้ว +3

      Yeah its a pain

    • @KukaKaz
      @KukaKaz 4 ปีที่แล้ว +2

      me too(

    • @swapnshah3234
      @swapnshah3234 4 ปีที่แล้ว +46

      Inner for loop will iterate for every possible group of no. of bedrooms of that respective location group. (of outer for loop)
      First inner for loop will store information about mean , std and no of data points( no of values present in a group of bedroom) in the already created dictionary in the outer for loop with key as the respective bedroom no. group. i.e (bhk_stats[2] stores info about 2 bedroom group values)
      Second inner for loop performs the main functionality,
      stats = bhk_stats.get(bhk-1)
      here it will fetch the value for the previous no. of bedroom group.
      For example, for 1 bedroom group it will be None , as there is no possibe value stored for 0 bedroom group, simply because there is not any value like that in dataframe.
      also for 3 bedroom group, it will fetch information about 2 bedroom group ( so that we can check the mean value )
      if stats and stats['count']>5:
      it checks if there is dictionary present ( we didn't have for 1 bedroom group ) because None value will throw error. It also checks if it has more than 5 values or not. Because we cannot decide to discard something without comparing it with substantial data values.
      exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft

    • @vidhyasagar7978
      @vidhyasagar7978 4 ปีที่แล้ว +1

      @@swapnshah3234 Thank you very much for detail explanation

  • @phamduy2251
    @phamduy2251 3 ปีที่แล้ว +3

    I have a question, please explain it to me.
    When I use 2 code lines:
    df4[~(df4.total_sqft/df4.BHK=300] but I get only 12456 records, more than 40 records I lost
    So I don't know what is the difference between the 2 cases.
    Thank you so much.

    • @peeyush13
      @peeyush13 2 ปีที่แล้ว

      This is due to NaN values in df4. This is happen due to applying function convert_sqft_to_max(x) to get dataframe from df3 to df4.Thanks for pointing out.

  • @siddhesh201182
    @siddhesh201182 หลายเดือนก่อน

    Thanks for this tutorial.

  • @vipulagarwal9842
    @vipulagarwal9842 4 ปีที่แล้ว +4

    Sir, instead of using the function for removing outliers at price per sqft, I did it manually by calculating mean and sd but I am getting a different result, can you help?

  • @manojkumaar8221
    @manojkumaar8221 8 หลายเดือนก่อน

    you cannot neglect the area_type. it is very crucial. A plot area having 600 sqft can easily have 8 bed rooms if it has been constructed for 4 floors. so we cannot take it is outlier. Price prediction must be made separately for each area_type. area_type is the dependant variable here.

    • @S.K.IntelliCy
      @S.K.IntelliCy 3 หลายเดือนก่อน +1

      you are correct

  • @binodrt7
    @binodrt7 4 ปีที่แล้ว +2

    Fisrt of all Thank you so much for uploading such a helpful video packed with lots of info.
    I have a doubt kindly help me on that. How can we find that price of 2 BHK apartment in few cases is more than 3 BHK apartment from such a huge data.

    • @codebasics
      @codebasics  4 ปีที่แล้ว

      You can find average price of 2 bhk apartment and on dataframe run a query that returns all 3 bhk apartment with price less then avg 2 bhk price

  • @waleadelkatan4498
    @waleadelkatan4498 3 ปีที่แล้ว

    Thank you very much, Greetings for the effort , i have small note, I think no need for second loop in the code

  • @basotra97
    @basotra97 4 ปีที่แล้ว +1

    Learned a lot from this tutorial.

  • @TSGokhale
    @TSGokhale 9 หลายเดือนก่อน

    Hello @codebasics, How Do you considered a 2BHK 500 sqft apartment consist of 250 sqft bedroom each, You have miscalculated by neglecting Hall, Kitchen and Bath Area, which is part of Total Square feet.

  • @ranis1227
    @ranis1227 4 ปีที่แล้ว +1

    Excellent tutorial thus far. I am running into an issue with the following:
    df5[df5.total_sqft/df5.bhk

    • @abdelazizhimmi5244
      @abdelazizhimmi5244 4 ปีที่แล้ว

      it's because total_sqft considered as str or 'object' , try to convert values in total_sqft to num before u make that subdivision, Good luck!!

    • @partharoy1733
      @partharoy1733 4 ปีที่แล้ว

      @Abdelaziz Himmi - 'Total_sqft' is already float not 'bhk', which is 'str'. Please check at your end.
      @Ranis Lamberte - I got the same issue :P checked and found bhk is the culprit here..please try.. df3['bhk']= df3['size'].apply(lambda x: int(x.split(' ')[0]))

  • @madhurjyadeka5569
    @madhurjyadeka5569 4 ปีที่แล้ว +2

    Hello sir could you please kindly explain the "stats and stats[count]>5 in the BHK outlier removal.
    Why would you specifically take 5

    • @codebasics
      @codebasics  4 ปีที่แล้ว +1

      I am considering only cases where number of apartments (for given bhk) is greater than 5. Because less than that would be very few samples to run any logic. 5 is a randomly taken number, you can change it something else that is reasonable enough.

    • @utkarsh9926
      @utkarsh9926 4 ปีที่แล้ว +1

      thanks for asking ,I am also having the same doubt.

  • @sumbulsultan4826
    @sumbulsultan4826 3 ปีที่แล้ว

    For the second outlier you did for the bhk, did you take into account the mean, std, count from the markdown code for you to get the value of 7000? Because when i did it, i was getting the same value as before which is 10241,7. Everything upto this point has been the same, code and values. But i can't seem to figure out how you dropped 3k values. Please do let me know. Thank you.

  • @mahaboobrashid3896
    @mahaboobrashid3896 4 ปีที่แล้ว +13

    HI It is difficult to understand remove_pps_outliers function & remove_bhk_outliers function .could you put separate videos for this both functions it will be helpful...

    • @ramendrachaudhary9784
      @ramendrachaudhary9784 4 ปีที่แล้ว +5

      the remove_pps_outliers function is looping thorough the subgroups of locations. For. eg. a subdf could be all data points with "jayanagar" as a location. It calculates mean and std of the rows in jayanagar location and then selects all points in that are within m-st and m-st of jayanagar and adds that to the df_out.
      you can see how the looping works by just printing the subdf :
      def trying(df):
      df_out= pd.DataFrame()
      for key, subdf in df.groupby("location"):

      print (subdf[["location","size"]] ,"
      ------------------------------
      ")
      this will help you visualise how the loop is working. Hope this helps !!

    • @jeehunkang5648
      @jeehunkang5648 2 ปีที่แล้ว +1

      @@ramendrachaudhary9784 this is an amazing explanation. Would you also be able to explain to me what "key" means in "for key,subdf in df.roupby("location"):"?

    • @Narendiranath
      @Narendiranath 2 ปีที่แล้ว

      @@jeehunkang5648 key is the actual location and subdf is the data (total_sqft bath, price, bhk, price_per_sqft) corresponding to that location.

  • @gabrielsalazar7646
    @gabrielsalazar7646 3 ปีที่แล้ว +1

    Any help on the
    if stats and stats['count']>5:
    Im not quite understanding why is there a 5 there. Any help ?

  • @vinaysingh-jv9zz
    @vinaysingh-jv9zz 2 ปีที่แล้ว

    Hi Dhawal, you make really interesting and resourceful content and I learn a lot from your data science videos. I have one doubt here on outlier removal part. Is it necessary to remove 3BHK houses in a particular location which have lower price and lower total_sqft. I am thinking of a situation where 2 BHK houses are bigger that is why costly than that of 3 BHK. Please help me out and thank you once again for your wonderful videos.

    • @danasharon4752
      @danasharon4752 2 ปีที่แล้ว

      I was wondering the same. Maybe these are newer 2BHK homes, and if the sqft is the same for 2bhk/3bhk, wouldn't that mean more spacious 2bhk? thx!!

    • @hero4future
      @hero4future ปีที่แล้ว

      i agree, determining the price solely on number of bedroom is not a reasonable heuristic to determine if they are outliers. this requires more understanding of subject matter (in this regards which factors determine prices) to carry out anomaly detection, else we're just chomping away the datasets smaller than it should

  • @izharkhankhattak
    @izharkhankhattak 3 ปีที่แล้ว +1

    Pretty nice work, Sir!

    • @codebasics
      @codebasics  3 ปีที่แล้ว

      Glad it was helpful!

  • @us12345
    @us12345 4 ปีที่แล้ว

    Hello Sir, thanks for this tutorials.
    I have a query is there any other way to handle this outliers ?

  • @vikasyaduvanshi2222
    @vikasyaduvanshi2222 3 ปีที่แล้ว

    This part is little complicated & thanx for sharing your knowledge with us

    • @codebasics
      @codebasics  3 ปีที่แล้ว +1

      Vikas , I will be uploading few more projects.they will be easier to understand than this one

    • @vikasyaduvanshi2222
      @vikasyaduvanshi2222 3 ปีที่แล้ว

      @@codebasics Thanx Sir you really are a lifesaver Namaste

    • @vikasyaduvanshi2222
      @vikasyaduvanshi2222 3 ปีที่แล้ว

      @@codebasics Sir how can I make this type of function by myself can u plz explain I'm new in A.I and don't have any experience in coding.

  • @ankitchatterjee1615
    @ankitchatterjee1615 2 ปีที่แล้ว +1

    Hello Dhaval Sir, actually I have done exactly the same thing , as you.
    But at 11:00 I am getting a very different plot in case of Hebbal
    Can you please help me ?

  • @jajatisahoo3831
    @jajatisahoo3831 3 ปีที่แล้ว

    Awesome tutorial

  • @santhoshinisantu6047
    @santhoshinisantu6047 3 หลายเดือนก่อน

    at 7:26 while removing outliers of price_per_sqft instead of using a function can we use iqr method to remove outliers. can you please clarify my doubt please

  • @shivamrai4052
    @shivamrai4052 4 ปีที่แล้ว +1

    Sir i am not able to understand why are eliminating some values through remove_bhk_outliers bcz there may be some 2 bhk whose price may be greater than 3 bhk due to some prime location ?

    • @ManindraKhandyana
      @ManindraKhandyana ปีที่แล้ว +1

      No, We are removing the data points at the exact location and similar sqft, but the 2 bhk price > 3 bhk.

  • @ManojKumaryOyOmaDy
    @ManojKumaryOyOmaDy 4 ปีที่แล้ว

    The most valuable things 👌🙌

  • @ruksharalam173
    @ruksharalam173 6 หลายเดือนก่อน

    Please do more such projects.

  • @sudikshapatil7022
    @sudikshapatil7022 2 ปีที่แล้ว

    I have a small doubt, here we have compared the price rates of 3 bhk with price of 2 bhk that is one value less, but there can be scenario like the price of 3 bhk is less than price of 1 bhk, which is not checked. Correct me if I am wrong?
    Addition to this I have another question like if we compare the size of original dataset with this almost half of the dataset is removed. Will this not effect the decision because we have lost half of the dataset.

    • @hero4future
      @hero4future ปีที่แล้ว

      1) yes we have to take into account all bedroom sizes, as focusing on these 2 sizes will affect the rest of bedroom counts
      2) considering the original dataset is small (around 13300-ish), after this cleaning process we took out almost half, leaving 7300 rows left. this is worrisome to say we have cleaned 'outliers' after removing some columns before in the previous video

  • @codebasics
    @codebasics  4 ปีที่แล้ว +3

    Complete machine learning tutorial playlist: th-cam.com/video/gmvvaobm7eQ/w-d-xo.html

    • @Arjun147gtk
      @Arjun147gtk 4 ปีที่แล้ว

      Sir how many times do we have to check for the outlier. I have a dataset, after removing the outlier and checking for any remaining outlier, it still shows some outlier. Do we have to run it in loops till I get no outlier?

    • @tejassutar4198
      @tejassutar4198 4 ปีที่แล้ว

      Hello sir i am new in DS as already i have done project of forecasting . Now is it worth of doing NLP project as i m having 2 yrs of exp other than DS?

  • @flamboyantperson5936
    @flamboyantperson5936 4 ปีที่แล้ว

    Extremely helpful tutorial

  • @Albert-ts2mu
    @Albert-ts2mu 4 ปีที่แล้ว +1

    Thank you!
    However the distribution of price per sqft is more like chi-squared, doesn't it?

  • @syedrayyan8130
    @syedrayyan8130 4 ปีที่แล้ว +2

    Hello sir
    After calling remove_pps_outliers(df6) function i am getting only (7,7) rows and columns
    Please help me sir

    • @premajha2798
      @premajha2798 3 ปีที่แล้ว

      I am also getting the same error

    • @nimilithagurram234
      @nimilithagurram234 2 ปีที่แล้ว

      Did you get the solution for this problem?

  • @abhishekbourai1832
    @abhishekbourai1832 3 ปีที่แล้ว +1

    PLease help : for key, subdf in df.groupby('location'):
    Why are using Key variable ? If i remove Key i get the error message : AttributeError: 'tuple' object has no attribute 'price_per_sqft'

  • @souvikroy5
    @souvikroy5 2 ปีที่แล้ว

    Is it good to keep creating a new dataframe if the number of rows are too many? like you're doing df5,df6 -- wouldn't we run out of memory?

    • @hero4future
      @hero4future ปีที่แล้ว +1

      1) data is small so memory cost isn't significant
      2) he's rerunning cells without the whole notebook so that won't be a problem

    • @hero4future
      @hero4future ปีที่แล้ว +1

      in case you want to see actual memory cost, use memory_profiler to check

    • @souvikroy5
      @souvikroy5 ปีที่แล้ว

      @@hero4future thank you

  • @KukaKaz
    @KukaKaz 4 ปีที่แล้ว +8

    so hard to understand this video((first 3 videos were well explained though

  • @Chinarkashmirmusic
    @Chinarkashmirmusic 4 ปีที่แล้ว

    I am facing "" from ipykernal import kernal app as app """" error in last step remove outliers . can you please help me with this

  • @ThaiNguyen-pr2vt
    @ThaiNguyen-pr2vt 4 ปีที่แล้ว +1

    thanks for doing this data science project videos! it really helps with my learning.

    • @codebasics
      @codebasics  4 ปีที่แล้ว +1

      I appreciate you leaving a comment of appreciation
      Thai Nguyen

  • @pavitran5495
    @pavitran5495 6 หลายเดือนก่อน

    can we remove outliers in bath and bhk using standard deviation method?please answer

  • @divyar8402
    @divyar8402 4 ปีที่แล้ว

    hi sir , beautifully done video
    i was wondering why didn't you use gruopby and aggregate to get what you achieved from the remove bhk outliers code

  • @veldibharathsrivardhan3559
    @veldibharathsrivardhan3559 3 ปีที่แล้ว

    if stats and stats['count']>5 means it returns true if both the statements are true right?
    then what do we actually mean by saying stats=true?

  • @pranjalgupta9427
    @pranjalgupta9427 4 ปีที่แล้ว +1

    Can we choose price for removing outlierrs

  • @adityap07-7
    @adityap07-7 หลายเดือนก่อน

    Explanation of Each Step: Disclaimer : (CHATGPT is used for explaination)
    Initialization:
    exclude_indices = np.array([])
    This initializes an empty NumPy array called exclude_indices to store indices of rows that will be excluded (i.e., identified as outliers).
    Group by Location:
    for location, location_df in df.groupby('location'):
    The DataFrame df is grouped by the 'location' column. location is the group key, and location_df is the subset DataFrame corresponding to that location.
    Calculate Statistics for Each BHK:
    bhk_stats = {}
    for bhk, bhk_df in location_df.groupby('bhk'):
    bhk_stats[bhk] = {
    'mean': np.mean(bhk_df.price_per_sqft),
    'std': np.std(bhk_df.price_per_sqft),
    'count': bhk_df.shape[0]
    }
    Within each location group, the data is further grouped by 'bhk' (the number of bedrooms). For each BHK type, calculate:
    mean: The mean of price_per_sqft.
    std: The standard deviation of price_per_sqft.
    count: The number of rows for that BHK type.
    Store these statistics in a dictionary bhk_stats, where the key is the BHK value.
    Identify and Collect Outliers:
    for bhk, bhk_df in location_df.groupby('bhk'):
    stats = bhk_stats.get(bhk-1)
    if stats and stats['count'] > 5:
    exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft < (stats['mean'])].index.values)
    For each BHK type within the location, look up the statistics for bhk-1 (i.e., the previous BHK type).
    If the statistics for bhk-1 exist and there are more than 5 entries for bhk-1, consider the rows for the current BHK type (bhk) where price_per_sqft is less than the mean price of bhk-1 as outliers.
    Append these indices to exclude_indices.
    Drop Outliers:
    return df.drop(exclude_indices, axis='index')
    After collecting all outlier indices, drop these rows from the original DataFrame and return the cleaned DataFrame.
    Summary
    Group: DataFrame is grouped by location and then bhk to calculate statistics for each combination.
    Calculate Statistics: Mean, standard deviation, and count are computed for price_per_sqft.
    Identify Outliers: Rows are flagged as outliers if their price_per_sqft is lower than the mean price of bhk-1 (when applicable) and the count of bhk-1 is greater than 5.
    Drop Outliers: Rows identified as outliers are removed from the DataFrame.
    This code aims to clean the dataset by removing outliers based on BHK pricing relative to the previous BHK type within the same location.

  • @followthepassion1530
    @followthepassion1530 4 ปีที่แล้ว

    Thank you so much sir...

  • @bq_wang
    @bq_wang 3 ปีที่แล้ว

    nice. thank you, sir.

    • @codebasics
      @codebasics  3 ปีที่แล้ว

      You are most welcome

  • @amitthakur6707
    @amitthakur6707 ปีที่แล้ว

    Hello sir, How to calculate the accuracy of this project?
    And what is the accuracy of this model can u please tell?

  • @sharkk2979
    @sharkk2979 3 ปีที่แล้ว

    Thanks sir

  • @sudharsan1515
    @sudharsan1515 2 ปีที่แล้ว

    Dear sir - Thanks a lot for all the tutorials. I have learnt a lot and learning more by going through quite a few videos.
    Anyone who is going through this tutorial - Looking for some help please.
    I am using PyCharm. Encountering an error TypeError: 'Series' object is not callable when I use df4 = remove_pps_outliers(df3)
    Did anyone encounter this error. If yes, can you please advise how you were able to proceed further. Thanks.

    • @otrocomentariomas
      @otrocomentariomas 2 ปีที่แล้ว

      You need to check the code in function, run first the function and look if it works, then You can find what is the line to fix

    • @aishwaryadharmadhikari7165
      @aishwaryadharmadhikari7165 2 ปีที่แล้ว

      im too getting the same error

  • @anthonym9130
    @anthonym9130 4 ปีที่แล้ว

    A 1 bedroom might cost more because it is a much larger room than the 2 bedrooms. Not so sure i would remove those.

  • @GAURAVKUMAR-xe9ce
    @GAURAVKUMAR-xe9ce 3 ปีที่แล้ว

    At 12:56 why we have added condition "if stats and stats['count']>5:" . I didn't get logic behind condition greater then 5

  • @netherdrake436
    @netherdrake436 2 ปีที่แล้ว

    The issue is that you are assuming the right prices for banglore, you should have used K-Means Clustering to detect outliers and then remove them... Not just do it according to assumptions.

    • @hero4future
      @hero4future ปีที่แล้ว

      just offering some of my encounters when i look into k means clustering for anomaly detection for this dataset:
      - there will always be 2 clusters for 2-bedroom and 3-bedroom (what about other bedroom counts that aren't as significant but still have around 500 samples since this will affect machine learning)
      - based on location (assuming you're plotting it after removing price per square feet function), some locations will have very small amount of 2-3 bedrooms so we can only trim out locations with significant sample sizes
      - determining how many data points away from cluster center to remove

  • @nagendravishwamitra3652
    @nagendravishwamitra3652 3 ปีที่แล้ว

    Sir could you please make a video on forecasting the values of sales based on future dates in Time Series Analytics. Please help

  • @lucasbegue8232
    @lucasbegue8232 4 ปีที่แล้ว

    when doing the function in 13:00 , you are only removing outliers where, for example, the price_per_sqft of a 2bhk is less than the mean price_per_sqft of 1bhk (at a given location). Based on this logic, why wouldnt you also remove rows where the price_per_sqft of a 2bhk is higher than the mean of price_per_sqft of 3bhk (at a given location) ??

  • @birajitnath4235
    @birajitnath4235 4 ปีที่แล้ว

    thank u sir for such awesome tutorial... but I have a doubt in this following code at line no 6.
    the question is why u have used bit wise & instead of logical 'and'... plz explain sir.
    def remove_pps_outlier(df):
    output_df=pd.DataFrame()
    for loc, subdf in df.groupby('location'):
    m=np.mean(subdf.price_per_sqft)
    st=np.std(subdf.price_per_sqft)
    reduced_df=subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft

    • @raihanhosain3374
      @raihanhosain3374 ปีที่แล้ว

      df7 = df6[(df6['price_per_sqft'] > (price_per_sqft_mean - price_per_sqft_std)) &
      (df6['price_per_sqft'] < (price_per_sqft_mean + price_per_sqft_std))]
      df7.shape
      i have used this insted given code. i was confused and couldn't understand the logic and others. i wish sir explain this two function part clearly.

  • @anushkasaxena5257
    @anushkasaxena5257 3 หลายเดือนก่อน

    Off topic but 300 sqft can't be the starting threshold..300 sqft is the size of a big master bedroom

  • @sandeepsharma7405
    @sandeepsharma7405 3 ปีที่แล้ว

    "Remove pps outliers" Sir please help me with these type of programs. How you think and make a program. Please make another video for this. If is it possible.

  • @muslimummah2343
    @muslimummah2343 3 ปีที่แล้ว

    At 13:47 out put is same as df7,
    Then sir where is problem ???

  • @turanfair9364
    @turanfair9364 3 ปีที่แล้ว

    removing 2 bedrooms apt that costs higher than 3 bedrooms is ok. But I think there is no one important feature that's year built of house. Maybe cause of that 2 bd is more expensive than 3 bd

  • @dhananjaykansal8097
    @dhananjaykansal8097 4 ปีที่แล้ว +1

    Thank you sir.

  • @saritagupta7200
    @saritagupta7200 2 ปีที่แล้ว +1

    For outliers, we can also use the formula of Quartiles
    Outliers > Q3 + 1.5 IQR
    Outliers < Q1 - 1.5 IQR

    • @brefat
      @brefat 2 ปีที่แล้ว

      For sure, Thanks for remembering me

  • @rashikrahman300
    @rashikrahman300 4 ปีที่แล้ว

    "stats and stats[count]>5 " it would be much help if you can explain this condition?
    and why specifically 5?

    • @codebasics
      @codebasics  4 ปีที่แล้ว +1

      I am considering only cases where number of apartments (for given bhk) is greater than 5. Because less than that would be very few samples to run any logic. 5 is a randomly taken number, you can change it something else that is reasonable enough.

  • @jaychotalia7542
    @jaychotalia7542 ปีที่แล้ว

    why is stats['count'] > 5 required in remove_bhk_outliers function?

  • @titashmaity5924
    @titashmaity5924 4 ปีที่แล้ว

    How is area being taken into consideration when you are removing 2BHK flats which are of less price than the mean price of 1BHK flats at the same location? Kindly do answer. Will be waiting for your reply.

    • @codebasics
      @codebasics  4 ปีที่แล้ว +1

      I think I do area based data cleaning in separate code block.

    • @titashmaity5924
      @titashmaity5924 4 ปีที่แล้ว

      @@codebasics thanks Dhaval Sir.

  • @vishwasjajpura796
    @vishwasjajpura796 ปีที่แล้ว

    What is that key command in the function remove_pps-outliners

  • @debaprasannbhoi230
    @debaprasannbhoi230 10 หลายเดือนก่อน

    That was funny,'In Bangalore the houses more than 13 bathrooms'

  • @fawzifgff3020
    @fawzifgff3020 2 ปีที่แล้ว

    thank you very much, what should i ask google to get something like the funcion "remove_pps_outliers(df)"?

  • @sharkk2979
    @sharkk2979 3 ปีที่แล้ว

    लेकिन सर फ्लॅट पॉश सोसायटी मे रहेगा तोह, ज्यादा price होगी, भले बेडरूम कम हो?

  • @nomanislam3157
    @nomanislam3157 4 ปีที่แล้ว

    Respected Sir,
    Plot_scatter_chart(df8,"Hebbal")
    NameError Plot_scatter_chart not defined.

  • @ramprasadsapkota1013
    @ramprasadsapkota1013 3 ปีที่แล้ว

    When type df5.head(10)
    It say df5 is not define above of all runs properly

  • @nikhilmishra8629
    @nikhilmishra8629 3 ปีที่แล้ว

    Can someone please explain me why he used stats and stats['count']>5 in remove_bhk_outlier function... can we use some other number like 0 or 1 but it filters more indices by using them

  • @sahil.sartaj
    @sahil.sartaj 4 ปีที่แล้ว +1

    Sir, I did not get why use negate(~) at 3:39
    please help to sir

    • @partharoy1733
      @partharoy1733 4 ปีที่แล้ว +1

      It's a shortcut to get the opposite value

  • @hamsavardhinim1278
    @hamsavardhinim1278 ปีที่แล้ว

    Any suggestions on how to remove a value error?

  • @sudarshanchougule2969
    @sudarshanchougule2969 ปีที่แล้ว

    sir i dont understand outlier removal part of your project

  • @INDIAN_SHADHU
    @INDIAN_SHADHU 3 ปีที่แล้ว

    UnboundLocalError : local variable 'excludes_indices' referenced before assignment
    How to fix this error

  • @zohrabatool1204
    @zohrabatool1204 4 ปีที่แล้ว

    why you remove 2 bedroom apartment in higher prices ?

  • @ujjwalwadera6858
    @ujjwalwadera6858 2 ปีที่แล้ว

    13:45 you talked about 2-BHK and 3-BHK and then you are doing the outlier removal with 1BHK and 2BHK ..why???
    Why not 2 nd 3 bhk

    • @hero4future
      @hero4future ปีที่แล้ว

      what do you mean? the cell above is for illustration. read the code, he's doing it with relative to the bedroom size that differs by one, aka 2 bedroom relative to 1 bedroom, 3 bedroom relative to 2 bedroom, 4 bedroom relative to 3 bedroom, etc

  • @sandeepsharma7405
    @sandeepsharma7405 3 ปีที่แล้ว

    I learned about for loop , Def functions but in real world uses unable to Define function

  • @sejalanand23
    @sejalanand23 4 ปีที่แล้ว

    Learning some new things which even paid courses didn't mention.

  • @jeehunkang5648
    @jeehunkang5648 2 ปีที่แล้ว

    After applying remove_bhk_outliers function to df7: df8 = remove_bhk_outliers(df7), does anyone get an error message saying that it reached the maximum recursion?

  • @nobodypodcasts
    @nobodypodcasts ปีที่แล้ว

    half the dataset is outlier?

  • @RahulPandey-jo6rk
    @RahulPandey-jo6rk 2 ปีที่แล้ว

    Why do we need key in the for loop?

    • @hero4future
      @hero4future ปีที่แล้ว

      so upon printing, if you do groupby(location), it will return key-value pair where key is the location name and value is the indices where it matches that key. for clarity, key is a location in the group of location

  • @aishwaryadharmadhikari7165
    @aishwaryadharmadhikari7165 3 ปีที่แล้ว

    Problem at 7:15 , it says, dataframe object is not passable..
    Need help.

    • @sudharsan1515
      @sudharsan1515 2 ปีที่แล้ว

      Hi Aishwarya - I am stuck at the same place. Did you manage to move forward please?

  • @DSlayer007
    @DSlayer007 ปีที่แล้ว

    6:29 keyerror: location. I am stuck here please help.

  • @annonymous.
    @annonymous. ปีที่แล้ว

    7:34 I can not understand the function here, can anyone explain it to me, please?

  • @androidgamez3639
    @androidgamez3639 3 ปีที่แล้ว

    You're just casually say about outlier removal function?Many questions arises but you ignore all....why are u choosing some thresholds and why u consider 1 standard deviation?why not 2 stad?

    • @codebasics
      @codebasics  3 ปีที่แล้ว

      General guideline on removing outlier is 3 standard deviation or above. In our specific case I used one std dev but yes in majority cases you would use 3 std dev. If dataset follows normal distribution, 99% data points will fall under 3 std dev and you can safely ignore anything above 3 std dev as an outlier

    • @androidgamez3639
      @androidgamez3639 3 ปีที่แล้ว

      @@codebasics yaa...I check the 99 percentile their is no outliers...but after that their a big difference of the value for that...I thought why u choose 1std. Now I have a question that Total sqft/Bhk u choose

  • @detacreations1999
    @detacreations1999 3 ปีที่แล้ว

    How can u consider the threshold

    • @prashanthshetkar2350
      @prashanthshetkar2350 3 ปีที่แล้ว

      it was just assumed price per square feet area. He said we have to get clarity on it by reaching out someone who knows or may be set by the management.