NO MORE NULLS - how to handle missing values in Spark DataFrames + Fabric (Day 10 of 30)

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ต.ค. 2024

ความคิดเห็น • 2

  • @trityes6336
    @trityes6336 ปีที่แล้ว

    Nice course. Is there a way use a more complex imputer calculation, for instance to fill null value with the mean value of each agent sales instead of the mean of the whole dataset ?

    • @LearnMicrosoftFabric
      @LearnMicrosoftFabric  ปีที่แล้ว +1

      Hey thanks for your comment! I don't think there's an out-the-box way of doing this in pySpark, you can achieve it though by grouping your data (or filtering out each category) and fitting the Imputer on the subset of data, before joining them back again
      Another method is to use the Pandas on Spark API as they have much better support for that kind of groupby + fillna functionality - read more here: spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.groupby.GroupBy.fillna.html