Pyspark Scenarios 11 : how to handle double delimiter or multi delimiters in pyspark

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 ม.ค. 2025

ความคิดเห็น • 30

  • @rajeshk1276
    @rajeshk1276 2 ปีที่แล้ว +2

    Very Well explained.. Loved it

  • @sravankumar1767
    @sravankumar1767 2 ปีที่แล้ว +2

    Nice explanation 👌 👍 👏

  • @prabhakaranvelusamy
    @prabhakaranvelusamy 2 ปีที่แล้ว +1

    Excellent explanation!

  • @tanushreenagar3116
    @tanushreenagar3116 2 ปีที่แล้ว +1

    Nice

  • @gobinathmuralitharan1997
    @gobinathmuralitharan1997 2 ปีที่แล้ว +1

    Clear explanation 👍👏thank you 🙂

    • @TRRaveendra
      @TRRaveendra  2 ปีที่แล้ว +1

      Thank You Gobinath

  • @pokemongatcha122
    @pokemongatcha122 2 ปีที่แล้ว +2

    Hi Ravi, I'm trying to do split by delimiter of a column with each cell having different no. of commas. Can you write a code to split columns with each occurance of comma? E.g. if row 1 has 4 commas it generates 4 columns but row 2 has 10 commas so it further generates another 6 columns.

  • @gobinathmuralitharan1997
    @gobinathmuralitharan1997 2 ปีที่แล้ว +1

    Subscribed 🔔

  • @fratkalkan7850
    @fratkalkan7850 2 ปีที่แล้ว

    perfection

  • @penchalaiahnarakatla9396
    @penchalaiahnarakatla9396 2 ปีที่แล้ว +1

    Hi, good video, one clarification, while writing dataframe output to csv leading zeros are missing.. How to handle this secanioro. If possible make a video on this.

    • @TRRaveendra
      @TRRaveendra  2 ปีที่แล้ว

      Thank you 👍

    • @penchalaiahnarakatla9396
      @penchalaiahnarakatla9396 2 ปีที่แล้ว +1

      Hope next video will be this.

    • @fakrullahimran
      @fakrullahimran 2 ปีที่แล้ว

      @@penchalaiahnarakatla9396 Try to include option(“quoteAll”,True) and check once

    • @penchalaiahnarakatla9396
      @penchalaiahnarakatla9396 2 ปีที่แล้ว +1

      @@fakrullahimran Thanks. I will try and will update you..

  • @snagendra5415
    @snagendra5415 2 ปีที่แล้ว +1

    Could you explain spark small files problem using pyspark?
    Thank you in advance

    • @TRRaveendra
      @TRRaveendra  2 ปีที่แล้ว +2

      sure i will do video on small files problem.

    • @snagendra5415
      @snagendra5415 2 ปีที่แล้ว

      @@TRRaveendra thank you for your reply, and waiting for the video 🤩

  • @udaynayak4788
    @udaynayak4788 ปีที่แล้ว

    Hi Ravi, i do have .txt file which multiple space delimiter, e.g accountID Acctnbm acctadd branch and likewise can you please suggest the approach here almost i have 76 columns with multiple consecutive delimiter.

  • @JustForFun-oy8fu
    @JustForFun-oy8fu 2 ปีที่แล้ว

    Hi Ravi, thanks I have one doubt: how
    can we generalize the above logic.....like if we have large number of columns after splitting the data like then it's obvious we can't do it manually.
    What could be our approach in that case?
    Thanks,
    Anonymous

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so ปีที่แล้ว

      # Case 1. when no of columns is known
      col = 4
      i = 0
      while i < col:
      df_multi = df_multi.withColumn("sub" + str(i), df_multi["marks_split"][i])
      i += 1
      df_1 = df_multi.drop("marks").drop("marks_split")
      display(df_1)

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so ปีที่แล้ว

      # Case 2. when no of columns is not known known
      from pyspark.sql.functions import max
      df_multi = df_multi.withColumn('marks_size', size('marks_split'))
      max_size = df_multi.select(max('marks_size')).collect()[0][0]
      j = 0
      while j < max_size:
      df_multi = df_multi.withColumn("subject" + str(j), df_multi["marks_split"][j])
      j += 1
      df_2 = df_multi.drop("marks").drop("marks_split").drop('marks_size')
      display(df_2)

  • @V-Barah
    @V-Barah ปีที่แล้ว

    this is looks simple in example but in real time we can't do each with column if there are 200-300 columns.
    is there any other way?

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so ปีที่แล้ว +1

      # Case 1. when no of columns is known
      col = 4
      i = 0
      while i < col:
      df_multi = df_multi.withColumn("sub" + str(i), df_multi["marks_split"][i])
      i += 1
      df_1 = df_multi.drop("marks").drop("marks_split")
      display(df_1)

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so ปีที่แล้ว +1

      # Case 2. when no of columns is not known known
      from pyspark.sql.functions import max
      df_multi = df_multi.withColumn('marks_size', size('marks_split'))
      max_size = df_multi.select(max('marks_size')).collect()[0][0]
      j = 0
      while j < max_size:
      df_multi = df_multi.withColumn("subject" + str(j), df_multi["marks_split"][j])
      j += 1
      df_2 = df_multi.drop("marks").drop("marks_split").drop('marks_size')
      display(df_2)

  • @vikrammore-y4t
    @vikrammore-y4t ปีที่แล้ว

    spark 3.X supports multi delimiter like .option("delimiter","[||]")

  • @NaveenKumar-kb2fm
    @NaveenKumar-kb2fm 2 ปีที่แล้ว

    very well explained , i have a scenario with schema (id,name,age,technology) and data in single row like (1001|Ram|28|Java|1002|Raj|24|Database|1004|Jam|28|DotNet|1005|Kesh|25|Java) coming in a single csv file.
    now can we make it into multiple rows as per schema as a single table like below
    id,name,age,technology
    1001|Ram|28|Java
    1002|Raj|24|Database
    1004|Jam|28|DotNet
    1005|Kesh|25|Java

    • @mohitmotwani9256
      @mohitmotwani9256 ปีที่แล้ว

      This data needs to be deived in multiple lines.

  • @dinsan4044
    @dinsan4044 ปีที่แล้ว

    Hi ,
    Could you please create a video to combine below 3 csv data files into one data frame dynamically
    File name: Class_01.csv
    StudentID Student Name Gender Subject B Subject C Subject D
    1 Balbinder Male 91 56 65
    2 Sushma Female 90 60 70
    3 Simon Male 75 67 89
    4 Banita Female 52 65 73
    5 Anita Female 78 92 57
    File name: Class_02.csv
    StudentID Student Name Gender Subject A Subject B Subject C Subject E
    1 Richard Male 50 55 64 66
    2 Sam Male 44 67 84 72
    3 Rohan Male 67 54 75 96
    4 Reshma Female 64 83 46 78
    5 Kamal Male 78 89 91 90
    File name: Class_03.csv
    StudentID Student Name Gender Subject A Subject D Subject E
    1 Mohan Male 70 39 45
    2 Sohan Male 56 73 80
    3 shyam Male 60 50 55
    4 Radha Female 75 80 72
    5 Kirthi Female 60 50 55