Handling corrupted records in spark | PySpark | Databricks

แชร์
ฝัง
  • เผยแพร่เมื่อ 10 เม.ย. 2023
  • In this video I have talked about reading bad records file in spark. I have also talked about the modes present in spark for reading.
    Directly connect with me on:- topmate.io/manish_kumar25
    Data:-
    id,name,age,salary,address,nominee
    1,Manish,26,75000,bihar,nominee1
    2,Nikita,23,100000,uttarpradesh,nominee2
    3,Pritam,22,150000,Bangalore,India,nominee3
    4,Prantosh,17,200000,Kolkata,India,nominee4
    5,Vikash,31,300000,,nominee5
    For more queries reach out to me on my below social media handle.
    Follow me on LinkedIn:- / manish-kumar-373b86176
    Follow Me On Instagram:- / competitive_gyan1
    Follow me on Facebook:- / manish12340
    My Second Channel -- / @competitivegyan1
    Interview series Playlist:- • Interview Questions an...
    My Gear:-
    Rode Mic:-- amzn.to/3RekC7a
    Boya M1 Mic-- amzn.to/3uW0nnn
    Wireless Mic:-- amzn.to/3TqLRhE
    Tripod1 -- amzn.to/4avjyF4
    Tripod2:-- amzn.to/46Y3QPu
    camera1:-- amzn.to/3GIQlsE
    camera2:-- amzn.to/46X190P
    Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
    Pentab (Small size):-- amzn.to/3RpmIS0
    Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
    Laptop -- amzn.to/3Ns5Okj
    Mouse+keyboard combo -- amzn.to/3Ro6GYl
    21 inch Monitor-- amzn.to/3TvCE7E
    27 inch Monitor-- amzn.to/47QzXlA
    iPad Pencil:-- amzn.to/4aiJxiG
    iPad 9th Generation:-- amzn.to/470I11X
    Boom Arm/Swing Arm:-- amzn.to/48eH2we
    My PC Components:-
    intel i7 Processor:-- amzn.to/47Svdfe
    G.Skill RAM:-- amzn.to/47VFffI
    Samsung SSD:-- amzn.to/3uVSE8W
    WD blue HDD:-- amzn.to/47Y91QY
    RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
    Gigabyte Motherboard:-- amzn.to/3RFUTGl
    O11 Dynamic Cabinet:-- amzn.to/4avkgSK
    Liquid cooler:-- amzn.to/472S8mS
    Antec Prizm FAN:-- amzn.to/48ey4Pj

ความคิดเห็น • 148

  • @manish_kumar_1
    @manish_kumar_1  ปีที่แล้ว +2

    Directly connect with me on:- topmate.io/manish_kumar25

  • @maheshsadhanagiri3636
    @maheshsadhanagiri3636 4 หลายเดือนก่อน +4

    your simply super, I am a Azure Solution Architect, but now I would like to start my journey with Data Engineering. I am very lucky that there is such very valuable and appreciable learning opportunity from your channel. Your are really good my dear, explaining concepts in good understandable way with execution. Even I am recommending your videos to my colleagues and friends

  • @bluerays5384
    @bluerays5384 4 หลายเดือนก่อน +2

    Very well explained Sir, Thank you.Keep educating n sharing your knowledge n experience..❤❤❤

  • @maboodahmad7289
    @maboodahmad7289 หลายเดือนก่อน

    very well explained . The way you are first ask the questions and then explain simply. Thank you so much.

  • @Thakur3427
    @Thakur3427 ปีที่แล้ว +2

    I am preparing data engineer interview from your videos 😊
    Thank you
    You are doing great job

  • @kunalk3830
    @kunalk3830 ปีที่แล้ว +1

    Manish bhai thank you so much for this. Mujhe jo doubts the woh apne aap clear ho rahe, really appreciate this videos bhai.

  • @debojyotihazra9571
    @debojyotihazra9571 7 หลายเดือนก่อน +1

    Hello Manish, I have been following your course from last couple of days and so far I have covered 17 session of Spark Theory and 7 sessions of Spark Practical. Thank You for all your efforts. Before this I have purchased multiple course on python and pyspark but I lost interest in each of the courses as they were monotonous. I'm actively looking for a job change and interviews are on pipeline, and I got the confidence on PySpark after watching your videos. Thank You ❤.

  • @deepakchaudhary4118
    @deepakchaudhary4118 ปีที่แล้ว

    Bahut ache Manish Bhai, I am eagerly waiting for upcoming video ❤

  • @itsme-fe1vh
    @itsme-fe1vh 4 หลายเดือนก่อน

    Manish Bhai plz continue the same way now ur following to explain the concept. It will be Crystal clear for every one also helpful to interviews. Over all excellent THX for ur work and efforts in making content.

  • @anirbanadhikary7997
    @anirbanadhikary7997 ปีที่แล้ว +6

    Best thing about this series is potential interview questions. I can challenge you will not find this none of channels other than this in the entire TH-cam.
    Also we can make a separate document consisting of these questions only which will be greatly beneficial during interview preparation.

  • @rishav144
    @rishav144 ปีที่แล้ว +1

    thanks Manish ....very well explained ....This spark series is very top notch

  • @sairamguptha9988
    @sairamguptha9988 ปีที่แล้ว

    Thanks much Manish for doing this amazing job.

  • @Watson22j
    @Watson22j ปีที่แล้ว

    bhaia ye jo approch hai na padhane ka interview questions ke through wo bhot mst hai!🙏

  • @gchanakya2979
    @gchanakya2979 ปีที่แล้ว

    Consistently following your videos, these videos are helping me a ton.

  • @priyamehandi99
    @priyamehandi99 6 หลายเดือนก่อน

    Nice explanation ....All spark session are helpful for me Thanku u manish sir.

  • @ankitdhurbey7093
    @ankitdhurbey7093 ปีที่แล้ว +1

    Great interview questions ❤

  • @hubspotvalley580
    @hubspotvalley580 10 หลายเดือนก่อน

    Manish bhai, You are the best teacher.

  • @kushchakraborty8013
    @kushchakraborty8013 2 หลายเดือนก่อน

    "khud Jake type Karo Ctrl + Enter marneke liar nahi phada raha huun" Thank you sir !!! ❤

  • @user-rh1hr5cc1r
    @user-rh1hr5cc1r 3 หลายเดือนก่อน

    Thank you bro. Big Hug🙂So many things to learn

  • @deepakchaudhary4118
    @deepakchaudhary4118 ปีที่แล้ว

    Thanks Manish please upload upcoming video ASAP

  • @dishant_22
    @dishant_22 10 หลายเดือนก่อน

    Nice Tutorial!!, I was trying to implement the same but found that "badRecordsPath" option in only a databricks specific feature and I was executing locally in my machine.

  • @ADESHKUMAR-yz2el
    @ADESHKUMAR-yz2el 3 หลายเดือนก่อน

    Mja aagya bhaiya, thanks

  • @HanuamnthReddy
    @HanuamnthReddy 6 หลายเดือนก่อน

    Awesome buai

  • @HanuamnthReddy
    @HanuamnthReddy 6 หลายเดือนก่อน +1

    Manish bad record handling doc share khariya..😊

  • @adarsharora6097
    @adarsharora6097 7 หลายเดือนก่อน

    Thanks Manish! I am studying late night after finishing office. Will make transition to DE soon. Thanks for the video!

    • @pritambiswas1023
      @pritambiswas1023 5 หลายเดือนก่อน

      same here, cheers to DE .🤞

  • @omkarm7865
    @omkarm7865 ปีที่แล้ว

    addicted to this channel

  • @AMITKUMAR-np7el
    @AMITKUMAR-np7el 9 หลายเดือนก่อน

    Aap data engineer ke betaj badshah ho ... please ese hi video banate rho

    • @manish_kumar_1
      @manish_kumar_1  9 หลายเดือนก่อน +2

      Mujhe bahut Kam chije aati hai bhai. Av to main 1% v jaanta nhi hounga. Bahut kuch hai sikhne and karne ko in DE

  • @Rakesh-if2tx
    @Rakesh-if2tx ปีที่แล้ว

    Thank you bhai...

  • @omkarm7865
    @omkarm7865 ปีที่แล้ว

    you are best

  • @user-jv3lv2dn8k
    @user-jv3lv2dn8k 7 หลายเดือนก่อน

    Hi Manish, If we have input csv file and we have not defined any manual schema, then in that case to show corrupted records- do we have to manually define schema for corrupted record column or how to handle that?

  • @user-rh1hr5cc1r
    @user-rh1hr5cc1r 3 หลายเดือนก่อน +1

    12:13 after printing corrupt records... in my case only nomine2,nomine 3 came under the column ..while in videos whole details of id 3 and 4 came. is there any catch here..as i followed same approach

  • @shubhamagarwal5932
    @shubhamagarwal5932 3 หลายเดือนก่อน

    When I tried the same on databricks, I'm getting the other 3 records instead of bad records.

  • @SantoshKumar-yr2md
    @SantoshKumar-yr2md 4 หลายเดือนก่อน

    dataset bhi de dijiye Sir sath me so good hands on ho jayega

  • @saifrahman1388
    @saifrahman1388 5 หลายเดือนก่อน

    Hi sir, Your videos are really interesting and the way you are teaching is too easy. One doubt i couldn't find the file link in the description. Could you pls provide the link. May god fulfill all your dreams. Thank you☺

  • @SubhamKumar-or8vc
    @SubhamKumar-or8vc 10 หลายเดือนก่อน +2

    Hi Manish,
    When I run only this
    df = spark.read.format("csv") \
    .option("inferschema", "true") \
    .option("header", "true") \
    .option("mode", "FAILFAST") \
    .load("/FileStore/tables/data.csv")
    and then run df.count(), it is showing 5 records in all 3 modes. But when I am running df.show(), it is giving output as per your explanation. What can be the possible reasons for the behavior of the count function?

    • @udittiwari8420
      @udittiwari8420 5 หลายเดือนก่อน +1

      The count() function in Spark DataFrame counts the number of rows in the DataFrame. It does not specifically check for corrupted or malformed rows when performing the count. In your case, even if there are corrupted rows in the DataFrame, the count() function will still return the total number of rows.

  • @lazycool0298
    @lazycool0298 3 หลายเดือนก่อน

    Hello Manish
    Hum ab tak flight_csv mein kaam kr rhe the, fir ye smployess_csv kab daalna hai pls guide.

  • @raviyadav-dt1tb
    @raviyadav-dt1tb 5 หลายเดือนก่อน

    corrupted data and complex data both are same or different in spark?

  • @AAKASHSINGH-jt9hp
    @AAKASHSINGH-jt9hp 5 หลายเดือนก่อน

    Manish Ji ...apka concept delivery bahaut acha hai, lekin yad karane ke liye koe document dijiye, kyo ki bar bar video dekhane me time lag raha hai....

  • @mohitupadhayay1439
    @mohitupadhayay1439 หลายเดือนก่อน

    Can we do this when reading XML and JSON files

  • @ComedyXRoad
    @ComedyXRoad 3 หลายเดือนก่อน

    at 12:00 instead of creating whole schema can we create new column using withColumn function? or do we need to create eplicit schema to hangle bad records? could you ans?

  • @bishantkumar9359
    @bishantkumar9359 23 ชั่วโมงที่ผ่านมา

    Hi Manish, I am getting same records in all three modes. can you please help here

  • @udittiwari8420
    @udittiwari8420 5 หลายเดือนก่อน

    hello sir kya as a freshers bhi this all questions are asked ?

  • @sukritisachan5773
    @sukritisachan5773 3 หลายเดือนก่อน

    I had a question, let's say if we have a CSV file which has some data in comma lets say address itself has commas so can we pass some text wrap in pyspark? Data can be like "Six Street,Ontario", so how can we pass this because this is not a corrupted record.

  • @rakeshcheedara268
    @rakeshcheedara268 3 หลายเดือนก่อน

    bhai....can you please provide the English subtitles too.. just to understand in better way

  • @lakshyagupta5688
    @lakshyagupta5688 หลายเดือนก่อน

    At 15:32 why only 3 records are shown as the mode is permissive, shouldn't the query fetch all the records ?

  • @HeenaKhan-lk3dg
    @HeenaKhan-lk3dg 2 หลายเดือนก่อน

    Hi manish,
    Where we can check employee file

  • @yugantshekhar782
    @yugantshekhar782 หลายเดือนก่อน

    Hi Manish, Nice so;ution
    but what if I have 500+ columns in my table, how can I do that then?

  • @vishalupadhyay4356
    @vishalupadhyay4356 3 หลายเดือนก่อน

    Where did we get this CSV file --employee data

  • @user-gt3pi6ir5u
    @user-gt3pi6ir5u 3 หลายเดือนก่อน

    where can I get the csv file used

  • @vaibhavshendre1
    @vaibhavshendre1 3 หลายเดือนก่อน

    How will spark know that it's bad records. Based on what conditions ,it's deciding it's bad ?

  • @amanjha5422
    @amanjha5422 ปีที่แล้ว

    Plz upload your next videos bhaiya 😊

  • @lazycool0298
    @lazycool0298 3 หลายเดือนก่อน

    How to delete created table in SPARK we created?

  • @mdreyaz5824
    @mdreyaz5824 9 หลายเดือนก่อน

    How to handle corrupted data in parquet file ?

  • @abinashsatapathy1711
    @abinashsatapathy1711 11 หลายเดือนก่อน

    Manish ji, mera ek doubt hai. How corrupt_record column is taking data from the beginning(i.e. ID column)of the row?

    • @kavitathorat4451
      @kavitathorat4451 4 หลายเดือนก่อน

      column name should be "_corrupt_record"

  • @muskangupta735
    @muskangupta735 2 หลายเดือนก่อน

    in permissive mode why it didn't created a new column?

  • @apurvsharma413
    @apurvsharma413 5 หลายเดือนก่อน

    Hi Manish,
    I have tried saving the corrupted records to a file, but i am unable to use %fs ls. It shows error - UsageError: Line magic function `%fs` not found. Can you help here?

  • @aakashkhandalkar9172
    @aakashkhandalkar9172 7 หลายเดือนก่อน

    I converted the text file to CSV, but when printed that same CSV in data bricks a new column i.e. 6th column is generated with null values, so basically I am not getting 3 different table for 3 different modes, what could be possible error ?

    • @udittiwari8420
      @udittiwari8420 5 หลายเดือนก่อน

      most probaly your csv is not created as pe the need check it once while opening csv in notepad it should look like this
      id,name,age,salary,address,nominee
      1,Manish,26,75000,bihar,nominee1
      2,Udit,25,100000,indore,nominee2
      3,jiya,15,1500000,lomri , India,nominee2
      4,swati,19,200000,kota,nominee4
      5,ravi,25,300000,indore ,India,nominee5
      6,tanu,25,120000,,nominee6
      there should not be any records inside " " and try

  • @ajaysinha5996
    @ajaysinha5996 11 หลายเดือนก่อน

    Hello Manish Bhai.. CSV file aapne mention nhi kiya description me...please provide krenge kya

    • @manish_kumar_1
      @manish_kumar_1  11 หลายเดือนก่อน +1

      Data ko aap copy karke save as csv kar lijiye

  • @kumarshivam8077
    @kumarshivam8077 5 หลายเดือนก่อน

    there is no csv file in description.

  • @amlansharma5429
    @amlansharma5429 ปีที่แล้ว

    @Manish Kumar, in option are you using columnNameOfCorruptRecord, because otherwise it's not displaying corrupt records.. I don't know how my previous comment got deleted I searched for this option in the internet and it worked for me.

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      No I didn't use columnNameIfCorruptRecord. I created manual schema and read the schema from there

    • @amlansharma5429
      @amlansharma5429 ปีที่แล้ว +1

      @@manish_kumar_1 after creating manualSchema, for me it's appearing as extra column and showing records of India for those rows

    • @mukulraj1545
      @mukulraj1545 8 หลายเดือนก่อน

      Same for me
      Have you got any solution for this?

  • @anaggarwal2512
    @anaggarwal2512 10 หลายเดือนก่อน

    why there is 1 job in Permissive mode 3 jobs in DROPMALFORMED and FAILFAST

  • @ayushtiwari104
    @ayushtiwari104 5 หลายเดือนก่อน

    Where is the spark doccumentation source?

  • @deeksha6514
    @deeksha6514 4 หลายเดือนก่อน

    badRecordsPath not getting created in local mode

  • @poojajoshi871
    @poojajoshi871 ปีที่แล้ว

    IllegalArgumentException: If 'badRecordsPath' is specified, 'mode' is not allowed to set. mode: PermissiveMode
    This is the error while storing the bad record .
    Please advice

  • @yogeshpatil186
    @yogeshpatil186 6 หลายเดือนก่อน

    sir i created schema after that in my case it shows in correpted column as nominee why it is shows like this

    • @tanmaytandel2425
      @tanmaytandel2425 หลายเดือนก่อน

      SAME FOR ME DO YOU GET SOLUTION?

  • @ajourney179
    @ajourney179 ปีที่แล้ว

    Hi Manish. From where are you learning Scala ?

  • @user-ed6wk8in5d
    @user-ed6wk8in5d 5 หลายเดือนก่อน

    Hi @manish,
    After created extra column to store corrupt data, instead of getting whole row i'm getting extra values present in other column like 'nominee3'. while i was watching video i have confusion how whole row of corrupt data stored automatically.

    • @kavitathorat4451
      @kavitathorat4451 4 หลายเดือนก่อน +1

      column name should be "_corrupt_record"

    • @Anonymous-qg5cw
      @Anonymous-qg5cw 4 หลายเดือนก่อน

      @@kavitathorat4451 thanks, got it.
      there is other option also i saw where we can load corrupt record in any column we want.

  • @user-dl3ck6ym4r
    @user-dl3ck6ym4r 4 หลายเดือนก่อน

    how can we reset our databrick password. i am unable to reset .please suggest

  • @shitalkurkure1402
    @shitalkurkure1402 ปีที่แล้ว

    What if we have 100 columns nd we want to print bad records...so in that case it is not possible to create schema manually...any other option to print bad records in that case????

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      No idea

    • @soumyaranjanrout2843
      @soumyaranjanrout2843 7 หลายเดือนก่อน

      @shitalkurkure1402 Printing bad records is same as storing it and then view it by converting it to dataframe. Suppose you have 100 columns and it's not possible to create schema manually then store the bad records in required path then view it.

    • @shitalkurkure1402
      @shitalkurkure1402 6 หลายเดือนก่อน

      ​​​@@soumyaranjanrout2843hey, thank you😊 but for that also we need to write schema manually first.

  • @celebritysview007
    @celebritysview007 4 หลายเดือนก่อน

    Sir what happens when our DF is emp_details, in this DF we have total 9 rows, 6 columns, by adding _currpet_Record column in my_schema. How it works/not?, plz xplain.
    Id,name,age,sal,add,nomine,
    1,raju,17,15000,india,nom1
    2,mani,19,21000,usa,nom2
    3,Mona,21,31000,usa,nom3
    4,rani,32,4100,ind,nom4
    5,Mira,25,5000,mum,ind,nom5
    6,yoo,21,510,mum,mh,IND,nom6
    7,mahi,27,611,hyd,TS,Ind,nom7
    8,nani,31,711,hyd,TS,ind,nom1,nom2
    9,om,21,911,Pune,mh,ind,nom1,nom2,nom3

  • @sachindubey4315
    @sachindubey4315 11 หลายเดือนก่อน

    i m doing in jupiter noteboo i m i have run the code but data is not getting stored in new column , n=in new column i m only getting null and nomminee 1 and 2 what error or gap can be the reason of this ???

    • @manish_kumar_1
      @manish_kumar_1  11 หลายเดือนก่อน

      Have you defined your own schema?

    • @sachindubey4315
      @sachindubey4315 11 หลายเดือนก่อน

      @@manish_kumar_1 yes i had

  • @killerricky54
    @killerricky54 10 หลายเดือนก่อน

    hi manish, facing problem reading the csv file when creating the csv by myself, after running the code, only showing the output, but mode fuction is not working, headers is also not reflecting, probabaly due to incorrect csv file format, can you help me in sharing your csv file so that i can download,that file ,this will help a lot.thanks

    • @manish_kumar_1
      @manish_kumar_1  10 หลายเดือนก่อน

      Data is already there in description

    • @chandanareddy8158
      @chandanareddy8158 5 หลายเดือนก่อน

      copy the data and save as from notepad++ to file_name.csv

  • @ShubhamSinghBelhari
    @ShubhamSinghBelhari 23 วันที่ผ่านมา

    where is the CSV file ?

  • @user-kv3yb1jn3f
    @user-kv3yb1jn3f 2 หลายเดือนก่อน

    how do you trace why a record is corrupt , or capture error while parsing

    • @manish_kumar_1
      @manish_kumar_1  2 หลายเดือนก่อน

      You will have to check your source that why corrupted data is being pushed

  • @shivakrishna1743
    @shivakrishna1743 ปีที่แล้ว

    Guys, where can I find the employee file? I have been following the series from first.. not sure where is it. I am not finding data in description.

  • @amanjha5422
    @amanjha5422 ปีที่แล้ว

    Bhaiya next video kb ayegi

  • @varsha2906ify
    @varsha2906ify 10 หลายเดือนก่อน

    can you please tell me how to add comments

    • @manish_kumar_1
      @manish_kumar_1  10 หลายเดือนก่อน +1

      use # key for single line comment and control+/ for multiple line comments

  • @yogesh9992008
    @yogesh9992008 ปีที่แล้ว

    Not found csv data in description

  • @amitkhandelwal8030
    @amitkhandelwal8030 3 หลายเดือนก่อน

    Hi where is csv file link ? not able to see in description

    • @manish_kumar_1
      @manish_kumar_1  3 หลายเดือนก่อน

      Hai to data. Save that as csv file. TH-cam me file daalne ka option nhi hai

  • @kavitharaju7944
    @kavitharaju7944 2 หลายเดือนก่อน

    Can you release your videos in english ??

  • @sanketraut8462
    @sanketraut8462 ปีที่แล้ว

    I tried but its not shows the seventh column

  • @mmohammedsadiq2483
    @mmohammedsadiq2483 9 หลายเดือนก่อน

    where is the Data , csv file

  • @aakashrathore2287
    @aakashrathore2287 ปีที่แล้ว

    CSV kha hai

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      Didn't give. I will update it soon

  • @gangaaramtech
    @gangaaramtech 3 หลายเดือนก่อน

    HI @ manish_kumar_1 I tried with the data you provided but I am not able to see the corrupted records based on on different modes and when I created the employee schema and tried to see the corrupted records it is showing all the records under corrupted records.

  • @dakshitamishra7501
    @dakshitamishra7501 ปีที่แล้ว +1

    I am not able to print corrupted records, when I am printing records it forms a new column with nominee value.I am not able to understand what am I doing wrong.
    from pyspark.sql.types import StructType,StructField,IntegerType,StringType
    emp_schema=StructType([
    StructField("id",IntegerType(),True),
    StructField("name",StringType(),True),
    StructField("age",IntegerType(),True),
    StructField("salary",IntegerType(),True),
    StructField("address",StringType(),True),
    StructField("nominee",StringType(),True),
    StructField("corrrecod",StringType(),True)
    ])
    employee_df=spark.read.format("csv")\
    .option("header","true")\
    .option("inferschema","true")\
    .option("mode","PERMISSIVE")\
    .schema(emp_schema)\
    .load("/FileStore/tables/EmployeeDetails.csv")
    employee_df.show(truncate=False)

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      emp_schema=StructType([
      StructField("id",IntegerType(),True),
      StructField("name",StringType(),True),
      StructField("age",IntegerType(),True),
      StructField("salary",IntegerType(),True),
      StructField("address",StringType(),True),
      StructField("nominee",StringType(),True),
      StructField("_corrupt_record",StringType(),True)
      ])
      use this schema. Your schema was not correct.

    • @KotlaMuraliKrishna
      @KotlaMuraliKrishna 10 หลายเดือนก่อน

      @@manish_kumar_1 Hi Manish, can you please define the error in his schema, as I was getting the same issue but after copy pasting your schema it worked for me not sure why. Thanks in advance.

    • @KotlaMuraliKrishna
      @KotlaMuraliKrishna 10 หลายเดือนก่อน

      @@manish_kumar_1 Got it, we should only use as _corrupt_record as StructField to get the complete record.

    • @ashishkumarak1783
      @ashishkumarak1783 8 หลายเดือนก่อน

      @@KotlaMuraliKrishna @dakshitamishra7501 @Mdkaleem__
      Manish sir has used column name as _corrupt_record in under StructField .
      If you want any other name of column then We have to add the columnNameOfCorruptRecord option as column name which we have given in schema.
      Like this:-
      emp_schema= StructType(
      [
      StructField("id",IntegerType(),True),
      StructField("name",StringType(),True),
      StructField("age",IntegerType(),True),
      StructField("salary",IntegerType(),True),
      StructField("address",StringType(),True),
      StructField("nominee",StringType(),True),
      StructField("any_name_of_corrupt_record",StringType(),True)
      ]
      )
      employee_df2=spark.read.format("csv")\
      .option("header","true")\
      .option("inferschema","false")\
      .option("mode","PERMISSIVE")\
      .schema(emp_schema)\
      .option("columnNameOfCorruptRecord", "any_name_of_corrupt_record")\
      .load("/FileStore/tables/employee_file.csv")
      employee_df2.show()
      You can go through this site:-
      medium.com/@sasidharan-r/how-to-handle-corrupt-or-bad-record-in-apache-spark-custom-logic-pyspark-aws-430ddec9bb41#:~:text=to%20add%20the-,columnNameOfCorruptRecord,-option%20as%20column

    • @swaroopsonu
      @swaroopsonu 7 หลายเดือนก่อน

      @@KotlaMuraliKrishna did you find any solution for this issue ?
      As I'm also getting the same output as yours

  • @swetasoni2914
    @swetasoni2914 4 หลายเดือนก่อน

    why I am getting corrupt record here? Also corrupt record value is only nominee
    emp_df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferschema","true")\
    .schema(emp_schema)\
    .option("badRecordsPath","/FileStore/tables/bad_records")\
    .load("/FileStore/tables/employee_file.csv")
    emp_df.show(truncate = False)
    (1) Spark Jobs
    emp_df:pyspark.sql.dataframe.DataFrame = [id: integer, name: string ... 5 more fields]
    +---+--------+---+------+---------+-------+--------------+
    |id |name |age|salary|address |nominee|corrupt_record|
    +---+--------+---+------+---------+-------+--------------+
    |3 |Pritam |22 |150000|Bangalore|India |nominee3 |
    |4 |Prantosh|17 |200000|Kolkata |India |nominee4 |
    +---+--------+---+------+---------+-------+--------------+

    • @Anonymous-qg5cw
      @Anonymous-qg5cw 4 หลายเดือนก่อน

      same doubt i have, even if we add new column extra records value will be placed in that and other have null

    • @Anonymous-qg5cw
      @Anonymous-qg5cw 4 หลายเดือนก่อน

      there is 2 options
      1) keep column name '_corrupt_record'
      2) .load("/FileStore/tables/employee_file.csv",ColumnNameOfCorruptRecord='corrupt_record')

  • @sansha3881
    @sansha3881 ปีที่แล้ว

    Unable to find spark link @manish kumar

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว +1

      Yes I had not given. I will add soon

  • @amlansharma5429
    @amlansharma5429 ปีที่แล้ว

    id,name,age,salary,address,nominee
    1,Manish,26,75000,bihar,nominee1
    2,Nikita,23,100000,uttarpradesh,nominee2
    3,Pritam,22,150000,Bangalore,India,nominee3
    4,Prantosh,17,200000,Kolkata,India,nominee4
    5,Vikash,31,300000,,nominee5
    Employee.csv

  • @Mdkaleem__
    @Mdkaleem__ ปีที่แล้ว +1

    Instead of bad records it shows me the correct one...
    bad_records_df=spark.read.format("json").load("/FileStore/tables/bad_records/20230610T072018/bad_records/")
    bad_records_df.show(truncate = False)
    |dbfs:/FileStore/tables/employee_df.csv|org.apache.spark.SparkRuntimeException: [MALFORMED_CSV_RECORD] Malformed CSV record: 1,Manish,26,75000,bihar,nominee1 |1,Manish,26,75000,bihar,nominee1 |
    |dbfs:/FileStore/tables/employee_df.csv|org.apache.spark.SparkRuntimeException: [MALFORMED_CSV_RECORD] Malformed CSV record: 2,Nikita,23,100000,uttarpradesh,nominee2|2,Nikita,23,100000,uttarpradesh,nominee2|
    |dbfs:/FileStore/tables/employee_df.csv|org.apache.spark.SparkRuntimeException: [MALFORMED_CSV_RECORD] Malformed CSV record: 5,Vikash,31,300000,,nominee5 |5,Vikash,31,300000,,nominee5 |

    • @Marcopronto
      @Marcopronto ปีที่แล้ว

      while creating schema, using "_corrupt_record" as the field name.

  • @celebritysview007
    @celebritysview007 4 หลายเดือนก่อน

    Sir what happens when our DF is emp_details, in this DF we have total 9 rows, 6 columns, by adding _currpet_Record column in my_schema. How it works/not?, plz xplain.
    Id,name,age,sal,add,nomine,
    1,raju,17,15000,india,nom1
    2,mani,19,21000,usa,nom2
    3,Mona,21,31000,usa,nom3
    4,rani,32,4100,ind,nom4
    5,Mira,25,5000,mum,ind,nom5
    6,yoo,21,510,mum,mh,IND,nom6
    7,mahi,27,611,hyd,TS,Ind,nom7
    8,nani,31,711,hyd,TS,ind,nom1,nom2
    9,om,21,911,Pune,mh,ind,nom1,nom2,nom3