15. Databricks| Spark | Pyspark | Read Json| Flatten Json

Raja's Data Engineering

มุมมอง 46 756

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 12 ม.ค. 2025

ความคิดเห็น • 134

@ChrisCrombieOfOregon 2 ปีที่แล้ว ⁺¹
Thank you for this! The flatten heirarchy behavior in the ADF Copy data activity was not working and this is such an elegant alternative solution in Databricks! Just what I needed. I appreciate your sharing it.
@rajasdataengineering7585 2 ปีที่แล้ว
Thank you Chris!
@madhukatakam-ye2up 6 วันที่ผ่านมา ⁺¹
Very much educative, thanks for making this video
@rajasdataengineering7585 6 วันที่ผ่านมา
I'm glad you found it helpful! Keep watching
@sandrafield9813 2 ปีที่แล้ว ⁺⁴
Thanks!! This is exactly what I needed, such excellent work!!!
@rajasdataengineering7585 2 ปีที่แล้ว
Thank you!
@arvind1cool ปีที่แล้ว ⁺³
excellent piece of information. I have a vice versa scenario, Is there a way to unflatten the data back to original structure? any pointer or reference or sample code would be really helpful
@ATHARVA89 ปีที่แล้ว ⁺¹
Hey Raja thanks for your superb real time interview series, I cleared interview process for KPMG global services as an azure data engineer
@rajasdataengineering7585 ปีที่แล้ว
Wow what a great news! Thanks for sharing the good news.
All the best!!
@prabhatgupta6415 ปีที่แล้ว
Hi Atharva can we connect? I am also preparing for the interview i need some help.
Let me know where can we connect
@ririraman7 2 ปีที่แล้ว ⁺²
Brilliant and awesome tutorial. Thats what we need...
@rajasdataengineering7585 2 ปีที่แล้ว
Thank you
@Jaipreksha ปีที่แล้ว ⁺¹
Excellent and very neat explanation.Thanks for sharing you wonderful knowledge❤❤
@rajasdataengineering7585 ปีที่แล้ว
Glad you liked it!
@deepjyotimitra1340 3 ปีที่แล้ว ⁺¹
Thank you for this video.. this is wonderful. It's going to help many
@Rafian1924 2 ปีที่แล้ว ⁺¹
Great video man.. that is what I was looking for.
@parzival-i3p 15 วันที่ผ่านมา
Hi Thankyou for ur valuable teaching.
It would be helpful if you can share the notebooks and other resources.
@BabaMalik2121 2 ปีที่แล้ว ⁺¹
Thank you so much. This function saves me lots of time.
@rajasdataengineering7585 2 ปีที่แล้ว
Glad to know that it helps you!
@ririraman7 2 ปีที่แล้ว ⁺²
You are an angel brother..
@rajasdataengineering7585 2 ปีที่แล้ว
Thank you
@gulsahtanay2341 10 หลายเดือนก่อน ⁺¹
Very useful content. Thank you!
@rajasdataengineering7585 10 หลายเดือนก่อน
Thanks for your comment!
@MyVaibhavraj ปีที่แล้ว ⁺¹
This is excellent and thank you very much to provide this json parsing solution which will help in many json scenarios. 😍
@rajasdataengineering7585 ปีที่แล้ว
Glad it was helpful!
@shreyaspatil4861 11 หลายเดือนก่อน
Thanks very much for the tutorial :) , I have a query regarding reading in json files.
so i have an array of structs where each struct has a different structure/schema.
And based on a certain property value of struct I apply filter to get that nested struct , however when I display using printschema it contains fields that do not belong to that object but are somehow being associated with the object from the schema of other structs , how can i possibly fix this issue ?
@erpoojadak 3 ปีที่แล้ว ⁺¹
this is so useful..thanks for sharing
@bhanuteja75 3 ปีที่แล้ว ⁺¹
Great Videos. Its saved my time. Thank you so much. :)
@rajasdataengineering7585 3 ปีที่แล้ว
Thank you Bhanu Teja
@FranciscoDara 2 ปีที่แล้ว ⁺¹
Thank you, thank you thank you guy, you saved me!
@rajasdataengineering7585 2 ปีที่แล้ว
Glad to know it helped you👍🏻
@sathyanarayan4247 3 ปีที่แล้ว ⁺¹
Hi Raja sir , I used this function to flatten deeply nested json but it is throwing the error "No such struct field com in ACID...." ...when I went through the schema of the json(nested) , I saw cookies column(which is of struct type) has one field 'com.sx.attr : string (nullable=true)...Please help.
@SureshBabu-kf5jx 11 หลายเดือนก่อน ⁺¹
HI Raja, Thank you so much for your videos. Really helpfull. Can you also make a video on CI/CD Integration of notebook and promoting notebooks to higher environments
@rajasdataengineering7585 11 หลายเดือนก่อน
Hi Suresh, sure I will create a video for CI/CD soon
@oiwelder ปีที่แล้ว ⁺²
Sir, I have a situation where I need to define the schema without using the explode, array or flatten function.
However, when creating it, it returns the message that it is expecting a string but the schema is an array. Would you help me?
root
|-- text: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- subtext: string (nullable = true)
| | |-- subtext2: string (nullable = true)
| | |-- subtext3: string (nullable = true)
| | |-- subtext4: string (nullable = true)
schema = StructType(
[
StructField("text", ArrayType(
StructType(
[
StructField("id", IntegerType(), True),
StructField("subtext", StringType(), True),
StructField("subtext2", StringType(), True),
StructField("subtext3", StringType(), True),
])
))
]
)
df1 = df.withColumn("text", from_json( col("text"), schema) )
@adsurya8024 2 ปีที่แล้ว ⁺¹
Usually I don't comment in videos . Awesome work bro.
@rajasdataengineering7585 2 ปีที่แล้ว ⁺¹
Thanks bro👍🏻
@alexxi9664 2 ปีที่แล้ว ⁺¹
thank you Raja, you save my day!!
@rajasdataengineering7585 2 ปีที่แล้ว
Welcome Alex
@tejasnaik6724 2 ปีที่แล้ว ⁺³
Hi Raja, Thanks for the solution. I have implemented the same, however getting "Ambiguous references to fields" i.e column names are identical. I have enabled a case-sensitive config, but it is not working.
@weldervicentemoreiramartin9467 ปีที่แล้ว ⁺¹
@tejasnaik6724 It is not repeating, there is a "_" separating it, use this function to remove the prefix. Raja's surprising every day.
df1 = flatten(df)
column = [i if len(i)
@rajasdataengineering7585 ปีที่แล้ว ⁺¹
Thanks for sharing
@parameshgosula5510 3 ปีที่แล้ว ⁺¹
Great Video ! Very informative
@rajasdataengineering7585 3 ปีที่แล้ว
Thank you!
@sharathmutya3120 2 ปีที่แล้ว ⁺²
Hi Raja, This is wonderful video. I have a quick question can the Json be flattened into multiple tables ?
@rajasdataengineering7585 2 ปีที่แล้ว ⁺¹
Hi Sharath, you can create one dataframe first by flattening the json and split that one dataframe into multiple based on your business requirement
@vipinkumarjha5587 3 ปีที่แล้ว
Thanks for the video. and sharing the very useful function.
@rajasdataengineering7585 3 ปีที่แล้ว
Thanks Vipin. Glad it was helpful!
@bikramjitsaha7222 ปีที่แล้ว ⁺¹
Wow, thanks for the solution.
@rajasdataengineering7585 ปีที่แล้ว
Glad it helped
@zzzmd11 2 ปีที่แล้ว ⁺¹
Thank you for the Great video.... 👌
@rajasdataengineering7585 2 ปีที่แล้ว
Thank you for watching
@albinchandy6056 2 ปีที่แล้ว
what if I create a schema and then try to bind the schema to the json data , provided I know the json data structure already.
@HafizfromIndia 3 ปีที่แล้ว ⁺¹
Thank you so much! Very useful function 👌
@rajasdataengineering7585 3 ปีที่แล้ว
Glad it helped!
@krunalagare2554 ปีที่แล้ว
hye raja.. i have doubt..
i run same code in google colab it parsed my json but same code if i run in azure synapse notebook it shows only one null rows.. can u tell me why this happened?
@sangeethasan1180 3 ปีที่แล้ว ⁺¹
This is not working for null value column while writing in folder .. could you help me ? I mean in the json if one of the column value is null means while writing the flatten json using
Df_flatten.write.json(path) and it's not loading the colum which is null . I am trying to fix this issue for long time .
@rajasdataengineering7585 3 ปีที่แล้ว
You can compare the schema before writing the data. If columns are missing for null values, you can populate thenm before writing.
Lets say you want 5 columns in your json file such as emp_id, name, age, dob, doj. But in your dataframe age is null so its holding only 4 columns such as emp_id, name, dob, doj. In this case, you need to populate the missing column before writing.
You can use this workwround:
Step 1: Create a UDF to populate missing columns
import pyspark.sql.functions as f
def add_missing_cols(missing_df):
for column in missing_columns:
missing_df = rename_df.withColumn(column, lit(None).cast(StringType()))
return missing_df
Step 2: Compare the actual and expected schema and identify missing columns
exp_schema = ["emp_id","name","age","doj","dob"]
Actual_schema = df.columns
missing_columns = set(exp_schema).difference(Actual_schema)
Step 3: Populate the missing columns using UDF in step 1
df_full_schema = add_missing_cols(df)
now you can write the dataframe df_full_schema which should contain null columns as well
@djbao1862 2 ปีที่แล้ว ⁺¹
How can one substitute posexplode_outer instead of esplode_outer in the code. In order to position and column value. I tried to use df=df.selectExpr("*", posexplode_outer(col_name).alias("position",col_name)) but getting error "TypeError: Column is not iterable"
@rajasdataengineering7585 2 ปีที่แล้ว ⁺¹
The column on which you apply positional explode should be either array or map type. I think the one you applied is neither array nor map field
@djbao1862 2 ปีที่แล้ว ⁺²
@@rajasdataengineering7585 I resolved it using col(col_name). Here is the code : df=df.select(col("*"),posexplode_outer(col(col_name)).alias(col_name+"_pos",col_name+"_values")).drop(col_name).
Thanks Raja for the suggestion.
@rajasdataengineering7585 2 ปีที่แล้ว
Welcome
@akhilesh2186 2 ปีที่แล้ว ⁺¹
Really Great. Is there anything available for Tabular Data to complex JSON File ?
@rajasdataengineering7585 2 ปีที่แล้ว
Yes Akhilesh, we can do that as well using pyspark functions. If needed, will post a video
@jatinyadav8960 ปีที่แล้ว ⁺¹
Hi Raja thanks for this wonderful video , My after flattening the dataframe when i select number of records is duplicated, is there is any way i can solve this ?
@rajasdataengineering7585 ปีที่แล้ว ⁺¹
Thanks Jatin for your comment. I think you can use drop_duplicates or distinct
@suravimahanta2510 3 ปีที่แล้ว ⁺¹
Great work
@ashutoshyadav8348 2 ปีที่แล้ว ⁺¹
How to perform pivot operations on the nested columns in dataframe created from json file???
@rajasdataengineering7585 2 ปีที่แล้ว
you can flatten the nested columns first and then pivot
@ashutoshyadav8348 2 ปีที่แล้ว
@@rajasdataengineering7585 data is coming through api so I have to dynamically perform the operations
@rajasdataengineering7585 2 ปีที่แล้ว
Yes , you can read the data through api in first step. Flatten that out in second step and apply pivot function in third step
@ashutoshyadav8348 2 ปีที่แล้ว
@@rajasdataengineering7585 Drop a video on this its' a must problem when it comes to json format
@satyanarayanavadaga 2 ปีที่แล้ว ⁺¹
Hi Raja, thanks for the solution. How to flatten if the field is of map datatype? Here {Fname, Mname, Lname} are map keys and {abc, pqr, xyz} are map values.
Ex : Name {"Fname": "abc", "Mname", "pqr", "Lname": "xyz"}
@rajasdataengineering7585 2 ปีที่แล้ว
Hi Satya, the function explode can be used to flatten map fields
@satyanarayanavadaga 2 ปีที่แล้ว ⁺¹
@@rajasdataengineering7585 Thanks Raja. Using explode, I am able to flattening the fields. I wanted to display them in a separate columns along with other columns in a data frame. Like below.
Company Name.Fname Name.Mname Name.LName
c1 abc pqr xyz
c2 sss qqq ooo
@rajasdataengineering7585 2 ปีที่แล้ว
Hi Satya, after exploding you need to split each column like exploded_column.nested_column using withColumn function
@satyanarayanavadaga 2 ปีที่แล้ว
@@rajasdataengineering7585 Thanks Raja for your reply. I have one more question, wondering if you can help.
I have two data frames contains name(df1) & address(df2) details respectively. I would like to write the name & address details in the below format to a JSON file.
{"Name" : "ABC"}
{"Address" : "Ad1"}
I used below code to merge both data frames followed by write command to write the merge output to a JSON file
val Mergedf_1_2:DataFrame = df1.unionByName(df2,true)
Mergedf_1_2.coalesce(1).write.mode("Overwrite").format("json").save("C:/Users/Test.json")
But I am getting output without indentation. Appreciate if you can share some inputs.
{"Name" : "ABC"}
{"Address" : "Ad1"}
@djbao1862 2 ปีที่แล้ว
@@rajasdataengineering7585 I have a similar requirement where there is need to convert the values of array to columns. I guess you would need a for loop using the withColumn function and drop the exploded column . How do we implement it in code.
@macdelvalle5570 3 ปีที่แล้ว
Great video and function! Works great, however I am receiving an error when I try and pass in a 2 data frames to be flattened. The first works as expected, but when running the function on the 2nd, it returns an error " 'str' object is not callable" and points to this line.. expanded = [col(col_name+'.'+k).alias(col_name+'_'+k) for k in [ n.name for n in complex_fields[col_name]]]. If I redefine the function in between each dataframe it works as expected.
I think it is something to do with the expanded variable. Any thoughts? Thanks again!
@bharathjc4700 3 ปีที่แล้ว
when we have schema drift in azure data bricks how to get email alert notification pls let me know sir
@analyticsedge6c218 3 ปีที่แล้ว
Hi Raja, would you please share the notebook code for all you're video's ,please suggest where can we find it
@sravankumar1767 3 ปีที่แล้ว ⁺¹
Nice explanation but the user defined function is little bit complex to understand
@navnathjarare4829 ปีที่แล้ว ⁺¹
excellent bro you are great
@rajasdataengineering7585 ปีที่แล้ว
Thanks, glad you liked it!
@suniltyagi5402 ปีที่แล้ว ⁺¹
Good explanation.
Codes are not visible in Notebook
@rajasdataengineering7585 ปีที่แล้ว
Thanks. In later videos, fixed this issue
@sravankumar1767 ปีที่แล้ว ⁺¹
Hi Raja, we are extracting data from Rest API , we have endpoints two nested json files are generated and my manager said convert to csv as well as merge as single csv. Is it possible to merge two different schemas and generate single csv
@rajasdataengineering7585 ปีที่แล้ว ⁺¹
Hi Sravan, yes it is possible. We need to compare schema and produce null values for missing columns.
You can watch this video to compare schemas th-cam.com/video/BtUFleFkXMM/w-d-xo.html
@sravankumar1767 ปีที่แล้ว ⁺¹
@@rajasdataengineering7585 Thank you very much
@rajasdataengineering7585 ปีที่แล้ว
Welcome
@bashask2121 3 ปีที่แล้ว ⁺¹
DF=sprak.read.json('/json file path i mean /Filestore/tables')
DF.show()
Would like to know why you used flaten function here
@rajasdataengineering7585 3 ปีที่แล้ว
If json file is having complex nested structure (let's say 20+ levels), it won't get parsed properly. This is one case.
In another case, while using structured streaming for Kafka integration in databricks, the entire json file would sit as binary data in value column. Without flattening the data, you cant read it
@a2zhi976 ปีที่แล้ว ⁺¹
sir ..can you please explain the program each step, it will help us a greatly
@rajasdataengineering7585 ปีที่แล้ว ⁺¹
Sure will make another video to explain each steps more in detail
@srinubayyavarapu2588 ปีที่แล้ว
@@rajasdataengineering7585 Sir, if the video is already uploaded please provide the link , im not getting where it is there ,explanation on the function what you have written, it would be really helpful for me and others too , please consider this one sir
@satellitepop 10 หลายเดือนก่อน ⁺¹
Suuuuuper !!!!!! 🥇
@rajasdataengineering7585 10 หลายเดือนก่อน
Thank you!
@yvoncadieux7609 2 ปีที่แล้ว
Realy good video, can you share the JSON
I Will do some test on my side
Merci, Thanks
@srinubayyavarapu2588 ปีที่แล้ว ⁺¹
@rajasdataengineering7585 Sir, if the video is already uploaded please provide the link , im not getting where it is there ,explanation on the function what you have written, it would be really helpful for me and others too , please consider this one sir
@rajasdataengineering7585 ปีที่แล้ว
Hi Srinu, I couldn't get time yet to create an explanation video. Will make soon and post it
@srinubayyavarapu2588 ปีที่แล้ว
@@rajasdataengineering7585 Thank you Sir, We will wait some more time, No issue
@saisaranv 2 ปีที่แล้ว
Hi Raja, video was so good and am getting error as. "
MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value='42'})" while displaying the data. can you please help me out.
@rajareddy5309 ปีที่แล้ว
Bro while I am using this udf i am getting an error name 'ArrayType' is not defined
@hungerspoint7358 3 ปีที่แล้ว ⁺¹
Awesome videos 👍
@rajasdataengineering7585 3 ปีที่แล้ว
Thank you
@kishorekumarramesh4010 2 ปีที่แล้ว
For me, its failing due to dict object has no attribute key
@ganil8004 3 ปีที่แล้ว
how to pick all contents in json
@vigneshgaming6286 3 ปีที่แล้ว
I want to take training, is it possible to give the training
@parameshgosula5510 2 ปีที่แล้ว
This is giving many to many relations. Not sure, if this is the expected behavior. Any thoughts on this pls ?
@rajasdataengineering7585 2 ปีที่แล้ว
Yes it will give many to many relationship depending on nested level
@parameshgosula5510 2 ปีที่แล้ว
@@rajasdataengineering7585 Thanks so much for the response Raja.
Is it possible to customize the script by giving some static field names to avoid the many to many relations ? Or if I want to go with one-to one what's the best way to customize it?
@animeflixkaka147 ปีที่แล้ว
can you share the dataset
@telugucartoonchannel2916 10 หลายเดือนก่อน
Can you share the file link please
@syedronaldo4758 ปีที่แล้ว ⁺¹
Can we get this in python instead of pyspark.
@rajasdataengineering7585 ปีที่แล้ว
Yes we can do in python as well. But data would be in the form of nested array instead of tabular format
@syedronaldo4758 ปีที่แล้ว
Bcos I'm working in this task....I have tried many ways but nothing helped....I liked this approach but requirement is fully python code....is any other way to achieve this.....in pandas df or something like that
@syedronaldo4758 ปีที่แล้ว
Can u make a fully python approach for this logic....
@oiwelder ปีที่แล้ว ⁺¹
Hello sir, I'm Brazilian and I don't speak English fluently. Could you enable subtitles like the other videos?
@rajasdataengineering7585 ปีที่แล้ว ⁺¹
Sure, will enable subtitles for all videos
@oiwelder ปีที่แล้ว
@@rajasdataengineering7585 This is the only one that doesn't have subtitles active.
@javshah105 ปีที่แล้ว ⁺¹
Thanks a lot!
@rajasdataengineering7585 ปีที่แล้ว
You're welcome!
@MohitSharma0205 2 ปีที่แล้ว
Video was good but it will be better if you can also explain the function that you have written. It's bit difficult to understand it.
@Utube_luv 2 ปีที่แล้ว ⁺¹
Sir
Kindly share the notebook
@rajasdataengineering7585 2 ปีที่แล้ว
Hi Sree, sure, will do
@gayatriseelam4913 2 ปีที่แล้ว
Hi raja is there any code for flattening XML plzz help mee
@rajasdataengineering7585 2 ปีที่แล้ว
Hi Gayatri, I don't have any code to flatten XML but will try to create one when I get sometime
@prabhathkumaryerabaty2534 10 หลายเดือนก่อน ⁺¹
wow.. just wow
@rajasdataengineering7585 10 หลายเดือนก่อน
Thanks
@dell0816 3 หลายเดือนก่อน
please upload the sample json file.
@sivapk626 ปีที่แล้ว ⁺¹
Thank you
@rajasdataengineering7585 ปีที่แล้ว
You're welcome
@dineshreddy1478 2 หลายเดือนก่อน ⁺¹
Glad I found this playlist, Thank you www.youtube.com/@rajasdataengineering7585
@rajasdataengineering7585 2 หลายเดือนก่อน
Glad it helps! Keep watching

ต่อไป

เล่นอัตโนมัติ