Hi Shaul, Superb content. Never seen such an clear and all possible approaches in TH-cam. Thanks a lot. Not only for the interview , to get out daily jobs done ,you're videos so helpful.
Good videos. Thank you. One small info, in "Automated Approach" if number of columns difference between two data frame is more than one and not in alphabetical order then it won't work. We need to sort the columns while performing union operation like below. df_final=df_file1.select(sorted(df_file1.columns)).union(df_file2.select(sorted(df_file2.columns)))
@Azarudeen Shah - In the example the missing column is at the last for one of the dataframe. So with_column automatically adds at the end. What if the column is missing in middle of the table structure ? Thank you!!
so can you help me to fix it ? can you check i am ready to share my screen ? dear please helpp i have learnt theory part of Hadoop and spark but not feeling confident because of no good hands on because of no environment
Hi Azarudeen. Thank you so much for this video. I have implemented the same question in spark scala but I am facing problem in implementing the automated approach in spark scala. Could you please help me on this and provide me solution for the same.
Hi Sir, in for loop we see df2=df2.withColumn(i,lit("null")) here we are able to update the dataframes, but how is it possible if dataframes are immutable.
HI Azarudeen its Awanish your video really helpful,,, actually i have installed Spark but while i am checking on command prompt by entering pyspark its saying path is not specified , even though i have made many correctness and checked even environment variables as well many times
Thanks for your support,; Are you referring to same data with different column names. If so, then automated approach does not suits.. try schema method...
@@AzarudeenShahul Just if the order of columns is not same between 2 DFs then this will fail. In that case, we can use unionByName or do df2= df2.select(df1.columns) first then we can apply union.
Very nice explanation of the concepts. How we can achieve this in scala. Also it will be great if you also explain some scenarios using Scala . Thank you
Just add a scenario if we do not have columns in same order in both dataframes after loop? New columns arrive or some columns may disappear over time but the merge/union should keep happening daily. - we need to select columns in right order before doing union we use foldLeft instead of loop (more functional programming way)
Awesome content!! please help me if we save the output => df1.union(df2).show() and save it to new dataframe as df, and apply df.show(), it didn't work, why?
Here we discuss about spark below 3.1 unionByName works when both DataFrames have the same columns, but in a different order. An optional parameter was also added in Spark 3.1 to allow unioning slightly different schemas.
Real and true looking forward to see more videos
Last approach was incredible. Did not know it was possible to subtract the columns to get the delta!!
being a newbie to spark I find it very helpful boss.keep it up brother.looking forward to see more such from you.
Now you can use unionByName() function as well.
df3 = df.unionByName(df2, allowMissinColumns=True)
df3.show()
Excelent video Azarudeen, you helped me alot! Thankssss
u r doing great job and its helping a lot to the beginners. Thanks
Very clear and useful. Thank you very much
Thanks Azar for making such a nice scenario based question series with demo.
Thank you so much for the videos. They definitely increased my hope towards practical learning!!!
Thanks for your support 🙂
The tutorial is very lucid and clear
very nice approach and clear explanation! Thank you very much.
Awesome work man. Appreciated
Nice video.. informative.. ❤❤
Thanks for all your support
Thank you so much for these real time scenario videos brother
Eagarly waiting for more such
All the best
Thanks for your support, pls share with ur frnds aswell :)
Very good explanation of each scenario .... Thanks a lot @Azarudeen Shahul... Keep it up
Thanks for your support.. 😊
Really its nice help friend
great work Azar. I used the automatic technique for a datawareshousing project.
Thanks for your support, share with your bigdata frnds
Hi Shaul,
Superb content. Never seen such an clear and all possible approaches in TH-cam. Thanks a lot. Not only for the interview , to get out daily jobs done ,you're videos so helpful.
Superb bro 👌 👏
Great example and nice explaination
Thanks for your support, :-)
Good videos. Thank you.
One small info, in "Automated Approach" if number of columns difference between two data frame is more than one and not in alphabetical order then it won't work.
We need to sort the columns while performing union operation like below.
df_final=df_file1.select(sorted(df_file1.columns)).union(df_file2.select(sorted(df_file2.columns)))
Good video ..please keep posted on new scenario based questions
Sure, move videos to come
Great pyspark tutorial thanks
Boss , you are beauty!!’
Excellent. Thanks for sharing.
Can u make a video on reading data from multiple parquet files of different schema using schema evolution.
Sure, can except the same soon👍
bro pls help me to install spark share me doc of steps i have windows 10
Awesome Azharuddin, your videos are very helpful...Do you take any online coaching?
How outer join worked? We have same columns in both the DF, which columns it will take?
I'm trying string (json style) -> parquet for merging different columns dataframe
thanks a lot bro,
Thanks for all your support 😊
@Azarudeen Shah - In the example the missing column is at the last for one of the dataframe. So with_column automatically adds at the end. What if the column is missing in middle of the table structure ? Thank you!!
Thanks for the question
Before merging, we can select the columns in same order as that of other like
Df1.select(df2.columns)
Hope this helps you :)
@@AzarudeenShahul wow.. cool thanks Azar..
so can you help me to fix it ?
can you check i am ready to share my screen ?
dear please helpp i have learnt theory part of Hadoop and spark but not feeling confident because of no good hands on because of no environment
Please mail me the error message scrnshot and steps u followed.. if needed we can chk on screen sharing
Hi Azarudeen. Thank you so much for this video. I have implemented the same question in spark scala but I am facing problem in implementing the automated approach in spark scala. Could you please help me on this and provide me solution for the same.
How can we get the code for all the scenarios in this playlist?
we have a github link provided in description of all recent video. u can find notebook for some scenario based question.
For the same scenario, I have used motonically I'd column for two then I have done left join.
Is that approach was correct?
Can you help in merge two dataframes with date column and big int column i am getting error like failed to merge
When merging 5 different data format files how it will work ?? Your answer will be helpful
Can you also make some videos on spark using scala? All your videos are brilliant
Hi..your videos are really helpful... could you please post a video on spark incremental data load and merge that data with scd2 type (using SCALA)...
nice !
Awesome Bro !.. If you can, please do the video on the same scenario by using Scala.
Sure 👍
Hi Sir,
in for loop we see df2=df2.withColumn(i,lit("null"))
here we are able to update the dataframes, but how is it possible if dataframes are immutable.
DataFrames are immutable that is the reason why we are assigning it to variable
HI Azarudeen its Awanish your video really helpful,,,
actually i have installed Spark but while i am checking on command prompt by entering pyspark its saying path is not specified
, even though i have made many correctness and checked even environment variables as well many times
Hi bro how to achieve the same using scala
Can we do this using unionByName
We can use unionByName in scala
How to compare two data frames, with matched records and unnmatched record values?
Can you please share the scala code for automated approach
Thanks Azar for making real-time scenario based videos.. how automated process works when both data frames have different column names ?
Thanks for your support,; Are you referring to same data with different column names. If so, then automated approach does not suits.. try schema method...
@@AzarudeenShahul Just if the order of columns is not same between 2 DFs then this will fail. In that case, we can use unionByName or do df2= df2.select(df1.columns) first then we can apply union.
@@himanshujain2047 there is also allowMissingColumns param in unionByName that does the same as this video
Thank you , but in automated approach , updating df2 in for loop it won't work in java
Whatever changed inside is not accessible outside of loop...can you help me how to handle it
How can I achive same in scala? I tried following code but not working.consider a and b as two dataframe
Val diffcol=a.columns.diff(b.columns)
for(i
Very nice explanation of the concepts. How we can achieve this in scala. Also it will be great if you also explain some scenarios using Scala . Thank you
how to get your mail id ?
Just add a scenario if we do not have columns in same order in both dataframes after loop?
New columns arrive or some columns may disappear over time but the merge/union should keep happening daily.
- we need to select columns in right order before doing union
we use foldLeft instead of loop (more functional programming way)
From where input1.csv is fetched, do u have uploaded any CSV file there.?
Yes Parnay I have created and uploaded csv file in my databricks account
Your methods will not work if both tables have one an extra column. For example
TableA: name, age, salary
TableB: name,age,gender
Awesome content!! please help me if we save the output => df1.union(df2).show() and save it to new dataframe as df, and apply df.show(), it didn't work, why?
we can achieve this by using UnionByName:
union_df = df1.unionByName(df2, allowMissingColumns = True)
Here we discuss about spark below 3.1
unionByName works when both DataFrames have the same columns, but in a different order. An optional parameter was also added in Spark 3.1 to allow unioning slightly different schemas.