I hope below is the best example to describe the type safety in data set. Dataframe: Compile-time type safety:Dataframe API does not support compile time safety which limits you from manipulating data when the structure is not know. The following example works during compile time. However, you will get a Runtime exception when executing this code. Example: case class Person(name : String , age : Int) val dataframe = sqlContect.read.json("people.json") dataframe.filter("salary > 10000").show => throws Exception : cannot resolve 'salary' given input age , name Dataset: Type Safety: Datasets API provides compile time safety which was not available in Dataframes. In the example below, we can see how Dataset can operate on domain objects with compile lambda functions. Example: case class Person(name : String , age : Int) val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20))) val personDF = sqlContect.createDataframe(personRDD) val ds:Dataset[Person] = personDF.as[Person] ds.filter(p => p.age > 25) ds.filter(p => p.salary > 25) // error : value salary is not a member of person //this is returned on compile time ds.rdd // returns RDD[Person]
Hi Harjeet, Thanks for your effort for making videos that is making learning simple. I would like you to make videos on following: 1) Spark submit and how it works 2) Difference b/w job, tasks and stage in spark 3) How to check status of spark jobs 4)spark implicit These questions were asked in an interview.
you can achieve the same in DF as you explained by using x.getAs[T](fieldName). What I understand from type safety is that the return type of a DataFrame is always sql.DataFrame. But in DataSet you can know the return type. e.g. in your case the return type of DataSet will be DataSet[Person]. That's the advantage of Dataset over DataFrame.
but when I mention toDF("name","age") while converting rdd to DF I can easily get age column.When you write case class for dataset you have explicitly mentioned name and age in case class likewise if we mention explicitly column name in dataframe we can run this program without type cast...
can you please zoom in ur screen while showing/executing the code for all the videos in the channel or please paste same code in the description/comment section, it will be useful for our reference.
Thanks a lot , one more nice vid , could you please do one vid , how to interactively develop the spark code using eclipse...using notebooks etc ....best possible way
thank you...bro i have stuck with this issue , do you have any idea how to fix it .... stackoverflow.com/questions/53442130/why-only-one-core-is-taking-all-the-load-how-to-make-other-29-cores-to-take-lo?noredirect=1#comment93902947_53442130
can you pls explain how to add new column to existing table as i have faced this cognizant interview.i didn't get concise and clear knowledge from internet
Manjunath Reddy in data frame you are writing expression in string, that means you can end up giving wrong column name... The is no check on compile time... You will realize issue only at runtime.... If you use dataset, your code will not compile...
One more interesting thing is that, to make datasets look exactly like dataframe... There are still some functions in dataset which are type unsafe... That's a topic for another video though :)
case class empC(name:String, age: Int, salary:Double); var data1= Seq(new empC("s1",33,11.10000),new empC("s2",37,135000.4),new empC("s3",33,33.4)); sc.makeRDD(data1).filter(x=> x.age == 37) ; Sir, Do you think normal RDD is type safe....
I faced this interview question and I was not able to answer it effectively You need to Sort 10 GB file sitting on a machine which has only 2GB memory. How do your sort this, Please explain?
Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data.The only problem is it will be slower,similar to the map reduce jobs as there is a lot of read and write from disk. spark.apache.org/faq.html
Can you provide more info. What is the cluster size, how many nodes, core size From your question I understood we have once cluster with one node having one core which having 2 GB RAM. If this is the case then we can do SORT but not ORDER BY kinda sort. Doing SORT: split the file, sort individual file. But this approach is not optimized if we have 1:1:2GB resouce. Doing ORDER BY kinda sort: Not Possible. HTH
You are right... I have improved voice quality in New videos... Please keep on sharing your suggestions... Join WhatsApp group using chat.whatsapp.com/80xV49NcVHyJrypBhQwRWz
Hi Prabhat... Thanks for feedback... TH-cam is not allowing me to edit the existing video and improve audio quality... I have changed this in New videos... Video and audio is better in New videos uploaded
I hope below is the best example to describe the type safety in data set.
Dataframe:
Compile-time type safety:Dataframe API does not support compile time safety which limits you from manipulating data when the structure is not know. The following example works during compile time. However, you will get a Runtime exception when executing this code.
Example:
case class Person(name : String , age : Int)
val dataframe = sqlContect.read.json("people.json")
dataframe.filter("salary > 10000").show
=> throws Exception : cannot resolve 'salary' given input age , name
Dataset:
Type Safety: Datasets API provides compile time safety which was not available in Dataframes. In the example below, we can see how Dataset can operate on domain objects with compile lambda functions.
Example:
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContect.createDataframe(personRDD)
val ds:Dataset[Person] = personDF.as[Person]
ds.filter(p => p.age > 25)
ds.filter(p => p.salary > 25)
// error : value salary is not a member of person //this is returned on compile time
ds.rdd // returns RDD[Person]
Your understanding is right.. you can give more examples... But this is good enough
Hi Harjeet, Thanks for your effort for making videos that is making learning simple. I would like you to make videos on following: 1) Spark submit and how it works
2) Difference b/w job, tasks and stage in spark
3) How to check status of spark jobs
4)spark implicit
These questions were asked in an interview.
Thanks Hemanshu for suggestions... I will create videos on these
you can achieve the same in DF as you explained by using x.getAs[T](fieldName). What I understand from type safety is that the return type of a DataFrame is always sql.DataFrame. But in DataSet you can know the return type. e.g. in your case the return type of DataSet will be DataSet[Person]. That's the advantage of Dataset over DataFrame.
Can we have videos with AWS Glue and PySpark in combination?
Hi Could you share some banking usecases in Spark.As I know the theory but its difficult to explain projects in interview as I have not worked on it.
Yes.i am also waiting for this question.thank u
Thanks bro... :) Suggest more questions
we can achieve the same in DF with out casing like this :- df.filter(col("age")>15) right , why we have to go for dataset please explain
but when I mention toDF("name","age") while converting rdd to DF I can easily get age column.When you write case class for dataset you have explicitly mentioned name and age in case class likewise if we mention explicitly column name in dataframe we can run this program without type cast...
Can you share sample code that you have tried
Can you make a video on push down filters and tungsten?
can you please zoom in ur screen while showing/executing the code for all the videos in the channel or please paste same code in the description/comment section, it will be useful for our reference.
Hi bro
Plz help me in understanding
How Lineage and DAG works on RDD alone. I mean when there was no DF and DS
And
Lineage and DAG works on DF and DS
Great work appreciated
Could you please make more videos on the same about benefits, encoders.
Sure , will create video on encoders specifically
Please give more of your suggestions... And subscribe to channel
When do you go for parallelize and makerdd, is there any significant difference among these
Will create a video
I encountered a question in an interview, please make a video on this---How is that flatmap function is able to produce multiple outputs(rows)
Tnx Harjeet!! This is of great help.👍
Thanks a lot for the great explanation, Could u please add some more videos on Scala & Kafka interview Question as well
Thanks Rajni... Will make videos on Kafka and Scala for sure
Harjeet, Thanks! for the nice explanation..What are real time scenarios where we are supposed to use DataFrame as compared to DataSet?
In latest versions of spark, it is always advisable to use dataset... Dataset has evolved from dataframes and is more mature data structure...
Thanks a lot , one more nice vid , could you please do one vid , how to interactively develop the spark code using eclipse...using notebooks etc ....best possible way
Sure bro
thank you...bro i have stuck with this issue , do you have any idea how to fix it .... stackoverflow.com/questions/53442130/why-only-one-core-is-taking-all-the-load-how-to-make-other-29-cores-to-take-lo?noredirect=1#comment93902947_53442130
can you pls explain how to add new column to existing table as i have faced this cognizant interview.i didn't get concise and clear knowledge from internet
Transformation As "column name here"
Spark processing with XML file could you please make video on this
hi...same can be achieved in spark dataframe as well using df.filter("age >=30" ).show() ..bit confusion here..
Manjunath Reddy in data frame you are writing expression in string, that means you can end up giving wrong column name... The is no check on compile time... You will realize issue only at runtime.... If you use dataset, your code will not compile...
One more interesting thing is that, to make datasets look exactly like dataframe... There are still some functions in dataset which are type unsafe... That's a topic for another video though :)
case class empC(name:String, age: Int, salary:Double); var data1= Seq(new empC("s1",33,11.10000),new empC("s2",37,135000.4),new empC("s3",33,33.4)); sc.makeRDD(data1).filter(x=> x.age == 37) ; Sir, Do you think normal RDD is type safe....
Please try to add some scala questions as well if possible, thanks in advance
I plan to cover that in future videos... Currently focus is on spark
Harjeet, one suggestion could please reorder the spark playlist according to the topics
Yes... Let me know list if topics... Will create this
Will create the order and post here soon
Understood what it happens through video. How to explain this question in interview?
You can take Same example during interview and explain
What is Lazy Evaluation in Spark
please make a video on end to end spark real time project. please asap.
During select using ds am getting missing parameter type
Can you share sample code... I will have look into this
Please post the python version of code, also please share your GitHub account projects
I faced this interview question and I was not able to answer it effectively
You need to Sort 10 GB file sitting on a machine which has only 2GB memory. How do your sort this, Please explain?
There is no any sorting techniques, just compress the file
Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data.The only problem is it will be slower,similar to the map reduce jobs as there is a lot of read and write from disk.
spark.apache.org/faq.html
Can you provide more info.
What is the cluster size, how many nodes, core size
From your question I understood we have once cluster with one node having one core which having 2 GB RAM.
If this is the case then we can do SORT but not ORDER BY kinda sort.
Doing SORT: split the file, sort individual file. But this approach is not optimized if we have 1:1:2GB resouce.
Doing ORDER BY kinda sort: Not Possible. HTH
This can be possible using External Quick Sort. Please look into the topic.
@@rajasekharreddy1637 It is possible using External Quick sort.
how to fit 5gb data in 4 gb machine
You are looking like Sundar Pichai in this video!
Hi there, Your content is good but video quality is very poor. It's very painful to see!!
Apologies for problem... What is exact issue... Low volume or something else?
@@DataSavvy yes your voice is very low. Please improve your audio quality.
You are right... I have improved voice quality in New videos... Please keep on sharing your suggestions... Join WhatsApp group using chat.whatsapp.com/80xV49NcVHyJrypBhQwRWz
Yes video quality is very poor, no useful for me atleast, just a feedback
Hi Prabhat... Thanks for feedback... TH-cam is not allowing me to edit the existing video and improve audio quality... I have changed this in New videos... Video and audio is better in New videos uploaded
sorry this video is not clear , please can you make a new one with more information
The quality of practical video is not good in terms of visualization. I am not sure, It is happening with me or everyone.