5.2 Why Spark Dataset is Type Safe | Spark Interview questions | Spark Tutorial

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 พ.ย. 2024

ความคิดเห็น • 67

  • @thirupathi333
    @thirupathi333 5 ปีที่แล้ว +9

    I hope below is the best example to describe the type safety in data set.
    Dataframe:
    Compile-time type safety:Dataframe API does not support compile time safety which limits you from manipulating data when the structure is not know. The following example works during compile time. However, you will get a Runtime exception when executing this code.
    Example:
    case class Person(name : String , age : Int)
    val dataframe = sqlContect.read.json("people.json")
    dataframe.filter("salary > 10000").show
    => throws Exception : cannot resolve 'salary' given input age , name
    Dataset:
    Type Safety: Datasets API provides compile time safety which was not available in Dataframes. In the example below, we can see how Dataset can operate on domain objects with compile lambda functions.
    Example:
    case class Person(name : String , age : Int)
    val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
    val personDF = sqlContect.createDataframe(personRDD)
    val ds:Dataset[Person] = personDF.as[Person]
    ds.filter(p => p.age > 25)
    ds.filter(p => p.salary > 25)
    // error : value salary is not a member of person //this is returned on compile time
    ds.rdd // returns RDD[Person]

    • @DataSavvy
      @DataSavvy  5 ปีที่แล้ว

      Your understanding is right.. you can give more examples... But this is good enough

  • @himanshuranjan2518
    @himanshuranjan2518 6 ปีที่แล้ว +2

    Hi Harjeet, Thanks for your effort for making videos that is making learning simple. I would like you to make videos on following: 1) Spark submit and how it works
    2) Difference b/w job, tasks and stage in spark
    3) How to check status of spark jobs
    4)spark implicit
    These questions were asked in an interview.

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Thanks Hemanshu for suggestions... I will create videos on these

  • @TE1gamingmadness
    @TE1gamingmadness 6 ปีที่แล้ว +4

    you can achieve the same in DF as you explained by using x.getAs[T](fieldName). What I understand from type safety is that the return type of a DataFrame is always sql.DataFrame. But in DataSet you can know the return type. e.g. in your case the return type of DataSet will be DataSet[Person]. That's the advantage of Dataset over DataFrame.

  • @siddharthmehta8932
    @siddharthmehta8932 3 ปีที่แล้ว

    Can we have videos with AWS Glue and PySpark in combination?

  • @surabhipandey6439
    @surabhipandey6439 4 ปีที่แล้ว

    Hi Could you share some banking usecases in Spark.As I know the theory but its difficult to explain projects in interview as I have not worked on it.

  • @ampolusantosh5350
    @ampolusantosh5350 6 ปีที่แล้ว +1

    Yes.i am also waiting for this question.thank u

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Thanks bro... :) Suggest more questions

  • @saiabi19
    @saiabi19 3 ปีที่แล้ว

    we can achieve the same in DF with out casing like this :- df.filter(col("age")>15) right , why we have to go for dataset please explain

  • @nandinivetal8569
    @nandinivetal8569 6 ปีที่แล้ว +1

    but when I mention toDF("name","age") while converting rdd to DF I can easily get age column.When you write case class for dataset you have explicitly mentioned name and age in case class likewise if we mention explicitly column name in dataframe we can run this program without type cast...

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Can you share sample code that you have tried

  • @Aryanhot1
    @Aryanhot1 5 ปีที่แล้ว

    Can you make a video on push down filters and tungsten?

  • @santhoshkotha6411
    @santhoshkotha6411 5 ปีที่แล้ว

    can you please zoom in ur screen while showing/executing the code for all the videos in the channel or please paste same code in the description/comment section, it will be useful for our reference.

  • @thanoojbharateeyudu3786
    @thanoojbharateeyudu3786 3 ปีที่แล้ว

    Hi bro
    Plz help me in understanding
    How Lineage and DAG works on RDD alone. I mean when there was no DF and DS
    And
    Lineage and DAG works on DF and DS

  • @suryapratap4411
    @suryapratap4411 6 ปีที่แล้ว +2

    Great work appreciated
    Could you please make more videos on the same about benefits, encoders.

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Sure , will create video on encoders specifically

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Please give more of your suggestions... And subscribe to channel

  • @bhargavhr1891
    @bhargavhr1891 6 ปีที่แล้ว +1

    When do you go for parallelize and makerdd, is there any significant difference among these

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว +1

      Will create a video

  • @viraajsivaraju2329
    @viraajsivaraju2329 5 ปีที่แล้ว

    I encountered a question in an interview, please make a video on this---How is that flatmap function is able to produce multiple outputs(rows)

  • @gauravpathak7017
    @gauravpathak7017 5 ปีที่แล้ว

    Tnx Harjeet!! This is of great help.👍

  • @rajnimehta5156
    @rajnimehta5156 5 ปีที่แล้ว +1

    Thanks a lot for the great explanation, Could u please add some more videos on Scala & Kafka interview Question as well

    • @DataSavvy
      @DataSavvy  5 ปีที่แล้ว

      Thanks Rajni... Will make videos on Kafka and Scala for sure

  • @yogeshguptain
    @yogeshguptain 6 ปีที่แล้ว +1

    Harjeet, Thanks! for the nice explanation..What are real time scenarios where we are supposed to use DataFrame as compared to DataSet?

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      In latest versions of spark, it is always advisable to use dataset... Dataset has evolved from dataframes and is more mature data structure...

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 6 ปีที่แล้ว

    Thanks a lot , one more nice vid , could you please do one vid , how to interactively develop the spark code using eclipse...using notebooks etc ....best possible way

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Sure bro

    • @SpiritOfIndiaaa
      @SpiritOfIndiaaa 6 ปีที่แล้ว

      thank you...bro i have stuck with this issue , do you have any idea how to fix it .... stackoverflow.com/questions/53442130/why-only-one-core-is-taking-all-the-load-how-to-make-other-29-cores-to-take-lo?noredirect=1#comment93902947_53442130

  • @shyamsundarnaik3545
    @shyamsundarnaik3545 5 ปีที่แล้ว

    can you pls explain how to add new column to existing table as i have faced this cognizant interview.i didn't get concise and clear knowledge from internet

  • @5669ashish
    @5669ashish 5 ปีที่แล้ว

    Spark processing with XML file could you please make video on this

  • @manjunathkv6234
    @manjunathkv6234 6 ปีที่แล้ว +1

    hi...same can be achieved in spark dataframe as well using df.filter("age >=30" ).show() ..bit confusion here..

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Manjunath Reddy in data frame you are writing expression in string, that means you can end up giving wrong column name... The is no check on compile time... You will realize issue only at runtime.... If you use dataset, your code will not compile...

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      One more interesting thing is that, to make datasets look exactly like dataframe... There are still some functions in dataset which are type unsafe... That's a topic for another video though :)

  • @dalwindersingh9282
    @dalwindersingh9282 5 ปีที่แล้ว

    case class empC(name:String, age: Int, salary:Double); var data1= Seq(new empC("s1",33,11.10000),new empC("s2",37,135000.4),new empC("s3",33,33.4)); sc.makeRDD(data1).filter(x=> x.age == 37) ; Sir, Do you think normal RDD is type safe....

  • @suryapratap4411
    @suryapratap4411 6 ปีที่แล้ว +1

    Please try to add some scala questions as well if possible, thanks in advance

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      I plan to cover that in future videos... Currently focus is on spark

  • @bhargavhr1891
    @bhargavhr1891 6 ปีที่แล้ว +3

    Harjeet, one suggestion could please reorder the spark playlist according to the topics

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว +1

      Yes... Let me know list if topics... Will create this

    • @bhargavhr1891
      @bhargavhr1891 6 ปีที่แล้ว +2

      Will create the order and post here soon

  • @vyke0169
    @vyke0169 6 ปีที่แล้ว +1

    Understood what it happens through video. How to explain this question in interview?

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      You can take Same example during interview and explain

  • @narendraraut3629
    @narendraraut3629 4 ปีที่แล้ว

    What is Lazy Evaluation in Spark

  • @simbhu1002
    @simbhu1002 5 ปีที่แล้ว

    please make a video on end to end spark real time project. please asap.

  • @kuttikids6138
    @kuttikids6138 6 ปีที่แล้ว +1

    During select using ds am getting missing parameter type

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Can you share sample code... I will have look into this

  • @pradeepkumars2385
    @pradeepkumars2385 6 ปีที่แล้ว

    Please post the python version of code, also please share your GitHub account projects

  • @jayanthnanduri5603
    @jayanthnanduri5603 5 ปีที่แล้ว

    I faced this interview question and I was not able to answer it effectively
    You need to Sort 10 GB file sitting on a machine which has only 2GB memory. How do your sort this, Please explain?

    • @rajasekharreddy1637
      @rajasekharreddy1637 5 ปีที่แล้ว

      There is no any sorting techniques, just compress the file

    • @sachin5611
      @sachin5611 4 ปีที่แล้ว

      Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data.The only problem is it will be slower,similar to the map reduce jobs as there is a lot of read and write from disk.
      spark.apache.org/faq.html

    • @sarada_rout
      @sarada_rout 4 ปีที่แล้ว

      Can you provide more info.
      What is the cluster size, how many nodes, core size
      From your question I understood we have once cluster with one node having one core which having 2 GB RAM.
      If this is the case then we can do SORT but not ORDER BY kinda sort.
      Doing SORT: split the file, sort individual file. But this approach is not optimized if we have 1:1:2GB resouce.
      Doing ORDER BY kinda sort: Not Possible. HTH

    • @anandbarnwal4425
      @anandbarnwal4425 3 ปีที่แล้ว

      This can be possible using External Quick Sort. Please look into the topic.

    • @anandbarnwal4425
      @anandbarnwal4425 3 ปีที่แล้ว

      @@rajasekharreddy1637 It is possible using External Quick sort.

  • @lakshmanneelam1877
    @lakshmanneelam1877 5 ปีที่แล้ว

    how to fit 5gb data in 4 gb machine

  • @mustafabohra2070
    @mustafabohra2070 2 ปีที่แล้ว

    You are looking like Sundar Pichai in this video!

  • @karamveersolanki138
    @karamveersolanki138 6 ปีที่แล้ว +1

    Hi there, Your content is good but video quality is very poor. It's very painful to see!!

    • @DataSavvy
      @DataSavvy  6 ปีที่แล้ว

      Apologies for problem... What is exact issue... Low volume or something else?

    • @jeevithat6038
      @jeevithat6038 5 ปีที่แล้ว +1

      @@DataSavvy yes your voice is very low. Please improve your audio quality.

    • @DataSavvy
      @DataSavvy  5 ปีที่แล้ว

      You are right... I have improved voice quality in New videos... Please keep on sharing your suggestions... Join WhatsApp group using chat.whatsapp.com/80xV49NcVHyJrypBhQwRWz

    • @meetprabhatsharma
      @meetprabhatsharma 5 ปีที่แล้ว +1

      Yes video quality is very poor, no useful for me atleast, just a feedback

    • @DataSavvy
      @DataSavvy  5 ปีที่แล้ว

      Hi Prabhat... Thanks for feedback... TH-cam is not allowing me to edit the existing video and improve audio quality... I have changed this in New videos... Video and audio is better in New videos uploaded

  • @akashputti
    @akashputti 5 ปีที่แล้ว

    sorry this video is not clear , please can you make a new one with more information

  • @SanjayKumar-oc8zs
    @SanjayKumar-oc8zs 3 ปีที่แล้ว

    The quality of practical video is not good in terms of visualization. I am not sure, It is happening with me or everyone.