PySpark Tutorial: Spark SQL & DataFrame Basics

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ธ.ค. 2024

ความคิดเห็น • 98

  • @GregHogg
    @GregHogg  ปีที่แล้ว +1

    Take my courses at mlnow.ai/!

  • @coemgeincraobhach236
    @coemgeincraobhach236 3 ปีที่แล้ว +15

    Thanks so much Greg, great job!
    Paying thousands for a masters at university, and people like you consistently pump out tutorials of way better quality. Its madness.

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +1

      Yup that's how it goes! Haha I'm really glad to have helped 😃

  • @TheALahiri
    @TheALahiri 2 ปีที่แล้ว +4

    Many thanks Greg for opening up a new frontier!
    I had no idea Google Colab was so generous and allowed installation and practicing of Spark.
    Your tutorial packs an astonishing amount of information, that too in an engaging way, in a very short timeframe.
    You are now my Guru for Spark.

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Yup, it does! 😃

  • @darrienjohnson9053
    @darrienjohnson9053 ปีที่แล้ว +9

    Don’t know if you’ll see this but I got into data engineering thru my company. They provided me the opportunity to become a software engineer, I was previously a cable installer/field tech. Although they provided this opportunity, I’ve still had to do much of my learning on my own. Your channel is amazing. Videos like these make all the difference. I really appreciate you making content where you’re walking thru the code. Once I get this under my belt I plan on creating content as well. Thank you. 🙏🏾

    • @GregHogg
      @GregHogg  ปีที่แล้ว +3

      Oh that's so great to hear!! Thank you for the kind words and I wish you all the best Darrien!!

    • @some90sKid
      @some90sKid ปีที่แล้ว +1

      🙌🙌

  • @jacobburt5424
    @jacobburt5424 2 ปีที่แล้ว +3

    I appreciate you and your videos so much. In my data science classes we're expected to teach ourselves Pyspark, Dataframe, Pandas and a bunch of other technologies and you've made the experience much more manageable.

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว +1

      Well I'm super happy to hear that Jacob, thanks for the kind words!

  • @andersborum9267
    @andersborum9267 ปีที่แล้ว

    These introductory videos are pure gold; thanks for sharing.

    • @GregHogg
      @GregHogg  ปีที่แล้ว

      Thank you greatly for the kind words, and for your support! It means a lot. 😊

  • @josephjoestar995
    @josephjoestar995 2 ปีที่แล้ว +1

    Thanks

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว +1

      Wow, that was very nice of you Joseph! Thank you :)

  • @barmalini
    @barmalini 2 ปีที่แล้ว +1

    In just 17 minutes I've learnt so much. Thanks!

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Perfect, really glad to hear it :)

  • @gauravraichandani7722
    @gauravraichandani7722 3 ปีที่แล้ว +4

    This was really amazing. Waiting for more uploads on pyspark.

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +1

      Awesome! Did you catch the other 15 minute long one?

    • @gauravraichandani7722
      @gauravraichandani7722 3 ปีที่แล้ว

      Yep, I have. Followed along both the videos.

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      @@gauravraichandani7722 okay awesome!

  • @prithvib
    @prithvib 2 ปีที่แล้ว +1

    This videos deserves 1m views

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Haha that would definitely be preferred, thanks so much for the kind words I really appreciate it!

  • @nishantbahikar5639
    @nishantbahikar5639 2 ปีที่แล้ว

    Bro you have explained it so well.. keep going

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Thanks, great to hear!

  • @ashutoshsingh7529
    @ashutoshsingh7529 2 ปีที่แล้ว

    Thank you so much. Pretty covers everything you to get started with pyspark. I wish you had included merging as well.

  • @tkadado
    @tkadado 5 หลายเดือนก่อน

    Excellent presentation and to the point !!!

  • @arsheyajain7055
    @arsheyajain7055 3 ปีที่แล้ว +1

    I was waiting for this one!

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +1

      I've wanted to make this for a long time since the PySpark RDD video did so well. Enjoy!

  • @GregThatcher
    @GregThatcher 6 หลายเดือนก่อน

    Thanks!

    • @GregHogg
      @GregHogg  6 หลายเดือนก่อน

      From Greg to Greg, thank you so much!

  • @faizalshebliTheAIGuy
    @faizalshebliTheAIGuy 3 ปีที่แล้ว

    Great video. Simple yet effective to comprehend.

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      I'm very glad to hear that Faizal, and I greatly appreciate your kind words!

  • @m_asare
    @m_asare 5 หลายเดือนก่อน

    This is amazing!! Thanks Greg.

  • @ramanantoaninaharintsoanan7752
    @ramanantoaninaharintsoanan7752 2 ปีที่แล้ว

    Thanks for sharing your knowledge. Great video.

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      You're very welcome and glad to hear it!

  • @saketsrivastava84
    @saketsrivastava84 2 ปีที่แล้ว

    Amazing content...please prepare more like these.. 👍🏻

  • @noushinbehboudi5694
    @noushinbehboudi5694 3 ปีที่แล้ว +1

    Awesome. Please keep up the good work. Please make more videos in spark. Thank you

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +2

      Awesome, thank you!

    • @noushinbehboudi5694
      @noushinbehboudi5694 3 ปีที่แล้ว +1

      Could you please suggest
      any good material video tutorial for pyspark for a newbie?

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +1

      @@noushinbehboudi5694 Isn't that this one?

    • @noushinbehboudi5694
      @noushinbehboudi5694 3 ปีที่แล้ว +1

      @@GregHogg I started pyspark with your videos. But I only found 2 videos in your channel. Are you going to upload more?

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +2

      @@noushinbehboudi5694 Makes sense. Not for awhile unfortunately, I would recommend doing the databricks specialization on Coursera :)

  • @Buxussempervirens
    @Buxussempervirens ปีที่แล้ว

    This is so amazing 😍😍

    • @GregHogg
      @GregHogg  ปีที่แล้ว

      Thanks so much!!

  • @mohamedelkhaldi1096
    @mohamedelkhaldi1096 2 ปีที่แล้ว

    Thank you so much !!! Always great contents

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      You're super welcome. Really glad to hear that.

  • @ronaldfungss
    @ronaldfungss 2 ปีที่แล้ว

    This is amazing! Thanks Greg : ]

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Awesome! You're very welcome 😄

  • @AlexFosterAI
    @AlexFosterAI หลายเดือนก่อน

    thanks for the vids bro. really curious what you think of lakesail's pysail. built on rust and suppsoedly 4x the speeds and 90% less hardware cost than spark. pretty recent project but would love some perspective on it

  • @AkshayKumar-vd5wn
    @AkshayKumar-vd5wn ปีที่แล้ว

    Thank you for the video.
    I have a problem -
    When I convert a column from string to int and then run printSchema it shows String and not the int.
    Is there a better way to convert string column to int in pyspark and make it a permanent change?
    I use thr data uploaded locally, I.e from my computer.
    Is this happens to only locally uploaded files? Will the conversation take place smoothly when operating on okne databases i.e through servers.

  • @SaffTechJourney
    @SaffTechJourney 2 ปีที่แล้ว

    You're awesome man!

  • @kishanbsh
    @kishanbsh 2 ปีที่แล้ว +1

    Can you expand more on the sql bit along with joins?

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Not in the comments (not properly, anyway); joins is merging two tables together by matching common rows. In PySpark, a join is essentially the exact same thing as it is in normal SQL. You'd have to learn what a join is first :)

  • @garyhampton3739
    @garyhampton3739 4 หลายเดือนก่อน

    Hi, Greg - just come across your page - great tutorials : Getting the following error :
    106
    107 if not os.path.isfile(conn_info_file):
    --> 108 raise Exception("Java gateway process exited before sending its port number")
    109
    110 with open(conn_info_file, "rb") as info:
    Any advice ? Tried with the latest hadoop distro without success.

  • @byronexaporriton318
    @byronexaporriton318 2 ปีที่แล้ว +1

    how can we create a python DataFrame from an already existing table?

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      You'll need to read in the file using one of Sparks read functions

  • @AlexMar-r
    @AlexMar-r 11 หลายเดือนก่อน

    is this the same as Apache spark ?

  • @matattz
    @matattz ปีที่แล้ว

    I would like to hear your opinion on Ponder. Considering that you can now work with Ponder similarly to how you work with Spark, do you believe it is still necessary to learn PySpark? I'm interested in your perspective on this matter, and if you are aware of any downsides or differences between Ponder and Spark.

  • @ranasana9681
    @ranasana9681 2 ปีที่แล้ว +1

    Thank u so much, sir i have à problem in converting spark.dataframe to pandas.df, beacuse i have a large number of data... How can i do !?

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Isn't there a .topandas function?

  • @SoneGiant
    @SoneGiant 2 ปีที่แล้ว

    I downloaded the train.csv file to my laptop's local hard drive, and tried to read it with titanic_df = spark.read.csv("c:\UserFiles\My Data\train.csv", header=True, inferSchema=True), but got an error message. Do you kbnow what I did wrong?

  • @gerardolamasrosales9777
    @gerardolamasrosales9777 2 ปีที่แล้ว

    Hola, como creo una base de datos con pyspark?

  • @soumyadeeppattanaik526
    @soumyadeeppattanaik526 2 ปีที่แล้ว

    hey.. Hogg while i am trying to extract sum of sales by grouping the states from the dataframe, its giving an unnesessary floating values. If the sum is 150.0 its giving like 150.856743 like this.can you explain this..

  • @DavidClaxton-s2u
    @DavidClaxton-s2u ปีที่แล้ว

    Small annoyance, but does anyone know why when I run something, like spark.sql('select * from Movies'), for example, it gives me the datatypes instead of displaying the actual table data?

    • @GregHogg
      @GregHogg  ปีที่แล้ว

      Empty table maybe?

  • @DanielWeikert
    @DanielWeikert 3 ปีที่แล้ว

    Are there more then the 2 videos on pyspark? Thanks and great work

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      That's all I've got, sorry!

  • @malanshinde6814
    @malanshinde6814 2 ปีที่แล้ว

    Awesome

  • @lord_voldemort44
    @lord_voldemort44 หลายเดือนก่อน

    so its like a worse sql?

  • @adeyemiadeniran2871
    @adeyemiadeniran2871 2 ปีที่แล้ว

    I am getting an error message ' E: Unable to locate package open-jdk-8-jdk-headless'. What could be the reason plz?

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      I think pip install PySpark is enough to install

  • @theodoruswidhi8192
    @theodoruswidhi8192 3 ปีที่แล้ว

    bro what is the difference between .limit(3) and .show(3) ? i tried it on data brick using python on spark 3.0.1 . show command showed the csv dataframe&row , but limit command can't showed the csv dataframe&row.

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      I don't know, sorry.

  • @EzraSchroeder
    @EzraSchroeder 2 ปีที่แล้ว

    i need to learn "the rest" of pyspark sql **fast** (& hardly know any sql at all). suggestions??? what are some good resources???

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      Honestly, the documentation is great.

  • @KradianKrad
    @KradianKrad 2 ปีที่แล้ว

    what is the difference between filter and where

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว +1

      Nothing, they are the same :)

  • @shankarsr1
    @shankarsr1 3 ปีที่แล้ว

    If we can use spark.sql then we don't need dataframes function like filter, agg etc.?

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +1

      It's essentially a different way of doing exactly the same thing. Sometimes I mix and match depending on how comfortable I am with what I'm trying to do

  • @rjayanth
    @rjayanth 2 ปีที่แล้ว

    Thanks Greg , it was clean and straight forward. like it a lot.. could you suggest me a course to learn Spark .In our company we are trying to build a data lake on hadoop using hive.. We have a lot of complex stored procedures on a rdbms. i will be migrating all the logic into Data lake.. spark would be great tool to accomplish this.I would really appreciate if you suggest some online courses.

    • @GregHogg
      @GregHogg  2 ปีที่แล้ว

      No problem! Check out some of the big data courses on Coursera.

  • @kunjuperath
    @kunjuperath 3 ปีที่แล้ว

    For installing pyspark, why didn't you just do `pip install pyspark`?
    I'm trying to use the pandas api that was introduced in 3.2 with this method but even if I wget and unzip the spark 3.2 tar file I can't import the module.
    Cool tutorial though!

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว +1

      That's a great question. I actually didn't know it was this easy in Colab at the time. Thanks!

  • @mihirgaming716
    @mihirgaming716 3 ปีที่แล้ว

    Can anyone give the command to replace null value in age column with average age for each gender ?

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      This is a great exercise 😁

  • @ajaynayak4697
    @ajaynayak4697 3 ปีที่แล้ว

    just wow.

  • @91horse
    @91horse 3 ปีที่แล้ว

    Awesome ! (..)

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      Thank you!!

  • @yashsvidixit7169
    @yashsvidixit7169 3 ปีที่แล้ว

    4:54 funny voice crack LOL

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      You're right, that is pretty funny 😂

    • @yashsvidixit7169
      @yashsvidixit7169 3 ปีที่แล้ว

      @@GregHogg the video was pretty amazing. Thanks.

    • @GregHogg
      @GregHogg  3 ปีที่แล้ว

      @@yashsvidixit7169 Really glad to hear that, and thanks a bunch!