Thank you Bryan for the series of videos on Databricks and Spark. I like the way how you elaborate and explain the concepts which makes it easy to understand for beginners like me trying to get into data engineering. Thanks again keep up the good work.
Extremely useful teaching approach and content, thank you so much! I've found lessons 22 and 23 to be especially relevant at this stage, but I listened to all of the preceding videos, which filled in a lot of holes I had in my understanding. Great stuff!
Hello Bryan, Thank you for you video again, could you help to advise what are the differences of using sparksql and sql with pyspark, for me , it seems that they both could deal with the spark dataframe with sql clause. is the only difference would be that sparksql is spark native runtime, while pyspark is interacting with sparkcore via the dataframe API. It would be very appreciated , if you could instruct on this.
You're welcome. Actually, when you execute SQL from PySpark you are calling the Spark SQL API so same performance as just calling Spark SQL. Also, the Spark SQL and PySpark dataframe API are closely intertwined and both go though the Spark performance optimizer. Either way, they both will generally perform well.
If you click on the visual in the notebook, a grabber arrow appears in the lower right corner and you can drag to resize the visual. For a Python library coded visual, there are configuration settings that can reduce the size Python renders. See stackabuse.com/change-figure-size-in-matplotlib/
Hi Bryan - thanks for the video. I don't really understand why one would want to use pyspark sql vs just using sparksql. Are there use cases where it makes more sense? It seems like it would be significantly easier to just write run the sparksql code in a very intuitive and familiar way, and then convert the result to a dataframe. Am I missing something?
They are the same. There actually is no SQL console and languages like R and Python call Spark SQL via the function sql() or saprk.sql() but the SQL query is passed to the Spark SQL API. It's a great way to get a dataframe back and Spark SQL persistence of data allows you to share data between different types of notebook cells like R, Python, and SQL.. Make sense?
Just wondering if I am supposed to know Pandas before embarking on this? I don't recall a prior lesson on Pandas, but Brian you make references to Pandas on more than one occasion!
Yes. Pandas is the defacto Python library to perform data analysis. Most other data wrangling libraries try to follow the pandas API. An excellent investment would be to learn pandas.
@@BryanCafferky Brian, Brian, Brian, you sending me down another rabbit hole now? JK- I appreciate your reply and insight and have defly been learning from your videos!
Thank you Bryan for the series of videos on Databricks and Spark. I like the way how you elaborate and explain the concepts which makes it easy to understand for beginners like me trying to get into data engineering.
Thanks again keep up the good work.
YW. Thanks for watching.
Extremely useful teaching approach and content, thank you so much! I've found lessons 22 and 23 to be especially relevant at this stage, but I listened to all of the preceding videos, which filled in a lot of holes I had in my understanding. Great stuff!
Great! Thanks for the feedback.
Thanks for the series of videos. Best of all that can be found in TH-cam
You are the best , we were egarly waiting for this looking forward for more ,Thanks 😀
Hello Bryan, Thank you for you video again, could you help to advise what are the differences of using sparksql and sql with pyspark, for me , it seems that they both could deal with the spark dataframe with sql clause. is the only difference would be that sparksql is spark native runtime, while pyspark is interacting with sparkcore via the dataframe API. It would be very appreciated , if you could instruct on this.
You're welcome. Actually, when you execute SQL from PySpark you are calling the Spark SQL API so same performance as just calling Spark SQL. Also, the Spark SQL and PySpark dataframe API are closely intertwined and both go though the Spark performance optimizer. Either way, they both will generally perform well.
Bryan, that "Databricks is smart enough.....restate the query for visualization" part is not clear to me...Can you please explain what's that ?
If you click on the visual in the notebook, a grabber arrow appears in the lower right corner and you can drag to resize the visual.
For a Python library coded visual, there are configuration settings that can reduce the size Python renders.
See stackabuse.com/change-figure-size-in-matplotlib/
Hi Bryan - thanks for the video. I don't really understand why one would want to use pyspark sql vs just using sparksql. Are there use cases where it makes more sense? It seems like it would be significantly easier to just write run the sparksql code in a very intuitive and familiar way, and then convert the result to a dataframe. Am I missing something?
They are the same. There actually is no SQL console and languages like R and Python call Spark SQL via the function sql() or saprk.sql() but the SQL query is passed to the Spark SQL API. It's a great way to get a dataframe back and Spark SQL persistence of data allows you to share data between different types of notebook cells like R, Python, and SQL.. Make sense?
Just wondering if I am supposed to know Pandas before embarking on this? I don't recall a prior lesson on Pandas, but Brian you make references to Pandas on more than one occasion!
Yes. Pandas is the defacto Python library to perform data analysis. Most other data wrangling libraries try to follow the pandas API. An excellent investment would be to learn pandas.
@@BryanCafferky Brian, Brian, Brian, you sending me down another rabbit hole now?
JK- I appreciate your reply and insight and have defly been learning from your videos!
@@anandmahadevanFromTrivandrum Check out this awesome free book by the author of pandas. wesmckinney.com/book/
Teacher I need . Thanks Guru Bryan
What is the equivalent to using exists and with clauses in Spark SQL?
I think it is this docs.databricks.com/sql/language-manual/functions/exists.html
And what about the with clause? That is extremely useful when simplifying complex queries.
@@rydmerlin Yes. I think I cover with and common table expressions in the Spark SQL topics.
The session of Spark Dataframe writer is not clear.
Ok. What was not clear? Can you be specific please?