Hi Raja. I have one doubt. Cache - will store the data in memory means is it onheap memory ?? Persist - Will store the data in onheap and off heap both ?? Is it correct ??
By default, it is cached at on-heap memory. But if off-heap memory is enabled and jvm memory(on-heap) is full, off-heap memory would be used for caching remaining partitions
@@rajasdataengineering7585, So far I have watched 9 out of the 22 videos in the "Databricks Performance Optimization" playlist. It is very detailed. Like it.
Very good playlist which I have come across.. Could you please provide example with practical example because I was watching some videos regarding this and what I noticed was when we df.cache() then by default it is MEMORY_AND_DISK SER ..there was no just MEMORY_AND_DISK it was always SERIALIZED ..need to know the reason on this.
Only few people have ability to teach in way that even novice can understand. Hats off to you.
Keep going !!!
Thank you for your encouraging words
can not agree more
Thank you for sharing your knowledge with us!
My pleasure! Thank you
Man you so good at what you do!
Thank you
@ i actually have a doubt, before running a pyspark model, if i cache the training data, would I still get DAG SCHEDULER LARGE BINARY SIZE warning
You have very good way of explaining the concepts. Thank you!
Thank you Chetan
you are the real raja bro , super
Thank you bro
Your videos are making wonders!!
Thank you
your videos are the best
Good 👍
Thank you! Cheers!
Nice content sir
Thanks!
This is the explanation thank you for share the knowledge sir👏
Thanks and welcome
Best teacher!!! Thank you sir 🙏🏻
Thank you Turan
Great explaination 🎉
Glad it was helpful! Keep watching
Knowledge session
Thanks Kamal
I found many videos on TH-cam regarding Cache and Persist, but nobody explain like the way you did...
Thank you Rahul
Raja, I really appreciate your explanation :)
Glad to hear that! Thanks for your comment
But where and how do we define these? Can you please add a short demo?
You explained it so simply...
i hope will be able to explain to the interviewer the same way u did😅
Thank you! All the best!
Hi Raja. I have one doubt.
Cache - will store the data in memory means is it onheap memory ??
Persist - Will store the data in onheap and off heap both ??
Is it correct ??
Yes that's correct. Cache always stores in memory but persist has flexibility of memory or disk
@@rajasdataengineering7585 memory means here onheap rgt and disk means offheap??
No onheap and offheap both are memory and disk is different. I have already posted a video on onheap vs offheap. Pls watch that video
@@rajasdataengineering7585 thank you 😊
Hi Raja, u said that persist will use both memory and disk. Here memory means both on and off heap memory????
By default, it is cached at on-heap memory. But if off-heap memory is enabled and jvm memory(on-heap) is full, off-heap memory would be used for caching remaining partitions
this is too good . please keep doing. can you post on processing small file problem with spark?
Thanks 👍🏻
Sure will post a video for small file problem
Can you add the examples for creating persist in the description?
Hi, I was asked to prepare for Spark for my next role in the same company I am working, Is this learning series enough ?
Hi, yes this is more than enough if you complete all these videos
I guess you have at least an M.Tech. + M.Ed. degrees.
Expert in Spark and Amazing Teacher.
Sir, Tussi Grett Ho !
Thank you Pankaj! Hope you like the tutorial
@@rajasdataengineering7585, So far I have watched 9 out of the 22 videos in the "Databricks Performance Optimization" playlist. It is very detailed. Like it.
Glad you like it!
great video sir! one question - is disc memory same as off heap memory?
No, off heap and in disc both are different. Off heap memory is part of RAM. on heap is controlled by jvm while off heap is controlled by os itself
Please make Video on Salting in Performance optimization
Sure will create a video on salting technique
Very good playlist which I have come across.. Could you please provide example with practical example because I was watching some videos regarding this and what I noticed was when we df.cache() then by default it is MEMORY_AND_DISK SER ..there was no just MEMORY_AND_DISK it was always SERIALIZED ..need to know the reason on this.
Hi Sir, we want vidoe for performance issues and solutions while develope the notebook
what are the issue comes
Best Explanation. but i have 1 question like cache() is a transformation or action ?
Cache is an action
@@rajasdataengineering7585 No, cache is not an action.It is an transformation, please do try it out.
Try to make videos under 10 mins sir
Sure, will do
how to avoid the duplicate rows while joining large datasets
Drop_duplicates or distinct can be used to remove duplicates