Hey Afaque Great tutorials. You should consider doing a full end to end spark project with a Big volume of data so we can understand the challenges faced and how to tackle them. Would be really helpful!
Great video! A question. What about situation when you have to reuse several different dataframes within one action? Let's say the whole ETL looks in a way that you make transformations, and at the end there's only one action - dumping the data. So, let's say you have first DF1 that is reused 3 times for different transformations. Let's say as a result you have DF_new_1, DF_new_2, DF_new_3. After that you union all of those three newly created dataframes and you have DF_combined. And now you'd like to cache this DF_combined because it's similarly reused as above example in next steps. So, in such situation: 1) Should we add some additional action like count() after we create DF_combined to persist it? And right after persisting to unpersist DF1? Or how to proceed in such situation when you have overlapping dataframes in the whole flow that you'd like to cache first, later to cache some other, and unpersist previous one? Any hints?
Hey @some_text, good question! I wouldn't recommend adding an action only for the purpose of making caching happen. You will naturally have an action in the script e.g. a write to disk at the end of computation of `DF_combined` which will make the caching happen at the appropriate step. Secondly, If there are overlapping DataFrames in the whole flow, it's important to `unpersist` once all the usage of that DataFrame is completed otherwise if you unpersist earlier, one or many of the steps downstream will end up re-computing that DataFrame all over again. If you're falling short of space, you can have a look at using several options with `unpersist` e.g. `MEMORY_AND_DISK`. Alternatively, Checkpointing may also be an option, if you want to trim the lineage and write the heavy computation result upto a certain point and then load it back. So, you have the result that was going to be recomputed several times, now saved/checkpointed and now it will be loaded fresh
Thanks for sharing, small query Do we need to cache based on number of transformations being done on that dataframe or if we are doing more actions on that dataframe/using that dataframe
Thanks @gananjikumar5715, transformations are accumulated until an action is called. So, it would be based on the number of actions; If you're performing several actions, better to cache the Dataframe first, otherwise Spark will re-create the DAG when executing a new action.
If we do not explicitly unpersist, what would happen to the data? Would it be cleaned by the next GC cycle ? Also what is the best practice , explicitly unpersist or leave it to GC.
Hey @anirbansom6682, Data would be kept in memory until the Spark application ends, or the context is stopped or is evicted because Spark needs to free up memory to make room for other data. It may also be evicted during next GC cycle. But this process is a little uncertain as it depends completely on Spark's own memory management policies and JVM's garbage collection process. Leaving it to GC would be a passive approach over which you've lesser control and is much more like a black box unless you're well aware of its policies. The best practice, however, is to explicitly unpersist when they're no longer needed. This will give you more control over your application's memory usage and can help prevent memory issues in long running Spark applications where different datasets are cached over time.
Want to understand better, Read this: Spark CacheLLM Response: 1. **Why Use Caching?** - Caching can significantly improve performance by reusing persistent data instead of recomputing it - It helps avoid redundant computations on the same dataset across multiple actions 2. **Lazy Evaluation and Caching** - Apache Spark uses lazy evaluation, where transformations are not executed until an action is triggered - Caching can help by materializing the result of a long sequence of transformations, avoiding recomputation 3. **Spark's Lineage Graph** - Spark tracks the lineage of transformations using a Lineage Graph - Caching breaks the lineage, reducing the size of the graph and improving performance 4. **Caching vs. No Caching** - The demo shows a significant performance improvement when caching is used, as seen in the Spark UI 5. **Persist and Storage Levels** - The `persist()` method is used for caching, with different storage levels available - Storage levels like `MEMORY_ONLY`, `DISK_ONLY`, and combinations control memory/disk usage and replication - Choose the appropriate storage level based on your requirements and cluster resources 6. **When to Cache?** - Cache datasets that are reused multiple times, especially after a long sequence of transformations - Cache intermediate datasets that are expensive to recompute - Be mindful of cluster resources and cache judiciously 7. **Unpersist** - Use `unpersist()` to remove cached data and free up resources when no longer needed - Spark may automatically unpersist cached data if memory is needed If you liked it, Upvote it. NarutoLLM Response
Great Video Afaque - Wish we could get Rahul's input on this!
Thanks man, but who is Rahul?
great explanation, plz create one end-to-end project also
Explained very well!
Great content!
Content is useful.
Please make more video 😊
Appreciate it @HimanshuGupta-xq2td, thank you :)
Very informative video.Thanks for sharing
Great explanation. Waiting for new videos.
Excellent content. Very Helpful.
Hey Afaque
Great tutorials.
You should consider doing a full end to end spark project with a Big volume of data so we can understand the challenges faced and how to tackle them.
Would be really helpful!
A full-fledged in-depth project using Spark and the modern data stack coming soon, stay tuned @mohitupadhayay1439 :)
Thanks for the videos... keep going
kindly cover apache spark scenerio based questions also
Great video! A question. What about situation when you have to reuse several different dataframes within one action? Let's say the whole ETL looks in a way that you make transformations, and at the end there's only one action - dumping the data. So, let's say you have first DF1 that is reused 3 times for different transformations. Let's say as a result you have DF_new_1, DF_new_2, DF_new_3. After that you union all of those three newly created dataframes and you have DF_combined. And now you'd like to cache this DF_combined because it's similarly reused as above example in next steps. So, in such situation: 1) Should we add some additional action like count() after we create DF_combined to persist it? And right after persisting to unpersist DF1? Or how to proceed in such situation when you have overlapping dataframes in the whole flow that you'd like to cache first, later to cache some other, and unpersist previous one? Any hints?
Hey @some_text, good question! I wouldn't recommend adding an action only for the purpose of making caching happen. You will naturally have an action in the script e.g. a write to disk at the end of computation of `DF_combined` which will make the caching happen at the appropriate step. Secondly, If there are overlapping DataFrames in the whole flow, it's important to `unpersist` once all the usage of that DataFrame is completed otherwise if you unpersist earlier, one or many of the steps downstream will end up re-computing that DataFrame all over again. If you're falling short of space, you can have a look at using several options with `unpersist` e.g. `MEMORY_AND_DISK`. Alternatively, Checkpointing may also be an option, if you want to trim the lineage and write the heavy computation result upto a certain point and then load it back. So, you have the result that was going to be recomputed several times, now saved/checkpointed and now it will be loaded fresh
Commendable
Nice video. By the what device you use to write on the screen for teaching bro
Thanks @reyazahmed4855, I use iPad
Can we persist any dataframe irrespective of the size of the data it has? Or are there any limitations in caching dataframes?
Thanks for sharing, small query
Do we need to cache based on number of transformations being done on that dataframe or if we are doing more actions on that dataframe/using that dataframe
Thanks @gananjikumar5715, transformations are accumulated until an action is called. So, it would be based on the number of actions; If you're performing several actions, better to cache the Dataframe first, otherwise Spark will re-create the DAG when executing a new action.
If we do not explicitly unpersist, what would happen to the data? Would it be cleaned by the next GC cycle ? Also what is the best practice , explicitly unpersist or leave it to GC.
Hey @anirbansom6682, Data would be kept in memory until the Spark application ends, or the context is stopped or is evicted because Spark needs to free up memory to make room for other data. It may also be evicted during next GC cycle. But this process is a little uncertain as it depends completely on Spark's own memory management policies and JVM's garbage collection process.
Leaving it to GC would be a passive approach over which you've lesser control and is much more like a black box unless you're well aware of its policies.
The best practice, however, is to explicitly unpersist when they're no longer needed. This will give you more control over your application's memory usage and can help prevent memory issues in long running Spark applications where different datasets are cached over time.
Want to understand better, Read this:
Spark CacheLLM Response:
1. **Why Use Caching?**
- Caching can significantly improve performance by reusing persistent data instead of recomputing it
- It helps avoid redundant computations on the same dataset across multiple actions
2. **Lazy Evaluation and Caching**
- Apache Spark uses lazy evaluation, where transformations are not executed until an action is triggered
- Caching can help by materializing the result of a long sequence of transformations, avoiding recomputation
3. **Spark's Lineage Graph**
- Spark tracks the lineage of transformations using a Lineage Graph
- Caching breaks the lineage, reducing the size of the graph and improving performance
4. **Caching vs. No Caching**
- The demo shows a significant performance improvement when caching is used, as seen in the Spark UI
5. **Persist and Storage Levels**
- The `persist()` method is used for caching, with different storage levels available
- Storage levels like `MEMORY_ONLY`, `DISK_ONLY`, and combinations control memory/disk usage and replication
- Choose the appropriate storage level based on your requirements and cluster resources
6. **When to Cache?**
- Cache datasets that are reused multiple times, especially after a long sequence of transformations
- Cache intermediate datasets that are expensive to recompute
- Be mindful of cluster resources and cache judiciously
7. **Unpersist**
- Use `unpersist()` to remove cached data and free up resources when no longer needed
- Spark may automatically unpersist cached data if memory is needed
If you liked it, Upvote it.
NarutoLLM Response
Good summary :)