sri hari.. You must try GCP, they have managed Spark on Kubernetes. If you want to try on your own check Kubernetes operator for Spark by GCP that can be run on any kubernetes instance
The dependency management is really a pain point for pyspark running on older gen Hadoop distributions of CDH or Mapr . I personally went the conda environment route for old pyspark 1.6 based platform stack. Kubernetes simplified a lot of these problems.
But what about data locality? I've been trying to migrate all our production jobs to kubernetes from YARN but I see a drastic chnage in job run time and it is due to Spark not aware of the Data locality. FYI I'm running HDFS also in same k8s cluster. Is there any fix we can use for spark to fix the data locality issue?
Thipu.. By not having data locality you will see around 2 to 8% performance degrading depending on use case and data movement. Typically for long running jobs this number is negligible. If you are running time sensitive workload it is better to run spark on yarn as of now. Today there is no benefit for kube for such workloads. In our case for ML pipeline these performance degradation is fine as we need extended compute and our yarn cluster had too many tenants
Thanks for Sharing great information, But I have 1 doubts Since you mention about 2 independent cluster for data management and computation respectively. But can't it possible to create only 1 cluster with Yarn where we can run SPARK as an application part and database in the back. Even though it won't be as great as Kubernetes cluster.
Sumit.. Yes we can but Spark alone cannot solve all use cases at application end. Yarn was supposedly created so anyone can implement it even applications. But with simplicity of containers and kubernetes no one used Yarn except for data applications. You can still run both yarn and kubernetes in same cluster by allocation quota on individual nodes but that is admin overheard and again requires 2 different scheduler
@@AIEngineeringLife Thanks again, I do agree with your suggestions that we can create a single cluster with partial nodes works on yarn for data management purpose and remaining with kuburnetes for spark application. But is there anyway where we can use kuburnetes container for data management and same kubernetes for spark application. I assume with this approach we can maintain homogeneity of cluster along with laveraging gains of kubernetes over yarn.
@@sumitchandak6131 .. Best is to move storage to some on premise object storage solution and keep it outside of kubernetes. Kube does have persistence volume but for this not really advisable
Hello :) Really amazing videos thank you. I have few questions and would be great if you can help. We are migrating towards spark and primary use case is to perform data loads from and to SQL server. I am using the jdbc connection and all the partition by, min and max ranges to extract. But even after specifying 10 as num partitions, we only get 2 parallel sqls running on SQL server at a time. I have tried increasing the cores and num of executors but either it hangs/fails or we get only 2 parallel sqls at max. Do we need a cluster setup for that.. I am running on a Windows installation local mode. Will be great if you can help. Each table is 40 Million plus. SQL server to Sql Server is the use case.
Parallism also depends on number of executors and also database support and quota allocated. Distributed might help you increase it or depending on number of cores you have you can try increasing executors and see
@@AIEngineeringLife Thank you. Does it also depends if the scheduling option should be set to FAIR. By default the scheduling pool in spark is FIFO so not sure if that has any impact. Also, I was struggling to get the scheduling set to FAIR for default pool groups. Any suggestions :)
@@rishuadams but I thought u are running in one node. Is the node used by others as well?. If not scheduling does not matter. But if u r multi tenant then FAIR is advaisable
Spark natively support kubernetes or you can use google open source spark on kubernetes and run it . You can install it via helm charts and step is in below git github.com/GoogleCloudPlatform/spark-on-k8s-operator
Thanks for sharing sir.it really depends on the use case. It isnt viable to run spark on kubernetes for every data science use case. What use case do you think will better fit for spark on kubernetes scenario?
Now I did not say it for all data science use case. It is ideal incase someone is already running DS on Spark or even data engineering and they want to decouple compute and storage. Keep data in hdfs but expand compute in kubernetes It is a way to also expand cluster without worrying about data locality. Connect to existing kerberized hdfs from kube
Pedro.. I have still not done Spark on k8s but you can check my spark presentation where I talk about it - www.slideshare.net/srivatsan88/future-of-data-platform-in-cloud-native-world And here is my complete free course on spark - th-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html I will do Demo on Spark on k8s later in the year
Thank you for sharing this video..Can you post how to configure spark cluster on kubernetes and configure zeppelin for notebook service.
Sure Debasis. Will do it in one of the future video
thanks for the info , quite like it I wish there this channel provides a video demo of running Spark on AWS EKS
sri hari.. You must try GCP, they have managed Spark on Kubernetes. If you want to try on your own check Kubernetes operator for Spark by GCP that can be run on any kubernetes instance
The dependency management is really a pain point for pyspark running on older gen Hadoop distributions of CDH or Mapr . I personally went the conda environment route for old pyspark 1.6 based platform stack. Kubernetes simplified a lot of these problems.
Yes Tanbir.. With more python dependency traditional conda is not ideal. I have detailed video as well with demo on Spark on k8s
But what about data locality? I've been trying to migrate all our production jobs to kubernetes from YARN but I see a drastic chnage in job run time and it is due to Spark not aware of the Data locality. FYI I'm running HDFS also in same k8s cluster. Is there any fix we can use for spark to fix the data locality issue?
Thipu.. By not having data locality you will see around 2 to 8% performance degrading depending on use case and data movement. Typically for long running jobs this number is negligible. If you are running time sensitive workload it is better to run spark on yarn as of now. Today there is no benefit for kube for such workloads. In our case for ML pipeline these performance degradation is fine as we need extended compute and our yarn cluster had too many tenants
Thanks for Sharing great information, But I have 1 doubts
Since you mention about 2 independent cluster for data management and computation respectively. But can't it possible to create only 1 cluster with Yarn where we can run SPARK as an application part and database in the back. Even though it won't be as great as Kubernetes cluster.
Sumit.. Yes we can but Spark alone cannot solve all use cases at application end. Yarn was supposedly created so anyone can implement it even applications. But with simplicity of containers and kubernetes no one used Yarn except for data applications. You can still run both yarn and kubernetes in same cluster by allocation quota on individual nodes but that is admin overheard and again requires 2 different scheduler
@@AIEngineeringLife
Thanks again,
I do agree with your suggestions that we can create a single cluster with partial nodes works on yarn for data management purpose and remaining with kuburnetes for spark application.
But is there anyway where we can use kuburnetes container for data management and same kubernetes for spark application. I assume with this approach we can maintain homogeneity of cluster along with laveraging gains of kubernetes over yarn.
@@sumitchandak6131 .. Best is to move storage to some on premise object storage solution and keep it outside of kubernetes. Kube does have persistence volume but for this not really advisable
Hello :)
Really amazing videos thank you. I have few questions and would be great if you can help.
We are migrating towards spark and primary use case is to perform data loads from and to SQL server. I am using the jdbc connection and all the partition by, min and max ranges to extract. But even after specifying 10 as num partitions, we only get 2 parallel sqls running on SQL server at a time. I have tried increasing the cores and num of executors but either it hangs/fails or we get only 2 parallel sqls at max. Do we need a cluster setup for that.. I am running on a Windows installation local mode.
Will be great if you can help.
Each table is 40 Million plus. SQL server to Sql Server is the use case.
Parallism also depends on number of executors and also database support and quota allocated. Distributed might help you increase it or depending on number of cores you have you can try increasing executors and see
@@AIEngineeringLife Thank you. Does it also depends if the scheduling option should be set to FAIR. By default the scheduling pool in spark is FIFO so not sure if that has any impact. Also, I was struggling to get the scheduling set to FAIR for default pool groups. Any suggestions :)
@@rishuadams but I thought u are running in one node. Is the node used by others as well?. If not scheduling does not matter. But if u r multi tenant then FAIR is advaisable
How to run pyspark on kubernetes
Spark natively support kubernetes or you can use google open source spark on kubernetes and run it . You can install it via helm charts and step is in below git
github.com/GoogleCloudPlatform/spark-on-k8s-operator
Thanks for sharing sir.it really depends on the use case. It isnt viable to run spark on kubernetes for every data science use case.
What use case do you think will better fit for spark on kubernetes scenario?
Now I did not say it for all data science use case. It is ideal incase someone is already running DS on Spark or even data engineering and they want to decouple compute and storage. Keep data in hdfs but expand compute in kubernetes
It is a way to also expand cluster without worrying about data locality. Connect to existing kerberized hdfs from kube
Demo???
Pedro.. I have still not done Spark on k8s but you can check my spark presentation where I talk about it - www.slideshare.net/srivatsan88/future-of-data-platform-in-cloud-native-world
And here is my complete free course on spark - th-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html
I will do Demo on Spark on k8s later in the year