some time while creating a new cluster in databricks it is taking long time, even after for some time also cluster is not creating. Tried terminating/Deleting the cluster and created new one , also same issue
bro, can broadcast be used only with UDF, I tried like below and its not working., could you pls have a look df.withColumn('City_Name', broad.value[State_Code]).show(5) # NameError: name 'State_Code' is not defined df.withColumn('City_Name', funcreg('State_Code')).show(5) # (this works just fine)
Broadcast variables never get copied over to Executors memory. What if my broadcast data is 1Gb and I have 10 executors. Will that 1GB gets copied to 10 executors? means 1GB is replicated to 10GB which is not a right approach.. Broadcast variables are copied to executed memory only whenever its required and also it doesn't copy entire data at once. It uses Torrent Broadcast algorithm internally.
instead of declaring and broadcasting variable, if we use case when condition on that df to populate fullname ,how different will be both? eg: withColumn("state_name",case when state='NY',then "New York")
Got my answer....Below is code snippet..had one question though. It is possible to broadcast all list of values (having multiple columns) from file or its not?? val input_df= spark.read.option("header","true").option("delimiter","|").option("inferSchema","true").csv("input/uspopulation.csv") val states = Map(("NY","New York"),("CA","California"),("FL","Florida"),("IL","Illinois"),("AZ","Arizona"),("TX","Texas"),("CO","Colorado")) val statesbc =sc.broadcast(states) val statesbcfunc= ( x : String) => {statesbc.value.get(x)} val statesbcudf=udf(statesbcfunc) input_df.withColumn("state",statesbcudf(input_df("State_Code"))).show(false)
@@AzarudeenShahul thank you bro. And also please share video on kafka spark streaming and dynamically handled the nested json in json file or kafka topic
Thank you for your video, very useful
Thanks for all your support 😊
Hi.. Nice one.. Please make a video on scala class
Thanks a lot , nice .. how to access broadcast variables inside the UDFs ?
some time while creating a new cluster in databricks it is taking long time, even after for some time also cluster is not creating.
Tried terminating/Deleting the cluster and created new one , also same issue
bro, can broadcast be used only with UDF, I tried like below and its not working., could you pls have a look
df.withColumn('City_Name', broad.value[State_Code]).show(5)
# NameError: name 'State_Code' is not defined
df.withColumn('City_Name', funcreg('State_Code')).show(5)
# (this works just fine)
Ur videos are very good and helps me a lot
Thanks for your support :)
Very well explained.
Thanks :)
What is the difference of destroy and unpersist? Are both remove the data from cache memory?
I hope, u had see vdo till end.. unpersist remove the data from cache.. whereas destroy removes the data from driver itself..
Informative video... can you please create one video on accumulator as well🙂
Thanks, Made a video on Accumulator. Hope it will be useful :)
@@AzarudeenShahul Thanks 😊
Why broadcast is useful in this scenario, i mean we can add directly the state name in input file
I want to update the value of the broadcast variable after each iteration in the loop. Is it possible?
No. It is readonly variable.
Broadcast variables never get copied over to Executors memory. What if my broadcast data is 1Gb and I have 10 executors. Will that 1GB gets copied to 10 executors? means 1GB is replicated to 10GB which is not a right approach..
Broadcast variables are copied to executed memory only whenever its required and also it doesn't copy entire data at once. It uses Torrent Broadcast algorithm internally.
instead of declaring and broadcasting variable, if we use case when condition on that df to populate fullname ,how different will be both?
eg: withColumn("state_name",case when state='NY',then "New York")
How to populate a default value when there is no match
Hi Azarudeen...Can you please share similar video on Broadcast Join using IntelliJ Sbt? Your videos are really helpful
Got my answer....Below is code snippet..had one question though. It is possible to broadcast all list of values (having multiple columns) from file or its not??
val input_df= spark.read.option("header","true").option("delimiter","|").option("inferSchema","true").csv("input/uspopulation.csv")
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"),("IL","Illinois"),("AZ","Arizona"),("TX","Texas"),("CO","Colorado"))
val statesbc =sc.broadcast(states)
val statesbcfunc= ( x : String) => {statesbc.value.get(x)}
val statesbcudf=udf(statesbcfunc)
input_df.withColumn("state",statesbcudf(input_df("State_Code"))).show(false)
Yes u can broadcast from file.. read file as DataFrame and broadcast the DataFrame.
Bro could you put one video to read hbase table in to structure..
Sure Bro, I am setting up Hbase in local, once done. will do one :) Thanks for your support
@@AzarudeenShahul thank you bro. And also please share video on kafka spark streaming and dynamically handled the nested json in json file or kafka topic