Hi Raja we really liked your solution. You daily video contents becomes our DNA now. I really appreciate you for getting time to make good video. I pray god to give you good health n wealth to make videos like this. Thanks again 🙏
Dear Raja, I wanted to express my gratitude for your immensely helpful videos. Our learning experience from your channel has been exceptional. However, I noticed that a few videos are missing, disrupting the series' continuity. I kindly request you to consider uploading the remaining videos in the correct order. Your efforts in accommodating this request would be greatly appreciated. Thank you for your dedication to providing valuable content.
Hi Raja, Why don't we use filters to exclude range in the given example? We can add new column with sequential index data and can filter required data. Can you please let me know what kind of issues we may face if we go with my approach?
In order to add a new column, data needs to pulled into spark environment first. When data is ingested into spark environment, it is splitted into partitions. So it's not possible to identify first few records using this method
Hello Raja, But you can set the partition value to 1 before reading the CSV. This way your data won't get partitioned and you can add a row number column and apply the filter
Hello Raja's, would it be possible to create a video lesson explaining how to create multi nodes with spark in a local network? It could be two machines, for example.
Suppose i have an excel file with multiple small tables within the same sheet, i want pick out the data and properly generate in a dataframe, can this be done ?
Sure, I can create a video on scenario based questions. Could you share the list of questions asked in your interview so that others in the community can be benefited
Hello Raja, thank you so much for the videos, I am planning to go through all the videos of your Pyspark transformation. My question is will this make me project ready and this is what we do in real time? If not can you please suggest me further.
Hi Mohammed, I have covered lot of Pyspark concepts and also covered few real time scenarios. When you complete all the videos, you will be in good position to handle any real time projects
Thank you, people post some simple Pyspark videos, but you are posting something that contains real time scenarios thank you soo much. Really appreciated.
We don't have performance optimised method for this requirement at the moment. If performance is concern, need to think of logic at data producer level itself
def skip_records(csv_file, start_row, end_row): with open(csv_file, 'r') as file: reader = csv.reader(file) for row_number, row in enumerate(reader, start=1): if start_row
Hi Sabesan, you are trying to subtract first 10 records with entire dataset. So it's equivalent of skipRows 10. So it wont produce the expected result. Also when we use limit 10 on dataframe, it does not guarantee that it is pulling out only first 10 records of csv file though it is first 10 records of dataframe
@@rajasdataengineering7585 Oh ok Raja bro let me look into your solution then though it seems to be somewhat lengthy one to be answered in the interviews😂
Yes it is lenghthy. This is created by keeping beginners also in mind. Basically you need to understand the concept and need to answer in your own way with short and crisp
from pyspark.sql.functions import monotonically_increasing_id df1=df.coalesce(1).select("*",monotonically_increasing_id().alias("pk")) df1.display() from pyspark.sql.functions import col df2=df1.filter(~col('pk').between(4,7)) df2.display()
Hi Raja we really liked your solution. You daily video contents becomes our DNA now. I really appreciate you for getting time to make good video. I pray god to give you good health n wealth to make videos like this. Thanks again 🙏
Hi Ashok, thank you for nice comment and your kind words.
Hope these videos help you gain knowledge in spark and databricks!
Thank you Raja this video is more useful to all data engineers
Glad you liked it!
Dear Raja, I wanted to express my gratitude for your immensely helpful videos. Our learning experience from your channel has been exceptional. However, I noticed that a few videos are missing, disrupting the series' continuity. I kindly request you to consider uploading the remaining videos in the correct order. Your efforts in accommodating this request would be greatly appreciated. Thank you for your dedication to providing valuable content.
Hi Anand, thanks for your nice comment.
Those missing videos are part of azure synapse analytics videos and you can find them in respective playlist
Very informative, thanks for sharing
Glad it was helpful!
Thank you raja..for the information.
Always welcome! Keep watching
Hi Raja,
Why don't we use filters to exclude range in the given example? We can add new column with sequential index data and can filter required data. Can you please let me know what kind of issues we may face if we go with my approach?
In order to add a new column, data needs to pulled into spark environment first. When data is ingested into spark environment, it is splitted into partitions. So it's not possible to identify first few records using this method
Hello Raja,
But you can set the partition value to 1 before reading the CSV.
This way your data won't get partitioned and you can add a row number column and apply the filter
Why DLT pipelines are used when we can create notebooks and schedule them using ADF or workflow?
The use of DLT and adf orchestration are totally different.
@@rajasdataengineering7585 no new videos from so long 🙁
Hello Raja's, would it be possible to create a video lesson explaining how to create multi nodes with spark in a local network? It could be two machines, for example.
Hi Welder, thanks for your request. Sure, will create a video on this requirement
Suppose i have an excel file with multiple small tables within the same sheet, i want pick out the data and properly generate in a dataframe, can this be done ?
Yes it can be done. We need to mix python approach and can be done
Does this work only in databricks ? Its not skipping the values for me
It works for databricks and any spark solution
sir, currently i am attending azure data engineer interviews. they are mostly asking scenario based quastions.do you provide that interview quastions.
Sure, I can create a video on scenario based questions.
Could you share the list of questions asked in your interview so that others in the community can be benefited
Hello Raja, thank you so much for the videos, I am planning to go through all the videos of your Pyspark transformation.
My question is will this make me project ready and this is what we do in real time? If not can you please suggest me further.
Hi Mohammed, I have covered lot of Pyspark concepts and also covered few real time scenarios. When you complete all the videos, you will be in good position to handle any real time projects
Thank you, people post some simple Pyspark videos, but you are posting something that contains real time scenarios thank you soo much. Really appreciated.
Welcome
@@rajasdataengineering7585 Let me know if you starting any paid classes for real time projects, I would like to join.
But what if, I have a very big a csv file ? what will be the performance optimized approach?
We don't have performance optimised method for this requirement at the moment. If performance is concern, need to think of logic at data producer level itself
In pyspark not having conceptsfor commands in sql between, in , like ....
Its there in pyspark too
Hi Raja, please make video on spark processing 1tb file how partition by partition interview question
Hi Bhaskar, please watch video no 100. You can answer any kind of partition questions
th-cam.com/video/A80o9WGXK_I/w-d-xo.html
@@rajasdataengineering7585 thanks 👍
Please create a video on schema_of_json
And higher order sql functions like filter (lamda), transform,etc
Sure Sumit, these topics are in the list. Will make videos on these topics soon
@@rajasdataengineering7585 also for incremental data ingestion and autoloader
Superb explanation Raja 👌 👏 👍, how can we convert json to csv and nested json to csv , can you please make a video. Using user defined functions
Thanks Sravan.
I have already posted a video on flattening complex json. You can refer that video th-cam.com/video/jD8JIw1FVVg/w-d-xo.html
Please make a playlist on unity catalog
Sure, will make soon
Hii Raja can you share the pdf of these course
Hi Raja could you please make couple of videos on delta live tables.
Hi Lalith, yes sure I will create videos on delta live tables
def skip_records(csv_file, start_row, end_row):
with open(csv_file, 'r') as file:
reader = csv.reader(file)
for row_number, row in enumerate(reader, start=1):
if start_row
Hi Raja bro, will my below logic works??
first_10_rows = df.limit(10)
after_20_rows = df.subtract(first_10_rows).orderBy('Id')
after_20_rows.show()
Hi Sabesan, you are trying to subtract first 10 records with entire dataset. So it's equivalent of skipRows 10. So it wont produce the expected result.
Also when we use limit 10 on dataframe, it does not guarantee that it is pulling out only first 10 records of csv file though it is first 10 records of dataframe
@@rajasdataengineering7585 Oh ok Raja bro let me look into your solution then though it seems to be somewhat lengthy one to be answered in the interviews😂
Yes it is lenghthy. This is created by keeping beginners also in mind. Basically you need to understand the concept and need to answer in your own way with short and crisp
from pyspark.sql.functions import monotonically_increasing_id
df1=df.coalesce(1).select("*",monotonically_increasing_id().alias("pk"))
df1.display()
from pyspark.sql.functions import col
df2=df1.filter(~col('pk').between(4,7))
df2.display()