113. Databricks | PySpark| Spark Reader: Skip Specific Range of Records While Reading CSV File

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ม.ค. 2025

ความคิดเห็น • 52

  • @AshokKumar-ji3cs
    @AshokKumar-ji3cs ปีที่แล้ว +5

    Hi Raja we really liked your solution. You daily video contents becomes our DNA now. I really appreciate you for getting time to make good video. I pray god to give you good health n wealth to make videos like this. Thanks again 🙏

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Hi Ashok, thank you for nice comment and your kind words.
      Hope these videos help you gain knowledge in spark and databricks!

  • @vantalakka9869
    @vantalakka9869 ปีที่แล้ว +2

    Thank you Raja this video is more useful to all data engineers

  • @anandgupta7273
    @anandgupta7273 ปีที่แล้ว +3

    Dear Raja, I wanted to express my gratitude for your immensely helpful videos. Our learning experience from your channel has been exceptional. However, I noticed that a few videos are missing, disrupting the series' continuity. I kindly request you to consider uploading the remaining videos in the correct order. Your efforts in accommodating this request would be greatly appreciated. Thank you for your dedication to providing valuable content.

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Hi Anand, thanks for your nice comment.
      Those missing videos are part of azure synapse analytics videos and you can find them in respective playlist

  • @pratikraj06
    @pratikraj06 ปีที่แล้ว +2

    Very informative, thanks for sharing

  • @passions9730
    @passions9730 ปีที่แล้ว +1

    Thank you raja..for the information.

  • @munnysmile
    @munnysmile ปีที่แล้ว +1

    Hi Raja,
    Why don't we use filters to exclude range in the given example? We can add new column with sequential index data and can filter required data. Can you please let me know what kind of issues we may face if we go with my approach?

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      In order to add a new column, data needs to pulled into spark environment first. When data is ingested into spark environment, it is splitted into partitions. So it's not possible to identify first few records using this method

    • @starmscloud
      @starmscloud ปีที่แล้ว

      Hello Raja,
      But you can set the partition value to 1 before reading the CSV.
      This way your data won't get partitioned and you can add a row number column and apply the filter

  • @sumitchandwani9970
    @sumitchandwani9970 ปีที่แล้ว +1

    Why DLT pipelines are used when we can create notebooks and schedule them using ADF or workflow?

  • @oiwelder
    @oiwelder ปีที่แล้ว +1

    Hello Raja's, would it be possible to create a video lesson explaining how to create multi nodes with spark in a local network? It could be two machines, for example.

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว +1

      Hi Welder, thanks for your request. Sure, will create a video on this requirement

  • @DebayanKar7
    @DebayanKar7 ปีที่แล้ว +1

    Suppose i have an excel file with multiple small tables within the same sheet, i want pick out the data and properly generate in a dataframe, can this be done ?

  • @sailalithareddy9362
    @sailalithareddy9362 ปีที่แล้ว +1

    Does this work only in databricks ? Its not skipping the values for me

  • @tadojuvishwa2509
    @tadojuvishwa2509 ปีที่แล้ว +1

    sir, currently i am attending azure data engineer interviews. they are mostly asking scenario based quastions.do you provide that interview quastions.

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Sure, I can create a video on scenario based questions.
      Could you share the list of questions asked in your interview so that others in the community can be benefited

  • @mohammedmussadiq8934
    @mohammedmussadiq8934 ปีที่แล้ว +1

    Hello Raja, thank you so much for the videos, I am planning to go through all the videos of your Pyspark transformation.
    My question is will this make me project ready and this is what we do in real time? If not can you please suggest me further.

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Hi Mohammed, I have covered lot of Pyspark concepts and also covered few real time scenarios. When you complete all the videos, you will be in good position to handle any real time projects

    • @mohammedmussadiq8934
      @mohammedmussadiq8934 ปีที่แล้ว +1

      Thank you, people post some simple Pyspark videos, but you are posting something that contains real time scenarios thank you soo much. Really appreciated.

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Welcome

    • @mohammedmussadiq8934
      @mohammedmussadiq8934 ปีที่แล้ว

      @@rajasdataengineering7585 Let me know if you starting any paid classes for real time projects, I would like to join.

  • @mankaransingh981
    @mankaransingh981 ปีที่แล้ว

    But what if, I have a very big a csv file ? what will be the performance optimized approach?

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      We don't have performance optimised method for this requirement at the moment. If performance is concern, need to think of logic at data producer level itself

  • @smallgod100
    @smallgod100 ปีที่แล้ว +1

    In pyspark not having conceptsfor commands in sql between, in , like ....

  • @bhaskaravenkatesh6994
    @bhaskaravenkatesh6994 ปีที่แล้ว +1

    Hi Raja, please make video on spark processing 1tb file how partition by partition interview question

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Hi Bhaskar, please watch video no 100. You can answer any kind of partition questions
      th-cam.com/video/A80o9WGXK_I/w-d-xo.html

    • @bhaskaravenkatesh6994
      @bhaskaravenkatesh6994 ปีที่แล้ว

      @@rajasdataengineering7585 thanks 👍

  • @sumitchandwani9970
    @sumitchandwani9970 ปีที่แล้ว +1

    Please create a video on schema_of_json
    And higher order sql functions like filter (lamda), transform,etc

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว +1

      Sure Sumit, these topics are in the list. Will make videos on these topics soon

    • @sumitchandwani9970
      @sumitchandwani9970 ปีที่แล้ว

      @@rajasdataengineering7585 also for incremental data ingestion and autoloader

  • @sravankumar1767
    @sravankumar1767 ปีที่แล้ว +1

    Superb explanation Raja 👌 👏 👍, how can we convert json to csv and nested json to csv , can you please make a video. Using user defined functions

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว +2

      Thanks Sravan.
      I have already posted a video on flattening complex json. You can refer that video th-cam.com/video/jD8JIw1FVVg/w-d-xo.html

  • @krishnaji6541
    @krishnaji6541 ปีที่แล้ว

    Please make a playlist on unity catalog

  • @sreenathsree6771
    @sreenathsree6771 ปีที่แล้ว

    Hii Raja can you share the pdf of these course

  • @lalithroy
    @lalithroy ปีที่แล้ว +1

    Hi Raja could you please make couple of videos on delta live tables.

  • @CodeCrafter02
    @CodeCrafter02 ปีที่แล้ว +1

    def skip_records(csv_file, start_row, end_row):
    with open(csv_file, 'r') as file:
    reader = csv.reader(file)
    for row_number, row in enumerate(reader, start=1):
    if start_row

  • @sabesanj5509
    @sabesanj5509 ปีที่แล้ว +1

    Hi Raja bro, will my below logic works??
    first_10_rows = df.limit(10)
    after_20_rows = df.subtract(first_10_rows).orderBy('Id')
    after_20_rows.show()

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Hi Sabesan, you are trying to subtract first 10 records with entire dataset. So it's equivalent of skipRows 10. So it wont produce the expected result.
      Also when we use limit 10 on dataframe, it does not guarantee that it is pulling out only first 10 records of csv file though it is first 10 records of dataframe

    • @sabesanj5509
      @sabesanj5509 ปีที่แล้ว +1

      @@rajasdataengineering7585 Oh ok Raja bro let me look into your solution then though it seems to be somewhat lengthy one to be answered in the interviews😂

    • @rajasdataengineering7585
      @rajasdataengineering7585  ปีที่แล้ว

      Yes it is lenghthy. This is created by keeping beginners also in mind. Basically you need to understand the concept and need to answer in your own way with short and crisp

  • @aravind5310
    @aravind5310 ปีที่แล้ว

    from pyspark.sql.functions import monotonically_increasing_id
    df1=df.coalesce(1).select("*",monotonically_increasing_id().alias("pk"))
    df1.display()
    from pyspark.sql.functions import col
    df2=df1.filter(~col('pk').between(4,7))
    df2.display()