Using PySpark on Dataproc Hadoop Cluster to process large CSV file

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 ก.ย. 2024

ความคิดเห็น • 18

  • @zramzscinece_tech5310
    @zramzscinece_tech5310 2 ปีที่แล้ว

    Great work! Make few GCP Data engineering project end to end.

  • @abhishekchoudhary247
    @abhishekchoudhary247 2 ปีที่แล้ว +1

    great quick tutorial. Thanks

  • @figh761
    @figh761 5 หลายเดือนก่อน

    how to load a csv file from our disk to GCP using PYSPARK

  • @snehalbhartiya6724
    @snehalbhartiya6724 2 ปีที่แล้ว

    This was helpful. Thanks Codible.

  • @rodrigoayarza9397
    @rodrigoayarza9397 11 หลายเดือนก่อน

    the files are in PARQUET now. no problem?

  • @shamimibneshahid706
    @shamimibneshahid706 3 ปีที่แล้ว

    if its not leveraging hdfs , whats the point? why is other silly reasons for using bucket over hdfs more important here?

  • @shamimibneshahid706
    @shamimibneshahid706 3 ปีที่แล้ว

    In the first cell, why didn't it read files from hdfs ? So, bucket=hdfs?

    • @kishanubhattacharya2473
      @kishanubhattacharya2473 3 ปีที่แล้ว +2

      Hello Buddy, so hdfs is different than the gcs bucket. When we create a data proc cluster, it gives us an option to choose the Disk type. It can be HDD or SSD. These are storage space which the hadoop cluster will utilize as a staging area or to process data.
      Whereas,
      A Google Cloud Storage Bucket is a separate space, and different than the HDD or SSD. Google recommends to use GCS Bucket over HDFS storage(SSD or HDD), as it performs better. Also, there are scenarios where we don't want the master and the worker instances to run for a long time, and needs to be shutdown. In that case, if using the HDFS storage, the data is also deleted, whereas on the other hand the data in the GCS remains as it is, and when you spin up a new cluster, you can make of this data.
      Hope this answer you question :)

  • @souravsardar
    @souravsardar 2 ปีที่แล้ว

    Hi @Codible do you provide GCP training?

  • @kishanubhattacharya2473
    @kishanubhattacharya2473 3 ปีที่แล้ว +1

    Thanks for the video buddy. However, why did you use the master node to download the data, when we can run the same command from the CLI of google cloud?
    Was the purpose was just to show that how the hdfs can be accessed in the master node and perform operations over it?

    • @ujarneevan1823
      @ujarneevan1823 2 ปีที่แล้ว +1

      Hi I have a use case in gcp do u help me in doing in that buddy please… 🙏

    • @kishanubhattacharya2473
      @kishanubhattacharya2473 2 ปีที่แล้ว

      @@ujarneevan1823 sure, i will try my best

    • @ujarneevan1823
      @ujarneevan1823 2 ปีที่แล้ว

      @@kishanubhattacharya2473 Reply me bro.

  • @SonuKumar-fn1gn
    @SonuKumar-fn1gn ปีที่แล้ว

    Please make a playlist..🙏

  • @234076virendra
    @234076virendra 2 ปีที่แล้ว

    do you have list of tutorial

  • @ujarneevan1823
    @ujarneevan1823 2 ปีที่แล้ว

    Hi can u help me with my use case😩

  • @RishabhSingh-db4mq
    @RishabhSingh-db4mq 3 ปีที่แล้ว

    good

  • @zucbsivrtcpegapjzwrf2056
    @zucbsivrtcpegapjzwrf2056 2 ปีที่แล้ว

    text