Using PySpark on Dataproc Hadoop Cluster to process large CSV file

Поделиться
HTML-код
  • Опубликовано: 3 дек 2024

Комментарии • 18

  • @abhishekchoudhary247
    @abhishekchoudhary247 2 года назад +1

    great quick tutorial. Thanks

  • @zramzscinece_tech5310
    @zramzscinece_tech5310 2 года назад

    Great work! Make few GCP Data engineering project end to end.

  • @snehalbhartiya6724
    @snehalbhartiya6724 2 года назад

    This was helpful. Thanks Codible.

  • @rodrigoayarza9397
    @rodrigoayarza9397 Год назад

    the files are in PARQUET now. no problem?

  • @souravsardar
    @souravsardar 2 года назад

    Hi @Codible do you provide GCP training?

  • @SonuKumar-fn1gn
    @SonuKumar-fn1gn Год назад

    Please make a playlist..🙏

  • @figh761
    @figh761 8 месяцев назад

    how to load a csv file from our disk to GCP using PYSPARK

  • @234076virendra
    @234076virendra 2 года назад

    do you have list of tutorial

  • @Titu-z7u
    @Titu-z7u 3 года назад

    if its not leveraging hdfs , whats the point? why is other silly reasons for using bucket over hdfs more important here?

  • @kishanubhattacharya2473
    @kishanubhattacharya2473 3 года назад +1

    Thanks for the video buddy. However, why did you use the master node to download the data, when we can run the same command from the CLI of google cloud?
    Was the purpose was just to show that how the hdfs can be accessed in the master node and perform operations over it?

    • @ujarneevan1823
      @ujarneevan1823 2 года назад +1

      Hi I have a use case in gcp do u help me in doing in that buddy please… 🙏

    • @kishanubhattacharya2473
      @kishanubhattacharya2473 2 года назад

      @@ujarneevan1823 sure, i will try my best

    • @ujarneevan1823
      @ujarneevan1823 2 года назад

      @@kishanubhattacharya2473 Reply me bro.

  • @Titu-z7u
    @Titu-z7u 3 года назад

    In the first cell, why didn't it read files from hdfs ? So, bucket=hdfs?

    • @kishanubhattacharya2473
      @kishanubhattacharya2473 3 года назад +2

      Hello Buddy, so hdfs is different than the gcs bucket. When we create a data proc cluster, it gives us an option to choose the Disk type. It can be HDD or SSD. These are storage space which the hadoop cluster will utilize as a staging area or to process data.
      Whereas,
      A Google Cloud Storage Bucket is a separate space, and different than the HDD or SSD. Google recommends to use GCS Bucket over HDFS storage(SSD or HDD), as it performs better. Also, there are scenarios where we don't want the master and the worker instances to run for a long time, and needs to be shutdown. In that case, if using the HDFS storage, the data is also deleted, whereas on the other hand the data in the GCS remains as it is, and when you spin up a new cluster, you can make of this data.
      Hope this answer you question :)

  • @RishabhSingh-db4mq
    @RishabhSingh-db4mq 3 года назад

    good

  • @ujarneevan1823
    @ujarneevan1823 2 года назад

    Hi can u help me with my use case😩

  • @zucbsivrtcpegapjzwrf2056
    @zucbsivrtcpegapjzwrf2056 2 года назад

    text