Thanks for the video buddy. However, why did you use the master node to download the data, when we can run the same command from the CLI of google cloud? Was the purpose was just to show that how the hdfs can be accessed in the master node and perform operations over it?
Hello Buddy, so hdfs is different than the gcs bucket. When we create a data proc cluster, it gives us an option to choose the Disk type. It can be HDD or SSD. These are storage space which the hadoop cluster will utilize as a staging area or to process data. Whereas, A Google Cloud Storage Bucket is a separate space, and different than the HDD or SSD. Google recommends to use GCS Bucket over HDFS storage(SSD or HDD), as it performs better. Also, there are scenarios where we don't want the master and the worker instances to run for a long time, and needs to be shutdown. In that case, if using the HDFS storage, the data is also deleted, whereas on the other hand the data in the GCS remains as it is, and when you spin up a new cluster, you can make of this data. Hope this answer you question :)
great quick tutorial. Thanks
Great work! Make few GCP Data engineering project end to end.
This was helpful. Thanks Codible.
the files are in PARQUET now. no problem?
Hi @Codible do you provide GCP training?
Please make a playlist..🙏
how to load a csv file from our disk to GCP using PYSPARK
do you have list of tutorial
if its not leveraging hdfs , whats the point? why is other silly reasons for using bucket over hdfs more important here?
Thanks for the video buddy. However, why did you use the master node to download the data, when we can run the same command from the CLI of google cloud?
Was the purpose was just to show that how the hdfs can be accessed in the master node and perform operations over it?
Hi I have a use case in gcp do u help me in doing in that buddy please… 🙏
@@ujarneevan1823 sure, i will try my best
@@kishanubhattacharya2473 Reply me bro.
In the first cell, why didn't it read files from hdfs ? So, bucket=hdfs?
Hello Buddy, so hdfs is different than the gcs bucket. When we create a data proc cluster, it gives us an option to choose the Disk type. It can be HDD or SSD. These are storage space which the hadoop cluster will utilize as a staging area or to process data.
Whereas,
A Google Cloud Storage Bucket is a separate space, and different than the HDD or SSD. Google recommends to use GCS Bucket over HDFS storage(SSD or HDD), as it performs better. Also, there are scenarios where we don't want the master and the worker instances to run for a long time, and needs to be shutdown. In that case, if using the HDFS storage, the data is also deleted, whereas on the other hand the data in the GCS remains as it is, and when you spin up a new cluster, you can make of this data.
Hope this answer you question :)
good
Hi can u help me with my use case😩
text