Hash Partitioning vs Range Partitioning | Spark Interview questions

Поделиться
HTML-код
  • Опубликовано: 17 окт 2024

Комментарии • 48

  • @saikumarmora6409
    @saikumarmora6409 4 года назад +6

    While reading the data it depends upon the file size. By default the partition size is 128MB so if we have an input file of 10*128MB the it'll be divided into 10 partitions. Also we can use the spark.sql.files.maxPartitionBytes to set the partition size. Please correct me if I'm wrong

    • @DataSavvy
      @DataSavvy  4 года назад +3

      Your understanding is right

  • @sadeeshkumar1484
    @sadeeshkumar1484 3 года назад

    If the file is from hdfs then by default block size 128 mb number of partitions will be considered. If it is from local then by default block size 64 mb is taken as block size and according to that number of partitions will be considered. Correct me if I'm wrong .

  • @sarfarazhussain6883
    @sarfarazhussain6883 4 года назад +10

    Default nature: If we use YARN then the number of partitions = number of blocks. In local or standalone mode, the number of partitions can be maximum the number of cores available in the system.

    • @DataSavvy
      @DataSavvy  4 года назад +1

      Thanks :)

    • @kiranmudradi26
      @kiranmudradi26 4 года назад +1

      @@DataSavvy and @Sarfaraz. If Spark is reading from non distributed file systems other than HDFS. What would be default/initial number of partitions and partition size?

    • @DataSavvy
      @DataSavvy  4 года назад +1

      In spark 2.x I think it is 4 partitions... In spark 3.x it is 6

    • @kiranmudradi26
      @kiranmudradi26 4 года назад

      @@DataSavvy thanks.

  • @big-bang-movies
    @big-bang-movies 2 года назад

    only high level theory covered behind partitioning. was expecting some hands on.

  • @veerap3878
    @veerap3878 2 года назад

    Is there a difference in reading the data in Hive using the HiveContext and using JDBC driver. when to use jdbc driver and HiveContext ?

  • @srinivasasameer9615
    @srinivasasameer9615 4 года назад

    Spark choose by default block size of HDFS, number of cores we are passing through spark submit in local [ ].. spark.sql.shuffle is by default 200. Not default partition is 200. Hope I am right. Correct me if I am wrong

  • @amitprasad8114
    @amitprasad8114 3 года назад

    Your explanation creates and very good mind mapping. Thank you!

  • @Laughrider
    @Laughrider 4 года назад +2

    Spark decides the number of partitions on the basis of block size .I am not sure but please do answer this questions I have been asked in an interview

  • @pardeep657
    @pardeep657 3 года назад

    is it memory partitinong or disk partitioning technique? isnt memory partitioning is costly in itself?

  • @mohankrishna4593
    @mohankrishna4593 3 года назад

    Hi Sir. All your channel videos are very helpful for us. Thanks a lot for the amazing content. Could you please answer this question?
    How many initial partitions spark creates when we read table/view from some data source like Oracle, Snowflake,SAP etc?

  • @ahyanroboking9237
    @ahyanroboking9237 2 года назад

    In another session you mention that reading large partition file an cause OutOfMemory error in executor but in these discussions it is considered as block size of 128 MB is considers as partition while spark reading it ? then how large partition file is reason for executor OutOfMemory ?

    • @ammeejurinaveenkumar6874
      @ammeejurinaveenkumar6874 6 месяцев назад

      If you have a very large file and you're not explicitly repartitioning it in Spark, Spark will likely create only a few partitions to process the data. If these partitions are too large, they might not fit into the memory of individual executor nodes, leading to OutOfMemory errors.
      For example, if you have a 10 GB file and Spark decides to create only 2 partitions, each partition would be approximately 5 GB in size. If your executor nodes have limited memory (which is often the case in distributed environments), trying to process a 5 GB partition might exceed the memory capacity of the executor, leading to an OutOfMemory error.
      Here in your case, if the large partition file is having 30GB data, and you allocated only 10 cores/tasks to run each partition, while loading the file to dataframe it cause OOM error.
      Hope you doubt got resolved now.:)

  • @anubhavkarelia9585
    @anubhavkarelia9585 3 года назад +1

    By Default: Partition size is 128MB, So When file is read in spark it automatically calculate by File_Size//128, and divide the partition accordingly. We can also change partition size in spark by changing config.

  • @yardy88
    @yardy88 2 года назад

    Very well explained! Thank you! 😊

  • @vamshi878
    @vamshi878 4 года назад +1

    Hi, partitionBy doesn't perform shuffle. will data move across nodes?

    • @DataSavvy
      @DataSavvy  4 года назад

      I meant when u repartition data

  • @Capt_X
    @Capt_X 4 года назад

    Thank you for making it so simple to understand!
    How can we we distribute 8 gb of records evenly after filter and joining it on another dataset. I see different number of partitions in different stages in last job in AppMaster when I perform an action of Saving df into csv. Will this problem be solved by increasing number of executers/executer memory/driver memory?

    • @DataSavvy
      @DataSavvy  4 года назад

      Hi... M sorry , I did not understand your question properly

  • @sachingajwa8839
    @sachingajwa8839 2 года назад

    Spark uses the default partitioning when it reads the data from file. Default partition partitioned the data based on size of file and it create a partition for each 128 mb of data.

  • @sreenivasmekala6198
    @sreenivasmekala6198 4 года назад

    Hi Sir, How will decide hash code of the record in hash partitioning

  • @hishailesh77
    @hishailesh77 3 года назад

    Spark decides number of partition based on combination of various factors viz. 'Default parallelism' usually equal to number of cores, total number of files and size of each file , min partition size (default 128 MB).
    Given below two scenarios .
    a) 54 parquet files, 63 MB each, No. of core equal to 10 , min partition size=128
    Total partition = 54 . As split size = 63 MB + 4 MB (openCostInBytes ) = 67 MB . So we can fit only one split into one partition
    b) 54 parquet files, 38 MB each, No. of core equal to 10 , min partition size=128
    Total partition = 18 . As split size = 38 MB + 4 MB (openCostInBytes ) = 42 MB . So we can fit Three split into one partition (128/42).
    Apart from this if we specify set spark default parallelism to very high , then it will also affect the number of partition and we would get different number for above scenarios (Will do the math later).
    BTW, thanks for putting this series , its really helpful .

  • @aneksingh4496
    @aneksingh4496 4 года назад +1

    We have to provide number of partitions let's say in repartition () manually and then invoke partitionBy or else Spark will take from default partition size spark.sql.partition which is 200

    • @DataSavvy
      @DataSavvy  4 года назад

      These are ways to enforce a number manually. Otherwise spark will create one partition per core when it is writing a new file. In case spark is reading a new file it will be based on hdfs blocks

  • @rajdeepsinghborana2409
    @rajdeepsinghborana2409 4 года назад +1

    Sir , is there any online lab ( platform) for practicing big data Hadoop free 👨🏻‍💻

    • @DataSavvy
      @DataSavvy  4 года назад +2

      You can use databricks community edition for practise

  • @ishansharma4276
    @ishansharma4276 3 года назад

    spark decides the number of partitions based on the key. so if there are 4 kind of keys let us yat x_1,y_1,z_1,t_1 then there will be 4 partitions of the file

  • @vijeandran
    @vijeandran 3 года назад

    Very informative...

  • @MrVivekc
    @MrVivekc 4 года назад +2

    partitions while reading file (Total file size)/(128 MB)

  • @aneksingh4496
    @aneksingh4496 4 года назад +1

    But how spark will decide which partitioner to choose from ?

    • @DataSavvy
      @DataSavvy  4 года назад

      That depends on nature of transformation... U can also force spark to prefer certain transformation

    • @nandepusrinivas6746
      @nandepusrinivas6746 3 года назад

      @@DataSavvy Can you elaborate how to force spark to prefer transforamtion..do we have any docs to dig deeper into that

  • @adityakvs3529
    @adityakvs3529 Месяц назад

    How hashcode decided

    • @DataSavvy
      @DataSavvy  Месяц назад

      Hashcode is calculated using hash algorithm

  • @stevehe5713
    @stevehe5713 3 года назад

    you didn't explain the context correctly. I think you meant the shuffle partition strategy.

  • @bhanukumarsingh6272
    @bhanukumarsingh6272 4 года назад +1

    spark will decide no of partition based on no of blocks of the files.

  • @vsandeep06
    @vsandeep06 4 года назад

    Num of partitions depends on total num of cores in worker nodes

    • @DataSavvy
      @DataSavvy  4 года назад

      This statement is not true always... Rather that only represents that how many parallel tasks can get executed

  • @2chandra
    @2chandra 4 года назад +1

    Spark partition depends on the no. of cores

    • @DataSavvy
      @DataSavvy  4 года назад +2

      Right... When spark is writing data it depends on cores... What about when spark is reading a new file?

    • @2chandra
      @2chandra 4 года назад

      @@DataSavvy Spark normally sets the partition automatically based on cluster. However we can manually set the partition.

    • @MrManish389
      @MrManish389 4 года назад +1

      @@DataSavvy , While reading the data --> (File size/Block size(128 mb)) . Kindly correct me if i am wrong.

  • @arupdaw5193
    @arupdaw5193 3 года назад

    The wapp group is full and kickd me out of the group