While reading the data it depends upon the file size. By default the partition size is 128MB so if we have an input file of 10*128MB the it'll be divided into 10 partitions. Also we can use the spark.sql.files.maxPartitionBytes to set the partition size. Please correct me if I'm wrong
If the file is from hdfs then by default block size 128 mb number of partitions will be considered. If it is from local then by default block size 64 mb is taken as block size and according to that number of partitions will be considered. Correct me if I'm wrong .
Default nature: If we use YARN then the number of partitions = number of blocks. In local or standalone mode, the number of partitions can be maximum the number of cores available in the system.
@@DataSavvy and @Sarfaraz. If Spark is reading from non distributed file systems other than HDFS. What would be default/initial number of partitions and partition size?
Spark choose by default block size of HDFS, number of cores we are passing through spark submit in local [ ].. spark.sql.shuffle is by default 200. Not default partition is 200. Hope I am right. Correct me if I am wrong
Hi Sir. All your channel videos are very helpful for us. Thanks a lot for the amazing content. Could you please answer this question? How many initial partitions spark creates when we read table/view from some data source like Oracle, Snowflake,SAP etc?
In another session you mention that reading large partition file an cause OutOfMemory error in executor but in these discussions it is considered as block size of 128 MB is considers as partition while spark reading it ? then how large partition file is reason for executor OutOfMemory ?
If you have a very large file and you're not explicitly repartitioning it in Spark, Spark will likely create only a few partitions to process the data. If these partitions are too large, they might not fit into the memory of individual executor nodes, leading to OutOfMemory errors. For example, if you have a 10 GB file and Spark decides to create only 2 partitions, each partition would be approximately 5 GB in size. If your executor nodes have limited memory (which is often the case in distributed environments), trying to process a 5 GB partition might exceed the memory capacity of the executor, leading to an OutOfMemory error. Here in your case, if the large partition file is having 30GB data, and you allocated only 10 cores/tasks to run each partition, while loading the file to dataframe it cause OOM error. Hope you doubt got resolved now.:)
By Default: Partition size is 128MB, So When file is read in spark it automatically calculate by File_Size//128, and divide the partition accordingly. We can also change partition size in spark by changing config.
Thank you for making it so simple to understand! How can we we distribute 8 gb of records evenly after filter and joining it on another dataset. I see different number of partitions in different stages in last job in AppMaster when I perform an action of Saving df into csv. Will this problem be solved by increasing number of executers/executer memory/driver memory?
Spark uses the default partitioning when it reads the data from file. Default partition partitioned the data based on size of file and it create a partition for each 128 mb of data.
Spark decides number of partition based on combination of various factors viz. 'Default parallelism' usually equal to number of cores, total number of files and size of each file , min partition size (default 128 MB). Given below two scenarios . a) 54 parquet files, 63 MB each, No. of core equal to 10 , min partition size=128 Total partition = 54 . As split size = 63 MB + 4 MB (openCostInBytes ) = 67 MB . So we can fit only one split into one partition b) 54 parquet files, 38 MB each, No. of core equal to 10 , min partition size=128 Total partition = 18 . As split size = 38 MB + 4 MB (openCostInBytes ) = 42 MB . So we can fit Three split into one partition (128/42). Apart from this if we specify set spark default parallelism to very high , then it will also affect the number of partition and we would get different number for above scenarios (Will do the math later). BTW, thanks for putting this series , its really helpful .
We have to provide number of partitions let's say in repartition () manually and then invoke partitionBy or else Spark will take from default partition size spark.sql.partition which is 200
These are ways to enforce a number manually. Otherwise spark will create one partition per core when it is writing a new file. In case spark is reading a new file it will be based on hdfs blocks
spark decides the number of partitions based on the key. so if there are 4 kind of keys let us yat x_1,y_1,z_1,t_1 then there will be 4 partitions of the file
While reading the data it depends upon the file size. By default the partition size is 128MB so if we have an input file of 10*128MB the it'll be divided into 10 partitions. Also we can use the spark.sql.files.maxPartitionBytes to set the partition size. Please correct me if I'm wrong
Your understanding is right
If the file is from hdfs then by default block size 128 mb number of partitions will be considered. If it is from local then by default block size 64 mb is taken as block size and according to that number of partitions will be considered. Correct me if I'm wrong .
Default nature: If we use YARN then the number of partitions = number of blocks. In local or standalone mode, the number of partitions can be maximum the number of cores available in the system.
Thanks :)
@@DataSavvy and @Sarfaraz. If Spark is reading from non distributed file systems other than HDFS. What would be default/initial number of partitions and partition size?
In spark 2.x I think it is 4 partitions... In spark 3.x it is 6
@@DataSavvy thanks.
only high level theory covered behind partitioning. was expecting some hands on.
Is there a difference in reading the data in Hive using the HiveContext and using JDBC driver. when to use jdbc driver and HiveContext ?
Spark choose by default block size of HDFS, number of cores we are passing through spark submit in local [ ].. spark.sql.shuffle is by default 200. Not default partition is 200. Hope I am right. Correct me if I am wrong
Your explanation creates and very good mind mapping. Thank you!
Spark decides the number of partitions on the basis of block size .I am not sure but please do answer this questions I have been asked in an interview
is it memory partitinong or disk partitioning technique? isnt memory partitioning is costly in itself?
Hi Sir. All your channel videos are very helpful for us. Thanks a lot for the amazing content. Could you please answer this question?
How many initial partitions spark creates when we read table/view from some data source like Oracle, Snowflake,SAP etc?
In another session you mention that reading large partition file an cause OutOfMemory error in executor but in these discussions it is considered as block size of 128 MB is considers as partition while spark reading it ? then how large partition file is reason for executor OutOfMemory ?
If you have a very large file and you're not explicitly repartitioning it in Spark, Spark will likely create only a few partitions to process the data. If these partitions are too large, they might not fit into the memory of individual executor nodes, leading to OutOfMemory errors.
For example, if you have a 10 GB file and Spark decides to create only 2 partitions, each partition would be approximately 5 GB in size. If your executor nodes have limited memory (which is often the case in distributed environments), trying to process a 5 GB partition might exceed the memory capacity of the executor, leading to an OutOfMemory error.
Here in your case, if the large partition file is having 30GB data, and you allocated only 10 cores/tasks to run each partition, while loading the file to dataframe it cause OOM error.
Hope you doubt got resolved now.:)
By Default: Partition size is 128MB, So When file is read in spark it automatically calculate by File_Size//128, and divide the partition accordingly. We can also change partition size in spark by changing config.
Very well explained! Thank you! 😊
Hi, partitionBy doesn't perform shuffle. will data move across nodes?
I meant when u repartition data
Thank you for making it so simple to understand!
How can we we distribute 8 gb of records evenly after filter and joining it on another dataset. I see different number of partitions in different stages in last job in AppMaster when I perform an action of Saving df into csv. Will this problem be solved by increasing number of executers/executer memory/driver memory?
Hi... M sorry , I did not understand your question properly
Spark uses the default partitioning when it reads the data from file. Default partition partitioned the data based on size of file and it create a partition for each 128 mb of data.
Hi Sir, How will decide hash code of the record in hash partitioning
Spark decides number of partition based on combination of various factors viz. 'Default parallelism' usually equal to number of cores, total number of files and size of each file , min partition size (default 128 MB).
Given below two scenarios .
a) 54 parquet files, 63 MB each, No. of core equal to 10 , min partition size=128
Total partition = 54 . As split size = 63 MB + 4 MB (openCostInBytes ) = 67 MB . So we can fit only one split into one partition
b) 54 parquet files, 38 MB each, No. of core equal to 10 , min partition size=128
Total partition = 18 . As split size = 38 MB + 4 MB (openCostInBytes ) = 42 MB . So we can fit Three split into one partition (128/42).
Apart from this if we specify set spark default parallelism to very high , then it will also affect the number of partition and we would get different number for above scenarios (Will do the math later).
BTW, thanks for putting this series , its really helpful .
We have to provide number of partitions let's say in repartition () manually and then invoke partitionBy or else Spark will take from default partition size spark.sql.partition which is 200
These are ways to enforce a number manually. Otherwise spark will create one partition per core when it is writing a new file. In case spark is reading a new file it will be based on hdfs blocks
Sir , is there any online lab ( platform) for practicing big data Hadoop free 👨🏻💻
You can use databricks community edition for practise
spark decides the number of partitions based on the key. so if there are 4 kind of keys let us yat x_1,y_1,z_1,t_1 then there will be 4 partitions of the file
Very informative...
partitions while reading file (Total file size)/(128 MB)
But how spark will decide which partitioner to choose from ?
That depends on nature of transformation... U can also force spark to prefer certain transformation
@@DataSavvy Can you elaborate how to force spark to prefer transforamtion..do we have any docs to dig deeper into that
How hashcode decided
Hashcode is calculated using hash algorithm
you didn't explain the context correctly. I think you meant the shuffle partition strategy.
spark will decide no of partition based on no of blocks of the files.
Num of partitions depends on total num of cores in worker nodes
This statement is not true always... Rather that only represents that how many parallel tasks can get executed
Spark partition depends on the no. of cores
Right... When spark is writing data it depends on cores... What about when spark is reading a new file?
@@DataSavvy Spark normally sets the partition automatically based on cluster. However we can manually set the partition.
@@DataSavvy , While reading the data --> (File size/Block size(128 mb)) . Kindly correct me if i am wrong.
The wapp group is full and kickd me out of the group