Spark Optimization | Bucket Pruning in Spark with Demo | Session-3 | LearntoSpark

Поделиться
HTML-код
  • Опубликовано: 22 дек 2024

Комментарии • 27

  • @vsandeep06
    @vsandeep06 4 года назад +2

    thanks for videos. What is the difference between bucket by and partition by ? when to use bucket and when to use partition by. can you create a video on z-order optimization

    • @AzarudeenShahul
      @AzarudeenShahul  4 года назад +2

      Both are optimization technique,
      PartitionBy -> It creates a folder level parts and chosen with count of unique values in a column is less, ie, finite number of unique value in a column. eg, Date, year, countrycode etc.,
      BucketBy -> It creates a file level parts and chosen on the column that contains infinite number of unique records with it. eg., Id, customerId, productId etc.,

  • @venkatasai4293
    @venkatasai4293 2 года назад

    Let us assume we created 4 buckets for two tables.Assume one node contain first Bucketed table and second node contains second bucketed table.I have a doubt when we perform join on bucketed table still shuffle is going to be there?

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 4 года назад +1

    Thank you so much ...how are you getting api help/prompt in databricks notebook??

    • @AzarudeenShahul
      @AzarudeenShahul  4 года назад +1

      For intelligence to work in databricks, u should have cluster in running state and press tab for prompts

    • @SpiritOfIndiaaa
      @SpiritOfIndiaaa 4 года назад

      @@AzarudeenShahul thanks a lot but it is not giving full details like arguments/params

  • @MrManish389
    @MrManish389 4 года назад +1

    How to calculate number of partitions required for a 10 GB of data, and for repartitioning and coalesce please help??

  • @anilreddy3110
    @anilreddy3110 4 года назад

    Nice explanation, how can we insert data in bucketing table

  • @IlseZubieta
    @IlseZubieta 9 месяцев назад

    Hi Azarudeen, great video as usual.
    On line df.coalesce(1).write.bucketBy(4,'id').sortBy('id').mode('overwrite').saveAsTable('bucket_demo1')
    I get the following error on databricks community: AnalysisException: Operation not allowed: `Bucketing` is not supported for Delta tables.
    Could you please help me troubleshoot?

    • @dubakatriloknavya4736
      @dubakatriloknavya4736 9 месяцев назад

      Hi @IlseZubieta, were you able to solve this? I am getting the same error. @Azarudeen any help is appreciate. Thank you so much.

  • @guptaashok121
    @guptaashok121 4 года назад +1

    What is the difference in using saveastable and tempview.. we are using twmp view extensively for joining and transforming.. is it recommended ?

    • @AzarudeenShahul
      @AzarudeenShahul  4 года назад +1

      Saveastable is for saving the data for future use. Even if the sparksession is ended, the table data remains whereas temp view will last only for that session, which is mainly useful for the ppl who knows SQL very well.

    • @guptaashok121
      @guptaashok121 4 года назад

      @@AzarudeenShahul how to remove table from memory ones we are done, if we use saveastable

    • @AtifImamAatuif
      @AtifImamAatuif 4 года назад +1

      @@guptaashok121 saveAsTable create a physical table and not a view. It is created on DISk and not on memory. So if your use-case is like you want to discard the table after the process is over, Use Tempview in that case.

    • @AzarudeenShahul
      @AzarudeenShahul  4 года назад

      Thanks Atif, hope this answers ur question ashok

    • @guptaashok121
      @guptaashok121 4 года назад

      @@AzarudeenShahul yup then we are doing right thing..

  • @jittendrakumar3908
    @jittendrakumar3908 3 года назад

    Great

  • @Shiva-kz6tn
    @Shiva-kz6tn 4 года назад

    does bucketBy work with file formats (CSV, Parquet) other than tables?

    • @AzarudeenShahul
      @AzarudeenShahul  4 года назад +1

      Bucket by is the method of Dataframewriter, so after spark 2.1.0 it is available for all file format

  • @vermad6233
    @vermad6233 4 года назад

    How to do this in Scala?

  • @saranyarasamani5245
    @saranyarasamani5245 3 года назад

    df1.coalesce(1).write.bucketBy(4,"id").sortBy("id").mode("overwrite").saveAsTable("buckettable")
    above statemnt which is same like in your video, is throwing error bro.
    Error:
    AnalysisException: Cannot convert bucketing with sort columns to a transform: 4 buckets, bucket columns: [id], sort columns: [id]
    Please help me on this

    • @IlseZubieta
      @IlseZubieta 9 месяцев назад

      On same command I got the following error: Operation not allowed: `Bucketing` is not supported for Delta tables

    • @dubakatriloknavya4736
      @dubakatriloknavya4736 9 месяцев назад

      Hi @saranyarasamani5245, were you able to resolve this?

  • @data5508
    @data5508 4 года назад +1

    How come partition is 8 ?

    • @AzarudeenShahul
      @AzarudeenShahul  4 года назад +1

      By default, in databricks without any worker node and with only driver running in cluster like local mode, it results with partition of 8. To decrease, u need to provide input number of partition while reading

  • @iotmails9519
    @iotmails9519 4 года назад

    Share code with your video so we can do hands on...and videos are in 480 px pls do make in 720p

    • @AzarudeenShahul
      @AzarudeenShahul  4 года назад

      Hi, all my vdos are in 1080p.. you can change that in settings -> quality. If you play vdo from Facebook, I guess quality is limited to 480p. Try once with YT app.