Apache Iceberg Tutorial for Beginners: Understanding Copy-on-write and Merge-on-read

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024

Комментарии • 17

  • @sukulmahadik0303
    @sukulmahadik0303 7 месяцев назад +1

    [Notes Part-2]
    >>>>>>>>>>>>>>>>> Setting the table for COW or MOR: >>> When to use which write mode?

  • @zayedet1637
    @zayedet1637 Год назад +1

    What it is actually done is Append on Write, and not Copy on Write. Because the file is written elsewhere and only the pointer changes to the file with the new raw.

  • @kenhung8333
    @kenhung8333 3 месяца назад

    Awsome Video !!
    At 3:18 when explaining different delete format I have question regards to the implementation :
    As the delete mode only accept MOR or COW , how exactly do I specify the delete operation to use Equality delete or Positional delete ??

    • @Dremio
      @Dremio  3 месяца назад +2

      It’s mainly based on the engine, most engines will use position delete but streaming platforms like Flink will use equality deleted to keep write latency to a minimum

  • @xabrielcollazomojica3939
    @xabrielcollazomojica3939 2 года назад +2

    Great explanation! Thank you for this video!

  • @cw5948
    @cw5948 2 года назад +2

    Very helpful! Thanks for also explaining the two types of delete files.

  • @SayedElhewihey
    @SayedElhewihey 7 месяцев назад

    Thanks Alex for great explanation.
    it is not clear for me what do delete files contain in case of update statement issued against table ?
    do delete files will have post image of the rows for example or what will happen ?
    thanks

    • @Dremio
      @Dremio  7 месяцев назад +1

      If an update, the delete file will reference the deleted old version. The new version of the row would be in a new file.

  • @ashmkrgao
    @ashmkrgao Год назад +1

    Which version of spark supports delete files?

  • @shyjukoppayilthiruvoth6568
    @shyjukoppayilthiruvoth6568 Год назад +1

    Hi Alex,
    Very good Content and Explanation.

  • @galeop
    @galeop 8 месяцев назад

    1:40 why do you say that, in Hive, updating a row would imply re-writting all the files composing the affected partition? Why is not just the Parquet file that contains the updated row? I mean, why would the other Parquet files in the partition have to be re-written ?

    • @Dremio
      @Dremio  8 месяцев назад

      If you directly update the single file that's fine, but the Hive metastore tracks tables and partitions and not single files, so If I run an update query against Hive it's not aware of the file that needs updating, just the partition so it rewrites that partition and then swaps out the reference in metastore to the location of the new version of the partition. - Alex

    • @galeop
      @galeop 8 месяцев назад +1

      @@Dremio Thanks!
      Waw, I had not realised Hive was that inefficient ! So if I update a single row, all the parquet files composing the partition will be re-written, even though only one parquet file should be affected. Correct ?

    • @Dremio
      @Dremio  8 месяцев назад +1

      @@galeop I wouldn't say it is inefficient, it just wasn't originally designed for the same reasons. Hive was mainly wanting to figure out how define a table for the SQL -> MapReduce functionality. A lot of the problems and bottlenecks didn't become apparent till later which is why formats like Iceberg were invented.

  • @peterconnolly3990
    @peterconnolly3990 Год назад +1

    Thanks for putting this presentation together, it's a great overview.
    It's not clear from the video, how do we specify position versus equality deletes?

    • @AlexMercedCoder
      @AlexMercedCoder Год назад +1

      There isn't a particular way for Spark, it just uses position deletes, the only situation I think you can use equality deletes currently is in flink for streaming which you'd then clean up via compaction.