Apache Spark Vs Apache Flink - Looking Through How Different Companies Approach Spark And Flink

Поделиться
HTML-код
  • Опубликовано: 6 июн 2024
  • As data increased in volume, velocity, and variety, so, in turn, did the need for tools that could help process and manage those larger data sets coming at us at ever faster speeds.
    As a result, frameworks such as Apache Spark and Apache Flink became popular due to their abilities to handle big data processing in a fast, efficient, and scalable manner.
    But we often find that sometimes it can be difficult to understand which use cases are best suited for Spark as well as for Flink (or even which might be suited for both).
    In this article, we’ll discuss some of those unique benefits for both Spark and Flink and help you understand the difference between the two, and go over real use cases, including ones where the engineers were trying to decide between Spark vs. Flink.
    Also, thank you to the sponsor for this video Deltastream, you can try them out for free here - www.deltastream.io/trial/?utm...
    If you enjoyed this video, check out some of my other top videos.
    Top Courses To Become A Data Engineer In 2022
    • Top Courses To Become ...
    What Is The Modern Data Stack - Intro To Data Infrastructure Part 1
    • What Is The Modern Dat...
    If you would like to learn more about data engineering, then check out Googles GCP certificate
    bit.ly/3NQVn7V
    If you'd like to read up on my updates about the data field, then you can sign up for our newsletter here.
    seattledataguy.substack.com/​​
    Or check out my blog
    www.theseattledataguy.com/
    And if you want to support the channel, then you can become a paid member of my newsletter
    seattledataguy.substack.com/s...
    Tags: Data engineering projects, Data engineer project ideas, data project sources, data analytics project sources, data project portfolio
    _____________________________________________________________
    Subscribe: / @seattledataguy
    _____________________________________________________________
    About me:
    I have spent my career focused on all forms of data. I have focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. I have also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. I privately consult on data science and engineering problems both solo as well as with a company called Acheron Analytics. I have experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.
    *I do participate in affiliate programs, if a link has an "*" by it, then I may receive a small portion of the proceeds at no extra cost to you.

Комментарии • 19

  • @danhorus
    @danhorus 18 дней назад +2

    13:03 in Spark, we avoid Python UDFs like the plague because they're much slower than native Spark code. I wonder if the same is true for Flink, given that it also runs on JVMs. A quick Google search indicates that vectorized UDFs are a thing in Flink too, so I assume the same limitations apply

    • @SeattleDataGuy
      @SeattleDataGuy  18 дней назад +1

      Thanks for the added context! It's much appreciated I now am thinking if I have ever had a good experience with a UDF 🤣. I always remember touting them, but even in one case where i do recall trying it out on SQL Server, we found it slow.

    • @danhorus
      @danhorus 18 дней назад +1

      ​​@@SeattleDataGuy With Spark, there are several ways to write transformations. By far, the best option is to use native Spark functions, as they compile to highly optimized and parallelized Java byte code. The second best option is to write UDFs in Scala or Java, as everything still runs in the same JVM. The third best option, in case you want/need to use Python, is to write a vectorized UDF (also known as Pandas UDF), which leverages Apache Arrow to move data between the JVM and the Python interpreter in batches. Finally, as a last resort, you can use regular Python UDFs, however they're a lot slower because they basically compute results row by row rather than in big batches. If you have slow Spark jobs using Python UDFs, refactoring them is usually a good way to gain some performance. About this blog post, I'm not sure the author is aware of this limitation, but if they need this code to run very very fast, they should probably avoid Python UDFs too

    • @danhorus
      @danhorus 18 дней назад +1

      ​@@SeattleDataGuyI wrote a long comment about the different types of UDFs in Spark, but apparently RUclips decided to delete it. Maybe you'll find it marked as spam, lol

    • @SeattleDataGuy
      @SeattleDataGuy  18 дней назад

      @@danhorus Did you put a url in it? That seems to be the main reason I have seen youtube define things as spam. I'll look

    • @danhorus
      @danhorus 18 дней назад +2

      Not really, but let's try again, haha. In Spark, there are many ways to apply data transformations. By far the best option is to use native Spark functions, as they compile to highly optimized/parallelized Java byte code. The second best option to maximize performance is to use Scala or Java UDFs, as they run inside the JVM with a minor performance hit. The third option, if you want/need to use Python, is to write a vectorized UDF (also known as Pandas UDF), which leverages Apache Arrow to transfer big batches of records to the Python interpreter and back to the JVM after processing. Finally, the last option you should consider is the regular Python UDF, as it basically transforms row by row and has much worse performance as a result. If you have a slow Spark job, refactoring Python UDFs can make it a lot faster. I'm not sure the authors of the blog post are aware of this, but they can probably make their code faster too

  • @osoucy
    @osoucy 15 дней назад +2

    To me, one of the main benefit of Spark Structured Streaming is that you can easily switch between near real-time (micro batches) and scheduled batch processing without having to re-writing a single line of code. This is a very effective way of scaling up and down and balancing costs vs latency.

  • @richardmartin6605
    @richardmartin6605 9 дней назад +2

    Would love to see article reviews!

  • @DataPains
    @DataPains 17 дней назад +1

    Great video! Thank you for sharing!

  • @damien__j
    @damien__j 18 дней назад +1

    Great video thanks!

  • @jace743
    @jace743 18 дней назад +3

    I’d watch if you did live article reviews!

    • @SeattleDataGuy
      @SeattleDataGuy  18 дней назад +2

      Yeah! I think watching other creators do it, I really gotta slow down to do it well

  • @knkootbaoat6759
    @knkootbaoat6759 18 дней назад +5

    gotta make things complex otherwise we wouldnt get paid as much. i half joke. we dont make it complex it's just situations are inherently complex