Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)

Поделиться
HTML-код
  • Опубликовано: 5 ноя 2024

Комментарии • 43

  • @lordlee6473
    @lordlee6473 3 года назад +1

    You can also do ETL with Apache Camel. And under the hood, they are probably very similar. You obviously need connectors to various data sources/destinations, and whatever in memory transformations to filter/map/reduce the data.

  • @EonPeon
    @EonPeon 7 лет назад +42

    No. No. No. No. No. No. No. A Data Warehouse has NOT become a data lake! A Data Warehouse is an architecture, with the data cleansed and structured ready for Business Intelligence. That is NOT a data lake!

    • @MrMisteronly
      @MrMisteronly 7 лет назад +2

      by a data warehouse he means the traditional one using star/snowflake schema he means the logical data modeling methodology...i think

    • @valekm
      @valekm 7 лет назад +3

      The speaker was trying to draw parallels between big data domain and the traditional BI/DW environment. He did not say that one BECOMES another, its a verbal method to say that John becomes a redhat and Peter becomes a wolf and then we play a game.

    • @vchandm23
      @vchandm23 6 лет назад +2

      hahah ... to me data lake is ODS layer of dwh. I agree 100% with you.

    • @sanchitkumar9862
      @sanchitkumar9862 5 лет назад +2

      I was actually searching for the comment where someone could point out that Data Lake and Data Warehouse are two different things. And it's not surprising the 1st comment says it. Data lake is not Data Warehouse. Period.

  • @mmaxmmccann
    @mmaxmmccann 8 лет назад +2

    As someone new to ETL, thank you for speaking!

  • @mgrajkumar1
    @mgrajkumar1 6 лет назад +9

    He is a scientist and have no clue about corporate data and how different form it exists. You end up doing ETL whether you use spark or not. Its a choice whether your write few thousand lines of code or use a 3rd party application.

  • @fredt3727
    @fredt3727 7 лет назад +3

    You need to select the right design and architectural pattern, for your data platform, to match your own environment with regards to information systems complexity, maturity, data volumes, etc...One architectural pattern certainly doesn't fit all.

  • @daemeonreiydelle4783
    @daemeonreiydelle4783 7 лет назад +1

    Fascinating perspective of just the issues (ETL, ECTL) using MR, Spark, Datasets vs. historic IBM/Oracle/Talend/etc. from schema (and flat files) into BW & BI

  • @MohammadHeydari
    @MohammadHeydari 5 лет назад +1

    It was a great presentation about a modern approach to ETL based on a modern tool.

    • @st3ppenwolf
      @st3ppenwolf 5 лет назад +2

      Map Reduce has been around the block for a while.. there's not silver bullet, just hard work

    • @SuperBhavanishankar
      @SuperBhavanishankar 4 года назад

      @@st3ppenwolf is MapReduce now obselete?

  • @jocalvo
    @jocalvo 6 лет назад +13

    NO! The KEY point is DATA ARCHITECTURE, not the trendy, fancy new platform!

    • @billsomen7953
      @billsomen7953 4 года назад +2

      Yes. That's just a trick for people who dont want to use their brain and understand the business uses cases!

  • @datasherlock
    @datasherlock 4 года назад +2

    I think something like AWS Glue is a sweet middle ground between sluggish GUI driven ETL and a hyper agile technology like Spark

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 6 лет назад +1

    thanks you , but how one use thing data migration from oracle to cassandra ?

  • @JD-xd3xp
    @JD-xd3xp 6 лет назад

    SPARK is for batch processing and cannot be data warehouse. ETL is not dead but we are & we will see different forms of ETL happening.

    • @EnriiBC
      @EnriiBC 6 лет назад

      J D spark is for bacht procesing ??? NO

  • @rajeevdas7506
    @rajeevdas7506 3 года назад +1

    When you said that if you borrow from 20 it becomes 19 but I think that the 2 becomes a 1 since there is 0 in the middle

  • @st3ppenwolf
    @st3ppenwolf 5 лет назад

    This approach is interesting, but I feel it looks at the problem way too simplistically. The need to move data around is not capricious.

  • @aryyzjfroxirisrirod
    @aryyzjfroxirisrirod 5 лет назад +1

    That was an amazing presentation and of-course great job by Gas, well done. Thanks.

  • @LeCoolCroco
    @LeCoolCroco 5 лет назад

    So what’s the e2e solution with Spark?

  • @chanukyahere
    @chanukyahere 7 лет назад +3

    Actually the ETL tools have been evolved due to the understanding/Debugging of long piece of code which was writen in PLSQL (I remember the days when it was 10 years ago). But the boxes, arrows, clicks helps the developers/analysts make them understand the flow of the data from point to point, fragment to fragment etc., Now since there is a better algorithm to do faster processing, we have MR, Spark and we are go back to original way of doing the ETL with the piece of code?! Yes, I agree with spark data processing is much faster, but what if there is a continuous streaming of the data (lets say streaming data ingested to HDFS through Kafka and handle the ETL on top of it by joining with the existing data marts), is there any thought, we would miss the data flow, fragmented understanding at all? We have data continuously coming through many systems (having the IoT in place and all the tracking systems into digital.. ) ?

    • @clray123
      @clray123 6 лет назад +2

      Flow charts have always been bs, since their inception in the 70s up to this day. And the simple reason is that once in a graphical hell you cannot easily perform most common editing tasks like copy-paste, diffs, patches, text-based version control, full-text search etc. The information density of a flow chart is also really low compared to a piece of text. So only morons use flow charts, but then that's whom these expensive tools are marketed to.

  • @antonyrajarathinam9976
    @antonyrajarathinam9976 3 года назад

    Good one☝️

  • @alexdeguzman4273
    @alexdeguzman4273 6 лет назад +1

    As someone, thank you!

  • @sahilbhange
    @sahilbhange 6 лет назад +1

    No Ab Initio ETL tool in the list

  • @pankajbagzai5085
    @pankajbagzai5085 6 лет назад

    Attend this webinar on Oct 17 to learn more www.streamanalytix.com/webinar/apache-spark-the-new-enterprise-backbone-for-etl-batch-and-real-time-streaming/?Apache+Spark-+The+New+Enterprise+Backbone+for+ETL%2C+Batch+and+Real-+time+Streaming

  • @eitanmizrahi7037
    @eitanmizrahi7037 7 лет назад

    Very good lecture.
    ,Ineed, ETL is an old phasion methtodolgy
    which not fits the technology growth with too expensive time consuming.
    There are causes of what is ETL for, and one of the thing is that it can transform old data architecture to some other systems.
    The lecture doen't show the solution for old architecture, but replace the old architecture with brand new one, which is not possible in most cases. The old architecture still be an old one, and Spark cannot change this easily.
    What I assume, that there will be built-in structure-base within the new version of db (like oracle) that use some base of methodology that spark does.
    My company, which I am the CTO, invent that solution-gap - that make ETL better, even using old db methodologies (and more), but I assume also that many will try get rid of the ETL old phashion methodology.

  • @33davethewave
    @33davethewave 5 лет назад +1

    "ETL Hell" - CSV files are not type-safe. Interesting that you write the data in your sample code to CSV files

  • @KC-zn4gt
    @KC-zn4gt 6 лет назад +4

    Just wasted my 32:17 of time watching this guy talking nonsense. He has no idea of what ETL is doing for Data Warehouses.

    • @jlcotton19681
      @jlcotton19681 5 лет назад +1

      Well, you sat through the entire presentation so obviously something kept you from leaving. It don't a whole day to recognize sunshine so to speak. :)

  • @smallclips4164
    @smallclips4164 4 года назад

    Lol !! So you are still doing ETL but don't want to admit it.
    So all that gurge is because you don't want to use ETL tool?
    There is a reason why ETL or ELT both have Transformation. With columnar database now people tend to do ELT but I still believe that ELT can save you lots of money in storage and post processing, plus give a huge advantage on ad-hoc reporting.
    ELT and ETL have their own use cases, so this video is crap.

  • @nguyen4so9
    @nguyen4so9 7 лет назад +3

    This guy has no idea about Enterprise Data-Warehouse & Business Intelligence inside a corporate. Integrity of information that worth billions of US. Hadoop/BigData/Spark whatever they are NOT for Enterprise Business Intelligence Production-scale at all.

    • @tubephr34k
      @tubephr34k 7 лет назад +2

      That's a rather wide and over reaching statement to make in my opinion. That might be what is 'critical' for the companies you've worked for, but many industries and massive corporations have problems which give up a fraction of a percentage of accuracy for speed and instead, focus on the consistency.
      If the statement listed was accurate, no big business would use MongoDB or the non ACID compliant database management systems.