You can also do ETL with Apache Camel. And under the hood, they are probably very similar. You obviously need connectors to various data sources/destinations, and whatever in memory transformations to filter/map/reduce the data.
No. No. No. No. No. No. No. A Data Warehouse has NOT become a data lake! A Data Warehouse is an architecture, with the data cleansed and structured ready for Business Intelligence. That is NOT a data lake!
The speaker was trying to draw parallels between big data domain and the traditional BI/DW environment. He did not say that one BECOMES another, its a verbal method to say that John becomes a redhat and Peter becomes a wolf and then we play a game.
I was actually searching for the comment where someone could point out that Data Lake and Data Warehouse are two different things. And it's not surprising the 1st comment says it. Data lake is not Data Warehouse. Period.
Fascinating perspective of just the issues (ETL, ECTL) using MR, Spark, Datasets vs. historic IBM/Oracle/Talend/etc. from schema (and flat files) into BW & BI
You need to select the right design and architectural pattern, for your data platform, to match your own environment with regards to information systems complexity, maturity, data volumes, etc...One architectural pattern certainly doesn't fit all.
He is a scientist and have no clue about corporate data and how different form it exists. You end up doing ETL whether you use spark or not. Its a choice whether your write few thousand lines of code or use a 3rd party application.
Actually the ETL tools have been evolved due to the understanding/Debugging of long piece of code which was writen in PLSQL (I remember the days when it was 10 years ago). But the boxes, arrows, clicks helps the developers/analysts make them understand the flow of the data from point to point, fragment to fragment etc., Now since there is a better algorithm to do faster processing, we have MR, Spark and we are go back to original way of doing the ETL with the piece of code?! Yes, I agree with spark data processing is much faster, but what if there is a continuous streaming of the data (lets say streaming data ingested to HDFS through Kafka and handle the ETL on top of it by joining with the existing data marts), is there any thought, we would miss the data flow, fragmented understanding at all? We have data continuously coming through many systems (having the IoT in place and all the tracking systems into digital.. ) ?
Flow charts have always been bs, since their inception in the 70s up to this day. And the simple reason is that once in a graphical hell you cannot easily perform most common editing tasks like copy-paste, diffs, patches, text-based version control, full-text search etc. The information density of a flow chart is also really low compared to a piece of text. So only morons use flow charts, but then that's whom these expensive tools are marketed to.
Very good lecture. ,Ineed, ETL is an old phasion methtodolgy which not fits the technology growth with too expensive time consuming. There are causes of what is ETL for, and one of the thing is that it can transform old data architecture to some other systems. The lecture doen't show the solution for old architecture, but replace the old architecture with brand new one, which is not possible in most cases. The old architecture still be an old one, and Spark cannot change this easily. What I assume, that there will be built-in structure-base within the new version of db (like oracle) that use some base of methodology that spark does. My company, which I am the CTO, invent that solution-gap - that make ETL better, even using old db methodologies (and more), but I assume also that many will try get rid of the ETL old phashion methodology.
Attend this webinar on Oct 17 to learn more www.streamanalytix.com/webinar/apache-spark-the-new-enterprise-backbone-for-etl-batch-and-real-time-streaming/?Apache+Spark-+The+New+Enterprise+Backbone+for+ETL%2C+Batch+and+Real-+time+Streaming
Lol !! So you are still doing ETL but don't want to admit it. So all that gurge is because you don't want to use ETL tool? There is a reason why ETL or ELT both have Transformation. With columnar database now people tend to do ELT but I still believe that ELT can save you lots of money in storage and post processing, plus give a huge advantage on ad-hoc reporting. ELT and ETL have their own use cases, so this video is crap.
This guy has no idea about Enterprise Data-Warehouse & Business Intelligence inside a corporate. Integrity of information that worth billions of US. Hadoop/BigData/Spark whatever they are NOT for Enterprise Business Intelligence Production-scale at all.
That's a rather wide and over reaching statement to make in my opinion. That might be what is 'critical' for the companies you've worked for, but many industries and massive corporations have problems which give up a fraction of a percentage of accuracy for speed and instead, focus on the consistency. If the statement listed was accurate, no big business would use MongoDB or the non ACID compliant database management systems.
You can also do ETL with Apache Camel. And under the hood, they are probably very similar. You obviously need connectors to various data sources/destinations, and whatever in memory transformations to filter/map/reduce the data.
As someone new to ETL, thank you for speaking!
No. No. No. No. No. No. No. A Data Warehouse has NOT become a data lake! A Data Warehouse is an architecture, with the data cleansed and structured ready for Business Intelligence. That is NOT a data lake!
by a data warehouse he means the traditional one using star/snowflake schema he means the logical data modeling methodology...i think
The speaker was trying to draw parallels between big data domain and the traditional BI/DW environment. He did not say that one BECOMES another, its a verbal method to say that John becomes a redhat and Peter becomes a wolf and then we play a game.
hahah ... to me data lake is ODS layer of dwh. I agree 100% with you.
I was actually searching for the comment where someone could point out that Data Lake and Data Warehouse are two different things. And it's not surprising the 1st comment says it. Data lake is not Data Warehouse. Period.
It was a great presentation about a modern approach to ETL based on a modern tool.
Map Reduce has been around the block for a while.. there's not silver bullet, just hard work
@@st3ppenwolf is MapReduce now obselete?
Fascinating perspective of just the issues (ETL, ECTL) using MR, Spark, Datasets vs. historic IBM/Oracle/Talend/etc. from schema (and flat files) into BW & BI
Good one☝️
thanks you , but how one use thing data migration from oracle to cassandra ?
You need to select the right design and architectural pattern, for your data platform, to match your own environment with regards to information systems complexity, maturity, data volumes, etc...One architectural pattern certainly doesn't fit all.
As someone, thank you!
He is a scientist and have no clue about corporate data and how different form it exists. You end up doing ETL whether you use spark or not. Its a choice whether your write few thousand lines of code or use a 3rd party application.
That was an amazing presentation and of-course great job by Gas, well done. Thanks.
I think something like AWS Glue is a sweet middle ground between sluggish GUI driven ETL and a hyper agile technology like Spark
NO! The KEY point is DATA ARCHITECTURE, not the trendy, fancy new platform!
Yes. That's just a trick for people who dont want to use their brain and understand the business uses cases!
So what’s the e2e solution with Spark?
You mean enterprise to enterprise?
When you said that if you borrow from 20 it becomes 19 but I think that the 2 becomes a 1 since there is 0 in the middle
Actually the ETL tools have been evolved due to the understanding/Debugging of long piece of code which was writen in PLSQL (I remember the days when it was 10 years ago). But the boxes, arrows, clicks helps the developers/analysts make them understand the flow of the data from point to point, fragment to fragment etc., Now since there is a better algorithm to do faster processing, we have MR, Spark and we are go back to original way of doing the ETL with the piece of code?! Yes, I agree with spark data processing is much faster, but what if there is a continuous streaming of the data (lets say streaming data ingested to HDFS through Kafka and handle the ETL on top of it by joining with the existing data marts), is there any thought, we would miss the data flow, fragmented understanding at all? We have data continuously coming through many systems (having the IoT in place and all the tracking systems into digital.. ) ?
Flow charts have always been bs, since their inception in the 70s up to this day. And the simple reason is that once in a graphical hell you cannot easily perform most common editing tasks like copy-paste, diffs, patches, text-based version control, full-text search etc. The information density of a flow chart is also really low compared to a piece of text. So only morons use flow charts, but then that's whom these expensive tools are marketed to.
SPARK is for batch processing and cannot be data warehouse. ETL is not dead but we are & we will see different forms of ETL happening.
J D spark is for bacht procesing ??? NO
This approach is interesting, but I feel it looks at the problem way too simplistically. The need to move data around is not capricious.
No Ab Initio ETL tool in the list
I also looked for Abinitio in the list
Very good lecture.
,Ineed, ETL is an old phasion methtodolgy
which not fits the technology growth with too expensive time consuming.
There are causes of what is ETL for, and one of the thing is that it can transform old data architecture to some other systems.
The lecture doen't show the solution for old architecture, but replace the old architecture with brand new one, which is not possible in most cases. The old architecture still be an old one, and Spark cannot change this easily.
What I assume, that there will be built-in structure-base within the new version of db (like oracle) that use some base of methodology that spark does.
My company, which I am the CTO, invent that solution-gap - that make ETL better, even using old db methodologies (and more), but I assume also that many will try get rid of the ETL old phashion methodology.
what do you phasion for fashion? 🤔
Just wasted my 32:17 of time watching this guy talking nonsense. He has no idea of what ETL is doing for Data Warehouses.
Well, you sat through the entire presentation so obviously something kept you from leaving. It don't a whole day to recognize sunshine so to speak. :)
"ETL Hell" - CSV files are not type-safe. Interesting that you write the data in your sample code to CSV files
Attend this webinar on Oct 17 to learn more www.streamanalytix.com/webinar/apache-spark-the-new-enterprise-backbone-for-etl-batch-and-real-time-streaming/?Apache+Spark-+The+New+Enterprise+Backbone+for+ETL%2C+Batch+and+Real-+time+Streaming
Lol !! So you are still doing ETL but don't want to admit it.
So all that gurge is because you don't want to use ETL tool?
There is a reason why ETL or ELT both have Transformation. With columnar database now people tend to do ELT but I still believe that ELT can save you lots of money in storage and post processing, plus give a huge advantage on ad-hoc reporting.
ELT and ETL have their own use cases, so this video is crap.
This guy has no idea about Enterprise Data-Warehouse & Business Intelligence inside a corporate. Integrity of information that worth billions of US. Hadoop/BigData/Spark whatever they are NOT for Enterprise Business Intelligence Production-scale at all.
That's a rather wide and over reaching statement to make in my opinion. That might be what is 'critical' for the companies you've worked for, but many industries and massive corporations have problems which give up a fraction of a percentage of accuracy for speed and instead, focus on the consistency.
If the statement listed was accurate, no big business would use MongoDB or the non ACID compliant database management systems.