Man, came across your platform today and just find it so valuable. From a Data Scientist curious to understand a little bit ELT, Pipelines and the backend. Thank you 🙏🏽
A nice and concise video, thanks! Would be interesting to hear about some best practices on doing custom data ingestion (EL) pipelines (that is not using Airbyte/Fivetran/Stitch) but writing actual python scripts (which libraries are commonly used, how to structure the project etc).
Thank you for your high-quality videos! In our use case, we ingest daily a .zip file containing 3 .csv’s related to sales, inventory and orders from different shops (20-30) and CRMs (4-5 ; each one with its own naming convention, dtypes, …). How would you improve the following pipeline? - Raw zip files are uploaded to a GCP bucket - The upload triggers a Python GCP Cloud function that transforms the data to create single naming/dtypes conventions and brief new columns (e.g. timestamp by merging date + time) - Transformed data is uploaded to MongoDB - 3 separate collection for sales, inventory and orders - and raw .csv’s to a separate GCP bucket as parquet files (1 folder for each CRM and PoS as subfolder) - A PubSub message posted by the function triggers a GCP Function that loads processed data from MongoDB, applies ML models and stores results in separate collections (1 for each analysis type; e.g. forecast, anomaly detection, …) - A Python web app directly reads ML output data from MongoDB Thank you so much and love your videos; 🤗
Is Airflow another ELT/ETL tool? I mean, can you manage to create an entire data pipeline just with Talend/FiveTran/DBT or how does Airflow enters to the tool set?
Airflow is a "task orchestrator" that can be used to trigger other tools (like dbt) in your pipeline. Yes, you can definitely still have a data pipeline without Airflow, but it becomes helpful as your infrastructure become a little more complex and you need to trigger various tools in specific sequence. Airflow can help you manage and monitor it from a single place rather than across different tools. Hope that helps!
Thanks for the response! Great vids and I’ll subscribe to your channel. Matillion is Fivetran combined with dbt. Matillion has dbt components for customers that have existing dbt jobs. I’d love for you to give it a try while I am learning how to use dbt! Anyways, take care and keep up the great vids
I don't understand how you can load the data into a "more permanent" table before you transform the data because many times when you transform the data by applying business logic, you are changing the grain and schema of the data. Am I missing something?
You can (and will) still change the grain, but the difference is "when" you choose to do that. By more permanent I mean you don't have to clear it out after each run, like you might do in an ETL process. This is possible b/c the cost of storing massive amounts of data on modern databases is now relatively cheap compared to 15-20 years ago. Computation is what's expensive.
Sorry Michael, but you should have attended more CHUG meetups and learned something about Big Data and doing ETL. There is no such thing as ELT. Its really ETL.
Want to build a reliable, modern data architecture without the mess?
Here’s a free checklist to help you → bit.ly/kds-checklist
Man, came across your platform today and just find it so valuable.
From a Data Scientist curious to understand a little bit ELT, Pipelines and the backend.
Thank you 🙏🏽
Very elegantly explained. Very concise & straight to the point. Loved the visual showing the different silos of data for Billing & CRM!
Appreciate the comment! Thanks for watching
A nice and concise video, thanks! Would be interesting to hear about some best practices on doing custom data ingestion (EL) pipelines (that is not using Airbyte/Fivetran/Stitch) but writing actual python scripts (which libraries are commonly used, how to structure the project etc).
Great suggestion, thanks for watching!
Thank you for your high-quality videos! In our use case, we ingest daily a .zip file containing 3 .csv’s related to sales, inventory and orders from different shops (20-30) and CRMs (4-5 ; each one with its own naming convention, dtypes, …).
How would you improve the following pipeline?
- Raw zip files are uploaded to a GCP bucket
- The upload triggers a Python GCP Cloud function that transforms the data to create single naming/dtypes conventions and brief new columns (e.g. timestamp by merging date + time)
- Transformed data is uploaded to MongoDB - 3 separate collection for sales, inventory and orders - and raw .csv’s to a separate GCP bucket as parquet files (1 folder for each CRM and PoS as subfolder)
- A PubSub message posted by the function triggers a GCP Function that loads processed data from MongoDB, applies ML models and stores results in separate collections (1 for each analysis type; e.g. forecast, anomaly detection, …)
- A Python web app directly reads ML output data from MongoDB
Thank you so much and love your videos; 🤗
I get to move my company into the modern ELT approach, thanks for the information!
It takes time but it's a nice approach. Good luck!
@@KahanDataSolutions thank you! First time doing something like this so learning all over the place for me.
Thank you for expalining it thats super easy to understand
Glad it was helpful!
I think ETL is the way to go you need to know what you or need before adding to a permanent source.
Another EL option is Airbyte
Is Airflow another ELT/ETL tool? I mean, can you manage to create an entire data pipeline just with Talend/FiveTran/DBT or how does Airflow enters to the tool set?
Airflow is a "task orchestrator" that can be used to trigger other tools (like dbt) in your pipeline. Yes, you can definitely still have a data pipeline without Airflow, but it becomes helpful as your infrastructure become a little more complex and you need to trigger various tools in specific sequence. Airflow can help you manage and monitor it from a single place rather than across different tools. Hope that helps!
@@KahanDataSolutions thxx!!
Great video! Super helpful and clear about ELT being the best approach. Question…I see you prefer dbt but how do you feel about Matillion? Thanks!
Thanks! I have not used Matillion before so I can't comment on that one
Thanks for the response! Great vids and I’ll subscribe to your channel.
Matillion is Fivetran combined with dbt. Matillion has dbt components for customers that have existing dbt jobs. I’d love for you to give it a try while I am learning how to use dbt! Anyways, take care and keep up the great vids
@@woolfolkdoesthings-onemans9388 Appreciate the summary - I'll def check it out! And thanks for watching / subscribing! Take care
This seem you are suggesting various data type and formats be brought into the single platform and then use tools there to transform
Would you also call building data models from analytical event tables as ETL? Or is it just abstracted as T of ELT? Thanks for making the video.
How ELT is more scalable?
I don't understand how you can load the data into a "more permanent" table before you transform the data because many times when you transform the data by applying business logic, you are changing the grain and schema of the data. Am I missing something?
You can (and will) still change the grain, but the difference is "when" you choose to do that. By more permanent I mean you don't have to clear it out after each run, like you might do in an ETL process. This is possible b/c the cost of storing massive amounts of data on modern databases is now relatively cheap compared to 15-20 years ago. Computation is what's expensive.
Una Cliffs
Very useful 💖🥀 new subscriber here
Thanks and welcome!
Sorry Michael, but you should have attended more CHUG meetups and learned something about Big Data and doing ETL.
There is no such thing as ELT. Its really ETL.
ETL is by all means dead - yes =)))