- Видео 87
- Просмотров 115 397
Apostolos Athanasiou
Добавлен 31 авг 2014
This channel is dedicated to Data Engineering tasks, exercises and tutorials. We go through simple examples of Data Pipelines, Integration methods, APIs, Data models and other cool stuff using Python, SQL and Azure as our core tools. If you want to begin your journey as a Data Engineer this channel is for you.
Видео
Databricks Quick Tips: Column Level Encryption | Protect your data
Просмотров 34День назад
In this video we see how to perform column level encryption in Databricks to protect your data. Links used in the demo: -www.databricks.com/notebooks/enforcing-column-level-encryption.html -cryptography.io/en/latest/fernet/ -learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/aes_encrypt Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119...
Databricks Quick Tips: Use Except in SQL to Exclude Columns in Query
Просмотров 414 дня назад
In this video we see how to use the Except keyword to exclude unnecessary columns in the select clause. Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:00 - Except Keyword
Databricks - How to re-order columns in Delta Tables
Просмотров 926 дней назад
In this video we see how to change the order of columns in Delta Tables. Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:35 - Column Re-ordering in Databricks Demo
Azure & Databricks - Collect Metrics from an App Service Plan in Databricks using Python
Просмотров 6611 дней назад
In this video we see how to collect metrics from an Azure Service Plan and build a data pipeline to create the Bronze and Silver Layer. The code can be found here:github.com/apostolos1927/AppServicePlanDatabricks Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:00 - Collect Metrics and build ETL pipeline
Azure Log Analytics & Databricks - Analyze logs from Log Analytics in Databricks using Python
Просмотров 10413 дней назад
In this video we see how to create a Log Analytics Workspace, connect it to Azure Monitor and then fetch and analyze the data in Databricks using Python. More information about Azure Log Analytics can be found: learn.microsoft.com/en-us/azure/azure-monitor/logs/log-analytics-workspace-overview The code can be found here: github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-...
Databricks: Build Reliable ETL Pipelines with Delta Live Tables and Confluent Kafka
Просмотров 192Месяц назад
In this video we see how to build reliable ETL pipelines with DLT framwork. We ingest and consume data from a kafka topic and then we build the bronze, the silver and the gold layer. You can find the Databricks Notebooks here: github.com/apostolos1927/DLT_data_pipeline Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00...
Databricks - Slowly Changing Dimension Type 2 (PySpark version)
Просмотров 412Месяц назад
In this video we see how to apply slowly changing dimension of type 2 in Databricks using PySpark only. You can find the Databricks Notebooks here: github.com/apostolos1927/SCD2Pyspark Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:20 - SCD2 code walk through
Databricks - How to load historical data in Delta Tables(Batch processing)
Просмотров 2582 месяца назад
In this video we see how to perform historical loads in Delta Tables in Databricks. To get a better understanding you need to play around with the code yourself as it would take some hours to dissect and absorb it. You can find the Databricks Notebooks here: github.com/apostolos1927/HistoricalLoadsDeltaTables Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa11...
Databricks - How to create your Data Quality Checks Notebooks
Просмотров 5302 месяца назад
In this video we create our own custom notebooks for data quality checks in Databricks. We use a combination of python and SQL. The code can be found here: github.com/apostolos1927/DataQualityChecks/ Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:20 - Data Quality Checks Notebooks
Databricks for Beginners - Mount Azure DataLake in Databricks the correct way
Просмотров 1512 месяца назад
In this video we see how to create mount points and use them to connect to an Azure DataLake. We also use Service Principal and Azure Key Vault in this example as this is the proper way. Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/
Databricks for Beginners - Access & pass parameters to Databricks Notebook from Azure Data Factory
Просмотров 1752 месяца назад
In this video we see how to use Data Factory to pass parameters to a Databricks Notebook Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/
Databricks - Cloudfiles Ingestion with Autoloader and Copy Into
Просмотров 1642 месяца назад
In this video we see how to use autoloader and the copy into command to ingest data from a cloud storage. You can find the Databricks Notebooks here: github.com/apostolos1927/Autoloader Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/
Databricks - Data Pipelines Best Practices (with Confluent Kafka and Delta Live Tables)
Просмотров 3,7 тыс.3 месяца назад
In this video we go through the best practices to follow when building data pipelines in Databricks. You can find the Databricks Notebooks here: github.com/apostolos1927/DatabricksBestPractices Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Databricks Best Practices Intro 01:00 - Data Pipeline Demo
Databricks - Slowly Changing Dimension & CDC with Delta Live Tables and Azure SQL Database
Просмотров 7483 месяца назад
In this video we see how to use Slowly Changing Dimensions with Databricks Delta Live Tables using CDC datafeed from Azure SQL Database. You can find the Databricks Notebooks here: github.com/apostolos1927/SCD-CDC-DLT Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - SCD in Databricks Intro 01:00 - Slowly changing d...
Databricks - Storing PII Data Securely
Просмотров 2664 месяца назад
In this video we see how to store your passowords and sensitive data securely using salt with the natural keys and dynamics views. You can find the Databricks Notebooks here: github.com/apostolos1927/SaltingHashing Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/
como lo instalo para usar con xampp? me ayudas por favor?
How to track lineage with apply changes as in my case lineage gets breaked?
Thanks for the video!
Great videos as always❤
Nice video!
Great video, thanks!
Thanks for the video. How often is the materialized view (goldT table) refreshed in the example? Is this a configurable setting?
recently when I used <databricks-instance>#secrets/createScope , Databricks is not redirecting to Create scope screen. any idea what can be reason. I have admin privileges and all relevant privileges.
Big Thank you , Cant give much but subscribed and Liked .
Very comprehensive and basics thanks for the tutorial
Love the dog in the thumbnail!!
Thank you it is a very good explanation & I created as per your instruction & works...Thank you again
how do we manage delete? if source table is huge (size =200gb) how do we identify if some of the rows are deleted from it before inserting into target table? I dont want to delete the row in target table but just have a flag against it. The target table is in a DW which should have history of all data. CDC is not an option. Thanks
This is an amazing job. I wonder if I can use the databricks community edition for my pet project that will be live.
Superr....
Its a nice channel to learn the DE. Can you make a roadmap for data engineering for absolute beginners and what should one get grip on like tech stack, skills, use cases on projects. Thankyou in advance
Thanks a lot for creating this video. For the orchestration. Adf triggers are the only option. With Airflow orchestration we customize on the scheduling part.
Nice!
Thank you for the video! It was very helpful. I have a dozen or so pipelines that I needed to execute in batch. I was able to create a parent pipeline using a ForEach Loop that reference the info from a database table then execute all the pipelines.
in case anyone runs into the same error, maybe its a windows thing? or an update from flask but: download_name=f'{video.title}.mp3' is the code line for send_file not attachment_name
Really Thanks bro for such a informative video
In Event grid & Azure function segment, you created the topic supscription for the Azure Function. But what about the publish part? How StorageAccout container is publishing the event to that particular topic on new file upload?
As a data enthusiast this information so valuable for me. Thanks for sharing your knowledge.
Superr video as always, keep doing....
I have an etl process in place in the ADF. In our team, we wanted to implement the table and views transformation and implementation with dbt core. We were wondering if we could orchestrate the dbt with Azure. If so, then how? One of the approaches I could think of was to use Azure Managed Airflow Instance. But, will it allow us to install astronomer cosmos? I have never implemented dbt this way before, so needed to know if this would be the right approach or is there anything else you would suggest me?
Unfortunately I haven't tried this approach either so I cannot tell. It seems astronomer cosmos works well with Apache airflow (github.com/astronomer/astronomer-cosmos) so in theory it should work with Azure Managed Airflow instance too. That being said, I haven't tried it, better give it a try and see.
Love your content always, keep rocking.....
great video, easy to understand. please pursuit with more videos.....
very useful
Always great videos...
Any clue if there is an open source alternative to Autoloader?
No idea, never searched for it, there might be.
Superr video as always. You are back another useful content....
Superr video, waiting for your next video....
Thank you mate! I took some time off to recharge. I will come back with a new video the next weeks.
please could you provide a snapshot sample of how your reference file look like? Thank you!
I understand how it's pretty easy to work with CDC & ADF for one table, but I don't understand how I would manage my pipeline easily if I have let's say 100 tables with CDC...What's the best practice in this case ?
Unfortunately I don't think there is a work around about it. Unless you can somehow parameterize an ADF pipeline somehow to do that in massive (I haven't tried it so I don't know if it's feasible)
Do you know why the output of your DLT pipeline didnt give Streaming Tables? I wrote a similar pipeline where I read CDC from a SQL source, write into bronze using Autoloader in DLT, into a streaming table, then write into silver in SCD2 in DLT as well. Both bronze and silver tables are generated as streaming tables... In your example, you get normal Delta tables. I might confusing concepts... Also, do you have a video covering streaming tables and materialized views? Thanks!
It depends how you specify the tables. You can specify them as streaming tables or just simple Live tables which are materialized views. Streaming tables are continuously updated while materialized views get updated every few minutes or when you manually update them. docs.databricks.com/en/delta-live-tables/index.html
@@AthanasiouApostolos so only streaming tables dépend on checkpoint locations then from what I gather. ATM my streaming tables are using scheduled DLT pipelines so are not continuous. Thanks for your replies btw! One of the best databricks content creators for sure. 🤗
Yes exactly checkpoints are for streaming tables. When you use scheduled pipelines is essentially like doing batch jobs despite the fact you are using streaming tables. You need to use continuous to get the actual benefits of streaming tables. That is of course if you have a good budget hahaha. And thank you, much appreciated.
didnt really understand what the deduplication portion had to do with the security aspect of storing sensitive data, but I learned something so I'm not gonna complain :) Great videos and good delivery. Thank you for sharing the code!
Me neither mate lol! But we try to follow the Databricks courses and they combined deduplication with security for some unknown reason. Irrelevant in my opinion too but this is how they have it. Go figure...
@@ApostolosAthanasiou-vlog which course is this from?
@@UltimaWeaponz The Databricks Academy courses for the Data Engineering certificates.
@@AthanasiouApostolosThank you! Trying to run the notebook myself. Where is the dataset coming from? Part of the Advanced Data Engineering curriculum?
No I am using IOT data I generated on the previous videos. Completely different dataset.
Quick Question : If a record is dropped from Source table i.e hard delete how does apply_changes handle it .
When I deploy, I check the error logs, and it seems like libraries aren't installed after deployment. Everything works fine when deploying from local host through
Thank you for another awesome episode!
Would it be simpler to use AD authentication as it guaranttes single sign on?
Hello , when I make some deletes and update I get this in "users_cdc_clean" and "user_cdc_quarantine" : "Detected a data update (for example part-00000-3795a007-fac7-4ed9-9a5a-f446b34caef9-c000.snappy.parquet) in the source table at version 3. This is currently not supported. If you'd like to ignore updates, set the option 'skipChangeCommits' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory" why is that? how can I manage deletes and updates?
You can't update or delete on streaming tables and restart. This is why we enabled CDC on the database and we have the update type information that can be used for SCD2. If this is what you are asking.
@@AthanasiouApostolos I have the same problem with implementing this. The pipeline can only run as a full refresh, but cannot be updated when new cdc records come into the source tables. The 'middle' streaming table (users_cdc_clean in your example) fails due to changes in source table (users_cdc_bronze in your example). After the initial run (or full refresh), making updates to the source table (which would be TEST123 in your example) and then running the pipeline update, it fails with error (org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 98392bb5-e5ee-419c-887a-514bbd0d4b8d, runId = 8fd2d375-9796-4ad0-a24f-eaf7bdd3f6f4] terminated with exception: [DELTA_SOURCE_TABLE_IGNORE_CHANGES] Detected a data update (for example part-00000-50168213-b0c8-4b5e-a20b-251aa77c3d4b-c000.snappy.parquet) in the source table at version 38. This is currently not supported. If you'd like to ignore updates, set the option 'skipChangeCommits' to 'true'). The updates that I did in TEST123 should only lead to appends in the users_cdc_bronze, but somehow, DB runtime seems to think there are updates in this table. Even when settings options skipChangeCommits to true for the users_cdc_bronze ingest, the pipeline fails on the second run.
"Efkaristo poli" , all the answers to the issue I had today.
Can we use databricks community edition for implementation?
Hi! Really good video. Very easy to follow and understand. Except if you could clarify following. If I have million devices do I have to create all of them in IOT hub? (That too manually?) I saw in your VS code you added device ID in the JSON. This added to the confusion above. Please tell me that the answer to my above question is no, and the device that you added in IoT hub is actually a device "type" and you have configured to receive messages from many device IDs for this one device type. Which of my understanding is true? 😊 Thank you so much
Hi mate, I am not sure I understand what you are looking for but maybe it's this? learn.microsoft.com/en-us/answers/questions/884678/iot-hub-multiple-connections-on-single-device (module identities)
@@AthanasiouApostolos well basically I wanted to know if I need as many IoT devices created in IoT hub as there are physical devices?
@@SynonAnon-vi1ql no you don't. However using the same connection string multiple times would create issues. That's why I proposed module identities. That being said, you would need to do research on this topic as I have never used more than one sensor in real life applications.
@@AthanasiouApostolos understood. Thank you friend!
Thank you for sharing! Great explnation
Superr video as always
7 'eventhubs.connectionString':sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(IOT_CS), 8 "eventhubs.consumerGroup": '$Default' 9 } 10 json_schema = StructType( 11 [ 12 StructField ("DeviceID", IntegerType(), True), 13 StructField ("DeviceNumber", IntegerType(), True) 14 ] 15 ) 18 # Enable auto compaction and optimized writes in Delta TypeError: 'JavaPackage' object is not callable please help me Apostolos
Great presentation. You may want to move the mic away from you mouth to cut down on unnecessary noise.
It works with SQL Server on prem as well ?
++
It works with SQL Server on prem ?