Видео 87
Просмотров 115 397

Databricks Quick Tips: Column Level Encryption | Protect your data

Databricks Quick Tips: Use Except in SQL to Exclude Columns in Query

Databricks - How to re-order columns in Delta Tables

Azure & Databricks - Collect Metrics from an App Service Plan in Databricks using Python

Azure Log Analytics & Databricks - Analyze logs from Log Analytics in Databricks using Python

Databricks: Build Reliable ETL Pipelines with Delta Live Tables and Confluent Kafka

Видео

Databricks Quick Tips: Column Level Encryption | Protect your data

Просмотров 34День назад

In this video we see how to perform column level encryption in Databricks to protect your data. Links used in the demo: -www.databricks.com/notebooks/enforcing-column-level-encryption.html -cryptography.io/en/latest/fernet/ -learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/aes_encrypt Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119...

Databricks Quick Tips: Use Except in SQL to Exclude Columns in Query

Просмотров 414 дня назад

In this video we see how to use the Except keyword to exclude unnecessary columns in the select clause. Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:00 - Except Keyword

Databricks - How to re-order columns in Delta Tables

Просмотров 926 дней назад

In this video we see how to change the order of columns in Delta Tables. Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:35 - Column Re-ordering in Databricks Demo

Azure & Databricks - Collect Metrics from an App Service Plan in Databricks using Python

Просмотров 6611 дней назад

In this video we see how to collect metrics from an Azure Service Plan and build a data pipeline to create the Bronze and Silver Layer. The code can be found here:github.com/apostolos1927/AppServicePlanDatabricks Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:00 - Collect Metrics and build ETL pipeline

Azure Log Analytics & Databricks - Analyze logs from Log Analytics in Databricks using Python

Просмотров 10413 дней назад

In this video we see how to create a Log Analytics Workspace, connect it to Azure Monitor and then fetch and analyze the data in Databricks using Python. More information about Azure Log Analytics can be found: learn.microsoft.com/en-us/azure/azure-monitor/logs/log-analytics-workspace-overview The code can be found here: github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-...

Databricks: Build Reliable ETL Pipelines with Delta Live Tables and Confluent Kafka

Просмотров 192Месяц назад

In this video we see how to build reliable ETL pipelines with DLT framwork. We ingest and consume data from a kafka topic and then we build the bronze, the silver and the gold layer. You can find the Databricks Notebooks here: github.com/apostolos1927/DLT_data_pipeline Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00...

Databricks - Slowly Changing Dimension Type 2 (PySpark version)

Просмотров 412Месяц назад

In this video we see how to apply slowly changing dimension of type 2 in Databricks using PySpark only. You can find the Databricks Notebooks here: github.com/apostolos1927/SCD2Pyspark Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:20 - SCD2 code walk through

Databricks - How to load historical data in Delta Tables(Batch processing)

Просмотров 2582 месяца назад

In this video we see how to perform historical loads in Delta Tables in Databricks. To get a better understanding you need to play around with the code yourself as it would take some hours to dissect and absorb it. You can find the Databricks Notebooks here: github.com/apostolos1927/HistoricalLoadsDeltaTables Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa11...

Databricks - How to create your Data Quality Checks Notebooks

Просмотров 5302 месяца назад

In this video we create our own custom notebooks for data quality checks in Databricks. We use a combination of python and SQL. The code can be found here: github.com/apostolos1927/DataQualityChecks/ Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Intro 01:20 - Data Quality Checks Notebooks

Databricks for Beginners - Mount Azure DataLake in Databricks the correct way

Просмотров 1512 месяца назад

In this video we see how to create mount points and use them to connect to an Azure DataLake. We also use Service Principal and Azure Key Vault in this example as this is the proper way. Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/

Databricks for Beginners - Access & pass parameters to Databricks Notebook from Azure Data Factory

Просмотров 1752 месяца назад

In this video we see how to use Data Factory to pass parameters to a Databricks Notebook Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/

Databricks - Cloudfiles Ingestion with Autoloader and Copy Into

Просмотров 1642 месяца назад

In this video we see how to use autoloader and the copy into command to ingest data from a cloud storage. You can find the Databricks Notebooks here: github.com/apostolos1927/Autoloader Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/

Databricks - Data Pipelines Best Practices (with Confluent Kafka and Delta Live Tables)

Просмотров 3,7 тыс.3 месяца назад

In this video we go through the best practices to follow when building data pipelines in Databricks. You can find the Databricks Notebooks here: github.com/apostolos1927/DatabricksBestPractices Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - Databricks Best Practices Intro 01:00 - Data Pipeline Demo

Databricks - Slowly Changing Dimension & CDC with Delta Live Tables and Azure SQL Database

Просмотров 7483 месяца назад

In this video we see how to use Slowly Changing Dimensions with Databricks Delta Live Tables using CDC datafeed from Azure SQL Database. You can find the Databricks Notebooks here: github.com/apostolos1927/SCD-CDC-DLT Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/ 00:00 - SCD in Databricks Intro 01:00 - Slowly changing d...

Databricks - Storing PII Data Securely

Просмотров 2664 месяца назад

In this video we see how to store your passowords and sensitive data securely using salt with the natural keys and dynamics views. You can find the Databricks Notebooks here: github.com/apostolos1927/SaltingHashing Follow me on social media: LinkedIn: www.linkedin.com/in/apostolos-athanasiou-9a0baa119 GitHub: github.com/apostolos1927/

@elsalchichastv День назад
como lo instalo para usar con xampp? me ayudas por favor?
@letsunderstand3824 День назад
How to track lineage with apply changes as in my case lineage gets breaked?
@humans34 11 дней назад
Thanks for the video!
@VenkatesanVenkat-fd4hg 11 дней назад
Great videos as always❤
@santypanda4903 12 дней назад
Nice video!
@humans34 13 дней назад
Great video, thanks!
@OneTheFunky 23 дня назад
Thanks for the video. How often is the materialized view (goldT table) refreshed in the example? Is this a configurable setting?
@ramjeeyadav5743 24 дня назад
recently when I used <databricks-instance>#secrets/createScope , Databricks is not redirecting to Create scope screen. any idea what can be reason. I have admin privileges and all relevant privileges.
@jhonsen9842 Месяц назад
Big Thank you , Cant give much but subscribed and Liked .
@arjunpalitphotography Месяц назад
Very comprehensive and basics thanks for the tutorial
@PepijnThijssen Месяц назад
Love the dog in the thumbnail!!
@cdgamage Месяц назад
Thank you it is a very good explanation & I created as per your instruction & works...Thank you again
@gauravdevgan79 Месяц назад
how do we manage delete? if source table is huge (size =200gb) how do we identify if some of the rows are deleted from it before inserting into target table? I dont want to delete the row in target table but just have a flag against it. The target table is in a DW which should have history of all data. CDC is not an option. Thanks
@bencipherx Месяц назад
This is an amazing job. I wonder if I can use the databricks community edition for my pet project that will be live.
@VenkatesanVenkat-fd4hg Месяц назад
Superr....
@Faire-rs7ph Месяц назад
Its a nice channel to learn the DE. Can you make a roadmap for data engineering for absolute beginners and what should one get grip on like tech stack, skills, use cases on projects. Thankyou in advance
@arnabdutta462 Месяц назад
Thanks a lot for creating this video. For the orchestration. Adf triggers are the only option. With Airflow orchestration we customize on the scheduling part.
@giangnguyenvan1350 Месяц назад
Nice!
@fernandos1790 Месяц назад
Thank you for the video! It was very helpful. I have a dozen or so pipelines that I needed to execute in batch. I was able to create a parent pipeline using a ForEach Loop that reference the info from a database table then execute all the pipelines.
@matthewdote Месяц назад
in case anyone runs into the same error, maybe its a windows thing? or an update from flask but: download_name=f'{video.title}.mp3' is the code line for send_file not attachment_name
@PRASHANTKUMAR-wd9ww Месяц назад
Really Thanks bro for such a informative video
@ayushpandey1148 Месяц назад
In Event grid & Azure function segment, you created the topic supscription for the Azure Function. But what about the publish part? How StorageAccout container is publishing the event to that particular topic on new file upload?
@Malik-ue6ub Месяц назад
As a data enthusiast this information so valuable for me. Thanks for sharing your knowledge.
@VenkatesanVenkat-fd4hg 2 месяца назад
Superr video as always, keep doing....
@datalearningsihan 2 месяца назад
I have an etl process in place in the ADF. In our team, we wanted to implement the table and views transformation and implementation with dbt core. We were wondering if we could orchestrate the dbt with Azure. If so, then how? One of the approaches I could think of was to use Azure Managed Airflow Instance. But, will it allow us to install astronomer cosmos? I have never implemented dbt this way before, so needed to know if this would be the right approach or is there anything else you would suggest me?
@AthanasiouApostolos 2 месяца назад
Unfortunately I haven't tried this approach either so I cannot tell. It seems astronomer cosmos works well with Apache airflow (github.com/astronomer/astronomer-cosmos) so in theory it should work with Azure Managed Airflow instance too. That being said, I haven't tried it, better give it a try and see.
@VenkatesanVenkat-fd4hg 2 месяца назад
Love your content always, keep rocking.....
@taglud 2 месяца назад
great video, easy to understand. please pursuit with more videos.....
@manasr3969 2 месяца назад
very useful
@VenkatesanVenkat-fd4hg 2 месяца назад
Always great videos...
@gardnmi 2 месяца назад
Any clue if there is an open source alternative to Autoloader?
@AthanasiouApostolos 2 месяца назад
No idea, never searched for it, there might be.
@VenkatesanVenkat-fd4hg 2 месяца назад
Superr video as always. You are back another useful content....
@VenkatesanVenkat-fd4hg 2 месяца назад
Superr video, waiting for your next video....
@AthanasiouApostolos 2 месяца назад
Thank you mate! I took some time off to recharge. I will come back with a new video the next weeks.
@MrDrexwald 2 месяца назад
please could you provide a snapshot sample of how your reference file look like? Thank you!
@sbkc-hannibal746 2 месяца назад
I understand how it's pretty easy to work with CDC & ADF for one table, but I don't understand how I would manage my pipeline easily if I have let's say 100 tables with CDC...What's the best practice in this case ?
@AthanasiouApostolos 2 месяца назад
Unfortunately I don't think there is a work around about it. Unless you can somehow parameterize an ADF pipeline somehow to do that in massive (I haven't tried it so I don't know if it's feasible)
@UltimaWeaponz 3 месяца назад
Do you know why the output of your DLT pipeline didnt give Streaming Tables? I wrote a similar pipeline where I read CDC from a SQL source, write into bronze using Autoloader in DLT, into a streaming table, then write into silver in SCD2 in DLT as well. Both bronze and silver tables are generated as streaming tables... In your example, you get normal Delta tables. I might confusing concepts... Also, do you have a video covering streaming tables and materialized views? Thanks!
@AthanasiouApostolos 3 месяца назад
It depends how you specify the tables. You can specify them as streaming tables or just simple Live tables which are materialized views. Streaming tables are continuously updated while materialized views get updated every few minutes or when you manually update them. docs.databricks.com/en/delta-live-tables/index.html
@UltimaWeaponz 3 месяца назад
@@AthanasiouApostolos so only streaming tables dépend on checkpoint locations then from what I gather. ATM my streaming tables are using scheduled DLT pipelines so are not continuous. Thanks for your replies btw! One of the best databricks content creators for sure. 🤗
@AthanasiouApostolos 3 месяца назад
Yes exactly checkpoints are for streaming tables. When you use scheduled pipelines is essentially like doing batch jobs despite the fact you are using streaming tables. You need to use continuous to get the actual benefits of streaming tables. That is of course if you have a good budget hahaha. And thank you, much appreciated.
@UltimaWeaponz 3 месяца назад
didnt really understand what the deduplication portion had to do with the security aspect of storing sensitive data, but I learned something so I'm not gonna complain :) Great videos and good delivery. Thank you for sharing the code!
@ApostolosAthanasiou-vlog 3 месяца назад
Me neither mate lol! But we try to follow the Databricks courses and they combined deduplication with security for some unknown reason. Irrelevant in my opinion too but this is how they have it. Go figure...
@UltimaWeaponz 3 месяца назад
@@ApostolosAthanasiou-vlog which course is this from?
@AthanasiouApostolos 3 месяца назад
@@UltimaWeaponz The Databricks Academy courses for the Data Engineering certificates.
@UltimaWeaponz 3 месяца назад
@@AthanasiouApostolosThank you! Trying to run the notebook myself. Where is the dataset coming from? Part of the Advanced Data Engineering curriculum?
@AthanasiouApostolos 3 месяца назад
No I am using IOT data I generated on the previous videos. Completely different dataset.
@user-nv9fv2up5d 3 месяца назад
Quick Question : If a record is dropped from Source table i.e hard delete how does apply_changes handle it .
@jamesxia9188 3 месяца назад
When I deploy, I check the error logs, and it seems like libraries aren't installed after deployment. Everything works fine when deploying from local host through
@andriifadieiev9757 3 месяца назад
Thank you for another awesome episode!
@davidchoi1655 3 месяца назад
Would it be simpler to use AD authentication as it guaranttes single sign on?
@hectoroviedo960 3 месяца назад
Hello , when I make some deletes and update I get this in "users_cdc_clean" and "user_cdc_quarantine" : "Detected a data update (for example part-00000-3795a007-fac7-4ed9-9a5a-f446b34caef9-c000.snappy.parquet) in the source table at version 3. This is currently not supported. If you'd like to ignore updates, set the option 'skipChangeCommits' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory" why is that? how can I manage deletes and updates?
@AthanasiouApostolos 3 месяца назад
You can't update or delete on streaming tables and restart. This is why we enabled CDC on the database and we have the update type information that can be used for SCD2. If this is what you are asking.
@peuzeltje 2 месяца назад
@@AthanasiouApostolos I have the same problem with implementing this. The pipeline can only run as a full refresh, but cannot be updated when new cdc records come into the source tables. The 'middle' streaming table (users_cdc_clean in your example) fails due to changes in source table (users_cdc_bronze in your example). After the initial run (or full refresh), making updates to the source table (which would be TEST123 in your example) and then running the pipeline update, it fails with error (org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 98392bb5-e5ee-419c-887a-514bbd0d4b8d, runId = 8fd2d375-9796-4ad0-a24f-eaf7bdd3f6f4] terminated with exception: [DELTA_SOURCE_TABLE_IGNORE_CHANGES] Detected a data update (for example part-00000-50168213-b0c8-4b5e-a20b-251aa77c3d4b-c000.snappy.parquet) in the source table at version 38. This is currently not supported. If you'd like to ignore updates, set the option 'skipChangeCommits' to 'true'). The updates that I did in TEST123 should only lead to appends in the users_cdc_bronze, but somehow, DB runtime seems to think there are updates in this table. Even when settings options skipChangeCommits to true for the users_cdc_bronze ingest, the pipeline fails on the second run.
@neclis7777 3 месяца назад
"Efkaristo poli" , all the answers to the issue I had today.
@user-du9wb1oe7t 3 месяца назад
Can we use databricks community edition for implementation?
@SynonAnon-vi1ql 3 месяца назад
Hi! Really good video. Very easy to follow and understand. Except if you could clarify following. If I have million devices do I have to create all of them in IOT hub? (That too manually?) I saw in your VS code you added device ID in the JSON. This added to the confusion above. Please tell me that the answer to my above question is no, and the device that you added in IoT hub is actually a device "type" and you have configured to receive messages from many device IDs for this one device type. Which of my understanding is true? 😊 Thank you so much
@AthanasiouApostolos 3 месяца назад
Hi mate, I am not sure I understand what you are looking for but maybe it's this? learn.microsoft.com/en-us/answers/questions/884678/iot-hub-multiple-connections-on-single-device (module identities)
@SynonAnon-vi1ql 3 месяца назад
@@AthanasiouApostolos well basically I wanted to know if I need as many IoT devices created in IoT hub as there are physical devices?
@AthanasiouApostolos 3 месяца назад
@@SynonAnon-vi1ql no you don't. However using the same connection string multiple times would create issues. That's why I proposed module identities. That being said, you would need to do research on this topic as I have never used more than one sensor in real life applications.
@SynonAnon-vi1ql 3 месяца назад
@@AthanasiouApostolos understood. Thank you friend!
@andriifadieiev9757 3 месяца назад
Thank you for sharing! Great explnation
@VenkatesanVenkat-fd4hg 3 месяца назад
Superr video as always
@prassadht5846 3 месяца назад
7 'eventhubs.connectionString':sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(IOT_CS), 8 "eventhubs.consumerGroup": '$Default' 9 } 10 json_schema = StructType( 11 [ 12 StructField ("DeviceID", IntegerType(), True), 13 StructField ("DeviceNumber", IntegerType(), True) 14 ] 15 ) 18 # Enable auto compaction and optimized writes in Delta TypeError: 'JavaPackage' object is not callable please help me Apostolos
@rickrofe4382 4 месяца назад
Great presentation. You may want to move the mic away from you mouth to cut down on unnecessary noise.
@samukapsilvas 4 месяца назад
It works with SQL Server on prem as well ?
@pragadeeshwaranThangaraj 25 дней назад
++
@samukapsilvas 4 месяца назад
It works with SQL Server on prem ?

Apostolos Athanasiou

Видео

Комментарии