- Видео 41
- Просмотров 72 973
dacort - Data Analytics
Добавлен 22 фев 2021
Damon has been building data pipelines for over a decade and enjoys moving data from one end of the Internet to the other.
He's built his own analytics startup, worked for 5 years on the EMR team at AWS, and creates many of his own data tools.
He's built his own analytics startup, worked for 5 years on the EMR team at AWS, and creates many of his own data tools.
Damons Data Lake
A short little video that shows how I gather data from GitHub and RUclips using my custom container framework ( github.com/dacort/cargo-crates/ ).
I run containers in ECS on a scheduled basis, deployed via CDK ( github.com/dacort/damons-data-lake/tree/main/data_containers ) and pipe the output from the container into paths on S3 extracted from the JSON data using a custom tool called Forklift ( github.com/dacort/forklift ).
Hope you enjoy!
00:00 - Intro
01:18 - Cargo Crates
03:51 - Forklift
06:03 - Damons Data Lake
07:32 - Querying in Athena
I run containers in ECS on a scheduled basis, deployed via CDK ( github.com/dacort/damons-data-lake/tree/main/data_containers ) and pipe the output from the container into paths on S3 extracted from the JSON data using a custom tool called Forklift ( github.com/dacort/forklift ).
Hope you enjoy!
00:00 - Intro
01:18 - Cargo Crates
03:51 - Forklift
06:03 - Damons Data Lake
07:32 - Querying in Athena
Просмотров: 255
Видео
Remote Debugging with PyCharm and EMR
Просмотров 70310 месяцев назад
Shows how to use PyCharm with EMR on EKS and EMR Serverless to interactively debug your Spark applications. 00:00 - Intro 01:16 - Getting Started 02:12 - CDK Deploy Output 03:43 - PyCharm Debugger 05:09 - Building Spark dependencies 06:52 - Running an EMR on EKS job 07:34 - Using the EMR CLI 08:45 - Enabling debugging on your Spark job 12:41 - Debugging EMR Serverless The CDK stack for this vid...
Amazon EMR and S3 Access Grants
Просмотров 1,2 тыс.Год назад
A demo that follows along with the blog post at aws.amazon.com/blogs/big-data/use-amazon-emr-with-s3-access-grants-to-scale-spark-access-to-amazon-s3/ 00:00 - Intro 01:34 - CloudFormation Stack Overview 03:07 - Create S3 Access Grants 05:04 - Create READ and READWRITE grants 06:32 - EMR on EC2 example 07:27 - Diving into data writer role permissions 11:16 - EMR Studio and EMR Serverless 16:50 -...
Generate real-time code suggestions in EMR Studio notebooks
Просмотров 427Год назад
See how EMR Studio now integrates with Amazon CodeWhisperer to provide real-time code suggestions for Apache Spark. Docs: docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-codewhisperer.html 00:00 - Intro 00:33 - Create EMR Serverless application 01:19 - Connect EMR Studio to EMR Serverless 02:34 - First CodeWhisperer auto-complete
Amazon EMR - When to use EMR on EC2, EKS, and Serverless
Просмотров 3,3 тыс.Год назад
A high-level overview of the different deployment options for EMR including EMR on EC2, EMR on EKS, and EMR Serverless. 00:00 - Intro 00:30 - EMR on EC2 02:52 - EMR Serverless 05:24 - EMR on EKS
A Tour of the Amazon EMR Console
Просмотров 408Год назад
In this video, we take a look at a few common tasks in the new Amazon EMR Console. Redesigned with ease of use and less clicks in mind, we hope you enjoy it! 00:00 - Intro 00:18 - Create cluster 02:30 - Cluster summary 03:13 - Add step 04:21 - Instances tab 04:51 - Collapsible notifications 05:18 - Cluster list with expandable rows 06:02 - Search bar usage 07:14 - Events page 07:29 - Block publ...
EMR Serverless Pre-initialized capacity overview
Просмотров 396Год назад
It can be a little tough to understand how jobs use pre-initialized capacity in EMR Serverless. This video demonstrates how resources are provisioned, acquired, and refilled during the course of two jobs. 00:00 - Intro 00:35 - Job 1 01:05 - Job 2 01:33 - Refilling the pool 01:52 - Finishing Job 2
Running Spark jobs on Amazon EMR Serverless
Просмотров 10 тыс.2 года назад
Get an overview of how to run Apache Spark jobs in EMR Serverless from the AWS Console, CLI, and using Amazon Managed Workflows for Apache Airflow (MWAA). Also see how to use the new CloudWatch Metrics to monitor EMR Serverless usage, Live Dashboard UI, and package your PySpark jobs with virtual environments. Table of Contents: 00:00 - Intro 02:01 - Create application in the console 02:47 - Pre...
Intro to Amazon EMR Toolkit
Просмотров 2,1 тыс.2 года назад
See how to install and use the Amazon EMR Toolkit for VS Code. - Browse your EMR on EC2, EMR on EKS, and EMR Serverless resources. - Explore your Glue Data Catalog and view table details. - Create a local PySpark development container based on EMR. - Deploy PySpark jobs to EMR Serverless Table of Contents: 00:00 - Intro 00:40 - Installing the EMR Toolkit 01:22 - EMR Explorer 02:25 - Glue Data C...
Modern Data Lake Storage Layers
Просмотров 12 тыс.2 года назад
An overview of Apache Hudi, Apache Iceberg, and Delta Lake. In this video, we talk about the basics of how Hudi, Iceberg, and Delta Lake work. You'll see how to insert, update, and delete data in your data lake and how each of these frameworks work behind the scenes. Blog post: dacort.dev/posts/modern-data-lake-storage-layers/ GitHub Repo with CloudFormation and Notebooks: github.com/dacort/mod...
Amazon EMR Studio - SQL Explorer
Просмотров 1,3 тыс.2 года назад
With the new SQL Explorer in Amazon EMR Studio, you can now easily look at your database tables and run SQL queries without having to embed them in code. In this demo, we show how to browse your database and tables and execute queries in the SQL Editor. What's new post: aws.amazon.com/about-aws/whats-new/2022/01/introducing-sql-explorer-in-emr-studio/ Documentation: docs.aws.amazon.com/emr/late...
Amazon EMR Studio - Real-time Collaboration
Просмотров 1,7 тыс.2 года назад
Real-time collaboration is a new feature in EMR Studio that allows multiple users to share a single notebook workspace. In this video, we’ll show you how you can both use the same workspace to collaborate on the same notebook. What's New post: aws.amazon.com/about-aws/whats-new/2022/01/real-time-collaborative-notebooks-emr-studio/ Documentation: docs.aws.amazon.com/emr/latest/ManagementGuide/em...
Running Hive and Spark jobs on Amazon EMR Serverless
Просмотров 6 тыс.2 года назад
Now in preview, Amazon EMR Serverless allows you to run big data analytics without worrying about infrastructure. In this demo, we show how to instantly run both Spark and Hive jobs on EMR Serverless as well as how to debug the jobs in real-time using logs on Amazon S3 and the Spark History Server and Tez UI. 00:00 - Intro 01:10 - EMR Serverless Overview 02:10 - Running a Spark job 06:43 - Usin...
Getting Started with EMR Studio
Просмотров 2,1 тыс.3 года назад
See how to create a new Amazon EMR Studio using IAM Authentication Mode. How to set up an Amazon EMR Studio: docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-set-up.html EMR Studio CloudFormation Templates: github.com/aws-samples/emr-studio-samples EMR Studio IAM Permissions: docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-iam-permissions-table.html Table of Contents: 00:00 -...
Query Athena from EMR Studio
Просмотров 1,3 тыс.3 года назад
See how to install pyathena and query Athena from an EMR Studio notebook. Setting up EMR Studio: docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-set-up.html Example notebook: github.com/dacort/demo-code/blob/main/emr/studio/notebooks/emr-studio-athena.ipynb Table of Contents: 00:00 - Intro 00:51 - Install pyathena 01:25 - Query your data! 02:37 - Querying with SparkSQL
Using IAM Authentication with EMR Studio
Просмотров 1,1 тыс.3 года назад
Using IAM Authentication with EMR Studio
Interactive data processing on Amazon EMR from Amazon SageMaker
Просмотров 2,4 тыс.3 года назад
Interactive data processing on Amazon EMR from Amazon SageMaker
dacort's Data Lab - Docker in EMR Studio
Просмотров 733 года назад
dacort's Data Lab - Docker in EMR Studio
EMR on EKS: Pod Templates in 60 seconds
Просмотров 3723 года назад
EMR on EKS: Pod Templates in 60 seconds
Amazon EMR Studio - Creating a new Studio Workspace
Просмотров 5 тыс.3 года назад
Amazon EMR Studio - Creating a new Studio Workspace
Connecting to Git in Amazon EMR Studio
Просмотров 1,5 тыс.3 года назад
Connecting to Git in Amazon EMR Studio
Amazon EMR on Amazon EKS Demo - Job Creation
Просмотров 9173 года назад
Amazon EMR on Amazon EKS Demo - Job Creation
where are you storing `pi.py`? thank you
Does anyone here knows if it is possible to use Spark to select/collect multiple Parquet files from s3 bucket ( all in "ABC" folder) and combined them in one Parquet file in ( "DEF") file in the same location? and if so what is the code , thanks
Isn't it redundant to call out to an Athena cluster from within an EMR cluster? Is the reason for using Athena its integration with Glue's Catalog?
❤ thanks ❤
I am in video two and looks like it would be beneficial to provide details on how to create vcluster
Amazing job. Thank you!! What is the best way to read this delta tables now? Data Catalog and then Athena? I would like to see this data in our QuickSight.
Hi Dacort, I can’t finish the “Reopen in Container” step, as after it downloads the docker image, an error pops up saying “yum doesn’t have enough cache data to continue” at step “RUN yum install -y sudo && “Hadoop ALL=(ALL) NOPASSWD:ALL”.. Wondering if you might know any reason behind? Thank you so much!
When reading "cargo crates", my mind immediately jumped into the rust ecosystem and the cargo package manager :) Great video man! Learning a ton from you especially when it comes to AWS and EMR! Please keep up doing the great work!
Hey @dacort, Thanks for the great video. - What about Glue? Can we say that Glue and EMR serverless do more or less the same thing? - Let's say we only have Spark jobs to run based on some triggers. Since it is a transient job, I should run it with EMR serverless. On the other hand, if I need a long-running cluster, I should go with EMR on EC2/EKS. Can I extract the formula like this :)
Thanks it was helpful, what about aws athena and aws s3 access grants? I am struggling with a PoC where data is in one aws account and athena in another account and I would like to use aws s3 access grants to manage permissions
I tried to work with notebooks.. I succesful attach serverless but It can't run.. always show error when I try to run..aws docs say nothing....for me could be role. maybe "interactive rol"... what is it?... any advice?
You made EMR fun. Good luck with everything.
Really nice content. You should do more of such Developer tooling and productivity boosts while working with EMR.
@dacort Thanks for the video. In my case as well I'm using EmrContainerOperator to submit jobs to EMR on EKS cluster which is working fine. Now to track the cost for each jobs I want to assign tags to each jobs being executed in EMR on EKS virtual cluster. While assigning tags to the EmrContainerOperator, the jobs are being executed by aws_default connection id and is ignoring the aws_conn_id I've provided to my operator but once I remove the tags from the operator it is perfectly using the custom aws_conn_id I've provided. Any help here would be greatly appreciated. For your reference: emr_spark_submit = EmrContainerOperator( task_id="task_id", virtual_cluster_id="VIRTUAL_CLUSTER_ID", execution_role_arn="emr-on-eks-job-execution-role", release_label="emr-6.7.0-latest", job_driver=job_driver_arg, configuration_overrides=configuration_overrides_arg, name="submit_pyspark_job", aws_conn_id="emr-on-eks", tags={'job':'test'}, dag=dag, ) FYI: I'm using from airflow.providers.amazon.aws.operators.emr import EmrContainerOperator and not the from emr_containers.operators.emr_containers import EMRContainerOperator
Thanks for the video @dacort. I tried submitting the Job to my EKS cluster but it got failed. I'm unable to even check the logs why it got failed. Any help here would be really appreciated
Hm, was it the job itself that failed? There's a lot of different reasons that could happen, so more details would be useful. :)
@@dacort The issue was with the permission granted to the role due to which it was unable to spin up the driver containers. Now the issues have been resolved, thanks :)
@@DE-YASH Sweet! Glad you got it figured out and sorry it took so long to reply. :)
Hi, thanks for the great video 😊 I am searching for an article or any direction to build a development pipeline on top of emr-eks from local/dev to production.
Sorry for the delay - this article here is for CI/CD with EMR Serverless, but could be adapted to EMR on EKS fairly easily.
the video talks about the advantages of using EMR on EC2 and EMR serverless, so what is benefit of using EMR on EKS?
EKS (Kubernetes) is great for want to share your compute/memory resources across different variable workloads. Many orgs are adopting k8s, so EMR on EKS helps make it easier to run EMR workloads (like Spark and Flink) on top of EKS.
indeed@@dacort. but one of the catches being that without quota or limit thresholds set at the k8s level, it's very easy for various team/apps to cripple resources in the "emr" namespace for emr containers. anyways, great vid and thanks for the content!
If you use PySpark engine on EMR to read the data from S3, will the resultant dataframe be stored on EMR or S3?
Not entirely sure what you mean - when PySpark reads the dataframe it stores that in memory. You decide where to write it back out to.
why EMR serverless does not support Flink? and also why EMR on EKS does not support Hive?
Each deployment model of EMR has different use-cases and customer bases. In other words, "folks that tend to run a modern k8s environment, also run modern workloads like Spark or Flink, but not Hive."
Q would this work with a EMR on EKS cluster?
Unfortunately not at this time.
Thanks for this great tutorial. Question: After a detached my notebook from a running cluster and stop my workspace(idle)... am I going to be charged for something else? (of course, assuming I terminated my cluster but still want to keep my workspace in emr studio). Thanks
Nope, no additional charge for notebooks/Studio. Just the underlying compute when you have a cluster or EMR serverless application running.
좋아요
Thanks for creating this nice tool. In the Glue catalog view, can we also see the partition keys of a table?
Unfortunately not yet!
Is there a way to install custom Java versions without creating custom images?
We now support Java 17 ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-java-runtime.html ). Unfortunately not another way to use custom Java versions without custom images.
Is there a way to run EMR serverless with GPU? I want to run pyspark jobs with NVIDIA RAPIDS
Not as of today. For that you'll still need EMR on EC2 or EMR on EKS.
@@dacort Ok. Thank you
What happened if the ip addresses have changed? do we need to re create the studio?
great video
This was extremely helpful, thank you! One suggestion for future videos on EMR: It's sort of implied in this video, but I think it would be helpful to new people to really emphasize that this will NOT WORK if you are signed in as the root user. I imagine there's a long list of reasons why using root user is bad practice, but I also imagine it's very common for people just starting out, and cause many wasted hours. 😅
Thanks, Conor! I hadn't run into that before but will definitely keep it in mind!
Great video!! , Is there any way to run a dbt project using emr serverless?, I have seen that they have the Thrift option to connect to EMR on EC2, but I am not sure if it is possible to connect it to EMR serverless :(
Unfortunately not as of today. :(
Is there any way to run a dbt project using emr serverless?, I have seen that they have the Thrift option to connect to EMR on EC2, but I am not sure if it is possible to connect it to EMR serverless :(
Not as of today. :(
Hello, I am getting the following error while retrieving databases
Hi I am getting USER_ERROR everytime I use custom image. Any solution for that ?
Hi Adish - tough to say without more details. We do have a tool to help validate that image, perhaps that'll help? github.com/awslabs/amazon-emr-on-eks-custom-image-cli
Does any of these have a "vacuum" equivalent, or how do you do housekeeping / maintenance on these incremental data lakes?
Both Hudi and Iceberg have "maintenance' operations you can run, including compaction. For Iceberg ( iceberg.apache.org/docs/1.2.0/maintenance/#compact-data-files ) and Hudi ( hudi.apache.org/docs/compaction/ ).
Thats an excellent demonstration
Hi Great video - can you please also show steps on how to install external libraries on EMR - bootstrap script replacement?
Assuming you're talking about EMR Serverless, there's a couple different options. You can use custom images ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html ) to install OS-level dependencies. If you're just talking about PySpark dependencies you can also bundle a virtual environment ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html ).
For pyspark dependencies like pandas or kafka. How to bundle a virtual environment? New to python, any help or suggestions are greatly appreciated.
Can you covering the set up in more detail? specifically the IAM roles, EMR service role etc?
Hi Decort. Thank you for the video but somehow I am unable to connect this with my aws account as we use the SSO (single Sing On) authentication. But I am able to connect via AWS toolkit using the 'aws configure sso' command. Could you please help what can I do in this case?
Amazing. Could you do a tutorial about using step function with EMR Serverless? Thanks.
EMR Serverless is not natively supported with Step Functions today, but there is a way to do it using Lambda functions. We have a blog post about it here, if it's helpful! aws.amazon.com/blogs/big-data/run-a-data-processing-job-on-amazon-emr-serverless-with-aws-step-functions/
nice work
great tutorial
Thanks for the video Dacort..Please share how to connect my own S3 from Docker container. I try to add AWS keys in docker file as ARG. but it did not worked after rebuild the container. Please advise..
Amazing video do you know if there is any chance to send parameters from airflow DAG to the called notebook? For example the DAG receives a random date&&number then when you trigger the DAG it send those parameters to the notebook. Thank you! :)
I didn't use notebooks in this video, the EMR StartNotebookExecution API allows you to pass parameters to notebook runs. We have a blog post about that here: aws.amazon.com/blogs/big-data/orchestrating-analytics-jobs-on-amazon-emr-notebooks-using-amazon-mwaa/
This is great Damon. Just a quick question. Does this need a connection to AWS similar to the AWS Toolkit in Visual Studio? Somehow even after having AWS local profile and the necessary permissions, the explorers does not load
It's not quite as robust as the AWS Toolkit authentication. It just uses the default profile from your environment. Do you get an error message or is there just nothing showing up? As long as you can run "aws emr list-clusters" or "aws emr-serverless list-applications" from your terminal, it should work in VS Code. There is also an "EMR: Select AWS Region" command to change regions.
@@dacort That's great. So the terminal commands used to work but not load up the explorer. But selecting the EMR:Select AWS Region solved the issue and now I can see all the explorers populated. Thank you so much. This is great :)
@@arjunshah6594 WOO HOO! Awesome. :) Thanks for giving it a try!
Amazing Demo!!!
Thanks for the concise demo!
This is wonderful demo video and very helpful. We want to create & submit jobs from either Terraform or open source Airflow. But Terraform supporting only application creation where as Airflow support from V5. Could you please share list of ways to create and run jobs.
Can we customize pods other than driver and executor pods? I wish to mount files that need to be available on job-runner container
Some context - job runner container is the one that spins up driver pod
have you configured EMR STUDIO for EMR on EKS? Please shared cloud formation stack info and RUclips video. Thanks!
There's a repository here that should help you get everything deployed you need: github.com/aws-samples/amazon-emr-on-eks-emr-studio
Amazing job, Thank you Dacort
I am trying to create Endpoint for EKS cluster to attach with EMR Studio. Endpoint requires Certificate. I have no clue how should I create certificate. I tried with certificate manager to create public certificate but not sure what is the domain I should provide there. Could you please explain about it ?
You can use a wildcard domain like *.ec2.internal and self-signed certificate. The instructions here specify how to do so: docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-encryption-certificates Once that certificate is created, you can import it into acm using a command like this: aws acm import-certificate --certificate fileb://trustedCertificates.pem --private-key fileb://privateKey.pem There's also a CDK example here that might be useful: github.com/aws-samples/aws-cdk-for-emr-on-eks