EMR Serverless is not natively supported with Step Functions today, but there is a way to do it using Lambda functions. We have a blog post about it here, if it's helpful! aws.amazon.com/blogs/big-data/run-a-data-processing-job-on-amazon-emr-serverless-with-aws-step-functions/
Does anyone here knows if it is possible to use Spark to select/collect multiple Parquet files from s3 bucket ( all in "ABC" folder) and combined them in one Parquet file in ( "DEF") file in the same location? and if so what is the code , thanks
Assuming you're talking about EMR Serverless, there's a couple different options. You can use custom images ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html ) to install OS-level dependencies. If you're just talking about PySpark dependencies you can also bundle a virtual environment ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html ).
Amazing video do you know if there is any chance to send parameters from airflow DAG to the called notebook? For example the DAG receives a random date&&number then when you trigger the DAG it send those parameters to the notebook. Thank you! :)
I didn't use notebooks in this video, the EMR StartNotebookExecution API allows you to pass parameters to notebook runs. We have a blog post about that here: aws.amazon.com/blogs/big-data/orchestrating-analytics-jobs-on-amazon-emr-notebooks-using-amazon-mwaa/
Great video!! , Is there any way to run a dbt project using emr serverless?, I have seen that they have the Thrift option to connect to EMR on EC2, but I am not sure if it is possible to connect it to EMR serverless :(
We now support Java 17 ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-java-runtime.html ). Unfortunately not another way to use custom Java versions without custom images.
Amazing Demo!!!
great video
Amazing. Could you do a tutorial about using step function with EMR Serverless? Thanks.
EMR Serverless is not natively supported with Step Functions today, but there is a way to do it using Lambda functions.
We have a blog post about it here, if it's helpful! aws.amazon.com/blogs/big-data/run-a-data-processing-job-on-amazon-emr-serverless-with-aws-step-functions/
Does anyone here knows if it is possible to use Spark to select/collect multiple Parquet files from s3 bucket ( all in "ABC" folder) and combined them in one Parquet file in ( "DEF") file in the same location? and if so what is the code , thanks
Hi Great video - can you please also show steps on how to install external libraries on EMR - bootstrap script replacement?
Assuming you're talking about EMR Serverless, there's a couple different options. You can use custom images ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html ) to install OS-level dependencies. If you're just talking about PySpark dependencies you can also bundle a virtual environment ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html ).
For pyspark dependencies like pandas or kafka. How to bundle a virtual environment?
New to python, any help or suggestions are greatly appreciated.
Amazing video
do you know if there is any chance to send parameters from airflow DAG to the called notebook?
For example the DAG receives a random date&&number then when you trigger the DAG it send those parameters to the notebook.
Thank you! :)
I didn't use notebooks in this video, the EMR StartNotebookExecution API allows you to pass parameters to notebook runs.
We have a blog post about that here: aws.amazon.com/blogs/big-data/orchestrating-analytics-jobs-on-amazon-emr-notebooks-using-amazon-mwaa/
Great video!! , Is there any way to run a dbt project using emr serverless?, I have seen that they have the Thrift option to connect to EMR on EC2, but I am not sure if it is possible to connect it to EMR serverless :(
Unfortunately not as of today. :(
Is there a way to install custom Java versions without creating custom images?
We now support Java 17 ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-java-runtime.html ). Unfortunately not another way to use custom Java versions without custom images.
Is there a way to run EMR serverless with GPU? I want to run pyspark jobs with NVIDIA RAPIDS
Not as of today. For that you'll still need EMR on EC2 or EMR on EKS.
@@dacort Ok. Thank you