Видео 118
Просмотров 100 961

34:12

Analyzing the National Water Model with Xarray, Dask, and Coiled

0:13

Dask DataFrame is Fast Now

54:28

Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale

37:52

How do I Set Up Coiled?

7:30

Run Your Jupyter Notebooks in the Cloud

23:08

Schedule Python Jobs with Prefect and Coiled

Prefect makes it easy to write production workflows in Python. Getting started on a laptop usually takes just a few minutes.
Coiled makes it easy to deploy Prefect in the cloud. You might want to run a workflow, or specific task within a workflow, on the cloud because:
- You want an always-on machine for long-running, or regularly scheduled, jobs
- You want to run close to your cloud-hosted data
- You want specific hardware (like GPUs, or lots of memory)
- You want many machines for running tasks in parallel
In this webinar, we deploy a Prefect workflow on the cloud with Coiled that processes a daily updated cloud dataset. This is a common pattern that shows up in fields like machine learning, ...

Видео

34:12

Churn Through Cloud Files in Parallel

Просмотров 1537 месяцев назад

People often want to run the same function over many files. However, processing files in cloud storage is often slow and expensive due to transferring cloud data in and out of AWS/GCP/Azure. In this webinar recording we’ll show how to run this “same function on many files” pattern on the cloud with Coiled, so you can run existing code faster and cheaper with minimal changes. We’ll also highligh...

Analyzing the National Water Model with Xarray, Dask, and Coiled

0:13

Analyzing the National Water Model with Xarray, Dask, and Coiled

Просмотров 4269 месяцев назад

Mean weekly water table depth for US counties from 1979-2020. Water table depth fluctuates seasonally, decreasing with more precipitation in the winter and increasing with more periods of drought in the summer. 1m is optimal for many types of agriculture. Blog post: docs.coiled.io/blog/coiled-xarray.html Code: github.com/coiled/examples/tree/main/national-water-model

54:28

Dask DataFrame is Fast Now

Просмотров 1,2 тыс.9 месяцев назад

In this webinar, Patrick Höfler and Rick Zamora show how recent development efforts have driven performance improvements in Dask DataFrame. Key Moments 00:00 Intro 00:19 Dask DataFrame is fast now 02:06 Historical pain points 03:51 PyArrow-backed strings in Dask 06:04 Demo: PyArrow strings 08:53 Demo: Task-based shuffling is slow 11:11 Better performance with P2P shuffling 16:29 Sub-optimal que...

Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale

37:52

Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale

Просмотров 7 тыс.10 месяцев назад

We run the common TPC-H Benchmark suite at 10 GB, 100 GB, 1 TB, and 10 TB scale on the cloud a local machine and compare performance for common large dataframe libraries. No tool does universally well. We look at common bottlenecks and compare performance between the different systems. This talk was originally given at PyData NYC 2023. These results are preliminary, and come from only a couple ...

7:30

How do I Set Up Coiled?

Просмотров 35711 месяцев назад

Set up Coiled to run Dask or other cloud processing APIs easily 1. Create an account 2. Register an API token 3. Connect to your cloud 00:00 Introduction 00:34 pip install coiled 00:51 Authenticate 01:25 Connect your Cloud 03:48 Add a Region 05:00 Hello, world! 06:25 Teams 07:11 Summary

23:08

Run Your Jupyter Notebooks in the Cloud

Просмотров 73911 месяцев назад

When you're only processing 10-100GB of data, a hundred-worker cluster is probably overkill when a single, big VM will do. You can use Coiled notebooks to start a JupyterLab instance on any machine you’d like, whether that’s a better GPU or a single VM with hundreds of GBs of memory. Examples in our docs: docs.coiled.io/user_guide/usage/notebooks/index.html Get started with Coiled: coiled.io/st...

15:20

Coiled Overview

Просмотров 484Год назад

Learn how to easily process data on the cloud with Coiled. This 15m video is an overview over many aspects of Coiled. For a more in-depth treatment, please consider the more topic-specific videos at youtube.com/@coiled 00:00 Introduction 01:14 API: CLI commands 02:41 API: Serverless Functions 03:40 API: Dask 06:25 API: Jupyter Notebooks 07:38 Management Dashboard 09:56 Architecture and Data Pri...

Run Python Scripts with Coiled Functions & Coiled Run

26:19

Run Python Scripts with Coiled Functions & Coiled Run

Просмотров 309Год назад

Run a script or Python function in any cloud region on any hardware. Sometimes you don’t need a huge cluster for your workflows, and you just want to run your Python function on a VM in the cloud. In this webinar, we'll walk through these two APIs: Coiled Functions and Coiled Run. We'll see how to run a computation on a VM close to our data, train a PyTorch model on a GPU in the cloud, and scal...

Run Python Scripts in the Cloud with Coiled

12:46

Run Python Scripts in the Cloud with Coiled

Просмотров 747Год назад

Sometimes you don’t need a huge cluster for your workflows, and you just want to run your Python function on a VM in the cloud. You might want to do this for a few reasons: You want a big machine You want a GPU You want to run close to your data You want to run the script many times while scaling out With Coiled, you can run any Python function, script, or executable in your AWS or GCP account,...

How do I get my software onto cloud VMs? Automatic Package Synchronization with Coiled

4:43

How do I get my software onto cloud VMs? Automatic Package Synchronization with Coiled

Просмотров 152Год назад

Getting your software onto cloud VMs is hard. Coiled makes it easy...mostly. This video talks about how Coiled manages software for Python development in the cloud, and methods to escape when things go wrong. More information available at docs.coiled.io/user_guide/software/ Blog posts: How many PEPs does it take to install a package? medium.com/coiled-hq/how-many-peps-does-it-take-to-install-a-...

7:43

Coiled Cluster Configuration

Просмотров 175Год назад

Learn how to configure your Coiled resources, including selecting instance types, regions, and different hardware choices. Documentation at docs.coiled.io/user_guide/clusters/ More videos to help you setup Coiled ruclips.net/video/QXql9O8kSPk/видео.html ruclips.net/video/ukkOJPF2URY/видео.html ruclips.net/video/eXP-YuERvi4/видео.html Get started with Coiled for free: coiled.io/start

5:36

Jupyter Notebooks with Coiled

Просмотров 343Год назад

Jupyter notebooks on large VMs in the cloud using Coiled. This approach synchronizes your local packages and files, giving a smooth Big Laptop experience. Check out this blog post for more details: medium.com/coiled-hq/coiled-notebooks-d4577596ff4a Key Moments 00:00 Intro 01:00 coiled notebook start 02:17 Cloud Notebook Starts 03:11 File sync 04:52 Summary Scale Your Python Workloads with Dask ...

Dask Futures Tutorial: Parallelize Python Code with Dask

1:00:39

Dask Futures Tutorial: Parallelize Python Code with Dask

Просмотров 1,7 тыс.Год назад

In this lesson, we'll parallelize a custom Python workflow that scrapes, parses, and cleans data from Stack Overflow. We'll get to: - Learn how to do arbitrary task scheduling using the Dask Futures API - Utilize blocking and non-blocking distributed calculations Notebook here: github.com/coiled/dask-tutorial/blob/main/1-Parallelize-your-python-code_Futures_API.ipynb Tutorial repo: github.com/c...

Dask DataFrames Tutorial: Best practices for larger-than-memory dataframes

1:03:32

Dask DataFrames Tutorial: Best practices for larger-than-memory dataframes

Просмотров 2,1 тыс.Год назад

Learn best practices for larger-than-memory dataframes. Investigate Uber/Lyft data and learn to do the following: - Manipulate Parquet files and optimize queries - Navigate inconvenient file sizes and data types - Tune Parquet storage, build features, and explore a challenging dataset with Pandas and Dask. Notebook here: github.com/coiled/dask-tutorial/blob/main/2-Get_better-at-dask-dataframes....

6:44

Databricks vs. Dask and Coiled

Просмотров 413Год назад

Databricks vs. Dask and Coiled

7:38

Coiled Xarray Example

Просмотров 549Год назад

Coiled Xarray Example

Coiled Dashboard: Monitor Teams and Manage Costs Easily and Efficiently

6:02

Coiled Dashboard: Monitor Teams and Manage Costs Easily and Efficiently

Просмотров 188Год назад

Coiled Dashboard: Monitor Teams and Manage Costs Easily and Efficiently

12:24

Dask + Pandas for Parallel ETL

Просмотров 1,2 тыс.Год назад

Dask Pandas for Parallel ETL

7:42

XGBoost and HyperParameter Optimization

Просмотров 861Год назад

XGBoost and HyperParameter Optimization

10:10

Dask Futures for General Parallelism

Просмотров 884Год назад

Dask Futures for General Parallelism

3:03

Engineering a Technical Newsletter: A transparent analysis of the Coiled newsletter

Просмотров 57Год назад

Engineering a Technical Newsletter: A transparent analysis of the Coiled newsletter

12:03

Six Coiled features for Dask users

Просмотров 433Год назад

Six Coiled features for Dask users

Dask Infrastructure with Coiled for Pangeo

8:10

Dask Infrastructure with Coiled for Pangeo

Просмотров 377Год назад

Dask Infrastructure with Coiled for Pangeo

10:25

Dask on Single Machine with Coiled

Просмотров 378Год назад

Dask on Single Machine with Coiled

Dask and Optuna for Hyper Parameter Optimization

13:53

Dask and Optuna for Hyper Parameter Optimization

Просмотров 2,1 тыс.Год назад

Dask and Optuna for Hyper Parameter Optimization

Measuring the GIL | Does pandas release the GIL?

6:11

Measuring the GIL | Does pandas release the GIL?

Просмотров 559Год назад

Measuring the GIL | Does pandas release the GIL?

High Performance Visualization | Parallel performance with Dask & Datashader

16:54

High Performance Visualization | Parallel performance with Dask & Datashader

Просмотров 4,3 тыс.Год назад

High Performance Visualization | Parallel performance with Dask & Datashader

Transforming Parquet Data at Scale on the Cloud with Dask & Coiled | NYC Taxi Uber/Lyft Data

14:56

Transforming Parquet Data at Scale on the Cloud with Dask & Coiled | NYC Taxi Uber/Lyft Data

Просмотров 475Год назад

Transforming Parquet Data at Scale on the Cloud with Dask & Coiled | NYC Taxi Uber/Lyft Data

Scale Python with Dask and Coiled | Setting up a production environment in the cloud

15:06

Scale Python with Dask and Coiled | Setting up a production environment in the cloud

Просмотров 1 тыс.Год назад

Scale Python with Dask and Coiled | Setting up a production environment in the cloud

@fida47 3 дня назад
can someone share dataset link? from where to download 10 csv files of nyc flights dataset?
@Andikan4U 9 дней назад
Thank you
@FabioRBelotto Месяц назад
If I run Dask without importing the client, it does not work on many workers ?
@FabioRBelotto Месяц назад
The source was one only big parquet file ? Dask set partitions by itself ?
@FabioRBelotto Месяц назад
My main issue with dask is the lack of support of the community (very different from pandas!)
@richerite Месяц назад
Great talk! What would you recommend for ingesting about 100-200GB of geospatial data on premise?
@mohitparwani4235 2 месяца назад
{ "name": "CancelledError", "message": "('mul-floordiv-3770c7fe5e6231d62ed3d68e48276fbd', 0)", "stack": "--------------------------------------------------------------------------- CancelledError Traceback (most recent call last) File <timed eval>:2 File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask_expr\\_collection.py:476, in FrameBase.compute(self, fuse, **kwargs) 474 out = out.repartition(npartitions=1) 475 out = out.optimize(fuse=fuse) --> 476 return DaskMethodsMixin.compute(out, **kwargs) File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask\\base.py:375, in DaskMethodsMixin.compute(self, **kwargs) 351 def compute(self, **kwargs): 352 \"\"\"Compute this dask collection 353 354 This turns a lazy Dask collection into its in-memory equivalent. (...) 373 dask.compute 374 \"\"\" --> 375 (result,) = compute(self, traverse=False, **kwargs) 376 return result File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask\\base.py:661, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs) 658 postcomputes.append(x.__dask_postcompute__()) 660 with shorten_traceback(): --> 661 results = schedule(dsk, keys, **kwargs) 663 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\distributed\\client.py:2235, in Client._gather(self, futures, errors, direct, local_worker) 2233 else: 2234 raise exception.with_traceback(traceback) -> 2235 raise exc 2236 if errors == \"skip\": 2237 bad_keys.add(key) CancelledError: ('mul-floordiv-3770c7fe5e6231d62ed3d68e48276fbd', 0)" } I'm getting this error when i use client can someone please help with any possible solution i definitely need that. please!
@as978 3 месяца назад
So happy to see this. Better late than never. Hopefully Dask gets the popularity it deserves and becomes a serious contender to Spark down the line.
@gemini_537 3 месяца назад
Gemini 1.5 Pro: This video is about an introduction to Dask DataFrames, and it covers when to use them, how to use them, and performance tips. In the video, it is explained that pandas is great for tabular data sets that fit into memory, but Dask is useful for working with data sets that are larger than your machine can handle. Dask can cut up your big data set into smaller bits and execute those smaller parts in parallel. Here are the key points covered in the video: * **When to use Dask DataFrames:** You should use Dask DataFrames if your data doesn't fit into memory and your computations are complex. Pandas might run into a memory error if the data is too large, but Dask can handle those types of large-scale computations comfortably. * **Dask DataFrames vs Pandas DataFrames:** Dask DataFrames are similar to Pandas DataFrames and implement a well-used portion of the Pandas API. This means that a lot of Dask DataFrames code will look and feel pretty familiar to Pandas users. However, there are some key differences. For instance, unlike Pandas DataFrames, Dask DataFrames are lazy, meaning they only create the task graph (a recipe or a root map) to get to the final result but doesn't actually execute it until you specifically tell Dask to do so by calling compute. * **Working with Partitions:** Dask DataFrames are cut up into small bits which are partitions and each partition is actually just a Pandas DataFrame. This means you can perform Pandas operations on these partitions. * **Performance tips:** The video also covers performance tips, such as when to call compute. It is recommended to call compute when you want to combine computations into a single task graph. This is because task graphs for these results have been merged which means that Dask only needs to read the data from the CSV file once instead of twice. The video concludes by mentioning that this is module two of the introduction to Dask tutorial and the next module will cover processing array data with Dask Arrays.
@zapy422 4 месяца назад
How this setup is solving dependencies for the python code?
@MatthewRocklin 4 месяца назад
We scrape the local environment for package versions, move those to the target architecture, use mamba to solve and fill in any missing pieces, then we download the new packages on the fly onto each machine. It all happens seamlessly in the background. Users don't need to care about this detail (other than that it works)
@maksimhajiyev7857 5 месяцев назад
The problem is that in fact RUST based tooling actually wins and all the paid promotions just suck . The actual reason why RUST based tooling is sort of suppressed is very simple , hyperscalers (big cloud tech) earn a lot of money and if things are faster there is no huge bills for your spark clusters 😊)) , I was playing with RUST and huge datasets myself without external benchmarks course I don t trust all this market shit .Rust based EDA is maybe witch kraft but this thing runs as beast . try yourself guys with a huge datasets .
@carlostph 5 месяцев назад
When you say "now", from what version are we talking about? To future-proof the video.
@manojjoshi4321 6 месяцев назад
It's a great introduction with very cool and easy to follow illustrations. Great job....!!
@kokizzu 6 месяцев назад
Clickhouse ftw
@giselleandreaulloadelarosa1869 6 месяцев назад
Would you please share a link to the github ?
@henrywittler5046 7 месяцев назад
Great work 🙂 Dask will fascilitate to solve some computational data analysis issues of many people
@snowaIker 7 месяцев назад
How delayed gets around GIL?
@wayne7936 7 месяцев назад
This is such a clear, simple, yet extremely powerful introduction. Alright, you convinced me to try coiled again.
@Coiled 7 месяцев назад
Acheivement unlocked! If you tried out Coiled more than a year ago then it's definitely worth trying again. Admittedly, the product was kinda bad early on. Now it is quite delightful.
@ravishmahajan9314 7 месяцев назад
But DuckDB is good if your data fits one single machine. But the benchmarks shows different story when data is distributed. What about that?
@henrywittler5046 7 месяцев назад
Thanks for this tutorial and the other material at Dask and Coiled, will help heaps in a large data project 🙂
@henrywittler5046 7 месяцев назад
Thanks for this tutorial and the other material at Dask and Coiled, will help heaps in a large data project 🙂
@taylorpaskett3703 7 месяцев назад
What software did you use for generating / displaying your plots? It looked really nice
@taylorpaskett3703 7 месяцев назад
Nevermind, if I just kept watching you showed the GitHub where it says ibis and altair. Thanks!
@randywilliams7696 8 месяцев назад
Great video! Recently switched from Dask to Duckdb on my ~1TB workloads, interesting to see some of the same issues I found brought up here. One gotcha I've found is that it is REALLY easy to blunder your way into making non-performant queries in dask (things that end up shuffling, partitioning, etc. a lot behind the scenes). It was more straightforward for my use case to write performant SQL queries for duckdb since that is much more of a common, solved problem. The scale-out feature of Dask and Spark is interesting too, as we are considering the merits of a natively clustered solution vs just breaking up our queries into chunks that can fit on multiple single instances for duckdb.
@MatthewRocklin 8 месяцев назад
Yup. Totally agreed. The query optimization in Dask Dataframe should handle what you ran into historically. The problem wasn't unique to you :)
@ravishmahajan9314 7 месяцев назад
But what about distributed databases. Is DuckDB able to query distributed databases? Is this technology replacing spark framework??
@rjv 8 месяцев назад
Such a good video! So many good insights clearly communicated with proper data. Also love the interfaces you've built, very meaningful, clean and minimalistic. Have you got comparison benchmarks where cloud cost is the only constraint and the number of machines or their size and type (GPU machines with cudf) is not restricted?
@mooncop 10 месяцев назад
you are most welcome (suffered well) worth it for the duck
@bbbbbbao 10 месяцев назад
It's not clear to me if you can use autoscaling with coiled.
@Coiled 9 месяцев назад
You can use autoscaling with Coiled. See the `coiled.Cluster.adapt` method.
@o0o0oo00oo00 10 месяцев назад
I don’t see duckdb and polars kick spark dask ass on 10gb level in my practical usage.😅 we can’t always trust TPC-H benchmarks.
@andrewm4894 10 месяцев назад
Great talk, thanks
@Amapramaadhy 10 месяцев назад
Some ppl were meant to teach and Matt is one of them! One feedback: I know you have covered it elsewhere but it might be helpful to talk about the graphs (like what does a yellow vs red block mean). You have them up on the screen. They must be serving some purpose. Again, brilliant presentation
@kamranpersianable 11 месяцев назад
Thanks, this is amazing! I have tried integrating Optuna hyperparameter search with Dask and it works great, but I have noticed if I increase the number of iterations, at some point my system crashes due to insufficient memory. From what I can see dask keeps a copy of each iteration so it ends up consuming more memory than needed; any way I can release all the memory usages after each iteration?
@Coiled 11 месяцев назад
The copy that Dask keeps is just the result of the objective function (scores, metrics). This should be pretty lightweight. That's not to say that there isn't some memory leak somewhere (XGBoost, Pandas, ...). If you're able to provide a reproducer to a Dask issue tracker that would be welcome. Alternatively if you run on Coiled infrastructure there's lots of measurement tools there that get run automatically that could help to diagnose.
@kamranpersianable 11 месяцев назад
@@Coiled thanks, I will check further to see what is going wrong! From what I can see for 500 iterations, there is 9GB of added materials into the memory.
@ButchCassidyAndSundanceKid Год назад
Does the Task Delayed use GPU as well ?
@UmmadikTas Год назад
I had an issue with parallelization and the random sampler for hyperparameter search. When I submit optimize function in parallel, optuna keeps repeating the same hyper-paremeters across all processes. I could not figure out how to reseed the sampler for different processes.
@Coiled Год назад
Are the different processes communcating hyperparameters with a central Optuna Storage object? This video shows using the DaskStorage, which helps all of the Optuna search functions coordinate and share results between each other using Dask. Other ways to do this include using things like a database (although we think that Dask is easier).
@ButchCassidyAndSundanceKid Год назад
What about Dask Bag and Dask Future ?
@irfams Год назад
Would you please share a link to the notebook ?
@UmmadikTas Год назад
Thank you so much. This is very helpful with my research.
@chaitanyamadduri5826 Год назад
The video is very informative and kudos to Richard for making Intuitive. Could you help me with below questions? 1. How can we perform a Time series regression using DASk. I see we are breaking the huge dataset to chunks how are gonna maintain the time continuity between the chunks. 2. You have used coiled clusters and i beleive these are external CPU clusters and how DASK is powerful over Pyspark in this case? 3. So DASK can be only utilised when there is CPU executions and it might be used in case of parallel GPU execution right ? Share your comments on this Thanks in advance
@Coiled Год назад
Thanks for the questions! First, you can always post more detailed questions on the Dask Forum dask.discourse.group/. For your question on a time series regression, you may find this example helpful examples.dask.org/applications/forecasting-with-prophet.html If you're curious to learn more about pros/cons of Dask vs. Spark, check out our blog post: www.coiled.io/blog/spark-vs-dask You can use Dask (and Coiled!) with GPU-enabled machines. Learn more in the Coiled docs.coiled.io/user_guide/clusters/gpu.html or Dask documentation docs.dask.org/en/stable/gpu.html
@Lemuz90 Год назад
This looks great! I remember trying to use coiled jobs to do something like this a while ago.
@Coiled Год назад
Thank you! Let us know how you end up using this!
@orlandogarcia885 Год назад
What are the coming features that coiled plans to do?
@Coiled Год назад
We are working on lots of new things - check out Coiled Notebooks: ruclips.net/video/mibhDHYun0M/видео.html and our upcoming webinar about Coiled Functions and Jobs, which allow you to run any python function in the cloud: ruclips.net/video/JuBmG39zLY8/видео.html.
@thomasmoore3175 Год назад
great stuff, Matt !
@bvenkateshx Год назад
I have a use case to read data from Oracle table - split this into files and zip it. Move to s3. Would Dask be a benefit or overhead for such a use case? (Cx_Oracle is used. Currently using mutiprocessing on 20 core server)
@Coiled Год назад
Thanks for the question! It's hard to answer without more details on the size of your data, but feel free to post your question on the Dask Forum dask.discourse.group/
@Coiled Год назад
Update: pandas 2.0 has been released! See www.coiled.io/blog/pyarrow-strings-in-dask-dataframes for the latest on PyArrow strings improvements.
@user-be4vx5by8p Год назад
Thank you very much for this usefull information
@billyblackburn864 Год назад
the one at 15min is really nice...what is the cluster you're running it on?
@exeb1t_solopharm Год назад
Большое спасибо вам! Отличная серия видео, продолжайте работать!
@user-lx5gf4vd4c Год назад
Good video! Can you help me? Where can i find notebook from this video?
@mikecmw8492 Год назад
This is a very good video. I have to ask cause I am in the situation of setting up a DASK cluster that will be querying large weather datasets in AWS S3. I have never done it. Do you have a video on setting up the cluster? Have not explored your channel yet...thx
@pieter5466 Год назад
33:00 surprising that there aren’t existing open source solutions that support “marginal “ arrays, so to speak… has this changed?
@francescos7361 Год назад
Thanks , interesting for oceanographic research .
@NajiShajarisales Год назад
thanks for this video!! i am not sure how it is benefitial to have dask worker code inside the same process that the user code is called. after all pinging the process that runs the user code, does not need to happen often, and in this way GIL is not blocking for the heartbeat to be communicated to scheduler. am i missing something here? any pointer is appreciated.
@floopybits8037 Год назад
Just one word WOW

Coiled

Комментарии