Coiled
Coiled
  • Видео 118
  • Просмотров 100 961
Schedule Python Jobs with Prefect and Coiled
Prefect makes it easy to write production workflows in Python. Getting started on a laptop usually takes just a few minutes.
Coiled makes it easy to deploy Prefect in the cloud. You might want to run a workflow, or specific task within a workflow, on the cloud because:
- You want an always-on machine for long-running, or regularly scheduled, jobs
- You want to run close to your cloud-hosted data
- You want specific hardware (like GPUs, or lots of memory)
- You want many machines for running tasks in parallel
In this webinar, we deploy a Prefect workflow on the cloud with Coiled that processes a daily updated cloud dataset. This is a common pattern that shows up in fields like machine learning, ...
Просмотров: 378

Видео

Churn Through Cloud Files in Parallel
Просмотров 1537 месяцев назад
People often want to run the same function over many files. However, processing files in cloud storage is often slow and expensive due to transferring cloud data in and out of AWS/GCP/Azure. In this webinar recording we’ll show how to run this “same function on many files” pattern on the cloud with Coiled, so you can run existing code faster and cheaper with minimal changes. We’ll also highligh...
Analyzing the National Water Model with Xarray, Dask, and Coiled
Просмотров 4269 месяцев назад
Mean weekly water table depth for US counties from 1979-2020. Water table depth fluctuates seasonally, decreasing with more precipitation in the winter and increasing with more periods of drought in the summer. 1m is optimal for many types of agriculture. Blog post: docs.coiled.io/blog/coiled-xarray.html Code: github.com/coiled/examples/tree/main/national-water-model
Dask DataFrame is Fast Now
Просмотров 1,2 тыс.9 месяцев назад
In this webinar, Patrick Höfler and Rick Zamora show how recent development efforts have driven performance improvements in Dask DataFrame. Key Moments 00:00 Intro 00:19 Dask DataFrame is fast now 02:06 Historical pain points 03:51 PyArrow-backed strings in Dask 06:04 Demo: PyArrow strings 08:53 Demo: Task-based shuffling is slow 11:11 Better performance with P2P shuffling 16:29 Sub-optimal que...
Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale
Просмотров 7 тыс.10 месяцев назад
We run the common TPC-H Benchmark suite at 10 GB, 100 GB, 1 TB, and 10 TB scale on the cloud a local machine and compare performance for common large dataframe libraries. No tool does universally well. We look at common bottlenecks and compare performance between the different systems. This talk was originally given at PyData NYC 2023. These results are preliminary, and come from only a couple ...
How do I Set Up Coiled?
Просмотров 35711 месяцев назад
Set up Coiled to run Dask or other cloud processing APIs easily 1. Create an account 2. Register an API token 3. Connect to your cloud 00:00 Introduction 00:34 pip install coiled 00:51 Authenticate 01:25 Connect your Cloud 03:48 Add a Region 05:00 Hello, world! 06:25 Teams 07:11 Summary
Run Your Jupyter Notebooks in the Cloud
Просмотров 73911 месяцев назад
When you're only processing 10-100GB of data, a hundred-worker cluster is probably overkill when a single, big VM will do. You can use Coiled notebooks to start a JupyterLab instance on any machine you’d like, whether that’s a better GPU or a single VM with hundreds of GBs of memory. Examples in our docs: docs.coiled.io/user_guide/usage/notebooks/index.html Get started with Coiled: coiled.io/st...
Coiled Overview
Просмотров 484Год назад
Learn how to easily process data on the cloud with Coiled. This 15m video is an overview over many aspects of Coiled. For a more in-depth treatment, please consider the more topic-specific videos at youtube.com/@coiled 00:00 Introduction 01:14 API: CLI commands 02:41 API: Serverless Functions 03:40 API: Dask 06:25 API: Jupyter Notebooks 07:38 Management Dashboard 09:56 Architecture and Data Pri...
Run Python Scripts with Coiled Functions & Coiled Run
Просмотров 309Год назад
Run a script or Python function in any cloud region on any hardware. Sometimes you don’t need a huge cluster for your workflows, and you just want to run your Python function on a VM in the cloud. In this webinar, we'll walk through these two APIs: Coiled Functions and Coiled Run. We'll see how to run a computation on a VM close to our data, train a PyTorch model on a GPU in the cloud, and scal...
Run Python Scripts in the Cloud with Coiled
Просмотров 747Год назад
Sometimes you don’t need a huge cluster for your workflows, and you just want to run your Python function on a VM in the cloud. You might want to do this for a few reasons: You want a big machine You want a GPU You want to run close to your data You want to run the script many times while scaling out With Coiled, you can run any Python function, script, or executable in your AWS or GCP account,...
How do I get my software onto cloud VMs? Automatic Package Synchronization with Coiled
Просмотров 152Год назад
Getting your software onto cloud VMs is hard. Coiled makes it easy...mostly. This video talks about how Coiled manages software for Python development in the cloud, and methods to escape when things go wrong. More information available at docs.coiled.io/user_guide/software/ Blog posts: How many PEPs does it take to install a package? medium.com/coiled-hq/how-many-peps-does-it-take-to-install-a-...
Coiled Cluster Configuration
Просмотров 175Год назад
Learn how to configure your Coiled resources, including selecting instance types, regions, and different hardware choices. Documentation at docs.coiled.io/user_guide/clusters/ More videos to help you setup Coiled ruclips.net/video/QXql9O8kSPk/видео.html ruclips.net/video/ukkOJPF2URY/видео.html ruclips.net/video/eXP-YuERvi4/видео.html Get started with Coiled for free: coiled.io/start
Jupyter Notebooks with Coiled
Просмотров 343Год назад
Jupyter notebooks on large VMs in the cloud using Coiled. This approach synchronizes your local packages and files, giving a smooth Big Laptop experience. Check out this blog post for more details: medium.com/coiled-hq/coiled-notebooks-d4577596ff4a Key Moments 00:00 Intro 01:00 coiled notebook start 02:17 Cloud Notebook Starts 03:11 File sync 04:52 Summary Scale Your Python Workloads with Dask ...
Dask Futures Tutorial: Parallelize Python Code with Dask
Просмотров 1,7 тыс.Год назад
In this lesson, we'll parallelize a custom Python workflow that scrapes, parses, and cleans data from Stack Overflow. We'll get to: - Learn how to do arbitrary task scheduling using the Dask Futures API - Utilize blocking and non-blocking distributed calculations Notebook here: github.com/coiled/dask-tutorial/blob/main/1-Parallelize-your-python-code_Futures_API.ipynb Tutorial repo: github.com/c...
Dask DataFrames Tutorial: Best practices for larger-than-memory dataframes
Просмотров 2,1 тыс.Год назад
Learn best practices for larger-than-memory dataframes. Investigate Uber/Lyft data and learn to do the following: - Manipulate Parquet files and optimize queries - Navigate inconvenient file sizes and data types - Tune Parquet storage, build features, and explore a challenging dataset with Pandas and Dask. Notebook here: github.com/coiled/dask-tutorial/blob/main/2-Get_better-at-dask-dataframes....
Databricks vs. Dask and Coiled
Просмотров 413Год назад
Databricks vs. Dask and Coiled
Coiled Xarray Example
Просмотров 549Год назад
Coiled Xarray Example
Coiled Dashboard: Monitor Teams and Manage Costs Easily and Efficiently
Просмотров 188Год назад
Coiled Dashboard: Monitor Teams and Manage Costs Easily and Efficiently
Dask + Pandas for Parallel ETL
Просмотров 1,2 тыс.Год назад
Dask Pandas for Parallel ETL
XGBoost and HyperParameter Optimization
Просмотров 861Год назад
XGBoost and HyperParameter Optimization
Dask Futures for General Parallelism
Просмотров 884Год назад
Dask Futures for General Parallelism
Engineering a Technical Newsletter: A transparent analysis of the Coiled newsletter
Просмотров 57Год назад
Engineering a Technical Newsletter: A transparent analysis of the Coiled newsletter
Six Coiled features for Dask users
Просмотров 433Год назад
Six Coiled features for Dask users
Dask Infrastructure with Coiled for Pangeo
Просмотров 377Год назад
Dask Infrastructure with Coiled for Pangeo
Dask on Single Machine with Coiled
Просмотров 378Год назад
Dask on Single Machine with Coiled
Dask and Optuna for Hyper Parameter Optimization
Просмотров 2,1 тыс.Год назад
Dask and Optuna for Hyper Parameter Optimization
Measuring the GIL | Does pandas release the GIL?
Просмотров 559Год назад
Measuring the GIL | Does pandas release the GIL?
High Performance Visualization | Parallel performance with Dask & Datashader
Просмотров 4,3 тыс.Год назад
High Performance Visualization | Parallel performance with Dask & Datashader
Transforming Parquet Data at Scale on the Cloud with Dask & Coiled | NYC Taxi Uber/Lyft Data
Просмотров 475Год назад
Transforming Parquet Data at Scale on the Cloud with Dask & Coiled | NYC Taxi Uber/Lyft Data
Scale Python with Dask and Coiled | Setting up a production environment in the cloud
Просмотров 1 тыс.Год назад
Scale Python with Dask and Coiled | Setting up a production environment in the cloud

Комментарии

  • @fida47
    @fida47 3 дня назад

    can someone share dataset link? from where to download 10 csv files of nyc flights dataset?

  • @Andikan4U
    @Andikan4U 9 дней назад

    Thank you

  • @FabioRBelotto
    @FabioRBelotto Месяц назад

    If I run Dask without importing the client, it does not work on many workers ?

  • @FabioRBelotto
    @FabioRBelotto Месяц назад

    The source was one only big parquet file ? Dask set partitions by itself ?

  • @FabioRBelotto
    @FabioRBelotto Месяц назад

    My main issue with dask is the lack of support of the community (very different from pandas!)

  • @richerite
    @richerite Месяц назад

    Great talk! What would you recommend for ingesting about 100-200GB of geospatial data on premise?

  • @mohitparwani4235
    @mohitparwani4235 2 месяца назад

    { "name": "CancelledError", "message": "('mul-floordiv-3770c7fe5e6231d62ed3d68e48276fbd', 0)", "stack": "--------------------------------------------------------------------------- CancelledError Traceback (most recent call last) File <timed eval>:2 File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask_expr\\_collection.py:476, in FrameBase.compute(self, fuse, **kwargs) 474 out = out.repartition(npartitions=1) 475 out = out.optimize(fuse=fuse) --> 476 return DaskMethodsMixin.compute(out, **kwargs) File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask\\base.py:375, in DaskMethodsMixin.compute(self, **kwargs) 351 def compute(self, **kwargs): 352 \"\"\"Compute this dask collection 353 354 This turns a lazy Dask collection into its in-memory equivalent. (...) 373 dask.compute 374 \"\"\" --> 375 (result,) = compute(self, traverse=False, **kwargs) 376 return result File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask\\base.py:661, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs) 658 postcomputes.append(x.__dask_postcompute__()) 660 with shorten_traceback(): --> 661 results = schedule(dsk, keys, **kwargs) 663 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\distributed\\client.py:2235, in Client._gather(self, futures, errors, direct, local_worker) 2233 else: 2234 raise exception.with_traceback(traceback) -> 2235 raise exc 2236 if errors == \"skip\": 2237 bad_keys.add(key) CancelledError: ('mul-floordiv-3770c7fe5e6231d62ed3d68e48276fbd', 0)" } I'm getting this error when i use client can someone please help with any possible solution i definitely need that. please!

  • @as978
    @as978 3 месяца назад

    So happy to see this. Better late than never. Hopefully Dask gets the popularity it deserves and becomes a serious contender to Spark down the line.

  • @gemini_537
    @gemini_537 3 месяца назад

    Gemini 1.5 Pro: This video is about an introduction to Dask DataFrames, and it covers when to use them, how to use them, and performance tips. In the video, it is explained that pandas is great for tabular data sets that fit into memory, but Dask is useful for working with data sets that are larger than your machine can handle. Dask can cut up your big data set into smaller bits and execute those smaller parts in parallel. Here are the key points covered in the video: * **When to use Dask DataFrames:** You should use Dask DataFrames if your data doesn't fit into memory and your computations are complex. Pandas might run into a memory error if the data is too large, but Dask can handle those types of large-scale computations comfortably. * **Dask DataFrames vs Pandas DataFrames:** Dask DataFrames are similar to Pandas DataFrames and implement a well-used portion of the Pandas API. This means that a lot of Dask DataFrames code will look and feel pretty familiar to Pandas users. However, there are some key differences. For instance, unlike Pandas DataFrames, Dask DataFrames are lazy, meaning they only create the task graph (a recipe or a root map) to get to the final result but doesn't actually execute it until you specifically tell Dask to do so by calling compute. * **Working with Partitions:** Dask DataFrames are cut up into small bits which are partitions and each partition is actually just a Pandas DataFrame. This means you can perform Pandas operations on these partitions. * **Performance tips:** The video also covers performance tips, such as when to call compute. It is recommended to call compute when you want to combine computations into a single task graph. This is because task graphs for these results have been merged which means that Dask only needs to read the data from the CSV file once instead of twice. The video concludes by mentioning that this is module two of the introduction to Dask tutorial and the next module will cover processing array data with Dask Arrays.

  • @zapy422
    @zapy422 4 месяца назад

    How this setup is solving dependencies for the python code?

    • @MatthewRocklin
      @MatthewRocklin 4 месяца назад

      We scrape the local environment for package versions, move those to the target architecture, use mamba to solve and fill in any missing pieces, then we download the new packages on the fly onto each machine. It all happens seamlessly in the background. Users don't need to care about this detail (other than that it works)

  • @maksimhajiyev7857
    @maksimhajiyev7857 5 месяцев назад

    The problem is that in fact RUST based tooling actually wins and all the paid promotions just suck . The actual reason why RUST based tooling is sort of suppressed is very simple , hyperscalers (big cloud tech) earn a lot of money and if things are faster there is no huge bills for your spark clusters 😊)) , I was playing with RUST and huge datasets myself without external benchmarks course I don t trust all this market shit .Rust based EDA is maybe witch kraft but this thing runs as beast . try yourself guys with a huge datasets .

  • @carlostph
    @carlostph 5 месяцев назад

    When you say "now", from what version are we talking about? To future-proof the video.

  • @manojjoshi4321
    @manojjoshi4321 6 месяцев назад

    It's a great introduction with very cool and easy to follow illustrations. Great job....!!

  • @kokizzu
    @kokizzu 6 месяцев назад

    Clickhouse ftw

  • @giselleandreaulloadelarosa1869
    @giselleandreaulloadelarosa1869 6 месяцев назад

    Would you please share a link to the github ?

  • @henrywittler5046
    @henrywittler5046 7 месяцев назад

    Great work 🙂 Dask will fascilitate to solve some computational data analysis issues of many people

  • @snowaIker
    @snowaIker 7 месяцев назад

    How delayed gets around GIL?

  • @wayne7936
    @wayne7936 7 месяцев назад

    This is such a clear, simple, yet extremely powerful introduction. Alright, you convinced me to try coiled again.

    • @Coiled
      @Coiled 7 месяцев назад

      Acheivement unlocked! If you tried out Coiled more than a year ago then it's definitely worth trying again. Admittedly, the product was kinda bad early on. Now it is quite delightful.

  • @ravishmahajan9314
    @ravishmahajan9314 7 месяцев назад

    But DuckDB is good if your data fits one single machine. But the benchmarks shows different story when data is distributed. What about that?

  • @henrywittler5046
    @henrywittler5046 7 месяцев назад

    Thanks for this tutorial and the other material at Dask and Coiled, will help heaps in a large data project 🙂

  • @henrywittler5046
    @henrywittler5046 7 месяцев назад

    Thanks for this tutorial and the other material at Dask and Coiled, will help heaps in a large data project 🙂

  • @taylorpaskett3703
    @taylorpaskett3703 7 месяцев назад

    What software did you use for generating / displaying your plots? It looked really nice

    • @taylorpaskett3703
      @taylorpaskett3703 7 месяцев назад

      Nevermind, if I just kept watching you showed the GitHub where it says ibis and altair. Thanks!

  • @randywilliams7696
    @randywilliams7696 8 месяцев назад

    Great video! Recently switched from Dask to Duckdb on my ~1TB workloads, interesting to see some of the same issues I found brought up here. One gotcha I've found is that it is REALLY easy to blunder your way into making non-performant queries in dask (things that end up shuffling, partitioning, etc. a lot behind the scenes). It was more straightforward for my use case to write performant SQL queries for duckdb since that is much more of a common, solved problem. The scale-out feature of Dask and Spark is interesting too, as we are considering the merits of a natively clustered solution vs just breaking up our queries into chunks that can fit on multiple single instances for duckdb.

    • @MatthewRocklin
      @MatthewRocklin 8 месяцев назад

      Yup. Totally agreed. The query optimization in Dask Dataframe should handle what you ran into historically. The problem wasn't unique to you :)

    • @ravishmahajan9314
      @ravishmahajan9314 7 месяцев назад

      But what about distributed databases. Is DuckDB able to query distributed databases? Is this technology replacing spark framework??

  • @rjv
    @rjv 8 месяцев назад

    Such a good video! So many good insights clearly communicated with proper data. Also love the interfaces you've built, very meaningful, clean and minimalistic. Have you got comparison benchmarks where cloud cost is the only constraint and the number of machines or their size and type (GPU machines with cudf) is not restricted?

  • @mooncop
    @mooncop 10 месяцев назад

    you are most welcome (suffered well) worth it for the duck

  • @bbbbbbao
    @bbbbbbao 10 месяцев назад

    It's not clear to me if you can use autoscaling with coiled.

    • @Coiled
      @Coiled 9 месяцев назад

      You can use autoscaling with Coiled. See the `coiled.Cluster.adapt` method.

  • @o0o0oo00oo00
    @o0o0oo00oo00 10 месяцев назад

    I don’t see duckdb and polars kick spark dask ass on 10gb level in my practical usage.😅 we can’t always trust TPC-H benchmarks.

  • @andrewm4894
    @andrewm4894 10 месяцев назад

    Great talk, thanks

  • @Amapramaadhy
    @Amapramaadhy 10 месяцев назад

    Some ppl were meant to teach and Matt is one of them! One feedback: I know you have covered it elsewhere but it might be helpful to talk about the graphs (like what does a yellow vs red block mean). You have them up on the screen. They must be serving some purpose. Again, brilliant presentation

  • @kamranpersianable
    @kamranpersianable 11 месяцев назад

    Thanks, this is amazing! I have tried integrating Optuna hyperparameter search with Dask and it works great, but I have noticed if I increase the number of iterations, at some point my system crashes due to insufficient memory. From what I can see dask keeps a copy of each iteration so it ends up consuming more memory than needed; any way I can release all the memory usages after each iteration?

    • @Coiled
      @Coiled 11 месяцев назад

      The copy that Dask keeps is just the result of the objective function (scores, metrics). This should be pretty lightweight. That's not to say that there isn't some memory leak somewhere (XGBoost, Pandas, ...). If you're able to provide a reproducer to a Dask issue tracker that would be welcome. Alternatively if you run on Coiled infrastructure there's lots of measurement tools there that get run automatically that could help to diagnose.

    • @kamranpersianable
      @kamranpersianable 11 месяцев назад

      @@Coiled thanks, I will check further to see what is going wrong! From what I can see for 500 iterations, there is 9GB of added materials into the memory.

  • @ButchCassidyAndSundanceKid
    @ButchCassidyAndSundanceKid Год назад

    Does the Task Delayed use GPU as well ?

  • @UmmadikTas
    @UmmadikTas Год назад

    I had an issue with parallelization and the random sampler for hyperparameter search. When I submit optimize function in parallel, optuna keeps repeating the same hyper-paremeters across all processes. I could not figure out how to reseed the sampler for different processes.

    • @Coiled
      @Coiled Год назад

      Are the different processes communcating hyperparameters with a central Optuna Storage object? This video shows using the DaskStorage, which helps all of the Optuna search functions coordinate and share results between each other using Dask. Other ways to do this include using things like a database (although we think that Dask is easier).

  • @ButchCassidyAndSundanceKid
    @ButchCassidyAndSundanceKid Год назад

    What about Dask Bag and Dask Future ?

  • @irfams
    @irfams Год назад

    Would you please share a link to the notebook ?

  • @UmmadikTas
    @UmmadikTas Год назад

    Thank you so much. This is very helpful with my research.

  • @chaitanyamadduri5826
    @chaitanyamadduri5826 Год назад

    The video is very informative and kudos to Richard for making Intuitive. Could you help me with below questions? 1. How can we perform a Time series regression using DASk. I see we are breaking the huge dataset to chunks how are gonna maintain the time continuity between the chunks. 2. You have used coiled clusters and i beleive these are external CPU clusters and how DASK is powerful over Pyspark in this case? 3. So DASK can be only utilised when there is CPU executions and it might be used in case of parallel GPU execution right ? Share your comments on this Thanks in advance

    • @Coiled
      @Coiled Год назад

      Thanks for the questions! First, you can always post more detailed questions on the Dask Forum dask.discourse.group/. For your question on a time series regression, you may find this example helpful examples.dask.org/applications/forecasting-with-prophet.html If you're curious to learn more about pros/cons of Dask vs. Spark, check out our blog post: www.coiled.io/blog/spark-vs-dask You can use Dask (and Coiled!) with GPU-enabled machines. Learn more in the Coiled docs.coiled.io/user_guide/clusters/gpu.html or Dask documentation docs.dask.org/en/stable/gpu.html

  • @Lemuz90
    @Lemuz90 Год назад

    This looks great! I remember trying to use coiled jobs to do something like this a while ago.

    • @Coiled
      @Coiled Год назад

      Thank you! Let us know how you end up using this!

  • @orlandogarcia885
    @orlandogarcia885 Год назад

    What are the coming features that coiled plans to do?

    • @Coiled
      @Coiled Год назад

      We are working on lots of new things - check out Coiled Notebooks: ruclips.net/video/mibhDHYun0M/видео.html and our upcoming webinar about Coiled Functions and Jobs, which allow you to run any python function in the cloud: ruclips.net/video/JuBmG39zLY8/видео.html.

  • @thomasmoore3175
    @thomasmoore3175 Год назад

    great stuff, Matt !

  • @bvenkateshx
    @bvenkateshx Год назад

    I have a use case to read data from Oracle table - split this into files and zip it. Move to s3. Would Dask be a benefit or overhead for such a use case? (Cx_Oracle is used. Currently using mutiprocessing on 20 core server)

    • @Coiled
      @Coiled Год назад

      Thanks for the question! It's hard to answer without more details on the size of your data, but feel free to post your question on the Dask Forum dask.discourse.group/

  • @Coiled
    @Coiled Год назад

    Update: pandas 2.0 has been released! See www.coiled.io/blog/pyarrow-strings-in-dask-dataframes for the latest on PyArrow strings improvements.

  • @user-be4vx5by8p
    @user-be4vx5by8p Год назад

    Thank you very much for this usefull information

  • @billyblackburn864
    @billyblackburn864 Год назад

    the one at 15min is really nice...what is the cluster you're running it on?

  • @exeb1t_solopharm
    @exeb1t_solopharm Год назад

    Большое спасибо вам! Отличная серия видео, продолжайте работать!

  • @user-lx5gf4vd4c
    @user-lx5gf4vd4c Год назад

    Good video! Can you help me? Where can i find notebook from this video?

  • @mikecmw8492
    @mikecmw8492 Год назад

    This is a very good video. I have to ask cause I am in the situation of setting up a DASK cluster that will be querying large weather datasets in AWS S3. I have never done it. Do you have a video on setting up the cluster? Have not explored your channel yet...thx

  • @pieter5466
    @pieter5466 Год назад

    33:00 surprising that there aren’t existing open source solutions that support “marginal “ arrays, so to speak… has this changed?

  • @francescos7361
    @francescos7361 Год назад

    Thanks , interesting for oceanographic research .

  • @NajiShajarisales
    @NajiShajarisales Год назад

    thanks for this video!! i am not sure how it is benefitial to have dask worker code inside the same process that the user code is called. after all pinging the process that runs the user code, does not need to happen often, and in this way GIL is not blocking for the heartbeat to be communicated to scheduler. am i missing something here? any pointer is appreciated.

  • @floopybits8037
    @floopybits8037 Год назад

    Just one word WOW