Dask Memory Management in 7 Minutes: An Introduction

Поделиться
HTML-код
  • Опубликовано: 11 ноя 2019
  • In this video, Matt Rocklin gives a brief introduction to Dask Memory Management. You will learn about computation and memory management in Dask. We will also cover laziness, the persist compute methods, and futures.
    Notebook: gist.github.com/mrocklin/11b6...
    What is Dask?
    Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations.
    Dask helps you scale your data science and machine learning workflows. Dask also makes it easy to work with Numpy, pandas, and Scikit-Learn.
    Where is Dask used?
    Dask is used in climate science, energy, hydrology, meteorology, and satellite imaging. Oceanographers produce massive simulated datasets of the earth's oceans. They couldn't even look at the outputs before Dask!
    What do you think of Dask Memory Management?
    Share your feedback with us in the comments and let us know:
    - Did you find the video helpful in understanding Dask Memory Management?
    - Have you used Dask before?
    Interested in learning more about Dask?
    You can find documentation and more at dask.org
    If you're interested in more videos just like this one on Dask Memory Management, you can also check out our playlist called "Getting Started with Dask" here:
    • Getting Started with Dask

Комментарии • 6

  • @androiduser457
    @androiduser457 Год назад

    Thank you for the explanation. Now it clears up my confusion on compute() vs persist()

  • @kantjopi
    @kantjopi 3 года назад +1

    How do we in principle deal with the data where the size is way larger than the maximal memory of a single machine? Shall we do the computation without calling compute() and finally to_csv () for instance write the final results into the hard disk? or are there a standard procedure to deal with the situation?

  • @musabqamri4265
    @musabqamri4265 3 года назад

    How can we make compute called on main rather than worker ? My main/master has enough memory to hold , but i cant give all workers that amount of memory .. Also for some operation i want to compute with pandas

  • @riis08
    @riis08 4 года назад +1

    How we get the results for large data, if compute() is used for small data?

    • @Dask-dev
      @Dask-dev  4 года назад

      If you can fit the result into memory then use compute. If you can't then I'm not sure what outcome you're looking for. Perhaps you want to write the result to disk?

    • @riis08
      @riis08 4 года назад

      @@Dask-dev Basically, I am trying to do something, but, issue is this, it is getting slower and slower.
      ###################################################
      from dask.distributed import Client, progress
      import dask.array as da
      from scipy.spatial import distance
      import numpy as np
      import dask
      import dask_distance
      import time
      from tqdm import tqdm
      client = Client()
      models= da.random.random((100000,256),chunks=(100000,256))
      labels= da.random.random((100000),chunks=(100000))
      def sim_score(embedding,metric='cosine'):
      score = 1-dask_distance.cdist(models, embedding, metric=metric).flatten()
      index = score.argtopk(5)
      return labels[index].persist(), score[index].persist()
      for i in tqdm(range(100000)):
      models = da.vstack([models, [np.random.randn(256)]]).rechunk(100000,256).persist()
      labels = da.concatenate([labels,np.random.randn(1)]).rechunk(100000).persist()
      embedding = np.random.randn(256)
      user,score = sim_score([embedding])
      yy = user.compute()
      xx = score.compute()