Dask Memory Management in 7 Minutes: An Introduction
HTML-код
- Опубликовано: 11 ноя 2019
- In this video, Matt Rocklin gives a brief introduction to Dask Memory Management. You will learn about computation and memory management in Dask. We will also cover laziness, the persist compute methods, and futures.
Notebook: gist.github.com/mrocklin/11b6...
What is Dask?
Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations.
Dask helps you scale your data science and machine learning workflows. Dask also makes it easy to work with Numpy, pandas, and Scikit-Learn.
Where is Dask used?
Dask is used in climate science, energy, hydrology, meteorology, and satellite imaging. Oceanographers produce massive simulated datasets of the earth's oceans. They couldn't even look at the outputs before Dask!
What do you think of Dask Memory Management?
Share your feedback with us in the comments and let us know:
- Did you find the video helpful in understanding Dask Memory Management?
- Have you used Dask before?
Interested in learning more about Dask?
You can find documentation and more at dask.org
If you're interested in more videos just like this one on Dask Memory Management, you can also check out our playlist called "Getting Started with Dask" here:
• Getting Started with Dask
Thank you for the explanation. Now it clears up my confusion on compute() vs persist()
How do we in principle deal with the data where the size is way larger than the maximal memory of a single machine? Shall we do the computation without calling compute() and finally to_csv () for instance write the final results into the hard disk? or are there a standard procedure to deal with the situation?
How can we make compute called on main rather than worker ? My main/master has enough memory to hold , but i cant give all workers that amount of memory .. Also for some operation i want to compute with pandas
How we get the results for large data, if compute() is used for small data?
If you can fit the result into memory then use compute. If you can't then I'm not sure what outcome you're looking for. Perhaps you want to write the result to disk?
@@Dask-dev Basically, I am trying to do something, but, issue is this, it is getting slower and slower.
###################################################
from dask.distributed import Client, progress
import dask.array as da
from scipy.spatial import distance
import numpy as np
import dask
import dask_distance
import time
from tqdm import tqdm
client = Client()
models= da.random.random((100000,256),chunks=(100000,256))
labels= da.random.random((100000),chunks=(100000))
def sim_score(embedding,metric='cosine'):
score = 1-dask_distance.cdist(models, embedding, metric=metric).flatten()
index = score.argtopk(5)
return labels[index].persist(), score[index].persist()
for i in tqdm(range(100000)):
models = da.vstack([models, [np.random.randn(256)]]).rechunk(100000,256).persist()
labels = da.concatenate([labels,np.random.randn(1)]).rechunk(100000).persist()
embedding = np.random.randn(256)
user,score = sim_score([embedding])
yy = user.compute()
xx = score.compute()