Dask in 15 Minutes | Machine Learning & Data Science Open-source Spotlight #5

Dan Bochman

Просмотров 49 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 июн 2024
Should you use Dask or PySpark for Big Data? 🤔
Dask is a flexible library for parallel computing in Python.
In this video I give a tutorial on how to use Dask for parallel computing, handling Big Data and integration with Deep Learning frameworks.
I compare Dask to PySpark and list the relative advantages I see of choosing Dask as your primary choice for Big Data handling.
Link to Notebook:
nbviewer.jupyter.org/github/d...
With these "Machine Learning & Data Science Open Source Spotlight" weekly videos, my objective is to introduce many game-changing libraries, which I believe many people can benefit from.
I would love to hear your feedback!
Did this video teach you something new?
Are there any open-source libraries you think deserve a spotlight?
Let me know in the comments! 👇🏻

Комментарии • 87

@mwd6478 3 года назад ⁺¹¹
Really liked how you showed how to pipeline the data to keras! None of the other Dask videos I've seen show those next steps, it's just pandas stuff
@julianurrea 4 года назад ⁺¹
Good job man, I'm starting with Dask and I am excited about its capabilities!, thanks a lot
@ChaitanyaBelhekar 3 года назад
Thanks for a great introductory video on Dask!
@skyleryoung1010 3 года назад ⁺¹³
Great video! Great use of examples and explained in a way that made Dask very approachable. Thanks!
@guideland1 2 года назад ⁺¹
Really great Dask introduction and the explanation is so easy to understand. That was useful. Thank you!
@sripadvantmuri89 2 года назад
Great explanations for beginners!! Thanks for this...
@teslaonly2136 3 года назад
Very nice explanation! Thanks!
@summerxia7474 2 года назад
Very nice video!!!! Thank you so much!!! Pls make more about this hands-on video! You explain them very clear and helpful!!!!
@Rafaelkenjinagao 3 года назад
Many thanks Dan. Good job, sir!
@someirrelevantthings6198 3 года назад
Literally man, you saved me. Thanks a ton
@cristian-bull 2 года назад
I really appreciate the batch-on-the-fly example with keras.
@terraflops 3 года назад
very helpful in understanding Dask, think i will start using it on my projects
@the-data-queerie 3 года назад
Thanks for this super informative video!
@sanketwebsites9764 3 года назад
Awesome Video. Thanks a ton !
@vireshsaini705 3 года назад
Really good..explanation of working with dask :)
@vitorruffo9431 2 года назад
Good work sir, your video has helped me to get started with Dask. Thank you very much.
@abrarfahim2042 3 года назад
Thank you so much. This video is really helpful to me.
@magicpotato1707 3 года назад
Thanks man! very informative
@apratimdey6118 11 месяцев назад
Hi Dan, this was a great introductory video. I am learning Dask and this was very helpful.
@alikoko1 2 года назад
Great video! Thank you man 👍
@datadiggers_ru 2 года назад
Great intro. Thank you
@krocodilnaohote1412 2 года назад
Man, great video, thank you!
@haydn.murray 3 года назад
Fantastic video! Keep it up!
@arashkhosravi7611 3 года назад ⁺¹
Great explanation, It was not long, it is very interesting .
@dwivedys 2 года назад
Excellent!
@priyabratasinha1478 3 года назад
Thank you so much man... you saved a project... 🙂🙂❤❤🙏🙏
@user-wr5rc5pp8r 2 года назад
amazing. Very interesting theme.
@_Apex 3 года назад
Amazing video!
@shandi1241 3 года назад
thanks man you made my morning
@piotr780 Год назад
very good ! thank you ! :)
Год назад
Hi 😀 Dan.
Thank you for this video.
Do you an example which uses the apply() function?
I want to create a new column based on a data transformation.
Thank you!
@DanielWeikert 2 года назад ⁺¹
Great. I would also like to see more on DASK and Deep Learning. How exactly would this generator be used in pytorch? Instead of the DataLoader. Thanks for the video(s)
@cherisykonstanz2807 2 года назад
Chapeau, well explained and healthy usage of memes!
@NitinPasumarthy 4 года назад
Very well explained! Thank you.
@danbochman 4 года назад
Very happy to hear!
If there's any topic in a particular which interests you, feel free to suggest!
@NitinPasumarthy 4 года назад
@@danbochman May be how to build custom UDFs like in Spark?
@danbochman 4 года назад
@@NitinPasumarthy hmm to be honest I don't know much about this topic. But I just released a video about visualizing big data with Datashader + Bokeh (I remember you've asked for that before):
www.linkedin.com/posts/danbochman_datashader-in-15-minutes-machine-learning-activity-6639434661288263680-g62W
Hope you'll like it!
@jodiesimkoff5778 3 года назад ⁺¹
Great video, looking forward to exploring more of your content! If you don’t mind, could you share any details about your jupyter notebook setup (extensions, etc)? It looks great.
@danbochman 3 года назад
Hey Jodie, really glad you liked it!
It's actually not a Jupyter Notebook, it's Google Colaboratory, which is essentially very similar, but runs on Google's servers and not your local machine.
I highly recommend it if you're doing work that is not hardware/storage demanding.
@adityaanvesh 4 года назад ⁺¹
Great intro of dask for a pandas guy like me...
@unathimatu 2 года назад
This is really greAT
@-mwolf 11 месяцев назад
awesome intro
@joseleonardosanchezvasquez1514 Год назад
Great
@jameswestwood1268 3 года назад
Amazing. Please expand and create a section on Dask for Machine Learning.
@danbochman 3 года назад
Thank you! Dask's machine learning support got even better since I've made this video.
The Dask-ML package mimics the scikit-learn API and mode selection, and the new SciKeras package wraps Keras models in the scikit-learn API, so the transition is pretty seamless.
Is there anything specific you wish you could have more guidance on?
@FabioRBelotto Год назад
Thanks man. It's really hard to find some information about dask
@luisalbertovivas5174 4 года назад
I liked it Dan. I wan to try how it works with Scikit Learn and all the models.
@danbochman 4 года назад
For Scikit-learn and other popular models such as XGBoost, it's even easier! Everything is already implemented in the dask-ml package:
ml.dask.org/
@anandramm235 3 года назад
This was awesome, more tutorials for Dask sir please!!
@danbochman 3 года назад ⁺¹
Thank you! Glad you liked it. What kind of Dask tutorials specifically will help you?
@anandramm235 3 года назад
@@danbochman data preprocessing ,not sure if dummy encoding can be done through dask
@FindMultiBagger 3 года назад
Thanks !
Nice tutorial , Thanks 🙌🏼
Can we use any ml library on it ?
Like scikit , pycaret etc
@danbochman 3 года назад ⁺¹
To the best of my knowledge not directly.
You can use any ML/DL framework after you persist your dask array to memory. However, Dask has it's own dask-ml package in which contributors migrate most of the common use cases in scikit-learn and PyCaret.
@FindMultiBagger 3 года назад
@@danbochman Can you make simple use-case on 🙂,
1) Integrate Dask with pycaret , it will so helpful.
Beacuse I have used all approaches like RAY , MODIN , RAPID , PYSPARK. But all have some limitations.
If you help to integrate DASK with pycaret it helps lots for open-source community :)
Looking forward to hearing from you :)
Thanks
@emmaf4320 3 года назад
Hey Dan, I really like the video! I have a question with regards to the for loop example. Why it only saved half of the time? To my understanding, delayed(inc) took about 1s because of the parallel computing? What else took time to operate?
@danbochman 3 года назад
Hey Lyann,
There’a a limit to how much you can parallelize
It depends on how many cores you have and how many threads can be opened within each process, in addition, how much of your computer resources are available (e.g. several chrome tabs are open)
@emmaf4320 3 года назад
@@danbochman Thank you!
@indian-inshorts5786 3 года назад
it's really helpful for me to thank you may I provide the link of your channel on my notebook on Kaggle?
@danbochman 3 года назад
Very happy to hear that you've found the video helpful!
(About link) Sure! The whole purpose of these videos is to share them ;)
@vegaalej Год назад
Many thanks for this excellent video! It is really clear and helpful!
I just have one question, I tried to run the notebook, and ran pretty well after some minor updates. Just the last line I was not able to make it run:
# never run out-of-memory while training
"model.fit_generator(generator=dask_data_generator(df_train), steps_per_epoch=100)"
Gives me an error message:
InvalidArgumentError: Graph execution error:
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
Traceback (most recent call last):
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
[[{{node PyFunc}}]]
[[IteratorGetNext]] [Op:__inference_train_function_506]
Any recommendation on how I should modify it to make it run?
Thanks
AG
@vegaalej 2 года назад ⁺²
Great Video an Explanation! Thank you very much for it! IT is really helpful!
I tried to run the notebook, and it ran pretty well after some minor updates. I just had problems to run the latest part. "never run out-of-memory while training", seems the generator or steps per epoch part is giving some prblem I cant fogure hout how to solve.
Any possible suggestion on how to fix the code?
Thanks!
InvalidArgumentError: Graph execution error:
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
Traceback (most recent call last):
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
@AlejandroGarcia-ib4kb 2 года назад ⁺¹
Interesting, I have the same problem in the last part of the notebook. Seems it is related to IDE, it needs and update.
@DontMansion 3 года назад
Hello. Can I use Dask for Numpy without using "defs"? Thank you.
@user-nd9rd9mo9f 2 года назад
Many thanks. Now I understand why the file was not read
@markp2381 2 года назад
Great content! One question, isn't it strange to use function in this form: function(do_something)(on_variable) instead of function(do_something(on_variable)) ?
@danbochman 2 года назад ⁺³
Hey Mark! Thanks.
I understand what you mean, but when a function returns a function this makes sense, as opposed to a function which outputs the input to the next function.
delayed(fn) returns a a new function (e.g. "delayed_fn"), and this new function is then called regularly delayed_fn(x). So its delayed(fn)(x).
All decorators are functions which return callable functions. In this example they are used quite unnaturally because I wanted to keep both versions of the functions.
Hope the explanation helped!
@kalagaarun9638 4 года назад
Thanks sir... very well exlpained and covered a wide range of topics!!
btw what are chunks in dask.array, I find it a bit hard to understand...
Can you explain?
@danbochman 3 года назад ⁺³
Hey! Sorry for the delay, for some reason RUclips doesn't notify me on comments on my videos... (although set to do so)
If your question is still relevant,
Chunks are basically, as the name suggests, chunks of your big arrays split into smaller arrays (in Dask -> Numpy ndarrays).
There exists a certain sweet spot of computational efficiency (parallelization potential) and memory capability (in-memory computations) for which a chunk size is optimal.
Given your amount of CPU cores, available RAM and file size, Dask find the optimal way to split your data to chunks which would be best optimally processed by your machine.
You might ask: "Isn't fetching only a certain number of data rows each time is sufficient?"
Well not exactly, the bottleneck maybe be at the column dimension (e.g. 10 rows, 2M columns), so row operations should split each row into chunks of columns for more efficient computation.
Hope it helped clear some concepts for you!
@kalagaarun9638 3 года назад
@@danbochman It helped a lot, thanks sir !
@FabioRBelotto Год назад
I've added dask delayed to some functions. When I visualize it, there are several parallel works planned, but my cpu does not seems to be affected (its only using a small percentage of it)
@madhu1987ful 2 года назад
Hey thanks for the awesome video and the explanation. I have a use case. I am trying to build a Deep learning tensorflow model for time series forecasting. For this I need to use multinode cluster for parallelization across all nodes. I have a single function which can take data for any 1 store and predict for that store.
Likewise I need to do predictions for 2 lakh outlets. How can I use dask to parallelize this huge task across all nodes of my cluster? Can you please guide me. Thanks in advance.
@danbochman 2 года назад ⁺¹
Hi Madhu,
Sorry, wish I could help, but node cluster parallelization tasks are more dependent on the framework iteself (e.g. Kubernetes), than Dask.
You have the dask.distributed module (distributed.dask.org/en/stable/), but handling the multi-worker infrastructure is where the real challenge lies...
@BilalAslamIsAwesome 4 года назад ⁺²
Hello! Thank you for your tutorial, i downloaded and ran your notebook, at the keras steps i'm getting an ( 'X' is not defined) error. I cant see where it was created either. Any ideas on how i can fix this to run it?
@danbochman 4 года назад ⁺¹
Hey Bilal!
Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace.
You can either change df_train to X or change X to df_train X df_train.
Just be consistent and it should work!
@BilalAslamIsAwesome 4 года назад ⁺¹
@@danbochman great! Thank you I will try that. Have you been using dask for everything now or do you still use pandas and numpy?
@danbochman 4 года назад ⁺¹
@@BilalAslamIsAwesome Hope it worked for you! Please let me know if it didn't.
Yes, I use Dask for mostly anything. Some exceptions:
1. Pandas Profiling - A great package (I have an old video on it) for EDA which I ALWAYS use when I first approach a new dataset, doesn't play well with Dask.
2. Complex (interactive) Visualizations - My favorite package for this is Plotly, doesn't synergize well with Dask. If plotting all the data points is a must than I will use Dask + Datashader + Bokeh.
@asd222treed 2 года назад ⁺¹
Great video! Thank you for sharing.But I think your code would have some incorrect codes in machine learning with dask part.
There is no X in the code (model.add(..., input_dim=X.shape[1], ... )
and when I training model.fit_generator, the tensor flow saids model.fit_generator is deprecated.. and finally displayed error - AttributeError: 'tuple' object has no attribute 'rank'
@danbochman 2 года назад ⁺²
Hey!
Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace.
You can either change df_train to X or change X to df_train X df_train.
Just be consistent and it should work!
@abhisekbus 3 года назад ⁺¹
Came to handle a 5gig csv, stayed for capybara with friends.
@raafhornet 3 года назад
Would it still work if the fraction of data you're sampling for the model is still larger than your memory?
@danbochman 3 года назад
Sorry for the late reply, this comment didn't pop up in my notifications...
If a fraction of the data is larger than your memory, than no, can't handle data bigger than your in-memory capabilities.
It just means that fraction should be fractioned even more :)
@parikannappan1580 Год назад
4:12 ,visualize() , where can in get documetation? I tried to use for sorted() did not work
@dendi1076 3 года назад
Hi sir, for Dask to work on my laptop, do I need to have more than 1 core? what if I only have 1 core on my laptop and no other nodes to work with, will Dask still be helpful for reading a csv file that has millions of rows and help me speed up the process
@danbochman 3 года назад
Hello dendi,
If you only have 1 core, you won't be able to speed up performance with parallelization, as you don't have the ability to open different processes/threads. However, you would still be more memory efficient if you have a huge dataset on a limited RAM.
Dask can still help you work with data that is bigger or close to your RAM size (e.g. 4GB-8GB for most laptops), by only fetching what is relevant for the current computation in-memory.
@felixtechverse Год назад
How to schedule a dask task? for example how to put a script to run every day at 10:00 with dask for example
@anirbanaws143 3 года назад
How do I handle if my data is .xlsx?
@danbochman 3 года назад
To be honest, if you're working with a huge .xlsx file and best performance is necessary, then I would recommend rewriting the .xlsx to .csvs with:
from openpyxl import load_workbook
wb = load_workbook(filename = "data.xlsx", read_only=True, values_only=True)
list_ws = wb.sheetnames
nws = len(wb.sheetnames)
for i in range(0, nws):
ws = wb[list_ws[i]]
with open("output/%s.csv" %(list_ws[i].replace(" ","")), "w", newline="") as f:
c = csv.writer(f)
for r in ws.rows:
c.writerow([cell.value for cell in r])
If this is not an option for you, you can call Dask's delayed decorator on the Panda's read_excel function like so:
delayed_ddf = dask.delayed(pd.read_excel)( "data.xlsx", sheet_name=0)
ddf = dd.from_delayed(delayed_ddf )
But you won't see a huge performance increase.
Hope it helped!

Следующие

Автовоспроизведение

Pandas Limitations - Pandas vs Dask vs PySpark - DataMites Courses