Koalas on Apache Spark - Pandas API

AIEngineering

Просмотров 4,5 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 8 ноя 2024

Комментарии • 44

@ajithshenoy5566 4 года назад ⁺²
@AIEngineering Thanks a ton , Most practical tutorials on Spark so far . Can you please Keep making more spark videos . Thanks and bless you.
@AIEngineeringLife 4 года назад ⁺¹
Thank you.. I have one coming tomorrow on spark 3 on GPU
@ajithshenoy5566 4 года назад
@@AIEngineeringLife eagerly looking forward to it . Please add something on streaming data as that's what most job requirements entail.
@bibhukalyandas1146 3 года назад ⁺¹
What kind of distribution strategy do koalas use?
@AIEngineeringLife 3 года назад
Koalas underneath uses spark and so distribution strategy is similar to that of Spark. It is just a convenient pandas like API on Spark
@himanish2006 3 года назад
Please explain to us Structured Streaming..
@AIEngineeringLife 3 года назад ⁺¹
Yes I will later this year
@vishumano 3 года назад
so u mean we can built a machine learning model on koalas as compared to earlier method of building a pyspark mlib model on pyspark df?
@apremgeorge 4 года назад ⁺¹
Thanks Srivatsan, for now if we use koalas on huge dataframe in Databricks, where distributed framework is still fully possible if I purchase more clusters (as koalas is not yet prime) (I have a 800gb data in elasticsearch to build recommender system model )
@AIEngineeringLife 4 года назад
Got it Prem
@mayurbansal5171 4 года назад ⁺²
thanks , can you tell me how to convert large spark dataframe into python Dataframe
@AIEngineeringLife 4 года назад
Mayur.. converting large spark dataframe to pandas as its own bottleneck as all data has to move from executors to driver. Now you can offset some of the performance by using arrow format but very large dataset is not advisable to get to pandas. Better to store it and then use pandas by sampling it
@krishnakishorepeddisetti4387 4 года назад
excatly... this is where i am stuck....it feels very comfortable on SQL to work with tables in databricks than Python
@tridipdas5445 3 года назад
Will there be any out of memory error if I convert koalas dataframe to spark dataframe or vice-versa??
@rushikeshbulbule8120 4 года назад ⁺¹
Superb
@ravikiran8573 4 года назад ⁺¹
thanks, awesome tutorial can you make a video some of the videos on pandas conversion to spark which are not readymade available in pyspark.It will be great helpful.
@AIEngineeringLife 4 года назад
Ravi.. I do not have a comparison but let me see if I can put one
I do have Apache spark transformations and actions in case if you have not seen
@ravikiran8573 4 года назад
@@AIEngineeringLife thanks for the reply sir I am looking forward a video like date range, forward fill,bfill etc which we use pandas.looking forward for video.
@jeharulhussain9344 4 года назад ⁺¹
@AIEngineering Very informative and nice video. I have one small doubt. Can't we use dask in case of distributed file system as an alternative to koalas? Excuse if my question sounds too silly .
@AIEngineeringLife 4 года назад
Dask works well on process parallelism. I see it fail on data parallelism. I have not tried it recently but it had its own drawback on running distributed systems. Spark on the other hand started with data parallelism and has been around and stable for sometime. If you do not have spark already might be Dask is a good start and see if it works
@jeharulhussain9344 4 года назад
@@AIEngineeringLife Thanks for your reply . I just started on Dask. Just thought of getting your advice.
@jeharulhussain9344 4 года назад
@@AIEngineeringLife There is one more reason behind my question. In one of the NVIDIA articles, I have read that Walmart uses Dask along with XGBoost on GPU (along with some other NVIDIA APIs) to forecast product inventory for around 500 million products on a weekly basis across different stores.
@AIEngineeringLife 4 года назад ⁺¹
For xgboost reason can be because xgboost has native dask support as well as nvidia rapids as well. It is easy to get dask running with nvidia rapids and xgboost
@rameshthamizhselvan2458 4 года назад ⁺¹
Awesome work sir....
Dask does the same operation but it doesn't use the spark engine rather it is using a block algorithm that's where it differs from koalas... correct me if I'm wrong.
@AIEngineeringLife 4 года назад
You are right Ramesh.. Dask has it's own distribution strategy
@AryanSingh-rq1nw 4 года назад ⁺¹
How robust is koalas pre-processing and pipeline in production ?
@AIEngineeringLife 4 года назад ⁺¹
Aryan.. if u have seen my video I have mentioned that it is still not primetime. Currently around 60% of functions exists of pandas. But it is a good start to have consistent api
@iamrvk98 4 года назад ⁺¹
ks.to_csv does it works faster than that of pd.to_csv?
@AIEngineeringLife 4 года назад
Rahul kolas is beneficial when we deal with distributed and large dataset. So pandas might be faster for small dataset as it need not split the data
@iamrvk98 4 года назад
@@AIEngineeringLife while writing huge dataset using jdbc connection in pyspark it is taking around 2-3 Hours and sometimes i am facing ORA01555 this issue is there any solution to this ?
@iamrvk98 4 года назад
@@AIEngineeringLife for writing csv i am using using to_pandas().to_csv()
@AIEngineeringLife 4 года назад
@@iamrvk98 .. Why dont you do writing outside to oracle. Basically write to file from spark and use oracle bulk loader to get into database.. Above error is due to consistency.. Is the database while writing also used by other application?
@himanish2006 4 года назад
Is there is any video explaining Spark Streaming from Kafka?
@AIEngineeringLife 4 года назад ⁺¹
Not yet Himanish.. But I do have it on plan in future
@FindMultiBagger 4 года назад ⁺¹
Does it is possible to install kolas on Google colab notebook ?
Or it's only for databricks ?
@AIEngineeringLife 4 года назад ⁺¹
Vinay you should be able to. You can install it from pypl. I have not tried in colab yet will try and update
I saw somewhere it is coming as part of spark3.0 but not sure how true it is
@FindMultiBagger 4 года назад
@@AIEngineeringLife Yeah , I will also try and update here , because Apache Spark really have high learning curve
@AIEngineeringLife 4 года назад ⁺²
Vinay.. check below out.. I have installed it on colab and same step must work on any environment
colab.research.google.com/drive/1YGemqOfCwF60ZzYrFoM6MFREUQtpd6W2
@rameshthamizhselvan2458 4 года назад ⁺¹
@@AIEngineeringLife Really helpful..
@krishnakishorepeddisetti4387 4 года назад ⁺¹
Hi Sir,
the most important part is missing in the tutorial. i have many tables created in the database and i can see them. i can read them using sql like "select * from database.table" . how can i do the same with koalas ...that is the key part where most of us are struggling.. kindly help...thank you
@AIEngineeringLife 4 года назад ⁺¹
Krishna.. Koalas is just a pandas replacement and not a SQL equivalent.. it is mostly for someone transitioning into Spark from Python. If you are comfortable with spark df and spark sql that is more advanced than koalas
@krishnakishorepeddisetti4387 4 года назад ⁺¹
@@AIEngineeringLife thank you for clarifying sir..
And yes sir..your tutorial is awesome
@gauravlotekar660 4 года назад ⁺¹
awesome
@arpanghosh3801 4 года назад
Can you share the code

Следующие

Автовоспроизведение

Apache Spark for Data Engineering and Analysis - Overview