@8:50 , I have one small doubt " we have already filtered out the department_id == 6 , In that case we wont have any other department other than 6. Do we need to really groupBy(department_id) after filtering ?? ".
Yes we should avoid Python UDF as much as possible. This example was just for demonstration of an use case of broadcast variable. You can always use UDF written in Scala and registered for use in Python.
can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.
Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜
hi sir, what is the difference between broadcast join and broadcast variable. in broadcast join also a copy of smaller dataframe is stored at each executor,so no shuffling happens across the executors
@8:50 , I have one small doubt " we have already filtered out the department_id == 6 , In that case we wont have any other department other than 6. Do we need to really groupBy(department_id) after filtering ?? ".
Yes, since the data is already filtered you can directly apply sum on it. Group by is not mandatory
@@easewithdata
Thank you 👍
In last video you mentioned that we should avoid UDF but here you used it during getting the broadcast value. Will it impact the performance?
Yes we should avoid Python UDF as much as possible. This example was just for demonstration of an use case of broadcast variable.
You can always use UDF written in Scala and registered for use in Python.
@@easewithdata thanks
can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.
Hello Sushant,
To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg.
If you like the content, please make sure to share with your network 🛜
hi sir, what is the difference between broadcast join and broadcast variable.
in broadcast join also a copy of smaller dataframe is stored at each executor,so no shuffling happens across the executors
Broadcast joins implements the same concept of broadcast variable. It simplifies the use in Dataframes
AWESOME
pls can you provide the link to download sample data ?
All datasets are available on GitHub. Checkout the url in video description