Building ML Models in Snowflake Using Python UDFs and Snowpark | DEMO
HTML-код
- Опубликовано: 22 мар 2023
- Learn how to build machine-learning models in Snowflake in this demo by Sonny Rivera of Thoughtspot and Chris Hastie of InterWorks. During the demo, they show how to use Snowpark to clean your data and perform feature engineering, build and train sales forecast models using Python in Snowflake, use Python UDFs to expose your predictive models, and present and analyze your models in ThoughtSpot.
To access the code used in this demo, go to:
github.com/ChrisHastieIW/Snow...
To access the Quickstart guide for this topic, go to:
github.com/thoughtspot/quicks...
Learn more about Thoughtspot:
Website: www.thoughtspot.com
Twitter: @thoughtsport
LinkedIn: /www.linkedin.com/company/thoughtspot
Learn more about Interworks:
Website:interworks.com
Twitter: @interworks
LinkedIn: www.linkedin.com/company/interworks
To connect with the presenters:
Sonny Rivera, Senior Analytics Evangelist, ThoughtSpot
LinkedIn: / sonnyrivera
Chris Hastie, Data Engineering and Analytics Consultant, InterWorks
LinkedIn: / chris-hastie
Learn how to build your application on Snowflake:
developers.snowflake.com
Continue the conversation by joining the Snowflake Community:
community.snowflake.com
❄Join our RUclips community❄ bit.ly/3lzfeeB
"
How would you solve this with a vectorized UDF?
Is there a demo on same.
Chris and I did not vectorize the UDF. That's a great idea. I'll sync with Chris and see want we do. Thanks for the great suggestion.
@@sonny.rivera that would be very helpful.
On a side note, is there any reference material on optimising costs for Snowflake compute resources.
Here a few resources to get you started:
medium.com/snowflake/best-practices-to-optimize-snowflake-spend-73b8f66d16c1
medium.com/snowflake/using-snowflakes-scale-to-zero-capabilities-for-fun-profit-f326a1d222d0
medium.com/snowflake/deep-dive-into-managing-latency-throughput-and-cost-in-snowflake-2fa658164fa8
medium.com/snowflake/improve-snowflake-price-performance-by-optimizing-storage-be9b5962decb
medium.com/snowflake/compute-primitives-in-snowflake-and-best-practices-to-right-size-them-b3add53933a3
Thank you !
Where can i find the data set that is used in this video
When you run the ml, does it run on local machine or within snowflake?
I often dev and test using VS Code/python on my local instance and then deploy the code to snowflake & snowpark that runs in the cloud.
if we do the per category training and predictions in udf function generate_auto_arima_predictions via pandas dataframe we wouldn't get any parallelization benefit, right ? We would process all the categories sequenetially.
Shouldn't we use UDTF for these kind of operations ?
Thanks for your comment! A UDTF would be a stronger option, as it could leverage parallel partitioning to perform these concurrently instead (as you mention). Check out the following two articles on training ARIMA models:
interworks.com/blog/2022/11/22/a-definitive-guide-to-creating-python-udtfs-directly-within-the-snowflake-user-interface/
interworks.com/blog/2022/11/29/a-definitive-guide-to-creating-python-udtfs-in-snowflake-using-snowpark/
For some more information on UDTFs and how they work, see:
interworks.com/blog/2022/11/15/an-introduction-to-python-udtfs-in-snowflake/
Thanks!
The models will run concurrently on the virtual warehouse. The UDTF is really just calling the 'predict' function. The model training is happening in the stored proc.