Automated data profiling and quality scan via Dataplex

Centralised Data Sharing using Analytics Hub

BEST way to protect GCP resources - VPC Service Perimeter

Pittsburgh Steelers vs. Las Vegas Raiders | 2024 Week 6 Game Highlights

Bridesmaid Speech - SNL

I Spent 100 Days in a Flood Infection in Hardcore Minecraft... Here's What Happened

Serverless distributed processing with BigFrames

PracticalGCP

Просмотров 2,3 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 16 окт 2024
Exciting news from Google Cloud with the launch of BigFrames (in preview). 🚀🚀🚀
This new library has significant potential to streamline processes that were traditionally managed by more intricate technologies like Apache Beam (Dataflow) or Spark. It also fills the gap between local Pandas operations running on Jupyter and deploying large-scale workloads in production, and enables faster interactive development at scale.
By harnessing the power of BigQuery's serverless compute and utilising Remote Functions on Cloud Run / Cloud Functions, it offers a more straightforward and modular approach to handling tasks before diving into machine learning workloads.
#GoogleCloud #BigFrames #RemoteFunctions #DataProcessing #MachineLearning #DistributedCompute #Orchestration
00:50 - What problem are we trying to solve
03:27 - What is BigFrames
04:58 - How does it work
09:10 - Why should we care
11:23 - Demo in Action
23:17 - Pros & cons plus ideas
To slide: docs.google.co...
To repo: github.com/roc...

Комментарии • 5

@sajti812 6 дней назад
Pretty cool intro to BigFrames
@practicalgcp2780 11 месяцев назад ⁺²
Hi there. Thanks for the feedback and very good questions. Let me address each point separately.
1. In the as it part, airflow is just acting as a scheduler. Cloud scheduler is the replacement in the BigFrame approach. Keep in mind I am not saying this can replace what airflow does, it’s more focused around as an alternative to dataflow for distributed processing when certain criteria are met
2. It depends on how you manage dependencies, even within BigFrame, you can have many steps depends on each other as I have shown in the diagram. But I don’t think I would use BigFrame to solve complex dependency management between different services, like pulling data from a 3rd party, or sending data to a 3rd party. Those are better managed in Airflow or at least outside of BigFrames.
3. It’s not documented clearly but from what I can see, it depends on if it’s a native pandas execution (sort, group, aggregate, min, max), that are analytical workload that can be run on BigQuery in SQL, or, if it’s using the map() function to delicate to a bespoke execution in a Python function. There is an exception though, if it’s a simple map with key and value, then it can so that on BigQuery, but that is about it.
4. It’s designed for data science pre processing, and there is a ML library can do ML tasks too, which I haven’t looked in detail yet. In my view, distributed compute is distributed compute, you have a massive input and a predictable output, you need it to be able to scale. If it works then I don’t see why it can’t be used. Having said that keep in mind although it’s easier, I would be cautious on trying to use this exclusively to replace Apache beam. It can do some of that, but if you need a fully distributed processing engine that is efficient and cost effective, I don’t think BigFrame will able to compete, especially where any custom processing will need to be passed to a cloud run job via HTTPS requests which isn’t efficient compare to process it on the worker themselves like Bean does. So consider what your use case is and see what works better. For me, I would use it for SQL heavy transformations and also for pre processing prior to ML model training, and keep in mind the real power of this thing is the ability to process large amount of data on Jupiter without needing any infrastructure for feature engineering during development
@SwapperTheFirst 11 месяцев назад
Hi Richard, thanks for the excellent presentation and demo.
Could you please clarify some doubts for me.
(1) In your "as is" architecture you have Airflow as orchestrator/scheduler. What tool is doing the same in case of BigFrames approach? For example, what if you have some issue during external function processing and you need to start data processing from scratch?
(2) What tool are you going to use if you have non-linear steps, for example when step 3 is dependent on outcome of step 2?
(3) I don't fully understand when the framework takes the decision that some steps could be vectorized/executed on BQ and some steps could not and must be executed as external Pandas functions.
(4) in your opinion, could this tool/framework used for massive data processing with complex pipelines or it is rather data scientists Pandas/Scikit on steroids?
Again, thanks for your great work.

Следующие

Автовоспроизведение

Automated data profiling and quality scan via Dataplex

Automated data profiling and quality scan via Dataplex

Centralised Data Sharing using Analytics Hub

Centralised Data Sharing using Analytics Hub

BEST way to protect GCP resources - VPC Service Perimeter

BEST way to protect GCP resources - VPC Service Perimeter

Pittsburgh Steelers vs. Las Vegas Raiders | 2024 Week 6 Game Highlights

Pittsburgh Steelers vs. Las Vegas Raiders | 2024 Week 6 Game Highlights

Bridesmaid Speech - SNL

Bridesmaid Speech - SNL

I Spent 100 Days in a Flood Infection in Hardcore Minecraft... Here's What Happened

I Spent 100 Days in a Flood Infection in Hardcore Minecraft... Here's What Happened

14 Hidden Details in Dragon Ball Sparking Zero! (Budokai Tenkaichi 4)

14 Hidden Details in Dragon Ball Sparking Zero! (Budokai Tenkaichi 4)

Save 50 percent of your Data Engineering effort via Continuous Queries

Save 50 percent of your Data Engineering effort via Continuous Queries

New kubernetes attack vectors #1 | Malicious Admission Controllers [OCTOBER 2024]

New kubernetes attack vectors #1 | Malicious Admission Controllers [OCTOBER 2024]

BigQuery to Datastore via Remote Functions

BigQuery to Datastore via Remote Functions

A practical application leveraging Langchain and BigQuery Vector Search

A practical application leveraging Langchain and BigQuery Vector Search

Connect to services on another VPC via Private Service Connect (PSC)

Connect to services on another VPC via Private Service Connect (PSC)

DBT Core on Cloud Run Job

DBT Core on Cloud Run Job

Real-time Analytics with Cloud Spanner CDC

Real-time Analytics with Cloud Spanner CDC

Use BigQuery Transfer Service to Optimise a File Based CDC

Use BigQuery Transfer Service to Optimise a File Based CDC

Cloud Run with IAP

Cloud Run with IAP

КОГДА МАМА ВИДИЬ У ТЕБЯ ЧТО-ТО НОВОЕ

КОГДА МАМА ВИДИЬ У ТЕБЯ ЧТО-ТО НОВОЕ

Как поставить колеса от УАЗа на волгу? Да легко

Как поставить колеса от УАЗа на волгу? Да легко

Курский спектакль подходит к концу

Курский спектакль подходит к концу

KIA K5: Гуталиновый масложор (160 ткм)

KIA K5: Гуталиновый масложор (160 ткм)

هل يعرف طفلك كيفية استخدام المقص بشكل صحيح؟

هل يعرف طفلك كيفية استخدام المقص بشكل صحيح؟

Papa Ke Samne Kiss😱💋 #swatimonga #couplegoals #rajatswati #funny #comedy #youtubeshorts

Papa Ke Samne Kiss😱💋 #swatimonga #couplegoals #rajatswati #funny #comedy #youtubeshorts

Телефонные мошенники 😳 #ComedyClub #КамедиКлаб #мошенники #ГарикХарламов #тнт #демискарибидис#камеди

Телефонные мошенники 😳 #ComedyClub #КамедиКлаб #мошенники #ГарикХарламов #тнт #демискарибидис#камеди