Serverless distributed processing with BigFrames

Поделиться
HTML-код
  • Опубликовано: 16 окт 2024
  • Exciting news from Google Cloud with the launch of BigFrames (in preview). 🚀🚀🚀
    This new library has significant potential to streamline processes that were traditionally managed by more intricate technologies like Apache Beam (Dataflow) or Spark. It also fills the gap between local Pandas operations running on Jupyter and deploying large-scale workloads in production, and enables faster interactive development at scale.
    By harnessing the power of BigQuery's serverless compute and utilising Remote Functions on Cloud Run / Cloud Functions, it offers a more straightforward and modular approach to handling tasks before diving into machine learning workloads.
    #GoogleCloud #BigFrames #RemoteFunctions #DataProcessing #MachineLearning #DistributedCompute #Orchestration
    00:50 - What problem are we trying to solve
    03:27 - What is BigFrames
    04:58 - How does it work
    09:10 - Why should we care
    11:23 - Demo in Action
    23:17 - Pros & cons plus ideas
    To slide: docs.google.co...
    To repo: github.com/roc...

Комментарии • 5

  • @sajti812
    @sajti812 6 дней назад

    Pretty cool intro to BigFrames

  • @practicalgcp2780
    @practicalgcp2780  11 месяцев назад +2

    Hi there. Thanks for the feedback and very good questions. Let me address each point separately.
    1. In the as it part, airflow is just acting as a scheduler. Cloud scheduler is the replacement in the BigFrame approach. Keep in mind I am not saying this can replace what airflow does, it’s more focused around as an alternative to dataflow for distributed processing when certain criteria are met
    2. It depends on how you manage dependencies, even within BigFrame, you can have many steps depends on each other as I have shown in the diagram. But I don’t think I would use BigFrame to solve complex dependency management between different services, like pulling data from a 3rd party, or sending data to a 3rd party. Those are better managed in Airflow or at least outside of BigFrames.
    3. It’s not documented clearly but from what I can see, it depends on if it’s a native pandas execution (sort, group, aggregate, min, max), that are analytical workload that can be run on BigQuery in SQL, or, if it’s using the map() function to delicate to a bespoke execution in a Python function. There is an exception though, if it’s a simple map with key and value, then it can so that on BigQuery, but that is about it.
    4. It’s designed for data science pre processing, and there is a ML library can do ML tasks too, which I haven’t looked in detail yet. In my view, distributed compute is distributed compute, you have a massive input and a predictable output, you need it to be able to scale. If it works then I don’t see why it can’t be used. Having said that keep in mind although it’s easier, I would be cautious on trying to use this exclusively to replace Apache beam. It can do some of that, but if you need a fully distributed processing engine that is efficient and cost effective, I don’t think BigFrame will able to compete, especially where any custom processing will need to be passed to a cloud run job via HTTPS requests which isn’t efficient compare to process it on the worker themselves like Bean does. So consider what your use case is and see what works better. For me, I would use it for SQL heavy transformations and also for pre processing prior to ML model training, and keep in mind the real power of this thing is the ability to process large amount of data on Jupiter without needing any infrastructure for feature engineering during development

  • @SwapperTheFirst
    @SwapperTheFirst 11 месяцев назад

    Hi Richard, thanks for the excellent presentation and demo.
    Could you please clarify some doubts for me.
    (1) In your "as is" architecture you have Airflow as orchestrator/scheduler. What tool is doing the same in case of BigFrames approach? For example, what if you have some issue during external function processing and you need to start data processing from scratch?
    (2) What tool are you going to use if you have non-linear steps, for example when step 3 is dependent on outcome of step 2?
    (3) I don't fully understand when the framework takes the decision that some steps could be vectorized/executed on BQ and some steps could not and must be executed as external Pandas functions.
    (4) in your opinion, could this tool/framework used for massive data processing with complex pipelines or it is rather data scientists Pandas/Scikit on steroids?
    Again, thanks for your great work.