Asset-Based Data Orchestration (from DATA + AI Summit 2023)

Поделиться
HTML-код
  • Опубликовано: 27 июл 2024
  • On June 8th of this year, Sandy Ryza, lead engineer on the Dagster project gave a presentation at the DATA + AI Summit in San Francisco. The talk was entitled "The Future of Data Orchestration: Asset-Based Orchestration".
    We are happy to share the key points of the talk in the video below.
    Sandy's thesis: Data orchestration is a core component for any batch data processing platform and we’ve been using patterns that haven't changed since the 1980s. Sandy introduces a new pattern and way of thinking for data orchestration known as asset-based orchestration, with data freshness sensors to trigger pipelines.
    Try Dagster today with a 30-day free trial: dagster.io/lp/dagster-cloud-t...
  • НаукаНаука

Комментарии • 10

  • @SuperDude69
    @SuperDude69 Год назад +3

    Really clearly described! I understand auto materialization debugging much better now

  • @dagsterio
    @dagsterio  11 месяцев назад

    Thanks for the questions. Sorry we don't tend to monitor questions here on RUclips and we encourage you to ask us questions on our community Slack which you can join at dagster.io/slack

  • @khiemhidden5019
    @khiemhidden5019 8 месяцев назад

    what if, once fraudulent_logins_model get updated, there is a need to trigger a new Insert into events_table.
    You could think it as a circle dependency, but not an endless one, just 1 iteration loop.

  • @hungnguyenthanh4101
    @hungnguyenthanh4101 11 месяцев назад

    Introduce with MlFlows pls.

  • @user-en4uz5kp2m
    @user-en4uz5kp2m Год назад

    Hey dagster team I had a query.
    When it comes to using assets not to do any computation(i.e. they don't return any data or use I/O managers) and use the asset definition to trigger work on any external services using resources, how does assets differ from operations(@op) in this case..
    I understand that assets underlying use operations to do computation .. But I need an understanding how using assets differ from using ops Directly in the above case.
    What's the difference between them?

  • @hungnguyenthanh4101
    @hungnguyenthanh4101 Год назад

    I don't know if you can make a video on how to install it on docker.

    • @dagsterio
      @dagsterio  Год назад

      You will find a "Deploying Dagster on Docker" guide here: docs.dagster.io/deployment/guides/docker

  • @luiztauffer8513
    @luiztauffer8513 Год назад

    Thanks for the nice presentation!
    How would this approach compare against using Airflow Sensors? It looks like the same, at least for Upstream triggered events.
    The downstream freshness policy does seem innovative!
    I wonder, whenever a pipeline is running "asynchronously" like that, do we loose completely the concept of a single pipeline run? Or Dagster manages to somehow still represent that?

    • @s_ryz
      @s_ryz Год назад +1

      Hey @luiztauffer8513 - thanks for watching!
      Regarding Airflow Sensors, this is addressed briefly at 4:02: Airflow Sensors help you chop up your DAG into smaller DAGs, but you still face the problems mentioned within each of those DAGs. Of course, nothing stops you from creating an Airflow DAG and Airflow Sensor for every node in your data asset graph, but Airflow isn't really built to be operated like this. E.g. it makes it harder to observe what's going on and harder to kick off manual runs, e.g. when you want to backfill some data. Also, even if there's a logical way to break them up into smaller DAGs, managing a DAG of DAGs is just more heavy weight, and you need to think carefully about where the boundaries are.
      > I wonder, whenever a pipeline is running "asynchronously" like that, do we loose completely the concept of a single pipeline run? Or Dagster manages to somehow still represent that?
      When there's a set of tasks that can be executed at the same time, Dagster combines it into a single run that can be observed / tracked / cancelled as a unit.

    • @fang_xianfu
      @fang_xianfu Год назад

      I don't think it's so much that you lose the idea of a single pipeline run, as this feature is intended for circumstances where your use cases have outgrown the concept of a single pipeline run.