What is a Headless Data Architecture?

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024

Комментарии • 18

  • @danielthebear
    @danielthebear 3 месяца назад +4

    I love Iceberg but probably I would not apply this architecture when data is distributed in different cloud providers because each query that goes across cloud providers will incur great latency and generate egress costs - costs that will be difficult to predict. Furthermore CAP theorem applies when data is distributed. What are you thoughts about those 3 points?

    • @LtdJorge
      @LtdJorge 3 месяца назад +2

      Well, the team building your architecture could abstract it below the public API. If you query data from BiqQuery, make the system do all processing on GCP and so on.
      However, if you're trying to join/aggregate data from different clouds, then yeah, I guess you're out of luck. Or you could make a query engine that is architecture aware and takes into account where the data is, the potential egress/ingress, etc as cost for the query planner and then try to push down as many operations as possible, so that you only send through the internet the most compact and already processed data, instead of the entire projection.

  • @abhijitsarkar482
    @abhijitsarkar482 2 месяца назад +1

    Hello Adam
    This is a great video.
    As you said, ice berg can be used to store data in table format, just wondering does it act bit like a no-sql database, exposing its own set of query language. What I mean, is the Iceberg is supposed to be consumed via some standard SaaS, or it can be queried directly.
    And lastly would it be easy to integrate it with any observatory platform like Grafana, or New Relic, so that it can act as the source.

  • @thisismissem
    @thisismissem 3 месяца назад +1

    The most impressive thing about this video is he's writing everything backwards on that glass board he's behind 😮

  • @nroelandt
    @nroelandt 3 месяца назад

    Hi Adam, this sounds great in theory and in 'full' load scenario's. What about CDC workloads, where full loads and delta's are separate. The logic and needed compute power (credits) will skyrocket..

    • @emonymph6911
      @emonymph6911 Месяц назад

      @@ConfluentDevXTeam HI Adam I love this video and your reply here. Would be fun to hear your thoughts on the topics below.
      Topic1
      If we want to do EDA microservices with DBs + Headless. If the data is generated from user inputs like a form. Would we write it to a topic and then use KC or table builder to write the topic to an OLTP? Basically a OLTP DB can also be our processing head? And why would we use one? Well because we just want the frontend app to be ACID and OLTP has all data safety banked in like rollback and locking protection etc, the confluent event log lets us rerun failed writes if the OLTP was down, how would we log what events failed to OLTP vs object store? Maybe this topic would make a good video.
      Topic2
      Would also love to see a video and learn more about how to make confluent streams from frontend UI elements to OLTP make sure the data travels in an ACID way, if possible here is what chatgpt said (it said it's not inherently ACID and recommends RR for micro services but gave some thoughts). I'm assuming CDC is an important element for this to work but could be wrong.
      Atomicity:
      Idempotent Producers: Kafka supports idempotent producers, which ensure that messages are not duplicated. This helps maintain atomicity at the message level, ensuring that messages are either sent successfully or not sent at all.
      Transactional Producers and Consumers: Kafka provides transaction APIs that allow you to produce and consume messages atomically across multiple topics and partitions. This means you can ensure that either all operations in a transaction succeed or none do, bringing atomicity to a larger scope.

  • @thomass9181
    @thomass9181 2 месяца назад

    Hi. Thanks for an informative video. One of the topics you mention very briefly is data access control - as you are saying: “... this dotted line is doing a bit of the heavy lifting, but ...”, and then you continue to explain that it essentially means that access can be controlled at the data level. My impression is that this is not the case - this seems to be highly dependent on the engine you choose as "head", as there seems to be no inherent data access control in the storage layer (of which Delta Lake, Apache Iceberg and Apache Hudi are probably the best known for tables). How would I for example achieve a single point of access control across DuckDB and Databricks (to use your own examples of “heads”), or even Microsoft Fabric compute options (which seems to have different security models depending on the compute option used)? Unless I do add a "common" head/query point or some form of policy/security platform. Any insights greatly appreciated.

    • @__toby__
      @__toby__ 2 месяца назад

      Well if you're using S3 as your storage layer, then you need to define what can access that bucket ? So it's defined at the data layer, not the head

    • @thomass9181
      @thomass9181 2 месяца назад

      ​@@__toby__ Thank you. You are right of course - access can (and should) be applied to the object storage layer. I think my confusion stemmed from the files/folders and table mismatch - you apply access to a bucket/folder path without necessarily knowing that it corresponds to e.g. a Delta or Iceberg table (in the sense that there is no "table semantics" in the storage layer). In practice though, using a combination seems common (thereby making it less "headless") - due to for example simplified management or data governance reasons, or due to the lack of granular control and flexibility that the different heads can provide.

  • @marcom.
    @marcom. 3 месяца назад +3

    I don't get the point of this video, I must admit. If we build modular architectures with bounded contexts, each with its own data, loosely coupled with EDA - why should I want something that sounds like the exact opposite?

  • @DavidTangye
    @DavidTangye 2 месяца назад +1

    How to make database driven software as massively overcomplicated as humanly possible, thus assuring your own job and a guaranteed income stream for your software company from herds of gullible corporates with more money than sense.

    • @DavidTangye
      @DavidTangye 2 месяца назад

      @@ConfluentDevXTeam Human intelligence

  • @gaddafun
    @gaddafun 3 месяца назад +1

    Why not just call it a data lakehouse architecture like others are doing?

    • @thomass9181
      @thomass9181 2 месяца назад

      @@ConfluentDevXTeam Thank you for taking the time to reply to comments - it provides very useful context. I appreciate the distinction you make with headless data architecture here. However, it seems there might be a confusion between Lakehouse-technology and the commonly applied data layout principles within Lakehouse architectures. Lakehouse papers, such as "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics" (M Armbrust, A Ghodsi, R Xin, M Zaharia, 2021) and "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores" (Armbrust, M. et al., 2020) do not mandate medallion architectures or specific zoning requirements. Instead, they emphasize the unification of data warehousing and analytics capabilities over a single open data lake file format - much like what you describe as headless in the video. Writing directly from a stream to an append-only Iceberg or Delta table for example, for which you can “bring your own compute” to query, would fit well within the “Lakehouse” concept in my opinion, although maybe not considered "best practice" or common. (I emphasize with your problem 1) by the way, having done much of the same myself - however, data for analytics very often DO mandate some form of "data value chain", for which layers may be useful).
      I think the main distinction based on your video and comment may be 1) the inclusion of streams (with Kafka in your example) and 2) the encompassing of operationally-oriented workloads. When it comes to the table storage however, whether analytical or operational, the headless architecture you describe would still depend on the same open storage layer and formats? Or am I missing something? Could you clarify how headless data architecture fundamentally differs in technology from the Lakehouse when it comes to the right hand side of your diagram (tables)?

    • @thomass9181
      @thomass9181 2 месяца назад

      For some reason my comment was deleted, not sure why, it was well-intended and hopefully useful for others as well. Will try again:
      Thank you for taking the time to reply to comments - it provides very useful context. I appreciate the distinction you make with headless data architecture here. However, it seems there might be a confusion between Lakehouse-technology and the commonly applied data layout principles within Lakehouse architectures. Lakehouse papers, such as "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics" (M Armbrust, A Ghodsi, R Xin, M Zaharia, 2021) and "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores" (Armbrust, M. et al., 2020) do not mandate medallion architectures or specific zoning requirements. Instead, they emphasize the unification of data warehousing and analytics capabilities over a single open data lake file format - much like what you describe as headless in the video. Writing directly from a stream to an append-only Iceberg or Delta table for example, for which you can “bring your own compute” to query, would fit well within the “Lakehouse” concept in my opinion, although perhaps not common according to “best practice”. I empathize with your problem 1) by the way, having done much of the same myself, although data for analytics do normally involve some form of data value chain, where layers can prove useful.
      I think the main distinction based on your video and comment may be 1) the inclusion of streams (with Kafka in your example) and 2) the encompassing of operationally-oriented workloads. When it comes to the table storage however, whether analytical or operational, the headless architecture you describe would still depend on the same open storage layer and formats? Or am I missing something? Could you clarify how headless data architecture fundamentally differs in technology from the Lakehouse when it comes to the right hand side of your diagram (tables)?

  • @jarrodhroberson
    @jarrodhroberson 3 месяца назад

    congratulations you rediscovered client server architecture and just confusingly rename it headless. by your definition ever RDBMS is “headless”