What is a Headless Data Architecture?

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ก.ค. 2024
  • The headless data architecture. Is it a fad? Some marketecture? Or something real? In this video, Adam Bellemare takes you through the basics of the headless data architecture and why it’s beginning to emerge as its own respective pattern.
    Driven by the decoupling of data computation from storage, the headless data architecture provides the basis for a modular data ecosystem. Stream your data for near real-time low latency use cases, or convert it to an Iceberg table for analytical use cases.
    The headless data architecture lets you plug your choice of processor into the stream/table data layer, using whichever format that works best for you. Compose your applications, services, data warehouses/lakes using the modular data layer-no more copying data, building pipelines, or dealing with chronic break-fix work.
    RELATED RESOURCES
    ► Apache Iceberg - iceberg.apache.org/
    ► Tableflow - cnfl.io/3VKq7hr
    ► Data quality restrictions and checks for enforcing data contracts - cnfl.io/3xnXk9i
    ► Connecting data into your headless data architecture with Kafka Connect - cnfl.io/4cqPvOK
    CHAPTERS
    00:00 - Intro
    00:42 - Kafka for Streams
    01:13 - Iceberg for Tables
    03:26 - Plugging in Heads
    04:55 - Benefits of HDA
    06:45 - Difference between HDA and Data Lake
    08:34 - Tableflow for Streams-to-Tables
    09:52 - Summary
    --
    ABOUT CONFLUENT
    Confluent is pioneering a fundamentally new category of data infrastructure focused on data in motion. Confluent’s cloud-native offering is the foundational platform for data in motion - designed to be the intelligent connective tissue enabling real-time data, from multiple sources, to constantly stream across the organization. With Confluent, organizations can meet the new business imperative of delivering rich, digital front-end customer experiences and transitioning to sophisticated, real-time, software-driven backend operations. To learn more, please visit www.confluent.io.
    #streamprocessing #apachekafka #kafka #confluent
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 15

  • @ConfluentDevXTeam
    @ConfluentDevXTeam 15 วันที่ผ่านมา +5

    Hey, Adam here. Plugging your data into the processing and query heads of your choice is a significant benefit of the headless data architecture. Let me know what heads you make the most use of, and what pain points you have!

  • @danielthebear
    @danielthebear 15 วันที่ผ่านมา +4

    I love Iceberg but probably I would not apply this architecture when data is distributed in different cloud providers because each query that goes across cloud providers will incur great latency and generate egress costs - costs that will be difficult to predict. Furthermore CAP theorem applies when data is distributed. What are you thoughts about those 3 points?

    • @LtdJorge
      @LtdJorge 15 วันที่ผ่านมา +2

      Well, the team building your architecture could abstract it below the public API. If you query data from BiqQuery, make the system do all processing on GCP and so on.
      However, if you're trying to join/aggregate data from different clouds, then yeah, I guess you're out of luck. Or you could make a query engine that is architecture aware and takes into account where the data is, the potential egress/ingress, etc as cost for the query planner and then try to push down as many operations as possible, so that you only send through the internet the most compact and already processed data, instead of the entire projection.

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 14 วันที่ผ่านมา +3

      Adam here. Inter-cloud costs remain a factor, but typically I wouldn't expect to see a single query federated across clouds WITHOUT taking into consideration data locality. For example, issue separate queries for each cloud, aggregate locally, then bring those results to a single cloud for final aggregation (basically a multi-cloud Map-Reduce - thanks Hadoop!). Speed also remains a factor, as you pointed out, due to CAP theorem. There is no free lunch, so if you're going with a global, multi-cloud, distributed data layer, then yeah, you should probably invest in some tooling to prevent your users from shooting themselves in the foot with a $50k per query bill.

  • @nroelandt
    @nroelandt 8 วันที่ผ่านมา

    Hi Adam, this sounds great in theory and in 'full' load scenario's. What about CDC workloads, where full loads and delta's are separate. The logic and needed compute power (credits) will skyrocket..

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 7 วันที่ผ่านมา

      Adam here.
      When you create the CDC topic and snapshot the table, set retention to infinite and set compaction to true. Your topic will be an eventually-consistent replica of your database table. Whenever you CUD a row in the DB, the latest full set of data will be propagated to the topic (think Debezium, with it's before/after fields).
      Then as a consumer you can simply materialize the whole topic if you want the whole set of data, or just select the fields your app cares about and discard the rest. Basically, you want to do whateve you can in your power to alleviate your consumers from having to "reconstruct" the data on their own, merging snapshots and deltas leads to a lot of complexity and work, replicated for every single consumer that wants the data.
      The tradeoff is more data over the wire with conventional message brokering - but the benefit is that it's a much simpler architecture. Storage is super cheap, especially if you're using Cloud storage as a backer for your Kafka topics. Note that this design is becoming even simpler with the adoption of Kafka replication without network. Eg; Confluent Freight topics use Amazon S3 as both storage AND replication, so that you don't pay cross-AZ fees anymore. Check it out here if you want to know more: www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/

  • @thisismissem
    @thisismissem 11 วันที่ผ่านมา

    The most impressive thing about this video is he's writing everything backwards on that glass board he's behind 😮

  • @uchechukwumadu9625
    @uchechukwumadu9625 15 วันที่ผ่านมา +1

    Why not just call it a data lakehouse architecture like others are doing?

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 14 วันที่ผ่านมา +4

      Adam here. I covered that here: th-cam.com/video/dahs6jpCYQs/w-d-xo.html
      Headless is the full decoupling of data access from processing _of all forms_, providing reusable data assets for anywhere in the company - not just for analytics use cases. To emphasize - a data lake, or lakehouse, or lake-like warehouse, are all analytics constructs first and foremost. They're also predicated on the notion that you must copy all the data into the data lake's bronze layer "as-is". From there, you add schemas, formatting, and structure, leading to a silver layer (another copy). Then you can start doing work on it.
      The problems:
      1) You don't own the data. The source breaks, your pipeline breaks, and then you're forced to react to it, determine impact / contamination, reprocess data sets, etc. (Did this for 7-8 years myself, no thanks).
      2) It's stuck in your data lake. All that work you did to convert it Source->Bronze->Silver is only usable if you use the data lake. _Historically_, leading data lake providers have been happy to provide you with the best performance ONLY if you use their query engines. Using an external engine (if even compatible) would lead to far worse performance. Data lakehouse/warehouse/househouse providers were more than happy to lock you in on the data front, because they made big $$$ on it. But happily, this is starting to change due to the adoption of open-source formats that you can run yourself in house - you can see it with the growing adoption of Iceberg (Databricks bought Tabular, Iceberg cocreators - Snowflake is investing heavily in Iceberg, open sourcing their Polaris catalog). Data lake providers _could_ decide NOT to adopt these open formats, but then they risk losing their business to those who have - so the result is that most players are letting go of their control over storage so that they can adopt Iceberg/Delta/Hudi compatible workloads that they may not have had access to before.
      If you want a quick mental shortcut for how this is different - a headless data architecture lets you plug your data into your data lake _at the silver layer_. The data is well-formed, controlled, schematized, and has designated ownership. But you can also plug that same data into a DuckDB instance, or you can plug it into BigQuery, or Flink, or any other Iceberg-compatible / Kafka compatible consumer endpoint. The idea is that you've decoupled the creation of the data plane from "analytics-only" concerns, and instead focused on building modular reusable building blocks that can power any number of data lakes, warehouses, or swamps, in addition to operational systems.

  • @marcom.
    @marcom. 15 วันที่ผ่านมา +3

    I don't get the point of this video, I must admit. If we build modular architectures with bounded contexts, each with its own data, loosely coupled with EDA - why should I want something that sounds like the exact opposite?

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 14 วันที่ผ่านมา +3

      Adam here. I'm not advocating removing bounded contexts, putting all the data into a big mixed pile, and tightly coupling all your systems together.
      A headless data architecture promotes the data that should be shared into an accessible format - in this version, the stream or the table. If you already have EDA with well-formed and schematized streams you're halfway there. The table component is an extension, where you take business data circulating in your streams and materializing it into an Iceberg table - but note that we didn't shove it into some data lake somewhere for just the lake to use. It remains outside the lake, and is _pluggable_ into whatever lake, warehouse, operational system, SaaS app, or client that needs it. This pluggability forgoes copying, so that you don't need to build pipelines and copy data.
      The gist is that you focus on building a data layer that promotes access and reuse - something that comes for free with a Kafka-based EDA, but that has historically struggled for tables due to the general approach of dumping it all in a data lake to sort it out later.

  • @jarrodhroberson
    @jarrodhroberson 12 วันที่ผ่านมา

    congratulations you rediscovered client server architecture and just confusingly rename it headless. by your definition ever RDBMS is “headless”

    • @ConfluentDevXTeam
      @ConfluentDevXTeam 12 วันที่ผ่านมา

      Adam here. It seems you're missing some key elements:
      1) A traditional RDBMS bundles processing with storage.In headless, the storage is completely separate from the processing and query. I do not know of any RDBMS systems that let you access their underlying storage (Eg: a B-tree) directly, but if there is, I would be keen to find out.
      2) You don't have the risk of one query saturating your data layer and stalling other queries, like you do in an RDBMS. However, this relies on the fact that most data layers will be served via massive cloud compute storage (R2, GCS, Azure Blob, S3, etc), and not you running your own data layer on an in-house HDFS.
      3) Client-server embeds business logic inside the server for the client to call. There is a tight coupling between the two. In HDA, there are no smarts in the data layer. It's just data that has been created by the upstream services. IF your server only provided raw data via a GET, and writes via a PUT/POST, but had absolutely no other functionality whatsoever, then you could equate it to a headless model. That's pretty much what Iceberg and Kafka do, with a bit of cleanup optimizations sprinkled in.