Scaling Past Joins! - A Framework for Incremental Materialized Views w/ Ecto
ฝัง
- เผยแพร่เมื่อ 22 ม.ค. 2025
- ElixirConf US 2024 - Chris Bailey
2024.elixircon...
I think databases are the heart of your application.
Generally speaking, whenever we try to write Elixir for anything that you could call “data-intensive”, we try to offload as much of the work to the database as possible.
I’m part of the “Data and Insights” team at Vetspire, where databases are our bread and butter.
What do you do if the problem you’re trying to solve isn’t as simple as writing more optimal joins? What if your data model is outside of your control and you can’t easily change it?
More importantly, what happens if you have so much data you need to churn through, you can’t reasonably query the data you need easily/quickly enough?
One of the first problems I worked on at Vetspire seemed simple enough! We had a slow query that we needed to optimize:
Patients have medications, immunizations, order items, and other things.
Medications/Immunizations/Order Items all have product IDs and dates associated with them
We needed to know, for a single patient, what the most recent products ordered were, when they were ordered, and the next due date those products were due
Medications/Immunizations/Order Items could all have the same product IDs
This was all done naively in raw Elixir… so we re-wrote this to use a bunch of joins, distincts, etc and the job was done!
For good measure, we added some indexes and stuff ran more than well enough.
That was, until the marketing team wanted to extend this feature to send automated reminders to all patients who were overdue for certain products.
We extended our existing queries how you might assume and stuff continued working, and Vetspire grew and more and more clinics wanted to bring over their data and use Vetspire to handle their patient care and reminders…
We reached a point where even streaming our data, partitioning our data to individual clinics / time ranges, using several hueristics to minimize the amount of data we’d need to query, our Postgres instance was struggling actually executing our queries.
When we debugged this, it turned out that even with “optimal” indexes, the data we were requesting and joining on was causing memory to explode in Postgres itself, leading to really bad performance across the web app.
It turned out that doing a single query with all of the complex joins, distincts, filters, selects, subqueries, etc was causing Postgres to spend most of its time writing temporary results out to the filesystem and this was causing stuff to really slow down to a crawl.
Because we also needed to run the exact same queries on several different “scopes” (i.e. a single patient at a time, a single location at a time, a cohort of locations at a time), we didn’t want to duplicate code all over the place in case stuff inadvertently went out of sync…
What we ended up doing was:
Decompose our one monster query with N joins into N smaller queries with no joins at all
These N queries could then be executed with a where: x.patient_id == ^patient_id, etc to “scope” the queries as needed, and we combined the results in Elixir itself
This got past the memory usage issues in Postgres, but the larger scoped queries were still too slow to run
Because our location and cohort scoped queries only needed to be up-to-date within the last 24h, we started running these queries out of hours and caching the results in new database tables
The processes that needed to read from these cached results for our daily medical reminders, analytics, reports etc could then do super optimal, indexed reads from these cached tables.
We ended up re-inventing materialized views in Elixir! And as the system grew, we ended up building out a framework which let us easily define and generate these materialized views incrementally in Elixir, with a bunch of safeguards to ensure that materialization was done so optimally and declaratively.
This talk aims to run you through the approaches we tried and the problems they ran into, which eventually took us to this approach, as well as to share our open source framework for generating these materialized views.
The coolest things our framework offers are:
Materialized views can be generated incrementally
They’re defined in Ecto, which means you could share code / implement materialized views in databases which don’t support them as first class features
Our system is actually a graph of several different materialized views, so our framework allows you to define a DAG of different “stages” which you can run via several engines and scopes based on the runtime characteristics your problems desire!
We’d love to share and talk through these fun problems, and present a case of using databases to optimize your data fetching problems which aren’t just more optimal indexes, joins, etc :-)