> Taxi shifts the obligation of describing how things stitch together from Consumers to Producers. Traditionally, it falls to consumers to work out how to do this. And, it’s an expensive question to answer… it involves tracking down API specs, reading docs, and building a mental model of how things hang together.
I work on enterprise data platforms and this project sounds very much like the "data mesh" concept where responsibility for defining metadata/usability of datasets is supposed to rest with the teams producing the datasets rather than the teams consuming datasets. The same criticism probably applies:
* Even though it is expensive/slow for a consumer to figure out how to use data, it is even more expensive/slow to have producers try to continuously anticipate the needs of all possible current and future consumers, rather than focus on their system and have actual consumers figure out how to use the subset of data they need as it changes.
* Above a Dunbar-esque-sized organization you can no longer rely on empathy to encourage producers to expend extra effort on consumer's needs, so either monetary or disciplinary encouragement is needed, both of which raise costs even further.
Is there a name for this kind of paradox? When an approach will work only in a company below a certain size/complexity/age, but that kind of company probably doesn't feel the pain enough to need the approach?
> it is even more expensive/slow to have producers try to continuously anticipate the needs of all possible current and future consumers
I agree, but I don't think Taxi encourages that. (I sure hope it doesn't).
All that's really happening is defining a system-agnostic set of terms, embedding those in producer schemas, and then letting consumers using those same terms to ask for data.
Those terms can be used to automate integration, and in many (but not all) cases, alleviate the need for low-value plumbing code.
Semantic Metadata is a growing trend in the data ecosystem (cube.dev is doing a nice job here too). The Agile folks have had Ubiquitous Language as a first class concept since day dot.
> an approach will work only in a company below a certain size/complexity/age
I think that GraphQL federation is an example of mid-to-large companies trying to solve the federation problem, suggesting they both feel the pain, and are looking for solutions.
The goals of Taxi feel at least somewhat like 'creating a domain driven design style Ubiquitous Language to describe data and relationships' and the deliberate not-forcing-a-single-global-thing seems like it'll at least make it harder for enterprise dysfunction to bend it into having the failure modes the grandparent describes.
What happens when a consumer does not understand the producer’s domain and misuses the data? As we load metadata with semantics of their own, we just kick the can down the road. Metadata is itself semantics about data. In many regulated environments, as producer you are responsible for downstream aggregations etc. I think the idea works great in prototyping, but similar to data mesh, ownership is poorly defined in the presence of aggregate domains and it’s more tuned to operational point to point interfaces.
> What happens when a consumer does not understand the producer’s domain and misuses the data?
That's a problem with or without Taxi/Orbital.
Today, it's generally left to consumers to pick the fields that look right, and hope they get it right. That's a risky approach. And, by leaving it to consumers, you run the risk on every new integration.
I think by asking producers to annotate their attributes with a strongly defined semantic contract, you reduce the risk of consumers getting it wrong.
Producers understand their data much better than consumers, so are better informed on how to map their attributes to a set of semantic contracts that consumers can leverage.
> Even though it is expensive/slow for a consumer to figure out how to use data, it is even more expensive/slow to have producers try to continuously anticipate the needs of all possible current and future consumers, rather than focus on their system and have actual consumers figure out how to use the subset of data they need as it changes.
I agree completely on the speed impact but I would go further and say this is basically impossible. I have worked on data everywhere from small to massive enterprises and the only model I have ever see work at a large enterprise is for source systems to just produce data in a raw-ish format that comes out of their system and leave it to the people who need to consume that data to figure out how to massage it into whatever format they need.
The reasons I think it's not possible for upstream to anticipate are:
1) Analytic usecases multiply as systems do and as users start consuming data. As soon as you start using data to address problems in the org, you see more and more questions you would like to use data to address. There is no way for a producer system to forsee all the possible analytic questions people would like to answer using their data, especially as data analysis is a creative endeavour and peoples' imagination is pretty amazing (especially taken collectively).
2) As data sources multiply, the analytical possibilities of combining these data sources explode quadratically. There is no way for any single upstream system to anticipate how this will go.
What upstream systems need to do is produce all the data they can from their system and let users do what they do when combining. You can make some sensible guidelines available (eg "document what the columns mean", "keep up to date estimates of volumes to expect", "have a data sample/staging system/test instance for people to integrate against", "put your goddam timestamps into UTC wherever possible" etc) that aren't too onerous for upstream maintainers to follow if they want to be good citizens.
Yeah, it's a common problem with non deterministic api schemas, e.g. openapi and swagger (https://swagger.io/specification/), but Taxi looks similar in some aspects to protoforce (https://www.protoforce.io/) except that the types subset is substantially narrower.
I've had my eye on Taxi for a while, and it's neat! I agree that the problems it aims to solve are real and painful in real life.
In my experience, I'm not sure people care about schemas or schema languages — they are just implementation details best left under the hood. This is why in my own work, I started on the query end of the spectrum instead.
This is why I built Trustfall, a query engine able to query any data source: DBs, APIs, files like PDF, CSV, or JSON, or any combination of those.
I agree that querying is where the real value is at, and Trustfall looks like an elegant approach.
While Taxi is all about documenting & augmenting specs, Orbital is the query engine (which is a bit similar to Trustfall) that consumes those specs.
Orbital's goal is to allow consumers to be able to query for data, without having to be aware of the specifics of the data sets / APIs / DBs, etc they're composing together.
IMO GraphQL does a nice job on the query side of keeping consumers away from the wiring, but shifts the obligation to middleware resolvers.
How data is wired together is ultimately an implementation detail, and one that changes.
If consumers need to know about how stuff is connected, they're subject to breakages when those details change.
> GraphQL does a nice job on the query side of keeping consumers away from the wiring, but shifts the obligation to middleware resolvers.
I'm not sure that I understand how Orbital is different from tools that autogenerate GraphQL API's on top of data sources like databases and API's.
From the site, it reads:
> "Orbital uses your existing API specs and Database schemas - enriched with Taxi metadata to describe links. Orbital turns this into rich API and data catalog, letting you explore all your data, and how it connects."
It sounds like the same concept, except using a new language instead of GraphQL?
The schema language looks very similar to something AWS has called Smithy by the way:
Autogen tools do a nice job of creating GraphQL schemas from a single source.
We're looking more at how to compose / stitch multiple sources into a single queryable API. (ie., Stitch data from a DB, some REST APIs, and a Kafka topic).
In GraphQL, you need to write / maintain resolvers that handle the stitching for you - which means as upstream data sources change, resolvers need to be maintained too.
Smithy is nice, and we'll probably add first-class support in Orbital for it at some point. Taxi was around before Smithy, so we're the OG :)
It also says "This repository will contain our open source code, along with issues, discussions and roadmap.", is there an ETA for source code release?
Will the Java application in that docker container be open sourced?
i love how graphql described its types in its dsl, thats one of things id love to see carried over in other schema formats. it looks like taxi tries to do this to some extend
The examples all look like something I can do today in any modern functional language. What if I told you that you do not have to stick with Java or C#?
I work on enterprise data platforms and this project sounds very much like the "data mesh" concept where responsibility for defining metadata/usability of datasets is supposed to rest with the teams producing the datasets rather than the teams consuming datasets. The same criticism probably applies:
* Even though it is expensive/slow for a consumer to figure out how to use data, it is even more expensive/slow to have producers try to continuously anticipate the needs of all possible current and future consumers, rather than focus on their system and have actual consumers figure out how to use the subset of data they need as it changes.
* Above a Dunbar-esque-sized organization you can no longer rely on empathy to encourage producers to expend extra effort on consumer's needs, so either monetary or disciplinary encouragement is needed, both of which raise costs even further.
Is there a name for this kind of paradox? When an approach will work only in a company below a certain size/complexity/age, but that kind of company probably doesn't feel the pain enough to need the approach?