These data exchange gateways often take the form of APIs, delivered over HTTP based on
RESTful patterns. APIs allow the business to define fixed protocols for outside parties to request data and be provided with a response. In this case, the consumer of the data has to code their request and response to the API format dictated by the provider. In many cases, the API design is unique to each provider, requiring consumers to duplicate their interface code many times with adjustments for each provider’s API definition.
Additionally, API payloads are generally requested and processed in batches. If a consumer wants to perform their own analysis of a producer’s data set, they would generally make multiple requests to create a copy of the full set of data and load it into their own data management engine. Only then could they run analysis of it or combine it with their own data to form a superset for more interesting queries.
In many cases, an API is the ideal. Some companies exchange data through more rudimentary methods, like flat file generation and shipping over FTP or email. These require more processing overhead and are less granular. Generally, the data consumer completely wipes and refreshes their copy of the source data on a batched basis, quickly resulting in stale data. Worse, copies of the source data are floating around in consumer installations, with no ability to revoke access if the producer/consumer relationship ends.
These approaches create a lot of overhead and limit the amount of data that can be shared. They don’t handle updates well and rely on making multiple copies of the source data. While the API can provide some level of governance and granular access, it still requires developers on both ends to load and consume the payload.
These methods of data exchange limit scale and flexibility. Yet, the real promise of AI and ML is realized with increasingly larger data sets. Data shared across enterprises in the same industry or across industries for pattern analysis will yield better insights. ML training requires large amounts of data to generate and validate their models with accurate outcomes. Different types of data sets will round out the analysis from multiple perspectives. This stacking of data sets will allow for insights to be generated at the intersection of different producers.
An example might be the creation of highly personalized auto insurance premiums. These could be generated by combining weather data, driver history, insurance claims and vehicle manufacturer specs. In this case, the data might be sourced from four different entities – the government’s weather history, a driving app, insurance companies and car manufacturers. These four entities might have large data sets in their own silos, but no easy way to share data between them.
If a modern auto insurance provider wanted to conduct this kind of advanced data processing in near real-time, they would need to aggregate the data from the four entities mentioned. To get access to the relevant data from each party, they might be able to access each provider’s API’s or more likely get some sort of flat file or data dump. The aggregator would then combine all the data into one large data store (their own data warehouse) and then run the desired analysis. If any of the source data changed, they would need to repeat the request/combine process again, generally from scratch.
A more ideal way to perform this exercise would be if all the data sets could be accessed from a single cloud data platform. Each entity might still have their data in what appears to be their own physical data warehouse, but all those data warehouses would be connected by a common virtual data management backplane. That would allow for controlled sharing without all the overhead of provisioning API’s or creating data files.
In this case, the data management platform becomes an enormous “warehouse for data warehouses”. Each individual data warehouse still maintains strict federation and access controls. But, if two or more entities want to share data, it would be easy to create a materialized view representing the desired combination of data sets. Similar to the materialized view on a database, the view wouldn’t create a new copy, would have access controls and could be revoked at any time. That would make sharing much easier and scalable, enabling far-reaching data analysis queries and hence richer insights.
This is the promise of data marketplaces, as a superset of the data warehousing underneath. If a single data management provider could gain critical mass of data producers on their platform, then the benefits would be enormous. A warehouse of data warehouses would enable the next generation of AI and ML processing by providing the superset of raw data to analyze.
Ideally, the provider of the system to manage this data sharing would support several design tenets to really make it scale. Data sharing should be set up and manageable by non-technical users, without writing code or provisioning specialized infrastructure. Access controls should support granular data definitions, be account based and revokable on demand. The superset of data should be created without making copies. That materialized view should not be limited by size and should support the same analysis and machine learning workloads as the original. While the provider can prefer to have a commercial relationship with all participants, they should enable basic access to share data sets available for non-customers.
If a provider could create a system of data sharing that meets these requirements, then strong network effects would come into play. The warehouse for warehouses provider with the most participants would realize tremendous value. This is because new participants would almost be compelled to utilize that warehousing solution in order gain access to their industry’s data sharing ecosystem. Participants outside of the leading provider’s ecosystem could certainly fall back to sharing data through APIs or other manual processes. Over time, however, that would create a disadvantage for them as potential partners weigh the cost of creating a one-off data exchange mechanic versus just activating data sharing permissions on a common data management platform.