Data Platforms for AI and ML with Richard Winter
Data Platforms for AI and ML with Richard Winter
Independent consultant Richard Winter explains what a data platform is, the role of generative AI, and how to protect data from public exposure in chatbots.
In the latest podcast program, Richard Winter, CEO and principal consultant for WinterCorp, discussed modern data platforms for advanced analytics, artificial intelligence, and machine learning. Winter will be teaching a session on data platform strategies for AI and ML at TDWI’s Modern Data Leader’s Summit in Chicago on April 30. His independent consultant career has spanned over 30 years, and he has focused on studying, understanding, testing, analyzing, evaluating, and helping customers use data platforms. [Editor’s note: Speaker quotations have been edited for length and clarity.]
To set the stage, host Andrew Miller asked Winter what a data platform is in the context of artificial intelligence. “Most people don’t think of these data platforms that way. We think of them as being for business intelligence, reporting, and dashboards — the traditional meaning of data warehousing. However, what started happening about 15 years ago is that some of the vendors began building in functions so that machine learning and certain advanced analytics could be performed inside the data platform rather than outside where it’s traditionally been done.”
Lately, Winter says, many more vendors are doing that, and they’re building in the capability for generative AI. “Data scientists have done these things on special data science workbenches or environments — outside the data platform — but data volumes have grown so large and these technologies are used on a greater scale, so it’s become critically important for some customers to move processing closer to the data — inside the data platform — so it’s more efficient and scalable.” For some customers, the only practical way to get their workloads done in a timely way in machine learning, AI, and advanced analytics is to do them inside the database. There are other advantages besides efficiency and scalability. Working inside the data platform impacts moving models into production. It’s faster, easier, subject to fewer errors, and cheaper, Winter claims.
Existing ML models built outside the platform can be brought into the platform thanks to a feature called Bring Your Own Model, where the model is developed elsewhere and there are languages for exporting models, such as PMML. Data scientists may have strong feelings about using a particular tool, and BYOM lets them transfer the model to production easily.
Generative AI
What does Winter think of generative AI? “ChatGPT and chat bots in general are mostly for consumer-oriented uses of generative AI. In the enterprise setting and for business-to-business applications you can ask a question of generative AI, but its answers use more than its large language model — it involves retrieving data from the enterprise data warehouse.
“An insurance claims application could be used to check who was involved in an accident to detect any insurance fraud by repetitive claimants. That question could be answered by a conventional database query, but you could also have a generative-AI app create the queries which then would be answered by retrieving data from the database.”
With generative AI, many of the use cases involve similarity search. Rather than searching by the exact match (the way most database queries are stated), “You’re asking ‘This is the thing or the idea, and I want to know if there are any things that are similar.’ Alternatively, the user might be asking a question about a broad subject, and you’ll want to get all the records that are broadly related to a certain question.
“That similarity search is done on large amounts of data using vector indexes. Popular data warehouse platforms either now have vector indexing or they are adding it,” Winter explained. These platforms are also adding the ability to create the vectors (called vector embedding), the ability to store the vectors, and the ability to search using the vectors. All these features are being built into data platforms.”
There is more to consider, Winter warns. “There are differences between the data platforms that become profound as the requirements become more challenging. If you have a relatively small database and routine requirements that are the same as a million other companies, then probably any popular platform will be able to satisfy your requirements. However, if you have a large data warehouse, or your requirements are in some way more complex than the typical user, if you’re in that 5% of companies that have very demanding requirements, then it’s very important to you how Platform A differs from Platform B, not only for business intelligence but for machine learning, generative AI, and these other subjects.”
Protecting Data
With so many people using their company’s data as input to a public application, possibly exposing that data outside the company, what are enterprises doing so that their data stays safe? “ChatGPT and the equivalents offered by other vendors use a very large language model. It’s truly enormous. It has trillions of documents ingested, and most companies could not afford to create their own version. What they’re doing instead is taking smaller language models that have, say, a few billion documents ingested and then training them on their private data. As long as their architecture is set up correctly, there’s no threat that their data will end up on the open internet through operating such a model.
“Another scenario is using an open model such as ChatGPT but using it to enrich the kinds of answers it can give by having it generate queries against private data. This is called retrieval augmented generation. You ask a question, the language model uses its big language model to understand your question. It then generates a query against corporate data to get the specific information, gets the answer, and delivers it. This has to be done in a way in which the private information is protected and not incorporated into the general model.”
[Editor’s note: You can replay the podcast on demand here.]