MIT researchers introduce generative AI for databases

A new tool makes it easier for database users to perform complex statistical analyses of tabular data without needing to know what’s going on behind the scenes.

GenSQL, a generative AI system for databases, could help users make predictions, detect anomalies, guess missing values, correct errors, or generate synthetic data with just a few keystrokes.

For example, if the system were used to analyze medical data from a patient who has always had high blood pressure, it might detect a blood pressure measurement that is low for that particular patient, but would otherwise be within the normal range.

GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model, which can account for uncertainty and adjust its decision-making based on new data.

Additionally, GenSQL can be used to produce and analyze synthetic data that replicates real data in a database. This can be particularly useful in situations where sensitive data cannot be shared, such as patient medical records, or where real data is scarce.

This new tool is built on SQL, a programming language for creating and manipulating databases that was introduced in the late 1970s and is used by millions of developers worldwide.

“Historically, SQL taught the business world what a computer could do. You didn’t have to write custom programs; you just asked a database questions in a high-level language. We think that as we move from just querying data to querying patterns and data, we’re going to need an analogous language that teaches people the coherent questions you can ask a computer that has a probabilistic model of the data,” says Vikash Mansinghka ’05, MEng ’09, PhD ’09, lead author of a paper introducing GenSQL and a principal investigator and leader of the Probabilistic Computing Project in MIT’s Department of Brain and Cognitive Sciences.

When researchers compared GenSQL to popular AI-based approaches for data analysis, they found that the software was not only faster but also produced more accurate results. It’s important to note that the probabilistic models used by GenSQL are explainable, so users can read and modify them.

“By looking at data and trying to find meaningful patterns using just a few simple statistical rules, you risk missing important interactions. You really need to capture the correlations and dependencies of variables, which can be quite complex, in a model. With GenSQL, we want to enable a wide range of users to query their data and model without having to know all the details,” adds lead author Mathieu Huot, a researcher in the Department of Brain and Cognitive Sciences and a member of the Probabilistic Computing project.

They are joined in the paper by MIT graduate students Matin Ghavami and Alexander Lew; Cameron Freer, a research scientist; Ulrich Schaechtel and Zane Shelby of Digital Garage; Martin Rinard, a professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad ’15, MEng ’16, PhD ’22, an assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.

Combining models and databases

SQL, which stands for “structured query language,” is a programming language for storing and manipulating information in a database. In SQL, users can ask questions about data using keywords, such as adding, filtering, or grouping database records.

However, querying a model can provide deeper insights because models can capture what the data means for an individual. For example, a developer wondering if she’s underpaid is likely more interested in what salary data means for her individually than in trends from database records.

The researchers noted that SQL did not provide an efficient way to integrate probabilistic AI models, but at the same time, approaches that use probabilistic models to make inferences did not support complex database queries.

They created GenSQL to fill this gap, allowing someone to query both a dataset and a probabilistic model using a simple yet powerful formal programming language.

A GenSQL user uploads their data and probabilistic model, which the system automatically integrates. They can then run queries on the data that also receive data from the probabilistic model running in the background. This not only allows for more complex queries, but can also provide more accurate answers.

For example, a query in GenSQL might be something like: “What is the probability that a developer in Seattle knows the Rust programming language?” Simply looking at a correlation between columns in a database can miss subtle dependencies. Incorporating a probabilistic model can capture more complex interactions.

Additionally, the probabilistic models used by GenSQL are verifiable, allowing users to see what data the model is using for decision making. Additionally, these models provide calibrated uncertainty measures with each response.

For example, with this calibrated uncertainty, if the model is asked about the predicted outcomes of different cancer treatments for a patient from a minority group that is underrepresented in the dataset, GenSQL would tell the user that they are uncertain and how uncertain they are, rather than overconfidently recommending the wrong treatment.

Faster, more accurate results

To evaluate GenSQL, the researchers compared their system to common benchmark methods that use neural networks. GenSQL was between 1.7 and 6.8 times faster than these approaches, executing most queries in milliseconds while providing more accurate results.

They also applied GenSQL in two case studies: one in which the system identified mislabeled clinical trial data and the other in which it generated accurate synthetic data that captured complex relationships in genomics.

The researchers next want to apply GenSQL on a larger scale to perform large-scale modeling of human populations. With GenSQL, they can generate synthetic data to draw conclusions about things like health and salary, while controlling the information used in the analysis.

They also want to make GenSQL easier to use and more powerful by adding new optimizations and automations to the system. In the long term, the researchers want to allow users to perform natural language queries in GenSQL. Their goal is to eventually develop a ChatGPT-like AI expert that could chat with any database and base its answers on GenSQL queries.

This research is funded in part by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.

Source link

Leave a Comment Cancel Reply