Prior Labs co-founders Sauraj Gambhir, Frank Hutter and Noah Hollmann / Courtesy of Prior Labs

ChatGPT is notoriously bad with numbers. That’s not a problem unique to OpenAI’s chatbot — most AI chatbots wouldn’t get a passing grade in maths.

This means that while LLMs have revolutionised text and deep learning has revolutionised image and video generation, there’s a whole load of data that remains largely indecipherable to AI in a useful way.

“Tabular data” — information organised in tables with rows and columns, such as in an Excel spreadsheet — is the bane of many businesses and a prime example of this untouched information in enterprise.

Over the past few years, researchers have been working on a solution to this, and now one startup has developed a model that can process ten million rows of data at once.

TabPFN is Prior Labs’ enterprise-scale AI foundation model and it has seen a 1,000x leap in dataset size this year. Previously the startup was focused on smaller data sets that might be used in medicine and scientific research, for example, but now it has scaled to be able to process millions of rows of data and is being used by Fortune 500 companies including Hitachi and TD Bank.

The open source model has also been downloaded more than two million times and has seen a broad set of use cases ranging from healthcare and life sciences to financial services, energy and manufacturing.

The model is trained on hundreds of millions of synthetic databases that enable it to process data across industries without task-specific training. 

“We pass this entire data set into the network and ask it to make predictions and we train that over hundreds of millions of data sets to actually make good predictions,” Professor Frank Hutter tells Pathfounders

“And what emerges is an algorithm that as you execute a forward pass actually can make predictions that work on these hundreds of millions of data sets, and then they also work on all the new use cases you throw at them.”

Traditionally, if you wanted to use data to make predictions, each use case would require its own model which, depending on how complex the prediction is, could take hours or months. 

Large enterprises have thousands of these models, but now one model can handle all these predictions, tackling any use case, from travel to medicine to finance and any statistical task, whether that’s classification, regression or time series.

Hitachi uses TabPFN for predictive maintenance across its rail network to identify track issues; UK-based biotech Oxford Cancer Analytics uses it to detect complex lung diseases earlier for better patient outcomes; and there’s a global bank using it to forecast portfolio growth and optimise liquidity planning.

Rather than just pattern spotting, it can process all kinds of data, reducing manual effort for data scientists so they can spend more time on critical insights rather than on routine tasks.

Previously, these kinds of models could only handle continuous, numerical data — like income or height — not categorical data — like marital status or occupation.

“But typical data science problems are messy and there's categorical data, outliers, missing values, all kinds of stuff,” Hutter says. The first version of the model couldn’t handle that, but the second version that launched in Nature in January this year could.

“That was the first time that it was actually really useful for data scientists, but only up to 10,000 data points. Now it's useful for basically any tabular prediction task. And not just useful, but really strong.”

In the future, there will be a context-aware model where users can describe what is in the data set to the model and incorporate that into its reasoning. There will also be an option to try out different scenarios to see what would happen to the data set if you make certain changes, making it easier to interact with and transform the data.

Reply

or to participate

Keep Reading

No posts found