GenAI and Data Democratization, Part 1

June 20, 2024

by Bain & Company's Expert Partner, Richard Lichtenstein

Introduction

Most of the hype around GenAI is based around using it for digesting and summarizing qualitative unstructured data. And GenAI is really good at this. I’ve seen amazing results with tools to summarize call transcripts, pull themes out of online reviews, and recap long documents.

In the next two articles, I’m going to talk about how GenAI can be helpful in two other situations:

1.       Structured data that contains unstructured elements such as confusing or unclear data labels (today)

2.       Using natural language to query the data (next week)

I’m pulling examples for today’s post from Bain’s Pyxis business unit. Pyxis is a company Bain acquired in 2019 for processing alternative data. Pyxis data is primarily B2C: credit card data, online SKU-level data, and location data, all fully anonymized to protect consumers’ identities. Bain has 55 data sets that cover 30+ countries and over a trillion data points. Pyxis data is used to serve our diligence work and sold directly to corporate Bain clients. (I may do another post on the types of insights you can get with Pyxis data in the future.)

So, you can probably see the challenge already. We have massive amounts of data from many different sources with product and company names in many different languages. Additionally, many of these sources rely on “customer exhaust” such as e-receipts where the full product name may not be clear.

Not that long ago, we solved this problem with a lot of manual review by humans. Then GenAI came to their rescue.

How Pyxis uses Gen AI for data structuring

This is a typical e-receipt SKU. It’s sort of English. You can see words in here that a human can interpret: “PWR = Power,” “RECL = Recliner,” “HDRST = Headrest,” and “LUMB = Lumbar.” This recliner is an example. (LAF = “Left arm facing”)

Pyxis uses two approaches to automate this. The first is a neural-net (NN) solution. This is classic AI trained on millions of labelled examples. NN are an older form of AA – I was building with them 20+ years ago in grad school. One important difference between NNs and GenAI is that NNs start as a blank slate. They don’t know anything about the world. Each labelled data point helps reinforce weights along a pathway from inputs to outputs. For a NN to solve a problem like this where there are so many types of products, you need A LOT of data.

GenAI is different because the models are pretrained on billions of examples. LLMs know what a recliner is, what a headrest is, even what LAF means. It does not need any training. So, giving it a relatively small number of training examples is all it takes to get good results.

This one seems easier. It’s got to be a bag of avocados, right? That is correct. The GenAI model got this correct as well. But the neural net model didn’t work. It thought this was an Avocado brand mattress because we had fed it a bunch of furniture data.

The GenAI model said that this was coffee. The NN model said it was a coffee table. The answer is…

Coffee table. So, it’s not as simple as always using one or the other model. At this point, we’re accepting ones where they agree and then having humans look at the points of disagreement.

This is the current state, but model performance will improve over time. It’s possible that in a year, we can just rely on the GenAI one, but we’ll see.

For readers that are facing similar challenges with their own data (1P) or with similar third party (3P) data, the good news is that LLMs should significantly improve your ability to structure data, but they are not yet able to achieve perfect accuracy. Human checkers are still needed.

Next week, I’ll show how we can use GenAI to query the data with natural language, making the clean data available to everyone.

Follow Richard Lichtenstein's Substack newsletter here.