The Data Scientist and the Proud Dad (and the Mess that is Data Science, AI, and Machine Learning in Retail)

There are a few traits shared by most dads in the world. One is telling truly bad jokes with glee (AKA “dad jokes”). Another is never missing the chance to brag about their kids. Today, I get to brag about my daughter Halie, who recently received a Ph.D. in bioinformatics from the University of Illinois. For those of you who don’t know (and why would you?), Merriam-Webster defines bioinformatics as “the collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied to molecular genetics and genomics.” So basically, it’s data science for biology and genetics.

By now you’re probably thinking, “Congrats Joe, your daughter is way smarter than you. But what does this have to do with retail?”

The answer is this. Halie spent 6 years earning her Ph.D., working to understand the genetic changes that take place in the genome of foxes that were bred to be tame. It’s fascinating stuff. But the important point here is that she estimates that 75% of her time was spent collecting and cleaning data.

Let that sink in.

Four-and-a-half years collecting data and ensuring it was properly structured and of good quality.

Only 1.5 years devoted to analyzing and writing.

But it makes sense. What’s the logic of applying “science” to “data” if your data are not accurate? The hard part of data science is the data, not the science. Data integrity is key, and the experts know this.

Which means that “data science” without accurate data isn’t science at all; it’s just a waste of time and money.

I want to be clear that the need for data integrity isn’t limited to sciences like bioinformatics. In the field of retail location analytics there are many different types of data required to produce meaningful results. It’s pretty easy to find good quality sources for things like demographics, lifestyle segmentation, and consumer expenditure estimates.

But some can only be obtained internally – like customer behavior data and site characteristics of potential sites.

And some can be bought from third parties. But these do not have the depth or accuracy out-of-the-box to be used in a truly rigorous analytical endeavor. The best example of this is competitor data. Retail location databases are available, but in all but rare (and expensive) cases, are not up-to-date enough or error-free enough to use unvetted for the data that they do contain – competitor existence and location – and also do not contain the depth of data to be truly useful for deep analysis (for example, competitor site characteristics, store condition, etc.).

But we in retail we exist in a world of the “I need it now” mentality. And this is why data quality takes a back seat to expediency.

I suggest that every company should ask two questions:

  1. How much attention is being focused on the accuracy of the data being purchased and collected? Typically, the answer is “very little.”
  2. How important is market data in understanding the performance of current and planned stores? With very, very few exceptions, for almost all retail, the answer is “crucial.”

Fully vetted, detailed store data from a third party would be prohibitively expensive, which is why no one supplies it. The companies that sell standard store-location data do their best with web scraping and other techniques, but don’t have a prayer of keeping it truly up-to-date and accurate for the entire country. And they don’t even try to supply the kind of site-specific data that real analytics demands.

Which really means that almost everyone in retail who is spending money on data science, machine learning, or artificial intelligence (AI) is throwing that money away. Because almost no one in retail gives a second thought to the quality of the data going into their analysis – at least when it comes to retail markets.

And yet retailers are spending money.

They spend it with service bureaus who claim ±10% accuracy 90% of the time. And yet those service bureaus use this third-party data out-of-the-box, with no cleaning or vetting. You can’t get ±10% accuracy 90% of the time using data that is significantly less accurate than that. So, these service bureaus are either lying or they are ignorant to the realities of data science or the quality of the data they are relying on.  I’m not sure which of these possibilities is worst.

Other retailers are spending top dollar to hire data scientists and expecting them to work their magic without the critical raw ingredient: good-quality, carefully-vetted data about their markets. I’m sure these data scientists put lots of time and energy into cleaning up customer data, store sales data by department, etc. But – because they are educated in data science and not in retail – they then combine it with out-of-the-box, third-party competitor location data, simply assuming that it is accurate and up-to-date. You can guess where this will go.

The saddest part is that high-quality, vetted data are available to every retailer and they’re available almost for free. Retail real estate professionals know their markets. They clean and vet data every day, just by doing their jobs. All that needs to happen is to get the data from these people’s heads into a database. Then the data science magic can happen. The technology to do this exists and it’s very affordable. Some retailers are using it and getting amazing results. The rest… we’ll see.

I won’t be shocked to see a retrospective in a few years talking about the broken promise of data science, AI, or machine learning in retail. And if this happens, first I will laugh maniacally. Then I will turn it into a dad joke and gleefully tell it to Halie.


The Data Scientist and the Proud Dad


Joe Rando