How to make event data more reliable for forecasting models

Matthew Hicks
Director of Data Assurance

It takes almost a decade of focused study to become a data scientist, and yet most companies waste hours of data scientist time every week doing the frustrating work of finding and cleaning data.

Research has found data scientists spend at least 40% of their time locating, cleansing and standardizing data, rather than building models or running analysis that would impact their business.

It’s not the kind of work any data scientist particularly wants to do. But it often can’t be avoided. Dirty data breaks models. And for demand forecasting models, duplicate or spam events create fake demand signals.

Every data scientist knows that data quality is critical for good results. Garbage in, garbage out. Event data is particularly challenging as its dynamic, nested and denormalized.

How PredictHQ processes events to give data scientists their time back

Turning messy event data from multiple sources into reliable, trustworthy demand intelligence you can plug directly into your demand forecasting and pricing models can be a nightmare.

We know this firsthand, because we built a powerful processing engine to do it. It’s our core business. We spent years building it to handle the many issues and intricacies of global intelligent event data. We’re obsessive about data quality. Because our team is made up of data scientists, we understand how frustrating dirty data is.

Here are some of our key steps our system does to produce high quality data so data scientists can get close to half of their working hours back.

Gather and standardize

One of the biggest challenges with event data is getting enough events. This is important for two reasons:

  1. Coverage to generate meaningful forecasts. This requires major events like conferences and sporting matches, as well as smaller but compounding events like community fun runs and farmers markets.

  2. Compare multiple records of the same event to verify or update key details.

The importance of the second step is often underestimated. Many event records contain inaccuracies, and events are complex. Once you have a minimum viable quantity of events, the records must be standardized before you can compare them programmatically. Even for just one city, the minimum viable quantity of events is too many to handle effectively manually.

Removing misleading events

Once the duplicate events are removed, your team will need to weed out spam events, as well as virtual-only or add-on events.

These need to be removed because events are catalysts of demand because they drive the movement of people. But many events don’t move people such as:

  • A large virtual conference is geocoded to take place in San Francisco because that’s where the host company’s office is. 10,000 people aren’t going to turn up as the event is virtual and not being held in a physical space. Therefore the event is overstating impact at the geocoded location.

  • Spam events, such as an advertisement for a restaurant or product, also don’t drive people movement. Its location, venue capacity and expected attendance are meaningless, as its for promotion and won’t drive attendees at a certain time and date.

  • Add-on events such as parking or a VIP event for a concert. This can have the same impact as a duplicate event. Imagine the conference has 40,000 attendees, and there is parking organized for 20,000 cars and a VIP event for 10,000. If you don’t remove the parking and VIP event, you’ve almost doubled expected guests.

Unfortunately, the volume of events for even just one major city means this weeding out of misleading events can’t be done manually – on average, San Francisco has more than 3,000 events each month, and London has more than 10,000 events. So it needs to be done programmatically with machine learning as each has similar complexities as removing duplicates.

PredictHQ takes care of this for the data scientists of our customers so they don’t need to worry about misleading events and can get straight into forecasting demand and building new models. Our spam event rate is 0%.

Identifying and deleting duplicates

Once you have a critical mass of records standardized and cleaned, you will need to remove duplicate events. Duplicates break your models as they are effectively fake demand signals.

De-duping can seem to be a straightforward challenge at first: just write a script to auto-delete everything with the same name, right? Except, name variation is significant. So you will need to write another script that identifies events taking place at the same venue. But unfortunately, venue names also vary widely and many geocodes on events are wrong. Dates? Times? Same issue – event data is notoriously messy.

Compare these three duplicates – each is for an identical event and from well-known event APIs.

alt

PredictHQ has more than a hundred different systems and models running to identify and automatically remove duplicates and all events are stored in a single event API.

Verifying and enriching

You’ve aggregated many events, cleaned and standardized them, and removed all the duplicates, spam, add-ons, virtual only and other misleading events. A lot of the verification work is done, but you still need to confirm an event is actually going to take place as the event record you’ve found says it will.

There are multiple factors to verify for an event, and each requires its own systems and models. These include:

  • Whether the event’s location venue exists

  • If the location’s address is accurate

  • If the event’s location matches its geocoding

  • If the venue capacity and the available tickets/attendance statistics match

  • If the event is a stand-alone and not part of a bigger event (which can cause double counting)

And that’s only the most obvious elements that require verification. PredictHQ’s systems update or add geocoding on about 50% of events, and delete at least one in five each month for a range of issues.

Ranking

Once you have your cleaned, de-duped and verified data, you’ll still be stuck with more than 10,000 events per major city per month if you are targeting meaningful coverage.

You’ll need a way to make sense of which events are worth your team’s attention, and which ones can safely be ignored.

That’s why PredictHQ’s ranking technology is so important – it helps you find the signal in the noise. Without it, there is no way to reliably know which events your team should update their forecasts and plans for. It’s not just large events that are important, smaller events can combine with outsize impact to create perfect storms of demand.

Give your data team half of their working week back

All of the above steps are time consuming and costly. We know, because we’ve built all these and more into our event processing pipeline. And we know because some of our largest customers at first tried to build their own event driven demand intelligence engines, costing months of lost time for senior data scientists and developers, and millions of dollars only to find it was constantly becoming more complex.

Intelligent event data is our core business. We look after it for you so you can take the high quality data and create models that directly enhance your company’s services and results.

Let your data scientists do their best work, rather than have them stuck cleaning data. Especially in today’s hyper competitive environment for recruiting and retaining data scientists – you want to enable them to do the exciting and innovative work they became data scientists to do.