How PredictHQ Identifies and Fixes Event Data at Scale

Matthew Hicks
Director of Data Assurance

Any developer or data scientists tasked with finding event data and trying to make it useful for demand forecasting learns a painful lesson quickly: most event data is messy and remarkably unreliable.

Yet events are major catalysts of demand and need to be tracked. We know the pain of trying to wrangle event data into a useable format because it’s the reason we built PredictHQ to do it for you. We've created a powerful data processing pipeline that hones in on three elements - data quality, data enrichment, and demand prediction.

We often talk about how we aggregate almost half a million events per month from hundreds of sources. We also talk about enriching, verifying and ranking millions of events every month.

But we’ve not shared yet about the huge volume of work done by our systems in the initial aggregation verification stage. This is where the depth and diversity of our data sources is deeply valuable, as well as our specialist focus because we are able to systematically fix most issues directly.

How we systematically verify and fix event data at scale

There are several system groups that are central to this work but the most fundamental is our entities system.

The PredictHQ entities systems are the models and intelligence that enable a detailed understanding of venues, performers and more entities that make up events. For example, Beyonce is going to sell more tickets than a small punk band, partly because she will be using larger performance spaces but also because of her popularity. Our understanding of both the performers and venues means our system can classify events into very specific groups and auto-identify potential issues.

Below is our entity screen for San Francisco’s Oracle Park. This is a major venue, so many might assume matching events to it is easy. But few things are easy when it comes to reliable event data. Check out its listing in the PredictHQ entity system below.


As you can see, more than 29 entity records make up our venue entity for Oracle Park. This includes its previous names such as AT&T Park, Pacific Bell Park (and Pac Bell Park). This depth and diversity of data enables our system to verify details such as address, latitude and longitude and allows us to locate all events in relation to a venue. You can also see this is our 112th version of this venue. Versions are created as the venue is updated with more information and better metadata. We want the best set of information on that particular venue, and that takes complex data science and constant iteration.

The Issues Queue

The second key system group for correcting events at scale is our issues queue. It is a series of models that trigger alerts if there is an unusual detail to an event. We have more than 100 different kinds of issues alerts. In most cases, PredictHQ’s systems fix the issue directly. In some cases, our data team is alerted and assess the event personally.

Our processes and data assurance team constantly check our data for issues. As we process millions of events, we are always creating programmatic solutions that will address the root cause of any issue.

We also have a powerful internal toolset to manage events and entities data where we can fix individual issues or make bulk changes and the issue will be corrected there. A series of monitor systems and dashboards alert us to any data issues and they are swiftly corrected. Every day our team invests in data quality – making our data better, cleaner and richer and every day our data gets better.

Here are some of the key issue alert types.

Attendance exceeding venue capacity

This is one of the most straightforward alerts to understand. It is triggered when an event’s predicted attendance is substantially higher than its venue’s capacity. For example, a 20,000 person concert that is set to take place in a venue that only seats 10,000.

There are several reasons this can occur, which our systems sort through and identify. These include:

  • A data provider has made a typo that is swiftly identified by checking other records of the same event.

  • It is the same event repeated, but the event record has described it as one event. Such as music concert that is on three nights in a row.

  • It is a multi-day or multi-venue event that is a series of events rather than one stand-alone event.

These errors are auto-interrogated and fixed by PredictHQ’s systems.


Key event details change

One of the most common incorrect assumptions about event data is that it is reasonably static. But it’s not. More than 80% events change details after our systems first identify them. We have created systems to track this and ensure changes are accurate.

Last month, 2,219 events were postponed or cancelled. Companies need to know this as quickly as possible – so they aren’t building strategies around events that won’t take place. Once these events have been identified as postponed or cancelled, these events are updated in our API within a matter of minutes.

When your system works with 20 million events and more than 2 billion data points, there is ample opportunity to develop the system’s understanding of different events.

We use extensive machine learning models and natural language processing to identify norms for events. So robust are these systems that PredictHQ’s knowledge graph understands niche details we didn’t set out to discover but are very glad we did.

This enables us to identify many elements, including if events seem unusually long, unusually early or late in the day or another category anomaly.

For example, a four-day long concert would generate an alert. And a month-long festival would too.

Once these alerts are triggered, our systems can sort out most of the differences based on the depth and diversity of our providers, and how internal ranking of how accurate each provider has been previously. If there is an unresolvable difference for the systems, someone on our data assurance team will check this manually.

The data assurance team also manually checks the expected attendance of all events above PredictHQ rank of 80 to ensure these major demand catalysts are accurate.