Get to Know our "duplicate_of_id" Field

Published on March 14, 2019
Yen Lim
Former Chief Product Officer

Introducing the “duplicate_of_id” field

The PredictHQ API is the result of an extensive 14-stage event processing pipeline to ensure our customers have the best event-related data in the world. So within our API, there are lots of fields and parameters to explore. We love getting asked about them by our developers and data scientists often.

All of these are detailed in our Developer Documentation, but we occasionally like to highlight specific fields or parameters. Today, we wanted to explore our duplicate_of_id field, which can be found on events that have been deleted in order to direct users to the active version of that event.

We draw in events from hundreds of sources and every single one is cleansed, enriched and verified. This process means many events are deleted before they enter our API. We delete more than 200,000 events each month, for a range of reasons.

Why? There are lots of reasons why an event might be deleted. It could be because an event is a spam event such as an advertisement or just flat-out fake. It could be that it’s a virtual-only event, such as a webinar. It could even just be parking or VIP ticket sales for a concert. We delete these because our customers are using PredictHQ to forecast demand, so VIP ticket sales for a concert isn’t an event about the actual concert itself, which we would already have in our system.

Why this field matters

Another common reason for deletion is due to duplicates when aggregating. When you’re pulling in data from so many sources, we end up with multiple versions of the same event. Given the scale we’re working at, we know which providers are more reliable, so end up deleting up to 60% of events from some providers to ensure we have the best listing details.

There are two reasons for deleted events that users will be able to see in the API. Firstly when an event was verified but has since been cancelled. The second will be because it was identified as a duplicate of an active event, so in such cases we provide the duplicate_of_id.

It’s an important field to be aware of because it means there is an active event that it’s a duplicate of so you can make sure your analysis and models for demand forecasting, dynamic pricing, workforce optimization, and many other use cases.