How Aviation Rank works

Dr. Xuxu Wang
Chief Data Officer

Aviation Rank™ is a numeric value that indicates how much an event will impact flight bookings. Its core technology is a set of advanced machine learning models we have carefully built, trained and refined over the last year to be able to pinpoint this predicted impact. This enables airlines to tailor their demand forecasting.

It has been a very exciting project to lead as we created this world-first technology. But as with most truly innovative technology, people have a lot of questions about how it works and how we know it works.

We have been fortunate to test the product extensively with several airlines. But before that, let me talk you through how we created Aviation Rank™.

The underlying data set

To begin, we created a training set of data out of PredictHQ’s historical data and began to build out the initial function our machine learning models would have to achieve.

One of the exciting things for the data scientists working with our data (both customers and my team here) is having so much high quality data to work with. More than 2 billion data points from 20 million events across 30,000 cities.

This includes many years of verified event listings – we are the only event API that doesn’t delete our records. And it’s this historical data that formed the basis of our training set for Aviation Rank.

Using our event data alongside airline network data we were able to decompose the time series of historical airline network data and detect the unusual spikes, then build the features of events we needed to begin to develop our prediction models. This took months of research and development work, hypothesizing and reworking our assumptions. But as we began to close in on more and more reliable results, we knew we were creating a new way of forecasting future demand.

The airline data we used

Every airline has their own historical demand data, which can be useful for demand forecasting. The limitations of relying on your airlines historical data only are significant though:

Identifying the why behind demand spikes is hard. Usually it’s events, but you will need to identify the impact of seasonality, marketing promotions and other factors.

Correlating demand spikes to events is difficult (it’s our core business so trust us, we know) especially when at hundreds of thousands of events recur in different locations (cities, even countries) each time.

Historical data reinforces what teams already know and doesn’t necessarily reveal what your competitors are winning at. For example, if one airline knows about a large event that takes place each year and has built marketing strategies or event partnerships, they may be enjoying a massive demand spike that barely registers on their competitors booking rates.

So we focused on network data, and were able to build a model that turned indirect bookings into a reliable, leading indicator of demand overall. Getting access to network data is not necessarily difficult, it was our process of decomposing the data and identifying the anomalous historical spikes and the problems of attributing it reliably every time that posed the biggest challenge.

PredictHQ’s knowledge graph

Our capacity to do this came down to through the entities systems within PredictHQ, our advanced NLP work (natural language processing) and our machine learning models. These are all fundamental and proprietary elements of our knowledge graph, so I can not go into them in as much detail as I would like, but can broadly outline why they were so powerful.

Without our entities systems (which analyze, label and provide additional intelligence to each event) we would not have been able to identify why some events appear very similar (same time, location and PredictHQ Rank™ for example) and yet have very different Aviation Rank™ scores.

Another important factor is our entities and NLP work that auto-sorts events into both categories ie is this a conference or an expo or a concert? And then the next layer, which industry is this conference in; or what kind of music is featured at this concert; or what grade or level is this sporting game?

These may not seem important details at first.  But when you are building machine learning models that are accurately predicting impact on bookings, these smaller details become very important to identify so you can see patterns you did not imagine were significant. We began to learn what kind of concerts had international vs domestic pull, and what kind of expos would summon people from all over the world. All of this directly informs Aviation Rank™, and it will keep getting smarter with time.

Testing methodology: How did we ensure Aviation Rank™ was accurate?

In its initial development, we used low-pass filtering by fast Fourier transform (FFT) for time series modeling and supervised learning algorithms extensively.

As we began to arrive at an acceptable accuracy rate, we also began to build a deep graph model to decompose events attributes, and develop ensemble models to build prediction model pipeline consisting of binary classifier and random forest regression and advanced time series anomaly detection algorithms.

In many ways, the final few months have been the most exciting as we realized breakthrough changes to the models that enabled us to get each category’s rank (ie sports or conferences) up well above the 90% accurate mark.

Like most data science, this is phase one of Aviation Rank™. We have exciting plans about how we will continue to add to it. It is only one of the very first specialized demand intelligence products we will be creating.