Improving Predictive Accuracy

Practice Area

  • Data Science

Business Impact

  • 20-25% improvement in predictive accuracy


  • Limited data source utilization
  • Inadequate predictive model


  • Python


Allergic rhinitis (AR) affects almost one in every ten Americans. Many of them rely on mobile applications to make informed decisions about potential exposure to outdoor allergens. Such applications deliver value to the customer by aiding their assessment of exposure risk and thus reducing their overall symptom burden, while also serving as an avenue for manufacturers of anti-allergic preparations to learn more about their target customer demographic, in particular their experience with AR.

Our client, a Fortune 50 healthcare conglomerate, offers an application to provide predictions about symptom severity, and wanted to improve its predictive model.

Improving Predictive Accuracy


The application relied on limited data and an unmaintained model based on a simple regression algorithm to make predictions. As a result, there was considerable room for improvement in its predictive accuracy.

There were three key challenges involved in enabling the application to deliver more value for users. First, the metric for measuring predictive accuracy was inadequate, as it failed to reflect relative class imbalances — where the classes are not approximately evenly distributed, such as in this case — and the ordering of symptom severity. Second, improving predictive accuracy would require adding and preparing new input variables. And third, there was no documentation or in-house expertise available for the legacy model at the client, which meant that Starschema would have to first reverse-engineer the model, then improve it by using more effective, cutting-edge predictive algorithms.

Improving Predictive Accuracy


The Starschema team identified the most appropriate metric to measure model performance and replaced the legacy metric with it. The new, custom-adjusted metric reflects both relative class imbalances and the ordering of symptom severity, and it also served as a baseline for evaluating the effectiveness of the developments that followed.

Updating the data sources entailed two main tasks. The first involved making better use of existing sources. The team found that, by joining together strongly correlated symptoms, they could reduce statistical noise and decrease the model’s complexity to improve its overall robustness. They also expanded the range of inputs – which had previously comprised only key symptoms and pollen counts – with weather and patient treatment information to give the application a more comprehensive foundation for predictions.

In addition, identifying typical co-occurring symptoms made it possible to identify whether the symptoms that the user is experiencing are typically allergic, atypical or mixed regime. This way, the application can provide higher-quality feedback while requiring less manual input from the user.

The most important step involved feature engineering, which allows the model to derive secondary variables from a feature variable. For the purposes of the client’s application, this meant an increase in predictive accuracy, as it enabled the model to consider trends in addition to point data as temporal information. Dimensionality reduction and feature selection helped further improve predictive accuracy by simplifying the model.

The Starschema team then rebuilt, from scratch, the underlying machine learning model based on a Gradient Boosting Regressor algorithm and changed the programming language form Java to Python.

Improving Predictive Accuracy


Starschema delivered the solution in two months. The new data sources and ML model resulted in a consistent 20-25% uplift in the application’s predictive accuracy. The application now makes significantly more accurate predictions about symptom severity for the next three days based on pollen and weather data, as well as symptom and treatment data from the user.

The project also paved the way for future developments that will increase the application’s value. Users will benefit from further improvement in predictive accuracy thanks to the addition of air quality data, while the introduction of sales data will enable the indexing of the start of allergy season to help the client find out how it impacts the sales of allergy symptom relief products.

Ask the Expert

Eszter Windhager-Pokol

Head of Data Science

Eszter holds a degree in Applied Mathematics and has years of experience supporting data-driven decision-making as a consultant, with additional experience researching collaboration filtering and developing user behavior analytics products for IT security purposes. Eszter regularly holds data science trainings for business users and teaches Mastering the Process of Data Science at CEU as a visiting faculty instructor.

Windhager Pokol Eszter
“Why Did This Happen?” New Horizons in Root Cause Analysis

Learn about core concepts of root cause analysis, the advantages and disadvantages of the most popular tools and techniques in the field and find out what the cutting-edge looks like.

Telco Location Data Monetization

A global telecommunications company opened a new revenue stream and made it profitable in just two years.

Automating BI Analytical Tasks with Anomaly Detection and NLG Summation

Learn how to design and implement a complex solution that automatically identifies anomalies in organizational data, provides relevant context and communicates it all in an easy-to-consume form to augment analysts' work.

Effective Location Data Monetization: Strategic and Technical Enablers

Geolocation data provides invaluable insights into the habits and preferences of users, customers and audiences. This white paper helps understand the fundamental opportunities and challenges inherent in using location data for business-critical processes in any industry.