These days, of course, things are a lot more complex. We are routinely finding the target patient populations being defined by multiple overlapping criteria. Where previously we might have been interested only in the Post-MI population, nowadays that specification is more likely to read ‘Post-MI patients with Diagnosed Type 2 Diabetes or a family history of Coronary Artery Disease. Contraindicated in patients with CHF staged II-IV in the NYHA classification’. There are a number of challenges such a specification poses, but the most significant are:

- Beyond the simple ‘Post-MI’ grouping, the additional criteria, expressed in their Boolean form – (Dx T2D OR Family Hx CAD) AND NOT CHF II-IV – require a detailed understanding of the overlaps between the individual criteria, and the comorbidities between them.
- Each of the criteria has a characteristic variation over time. That variation affects the comorbidities within the pools, because they aren’t all changing at the same rate. Further, there are causal relationships between them. AMI is a major precipitating factor in the development of CHF, so changes in the AMI rate should be reflected in lagged changes in the CHF prevalence within that pool.

**Comorbidities in Forecasting**

Among the canonical commandments of forecasting, we might include “Thou shalt count a patient only once in thine forecast, and abjure them from other segments.” To this day, one of the biggest sources of inflated forecast values is double-counting patients. If a product is launched with a Depression indication, covering X million patients, then adds a claim for Anxiety, affecting Y million patients, building a forecast where the eligible patient pool is (X+Y) million patients neglects the subset of Depression patients with comorbid Anxiety who are already accounted for in the forecast. Looked at schematically, we can build up the situation as follows:

This is a very simple case of two disease states, A and B, which are completely independent of each other. If both have a prevalence of 10%, the simple random chance places the number of people with both conditions at 10% x 10% = 1%. Note also that the population defined by A OR B is 19%, not 20%. So, even with unrelated disease states, we can’t simply add the numbers together.

Now, let’s add a small wrinkle, and assume that there is a positive relationship between these conditions, such that having one increases the chances that you’ll also have the other.

The only difference here is that we’ve increased the overlap slightly. We can quantify this by means of the Odds Ratio, an epidemiological term originally used for the very similar purpose of identifying the extent to which a disease prevalence increases with the presence or absence of a risk factor. If we use condition A as the ‘risk factor’ for Condition B, we can apply the equation easily:

Odds Ratio (OR) = (Odds of B in patients with A) / (Odds of B in patients without A)

Odds of B in patients with A = 2%/8% = 0.25

Odds of B in patients without A = 8%/82% = 0.0976

Thus, the Odds Ratio is 0.25/0.0976 = 2.56

An OR value of 1.00 implies no special relationship between the disease states. A value less than one implies a protective effect (such as between malaria and sickle cell trait, where the OR is around 0.25), and a positive effect has a value greater than 1.00, as we see here.

*(There is a related concept, Relative Risk, which is calculated in a similar manner, using risk instead of odds. In that schema, we would have returned a value of 2.25. The lower the prevalence of the disease states involved, the less the difference between the numerical value of the Odds Ratio and Relative Risk)*

The immediate questions we might ask are:

- Where would I get the data to support this type of analysis for a more complex pattern of comorbidity?
- Even if we have this data, how does that help me forecast to 2030, where I believe the prevalence of the contributing conditions will be vastly different?
- What do I do to extend this analysis to other countries, where the diseases have quite different prevalence?

There is a certain amount of work to be done before we can use the data. As delivered, being government databases, they are generally a little cumbersome to work with, and the documentation supplied with them is not written with penetrating clarity. Nevertheless, it is possible to extract the relevant data and make use of it.

Essentially, the approach we take is to define a series of flags, identifying individual respondents as being either afflicted or unafflicted with a particular condition or risk factor. In more complex approaches, we can allow for non-dichotomous variables, where the patient is flagged with a variable that can take on more than two values (for example, Fasting Plasma Glucose levels can be grouped into the ranges 0-99, 100-125, 126+).

Once that has been done, we can define, for a given set of n conditions, the full set of 2n permutations that define all possible states in which a respondent could be found. In the prior examples, we’ve considered two conditions, with four possible states (A Alone, B Alone, A and B, Neither), but there’s nothing in the epidemiology textbooks to tell us how to extend this idea to more than two conditions. Going back to our previous example, we need to flag the following patients within NHANES:

- AMI survivors
- Diagnosed T2D patients
- Family history of CAD
- Diagnosed CHF patients

The key point to understand here is that these segment sizes embody the full complexity of all the comorbidities between all the disease conditions we care about. Whatever we might call a multi-dimensional analog of the Odds Ratio, these values represent it fully, and we could extract any single OR value from this set of numbers using trivial arithmetic, without further analysis.

Our next challenge is to use this knowledge in a way that meets a key criterion: We must be able to use the embedded knowledge of how risk factors relate to each other separately from the individual prevalence estimates in the survey. This has several components to it:

- We cannot assume that just because NHANES says the prevalence of CHF is X% that this is the value we must use. Your data source may be the gold standard source for some of the populations (I would argue strongly that NHANES is the definitive source for T2D in the US, but underrepresents CHF). Our approach must allow the user to substitute a preferred value for the overall prevalence of each condition.
- Ex-US comorbidity data are hard to come by. An approach that allowed us to use US comorbidity data, and apply it to local prevalence figures for individual disease states which may vary substantially from country to country, but may be easier to come by, is highly valuable.
- Just as the prevalence of individual conditions can vary by country, it can vary over time. Survey data are a snapshot of a defined time window, often 5-10 years in the past, but the purpose of the analysis is not just to quantify the current market, but project it over the full lifecycle of the product we’re ultimately forecasting.

It is important to emphasize that any time we apply data from one population to another, we are making some form of assumption. The goal of the approach we are taking is to limit the scope of that assumption. Specifically, the implicit assumption we are making is that however the prevalence of individual conditions varies, the way that the presence of one affects the presence of the other remains broadly consistent (for example, smoking rates and lung cancer rates vary substantially around the world, but our assumption would be that the way smoking increases lung cancer risk is similar everywhere). In cases where we have had the opportunity to compare the consequences of this assumption against local comorbidity data, we have found it to be quite valid.

The methodology by which we take the local prevalence projections for each contributing disease state and apply these comorbidities to map the detailed overlaps over time is highly complex and beyond the scope of this article, but the resultant output is a projection of the subsegment patient pools over time, in each country of interest for the forecast. These segments can then be combined in the appropriate way to include and exclude the relevant populations, and generate one or more sets of target patients for the forecast.

This approach can be extended as far as the sample size for the survey allows. Superficially, we could argue that the current NHANES sample size is around 50,000 respondents (2010-2011 survey data was not yet fully released at the time of writing), and even with 6 separate patient pools, which would yield 64 individual subsegments, the sample size would be close to 1,000 respondents per bucket. Bear in mind, however, that the conditions we look at typically have percentages of a few percent at best, so we wind up with >80% of the respondents falling into the ‘None’ subsegment, and the remaining <20% spread between the other 63 subsegments. The only way to know where to stop is to run the queries, and see how many respondents fall within the subsegments of a particular segmentation structure.

The real strength of this type of analysis is that it is the only way available to manage these highly complex patient definitions and inclusion/exclusion criteria, and it works well within a context of other types of analysis that might be required to identify particular patient groups.

To learn more about how we can help you refine your understanding of this complex and fascinating area, please contact the author directly.

Managing Partner