Fractured Patient Populations

Remember the days when ‘for the treatment of Hyptertension’ was good enough for approval? It’s been a while since markets were that simple. In reality, of course, they never were quite that simple, but the data sources available to us and the analytical tools to make sense of them wouldn’t peer into the murk well enough to dig deeper.

These days, of course, things are a lot more complex. We are routinely finding the target patient populations being defined by multiple overlapping criteria. Where previously we might have been interested only in the Post-MI population, nowadays that specification is more likely to read ‘Post-MI patients with Diagnosed Type 2 Diabetes or a family history of Coronary Artery Disease. Contraindicated in patients with CHF staged II-IV in the NYHA classification’. There are a number of challenges such a specification poses, but the most significant are:

Beyond the simple ‘Post-MI’ grouping, the additional criteria, expressed in their Boolean form – (Dx T2D OR Family Hx CAD) AND NOT CHF II-IV – require a detailed understanding of the overlaps between the individual criteria, and the comorbidities between them.
Each of the criteria has a characteristic variation over time. That variation affects the comorbidities within the pools, because they aren’t all changing at the same rate. Further, there are causal relationships between them. AMI is a major precipitating factor in the development of CHF, so changes in the AMI rate should be reflected in lagged changes in the CHF prevalence within that pool.

Let’s look at the issues surrounding the overlaps in a little more detail.

Comorbidities in Forecasting

Among the canonical commandments of forecasting, we might include “Thou shalt count a patient only once in thine forecast, and abjure them from other segments.” To this day, one of the biggest sources of inflated forecast values is double-counting patients. If a product is launched with a Depression indication, covering X million patients, then adds a claim for Anxiety, affecting Y million patients, building a forecast where the eligible patient pool is (X+Y) million patients neglects the subset of Depression patients with comorbid Anxiety who are already accounted for in the forecast. Looked at schematically, we can build up the situation as follows:

This is a very simple case of two disease states, A and B, which are completely independent of each other. If both have a prevalence of 10%, the simple random chance places the number of people with both conditions at 10% x 10% = 1%. Note also that the population defined by A OR B is 19%, not 20%. So, even with unrelated disease states, we can’t simply add the numbers together.

Now, let’s add a small wrinkle, and assume that there is a positive relationship between these conditions, such that having one increases the chances that you’ll also have the other.

The only difference here is that we’ve increased the overlap slightly. We can quantify this by means of the Odds Ratio, an epidemiological term originally used for the very similar purpose of identifying the extent to which a disease prevalence increases with the presence or absence of a risk factor. If we use condition A as the ‘risk factor’ for Condition B, we can apply the equation easily:

Odds Ratio (OR) = (Odds of B in patients with A) / (Odds of B in patients without A)

Odds of B in patients with A = 2%/8% = 0.25
Odds of B in patients without A = 8%/82% = 0.0976
Thus, the Odds Ratio is 0.25/0.0976 = 2.56

An OR value of 1.00 implies no special relationship between the disease states. A value less than one implies a protective effect (such as between malaria and sickle cell trait, where the OR is around 0.25), and a positive effect has a value greater than 1.00, as we see here.

(There is a related concept, Relative Risk, which is calculated in a similar manner, using risk instead of odds. In that schema, we would have returned a value of 2.25. The lower the prevalence of the disease states involved, the less the difference between the numerical value of the Odds Ratio and Relative Risk)

The immediate questions we might ask are:

Where would I get the data to support this type of analysis for a more complex pattern of comorbidity?
Even if we have this data, how does that help me forecast to 2030, where I believe the prevalence of the contributing conditions will be vastly different?
What do I do to extend this analysis to other countries, where the diseases have quite different prevalence?

This is where patient-level data is absolutely key. There are a number of large, well-designed, representative studies that have made their raw datasets available to researchers. The National Health and Nutrition Examination Survey (NHANES) and the National Comorbidity Survey Replication (NCS-R) both contain detailed data on tens of thousands of individual respondents. NCS-R focuses on mental health, and NHANES more on overall health. We have used NHANES extensively in the cardiovascular/metabolic area, and found it to be the richest data source by far for understanding the complex dynamics in these markets.

There is a certain amount of work to be done before we can use the data. As delivered, being government databases, they are generally a little cumbersome to work with, and the documentation supplied with them is not written with penetrating clarity. Nevertheless, it is possible to extract the relevant data and make use of it.

Essentially, the approach we take is to define a series of flags, identifying individual respondents as being either afflicted or unafflicted with a particular condition or risk factor. In more complex approaches, we can allow for non-dichotomous variables, where the patient is flagged with a variable that can take on more than two values (for example, Fasting Plasma Glucose levels can be grouped into the ranges 0-99, 100-125, 126+).

Once that has been done, we can define, for a given set of n conditions, the full set of 2n permutations that define all possible states in which a respondent could be found. In the prior examples, we’ve considered two conditions, with four possible states (A Alone, B Alone, A and B, Neither), but there’s nothing in the epidemiology textbooks to tell us how to extend this idea to more than two conditions. Going back to our previous example, we need to flag the following patients within NHANES:

AMI survivors
Diagnosed T2D patients
Family history of CAD
Diagnosed CHF patients

Schematically, we have this:

That gives us 24 = 16 possible patient groups, which drops to 8 if we define a combined group for ‘Dx T2D –OR– Family History of CAD’, only some of which are eligible target patients for our forecast (and we must never forget that the entire point of these epidemiological calculations is to support a forecast model – it’s critical that epi groups and forecasting groups work together to ensure that the epidemiological approach dovetails with the eventual structure of the forecast). If we construct our queries carefully, we can generate a set of values telling us the size of each pool, according to NHANES, or whichever source we’re using (the approach is completely portable to other respondent-level surveys).

The key point to understand here is that these segment sizes embody the full complexity of all the comorbidities between all the disease conditions we care about. Whatever we might call a multi-dimensional analog of the Odds Ratio, these values represent it fully, and we could extract any single OR value from this set of numbers using trivial arithmetic, without further analysis.

Our next challenge is to use this knowledge in a way that meets a key criterion: We must be able to use the embedded knowledge of how risk factors relate to each other separately from the individual prevalence estimates in the survey. This has several components to it:

We cannot assume that just because NHANES says the prevalence of CHF is X% that this is the value we must use. Your data source may be the gold standard source for some of the populations (I would argue strongly that NHANES is the definitive source for T2D in the US, but underrepresents CHF). Our approach must allow the user to substitute a preferred value for the overall prevalence of each condition.
Ex-US comorbidity data are hard to come by. An approach that allowed us to use US comorbidity data, and apply it to local prevalence figures for individual disease states which may vary substantially from country to country, but may be easier to come by, is highly valuable.
Just as the prevalence of individual conditions can vary by country, it can vary over time. Survey data are a snapshot of a defined time window, often 5-10 years in the past, but the purpose of the analysis is not just to quantify the current market, but project it over the full lifecycle of the product we’re ultimately forecasting.

It is important to emphasize that any time we apply data from one population to another, we are making some form of assumption. The goal of the approach we are taking is to limit the scope of that assumption. Specifically, the implicit assumption we are making is that however the prevalence of individual conditions varies, the way that the presence of one affects the presence of the other remains broadly consistent (for example, smoking rates and lung cancer rates vary substantially around the world, but our assumption would be that the way smoking increases lung cancer risk is similar everywhere). In cases where we have had the opportunity to compare the consequences of this assumption against local comorbidity data, we have found it to be quite valid.

The methodology by which we take the local prevalence projections for each contributing disease state and apply these comorbidities to map the detailed overlaps over time is highly complex and beyond the scope of this article, but the resultant output is a projection of the subsegment patient pools over time, in each country of interest for the forecast. These segments can then be combined in the appropriate way to include and exclude the relevant populations, and generate one or more sets of target patients for the forecast.

This approach can be extended as far as the sample size for the survey allows. Superficially, we could argue that the current NHANES sample size is around 50,000 respondents (2010-2011 survey data was not yet fully released at the time of writing), and even with 6 separate patient pools, which would yield 64 individual subsegments, the sample size would be close to 1,000 respondents per bucket. Bear in mind, however, that the conditions we look at typically have percentages of a few percent at best, so we wind up with >80% of the respondents falling into the ‘None’ subsegment, and the remaining <20% spread between the other 63 subsegments. The only way to know where to stop is to run the queries, and see how many respondents fall within the subsegments of a particular segmentation structure. The real strength of this type of analysis is that it is the only way available to manage these highly complex patient definitions and inclusion/exclusion criteria, and it works well within a context of other types of analysis that might be required to identify particular patient groups. To learn more about how we can help you refine your understanding of this complex and fascinating area, please contact the author directly.

Paul McNiven, M.Sci.
Managing Partner

Tel:+1 (512) 888-9986 Ext 1

Email:paul.mcniven@humanumeric.com

Web: www.humanumeric.com