R2795Data to Decisions: Introduction to AI in Healthcare

Session 1.1.5 – Logistic Regression

The Problem: Predicting Mortality in Healthcare

In healthcare, we often need to predict binary outcomes—events that either happen or don't happen:

Will this patient survive their hospital stay?
Will this patient develop a complication?
Does this patient have a particular disease?

Vizient Mortality Prediction: A Real-World Example

Vizient is a healthcare performance improvement company that provides risk-adjusted mortality models to help hospitals benchmark their outcomes. These models use logistic regression to predict the probability of in-hospital mortality based on patient characteristics.

Vizient Mortality Podcast - Knowledge on the Go

Consider gastrointestinal (GI) bleeding—a common, serious condition that requires rapid assessment and intervention. Predicting mortality risk helps clinicians:

Identify high-risk patients who need intensive monitoring
Guide resource allocation decisions
Compare outcomes across hospitals after adjusting for patient risk
Support shared decision-making with patients and families

Learning Objectives

By the end of this session, you will be able to:

Explain why linear regression fails for binary outcomes - Understand the fundamental limitations of linear regression when predicting probabilities and binary events, including the problem of predictions outside the [0,1] range
Understand the logistic regression model and link function - Describe how logistic regression transforms a linear combination of predictors into probabilities using the logit link function, creating an S-shaped curve that ensures probabilities stay between 0 and 1
Interpret logistic regression coefficients as odds ratios - Explain what coefficients in a logistic regression model mean in clinical context, understanding how each predictor affects the odds (and probability) of the outcome, using the Vizient-like GI bleed mortality model as an example
Recognize the advantages and limitations of logistic regression - Identify strengths (handles binary outcomes, interpretable odds ratios, well-calibrated probabilities) and limitations (assumes linearity in log-odds, requires adequate sample size, sensitive to class imbalance) when applied to healthcare prediction problems

1. Why Linear Regression Fails for Binary Outcomes

The Problem with Continuous Predictions

Imagine trying to predict whether a patient with GI bleeding will die using linear regression. We might model:

Mortality = β₀ + β₁(Age) + β₂(Sepsis) + ...

Linear regression problems with probability predictions

But what does this produce? A continuous number that could be:

Negative (e.g., -0.3)—what does "negative mortality" mean?
Greater than 1 (e.g., 1.5)—what does "150% mortality" mean?

These predictions are meaningless for binary outcomes. We need probabilities that are:

Bounded between 0 and 1
Interpretable as the chance an event will occur

Visual Comparison

✋ Hands-on Activity: Linear vs Logistic Regression

Let's see how linear and logistic regression differ when predicting binary outcomes:

Linear Regression

Predicted MAP: 55.0 mmHg

Norepinephrine Dose (mcg/min):10

Logistic Regression

P(MAP > 55): 50.0%

Notice how:

Linear regression can produce predictions outside [0,1] and doesn't capture the S-shaped relationship
Logistic regression produces probabilities bounded between 0 and 1, with a natural S-curve that reflects how risk changes with predictor values

2. Logistic Regression: The Model

The Core Idea

Logistic regression solves the binary outcome problem by:

Modeling the log-odds (logit) of the outcome as a linear function of predictors
Transforming the log-odds back to probabilities using the logistic function

This ensures probabilities always stay between 0 and 1, while still allowing us to use a linear combination of predictors.

The Logit Link Function

The logit (log-odds) is defined as:

logit(p) = ln(p / (1-p))

Where:

p = probability of the outcome (e.g., mortality)
p / (1-p) = the odds of the outcome
ln(odds) = the log-odds or logit

The logit can range from -∞ to +∞, making it perfect for linear modeling.

The Logistic Regression Formula

The transformation from linear log-odds to S-shaped probabilities is the heart of logistic regression. Let's explore this visually:

✋ Hands-on Activity: The Logit Link Function

Watch how a linear relationship in log-odds space becomes an S-curve in probability space:

Step 1: The Logit Function

logit(p) = -6 + 0.1 × age

We start with a linear equation for the logit (log-odds) of probability p.

Step 1 of 3

This transformation creates the characteristic S-shaped curve that:

Approaches 0 as logit → -∞
Approaches 1 as logit → +∞
Is steepest when p ≈ 0.5 (logit ≈ 0)

3. Vizient-like GI Bleed Mortality Model

Here's an illustrative example of how a Vizient-like model might predict mortality in GI bleeding patients. This is a simplified, educational example (not the actual proprietary Vizient model):

The Model Equation

The model predicts mortality probability using logistic regression with the following equation:

logit(p) = -7.0 + 0.055 × (Age) + 0.20 × (Male) + 0.60 × (Emergency) + 0.90 × (ICU_24h) + 0.08 × (ElixScore) + 0.70 × (CHF_POA) + 0.55 × (CKD_POA) + 1.10 × (Sepsis_POA) + 1.40 × (Vent_24h) + 0.65 × (Creatinine_high) + 0.85 × (Lactate_high)

Then convert to probability:

p = 1 / (1 + exp(-logit(p)))

✋ Hands-on Activity: Interactive Mortality Calculator

Explore how patient characteristics affect predicted mortality risk using this interactive calculator. Adjust the patient characteristics on the left to see how the model calculates:

The logit value (log-odds)
The predicted mortality probability (as a percentage)
The contribution of each predictor to the final risk estimate

Interactive GI Bleed Mortality Calculator

Adjust patient characteristics to see how the model predicts in-hospital mortality risk. This is a simplified, educational example (not the actual proprietary Vizient model).

Patient Characteristics

Age — Patient age in years

75 years

18 years100 years

Male — Patient is male

Emergency admission — ED/urgent admission (vs. elective)

ICU first 24h — ICU admission in first 24 hours

Elixhauser comorbidity score — Overall comorbidity burden

030

CHF (POA) — Congestive heart failure present-on-admission

CKD (POA) — Chronic kidney disease present-on-admission

Sepsis (POA) — Sepsis present-on-admission

Ventilation 24h — Mechanical ventilation in first 24 hours

Creatinine high (≥2.0 mg/dL) — First creatinine ≥ 2.0 mg/dL

Lactate high (≥4 mmol/L) — First lactate ≥ 4 mmol/L

Predicted Mortality

73.4%

Very High

Very high risk - intensive monitoring needed

Model Details

logit(p) = 1.015
p = 1 / (1 + exp(-1.015)) = 0.734 = 73.4%

Term Contributions (log-odds)

Intercept

-7

Age

+4.13

Male

+0.2

Emergency admission

+0.6

ICU first 24h

Elixhauser score

+0.64

CHF (POA)

+0.7

CKD (POA)

Sepsis (POA)

+1.1

Ventilation 24h

Creatinine high

+0.65

Lactate high

4. Interpreting Coefficients as Odds Ratios

What Do Coefficients Mean?

In logistic regression, coefficients tell us how predictors affect the log-odds of the outcome. But it's often more intuitive to think in terms of odds ratios.

Converting Coefficients to Odds Ratios

For a coefficient β, the odds ratio is: OR = exp(β)

The odds ratio tells us: "How many times more likely is the outcome when the predictor increases by 1 unit (or is present vs. absent)?"

Interpreting the Vizient-like Model Coefficients

Let's interpret some coefficients from our GI bleed mortality model:

Coefficient: 0.055 for Age

Odds Ratio per year: exp(0.055) = 1.057
Meaning: Each additional year of age increases the odds of mortality by 5.7% (or multiplies odds by 1.057)
Clinical interpretation: Age is a continuous risk factor—older patients have incrementally higher risk, with the effect compounding over decades

Coefficient: 1.10 for Sepsis_POA

Odds Ratio: exp(1.10) = 3.00
Meaning: Patients with sepsis present-on-admission have 3.00 times higher odds of mortality
Clinical interpretation: Sepsis is a life-threatening condition that substantially increases mortality risk in GI bleeding patients

Important Notes on Interpretation

Odds vs. Probability: Odds ratios describe relative changes in odds, not probabilities. A 2x increase in odds doesn't mean a 2x increase in probability (the relationship is non-linear)
Holding other variables constant: Like in linear regression, each coefficient represents the effect of that predictor while holding all other predictors constant
Interaction effects: The model assumes predictors act independently. In reality, some combinations (e.g., sepsis + mechanical ventilation) might have synergistic effects not captured by simple additive terms

5. Evaluation Metrics: AUC, Sensitivity, and Specificity

The ROC curve plots sensitivity (True Positive Rate) vs. 1 - specificity (False Positive Rate) across all classification thresholds. The Area Under the ROC Curve (AUC) summarizes discrimination:

AUC = 0.5: Random guessing
AUC = 0.7-0.8: Acceptable
AUC = 0.8-0.9: Excellent
AUC > 0.9: Outstanding

At any threshold, predictions form a confusion matrix with four categories: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). Key metrics:

Sensitivity (Recall): TP / (TP + FN) - How well does the model identify positive cases?
Specificity: TN / (TN + FP) - How well does the model avoid false alarms?
Accuracy: (TP + TN) / Total - Overall correctness (can be misleading with imbalanced data)

There's a trade-off between sensitivity and specificity: lower thresholds increase sensitivity but decrease specificity. The optimal threshold depends on clinical context (e.g., screening vs. confirmatory tests).

✋ Hands-on Activity: Exploring ROC Curves and Classification Metrics

Interact with this demonstration to see how the ROC curve, confusion matrix, and classification metrics change as you adjust the classification threshold:

ROC Curve and Classification Metrics

Classification Threshold0.50

0.000.501.00

Confusion Matrix

Predicted: Positive

Predicted: Negative

Actual: Positive

True Positive

False Negative

Actual: Negative

False Positive

969

True Negative

Accuracy

90.5%

Sensitivity

26.0%

Specificity

96.9%

PPV

45.6%

NPV

92.9%

Accuracy: (TP + TN) / Total = 995 / 1100

Sensitivity (Recall): TP / (TP + FN) = 26 / 100

Specificity: TN / (TN + FP) = 969 / 1000

PPV (Positive Predictive Value): TP / (TP + FP) = 26 / 57

NPV (Negative Predictive Value): TN / (TN + FN) = 969 / 1043

6. Advantages and Limitations of Logistic Regression

When Logistic Regression Works Well

✅ Good for:

Binary outcomes (mortality, complications, diagnoses)
When interpretability is important (odds ratios are clinically meaningful)
When you have a moderate number of predictors
When relationships are approximately linear in log-odds space
When you need well-calibrated probability estimates

When to Consider Alternatives

❌ Consider other methods when:

You have very few events (rare outcomes) relative to predictors
Relationships are highly non-linear in log-odds space
You have many predictors relative to sample size (consider regularization)
You need to model complex interactions automatically (consider machine learning)
You have clustered or longitudinal data (consider mixed-effects models)

Summary

Key Takeaways

Linear regression fails for binary outcomes - It can produce predictions outside [0,1] and assumes constant variance, which doesn't hold for binary data. We need a method that constrains predictions to valid probability ranges.
Logistic regression uses the logit link function - It models the log-odds (logit) as a linear combination of predictors, then transforms to probabilities using the logistic function. This creates an S-shaped curve that ensures probabilities stay between 0 and 1.
Coefficients represent odds ratios - Each coefficient tells us how a predictor affects the odds of the outcome. The odds ratio = exp(coefficient) describes the multiplicative effect on odds. This allows us to quantify risk factors and understand which patient characteristics matter most for outcomes like mortality.
Logistic regression has strengths and limitations - It's interpretable, handles binary outcomes well, and produces calibrated probabilities. However, it assumes linearity in log-odds, requires adequate sample size, and may oversimplify complex relationships. Understanding these trade-offs helps you know when logistic regression is appropriate and when to consider alternatives.

Questions for Reflection

Why can't we simply constrain linear regression predictions to [0,1] and use that for binary outcomes? What problems would this approach still have?
In the Vizient-like model, why might sepsis (coefficient 1.10) have a smaller effect than mechanical ventilation (coefficient 1.40)? What clinical factors could explain this?
How would you interpret an odds ratio of 2.0 for a continuous predictor like age? How does this differ from interpreting an odds ratio of 2.0 for a binary predictor like "male"?
What additional variables might improve a GI bleed mortality prediction model? What about interaction terms (e.g., sepsis × age)?
How might the coefficients in this model differ if you were predicting mortality in a different patient population (e.g., pediatric patients, elective surgery patients)?

References

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). John Wiley & Sons.
Steyerberg, E. W. (2019). Clinical prediction models: A practical approach to development, validation, and updating (2nd ed.). Springer.
Vizient. (n.d.). Clinical Data Base and Resource Manager. Retrieved from Vizient website.

The Problem: Predicting Mortality in Healthcare​

Vizient Mortality Prediction: A Real-World Example​

Learning Objectives​

1. Why Linear Regression Fails for Binary Outcomes​

The Problem with Continuous Predictions​

Visual Comparison​

Linear Regression

Logistic Regression

2. Logistic Regression: The Model​

The Core Idea​

The Logit Link Function​

The Logistic Regression Formula​

Step 1: The Logit Function

3. Vizient-like GI Bleed Mortality Model​

The Model Equation​

4. Interpreting Coefficients as Odds Ratios​

What Do Coefficients Mean?​

Converting Coefficients to Odds Ratios​

Interpreting the Vizient-like Model Coefficients​

Coefficient: 0.055 for Age​

Coefficient: 1.10 for Sepsis_POA​

Important Notes on Interpretation​

5. Evaluation Metrics: AUC, Sensitivity, and Specificity​

6. Advantages and Limitations of Logistic Regression​

When Logistic Regression Works Well​

When to Consider Alternatives​

Summary​

Key Takeaways​

Questions for Reflection​

References​