Decision Trees & Ensemble Methods

A study of three supervised learning algorithms applied to the Titanic survival dataset, examining how ensemble methods reduce overfitting and improve prediction accuracy on real data.

The Problem
Binary Classification Under Uncertainty

Given passenger attributes, can a model reliably predict who survived the 1912 Titanic disaster? For each passenger the model outputs 1 (survived) or 0 (did not survive).

The core challenge is learning genuine patterns rather than memorizing training examples. A model that overfits scores well on training data but fails on new passengers. Ensemble methods are designed to fix exactly that.

Motivation
Why This Matters

Classification problems appear everywhere cadets and officers operate. Knowing how to select and apply the right classifier is a critical decision-making skill in any environment where a wrong prediction has real consequences.

Three important domains in the Army are predictive maintenance, medical triage, and personnel readiness. Anticipating equipment failure before it happens keeps fleets mission-ready. Consistent data-driven prioritization assists medics under stress when time and personnel are limited. Identifying attrition risk early gives commanders time to intervene. In each case, the stakes of a wrong prediction are real, which is exactly why understanding when and how to apply these tools matters.

The Dataset
Titanic Passenger Records (Kaggle)

The Kaggle Titanic dataset contains passenger records from the disaster with known survival outcomes. Seven features were used as inputs: passenger class, sex, age, siblings/spouses aboard, parents/children aboard, ticket fare, and embarkation port.

891
Training Records
7
Input Features
80/20
Train/Test Split
Methodology
How the Analysis Was Conducted

Three models were trained using Python's scikit-learn library: a single Decision Tree (max_depth=5), a Bagging Classifier (n_estimators=100, max_depth=5), and a Random Forest (n_estimators=100, max_depth=5). All three used the same 80/20 train-test split with random_state=42 for reproducibility. Missing numerical values such as Age were imputed with the median of the known values; categorical variables (Sex, Embarked) were label-encoded, converting text categories into integers so the models could process them. Performance was measured by test accuracy on the held-out test set.

Three Algorithms, One Dataset, Real Results

Each model builds on the limitations of the previous one. The results below come from training and evaluating all three on the Titanic dataset using identical preprocessing and splits.

How Each Algorithm Works
Decision Tree

A single model that learns to split data by asking yes or no questions at each node. It picks the feature and threshold that best separates the classes at each step. Fast to train and easy to visualize, but it tends to memorize training data rather than learn general patterns.

Split Criterion: Gini Impurity
G(t) = 1 − p12 − p22 − … − pk2
Worked split: 10 passengers (6 survived / 4 died)
Before split:G = 1 − 0.62 − 0.42 = 0.480
After best split:G̃ = (8/10)(0.219) + (2/10)(0.500) = 0.275
Gini gain:0.480 − 0.275 = 0.205 ↑

G = 0 is a perfectly pure node. Every split picks the threshold that minimizes the weighted Gini of the two children.

Feature A Feature B Feature C YES NO YES NO 1 tree using all features
Fast to train Fully interpretable High variance Overfits easily
Bagging

Bootstrap Aggregating trains many separate trees, each on a different random sample of the training data drawn with replacement. Final predictions come from a majority vote or average across all trees. Every tree still has access to all features at every split.

Bootstrap Sampling + Variance Reduction
P(sample in bag) ≈ 63.2%
= 1 − (890/891)891
Var(T̅B) = ρσ2 + (1 − ρ)σ2 / B

ρ is the pairwise correlation between trees. The ρσ2 term is a hard floor. Adding more trees never removes it. Only reducing ρ does. That is exactly what random forests target.

Tree 1 Y N Tree 2 N Y Tree 3 Y N Majority Vote all features available at each split
Reduces variance Less overfitting Trees stay correlated Good accuracy
Random Forest

Like bagging but with one critical addition. At each node split, only a random subset of features is considered rather than all of them. This prevents any single strong feature from dominating every tree, making the trees genuinely different from each other and the ensemble far more powerful.

Feature Subset Rule + Decorrelation
m = ⌊√p⌋  features per split  (p=7 → m=2)

Bagging: ρ ≈ 0.76  →  floor = 0.76 σ2
RF:      ρ ≈ 0.67  →  floor = 0.67 σ2

Forcing each split to consider only m features stops dominant features from appearing in every tree. Cutting ρ from 0.76 to 0.67 reduces the irreducible variance floor. No number of bagging trees can match this.

Tree 1 features: A, C Y N Tree 2 features: B, D N Y Tree 3 features: A, E Y N Majority Vote random feature subset at each split
Lowest variance Decorrelated trees Best accuracy Less interpretable
The Core Difference

Both bagging and random forests build many trees on bootstrap samples of the data. The key difference is that random forests restrict each split to a random subset of features, typically the square root of the total feature count.

This forces trees to rely on different signals. When trees are genuinely diverse, averaging their predictions cancels out more errors and produces a stronger model.

Results on the Titanic Dataset
Model Accuracy: Train vs. Test

All three models use max_depth=5. Random Forest achieves the highest test accuracy at 81.6%, beating the single Decision Tree (79.3%) and Bagging (79.9%). The gap between train and test bars shows how much each model overfits.

Bar chart comparing train and test accuracy for Decision Tree, Bagging, and Random Forest on the Titanic dataset
Overfitting Gap by Model

The train−test accuracy gap measures how much each model memorizes vs. generalizes. Decision Tree: 5.4 pp. Bagging actually widens it to 7.2 pp because its highly correlated trees all overfit in the same direction. Random Forest closes it back down to 4.5 pp by decorrelating the trees, giving it the tightest gap of the three.

Bar chart showing the overfitting gap (train minus test accuracy) for each model
Test Accuracy vs Number of Trees

Both methods improve quickly, then flatten out. The reason is mathematical: the variance of a B-tree ensemble is Var = ρσ² + (1−ρ)σ²/B. The second term shrinks as you add trees, but the first term, ρσ², is a hard floor that no number of trees can remove. At 10,000 trees the second term is essentially zero, but accuracy still cannot exceed the ceiling set by that floor. The only way past it is to lower ρ, the correlation between trees. That is exactly what Random Forest does: by forcing each split to see only a random subset of features, its trees disagree more often, ρ drops, and the floor is lower. Bagging never reduces ρ because every tree can use every feature, so it levels off below Random Forest. A Decision Tree is a single tree by definition and is not shown.

Line chart showing test accuracy vs number of trees for Bagging and Random Forest on the Titanic dataset
Random Forest: Feature Importance

Sex is by far the strongest predictor of survival, consistent with historical accounts of "women and children first." Passenger class and fare, both proxies for socioeconomic status, rank second and third. Age contributes meaningfully; port of embarkation contributes very little.

Horizontal bar chart showing Random Forest feature importances for Titanic survival prediction
Takeaways

Ensembles are generally the stronger choice, but understanding your data before applying a model matters just as much as the model itself. Bagging underperformed here not because it is a weaker method, but because this dataset has a few heavily weighted features: sex, class, and age dominate survival prediction. When every tree sees all features, they all split on the same signals and stay correlated. Random Forest forces diversity by restricting what each split can see, which is why it pulls ahead.

The lesson is not just that Random Forest wins. It is that knowing why it wins, and when it would not, is what makes the difference between applying a tool and understanding one.

How This Was Built

From domain registration to a live research site, built entirely through Claude Enterprise with no local development environment. Every commit pushed to GitHub deploys automatically to dennis-pezan.com.

Step 01
Domain & Professional Email

Purchased dennis-pezan.com through IONOS and configured a professional email through Zoho.

Step 02
Claude Enterprise Access

Obtained Claude Enterprise access, which includes Claude Code. No local IDE, no local files, no server configuration.

Step 03
Landing Page & Research Pages

All pages built through Claude Code. The ML analysis (Kaggle data, Python, scikit-learn, matplotlib charts) was scripted the same way and embedded directly as PNGs.

Step 04
Version Control via GitHub

Two repos: JettDrums/Landing-Page and JettDrums/MA289. Claude Code commits and pushes directly, nothing stored locally.

Step 05
Automated Deployment via Vercel

Both repos connected to Vercel. Every push triggers an automatic deploy and changes go live at dennis-pezan.com in seconds.

Deployment Pipeline
IONOS
IONOS
dennis-pezan.com domain + professional email
Zoho
Professional email tied to the domain
Claude Enterprise
Site design, code, and ML analysis via Claude Code
GitHub
JettDrums/Landing-Page + JettDrums/MA289
Vercel Auto-Deploy
dennis-pezan.com — live on every push