Decision Trees & Ensemble Methods

MA289 / Machine Learning

A study of three supervised learning algorithms applied to the Titanic survival dataset, examining how ensemble methods reduce overfitting and improve prediction accuracy on real data.

The Problem

Binary Classification Under Uncertainty

Given passenger attributes, can a model reliably predict who survived the 1912 Titanic disaster? For each passenger the model outputs 1 (survived) or 0 (did not survive).

The core challenge is learning genuine patterns rather than memorizing training examples. A model that overfits scores well on training data but fails on new passengers. Ensemble methods are designed to fix exactly that.

Motivation

Why This Matters

Classification problems appear everywhere cadets and officers operate. Knowing how to select and apply the right classifier is a critical decision-making skill in any environment where a wrong prediction has real consequences.

Three important domains in the Army are predictive maintenance, medical triage, and personnel readiness. Anticipating equipment failure before it happens keeps fleets mission-ready. Consistent data-driven prioritization assists medics under stress when time and personnel are limited. Identifying attrition risk early gives commanders time to intervene. In each case, the stakes of a wrong prediction are real, which is exactly why understanding when and how to apply these tools matters.

The Dataset

Titanic Passenger Records (Kaggle)

The Kaggle Titanic dataset contains passenger records from the disaster with known survival outcomes. Seven features were used as inputs: passenger class, sex, age, siblings/spouses aboard, parents/children aboard, ticket fare, and embarkation port.

891

Training Records

Input Features

80/20

Train/Test Split

Methodology

How the Analysis Was Conducted

Three models were trained using Python's scikit-learn library: a single Decision Tree (max_depth=5), a Bagging Classifier (n_estimators=100, max_depth=5), and a Random Forest (n_estimators=100, max_depth=5). All three used the same 80/20 train-test split with random_state=42 for reproducibility. Missing numerical values such as Age were imputed with the median of the known values; categorical variables (Sex, Embarked) were label-encoded, converting text categories into integers so the models could process them. Performance was measured by test accuracy on the held-out test set.

Machine Learning Analysis

Three Algorithms, One Dataset, Real Results

Each model builds on the limitations of the previous one. The results below come from training and evaluating all three on the Titanic dataset using identical preprocessing and splits.

Section 01

How Each Algorithm Works

Decision Tree

A single model that learns to split data by asking yes or no questions at each node. It picks the feature and threshold that best separates the classes at each step. Fast to train and easy to visualize, but it tends to memorize training data rather than learn general patterns.

Split Criterion: Gini Impurity G(t) = 1 - p 12 - p 22 - \dots - p k 2 Worked split: 10 passengers (6 survived / 4 died) Before split: G = 1 - 0.6 2 - 0.4 2 = 0.480 After best split: G̃ = (8/10)(0.219) + (2/10)(0.500) = 0.275 Gini gain: 0.480 - 0.275 = 0.205 ↑ G = 0 is a perfectly pure node. Every split picks the threshold that minimizes the weighted Gini of the two children.

Fast to train Fully interpretable High variance Overfits easily

Bagging

Bootstrap Aggregating trains many separate trees, each on a different random sample of the training data drawn with replacement. Final predictions come from a majority vote or average across all trees. Every tree still has access to all features at every split.

Bootstrap Sampling + Variance Reduction P(sample in bag) \approx 63.2% = 1 - (890/891) 891 Var(T̅ B) = ρσ 2 + (1 - ρ)σ 2 / B ρ is the pairwise correlation between trees. The ρσ 2 term is a hard floor. Adding more trees never removes it. Only reducing ρ does. That is exactly what random forests target.

Reduces variance Less overfitting Trees stay correlated Good accuracy

Random Forest

Like bagging but with one critical addition. At each node split, only a random subset of features is considered rather than all of them. This prevents any single strong feature from dominating every tree, making the trees genuinely different from each other and the ensemble far more powerful.

Feature Subset Rule + Decorrelation m = ⌊\sqrtp⌋ features per split (p=7 \to m=2) Bagging: ρ \approx 0.76 \to floor = 0.76 σ 2 RF: ρ \approx 0.67 \to floor = 0.67 σ 2 Forcing each split to consider only m features stops dominant features from appearing in every tree. Cutting ρ from 0.76 to 0.67 reduces the irreducible variance floor. No number of bagging trees can match this.

Lowest variance Decorrelated trees Best accuracy Less interpretable

The Core Difference

Both bagging and random forests build many trees on bootstrap samples of the data. The key difference is that random forests restrict each split to a random subset of features, typically the square root of the total feature count.

This forces trees to rely on different signals. When trees are genuinely diverse, averaging their predictions cancels out more errors and produces a stronger model.

Section 02

Results on the Titanic Dataset

Model Accuracy: Train vs. Test

All three models use max_depth=5. Random Forest achieves the highest test accuracy at 81.6%, beating the single Decision Tree (79.3%) and Bagging (79.9%). The gap between train and test bars shows how much each model overfits.

Bar chart comparing train and test accuracy for Decision Tree, Bagging, and Random Forest on the Titanic dataset

Overfitting Gap by Model

The train−test accuracy gap measures how much each model memorizes vs. generalizes. Decision Tree: 5.4 pp. Bagging actually widens it to 7.2 pp because its highly correlated trees all overfit in the same direction. Random Forest closes it back down to 4.5 pp by decorrelating the trees, giving it the tightest gap of the three.

Bar chart showing the overfitting gap (train minus test accuracy) for each model

Test Accuracy vs Number of Trees

Both methods improve quickly, then flatten out. The reason is mathematical: the variance of a B-tree ensemble is Var = ρσ² + (1−ρ)σ²/B. The second term shrinks as you add trees, but the first term, ρσ², is a hard floor that no number of trees can remove. At 10,000 trees the second term is essentially zero, but accuracy still cannot exceed the ceiling set by that floor. The only way past it is to lower ρ, the correlation between trees. That is exactly what Random Forest does: by forcing each split to see only a random subset of features, its trees disagree more often, ρ drops, and the floor is lower. Bagging never reduces ρ because every tree can use every feature, so it levels off below Random Forest. A Decision Tree is a single tree by definition and is not shown.

Random Forest: Feature Importance

Sex is by far the strongest predictor of survival, consistent with historical accounts of "women and children first." Passenger class and fare, both proxies for socioeconomic status, rank second and third. Age contributes meaningfully; port of embarkation contributes very little.

Takeaways

Ensembles are generally the stronger choice, but understanding your data before applying a model matters just as much as the model itself. Bagging underperformed here not because it is a weaker method, but because this dataset has a few heavily weighted features: sex, class, and age dominate survival prediction. When every tree sees all features, they all split on the same signals and stay correlated. Random Forest forces diversity by restricting what each split can see, which is why it pulls ahead.

The lesson is not just that Random Forest wins. It is that knowing why it wins, and when it would not, is what makes the difference between applying a tool and understanding one.

Interactive Demo

Random Forest Explorer

Adjust the sliders to see how each hyperparameter actually affected the model when trained on the Titanic dataset. Every result is a real measurement, not a formula.

Hyperparameters / Titanic dataset (7 features)

Number of Trees 50

More trees average out individual errors. On Titanic, most of the accuracy gain arrives by tree 20. Past 50 the test score plateaus and you're just adding compute time.

Max Tree Depth 5

Shallow trees underfit; deep trees memorize. Watch the overfitting gap widen as depth increases. Depth 4-6 tends to be the sweet spot on this dataset.

Features per Split 3

How many of the 7 features each split may consider. The default is √7 ≈ 2-3. Using all 7 removes the decorrelation benefit and trees start looking like bagging.

Real results from Titanic training run

Test Accuracy

Overfit Gap

CV Std

Loading data...

Project Workflow

How This Was Built

From domain registration to a live research site, built entirely through Claude Enterprise with no local development environment. Every commit pushed to GitHub deploys automatically to dennis-pezan.com.

Step 01

Domain & Professional Email

Purchased dennis-pezan.com through IONOS and configured a professional email through Zoho.

Step 02

Claude Enterprise Access

Obtained Claude Enterprise access, which includes Claude Code. No local IDE, no local files, no server configuration.

Step 03

Landing Page & Research Pages

All pages built through Claude Code. The ML analysis (Kaggle data, Python, scikit-learn, matplotlib charts) was scripted the same way and embedded directly as PNGs.

Step 04

Version Control via GitHub

Two repos: JettDrums/Landing-Page and JettDrums/MA289. Claude Code commits and pushes directly, nothing stored locally.

Step 05

Automated Deployment via Vercel

Both repos connected to Vercel. Every push triggers an automatic deploy and changes go live at dennis-pezan.com in seconds.

Deployment Pipeline

IONOS

dennis-pezan.com domain + professional email

Zoho

Professional email tied to the domain

Claude Enterprise

Site design, code, and ML analysis via Claude Code

GitHub

JettDrums/Landing-Page + JettDrums/MA289

Vercel Auto-Deploy

dennis-pezan.com — live on every push