
Ensemble ML system (XGBoost + LightGBM + LR, 103 features) predicting NCAA tournament winners with a live GitHub Pages dashboard
An ML system built for the CMU Second Annual March Madness Machine Learning Competition. A weighted ensemble of XGBoost (40%), LightGBM (40%), and Logistic Regression (20%) trained on 103 differential features per matchup, including KenPom, Barttorvik T-Rank, NET rankings, and seed history. Achieved 72.3% CV accuracy (Men) and 75.1% (Women) on real Kaggle NCAA data using walk-forward CV. A GitHub Actions pipeline retrains and deploys a live bracket dashboard to GitHub Pages.
NCAA March Madness bracket prediction via walk-forward CV ensemble ML.
• Models: XGBoost (40%) + LightGBM (40%) + Logistic Regression (20%); isotonic regression calibration; optional Optuna Bayesian hyperparameter tuning.
• Features: 103 differential features per matchup: box-score stats, KenPom AdjEM/AdjO/AdjD, Barttorvik T-Rank, NET ranking, Massey ordinals (30+ systems), SOS, coaching tenure, conference strength, and tournament seed history.
• Data: Kaggle March Machine Learning Mania 2026 dataset (124k+ real games); external enrichment from KenPom and Barttorvik.
• Validation: walk-forward CV (train on years ≤N, validate on N+1); no data leakage; 72.3% Men's / 75.1% Women's AUC accuracy.
• Pipeline: run_bracket.py simulates full round-by-round brackets; export_site_data.py publishes model outputs to the live dashboard.
• CI/CD: GitHub Actions for CI tests, weekly model retraining with Kaggle secrets, and GitHub Pages deployment.