This loan approval AI is 98.8% accurate.
It also breaks federal law.
A forensic audit of algorithmic bias in US mortgage lending, across 50,000 real HMDA applications. XGBoost, SHAP, and threshold debiasing. What the model sees, what it hides, and what the law says about it.
N° 01
N° 02
N° 03
A rule written in 1978, still the line between bias and crime.
If a protected group is approved at less than 80% of the most-approved group's rate, US regulators treat it as evidence of discrimination. Simple math, serious consequences.
67.5% / 85.1% = 0.7934, below the 0.80 legal floor.
The model doesn’t see race.
It sees ZIP code.
And ZIP code, in America, is race.
The XGBoost model was trained without race as an input. It learned to use something else.
The 5th most important feature in the model is the percentage of minority population in the census tract where the property sits. Its correlation with the applicant’s actual race is −0.416, stronger than any other feature in the dataset.
The model never saw race. It saw geography. In America, that distinction collapses. This is redlining, rediscovered by a machine.
- 01interest_rate0.545
- 02purchaser_type0.382
- 03applicant_credit_score_type0.045
- 04income0.003
- 05tract_minority_population_percentProxy0.011
Unlike the top 4 features, which encode borrower risk signals, this feature encodes where the property is. Geography is not a credit attribute. It is a demographic one.
Three lines of code,four groups restoredto legal compliance.
Microsoft’s Fairlearn library includes a ThresholdOptimizer. Apply it once, with the demographic_parity constraint, and the model that was violating federal law is no longer violating federal law. The accuracy cost: 3.39 percentage points. The discrimination, it turns out, was a choice.
White applicants are the reference group (DIR = 1.000 by definition) and are omitted from this view.
The accuracy cost, in plain English.
For every 100 loan applications the original model evaluated, it assigned the correct label (approve or deny) 98.78 times. The debiased model gets 95.39 correct. The tradeoff is 3.39 applications per 100. The gain: full legal compliance across every demographic group the federal government protects.
from fairlearn.postprocessing import ThresholdOptimizer# Three lines to go from violation to complianceoptimizer = ThresholdOptimizer(estimator=xgb_model,constraints="demographic_parity",objective="accuracy_score",)optimizer.fit(X_train, y_train, sensitive_features=race)y_pred = optimizer.predict(X_test, sensitive_features=race)
Fairlearn v0.12.0 · Microsoft Research · Open source
The tools exist.
The data is public.
The law is clear.
Every biased lending model in production is a decision,
not a mistake.
The audit, as a toolyou can use.
This calibrator runs a simplified version of our audited model directly in your browser. It uses the actual per-race thresholds produced by the ThresholdOptimizer. Change the applicant’s race below and watch the threshold shift. The decision logic is the exact calibration that brings all groups into compliance with federal lending law.
NOTE · This is a calibrated approximation of the production XGBoost model, not the full tree ensemble. It preserves the threshold logic for demonstration purposes.
The default values are calibrated to demonstrate the threshold contrast. Run them as-is, then change the race to see the calibration in action.
More selective than the default to prevent over-approval relative to baseline.
White is the reference group. DIR = 1.000 by construction; every other group is measured relative to this baseline.
Don’t believe me?Here’s every number,every formula,every model.
Radical transparency is the opposite of how most lending audits are conducted. Most live behind NDAs. This one doesn’t. Below is every dataset, every preprocessing decision, every hyperparameter, every fairness metric formula, and every library version used to produce the findings on this page. If you find a flaw, the data and code are public. Reproduce it. Challenge it. That is the point.
Training data
- Dataset·Home Mortgage Disclosure Act (HMDA) Loan Application Register (LAR)
- Year·2024 (full disclosure year)
- Data release schedule·2024 represents the most recent full-year disclosure available; 2025 release scheduled Q3 2026
- Geographic scope·National
- Sample size analyzed·50,000 applications
- Sampling·Stratified by loan_type, action_taken, and applicant_race to preserve original distribution
- Held-out test set·10,000 applications (20% of total)
Preprocessing
- Missing value handling·Median imputation for continuous, mode for categorical
- Categorical encoding·One-hot encoding for low-cardinality, target encoding for high-cardinality
- Feature scaling·StandardScaler on continuous features
- Sensitive attributes (race, sex) excluded from feature matrix
- Train/test split·80/20 stratified by action_taken
Models trained
- Logistic Regression·max_iter=1000, C=1.0, penalty=l2 (baseline)
- Random Forest·n_estimators=200, max_depth=null, class_weight=balanced
- XGBoost (selected)·n_estimators=200, max_depth=6, learning_rate=0.1, subsample=0.85
- Test set accuracy·0.9878 · F1: 0.9897 · AUC-ROC: 0.9987
Fairness metrics computed
- Disparate Impact Ratio (DIR)·P(approved | minority) / P(approved | reference)
- Demographic Parity Difference (DPD)·P(approved | minority) − P(approved | reference)
- Equalized Odds Difference (EOD)·max(|TPR difference|, |FPR difference|) across groups
- Reference group·White (largest group, set as baseline per industry convention)
- Legal threshold for DIR·0.80 per EEOC Uniform Guidelines, 29 CFR 1607.4(D)
Intersectional analysis
- Groups computed·5 races × 3 sex categories (Male, Female, Joint) = 15 cells
- Joint = applications with two co-applicants of different recorded sex
- Per-cell metrics·approval rate, DIR, n
- Reference cell·White Male (largest cell, n=4,238 in test set)
- Cells failing 4/5 rule·10 of 15 (66.7%)
SHAP explainability
- Method·TreeExplainer on the XGBoost model
- Sample size for explanation·5,000 test observations
- Top features by mean absolute SHAP value: interest_rate (0.545), purchaser_type (0.382), applicant_credit_score_type (0.045), tract_minority_population_percent (0.011)
- Proxy detection threshold·|Pearson r| > 0.15 with race
Debiasing methods tested
- Reweighting·balanced sample weights via fairlearn.preprocessing.Reweighing
- ExponentiatedGradient·with DemographicParity constraint, eps=0.01
- ThresholdOptimizer (selected)·per-group thresholds optimized for demographic_parity, objective=accuracy_score
- Selected because·only method achieving 4/4 group compliance with 4/5 rule
- Per-group thresholds applied·White 0.996, Asian 0.963, Black 0.007, Native American 0.004, Pacific Islander 0.002
Compute
- Hardware·MacBook Pro M3 Pro, 18GB RAM (XGBoost training)
- Platform·Zerve cloud notebooks (full pipeline reproduction for ZerveHack 2026)
- Total runtime·~12 minutes for full audit pipeline
This audit was conducted as a single-author research artifact for the ZerveHack 2026 hackathon. No vendor relationships influenced the methodology. The dataset is publicly available; the libraries are open source; the methodology follows established conventions in the algorithmic fairness literature. The findings are reproducible from the raw HMDA disclosure files.
For corrections, methodological challenges, or replication assistance, contact the author.
- Audit period
- April 24, 2026
- Data version
- HMDA 2024 LAR
- Compute environment
- Zerve cloud
- Code language
- Python 3.12
- Random seeds
- numpy 42, sklearn 42, xgb 42
- Reproducibility
- full
- License
- MIT (code), public (data)