CRIME & COMMUNITIES ANALYSIS

ML investigation into which socioeconomic and demographic features are most correlated with violent crime across U.S. communities.

PandasNumPyscikit-learnMatplotlibJupyter

GitHub

Overview

A data science project exploring which community-level features are most strongly associated with violent crime rates across the United States. Using a Kaggle dataset of 2,215 communities with 146 columns spanning demographic and socioeconomic variables, we applied multiple regression and classification models to identify the strongest predictors. After extensive preprocessing — dropping police data (33 columns with excessive missingness), filtering collinear features, and implementing a missing-data threshold — we compared regressors (Linear, Random Forest, SVR, Linear SVR) and classifiers (Logistic, Random Forest, SVC, Linear SVC), ultimately finding that regressors provided richer interpretable output.

Key Results

Family structure

Top Predictor

8 total

Models Compared

15 selected

Final Features

146

Original Columns

1,993

Communities

Regression

Approach

Methodology

Started with 146-column dataset, preprocessed down to 97 features by removing police data, geographic identifiers, and population sums. Implemented a missing-data threshold function and hand-selected 15 key features based on correlation analysis while avoiding collinearity. Compared regression vs classification approaches, selected regressors for richer output. Evaluated Random Forest, Lasso, KNN, Linear SVR, Deep ANN, and Shallow ANN. Conducted feature importance analysis across all models to identify consistent predictors.

What We Built

●Comprehensive preprocessing pipeline: 146 columns filtered to 15 key features
●Eight models compared across regression and classification paradigms
●Feature importance analysis across Random Forest, Lasso, ANN, and Linear SVR
●Correlation heatmaps and bar plots for exploratory analysis
●Quartile classification attempted alongside continuous regression
●Ethical considerations in analyzing sensitive demographic data

Challenges

●Police information accounted for 33 columns with heavy missingness — required complete removal
●Strong multicollinearity between demographic variables demanded careful feature selection
●Right-skewed distribution of violent crime rates made quartile classification ineffective
●Balancing model interpretability with predictive power for actionable policy insights

Outcomes

●Family structure (children born to unmarried parents) consistently emerged as the strongest predictor across all models
●Random Forest relied ~55% on this single feature; ANN and Lasso confirmed the finding
●Regressors slightly outperformed classifiers and provided more interpretable continuous predictions
●Identified a set of features negatively correlated with crime — percent White residents, percent English-only speakers

Papers & Reports

Investigating Crime and Communities

IEEE-format paper presenting the full analysis from data preprocessing through feature importance findings.

Back to all projects