ML investigation into which socioeconomic and demographic features are most correlated with violent crime across U.S. communities.
A data science project exploring which community-level features are most strongly associated with violent crime rates across the United States. Using a Kaggle dataset of 2,215 communities with 146 columns spanning demographic and socioeconomic variables, we applied multiple regression and classification models to identify the strongest predictors. After extensive preprocessing — dropping police data (33 columns with excessive missingness), filtering collinear features, and implementing a missing-data threshold — we compared regressors (Linear, Random Forest, SVR, Linear SVR) and classifiers (Logistic, Random Forest, SVC, Linear SVC), ultimately finding that regressors provided richer interpretable output.
Family structure
Top Predictor
8 total
Models Compared
15 selected
Final Features
146
Original Columns
1,993
Communities
Regression
Approach
Started with 146-column dataset, preprocessed down to 97 features by removing police data, geographic identifiers, and population sums. Implemented a missing-data threshold function and hand-selected 15 key features based on correlation analysis while avoiding collinearity. Compared regression vs classification approaches, selected regressors for richer output. Evaluated Random Forest, Lasso, KNN, Linear SVR, Deep ANN, and Shallow ANN. Conducted feature importance analysis across all models to identify consistent predictors.