Back

CRYPTOCURRENCY PRICE PREDICTION

Comparative ML study using XGBoost and LSTM to predict significant crypto price movements, with extensive feature engineering and financial performance optimization.

XGBoostTensorFlowscikit-learnPandasCoinGecko API

Overview

A two-phase research project tackling cryptocurrency price prediction through a research-oriented lens. Phase 1 compared logistic regression, random forest, XGBoost, and LSTM models for predicting 15%+ price increases over 30 days across 2,000+ cryptocurrencies. Phase 2 focused on hyperparameter-tuning an XGBoost model specifically for predicting price doublings within 60 days, optimized for financial returns rather than conventional classification metrics. We collected daily data for 2,000+ coins via the CoinGecko API, engineered 50+ features across price, volume, volatility, momentum, and temporal categories, and introduced a novel train-leader-test-follower dataset split methodology.

Key Results

0.389

LSTM F1-Score

0.383

XGBoost F1-Score

1,689%

Portfolio Return

9.59

Profit Factor

Month (20.4%)

Top Feature

2,000+

Coins Analyzed

Methodology

Collected 365 days of daily data for 2,000+ cryptocurrencies via CoinGecko API with custom rate-limiting. Filtered stablecoins and engineered features across five categories: price (moving averages, momentum), volume (OBV, correlations), volatility (Bollinger Bands, std dev), momentum (RSI, MACD), and temporal (month, day of week). Used a train-leader-test-follower split — training on top 50% market cap coins, testing on a random sample from the bottom 50% — to evaluate generalization. Addressed severe class imbalance (<1% positive) with undersampling + SMOTE. Phase 2 conducted threshold sensitivity analysis up to 0.95 probability.

What We Built

  • Custom CoinGecko API scraper with rate limiting and error handling for 2,000+ coins
  • 50+ engineered features across price, volume, volatility, momentum, and temporal categories
  • Novel train-leader-test-follower split for realistic generalization testing
  • Four model architectures: Logistic Regression, Random Forest, XGBoost, dual-layer LSTM
  • Threshold sensitivity analysis optimizing for financial returns over classification metrics
  • Two-step resampling: undersampling majority + SMOTE for <1% minority class

Challenges

  • Handling CoinGecko API rate limits while collecting data for 2,000+ cryptocurrencies
  • Severe class imbalance — price doubling events constituted less than 1% of observations
  • Designing evaluation methodology that reflects real-world financial applicability rather than just classification accuracy
  • Pivoting from initial sentiment analysis approach to technical indicators due to data collection constraints

Outcomes

  • LSTM slightly outperformed XGBoost (F1: 0.389 vs 0.383), but XGBoost offered far better computational efficiency
  • Optimized XGBoost with 0.95 threshold achieved 1,689% backtested portfolio return with 9.59 profit factor
  • Discovered strong seasonality — 'month' was the single most important feature at 20.4% importance
  • Demonstrated that optimizing for financial metrics leads to dramatically different model configurations than classification metrics

Papers & Reports


Back to all projects