Note: If you want to take a look at our code and analysis, check out our notebooks in this repository. You may have to refresh a couple times for them to load properly.
- 1. Introduction
- 2. Figures
- 3. Methods
- 4. Data Exploration and Analysis
- 5. Data Visualization and Key Insights
- 6. Advanced Analytics: Unsupervised Learning
- 7a. Model 1 Development and Enhancement
- 7b. Model 2 Development and Enhancement
- 8. Conclusions and Future Improvements
- 9. Discussion
- 10. Conclusion
- 11. Environment Setup and Usage
- 12. Statement of Collaboration
The goal of our project was to develop a machine learning system capable of predicting NFL fantasy football performance based on historical player statistics. This problem was chosen due to its practical applications in fantasy sports, professional team scouting, and the broader challenge of predicting human athletic performance in dynamic environments.
As avid NFL fans who have spent countless hours analyzing player performance and debating draft strategies, we wanted to test whether machine learning could systematize the intuitive player evaluation that dedicated football fans develop. Fantasy football transforms every Sunday into a personal investment in player performance, where statistical analysis meets genuine passion for the game.
A robust, high-performing predictive model in this domain could revolutionize several areas:
Fantasy Sports Applications:
- Automated Draft Strategy: Our models can identify undervalued players and optimal draft positions, giving fantasy managers a competitive edge
- Weekly Lineup Optimization: Predictive models help determine which players to start based on matchup analysis and consistency metrics
- Trade Evaluation: Quantitative player valuations enable more informed trade decisions throughout the fantasy season
Professional Team Applications:
- Player Scouting: Teams can identify undervalued free agents and draft prospects who may outperform their projected draft position
- Contract Negotiations: Data-driven performance projections help teams make optimal contract decisions based on predicted future value rather than past achievements
- Roster Management: Predictive models assist in determining when to extend, trade, or release players based on projected career trajectories
Broader Sports Analytics:
- Injury Impact Assessment: Understanding how injuries affect long-term performance patterns
- Age Curve Analysis: Quantifying how players of different positions age and when performance typically declines
- Scheme Fit Evaluation: Identifying which players are likely to succeed in specific offensive systems
This project was also a natural extension of our shared passion for football and data science, allowing us to combine technical skills with genuine sports knowledge.
Our analysis includes several key visualizations that demonstrate our methodology and findings:
Figure 1: Position Distribution - Shows the composition of NFL player data across positions, revealing the scarcity principle in fantasy football where positions with fewer elite players carry premium value.
Figure 2: Fantasy Points Consistency Analysis - Displays scoring variability by position, with lower bars indicating more consistent, predictable scoring and higher bars showing "boom-or-bust" potential.
Figure 3: Initial Correlation Analysis - Basic relationship between 2023 rushing yards and 2024 fantasy points, establishing the foundation for our predictive approach.
Figure 4: Multi-Feature Analysis - Comprehensive examination of how various 2023 performance metrics (snap count, rushing yards, touchdowns, yards per carry) correlate with 2024 fantasy outcomes.
Figure 5: Model Validation Results - Support Vector Regression performance showing actual vs. predicted fantasy points with separate visualization of training and test data.
Figure 6: Position-Filtered Enhancement - Improved model performance when restricting analysis to RB1/RB2 level players, demonstrating better accuracy for fantasy-relevant decisions.
Note: Additional figures throughout our analysis show training vs. testing errors, evaluation metrics, hyperparameter tuning results, and detailed model interpretations.
- Dataset: NFL Player Statistics (1999-2022) from Kaggle - [philiphyde1/nfl-stats-1999-2022]
- Total Observations: 215,243 across weekly and yearly statistics
- Focus Period: 2023-2024 seasons for prediction modeling
- Key Analysis: Position distribution, fantasy points consistency, and data quality assessment
- Target Variable Creation: Calculated fantasy_points_ppr using standard PPR scoring rules
- Data Filtering: Regular season games only, position-specific analysis
- Quality Assurance: Handled missing values, removed duplicates, created depth chart rankings
- Feature Engineering: Developed reliability index combining performance and consistency metrics
Support Vector Regression (SVR) Implementation:
- Justification: Chosen for effectiveness with non-linear relationships and robustness to outliers
- Features: 2023 performance metrics (snap count, rushing yards, touchdowns, yards per carry)
- Hyperparameter Tuning: Grid search with 5-fold cross-validation
- Enhancement: Position-filtered analysis for RB1/RB2 players
# Core model implementation
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
params = {
'C': [1, 10, 50, 100],
'epsilon': [0.1, 0.5, 1.0],
'gamma': ['scale', 0.01, 0.1, 1],
'kernel': ['rbf', 'poly']
}
grid = GridSearchCV(SVR(), params, cv=5, scoring='r2')
grid.fit(X_train_scaled, y_train)- Position-Specific Filtering: Focus on fantasy-relevant players (RB1/RB2 only)
- Feature Scaling: StandardScaler normalization
- Validation Strategy: 80/20 train-test split with cross-validation
Principal Component Analysis (PCA) + K-means Clustering Implementation:
- Justification: Chosen for its ability to analyze multiple different statistical categories and plot players against each other in different positions to analyze their value against other positions
- Features: 2024 performance metrics (rushing yards, rushing touchdowns, receiving yards, receiving touchdowns, receptions, passing yards, passing touchdowns, interceptions)
- Principal Component Analysis Found two most variant components tracked passing play and rushing play. When plotted against each other it formed a triangular branchning scatter plot.
- K-means Clustering Used K-means clustering on Principal Component formed features to find patterns in our data.
- Position-Specific Labeling: Using Different K-values for our K-means Clustering yielded more clusters with different kinds of players in each cluster.
- Comparison across Positions The Principal Component Analysis gave us a representation of our data allowing us to compare players across all positions from RB, QB, TE, WR, despite the vastly different roles each of these play on the field
- Evaluation of Z-score to Fantasy Percentile Taking the Z-score of every player in the PCA graph, and comparing it against their season fantasy scores gives us the following graph:
Our dataset is a comprehensive collection of NFL statistics from 1999-2022, comprising 215,243 total observations distributed across eight CSV files. The data have two primary axes: (weekly vs. yearly) and (player vs. team), which is ideal for our goal of building a fantasy football prediction system.
Player-Level Data (Primary Focus):
- Weekly Offensive Stats (58,629 observations): This is the most critical dataset for our project. Each observation represents a single player's offensive performance in one game.
- Yearly Offensive Stats (7,133 observations): Aggregated weekly data (16-17 games) into full-season summaries for each player, useful for establishing baseline performance.
- Defensive Stats (Weekly & Yearly): While our primary focus is offense, the 117,993 weekly and 16,148 yearly defensive observations provide context for team defensive efficiency.
Team-Level Data (Contextual): The team-level files provide broader context on team tendencies (e.g., pass-heavy vs. run-heavy offenses) that can influence a player's opportunity and projected output.
For our fantasy football prediction system, we engineered our target variable fantasy_points_ppr based on standard PPR (Points Per Reception) league scoring:
- 1 point per reception ('receptions')
- 0.1 points per rushing/receiving yard ('rushing_yards', 'receiving_yards')
- 6 points per rushing/receiving touchdown ('rushing_tds', 'receiving_tds')
- 4 points per passing touchdown ('passing_tds')
- -2 points per interception thrown ('interceptions')
This engineered column serves as a continuous variable on a ratio scale and represents the value our models aim to predict for future player performance.
Missing and Duplicate Values Analysis:
- No duplicate rows found across the primary datasets
- Missing values identified in the "yards per carry" column, which we handled by initializing with 0s (indicating no rushing attempts)
- Data integrity maintained through systematic filtering and validation processes
Our comprehensive visualization approach revealed critical patterns that directly informed our modeling strategy:
Description: This pie chart illustrates the distribution of players across different positions within our dataset. From a fantasy drafting perspective, this visualization is key to understanding positional value.
Key Insights:
- Wide Receivers comprise the largest segment of skill position players
- Running Backs and Tight Ends represent smaller but crucial position groups
- Positions with smaller slices (like Running Back or Tight End) demonstrate scarcity, increasing the value of elite players at those positions
- The drop-off in production to the next available player is much steeper for scarce positions
- This analysis directly impacts our position-specific modeling approach and draft strategy prioritization
Description: This bar chart displays the consistency of fantasy scoring by position, measured by the standard deviation of weekly points. Lower bars signify more consistent, predictable scoring, while higher bars indicate more "boom-or-bust" potential.
Interpretation:
- Lower Bars (More Consistent): Positions like Quarterback often show lower variance, providing a stable scoring floor each week
- Higher Bars (Less Consistent): Positions like Wide Receiver or Tight End have higher variance, meaning their weekly scores fluctuate more
- Quarterbacks typically show the most consistent scoring patterns
- Wide Receivers and Tight Ends exhibit higher variability
Fantasy Football Application: Our models leverage this insight to manage risk by prioritizing players from more consistent positions to build a reliable core, while strategically targeting less consistent, high-upside players for competitive advantage.
After focusing on running backs for our initial modeling approach, we analyzed key performance indicators:
Features Analyzed:
- Offensive snap count (opportunity metric)
- Rushing yards (volume metric)
- Rushing touchdowns (efficiency metric)
- Yards per carry (skill metric)
Key Findings:
- Strong correlation between 2023 rushing yards and 2024 fantasy performance
- Snap count proved crucial as an opportunity-based predictor
- Touchdown production showed regression patterns year-over-year
Since all fantasy players are scored on the same scale regardless of position, we conducted PCA on 2024 player fantasy data to identify hidden player archetypes and validate our position-based assumptions.
Methodology:
- Data Filtering: Focused on 2024 offensive players capable of generating fantasy points
-Aggregation Aggregated player statistics week by week for the season, and included fantasy points in our aggregation but not in our PCA.

- Data Standardization: Scaled data to mean 0, standard deviation 1 across all players
-Done using StandardScaler from scikitLearn

- PCA Application: Applied dimensional reduction to discover underlying patterns

Results and Insights:
Three Distinct "Branches" Identified:
- Far Bottom Right: Quarterback cluster - sparse and spread out due to smaller QB population
- Far Left Bottom: Receiver/TE cluster - dense clustering due to larger population and similar statistical profiles
- Center to Top: Running Back cluster - characterized by high rushing yardage and rushing touchdowns
Principal Component Analysis:
- PC1 (Highest Variance): Dominated by passing yards (positive) vs. receptions (negative)
- Explains quarterback separation on X-axis
- Receivers cluster on low end due to reception-heavy profiles
- PC2: Rushing yards (positive) vs. passing yards (negative)
- Traditional QBs (Goff, Burrow) appear bottom due to high passing/low rushing
- Dual-threat QBs (Daniels, Hurts, Jackson) cluster toward middle
Objective: Test whether machine learning could correctly identify player positions based on statistical profiles alone.
Initial Clustering (K=3):
- Successfully separated QBs, RBs, and WR/TE groups
- Minor classification issues with low-volume players being grouped as WRs
Enhanced Clustering (K=4):
- 100% accuracy for WR/TE position identification
- No misclassification of Running Backs or Quarterbacks
- Successfully isolated backup/low-volume players into distinct cluster

Validation Results:
- Machine learning clustering perfectly aligned with actual position assignments
- Confirms that statistical profiles inherently separate by position
- Validates our position-specific modeling approach
Through manual labeling of our K-means clustering with 9 clusters, we identified several distinct player archetypes:
Fantasy Application: These archetypes inform draft strategy by identifying player types that provide specific value propositions, and their distance from the mean of all data also informs us of who is an outlier among the data and gives us an informed approach to player on player comparisons, especially when it is difficult to tell between two players to draft.
Temporal Filtering:
- Focused on 2023-2024 seasons for most recent and relevant data
- Regular season only filtering (removed playoff data irrelevant to fantasy)
- Position-specific analysis prioritizing fantasy-relevant positions
Feature Engineering:
- Created reliability indices combining performance and consistency
- Developed opportunity-based metrics emphasizing volume indicators
- Implemented position-specific normalization for cross-positional comparisons
Initial SVR Model:
Train RMSE: 69.92
Test RMSE: 97.70
Train R²: 0.479
Test R²: -0.026
Position-Filtered Enhancement (RB1/RB2 only):
Train RMSE: 71.89
Test RMSE: 74.93
Train R²: 0.457
Test R²: 0.216
Final Tuned Model (GridSearchCV):
Train RMSE: 28.91
Test RMSE: 34.67
Train R²: 0.801
Test R²: 0.739
Key Improvements:
- 53% reduction in test RMSE through position filtering
- Eliminated negative R² indicating meaningful predictive power
- GridSearchCV optimization with parameters:
{'C': 10, 'epsilon': 1.0, 'gamma': 'scale', 'kernel': 'rbf'}
What Our Model Successfully Captures:
- Volume-based opportunity metrics (snap counts, target share)
- Touchdown regression patterns (high scorers typically decline)
- Position-specific performance patterns (elite players more predictable)
- Year-over-year consistency factors for established players
Acknowledged Limitations:
- Small sample size (47 RBs with complete data) limits generalizability
- Missing contextual factors: coaching changes, injury history, offensive line quality
- Temporal assumptions: assumes performance patterns persist across seasons
- Position isolation: doesn't account for cross-positional impacts
Real-World Challenges:
- NFL's inherent unpredictability due to injuries, weather, game script
- Fantasy football's dynamic environment with weekly adjustments needed
- Equal treatment of statistics that vary significantly in value (red zone vs. garbage time production)
- Data Filtering: Focused on 2024 offensive players capable of generating fantasy points
-Aggregation Aggregated player statistics week by week for the season, and included fantasy points in our aggregation but not in our PCA.

- Data Standardization: Scaled data to mean 0, standard deviation 1 across all players
-Done using StandardScaler from scikitLearn

- PCA Application: Applied dimensional reduction to discover underlying patterns

- K-Means Clustering Applied K-means clustering to data to get the clusters of players and their archetypes, as well as classification of them by position
Initial K=3 Model:
The K=3 model had classified the QBs and RBs well, but classified all other players that didn't have significant enough statistics into the 3rd cluster, so we decided to add another cluster to try and reduce this error.
K=4 Model:
The K=4 Model saw significant improvements, with a 4th cluster being added that split the third cluster into pure receivers and other players with stats that weren't significant enough to be classified into a different category.
K=9 Model:
Playing around with the K value, and landing on K=9 gave me a good number of clusters classifying different archetypes of players neatly, and allowed me to label them as well. The classification report was insightful, and showed the clustering was about 3 clusters per position. Cluster 1 takes the role of the "Mean" cluster having the most diversity in positions. Since Tight Ends and Receivers both are ball catchers primarily, they are treated the same in the clustering.
To test the hypothesis that euclidean distance from the mean of the PCA graph and fantasy points were correlated we decided to plot the fantasy points against the distance, and divide by position to see which positions fit the best. We found the following graphs:
As you can see, there is a clear positive correlation with every position and the fantasy production of the player vs the distance from the mean of the PCA components. This means that PCA can be used as a reliable predictor of fantasy production to evaluate who should be selected.
Key Improvements:
- Higher Classification Accuracy in classification report through adding K-means
- Archetype and Hidden Player type classification with PCA 1 and 2 allowing the data to show hidden patterns
- Outlier Discovery and Fantasy Performance Prediction Being able to predict the Fantasy Points with the performance factor of PCAs
What Our Model Successfully Captures:
- Player Archetypes Beyond Positions: By combining PCA with K-means clustering, the model identifies groups of players with similar statistical profiles, such as dual-threat QBs, elite RBs, or high-volume WRs, rather than limiting classification to listed positions.
- Dimensional Reduction Reveals Patterns: PCA reduces redundant dimensions while preserving variance, showing which statistical categories (e.g., receptions vs. rushing yards) most influence player separation.
- Improved Differentiation Across Skill Levels: With higher K values, the model successfully separates backup-level contributors from elite starters and hybrid players, allowing deeper analysis of tiers within positions.
- Insight into Fantasy Value: Distance from the PCA mean and cluster assignment correlate with fantasy production, giving a way to evaluate not just total points, but style and archetype of contribution.
Acknowledged Limitations:
- Cluster Interpretability: While clusters can be labeled with football knowledge, the algorithm itself does not inherently “know” positions or player roles. Some clusters may include mixed positions (e.g., WR/TE overlap) that require manual interpretation.
- Static Snapshot: PCA and clustering were applied to season-long aggregates. This misses week-to-week variance, injuries, and mid-season role changes that strongly impact fantasy outcomes.
- Overfitting Cluster Count: Choosing K=9 worked well heuristically, but it wasn’t optimized using metrics like the elbow method or silhouette score, which could make clusters more robust.
- No Contextual Features: The model only considers player statistics, ignoring external factors such as opponent strength, game scripts, or injuries, which drive real-world fantasy value.
Real-World Challenges:
- Data Availability & Quality: Play-by-play or situational data could enhance clustering but is often harder to acquire and standardize.
- Evolving Player Roles: Modern NFL offenses increasingly blur positional distinctions (e.g., WRs taking rushing snaps, TEs lining up wide), which makes clean clustering more difficult.
- Fantasy Scoring Variability: Different leagues (PPR, standard, half-PPR, superflex) weigh contributions differently. A player archetype valuable in one scoring system may not translate equally to another.
- Practical Use in Fantasy Drafting: While clustering reveals hidden player types, managers must still consider draft position, ADP (average draft position), and roster construction strategy, which aren’t modeled here.
Our analysis revealed that statistical modeling for individual positions in isolation is challenging due to:
- Fluid nature of NFL rosters and player roles
- High injury rates particularly at running back position
- External dependencies like offensive line blocking and defensive efficiency
- Small sample sizes inherent in 17-game NFL seasons
Injury and Context Handling:
- Weight snap counts more heavily as indicator of health and role security
- Incorporate injury probability models based on historical data
- Add coaching stability metrics to account for scheme changes
Advanced Feature Engineering:
- Depth chart movement multipliers for players changing roles
- Opponent-adjusted statistics accounting for defensive strength
- Game script factors distinguishing garbage time from meaningful production
- Red zone opportunity weighting for touchdown-dependent metrics
Multi-Position Integration:
- Cross-positional correlation analysis (QB-WR connections, RB-OL dependencies)
- Team-based modeling incorporating offensive system effects
- Weekly prediction capabilities for in-season adjustments
1. Enhanced Clustering Approaches
- Experiment with different clustering algorithms (e.g., DBSCAN, Gaussian Mixture Models) to capture non-linear groupings and overlapping archetypes.
- Use cluster validation metrics such as silhouette score, Davies–Bouldin index, or Calinski–Harabasz to more systematically choose the optimal K.
2. Incorporation of Contextual Features
- Add situational features such as opponent defensive rankings, offensive line performance, and team pace of play to better explain player outputs.
- Include game script indicators (e.g., average score differential when a player receives touches) to capture how usage depends on context.
3. Time-Series Modeling
- Shift from season-level aggregates to week-by-week PCA and clustering, which would better capture streaks, injuries, and player development.
- Apply Hidden Markov Models (HMMs) or RNNs to model transitions between player roles across a season.
4. Fantasy-Scoring Sensitivity Analysis
- Re-run clustering under different scoring formats (standard, half-PPR, superflex) to see how player archetypes shift.
- Compare distance-to-mean metrics against fantasy points across these formats to evaluate stability of archetypes.
5. Integration with Draft/ADP Data
- Merge with Average Draft Position (ADP) datasets to assess whether identified archetypes are overvalued or undervalued relative to their fantasy output.
- Could guide draft strategy by highlighting hidden tiers or overlooked player types.
6. Advanced Visualization
- Deploy interactive dashboards (e.g., Plotly, Streamlit) where users can filter by position, team, or cluster and explore PCA/clustering results dynamically.
- Highlight “nearest neighbors” in PCA space to suggest comparable players for trades or waiver pickups.
7. Hybrid Modeling Approaches
- Combine unsupervised learning (clustering) with supervised prediction (regression, gradient boosting, neural nets) to directly forecast fantasy output while retaining interpretability.
- Use cluster assignments as features in downstream predictive models.
Our SVR approach successfully captured several key football analytics concepts:
- Opportunity Metrics: Snap count proved crucial for predicting future success
- Touchdown Regression: Model identified the well-known phenomenon where high touchdown scorers typically decline
- Position-Specific Patterns: Elite players (RB1/RB2) showed more predictable patterns than fringe roster players
1. Effective Dimensionality Reduction
- PCA successfully distilled multiple offensive statistics into two main components (PC1 & PC2), simplifying the analysis without losing key information.
- Allowed clear visualization of player archetypes and statistical patterns across positions.
2. Archetype Identification
- K-means clustering on PCA components effectively grouped players into distinct archetypes, such as:
- Starting QBs
- Dual-threat QBs
- WR1s / receiving tight ends
- High-output RBs
- Backup or low-tier players
- These clusters align well with real-world fantasy football roles.
3. Position-Aware Insights
- Clustering reveals hidden patterns within positions, e.g., separating elite RBs from mid-tier starters, or WR2/TE hybrids from WR1s.
- Provides position-specific intelligence useful for draft or waiver decisions.
4. Quantitative Player Comparison
- Euclidean distance from the PCA mean provides a single metric capturing overall fantasy-relevant impact, making it easy to compare players across positions.
5. Flexibility Across Scoring Formats
- Model allows comparison of distance metrics against PPR and standard scoring, showing robustness and adaptability to different fantasy leagues.
6. Data-Driven Insights Without Prior Bias
- Unsupervised approach discovers archetypes naturally from player statistics, reducing reliance on subjective labels or analyst assumptions.
7. Visualization-Friendly
- PCA + clustering allows intuitive plotting and interactive exploration, making insights accessible to both analysts and fantasy managers.
Sample Size Constraints: Only 47 running backs had complete 2023-2024 data, limiting generalizability. For our unsupervised data, we only have data to extrapolate from that year, and can't analyze trends across multiyear spans to analyze regression of player performance due to age, injury, weight gain, height, and other factors.
Missing Contextual Factors:
- Coaching changes and scheme alterations
- Injury history and recovery patterns
- Team offensive line quality and blocking efficiency
- Game script and situational factors
Temporal Assumptions: The model assumes 2023 performance patterns persist into 2024, potentially missing regime changes or player development arcs
Fantasy football operates in a dynamic environment where player values fluctuate based on weekly performance, injuries, and matchups. Our annual prediction model doesn't capture in-season adjustments that fantasy managers must make throughout the season.
Our project tested whether machine learning could systematize intuitive player evaluation. Results suggest partial success - the model captures broad patterns while missing nuanced factors that experienced fantasy players consider.
Where Models Excel:
- Removing emotional bias from player evaluation
- Quantifying touchdown regression and efficiency mean reversion
- Identifying volume-based opportunity metrics
Where Human Knowledge Remains Superior:
- Contextualizing coaching changes and scheme fits
- Evaluating injury impact and recovery timelines
- Understanding team dynamics and locker room situations
Our NFL fantasy football prediction project represents a successful integration of sports analytics, machine learning, and genuine football passion. While we initially aimed to build a complete "fantasy football autodrafter," we discovered something more valuable: a framework for enhancing rather than replacing football knowledge.
- Player Archetype Analysis We were able to successfully classify players and the archetype they fit in by position, all in one graph using PCA and K-means clustering. This was one of the most interesting and fun parts of the project in our opinion. Sports Data is usually very difficult to quantify statistically, and it isn't always a game of numbers, as coaching and schemes can often outplay superior talent if done right, but seeing numbers validated in our data and form distinct clusters that could be analyzed by the machine and code was very satisfying. It also showed the application of machine learning processes in a way that was interesting to us and could show us meaningful results and data
- Correlation Between PCA and fantasy performance We found that using PCA, our model was extremely improved in predicting fantasy statistics. The Euclidean PCA Distance was able to determine the Fantasy player statistics much better than our other statistics were in our first model. The relationship was much more tightly fit and made much more sense for all positions. It was very interesting to see the correlation between those two statistics.
- Different Objective Choosing a different objective for these models would have been good. Sports Prediction is difficult to do, due to certain characteristics, so using Machine Learning for player classification like for College Players to see where their best fit in the NFL would be, would be one that we want to do.
- Dynamic Models: Develop weekly prediction capabilities for in-season adjustments based on a player's experience, injuries, play of other factors like Offensive Linemen, Quarterback, Defensive Schemes, and more.
- Multi-Position Analysis: Extend methodology to defensive players, and offensive linemen to analyze deeper statistics like coverage grade, blocking grade, efficiency, and more.
- Ensemble Methods: Combine multiple model types for improved accuracy, such as PCA and Logistic Regression, SVM Boundary Classification, or using a different feature transformation for the data.
Our methodologies extend beyond fantasy football into professional team management, player development, and contract negotiations. The challenges we encountered - handling seasonal variance, balancing evaluation criteria, and predicting human performance, as well as analysis of play style and statistics for combining players that will scheme together well - have applications in financial forecasting, team success, coaching schemes, player evaluation, and healthcare outcomes.
- Python 3.8+ with pip package manager
- Kaggle account with API access
- Jupyter Lab/Notebook environment
# Clone repository
git clone https://github.com/jaw039/CSE151A-Project.git
cd CSE151A-Project
# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn kaggle jupyterlab
# Set up Kaggle API (place kaggle.json in ~/.kaggle/)
# Download dataset
python get_data.pyCSE151A-Project/
├── data_exploration.ipynb # Initial data analysis and visualization
├── firstModel.ipynb # SVR model development and validation
├── feature_eng.ipynb # Feature engineering and reliability metrics
├── stat_projections.ipynb # Advanced statistical analysis
├── PCA.ipynb # PCA + K-means clustering
├── final_report.ipynb # Comprehensive project documentation
├── images/ # Generated visualizations and figures
├── README.md # This documentation
└── data/ # NFL statistics dataset (auto-generated)
Jackie Wang
- Title: Lead Data Scientist and Supervised Learning Specialist
- Contribution: Primary contributor to data exploration, SVR model implementation, and feature engineering. Led preprocessing pipeline development and reliability index creation. Major contributor to technical documentation and figure generation. Responsible for model validation and hyperparameter tuning processes.
Akul Kudari
- Title: Project Lead and Unsupervised Learning Specialist
- Contribution: Project management and methodology oversight. Key contributor to position-specific filtering approaches and model enhancement strategies. Significant contributions to data visualization frameworks and results interpretation. Led collaborative discussions and ensured project timeline adherence. Contributed to Unsupervised Learning Approach and Clustering Approach
Both team members participated actively in all phases of the project, from initial data exploration through final model validation. The project benefited from diverse perspectives combining technical expertise, football domain knowledge, and analytical rigor. All major decisions were made collaboratively, with regular code reviews and methodology discussions ensuring high-quality outcomes.










