Skip to content

epfl-ada/ada-2025-project-theoutliers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ada-2025-project-theoutliers

Check out our data story website : https://ula1111111111.github.io/ !!!

README Milestone 2 and 3

Title: The Pulse of the Market: Who Sets the Rhythm?

Motivations:

Financial markets are often portrayed as dominated by a handful of giants, but do the biggest companies truly drive the movements of their entire industry? This project investigates whether industry leaders systematically influence smaller peers within NASDAQ sectors. Using historical stock price data enriched with external metadata, we will group companies by industry, develop objective criteria to rank them, and analyze how information and volatility propagate across the market. Our methodology combines time-series modeling and Granger causality testing to detect leader-follower patterns in price dynamics. We will also assess how persistent these relationships are and how they evolve during major market events. Another component compares value-weighted ETFs with equally weighted portfolios to quantify whether performance is concentrated among large-cap leaders or broadly distributed. By highlighting where price discovery originates, from dominant firms or distributed behavior, this project offers insights into market power and the dynamics of information diffusion.

Research Questions:

  1. How do we define a "leader" and a "follower" in stock movements? How to sectorize and hierarchize company?
  2. How can we detect directional influence between stocks within a sector? (use daily return time series, detect this using lagged correlations or Granger causality)
  3. Are leader-follower dynamics consistent across sectors? Do some sectors have stronger leadership patterns than others?
  4. How stable are these influence patterns over time, and how do they evolve during major market events?
  5. What does ETF analysis reveal about the performance of market leaders, sector averages, and followers, and about overall market concentration?
  6. How does survivorship bias impact the validity of our conclusions, and what steps can be taken to reduce its effects? Which time window should we analyze?

Additional dataset:

https://www.kaggle.com/datasets/dhimananubhav/nasdaq-company-list

This Dataset is essential to our analysis, it provides, in particular the sector, market capitalization and name of all the companies with their tickers, listed on the Nasdaq. It enables us to group the companies (hence stocks) by sectors to develop our analysis that is sector-based.

Methods:

1. Data Handling & Preprocessing

2. Merging Datasets & Sectorizing companies

  • Match the data (provided and external) to get the sector and market cap of every companies of Nasdaq
  • Hierarchise companies inside their sector based on relevant criteria (developed in the notebook)
  • Get the final dataset, providing a 'ranking' of each company in their sector : useful for deeper analysis following

3. Leader–Follower Identification

  • Granger Causality Tests: Statistical test for predictive relationship between time series
  • Cross-Correlation Analysis: Identify lagged relationships between stock returns
  • Such findings may reveal which firms act as information leaders, transmitting price signals that others follow, and could provide valuable insights for trading strategies, sector analysis, or portfolio diversification.

4. ETF vs Equally-Weighted Method

  • Construct two portfolios: one using global or sectoral ETFs (value-weighted), and one equally-weighted portfolio of individual stocks
  • Compute for both portfolios: daily and annualized returns, cumulative performance, volatility, and Sharpe ratio
  • Compare performances to evaluate the dominance of large-cap leaders versus the average behavior of smaller firms (followers)
  • Run the same comparison by sector, using Sector from Dataset 1 to build the equally-weighted sector portfolios
  • Interpret whether sector performance is driven by a few large firms or by more distributed contributions across all firms

Proposed Timeline:

Week 7-8: Data Preparation & Initial Analysis (MILESTONE 2)

  • Data collection
  • Data cleaning and preprocessing pipeline
  • Exploratory data analysis and descriptive statistics
  • Merging the two cleaned datasets
  • Hierarchising companies in their sector

Week 9-10: Core Analysis Implementation

  • Implement Granger causality testing framework
  • Develop influence network construction methods
  • Conduct rolling window temporal analysis
  • Sector-based comparative analysis
  • Develop ETF vs Equally-weighted portfolio analysis
  • Analyse the impact of media shocks

Week 11: Validation & Refinement

  • Robustness checks with alternative methodologies
  • Statistical validation of identified relationships
  • Prepare preliminary results and visualizations

Week 12-13: Final Integration & Documentation (MILESTONE 3)

  • Integrate all analyses into final pipeline
  • Create comprehensive visualizations and reports
  • Finalize documentation and repository structure

Organization within the team:

Amine and Andrew :

  • Pulled in two metadata files (companies + all securities) and standardized tickers.
  • Kept only common stocks/ADRs for the stock universe and handled ETFs separately.
  • Spot-checked price files to confirm columns and date parsing were consistent.
  • Merged sectors, market caps, and IPO years onto each symbol and removed duplicates.
  • Applied an availability filter: kept symbols with at least 2 years of data.
  • Built two clean inputs: stocks_filtered (our stock universe) and etfs_only (for benchmarks).
  • Ran quick EDA: sector mix, rough market-cap spread, stock vs ETF counts, and data-coverage checks.
  • Noted survivorship-bias risk and set guardrails (e.g., focus window like 2015–2020, consider IPO-age weighting).
  • Outcome: a tidy, analysis-ready dataset wired for the next steps (Granger/lag tests, rolling windows, ETF vs EW).

Clement and Urszula :

  • Explore the data and its meaning, is everything relevant for our analysis ? (in our case, all the securities that are not common stocks are not meaningful for our analysis, so we will extract it out of our data).
  • Find an external Dataset providing more information about companies present on Nasdaq (dataset above), make sure it matches enough with the provided data.
  • Developing a method to hierarchise the companies within their sector (developed in the notebook).
  • Analysis of Stock Leadership part, exploring the statistical methods to be used
  • Update the Readme and everything necessary to the submission

Léonard :

  • Developed the theoretical framework for comparing ETFs (value-weighted portfolios) with equally-weighted portfolios to evaluate how market performance is distributed between large-cap leaders and smaller followers.
  • Determined the key performance metrics to be used in the analysis (annualized returns, cumulative performance, volatility, Sharpe ratios).
  • Analyzed the presence of survivorship bias in the dataset, which only includes companies still listed on NASDAQ in 2020, thus excluding those that were delisted or bankrupt.
  • Proposed theoretical strategies, including limiting the analysis period to 2015–2020 and introducing a lifespan weighting approach based on the IPOyear variable from Dataset 1.

Contributions within the team for milestone 3:

  • Amine: Prepared analysis-ready datasets by cleaning and merging multiple data sources. Worked on the ETF vs. equally weighted portfolio analysis, including implementation, interpretation, and storytelling. Contributed to the development of the project website, particularly the creation of interactive plots and visual graphics used to present results.

  • Léonard: Focused on the ETF vs. equally weighted portfolio framework, including the theoretical motivation behind the comparison. Developed detailed explanations of the methodology and provided in-depth analysis of the results. Contributed visualizations and written interpretation for this part of the project.

  • Andrew: Worked on data normalization and the company hierarchization framework used throughout the analysis. Implemented the ranking calculations and contributed to the hierarchization storyline on the website. Created interactive Flourish visualizations allowing users to explore company size and rankings.

  • Clement: Developed performance metrics and contributed to the overall hierarchization methodology. Led the survivorship bias analysis and investigated sector size effects. Played a major role in website design, interactive graphics, and data storytelling, ensuring coherence across different sections of the project.

  • Urszula: Developed the statistical methodology for the leader–follower analysis, including the cross-correlation screening and Granger causality testing framework. Implemented the leader–follower analysis and contributed to the structure of the website. Designed and integrated interactive network graphs and heatmaps, as well as the accompanying data storytelling.

  • All team members: Collaborated on refining the README, validating results across analyses, and preparing the final submission. All members contributed to discussions on methodology choices and interpretation of results.

How to run the code:

  1. Clone the repository to your local machine.
  2. Ensure you have Python 3.9 or newer installed along with the required libraries listed in requirements.txt. You can install them using pip:
    pip install -r requirements.txt
    
  3. Download the provided Kaggle Stock Market Datasetand and place it in the data/ folder.
  4. Download the additional dataset from Kaggle (link provided above) and place it also in the data/ folder.
  5. Navigate to the project directory and run the results.ipynb Jupyter notebook. All results and website-ready outputs (figures and JSON files) are generated directly from results.ipynb.

About

ada-2025-project-theoutliers created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors