ada-2025-project-theoutliers

Check out our data story website : https://ula1111111111.github.io/ !!!

README Milestone 2 and 3

Title: The Pulse of the Market: Who Sets the Rhythm?

Motivations:

Financial markets are often portrayed as dominated by a handful of giants, but do the biggest companies truly drive the movements of their entire industry? This project investigates whether industry leaders systematically influence smaller peers within NASDAQ sectors. Using historical stock price data enriched with external metadata, we will group companies by industry, develop objective criteria to rank them, and analyze how information and volatility propagate across the market. Our methodology combines time-series modeling and Granger causality testing to detect leader-follower patterns in price dynamics. We will also assess how persistent these relationships are and how they evolve during major market events. Another component compares value-weighted ETFs with equally weighted portfolios to quantify whether performance is concentrated among large-cap leaders or broadly distributed. By highlighting where price discovery originates, from dominant firms or distributed behavior, this project offers insights into market power and the dynamics of information diffusion.

Research Questions:

How do we define a "leader" and a "follower" in stock movements? How to sectorize and hierarchize company?
How can we detect directional influence between stocks within a sector? (use daily return time series, detect this using lagged correlations or Granger causality)
Are leader-follower dynamics consistent across sectors? Do some sectors have stronger leadership patterns than others?
How stable are these influence patterns over time, and how do they evolve during major market events?
What does ETF analysis reveal about the performance of market leaders, sector averages, and followers, and about overall market concentration?
How does survivorship bias impact the validity of our conclusions, and what steps can be taken to reduce its effects? Which time window should we analyze?

Additional dataset:

https://www.kaggle.com/datasets/dhimananubhav/nasdaq-company-list

This Dataset is essential to our analysis, it provides, in particular the sector, market capitalization and name of all the companies with their tickers, listed on the Nasdaq. It enables us to group the companies (hence stocks) by sectors to develop our analysis that is sector-based.

Methods:

1. Data Handling & Preprocessing

2. Merging Datasets & Sectorizing companies

Match the data (provided and external) to get the sector and market cap of every companies of Nasdaq
Hierarchise companies inside their sector based on relevant criteria (developed in the notebook)
Get the final dataset, providing a 'ranking' of each company in their sector : useful for deeper analysis following

3. Leader–Follower Identification

Granger Causality Tests: Statistical test for predictive relationship between time series
Cross-Correlation Analysis: Identify lagged relationships between stock returns
Such findings may reveal which firms act as information leaders, transmitting price signals that others follow, and could provide valuable insights for trading strategies, sector analysis, or portfolio diversification.

4. ETF vs Equally-Weighted Method

Construct two portfolios: one using global or sectoral ETFs (value-weighted), and one equally-weighted portfolio of individual stocks
Compute for both portfolios: daily and annualized returns, cumulative performance, volatility, and Sharpe ratio
Compare performances to evaluate the dominance of large-cap leaders versus the average behavior of smaller firms (followers)
Run the same comparison by sector, using Sector from Dataset 1 to build the equally-weighted sector portfolios
Interpret whether sector performance is driven by a few large firms or by more distributed contributions across all firms

Proposed Timeline:

Week 7-8: Data Preparation & Initial Analysis (MILESTONE 2)

Data collection
Data cleaning and preprocessing pipeline
Exploratory data analysis and descriptive statistics
Merging the two cleaned datasets
Hierarchising companies in their sector

Week 9-10: Core Analysis Implementation

Implement Granger causality testing framework
Develop influence network construction methods
Conduct rolling window temporal analysis
Sector-based comparative analysis
Develop ETF vs Equally-weighted portfolio analysis
Analyse the impact of media shocks

Week 11: Validation & Refinement

Robustness checks with alternative methodologies
Statistical validation of identified relationships
Prepare preliminary results and visualizations

Week 12-13: Final Integration & Documentation (MILESTONE 3)

Integrate all analyses into final pipeline
Create comprehensive visualizations and reports
Finalize documentation and repository structure

Organization within the team:

Amine and Andrew :

Pulled in two metadata files (companies + all securities) and standardized tickers.
Kept only common stocks/ADRs for the stock universe and handled ETFs separately.
Spot-checked price files to confirm columns and date parsing were consistent.
Merged sectors, market caps, and IPO years onto each symbol and removed duplicates.
Applied an availability filter: kept symbols with at least 2 years of data.
Built two clean inputs: stocks_filtered (our stock universe) and etfs_only (for benchmarks).
Ran quick EDA: sector mix, rough market-cap spread, stock vs ETF counts, and data-coverage checks.
Noted survivorship-bias risk and set guardrails (e.g., focus window like 2015–2020, consider IPO-age weighting).
Outcome: a tidy, analysis-ready dataset wired for the next steps (Granger/lag tests, rolling windows, ETF vs EW).

Clement and Urszula :

Explore the data and its meaning, is everything relevant for our analysis ? (in our case, all the securities that are not common stocks are not meaningful for our analysis, so we will extract it out of our data).
Find an external Dataset providing more information about companies present on Nasdaq (dataset above), make sure it matches enough with the provided data.
Developing a method to hierarchise the companies within their sector (developed in the notebook).
Analysis of Stock Leadership part, exploring the statistical methods to be used
Update the Readme and everything necessary to the submission

Léonard :

Developed the theoretical framework for comparing ETFs (value-weighted portfolios) with equally-weighted portfolios to evaluate how market performance is distributed between large-cap leaders and smaller followers.
Determined the key performance metrics to be used in the analysis (annualized returns, cumulative performance, volatility, Sharpe ratios).
Analyzed the presence of survivorship bias in the dataset, which only includes companies still listed on NASDAQ in 2020, thus excluding those that were delisted or bankrupt.
Proposed theoretical strategies, including limiting the analysis period to 2015–2020 and introducing a lifespan weighting approach based on the IPOyear variable from Dataset 1.

Contributions within the team for milestone 3:

Amine: Prepared analysis-ready datasets by cleaning and merging multiple data sources. Worked on the ETF vs. equally weighted portfolio analysis, including implementation, interpretation, and storytelling. Contributed to the development of the project website, particularly the creation of interactive plots and visual graphics used to present results.
Léonard: Focused on the ETF vs. equally weighted portfolio framework, including the theoretical motivation behind the comparison. Developed detailed explanations of the methodology and provided in-depth analysis of the results. Contributed visualizations and written interpretation for this part of the project.
Andrew: Worked on data normalization and the company hierarchization framework used throughout the analysis. Implemented the ranking calculations and contributed to the hierarchization storyline on the website. Created interactive Flourish visualizations allowing users to explore company size and rankings.
Clement: Developed performance metrics and contributed to the overall hierarchization methodology. Led the survivorship bias analysis and investigated sector size effects. Played a major role in website design, interactive graphics, and data storytelling, ensuring coherence across different sections of the project.
Urszula: Developed the statistical methodology for the leader–follower analysis, including the cross-correlation screening and Granger causality testing framework. Implemented the leader–follower analysis and contributed to the structure of the website. Designed and integrated interactive network graphs and heatmaps, as well as the accompanying data storytelling.
All team members: Collaborated on refining the README, validating results across analyses, and preparing the final submission. All members contributed to discussions on methodology choices and interpretation of results.

How to run the code:

Clone the repository to your local machine.
Ensure you have Python 3.9 or newer installed along with the required libraries listed in requirements.txt. You can install them using pip:
```
pip install -r requirements.txt
```
Download the provided Kaggle Stock Market Datasetand and place it in the data/ folder.
Download the additional dataset from Kaggle (link provided above) and place it also in the data/ folder.
Navigate to the project directory and run the results.ipynb Jupyter notebook. All results and website-ready outputs (figures and JSON files) are generated directly from results.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ada-2025-project-theoutliers

Check out our data story website : https://ula1111111111.github.io/ !!!

Title: The Pulse of the Market: Who Sets the Rhythm?

Motivations:

Research Questions:

Additional dataset:

Methods:

1. Data Handling & Preprocessing

2. Merging Datasets & Sectorizing companies

3. Leader–Follower Identification

4. ETF vs Equally-Weighted Method

Proposed Timeline:

Week 7-8: Data Preparation & Initial Analysis (MILESTONE 2)

Week 9-10: Core Analysis Implementation

Week 11: Validation & Refinement

Week 12-13: Final Integration & Documentation (MILESTONE 3)

Organization within the team:

Amine and Andrew :

Clement and Urszula :

Léonard :

Contributions within the team for milestone 3:

How to run the code:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ada-2025-project-theoutliers

Check out our data story website : https://ula1111111111.github.io/ !!!

Title: The Pulse of the Market: Who Sets the Rhythm?

Motivations:

Research Questions:

Additional dataset:

Methods:

1. Data Handling & Preprocessing

2. Merging Datasets & Sectorizing companies

3. Leader–Follower Identification

4. ETF vs Equally-Weighted Method

Proposed Timeline:

Week 7-8: Data Preparation & Initial Analysis (MILESTONE 2)

Week 9-10: Core Analysis Implementation

Week 11: Validation & Refinement

Week 12-13: Final Integration & Documentation (MILESTONE 3)

Organization within the team:

Amine and Andrew :

Clement and Urszula :

Léonard :

Contributions within the team for milestone 3:

How to run the code:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages