GitHub - Riyush/ML-For-Econ-Project: Data collection and analysis files for ECMA 31350

Riyush / ML-For-Econ-Project Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Data collection and analysis files for ECMA 31350

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Custom_Motley_Scraper		Custom_Motley_Scraper
rfile and data		rfile and data
.DS_Store		.DS_Store
2012_JBF_Earning Call.pdf		2012_JBF_Earning Call.pdf
CEO Dismissal Data 2021.02.03.xlsx		CEO Dismissal Data 2021.02.03.xlsx
CEO Dismissal Database 2021.02.03.docx		CEO Dismissal Database 2021.02.03.docx
Disclosure Sentiment Machine Learning vs Dicionary Methods MAIN PAPER.pdf		Disclosure Sentiment Machine Learning vs Dicionary Methods MAIN PAPER.pdf
README		README
ecma_training_proj.ipynb		ecma_training_proj.ipynb
requirements.txt		requirements.txt
sentiment_trans_02-24_new.xlsx		sentiment_trans_02-24_new.xlsx
vars_v1.xlsx		vars_v1.xlsx

Repository files navigation

Data:

This repository containes all initial work done on the final ML for ECON project.
The first step was builing the dataset enclosed in the Custom_Motley_Scraper folder.
This folder contains a subfolder called trancripts which is the output folder
where raw transcripts are collected. There are 2 important python files called
Scraping_Functions.py and Scraping_Operations.py. The functions file defines
the functions needed to scrape fool.com. The operations file uses the functions
to systematically scrape data.There are also 2 excel files called SP500 Ticker and exchange.xlsx and companies_copy.xlsx. These files contain a list of companies
to be scraped, their ticker, and their exchange. This information is needed
for the scraper to work.

In order to use the scraper, simply download the repository and modify line 7
of the operations file. This line defines the directory you want to output
the transcripts to. Set it to a directory specific to your computer.

In theory, the operations file was meant to work by scanning through
entries of the SP500 Ticker and exchange.xlsx file in scraping all transcripts
in one fell swoop. In practice, some unpredictability of the website caused
errors with certain companies. Due to this, the companies_copy file was initially
created as a copy of SP500 Ticker and exchange.xlsx. We would run the scraper,
and let it scrape transcripts until an error occurred. When an error occurred,
we would delete all the entries from the companies_copy.xlsc file that the scraper properly scraped. Then we would rerun the loop until an error occurred again.
In practice this meant that we would scrape 10-15 companies until an error
occurred, delete the successfully scraped companies from companies_copy and rerun
the operations file. This resulted in a slightly tedious but manageable scraping
process. We stopped scraping when we observed that fool.com didn't have a
complete 5 years of transcripts for more obscure companies. This resulted in
186 companies scraped and roughly 3300 transcripts as our dataset.

All analysis for the data is done in the ecma_training_proj and other R files.