Here you can find a presentation explaining who we are and the problem we want to solve.
The dataset can be found in this Google Drive folder.
- categories.json: Is a JSON file that contains all of our current categorization. It has three levels: Topic, Subtopic and tag. We only need the first level of categories for this model.
- initiatives.json.zip: A zipped file with all of our dataset in JSON format. Only the title and the content fields of each entry need to be used to identify the SDGs. All of the dataset is tagged and less than 50% of the documents returned results with our current system.
- small-batch.json: A sample file with just 20 items for testing purposes.
For better versioning, please install Jupytext and follow this guide.