The extraction module calculates the metrics from the occurrences. It crawls a directory containing Darwin Core Archives (downloaded from the GBIF website) and for each one generates a simple report file containing raw metrics and other useful data (taxonomy, multimedia extensions, ...) in JSON format. These JSON reports will then be used as the input of the aggregation module.
- One archive/report can describe multiple datasets AND a specific dataset can be enclosed in several archives, hence the need for the aggregation module. That provides in return a lot of flexibility in terms of data volume and horizontal scalability capabilities.
- At this stage, we are able to process 15GB+ compressed archives in a couple of hours on a standard Macbook Pro.By splitting the whole GBIF archive in smaller files, we were ultimately able to parse all GBIF occurrence data.
- This module is very concise since the hard/low-level work is delegated to python-dwca-reader.
- Install the requirements:
$ pip install -r requirements.txt
- Download data from GBIF (search per dataset or per publishing country for example) and place the Darwin Core Archives (zip files) in an empty directory.
- Create another empty directory somewhere else to receive the reports.
- Configure these two directories in
DATA_SOURCE_DIR
andREPORTS_DIR
(at the top ofbin/extract_data.py
) - Run the extractor:
$ python bin/extract_data.py
. - That's it! You can now use the generated reports in the aggregation module.