Registering biodiversity-related vocabulary as Wikidata lexemes and link their senses to Wikidata items
- Wikidata
- COL
Content mining involves a combination of methods that take into account knowledge about
- the documents being mined
- the languages used therein
- the documents’ content and
- specific use cases at hand. While special workflows exist for mining biodiversity information (e.g. Plazi), it would be desirable to have such workflows more readily available to and reusable by a broader community and to invite community members to help curate the underlying information. Integration of such workflows with Wikidata would be a good step forward in this direction.
Wikidata is a FAIR and open semantic database that is community-curated following the Wikipedia approach that “anyone can edit it”. Besides information about concepts and their relationship between each other and the wider world (known as items and properties in Wikidata parlance, largely equivalent to the subjects and predicates in semantic web lingo), Wikidata has also begun to store structured information about languages and their components, which it calls lexemes. The idea of this hackathon project would be to
- pick a set of documents - possibly some that other groups at the hackathon are analyzing or producing, such as taxonomic publications or taxon treatments - or similar resources in one or more target languages and
- use natural language processing to
- make provisions to scale this up.
- OPTIONAL: Use visualizations via Ordia to assist with quality control, prioritization and data exploration.
- Global assessment report on biodiversity and ecosystem services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services
- Range extension and conservation status of the rare Solanaceae shrub, Solanum conocarpum
- Text-to-lexemes
The minimal outcome of this hackathon project would be the documentation of a workflow - for at least one BiCIKL infrastructure - that covers in detail the steps listed under “2” in the methodology section above, ideally based on use cases (as per “1”) for which automation (as per “3”) is within reach. In the long run, doing this systematically for one or more of the BiCIKL research infrastructures will help disambiguate ambiguous terms better and recognize biodiversity-related entities with higher precision across a growing range of linguistic and thematic contexts and document types. Such improved workflows can help improve the information in Wikidata, thereby providing the basis for a positive data quality feedback loop between content mining and Wikidata-based curation of linguistic and subject-matter contexts for such content. For documentation of a similar event focused on climate-related lexemes, see Climate Lexeme Week.
Please list your name, link it to your GitHub profile and say a few words about what you'd like to do in the framework of this hackathon. Feel free to add further information, e.g. ORCID, Wikimedia user name or social media handles.
- Daniel Mietchen (0000-0001-9488-1870, User:Daniel Mietchen, @EvoMRI): I plan to focus on vocabulary related to species that are invasive, endangered, recently extinct or newly described.
- Finn Årup Nielsen Scholia: Particular Danish lexemes.
- you?