-
Notifications
You must be signed in to change notification settings - Fork 4
Rationale
The DwC standard itself is a general-use1 vocabulary that defines terms which can be used to facilitate the sharing of information about biological diversity.2 As such, it does not have any prescribed methods of implementation. At the present time, there are guides for implementing DwC in text files3 and in XML4. Since September 2010, there has been significant discussion about how terms in the Darwin Core (DwC) standard could be used as properties to describe resources using Resource Description Framework (RDF).5
RDF differs from other methods of information transfer in that it does not assume that there is a designated recipient of the information, nor does it assume that the recipient has preexisting knowledge of how the information should be understood. RDF is designed to permit the "discovery" of information by an undetermined recipient, and as such it is necessary to provide the recipient with all of the information that they would need to understand the information received. This is different from a text transfer in which the sender and receiver would agree in advance what fields would be included and the format of the file that contains them or an XML transfer where a schema may be used to spell out the number,types, and permitted order of fields in the XML document used to transfer the information. The other assumption is that the recipient will be a computer (or perhaps more accurately a computer program, known as "the consuming application" or "client") that is not necessarily under the guidance of a human. The relevance of the lack of preexisting knowledge about a resource in the biodiversity informatics community is summed up thus: "Biodiversity research is typically carried out by combining data of different kinds from multiple sources. The providers of data do not know who will use their data or how it will be combined with data from other sources. The consumer needs some level of commonality across all the data received so that it can be combined for analysis without the need to write computer software for every new combination."6
RDF itself does not have a single method of representation ("serialization"). It can be represented by diagrams ("graphs") or text formats such as N3 7. However, the default serialization when a URI is "looked up" (dereferenced) by a computer is XML.
RDF itself is simply a way to describe the properties of "things" (resources). RDF can be "used" by different communities to serve various purposes. One such community is "Linked Data" or "Linked Open Data" (LOD). Linked Data "a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF."8 Some draw a distinction between the goals of LOD and "the Semantic Web"9. Whereas LOD is focused on exposing and collecting large amounts of information with minimal effort, the Semantic Web is more restrictive because it focuses effort on only asserting things that are true and subject to tractable reasoning.10 If the primary purpose of terms used in RDF is to link one resource to another, those terms can be defined simply using the RDF Vocabulary Description Language (RDFS)11. However, if terms are intended to convey more meaning and be used to enable reasoning, they may be given more complex definitions using the Web Ontology Language (OWL).12
1 http://lists.tdwg.org/pipermail/tdwg-content/2010-October/000013.html
2 http://rs.tdwg.org/dwc/index.htm
3 http://rs.tdwg.org/dwc/terms/guides/text/index.htm
4 http://rs.tdwg.org/dwc/terms/guides/xml/index.htm
6 TDWG Technical Roadmap 2007. http://wiki.tdwg.org/twiki/pub/TAG/RoadMap2007/TAG_Roadmap_2007_final.pdf
7 http://www.w3.org/DesignIssues/Notation3
9 http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001707.html
10 http://lists.tdwg.org/pipermail/tdwg-content/2010-November/001968.html
11 http://www.w3.org/TR/rdf-schema/
12 http://www.w3.org/TR/owl2-overview/
Because of the "self-explanatory" nature of RDF, general guidelines1 must be followed in constructing it to permit a non-human client to "understand" what it means without help other than what the client can discover on its own. Each unit of information (known as a "triple") is like a sentence consisting of a subject, a predicate, and an object. The subject is represented by a URI (Uniform Resource Identifier) which in the context of the biodiversity domain would be a globally unique identifier (GUID) for the "thing" (known as a resource) being described. The predicate designates the relationship between the subject and the object. The predicate is a generally a property of the thing that is being described (i.e. the subject). If the property is described by text, it is called a data property and the object is called a "literal". If the property is a relationship with another resource, the object is the URI of another resource and the property is called an object property.
Predicates are actually URIs themselves. For example
http://purl.org/dc/terms/title
is the Dublin Core (DCMI) term2 for the title of a resource. The first part of the URI (the namespace)is often abbreviated. "http://purl.org/dc/terms/" may be abbreviated "dcterms:" which would make the abbreviated URI of the predicate "dcterms:title". Although in theory a client can figure out what a predicate "means" by looking up (dereferencing3) its URI, some URIs won't actually dereference and as a practical matter the information provided doesn't really tell the client much. In the case of dcterms:title,4 a client could learn that it is a property whose object is a literal and that its label is "Title". That's not really very informative for a computer. What makes dcterms:title useful is that it is "well-known", i.e. a lot of people who are writing client applications know what dcterms:title is intended to mean and they can therefore do something useful with it.
What this means is that unless predicates describe properties using well-known terms, the information that they impart is close to useless. For that reason, it is most practical to describe the properties of a resource using a vocabulary that is well-known to the community that is likely to be creating applications that will make use of the information. In the case of the biodiversity community, Darwin Core terms are probably the best known and therefore would be the best candidates to serve as descriptive properties.
There are many DwC terms5 that are suitable for use as data properties having literal objects. Unfortunately, there are either few or no DwC terms that are suitable for use as object properties having URI objects. These properties are needed to describe how a resource in one class is related to a resource in another class. DwC defines a number of terms whose names end in "ID", e.g. occurrenceID, eventID, taxonID, etc. It is not clear whether these terms should refer to the identifier of the subject itself, or whether they should refer to the identifier of the object of a property.6 They are used both ways in examples given for XML7, and apparently they can be used however the user prefers.8 The lack of clarity about the DwC "ID" terms make them unsuitable for use as object properties. There have been several suggested solutions for the lack of object properties in DwC, including developing social conventions9, using existing terms to refer to URIs and applying rdfs:label properties to identify the string versions10, using the same term names but putting them in different namespaces11, and creating new terms specifically designed to be object properties12.
1 http://www.w3.org/TR/rdf-primer/
2 http://dublincore.org/documents/dcmi-terms/
3 http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001670.html
4 http://purl.org/dc/terms/title
5 http://rs.tdwg.org/dwc/terms/index.htm
6 http://lists.tdwg.org/pipermail/tdwg-content/2010-August/000061.html
7 http://rs.tdwg.org/dwc/terms/guides/xml/index.htm
8 http://lists.tdwg.org/pipermail/tdwg-content/2010-September/000050.html
9 http://lists.tdwg.org/pipermail/tdwg-content/2010-October/000011.html
10 http://lists.tdwg.org/pipermail/tdwg-content/2010-September/000057.html and http://lists.tdwg.org/pipermail/tdwg-content/2010-October/000010.html
11 http://lists.tdwg.org/pipermail/tdwg-content/2010-October/000013.html
12 http://lists.tdwg.org/pipermail/tdwg-content/2010-September/000054.html and http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001788.html
A major imperative is that the best practices for implementing globally unique identifiers (GUIDs) requires that GUIDs in the biodiversity informatics community be in the form of URIs. Those URIs should be resolvable and the default metadata response format should be RDF serialized as XML.1 The Global Biodiversity Information Facility (GBIF) encourages the use of URIs and encourages data providers to deploy RDF services.2 Major efforts such as BiSciCol3 are designing their implementations around HTTP URI GUIDs with the assumption that they will dereference to produce RDF. "If GUIDs are used to uniquely identify 'pieces' of data we need to have a shared understanding of what we mean by a 'piece of data' i.e. what kind of thing is it that a particular id applies to, a specimen, a person, an observation, a complete data set. We also need to have a shared understanding of at least some of the properties we use to describe these things. ... HTTP should be thought of as the universal syntax for addressable resources."4
Exposing metadata as Linked Data makes it possible for small institutions to participate in large-scale "cloud computing" efforts to aggregate metadata.
Referring to resources as URIs provides a stable point of reference despite changes of name or changing metadata.5
Aggregated RDF metadata can be stored efficiently in the form of triples6 and queried using SPARQL to discover relationships that might not have otherwise been discovered.7
Metadata from our biodiversity informatics community can be integrated with metadata provided by other communities who are already adopting the Linked Data/Semantic Web approach.8
1 TDWG GUID Applicability Statement, recommendations 2, 7, and 10. http://www.tdwg.org/standards/150
2 Adoption of Persistent Identifiers for Biodiversity Informatics, recommendations 4 and 9. http://www2.gbif.org/Persistent-Identifiers.pdf
3 http://biscicol.blogspot.com/
4 TDWG Technical Roadmap 2008. http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2008.pdf
5 http://lists.tdwg.org/pipermail/tdwg-content/2010-November/001825.html
6 http://lists.tdwg.org/pipermail/tdwg-content/2010-September/000054.html
7 http://lists.tdwg.org/pipermail/tdwg-content/2010-November/001929.html
8 http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001736.html
Judge for yourself:
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-September/000050.html and http://lists.tdwg.org/pipermail/tdwg-content/2010-October/000013.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-September/000058.html and http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001678.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001584.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001589.html and http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001591.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001751.html and http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001810.html and http://lists.tdwg.org/pipermail/tdwg-content/2010-November/001925.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001755.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-November/001836.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2010-November/001956.html
-
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002265.html