Wednesday, June 7, 2017

We Won a Data Science Competition. I Blogged About It Elsewhere. Now I am Cross-Posting Here.

We (I and a bunch of other guys) won a data science competition called #Datathon17. I blogged about it at the Ontotext blog and at the Data Science Society blog. I am cross-posting it here to reaffirm commitment to this blog:

I have lots of exciting stuff that is coming up really soon. Later this week I will blog the history of AI.

Friday, January 13, 2017

Prototype of the Open Biodiversity Knowledge Management System (OBKMS) Shown at the TDWG Conference

Last month Pensoft and I went to the TDWG 2016 Annual Conference in Santa Clara de San Carlos, Costa Rica, where we presented the prototype of OBKMS. We argued that OBKMS’ vision is to become the go-to resource for peer-reviewed structured scientific data for the biodiversity domain.

For new readers of the blog I would like to review the present and future of the system. OBKMS stands for Open Biodiversity Knowledge Management System and the realization of the biodiversity knowledge graph envisaged in pro-iBiosphere as Linked Data. It is a collaborative effort that is currently being developed jointly by Pensoft and Plazi with the help of many partners outside of these two organizations. Probably not an exhaustive list includes Kiril Simov, Éamonn Ó Tuama and Nico Franz.

The components of OBKMS are as follows:

  • The Linked Open Data (LOD) dataset together with OWL ontologies stored in an RDF graph database
  • Tools for information extraction from publications and databases that are used for creating the dataset
  • A website allowing for the querying and exploring the dataset
  • Documentation in form of scientific papers and technical documentation stored on the website and elsewhere

In 2017, a set of blog posts will follow describing each of these aspects. At present a GraphDB database is already online containing 2,181,183 RDF triples. The triples currently come from information extracted from the journals Biodiversity Data Journal and ZooKeys, but will soon also come from all other Pensoft’s journals, Plazi’s Treatment Bank and other sources.

The triples follow the models described in the SPAR Ontologies, DarwinCore-SW, the Treatment Ontology, and an OBKMS-specific ontology. Triple generation is carried out with the help of two as-of-yet-unpublished R packages named “okbms” and “rdf4jr”. We will try to publish at least one of them as an rOpenSci package.

The following picture taken from our TDWG presentation summarizes the architecture of OBKMS:

The following bird’s eye view of the model illustrates the key concepts that we are modeling without, however, going into sufficient detail for the understanding needed to write database queries (a look at the upcoming papers describing the model is needed for that):


In the OBKMS universe, most information comes from scientific articles. Therefore, we model articles as nodes in our graph/network, together with their sub-article elements such as sections (the treatment section being the most prominent), tables, figures, and so on. Each of these elements can be considered expressions in the FRBR sense. Colloquially an expression is some idea written down. This notion is modeled in FRBR by stating that expressions are realizations of some more abstract entities called works. In OBKMS examples of works are taxonomic name usages and taxon concepts. They store the information content of the respective article elements by having biological properties, such as occurrence information, or nomenclatural properties such as scientific names.

Note that scientific names are not the hub for biological information about species, due to various meanings the scientists historically put in a name or also to synonymies and homonymies in naming the biological species. For this purpose we use taxon concepts that are linked to scientific names, allowing for some flexibility (also this, “Taxon vs Taxon Concept”). The system currently stores 43,296 scientific names. Theoretically, a single scientific name can refer to multiple taxon concepts. For each taxonomic name usage in the text of a given article there is a node in the graph/network node having date-stamp. If the taxonomic name usage is accompanied by a bibliographic citation we try to map it directly to a taxon concept. If the bibliographic citation is missing, it is only mapped to a scientific name. Sometimes a name is mentioned in the тext with a specific context (e.g. “new syn.,” i.e. synonymization). In this case a modified relationship is used denoting these special cases.

In order to come up with the special cases, we will utilize a combined top-down and bottom-up approach. We will mine the nomenclatural sections of taxonomic articles for these different usages and create a vocabulary. We will further use nomenclatural ontologies to refine these usages.

We plan to move the prototype of OBKMS to a beta version in the first quarter of 2017. A website will be launched containing a SPARQL endpoint to OBKMS and some predefined queries, very similar to FactForge. Make sure to check out this blog regularly for the coming updates.