Wednesday, June 7, 2017

We Won a Data Science Competition. I Blogged About It Elsewhere. Now I am Cross-Posting Here.

We (I and a bunch of other guys) won a data science competition called #Datathon17. I blogged about it at the Ontotext blog and at the Data Science Society blog. I am cross-posting it here to reaffirm commitment to this blog:

I have lots of exciting stuff that is coming up really soon. Later this week I will blog the history of AI.

Friday, January 13, 2017

Prototype of the Open Biodiversity Knowledge Management System (OBKMS) Shown at the TDWG Conference

Last month Pensoft and I went to the TDWG 2016 Annual Conference in Santa Clara de San Carlos, Costa Rica, where we presented the prototype of OBKMS. We argued that OBKMS’ vision is to become the go-to resource for peer-reviewed structured scientific data for the biodiversity domain.

For new readers of the blog I would like to review the present and future of the system. OBKMS stands for Open Biodiversity Knowledge Management System and the realization of the biodiversity knowledge graph envisaged in pro-iBiosphere as Linked Data. It is a collaborative effort that is currently being developed jointly by Pensoft and Plazi with the help of many partners outside of these two organizations. Probably not an exhaustive list includes Kiril Simov, Éamonn Ó Tuama and Nico Franz.

The components of OBKMS are as follows:

  • The Linked Open Data (LOD) dataset together with OWL ontologies stored in an RDF graph database
  • Tools for information extraction from publications and databases that are used for creating the dataset
  • A website allowing for the querying and exploring the dataset
  • Documentation in form of scientific papers and technical documentation stored on the website and elsewhere

In 2017, a set of blog posts will follow describing each of these aspects. At present a GraphDB database is already online containing 2,181,183 RDF triples. The triples currently come from information extracted from the journals Biodiversity Data Journal and ZooKeys, but will soon also come from all other Pensoft’s journals, Plazi’s Treatment Bank and other sources.

The triples follow the models described in the SPAR Ontologies, DarwinCore-SW, the Treatment Ontology, and an OBKMS-specific ontology. Triple generation is carried out with the help of two as-of-yet-unpublished R packages named “okbms” and “rdf4jr”. We will try to publish at least one of them as an rOpenSci package.

The following picture taken from our TDWG presentation summarizes the architecture of OBKMS:

The following bird’s eye view of the model illustrates the key concepts that we are modeling without, however, going into sufficient detail for the understanding needed to write database queries (a look at the upcoming papers describing the model is needed for that):


In the OBKMS universe, most information comes from scientific articles. Therefore, we model articles as nodes in our graph/network, together with their sub-article elements such as sections (the treatment section being the most prominent), tables, figures, and so on. Each of these elements can be considered expressions in the FRBR sense. Colloquially an expression is some idea written down. This notion is modeled in FRBR by stating that expressions are realizations of some more abstract entities called works. In OBKMS examples of works are taxonomic name usages and taxon concepts. They store the information content of the respective article elements by having biological properties, such as occurrence information, or nomenclatural properties such as scientific names.

Note that scientific names are not the hub for biological information about species, due to various meanings the scientists historically put in a name or also to synonymies and homonymies in naming the biological species. For this purpose we use taxon concepts that are linked to scientific names, allowing for some flexibility (also this, “Taxon vs Taxon Concept”). The system currently stores 43,296 scientific names. Theoretically, a single scientific name can refer to multiple taxon concepts. For each taxonomic name usage in the text of a given article there is a node in the graph/network node having date-stamp. If the taxonomic name usage is accompanied by a bibliographic citation we try to map it directly to a taxon concept. If the bibliographic citation is missing, it is only mapped to a scientific name. Sometimes a name is mentioned in the тext with a specific context (e.g. “new syn.,” i.e. synonymization). In this case a modified relationship is used denoting these special cases.

In order to come up with the special cases, we will utilize a combined top-down and bottom-up approach. We will mine the nomenclatural sections of taxonomic articles for these different usages and create a vocabulary. We will further use nomenclatural ontologies to refine these usages.

We plan to move the prototype of OBKMS to a beta version in the first quarter of 2017. A website will be launched containing a SPARQL endpoint to OBKMS and some predefined queries, very similar to FactForge. Make sure to check out this blog regularly for the coming updates.

Tuesday, August 2, 2016

Taxon vs Taxon Concept

In our August issue of the OBKMS blog, I would like to write about the "Taxon vs Taxon Concept" debate. July was the month for this heated debate for myself. And while I started the month not really sure what a taxon concept is—and thinking I know what a taxon is—I am ending it with a changed perception of what a taxon is and having a pretty much better idea of what a taxon concept is.

But let’s rewind and set the stage. Enter the taxon:
“A taxonomic unit, whether named or not: i.e. a population, or group of populations of organisms which are usually inferred to be phylogenetically related and which have characters in common which differentiate (q.v.) the unit (e.g. a geographic population, a genus, a family, an order) from other such units. A taxon encompasses all included taxa of lower rank (q.v.) and individual organisms. [...]" (Wikipedia citing the Code).
So, the taxon is the natural group taxonomists are studying and we ought to model it in OBKMS!

Well, not so fast!

OBKMS is a scientific database that aims to integrate taxonomic information. Being a scientific database, it integrates information about taxa found in scientific theories about taxa. Enter the taxon concept:
“A taxonomic concept is the underlying meaning, or referential extension, of a scientific name as stated by a particular author in a particular publication. It represents the author’s full-blown view of how the name is supposed to reach out to objects in nature.”
Okay, so the taxon concept is a circumscription of a taxon by a taxonomist in a written record. So, as the OBKMS will initially process scientific articles, and extract their information, one could argue that the information extracted from the articles will be encapsulated in informational items, which are taxon concepts. So, it is even more natural to model taxon concepts, as well.

But then we model taxa and taxon concepts? Taxon concepts “are about” taxa. Well, not necessarily. Essentially our database needs to do two things:

  1. It needs to capture/ extract the information contained in various biodiversity texts and records.
  2. It needs to link this information “to the real world.” I.e. when we want to know something about a taxon, an entity in the real world, we want to be able to ask the database what information it stores about this entity.

It seems natural to think of the database taxon, x, as an entity standing for the real world taxon and linked to the information concepts. However, what is the real nature of x, i.e. what is the entity x that stands for the real world object taxon? In semiotics, the branch of philosophy dealing with symbolism, the act of linking of units of thoughts to real things is modeled by a triple/ triangle: [symbol - signifier, reference - unit of thought/ meaning, referent - real world object]. See for details. So, the taxon is the referent, i.e. the real world object we are modeling. The taxon concept is the reference, i.e. unit of thought that we ought to link to the real world object taxon. What is the symbol then? Enter the scientific name:
“The full scientific name, with authorship and date information if known. When forming part of an Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the identificationQualifier term. 
Examples: ‘Coleoptera’ (order), ‘Vespertilionidae’ (family), ‘Manis’ (genus), ‘Ctenomys sociabilis’ (genus + specificEpithet), ‘Ambystoma tigrinum diaboli’ (genus + specificEpithet + infraspecificEpithet), ‘Roptrocerus typographi (Györfi, 1952)’ (genus + specificEpithet + scientificNameAuthorship), ‘Quercus agrifolia var. oxyadenia (Torr.) J.T. Howell’ (genus + specificEpithet + taxonRank + infraspecificEpithet + scientificNameAuthorship).”
So the scientific name is the symbol that stands for the taxon in nature. In a sense, the taxon is forever inaccessible to the world of informatics, as we cannot speak directly of it. Only the act of naming creates a connection between the world of thought and the real world. So we do not need to model taxa! We need to model scientific names!

Yes, but that is not the full story. Remember the semiotic triangle: it links concepts to taxa, the names symbolize concepts and stand for taxa. So, we want to know something about a taxon, we use its scientific name instead and request the information about it (the concept). Clearly, there is a one-to-one mapping between scientific names and taxonomic concepts!


Taxonomic concepts evolve. Enter the African elephant, Loxodonta africana. Until 2000 L. africana symbolized a taxonomic concept that included both the bush elephant and the forest elephant (they were considered subspecies of the same species). However, Grubb et al (2000) hypothesized that the two subspecies are two species and a Nature article in 2001 (DOI: 10.1126/science.1059936) proved that they are about as different as the Asian elephant and the woolly mammoth. So, to many, after 2001, L. africana stands only for the bush elephant and symbolizes a different revised taxonomic concept. The forest elephant is now “stood for” by the symbol L. cyclotis.

But what other symbol do we use to unambiguously link referents (taxa) to their references/ meanings (taxonomic concepts)? Thanks to Nico Franz we have this symbol and it is called taxonomic concept label. Essentially, it contains an extra part that adds the reference to the record containing the taxon concept to the end of scientific name string. In the elephant example above, we might say L. africana sensu Grubb et al (2000) to unambiguously speak of the taxon referenced by the taxon concept of Grubb et al (2000). The whole string “L. africana sensu Grubb et al (2000)” is the taxonomic concept label string of the taxon concept of Grubb et al (2000).

That's it for now—this discussion was difficult but I think needed!

Live long and prosper!

Wednesday, June 29, 2016

Idea: An R-Package for RDF4J

Here, I will tell a little story about coming up with the idea of a little R-package that I am planning to develop.

Since journal articles are the central objects---and at the same-time sources for atomized information to produce RDF---for the Open Biodiversity Knowledge Management System (OBKMS), I started by expressing information about a journal article in RDF. I borrowed heavily from the Treatment Ontology of Plazi. Here’s how you can express a sample paper in RDF (Turtle notation):

@prefix dcterms: <> .
@prefix frbr: <> .
@prefix prism: <> .
@prefix dcterms: <> .

<> rdf:type owl:NamedIndividual ,
                                fabio:JournalArticle ;
                                dcterms:creator "Huang, Sunbin" ,
                                                "Tian, Mingyi" ,
                                                "Yin, Haomin"
                                dcterms:date "2014" ;
                                dcterms:title "Du'an Karst of Guangxi: a kingdom of the cavernicolous genus Dongodytes Deuve (Coleoptera, Carabidae, Trechinae)" ;
                                frbr:partOf <>;                                            
                                prism:startingPage "69" ;
                                prism:endingPage "107" ;
                                fabio:hasPageCount "39" .

<> a fabio:JournalVolume ;
  frbr:partOf <> ;
  frbr:hasPart <> .

<> a fabio:Journal;
  frbr:hasPart <> .

Note, that in the context of the biodiversity graph, I have created graph nodes for the journal and the journal volume, in order to properly attribute the article to its publishing journal. As we will put information from various journals into OBKMS, I decided to go on-line and look for lists of journals and their issues/ volumes. CrossRef seemed like as good a place to start as any.

Indeed, I’ve found a dataset, which currently has 45763 rows, each specifying an individual journal. The dataset is in the CSV format, which makes extremely easy to work with in R thanks to `read.csv`. So with the help of RCurl, I imported the dataset into workspace, and after cleaning up wanted to upload it to the graph database. My initial idea was to use `getUrl` and Co. to submit SPARQL update-queries to the endpoints as described in RDF4J server REST API, but soon I found myself writing my own little helper functions to make use of the different features of the API and an idea dawned on me.

A Google search later, I found two R packages doing RDF-related stuff: RDF and rrdf. None of them seemed to have the full RDF4J API functionality implemented, so I said to myself: why not put the little helper functions that I am already writing in a tiny little R package, which will abstract the RDF4J functionality for the user and make it really easy for the R programmer to work with their RDF4J-compliant RDF store. Here's the list of those types of stores:

I discussed this idea with somebody from R community and he seems it is a good idea to do the package, so in coming weeks I will be spending part of my developing it.

Note: In the Turtle above, I have used literals for persons. This is probably not a good idea and `foaf:Agent` might more appropriate.

Latest Developments as of 2016-06-29

I haven’t updated my blog since January 25th, wow! Memo to self: update your blog more often!

A lot of time has passed and a lot of things have happened, which are impossible to cover fully. For example, BIG4 had its second workshop in Havraniky, the Czech Republic, apparently now called Czechia.

The following pictures should sum up the trip pretty well.

Here, I am pictured examining a small insect---probably a Coleopteran---under a microscope in the field station in Havraniky. This work was fun but very time-consuming. I learned to appreciate the expertise that professional taxonomists have. I really thank Emanuel from BIG4 for teaching me about Coleoptera identification and Prof. Ximo Mengual for teaching me about Diptera recognition. Unfortunately, I didn't have time to learn Lepidoptera or Hymenoptera. 
After collecting in the field, and identifying in the station we visit local wineries. The area around Podyji is known for its white wines.

In the meantime, work on the Open Biodiversity Knowledge Management System (OBKMS) is well underway. We have a prototype of the system already working thanks to some help from Plazi , Ontotext, Kiril Simov, and Eamon O'Tuama. Rod Page wrote an article about his vision of the biodiversity knowledge graph, which I reviewed via the RIO’s post-publication peer-review mechanism. The review itself, has DOI: "10.3897/rio.2.e8767.r25935". I'll re-publish an updated version of the review and thoughts around it as one of my next blog posts.

In a nutshell, it is a technological article, which is an excellent read for people interested in the OBKMS. In the end of the day, however, one can realize the vision of the OBKMS with a number of technologies each having its advantages and disadvantages.

My work in the last six months or so was concentrated on using RDF stores, GraphDB, in particular to implement OBKMS. I have chosen it for a number of reasons after having played around a little bit with Neo4j in the end of last year. One of the things that I like about RDF is that there is a huge body of both data (linked open datasets) and “data schemas,” known in RDF world as ontologies, which give you ideas about how to model your data.

In particular, as OBKMS will be based mostly on scholarly papers, we intend to make heavy use of the Spar Ontologies. For the biodiversity part, we will draw from Darwin Core Filtered Push, Darwin Core for the Semantic Web, and the Treatment Ontology. In particular, in the last week, I have come up with a conceptual model about the relationships between Taxa, Names, Treatments, and Taxon Concepts, which I plan to express in OWL on top of these existing ontologies. I also have an idea about R package to abstract RDF4J query, which I will publish next in the blog.

Monday, January 25, 2016

PlutoF - ARPHA Workflow Established

Estonia, the homeland of Skype, is also the home of PlutoF, a global biodiversity data management system. The purpose of PlutoF is to “Create, manage, share, analyse and publish biology-related databases and projects.” It can administer research projects, as well as citizen science projects. It can administer all kinds of taxon occurrence data such as specimens, observations, sequences, living cultures, bibliographic references, or material samples. Finally, it has several work-benches, called “labs,” which facilitate work with files or collections, publishing work, taxonomy work, data analysis, and work with molecular data. It is a web-based multi-user system that is free of charge. The PlutoF services are utilized by several big academic and citizen science projects in Estonia and around the world, such as UNITE, the Unified system for DNA based fungal species linked to classification, and DINA, an open-source Web-based information management system for natural history data.

The PlutoF system can be useful to individual researchers, citizen scientists, or natural history museums to manage all their data. To learn more about the system and initiate collaboration, Pensoft invited a team of scientists and engineers to visit Bulgaria in the fall of 2015.

The PlutoF technical team in Pensoft’s office in Sofia. From left to right: Dr. Kessy Abarenkov, bioinformatician, database designer; Alan Zirk, team leader; Prof. Urmas Kõljalg, a world-renowned mycologist and bioinformatician, visionary; Timo Piirman, back-end developer, API design; Raivo Pöhönen, front-end developer, user experience design.

Several ideas grew out of this meeting, one of which has already been deployed in the production environment of Pensoft, for which I was the main responsible person from the Pensoft side. Namely, the users of ARPHA can now import specimen or occurrence records from the PlutoF database directly into an ARPHA manuscript via their Specimen ID. In order to accomplish this task, Timo designed a specialized API for the export of occurrence data from PlutoF and I, with the help of other programmers, wrote the software necessary for ingesting this data and transforming it into a record in the manuscript. Essentially, it is a two-step process, due to the fact that universal identifiers are not deployed widely in the taxonomic community.

Currently, the most widely used system for naming and identifying specimens in biodiversity science is the so-called Darwin Core Triplet, which consists of the institution code, collection code, and catalog number. Essentially the catalog number, i.e. the textual reference that is on the physical label of the specimen, is unique within a collection but need not be unique across collections and institutions. The catalog number corresponds to the Specimen ID in the PlutoF system.

The way we thought about importing records from PlutoF via Specimen ID’s is as follows: the PlutoF user locates the Specimen ID of the record that they want to import and enters it into a dialog in the ARPHA authoring tool. This ID resolves to one or more unique ID’s in the PlutoF database, the records belonging to which are then imported into the manuscript. By looking at the imported data, the user removes irrelevant records. Thankfully in most of the cases, the Specimen ID’s are also unique and the user does not need to do the last step.

In short, thanks to the efforts of the Pensoft team and our partners world-wide, ARPHA can now import specimen and occurrence records directly from the following repositories: GBIF, BOLD Systems, iDigBio, and PlutoF.

Automated import of specimen records into the ARPHA writing tool.

There are numerous further workflows that I am certainly looking forward to collaborating on with PlutoF. One is the very important import of specimen records from a Species Hypothesis (SH) ID. Species Hypotheses is the terminology used by the UNITE to describe DNA-based fungal species, equivalent to the Operational Taxonomic Units (OTU’s) terminology used by analogous platforms such as BOLD Systems. In order to streamline the publication of these SH’s as new species we plan to develop a workflow that takes all specimen records linked to the particular SH and imports them in a treatment in a manuscript authored in ARPHA.

I expect to establish this and other workflows, such as data paper generation, between ARPHA and PlutoF in the near future, for which I will give regular updates. This concludes today’s discussion.

Tuesday, January 12, 2016

Why I decided to publish my PhD Project Plan

According to Was ist Open Science? there are six leading principles of Open Science: open methodology, open source, open data, open access, open peer review, and open educational resources. If I am to have an open thesis, then I have to strive to apply those six principles to my own work. In the spirit of this effort, I and my advisor, Prof. Penev, have decided to publish my PhD project plan as a RIO article.

RIO (Research Ideas and Outcomes) is a new academic journal, which aims to publish results from all steps of the research continuum. Therefore, a PhD project plan was a very good candidate for it, as it is the first output of my PhD research so far.

By publishing my PhD research plan I achieve several things. First, since we’re being funded through a Marie-Curie grant (No. 542241), we achieve visibility of where the public financing goes. Second, we have invited several known experts to review my submission and we’ve got invaluable comments, which will help us steer the research into a better direction. Third, by publishing the research plan in RIO and sharing it on social media, we hope to attract comments from the public, which again will help me steer my research into a better direction, and might turn out to be valuable contributions. Fourth, we are more motivated to be on track, when we know that the plan is public.

You can find the plan here and even an official press release! This is in a nutshell why we decided to publish my PhD Project Plan. In the remainder of this post, I will share with you some more thoughts on open theses.

Having an open thesis consists of two parts: first, have an open license allowing for the unhindered access and distribution of the text and a second, optional part, which consists of drafting the thesis in the open. By publishing my PhD research plan I am making the first step towards drafting my thesis in the open. Further steps could be, should I decide that this is the most appropriate direction, the opening up of lab notebooks, and the gradual publication of the chapters of my thesis as they become available. For this purpose, I have started using the Open Science Framework to host my lab notebooks and other project documentation. Should I decide that I want to publish the chapters of my PhD thesis, I could setup a github repository and push updates as I write them.

The Wikipedia page on open thesis lists two theses that are being drafted in the open right now: that of Max Klein and that of Patrick Hadley.

# # # #

If you’ve read the research plan and want to comment, don’t hesitate to post here, or mention me @vsenderov in Twitter or email me at datascience at pensoft dot net.