Building the Open Biodiversity Knowledge Management System

Wednesday, June 29, 2016

Idea: An R-Package for RDF4J

Here, I will tell a little story about coming up with the idea of a little R-package that I am planning to develop.

Since journal articles are the central objects---and at the same-time sources for atomized information to produce RDF---for the Open Biodiversity Knowledge Management System (OBKMS), I started by expressing information about a journal article in RDF. I borrowed heavily from the Treatment Ontology of Plazi. Here’s how you can express a sample paper in RDF (Turtle notation):

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix prism: <http://prismstandard.org/namespaces/basic/2.0/> .

@prefix dcterms: <http://purl.org/dc/terms/> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://dx.doi.org/http://dx.doi.org/10.3897/zookeys.454.7269> rdf:type owl:NamedIndividual ,
                                fabio:JournalArticle ;
                                dcterms:creator "Huang, Sunbin" ,
                                                "Tian, Mingyi" ,
                                                "Yin, Haomin"
                                dcterms:date "2014" ;
                                dcterms:title "Du'an Karst of Guangxi: a kingdom of the cavernicolous genus Dongodytes Deuve (Coleoptera, Carabidae, Trechinae)" ;
                                frbr:partOf <http://id.pensoft.net/ZooKeysVol454>;
                             prism:startingPage "69" ;
                                prism:endingPage "107" ;
                                fabio:hasPageCount "39" .

<http://id.pensoft.net/ZooKeysVol454> a fabio:JournalVolume ;
  frbr:partOf <http://id.pensoft.net/ZooKeys> ;
  frbr:hasPart <http://dx.doi.org/http://dx.doi.org/10.3897/zookeys.454.7269> .

<http://id.pensoft.net/ZooKeys> a fabio:Journal;
  frbr:hasPart <http://id.pensoft.net/ZooKeysVol454> .

Note, that in the context of the biodiversity graph, I have created graph nodes for the journal and the journal volume, in order to properly attribute the article to its publishing journal. As we will put information from various journals into OBKMS, I decided to go on-line and look for lists of journals and their issues/ volumes. CrossRef seemed like as good a place to start as any.

Indeed, I’ve found a dataset, which currently has 45763 rows, each specifying an individual journal. The dataset is in the CSV format, which makes extremely easy to work with in R thanks to `read.csv`. So with the help of RCurl, I imported the dataset into workspace, and after cleaning up wanted to upload it to the graph database. My initial idea was to use `getUrl` and Co. to submit SPARQL update-queries to the endpoints as described in RDF4J server REST API, but soon I found myself writing my own little helper functions to make use of the different features of the API and an idea dawned on me.

A Google search later, I found two R packages doing RDF-related stuff: RDF and rrdf. None of them seemed to have the full RDF4J API functionality implemented, so I said to myself: why not put the little helper functions that I am already writing in a tiny little R package, which will abstract the RDF4J functionality for the user and make it really easy for the R programmer to work with their RDF4J-compliant RDF store. Here's the list of those types of stores:

http://rdf4j.org/about/rdf4j-databases/

I discussed this idea with somebody from R community and he seems it is a good idea to do the package, so in coming weeks I will be spending part of my developing it.

Note: In the Turtle above, I have used literals for persons. This is probably not a good idea and `foaf:Agent` might more appropriate.

Latest Developments as of 2016-06-29

I haven’t updated my blog since January 25th, wow! Memo to self: update your blog more often!

A lot of time has passed and a lot of things have happened, which are impossible to cover fully. For example, BIG4 had its second workshop in Havraniky, the Czech Republic, apparently now called Czechia.

The following pictures should sum up the trip pretty well.

Here, I am pictured examining a small insect---probably a Coleopteran---under a microscope in the field station in Havraniky. This work was fun but very time-consuming. I learned to appreciate the expertise that professional taxonomists have. I really thank Emanuel from BIG4 for teaching me about Coleoptera identification and Prof. Ximo Mengual for teaching me about Diptera recognition. Unfortunately, I didn't have time to learn Lepidoptera or Hymenoptera.

After collecting in the field, and identifying in the station we visit local wineries. The area around Podyji is known for its white wines.

In the meantime, work on the Open Biodiversity Knowledge Management System (OBKMS) is well underway. We have a prototype of the system already working thanks to some help from Plazi , Ontotext, Kiril Simov, and Eamon O'Tuama. Rod Page wrote an article about his vision of the biodiversity knowledge graph, which I reviewed via the RIO’s post-publication peer-review mechanism. The review itself, has DOI: "10.3897/rio.2.e8767.r25935". I'll re-publish an updated version of the review and thoughts around it as one of my next blog posts.

In a nutshell, it is a technological article, which is an excellent read for people interested in the OBKMS. In the end of the day, however, one can realize the vision of the OBKMS with a number of technologies each having its advantages and disadvantages.

My work in the last six months or so was concentrated on using RDF stores, GraphDB, in particular to implement OBKMS. I have chosen it for a number of reasons after having played around a little bit with Neo4j in the end of last year. One of the things that I like about RDF is that there is a huge body of both data (linked open datasets) and “data schemas,” known in RDF world as ontologies, which give you ideas about how to model your data.

In particular, as OBKMS will be based mostly on scholarly papers, we intend to make heavy use of the Spar Ontologies. For the biodiversity part, we will draw from Darwin Core Filtered Push, Darwin Core for the Semantic Web, and the Treatment Ontology. In particular, in the last week, I have come up with a conceptual model about the relationships between Taxa, Names, Treatments, and Taxon Concepts, which I plan to express in OWL on top of these existing ontologies. I also have an idea about R package to abstract RDF4J query, which I will publish next in the blog.

Monday, January 25, 2016

PlutoF - ARPHA Workflow Established

Estonia, the homeland of Skype, is also the home of PlutoF, a global biodiversity data management system. The purpose of PlutoF is to “Create, manage, share, analyse and publish biology-related databases and projects.” It can administer research projects, as well as citizen science projects. It can administer all kinds of taxon occurrence data such as specimens, observations, sequences, living cultures, bibliographic references, or material samples. Finally, it has several work-benches, called “labs,” which facilitate work with files or collections, publishing work, taxonomy work, data analysis, and work with molecular data. It is a web-based multi-user system that is free of charge. The PlutoF services are utilized by several big academic and citizen science projects in Estonia and around the world, such as UNITE, the Unified system for DNA based fungal species linked to classification, and DINA, an open-source Web-based information management system for natural history data.

The PlutoF system can be useful to individual researchers, citizen scientists, or natural history museums to manage all their data. To learn more about the system and initiate collaboration, Pensoft invited a team of scientists and engineers to visit Bulgaria in the fall of 2015.

The PlutoF technical team in Pensoft’s office in Sofia. From left to right: Dr. Kessy Abarenkov, bioinformatician, database designer; Alan Zirk, team leader; Prof. Urmas Kõljalg, a world-renowned mycologist and bioinformatician, visionary; Timo Piirman, back-end developer, API design; Raivo Pöhönen, front-end developer, user experience design.

Several ideas grew out of this meeting, one of which has already been deployed in the production environment of Pensoft, for which I was the main responsible person from the Pensoft side. Namely, the users of ARPHA can now import specimen or occurrence records from the PlutoF database directly into an ARPHA manuscript via their Specimen ID. In order to accomplish this task, Timo designed a specialized API for the export of occurrence data from PlutoF and I, with the help of other programmers, wrote the software necessary for ingesting this data and transforming it into a record in the manuscript. Essentially, it is a two-step process, due to the fact that universal identifiers are not deployed widely in the taxonomic community.

Currently, the most widely used system for naming and identifying specimens in biodiversity science is the so-called Darwin Core Triplet, which consists of the institution code, collection code, and catalog number. Essentially the catalog number, i.e. the textual reference that is on the physical label of the specimen, is unique within a collection but need not be unique across collections and institutions. The catalog number corresponds to the Specimen ID in the PlutoF system.

The way we thought about importing records from PlutoF via Specimen ID’s is as follows: the PlutoF user locates the Specimen ID of the record that they want to import and enters it into a dialog in the ARPHA authoring tool. This ID resolves to one or more unique ID’s in the PlutoF database, the records belonging to which are then imported into the manuscript. By looking at the imported data, the user removes irrelevant records. Thankfully in most of the cases, the Specimen ID’s are also unique and the user does not need to do the last step.

In short, thanks to the efforts of the Pensoft team and our partners world-wide, ARPHA can now import specimen and occurrence records directly from the following repositories: GBIF, BOLD Systems, iDigBio, and PlutoF.

Automated import of specimen records into the ARPHA writing tool.

There are numerous further workflows that I am certainly looking forward to collaborating on with PlutoF. One is the very important import of specimen records from a Species Hypothesis (SH) ID. Species Hypotheses is the terminology used by the UNITE to describe DNA-based fungal species, equivalent to the Operational Taxonomic Units (OTU’s) terminology used by analogous platforms such as BOLD Systems. In order to streamline the publication of these SH’s as new species we plan to develop a workflow that takes all specimen records linked to the particular SH and imports them in a treatment in a manuscript authored in ARPHA.

I expect to establish this and other workflows, such as data paper generation, between ARPHA and PlutoF in the near future, for which I will give regular updates. This concludes today’s discussion.

Tuesday, January 12, 2016

Why I decided to publish my PhD Project Plan

According to Was ist Open Science? there are six leading principles of Open Science: open methodology, open source, open data, open access, open peer review, and open educational resources. If I am to have an open thesis, then I have to strive to apply those six principles to my own work. In the spirit of this effort, I and my advisor, Prof. Penev, have decided to publish my PhD project plan as a RIO article.

RIO (Research Ideas and Outcomes) is a new academic journal, which aims to publish results from all steps of the research continuum. Therefore, a PhD project plan was a very good candidate for it, as it is the first output of my PhD research so far.

By publishing my PhD research plan I achieve several things. First, since we’re being funded through a Marie-Curie grant (No. 542241), we achieve visibility of where the public financing goes. Second, we have invited several known experts to review my submission and we’ve got invaluable comments, which will help us steer the research into a better direction. Third, by publishing the research plan in RIO and sharing it on social media, we hope to attract comments from the public, which again will help me steer my research into a better direction, and might turn out to be valuable contributions. Fourth, we are more motivated to be on track, when we know that the plan is public.

You can find the plan here and even an official press release! This is in a nutshell why we decided to publish my PhD Project Plan. In the remainder of this post, I will share with you some more thoughts on open theses.

Having an open thesis consists of two parts: first, have an open license allowing for the unhindered access and distribution of the text and a second, optional part, which consists of drafting the thesis in the open. By publishing my PhD research plan I am making the first step towards drafting my thesis in the open. Further steps could be, should I decide that this is the most appropriate direction, the opening up of lab notebooks, and the gradual publication of the chapters of my thesis as they become available. For this purpose, I have started using the Open Science Framework to host my lab notebooks and other project documentation. Should I decide that I want to publish the chapters of my PhD thesis, I could setup a github repository and push updates as I write them.

The Wikipedia page on open thesis lists two theses that are being drafted in the open right now: that of Max Klein and that of Patrick Hadley.

# # # #

If you’ve read the research plan and want to comment, don’t hesitate to post here, or mention me @vsenderov in Twitter or email me at datascience at pensoft dot net.

Wednesday, October 21, 2015

A cross-posting from the BIG4 blog

The following was originally published at the BIG4 blog as First Impressions and First Results.

I recently blogged on my PhD thesis blog about the BIG4 kick-off meeting in Copenhagen. Here, I will revisit this topic and give my first impressions as well as report some first BIG4-relevant results.

As I pointed out, the format of the meeting was two days of presentations by students and PI’s about individual projects, then a field-trip day, then two more days of presentations and entomology-related workshops lead by senior lecturers. Certainly, one of the more memorable moments for me was the Friday workshop, when the students got to examine and document parts of the Fabricius collection at the Natural History Museum of Denmark.

The whole symposium was very well organized thanks to Sree and Alexey. I am certainly looking forward to the next one that will be probably in the Czech Republic.

On the scientific side, I think we have good mix of entomologists and molecular researchers - me being squarely in the second camp. I am looking forward to the next half a year or so when first data begin to be generated so that I have material to work on. In the mean-time, I will be doing some interconnections between data portals and Biodiversity Data Journal (BDJ) in order to learn the Pensoft system, laying the ground-work for an open thesis, and working on different research agendas for my project.

Some of those interconnections have been already engineered and I’d like now to introduce two new workflows. The first workflow facilitates the import of metadata into BDJ as a data paper. What it does is that it allows an author in BDJ to initialize her data paper manuscript from an EML text file containing metadata belonging to a dataset. In other words, given a dataset and its metadata, we convert the structured information about the dataset found in the metadata to a journal-style formatted manuscript ready for submission for review in BDJ after modifications have been made. The other workflow facilitates the import of occurrence records into a taxonomic manuscript at BDJ. As you can see, it is now possible to copy occurrence records from GBIF, BOLD Systems and iDigBio into your taxonomic manuscript by just typing their ID’s in a dialog.

These two workflows could be used by BIG4 students and PI’s to write data papers about the datasets that are generating and in taxonomy papers. An easy way to utilize them would be that all BIG4 member labs install a copy of GBIF’s Integrated Publishing Toolkit (IPT) on their lab server and share their biodiversity datasets with the community via IPT. Then should the authors decide to publish in Biodiversity Data Journal they would be able to both create a data paper about their dataset based on GBIF’s EML format and import the individual occurrences into a taxonomic paper.

In terms of the big picture, they would be most useful for species redescriptions and more specifically to extend the morphological descriptions of old taxa with genomic data. I also plan in the future, together with the ARPHA engineers, to write an importer for occurrence data from Darwin Core-Archive (DwCA). This will allow for an almost universal exchange of occurrences between databases and BDJ.

We would certainly be very happy to hear from other BIG4 students and PI's and also from the general public. Therefore, I will cross-post this blog entry on the BIG4 Google+ page, and on my PhD thesis blog, where you can start a discussion.

Thursday, October 8, 2015

The Open Science Pyradmid

I made a mental promise to myself to write a blog post here every week or so, and I find myself breaking this promise the second week. After I attended the BIG4 kick-off meeting in Copenhagen, Denmark, I traveled to Moscow on a vacation (if you friend me on Facebook you might find some pictures from the trip there), and after that I spent most of my blogging time contributing tothe Pensoft blog. I'll alert the readers of this blog and my Twitter folllowers (@vsenderov) when the post that I contributed to appears there, and I will discuss the subject matter here as well. It will be about data papers.

What happened a few weeks ago in Copenhagen? The first and second days of the symposium were dedicated to presentations by PI's and students, the third day was an excursion, and Thursday and Friday were spent in the museum looking and sorting specimens, and hearing presentations about museum stuff. It was all pretty neat as you can see from the photo below!

In my first post I introduced the Open Biodiversity Knowledge Management System. The system in itself - what it means, what it is, etc. - needs to be discussed much further, but for now I want to say that my PhD project will be dedicated to building this system and this blog is dedicated to my PhD project. The logical structure of the project that was agreed upon after discussions with Prof. Penev who is my advisor is as follows:

Chapter One: Introduction of the open science principles, work on universal identifiers.

Chapter Two: New forms of digital publications.

Chapter Three: New forms of displaying genomics data.

So the blogs here will follow this scheme somewhat. The next few posts will be about open science and after that everything else will follow.

Therefore I will begin now by introducing the open science pyramid, a simple visual aid to illustrate some of the aspects of open science:

The idea behind it is that your digital publication, which is not behind a pay-wall (open access), is only the tip of the ice-berg when it comes to dissemination of knowledge. Your paper usually consists of a nice story plus results in form of figures, tables and statements. While the story is what is appealing to the reader and is what our brains have evolved to process, it is actually the least scientific part of your paper since it is neither verifiable, nor reproducible, What is verifiable and reproducible are the figures, tables and statements, which presumably have been computed by an algorithm. So, if another author is to be able to collect more evidence in favor of your statements, or even better, disprove them, you have to give them access to more than the story plus the results - you have to give them access to the algorithms that are behind the computation (open source). And lastly, if you want other people to be able to reproduce your computations, you would have to open up the data that your algorithm worked on (open data). A caveat: they will of course have to subsample the population again to eliminate biases.

This is for my second post: it is too short, but as time and project progresses I will make these posts longer. Since my thesis is starting to shape up as an open thesis, the next post will be about open theses.

Saturday, September 12, 2015

What can we do for the OBKMS to happen?

This is my first blog post here, so it merits to define what actually The Open Biodiversity Knowledge Management System (OBKMS) is, or at least how I view it. The OBKMS was defined as part of the EU-funded project pro-iBiosphere. In a nutshell, its implementation would allow biodiversity and biodiversity-related data to freely flow from acquisition to analysis and storage to publishing and back.

One of its early critics, and I mean this in a positive way, since only through critique can a concept be improved---it is the very essense of the scientific method---is a person, whose blog I will look up to for inspiration and as a model in writing mine, namely none other than Prof. Rod Page. In a personal correspondence with me, Prof. Page organized his concerns into the following major categories:

The system itself needs to be well-defined: what it is, what it is trying to achieve, etc.
Linked data and semantic web are still in a very early phase of their development and represent hopes rather than address real challenges.
Network effects need to be leveraged. In his language, a "network effect" is an effect "where both users and providers get tangible benefits."

In my current grasp of OBKMS, I completely agree with Prof. Page on (1) and (3) and disagree on (2). I do understand the dislike in the scientific community for linked data---scientists are used to storing data in a much different way than triples---but I also believe in linked data at the moment. I also believe that in the future there will be ways to easily transform tabular data into triple-store and back. For an idea, you could check an article by Allocca and Gougousis (2015), whose academic editor was the humble writer of this blog, which gives an idea of how to reverse RML.

This is my first post for now. Next week I will be at the BIG4 kick-off meeting in Copenhagen, where I will present some interesting workflows that I've been helping develop for the Biodiversity Data Journal.