Building the Open Biodiversity Knowledge Management System: January 2016

Estonia, the homeland of Skype, is also the home of PlutoF, a global biodiversity data management system. The purpose of PlutoF is to “Create, manage, share, analyse and publish biology-related databases and projects.” It can administer research projects, as well as citizen science projects. It can administer all kinds of taxon occurrence data such as specimens, observations, sequences, living cultures, bibliographic references, or material samples. Finally, it has several work-benches, called “labs,” which facilitate work with files or collections, publishing work, taxonomy work, data analysis, and work with molecular data. It is a web-based multi-user system that is free of charge. The PlutoF services are utilized by several big academic and citizen science projects in Estonia and around the world, such as UNITE, the Unified system for DNA based fungal species linked to classification, and DINA, an open-source Web-based information management system for natural history data.

The PlutoF system can be useful to individual researchers, citizen scientists, or natural history museums to manage all their data. To learn more about the system and initiate collaboration, Pensoft invited a team of scientists and engineers to visit Bulgaria in the fall of 2015.

The PlutoF technical team in Pensoft’s office in Sofia. From left to right: Dr. Kessy Abarenkov, bioinformatician, database designer; Alan Zirk, team leader; Prof. Urmas Kõljalg, a world-renowned mycologist and bioinformatician, visionary; Timo Piirman, back-end developer, API design; Raivo Pöhönen, front-end developer, user experience design.

Several ideas grew out of this meeting, one of which has already been deployed in the production environment of Pensoft, for which I was the main responsible person from the Pensoft side. Namely, the users of ARPHA can now import specimen or occurrence records from the PlutoF database directly into an ARPHA manuscript via their Specimen ID. In order to accomplish this task, Timo designed a specialized API for the export of occurrence data from PlutoF and I, with the help of other programmers, wrote the software necessary for ingesting this data and transforming it into a record in the manuscript. Essentially, it is a two-step process, due to the fact that universal identifiers are not deployed widely in the taxonomic community.

Currently, the most widely used system for naming and identifying specimens in biodiversity science is the so-called Darwin Core Triplet, which consists of the institution code, collection code, and catalog number. Essentially the catalog number, i.e. the textual reference that is on the physical label of the specimen, is unique within a collection but need not be unique across collections and institutions. The catalog number corresponds to the Specimen ID in the PlutoF system.

The way we thought about importing records from PlutoF via Specimen ID’s is as follows: the PlutoF user locates the Specimen ID of the record that they want to import and enters it into a dialog in the ARPHA authoring tool. This ID resolves to one or more unique ID’s in the PlutoF database, the records belonging to which are then imported into the manuscript. By looking at the imported data, the user removes irrelevant records. Thankfully in most of the cases, the Specimen ID’s are also unique and the user does not need to do the last step.

In short, thanks to the efforts of the Pensoft team and our partners world-wide, ARPHA can now import specimen and occurrence records directly from the following repositories: GBIF, BOLD Systems, iDigBio, and PlutoF.

Automated import of specimen records into the ARPHA writing tool.

There are numerous further workflows that I am certainly looking forward to collaborating on with PlutoF. One is the very important import of specimen records from a Species Hypothesis (SH) ID. Species Hypotheses is the terminology used by the UNITE to describe DNA-based fungal species, equivalent to the Operational Taxonomic Units (OTU’s) terminology used by analogous platforms such as BOLD Systems. In order to streamline the publication of these SH’s as new species we plan to develop a workflow that takes all specimen records linked to the particular SH and imports them in a treatment in a manuscript authored in ARPHA.

I expect to establish this and other workflows, such as data paper generation, between ARPHA and PlutoF in the near future, for which I will give regular updates. This concludes today’s discussion.

According to Was ist Open Science? there are six leading principles of Open Science: open methodology, open source, open data, open access, open peer review, and open educational resources. If I am to have an open thesis, then I have to strive to apply those six principles to my own work. In the spirit of this effort, I and my advisor, Prof. Penev, have decided to publish my PhD project plan as a RIO article.

RIO (Research Ideas and Outcomes) is a new academic journal, which aims to publish results from all steps of the research continuum. Therefore, a PhD project plan was a very good candidate for it, as it is the first output of my PhD research so far.

By publishing my PhD research plan I achieve several things. First, since we’re being funded through a Marie-Curie grant (No. 542241), we achieve visibility of where the public financing goes. Second, we have invited several known experts to review my submission and we’ve got invaluable comments, which will help us steer the research into a better direction. Third, by publishing the research plan in RIO and sharing it on social media, we hope to attract comments from the public, which again will help me steer my research into a better direction, and might turn out to be valuable contributions. Fourth, we are more motivated to be on track, when we know that the plan is public.

You can find the plan here and even an official press release! This is in a nutshell why we decided to publish my PhD Project Plan. In the remainder of this post, I will share with you some more thoughts on open theses.

Having an open thesis consists of two parts: first, have an open license allowing for the unhindered access and distribution of the text and a second, optional part, which consists of drafting the thesis in the open. By publishing my PhD research plan I am making the first step towards drafting my thesis in the open. Further steps could be, should I decide that this is the most appropriate direction, the opening up of lab notebooks, and the gradual publication of the chapters of my thesis as they become available. For this purpose, I have started using the Open Science Framework to host my lab notebooks and other project documentation. Should I decide that I want to publish the chapters of my PhD thesis, I could setup a github repository and push updates as I write them.

The Wikipedia page on open thesis lists two theses that are being drafted in the open right now: that of Max Klein and that of Patrick Hadley.

# # # #

If you’ve read the research plan and want to comment, don’t hesitate to post here, or mention me @vsenderov in Twitter or email me at datascience at pensoft dot net.

Building the Open Biodiversity Knowledge Management System

Monday, January 25, 2016

PlutoF - ARPHA Workflow Established

Tuesday, January 12, 2016

Why I decided to publish my PhD Project Plan