Building the Open Biodiversity Knowledge Management System: 2015

Wednesday, October 21, 2015

A cross-posting from the BIG4 blog

The following was originally published at the BIG4 blog as First Impressions and First Results.

I recently blogged on my PhD thesis blog about the BIG4 kick-off meeting in Copenhagen. Here, I will revisit this topic and give my first impressions as well as report some first BIG4-relevant results.

As I pointed out, the format of the meeting was two days of presentations by students and PI’s about individual projects, then a field-trip day, then two more days of presentations and entomology-related workshops lead by senior lecturers. Certainly, one of the more memorable moments for me was the Friday workshop, when the students got to examine and document parts of the Fabricius collection at the Natural History Museum of Denmark.

The whole symposium was very well organized thanks to Sree and Alexey. I am certainly looking forward to the next one that will be probably in the Czech Republic.

On the scientific side, I think we have good mix of entomologists and molecular researchers - me being squarely in the second camp. I am looking forward to the next half a year or so when first data begin to be generated so that I have material to work on. In the mean-time, I will be doing some interconnections between data portals and Biodiversity Data Journal (BDJ) in order to learn the Pensoft system, laying the ground-work for an open thesis, and working on different research agendas for my project.

Some of those interconnections have been already engineered and I’d like now to introduce two new workflows. The first workflow facilitates the import of metadata into BDJ as a data paper. What it does is that it allows an author in BDJ to initialize her data paper manuscript from an EML text file containing metadata belonging to a dataset. In other words, given a dataset and its metadata, we convert the structured information about the dataset found in the metadata to a journal-style formatted manuscript ready for submission for review in BDJ after modifications have been made. The other workflow facilitates the import of occurrence records into a taxonomic manuscript at BDJ. As you can see, it is now possible to copy occurrence records from GBIF, BOLD Systems and iDigBio into your taxonomic manuscript by just typing their ID’s in a dialog.

These two workflows could be used by BIG4 students and PI’s to write data papers about the datasets that are generating and in taxonomy papers. An easy way to utilize them would be that all BIG4 member labs install a copy of GBIF’s Integrated Publishing Toolkit (IPT) on their lab server and share their biodiversity datasets with the community via IPT. Then should the authors decide to publish in Biodiversity Data Journal they would be able to both create a data paper about their dataset based on GBIF’s EML format and import the individual occurrences into a taxonomic paper.

In terms of the big picture, they would be most useful for species redescriptions and more specifically to extend the morphological descriptions of old taxa with genomic data. I also plan in the future, together with the ARPHA engineers, to write an importer for occurrence data from Darwin Core-Archive (DwCA). This will allow for an almost universal exchange of occurrences between databases and BDJ.

We would certainly be very happy to hear from other BIG4 students and PI's and also from the general public. Therefore, I will cross-post this blog entry on the BIG4 Google+ page, and on my PhD thesis blog, where you can start a discussion.

Thursday, October 8, 2015

The Open Science Pyradmid

I made a mental promise to myself to write a blog post here every week or so, and I find myself breaking this promise the second week. After I attended the BIG4 kick-off meeting in Copenhagen, Denmark, I traveled to Moscow on a vacation (if you friend me on Facebook you might find some pictures from the trip there), and after that I spent most of my blogging time contributing tothe Pensoft blog. I'll alert the readers of this blog and my Twitter folllowers (@vsenderov) when the post that I contributed to appears there, and I will discuss the subject matter here as well. It will be about data papers.

What happened a few weeks ago in Copenhagen? The first and second days of the symposium were dedicated to presentations by PI's and students, the third day was an excursion, and Thursday and Friday were spent in the museum looking and sorting specimens, and hearing presentations about museum stuff. It was all pretty neat as you can see from the photo below!

In my first post I introduced the Open Biodiversity Knowledge Management System. The system in itself - what it means, what it is, etc. - needs to be discussed much further, but for now I want to say that my PhD project will be dedicated to building this system and this blog is dedicated to my PhD project. The logical structure of the project that was agreed upon after discussions with Prof. Penev who is my advisor is as follows:

Chapter One: Introduction of the open science principles, work on universal identifiers.

Chapter Two: New forms of digital publications.

Chapter Three: New forms of displaying genomics data.

So the blogs here will follow this scheme somewhat. The next few posts will be about open science and after that everything else will follow.

Therefore I will begin now by introducing the open science pyramid, a simple visual aid to illustrate some of the aspects of open science:

The idea behind it is that your digital publication, which is not behind a pay-wall (open access), is only the tip of the ice-berg when it comes to dissemination of knowledge. Your paper usually consists of a nice story plus results in form of figures, tables and statements. While the story is what is appealing to the reader and is what our brains have evolved to process, it is actually the least scientific part of your paper since it is neither verifiable, nor reproducible, What is verifiable and reproducible are the figures, tables and statements, which presumably have been computed by an algorithm. So, if another author is to be able to collect more evidence in favor of your statements, or even better, disprove them, you have to give them access to more than the story plus the results - you have to give them access to the algorithms that are behind the computation (open source). And lastly, if you want other people to be able to reproduce your computations, you would have to open up the data that your algorithm worked on (open data). A caveat: they will of course have to subsample the population again to eliminate biases.

This is for my second post: it is too short, but as time and project progresses I will make these posts longer. Since my thesis is starting to shape up as an open thesis, the next post will be about open theses.

Saturday, September 12, 2015

What can we do for the OBKMS to happen?

This is my first blog post here, so it merits to define what actually The Open Biodiversity Knowledge Management System (OBKMS) is, or at least how I view it. The OBKMS was defined as part of the EU-funded project pro-iBiosphere. In a nutshell, its implementation would allow biodiversity and biodiversity-related data to freely flow from acquisition to analysis and storage to publishing and back.

One of its early critics, and I mean this in a positive way, since only through critique can a concept be improved---it is the very essense of the scientific method---is a person, whose blog I will look up to for inspiration and as a model in writing mine, namely none other than Prof. Rod Page. In a personal correspondence with me, Prof. Page organized his concerns into the following major categories:

The system itself needs to be well-defined: what it is, what it is trying to achieve, etc.
Linked data and semantic web are still in a very early phase of their development and represent hopes rather than address real challenges.
Network effects need to be leveraged. In his language, a "network effect" is an effect "where both users and providers get tangible benefits."

In my current grasp of OBKMS, I completely agree with Prof. Page on (1) and (3) and disagree on (2). I do understand the dislike in the scientific community for linked data---scientists are used to storing data in a much different way than triples---but I also believe in linked data at the moment. I also believe that in the future there will be ways to easily transform tabular data into triple-store and back. For an idea, you could check an article by Allocca and Gougousis (2015), whose academic editor was the humble writer of this blog, which gives an idea of how to reverse RML.

This is my first post for now. Next week I will be at the BIG4 kick-off meeting in Copenhagen, where I will present some interesting workflows that I've been helping develop for the Biodiversity Data Journal.