Wednesday, June 29, 2016

Idea: An R-Package for RDF4J

Here, I will tell a little story about coming up with the idea of a little R-package that I am planning to develop.

Since journal articles are the central objects---and at the same-time sources for atomized information to produce RDF---for the Open Biodiversity Knowledge Management System (OBKMS), I started by expressing information about a journal article in RDF. I borrowed heavily from the Treatment Ontology of Plazi. Here’s how you can express a sample paper in RDF (Turtle notation):

@prefix dcterms: <> .
@prefix frbr: <> .
@prefix prism: <> .
@prefix dcterms: <> .

<> rdf:type owl:NamedIndividual ,
                                fabio:JournalArticle ;
                                dcterms:creator "Huang, Sunbin" ,
                                                "Tian, Mingyi" ,
                                                "Yin, Haomin"
                                dcterms:date "2014" ;
                                dcterms:title "Du'an Karst of Guangxi: a kingdom of the cavernicolous genus Dongodytes Deuve (Coleoptera, Carabidae, Trechinae)" ;
                                frbr:partOf <>;                                            
                                prism:startingPage "69" ;
                                prism:endingPage "107" ;
                                fabio:hasPageCount "39" .

<> a fabio:JournalVolume ;
  frbr:partOf <> ;
  frbr:hasPart <> .

<> a fabio:Journal;
  frbr:hasPart <> .

Note, that in the context of the biodiversity graph, I have created graph nodes for the journal and the journal volume, in order to properly attribute the article to its publishing journal. As we will put information from various journals into OBKMS, I decided to go on-line and look for lists of journals and their issues/ volumes. CrossRef seemed like as good a place to start as any.

Indeed, I’ve found a dataset, which currently has 45763 rows, each specifying an individual journal. The dataset is in the CSV format, which makes extremely easy to work with in R thanks to `read.csv`. So with the help of RCurl, I imported the dataset into workspace, and after cleaning up wanted to upload it to the graph database. My initial idea was to use `getUrl` and Co. to submit SPARQL update-queries to the endpoints as described in RDF4J server REST API, but soon I found myself writing my own little helper functions to make use of the different features of the API and an idea dawned on me.

A Google search later, I found two R packages doing RDF-related stuff: RDF and rrdf. None of them seemed to have the full RDF4J API functionality implemented, so I said to myself: why not put the little helper functions that I am already writing in a tiny little R package, which will abstract the RDF4J functionality for the user and make it really easy for the R programmer to work with their RDF4J-compliant RDF store. Here's the list of those types of stores:

I discussed this idea with somebody from R community and he seems it is a good idea to do the package, so in coming weeks I will be spending part of my developing it.

Note: In the Turtle above, I have used literals for persons. This is probably not a good idea and `foaf:Agent` might more appropriate.

Latest Developments as of 2016-06-29

I haven’t updated my blog since January 25th, wow! Memo to self: update your blog more often!

A lot of time has passed and a lot of things have happened, which are impossible to cover fully. For example, BIG4 had its second workshop in Havraniky, the Czech Republic, apparently now called Czechia.

The following pictures should sum up the trip pretty well.

Here, I am pictured examining a small insect---probably a Coleopteran---under a microscope in the field station in Havraniky. This work was fun but very time-consuming. I learned to appreciate the expertise that professional taxonomists have. I really thank Emanuel from BIG4 for teaching me about Coleoptera identification and Prof. Ximo Mengual for teaching me about Diptera recognition. Unfortunately, I didn't have time to learn Lepidoptera or Hymenoptera. 
After collecting in the field, and identifying in the station we visit local wineries. The area around Podyji is known for its white wines.

In the meantime, work on the Open Biodiversity Knowledge Management System (OBKMS) is well underway. We have a prototype of the system already working thanks to some help from Plazi , Ontotext, Kiril Simov, and Eamon O'Tuama. Rod Page wrote an article about his vision of the biodiversity knowledge graph, which I reviewed via the RIO’s post-publication peer-review mechanism. The review itself, has DOI: "10.3897/rio.2.e8767.r25935". I'll re-publish an updated version of the review and thoughts around it as one of my next blog posts.

In a nutshell, it is a technological article, which is an excellent read for people interested in the OBKMS. In the end of the day, however, one can realize the vision of the OBKMS with a number of technologies each having its advantages and disadvantages.

My work in the last six months or so was concentrated on using RDF stores, GraphDB, in particular to implement OBKMS. I have chosen it for a number of reasons after having played around a little bit with Neo4j in the end of last year. One of the things that I like about RDF is that there is a huge body of both data (linked open datasets) and “data schemas,” known in RDF world as ontologies, which give you ideas about how to model your data.

In particular, as OBKMS will be based mostly on scholarly papers, we intend to make heavy use of the Spar Ontologies. For the biodiversity part, we will draw from Darwin Core Filtered Push, Darwin Core for the Semantic Web, and the Treatment Ontology. In particular, in the last week, I have come up with a conceptual model about the relationships between Taxa, Names, Treatments, and Taxon Concepts, which I plan to express in OWL on top of these existing ontologies. I also have an idea about R package to abstract RDF4J query, which I will publish next in the blog.