Wednesday, June 29, 2016

Idea: An R-Package for RDF4J

Here, I will tell a little story about coming up with the idea of a little R-package that I am planning to develop.

Since journal articles are the central objects---and at the same-time sources for atomized information to produce RDF---for the Open Biodiversity Knowledge Management System (OBKMS), I started by expressing information about a journal article in RDF. I borrowed heavily from the Treatment Ontology of Plazi. Here’s how you can express a sample paper in RDF (Turtle notation):

@prefix dcterms: <> .
@prefix frbr: <> .
@prefix prism: <> .
@prefix dcterms: <> .

<> rdf:type owl:NamedIndividual ,
                                fabio:JournalArticle ;
                                dcterms:creator "Huang, Sunbin" ,
                                                "Tian, Mingyi" ,
                                                "Yin, Haomin"
                                dcterms:date "2014" ;
                                dcterms:title "Du'an Karst of Guangxi: a kingdom of the cavernicolous genus Dongodytes Deuve (Coleoptera, Carabidae, Trechinae)" ;
                                frbr:partOf <>;                                            
                                prism:startingPage "69" ;
                                prism:endingPage "107" ;
                                fabio:hasPageCount "39" .

<> a fabio:JournalVolume ;
  frbr:partOf <> ;
  frbr:hasPart <> .

<> a fabio:Journal;
  frbr:hasPart <> .

Note, that in the context of the biodiversity graph, I have created graph nodes for the journal and the journal volume, in order to properly attribute the article to its publishing journal. As we will put information from various journals into OBKMS, I decided to go on-line and look for lists of journals and their issues/ volumes. CrossRef seemed like as good a place to start as any.

Indeed, I’ve found a dataset, which currently has 45763 rows, each specifying an individual journal. The dataset is in the CSV format, which makes extremely easy to work with in R thanks to `read.csv`. So with the help of RCurl, I imported the dataset into workspace, and after cleaning up wanted to upload it to the graph database. My initial idea was to use `getUrl` and Co. to submit SPARQL update-queries to the endpoints as described in RDF4J server REST API, but soon I found myself writing my own little helper functions to make use of the different features of the API and an idea dawned on me.

A Google search later, I found two R packages doing RDF-related stuff: RDF and rrdf. None of them seemed to have the full RDF4J API functionality implemented, so I said to myself: why not put the little helper functions that I am already writing in a tiny little R package, which will abstract the RDF4J functionality for the user and make it really easy for the R programmer to work with their RDF4J-compliant RDF store. Here's the list of those types of stores:

I discussed this idea with somebody from R community and he seems it is a good idea to do the package, so in coming weeks I will be spending part of my developing it.

Note: In the Turtle above, I have used literals for persons. This is probably not a good idea and `foaf:Agent` might more appropriate.

No comments:

Post a Comment