Thursday, June 25, 2009

EOL, Wikipedia, TDWG, LinkedData, and the Vision Thing

Time for more half-baked ideas. There's been a lot of discussion on Twitter about EOL, Linked Data (sometimes abbreviated LOD), and Wikipedia. Pete DeVries (@pjd) is keen on LOD, and has been asking why TDWG isn't playing in this space. I've been muttering dark thoughts about EOL, and singing the praises of Wikipedia. On so it goes on. So, here's one vision of where we could (?should) be going with this.

Let's imagine that we do indeed want to play in the Linked Data space. The concern that tends to raised the most is that biodiversity informatics uses LSIDs as the standard GUID, and this doesn't play nice with Linked Data. This is true, but not life threatening. There are various hacks (like this and this that deal with this).

But, the real concern (I think) is that we need a way to link our stuff to the rest of the Linked Data cloud. That is, wherever possible we need to reuse existing identifiers. In the LOD diagram below (for the latest version see here) DBpedia.org is key to linking much of this together, and major players (such as the BBC) are now using DBpedia.org to make connections.



DBpedia.org is based on Wikipedia, so I think you can see where this is going. There are some 120,000+ taxon pages in Wikipedia, so that's some 120,000+ identifiers in DBpedia.org that others interested in organisms can (and will) use to refer to taxa. Given the centrality of Wikipedia and DBpedia to LOD, why don't we adopt DBpedia.org URIs as the default GUID for our taxa? At present we have numerous, competing identifiers (e.g., NCBI tax ids, ITIS tsn's, Catalogue of Life LSIDs, uBio NameBankID's, plus LSIDs from various nomenclators). For users this is a mess -- which one do I use? Deciding requires dealing with issues (such as the difference between nomenclatural codes, and between taxonomic names and concepts, etc., that frankly, nobody outside our community cares about.

So, if we want to play with LOD, we need to make our identifiers play nice (straightforward), and we should think seriously about adopting DBpedia.org URIs as the default GUID for taxa.

Now, where does this leave EOL? Well, frankly, it should get out of the business of making web pages for taxa, because Wikipedia owns that space already. Their pages are fewer, but often much more detailed than the corresponding EOL page, and Wikipedia reacts faster to new discoveries. Wikipedia supports community editing, versioning, and quite sophisticated tools for handling biblographic references.

There's plenty of scope for userful tools and services for EOL to develop, but I think the real game is elsewhere. Now, Wikipedia is far from perfect. It's basically semi-structured text with a God-awful template language, and it would benefit greatly from more structure (e.g., as could be provided by Semantic Mediawiki), but I think we should think about building upon it. We could build our own (and my experiments over at itaxon.org explore this), but the big challenge is getting a community around a project, and if David Shorthouse's pronouncement that The Community is Dead is correct, then maybe we should get on board with the community that already exists. Perhaps what EOL should be doing is talking to Wikipedia, improving the existing templates for taxon pages, and creating bots to automatically populate Wikipedia with more taxon pages.