Friday, January 09, 2015

GBIF, biodiversity informatics and the "platform rant"

Each year about this time, as I ponder what to devote my time on in the coming year, I get exasperated and frustrated that each year will be like the previous one, and biodiversity informatics will seem no closer to getting its act together. Sure, we are putting more and more data online, but we are no closer to linking this stuff together, or building things that people can use to do cool science with. And each year I try and figure out why we are still flaying about and not getting very far. This year, I've settled on the lack of "platforms".

In 2011 Steve Yegge (accidentally) published a widely read document known as the "Google Platforms Rant". It's become something of a classic, and I wonder if biodiversity informatics can learn from this rant (it's long but well worth a read).

One way to think about this is to look at how we build things. In the early days, people would have some data and build a web site:

Rant1

In the diagram above "dev" is the web developer who builds the site, and "DBA" is the person who manages the data (for many projects this is one and the same person). The user is presented with a web site, and that's the only way they can access the data. If the web site is well designed this typically works OK, but the user will come up against limitations. Why do I have to manually search for each record? How can I combine this data with some other data? These questions lead to some users doing things like screen scrapping, anything to get the data and do more than the web site permits (I spend a lot of my time doing exactly this). In contrast, the person (or team) building the site ("dev") can access the data and tools directly.

Eventually some sites realise that they could add value to their users if they added an API, so typically we get something like this:

Rant2

Now we have an API (yay), but notice that it is completely separate from the web site. Now the site developers have to manage two different things, and two sets of users (web site visitors, and user programming against the API). Because the site and the API are different, and the site gets more users, typically what happens is the API lacks much of the functionality of the site, which frustrates users of the API. For example, when Mendeley launched it's API its limited functionality and lack of documentation drove me nuts. Similarly, the Encyclopedia of Life (EOL) API is pretty sucky. If anyone from EOL is reading this, for the love of God add user authentication and the ability to create and edit collections to the API. Until you do, you'll never have an ecosystem of apps.

A solution to sucky APIs is "dogfooding":

Rant3

Dogfooding is the idea that your product is so good you'd use it yourself. In the case of web development, if we build the web site on top of the same API that we expose to users, then the site developers have a strong incentive to make the API well-documented and robust, because their web site runs on the same services. As a result the interests of the web developers and users who are programmers are much more aligned. If a user finds a bug in the API, or the API lacks a feature, it's much more likely to get fixed. An example of a biodiversity informatics project that "gets" dogfooding is GBIF, which has a nice API that powers much of their web site. This is a good example of how to tell if an API is any good, namely, can you recreate the web site yourself just using the API?

But the example above leaves one aspect of the whole system still intact and not accessible to users. Typically a company or organisation has data, tools, and processes that it uses to manage whatever is central to its operations. These are kept separate from users, who only get to access these indirectly through the web site or the API.

A "platform" takes things one step further. Steve Yegge summarises Jeff Bezos' memo that outlined Amazon's move to a platform:

  1. All teams will henceforth expose their data and functionality through service interfaces.
  2. Teams must communicate with each other through these interfaces.
  3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
  4. It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.
  5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
  6. Anyone who doesn't do this will be fired.
  7. Thank you; have a nice day!
All the core bits of infrastructure that powered Amazon were to become services, and different bits of Amazon could only talk to each other through these services. The point of this is that it enabled Amazon to expose its infrastructure to the outside world (AKA paying customers) and now we have Amazon cloud services for storing data, running compute jobs, and so on. By exposing its infrastructure as services, Amazon now runs a big chunk of the startup economy. By insisting that Amazon itself uses these services (dogfooding at the infrastructure level), Amazon ensures that this infrastructure works (because its own business depends on it).

Rant4

There are some things Google does that are like a platform (despite the complaints in the "Google Platforms Rant"). For example, you could imagine that most workers at Google use tools such as Google Docs to create and share documents. Likewise, Google Scholar is unlikely to be a simple act of altruism. If you have a team of world class researchers you need a tool that enables them to find existing research. Google Scholar does this. If you then expose it to the outside world you get more users, and an incentive for commercial publishers to open up their paywall journals to being indexed by Google's crawlers, and incentive that would be missing if Scholar was purely an internal service.

Now, giant companies like Amazon and Google might seem a world away from biodiversity informatics, but I think there are things we can learn from this. Looking around, I think there are other examples of platforms that may seem closer to home. For example, the NCBI runs GenBank and PubMed, and these are very like platforms. GenBank provides tools, such as BLAST, that it provides to the user community, but which it also uses internally to cluster sequences into related sets. Consider PubMed, which has gone from a simple index to the biomedical literature to a publishing platform. PubMed has driven the standardisation of XML across biomedical publishers. It is quite possible to visit the NCBI site, explore data, then read full text for the associated publications in PubMed Central, without ever leaving the NCBI site. No wonder some commercial publishers are deeply worried about PubMed Central.

A key thing about platforms is that the people running the platform have a deep interest in many of the same things as the users of that platform (note the "users" scattered all over the platform diagram above). Instead of user being a separate category that you try and serve by figuring out what they want, developers are users too.

To try and flesh this out a little more, what would a "taxonomic" platform look like? At the moment, we have lots of taxonomic web sites that pump out lists of names and little else. This is not terribly useful. If we think about what goes into making lists of names, it requires access to the scientific literature, it requires being able to read that literature and extract statements about names (e.g., this is the original description, these two names are synonyms, etc.), and it requires some way of summarising what we know about those names and the taxa that we label with those names. Typically these are all things that happen behind the scenes, then the user simply gets a list of names. A platform would expose all of the data, tools, and processes that went into making that list. It would provide the literature in both human and computer readable forms, it would provide tools for extracting information, tools to store knowledge about those names, and tools to make inferences using that knowledge. All of these would be exposed to users. And these some services and tools would be used by the people building those services and tools.

This last point means that you also need people working on the same problems as "users". For example, consider something like GBIF. At the moment GBIF consumes output of taxonomic research (such as lists of names) and tries to make sense of these before serving them back to the community. There is little alignment between the interests of taxonomists and GBIF itself. For GBIF to become a taxonomic platform, it would need to provide the data, tools and services for people to do taxonomic research, and ideally it would actually have taxonomists working at GBIF using those tools (these taxonomists could, for example, be visiting fellows working on particular taxa, rather than permanent employees). These tools would greatly help the taxonomic community, but also help GBIF make sense of the millions of names it has to interpret.

It's important to note here the the goal of the platform is NOT to "help" users - that simply reinforces the distinction between you and the "users". Instead it is to become a user. You may have more resources, and work on a different scale (few business Amazon's services support will be anything like as big as Amazon), but you are ultimately "just" another user.