Sunday, March 01, 2009

What would people want/expect from taxon searching on TreeBASE?

Rutger Vos asked on Twitter "What would people want/expect from taxon searching on TreeBASE?". This is a good question, and one which motivated the work I did on TBMap (see doi:10.1186/1471-2105-8-158), which developed a mapping between TreeBASE taxa and other databases. In that paper I published a table showing the effectiveness of string and hierarchical queries of TreeBASE. A string query returned just the studies that included the query term, a hierarchical query returned all studies that contained taxa that descend from a given node in the NCBI taxonomy.

A hierarchical query can be visualised as a range query. For example, the diagram below shows a classification where the descendants of Node A correspond to the range 1-8 (this is a simplification of visitation numbers, see my earlier post, and also Chen et al. doi:10.1186/1471-2148-8-90). The three trees can be represented as the ranges 3-8, 6-7, and 2-4, respectively. To find trees for the taxon corresponding to Node A we look for trees whose range intersects 1-8, to find trees corresponding to Node B we look for trees whose range intersects 1-5.


This approach retrieves a list of all trees that include a given taxon, but potentially this list could be very large (for example, a query for all plant trees could return 1000's of trees). So the question becomes how to order these trees? Some ideas:
  1. order by the number of taxa in the tree.
  2. order by the size of the range ("span") of the tree.
  3. order by set inclusion.

Ordering by size seems attractive, but will favour trees with more taxa over those with a greater taxonomic spread (which is favoured by ordering by the span of the tree). I used span in my challenge entry to display papers with related taxa.

What I have in mind for ordering by set inclusion is constructing a directed graph where each node represents a tree, and a pair of nodes (x, y) is connected by an edge xy if the taxa in tree y are nested inside the taxa in tree x. We could also introduce some additional nodes corresponding to nodes in the classification. If we then topologically sort the graph we have a linear order for the trees. Given that this order could be pre-computed independently of any queries (in much the same way that PageRank is), it could make for faster queries.

It would be useful to explore these and other ordering criteria. Perhaps the best approach would be some measure which combines one or more criteria, in which case we might want to use some form of rank aggregation (see iSpecies clones, and taxonomic intelligence for some links to the relevant literature).

1 comment:

sdfsdf said...

That's a wonderfully detailed reply, thank you very much.

To provide a bit more context, TreeBASE2 organizes the labels (i.e. contents of nexus taxa blocks, the row names of character blocks and the leaf names in tree blocks) it finds in submitted studies as follows:

* For every label, a TaxonLabel object is created, which is linked to the submission.

* TaxonLabel objects are linked to TaxonVariant objects, which are simply the normalized form of the string occurrence. If multiple submissions contain the label "Pan paniscus", there are TaxonLabel objects by that name for each submission and a single TaxonVariant "Pan paniscus" to which all the TaxonLabels refer.

* If "Pan paniscus" occurs in uBio, the TaxonVariant contains the NameBankID for that record.

* If the uBio record contains an NCBI taxon ID, the TaxonVariant object has a reference to a Taxon object, which holds that NCBI ID and the "canonical" NameBankID.

* If submitters use cryptic names in their data files they can help this matching by manually entering the NameBankID.

The implication for searching is that users can search on raw labels (which would return any old cruft people put in their data files), on variants (which is the aggregate of cruft that uBio has encountered) or on "true" taxa.

(I am still thinking about your comments on ordering the results. We present search results as tables which can be sorted on columns, so we don't have to choose a single ordering. If we can express "span" or "inclusiveness" in a number, the user can sort on that.)