Classifying occurrences
One of the two major pillars of the Edgar project is vetting bird records.
The selfish reason for doing vetting is because modelling the climatological suitability of an area for a given species of bird – the other pillar of the Edgar project – is sensitive to inaccurate occurrence records. The model assumes that the occurrence of a bird in an area is evidence that the bird can survive in that area. So for our modelling to be accurate, we need to identify occurrence records that aren’t evidence that a bird can survive somewhere.
The unselfish reason for doing vetting is to pool the vetted data and make it available to other researchers who need clean data. To this end, we will make our vetted data available for download and list it in metadata repositories, and also feed back vetting contributions to data sources, wherever that’s possible.
There are several reasons that recorded occurrences might not be evidence of survival. The obvious one is that the occurrence was recorded in error – the species was mis-identified by the observer, or observation details were altered when written down or copied.
However a recorded observation can be true, but still unsuitable for modelling. A valid observation from decades ago might not show that current conditions in that area are suitable. A rainforest bird may have been caught in severe weather and blown off the coast. What the modelling process really needs to know is if an occurrence does or doesn’t demonstrate that an area can sustain a population of the bird species.
So, occurrence records need cleaning. For our purposes we want them classified into “suitable for modelling” and “unsuitable for modelling”, but that’s the language nerdy modellers use; the bird experts who actually have the knowledge to differentiate between those two categories don’t use that terminology.
In situations where you are paying people to interact with your system, or there’s some other reason that your system has more power than your users, you can just give the users a manual to read, and you’re done. In our case, the vetting users are volunteers with valuable knowledge, so we want to treat them pretty well.1 We need to collect vetting information in the language spoken by the users.
Bird watcher’s classification - the habitat dimension
One of the dimensions a recorded occurrence can be classified onto is the nature of the habitat.
After a bit of cultural immersion and a long discussion with Lauren, I think this is a reasonable classification for an occurrence’s habitat.
not yet classifiedWe haven't yet put this occurrence record into a proper classification. 0 |
invalidThe occurrence record is incorrect, the bird could not have been seen there. 1 |
validThe occurence record is correct – the bird really was seen in that spot.
Not all of these points are strictly about the habitat, but I’ve sacrificed the pleasure of seeing a nice clean classification for the convenience of having a single dimension for our volunteers to interact with.
I only think this is a good list. The true test is if our volunteer bird experts feel like they can choose the right classification without feeling constrained by the list they have to choose from. Conversely, it’s a bad list if none of the offered classifications match the expert’s opinion.
I can detect that by offering the vetting users a comments box as well as a classification selector. If a user feels like none of the classifications offered are suitable, they’re more likely to write an explanatory comment; if they can select a completely suitable classification, they’re less likely to comment. So a high frequency of comments could mean we need to rework our classifications.
Note that I’ve left classification 6 off my diagram. The last three pinkish ones are combinations of two bi-valued dimensions, { breeding | non-breeding } and { introduced | natural }, which should give four results. The one I’ve left out is the ‘introduced non-breeding’ combination, which Lauren suggested was an unreasonable combination given that a non-breeding core area implies a migratory bird, and migratory birds are unlikely to stay in an area even once they’ve been “introduced”.
How we arrive at an initial classification
Edgar will launch with about eighteen million occurrences. That’s too many to rely on volunteer vettings that classify each one. We need to have some way to auto-classify occurrences.
Here’s how we plan to do it:
When importing occurrences from ALA or some other source, examine the metadata for that occurrence. ALA attach “assertions” to each recorded observation, some of which refer to apparent validity, for example
which is attached to an observation that is outside the normal environmental range of the species. We can apply an initial guess at validity using those assertions. -
BirdLife Australia have provided Edgar with the accepted ranges for bird species as geographic regions, with separate region polygons differentiating ranges that are core, irruptive, etc. Incoming observations will be compared against those regions to suggest classifications.
When we record a vetting decision to apply to a observation, we will assume the vetting classification applies to a circle around the observation. So a new occurrence may fall into a previously vetted area, in which case we can apply the classification given in the vetting.
That gives us three opportunities to get a classification before we ask a volunteer bird expert to look at the observation record. But now we are a man with two watches; if we get differing classifications from our various sources, we have some ambiguity to resolve.
Certainty classification
Each observation record gets a primary classification about the observation’s validity and habitat. We will also track a measure of our certainty in that classification. Later we may use certainty to draw the attention of vetting users to classifications we aren’t sure of.
We can achieve a classification by allocating each classifying mechanism a certainty level, then collecting the votes for a given occurrence, and choosing the classification with the highest certainty level.
Certainty ranges from 0 to 6.
“Not yet classified” starts at certainty level 1 (I’m reserving 0 for some future ”wtf!?” situation).
A classification of “invalid” determined by metadata about the occurrence may be 2 or higher, depending on the metadata. Invalid at certainty levels 4 and 5 might not be shown on the map.
A classification from BirdLife polygons gives a certainty level of 2.
A classification from vetting by a normal user will give certainty level 3. If there are multiple vettings, the most recent one wins.
If our project schedule allows, we will give admin users the ability to mark some users are “recognised experts”. If we do so, then a vetting by an expert user will give certainty level 4.
A classification from a vetting entered by an admin user gives a certainty level of 6.
Contention classification
Sources of classification can disagree. We need a single primary classification to serve our modelling needs — an occurrence is either included in the modelling or not — so our classification strategy enforces a precise priority of classification, but where there is disagreement we want an admin user to investigate further, and resolve the conflict.
Contention levels range from 0 to 3 and is calculated by looking at the classification votes.
disagreement between two classification votes where both are at the highest level of certainty is a level 3 contention.
disagreement between the highest certainty vote and a certainty vote one point lower is a level 2 contention.
disagreement between the highest certainty vote and a vote two certainty points lower is a level 1 contention.
any other disagreement is considered uncontentious.
Plus, our mums told us we should always treat everyone well.
blog comments powered by Disqus