Species taxonomists make expert categorisation judgments based on their experience working with collections of specimens. Many machine learning systems for automated inference have a similar objective, but are not usually developed in direct collaboration with knowledge communities built around taxonomic expertise and methods.
The goal of this project is to develop a new experimental system for automated classification of fossil and insect photographs into species, based on machine learning / artificial intelligence techniques. Funded by the Natural History Museum, the initial work has been carried out within the Undergraduate research opportunities programme
- Norman MacLeod, Keeper of Palaeontology at the Natural History Museum in London,
- Dr Ben Glocker at Microsoft Research Cambridge and
- Tom Whitehead an undergraduate engineer at Cambridge University.
The approach chosen is to develop a supervised learning system using the Random Forest Classifier algorithm. We will aim for the software to be easily extensible and reconfigurable so that a variety of changes can be made at a later date. The results achieved with this approach will be compared to those for the existing system (called DAISY) which is based on the Plastic Self-Organising Map (PSOM) algorithm.
In the early stages of this project we are focusing primarily on three data sets. These are wasp heads, UK butterfly species and forams. The system currently uses colour, texture, pattern and a small amount of shape information to generate the binary trees.
We are currently attempting to use a single image (a non linear combination of the training images) from each class to automatically split test images into separate components. We are also expanding the variety of tests that can be applied to images at nodes within the binary trees.
The basic components of the system are:
Random Forest. The random forest is a collection of binary trees each with different characteristics. To classify a test image it is passed down each binary tree until it reaches a leaf node. At each leaf node there will be a probability distribution over classes which is determined by which images from the training set reached that node. The probability distributions from the leaf nodes reached is formed and the class with the highest probability is returned as the predicted class.
Nodes. All nodes other than leaf nodes contain a split function which determines whether an image passes left or right at that node. To choose the split function a large number of candidates are generated (currently 1000) and the entropy of the probability distribution at each child node is calculated for each candidate. The selected split function is then the candidate that gave child nodes with the lowest entropy (the greatest information gain).
Split Functions. There are many types of split function which can be used and new types can be added and removed very easily. These split functions can be based on any aspect of the image such as colour, texture or shape. The tests can also be run on the whole image, a certain area of the image or a single component of the image (such as just the left forewing wing on a butterfly).
Testing. Early tests less than two weeks into the project gave an accuracy of 78.26%. At this stage only very simple split functions were being used and we are confident that we will be able to significantly improve upon this result.