was required to handle many of the entries; for instance, to cope with spelling mistakes, ad hoc or inconsistent abbreviations (e.g., âBhamâ or âBâhamâ for âBirminghamâ), and the use of different ways of expressing the same thing (e.g., âBirmingham accentâ, âBham accentâ, âlocal accentâ, âaccent: localâ, etc.).
After some initial analysis, the researchers decided to focus on eight variables: age, height, hair color, hair length, build, accent, race, and number of accomplices.
Once the data had been processed into the appropriate structured format, the next step was to use geometric clustering to group the 105 offender descriptions into collections that were likely to refer to the same individual. To understand how this was done, letâs first consider a method that at first sight might appear to be feasible, but which soon proves to have significant weaknesses. Then, by seeing how those weaknesses may be overcome, we will arrive at the method used in the British study.
First, you code each of the eight variables numerically. Ageâoften a guessâis likely to be recorded either as a single figure or a range; if it is a range, take the mean. Gender (not considered in the British Midlands study because all the cases examined had a female distracter) can be coded as 1 for male, 0 for female. Height may be given as a number (inches), a range, or a term such as âtallâ, âmediumâ, or âshortâ; again, some method has to be chosen to convert each of these to a single figure. Likewise, schemes have to be devised to represent each of the other variables as a number.
When the numerical coding has been completed, each perpetrator description is then represented by an eight-vector, the coordinates of a point in eight-dimensional geometric (Euclidean) space. The familiar distance measure of Euclidean geometry (the Pythagorean metric) can then be used to measure the geometric distance between each pair of points. This gives the distance between two vectors (x 1 ,â¦, x 8 ) and (y 1 ,â¦, y 8 ) as:
Points that are close together under this metric are likely to correspond to perpetrator descriptions that have several features in common; and the closer the points, the more features the descriptions are likely to have in common. (Remember, there are problems with this approach, which weâll get to momentarily. For the time being, however, letâs suppose that things work more or less as just described.)
The challenge now is to identify clusters of points that are close together. If there were only two variables, this would be easy. All the points could be plotted on a single x,y-graph and visual inspection would indicate possible clusters. But human beings are totally unable to visualize eight-dimensional space, no matter what assistance the software system designers provide by way of data visualization tools. The way around this difficulty is to reduce the eight-dimensional array of points (descriptions) to a two-dimensional array (i.e., a matrix or table). The idea is to arrange the data points (that is, the vector representatives of the offender descriptions) in a two-dimensional grid in such a way that:
pairs of points that are extremely close together in the eight-dimensional space are put into the same grid entry;
pairs of points that are neighbors in the grid are close together in the eight-dimensional space; and
points that are farther apart in the grid are farther apart in the space.
This can be done using a special kind of computer program known as a neural net, in particular, a Kohonen self-organizing map (or SOM). Neural nets (including SOMs) are described later in the chapter. For now, all we need to know is that these systems, which work iteratively, are extremely good at homing in (over the course of many iterations) on patterns, such as geometric clusters of the kind we are interested in, and thus
Kristine Kathryn Rusch, Scott Nicholson, Garry Kilworth, Eric Brown, John Grant, Anna Tambour, Kaitlin Queen, Iain Rowan, Linda Nagata, Keith Brooke