The Numbers Behind NUMB3RS

The Numbers Behind NUMB3RS by Keith Devlin Read Free Book Online Page B

Book: The Numbers Behind NUMB3RS by Keith Devlin Read Free Book Online
Authors: Keith Devlin
can indeed take an eight-dimensional array of the kind described above and place the points appropriately in a two-dimensional grid. (Part of the skill required to use an SOM effectively in a case such as this is deciding in advance, or by some initial trial and error, what are the optimal dimensions of the final grid. The SOM needs that information in order to start work.)
    Once the data has been put into the grid, law enforcement officers can examine grid squares that contain several entries, which are highly likely to come from a single gang responsible for a series of crimes, and can visually identify clusters on the grid, where there is also a likelihood that they represent gang activity. In either case, the officers can examine the corresponding original crime statement entries, looking for indications that those crimes are indeed the work of a single gang.
    Now let’s see what goes wrong with the method just described, and how to correct it.
    The first problem is that the original encoding of entries as numbers is not systematic. This can lead to one variable dominating others when the entries are clustered using geometric distance (the Pythagorean metric) in eight-dimensional space. For example, a dimension that measures height (which could be anything between 60 inches and 76 inches) would dominate the entry for gender (0 or 1). So the first step is to scale (in mathematical terminology, normalize ) the eight numerical variables, so that each one varies between 0 and 1.
    One way to do that would be to simply scale down each variable by a multiplicative scaling factor appropriate for that particular feature (height, age, etc.). But that will introduce further problems when the separation distances are calculated; for example, if gender and height are among the variables, then, all other variables being roughly the same, a very tall woman would come out close to a very short man (because female gives a 0 and male gives a 1, whereas tall comes out close to 1 and short close to 0). Thus, a more sophisticated normalization procedure has to be used.
    The approach finally adopted in the British Midlands study was to make every numerical entry binary (just 0 or 1). This meant splitting the continuous variables (age and height) into overlapping ranges (a few years and a few inches, respectively), with a 1 denoting an entry in a given range and a 0 meaning outside that range, and using pairs of binary variables to encode each factor of hair color, hair length, build, accent, and race. The exact coding chosen was fairly specific to the data being studied, so there is little to be gained from providing all the details here. (The age and height ranges were taken to be overlapping to account for entries toward the edges of the chosen ranges.) The normalization process resulted in a set of 46 binary variables. Thus, the geometric clustering was done over a geometric space of 46 dimensions.
    Another problem was how to handle missing data. For example, what do you do if a victim’s statement says nothing about the perpetrator’s accent? If you enter a 0, that would amount to assigning an accent. But what will the clustering program do if you leave that entry blank? (In the British Midlands study, the program would treat a missing entry as 0.) Missing data points are in fact one of the major headaches for data miners, and there really is no universally good solution. If there are only a few such cases, you could either ignore them or else see what solutions you get with different values entered.
    As mentioned earlier, a key decision that has to be made before the SOM can be run is the size of the resulting two-dimensional grid. It needs to be small enough so that the SOM is forced to put some data points into the same grid squares, and will also result in some non-empty grid squares having non-empty neighbors. The investigators in the British Midlands study eventually decided to opt for a five-by-seven grid. With 105

Similar Books

infinities

Kristine Kathryn Rusch, Scott Nicholson, Garry Kilworth, Eric Brown, John Grant, Anna Tambour, Kaitlin Queen, Iain Rowan, Linda Nagata, Keith Brooke

Panda-Monium

Bindi Irwin

Five's A Crowd

Kasey Michaels

Missing Pieces

Joy Fielding

Over the Edge

Jonathan Kellerman

Stealing Trinity

Ward Larsen