Author: Andrew Kingston
Source: GZipped PostScript (0kb); Adobe PDF (630kb)
Two popular approaches to isolated word speech recognition include the template matching and the neural network approach. An implementation of each of these systems is presented in this report, and a comparison between the two is given. The recognition task is limited to digits only, and experiments involving a single speaker and a more difficult multiple speaker task were performed. In constructing the input for the recognition systems, various digital signal processing routines that transform the raw speech signal into a comparable word template format are examined. Word boundary detection is investigated, and an algorithm was developed to automatically separate each digit from the collected speech samples. A ``dynamic time warping'' procedure provides the template matching system with the ability to time align two templates, in contrast to the ``temporal flow'' neural network model, which contains a network of delay links to represent temporal relationships. Recognition rates of 100% are reported for both approaches using a single speaker, but the neural network model achieves consistently higher rates for the multiple speakers obtaining a highest recognition rate of 92.5%.