Home | Site Map | Watch | FAQ | History | Store | Contact

Speech Tool

The problem

How can we describe what goes on in speech? There are a couple of ways that people customarily approach this:

  • To describe the cause -- what goes on physically with our speech-producing organs. There are many things which are going on at once. I can open and close my jaw. My tongue and lips can each assume a variety of positions and degrees of tension. My lungs pump air in and out. My larynx is doing something. The variations of structure (from one individual to another), position, motion, and tension of these and other vocal organs (and the ways in which these variations are correlated among the organs) could be described.
  • To analyze the result -- what sounds are produced by the vocal organs. With the current technology, this approach is simpler, because it is easier to record and analyze speech sounds than it is to record and analyze the position, motion, and tension of the vocal organs. A description of the sounds would consist of various time-series of audio spectra.

    The above approaches are comprehensive, but their raw results are not easy to understand. The reason is that each approach results in a multi-dimensional time series -- something that we're not very good at grasping and interpreting (except, of course, in the original form: speech itself). In the case of the actions of the organs, there are ten or more independent (or partly-independent) quantities; in the case of the audio spectra, there are many frequencies.

    Even a one-dimensional time series may be hard to describe in words. A one- or two- dimensional series can be presented as a 2D (or, if the data are suited to it, 3D) graph. For higher numbers of dimensions, however, we cannot graph the data directly.

    The kind of solution I'm looking for

    It is sometimes possible to compress multi-dimensional information into a two- or three-dimensional representation, if the information is structured in a way that allows that. I believe that speech is suited to such a compression. Why? Here are some reasons that a compression may be possible:

  • Many of the speech organs are coupled, more or less directly, so their states do not vary completely independently.
  • There are many combinations of activities of the organs which are not used in speech.
  • We can artificially limit the number of dimensions we include, based on some assumptions about what aspects of speech are most important. For example, whispered speech is quite intelligible in many languages -- and whispering does not involve the larynx to the same extent as voiced speech does; so, a representation of speech which didn't include the larynx could still show a lot.

    The representation of speech I'm imagining might be something like handwriting, in that it would be a gestural diagram, a wiggle through space that would represent things happening in time. Or, it might be an animation of an object like a moving hand: something which could change its shape, orientation, and position over time (in a way that one could imagine doing).