How can we describe what goes on in speech? There are a couple of ways that people customarily approach this:
The above approaches are comprehensive, but their raw results are not easy to understand. The reason is that each approach results in a multi-dimensional time series -- something that we're not very good at grasping and interpreting (except, of course, in the original form: speech itself). In the case of the actions of the organs, there are ten or more independent (or partly-independent) quantities; in the case of the audio spectra, there are many frequencies.
Even a one-dimensional time series may be hard to describe in words. A one- or two- dimensional series can be presented as a 2D (or, if the data are suited to it, 3D) graph. For higher numbers of dimensions, however, we cannot graph the data directly.
It is sometimes possible to compress multi-dimensional information into a two- or three-dimensional representation, if the information is structured in a way that allows that. I believe that speech is suited to such a compression. Why? Here are some reasons that a compression may be possible:
The representation of speech I'm imagining might be something like handwriting, in that it would be a gestural diagram, a wiggle through space that would represent things happening in time. Or, it might be an animation of an object like a moving hand: something which could change its shape, orientation, and position over time (in a way that one could imagine doing).