Cepstral analysis
The technique of cepstral analysis is fairly well described in Wakita's papers
(Wakita 1976 or 1996). One way of approaching the idea of the cepstrum is
to consider first the spectral representation that we have in figure 4.7a,
a standard power spectral density function displayed on a logarithmic scale.
Figure 4.7. (a) Spectrum of a voiced speech segment; (b) the cepstrum of
(a). Reproduced from Noll (1967).
Ignore for a moment the fact that the horizontal axis of figure 4.7
(a) is a frequency scale; after all, that is just a label that we give to
the graph. Suppose that the graph we are looking at here were a short segment
of a signal in the time domain, a short time-varying signal. We want to find
the distance between the smaller peaks, and, second, the distance between
the larger peaks. In the time domain that would amount to determining the
frequency of the smaller vibrations and the frequency of the larger vibrations.
If this were a time domain signal we could do a Fourier analysis of it, and
that would tell us the frequency of the higher frequency events (the little
peaks) as well as the lower frequency events (the big peaks). In terms of
what we know this graph to represent, the higher frequency peaks represent
the harmonics and the slower frequency peaks represent the underlying resonances.
So if we were to take a Fourier analysis of this graph, that would give us
a technique for separating out the narrow harmonic peaks from the broad resonance
peaks. So, having applied the Fast Fourier Transform once, to get the power
spectrum, if we apply it again, a second time, we will be able to measure
the time interval between the little peaks and the time interval between
the big peaks. However since we are mapping not from the time domain to the
frequency domain on this second application, but from the frequency domain
back into the time domain, the operation that we use is not an FFT as such,
but an inverse Fast Fourier Transform. This will have the same separation
effect but will take us from the frequency domain back into the time domain.
What is this time domain that arises as the result of performing an inverse
Fast Fourier Transform to a log power spectral density? Well, it is called
quefrency, a term coined by Bogert et al. (1963) that is intended
to denote the inverse of frequency in some respect, the inverse of frequency
being a time domain unit. On the horizontal axis of figure 4.7b it is given
in seconds, though as we shall see the units of quefrency will turn out to
be samples of the original signal. The quefrency of the spike that
we see representing the fundamental frequency will turn out to be the period
(in samples) of the fundamental frequency of the original signal. You can
see in figure 4.7b the result of applying an inverse Fast Fourier Transform
to a log power spectrum. There is one sharply demarcated spike at around
T seconds on the scale, which is the spike representing the fundamental
frequency. If T = 8.4 ms, for instance, the frequency will be 119 Hz,
since f = 1/T. At the left-hand end of the horizontal axis,
that is quefrencies of events with a very much smaller period, we have peaks
which represent the resonances of the vocal tract. Note that there is a very
clear separation between the fundamental frequency and the resonances. That
is because the fundamental frequency, the period of the glottal wave may
be of the order of say 10 milliseconds, whereas the period -if that is the
right word to use - the time taken for each oscillation of the resonators
is very much shorter. At a sampling rate of 8000 samples a second, a resonance
at 4000 Hz, for instance, would have a period of just 2 samples, a quarter
of a millisecond.
Next: Computation of
the cepstrum in C