Blog / Audio Basics

"Why Pitch Detection Is Hard, and How YIN Solved It"

Every guitar tuner, every pitch-correction plugin, every audio-to-MIDI converter needs to answer one question: what frequency is this signal? The naive approach, autocorrelation, has been known since the 1960s. It also fails in a specific, predictable way that plagued audio tools for decades. In 2002, Alain de Cheveigné and Hideki Kawahara published an algorithm called YIN that cut error rates by roughly a factor of three through a conceptual shift simple enough to fit in a whiteboard diagram.

The Autocorrelation Trap

Autocorrelation measures how similar a signal is to a delayed version of itself. A periodic signal at frequency f will be most similar to itself at delay 1/f (one period), so the pitch is wherever the autocorrelation function peaks. This works in textbook examples. It breaks in practice.

The problem is that the autocorrelation function peaks at every integer multiple of the fundamental period, not just the first one. A note at 200 Hz has autocorrelation peaks at 5 ms, 10 ms, 15 ms, and so on. Finding the right peak requires knowing where to look, which requires knowing the approximate frequency already. Worse, the peak at lag zero is always the highest, so threshold-based selection inevitably makes trade-offs between sensitivity and false positives that depend heavily on the signal content.

The result: octave errors. Pitch detection confidently returns 100 Hz when the signal is 200 Hz, or 440 Hz when the fundamental is 880 Hz. Real-time tuners trained on this approach needed extensive heuristics to paper over the gaps. Some added zero-crossing rate estimation as a fallback. Others used short-term energy windowing to disambiguate. None of it was clean.

The YIN Insight: Measure Difference, Not Similarity

De Cheveigné and Kawahara inverted the problem. Instead of measuring how similar the signal is at each lag, YIN measures how different it is.

Define the difference function d(t, tau) as the squared difference between the signal at time t and the signal at time t + tau, summed over a window. A periodic signal has d equal to zero at exactly tau = T (one period) and at multiples of T. This looks equivalent to autocorrelation, and mathematically it nearly is, but the key distinction is what happens at tau = 0: the difference function is zero there (a signal is identical to itself), while the autocorrelation peaks there.

This shift allows a much cleaner normalization. YIN computes the Cumulative Mean Normalized Difference Function (CMNDF), where each value d'(t, tau) is divided by the running mean of all d values from tau = 1 up to the current tau. The CMNDF starts at exactly 1 by definition. It falls below 1 only when a lag is genuinely below the average difference, which happens at periods. The first minimum below a threshold of 0.1 is the pitch period.

The threshold = 0.1 is not a tunable parameter requiring signal-specific calibration. It works across speech, singing voice, guitar, and monophonic instruments with minimal adjustment. The original paper reports this finding explicitly, and it has held up across two decades of implementations.

The Six Steps

The full YIN algorithm runs in six steps:

  1. Compute autocorrelation over a window of at least two pitch periods.
  2. Compute the difference function from the autocorrelation using the identity d(tau) = r(0) + r(0) - 2 * r(tau), where r is the autocorrelation. This avoids computing d directly and makes the implementation more efficient.
  3. Normalize to CMNDF by dividing each d(tau) by the cumulative mean up to tau. Set d'(1) = 1 by definition.
  4. Find the first tau below threshold (default 0.1). If none exists, use the global minimum.
  5. Parabolic interpolation around the minimum to achieve sub-sample frequency resolution without increasing the window size.
  6. Best local estimate: select among candidates using contextual smoothing or return the result directly for real-time applications.

A practical implementation in C++ covers the core in about 50 lines. The only non-trivial part is the circular buffer management for overlapping frames:

float cmndf(const float* buffer, int bufferSize, int tau) {
    float sum = 0.0f;
    for (int j = 0; j < bufferSize - tau; ++j) {
        float diff = buffer[j] - buffer[j + tau];
        sum += diff * diff;
    }
    return sum;
}

float findPitch(const float* buffer, int bufferSize, float sampleRate, float threshold = 0.1f) {
    float runningSum = 0.0f;
    for (int tau = 1; tau < bufferSize / 2; ++tau) {
        float d = cmndf(buffer, bufferSize, tau);
        runningSum += d;
        float d_prime = (tau == 0) ? 1.0f : d * tau / runningSum;
        if (d_prime < threshold) {
            // Parabolic interpolation around tau
            float better = tau + 0.5f * (cmndf(buffer, bufferSize, tau + 1)
                                        - cmndf(buffer, bufferSize, tau - 1))
                                 / (2.0f * d - cmndf(buffer, bufferSize, tau - 1)
                                            - cmndf(buffer, bufferSize, tau + 1));
            return sampleRate / better;
        }
    }
    return -1.0f; // unvoiced
}

This is simplified for clarity: the naive inner loop makes it O(N^2). Production implementations replace it with FFT-based autocorrelation to get O(N log N), which matters at 48 kHz with frame sizes of 2048 or larger.

pYIN and the Probabilistic Extension

YIN commits to one pitch estimate per frame. That works well for stable sustained notes but causes problems during transitions: the threshold crossing on frame N might be ambiguous, and once a choice is made, there is no way to revise it based on frames N+1 or N+2.

Matthias Mauch and Simon Dixon published pYIN in 2014, extending the algorithm with a Hidden Markov Model. Instead of returning the first tau below a fixed threshold, pYIN computes a probability distribution over threshold crossings using a beta distribution on the CMNDF values. Multiple pitch candidates per frame, each with a confidence weight, feed into a Viterbi decoder that finds the most likely pitch trajectory across the entire phrase.

The improvement is measurable in noisy conditions and on guitar and violin, where inharmonicity causes the CMNDF to have shallower minima than on voices. pYIN is available in librosa as librosa.pyin() and in the Essentia library as PitchYinProbabilistic. For research and offline analysis it is the better choice. For real-time applications where latency must stay below 20 ms, standard YIN with a 1024-sample window is still the practical baseline.

What Actually Uses YIN

AutoTune does not use YIN. The original patent was filed in 1997 by Andy Hildebrand at Antares and predates the 2002 paper by five years. It uses a custom autocorrelation-based detection method derived from seismic signal analysis, where Hildebrand had worked previously. The pitch-shifting portion uses a variant of PSOLA (Pitch-Synchronous Overlap-Add).

Melodyne's pitch detection is proprietary and undisclosed beyond the marketing descriptions of its Melodic, Percussive, and Universal algorithm modes. The polyphonic DNA algorithm does not operate on a single pitch estimate per frame and is architecturally different from anything in the YIN family.

Open-source guitar tuner applications, pitch-to-MIDI converters, and audio analysis libraries are where YIN actually appears. Librosa, Essentia, the JUCE pitch_detector module by Adam Wilson, and standalone implementations like Ashok Fernandez's C port all trace their core logic directly to the 2002 paper. The threshold invariance is what made adoption easy: a developer building a guitar tuner app does not need to read acoustics literature to get acceptable results. They copy a reference implementation, set threshold = 0.1, and it works.

The Practical Takeaway

If you are implementing monophonic pitch detection for real-time use, YIN with a 1024 to 2048 sample window at your target sample rate covers most cases. The minimum window must contain at least two periods of the lowest frequency you want to detect: 40 Hz requires 50 ms minimum at 44100 Hz, or roughly 2205 samples. That puts a floor on your latency regardless of algorithm choice.

For offline analysis where accuracy matters more than latency, pYIN through librosa is one function call and handles most edge cases without parameter tuning.

The reason YIN still appears in new code 24 years after publication is not that nobody has found anything better for polyphonic material or deep-learning-based tasks. It is that for the specific problem of real-time monophonic pitch estimation with minimal computational overhead and zero training data, the combination of the CMNDF normalization and the threshold invariance has not been meaningfully improved upon.