SNL uses a salmon trail

When does the salmon trail start?


One of my weekend projects took me into the depths of signal processing. As with all of my code projects that require some high-performance math, I'm more than pleased to find my way to a solution despite the lack of theoretical foundations, but in this case I don't have any and would like some advice on my problem I am trying to find out when the live audience laughs during a TV show.

I spent a lot of time researching about machine learning to recognize laughter, but I realized that this had more to do with recognizing individual laughter. Two hundred people laughing at once will have very different acoustic properties, and I have an intuition that they should be distinguishable by much cruder techniques than a neural network. However, I can be completely wrong! I would appreciate thoughts on this topic.

Here's what I've tried so far: I've broken a five-minute excerpt from a recent episode of Saturday Night Live into two-second clips. I then referred to them as "laughing" or "not laughing". Using Librosa's MFCC function extractor, I then performed K-Means clustering for the data and got good results. The two clusters were mapped very precisely on my labels. But when I tried to go through the longer file, the predictions ran out of water.

What I'm going to try now: I'll be more specific to create these laugh clips. Instead of blindly splitting and sorting, I extract them manually so that no dialogue pollutes the signal. Then I break them up into quarter-second clips, calculate the MFCCs, and train them to be an SVM.

My questions at this point:

  1. Does any of this make sense?

  2. Can statistics help here? I've scrolled around in Audacity's Spectrogram view mode and I can see where laughter is occurring pretty clearly. In a logarithmic power spectrogram, language has a very characteristic, "furrowed" appearance. In contrast, laughter covers a wide spectrum of frequencies fairly evenly, almost like a normal distribution. It is even possible to visually distinguish applause from laughter by limiting the number of frequencies represented in the applause. That makes me think of standard deviations. I see there is such a thing as the Kolmogorov-Smirnov test. Could this be helpful here? (You can see the laugh in the picture above as an orange wall that hits 45% of the way in.)

  3. The linear spectrogram seems to show that laughter is more energetic at lower frequencies and subsides at higher frequencies - does this mean it is classified as pink noise? If so, could this be a cause of the problem?

I apologize if I have misused any jargon. I've been to Wikipedia quite a bit and I wouldn't be surprised if I got a little confused.




Reply:


Based on your observation that the spectrum of the signal is sufficiently distinguishable, you can use this as a function to classify laughter from speech.

There are many ways that you can look at the problem.

Approach No. 1

In one case, you can only look at each other vector of the MFCC. and apply this to any classifier. Since you have a lot of coefficients in the frequency domain, you should look at the Cascade Classifiers structure with boosting algorithms based on it like Adaboost to compare between speech and laughter classes.

Approach 2

You realize that your speech is essentially a time-varying signal. One of the most effective ways is to look at the variation in the signal itself over time. This can be done by dividing signals into batches of samples and looking at the spectrum for that time. Now you can see that laughter can have a more repetitive pattern for a set length of time when the language inherently contains more information and therefore the spectrum is more varied. You can apply this to HMM model type to determine whether you are constantly in the same state or constantly changing for a given frequency spectrum. Time changes here, even if the language spectrum sometimes resembles that of laughter.

Approach 3

Force an LPC / CELP encoding to be applied to the signal and watch for the lag. The CELP coding represents a very precise model of speech production.

From the reference here: CELP CODING THEORY

The redundancies in the speech signal are almost eliminated after the short-term prediction and the long-term prediction of the speech signal, and the rest has only a very low correlation. Then a stimulus that synthesizes the speech is searched, and the codebook index and the gain are searched from the fixed codebook. The optimal selection criterion for the codebook index is based on the MMSE between the locally synthesized speech and the original speech signal.

Simply put, after all of the speech predicted by the analyzer has been removed, what remains is what is transmitted to restore the exact waveform.

How does this help with your problem? When you use CELP encoding, most of the speech in the signal is removed, and residuals remain. In the case of a laugh, a large part of the signal can be retained, since CELP cannot predict such a signal when modeling the vocal tract, since the individual speech has very little residues. You can also analyze this lag in the frequency domain to determine whether it is laughter or speech.


Most speech recognizers use not only the MFCC coefficients but also the first and second derivatives of the MFCC levels. I suspect the approaches would be very useful in this case and help you distinguish a laugh from other sounds.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.