

To increase overall accuracy, you merge the CNN and ensemble classifier results using late fusion. The CNN and ensemble classifier produce roughly equivalent overall accuracy, but perform better at distinguishing different acoustic scenes. To illustrate a simple approach that produces reasonable results, this example trains a CNN using mel spectrograms and an ensemble classifier using wavelet scattering. The top-ranked systems in the challenge used late fusion and data augmentation to help their systems generalize. The most popular feature for top-ranked systems in the DCASE 2017 contest was the mel spectrogram ( melSpectrogram). More recently, the best performing systems have used deep learning, usually CNNs, and a fusion of multiple models. Hidden Markov models (HMMs) were trained to describe the temporal evolution of the GMMs. Other popular features used for ASC include zero crossing rate, spectral centroid ( spectralCentroid), spectral rolloff ( spectralRolloffPoint), spectral flux ( spectralFlux ), and linear prediction coefficients ( lpc). Early attempts at ASC used mel-frequency cepstral coefficients ( mfcc) and Gaussian mixture models (GMMs) to describe their statistical distribution.

ASC is a generic classification problem that is foundational for context awareness in devices, robots, and many other applications.

Acoustic scene classification (ASC) is the task of classifying environments from the sounds they produce.
