Deep learning for acoustic scene classification

Kek, Xing Yong

Please use this identifier to cite or link to this item: http://theses.ncl.ac.uk/jspui/handle/10443/5904

Title:	Deep learning for acoustic scene classification
Authors:	Kek, Xing Yong
Issue Date:	2023
Publisher:	Newcastle University
Abstract:	Acoustic scene classification is the ability to automatically detect the vicinity based on the sound produced by the environment. The current framework uses a log Mel spectrogram with a convolution neural network. There is an advancement toward low complexity modeling where the convolution neural network is compressed. Log Mel-spectrogram suffers from Heisenberg Uncertainty Principle, where the representation is based on a fixed timescale and becomes unstable to time-wrapping deformation when the timescale is greater than 25ms. Hence, instead of using log Mel-spectrogram, this thesis investigates the viability of using wavelet scattering. Wavelet scattering is the cascade of wavelet transform and provides a scaling factor based on wavelet theory. It is stable to deformation when the timescale is greater than 25ms. However, the averaging operation in the wavelet scattering has a limiting effect on the maximum scaling factor of the wavelet. Hence, wavelet scattering also suffers from Heisenberg Uncertainty Principle. In addition, it is observed that the first and second-order coefficients have a considerable difference in magnitude, making the acoustic classification model difficult. The huge disparity in data representation can cause an internal covariate shift, leading to a slower convergence rate and even poor classification accuracy. Hence, this thesis proposed a multi-timescale using genetic algorithm for feature selection on the limitation of ‘fixed’ timescale and a two-stage convolution neural network architecture framework to tackle the magnitude disparity between first and second-order coefficients in wavelet scattering. Despite the challenge of designing low complexity models, this thesis adapted the model compression technique to compress the convolution block of the two-stage convolution neural network architecture. In the feature representation, using wavelet scattering and multi-timescale are not a viable option as having multiple timescales will increase the model size and the time complexity. Hence, a simple ‘mixing’ of first and second-order to increase the variety of timescale slightly is being proposed. In addition, the genetic algorithm for feature selection is adapted to reduce the size of the second-order frequency dimension due to a large frequency dimension (e.g., > 500) as compared to first-order (e.g., <100) and finescale resolutions which can result in unnecessary over-representation of acoustic profiles.
Description:	PhD Thesis
URI:	http://hdl.handle.net/10443/5904
Appears in Collections:	School of Engineering

Files in This Item:

File	Description	Size	Format
Kek X Y 2023.pdf		4.61 MB	Adobe PDF	View/Open
dspacelicence.pdf		43.82 kB	Adobe PDF	View/Open

Show full item record