Audio Embeddings
The system uses mel-spectrogram extraction as the main acoustic feature representation, where raw audio is converted to time-frequency representations followed by mel-scale filtering.
The preprocessing pipeline follows several steps: audio files are loaded at a 22.05kHz sampling rate, converted to mel-spectrograms with 128 mel bands and an 8kHz frequency ceiling. These spectrograms are converted from power to decibels and normalized to a fixed temporal dimension of 228 frames through padding or truncation, ensuring consistent dimensions across variable-length audio inputs.
The architecture consists of a feature extraction network: a first Conv2D layer (32 filters, 3×3 kernels) captures basic spectro-temporal patterns, followed by MaxPooling2D (2×2), then a second convolutional layer (64 filters, 3×3 kernels) for extracting more complex acoustic features. The feature maps are flattened before being projected into a 256-dimensional embedding space through a fully-connected layer. This embedding is mapped to 8 emotion classes (neutral, calm, happy, sad, angry, fearful, disgust, surprised) via a softmax output layer. The model is trained using Adam optimizer with categorical cross-entropy loss.
The model achieves 87% accuracy on emotion classification tasks across the RAVDESS dataset, demonstrating strong performance in distinguishing between subtle emotional variations in speech. The 256-dimensional embeddings capture rich semantic information about audio content, enabling various downstream applications.
Oh, and I did this end-to-end on a plane to SF!

