Voice disorders result from various pathological processes caused by anatomical, functional, or paralytic factors, affecting voice production across physiological, auditory, aerodynamic, acoustic, and perceptual aspects. The main categories of voice pathology include hyperkinetic dysphonia, hypokinetic dysphonia, and other conditions such as reflux laryngitis, vocal fold nodules, and vocal fold paralysis.
Methodology
The study utilized the VOice ICar fEDerico II (VOICED) dataset, which contains recordings of healthy and pathological voices. The dataset consists of 208 voice recordings collected during a clinical study, with a prevalence of disordered voices (150) over healthy ones (58). The voices were recorded using a Samsung Galaxy S4 mobile device, sampled at 8000 Hz with 32-bit resolution, and saved in .txt format.
Each 5-second recording of the vowel /a/ was divided into 250 ms segments with a 125 ms overlap, generating 36 images per sound. These segments were then transformed into Mel spectrograms, which provide a time-frequency representation of the audio signals. The study employed pre-trained CNN networks (OpenL3, YAMNET, VGGish) for classification, using transfer learning and cross-validation with k = 5.
Results
The OpenL3 network achieved the highest accuracy at 99.44%, outperforming YAMNET (94.36%) and VGGish (95.34%). The classification performance was evaluated using accuracy, precision, recall/sensitivity, and the Area Under the Curve (AUC) value. The confusion matrices and ROC curves demonstrated high classification performance for all eight classes: Glottic Insufficiency, Hyperkinetic Dysphonia, Hypokinetic Dysphonia, Prolapse, Reflux Laryngitis, Vocal Fold Nodules, Vocal Fold Paralysis, and Healthy.
Explainability Analysis
The study used Occlusion Sensitivity, an XAI technique, to understand the behavior of the deep neural network. The XAI maps revealed that the model primarily used specific frequency bands to classify voice pathologies. For instance, Hyperkinetic Dysphonia was characterized by dominant frequency bands around 700 Hz and 100 Hz, while Hypokinetic Dysphonia showed bands over 200 Hz and 900 Hz. The average XAI maps across all correctly classified images highlighted the most used areas for classification, demonstrating differentiability between classes.
Conclusion
This research demonstrates the effectiveness of using Mel spectrograms and deep learning for voice pathology detection, achieving high accuracy with the OpenL3 network. The application of XAI techniques provided insights into the model’s decision-making process, showing that specific frequency bands are crucial for distinguishing between different pathologies. The proposed system has the potential to be used as a support tool for specialists, particularly in telemedicine applications. Future work could expand to include other vowels and pathologies, enhancing the system’s versatility and diagnostic capabilities.