Can AI show cognitive empathy through phonetics?

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

It is important for mental health providers to understand and accurately identify human emotional states. Can artificial intelligence (AI) machine learning reveal the human capacity for cognitive empathy? A new peer-reviewed study shows how AI can detect emotions on par with human performance from audio clips as short as 1.5 seconds.

Article continues after advertisement.

“The human voice serves as a powerful channel for expressing emotional states, as it provides universally understandable cues about the sender’s situation and can transmit them over long distances,” the study said. wrote the first author, Hans Demmerling, of the Max Planck Institute for Human Development. In collaboration with Leonie Stresemann, Tina Braun and Timo von Ortzen, psychology researchers based at the Center for Lifespan Psychology, Germany.

In AI deep learning, the quality and quantity of training data is critical to the efficiency and accuracy of the algorithm. The audio data used for this research came from over 1,500 unique audio clips from the English and German language open-source emotion databases obtained from the Ryerson Audiovisual Database of Emotional Speech and Song, and German audio recordings. are from the Berlin Database of Emotional Speech. (Emo-DB).

“Emotion recognition from audio recordings is a rapidly advancing field, with important implications for artificial intelligence and human-computer interaction,” the researchers wrote.

For the purposes of this study, the researchers narrowed emotional states into six categories: happiness, fear, neutral, anger, sadness and disgust. Audio recordings were combined into 1.5 second segments and different features. Quantified features include pitch tracking, pitch intensity, spectral bandwidth, magnitude, phase, MFCC, chroma, tones, spectral contrast, spectral roll-off, fundamental frequency, spectral centroid, zero crossing rate, root mean square, non-mode ness, HPS, Flatness included. audio signal.

Article continues after advertisement.

Psychoacoustics is the science of sound psychology and human voice perception. Audio frequency (pitch) and amplitude (volume) greatly influence the way people experience sound. In psychoacoustics, pitch describes the frequency of sound and is measured in hertz (Hz) and kilohertz (kHz). The higher the pitch, the higher the frequency. Amplitude refers to the loudness of a sound and is measured in decibels (db). The higher the amplitude, the higher the volume of the sound.

Spectral bandwidth (spectral spread) is the range between upper and lower frequencies and is derived from the spectral centroid. The spectral centroid measures the audio signal spectrum and is the center of mass of the spectrum. Spectral flatness measures the uniformity of the energy distribution across frequencies against a reference signal. Spectral rolloff finds the frequency ranges most strongly represented in the signal.

MFCC, Mel-Frequency Cepstral Coefficient, is a widely used feature for voice processing.

Chromas, or pitch class profiles, are a way of analyzing a musical key, usually with twelve semitones of an octave.

In music theory, a tonetz (which translates to “audio network” in German) is a visual representation of the relationship between chords in neo-Riemannian theory, named after the German musicologist Hugo Riemann (1849–1919). is held, one of the founders of modern music.

Article continues after advertisement.

A common acoustic feature for audio analysis is the zero crossing rate (ZCR). For an audio signal frame, the zero crossing rate measures the number of times the signal amplitude changes sign and crosses the X-axis.

In audio production, root mean square (RMS) measures the average loudness or power of a sound wave over time.

HPSS, harmonic-percussive source separation, is a method of breaking down an audio signal into harmonic and percussive components.

The scientists applied three different AI deep learning models to classify emotions from short audio clips using a combination of Python, TensorFlow, and Bayesian optimization, and then benchmarked the results against human performance. The AI ​​models evaluated include a deep neural network (DNN), a convolutional neural network (CNN), and a hybrid model of a combined DNN to process features with a CNN to analyze spectrograms. could The goal was to see which model performed best.

Artificial intelligence reads essentials.

The researchers discovered that, across the board, the AI ​​models’ emotion classification accuracy exceeded chance and equaled human performance. Within the three AI models, the deep neural network and the hybrid model outperformed the convolutional neural network.

A combination of artificial intelligence and data science applied to psychology and psychometric properties demonstrates how machines are capable of performing voice-based cognitive empathy tasks comparable to human-level performance.

Article continues after advertisement.

“This interdisciplinary research bridges psychology and computer science, highlighting the potential for automated emotion recognition and development in a wide range of applications,” the researchers concluded.

references

Copyright © 2024 Cami Rosso. All rights reserved.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment