Deep learning technology is actively explored in auditory attention detection tasks based on electroencephalogram (EEG) signals. However, past research in this area mainly focused on the sensory domain of human hearing, and relatively few studies investigated the effect of vision on auditory attention. In addition, mature public datasets like KUL and DTU are commonly used; however, they contain only EEG data and audio data, while in daily life, people's auditory attention is usually accompanied by visual information. To more comprehensively study people's auditory attention in a combined audio-visual state, this work integrates EEG, audio, and video data to conduct auditory attention detection studies.
To simulate a real-world perceptual environment, this paper constructs an audio-video EEG dataset to realize an in-depth exploration of auditory attention. The dataset contains two stimulus scenarios: audio-video and audio. In the audio-video stimulus scenario, subjects pay attention to the voice corresponding to the speaker in the video and ignore the voice of the other speaker; that is, subjects receive visual and auditory information input simultaneously. In the audio stimulus scenario, subjects focus on only one of the two speaker voices, i.e., the subjects receive only auditory input. Based on the EEG data of subjects in the above two scenarios, this paper verifies and compares the effectiveness of this dataset through existing methods.
The results show the following: 1) Under various decision windows, the average accuracy of receiving only audio stimuli was significantly higher than that of receiving audio-video stimuli. Under a 2-s decision window, the detection performance of audio-video stimuli and audio stimuli reached only 70.5% and 75.2%, respectively. 2) Through experiments on EEG signals of various frequency bands in the two public datasets and the audio-video EEG datasets constructed in this paper, the detection performance of the gamma frequency band in the DTU dataset and audio-video scenario was better than other bands. In the KUL dataset, the detection performance of the alpha frequency band was higher than that of other bands. In the audio-only scenario, although the average classification accuracy of the 2-s decision window in the alpha frequency band was lower than that in the theta frequency band, it was still higher than that in other bands.
This paper proposes an audio-video EEG dataset that simulates the real scene more closely. Through experiments, it is found that in the audio-video stimulation scenario, the subjects need to process two sensory information simultaneously, which distracts their attention and leads to performance degradation. In addition, EEG signals in the alpha and gamma frequency bands carry important information when performing auditory spatial attention. Compared with the existing public auditory attention detection datasets, the audio-video EEG dataset proposed in this paper introduces video information and simulates the real scene more realistically. This dataset design provides richer modal information for the research and application of the brain-computer interface. This information is helpful for the deep study of auditory attention patterns and neural mechanisms of people under simultaneous stimulation of audio-visual information and has important research and application significance. This paper is expected to promote further research and application in auditory attention. This dataset is publicly available at http://iiphci.ahu.edu.cn/toAuditoryAttentionEnglish.