归档:TTS 常用数据集

This post will be updated frequently, depending more datasets being testified.

Chinese-Mandarin 普通话

AISHELL-3

地址:https://www.aishelltech.com/aishell_3

希尔贝壳中文普通话语音数据库AISHELL-3的语音时长为85小时88035句,可做为多说话人合成系统。录制过程在安静室内环境中, 使用高保真麦克风(44.1kHz,16bit)。218名来自中国不同口音区域的发言人参与录制。专业语音校对人员进行拼音和韵律标注,并通过严格质量检验,此数据库音字确率在98%以上。

AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. The word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody.


English 英语

LJ Speech

地址:https://keithito.com/LJ-Speech-Dataset/

The LJ Speech Dataset: This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

LibriTTS

地址:https://research.google/resources/datasets/libri-tts/

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.

The main differences from the LibriSpeech corpus are listed below:

  1. The audio files are at 24kHz sampling rate.
  2. The speech is split at sentence breaks.
  3. Both original and normalized texts are included.
  4. Contextual information (e.g., neighbouring sentences) can be extracted.
  5. Utterances with significant background noise are excluded.

Multi Language

Emotional Speech Database (ESD)

地址:https://hltsingapore.github.io/ESD/

ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.


Multimodality

M3ED

论文:https://aclanthology.org/2022.acl-long.391.pdf

下载:https://github.com/AIM3-RUC/RUCM3ED

M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database. ACL 2022

In this work, we propose a multi-modal, multiscene, and multi-label emotional dialogue dataset, M3ED, for multimodal emotion recognition in conversations. Compared to MELD, the currently largest multimodal dialogue dataset for emotion recognition, M3ED is larger (24,449 vs. 13,708 utterances), more diversified (56 different TV series vs. only one TV series Friends), with higher-quality (balanced performance across all three modalities), and containing blended emotions annotation which is not available in MELD. M3ED is the first multimodal emotion dialogue dataset in Chinese, which can serve as a valuable addition to the affective computing community and promote the research of cross-culture emotion analysis and recognition. Furthermore, we propose a general Multimodal Dialog-aware Interaction framework, which considers multimodal fusion, temporal-context modeling, and speaker interactions modeling, and achieves the state-of-the-art performance. We also propose several interesting future exploration directions based on the M3ED dataset.

MELD

地址:https://affective-meld.github.io/

Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in a dialogue has been labeled by any of these seven emotions – Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. MELD also has sentiment (positive, negative and neutral) annotation for each utterance.

MOSEI

地址:http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset is the largest dataset of multimodal sentiment analysis and emotion recognition to date. The dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos. The videos are transcribed and properly punctuated.

IEMOCAP

地址:https://sail.usc.edu/iemocap/index.html

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker database, recently collected at SAIL lab at USC. It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions. It consists of dyadic sessions where actors perform improvisations or scripted scenarios, specifically selected to elicit emotional expressions. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, neutrality, as well as dimensional labels such as valence, activation and dominance. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication.


Apart From Databases

About the preprocess of data using MFA, please check out this post of mine: MFA: Montreal Forced Aligner