Text-to-speech (TTS) generates from text using computers. In recent years, deep learning-based TTS has reached a quality comparable to that of humans for neutral speech, but falls short on expressiveness.
At LINE, we’re working on a development for a highly-expressive TTS system with a variety of speaking styles and precise controls. This session introduces two themes from the aspects of emotional TTS model development and TTS system operation.
The first half covers the method accepted to INTERSPEECH 2022, which is an international conference on speech processing. With a small amount of neutral speech as a base, it applies voice conversion to generate pseudo emotional data used to construct an emotional TTS model .
The second half covers initiatives for implementing micro-service to improve the speed and efficiency of the maintenance and development cycle of a system with multiple inference modules.