JP4551803B2

JP4551803B2 - Speech synthesizer and program thereof

Info

Publication number: JP4551803B2
Application number: JP2005096526A
Authority: JP
Inventors: 正統田村; 剛平林; 岳彦籠嶋
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2005-03-29
Filing date: 2005-03-29
Publication date: 2010-09-29
Anticipated expiration: 2025-03-29
Also published as: US20060224391A1; US7630896B2; JP2006276528A; CN1841497A; CN1841497B

Abstract

A speech synthesis system in a preferred embodiment includes a speech unit storage section, a phonetic environment storage section, a phonetic sequence/prosodic information input section, a plural-speech-unit selection section, a fused-speech-unit sequence generation section, and a fused-speech-unit modification/concatenation section. By fusing a plurality of selected speech units in the fused speech unit sequence generation section, a fused speech unit is generated. In the fused speech unit sequence generation section, the average power information is calculated for a plurality of selected M speech units, N speech units are fused together, and the power information of the fused speech unit is so corrected as to be equalized with the average power information of the M speech units.

Description

本発明は、テキスト音声合成のための音声合成装置及びその方法に係り、特に音韻系列と、基本周波数及び音韻継続時間長などの韻律情報から音声信号を生成する音声合成装置及びその方法に関する。 The present invention relates to a speech synthesis apparatus and method for text-to-speech synthesis, and more particularly to a speech synthesis apparatus and method for generating a speech signal from a phoneme sequence, prosodic information such as a fundamental frequency and a phoneme duration.

任意の文章から人工的に音声信号を作り出すことを「テキスト音声合成」という。テキスト音声合成は、一般的に言語処理部、韻律処理部及び音声合成部の３つの段階によって行われる。 Artificially creating speech signals from arbitrary sentences is called “text-to-speech synthesis”. Text-to-speech synthesis is generally performed in three stages: a language processing unit, a prosody processing unit, and a speech synthesis unit.

入力されたテキストは、まず言語処理部において形態素解析や構文解析などが行われ、次に韻律処理部においてアクセントやイントネーションの処理が行われて、音韻系列・韻律情報（基本周波数、音韻継続時間長など）が出力される。最後に、音声信号合成部で音韻系列・韻律情報から音声波形を合成する。 The input text is first subjected to morphological analysis and syntactic analysis in the language processing unit, and then subjected to accent and intonation processing in the prosody processing unit, and phoneme sequence / prosodic information (basic frequency, phoneme duration length) Etc.) is output. Finally, the speech waveform is synthesized from the phoneme sequence / prosodic information by the speech signal synthesis unit.

音声合成方法の一つとして、入力された音韻系列・韻律情報を目標にして、大量の音声素片から音声素片系列を選択して合成する素片選択型の音声合成方法がある。素片選択型の音声合成は、予め記憶された大量の音声素片のなかから、入力された音韻系列・韻律情報に基づき音声素片を選択する。素片選択手法としては、音声を合成することで生じる合成音声の歪みの度合いをコスト関数として定義し、コストが小さくなるように素片系列を選択する方法がある。例えば、目標とする音声と各音声素片との韻律・音韻環境等の差異を表す目標歪み、音声素片を接続することで生じる接続歪をコストとして数値化し、このコストに基づいて音声合成に使用する音声素片系列を選択し、選ばれた音声素片系列に基づいて合成音声を生成する方法が用いられる。素片選択型の音声合成は、大量の音声素片から適切な音声素片系列を選択することにより、素片の編集及び接続における音質の劣化を抑えた合成音声を得ることができる。 As one of speech synthesis methods, there is a unit selection type speech synthesis method that selects and synthesizes a speech unit sequence from a large number of speech units with the target phoneme sequence / prosodic information as a target. In the unit selection type speech synthesis, a speech unit is selected from a large number of speech units stored in advance based on input phoneme series / prosodic information. As a unit selection method, there is a method in which the degree of distortion of synthesized speech generated by synthesizing speech is defined as a cost function and a unit sequence is selected so as to reduce the cost. For example, the target distortion that represents the difference between the target speech and each speech segment, such as the prosody and phonological environment, and the connection distortion generated by connecting speech segments are quantified as costs, and speech synthesis is performed based on this cost. A method is used in which a speech unit sequence to be used is selected, and a synthesized speech is generated based on the selected speech unit sequence. In the unit selection type speech synthesis, by selecting an appropriate speech unit sequence from a large number of speech units, it is possible to obtain a synthesized speech in which deterioration of sound quality in the editing and connection of the unit is suppressed.

また、入力された音韻系列・韻律情報を目標にして、入力音韻系列を区切ることにより得られる各合成単位に対して、合成音声の歪みの度合いに基づいて複数の音声素片を選択し、選択された複数の音声素片を融合することによって新たな音声素片を生成し、それらを接続して音声を合成する複数素片選択型の音声合成方法がある（非特許文献１参照）。 In addition, for each synthesis unit obtained by dividing the input phoneme sequence with the target phoneme sequence / prosodic information as a target, a plurality of speech segments are selected and selected based on the degree of distortion of the synthesized speech There is a multi-unit selection type speech synthesis method in which a new speech unit is generated by fusing a plurality of speech units and the speech is synthesized by connecting them (see Non-Patent Document 1).

融合方法としては、例えばピッチ波形を平均化する方法が用いられる。これにより、素片選択型の音声合成器において生じる、目標としている音韻系列・韻律情報と選択された音声素片系列とのミスマッチによる音質劣化や素片接続のミスマッチによる音質劣化が低減し、肉声感を保持したまま合成音声の安定性を高めることができる。 As the fusion method, for example, a method of averaging pitch waveforms is used. This reduces sound quality degradation caused by mismatch between the target phoneme sequence / prosodic information and the selected speech unit sequence, and sound quality degradation caused by unit connection mismatch. The stability of the synthesized speech can be improved while maintaining the feeling.

また、合成音声のパワー制御を行う方法として、音声素片を音素境界で分割し、各部分素片毎にパワーを推定し、推定パワーに基づいて音声素片のパワーを変更する音声合成方法が開示されている（特許文献１参照）。パワー推定工程は、数量化Ｉ類の係数等、事前に用意したパラメータを用いてパワーを生成する。
特開２００１−２８２２７６号公報水谷竜也、籠嶋岳彦、“複数素片選択・融合方式による音声合成”、日本音響学会春季研究発表会講演論文集、2004年3月、1-7-3，pp．217-218 Further, as a method for performing power control of synthesized speech, there is a speech synthesis method that divides a speech unit at a phoneme boundary, estimates power for each partial unit, and changes the power of the speech unit based on the estimated power. It is disclosed (see Patent Document 1). In the power estimation step, power is generated using parameters prepared in advance, such as coefficients of quantification class I.
JP 2001-282276 A Tatsuya Mizutani, Takehiko Tsujishima, “Speech synthesis by multiple unit selection and fusion method”, Proceedings of the Acoustical Society of Japan Spring Meeting, March 2004, 1-7-3, pp. 217-218

素片選択型音声合成器では、大量の音声素片の中から、コスト関数により選択された音声素片を選択するが、そのパワーは適切であるとは限らない。このため、パワーの不連続が知覚され、合成音声の品質を低下させる要因となっていた。また、複数素片選択型音声合成器においては、融合する素片数を増加させると合成音声のパワーは安定する反面、音質的な特徴の異なる多くの音声素片から融合音声素片を作成することになり、音質的な歪みが増す。さらに、素片融合の処理において、適切なパワーから大きく異なる音声素片を融合に用いることにより音質的な劣化が生じることがある。 In a unit selection type speech synthesizer, a speech unit selected by a cost function is selected from a large number of speech units, but its power is not always appropriate. For this reason, discontinuity of power is perceived, which has been a factor of reducing the quality of synthesized speech. In addition, in the multi-unit selection type speech synthesizer, if the number of units to be fused is increased, the power of synthesized speech becomes stable, but a fused speech unit is created from many speech units having different sound quality characteristics. As a result, sound quality distortion increases. Further, in the process of unit fusion, sound quality may be deteriorated by using speech units that are significantly different from appropriate powers for fusion.

パワー推定工程を持ち、事前に用意したパラメータを用いてパワー制御を行う音声合成方法では、大量の音声素片の持つパワーの情報を適切に反映したパワー制御を行うことは困難であり、パワーと音声素片とのミスマッチが起こる可能性がある。 With a speech synthesis method that has a power estimation process and performs power control using parameters prepared in advance, it is difficult to perform power control that appropriately reflects the power information of a large number of speech segments. Mismatches with speech segments can occur.

本発明は、上記問題点に鑑み、素片選択型もしくは、複数素片選択型の音声合成において、大規模な音声素片のパワー情報を適切に反映し、各音声区間における音声素片のパワーが自然で安定したものとなる高品質な音声合成が実現できる音声合成装置及びその方法を提供することを目的とする。 In view of the above problems, the present invention appropriately reflects power information of a large-scale speech unit in speech selection of a unit selection type or a multiple unit selection type, and the power of the speech unit in each speech section. An object of the present invention is to provide a speech synthesizer capable of realizing high-quality speech synthesis that is natural and stable, and a method thereof.

請求項１に係る発明は、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記選択された複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備え、前記選択手段においてはＮ個及びＭ個（Ｎ＜Ｍ）の音声素片をそれぞれ選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片からパワー情報の平均値を求め、前記選択されたＮ個の音声素片を融合することにより融合音声素片を作成し、前記融合音声素片のパワー情報が、前記Ｍ個の音声素片から求めたパワー情報の平均値となるように補正して前記代表音声素片を生成することを特徴とする音声合成装置である。
請求項２に係る発明は、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記選択された複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備え、前記選択手段においてはＮ個及びＭ個（Ｎ＜Ｍ）の音声素片をそれぞれ選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片からパワー情報の平均値を求め、前記選択されたＮ個の音声素片それぞれのパワー情報が、前記パワー情報の平均値となるように補正し、前記補正したＮ個の音声素片を融合することにより前記代表音声素片を生成することを特徴とする音声合成装置である。
請求項３に係る発明は、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記選択された複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備え、前記選択手段においてはＮ個及びＭ個（Ｎ＜Ｍ）の音声素片をそれぞれ選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片のパワー情報の平均値を求め、前記選択されたＮ個の音声素片のパワー情報の平均値を求め、前記Ｎ個の音声素片のパワー情報の平均値が前記Ｍ個の音声素片のパワー情報の平均値となるように補正するための補正値を求め、前記補正値を適用することにより前記Ｎ個の音声素片のそれぞれを補正し、前記補正したＮ個の音声素片を融合することにより前記代表音声素片を生成することを特徴とする音声合成装置である。
請求項４に係る発明は、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記選択された複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備え、前記選択手段においてはＮ個及びＭ個（Ｎ＜Ｍ）の音声素片をそれぞれ選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片からパワー情報の統計量を求め、前記選択されたＮ個の音声素片のパワー情報をそれぞれ求め、前記求めたパワー情報の統計量及び前記Ｎ個の音声素片のパワー情報に基づいて前記Ｎ個の音声素片それぞれの重みを決定し、前記重みに基づいてＮ個の音声素片を融合することにより前記代表音声素片を生成することを特徴とする音声合成装置である。
請求項５に係る発明は、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記選択された複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備え、前記選択手段においてはＮ個及びＭ個（Ｎ＜Ｍ）の音声素片をそれぞれ選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片のパワー情報の統計量から前記パワー情報の分布が所定確率以上になる区間、または、前記パワー情報の四分位範囲に基づいて区間を求め、前記選択されたＮ個の音声素片のパワー情報をそれぞれ求め、前記Ｎ個の音声素片のパワー情報が前記区間から外れる場合は前記選択すべき音声素片から除外し、前記選択されたＮ個の音声素片のうち前記区間内にある音声素片を融合することにより前記代表音声素片を生成することを特徴とする音声合成装置である。
請求項６に係る発明は、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記選択された複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備え、前記選択手段においてはＭ個の音声素片及び前記合成音声の歪みの度合いが少ない最適な音声素片を選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片からパワー情報の平均値を求め、前記最適な音声素片のパワー情報が、前記パワー情報の平均値となるように補正して前記代表音声素片を生成することを特徴とする音声合成装置である。
請求項７に係る発明は、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記選択された複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備え、前記選択手段においてはＭ個の音声素片を選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片のパワー情報の統計量から前記パワー情報の分布が所定確率以上になる区間、または、前記パワー情報の四分位範囲に基づいて区間を求め、前記パワー情報の区間に含まれるパワー情報を有する音声素片の中から前記合成音声の歪みの度合いが少ない最適な音声素片を選択することにより前記代表音声素片を生成することを特徴とする音声合成装置である。 The invention according to claim 1 divides a phoneme sequence obtained from input text into predetermined synthesis units, obtains representative speech segments for each synthesis unit, and connects these representative speech units to generate synthesized speech. And a degree of distortion of the synthesized speech with respect to each synthesis unit of a phoneme sequence obtained from the input text, and storage means for storing a plurality of speech segments corresponding to the synthesis unit A selection unit that selects a plurality of speech units from the speech units stored in the storage unit, and obtains a statistic of power information from the selected plurality of speech units, and the statistics of the power information Representative speech unit generation means for correcting the power information so as to improve the quality of the synthesized speech based on the amount and generating a representative speech unit corresponding to the synthesis unit, and the generated proxy Speech waveform generation means for generating a speech waveform by connecting speech segments, wherein the selection means selects N and M (N <M) speech segments, respectively, and the representative speech element In the segment generation means, an average value of power information is obtained from the selected M speech units, a fused speech unit is created by fusing the selected N speech units, and the fusion The speech synthesizer is characterized in that the representative speech element is generated by correcting the power information of the speech element so as to be an average value of the power information obtained from the M speech elements.
The invention according to claim 2 divides a phoneme sequence obtained from input text into predetermined synthesis units, obtains representative speech units for each synthesis unit, and connects these representative speech units to generate synthesized speech. And a degree of distortion of the synthesized speech with respect to each synthesis unit of a phoneme sequence obtained from the input text, and storage means for storing a plurality of speech segments corresponding to the synthesis unit A selection unit that selects a plurality of speech units from the speech units stored in the storage unit, and obtains a statistic of power information from the selected plurality of speech units, and the statistics of the power information Representative speech unit generation means for correcting the power information so as to improve the quality of the synthesized speech based on the amount and generating a representative speech unit corresponding to the synthesis unit, and the generated proxy Speech waveform generation means for generating a speech waveform by connecting speech segments, wherein the selection means selects N and M (N <M) speech segments, respectively, and the representative speech element In the segment generation means, an average value of power information is obtained from the selected M speech units, and the power information of each of the selected N speech units becomes the average value of the power information. And the representative speech unit is generated by fusing the corrected N speech units.
The invention according to claim 3 divides a phoneme sequence obtained from input text into predetermined synthesis units, obtains representative speech units for each synthesis unit, and connects these representative speech units to generate synthesized speech. And a degree of distortion of the synthesized speech with respect to each synthesis unit of a phoneme sequence obtained from the input text, and storage means for storing a plurality of speech segments corresponding to the synthesis unit A selection unit that selects a plurality of speech units from the speech units stored in the storage unit, and obtains a statistic of power information from the selected plurality of speech units, and the statistics of the power information Representative speech unit generation means for correcting the power information so as to improve the quality of the synthesized speech based on the amount and generating a representative speech unit corresponding to the synthesis unit, and the generated proxy Speech waveform generation means for generating a speech waveform by connecting speech segments, wherein the selection means selects N and M (N <M) speech segments, respectively, and the representative speech element In the segment generation means, an average value of power information of the selected M speech units is obtained, an average value of power information of the selected N speech units is obtained, and the N speech units are obtained. A correction value for correcting the average value of the power information of the pieces to be the average value of the power information of the M speech units is obtained, and the correction value is applied to obtain the correction value of the N speech units. The speech synthesizer is characterized in that the representative speech segment is generated by correcting each of the corrected speech segments and fusing the corrected N speech segments.
The invention according to claim 4 divides a phoneme sequence obtained from input text into predetermined synthesis units, obtains representative speech segments for each synthesis unit, and connects these representative speech units to generate synthesized speech. And a degree of distortion of the synthesized speech with respect to each synthesis unit of a phoneme sequence obtained from the input text, and storage means for storing a plurality of speech segments corresponding to the synthesis unit A selection unit that selects a plurality of speech units from the speech units stored in the storage unit, and obtains a statistic of power information from the selected plurality of speech units, and the statistics of the power information Representative speech unit generation means for correcting the power information so as to improve the quality of the synthesized speech based on the amount and generating a representative speech unit corresponding to the synthesis unit, and the generated proxy Speech waveform generation means for generating a speech waveform by connecting speech segments, wherein the selection means selects N and M (N <M) speech segments, respectively, and the representative speech element In the segment generation means, a statistic of power information is obtained from the selected M speech units, and power information of the selected N speech units is obtained, respectively. And determining the weight of each of the N speech units based on the power information of the N speech units, and merging the N speech units based on the weight to obtain the representative speech unit. A speech synthesizer characterized by generating the speech synthesizer.
The invention according to claim 5 divides a phoneme sequence obtained from input text into predetermined synthesis units, obtains representative speech units for each synthesis unit, and connects these representative speech units to generate synthesized speech. And a degree of distortion of the synthesized speech with respect to each synthesis unit of a phoneme sequence obtained from the input text, and storage means for storing a plurality of speech segments corresponding to the synthesis unit A selection unit that selects a plurality of speech units from the speech units stored in the storage unit, and obtains a statistic of power information from the selected plurality of speech units, and the statistics of the power information Representative speech unit generation means for correcting the power information so as to improve the quality of the synthesized speech based on the amount and generating a representative speech unit corresponding to the synthesis unit, and the generated proxy Speech waveform generation means for generating a speech waveform by connecting speech segments, wherein the selection means selects N and M (N <M) speech segments, respectively, and the representative speech element In the segment generation means, a segment in which the distribution of the power information is a predetermined probability or more from a statistic of power information of the selected M speech segments, or a segment based on the quartile range of the power information The power information of the selected N speech units is obtained, and when the power information of the N speech units is out of the section, it is excluded from the speech units to be selected , and the selection is performed. The speech synthesizer is characterized in that the representative speech unit is generated by fusing speech units within the section among the N speech units that have been generated.
The invention according to claim 6 divides a phoneme sequence obtained from input text into predetermined synthesis units, obtains representative speech units for each synthesis unit, and connects these representative speech units to generate synthesized speech. And a degree of distortion of the synthesized speech with respect to each synthesis unit of a phoneme sequence obtained from the input text, and storage means for storing a plurality of speech segments corresponding to the synthesis unit A selection unit that selects a plurality of speech units from the speech units stored in the storage unit, and obtains a statistic of power information from the selected plurality of speech units, and the statistics of the power information Representative speech unit generation means for correcting the power information so as to improve the quality of the synthesized speech based on the amount and generating a representative speech unit corresponding to the synthesis unit, and the generated proxy Speech waveform generation means for generating a speech waveform by connecting speech segments, and the selection means selects M speech segments and an optimal speech segment with a low degree of distortion of the synthesized speech In the representative speech element generation means, an average value of power information is obtained from the selected M speech elements, and the power information of the optimal speech element becomes the average value of the power information. The speech synthesizer is characterized in that the representative speech segment is generated by correcting as described above.
The invention according to claim 7 divides a phoneme sequence obtained from input text into predetermined synthesis units, obtains representative speech units for each synthesis unit, and connects these representative speech units to generate synthesized speech. And a degree of distortion of the synthesized speech with respect to each synthesis unit of a phoneme sequence obtained from the input text, and storage means for storing a plurality of speech segments corresponding to the synthesis unit A selection unit that selects a plurality of speech units from the speech units stored in the storage unit, and obtains a statistic of power information from the selected plurality of speech units, and the statistics of the power information Representative speech unit generation means for correcting the power information so as to improve the quality of the synthesized speech based on the amount and generating a representative speech unit corresponding to the synthesis unit, and the generated proxy Speech waveform generation means for generating a speech waveform by connecting speech segments, wherein the selection means selects M speech segments, and the representative speech segment generation means selects the selected From the statistic of the power information of the M speech units, a section where the distribution of the power information is a predetermined probability or more, or a section based on the quartile range of the power information is obtained, and the section of the power information The speech synthesizer is characterized in that the representative speech segment is generated by selecting an optimal speech segment with a low degree of distortion of the synthesized speech from speech segments having power information included.

本発明によれば、素片選択型・複数素片選択融合型どちらの方式においても、合成音声のパワーを安定化することができる。事前にパワーを推定し、制御する方法と比べた場合、大規模な音声素片のなかからコスト関数に基づいて複数の素片を選択して平均パワーを生成するため、大規模な音声素片のパワー情報を適切に反映した合成音声を得ることができる。 According to the present invention, the power of synthesized speech can be stabilized in both the unit selection type and the multiple unit selection fusion type. Compared with the method of estimating and controlling power in advance, a large-scale speech unit is selected because multiple units are selected from a large-scale speech unit based on a cost function and average power is generated. It is possible to obtain synthesized speech that appropriately reflects the power information.

また、パワー情報に基づく素片融合時の重み付け、もしくは不良素片の除去を行うことができ、音質を向上することができる。これらの結果により、音質を保ちつつ安定したパワーとなり、より自然な合成音声が得られる。 Moreover, weighting at the time of unit fusion based on power information or removal of defective units can be performed, and sound quality can be improved. As a result, the sound becomes stable while maintaining the sound quality, and a more natural synthesized speech can be obtained.

本発明の実施形態では、入力されたテキストから得られる音韻系列を所定の合成単位に区切り、これら合成単位毎に代表音声素片を得て、これら代表音声素片を接続することによって合成音声を生成する音声合成装置において、前記合成単位に対応する複数の音声素片を記憶する記憶手段と、前記入力されたテキストから得られる音韻系列の各合成単位に対し、前記合成音声の歪みの度合いに基づいて、前記記憶手段に記憶された音声素片から複数の音声素片を選択する選択手段と、前記複数の音声素片からパワー情報の統計量を求め、前記パワー情報の統計量に基づいて前記合成音声の品質が上がるように前記パワー情報を補正して、前記合成単位に対応する代表音声素片を生成する代表音声素片生成手段と、前記生成された代表音声素片を接続することにより音声波形を生成する音声波形生成手段と、を備える。かかる構成により、合成音声を生成する際に、各音声区間において音声素片群から選択した複数の音声素片のパワー情報の統計量を用いて音声素片を補正することになり、大規模な音声素片のパワー情報を適切に反映した合成音声を得ることができる。 In an embodiment of the present invention, a phoneme sequence obtained from input text is divided into predetermined synthesis units, representative speech segments are obtained for each synthesis unit, and synthesized speech is obtained by connecting these representative speech segments. In the generated speech synthesizer, a storage means for storing a plurality of speech segments corresponding to the synthesis unit and a degree of distortion of the synthesized speech for each synthesis unit of a phoneme sequence obtained from the input text Based on the selection unit for selecting a plurality of speech units from the speech unit stored in the storage unit, and obtaining statistics of power information from the plurality of speech units, and based on the statistics of the power information Representative speech segment generation means for correcting the power information so as to improve the quality of the synthesized speech and generating a representative speech segment corresponding to the synthesis unit; and the generated representative speech segment And a speech waveform generation means for generating a speech waveform by connecting. With this configuration, when generating synthesized speech, the speech unit is corrected using the statistic of the power information of a plurality of speech units selected from the speech unit group in each speech segment. A synthesized speech that appropriately reflects the power information of the speech unit can be obtained.

また、前記選択手段においてはＮ個及びＭ個（Ｎ＝＜Ｍ）の音声素片をそれぞれ選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片からパワー情報の平均値を求め、前記選択されたＮ個の音声素片を融合することにより融合音声素片を作成し、前記融合音声素片のパワー情報が、前記Ｍ個の音声素片から求めたパワー情報の平均値となるように補正して前記代表音声素片を生成する。かかる構成により、複数素片選択・融合型音声合成方式において、素片融合に用いる音声素片数はＮ個に制限することで音質を保ち、かつ、より多くのＭ個の音声素片の平均パワーを用いて補正することにより、融合音声素片のパワーは安定し、自然な合成音声が得られる。 The selecting means selects N and M (N = <M) speech segments, respectively, and the representative speech segment generating means selects power information from the selected M speech segments. , The fused N speech units are created by fusing the selected N speech units, and the power information of the fused speech units is the power obtained from the M speech units. The representative speech segment is generated by correcting it to be an average value of information. With this configuration, in the multiple unit selection / fusion type speech synthesis method, the number of speech units used for unit fusion is limited to N to maintain sound quality, and an average of more M speech units. By correcting using the power, the power of the fused speech unit is stabilized and a natural synthesized speech can be obtained.

また、前記選択手段においてはＭ個の音声素片及び、最適な音声素片を選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片からパワー情報の平均値を求め、前記最適な音声素片のパワー情報が、前記Ｍ個の音声素片から求めたパワー情報の平均値となるように音声素片を補正することにより代表音声素片を生成する。かかる構成により、素片選択型音声合成方式において、複数のＭ個の音声素片の平均パワーを用いて選択された最適音声素片を補正し、補正した音声を接続するため、高い音質を保ったまま合成音声のパワーが安定する。 The selection unit selects M speech units and the optimum speech unit, and the representative speech unit generation unit selects an average value of power information from the selected M speech units. And the representative speech unit is generated by correcting the speech unit so that the power information of the optimum speech unit becomes an average value of the power information obtained from the M speech units. With this configuration, in the unit selection type speech synthesis method, the optimum speech unit selected using the average power of a plurality of M speech units is corrected, and the corrected speech is connected, so that high sound quality is maintained. The power of synthesized speech is stabilized.

また、前記選択手段においてはＮ個及びＭ個（Ｎ＝＜Ｍ）の音声素片を選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片からパワー情報の統計量を求め、前記選択されたＮ個の音声素片それぞれのパワー情報を求め、前記Ｍ個の音声素片から求めたパワー情報の統計量に基づいて前記Ｎ個の音声素片それぞれの重みを決定し、前記重みに基づいてＮ個の音声素片を融合することにより代表音声素片を生成する。かかる構成により、複数素片選択・融合型音声合成方式において、素片融合に用いる各音声素片のパワーが、より多くのＭ個の音声素片の平均パワーから離れるほど、融合時の重みが小さくなる。これにより、融合音声素片の品質が向上し、高品質な合成音声が得られる。 The selection unit selects N and M (N = <M) speech units, and the representative speech unit generation unit generates power information from the selected M speech units. A statistic is obtained, power information of each of the selected N speech units is obtained, and a weight of each of the N speech units is determined based on a statistic of power information obtained from the M speech units. And a representative speech unit is generated by fusing N speech units based on the weights. With such a configuration, in the multiple unit selection / fusion type speech synthesis method, the power of each speech unit used for unit fusion is farther from the average power of more M speech units, so that the weight at the time of fusion is higher. Get smaller. As a result, the quality of the fused speech unit is improved and a high-quality synthesized speech can be obtained.

また、前記選択手段においてはＮ個及びＭ個（Ｎ＝＜Ｍ）の音声素片を選択し、前記代表音声素片生成手段においては、前記選択されたＭ個の音声素片のパワー情報の統計量から区間を求め、前記選択されたＮ個の音声素片それぞれのパワー情報を求め、Ｎ個の音声素片それぞれのパワー情報が前記区間から外れる場合は外れ値として選択素片から除外し、前記外れ値を除外した音声素片を融合することにより代表音声素片を生成する。かかる構成により、複数素片選択・融合型音声合成方式において、素片融合に用いる各音声素片のパワーが、より多くのＭ個の音声素片の平均パワーから大きくずれる不良素片を除去することになり、不良素片を除いて素片融合を行うことにより融合音声素片の品質が向上し、高品質な合成音声が得られる。 The selection means selects N and M (N = <M) speech segments, and the representative speech segment generation means selects power information of the selected M speech segments. A section is obtained from statistics, power information of each of the selected N speech units is obtained, and when power information of each of the N speech units is out of the section, it is excluded from the selected unit as an outlier. The representative speech unit is generated by fusing speech units excluding the outlier. With this configuration, in the multiple unit selection / fusion type speech synthesis method, defective units in which the power of each speech unit used for unit fusion deviates greatly from the average power of more M speech units are removed. In other words, by performing the unit fusion by removing the defective unit, the quality of the fused speech unit is improved, and a high-quality synthesized speech can be obtained.

また、Ｎ個の音声素片を融合した融合音声素片のパワー情報が、複数のＭ個の音声素片から求めたパワー情報の平均値より大きい場合のみ、融合音声素片のパワー情報が前記パワー情報の平均値となるように融合音声素片を補正する。かかる構成により、パワー情報が小さくなる方向にのみ補正するため、融合音声素片にノイズ成分が含まれている場合においても、それを増幅してしまう可能性がなくなり、パワーの補正に起因する音質劣化を避けることができる。 Also, only when the power information of the fused speech unit obtained by fusing N speech units is larger than the average value of the power information obtained from a plurality of M speech units, the power information of the fused speech unit is The fusion speech unit is corrected so that the average value of the power information is obtained. With this configuration, correction is made only in the direction in which the power information is reduced, so that even if the fusion speech unit contains a noise component, there is no possibility of amplifying it, and sound quality resulting from power correction is eliminated. Degradation can be avoided.

以下、図面を参照して本発明の具体的な実施形態を説明する。 Hereinafter, specific embodiments of the present invention will be described with reference to the drawings.

［第１の実施形態］
第１の実施形態のテキスト音声合成装置について説明する。 [First Embodiment]
A text-to-speech synthesizer according to a first embodiment will be described.

［１］テキスト音声合成装置の構成
図１は、本発明の第１の実施形態に係るテキスト音声合成装置の構成を示すブロック図である。 [1] Configuration of Text-to-Speech Synthesizer FIG. 1 is a block diagram showing a configuration of a text-to-speech synthesizer according to the first embodiment of the present invention.

このテキスト音声合成装置はテキスト入力部１１、言語処理部１２、韻律処理部１３、音声合成部１４、音声波形出力部１５から構成される。 This text-to-speech synthesizer includes a text input unit 11, a language processing unit 12, a prosody processing unit 13, a speech synthesis unit 14, and a speech waveform output unit 15.

言語処理部１２は、テキスト入力部１１から入力されるテキストの形態素解析・構文解析を行い、その結果を韻律処理部１３へ送る。 The language processing unit 12 performs morphological analysis / syntax analysis of the text input from the text input unit 11 and sends the result to the prosody processing unit 13.

韻律処理部１３は、言語解析結果からアクセントやイントネーションの処理を行い、音韻系列（音韻記号列）及び韻律情報を生成し、音声合成部１４へ送る。 The prosody processing unit 13 performs accent and intonation processing from the language analysis result, generates a phoneme sequence (phoneme symbol string) and prosody information, and sends them to the speech synthesis unit 14.

音声合成部１４は、音韻系列及び韻律情報から音声波形を生成する。こうして生成された音声波形は音声波形出力部１５で出力される。 The speech synthesizer 14 generates a speech waveform from the phoneme sequence and prosodic information. The speech waveform generated in this way is output from the speech waveform output unit 15.

［２］音声合成部１４の構成
図２は、図１の音声合成部１４の構成例を示すブロック図である。 [2] Configuration of Speech Synthesizer 14 FIG. 2 is a block diagram showing a configuration example of the speech synthesizer 14 in FIG.

図２において、音声合成部１４は、音声素片記憶部２１、音素環境記憶部２２、音韻系列・韻律情報入力部２３、複数音声素片選択部２４、融合音声素片系列作成部２５、融合音声素片編集・接続部２６により構成される。 In FIG. 2, the speech synthesis unit 14 includes a speech unit storage unit 21, a phoneme environment storage unit 22, a phoneme sequence / prosodic information input unit 23, a multiple speech unit selection unit 24, a fusion speech unit sequence creation unit 25, a fusion The speech unit editing / connecting unit 26 is configured.

［２−１］音声素片記憶部２１
音声素片記憶部２１には音声素片が蓄積されており、それらの音素環境の情報（音素環境情報）が音素環境記憶部２２に蓄積されている。 [2-1] Speech unit storage unit 21
Speech units are stored in the speech unit storage unit 21, and information on the phoneme environment (phoneme environment information) is stored in the phoneme environment storage unit 22.

音声素片記憶部２１には、合成音声を生成する際に用いる音声の単位（合成単位）の音声素片が記憶されている。合成単位は、音素あるいは音素を分割したものの組み合わせであり、例えば、半音素、音素（Ｃ、Ｖ）、ダイフォン（ＣＶ、ＶＣ、ＶＶ）、トライフォン（ＣＶＣ、ＶＣＶ）、音節（ＣＶ、Ｖ）、などであり（Ｖは母音、Ｃは子音を表す）、これらが混在しているなど可変長であってもよい。 The speech unit storage unit 21 stores speech units in units of speech (synthesis unit) used when generating synthesized speech. A synthesis unit is a phoneme or a combination of phonemes, for example, semiphones, phonemes (C, V), diphones (CV, VC, VV), triphones (CVC, VCV), syllables (CV, V). (V represents a vowel and C represents a consonant), and these may be mixed lengths.

音声素片の音素環境とは、当該音声素片にとっての環境となる要因に対応する情報である。要因としては、例えば、当該音声素片の音素名、先行音素、後続音素、後々続音素、基本周波数、音韻継続時間長、ストレスの有無、アクセント核からの位置、息継ぎからの時間、発声速度、感情などがある。 The phoneme environment of a speech unit is information corresponding to a factor that is an environment for the speech unit. Factors include, for example, the phoneme name of the speech unit, the preceding phoneme, the subsequent phoneme, the subsequent phoneme, the fundamental frequency, the phoneme duration, the presence or absence of stress, the position from the accent core, the time from breathing, the utterance speed, There are emotions.

［２−２］音韻系列・韻律情報入力部２３
音韻系列・韻律情報入力部２３には、韻律処理部１３から出力された入力テキストに対応する音韻系列及び韻律情報が入力される。音韻系列・韻律情報入力部２３に入力される韻律情報としては、基本周波数、音韻継続時間長などがある。 [2-2] Phoneme sequence / prosodic information input unit 23
The phoneme sequence / prosodic information input unit 23 receives the phoneme sequence and prosodic information corresponding to the input text output from the prosody processing unit 13. The prosodic information input to the phoneme sequence / prosodic information input unit 23 includes a fundamental frequency, a phoneme duration, and the like.

以下、音韻系列・韻律情報入力部２３に入力される音韻系列と韻律情報を、それぞれ「入力音韻系列」、「入力韻律情報」と呼ぶ。入力音韻系列は、例えば音韻記号の系列である。 Hereinafter, the phoneme sequence and the prosody information input to the phoneme sequence / prosodic information input unit 23 are referred to as “input phoneme sequence” and “input prosody information”, respectively. The input phoneme sequence is a sequence of phoneme symbols, for example.

［２−３］複数音声素片選択部２４
複数音声素片選択部２４は、入力音韻系列の各合成単位に対し、入力韻律情報と、融合音声素片の音素環境に含まれる韻律情報とに基づいて合成音声の歪みの度合いを推定し、前記合成音声の歪みの度合いに基づいて音声素片記憶部２１に記憶されている音声素片の中から、平均パワー情報を求めるための複数のＭ個の音声素片及び、融合素片を求めるための複数のＮ個（Ｎ＝＜Ｍ）の音声素片を選択する。 [2-3] Multiple speech unit selection unit 24
The multiple speech segment selection unit 24 estimates the degree of distortion of the synthesized speech based on the input prosodic information and the prosodic information included in the phoneme environment of the fused speech segment for each synthesis unit of the input phoneme sequence, Based on the degree of distortion of the synthesized speech, a plurality of M speech units and a fusion unit for obtaining average power information are obtained from speech units stored in the speech unit storage unit 21. A plurality of N speech units (N = <M) are selected.

ここで、合成音声の歪みの度合いは、音素素片記憶部２１に記憶されている音声素片の音素環境と音韻系列・韻律情報入力部２３から送られる目標音素環境との違いに基づく歪みである目標コストと、接続する音声素片間の音素環境の違いに基づく歪みである接続コストの重み付け和として求められる。 Here, the degree of distortion of the synthesized speech is a distortion based on the difference between the phoneme environment of the speech unit stored in the phoneme unit storage unit 21 and the target phoneme environment sent from the phoneme sequence / prosodic information input unit 23. It is obtained as a weighted sum of a certain target cost and a connection cost that is a distortion based on a difference in phoneme environment between connected speech elements.

すなわち、目標コストとは、音声素片記憶部２１に記憶されている音声素片を入力されたテキストの目標素片環境のもとで使用することによって生じる歪みであり、接続コストとは、接続する音声素変換の素片環境が不連続であることによって生じる歪みである。本実施形態においては、合成音声の歪みの度合いとして、後述するコスト関数を用いる。 That is, the target cost is distortion caused by using the speech unit stored in the speech unit storage unit 21 under the target text environment of the input text, and the connection cost is the connection cost. This is distortion caused by the discontinuity of the fragment environment of the phoneme conversion. In the present embodiment, a cost function described later is used as the degree of distortion of the synthesized speech.

［２−４］融合音声素片系列作成部２５
次に、融合音声素片系列作成部２５において、選択された複数の素片を融合することにより、融合音声素片を生成する。音声素片の融合は例えば、後述するようにピッチ波形を平均化することにより作成することができる。この融合音声素片系列作成部２５において、選択された複数のＭ個の音声素片の平均パワー情報を求め、Ｎ個の音声素片を融合し、生成した融合音声素片のパワー情報を、前記Ｍ個の音声素片の平均パワー情報となるように補正する。その結果、入力音韻系列の音韻記号の系列に対応するパワー情報の補正された融合音声素片の系列が得られる。融合音声素片の系列は、融合音声素片編集・接続部２６において、入力韻律情報に基づいて変形及び接続され、合成音声の音声波形が生成される。こうして生成された音声波形は音声波形出力部１５で出力される。 [2-4] Fusion speech unit sequence creation unit 25
Next, the fused speech element sequence creation unit 25 creates a fused speech element by fusing the selected plurality of elements. Fusion of speech segments can be created, for example, by averaging pitch waveforms as will be described later. In this fused speech unit sequence creation unit 25, average power information of a plurality of selected M speech units is obtained, N speech units are fused, and power information of the generated fused speech unit is obtained. Correction is performed so that the average power information of the M speech segments is obtained. As a result, a sequence of fused speech segments in which power information corresponding to the phoneme symbol sequence of the input phoneme sequence is corrected is obtained. The sequence of fused speech units is transformed and connected based on the input prosodic information in the fused speech unit editing / connecting unit 26 to generate a speech waveform of synthesized speech. The speech waveform generated in this way is output from the speech waveform output unit 15.

なお、「パワー情報」とは、音声波形の平均二乗値もしくは平均絶対振幅値である。 The “power information” is the mean square value or the mean absolute amplitude value of the speech waveform.

［３］音声合成部１４の各処理
以下、音声合成部１４の各処理について詳しく説明する。 [3] Each process of the speech synthesizer 14 Hereinafter, each process of the speech synthesizer 14 will be described in detail.

ここでは、合成単位の音声素片は音素であるとする。 Here, it is assumed that the speech unit of the synthesis unit is a phoneme.

［３−１］音声素片記憶部２１
音声素片記憶部２１には、図３に示すように、各音素の音声信号の音声波形が当該音素を識別するための音声素片番号と共に記憶されている。また、音素環境記憶部２２には、図４に示すように、音声素片記憶部２１に記憶されている各音声素片の音素環境情報が、当該音素の素片番号に対応付けて記憶されている。ここでは、音素環境として、音素記号（音素名）、基本周波数、音韻継続長、接続境界ケプストラムが記憶されている。 [3-1] Speech unit storage unit 21
As shown in FIG. 3, the speech unit storage unit 21 stores a speech waveform of a speech signal of each phoneme together with a speech unit number for identifying the phoneme. In addition, as shown in FIG. 4, the phoneme environment storage unit 22 stores the phoneme environment information of each speech unit stored in the speech unit storage unit 21 in association with the unit number of the phoneme. ing. Here, a phoneme symbol (phoneme name), a fundamental frequency, a phoneme duration, and a connection boundary cepstrum are stored as the phoneme environment.

なお、ここでは音声素片は音素単位としているが、半音素、ダイフォン、トライフォン、音節あるいはこれらの組み合わせや可変長であっても上記同様である。 Here, although the speech unit is a phoneme unit, the same applies to a semiphoneme, a diphone, a triphone, a syllable, or a combination or variable length thereof.

音声素片記憶部２１に記憶されている各音声素片は、別途収集された多数の音声データ対して音素毎にラベリングを行い、音素毎に音声波形を切り出したものを、音声素片として蓄積したものである。例えば、図５には、音声データ５１に対し、音素毎にラベリングを行った結果を示している。図５では、ラベル境界５２により区切られた各音素の音声データ（音声波形）について、ラベルデータ５３として音素記号を付与している。なお、この音声データから、各音素についての音素環境の情報（例えば、音韻（この場合、音素名（音素記号））、基本周波数、音韻継続時間長など）も抽出する。このようにして音声データ５１から求めた各音声波形と、当該音声波形に対応する音素環境の情報には、同じ素片番号が与えられて、図３及び図４に示すように、音声素片記憶部２１と音素環境記憶部２２にそれぞれ記憶される。ここでは、音素環境情報には、音声素片の音韻とその基本周波数及び音韻継続時間長を含むものとする。 Each speech unit stored in the speech unit storage unit 21 is labeled for each phoneme with respect to a large number of separately collected speech data, and a speech waveform cut out for each phoneme is stored as a speech unit. It is a thing. For example, FIG. 5 shows the result of labeling the audio data 51 for each phoneme. In FIG. 5, phoneme symbols are assigned as the label data 53 for the speech data (speech waveform) of each phoneme divided by the label boundary 52. Note that phoneme environment information (eg, phoneme (in this case, phoneme name (phoneme symbol)), fundamental frequency, phoneme duration, etc.) for each phoneme is also extracted from the speech data. In this way, the same unit number is given to each speech waveform obtained from the speech data 51 and the information of the phoneme environment corresponding to the speech waveform, and as shown in FIG. 3 and FIG. The information is stored in the storage unit 21 and the phoneme environment storage unit 22, respectively. Here, the phoneme environment information includes the phoneme of the speech unit, its fundamental frequency, and the phoneme duration.

［３−２］複数音声素片選択部２４
次に、複数音声素片選択部２４に説明する。 [3-2] Multiple speech unit selection unit 24
Next, the multiple speech element selection unit 24 will be described.

［３−２−１］コスト関数
まず、複数音声素片選択部２４において素片系列を求める際に用いられるコスト関数について説明する。 [3-2-1] Cost Function First, a cost function used when the multiple speech unit selection unit 24 obtains a unit sequence will be described.

音声素片を変形・接続して合成音声を生成する際に生ずる歪の要因毎にサブコスト関数Ｃ_ｎ（ｕ_ｉ、ｕ_ｉ−１、ｔ_ｉ）（ｎ：１，…，Ｎ、Ｎはサブコスト関数の数）を定める。ここで、ｔ_ｉは、入力音韻系列及び入力韻律情報に対応する目標とする音声（目標音声）をｔ＝（ｔ_１、…、ｔ_Ｉ）としたときのｉ番目のセグメントに対応する部分の音声素片の目標とする音素環境情報を表し、ｕ_ｉは音声素片記憶部２１に記憶されている音声素片のうち、ｔ_ｉと同じ音韻の音声素片を表す。 Sub cost functions C _n (u _i , u _i−1 , t _i ) (n: 1,..., N, N are sub costs for each factor of distortion generated when speech units are deformed and connected to generate synthesized speech Number of functions). Here, t _i is a portion corresponding to the i-th segment when the target speech (target speech) corresponding to the input phoneme sequence and the input prosodic information is t = (t ₁ ,..., T _I ). The target phoneme environment information of the speech unit is represented, and u _i represents the speech unit having the same phoneme as t _i among the speech units stored in the speech unit storage unit 21.

サブコスト関数は、音声素片記憶部２１に記憶されている音声素片を用いて合成音声を生成したときに生ずる当該合成音声の目標音声に対する歪みの度合いを推定するためのコストを算出するためのものである。 The sub-cost function is used to calculate a cost for estimating the degree of distortion of the synthesized speech with respect to the target speech that occurs when the synthesized speech is generated using the speech units stored in the speech unit storage unit 21. Is.

当該コストを算出するために、当該音声素片を使用することによって生じる合成音声の目標音声に対する歪みの度合いを推定する「目標コスト」と、当該音声素片を他の音声素片と接続したときに生じる当該合成音声の目標音声に対する歪みの度合いを推定する「接続コスト」という２種類のサブコストがある。 In order to calculate the cost, a “target cost” that estimates the degree of distortion of the synthesized speech generated by using the speech unit with respect to the target speech, and the speech unit is connected to another speech unit There are two types of sub-costs called “connection costs” for estimating the degree of distortion of the synthesized speech with respect to the target speech.

目標コストとしては、音声素片記憶部２１に記憶されている音声素片の基本周波数と目標の基本周波数との違い（差）を表す基本周波数コスト、音声素片の音韻継続時間長と目標の音韻継続時間長との違い（差）を表す音韻継続時間長コストを用いる。 The target cost includes a basic frequency cost representing the difference (difference) between the basic frequency of the speech unit stored in the speech unit storage unit 21 and the target basic frequency, the phoneme duration length of the speech unit, and the target The phoneme duration time cost representing the difference (difference) from the phoneme duration is used.

接続コストとしては、接続境界でのスペクトルの違い（差）を表すスペクトル接続コストを用いる。具体的には、基本周波数コストは、

As the connection cost, a spectrum connection cost representing a spectrum difference (difference) at the connection boundary is used. Specifically, the fundamental frequency cost is

から算出する。ここで、ｖ_ｉは音声素片記憶部２１に記憶されている音声素片ｕ_ｉの音素環境を、ｆは音素環境ｖ_ｉから平均基本周波数を取り出す関数を表す。また、音韻継続時間長コストは、

Calculate from Here, v _i is the phonetic environment of the speech unit u _i stored in the voice unit storage 21, f represents a function to extract the average fundamental frequency from phonetic environment v _i. Also, the long phoneme duration cost is

から算出する。ここで、ｇは音素環境ｖ_ｉから音韻継続時間長を取り出す関数を表す。スペクトル接続コストは、２つの音声素片間のケプストラム距離：

Calculate from Here, g represents the function to extract phoneme duration from the phonetic environment v _i. Spectral connection cost is the cepstrum distance between two speech segments:

から算出する。ここで、ｈは音声素片ｕ_ｉの接続境界のケプストラム係数をベクトルとして取り出す関数を表す。これらのサブコスト関数の重み付き和を合成単位コスト関数と定義する：

Calculate from Here, h represents a function for taking out a cepstrum coefficient of a connection boundary of the speech unit u _i as a vector. Define the weighted sum of these subcost functions as the composite unit cost function:

ここで、ｗ_ｎはサブコスト関数の重みを表す。本実施形態では、簡単のため、ｗ_ｎはすべて「１」とする。上記式（４）は、ある合成単位に、ある音声素片を当てはめた場合の当該音声素片の合成単位コストである。 Here, w _n represents the weight of the sub cost function. In the present embodiment, for the sake of simplicity, all w _n is set to "1". The above formula (4) is the synthesis unit cost of the speech unit when a speech unit is applied to a synthesis unit.

入力音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対し、上記式（４）から合成単位コストを算出した結果を、全セグメントについて足し合わせたものをコストと呼び、当該コストを算出するためのコスト関数を次式（５）に示すように定義する：

For each of a plurality of segments obtained by dividing the input phoneme sequence by synthesis unit, the result of calculating the synthesis unit cost from the above equation (4) is the sum of all segments is called the cost. A cost function for calculation is defined as shown in the following equation (5):

［３−２−２］素片選択処理
複数音声素片選択部２４では、上記式（１）〜（５）に示したコスト関数を使って２段階で１セグメント当たり（すなわち、１合成単位当たり）複数の音声素片を選択する。 [3-2-2] Unit Selection Processing The multiple speech unit selection unit 24 uses the cost functions shown in the above formulas (1) to (5) in two steps per segment (that is, per synthesis unit). ) Select multiple speech segments.

図６は、素片選択処理を説明するためのフローチャートである。 FIG. 6 is a flowchart for explaining the segment selection process.

まず、１段階目の素片選択として、ステップＳ６１では、音声素片記憶部２１に記憶されている音声素片群の中から、上記式（５）で算出されるコストの値が最小の音声素片の系列を求める。このコストが最小となる音声素片の組み合わせを最適素片系列と呼ぶこととする。すなわち、最適音声素片系列中の各音声素片は、入力音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対応し、最適音声素片系列中の各音声素片から算出された上記合成単位コストと式（５）より算出されたコストの値は、他のどの音声素片系列よりも小さい値である。なお、最適素片系列の探索には、動的計画法（ＤＰ：ｄｙｎａｍｉｃｐｒｏｇｒａｍｍｉｎｇ）を用いることでより効率的に行うことができる。 First, as a first-stage unit selection, in step S61, the voice having the smallest cost value calculated by the above equation (5) from the speech unit group stored in the speech unit storage unit 21 is used. Find a sequence of segments. A combination of speech units that minimizes the cost is called an optimal unit sequence. That is, each speech unit in the optimal speech unit sequence corresponds to each of a plurality of segments obtained by dividing the input phoneme sequence by synthesis unit, and is calculated from each speech unit in the optimal speech unit sequence. The cost value calculated from the synthesis unit cost and the equation (5) is smaller than any other speech unit sequence. Note that the search for the optimum unit sequence can be performed more efficiently by using dynamic programming (DP).

次に、ステップＳ６２に進み、２段階目の素片選択では、最適素片系列を用いて、１セグメント当たり複数の音声素片を選ぶ。ここでは、セグメントの数をＪ個とし、セグメント当たり、平均パワー情報を求めるためのＭ個の音声素片と、素片融合に用いるためのＮ個の音声素片を選択することとして、ステップＳ６２の詳細を説明する。 Next, proceeding to step S62, in the second stage segment selection, a plurality of speech segments are selected per segment using the optimum segment sequence. Here, it is assumed that the number of segments is J, M speech units for obtaining average power information per segment, and N speech units for use in unit fusion are selected, and step S62 is performed. Details will be described.

［３−２−３］１セグメント当たり複数の音声素片を選ぶ方法
ステップＳ６２１からＳ６２３では、Ｊ個のセグメントのうちの１つを注目セグメントとする。ステップＳ６２１からＳ６２３はＪ回繰り返され、Ｊ個のセグメントが１回ずつ注目セグメントとなるように処理を行う。 [3-2-3] Method of selecting a plurality of speech segments per segment In steps S621 to S623, one of the J segments is set as a target segment. Steps S621 to S623 are repeated J times, and processing is performed so that J segments become the target segment once.

まず、ステップＳ６２１では、注目セグメント以外のセグメントには、それぞれ最適素片系列の音声素片を固定する。この状態で、注目セグメントに対して音声素片記憶部２１に記憶されている音声素片を式（５）のコストの値に応じて順位付けし、平均パワー情報を求めるための上位Ｍ個の音声素片と、素片融合に用いるための上位Ｎ個の音声素片を選択する。 First, in step S621, the speech unit of the optimal unit sequence is fixed to each segment other than the segment of interest. In this state, the speech units stored in the speech unit storage unit 21 are ranked with respect to the segment of interest according to the cost value of Equation (5), and the top M pieces for obtaining average power information A speech unit and the top N speech units to be used for unit fusion are selected.

例えば、図７に示すように、入力音韻系列が「ｔｓ・ｉ・ｉ・ｓ・ａ・…」であるとする。この場合、合成単位は、音素「ｔｓ」、「ｉ」、「ｉ」、「ｓ」、「ａ」、…のそれぞれに対応し、これら音素のそれぞれが１つのセグメントに対応する。図７では、入力された音韻系列中の３番目の音素「ｉ」に対応するセグメントを注目セグメントとし、この注目セグメントについて、複数の音声素片を求める場合を示している。この３番目の音素「ｉ」に対応するセグメント以外のセグメントに対しては、最適素片系列中の音声素片７１ａ、７１ｂ、７１ｄ、７１ｅ…を固定する。 For example, as shown in FIG. 7, it is assumed that the input phoneme sequence is “ts · i · i · s · a ·. In this case, the synthesis unit corresponds to each of phonemes “ts”, “i”, “i”, “s”, “a”,..., And each of these phonemes corresponds to one segment. FIG. 7 shows a case where a segment corresponding to the third phoneme “i” in the input phoneme sequence is set as a target segment, and a plurality of speech segments are obtained for this target segment. For the segments other than the segment corresponding to the third phoneme “i”, the speech units 71a, 71b, 71d, 71e,.

この状態で、音声素片記憶部２１に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ音声素片のそれぞれについて、式（５）を用いてコストを算出する。但し、それぞれの音声素片に対してコストを求める際に、値が変わるのは、注目セグメントの目標コスト、注目セグメントとその一つ前のセグメントとの接続コスト、注目セグメントとその一つ後のセグメントとの接続コストであるので、これらのコストのみを考慮すればよい。すなわち、
（手順１）音声素片記憶部２１に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ音声素片のうちの１つを音声素片ｕ_３とする。音声素片ｕ_３の基本周波数ｆ（ｖ_３）と、目標の基本周波数ｆ（ｔ_３）とから、式（１）を用いて、基本周波数コストを算出する。 In this state, among the speech elements stored in the speech element storage unit 21, for each speech element having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest, Equation (5) is obtained. To calculate the cost. However, when the cost is calculated for each speech unit, the value changes for the target cost of the target segment, the connection cost between the target segment and the previous segment, the target segment and the next segment. Since these are the connection costs with the segments, only these costs need be considered. That is,
(Procedure 1) Among the speech elements stored in the speech element storage unit 21, one of the speech elements having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest is selected as the speech element. and u _3. From the fundamental frequency f (v ₃ ) of the speech unit u ₃ and the target fundamental frequency f (t ₃ ), the fundamental frequency cost is calculated using Equation (1).

（手順２）音声素片ｕ_３の音韻継続時間長ｇ（ｖ_３）と、目標の音韻継続時間長ｇ（ｔ_３）とから、式（２）を用いて、音韻継続時間長コストを算出する。 (Procedure 2) The phoneme duration cost is calculated from the phoneme duration g (v ₃ ) of the speech unit u ₃ and the target phoneme duration g (t ₃ ) using Equation (2). To do.

（手順３）音声素片ｕ_３のケプストラム係数ｈ（ｕ_３）と、音声素片５１ｂ（ｕ_２）のケプストラム係数ｈ（ｕ_２）とから、式（３）を用いて、第１のスペクトル接続コストを算出する。また、音声素片ｕ_３のケプストラム係数ｈ（ｕ_３）と、音声素片５１ｄ（ｕ_４）のケプストラム係数ｈ（ｕ_４）とから、式（３）を用いて、第２のスペクトル接続コストを算出する。 And (Step 3) cepstral coefficients of the speech unit _{u 3} h _{(u 3),} from a speech unit 51b cepstral coefficients _{(u 2)} h _{(u 2),} using equation (3), first spectrum Calculate the connection cost. Further, the cepstral coefficients of the speech unit _{u 3} h _{(u 3),} from the cepstral coefficients of the speech unit _{_{51d (u 4) h (u}} 4), using equation (3), the second spectral concatenation cost Is calculated.

（手順４）上記（手順１）〜（手順３）で各サブコスト関数を用いて算出された基本周波数コストと音韻継続時間長コストと第１及び第２のスペクトル接続コストの重み付け和を算出して、音声素片ｕ_３のコストを算出する。 (Procedure 4) Calculate the weighted sum of the fundamental frequency cost, the phoneme duration time cost, and the first and second spectrum connection costs calculated by using each sub-cost function in (Procedure 1) to (Procedure 3). The cost of the speech unit u ₃ is calculated.

（手順５）音声素片記憶部２１に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ各音声素片について、上記（手順１）〜（手順４）に従って、コストを算出したら、その値の最も小さい音声素片ほど高い順位となるように順位付けを行う（図６のステップＳ６２１）。例えば、図７では、音声素片７２ａが最も順位が高く、音声素片７２ｅが最も順位が低い。そして、平均パワー情報を求めるための７２ａから７２ｄまでの上位Ｍ個の音声素片を選択し（図６のステップＳ６２２）、また、音声素片を融合するための７２ａから７２ｃまでの上位Ｎ個（Ｎ＝＜Ｍ）の音声素片を選択する（図６のステップＳ６２３）。 (Procedure 5) Among the speech elements stored in the speech element storage unit 21, for each speech element having the same phoneme name (phoneme symbol) as the phoneme "i" of the segment of interest, the above (Procedure 1) to When the cost is calculated according to (Procedure 4), ranking is performed so that the speech unit having the smallest value has a higher rank (Step S621 in FIG. 6). For example, in FIG. 7, the speech unit 72a has the highest rank, and the speech unit 72e has the lowest rank. Then, the top M speech units from 72a to 72d for obtaining average power information are selected (step S622 in FIG. 6), and the top N speech units from 72a to 72c for fusing speech units are selected. A speech segment (N = <M) is selected (step S623 in FIG. 6).

以上の（手順１）〜（手順５）をそれぞれのセグメントに対して行う。その結果、それぞれのセグメントについて、Ｍ個及びＮ個の音声素片が得られる。 The above (Procedure 1) to (Procedure 5) are performed for each segment. As a result, M and N speech segments are obtained for each segment.

［３−３］融合音声素片作成部２５
次に融合音声素片作成部２５について説明する。 [3-3] Fusion speech unit creation unit 25
Next, the fusion speech unit creation unit 25 will be described.

融合音声素片作成部２５では、複数音声素片選択部２４において選択された複数の音声素片を融合し、融合音声素片を作成する。 The fused speech element creating unit 25 fuses a plurality of speech elements selected by the multiple speech element selecting unit 24 to create a fused speech element.

［３−３−１］融合音声素片作成部２５の処理
図８に融合音声素片作成部２５の処理を示す。 [3-3-1] Process of Fusion Speech Unit Creation Unit 25 FIG.

まず、ステップＳ８１において、選択されたＭ個の音声素片の平均パワー情報を求める。各音声の平均パワー情報P_ｉを

First, in step S81, average power information of the selected M speech units is obtained. Average power information P _i for each voice

として求め、求めた各素片のパワー情報P_ｉ（1＝＜ｉ＝＜M）の平均値P_ａｖｅ

The average value P _ave of the power information P _i (1 = <i = <M) of each obtained piece

を求めることにより、Ｍ個の音声素片の平均パワー情報を求める。式中、ｓ_ｉ（ｎ）はｉ番目の音声素片の音声信号、Ｔはそのサンプル数を表す。 To obtain the average power information of M speech segments. In the equation, s _i (n) represents the audio signal of the i-th speech element, and T represents the number of samples.

次に、ステップＳ８２において後述する融合方法を用いて、Ｎ個の音声素片を融合する。複数素片選択部２４において選択されたＮ個の音声素片を、音声素片記憶部２１から取得し、当該Ｎ個の音声素片を融合し、新たな音声素片（融合音声素片）を生成する。 Next, in step S82, N speech segments are fused using a fusion method described later. N speech units selected by the multiple segment selection unit 24 are acquired from the speech unit storage unit 21, and the N speech units are merged to create a new speech unit (fused speech unit). Is generated.

最後に、ステップＳ８３において、融合音声素片のパワー情報を、前記平均パワー情報Ｐ_ａｖｅに補正する。融合音声素片のパワー情報Ｐ_ｆを式（６）により求め、パワー情報を補正する比率ｒを

Finally, in step S83, the power information of the fused speech unit is corrected to the average power information P _ave . The power information P _f of the fusion speech unit is obtained by the equation (6), and the ratio r for correcting the power information is calculated.

により求める。この比率ｒを融合音声素片に掛けることによりパワー情報を補正する。 Ask for. The power information is corrected by multiplying the unit r by the ratio r.

簡単のため、融合音声素片のパワー情報Ｐ_ｆは、Ｎ個の音声素片のパワー情報Ｐ_ｉ（１＝＜Ｉ＝＜Ｎ）の平均値としてもよい。 For simplicity, the power information P _f of the fusion speech unit may be an average value of the power information P _i (1 = <I = <N) of the N speech units.

［３−３−２］パワー情報補正
図９にパワー情報補正の例を示す。図９中の表は、M＝１５としたときの、音素ｉの右素片において選択された上位M個の音声素片のパワー情報Ｐ_ｉ（1＝＜ｉ＝＜M）を表している。この例では、半音素を合成単位としている。Ｎ＝３とした時、融合音声素片のパワー情報Ｐ_ｆ＝２６９１６７１、Ｍ個の平均パワー情報Ｐ_ａｖｅ＝１６４７０８４となり、パワー情報を補正する比率はｒ＝０．７８となる。このｒを融合音声素片の音声波形に掛けることによりパワー情報を補正する。 [3-3-2] Power Information Correction FIG. 9 shows an example of power information correction. The table in FIG. 9 represents the power information P _i (1 = <i = <M) of the upper M speech units selected in the right unit of phoneme i when M = 15. . In this example, semitones are used as a synthesis unit. When N = 3, power information P _f = 2691671 of the fusion speech unit, M average power information P _ave = 1647084, and the ratio for correcting the power information is r = 0.78. The power information is corrected by multiplying r by the speech waveform of the fused speech unit.

図１０にパワー情報を補正した波形の例を示す。図１０は、文頭の音素ｉを表している。図１０（ａ）は補正せず、融合音声素片をそのまま接続した場合、図１０（ｂ）は本発明によるパワー情報補正を行った場合を示している。横軸の数字はピッチマーク番号を表している。図１０（ａ）では、音素ｉの左半音素と右半音素の接続部分ピッチマーク番号９から１０にかけて、急激にパワー情報が大きくなっている。これに対し図１０（ｂ）では、左半音素がｒ＝１．２８、右半音素がｒ＝０．７８となり、接続部において滑らかに接続されていることがわかる。なお、この右半音素は、図９に対応している。 FIG. 10 shows an example of a waveform obtained by correcting the power information. FIG. 10 shows the phoneme i at the beginning of the sentence. FIG. 10A shows a case where the fusion speech unit is connected as it is without correction, and FIG. 10B shows a case where the power information correction according to the present invention is performed. The numbers on the horizontal axis represent pitch mark numbers. In FIG. 10 (a), the power information suddenly increases from the connection part pitch mark numbers 9 to 10 of the left half phoneme and the right half phoneme of the phoneme i. On the other hand, in FIG. 10B, the left semiphone is r = 1.28 and the right semiphone is r = 0.78, and it can be seen that the connection is smoothly connected. Note that the right semiphoneme corresponds to FIG.

［３−３−３］音声素片の融合方法
次にステップＳ８２における音声素片の融合方法について説明する。このステップは音声素片が有声音である場合と無声音である場合とで別の処理を行う。 [3-3-3] Speech Unit Fusion Method Next, the speech unit fusion method in step S82 will be described. This step performs different processing depending on whether the speech segment is a voiced sound or an unvoiced sound.

［３−３−３−１］有声音の場合
まずは有声音の場合について説明する。有声音の場合には、音声素片からピッチ波形を取り出し、ピッチ波形のレベルで融合し、新たなピッチ波形を作りだす。ピッチ波形とは、その長さが音声の基本周期の数倍程度までで、それ自身は基本周期を持たない比較的短い波形であって、そのスペクトルが音声信号のスペクトル包絡を表すものを意味する。 [3-3-3-1] Case of Voiced Sound First, the case of voiced sound will be described. In the case of voiced sound, a pitch waveform is extracted from the speech segment and fused at the level of the pitch waveform to create a new pitch waveform. A pitch waveform is a relatively short waveform that has a length up to several times the fundamental period of speech and does not have a fundamental period, and its spectrum represents the spectrum envelope of the speech signal. .

その抽出方法としては、単に基本周期同期窓で切り出す方法、ケプストラム分析やＰＳＥ分析によって得られたパワー情報スペクトル包絡を逆離散フーリエ変換する方法、線形予測分析によって得られたフィルタのインパルス応答によってピッチ波形を求める方法、閉ループ学習法によって合成音声のレベルで自然音声に対する歪が小さくなるようなピッチ波形を求める方法など様々なものがある。 The extraction method includes a method of simply cutting out with a fundamental period synchronization window, a method of performing inverse discrete Fourier transform on the power information spectrum envelope obtained by cepstrum analysis and PSE analysis, and a pitch waveform by the impulse response of the filter obtained by linear prediction analysis. There are various methods, such as a method for obtaining a pitch waveform that reduces distortion with respect to natural speech at the level of synthesized speech by a closed loop learning method.

ここでは、基本周期同期窓で切り出す方法を用いてピッチ波形を抽出する場合を例にとり、図１１のフローチャートを参照して、説明する。複数音声素片選択部２４で選択されたＮ個の音声素片を融合して１つの新たな音声素片を生成する場合の処理手順を説明する。 Here, a case where a pitch waveform is extracted using a method of cutting out with a basic period synchronization window will be described as an example with reference to the flowchart of FIG. A processing procedure in the case of generating one new speech unit by fusing the N speech units selected by the multiple speech unit selection unit 24 will be described.

ステップＳ１１１において、Ｎ個の音声素片のそれぞれの音声波形に、その周期間隔毎にマーク（ピッチマーク）を付ける。図１２（ａ）には、Ｎ個の音声素片のうちの１つの音声素片の音声波形１２１に対し、その周期間隔毎にピッチマーク１２２が付けられている場合を示している。 In step S111, marks (pitch marks) are added to the respective speech waveforms of the N speech units for each periodic interval. FIG. 12A shows a case in which pitch marks 122 are attached to the speech waveform 121 of one speech unit among the N speech units at every cycle interval.

ステップＳ１１２において、図１２（ｂ）に示すように、ピッチマークを基準として窓掛けを行ってピッチ波形を切り出す。窓にはハニング窓１２３を用い、その窓長は基本周期の２倍とする。そして、図１２（ｃ）に示すように、窓掛けされた波形１２４をピッチ波形として切り出す。Ｎ個の音声素片のそれぞれについて、図１２に示す処理（ステップＳ１１２の処理）を施す。その結果、Ｎ個の音声素片のそれぞれについて、複数個のピッチ波形からなるピッチ波形の系列が求まる。 In step S112, as shown in FIG. 12B, windowing is performed with the pitch mark as a reference to cut out the pitch waveform. A Hanning window 123 is used as the window, and the window length is twice the basic period. Then, as shown in FIG. 12C, the windowed waveform 124 is cut out as a pitch waveform. The process shown in FIG. 12 (the process of step S112) is performed on each of the N speech segments. As a result, a sequence of pitch waveforms composed of a plurality of pitch waveforms is obtained for each of the N speech segments.

次にステップＳ１１３に進み、当該セグメントのＮ個の音声素片のそれぞれのピッチ波形の系列の中で、最もピッチ波形の数が多いものに合わせて、Ｎ個全てのピッチ波形の系列中のピッチ波形の数が同じになるように、（ピッチ波形の数が少ないピッチ波形の系列については）ピッチ波形を複製して、ピッチ波形の数をそろえる。 Next, the process proceeds to step S113, and the pitches in the pitch waveform series of all N pitch waveforms are matched to the one having the largest number of pitch waveforms in the pitch waveform series of the N speech units of the segment. The pitch waveforms are duplicated so that the number of pitch waveforms is the same (for a series of pitch waveforms with a small number of pitch waveforms).

図１３には、当該セグメントのＮ個（例えば、ここでは、３個）の音声素片ｄ１〜ｄ３のそれぞれから、ステップＳ１１２で切り出されたピッチ波形の系列ｅ１〜ｅ３を示している。ピッチ波形の系列ｅ１中のピッチ波形の数は７個、ピッチ波形の系列ｅ２中のピッチ波形の数は５個、ピッチ波形の系列ｅ３中のピッチ波形の数は６個であるので、ピッチ波形の系列ｅ１〜ｅ３のうち最もピッチ波形の数が多いものは、系列ｅ１である。従って、この系列ｅ１中のピッチ波形の数（例えば、ここでは、ピッチ波形の数は、７個）に合わせて、他の系列ｅ２、ｅ３については、それぞれ、当該系列中のピッチ波形のいずれかをコピーして、ピッチ波形の数を７個にする。その結果得られた、系列ｅ２、ｅ３のそれぞれに対応する新たなピッチ波形の系列がｅ２´、ｅ３´である。 FIG. 13 shows a series of pitch waveforms e1 to e3 cut out in step S112 from each of N (for example, three in this case) speech segments d1 to d3 of the segment. Since the number of pitch waveforms in the pitch waveform series e1 is 7, the number of pitch waveforms in the pitch waveform series e2 is 5, and the number of pitch waveforms in the pitch waveform series e3 is 6, the pitch waveform. Among the series e1 to e3, the series e1 has the largest number of pitch waveforms. Therefore, in accordance with the number of pitch waveforms in this series e1 (for example, the number of pitch waveforms here is 7), each of the other series e2 and e3 is one of the pitch waveforms in the series. Is copied and the number of pitch waveforms is set to seven. As a result, new pitch waveform series corresponding to the series e2 and e3 are e2 ′ and e3 ′, respectively.

次に、ステップＳ１１４に進む。このステップでは、ピッチ波形毎に処理を行う。ステップＳ１１４では、当該セグメントのＮ個のそれぞれの音声素片に対応するピッチ波形をその位置毎に平均化し、新たなピッチ波形の系列を生成する。この生成された新たなピッチ波形の系列を融合された音声素片とする。 Next, the process proceeds to step S114. In this step, processing is performed for each pitch waveform. In step S114, the pitch waveforms corresponding to the N speech units of the segment are averaged for each position, and a new pitch waveform series is generated. The generated new pitch waveform sequence is used as a fused speech unit.

図１４には、当該セグメントのＮ個（例えば、ここでは、３個）の音声素片ｄ１〜ｄ３のそれぞれからステップＳ１１３で求めたピッチ波形の系列ｅ１、ｅ２´、ｅ３´を示している。各系列中には、７個のピッチ波形があるので、ステップＳ１１４では、１番目から７番目のピッチ波形をそれぞれ３つの音声素片で平均化し、７個の新たなピッチ波形からなる新たなピッチ波形の系列ｆ１を生成している。すなわち、例えば、系列ｅ１の１番目とピッチ波形と、系列ｅ２´の１番目のピッチ波形と、系列ｅ３´の１番目のピッチ波形のセントロイドを求めて、それを新たなピッチ波形の系列ｆ１の１番目のピッチ波形とする。新たなピッチ波形の系列ｆ１の２番目〜７番目のピッチ波形についても同様である。ピッチ波形の系列ｆ１が、上記「融合音声素片」である。セントロイドを求める際、それぞれのピッチ波形に重み付けをしてもよい。この場合、ｅ１の重みをｗ１、ｅ２の重みをｗ２、ｅ３の重みをｗ３とし、重み付け平均

FIG. 14 shows pitch waveform series e1, e2 ′, e3 ′ obtained in step S113 from each of N (for example, three in this case) speech elements d1 to d3 of the segment. Since there are seven pitch waveforms in each series, in step S114, the first to seventh pitch waveforms are averaged with three speech segments, and a new pitch consisting of seven new pitch waveforms is obtained. A waveform series f1 is generated. That is, for example, the centroid of the first pitch waveform of the series e1, the first pitch waveform of the series e2 ′, and the first pitch waveform of the series e3 ′ is obtained, and is obtained as a new pitch waveform series f1. The first pitch waveform. The same applies to the second to seventh pitch waveforms of the new pitch waveform series f1. A series f1 of pitch waveforms is the above “fusion speech unit”. When obtaining the centroid, each pitch waveform may be weighted. In this case, the weight of e1 is w1, the weight of e2 is w2, the weight of e3 is w3, and the weighted average

により新たなピッチ波形の系列ｆ１を求める。式（９）では、重みｗｉは正規化されているものとしている。 Thus, a new pitch waveform series f1 is obtained. In equation (9), the weight wi is assumed to be normalized.

ピッチ波形の融合処理はピッチ波形の平均化に限定するものではない。例えば、閉ループ学習を使うことで、それぞれの音声素片のピッチ波形を取り出すことなく、合成音のレベルで最適なピッチ波形系列を作り出すことができる。閉ループ学習とは、実際に基本周波数や韻律継続時間長を変更して合成された合成音声のレベルで、自然音声に対する歪が小さくなるような代表音声素片を生成する方法である。閉ループ学習では、合成音声のレベルで歪が小さくなるような素片を生成するため、ピッチ波形の平均化によって新たな音声素片を作成する場合よりも、高品質な音声素片が作成される（特許文献２：特許第３２８１２８１号参照）。 The pitch waveform fusion processing is not limited to averaging of pitch waveforms. For example, by using closed loop learning, an optimum pitch waveform sequence can be created at the level of the synthesized sound without extracting the pitch waveform of each speech unit. Closed-loop learning is a method of generating representative speech segments that reduce the distortion of natural speech at the level of synthesized speech that is actually synthesized by changing the fundamental frequency and prosodic duration. In closed-loop learning, a segment whose distortion is reduced at the level of the synthesized speech is generated. Therefore, a higher-quality speech unit is created than when a new speech unit is created by averaging pitch waveforms. (See Patent Document 2: Japanese Patent No. 3281281).

［３−３−３−２］無声音の場合
一方、素片融合の処理ステップにおいて、無声音のセグメントの場合には、複数音声素片選択部２４で選択された当該セグメントのＮ個の音声素片のうち、当該Ｎ個の音声素片のそれぞれに付けられている順位が１位の音声素片の音声波形をそのまま使用する。 [3-3-3-2] In the case of unvoiced sound On the other hand, in the unit fusion processing step, in the case of an unvoiced sound segment, N speech elements of the segment selected by the multiple speech element selection unit 24 are used. Among them, the speech waveform of the speech unit having the first rank attached to each of the N speech units is used as it is.

［３−４］融合素片編集・接続部２６
融合素片編集・接続部２６では、融合音声素片を、入力韻律情報に従って変形し、接続することで合成音声の音声波形を生成する。融合された音声素片は、実際にはピッチ波形の形になっているので、当該融合された音声素片の基本周波数、音韻継続時間長のそれぞれが、入力韻律情報に示されている目標音声の基本周波数、目標音声の音韻継続時間長になるようにピッチ波形を重畳することで、音声波形を生成することができる。 [3-4] Fusion unit editing / connection unit 26
The fused segment editing / connecting unit 26 transforms the fused speech segments according to the input prosodic information and connects them to generate a speech waveform of synthesized speech. Since the fused speech unit is actually in the form of a pitch waveform, the basic speech and the phoneme duration length of the fused speech unit are indicated in the input prosodic information. The speech waveform can be generated by superimposing the pitch waveform so that the basic frequency and the target speech phoneme duration are the same.

図１５は、融合素片編集・接続部２６の処理を説明するための図である。図１５では、音素「ｍ」、「ａ」、「ｄ」、「ｏ」の各合成単位について素片融合部で求めた融合された音声素片を変形・接続して、「まど」という音声波形を生成する場合を示している。図１５に示すように、入力韻律情報に示されている目標の基本周波数、目標の音韻継続時間長に応じて、セグメント（合成単位）毎に、融合された音声素片中の各ピッチ波形の基本周波数を変えたり（音の高さを変えたり）、ピッチ波形の数を増やしたり（時間長を変えたり）する。その後に、セグメント内、セグメント間で、隣り合うピッチ波形を接続して合成音声を生成する。 FIG. 15 is a diagram for explaining the processing of the fusion unit editing / connecting unit 26. In FIG. 15, the synthesized speech units obtained by the segment fusion unit for each synthesis unit of the phonemes “m”, “a”, “d”, and “o” are transformed and connected to each other and called “Mado”. The case where a speech waveform is generated is shown. As shown in FIG. 15, according to the target fundamental frequency and the target phoneme duration duration indicated in the input prosodic information, for each segment (synthesis unit), each pitch waveform in the fused speech segment is Change the basic frequency (change the pitch) or increase the number of pitch waveforms (change the time length). After that, synthesized speech is generated by connecting adjacent pitch waveforms within and between segments.

上述したように、本実施形態では、複数素片選択型音声合成において、音声素片を融合するためのＮ個の音声素片と、パワー情報を求めるためのＭ（Ｎ＝＜Ｍ）個の音声素片とを選択し、Ｍ個の音声素片の平均パワー情報に融合音声素片のパワー情報を補正することにより、接続の不連続感が減少し、自然な合成音声となる。 As described above, in the present embodiment, in the multiple unit selection type speech synthesis, N speech units for merging speech units and M (N = <M) pieces for obtaining power information. By selecting the speech unit and correcting the power information of the fusion speech unit to the average power information of the M speech units, the discontinuity of connection is reduced and a natural synthesized speech is obtained.

［４］変更例
［４−１］変更例１
本実施形態においては、融合音声素片のパワー情報をＭ個の音声素片の平均パワー情報に補正したが、Ｍ個の音声素片の平均パワー情報に予めＮ個の音声素片のパワー情報を補正し、補正したＮ個の音声素片を融合することによりパワー情報を補正しても良い。 [4] Modification Example [4-1] Modification Example 1
In the present embodiment, the power information of the fusion speech unit is corrected to the average power information of M speech units, but the power information of N speech units is added to the average power information of M speech units in advance. And the power information may be corrected by fusing the corrected N speech segments.

この場合、融合音声素片作成部２５の処理は、図１６に示すように、ステップＳ１６１において、式（６）、式（７）によりＭ個の音声素片の平均パワー情報を求め、ステップＳ１６２においてＮ個の音声素片それぞれを、そのパワー情報がＰ_ａｖｅとなるように補正し、ステップＳ１６３において補正した音声素片を融合することにより融合音声素片を作成する。 In this case, as shown in FIG. 16, the process of the fusion speech unit creating unit 25 obtains average power information of M speech units by using the equations (6) and (7) in step S161, and in step S162. In step S163, each of the N speech units is corrected so that its power information becomes P _ave, and the speech unit corrected in step S163 is merged to create a fused speech unit.

［４−２］変更例２
また、本実施形態においては、融合音声素片のパワー情報をＭ個の音声素片の平均パワー情報に補正したが、Ｍ個の音声素片の平均パワー情報及びＮ個の音声素片の平均パワー情報を求め、Ｎ個の音声素片の平均パワー情報がＭ個の音声素片の平均パワー情報となるように補正するための比率を求め、この比率をＮ個の音声素片それぞれに掛けることによりＮ個の音声素片を補正し、補正したＮ個の音声素片を融合することにより融合音声素片を作成しても良い。 [4-2] Modification 2
In the present embodiment, the power information of the fusion speech unit is corrected to the average power information of the M speech units. However, the average power information of the M speech units and the average of the N speech units are corrected. The power information is obtained, a ratio for correcting the average power information of the N speech units to be the average power information of the M speech units is obtained, and this ratio is multiplied by each of the N speech units. Thus, the N speech units may be corrected, and the fused N speech units may be created by fusing the corrected N speech units.

この場合、融合音声素片作成部２６の処理は、図２３に示すように、ステップＳ２３１において、式（６）、式（７）によりＭ個の音声素片の平均パワー情報Ｐ_ａｖｅを求め、同様に、ステップＳ２３２においてＮ個の音声素片の平均パワー情報Ｐ_ｆを求め、ステップＳ２３３において、平均パワー情報Ｐ_ｆ及びＰ_ａｖｅから、式（８）により比率ｒを求める。その後ステップＳ２３４において、Ｎ個の音声素片それぞれを、得られた比率ｒを掛けることにより補正し、ステップＳ２３５において、補正したＮ個の音声素片を融合することにより、融合音声素片を作成する。 In this case, as shown in FIG. 23, the process of the fusion speech unit creating unit 26 obtains average power information P _ave of M speech units by using the equations (6) and (7) in step S231, Similarly, in step S232, average power information P _f of N speech units is obtained, and in step S233, the ratio r is obtained from the average power information P _f and P _ave by equation (8). Thereafter, in step S234, each of the N speech units is corrected by multiplying the obtained ratio r, and in step S235, the corrected N speech units are merged to create a fused speech unit. To do.

［４−３］変更例３
また、本実施形態においては、パワー情報を，式（６）で表される平均二乗値であるとして記述しているが、平均絶対振幅値の場合は式（６）の代わりに

[4-3] Modification 3
In the present embodiment, the power information is described as the mean square value represented by Expression (6). However, in the case of an average absolute amplitude value, instead of Expression (6).

を用い、式（８）の代わりに平均絶対振幅比

And the average absolute amplitude ratio instead of equation (8)

を用いる。これにより、平方根の計算が必要なくなり、整数演算のみで計算することができるようになる。 Is used. As a result, the calculation of the square root is not necessary, and the calculation can be performed only by integer arithmetic.

［４−４］変更例４
また、図８のステップＳ８３及び、図１６のステップＳ１６２の融合音声素片もしくは、選択された音声素片のパワー情報を補正するステップにおいて、式（８）もしくは式（１１）により求められる補正する比率ｒが１．０より小さい場合のみ補正してもよい。これは、パワー情報を小さくする方向のみに補正することをあらわしており、これにより音声素片に加わったノイズ成分を増幅することを避けることができる。 [4-4] Modification 4
Further, in the step of correcting the power information of the fusion speech unit or the selected speech unit in step S83 of FIG. 8 and step S162 of FIG. 16, the correction obtained by formula (8) or formula (11) is performed. Correction may be made only when the ratio r is smaller than 1.0. This means that the correction is made only in the direction of decreasing the power information, and thus it is possible to avoid amplifying the noise component added to the speech segment.

［第２の実施形態］
次に、第２の実施形態に係る融合音声素片作成部２５について説明する。 [Second Embodiment]
Next, the fusion speech unit creation unit 25 according to the second embodiment will be described.

図１７は第２の実施形態に係る融合音声素片作成部２５の処理を示したものである。 FIG. 17 shows the processing of the fused speech unit creation unit 25 according to the second embodiment.

第２の実施形態では、複数のＭ個のパワー情報の統計量から、式（９）中の融合音声素片の重みｗｉを決定する。 In the second embodiment, the weight wi of the fused speech unit in the equation (9) is determined from the statistics of a plurality of M pieces of power information.

図１７のステップＳ１７１では、複数音声素片選択部２４において選択された、Ｍ個の音声のパワー情報の平均及び分散を求める。 In step S171 in FIG. 17, the average and variance of the power information of the M voices selected by the multiple voice element selection unit 24 are obtained.

そして、ステップＳ１７２に進み、素片融合に用いるＮ個の各音声素片のパワー情報の尤度を求める。尤度は、ガウス分布を仮定し、

In step S172, the likelihood of the power information of each of the N speech units used for unit fusion is obtained. Likelihood assumes a Gaussian distribution,

このように、素片融合に用いるＮ個の音声素片それぞれのパワー情報が、Ｍ個の音声素片のパワー情報から求めた分布の平均に近いほど重みが大きくなり、平均から外れるほど重みが小さくなる。このため、選択された各音声素片のパワー情報がそのセグメントにおける平均的な値から外れる素片の融合重みを小さくすることができ、融合による音質劣化を低減することができる。 In this way, the weight information increases as the power information of each of the N speech units used for unit fusion is closer to the average of the distribution obtained from the power information of the M speech units, and the weight increases as the power information deviates from the average. Get smaller. For this reason, it is possible to reduce the fusion weight of the segments in which the power information of each selected speech unit deviates from the average value in the segment, and it is possible to reduce deterioration in sound quality due to the fusion.

また、上記の融合重みの近似として、Ｎ個の音声素片中の各素片のパワー情報が、Ｍ個の音声素片のパワー情報の分布において、所定の確率となる区間から外れた場合の重みを０とし、それ以外の素片を等重みとして融合することができる。図１８にこの処理を示す。ステップＳ１８１において、選択されたＭ個の音声素片のパワー情報の平均・及び標準偏差・を求め、ステップＳ１８２において、パワー情報が所定の確率となる区間を求める。例えば、区間を（μ−１．９６σ＜Ｐｉ＜μ＋１．９６σ）とするとＰｉがその区間となる確率が９５％となる。 Further, as an approximation of the above fusion weight, when the power information of each of the N speech units deviates from a section having a predetermined probability in the power information distribution of the M speech units. The weight can be set to 0 and the other segments can be merged as equal weights. FIG. 18 shows this process. In step S181, the average and standard deviation of the power information of the selected M speech units are obtained, and in step S182, a section in which the power information has a predetermined probability is obtained. For example, if the section is (μ−1.96σ <Pi <μ + 1.96σ), the probability that Pi is in that section is 95%.

次に、ステップＳ１８３において、前述した区間外の音声素片を取り除く。その区間から外れた場合の融合重みｗｉを０とすることで除去することができる。 Next, in step S183, the speech segment outside the section described above is removed. It can be removed by setting the fusion weight wi when it is out of the section to zero.

そして、ステップＳ１８４において、音声素片を融合し、融合音声素片を得る。図９のデータに適用すると、その区間は、（−２７３５７３＜Ｐｉ＜３５６７７３９）となり、Ｐ_３＝４０９１９７９がこの範囲から外れる。このため、融合重みを、ｗｉ＝０．５、ｗ２＝０．５、ｗ３＝０として融合することで、外れ素片を除去することができる。区間の決定は上述した方法に限定するものではなく、統計量の一つである四分位範囲に基づく方法なども考えられる。 In step S184, the speech units are merged to obtain a fused speech unit. When applied to the data of FIG. 9, the interval is (−273573 <Pi <3567739), and P ₃ = 4091979 is out of this range. For this reason, by merging the fusion weights with wi = 0.5, w2 = 0.5, and w3 = 0, it is possible to remove outliers. The determination of the interval is not limited to the above-described method, and a method based on a quartile range that is one of the statistics can be considered.

例えば、パワーをソートして、３／４番目のパワー値（上側四分位数）と、１／４番目のパワー値（下側四分囲数）の差を四分位範囲と呼び、その定数倍（１．５倍等）を下側四分囲数のパワーの値から引いた値、定数倍を上側四分位数のパワーの値に足した範囲を区間と決め、その区間外のものを外れ値とする。 For example, by sorting the power, the difference between the 3 / 4th power value (upper quartile) and the 1 / 4th power value (lower quartile) is called the quartile range. A value obtained by subtracting a constant multiple (1.5 times, etc.) from the power value of the lower quartile, and a range obtained by adding the constant multiple to the power value of the upper quartile is defined as a section. Things are outliers.

本実施形態により、あるセグメントに対して選択された上位Ｎ個の素片のパワー情報が外れる場合には融合重みを小さくする、もしくは融合から除去することができる。これにより、合成音声のパワー情報の異なる素片を融合することによる音質劣化を防ぐことができ、自然な合成音声が得られる。また、第１の実施形態と組み合わせ、融合重みを第２の実施形態で決定し、パワー情報の補正を第１の実施形態の方法で行うことができる。 According to the present embodiment, when the power information of the top N segments selected for a certain segment deviates, the fusion weight can be reduced or removed from the fusion. As a result, it is possible to prevent deterioration in sound quality due to the fusion of segments having different power information of synthesized speech, and natural synthesized speech can be obtained. Further, in combination with the first embodiment, the fusion weight can be determined in the second embodiment, and the power information can be corrected by the method of the first embodiment.

［第３の実施形態］
第３の実施形態では、素片選択型の音声合成において、選択された最適な音声素片のパワー情報を複数素片の平均パワー情報に補正する。素片融合処理を含まない点が第１、第２の実施形態と異なる。 [Third Embodiment]
In the third embodiment, in the unit selection type speech synthesis, the power information of the selected optimum speech unit is corrected to the average power information of a plurality of units. The difference from the first and second embodiments is that the unit fusion processing is not included.

［１］音声合成部１４の構成
図１８は、第２の実施形態に係る音声合成部１４の構成例を示したものである。 [1] Configuration of Speech Synthesizer 14 FIG. 18 shows a configuration example of the speech synthesizer 14 according to the second embodiment.

音声合成部１４は、音声素片記憶部１９１、音素環境記憶部１９２、音韻系列・韻律情報入力部１９３、複数音声素片選択部１９４、音声素片作成部１９５、音声素片編集・接続部１９５、音声波形出力部１５より構成される。 The speech synthesis unit 14 includes a speech unit storage unit 191, a phoneme environment storage unit 192, a phoneme sequence / prosodic information input unit 193, a multiple speech unit selection unit 194, a speech unit creation unit 195, a speech unit editing / connection unit. 195, the voice waveform output unit 15.

［１］音声素片記憶部１９１
音声素片記憶部１９１には、第１の実施形態と同様に、データベースを分析して得られる音声素片が記憶され、音素環境情報記憶部１９２には、各音声素片の音素環境が記憶されている。 [1] Speech segment storage unit 191
Similar to the first embodiment, the speech unit storage unit 191 stores speech units obtained by analyzing the database, and the phoneme environment information storage unit 192 stores the phoneme environment of each speech unit. Has been.

［２］複数音声素片選択部１９３
複数音声素片選択部１９３は、入力音韻系列の各合成単位に対し、入力韻律情報と、音声素片の音素環境に含まれる韻律情報との歪みの度合いを推定し、前記歪みの度合いを最小化するように音声素片記憶部１９１に記憶されている音声素片の中から複数の音声素片と、最適な音声素片とを選択する。複数の音声素片の選択は、上述した図２０に示すように、前述したコスト関数に基づいて行うことができる。図６の処理とは、上位Ｎ個の音声素片でなく最適な音声素片のみを選択している点が異なる。その結果、入力音韻系列の音韻記号の系列のそれぞれのセグメントに対応する複数のＭ個の音声素片及び、最適音声素片が選択される。 [2] Multiple speech element selection unit 193
The multiple speech segment selector 193 estimates the degree of distortion between the input prosody information and the prosodic information included in the phoneme environment of the speech segment for each synthesis unit of the input phoneme sequence, and minimizes the degree of distortion. A plurality of speech units and an optimal speech unit are selected from the speech units stored in the speech unit storage unit 191 so as to be converted into the same. Selection of a plurality of speech segments can be performed based on the cost function described above, as shown in FIG. The processing in FIG. 6 is different from the processing in FIG. 6 in that only the optimal speech unit is selected instead of the top N speech units. As a result, a plurality of M speech units corresponding to each segment of the phoneme symbol sequence of the input phoneme sequence and the optimum speech unit are selected.

［３］音声素片作成部１９５
次に音声素片作成部１９５について説明する。 [3] Speech segment creation unit 195
Next, the speech segment creation unit 195 will be described.

音声素片作成部１９５では、複数音声素片選択部１９４において選択された最適な音声素片のパワー情報を補正し、合成に用いる音声素片を作成する。 The speech unit creation unit 195 corrects the power information of the optimum speech unit selected by the multiple speech unit selection unit 194 and creates a speech unit used for synthesis.

図２１に音声素片作成部１９５の処理を示す。 FIG. 21 shows the processing of the speech segment creation unit 195.

まず、ステップＳ２１１において、選択されたＭ個の各音声素片のパワー情報Ｐｉ（１＝＜ｉ＝＜M）を求め、その平均パワー情報Ｐ_ａｖｅを求める。第１の実施形態と同様式（６）及び式（７）から求めることができる。次に、ステップＳ２１２において最適音声素片のパワー情報Ｐ_１をステップＳ２１１で得られたＭ個の平均パワー情報Ｐ_ａｖｅに補正する。ここでは、パワー情報を補正する比率ｒを

First, in step S211, power information Pi (1 = <i = <M) of each of the selected M speech units is obtained, and the average power information P _ave is obtained. It can obtain | require from Formula (6) and Formula (7) similarly to 1st Embodiment. Then, to correct the power information _{P 1} of the optimum speech unit in step S212 to the M average power information _{P ave} obtained in step S211. Here, the ratio r for correcting the power information is

により求める。この比率ｒを最適音声素片に掛けることによりパワー情報を補正する。 Ask for. The power information is corrected by multiplying the optimal speech unit by this ratio r.

図９のデータにおいては、Ｍ個の平均パワー情報Ｐ_ａｖｅ＝１６４７０８４、最適素片のパワー情報Ｐ_１＝２８５９８８３であり、比率ｒ＝０．７６となる。このｒを最適音声素片の音声波形に掛けることによりパワー情報を補正する。 In the data of FIG. 9, M pieces of average power information P _ave = 1647084, power information P ₁ of the optimum unit P = 2859883, and the ratio r = 0.76. The power information is corrected by multiplying this r by the speech waveform of the optimum speech unit.

［４］音声素片編集・接続部１９６
音声素片編集・接続部１９６では、音声素片を、入力韻律情報に従って変形し、接続することで合成音声の音声波形を生成する。音声素片からピッチ波形を切り出し、当該音声素片の基本周波数、音韻継続時間長のそれぞれが、入力韻律情報に示されている目標音声の基本周波数、目標音声の音韻継続時間長になるようにピッチ波形を重畳することで、音声波形を生成することができる。 [4] Speech segment editing / connection unit 196
The speech segment editing / connection unit 196 generates speech waveforms of synthesized speech by transforming speech segments according to input prosodic information and connecting them. A pitch waveform is cut out from a speech unit so that the fundamental frequency and phoneme duration of the speech unit are the basic frequency of the target speech and the phoneme duration of the target speech indicated in the input prosodic information. An audio waveform can be generated by superimposing the pitch waveform.

上述したように、本実施形態では、素片選択型音声合成において、選択された音声素片を、複数のＭ個の音声素片の平均パワー情報に補正することにより、安定感が増し合成音声の品質が向上する。 As described above, in the present embodiment, in the unit selection type speech synthesis, the selected speech unit is corrected to the average power information of a plurality of M speech units, so that the sense of stability is increased and the synthesized speech is synthesized. Improve the quality.

［５］変更例
また、第２の実施形態と同様に、Ｍ個の音声素片のパワー情報から区間を求め、その区間内で最適な音声素片を選択してもよい。 [5] Modified Example Similarly to the second embodiment, a section may be obtained from the power information of M speech units, and an optimal speech unit may be selected within the section.

この場合、音声素片作成部１９５は図２２に示すようになる。 In this case, the speech segment creation unit 195 is as shown in FIG.

ステップＳ２２１において、Ｍ個の音声素片のパワー情報の平均及び標準偏差を求め、ステップＳ２２２において、パワー情報が所定の確率となる区間を求める。 In step S221, the average and standard deviation of the power information of the M speech units are obtained. In step S222, a section in which the power information has a predetermined probability is obtained.

ステップＳ２２３において、１位の音声素片のパワー情報Ｐ_１が区間内にあればそれを用い、なければ２位の音声素片のパワー情報Ｐ_２が区間内かどうかを判定し、以下繰り返していきパワー情報が区間内に含まれる中でコストが最も小さい音声素片を選択する。これにより、上位の音声素片のパワー情報が大きく異なる場合は、それを外れ値として除外し除外した中で最適な音声素片を選択することができる。このように選択した音声素片のパワー情報をＭ個の平均パワー情報に補正してもよい。 In step S223, if the power information P1 of the _first speech unit is within the section, it is used, and if not, it is determined whether or not the power information P2 of the _second speech unit is within the section. A speech unit with the lowest cost is selected from among the power information included in the section. As a result, when the power information of the upper speech unit is greatly different, it is possible to select the optimum speech unit while excluding it as an outlier. The power information of the selected speech unit may be corrected to M pieces of average power information.

このようにして選択した音声素片を、音声素片編集・接続部１９６で編集・接続し合成音声を得ることができる。 The speech unit selected in this way can be edited / connected by the speech unit editing / connecting unit 196 to obtain a synthesized speech.

第１の実施形態と同様に、平均パワー情報でなく平均絶対振幅を用いてもよい。 Similar to the first embodiment, average absolute amplitude may be used instead of average power information.

また、第１の実施形態と同様にパワー情報を小さくする方向のみに補正するため、図２１のステップＳ２１２の最適音声素片のパワー情報を補正するステップにおいて、補正する比率ｒが１．０より小さい場合のみ補正してもよい。これにより最適な音声素片に加わったノイズ成分を増幅することを避けることができる。 Further, since correction is performed only in the direction in which the power information is reduced as in the first embodiment, the correction ratio r is greater than 1.0 in the step of correcting the power information of the optimum speech unit in step S212 in FIG. You may correct | amend only when small. As a result, it is possible to avoid amplifying the noise component added to the optimum speech segment.

本発明の第１の実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the 1st Embodiment of this invention. 音声合成部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a speech synthesizer. 音声素片記憶部の音声素片の記憶例を示す図である。It is a figure which shows the example of a memory | storage of the speech unit of a speech unit storage part. 音素環境記憶部の音素環境の記憶例を示す図である。It is a figure which shows the memory example of the phoneme environment of a phoneme environment storage part. 音声データから音声素片を得るための手順を説明する図である。It is a figure explaining the procedure for obtaining a speech unit from speech data. 複数音声素片選択部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation | movement of a several speech unit selection part. 入力音韻系列に対応する複数のセグメントのそれぞれに対し、複数の音声素片を求めるための手順を説明する図である。It is a figure explaining the procedure for calculating | requiring a several speech unit with respect to each of the some segment corresponding to an input phoneme series. 融合音声素片作成部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation | movement of a fusion speech unit creation part. パワー情報補正の実例を示す図である。It is a figure which shows the example of power information correction | amendment. パワー情報補正の実例を示す図である。It is a figure which shows the example of power information correction | amendment. 素片融合ステップの処理を説明するフローチャートである。It is a flowchart explaining the process of an element fusion step. 素片融合部の処理を説明するため図である。It is a figure for demonstrating the process of an element fusion part. 素片融合部の処理を説明するため図である。It is a figure for demonstrating the process of an element fusion part. 素片融合部の処理を説明するため図である。It is a figure for demonstrating the process of an element fusion part. 素片編集・接続部の処理動作を説明する図である。It is a figure explaining the processing operation of a segment edit and a connection part. 融合音声素片作成部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation | movement of a fusion speech unit creation part. 本発明の第２の実施形態における融合音声素片作成部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation of the fusion speech unit preparation part in the 2nd Embodiment of this invention. 第２の実施形態における融合音声素片作成部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation of the fusion speech unit preparation part in 2nd Embodiment. 本発明の第３の実施形態における音声合成部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech synthesizer in the 3rd Embodiment of this invention. 第３の実施形態における複数音声素片選択部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation | movement of the several speech unit selection part in 3rd Embodiment. 第３の実施形態における融合音声素片作成部の処理を示すフローチャートである。It is a flowchart which shows the process of the fusion speech unit preparation part in 3rd Embodiment. 第３の実施形態における融合音声素片作成部の処理を示すフローチャートである。It is a flowchart which shows the process of the fusion speech unit preparation part in 3rd Embodiment. 融合音声素片作成部の処理動作を説明するためのフローチャートである。It is a flowchart for demonstrating the processing operation | movement of a fusion speech unit creation part.

Explanation of symbols

１１テキスト入力部
１２言語処理部
１３韻律処理部
１４音声合成部
１５音声波形出力部
２１音声素片記憶部
２２音素環境記憶部
２３音韻系列・韻律情報入力部
２４複数音声素片選択部
２５融合音声素片系列作成部
２６融合音声素片編集・接続部

11 Text Input Unit 12 Language Processing Unit 13 Prosody Processing Unit 14 Speech Synthesis Unit 15 Speech Waveform Output Unit 21 Speech Segment Storage Unit 22 Phoneme Environment Storage Unit 23 Phoneme Sequence / Prosody Information Input Unit 24 Multiple Speech Unit Selection Unit 25 Fusion Speech Segment sequence creation unit 26 Fusion speech segment editing / connection unit

Claims

In a speech synthesizer that generates a synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units.
Storage means for storing a plurality of speech segments corresponding to the synthesis unit;
Selection means for selecting a plurality of speech units from the speech units stored in the storage unit based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. Representative speech segment generation means for generating a representative speech segment;
Speech waveform generation means for generating a speech waveform by connecting the generated representative speech segments;
With
The selection means selects N and M (N <M) speech segments,
In the representative speech segment generation means,
An average value of power information is obtained from the selected M speech segments,
Creating a fused speech unit by fusing the selected N speech units;
The speech synthesizer characterized in that the representative speech unit is generated by correcting the power information of the fused speech unit to be an average value of the power information obtained from the M speech units.

In a speech synthesizer that generates a synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units.
Storage means for storing a plurality of speech segments corresponding to the synthesis unit;
Selection means for selecting a plurality of speech units from the speech units stored in the storage unit based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. Representative speech segment generation means for generating a representative speech segment;
Speech waveform generation means for generating a speech waveform by connecting the generated representative speech segments;
With
The selection means selects N and M (N <M) speech segments,
In the representative speech segment generation means,
An average value of power information is obtained from the selected M speech segments,
The power information of each of the selected N speech units is corrected to be an average value of the power information,
The speech synthesizer characterized in that the representative speech unit is generated by fusing the corrected N speech units.

In a speech synthesizer that generates a synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units.
Storage means for storing a plurality of speech segments corresponding to the synthesis unit;
Selection means for selecting a plurality of speech units from the speech units stored in the storage unit based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. Representative speech segment generation means for generating a representative speech segment;
Speech waveform generation means for generating a speech waveform by connecting the generated representative speech segments;
With
The selection means selects N and M (N <M) speech segments,
In the representative speech segment generation means,
An average value of power information of the selected M speech units is obtained,
An average value of power information of the selected N speech units is obtained,
Obtaining a correction value for correcting the average value of the power information of the N speech units to be an average value of the power information of the M speech units;
Correcting each of the N speech segments by applying the correction value;
The speech synthesizer characterized in that the representative speech unit is generated by fusing the corrected N speech units.

In a speech synthesizer that generates a synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units.
Storage means for storing a plurality of speech segments corresponding to the synthesis unit;
Selection means for selecting a plurality of speech units from the speech units stored in the storage unit based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. Representative speech segment generation means for generating a representative speech segment;
Speech waveform generation means for generating a speech waveform by connecting the generated representative speech segments;
With
The selection means selects N and M (N <M) speech segments,
In the representative speech segment generation means,
A power information statistic is determined from the selected M speech segments;
Obtaining power information of each of the selected N speech segments;
Determining the weight of each of the N speech units based on the statistic of the obtained power information and the power information of the N speech units;
The speech synthesizer characterized in that the representative speech unit is generated by fusing N speech units based on the weights.

In a speech synthesizer that generates a synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units.
Storage means for storing a plurality of speech segments corresponding to the synthesis unit;
Selection means for selecting a plurality of speech units from the speech units stored in the storage unit based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. Representative speech segment generation means for generating a representative speech segment;
Speech waveform generation means for generating a speech waveform by connecting the generated representative speech segments;
With
The selection means selects N and M (N <M) speech segments,
In the representative speech segment generation means,
A section where the distribution of the power information is a predetermined probability or more from a statistic of power information of the selected M speech units, or a section based on the quartile range of the power information,
Obtaining power information of each of the selected N speech segments;
If the power information of the N speech units is out of the section, exclude it from the speech unit to be selected,
The speech synthesizer characterized in that the representative speech unit is generated by fusing speech units within the section among the selected N speech units .

In a speech synthesizer that generates a synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units.
Storage means for storing a plurality of speech segments corresponding to the synthesis unit;
Selection means for selecting a plurality of speech units from the speech units stored in the storage unit based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. Representative speech segment generation means for generating a representative speech segment;
Speech waveform generation means for generating a speech waveform by connecting the generated representative speech segments;
With
In the selection means, M speech units and an optimal speech unit with a low degree of distortion of the synthesized speech are selected,
In the representative speech segment generation means,
An average value of power information is obtained from the selected M speech segments,
The speech synthesizer, wherein the representative speech segment is generated by correcting power information of the optimal speech segment to be an average value of the power information.

In a speech synthesizer that generates a synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units.
Storage means for storing a plurality of speech segments corresponding to the synthesis unit;
Selection means for selecting a plurality of speech units from the speech units stored in the storage unit based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. Representative speech segment generation means for generating a representative speech segment;
Speech waveform generation means for generating a speech waveform by connecting the generated representative speech segments;
With
In the selection means, M speech segments are selected,
In the representative speech segment generation means,
A section where the distribution of the power information is a predetermined probability or more from a statistic of power information of the selected M speech units, or a section based on the quartile range of the power information,
The representative speech unit is generated by selecting an optimal speech unit having a low degree of distortion of the synthesized speech from speech units having power information included in the power information section. Speech synthesizer.

In a speech synthesis program for generating synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units,
On the computer,
A storage function for storing a plurality of speech segments corresponding to the synthesis unit;
A selection function for selecting a plurality of speech units from the speech units stored in the storage function based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. A representative speech segment generation function for generating a representative speech segment;
A speech waveform generation function for generating a speech waveform by connecting the generated representative speech segments;
Is to realize
In the selection function, N and M (N <M) speech segments are selected,
In the representative speech segment generation function,
An average value of power information is obtained from the selected M speech segments,
Creating a fused speech unit by fusing the selected N speech units;
A speech synthesis program, wherein the representative speech unit is generated by correcting the power information of the fused speech unit to be an average value of power information obtained from the M speech units.

In a speech synthesis program for generating synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units,
On the computer,
A storage function for storing a plurality of speech segments corresponding to the synthesis unit;
A selection function for selecting a plurality of speech units from the speech units stored in the storage function based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. A representative speech segment generation function for generating a representative speech segment;
A speech waveform generation function for generating a speech waveform by connecting the generated representative speech segments;
Is to realize
In the selection function, N and M (N <M) speech segments are selected,
In the representative speech segment generation function,
An average value of power information is obtained from the selected M speech segments,
The power information of each of the selected N speech units is corrected to be an average value of the power information,
The speech synthesis program characterized by generating the representative speech unit by fusing the corrected N speech units.

In a speech synthesis program for generating synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units,
On the computer,
A storage function for storing a plurality of speech segments corresponding to the synthesis unit;
A selection function for selecting a plurality of speech units from the speech units stored in the storage function based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. A representative speech segment generation function for generating a representative speech segment;
A speech waveform generation function for generating a speech waveform by connecting the generated representative speech segments;
Is to realize
In the selection function, N and M (N <M) speech segments are selected,
In the representative speech segment generation function,
An average value of power information of the selected M speech units is obtained,
An average value of power information of the selected N speech units is obtained,
Obtaining a correction value for correcting the average value of the power information of the N speech units to be an average value of the power information of the M speech units;
Correcting each of the N speech segments by applying the correction value;
The speech synthesis program characterized by generating the representative speech unit by fusing the corrected N speech units.

In a speech synthesis program for generating synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units,
On the computer,
A storage function for storing a plurality of speech segments corresponding to the synthesis unit;
A selection function for selecting a plurality of speech units from the speech units stored in the storage function based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. A representative speech segment generation function for generating a representative speech segment;
A speech waveform generation function for generating a speech waveform by connecting the generated representative speech segments;
Is to realize
In the selection function, N and M (N <M) speech segments are selected,
In the representative speech segment generation function,
A power information statistic is determined from the selected M speech segments;
Obtaining power information of each of the selected N speech segments;
Determining the weight of each of the N speech units based on the statistic of the obtained power information and the power information of the N speech units;
The speech synthesis program characterized in that the representative speech unit is generated by fusing N speech units based on the weights.

In a speech synthesis program for generating synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units,
On the computer,
A storage function for storing a plurality of speech segments corresponding to the synthesis unit;
A selection function for selecting a plurality of speech units from the speech units stored in the storage function based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. A representative speech segment generation function for generating a representative speech segment;
A speech waveform generation function for generating a speech waveform by connecting the generated representative speech segments;
Is to realize
In the selection function, N and M (N <M) speech segments are selected,
In the representative speech segment generation function,
A section where the distribution of the power information is a predetermined probability or more from a statistic of power information of the selected M speech units, or a section based on the quartile range of the power information,
Obtaining power information of each of the selected N speech segments;
If the power information of the N speech units is out of the section, exclude it from the speech unit to be selected,
The speech synthesis program characterized by generating the representative speech unit by fusing speech units within the section among the selected N speech units .

In a speech synthesis program for generating synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units,
On the computer,
A storage function for storing a plurality of speech segments corresponding to the synthesis unit;
A selection function for selecting a plurality of speech units from the speech units stored in the storage function based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. A representative speech segment generation function for generating a representative speech segment;
A speech waveform generation function for generating a speech waveform by connecting the generated representative speech segments;
Is to realize
In the selection function, M speech units and an optimal speech unit with a low degree of distortion of the synthesized speech are selected,
In the representative speech segment generation function,
An average value of power information is obtained from the selected M speech segments,
The speech synthesis program, wherein the representative speech segment is generated by correcting power information of the optimal speech segment to be an average value of the power information.

In a speech synthesis program for generating synthesized speech by dividing a phoneme sequence obtained from input text into predetermined synthesis units, obtaining representative speech units for each synthesis unit, and connecting these representative speech units,
On the computer,
A storage function for storing a plurality of speech segments corresponding to the synthesis unit;
A selection function for selecting a plurality of speech units from the speech units stored in the storage function based on the degree of distortion of the synthesized speech for each synthesis unit of the phoneme sequence obtained from the input text; ,
A statistic of power information is obtained from the plurality of selected speech segments, and the power information is corrected based on the statistic of the power information so as to improve the quality of the synthesized speech, thereby corresponding to the synthesis unit. A representative speech segment generation function for generating a representative speech segment;
A speech waveform generation function for generating a speech waveform by connecting the generated representative speech segments;
Is to realize
In the selection function, M speech segments are selected,
In the representative speech segment generation function,
A section where the distribution of the power information is a predetermined probability or more from a statistic of power information of the selected M speech units, or a section based on the quartile range of the power information,
The representative speech unit is generated by selecting an optimal speech unit having a low degree of distortion of the synthesized speech from speech units having power information included in the power information section. Speech synthesis program.