JP2008185805A

JP2008185805A - Technology for creating high quality synthesis voice

Info

Publication number: JP2008185805A
Application number: JP2007019433A
Authority: JP
Inventors: Takateru Tachibana; 隆輝立花; Toru Nagano; 徹長野; Masafumi Nishimura; 雅史西村
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2007-01-30
Filing date: 2007-01-30
Publication date: 2008-08-14
Also published as: CN101236743B; US20080183473A1; CN101236743A; US8015011B2

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently create high quality synthesis voice by connecting a plurality of phonemes. <P>SOLUTION: A system comprises: a phoneme storage section for storing a plurality of phoneme data; a synthesis section for creating a voice data which indicates synthesis voice of a text by reading and connecting a phoneme data corresponding to each phoneme, which indicates pronunciation of the input text, from the phoneme storage section; a calculation section for calculating an index value which indicates unnaturalness of the synthesis voice of the text, based on the voice data; a paraphrase storage section for storing a second notation which is paraphrasing of a first notation by relating it to each of the plurality of first notations; a replacing section for replacing the searched notation with the second notation corresponding to the first notation, by searching notation which corresponds to any of the first notation from the text; and a determination section in which the created voice data is output on condition that the calculated index value is smaller than a reference value, and in which the text is input to the synthesis section so that the voice data of the replaced text may be further created, on condition that the index value is the reference value or more. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、合成音声を生成する技術に関する。特に、本発明は、複数の音素片を接続して合成音声を生成する技術に関する。 The present invention relates to a technique for generating synthesized speech. In particular, the present invention relates to a technique for generating a synthesized speech by connecting a plurality of phonemes.

聞き手にとって自然な合成音声を生成することを目標として、従来、波形編集合成方式を採用した音声合成技術が用いられている。この方式では、音声合成装置が、人間である話者の音声を予め収録して音声波形データとしてデータベースに保存しておく。そして、その音声合成装置は、入力されたテキストに基づいて複数の音声波形データを読み出して接続することにより、合成音声を生成する。このような合成音声が聞き手にとって自然に聞こえるようにするためには、音声の周波数や音色が連続的に変化することが望ましい。例えば、音声波形データの接続部分で音声の周波数や音色が大きく変化してしまうと、その合成音声は不自然に聞こえてしまう。 Conventionally, a speech synthesis technique employing a waveform editing synthesis method has been used with the goal of generating a synthesized speech that is natural to the listener. In this method, a speech synthesizer records in advance the speech of a human speaker and stores it in a database as speech waveform data. Then, the speech synthesizer reads out a plurality of speech waveform data based on the input text and connects them to generate synthesized speech. In order for such a synthesized voice to be heard naturally by the listener, it is desirable that the frequency and tone color of the voice change continuously. For example, if the sound frequency or tone changes greatly at the connection portion of the sound waveform data, the synthesized sound will sound unnatural.

特開２００３−１３１６７９号公報JP 2003-131679 A Wael Hamza, Raimo Bakis, and Ellen Eide, "RECONCILING PRONUNCIATION DIFFERENCES BETWEEN THE FRONTEND AND BACK-END IN THE IBM SPEECH SYNTHESIS SYSTEM", Proceedings of ICSLP, Jeju, South Korea, 2004, pp.2561-2564Wael Hamza, Raimo Bakis, and Ellen Eide, "RECONCILING PRONUNCIATION DIFFERENCES BETWEEN THE FRONTEND AND BACK-END IN THE IBM SPEECH SYNTHESIS SYSTEM", Proceedings of ICSLP, Jeju, South Korea, 2004, pp.2561-2564

しかしながら、費用や時間の制約や、コンピュータの記憶容量や処理能力の制約のため、予め収録することのできる音声波形データの種類は限られている。このため、適切な音声波形データがデータベースに登録されておらず、代用の音声波形データを用いた結果接続部分で周波数等が大きく変化して、合成音声が不自然になってしまう場合がある。これは、入力されたテキストの表記内容が、音声波形データを生成するために予め収録した音声の内容と大きく異なっている場合に起こり易い。 However, the types of audio waveform data that can be recorded in advance are limited due to cost and time constraints, and the storage capacity and processing capability of computers. For this reason, appropriate speech waveform data is not registered in the database, and as a result of using the substitute speech waveform data, the frequency or the like changes greatly at the connection portion, and the synthesized speech may become unnatural. This is likely to occur when the notation content of the input text is significantly different from the speech content recorded in advance to generate speech waveform data.

参考技術文献として特許文献１および非特許文献１を挙げる。特許文献１に記載の音声出力装置は、書き言葉によって構成されたテキストを、話し言葉のテキストに変換したうえで読み上げることで、聞き手がその内容を理解し易くしている。しかしながら、この装置はテキストをその表現を変更するために変換するものであり、その変換は音声波形データの周波数変化などの情報とは無関係に行われる。従って、この変換によっては合成音声の品質を向上させることはできない。非特許文献１の技術は、表記が同一で発音の異なる音素片を予め記憶しておき、それらの中から、合成音声の品質が向上するように適切な音素片を選択するものである。しかしながら、そのような選択を試みてもなお適切な音素片が無ければ合成音声が不自然になってしまう。 Patent Document 1 and Non-Patent Document 1 are listed as reference technical documents. The voice output device described in Patent Document 1 makes it easy for the listener to understand the content by converting text composed of written words into spoken text and then reading it out. However, this apparatus converts text in order to change its expression, and the conversion is performed irrespective of information such as frequency change of speech waveform data. Therefore, the quality of the synthesized speech cannot be improved by this conversion. The technique of Non-Patent Document 1 stores phonemes having the same notation and different pronunciations in advance, and selecting appropriate phonemes so as to improve the quality of synthesized speech. However, even if such a selection is attempted, the synthesized speech becomes unnatural unless there is an appropriate phoneme segment.

そこで本発明は、上記の課題を解決することのできるシステム、方法およびプログラムを提供することを目的とする。この目的は特許請求の範囲における独立項に記載の特徴の組み合わせにより達成される。また従属項は本発明の更なる有利な具体例を規定する。 Therefore, an object of the present invention is to provide a system, a method, and a program that can solve the above-described problems. This object is achieved by a combination of features described in the independent claims. The dependent claims define further advantageous specific examples of the present invention.

上記課題を解決するために、本発明の第１の形態においては、合成音声を生成するシステムであって、各々が互いに異なる音素の音声を示す複数の音素片データを記憶する音素片記憶部と、テキストを入力し、入力したテキストの発音を示す各音素に対応する音素片データを音素片記憶部から読み出して接続し、テキストの合成音声を示す音声データを生成する合成部と、テキストの合成音声の不自然さを示す指標値を、音声データに基づいて算出する算出部と、複数の第１表記のそれぞれに対応付けて、当該第１表記の言い換えである第２表記を記憶する換言記憶部と、テキストの中から何れかの第１表記に一致する表記を検索して、検索された当該表記を当該第１表記に対応する第２表記に置換する置換部と、算出した指標値が予め定められた基準値より小さいことを条件に、生成された音声データを出力し、当該指標値が基準値以上であることを条件に、置換されたテキストについて音声データをさらに生成させるべく当該テキストを合成部に入力する判断部とを備えるシステムを提供する。また、当該システムにより合成音声を生成する方法、および、当該システムとして情報処理装置を機能させるプログラムを提供する。
なお、上記の発明の概要は、本発明の必要な特徴の全てを列挙したものではなく、これらの特徴群のサブコンビネーションもまた、発明となりうる。 In order to solve the above-described problem, in a first embodiment of the present invention, a system for generating synthesized speech, a phoneme unit storage unit that stores a plurality of phoneme unit data each indicating speech of different phonemes, and A text synthesizing unit that inputs text, reads out and connects phoneme data corresponding to each phoneme indicating the pronunciation of the input text from the phoneme storage unit, and generates voice data indicating a synthesized voice of the text; A paraphrase memory that stores a second notation that is a paraphrase of the first notation in association with a calculation unit that calculates an index value indicating unnaturalness of speech based on the speech data and each of the plurality of first notations. Part, a replacement part that searches the text for a notation that matches any first notation, and replaces the searched notation with a second notation corresponding to the first notation, and the calculated index value Predetermined The generated voice data is output on condition that the value is smaller than the reference value, and the text is synthesized to further generate voice data for the replaced text on condition that the index value is equal to or higher than the reference value. A system including a determination unit that inputs to a unit is provided. In addition, a method for generating synthesized speech by the system and a program for causing an information processing apparatus to function as the system are provided.
The above summary of the invention does not enumerate all the necessary features of the present invention, and sub-combinations of these feature groups can also be the invention.

以下、発明の実施の形態を通じて本発明を説明するが、以下の実施形態は特許請求の範囲にかかる発明を限定するものではなく、また実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, the present invention will be described through embodiments of the invention. However, the following embodiments do not limit the invention according to the scope of claims, and all combinations of features described in the embodiments are included. It is not necessarily essential for the solution of the invention.

図１は、音声合成システム１０およびそれに関連するデータの全体構成を示す。音声合成システム１０は、複数の音素片データを記憶する音素片記憶部２０を有する。これらの音素片データは、生成する目標とするべき合成音声を示す目標音声データから、そのデータを音素毎に分割することによって予め生成されるものである。この目標音声データは、例えばアナウンサーが原稿を読み上げた音声などを録音したものである。そして、音声合成システム１０は、テキストを入力し、入力したこのテキストに対し、形態素解析や韻律モデルの適用などの処理を行って、そのテキストの読み上げ音声として生成するべき各音素の韻律や音色などのデータを生成する。そして、音声合成システム１０は、生成したこれらの周波数などのデータに基づいて、音素片記憶部２０から複数の音素片データを選択して読み出し、読み出したこれらの音素片データを接続する。接続された複数の音素片データは、利用者の承認を条件に、このテキストの合成音声を示す音声データとして出力される。 FIG. 1 shows the overall configuration of a speech synthesis system 10 and data related thereto. The speech synthesis system 10 includes a phoneme piece storage unit 20 that stores a plurality of phoneme piece data. These phoneme piece data are generated in advance by dividing the data for each phoneme from the target speech data indicating the synthesized speech to be generated. The target audio data is, for example, a recording of a voice read out by the announcer. Then, the speech synthesis system 10 inputs text, performs processing such as application of morphological analysis and prosodic model to the input text, and the prosody and timbre of each phoneme to be generated as a read-out speech of the text. Generate data for Then, the speech synthesis system 10 selects and reads a plurality of phoneme piece data from the phoneme piece storage unit 20 based on the generated data such as frequencies, and connects these read phoneme piece data. The connected plurality of phoneme piece data is output as voice data indicating the synthesized voice of this text on condition that the user approves it.

ここで、音素片記憶部２０に記憶することのできる音素片データの種類は、費用や所要時間、あるいは、音声合成システム１０の計算処理能力などの制約によって制限される。このため、音声合成システム１０が、韻律モデルの適用などの処理の結果、各音素の発音として生成するべき周波数を求めても、その周波数の音素片データが音素片記憶部２０に記憶されていない場合がある。この場合、音声合成システム１０が不適切な音素片データを選択した結果、品質の低い合成音声が生成されてしまうおそれがある。これに対し本実施形態に係る音声合成システム１０は、一旦生成した音声データが充分な品質を有していない場合には、テキストの表記をその意味を変更しない範囲内で変更することで、出力される合成音声の品質を向上することを目的とする。 Here, the type of phoneme piece data that can be stored in the phoneme piece storage unit 20 is limited by constraints such as cost, required time, or calculation processing capability of the speech synthesis system 10. For this reason, even if the speech synthesis system 10 obtains a frequency to be generated as a pronunciation of each phoneme as a result of processing such as application of the prosody model, phoneme piece data of that frequency is not stored in the phoneme piece storage unit 20. There is a case. In this case, there is a possibility that synthesized speech having low quality may be generated as a result of the speech synthesis system 10 selecting inappropriate phoneme piece data. On the other hand, the speech synthesis system 10 according to the present embodiment, when the speech data once generated does not have sufficient quality, outputs the text by changing the notation of the text within a range that does not change its meaning. The purpose is to improve the quality of synthesized speech.

図２は、音素片記憶部２０のデータ構造の一例を示す。音素片記憶部２０は、各々が互いに異なる音素の音声を示す複数の音素片データを記憶する。具体的には、音素片記憶部２０は、各々の音素について、当該音素の表記と、当該音素の音声波形データと、当該音素の音色データとを記憶している。一例として、音素片記憶部２０は、「あ」という表記を有するある音素について、時間の経過に応じた基本周波数の変化を示す情報を音声波形データとして記憶する。ここで、音素の基本周波数とは、音素を構成する各周波数成分のうち最も音の大きい周波数成分をいう。また、音素片記憶部２０は、同じ「あ」という表記を有するある音素について、基本周波数を含む複数の周波数成分のそれぞれについて、音声の大きさ又は強さを要素として示したベクトルデータを音色データとして記憶する。図２では説明の都合上、各音素の先頭部分および末尾部分における音色データを例示したが、実際には、音素片記憶部２０は各周波数成分についてその大きさ又は強さの時間変化を示すデータを記憶している。 FIG. 2 shows an example of the data structure of the phoneme piece storage unit 20. The phoneme piece storage unit 20 stores a plurality of phoneme piece data each representing a different phoneme sound. Specifically, the phoneme piece storage unit 20 stores, for each phoneme, notation of the phoneme, speech waveform data of the phoneme, and timbre data of the phoneme. As an example, the phoneme piece storage unit 20 stores, as speech waveform data, information indicating a change in fundamental frequency over time for a phoneme having the notation “A”. Here, the fundamental frequency of a phoneme means the frequency component with the loudest sound among the frequency components constituting the phoneme. In addition, the phoneme piece storage unit 20 uses, as timbre data, vector data indicating the magnitude or strength of speech for each of a plurality of frequency components including the fundamental frequency for a phoneme having the same notation “A”. Remember as. In FIG. 2, for convenience of explanation, the timbre data at the head part and the tail part of each phoneme is illustrated, but actually, the phoneme piece storage unit 20 is data indicating the time change of the magnitude or strength of each frequency component. Is remembered.

このように、音素片記憶部２０には各音素の音声波形データが記憶されているので、音声合成システム１０はこれらの音声波形データを接続すれば、複数の音素を有する音声を生成することができる。なお、図２は音素片データの内容の一例を示したものであり、音素片記憶部２０が記憶する音素片データのデータ構造やデータ形式はこの図に記載のものに限定されない。他の例として例えば、音素片記憶部２０は、音素片データとして、音素の録音データ自体を記憶してもよいし、その録音データに対し所定の演算を行ったデータを記憶していてもよい。演算とは例えば離散コサイン変換などであり、これにより、録音データのうちの所望の周波数成分を参照できるので、基本周波数や音色の解析を可能とすることができる。 Thus, since the phoneme piece storage unit 20 stores the speech waveform data of each phoneme, the speech synthesis system 10 can generate speech having a plurality of phonemes by connecting these speech waveform data. it can. FIG. 2 shows an example of the contents of the phoneme piece data, and the data structure and data format of the phoneme piece data stored in the phoneme piece storage unit 20 are not limited to those shown in this figure. As another example, for example, the phoneme piece storage unit 20 may store phoneme recording data itself as phoneme piece data, or may store data obtained by performing a predetermined calculation on the recording data. . The calculation is, for example, discrete cosine transform and the like, whereby a desired frequency component in the recorded data can be referred to, so that the fundamental frequency and timbre can be analyzed.

図３は、音声合成システム１０の機能構成を示す。音声合成システム１０は、音素片記憶部２０と、合成部３１０と、算出部３２０と、判断部３３０と、表示部３３５と、換言記憶部３４０と、置換部３５０と、出力部３７０とを有する。まず、これらの各部材とハードウェア資源との関係を述べる。音素片記憶部２０および換言記憶部３４０は、たとえば、後述のＲＡＭ１０２０およびハードディスクドライブ１０４０などの記憶装置によって実現される。合成部３１０、算出部３２０、判断部３３０および置換部３５０は、インストールされたプログラムの指令によって後述のＣＰＵ１０００の動作により実現される。表示部３３５は、後述のグラフィックコントローラ１０７５および表示装置１０８０の他、利用者からの入力を受け付けるためのポインティングデバイスやキーボードにより実現される。そして、３７０は、スピーカや入出力チップ１０７０により実現される。 FIG. 3 shows a functional configuration of the speech synthesis system 10. The speech synthesis system 10 includes a phoneme unit storage unit 20, a synthesis unit 310, a calculation unit 320, a determination unit 330, a display unit 335, a paraphrase storage unit 340, a replacement unit 350, and an output unit 370. . First, the relationship between these members and hardware resources will be described. The phoneme piece storage unit 20 and the paraphrase storage unit 340 are realized by a storage device such as a RAM 1020 and a hard disk drive 1040 described later, for example. The synthesis unit 310, the calculation unit 320, the determination unit 330, and the replacement unit 350 are realized by the operation of the CPU 1000, which will be described later, according to instructions of the installed program. The display unit 335 is realized by a pointing device or a keyboard for receiving input from the user, in addition to the graphic controller 1075 and the display device 1080 described later. 370 is realized by a speaker or an input / output chip 1070.

音素片記憶部２０は、上述のように、複数の音素片データを記憶している。合成部３１０は、テキストを外部から入力し、入力したこのテキストの発音を示す各音素に対応する音素片データを音素片記憶部２０から読み出して接続する。具体的には、まず、合成部３１０は、このテキストに対し形態素解析を行って、このテキストに含まれる語句の境界、および、各語句の品詞を検出する。そして、合成部３１０は、各語句の読み方について予め記憶しているデータに基づいて、このテキストを読み上げたときに各音素をどのような周波数の音声で、かつ、どのような音色で発音するべきかを求める。そして、合成部３１０は、この周波数および音色に近い音素片データをそれぞれ音素片記憶部２０から読み出して接続し、このテキストの合成音声を示す音声データとして算出部３２０に出力する。 The phoneme piece storage unit 20 stores a plurality of phoneme piece data as described above. The synthesizing unit 310 inputs text from the outside, reads out phoneme piece data corresponding to each phoneme indicating the pronunciation of the input text from the phoneme piece storage unit 20, and connects them. Specifically, first, the synthesis unit 310 performs a morphological analysis on the text to detect a boundary between words included in the text and a part of speech of each word. The synthesizing unit 310 should pronounce each phoneme in what frequency and in what tone when reading the text, based on data stored in advance about how to read each phrase. Ask for. The synthesizing unit 310 then reads out and connects the phoneme piece data close to the frequency and the timbre from the phoneme piece storage unit 20 and outputs them to the calculating unit 320 as voice data indicating the synthesized voice of the text.

算出部３２０は、このテキストの合成音声の不自然さを示す指標値を、合成部３１０から受け取った音声データに基づいて算出する。この指標値は、例えば、音声データに含まれる第１の音素片データおよびこの第１の音素片データに接続する第２の音素片データの境界における、この第１の音素片データおよびこの第２の音素片データの間の発音の相違度を示すものである。そして発音の相違度とは、音色や基本周波数の相違度である。即ちこの相違度が大きいほど、音声の周波数などが突然に変化するので、合成音声は聞き手にとって不自然に感じられる。 The calculation unit 320 calculates an index value indicating the unnaturalness of the synthesized speech of the text based on the speech data received from the synthesis unit 310. This index value is, for example, the first phoneme piece data and the second phoneme piece data at the boundary between the first phoneme piece data included in the voice data and the second phoneme piece data connected to the first phoneme piece data. It shows the difference in pronunciation between the phoneme piece data. The difference in pronunciation is the difference in timbre and fundamental frequency. That is, the greater the difference is, the more suddenly the voice frequency changes, and the synthesized voice feels unnatural to the listener.

判断部３３０は、算出したこの指標値が予め定められた基準値より小さいか否かを判断する。判断部３３０は、この指標値が基準値以上であることを条件に、テキスト中の表記を置換させて置換したそのテキストについて音声データをさらに生成させるべく、置換部３５０に指示する。一方、判断部３３０は、指標値が基準値より小さいことを条件に、表示部３３５は、この音声データを生成する対象となったテキストを利用者に表示して、このテキストに基づいて合成音声を生成してよいかどうかを利用者に問合せる表示を行う。このテキストは、外部から入力したテキストそのものである場合もあるし、置換部３５０によって何度かの置換処理が行われた結果として生成されたテキストである場合もある。 The determination unit 330 determines whether or not the calculated index value is smaller than a predetermined reference value. The determination unit 330 instructs the replacement unit 350 to generate voice data for the replaced text by replacing the notation in the text on condition that the index value is equal to or greater than the reference value. On the other hand, on the condition that the index value is smaller than the reference value, the determination unit 330 displays the text that is the target for generating the voice data to the user, and the synthesized voice is generated based on the text. Is displayed to inquire the user as to whether or not to generate the file. This text may be the text itself input from the outside, or may be text generated as a result of the replacement processing performed several times by the replacement unit 350.

判断部３３０は、承認する入力を受けたことを条件として、生成されたこの音声データを出力部３７０に出力する。これを受けて、出力部３７０は、音声データに基づいて合成音声を生成し、利用者に出力する。一方、置換部３５０は、指標値が基準値以上の場合に判断部３３０から指示を受けて処理を開始する。換言記憶部３４０は、複数の第１表記のそれぞれに対応付けて、当該第１表記の言い換えである第２表記を記憶している。そして、置換部３５０は、判断部３３０から指示を受けると、まず、前回に音声合成の対象となったテキストを合成部３１０から取得する。次に、置換部３５０は、そのテキストの中から何れかの第１表記に一致する表記を検索する。検索されたことを条件に、置換部３５０は、検索された当該表記を当該第１表記に対応する第２表記に置換する。表記が置換されたテキストは、合成部３１０に入力されて、そのテキストに基づいてさらに音声データが生成される。 The determination unit 330 outputs the generated audio data to the output unit 370 on condition that the input for approval is received. In response to this, the output unit 370 generates a synthesized voice based on the voice data and outputs it to the user. On the other hand, the replacement unit 350 starts processing upon receiving an instruction from the determination unit 330 when the index value is greater than or equal to the reference value. The paraphrase storage unit 340 stores a second notation that is a paraphrase of the first notation in association with each of the plurality of first notations. When the replacement unit 350 receives an instruction from the determination unit 330, the replacement unit 350 first acquires the text that was the object of speech synthesis last time from the synthesis unit 310. Next, replacement unit 350 searches the text for a notation that matches any of the first notations. The replacement unit 350 replaces the searched notation with the second notation corresponding to the first notation on the condition that the search is performed. The text in which the notation is replaced is input to the synthesis unit 310, and voice data is further generated based on the text.

図４は、合成部３１０の機能構成を示す。合成部３１０は、語句記憶部４００と、語句検索部４１０と、音素片検索部４２０とを有する。そして、合成部３１０は、ｎ−ｇｒａｍモデルとして知られている手法によりテキストの読み方を生成したうえで、それに基づき音声データを生成する。具体的には、まず、語句記憶部４００は、予め登録された複数の語句のそれぞれについて、当該語句の読み方を当該語句の表記に対応付けて記憶する。表記とは、語句を構成する文字列であり、読み方とは、たとえば発音を示す記号、アクセントを示す記号またはアクセント型などである。語句記憶部４００は、同一の表記について複数の互いに異なる読み方を対応付けて記憶してもよい。そしてその場合、語句記憶部４００は、それぞれの読み方についてその読み方で読まれる確率値をさらに記憶している。 FIG. 4 shows a functional configuration of the synthesis unit 310. The synthesis unit 310 includes a phrase storage unit 400, a phrase search unit 410, and a phoneme piece search unit 420. And the synthetic | combination part 310 produces | generates audio | speech data based on it, after producing | generating how to read the text by the method known as an n-gram model. Specifically, the phrase storage unit 400 first stores, for each of a plurality of previously registered phrases, how to read the phrase in association with the notation of the phrase. The notation is a character string that constitutes a phrase, and the reading is, for example, a symbol indicating pronunciation, a symbol indicating accent, or an accent type. The phrase storage unit 400 may store a plurality of different readings in association with the same notation. And in that case, the phrase memory | storage part 400 has further memorize | stored the probability value read by the reading about each reading.

詳細には、語句記憶部４００は、予め定められた数の語句の組合せ（たとえばｂｉ−ｇｒａｍモデルでは２つの語句の組合せ）ごとに、その組合せの語句がそれぞれの読み方の組合せで読まれる確率値を記憶している。たとえば、「僕の」という単一の語句について第１音節にアクセントがある確率値と、第２音節にアクセントがある確率値とを記憶するだけではなく、「僕の」が「近くの」という語句と連続して表記されたときに、この連続する語句の組合せについて、第１音節にアクセントがある確率値、および、第２音節にアクセントがある確率値がそれぞれ記憶される。そしてこれとは別に、「僕の」という単語が「近くの」ではない他の語句と連続して表記されたときに、やはりこの連続する語句の組合せについて、各音節にアクセントがある確率値がそれぞれ記憶される。 In detail, the phrase storage unit 400, for each combination of a predetermined number of phrases (for example, a combination of two phrases in the bi-gram model), the probability value that the phrase of the combination is read in each combination of readings. Is remembered. For example, not only memorize the probability value with the accent in the first syllable and the probability value with the accent in the second syllable for the single phrase “my”, but “my” is “near” When consecutively written as a phrase, a probability value having an accent in the first syllable and a probability value having an accent in the second syllable are stored for this combination of consecutive phrases. And apart from this, when the word “my” is written consecutively with other words that are not “near”, there is also a probability value that each syllable has an accent for this continuous word combination. Each is remembered.

ここに記憶される表記、読み方および確率値の情報は、予め録音された目標音声データを音声認識したうえで、語句の組合せ毎に読み方の組合せが出現する頻度をカウントすることによって生成される。即ち、目標音声データ中で高頻度に現れる語句および読み方の組合せについては高い確率値が記憶される。なお、音声合成の精度をさらに高めるべく、音素片記憶部２０は、語句の品詞の情報をさらに記憶していることが望ましい。品詞の情報も、目標音声データを音声認識することによって生成されてもよいし、音声認識されたテキストデータに対し人手によって付与されてもよい。 The information on the notation, how to read and the probability value stored here is generated by recognizing previously recorded target speech data and counting the frequency with which the combination of readings appears for each combination of phrases. That is, high probability values are stored for combinations of words and readings that appear frequently in the target speech data. In order to further improve the accuracy of speech synthesis, it is desirable that the phoneme segment storage unit 20 further stores information on the part of speech of the phrase. The part-of-speech information may also be generated by speech recognition of the target speech data, or may be manually added to the speech-recognized text data.

語句検索部４１０は、入力したテキストに含まれる各々の語句と表記が一致する語句を語句記憶部４００から検索し、検索したそれぞれの語句に対応する読み方を語句記憶部４００から読み出して接続することにより、テキストの読み方を生成する。たとえばｂｉ−ｇｒａｍモデルでは、語句検索部４１０は、入力したテキストを先頭から走査して、連続する２つの語句の組合せごとに、その組合せの語句と一致する語句の組合せを語句記憶部４００から検索する。そして、語句検索部４１０は、検索された語句の組合せに対応する読み方の組合せをそれに対応する確率値と共に語句記憶部４００から読み出す。このようにして、語句検索部４１０は、テキストの先頭から末尾に向かって語句の組合せ毎に複数の確率値を検索してゆく。 The phrase search unit 410 searches the phrase storage unit 400 for a phrase that has the same notation as each phrase included in the input text, and reads and connects the reading corresponding to each searched phrase from the phrase storage unit 400. Generates how to read text. For example, in the bi-gram model, the phrase search unit 410 scans the input text from the top, and searches the phrase storage unit 400 for a combination of phrases that matches the phrase of each combination of two consecutive phrases. To do. Then, the phrase search unit 410 reads the combination of readings corresponding to the searched combination of phrases from the phrase storage unit 400 together with the probability value corresponding to the combination. In this way, the phrase search unit 410 searches for a plurality of probability values for each combination of phrases from the beginning to the end of the text.

たとえば、テキストが語句Ａ、ＢおよびＣをこの順で含む場合、語句ＡおよびＢの組合せについて、読み方として、ａ１およびｂ１の組合せ（確率値ｐ１）、ａ２およびｂ１の組み合わせ（確率値ｐ２）、ａ１およびｂ２の組合せ（確率値ｐ３）、並びに、ａ２およびｂ２の組み合わせ（確率値ｐ４）が検索される。同様に、語句ＢおよびＣの組合せについて、読み方として、ｂ１およびｃ１の組合せ（確率値ｐ５）、ｂ１およびｃ２の組合せ（確率値ｐ６）、ｂ２およびｃ１の組合せ（確率値ｐ７）、ならびに、ｂ２およびｃ２の組合せ（確率値ｐ８）が検索される。そして、語句検索部４１０は、語句の各組合せについての確率値の積が最大となる読み方の組合せを選択して、テキストの読み方として音素片検索部４２０に出力する。この例では、ｐ１×ｐ５、ｐ１×ｐ７、ｐ２×ｐ５、ｐ２×ｐ７、ｐ３×ｐ６、ｐ３×ｐ８、ｐ４×ｐ６、および、ｐ４×ｐ８がそれぞれ算出されて、そのうちの最大値に対応する読み方の組合せが出力される。 For example, when the text includes the phrases A, B, and C in this order, as to how to read the combinations of the phrases A and B, a combination of a1 and b1 (probability value p1), a combination of a2 and b1 (probability value p2), A combination of a1 and b2 (probability value p3) and a combination of a2 and b2 (probability value p4) are searched. Similarly, for the combinations of words B and C, the readings are b1 and c1 combination (probability value p5), b1 and c2 combination (probability value p6), b2 and c1 combination (probability value p7), and b2 And the combination of c2 (probability value p8) is searched. Then, the phrase search unit 410 selects a combination of readings that maximizes the product of the probability values for each combination of phrases, and outputs it to the phoneme piece search unit 420 as a text reading. In this example, p1 * p5, p1 * p7, p2 * p5, p2 * p7, p3 * p6, p3 * p8, p4 * p6, and p4 * p8 are respectively calculated and correspond to the maximum value of them. A combination of readings is output.

次に、音素片検索部４２０は、生成した読み方に基づいて各音素について目標となる韻律および音色を求め、それに最も近い音素片データをそれぞれ音素片記憶部２０から検索して読み出し、読み出した複数の音素片データを接続することにより音声データを生成して、算出部３２０に対し出力する。たとえば、音素片検索部４２０は、生成した読み方において、アクセントが、音節毎にＬＨＨＨＬＬＨのように連続する場合において（Ｌはアクセントが低く、Ｈはアクセントが高いことを示す）、このようなアクセントの高低を滑らかに表現するように、各音素の韻律を算出する。韻律とは、たとえば、基本周波数の変化、音の長さ、および、音量によって表される。基本周波数の算出は、あらかじめアナウンサーが録音した音声データから統計的に学習しておいた基本周波数モデルを用いる。基本周波数モデルによって各音素の基本周波数の目標値が、アクセントの環境、品詞、文の長さなどに応じて求めることができる。ここではアクセントから基本周波数を求める処理の一例を述べたが、発音から音色や継続時間長や音量を求める処理も同様に、予め統計的に学習したルールに基づき求めることができる。このように、アクセントや発音に基づいて各音素の韻律および音色を定める技術については、韻律または音色を予測する技術として従来公知であるから更に詳細な説明を省略する。 Next, the phoneme segment search unit 420 obtains a target prosody and timbre for each phoneme based on the generated reading, searches the phoneme unit storage unit 20 for the nearest phoneme data, reads out the phoneme data, Audio data is generated by connecting the phoneme piece data and output to the calculation unit 320. For example, when the phoneme segment search unit 420 has an accent that continues in each syllable like LHHHLLH in the generated reading (L indicates that the accent is low and H indicates that the accent is high), The prosody of each phoneme is calculated so that the level is expressed smoothly. The prosody is represented by, for example, a change in fundamental frequency, a length of sound, and a volume. The calculation of the fundamental frequency uses a fundamental frequency model that has been statistically learned in advance from voice data recorded by the announcer. With the fundamental frequency model, the target value of the fundamental frequency of each phoneme can be determined according to the accent environment, the part of speech, the sentence length, and the like. Although an example of the process for obtaining the fundamental frequency from the accent has been described here, the process for obtaining the timbre, the duration time, and the volume from the pronunciation can also be obtained based on a rule learned statistically in advance. As described above, a technique for determining the prosody and tone color of each phoneme based on accents and pronunciations is conventionally known as a technique for predicting a prosody or a tone color, and thus a detailed description thereof will be omitted.

図５は、換言記憶部３４０のデータ構造の一例を示す。換言記憶部３４０は、複数の第１表記のそれぞれに対応付けて、その第１表記の言い換えである第２表記を記憶している。さらに、換言記憶部３４０は、第１表記およびそれに対応する第２表記の組のそれぞれに対応付けて、その第１表記およびその第２表記の間の意味の近似度を記憶している。たとえば、換言記憶部３４０は、第１表記「僕の」をその言い換えである第２表記「私の」に対応付けて記憶し、これらの表記の組に対応付けてその近似度「６５％」を更に記憶する。近似度とは、このようにたとえば百分率で表され、表記を換言記憶部３４０に登録した作業者によって入力されたものであってもよいし、この言い換えを用いた置換処理の結果として利用者にその置換が承認された確率に基づき算出されてもよい。 FIG. 5 shows an example of the data structure of the paraphrase storage unit 340. The paraphrase storage unit 340 stores a second notation that is a paraphrase of the first notation in association with each of the plurality of first notations. Further, the paraphrase storage unit 340 stores the degree of approximation of meaning between the first notation and the second notation in association with each of the first notation and the set of the second notation corresponding thereto. For example, the paraphrase storage unit 340 stores the first notation “my” in association with the second notation “my” and associates it with the set of these notations, and the degree of approximation “65%”. Is further memorized. In this way, the degree of approximation may be expressed, for example, as a percentage, and may be input by an operator whose notation is registered in the paraphrase storage unit 340, or may be given to the user as a result of replacement processing using this paraphrase. It may be calculated based on the probability that the replacement was approved.

換言記憶部３４０に登録された表記の数が多くなると、互いに等しい複数の第１表記が、互いに異なる複数の第２表記に対応付けて記憶される場合がある。即ち、置換部３５０が、入力したテキストと換言記憶部３４０の第１表記とを比較した結果、そのテキスト中の表記が複数の第１表記のそれぞれに一致する場合がある。この場合、置換部３５０は、テキスト中のその表記を、当該複数の第１表記のうち近似度の最も高い第１表記に対応する第２表記により置換する。このように、表記に対応付けて記憶した近似度は、置換先とするべき表記を選択する指針として用いることができる。 When the number of expressions registered in the paraphrase storage unit 340 increases, a plurality of first expressions that are equal to each other may be stored in association with a plurality of second expressions that are different from each other. That is, as a result of comparison between the input text and the first notation in the paraphrase storage unit 340, the replacement unit 350 may match the notation in the text with each of the plurality of first notations. In this case, the replacement unit 350 replaces the notation in the text with the second notation corresponding to the first notation having the highest degree of approximation among the plurality of first notations. As described above, the degree of approximation stored in association with the notation can be used as a guideline for selecting the notation to be replaced.

さらに、換言記憶部３４０が記憶する第２表記は、目標音声データの内容を示すテキストに含まれる語句の表記であることが望ましい。目標音声データの内容を示すテキストとは、たとえば、目標音声データがテキストの読み上げ音声を示す場合におけるそのテキストである。また、目標音声データが自由発声に係るものである場合には、その目標音声データを音声認識した結果を示すテキストであってもよいし、その目標音声データの内容を人手でテキストに記録したものであってもよい。これにより、置換先の語句表記は目標音声データにおいて用いられたものとなり、置換後のテキストについて出力される合成音声をより一層自然なものとすることができる。 Furthermore, it is desirable that the second notation stored in the paraphrase storage unit 340 is a notation of a phrase included in the text indicating the content of the target speech data. The text indicating the content of the target voice data is, for example, the text in the case where the target voice data indicates a reading voice of the text. In addition, when the target voice data is related to free speech, the target voice data may be a text indicating the result of voice recognition of the target voice data, or the contents of the target voice data are manually recorded in the text. It may be. As a result, the replacement-destination phrase notation is used in the target speech data, and the synthesized speech output for the replaced text can be made more natural.

更にこれに加えて、置換部３５０は、テキスト中の第１表記に対応する複数の第２表記が検索される場合に、そのそれぞれについて、当該第２表記により置換したテキストと、目標音声データの内容を示すテキストとの間の距離を計算してもよい。距離とは、テキストの表現の傾向や内容の傾向が近いかどうかを示す指標値として知られている概念で、既存の手法により算出することができる。そしてこの場合、置換部３５０は、当該距離が最も短いテキストを置換後のテキストとして選択する。このような手法によっても、置換後のテキストに基づく音声をできるだけ目標音声に近づけることができる。 In addition to this, when a plurality of second notations corresponding to the first notation in the text are searched, the replacement unit 350, for each, replaces the text replaced with the second notation and the target speech data. You may calculate the distance between the text which shows a content. The distance is a concept known as an index value indicating whether the tendency of expression of text and the tendency of contents are close, and can be calculated by an existing method. In this case, the replacement unit 350 selects the text having the shortest distance as the replaced text. Even with such a method, the voice based on the text after replacement can be made as close as possible to the target voice.

図６は、語句記憶部４００のデータ構造の一例を示す。語句記憶部４００は、語句データ６００と、発音データ６１０と、アクセントデータ６２０と、品詞データ６３０とを相互に対応付けて記憶している。語句データ６００は、複数の語句のそれぞれについてその語句の表記を示している。図６の例で語句データ６００は、「大阪」、「府」、「在住」、「の」、「方」、「に」、「限」、「り」、「ま」および「す」のそれぞれを語句の表記として含む。また、発音データ６１０およびアクセントデータ６２０は、複数の語句のそれぞれについてその語句の読み方を示している。発音データ６１０は読み方のうち発音を示し、アクセントデータ６２０は読み方のうちアクセントを含む。発音は、たとえば、アルファベットなどを用いた表音記号によって表される。アクセントは、高い（Ｈ）または低い（Ｌ）という音声の相対的な高低関係を音素毎に配列したものである。また、アクセントデータ６２０は、音素毎の相対的な高低関係の組合せを番号により識別するアクセント型を含んでよい。また、語句記憶部４００は、品詞データ６３０として示したように、各語句の品詞を記憶してもよい。品詞とは、文法上厳密な意味での品詞ではなく、音声合成や解析に適した拡張的な品詞を含む。たとえば、語句の末尾を構成する末尾詞などを含んでよい。 FIG. 6 shows an example of the data structure of the phrase storage unit 400. The phrase storage unit 400 stores phrase data 600, pronunciation data 610, accent data 620, and part-of-speech data 630 in association with each other. The phrase data 600 indicates the notation of the phrase for each of a plurality of phrases. In the example of FIG. 6, the phrase data 600 includes “Osaka”, “fu”, “resident”, “no”, “how”, “ni”, “limit”, “ri”, “ma” and “su”. Each is included as a phrase. Further, the pronunciation data 610 and the accent data 620 indicate how to read the words for each of the words. The pronunciation data 610 indicates the pronunciation of the reading, and the accent data 620 includes the accent of the reading. The pronunciation is represented by a phonetic symbol using, for example, an alphabet. Accents are obtained by arranging, for each phoneme, a relative high / low relationship of high (H) or low (L) speech. Further, the accent data 620 may include an accent type that identifies a combination of relative height relationships for each phoneme by a number. Moreover, the phrase storage unit 400 may store the part of speech of each phrase as shown as the part of speech data 630. The part of speech includes an extended part of speech suitable for speech synthesis and analysis, not a part of speech with a strict grammatical meaning. For example, it may include an acronym that constitutes the end of the phrase.

図６には、これらの各種のデータと対比して、これらのデータに基づき語句検索部４１０により生成される音声波形データを図中央に示した。即ち、語句検索部４１０は、「大阪府在住の方に限ります」というテキストを入力すると、上記ｎ−ｇｒａｍモデルを用いた手法によって、各音素の相対的な高低関係（ＬまたはＨ）、および、各音素の発音（アルファベットを用いた表音記号）を求める。すると、音素片検索部４２０は、各音素の相対的な高低関係を反映しつつも、利用者にとって不自然に聞こえないように滑らかに変化する基本周波数を生成する。このようにして生成された基本周波数の一例を図６中央に示す。基本周波数がこのように変化すれば理想的であるが、周波数の値が全く一致するような音素片データを音素片記憶部２０から検索できない場合もあり、その結果として合成音声が不自然に聞こえてしまう場合もある。これに対し、既に説明してきたように、音声合成システム１０によれば、テキスト自体をその意味を変えない範囲で変更することで、検索可能な音素片データを有効に使用して、合成音声の品質を向上することができる。 In FIG. 6, the speech waveform data generated by the phrase search unit 410 based on these data is shown in the center of the figure in contrast to these various data. That is, when the phrase search unit 410 inputs the text “limited to those living in Osaka Prefecture”, the relative height relationship (L or H) of each phoneme, and the method using the n-gram model, and The pronunciation of each phoneme (phonetic symbol using alphabets) is obtained. Then, the phoneme segment search unit 420 generates a fundamental frequency that smoothly changes so as not to be unnaturally heard by the user while reflecting the relative height relationship of each phoneme. An example of the fundamental frequency generated in this way is shown in the center of FIG. It is ideal if the fundamental frequency changes in this way, but there may be a case where it is not possible to retrieve phoneme piece data from the phoneme piece storage unit 20 whose frequency values are exactly the same, and as a result, synthesized speech sounds unnaturally. There is also a case. On the other hand, as already described, according to the speech synthesis system 10, by changing the text itself within a range that does not change the meaning, the searchable speech segment data can be used effectively, Quality can be improved.

図７は、音声合成システム１０によって合成音声が生成される処理の流れを示す。合成部３１０は、テキストを外部から入力し、入力したこのテキストの発音を示す各音素に対応する音素片データを音素片記憶部２０から読み出して接続する（Ｓ７００）。具体的には、まず、合成部３１０は、このテキストに対し形態素解析を行って、このテキストに含まれる語句の境界、および、各語句の品詞を検出する。そして、合成部３１０は、各語句の読み方について語句記憶部４００に予め記憶しているデータに基づいて、このテキストを読み上げたときに各音素をどのような周波数の音声で、かつ、どのような音色で発音するべきかを求める。そして、合成部３１０は、この周波数および音色に近い音素片データをそれぞれ音素片記憶部２０から読み出して接続し、このテキストの合成音声を示す音声データとして算出部３２０に出力する。 FIG. 7 shows the flow of processing in which synthesized speech is generated by the speech synthesis system 10. The synthesizing unit 310 inputs text from outside, reads out phoneme piece data corresponding to each phoneme indicating the pronunciation of the input text from the phoneme piece storage unit 20, and connects them (S700). Specifically, first, the synthesis unit 310 performs a morphological analysis on the text to detect a boundary between words included in the text and a part of speech of each word. Then, based on the data stored in the phrase storage unit 400 in advance on how to read each phrase, the synthesizer 310 reads each phoneme at what frequency and what frequency Find out if the sound should be pronounced. The synthesizing unit 310 then reads out and connects the phoneme piece data close to the frequency and the timbre from the phoneme piece storage unit 20 and outputs them to the calculating unit 320 as voice data indicating the synthesized voice of the text.

算出部３２０は、このテキストの合成音声の不自然さを示す指標値を、合成部３１０から受け取った音声データに基づいて算出する（Ｓ７１０）。その一例を述べる。指標値は、音素片データの接続境界における音声の相違度と、テキストの読み方に基づく各音素の音声、および、音素片検索部４２０により検索された音素片データの間の音声の相違度とに基づき算出される。以下、順に説明する。 The calculation unit 320 calculates an index value indicating the unnaturalness of the synthesized speech of the text based on the speech data received from the synthesis unit 310 (S710). An example is described. The index value is based on the voice dissimilarity at the connection boundary of the phoneme data, the voice of each phoneme based on how to read the text, and the voice dissimilarity between the phoneme data retrieved by the phoneme segment search unit 420. Calculated based on Hereinafter, it demonstrates in order.

（１）接続境界における相違度
算出部３２０は、音声データに含まれる音素片データの接続境界のそれぞれについて、当該接続境界における基本周波数の相違度および音色の相違度を算出する。基本周波数の相違度とは基本周波数の差分値であってもよいし、基本周波数の変化の割合であってもよい。音色の相違度とは、境界前における音色を示すベクトルと、境界後における音色を示すベクトルとの間の距離である。たとえば、境界前における音声波形データを離散コサイン変換して得られるベクトルと、境界後における音声波形データを離散コサイン変換して得られるベクトルとの間の、ケプストラム空間におけるユークリッド距離であってよい。そして、算出部３２０は、各接続境界における相違度を合計する。 (1) Dissimilarity at Connection Boundary The calculation unit 320 calculates, for each connection boundary of phoneme piece data included in audio data, a fundamental frequency dissimilarity and a timbre dissimilarity at the connection boundary. The difference between the fundamental frequencies may be a difference value between the fundamental frequencies or a change rate of the fundamental frequency. The timbre dissimilarity is a distance between a vector indicating the timbre before the boundary and a vector indicating the timbre after the boundary. For example, it may be the Euclidean distance in the cepstrum space between a vector obtained by discrete cosine transform of speech waveform data before the boundary and a vector obtained by discrete cosine transform of speech waveform data after the boundary. Then, the calculation unit 320 sums up the dissimilarity at each connection boundary.

但し、算出部３２０は、音素片データの接続境界でｐやｔなどの無声子音が発音される場合には、その接続境界における相違度を０と判断する。これは、無声子音の前後では音色や基本周波数が大きく変化しても、聞き手は違和感を感じにくいからである。同様の理由で、算出部３２０は、音素片データの接続境界に読点が含まれる場合には、その接続境界における相違度を０と判断する。 However, when an unvoiced consonant such as p or t is generated at the connection boundary of the phoneme piece data, the calculation unit 320 determines that the degree of difference at the connection boundary is zero. This is because the listener is unlikely to feel uncomfortable even if the timbre and the fundamental frequency change greatly before and after the unvoiced consonant. For the same reason, when the reading boundary is included in the connection boundary of the phoneme piece data, the calculation unit 320 determines that the degree of difference at the connection boundary is 0.

（２）読み方に基づく音声と、音素片データの音声との相違度
算出部３２０は、音声データに含まれる音素片データのそれぞれについて、その音素片データの韻律、および、その音素の読み方に基づき定められた韻律を比較する。韻律は、基本周波数を示す音声波形データによって定められてもよい。たとえば、算出部３２０は、各音声波形データの周波数の合計又は平均を比較してもよい。そしてその差分値が相違度として算出される。これに代えて、又は、これに加えて、算出部３２０は、各音素片データの音色を示すベクトルデータ、および、各音素の読み方に基づき定められたベクトルデータを比較する。そして、算出部３２０は、音素の先頭又は末尾の音色について、ベクトルデータ間の距離を相違度として算出する。これに加えて、算出部３２０は、音素の発音の長さを用いてもよい。たとえば、語句検索部４１０は、読み方に基づいて各音素の発音の長さとして望ましい値を算出する。これに対し、音素片検索部４２０は、この長さに最も近い長さの発音を示す音素片データを検索する。この場合、算出部３２０は、これらの発音の長さの差分を相違度として算出する。 (2) The degree of difference between the speech based on the reading and the speech of the phoneme data is calculated based on the prosody of the phoneme data and the reading of the phoneme for each phoneme data included in the speech data. Compare the determined prosody. The prosody may be determined by speech waveform data indicating the fundamental frequency. For example, the calculation unit 320 may compare the sum or average of the frequencies of the respective audio waveform data. The difference value is calculated as the degree of difference. Instead of this, or in addition to this, the calculation unit 320 compares vector data indicating the tone color of each phoneme piece data and vector data determined based on how to read each phoneme. Then, the calculation unit 320 calculates the distance between the vector data as the degree of difference for the timbre at the head or the end of the phoneme. In addition, the calculation unit 320 may use the length of phoneme pronunciation. For example, the phrase search unit 410 calculates a desirable value as the pronunciation length of each phoneme based on how to read. On the other hand, the phoneme segment search unit 420 searches for phoneme segment data indicating a pronunciation having a length closest to this length. In this case, the calculation unit 320 calculates the difference between the lengths of these pronunciations as the degree of difference.

算出部３２０は、以上のように算出した相違度をそのまま合計して指標値としてもよいし、重み付けして合計して指標値としてもよい。また、算出部３２０は、所定の評価関数に各相違度を入力したうえで、その出力を指標値としてもよい。即ち、この指標値は、接続境界における音声の相違と、読み方に基づく音声と音素片データに基づく音声との相違とを示すものであればよい。
判断部３３０は、このようにして算出した指標値が、予め定められた基準値以上かどうかを判断する（Ｓ７２０）。基準値以上であれば（Ｓ７２０：ＹＥＳ）、置換部３５０は、テキストと換言記憶部３４０とを比較して、テキストの中から何れかの第１表記に一致する表記を検索する（Ｓ７３０）。そして、置換部３５０は、検索された表記を、その第１表記に対応する第２表記により置換する。 The calculating unit 320 may add the dissimilarities calculated as described above as it is to obtain an index value, or may add the weighted values to obtain an index value. In addition, the calculation unit 320 may input each degree of difference into a predetermined evaluation function and use the output as an index value. That is, this index value may be any value that indicates a difference in voice at the connection boundary and a difference between voice based on reading and voice based on phoneme piece data.
The determination unit 330 determines whether or not the index value calculated in this way is greater than or equal to a predetermined reference value (S720). If it is equal to or greater than the reference value (S720: YES), the replacement unit 350 compares the text with the paraphrase storage unit 340 and searches the text for a notation that matches any of the first notations (S730). Then, the replacement unit 350 replaces the retrieved notation with the second notation corresponding to the first notation.

置換部３５０は、テキスト中の全ての語句を置換の候補として第１表記と比較してもよいが、その一部を比較の対象としてもよい。好ましくは、置換部３５０は、テキスト中の一部の文については、第１表記が検索されても置換の対象としない。たとえば、置換部３５０は、固有名詞又は数詞の少なくとも何れか一方を含む文については、表記を置換せず、固有名詞および数詞の何れも含まない文について、第１表記と一致する表記を検索する。数詞や固有名詞を含む文は意味の厳密性が要求される場合が多いので、このような文を避けることで意味を大きく変更してしまうことを防ぐことができる。 The replacement unit 350 may compare all the words in the text with the first notation as candidates for replacement, but some of them may be compared. Preferably, the replacement unit 350 does not replace some sentences in the text even if the first notation is searched. For example, the replacement unit 350 does not replace the notation for a sentence including at least one of a proper noun or a number, and searches for a notation that matches the first notation for a sentence that does not include either a proper noun or a number. . Since sentences containing numbers and proper nouns often require strictness of meaning, avoiding such sentences can prevent a significant change in meaning.

置換部３５０は、処理を更に効率化するべく、テキストの中で置換の候補となる特定の部分のみを第１表記と比較してもよい。たとえば、置換部３５０は、テキストを先頭から順に順次走査して、そのテキストの中の、連続して表記される予め定められた数の語句の組合せを順次選択してゆく。テキストに語句Ａ、Ｂ、Ｃ、ＤおよびＥが含まれていて、当該予め定められた数が３であれば、置換部３５０は、ＡＢＣ、ＢＣＤおよびＣＤＥをこの順に選択する。そして、置換部３５０は、選択した当該組合せに対応する合成音声の不自然さを示す指標値を算出する。 The replacement unit 350 may compare only a specific part that is a candidate for replacement in the text with the first notation in order to further improve the processing efficiency. For example, the replacement unit 350 sequentially scans the text sequentially from the top, and sequentially selects combinations of a predetermined number of words that are consecutively written in the text. If the words A, B, C, D, and E are included in the text, and the predetermined number is 3, the replacement unit 350 selects ABC, BCD, and CDE in this order. Then, replacement unit 350 calculates an index value indicating the unnaturalness of the synthesized speech corresponding to the selected combination.

具体的には、置換部３５０は、語句の組合せ毎に、当該組合せに含まれる音素の各接続境界における音声の相違度を合計する。そして、置換部３５０は、当該組合せに含まれる接続境界の数でその合計値を割り算することにより、接続境界あたりの相違度の平均値を求める。また、置換部３５０は、当該組合せに含まれる各音素における合成音声、および、読み方に基づく音声の相違度を合計して、当該組合せに含まれる音素の数で割り算することにより、音素あたりの相違度の平均値を求める。そして、置換部３５０は、接続境界あたりの相違度の平均値、および、音素あたりの相違度の平均値の合計を指標値として算出する。そして、置換部３５０は、算出した指標値が最も大きい組み合わせに含まれる語句について、当該語句の表記に一致する第１表記を換言記憶部３４０から検索する。たとえば、ＡＢＣ、ＢＣＤおよびＣＤＥのうちＢＣＤの指標値が最も大きければ、置換部３５０はＢＣＤを選択して、ＢＣＤの中から第１表記と一致する語句を検索する。
これにより、最も不自然な箇所の近傍を優先して置換の対象とすることができ、置換処理を全体として効率化できる。 Specifically, for each combination of phrases, replacement unit 350 sums the voice dissimilarities at each connection boundary of phonemes included in the combination. Then, the replacement unit 350 calculates an average value of the degrees of difference per connection boundary by dividing the total value by the number of connection boundaries included in the combination. Further, the replacement unit 350 sums the dissimilarities of the synthesized speech and the speech based on how to read in each phoneme included in the combination, and divides by the number of phonemes included in the combination, thereby calculating the difference per phoneme. Find the average degree. Then, the replacement unit 350 calculates an average value of the dissimilarity per connection boundary and the sum of the average dissimilarity per phoneme as an index value. Then, the replacement unit 350 searches the paraphrase storage unit 340 for the first notation that matches the notation of the phrase for the phrase included in the combination with the largest calculated index value. For example, if the index value of BCD is the largest among ABC, BCD, and CDE, replacement unit 350 selects BCD and searches the BCD for a phrase that matches the first notation.
As a result, the vicinity of the most unnatural part can be given priority for replacement, and the replacement process can be made more efficient as a whole.

続いて、判断部３３０は、置換されたテキストについて音声データを更に生成させるべく、当該テキストを合成部３１０に入力してＳ７００に処理を戻す。一方、指標値が基準値未満であることを条件に（Ｓ７２０：ＮＯ）、表示部３３５は、表記が置換されたこのテキストを利用者に表示する（Ｓ７４０）。そして、判断部３３０は、表示した当該テキストに対し置換を承認する入力を受けたかどうかを判断する（Ｓ７５０）。置換を承認する入力を受けたことを条件に（Ｓ７５０：ＹＥＳ）、判断部３３０は、表記が置換されたこのテキストに基づく音声データを出力する（Ｓ７７０）。一方、置換を承認しないことを示す入力を受けたことを条件に（Ｓ７５０：ＮＯ）、判断部３３０は、指標値の大きさに関わらず、置換前のテキストに基づく音声データを出力する（Ｓ７６０）。これを受けて出力部３７０は合成音声を出力する。 Subsequently, the determination unit 330 inputs the text to the synthesis unit 310 and returns the process to S700 in order to further generate voice data for the replaced text. On the other hand, on condition that the index value is less than the reference value (S720: NO), the display unit 335 displays this text with the notation replaced to the user (S740). Then, the determination unit 330 determines whether or not an input to approve replacement is received for the displayed text (S750). On the condition that the input for approving the replacement is received (S750: YES), the determination unit 330 outputs voice data based on the text with the notation replaced (S770). On the other hand, on the condition that an input indicating that the replacement is not approved is received (S750: NO), the determination unit 330 outputs voice data based on the text before the replacement regardless of the size of the index value (S760). ). In response to this, the output unit 370 outputs synthesized speech.

図８は、音声合成システム１０によって合成音声が生成される過程で順次生成されるテキストの具体例を示す。テキスト１は「僕のそばの窓のデフロスタをつけてくれよ。」というテキストである。このテキストに基づき合成部３１０により音声データを生成しても、合成音声は不自然であり、指標値も基準値（たとえば０．５５）より高い。「デフロスタ」を「デフロスター」に置換することにより、テキスト２が生成される。テキスト２でもなお指標値が基準値より高いので、「そば」が「近く」に置換されてテキスト３が生成される。以降同様に、「僕の」が「私の」に置換され、「くれよ」が「ちょうだい」に置換され、「ちょうだい」が「ください」に置換されて、テキスト６が生成される。最後の置換のように、一旦置換された語句が再度置換されてもよい。 FIG. 8 shows a specific example of text sequentially generated in the process of generating synthesized speech by the speech synthesis system 10. Text 1 is the text "Please put on the defroster of the window beside me." Even if voice data is generated by the synthesizer 310 based on this text, the synthesized voice is unnatural and the index value is higher than a reference value (for example, 0.55). Text 2 is generated by replacing “defroster” with “defroster”. Since the index value is still higher than the reference value in the text 2, the text 3 is generated by replacing “soba” with “near”. Similarly, “my” is replaced with “my”, “Kureyo” is replaced with “sibling”, and “sibling” is replaced with “please”, and text 6 is generated. Like the last replacement, the once replaced phrase may be replaced again.

テキスト６でもなお指標値が基準値より高いので、「窓の」が「窓の、」に置換される。このように、置換元又は置換先（即ち上記の第１表記又は第２表記）は読点を含んでもよい。また、「デフロスター」は「デフォッガー」に置換される。この結果生成されたテキスト８は、指標値が基準値未満となる。このため、出力部３７０はこのテキスト８に基づく合成音声を出力する。 Since the index value is still higher than the reference value in the text 6, “window” is replaced with “window”. As described above, the replacement source or the replacement destination (that is, the first notation or the second notation described above) may include a reading point. In addition, “defroster” is replaced with “defogger”. As a result, the generated text 8 has an index value less than the reference value. Therefore, the output unit 370 outputs synthesized speech based on the text 8.

図９は、音声合成システム１０として機能する情報処理装置５００のハードウェア構成の一例を示す。情報処理装置５００は、ホストコントローラ１０８２により相互に接続されるＣＰＵ１０００、ＲＡＭ１０２０、及びグラフィックコントローラ１０７５を有するＣＰＵ周辺部と、入出力コントローラ１０８４によりホストコントローラ１０８２に接続される通信インターフェイス１０３０、ハードディスクドライブ１０４０、及びＣＤ−ＲＯＭドライブ１０６０を有する入出力部と、入出力コントローラ１０８４に接続されるＲＯＭ１０１０、フレキシブルディスクドライブ１０５０、及び入出力チップ１０７０を有するレガシー入出力部とを備える。 FIG. 9 shows an example of the hardware configuration of the information processing apparatus 500 that functions as the speech synthesis system 10. The information processing apparatus 500 includes a CPU peripheral unit including a CPU 1000, a RAM 1020, and a graphic controller 1075 connected to each other by a host controller 1082, a communication interface 1030, a hard disk drive 1040, and the like connected to the host controller 1082 by an input / output controller 1084. And an input / output unit having a CD-ROM drive 1060 and a legacy input / output unit having a ROM 1010 connected to an input / output controller 1084, a flexible disk drive 1050, and an input / output chip 1070.

ホストコントローラ１０８２は、ＲＡＭ１０２０と、高い転送レートでＲＡＭ１０２０をアクセスするＣＰＵ１０００及びグラフィックコントローラ１０７５とを接続する。ＣＰＵ１０００は、ＲＯＭ１０１０及びＲＡＭ１０２０に格納されたプログラムに基づいて動作し、各部の制御を行う。グラフィックコントローラ１０７５は、ＣＰＵ１０００等がＲＡＭ１０２０内に設けたフレームバッファ上に生成する画像データを取得し、表示装置１０８０上に表示させる。これに代えて、グラフィックコントローラ１０７５は、ＣＰＵ１０００等が生成する画像データを格納するフレームバッファを、内部に含んでもよい。 The host controller 1082 connects the RAM 1020 to the CPU 1000 and the graphic controller 1075 that access the RAM 1020 at a high transfer rate. The CPU 1000 operates based on programs stored in the ROM 1010 and the RAM 1020, and controls each unit. The graphic controller 1075 acquires image data generated by the CPU 1000 or the like on a frame buffer provided in the RAM 1020 and displays it on the display device 1080. Alternatively, the graphic controller 1075 may include a frame buffer that stores image data generated by the CPU 1000 or the like.

入出力コントローラ１０８４は、ホストコントローラ１０８２と、比較的高速な入出力装置である通信インターフェイス１０３０、ハードディスクドライブ１０４０、及びＣＤ−ＲＯＭドライブ１０６０を接続する。通信インターフェイス１０３０は、ネットワークを介して外部の装置と通信する。ハードディスクドライブ１０４０は、情報処理装置５００が使用するプログラム及びデータを格納する。ＣＤ−ＲＯＭドライブ１０６０は、ＣＤ−ＲＯＭ１０９５からプログラム又はデータを読み取り、ＲＡＭ１０２０又はハードディスクドライブ１０４０に提供する。 The input / output controller 1084 connects the host controller 1082 to the communication interface 1030, the hard disk drive 1040, and the CD-ROM drive 1060, which are relatively high-speed input / output devices. The communication interface 1030 communicates with an external device via a network. The hard disk drive 1040 stores programs and data used by the information processing apparatus 500. The CD-ROM drive 1060 reads a program or data from the CD-ROM 1095 and provides it to the RAM 1020 or the hard disk drive 1040.

また、入出力コントローラ１０８４には、ＲＯＭ１０１０と、フレキシブルディスクドライブ１０５０や入出力チップ１０７０等の比較的低速な入出力装置とが接続される。ＲＯＭ１０１０は、情報処理装置５００の起動時にＣＰＵ１０００が実行するブートプログラムや、情報処理装置５００のハードウェアに依存するプログラム等を格納する。フレキシブルディスクドライブ１０５０は、フレキシブルディスク１０９０からプログラム又はデータを読み取り、入出力チップ１０７０を介してＲＡＭ１０２０またはハードディスクドライブ１０４０に提供する。入出力チップ１０７０は、フレキシブルディスク１０９０や、例えばパラレルポート、シリアルポート、キーボードポート、マウスポート等を介して各種の入出力装置を接続する。 The input / output controller 1084 is connected to the ROM 1010 and relatively low-speed input / output devices such as the flexible disk drive 1050 and the input / output chip 1070. The ROM 1010 stores a boot program executed by the CPU 1000 when the information processing apparatus 500 is activated, a program depending on the hardware of the information processing apparatus 500, and the like. The flexible disk drive 1050 reads a program or data from the flexible disk 1090 and provides it to the RAM 1020 or the hard disk drive 1040 via the input / output chip 1070. The input / output chip 1070 connects various input / output devices via a flexible disk 1090 and, for example, a parallel port, a serial port, a keyboard port, a mouse port, and the like.

情報処理装置５００に提供されるプログラムは、フレキシブルディスク１０９０、ＣＤ−ＲＯＭ１０９５、又はＩＣカード等の記録媒体に格納されて利用者によって提供される。プログラムは、入出力チップ１０７０及び/又は入出力コントローラ１０８４を介して、記録媒体から読み出され情報処理装置５００にインストールされて実行される。プログラムが情報処理装置５００等に働きかけて行わせる動作は、図１から図８において説明した音声合成システム１０における動作と同一であるから、説明を省略する。 A program provided to the information processing apparatus 500 is stored in a recording medium such as the flexible disk 1090, the CD-ROM 1095, or an IC card and provided by a user. The program is read from the recording medium via the input / output chip 1070 and / or the input / output controller 1084, installed in the information processing apparatus 500, and executed. The operation that the program causes the information processing apparatus 500 to perform is the same as the operation in the speech synthesis system 10 described with reference to FIGS.

以上に示したプログラムは、外部の記憶媒体に格納されてもよい。記憶媒体としては、フレキシブルディスク１０９０、ＣＤ−ＲＯＭ１０９５の他に、ＤＶＤやＰＤ等の光学記録媒体、ＭＤ等の光磁気記録媒体、テープ媒体、ＩＣカード等の半導体メモリ等を用いることができる。また、専用通信ネットワークやインターネットに接続されたサーバシステムに設けたハードディスク又はＲＡＭ等の記憶装置を記録媒体として使用し、ネットワークを介してプログラムを情報処理装置５００に提供してもよい。 The program shown above may be stored in an external storage medium. As the storage medium, in addition to the flexible disk 1090 and the CD-ROM 1095, an optical recording medium such as a DVD or PD, a magneto-optical recording medium such as an MD, a tape medium, a semiconductor memory such as an IC card, or the like can be used. Further, a storage device such as a hard disk or a RAM provided in a server system connected to a dedicated communication network or the Internet may be used as a recording medium, and the program may be provided to the information processing apparatus 500 via the network.

このように、本実施形態に係る音声合成システム１０によれば、意味を大きく変えない範囲内でテキスト中の表記を順次変更してゆくことで、音素片の結合がより自然になるテキスト表記を探し出し、合成音声の品質を高めることができる。これにより、音素の結合や周波数の変更といった音響処理によっては品質に限界がある場合であっても、更に高品質な音声を生成することができる。音質は、音素の接続境界における音声の相違度などを用いることで、精度良く評価される。これにより、置換するべきか否かの判断や、置換するべき箇所の判断を正確に行うことができる。 As described above, according to the speech synthesis system 10 according to the present embodiment, by changing the notation in the text sequentially within a range that does not greatly change the meaning, the text notation that makes the combination of phonemes more natural. Search and improve the quality of synthesized speech. Thereby, even if there is a limit in quality depending on acoustic processing such as phoneme coupling and frequency change, it is possible to generate higher quality speech. The sound quality is evaluated with high accuracy by using, for example, the voice dissimilarity at the phoneme connection boundary. As a result, it is possible to accurately determine whether or not to replace, and to determine the location to be replaced.

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されない。上記実施の形態に、多様な変更または改良を加えることのできることが当業者にとって明らかである。その様な変更または改良を加えた形態も本発明の技術的範囲に含まれ得ることが、特許請求の範囲の記載から明らかである。 As mentioned above, although this invention was demonstrated using embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment. It will be apparent to those skilled in the art that various modifications or improvements can be added to the above embodiment. It is apparent from the scope of the claims that the embodiments added with such changes or improvements can be included in the technical scope of the present invention.

図１は、音声合成システム１０およびそれに関連するデータの全体構成を示す。FIG. 1 shows the overall configuration of a speech synthesis system 10 and data related thereto. 図２は、音素片記憶部２０のデータ構造の一例を示す。FIG. 2 shows an example of the data structure of the phoneme piece storage unit 20. 図３は、音声合成システム１０の機能構成を示す。FIG. 3 shows a functional configuration of the speech synthesis system 10. 図４は、合成部３１０の機能構成を示す。FIG. 4 shows a functional configuration of the synthesis unit 310. 図５は、換言記憶部３４０のデータ構造の一例を示す。FIG. 5 shows an example of the data structure of the paraphrase storage unit 340. 図６は、語句記憶部４００のデータ構造の一例を示す。FIG. 6 shows an example of the data structure of the phrase storage unit 400. 図７は、音声合成システム１０によって合成音声が生成される処理の流れを示す。FIG. 7 shows the flow of processing in which synthesized speech is generated by the speech synthesis system 10. 図８は、音声合成システム１０によって合成音声が生成される過程で順次生成されるテキストの具体例を示す。FIG. 8 shows a specific example of text sequentially generated in the process of generating synthesized speech by the speech synthesis system 10. 図９は、音声合成システム１０として機能する情報処理装置５００のハードウェア構成の一例を示す。FIG. 9 shows an example of the hardware configuration of the information processing apparatus 500 that functions as the speech synthesis system 10.

Explanation of symbols

１０音声合成システム
２０音素片記憶部
３１０合成部
３２０算出部
３３０判断部
３３５表示部
３４０換言記憶部
３５０置換部
３７０出力部
４００語句記憶部
４１０語句検索部
４２０音素片検索部
５００情報処理装置
６００語句データ
６１０発音データ
６２０アクセントデータ
６３０品詞データ 10 speech synthesis system 20 phoneme segment storage unit 310 synthesis unit 320 calculation unit 330 judgment unit 335 display unit 340 paraphrase storage unit 350 replacement unit 370 output unit 400 phrase storage unit 410 phrase search unit 420 phoneme segment search unit 500 information processing apparatus 600 Data 610 Pronunciation data 620 Accent data 630 Part of speech data

Claims

A system for generating synthesized speech,
A phoneme piece storage unit for storing a plurality of phoneme piece data each representing speech of a different phoneme;
A synthesizing unit for inputting text, connecting phoneme data corresponding to each phoneme indicating the pronunciation of the input text from the phoneme storage unit, and generating speech data indicating synthesized speech of the text;
A calculation unit that calculates an index value indicating unnaturalness of the synthesized speech of the text based on the speech data;
A paraphrase storage unit that stores a second notation that is a paraphrase of the first notation in association with each of the plurality of first notations;
A replacement unit that searches the text for a notation that matches any of the first notations, and replaces the searched notation with the second notation corresponding to the first notation;
The generated text data is output on the condition that the calculated index value is smaller than a predetermined reference value, and the replaced text on the condition that the index value is not less than the reference value A determination unit that inputs the text to the synthesis unit to further generate voice data.

The calculation unit includes the first phoneme piece data and the second phoneme piece data at a boundary between the first phoneme piece data included in the voice data and the second phoneme piece data connected to the first phoneme piece data. The system according to claim 1, wherein a difference in pronunciation between phoneme piece data is calculated as the index value.

The phoneme storage unit stores, as the phoneme piece data, data indicating a fundamental frequency and a tone color of speech for each phoneme,
The calculation unit calculates a difference in fundamental frequency and tone color between the first phoneme piece data and the second phoneme piece data at a boundary between the first phoneme piece data and the second phoneme piece data. The system according to claim 2, wherein the system is calculated as the index value.

The synthesis unit is
For each of the plurality of phrases, a phrase storage unit that stores how to read the phrase in association with the notation of the phrase;
By searching the phrase storage unit for a phrase whose notation matches with each phrase included in the input text, and reading and connecting from the phrase storage unit the reading corresponding to each searched phrase, A phrase search unit for generating readings;
The phoneme piece data closest to the prosody of each phoneme determined based on the generated reading is retrieved from the phoneme piece storage unit, and the voice data is generated by connecting the read phoneme piece data. Phoneme segment search section and
The calculation unit calculates, as the index value, a difference between a prosody of each phoneme determined based on the generated reading and a prosody indicated by phoneme piece data searched corresponding to each phoneme. The described system.

The synthesis unit is
For each of the plurality of phrases, a phrase storage unit that stores how to read the phrase in association with the notation of the phrase;
By searching the phrase storage unit for a phrase whose notation matches with each phrase included in the input text, and reading and connecting from the phrase storage unit the reading corresponding to each searched phrase, A phrase search unit for generating readings;
The phoneme piece data closest to the timbre of each phoneme determined based on the generated reading is retrieved from the phoneme piece storage unit, and the voice data is generated by connecting the read phoneme piece data. Phoneme segment search section and
The calculation unit calculates, as the index value, a difference between a timbre of each phoneme determined based on the generated reading and a timbre indicated by phoneme piece data respectively retrieved corresponding to the phoneme. The described system.

The phoneme storage unit acquires in advance target speech data that is speech data indicating a synthesized speech to be generated and generates a plurality of phoneme segment data indicating speech of a plurality of phonemes included in the target speech data. Generated and stored in advance,
The paraphrase storage unit stores, as each of the plurality of second notations, a notation of a phrase included in the text indicating the content of the target speech data,
The replacement unit replaces a notation that matches the first notation in the input text with a second notation that is a notation of a phrase included in the text indicating the content of the target speech data. system.

The replacement unit calculates an index value indicating the unnaturalness of the synthesized speech corresponding to each combination of a predetermined number of words written in succession in the input text. The wording included in the combination having the largest calculated index value is searched for the first notation that matches the notation of the word from the paraphrase storage unit, and the notation of the word is replaced with the second notation. The system according to 1.

The paraphrase storage unit further stores a degree of approximation of meaning between the first notation and the second notation in association with each of the first notation and the second notation set which is a paraphrase of the first notation. And
The replacement unit replaces the matching notation with the highest degree of approximation among the plurality of first notations on the condition that the notation in the input text matches each of the plurality of first notations. The system according to claim 1, wherein replacement is performed by a second notation corresponding to one notation.

The replacement unit does not replace the notation for a sentence that includes at least one of a proper noun or a numeral in the input text, and matches the first notation for a sentence that does not include either a proper noun or a numeral. The system according to claim 1, wherein the notation to be searched is replaced with a second notation corresponding to the first notation.

Further comprising a display unit for displaying to the user the text whose notation has been replaced on the condition that the notation has been replaced by the replacing unit;
The determination unit outputs voice data based on the text in which the notation is replaced, and further receives an input that does not approve the replacement, on condition that the input to approve replacement is received for the displayed text. The system according to claim 1, wherein audio data based on the text before replacement is output regardless of the index value.

A method for generating synthesized speech,
Storing a plurality of phoneme pieces data each representing a different phoneme speech;
Inputting text, reading and connecting from the phoneme piece data storing the phoneme piece data corresponding to each phoneme indicating the pronunciation of the input text, and generating voice data showing the synthesized voice of the text;
Calculating an index value indicating unnaturalness of the synthesized speech of the text based on the speech data;
Storing a second notation that is a paraphrase of the first notation in association with each of the plurality of first notations;
Searching the text for a notation that matches any of the first notations, and replacing the searched notation with the second notation corresponding to the first notation;
The generated text data is output on the condition that the calculated index value is smaller than a predetermined reference value, and the replaced text on the condition that the index value is equal to or greater than the reference value Generating further synthesized speech of the text to further generate speech data.

A program for causing an information processing device to function as a system for generating synthesized speech,
The information processing apparatus;
A phoneme piece storage unit for storing a plurality of phoneme piece data each representing speech of a different phoneme;
A synthesizing unit for inputting text, connecting phoneme data corresponding to each phoneme indicating the pronunciation of the input text from the phoneme storage unit, and generating speech data indicating synthesized speech of the text;
A calculation unit that calculates an index value indicating unnaturalness of the synthesized speech of the text based on the speech data;
A paraphrase storage unit that stores a second notation that is a paraphrase of the first notation in association with each of the plurality of first notations;
A replacement unit that searches the text for a notation that matches any of the first notations, and replaces the searched notation with the second notation corresponding to the first notation;
The generated text data is output on the condition that the calculated index value is smaller than a predetermined reference value, and the replaced text on the condition that the index value is equal to or greater than the reference value A program that functions as a determination unit that inputs the text to the synthesis unit to further generate voice data.