US20130013313A1 - Statistical enhancement of speech output from a statistical text-to-speech synthesis system - Google Patents
Statistical enhancement of speech output from a statistical text-to-speech synthesis system Download PDFInfo
- Publication number
- US20130013313A1 US20130013313A1 US13/177,577 US201113177577A US2013013313A1 US 20130013313 A1 US20130013313 A1 US 20130013313A1 US 201113177577 A US201113177577 A US 201113177577A US 2013013313 A1 US2013013313 A1 US 2013013313A1
- Authority
- US
- United States
- Prior art keywords
- corrective
- indicator
- parametric
- feature vector
- component
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015572 biosynthetic process Effects 0.000 title description 16
- 238000003786 synthesis reaction Methods 0.000 title description 16
- 239000013598 vector Substances 0.000 claims abstract description 223
- 230000009466 transformation Effects 0.000 claims abstract description 78
- 230000002708 enhancing effect Effects 0.000 claims abstract description 75
- 238000000034 method Methods 0.000 claims abstract description 62
- 238000013179 statistical model Methods 0.000 claims abstract description 31
- 230000001419 dependent effect Effects 0.000 claims abstract description 17
- 238000004590 computer program Methods 0.000 claims abstract description 15
- 238000000844 transformation Methods 0.000 claims abstract description 14
- 230000007246 mechanism Effects 0.000 claims description 18
- 238000003860 storage Methods 0.000 claims description 16
- 238000009826 distribution Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 12
- 239000000203 mixture Substances 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 5
- 238000009499 grossing Methods 0.000 claims description 2
- 230000003595 spectral effect Effects 0.000 description 35
- 230000006870 function Effects 0.000 description 32
- 238000010586 diagram Methods 0.000 description 21
- 230000015654 memory Effects 0.000 description 12
- 238000001228 spectrum Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 9
- 230000001755 vocal effect Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 6
- 238000009795 derivation Methods 0.000 description 5
- 230000001965 increasing effect Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 238000012935 Averaging Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000009877 rendering Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000005648 markovian process Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
Definitions
- This invention relates to the field of synthesized speech.
- the invention relates to statistical enhancement of synthesized speech output from a statistical text-to-speech (TTS) synthesis system.
- TTS text-to-speech
- Synthesized speech is artificially produced human speech generated by computer software or hardware.
- a TTS system converts language text into a speech signal or waveform suitable for digital-to-analog conversion and playback.
- TTS system uses concatenating synthesis in which pieces of recorded speech are selected from a database and concatenated to form the speech signal conveying the input text.
- the stored speech pieces represent phonetic, units e.g. sub-phones, phones, diphones, appearing in certain phonetic-linguistic context.
- HMM TTS hidden Markov models
- a statistical TTS system may employ other types of models. Hence the description of the present invention addresses statistical TTS in general while HMM TTS is considered a particular example of the former.
- the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech may be modeled simultaneously by HMMs.
- Speech waveforms may be generated from HMMs based on the maximum likelihood criterion.
- HMM-based TTS systems have gained increased popularity in the industry and speech research community due to certain advantages of this approach over the concatenative synthesis paradigm.
- HMM TTS systems produce speech of dimmed quality lacking crispiness and liveliness that are present in natural speech and preserved to a big extent in concatenative TTS output.
- the dimmed quality in HMM-based systems is accounted to spectral shape smearing and in particular to formants widening as a result of statistical modeling that involves averaging of vast amount (e.g. thousands) of feature vectors representing speech frames.
- the formant smearing effect has been known for many years in the field of speech coding, although in HMM TTS this effect has stronger negative impact on the perceptual quality of the output.
- Some speech enhancement techniques also known as, postfiltering
- Some TTS systems follow this approach and employ a post-processing enhancement step aimed at partial compensation of the spectral smearing effect.
- a method for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors, comprising: defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; defining a distortion indictor of a feature vector or a plurality of feature vectors; receiving a feature vector output by the system; generating an instance of the corrective transformation by: calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation; deriving an instance of the corrective transformation corresponding to the enhancing
- a computer program product for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors
- the computer program product comprising: a computer readable non-transitory storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to: define a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; define a distortion indictor of a feature vector or a plurality of feature vectors; receive a feature vector output by the system; generate an instance of the corrective transformation by: calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; calculating the
- a system for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors, comprising: a processor; an acoustic feature vector input component for receiving an acoustic feature vector emitted by a phonetic unit; a corrective transformation defining component for defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; an enhancing parametric set component including: a distortion indicator reference component for calculating a reference value of a distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; a distortion indicator actual value component for calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; and wherein the enhancing parameter set component calculating the enhancing parameter values depending on the reference value of the distortion
- FIG. 1 is a graph showing the smearing effect of spectral envelopes derived from cepstral vectors associated with the same context-dependent phonetic unit for real and synthetic speech;
- FIG. 2 is a stemmed plot of components of a ratio vector for a context-dependent phonetic unit with the components of the ratio vector plotted against quefrency;
- FIG. 3 is a block diagram of a first embodiment of a system in accordance with the present invention.
- FIG. 4 is a block diagram of a second embodiment of a system in accordance with the present invention.
- FIG. 5 is a block diagram of a computer system in which the present invention may be implemented.
- FIG. 6 is a flow diagram of a method in accordance with the present invention.
- FIG. 7 is a flow diagram of a first embodiment of a method in accordance with the present invention applied in an on-line operational mode.
- FIG. 8 is a flow diagram of a second embodiment of a method in accordance with the present invention applied in an off-line/on-line operational mode.
- a statistical compensation method is used on the speech output from a statistical TTS system.
- Distortion may be reduced in synthesized speech by compensating the spectral smearing effect inherent to statistical TTS systems and other distortions by applying a corrective transformation to acoustic feature vectors generated by the system.
- an instantaneous spectral envelope of speech is parameterised, i.e. represented by an acoustic feature vector.
- the spectral envelope may combine the vocal tract and the glottal pulse related components. In this case, the influence of the glottal pulse on the spectral envelope is typically ignored, and the spectral envelope is deemed to be related to the vocal tract. In other systems, the glottal pulse and the vocal tract may be modeled and generated separately.
- the method is applied to the case of a single spectral envelope. In other embodiments, the method may be applied separately to the vocal tract and glottal pulse related components.
- a parameterized spectral envelope associated with each distinct phonetic unit is modeled by a separate probability distribution.
- These distinct units are usually parts of a phone taken in certain phonetic-linguistic context.
- each phone taken in a certain phonetic and linguistic context is modeled by a 3-states HMM.
- the phonetic unit represents one third (either the beginning, or the middle or the end) part of a phone taken in a context and is modeled by a multivariate Gaussian mixture probability density function.
- HSMM semi-Markov models
- Other statistical TTS methods to which the described method may be applied may use models other than HMM states with emission probability modeled by probability distributions other than Gaussian.
- acoustic features may be used for the spectral envelope parameterisation in statistical TTS systems.
- an acoustic feature vector in the form of a cepstral vector is used.
- other forms of acoustic feature vectors may be used, such as Line Spectral Frequencies (LSF) also referred to as Line Spectral Pairs (LSP).
- LSF Line Spectral Frequencies
- LSP Line Spectral Pairs
- a power cepstrum is the result of taking the inverse Fourier transform of the log-spectrum.
- the frequency axis is warped prior to the cepstrum calculation.
- One of the popular frequency warping transformations is Mel-scale warping reflecting perceptual properties of human auditory system.
- the continuous spectral envelope is not available immediately from the voiced speech signal which has a quasi-periodic nature.
- there are a number of widely used techniques for the cepstrum estimation each is based on a distinct method of spectral envelope estimation.
- MFCC Mel-Frequency Cepstral Coefficients
- PLP Perceptual Linear Predictive
- MRCC Mel-scale Regularized Cepstral Coefficients
- c(2) is cepstrum value at quefrency 2.
- Each component has an index referred to as quefrency.
- the c2 component is associated with quefrency 2.
- the method proposed in the present invention does not exploit specific properties of Markov models or properties of Gaussian mixture models. Hence the method is applicable to any statistical TTS system that models the spectral envelope of a phonetic unit by a probability distribution defined in the space of acoustic feature vectors.
- the over-smoothed nature of the speech generated by a statistical TTS system is due to spectral shape smearing as a result of statistical modeling of cepstral vectors (or other acoustic feature vectors) for each phonetic unit.
- FIG. 1 is a graph 100 plotting amplitude 101 against frequency 102 with spectral envelopes derived from cepstral vectors selected from the real cluster 103 and synthetic cluster 104 associated with a certain unit drawn with dashed and solid lines respectively.
- the synthetic vectors 104 show flatter spectra with lower peaks and higher valleys compared to the real vectors 103 .
- the L2-norm of a sub-vector extracted from the full 33-dimensional cepstral vector [C(1), C(2), . . . , C(33)] was calculated.
- Sub-vectors were analyzed containing lowest quefrency coefficients [C(1) . . . C(11)], middle quefrency coefficients [C(12) . . . C(22)] and highest quefrency coefficients [C(23) . . . C(33)]. It was seen that the L2-norm of the middle quefrency and highest quefrency sub-vectors was systematically lower within the synthetic cluster than within the real cluster. At the same time the L2-norm of the lowest quefrency sub-vectors did not vary significantly between the real and synthetic clusters.
- M real 2 and M syn 2 are the component-wise empirical second moments of the real and synthetic vectors correspondingly.
- the second moment vectors were smoothed along the quefrency axis with the 5-tap moving average operator prior to calculating the ratio vector (3).
- the stemmed plot 200 represents the components of the L2-norm ratio vector R calculated for the same unit analyzed on FIG. 1 with L2-norm ratio 201 plotted against quefrency 202 .
- the ratio vector components exhibit an increasing trend along the quefrency axis 202 which means that the synthetic vectors have a stronger attenuation than the real vectors on average. This statistical observation was validated on all the units of several male and female voice models in three languages summing up to about 7000 HMM states.
- the analysis above is used to compensate for this stronger attenuation of synthetic vectors prior to rendering the synthesized speech waveform.
- the attenuation of cepstrum coefficients in quefrency is considered.
- Other indications of acoustic distortion may be used for other forms of acoustic feature vectors, such as Line Spectral Frequencies.
- the distortion indicator may indicate (or enable a derivation of) a degree of spectral smoothness or other spectral distortion.
- the enhanced output vector O is:
- the general idea of the described method is to define a parametric family of smooth positive corrective functions W p (n) (e.g. exponential) dependant on a parameters set p and to calculate the parameter values either for each phonetic unit or for each emitted cepstral vector so that the cepstral attenuation degree (and corresponding spectral sharpness degree) after the liftering matches the average level observed in the corresponding real cluster.
- W p (n) e.g. exponential
- the described method statistically controls the corrective liftering to greatly improve the quality of synthesized speech while preventing an over-liftering introducing audible distortions.
- W p (n) be a parametric family of corrective liftering functions dependant on enhancing parameters set p;
- H(X) be a vectorial function of a cepstral vector X indicative of its attenuation.
- H(X) is referred to as attenuation indicator.
- a reference value H real of the attenuation indicator may be calculated for the unit L by averaging of H(X) over the real cluster associated with that unit:
- H real E ⁇ H ( X ), X ⁇ raw cluster L ⁇ (5)
- An actual value H syn of the attenuation indicator may be calculated by averaging of H(X) over the synthetic cluster created in advance for the unit L:
- H syn E ⁇ H ( X ), X ⁇ synthetic cluster L ⁇ (6.1)
- H syn may be calculated from the same single synthetic vector C to be processed:
- Optimal values of the enhancing parameters may be calculated that provide the best approximation of the reference value of the attenuation indicator:
- D(H real , H syn , W p ) is an enhancement criterion that measures a dissimilarity between the reference value of the attenuation indicator and a predicted actual value of the attenuation indicator after applying the corrective liftering W p .
- the optimal enhancing parameters set p and the corrective liftering vector W p associated with each unit may be calculated off-line prior to exploitation of the enhanced system and stored.
- the corresponding pre-stored liftering function may be applied to each synthetic vector C. This choice simplifies the implementation of the run-time component of the enhanced system.
- the calculation of the optimal corrective liftering vector W p may be performed for each vector C emitted from the statistical model in run-time. Only the reference values H real may be calculated off-line and stored. In the synthesis time the reference value H real associated with the corresponding unit may be passed to the enhancement algorithm. This choice removes the need to build the synthetic clusters for each unit. Moreover, with a proper selection of the attenuation indicator H(X), as described below, there is no need to store H real vectors. Instead they are easily derived from the statistical model parameters, and the proposed method may be applied to pre-existing voice models built for the original TTS system.
- Relation (2) suggests a simple and mathematically tractable exponential corrective function:
- the enhancing parameter set p may be comprised of a single scalar exponent base ⁇ .
- the exponential liftering results in the uniform radial migration of poles and zeros towards the unit circle of the complex plane that directly relates to spectrum sharpening without changing the location of the peaks and valleys on the frequency axis:
- the degree of the spectrum sharpening depends on the selected exponent base ⁇ value. A too high ⁇ may overemphasize the spectral formants and even render the inverse cepstrum transform unstable. On the other hand, a too low ⁇ may not yield the expected enhancement effect. This is why the statistical control over the liftering parameters is important.
- the enhancing parameters set may be comprised of three parameters: the base ⁇ of the first exponent, the base ⁇ of the second exponent and integer concatenation point ⁇ , i.e. the index of the vector component where the concatenation takes place.
- the reference value H real given by (5) is the second moment M real 2 of the real cluster associated with the phonetic unit L. Practically there is no need to build the real cluster in order to calculate the vector M real 2 . In many cases it can be easily calculated from real the cepstral vectors probability distribution. For example, in the case of Gaussian mixture models used in HMM TTS systems, the reference value may be calculated as:
- ⁇ i , ⁇ i 2 and ⁇ i are respectively mean-vectors, variance-vectors and weights associated with individual Gaussians.
- the actual value H syn of the attenuation indicator may be either the empirical second moment of the cepstral vectors calculated over the synthetic cluster or squared vector C to be enhanced depending on the choice between (6.1) and (6.2).
- the components of the vectors H real and H syn may be optionally smoothed by a short filter such as 5-tap moving average filter.
- a short filter such as 5-tap moving average filter.
- the enhancement criterion D(H real , H syn , W p ) appearing in (7) may be defined as:
- the enhancement criterion may be defined as:
- the calculation (7) of the optimal enhancing parameter a may be achieved by log-linear regression:
- FIG. 2 an example of the optimal corrective liftering function calculated according to (17) is drawn by the bold solid line 210 .
- An enhanced spectral envelope resulting from the corrective liftering is shown on FIG. 1 by the dashed bold line 110 . It can be seen that the enhanced spectral envelope exhibits emphasized peaks and valleys and resembles the real spectra much better compared to the original synthetic spectra.
- the optimal set of the enhancing parameters may be calculated as follows. Fixing the concatenation point ⁇ , the values of ⁇ and ⁇ may be calculated as:
- the optimal values of the three parameters may be obtained by scanning all the integer values of ⁇ within a predefined range:
- the optimal value of the exponent base ⁇ may be obtained by solving following equation:
- the optimal enhancing parameters bring the attenuation degree of the synthetic cepstral vectors to the averaged level observed on the corresponding real cluster. Therefore, the enhancement may be strengthened or softened to some extent relatively to the optimal level in order to optimize the perceptual quality of the enhanced synthesized speech.
- the optimal enhancing parameters calculated as described above may be altered depending on certain properties of the corresponding phonetic units emitting the synthetic vectors to be enhanced. For example, the optimal exponent base (17) calculated for vectors emitted from a certain unit of an HMM TTS system may be modified as:
- a predefined factor F depends on the HMM state number representing that unit, a category of the phone represented by this HMM and voicing class of the segments represented by this state.
- the final value ⁇ final may be used for rendering the corrective liftering vector to be applied to the corresponding synthetic cepstral vector.
- FIGS. 3 and 4 block diagrams show example embodiments of a system 300 , 400 in which the described statistical enhancement of synthesized speech is applied.
- the system 300 includes an on-line enhancement mechanism 340 for a statistical TTS system 310 .
- the system 300 includes a statistical TTS system 310 , for example, an HMM-based system which receives a text input 301 and synthesizes the text to provide a speech output 302 .
- TTS system 310 is an HMM-based system which models parameterised speech by a sequence of Markovian processes with unobserved (hidden) states with Gaussian mixture emitting probability distribution. In other embodiments, other forms of statistical modeling may be used.
- the statistical TTS system 310 may include a phonetic unit model component 320 including an acoustic feature vector output component 321 for outputting synthetic acoustic feature vectors generated out of this unit model.
- the acoustic feature vector may be a cepstral vector.
- the acoustic feature vector may be a Line Spectral Frequencies vector.
- An initialization unit 330 may be provided including a corrective transformation defining component 331 for defining the parametric corrective transformation to be used for the corrective transformation instance derivation.
- the corrective transformation defining component 331 may also include an enhancing parameter set component 332 for defining the enhancing parameter set to be used.
- the initialization unit 330 may also include a distortion indicator component 333 for defining a distortion indicator to be used and an enhancement criterion component 334 for defining an enhancement criterion to be used.
- the initialization unit 330 may also include an enhancement customization component 335 dependent on unit attributes and enhancing parameters.
- the distortion indicator is an attenuation indicator.
- An on-line enhancement mechanism 340 is provided which may include the following components for enhancing distorted acoustic feature vectors as output by the phonetic unit model component 320 by applying an instance of the corrective transformation.
- the on-line enhancement mechanism 340 may include an inputs component 341 .
- the inputs component 341 may include an acoustic feature vector input component 342 for receiving outputs from the phonetic unit model component 320 .
- a sequence of N-dimensional cepstral vectors For example, a sequence of N-dimensional cepstral vectors.
- the inputs component 341 may also include a real emission statistics component 343 for receiving real emission statistics from the statistical model of the phonetic unit model component 320 .
- the inputs component 341 may also include a unit attributes component 344 for receiving unit attributes of the phonetic unit model component 320 .
- the on-line enhancement mechanism 340 may also include an enhancing parameter set component 350 .
- the enhancing parameter set component 350 may include a distortion indicator reference component 351 and a distortion indicator actual value component 352 for applying the distortion indicator definitions and calculating the actual and reference values for use in the enhancing parameter set derivation.
- the enhancing parameter set component 350 may also include an enhancement criterion applying component 353 for applying a defined enhancement criterion to measure the dissimilarity between the reference value of the distortion indicator and a predicted actual value.
- the enhancing parameter set component 350 may include a customization component 354 for altering optimal enhancing parameter set values according to unit attributes.
- the attributes may include a phone category which the statistical model is attributed to and voicing class of the majority of speech frames used for the statistical model training.
- the on-line enhancement mechanism 340 may include a corrective transformation generating component 360 and a corrective transformation applying component 365 for applying an instance of the parametric transformation derived from the enhancing parameter set values to an acoustic feature vector yielding an enhanced vector.
- the on-line enhancement mechanism 340 may include an output component 370 for outputting the enhanced vector output 371 for use in a waveform synthesis of the speech component 380 of the statistical TTS system 310 .
- the system 400 shows an alternative embodiment to that of FIG. 3 in which the corrective transformation is generated off-line. Equivalent reference numbers to FIG. 3 are used where possible.
- the system 400 includes a statistical TTS system 410 , for example, an HMM-based system which receives a text input 401 and synthesizes the text to provide a speech output 402 .
- the statistical TTS system 410 may include a phonetic unit model component 420 including an acoustic feature vector output component 421 for outputting synthetic acoustic feature vectors generated out of this unit model.
- an initialization unit 430 may be provided including a corrective transformation defining component 431 for defining the parametric corrective transformation to be used for the corrective transformation instance derivation.
- the corrective transformation defining component 431 may also include a parameter set component 432 for defining the enhancing parameter set to be used.
- the initialization unit 430 may also include a distortion indicator component 433 for defining a distortion indicator to be used and an enhancement criterion component 434 for defining an enhancement criterion to be used.
- the initialization unit 430 may also include an enhancement customization component 435 dependent on unit attributes and enhancing parameters.
- an off-line enhancement calculation mechanism 440 may be provided for generating and storing a corrective transformation instance.
- An on-line enhancement mechanism 450 may be provided to retrieve and apply instances of the corrective transformation during speech synthesis.
- the off-line enhancement calculation mechanism 440 may include an inputs component 441 .
- the inputs component 441 may include a synthetic cluster vector component 442 for collecting a synthetic cluster of acoustic feature vectors for each phonetic unit emitted from the phonetic unit model component 420 .
- the inputs component 441 may also include a real emission statistics component 443 for receiving real emission statistics from the statistical model of the phonetic unit model component 420 .
- the inputs component 441 may also include a unit attributes component 444 for receiving unit attributes of the phonetic unit model component 420 .
- the off-line enhancement calculation mechanism 440 may also include an enhancing parameter set component 450 .
- the enhancing parameter set component 450 may include a distortion indicator reference component 451 and a distortion indicator actual value component 452 for applying the distortion indicator definitions and calculating the actual and reference values for use in the enhancing parameter set derivation.
- the enhancing parameter set component 450 may also include an enhancement criterion applying component 453 for applying a defined enhancement criterion to measure the dissimilarity between the reference value of the distortion indicator and a predicted actual value.
- the enhancing parameter set component 450 may include a customization component 454 for altering optimal enhancing parameter set values according to unit attributes.
- the off-line enhancement calculation mechanism 440 may include a corrective transformation generating and storing component 460 .
- the on-line enhancement mechanism 470 may include a corrective transformation retrieving and applying component 471 for applying the instance of the parametric corrective transformation derived from the enhancing parameter set values to an acoustic feature vector yielding an enhanced vector.
- the on-line enhancement mechanism 470 may include an output component 472 for outputting the enhanced vector output 473 for use in a waveform synthesis of the speech component 480 of the statistical TTS system 410 .
- an exemplary system for implementing aspects of the invention includes a data processing system 500 suitable for storing and/or executing program code including at least one processor 501 coupled directly or indirectly to memory elements through a bus system 503 .
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- the memory elements may include system memory 502 in the form of read only memory (ROM) 504 and random access memory (RAM) 505 .
- ROM read only memory
- RAM random access memory
- a basic input/output system (BIOS) 506 may be stored in ROM 504 .
- System software 507 may be stored in RAM 505 including operating system software 508 .
- Software applications 510 may also be stored in RAM 505 .
- the system 500 may also include a primary storage means 511 such as a magnetic hard disk drive and secondary storage means 512 such as a magnetic disc drive and an optical disc drive.
- the drives and their associated computer-readable media provide non-volatile storage of computer-executable instructions, data structures, program modules and other data for the system 500 .
- Software applications may be stored on the primary and secondary storage means 511 , 512 as well as the system memory 502 .
- the computing system 500 may operate in a networked environment using logical connections to one or more remote computers via a network adapter 516 .
- Input/output devices 513 can be coupled to the system either directly or through intervening I/O controllers.
- a user may enter commands and information into the system 500 through input devices such as a keyboard, pointing device, or other input devices (for example, microphone, joy stick, game pad, satellite dish, scanner, or the like).
- Output devices may include speakers, printers, etc.
- a display device 514 is also connected to system bus 503 via an interface, such as video adapter 515 .
- a flow diagram 600 shows the described method.
- a parametric family of corrective transformations is defined 601 operating in the space of acoustic feature vectors and dependent on a set of enhancing parameters.
- a distortion indicator of a feature vector may also be defined 602 .
- a feature vector is received 603 as emitted form a phonetic unit of the system.
- An instance of the corrective transformation may be generated 604 from the parametric corrective transformation by applying an optimized a set of enhancing parameter values to reduce audible distortions.
- the instance of the corrective transformation may be generated by the following steps. Calculating 605 a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector, and calculating 606 an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector, and calculating 607 a set of enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator, and the parametric corrective transformation.
- the instance of the corrective transformation may be applied 608 to the feature vector to provide an enhanced vector for use in speech synthesis.
- flow diagrams 700 , 800 show example embodiments of the described method in the context of corrective liftering vectors applied to cepstral vectors with distortion indicators in the form of attenuation indicators for smoothing spectral distortion.
- a flow diagram 700 shows steps of an example embodiment of the described method corresponding to the case where cepstral acoustic feature vectors and liftering corrective transformation are used and the corrective liftering vectors are calculated on-line during the synthesis operation.
- a first initialization phase 710 may include defining 711 : parametric family of corrective liftering functions W P (N) dependent on enhancing parameter set P; attenuation indicator H; enhancement criterion D(H, H, W P ); and enhancement customization mechanism F dependent on unit attributes and enhancing parameters.
- a second phase 720 is the operation of synthesis with enhancement.
- Cepstral vector generation may be applied 721 from the statistical model.
- the following may be received 722 : synthetic cepstral vector C emitted from phonetic unit U; emission statistics REALS (e.g. mean and variance) from statistical model of U; and unit attributes UA of phonetic unit U.
- emission statistics REALS e.g. mean and variance
- Optimal enhancing parameter values P* may be calculated 724 optimizing the enhancement criterion:
- P * arg ⁇ ⁇ min P ⁇ D ⁇ ( H REAL , H SYN , W P ) .
- a corrective liftering vector W P** corresponding to P** may be calculated 726 and applied 727 to vector C yielding enhanced vector O.
- the enhanced vector O may be used 728 in waveform synthesis of speech
- a flow diagram 800 shows steps of an example embodiment of the described method corresponding to the case where cepstral acoustic feature vectors and liftering corrective transformation are used and the corrective liftering vectors are calculated off-line and stored being linked to corresponding phonetic units.
- a first initialization phase 810 may include defining: parametric family of corrective liftering functions W P (N) dependent on enhancing parameter set P; attenuation indicator H; enhancement criterion D(H, H, W P ); and enhancement customization mechanism F dependent on unit attributes and enhancing parameters.
- a second phase 820 is an off-line calculation of unit dependent corrective vectors.
- Cepstral vector generation may be applied 821 from the statistical model.
- a synthetic cluster of cepstral vectors emitted from phonetic unit U may be collected 822 .
- the synthetic cluster statistics e.g. means and variance
- SYNS may be calculated 823 .
- the emission statistics e.g. mean and variance
- REALS may be fetched 824 from statistical model of U together with the unit attributes UA of phonetic model U.
- Optimal enhancing parameter values P* may be calculated 826 optimising the enhancement criterion:
- P * arg ⁇ ⁇ min P ⁇ D ⁇ ( H REAL , H SYN , W P ) .
- the corrective liftering vector W P** corresponding to P** is calculated 828 .
- the liftering vector W P** is stored 829 being linked to the unit U.
- a synthetic cepstral vector C is received 831 together with a corrective liftering vector W P** corresponding to unit emitting C.
- Corrective liftering vector W P** is applied 832 to vector C yielding enhanced vector O.
- the enhanced vector O is used 833 in waveform synthesis of speech.
- the enhancement method described improves the perceptual quality of synthesized speech by strong reduction of the spectral smearing effect.
- the effect of this enhancement technique consists of moving poles and zeros of the transfer function corresponding to the synthesized spectral envelope towards the unit circle of Z-plane which leads to sharpening of spectral peaks and valleys.
- HMM-based TTS systems and of statistical TTS systems in general.
- Most HMM TTS systems model frames' spectral envelopes in the cepstral space i.e. use cepstral feature vectors.
- the enhancement technique described works in the cepstral domain and is directly applicable to any statistical system employing cepstral features.
- the described method does not introduce audible distortions due to the fact that it works adaptively exploiting statistical information available within a statistical TTS system.
- the corrective transformation applied to a synthetic vector output from the original TTS system is calculated with the goal to bring the value of certain characteristics of the enhanced vector to the average level of this characteristic observed on relevant feature vectors derived from real speech.
- the described method does not require building of a new voice model.
- the described method can be employed with a pre-existing voice model.
- the real vectors statistics used as a reference for the corrective transformation calculation can be calculated based on the cepstral mean and variance vectors readily available within the existing voice model.
- aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
- a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
- a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
- a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
Description
- This invention relates to the field of synthesized speech. In particular, the invention relates to statistical enhancement of synthesized speech output from a statistical text-to-speech (TTS) synthesis system.
- Synthesized speech is artificially produced human speech generated by computer software or hardware. A TTS system converts language text into a speech signal or waveform suitable for digital-to-analog conversion and playback.
- One form of TTS system uses concatenating synthesis in which pieces of recorded speech are selected from a database and concatenated to form the speech signal conveying the input text. Typically, the stored speech pieces represent phonetic, units e.g. sub-phones, phones, diphones, appearing in certain phonetic-linguistic context.
- Another class of speech synthesis, referred to as “statistical TTS”, creates the synthesized speech signal by statistical modeling of the human voice. Existing statistical TTS systems are based on hidden Markov models (HMM) with Gaussian mixture emission probability distribution, so “HMM TTS” and “statistical TTS” may sometimes be used synonymously. However, in principle a statistical TTS system may employ other types of models. Hence the description of the present invention addresses statistical TTS in general while HMM TTS is considered a particular example of the former.
- In an HMM-based system the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech may be modeled simultaneously by HMMs. Speech waveforms may be generated from HMMs based on the maximum likelihood criterion.
- HMM-based TTS systems have gained increased popularity in the industry and speech research community due to certain advantages of this approach over the concatenative synthesis paradigm. However, it is commonly acknowledged that HMM TTS systems produce speech of dimmed quality lacking crispiness and liveliness that are present in natural speech and preserved to a big extent in concatenative TTS output. In general, the dimmed quality in HMM-based systems is accounted to spectral shape smearing and in particular to formants widening as a result of statistical modeling that involves averaging of vast amount (e.g. thousands) of feature vectors representing speech frames.
- The formant smearing effect has been known for many years in the field of speech coding, although in HMM TTS this effect has stronger negative impact on the perceptual quality of the output. Some speech enhancement techniques (also known as, postfiltering) have been developed for speech codecs in order to compensate quantization noise and sharpen the formants at the decoding phase. Some TTS systems follow this approach and employ a post-processing enhancement step aimed at partial compensation of the spectral smearing effect.
- According to a first aspect of the present invention there is provided a method for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors, comprising: defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; defining a distortion indictor of a feature vector or a plurality of feature vectors; receiving a feature vector output by the system; generating an instance of the corrective transformation by: calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation; deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations; and applying the instance of the corrective transformation to the feature vector to provide an enhanced feature vector.
- According to a second aspect of the present invention there is provided a computer program product for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors, the computer program product comprising: a computer readable non-transitory storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to: define a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; define a distortion indictor of a feature vector or a plurality of feature vectors; receive a feature vector output by the system; generate an instance of the corrective transformation by: calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation; deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations; and applying the instance of the corrective transformation to the feature vector to provide an enhanced feature vector.
- According to a third aspect of the present invention there is provided a system for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors, comprising: a processor; an acoustic feature vector input component for receiving an acoustic feature vector emitted by a phonetic unit; a corrective transformation defining component for defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; an enhancing parametric set component including: a distortion indicator reference component for calculating a reference value of a distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; a distortion indicator actual value component for calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; and wherein the enhancing parameter set component calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation; a corrective transformation applying component for applying an instance of the corrective transformation to the feature vector to provide an enhanced feature vector.
- The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
-
FIG. 1 is a graph showing the smearing effect of spectral envelopes derived from cepstral vectors associated with the same context-dependent phonetic unit for real and synthetic speech; -
FIG. 2 is a stemmed plot of components of a ratio vector for a context-dependent phonetic unit with the components of the ratio vector plotted against quefrency; -
FIG. 3 is a block diagram of a first embodiment of a system in accordance with the present invention; -
FIG. 4 is a block diagram of a second embodiment of a system in accordance with the present invention; -
FIG. 5 is a block diagram of a computer system in which the present invention may be implemented; -
FIG. 6 is a flow diagram of a method in accordance with the present invention; -
FIG. 7 is a flow diagram of a first embodiment of a method in accordance with the present invention applied in an on-line operational mode; and -
FIG. 8 is a flow diagram of a second embodiment of a method in accordance with the present invention applied in an off-line/on-line operational mode. - It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numbers may be repeated among the figures to indicate corresponding or analogous features.
- In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
- The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
- The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
- Method, system and computer program product are described in which a statistical compensation method is used on the speech output from a statistical TTS system. Distortion may be reduced in synthesized speech by compensating the spectral smearing effect inherent to statistical TTS systems and other distortions by applying a corrective transformation to acoustic feature vectors generated by the system.
- In a statistical TTS system, an instantaneous spectral envelope of speech is parameterised, i.e. represented by an acoustic feature vector. In some systems the spectral envelope may combine the vocal tract and the glottal pulse related components. In this case, the influence of the glottal pulse on the spectral envelope is typically ignored, and the spectral envelope is deemed to be related to the vocal tract. In other systems, the glottal pulse and the vocal tract may be modeled and generated separately. In one embodiment used as the main example for the specific description, the method is applied to the case of a single spectral envelope. In other embodiments, the method may be applied separately to the vocal tract and glottal pulse related components.
- In a statistical TTS system, a parameterized spectral envelope associated with each distinct phonetic unit is modeled by a separate probability distribution. These distinct units are usually parts of a phone taken in certain phonetic-linguistic context. For example, in a typical 3-states HMM-based system each phone taken in a certain phonetic and linguistic context is modeled by a 3-states HMM. In this case the phonetic unit represents one third (either the beginning, or the middle or the end) part of a phone taken in a context and is modeled by a multivariate Gaussian mixture probability density function. The same is true for the systems utilizing semi-Markov models (HSMM) where the state transition probabilities are not used and the unit durations are modeled directly. Other statistical TTS methods to which the described method may be applied may use models other than HMM states with emission probability modeled by probability distributions other than Gaussian.
- Different types of the acoustic features may be used for the spectral envelope parameterisation in statistical TTS systems. In one embodiment used as the main example for the specific description, an acoustic feature vector in the form of a cepstral vector is used. However, other forms of acoustic feature vectors may be used, such as Line Spectral Frequencies (LSF) also referred to as Line Spectral Pairs (LSP).
- In the context of cepstral features, a power cepstrum, or simply cepstrum, is the result of taking the inverse Fourier transform of the log-spectrum. In speech processing in general, and in TTS systems in particular, the frequency axis is warped prior to the cepstrum calculation. One of the popular frequency warping transformations is Mel-scale warping reflecting perceptual properties of human auditory system. The continuous spectral envelope is not available immediately from the voiced speech signal which has a quasi-periodic nature. Hence, there are a number of widely used techniques for the cepstrum estimation, each is based on a distinct method of spectral envelope estimation. Examples of such techniques are: Mel-Frequency Cepstral Coefficients (MFCC), Perceptual Linear Predictive (PLP) cepstrum, Mel-scale Regularized Cepstral Coefficients (MRCC). A finite number of the cepstrum samples (also referred to as cepstral coefficients) is calculated to form a cepstral parameters vector modeled by a certain probability distribution for each phonetic unit within a statistical TTS system.
- The argument of the cepstrum signal and indices of cepstral vector components are referred to as quefrency. Cepstrum is a discrete signal, i.e. an infinite sequence of values (coefficients) c(n)=c(0), c(1), c(2), . . . n is quefrency. For example, c(2) is cepstrum value at
quefrency 2. The cepstral vector used in TTS is a truncated cepstrum: V=[c1, c2, . . . , cN]. Each component has an index referred to as quefrency. For example, the c2 component is associated withquefrency 2. - The method proposed in the present invention does not exploit specific properties of Markov models or properties of Gaussian mixture models. Hence the method is applicable to any statistical TTS system that models the spectral envelope of a phonetic unit by a probability distribution defined in the space of acoustic feature vectors.
- Studies and analysis presented below were carried out using a US English 5-states HSMM TTS system that employs 33-dimensional MRCC cepstral vectors for the spectral envelope parameterization. [Reference for MRCC: Shechtman, S. and Sorin, A., “Sinusoidal model parameterization for HMM-based TTS system”, in Proc. Interspeech 2010.] Thus each phonetic unit is represented by a certain state of a certain HMM. The cepstral vectors associated with each unit were modeled by a distinct multivariate Gaussian probability distribution.
- Once a voice model had been trained on a training sentences set, all the cepstral vectors that were clustered to a certain phonetic unit were gathered. This collection of cepstral vectors, hereafter referred to as the real cluster, were used for estimation of the unit's Gaussian mean and variance during the voice model training. All the training sentences were then synthesized and all the synthetic cepstral vectors emitted from this unit's Gaussian model were collected. This second collection is referred to as the synthetic cluster.
- The over-smoothed nature of the speech generated by a statistical TTS system is due to spectral shape smearing as a result of statistical modeling of cepstral vectors (or other acoustic feature vectors) for each phonetic unit.
- An example of the smearing effect is depicted in
FIG. 1 .FIG. 1 is agraph 100 plottingamplitude 101 againstfrequency 102 with spectral envelopes derived from cepstral vectors selected from thereal cluster 103 andsynthetic cluster 104 associated with a certain unit drawn with dashed and solid lines respectively. Thesynthetic vectors 104 show flatter spectra with lower peaks and higher valleys compared to thereal vectors 103. - The spectrum flattening is closely related to an increased attenuation of the cepstrum with quefrency. Insight of this relation can be gained using the rational representation of the vocal tract transfer function:
-
- where {pk} and {zm} are respectively poles and zeros of S(z). Taking the logarithm of the right-side of (1) and applying the Maclaurin series expansion to the additive logarithmic terms, the cepstrum of the vocal tract impulse response can be expressed as following:
-
- From (2), it follows that when the poles and zeros of the transfer function move away from the unit circle towards the origin of Z-plane—flattening spectral peaks and valleys—the cepstrum attenuation increases.
- Thus it is expected that synthetic cepstral vectors associated with a certain unit have higher attenuation in quefrency than the real vectors associated with that unit. This hypothesis is supported by the statistical observations which compare the L2-norm distribution over the cepstral vector components measured on real and synthetic clusters.
- Specifically, the L2-norm of a sub-vector extracted from the full 33-dimensional cepstral vector [C(1), C(2), . . . , C(33)] was calculated. Sub-vectors were analyzed containing lowest quefrency coefficients [C(1) . . . C(11)], middle quefrency coefficients [C(12) . . . C(22)] and highest quefrency coefficients [C(23) . . . C(33)]. It was seen that the L2-norm of the middle quefrency and highest quefrency sub-vectors was systematically lower within the synthetic cluster than within the real cluster. At the same time the L2-norm of the lowest quefrency sub-vectors did not vary significantly between the real and synthetic clusters.
- The same phenomenon was observed in the mean values calculated over the real and synthetic clusters. For a given unit the L2-norm ratio vector R is defined as:
-
R(n)=√{square root over (M real 2(n)/M syn 2(n),)}{square root over (M real 2(n)/M syn 2(n),)}n=1, . . . , N (3) - where Mreal 2 and Msyn 2 are the component-wise empirical second moments of the real and synthetic vectors correspondingly. The second moment vectors were smoothed along the quefrency axis with the 5-tap moving average operator prior to calculating the ratio vector (3).
- With the reference to
FIG. 2 , the stemmedplot 200 represents the components of the L2-norm ratio vector R calculated for the same unit analyzed onFIG. 1 with L2-norm ratio 201 plotted againstquefrency 202. The ratio vector components exhibit an increasing trend along thequefrency axis 202 which means that the synthetic vectors have a stronger attenuation than the real vectors on average. This statistical observation was validated on all the units of several male and female voice models in three languages summing up to about 7000 HMM states. - The analysis above is used to compensate for this stronger attenuation of synthetic vectors prior to rendering the synthesized speech waveform. In the above study and analysis, the attenuation of cepstrum coefficients in quefrency is considered. Other indications of acoustic distortion may be used for other forms of acoustic feature vectors, such as Line Spectral Frequencies. The distortion indicator may indicate (or enable a derivation of) a degree of spectral smoothness or other spectral distortion.
- In an example embodiment of the described method, the compensation transformation is represented as component-wise multiplication, referred to as littering, of a distorted synthetic cepstral vector C=[C(1), . . . , C(N)] by a corrective vector W=[W(1), . . . , W(N)] with positive components. Then the enhanced output vector O is:
- Hereafter a dual treatment of the corrective vector is adopted. On one hand it is considered a vector, i.e. an ordered set of values. On the other hand it is considered as a result of sampling of function W(n) at the grid n=[1, 2, . . . , N].
- The observations described above suggest that the corrective liftering function W(n) in general should be increasing in n though not necessarily monotonously. Two requirements may be imposed on the corrective function in order to prevent audible distortions in the enhanced synthesized speech:
-
- The form of the liftering function may be chosen so that the frequencies of spectral peaks and valleys do not change significantly as a result of the liftering operation. In particular it means that the liftering function should be smooth in quefrency.
- The degree of spectrum sharpness achieved by the corrective liftering operation may be within the range observed in the real cluster associated with the corresponding phonetic unit.
- The general idea of the described method is to define a parametric family of smooth positive corrective functions Wp(n) (e.g. exponential) dependant on a parameters set p and to calculate the parameter values either for each phonetic unit or for each emitted cepstral vector so that the cepstral attenuation degree (and corresponding spectral sharpness degree) after the liftering matches the average level observed in the corresponding real cluster.
- The described method statistically controls the corrective liftering to greatly improve the quality of synthesized speech while preventing an over-liftering introducing audible distortions.
- Let: Wp(n) be a parametric family of corrective liftering functions dependant on enhancing parameters set p; C=[C(n),n=1, . . . , N] be a synthetic cepstral vector emitted from a phonetic unit model L of a statistical TTS system; and H(X) be a vectorial function of a cepstral vector X indicative of its attenuation. Hereafter H(X) is referred to as attenuation indicator.
- A reference value Hreal of the attenuation indicator may be calculated for the unit L by averaging of H(X) over the real cluster associated with that unit:
-
H real =E{H(X),Xεraw cluster L} (5) - An actual value Hsyn of the attenuation indicator may be calculated by averaging of H(X) over the synthetic cluster created in advance for the unit L:
-
H syn =E{H(X),Xεsynthetic cluster L} (6.1) - Alternatively the actual value Hsyn may be calculated from the same single synthetic vector C to be processed:
-
H syn =H(C) (6.2) - Optimal values of the enhancing parameters may be calculated that provide the best approximation of the reference value of the attenuation indicator:
-
- where D(Hreal, Hsyn, Wp) is an enhancement criterion that measures a dissimilarity between the reference value of the attenuation indicator and a predicted actual value of the attenuation indicator after applying the corrective liftering Wp.
- Finally, the optimal liftering may be applied to vector C yielding the enhanced vector O:
- which may be used further for the output speech waveform rendering according to the regular scheme adopted for the original statistical TTS system.
- The process described above may be applied to each cepstral vector output from the original statistical TTS system.
- Referring to the calculation of the actual value Hsyn of the attenuation indicator given by the two alternative formulas (6.1) and (6.2), it can be noted that the alternative choices yield similar results. This may be explained by the fact that in HMM TTS systems synthetic clusters exhibit low variance, and therefore each vector, e.g. C, is close to the cluster's average. However, (6.1) and (6.2) lead to two different modes of operation of the enhanced system.
- In the first case (6.1), the optimal enhancing parameters set p and the corrective liftering vector Wp associated with each unit may be calculated off-line prior to exploitation of the enhanced system and stored. In the synthesis time, the corresponding pre-stored liftering function may be applied to each synthetic vector C. This choice simplifies the implementation of the run-time component of the enhanced system.
- In the second case (6.2), the calculation of the optimal corrective liftering vector Wp may be performed for each vector C emitted from the statistical model in run-time. Only the reference values Hreal may be calculated off-line and stored. In the synthesis time the reference value Hreal associated with the corresponding unit may be passed to the enhancement algorithm. This choice removes the need to build the synthetic clusters for each unit. Moreover, with a proper selection of the attenuation indicator H(X), as described below, there is no need to store Hreal vectors. Instead they are easily derived from the statistical model parameters, and the proposed method may be applied to pre-existing voice models built for the original TTS system.
- The method described above in general terms will be better understood with reference to following example embodiments addressing specific important points of the algorithm.
- Relation (2) suggests a simple and mathematically tractable exponential corrective function:
-
W α(n)=αn,α>1 (9) - in which case the enhancing parameter set p may be comprised of a single scalar exponent base α. Within the pole-zero model (2), the exponential liftering results in the uniform radial migration of poles and zeros towards the unit circle of the complex plane that directly relates to spectrum sharpening without changing the location of the peaks and valleys on the frequency axis:
-
- The degree of the spectrum sharpening depends on the selected exponent base α value. A too high α may overemphasize the spectral formants and even render the inverse cepstrum transform unstable. On the other hand, a too low α may not yield the expected enhancement effect. This is why the statistical control over the liftering parameters is important.
- A study of typical shapes of the L2-norm ratio vectors (exemplified by the stemmed plot on
FIG. 2 ) motivated an alternative, less tractable mathematically, corrective function in the form of two concatenated exponents: -
- In this case the enhancing parameters set may be comprised of three parameters: the base α of the first exponent, the base β of the second exponent and integer concatenation point γ, i.e. the index of the vector component where the concatenation takes place.
- The embodiments of the proposed method described below may be based on the attenuation indicator defined as:
-
H(X)=[X 2(n),n=1, . . . , N] (12) - Then the reference value Hreal given by (5) is the second moment Mreal 2 of the real cluster associated with the phonetic unit L. Practically there is no need to build the real cluster in order to calculate the vector Mreal 2. In many cases it can be easily calculated from real the cepstral vectors probability distribution. For example, in the case of Gaussian mixture models used in HMM TTS systems, the reference value may be calculated as:
-
- where μi, σi 2 and λi are respectively mean-vectors, variance-vectors and weights associated with individual Gaussians.
- The actual value Hsyn of the attenuation indicator may be either the empirical second moment of the cepstral vectors calculated over the synthetic cluster or squared vector C to be enhanced depending on the choice between (6.1) and (6.2).
- The components of the vectors Hreal and Hsyn may be optionally smoothed by a short filter such as 5-tap moving average filter. Hereafter, the smoothed versions of the vectors retain the same notations to avoid complication of the formulas.
- In one embodiment of the proposed method, the enhancement criterion D(Hreal, Hsyn, Wp) appearing in (7) may be defined as:
-
- When H(X) is defined by (12), the enhancement criterion (14) represents a dissimilarity between the corrective vector Wp and the L2-norm ratio vector R=[√{square root over (Mreal 2(n)/Hsyn(n),)}{square root over (Mreal 2(n)/Hsyn(n),)} n=1, . . . , N], or in other words the enhancement criterion represents a predicted flatness of the L2-norm ratio vector after applying the enhancement.
- In another embodiment, the enhancement criterion may be defined as:
-
- Note that when H(X) is defined by (12)
-
- where S(ω) is spectral envelope corresponding to the cepstral vector X. Hence the enhancement criterion (15) predicts the dissimilarity between the real and enhanced synthetic vectors in terms of spectrum smoothness.
- In the case of the exponential corrective liftering function (9) and the enhancement criterion (14), the calculation (7) of the optimal enhancing parameter a may be achieved by log-linear regression:
-
- Referring to the
FIG. 2 , an example of the optimal corrective liftering function calculated according to (17) is drawn by the boldsolid line 210. An enhanced spectral envelope resulting from the corrective liftering is shown onFIG. 1 by the dashedbold line 110. It can be seen that the enhanced spectral envelope exhibits emphasized peaks and valleys and resembles the real spectra much better compared to the original synthetic spectra. - In the case of two-concatenated exponents (11) and the enhancement criterion (14), the optimal set of the enhancing parameters may be calculated as follows. Fixing the concatenation point γ, the values of α and β may be calculated as:
-
- Then the optimal values of the three parameters may be obtained by scanning all the integer values of γ within a predefined range:
-
- with 1<minγ<maxγ<N such as for example min γ=0.5*N and max γ=0.75*N.
- An example of the optimal corrective liftering function calculated according to (18) and (19) is drawn on
FIG. 2 by the bold dashedline 220. - In the case of the exponential corrective liftering function (9) and enhancement criterion (15), the optimal value of the exponent base α may be obtained by solving following equation:
-
- The left-side of (20) is an unlimited monotonously increasing function of a which is less than the right-side value for α=0. Therefore the equation has a unique solution and can be solved numerically by one of the methods known in the art.
- The optimal enhancing parameters bring the attenuation degree of the synthetic cepstral vectors to the averaged level observed on the corresponding real cluster. Therefore, the enhancement may be strengthened or softened to some extent relatively to the optimal level in order to optimize the perceptual quality of the enhanced synthesized speech. In some embodiments of the proposed method, the optimal enhancing parameters calculated as described above may be altered depending on certain properties of the corresponding phonetic units emitting the synthetic vectors to be enhanced. For example, the optimal exponent base (17) calculated for vectors emitted from a certain unit of an HMM TTS system may be modified as:
-
αfinal=1+(αopt−1)·F(state_number,phone,voicing_class) (21) - where a predefined factor F depends on the HMM state number representing that unit, a category of the phone represented by this HMM and voicing class of the segments represented by this state. For example F(3,“AH”,1)=1.2 means that the enhancement will be strengthened roughly by 20% relatively to the optimal level for all the units representing
state number 3 of the phone “AH” given that the majority of frames clustered to this unit are voiced. - Then the final value αfinal may be used for rendering the corrective liftering vector to be applied to the corresponding synthetic cepstral vector.
- Referring to
FIGS. 3 and 4 , block diagrams show example embodiments of asystem - Referring to
FIG. 3 , thesystem 300 includes an on-line enhancement mechanism 340 for astatistical TTS system 310. Thesystem 300 includes astatistical TTS system 310, for example, an HMM-based system which receives atext input 301 and synthesizes the text to provide aspeech output 302. - In one embodiment,
TTS system 310 is an HMM-based system which models parameterised speech by a sequence of Markovian processes with unobserved (hidden) states with Gaussian mixture emitting probability distribution. In other embodiments, other forms of statistical modeling may be used. - The
statistical TTS system 310 may include a phoneticunit model component 320 including an acoustic featurevector output component 321 for outputting synthetic acoustic feature vectors generated out of this unit model. In one embodiment, the acoustic feature vector may be a cepstral vector. In another embodiment, the acoustic feature vector may be a Line Spectral Frequencies vector. - An
initialization unit 330 may be provided including a correctivetransformation defining component 331 for defining the parametric corrective transformation to be used for the corrective transformation instance derivation. The correctivetransformation defining component 331 may also include an enhancing parameter setcomponent 332 for defining the enhancing parameter set to be used. Theinitialization unit 330 may also include adistortion indicator component 333 for defining a distortion indicator to be used and anenhancement criterion component 334 for defining an enhancement criterion to be used. Theinitialization unit 330 may also include anenhancement customization component 335 dependent on unit attributes and enhancing parameters. In the embodiment of the acoustic feature vector being a cepstral vector, the distortion indicator is an attenuation indicator. - An on-
line enhancement mechanism 340 is provided which may include the following components for enhancing distorted acoustic feature vectors as output by the phoneticunit model component 320 by applying an instance of the corrective transformation. - The on-
line enhancement mechanism 340 may include aninputs component 341. Theinputs component 341 may include an acoustic featurevector input component 342 for receiving outputs from the phoneticunit model component 320. For example, a sequence of N-dimensional cepstral vectors. - The
inputs component 341 may also include a realemission statistics component 343 for receiving real emission statistics from the statistical model of the phoneticunit model component 320. - The
inputs component 341 may also include a unit attributescomponent 344 for receiving unit attributes of the phoneticunit model component 320. - The on-
line enhancement mechanism 340 may also include an enhancing parameter setcomponent 350. The enhancing parameter setcomponent 350 may include a distortionindicator reference component 351 and a distortion indicatoractual value component 352 for applying the distortion indicator definitions and calculating the actual and reference values for use in the enhancing parameter set derivation. - The enhancing parameter set
component 350 may also include an enhancementcriterion applying component 353 for applying a defined enhancement criterion to measure the dissimilarity between the reference value of the distortion indicator and a predicted actual value. - The enhancing parameter set
component 350 may include acustomization component 354 for altering optimal enhancing parameter set values according to unit attributes. The attributes may include a phone category which the statistical model is attributed to and voicing class of the majority of speech frames used for the statistical model training. - The on-
line enhancement mechanism 340 may include a correctivetransformation generating component 360 and a correctivetransformation applying component 365 for applying an instance of the parametric transformation derived from the enhancing parameter set values to an acoustic feature vector yielding an enhanced vector. - The on-
line enhancement mechanism 340 may include anoutput component 370 for outputting theenhanced vector output 371 for use in a waveform synthesis of thespeech component 380 of thestatistical TTS system 310. - Referring to
FIG. 4 , thesystem 400 shows an alternative embodiment to that ofFIG. 3 in which the corrective transformation is generated off-line. Equivalent reference numbers toFIG. 3 are used where possible. - As in
FIG. 3 , thesystem 400 includes astatistical TTS system 410, for example, an HMM-based system which receives atext input 401 and synthesizes the text to provide aspeech output 402. Thestatistical TTS system 410 may include a phoneticunit model component 420 including an acoustic featurevector output component 421 for outputting synthetic acoustic feature vectors generated out of this unit model. - As in
FIG. 3 , aninitialization unit 430 may be provided including a correctivetransformation defining component 431 for defining the parametric corrective transformation to be used for the corrective transformation instance derivation. The correctivetransformation defining component 431 may also include a parameter setcomponent 432 for defining the enhancing parameter set to be used. Theinitialization unit 430 may also include adistortion indicator component 433 for defining a distortion indicator to be used and anenhancement criterion component 434 for defining an enhancement criterion to be used. Theinitialization unit 430 may also include anenhancement customization component 435 dependent on unit attributes and enhancing parameters. - In this embodiment, an off-line
enhancement calculation mechanism 440 may be provided for generating and storing a corrective transformation instance. An on-line enhancement mechanism 450 may be provided to retrieve and apply instances of the corrective transformation during speech synthesis. - The off-line
enhancement calculation mechanism 440 may include aninputs component 441. Theinputs component 441 may include a syntheticcluster vector component 442 for collecting a synthetic cluster of acoustic feature vectors for each phonetic unit emitted from the phoneticunit model component 420. Theinputs component 441 may also include a realemission statistics component 443 for receiving real emission statistics from the statistical model of the phoneticunit model component 420. Theinputs component 441 may also include a unit attributescomponent 444 for receiving unit attributes of the phoneticunit model component 420. - The off-line
enhancement calculation mechanism 440 may also include an enhancing parameter setcomponent 450. The enhancing parameter setcomponent 450 may include a distortionindicator reference component 451 and a distortion indicatoractual value component 452 for applying the distortion indicator definitions and calculating the actual and reference values for use in the enhancing parameter set derivation. The enhancing parameter setcomponent 450 may also include an enhancementcriterion applying component 453 for applying a defined enhancement criterion to measure the dissimilarity between the reference value of the distortion indicator and a predicted actual value. The enhancing parameter setcomponent 450 may include acustomization component 454 for altering optimal enhancing parameter set values according to unit attributes. - The off-line
enhancement calculation mechanism 440 may include a corrective transformation generating and storingcomponent 460. - The on-
line enhancement mechanism 470 may include a corrective transformation retrieving and applyingcomponent 471 for applying the instance of the parametric corrective transformation derived from the enhancing parameter set values to an acoustic feature vector yielding an enhanced vector. The on-line enhancement mechanism 470 may include anoutput component 472 for outputting theenhanced vector output 473 for use in a waveform synthesis of thespeech component 480 of thestatistical TTS system 410. - Referring to
FIG. 5 , an exemplary system for implementing aspects of the invention includes adata processing system 500 suitable for storing and/or executing program code including at least oneprocessor 501 coupled directly or indirectly to memory elements through abus system 503. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. - The memory elements may include
system memory 502 in the form of read only memory (ROM) 504 and random access memory (RAM) 505. A basic input/output system (BIOS) 506 may be stored inROM 504.System software 507 may be stored inRAM 505 includingoperating system software 508.Software applications 510 may also be stored inRAM 505. - The
system 500 may also include a primary storage means 511 such as a magnetic hard disk drive and secondary storage means 512 such as a magnetic disc drive and an optical disc drive. The drives and their associated computer-readable media provide non-volatile storage of computer-executable instructions, data structures, program modules and other data for thesystem 500. Software applications may be stored on the primary and secondary storage means 511, 512 as well as thesystem memory 502. - The
computing system 500 may operate in a networked environment using logical connections to one or more remote computers via anetwork adapter 516. - Input/
output devices 513 can be coupled to the system either directly or through intervening I/O controllers. A user may enter commands and information into thesystem 500 through input devices such as a keyboard, pointing device, or other input devices (for example, microphone, joy stick, game pad, satellite dish, scanner, or the like). Output devices may include speakers, printers, etc. Adisplay device 514 is also connected tosystem bus 503 via an interface, such asvideo adapter 515. - Referring to
FIG. 6 , a flow diagram 600 shows the described method. A parametric family of corrective transformations is defined 601 operating in the space of acoustic feature vectors and dependent on a set of enhancing parameters. A distortion indicator of a feature vector may also be defined 602. A feature vector is received 603 as emitted form a phonetic unit of the system. An instance of the corrective transformation may be generated 604 from the parametric corrective transformation by applying an optimized a set of enhancing parameter values to reduce audible distortions. - The instance of the corrective transformation may be generated by the following steps. Calculating 605 a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector, and calculating 606 an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector, and calculating 607 a set of enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator, and the parametric corrective transformation.
- The instance of the corrective transformation may be applied 608 to the feature vector to provide an enhanced vector for use in speech synthesis.
- Referring to
FIGS. 7 and 8 , flow diagrams 700, 800 show example embodiments of the described method in the context of corrective liftering vectors applied to cepstral vectors with distortion indicators in the form of attenuation indicators for smoothing spectral distortion. - Referring to
FIG. 7 , a flow diagram 700 shows steps of an example embodiment of the described method corresponding to the case where cepstral acoustic feature vectors and liftering corrective transformation are used and the corrective liftering vectors are calculated on-line during the synthesis operation. - A
first initialization phase 710 may include defining 711: parametric family of corrective liftering functions WP(N) dependent on enhancing parameter set P; attenuation indicator H; enhancement criterion D(H, H, WP); and enhancement customization mechanism F dependent on unit attributes and enhancing parameters. - A
second phase 720 is the operation of synthesis with enhancement. Cepstral vector generation may be applied 721 from the statistical model. The following may be received 722: synthetic cepstral vector C emitted from phonetic unit U; emission statistics REALS (e.g. mean and variance) from statistical model of U; and unit attributes UA of phonetic unit U. - A reference value of the attenuation indictor may be calculated HREAL=H(REALS) as well as an actual value HSYN=H(C) 723. Optimal enhancing parameter values P* may be calculated 724 optimizing the enhancement criterion:
-
- The optimal enhancing parameter values may be altered 725 according to unit attributes applying customization mechanism P**=F(P*,UA). A corrective liftering vector WP** corresponding to P** may be calculated 726 and applied 727 to vector C yielding enhanced vector O. The enhanced vector O may be used 728 in waveform synthesis of speech
- Referring to
FIG. 8 , a flow diagram 800 shows steps of an example embodiment of the described method corresponding to the case where cepstral acoustic feature vectors and liftering corrective transformation are used and the corrective liftering vectors are calculated off-line and stored being linked to corresponding phonetic units. - A
first initialization phase 810 may include defining: parametric family of corrective liftering functions WP(N) dependent on enhancing parameter set P; attenuation indicator H; enhancement criterion D(H, H, WP); and enhancement customization mechanism F dependent on unit attributes and enhancing parameters. - A
second phase 820 is an off-line calculation of unit dependent corrective vectors. Cepstral vector generation may be applied 821 from the statistical model. For each phonetic unit U, a synthetic cluster of cepstral vectors emitted from phonetic unit U may be collected 822. The synthetic cluster statistics (e.g. means and variance) SYNS may be calculated 823. The emission statistics (e.g. mean and variance) REALS may be fetched 824 from statistical model of U together with the unit attributes UA of phonetic model U. - A reference value of attenuation indicator may be calculated HREAL=H(REALS) as well as the actual value HSYN=H(SYNS) 825. Optimal enhancing parameter values P* may be calculated 826 optimising the enhancement criterion:
-
- The optimal enhancing parameter values may be altered 827 according to unit attributes applying customization mechanism P**=F(P*,UA).
- The corrective liftering vector WP** corresponding to P** is calculated 828. The liftering vector WP** is stored 829 being linked to the unit U.
- At an on-
line operation 830 of synthesis with enhancement, a synthetic cepstral vector C is received 831 together with a corrective liftering vector WP** corresponding to unit emitting C. Corrective liftering vector WP** is applied 832 to vector C yielding enhanced vector O. The enhanced vector O is used 833 in waveform synthesis of speech. - The enhancement method described improves the perceptual quality of synthesized speech by strong reduction of the spectral smearing effect. The effect of this enhancement technique consists of moving poles and zeros of the transfer function corresponding to the synthesized spectral envelope towards the unit circle of Z-plane which leads to sharpening of spectral peaks and valleys.
- It is applicable to a wide class of HMM-based TTS systems and of statistical TTS systems in general. Most HMM TTS systems model frames' spectral envelopes in the cepstral space i.e. use cepstral feature vectors. The enhancement technique described works in the cepstral domain and is directly applicable to any statistical system employing cepstral features.
- The described method does not introduce audible distortions due to the fact that it works adaptively exploiting statistical information available within a statistical TTS system. The corrective transformation applied to a synthetic vector output from the original TTS system is calculated with the goal to bring the value of certain characteristics of the enhanced vector to the average level of this characteristic observed on relevant feature vectors derived from real speech.
- The described method does not require building of a new voice model. The described method can be employed with a pre-existing voice model. The real vectors statistics used as a reference for the corrective transformation calculation can be calculated based on the cepstral mean and variance vectors readily available within the existing voice model.
- As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- Aspects of the present invention are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Claims (25)
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/177,577 US8682670B2 (en) | 2011-07-07 | 2011-07-07 | Statistical enhancement of speech output from a statistical text-to-speech synthesis system |
DE112012002524.5T DE112012002524B4 (en) | 2011-07-07 | 2012-06-28 | Statistical improvement of speech output from a text-to-speech synthesis system |
CN201280033177.0A CN103635960B (en) | 2011-07-07 | 2012-06-28 | From the statistics enhancement of the voice that statistics Text To Speech synthesis system exports |
GB1400493.1A GB2507674B (en) | 2011-07-07 | 2012-06-28 | Statistical enhancement of speech output from A statistical text-to-speech synthesis system |
PCT/IB2012/053270 WO2013011397A1 (en) | 2011-07-07 | 2012-06-28 | Statistical enhancement of speech output from statistical text-to-speech synthesis system |
JP2014518027A JP2014522998A (en) | 2011-07-07 | 2012-06-28 | Statistical enhancement of speech output from statistical text-to-speech systems. |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/177,577 US8682670B2 (en) | 2011-07-07 | 2011-07-07 | Statistical enhancement of speech output from a statistical text-to-speech synthesis system |
Publications (2)
Publication Number | Publication Date |
---|---|
US20130013313A1 true US20130013313A1 (en) | 2013-01-10 |
US8682670B2 US8682670B2 (en) | 2014-03-25 |
Family
ID=47439189
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/177,577 Expired - Fee Related US8682670B2 (en) | 2011-07-07 | 2011-07-07 | Statistical enhancement of speech output from a statistical text-to-speech synthesis system |
Country Status (6)
Country | Link |
---|---|
US (1) | US8682670B2 (en) |
JP (1) | JP2014522998A (en) |
CN (1) | CN103635960B (en) |
DE (1) | DE112012002524B4 (en) |
GB (1) | GB2507674B (en) |
WO (1) | WO2013011397A1 (en) |
Cited By (148)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140156280A1 (en) * | 2012-11-30 | 2014-06-05 | Kabushiki Kaisha Toshiba | Speech processing system |
US9697820B2 (en) * | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US9865248B2 (en) | 2008-04-05 | 2018-01-09 | Apple Inc. | Intelligent text-to-speech conversion |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9966060B2 (en) | 2013-06-07 | 2018-05-08 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US9971774B2 (en) | 2012-09-19 | 2018-05-15 | Apple Inc. | Voice-based media searching |
US9986419B2 (en) | 2014-09-30 | 2018-05-29 | Apple Inc. | Social reminders |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
US10049675B2 (en) | 2010-02-25 | 2018-08-14 | Apple Inc. | User profiling for voice input processing |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10079014B2 (en) | 2012-06-08 | 2018-09-18 | Apple Inc. | Name recognition system |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US10089072B2 (en) | 2016-06-11 | 2018-10-02 | Apple Inc. | Intelligent device arbitration and control |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10269345B2 (en) | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US10297253B2 (en) | 2016-06-11 | 2019-05-21 | Apple Inc. | Application integration with a digital assistant |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10318871B2 (en) | 2005-09-08 | 2019-06-11 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10354011B2 (en) | 2016-06-09 | 2019-07-16 | Apple Inc. | Intelligent automated assistant in a home environment |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US20190304435A1 (en) * | 2017-05-18 | 2019-10-03 | Telepathy Labs, Inc. | Artificial intelligence-based text-to-speech system and method |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10475438B1 (en) * | 2017-03-02 | 2019-11-12 | Amazon Technologies, Inc. | Contextual text-to-speech processing |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10521466B2 (en) | 2016-06-11 | 2019-12-31 | Apple Inc. | Data driven natural language event detection and classification |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US10706841B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Task flow identification based on user intent |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10795541B2 (en) | 2009-06-05 | 2020-10-06 | Apple Inc. | Intelligent organization of tasks items |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10904611B2 (en) | 2014-06-30 | 2021-01-26 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US10942703B2 (en) | 2015-12-23 | 2021-03-09 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US20210073611A1 (en) * | 2011-08-10 | 2021-03-11 | Konlanbi | Dynamic data structures for data-driven modeling |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US11080012B2 (en) | 2009-06-05 | 2021-08-03 | Apple Inc. | Interface for a virtual digital assistant |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11217266B2 (en) * | 2016-06-21 | 2022-01-04 | Sony Corporation | Information processing device and information processing method |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US11886805B2 (en) | 2015-11-09 | 2024-01-30 | Apple Inc. | Unconventional virtual assistant interactions |
CN117540326A (en) * | 2024-01-09 | 2024-02-09 | 深圳大学 | Construction status abnormality identification method and system for drill and blast tunnel construction equipment |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US12197817B2 (en) | 2016-06-11 | 2025-01-14 | Apple Inc. | Intelligent device arbitration and control |
US12223282B2 (en) | 2016-06-09 | 2025-02-11 | Apple Inc. | Intelligent automated assistant in a home environment |
Family Cites Families (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3472964A (en) * | 1965-12-29 | 1969-10-14 | Texas Instruments Inc | Vocal response synthesizer |
US5067158A (en) * | 1985-06-11 | 1991-11-19 | Texas Instruments Incorporated | Linear predictive residual representation via non-iterative spectral reconstruction |
US5940791A (en) * | 1997-05-09 | 1999-08-17 | Washington University | Method and apparatus for speech analysis and synthesis using lattice ladder notch filters |
US6266638B1 (en) * | 1999-03-30 | 2001-07-24 | At&T Corp | Voice quality compensation system for speech synthesis based on unit-selection speech database |
US6725190B1 (en) * | 1999-11-02 | 2004-04-20 | International Business Machines Corporation | Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope |
US6430522B1 (en) * | 2000-03-27 | 2002-08-06 | The United States Of America As Represented By The Secretary Of The Navy | Enhanced model identification in signal processing using arbitrary exponential functions |
US20020026253A1 (en) * | 2000-06-02 | 2002-02-28 | Rajan Jebu Jacob | Speech processing apparatus |
CN1156819C (en) * | 2001-04-06 | 2004-07-07 | 国际商业机器公司 | A Method of Generating Personalized Speech from Text |
US7103539B2 (en) | 2001-11-08 | 2006-09-05 | Global Ip Sound Europe Ab | Enhanced coded speech |
US7092567B2 (en) * | 2002-11-04 | 2006-08-15 | Matsushita Electric Industrial Co., Ltd. | Post-processing system and method for correcting machine recognized text |
US8005677B2 (en) * | 2003-05-09 | 2011-08-23 | Cisco Technology, Inc. | Source-dependent text-to-speech system |
KR100612843B1 (en) | 2004-02-28 | 2006-08-14 | 삼성전자주식회사 | Probability Density Compensation Method, Consequent Speech Recognition Method and Apparatus for Hidden Markov Models |
FR2868586A1 (en) * | 2004-03-31 | 2005-10-07 | France Telecom | IMPROVED METHOD AND SYSTEM FOR CONVERTING A VOICE SIGNAL |
US8073147B2 (en) * | 2005-11-15 | 2011-12-06 | Nec Corporation | Dereverberation method, apparatus, and program for dereverberation |
WO2008033095A1 (en) * | 2006-09-15 | 2008-03-20 | Agency For Science, Technology And Research | Apparatus and method for speech utterance verification |
US8024193B2 (en) * | 2006-10-10 | 2011-09-20 | Apple Inc. | Methods and apparatus related to pruning for concatenative text-to-speech synthesis |
US8321222B2 (en) * | 2007-08-14 | 2012-11-27 | Nuance Communications, Inc. | Synthesis by generation and concatenation of multi-form segments |
US8244534B2 (en) * | 2007-08-20 | 2012-08-14 | Microsoft Corporation | HMM-based bilingual (Mandarin-English) TTS techniques |
JP5457706B2 (en) * | 2009-03-30 | 2014-04-02 | 株式会社東芝 | Speech model generation device, speech synthesis device, speech model generation program, speech synthesis program, speech model generation method, and speech synthesis method |
US9031834B2 (en) * | 2009-09-04 | 2015-05-12 | Nuance Communications, Inc. | Speech enhancement techniques on the power spectrum |
GB2478314B (en) * | 2010-03-02 | 2012-09-12 | Toshiba Res Europ Ltd | A speech processor, a speech processing method and a method of training a speech processor |
US8757490B2 (en) * | 2010-06-11 | 2014-06-24 | Josef Bigun | Method and apparatus for encoding and reading optical machine-readable data codes |
-
2011
- 2011-07-07 US US13/177,577 patent/US8682670B2/en not_active Expired - Fee Related
-
2012
- 2012-06-28 WO PCT/IB2012/053270 patent/WO2013011397A1/en active Application Filing
- 2012-06-28 DE DE112012002524.5T patent/DE112012002524B4/en not_active Expired - Fee Related
- 2012-06-28 JP JP2014518027A patent/JP2014522998A/en active Pending
- 2012-06-28 GB GB1400493.1A patent/GB2507674B/en not_active Expired - Fee Related
- 2012-06-28 CN CN201280033177.0A patent/CN103635960B/en not_active Expired - Fee Related
Cited By (259)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11928604B2 (en) | 2005-09-08 | 2024-03-12 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US10318871B2 (en) | 2005-09-08 | 2019-06-11 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US11979836B2 (en) | 2007-04-03 | 2024-05-07 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US11671920B2 (en) | 2007-04-03 | 2023-06-06 | Apple Inc. | Method and system for operating a multifunction portable electronic device using voice-activation |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US9865248B2 (en) | 2008-04-05 | 2018-01-09 | Apple Inc. | Intelligent text-to-speech conversion |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US11348582B2 (en) | 2008-10-02 | 2022-05-31 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11900936B2 (en) | 2008-10-02 | 2024-02-13 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11080012B2 (en) | 2009-06-05 | 2021-08-03 | Apple Inc. | Interface for a virtual digital assistant |
US10795541B2 (en) | 2009-06-05 | 2020-10-06 | Apple Inc. | Intelligent organization of tasks items |
US10741185B2 (en) | 2010-01-18 | 2020-08-11 | Apple Inc. | Intelligent automated assistant |
US12165635B2 (en) | 2010-01-18 | 2024-12-10 | Apple Inc. | Intelligent automated assistant |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US12087308B2 (en) | 2010-01-18 | 2024-09-10 | Apple Inc. | Intelligent automated assistant |
US10706841B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Task flow identification based on user intent |
US10049675B2 (en) | 2010-02-25 | 2018-08-14 | Apple Inc. | User profiling for voice input processing |
US10692504B2 (en) | 2010-02-25 | 2020-06-23 | Apple Inc. | User profiling for voice input processing |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US20210073611A1 (en) * | 2011-08-10 | 2021-03-11 | Konlanbi | Dynamic data structures for data-driven modeling |
US12210951B2 (en) * | 2011-08-10 | 2025-01-28 | Konlanbi | Dynamic data structures for data-driven modeling |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US11321116B2 (en) | 2012-05-15 | 2022-05-03 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11269678B2 (en) | 2012-05-15 | 2022-03-08 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US10079014B2 (en) | 2012-06-08 | 2018-09-18 | Apple Inc. | Name recognition system |
US9971774B2 (en) | 2012-09-19 | 2018-05-15 | Apple Inc. | Voice-based media searching |
US20140156280A1 (en) * | 2012-11-30 | 2014-06-05 | Kabushiki Kaisha Toshiba | Speech processing system |
US9466285B2 (en) * | 2012-11-30 | 2016-10-11 | Kabushiki Kaisha Toshiba | Speech processing system |
US11862186B2 (en) | 2013-02-07 | 2024-01-02 | Apple Inc. | Voice trigger for a digital assistant |
US10978090B2 (en) | 2013-02-07 | 2021-04-13 | Apple Inc. | Voice trigger for a digital assistant |
US10714117B2 (en) | 2013-02-07 | 2020-07-14 | Apple Inc. | Voice trigger for a digital assistant |
US12009007B2 (en) | 2013-02-07 | 2024-06-11 | Apple Inc. | Voice trigger for a digital assistant |
US11636869B2 (en) | 2013-02-07 | 2023-04-25 | Apple Inc. | Voice trigger for a digital assistant |
US12277954B2 (en) | 2013-02-07 | 2025-04-15 | Apple Inc. | Voice trigger for a digital assistant |
US11557310B2 (en) | 2013-02-07 | 2023-01-17 | Apple Inc. | Voice trigger for a digital assistant |
US11388291B2 (en) | 2013-03-14 | 2022-07-12 | Apple Inc. | System and method for processing voicemail |
US11798547B2 (en) | 2013-03-15 | 2023-10-24 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US9966060B2 (en) | 2013-06-07 | 2018-05-08 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US12073147B2 (en) | 2013-06-09 | 2024-08-27 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US11727219B2 (en) | 2013-06-09 | 2023-08-15 | Apple Inc. | System and method for inferring user intent from speech inputs |
US12010262B2 (en) | 2013-08-06 | 2024-06-11 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US12067990B2 (en) | 2014-05-30 | 2024-08-20 | Apple Inc. | Intelligent assistant for home automation |
US10714095B2 (en) | 2014-05-30 | 2020-07-14 | Apple Inc. | Intelligent assistant for home automation |
US10878809B2 (en) | 2014-05-30 | 2020-12-29 | Apple Inc. | Multi-command single utterance input method |
US12118999B2 (en) | 2014-05-30 | 2024-10-15 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US11670289B2 (en) | 2014-05-30 | 2023-06-06 | Apple Inc. | Multi-command single utterance input method |
US10657966B2 (en) | 2014-05-30 | 2020-05-19 | Apple Inc. | Better resolution when referencing to concepts |
US11810562B2 (en) | 2014-05-30 | 2023-11-07 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US11257504B2 (en) | 2014-05-30 | 2022-02-22 | Apple Inc. | Intelligent assistant for home automation |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US11699448B2 (en) | 2014-05-30 | 2023-07-11 | Apple Inc. | Intelligent assistant for home automation |
US12200297B2 (en) | 2014-06-30 | 2025-01-14 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10904611B2 (en) | 2014-06-30 | 2021-01-26 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11838579B2 (en) | 2014-06-30 | 2023-12-05 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11516537B2 (en) | 2014-06-30 | 2022-11-29 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10390213B2 (en) | 2014-09-30 | 2019-08-20 | Apple Inc. | Social reminders |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US9986419B2 (en) | 2014-09-30 | 2018-05-29 | Apple Inc. | Social reminders |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US10930282B2 (en) | 2015-03-08 | 2021-02-23 | Apple Inc. | Competing devices responding to voice triggers |
US12236952B2 (en) | 2015-03-08 | 2025-02-25 | Apple Inc. | Virtual assistant activation |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US11087759B2 (en) | 2015-03-08 | 2021-08-10 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US11842734B2 (en) | 2015-03-08 | 2023-12-12 | Apple Inc. | Virtual assistant activation |
US11468282B2 (en) | 2015-05-15 | 2022-10-11 | Apple Inc. | Virtual assistant in a communication session |
US12001933B2 (en) | 2015-05-15 | 2024-06-04 | Apple Inc. | Virtual assistant in a communication session |
US12154016B2 (en) | 2015-05-15 | 2024-11-26 | Apple Inc. | Virtual assistant in a communication session |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US11070949B2 (en) | 2015-05-27 | 2021-07-20 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10681212B2 (en) | 2015-06-05 | 2020-06-09 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11010127B2 (en) | 2015-06-29 | 2021-05-18 | Apple Inc. | Virtual assistant for media playback |
US11947873B2 (en) | 2015-06-29 | 2024-04-02 | Apple Inc. | Virtual assistant for media playback |
US11954405B2 (en) | 2015-09-08 | 2024-04-09 | Apple Inc. | Zero latency digital assistant |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US11550542B2 (en) | 2015-09-08 | 2023-01-10 | Apple Inc. | Zero latency digital assistant |
US11853536B2 (en) | 2015-09-08 | 2023-12-26 | Apple Inc. | Intelligent automated assistant in a media environment |
US12204932B2 (en) | 2015-09-08 | 2025-01-21 | Apple Inc. | Distributed personal assistant |
US11809483B2 (en) | 2015-09-08 | 2023-11-07 | Apple Inc. | Intelligent automated assistant for media search and playback |
US11126400B2 (en) | 2015-09-08 | 2021-09-21 | Apple Inc. | Zero latency digital assistant |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US9697820B2 (en) * | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US11809886B2 (en) | 2015-11-06 | 2023-11-07 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11886805B2 (en) | 2015-11-09 | 2024-01-30 | Apple Inc. | Unconventional virtual assistant interactions |
US10354652B2 (en) | 2015-12-02 | 2019-07-16 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10942703B2 (en) | 2015-12-23 | 2021-03-09 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US11853647B2 (en) | 2015-12-23 | 2023-12-26 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
US11069347B2 (en) | 2016-06-08 | 2021-07-20 | Apple Inc. | Intelligent automated assistant for media exploration |
US10354011B2 (en) | 2016-06-09 | 2019-07-16 | Apple Inc. | Intelligent automated assistant in a home environment |
US12223282B2 (en) | 2016-06-09 | 2025-02-11 | Apple Inc. | Intelligent automated assistant in a home environment |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US11657820B2 (en) | 2016-06-10 | 2023-05-23 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US11037565B2 (en) | 2016-06-10 | 2021-06-15 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US12175977B2 (en) | 2016-06-10 | 2024-12-24 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10089072B2 (en) | 2016-06-11 | 2018-10-02 | Apple Inc. | Intelligent device arbitration and control |
US10942702B2 (en) | 2016-06-11 | 2021-03-09 | Apple Inc. | Intelligent device arbitration and control |
US11749275B2 (en) | 2016-06-11 | 2023-09-05 | Apple Inc. | Application integration with a digital assistant |
US12293763B2 (en) | 2016-06-11 | 2025-05-06 | Apple Inc. | Application integration with a digital assistant |
US10269345B2 (en) | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US10297253B2 (en) | 2016-06-11 | 2019-05-21 | Apple Inc. | Application integration with a digital assistant |
US10521466B2 (en) | 2016-06-11 | 2019-12-31 | Apple Inc. | Data driven natural language event detection and classification |
US11809783B2 (en) | 2016-06-11 | 2023-11-07 | Apple Inc. | Intelligent device arbitration and control |
US10580409B2 (en) | 2016-06-11 | 2020-03-03 | Apple Inc. | Application integration with a digital assistant |
US12197817B2 (en) | 2016-06-11 | 2025-01-14 | Apple Inc. | Intelligent device arbitration and control |
US11152002B2 (en) | 2016-06-11 | 2021-10-19 | Apple Inc. | Application integration with a digital assistant |
US11217266B2 (en) * | 2016-06-21 | 2022-01-04 | Sony Corporation | Information processing device and information processing method |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10553215B2 (en) | 2016-09-23 | 2020-02-04 | Apple Inc. | Intelligent automated assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11656884B2 (en) | 2017-01-09 | 2023-05-23 | Apple Inc. | Application integration with a digital assistant |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US12260234B2 (en) | 2017-01-09 | 2025-03-25 | Apple Inc. | Application integration with a digital assistant |
US10475438B1 (en) * | 2017-03-02 | 2019-11-12 | Amazon Technologies, Inc. | Contextual text-to-speech processing |
US11443733B2 (en) * | 2017-03-02 | 2022-09-13 | Amazon Technologies, Inc. | Contextual text-to-speech processing |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10741181B2 (en) | 2017-05-09 | 2020-08-11 | Apple Inc. | User interface for correcting recognition errors |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US11599331B2 (en) | 2017-05-11 | 2023-03-07 | Apple Inc. | Maintaining privacy of personal information |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10847142B2 (en) | 2017-05-11 | 2020-11-24 | Apple Inc. | Maintaining privacy of personal information |
US11538469B2 (en) | 2017-05-12 | 2022-12-27 | Apple Inc. | Low-latency intelligent automated assistant |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11405466B2 (en) | 2017-05-12 | 2022-08-02 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US11380310B2 (en) | 2017-05-12 | 2022-07-05 | Apple Inc. | Low-latency intelligent automated assistant |
US11862151B2 (en) | 2017-05-12 | 2024-01-02 | Apple Inc. | Low-latency intelligent automated assistant |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US11837237B2 (en) | 2017-05-12 | 2023-12-05 | Apple Inc. | User-specific acoustic models |
US11580990B2 (en) | 2017-05-12 | 2023-02-14 | Apple Inc. | User-specific acoustic models |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US12026197B2 (en) | 2017-05-16 | 2024-07-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US11675829B2 (en) | 2017-05-16 | 2023-06-13 | Apple Inc. | Intelligent automated assistant for media exploration |
US11532306B2 (en) | 2017-05-16 | 2022-12-20 | Apple Inc. | Detecting a trigger of a digital assistant |
US11217255B2 (en) | 2017-05-16 | 2022-01-04 | Apple Inc. | Far-field extension for digital assistant services |
US10748546B2 (en) | 2017-05-16 | 2020-08-18 | Apple Inc. | Digital assistant services based on device capabilities |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US12254887B2 (en) | 2017-05-16 | 2025-03-18 | Apple Inc. | Far-field extension of digital assistant services for providing a notification of an event to a user |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10909171B2 (en) | 2017-05-16 | 2021-02-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US11244670B2 (en) * | 2017-05-18 | 2022-02-08 | Telepathy Labs, Inc. | Artificial intelligence-based text-to-speech system and method |
US20190304435A1 (en) * | 2017-05-18 | 2019-10-03 | Telepathy Labs, Inc. | Artificial intelligence-based text-to-speech system and method |
US11244669B2 (en) * | 2017-05-18 | 2022-02-08 | Telepathy Labs, Inc. | Artificial intelligence-based text-to-speech system and method |
US20190304434A1 (en) * | 2017-05-18 | 2019-10-03 | Telepathy Labs, Inc. | Artificial intelligence-based text-to-speech system and method |
US12118980B2 (en) * | 2017-05-18 | 2024-10-15 | Telepathy Labs, Inc. | Artificial intelligence-based text-to-speech system and method |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US11710482B2 (en) | 2018-03-26 | 2023-07-25 | Apple Inc. | Natural assistant interaction |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US12211502B2 (en) | 2018-03-26 | 2025-01-28 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11854539B2 (en) | 2018-05-07 | 2023-12-26 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11907436B2 (en) | 2018-05-07 | 2024-02-20 | Apple Inc. | Raise to speak |
US11900923B2 (en) | 2018-05-07 | 2024-02-13 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11169616B2 (en) | 2018-05-07 | 2021-11-09 | Apple Inc. | Raise to speak |
US11487364B2 (en) | 2018-05-07 | 2022-11-01 | Apple Inc. | Raise to speak |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US11630525B2 (en) | 2018-06-01 | 2023-04-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US10984798B2 (en) | 2018-06-01 | 2021-04-20 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US12080287B2 (en) | 2018-06-01 | 2024-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US11009970B2 (en) | 2018-06-01 | 2021-05-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US12067985B2 (en) | 2018-06-01 | 2024-08-20 | Apple Inc. | Virtual assistant operations in multi-device environments |
US12061752B2 (en) | 2018-06-01 | 2024-08-13 | Apple Inc. | Attention aware virtual assistant dismissal |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11431642B2 (en) | 2018-06-01 | 2022-08-30 | Apple Inc. | Variable latency device coordination |
US11360577B2 (en) | 2018-06-01 | 2022-06-14 | Apple Inc. | Attention aware virtual assistant dismissal |
US10720160B2 (en) | 2018-06-01 | 2020-07-21 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US10504518B1 (en) | 2018-06-03 | 2019-12-10 | Apple Inc. | Accelerated task performance |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10944859B2 (en) | 2018-06-03 | 2021-03-09 | Apple Inc. | Accelerated task performance |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11893992B2 (en) | 2018-09-28 | 2024-02-06 | Apple Inc. | Multi-modal inputs for voice commands |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US12136419B2 (en) | 2019-03-18 | 2024-11-05 | Apple Inc. | Multimodality in digital assistant systems |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11783815B2 (en) | 2019-03-18 | 2023-10-10 | Apple Inc. | Multimodality in digital assistant systems |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11217251B2 (en) | 2019-05-06 | 2022-01-04 | Apple Inc. | Spoken notifications |
US11705130B2 (en) | 2019-05-06 | 2023-07-18 | Apple Inc. | Spoken notifications |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US12216894B2 (en) | 2019-05-06 | 2025-02-04 | Apple Inc. | User configurable task triggers |
US11675491B2 (en) | 2019-05-06 | 2023-06-13 | Apple Inc. | User configurable task triggers |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US12154571B2 (en) | 2019-05-06 | 2024-11-26 | Apple Inc. | Spoken notifications |
US11888791B2 (en) | 2019-05-21 | 2024-01-30 | Apple Inc. | Providing message response suggestions |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11360739B2 (en) | 2019-05-31 | 2022-06-14 | Apple Inc. | User activity shortcut suggestions |
US11237797B2 (en) | 2019-05-31 | 2022-02-01 | Apple Inc. | User activity shortcut suggestions |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11657813B2 (en) | 2019-05-31 | 2023-05-23 | Apple Inc. | Voice identification in digital assistant systems |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11924254B2 (en) | 2020-05-11 | 2024-03-05 | Apple Inc. | Digital assistant hardware abstraction |
US12197712B2 (en) | 2020-05-11 | 2025-01-14 | Apple Inc. | Providing relevant data items based on context |
US11765209B2 (en) | 2020-05-11 | 2023-09-19 | Apple Inc. | Digital assistant hardware abstraction |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US12219314B2 (en) | 2020-07-21 | 2025-02-04 | Apple Inc. | User identification using headphones |
US11750962B2 (en) | 2020-07-21 | 2023-09-05 | Apple Inc. | User identification using headphones |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
CN117540326A (en) * | 2024-01-09 | 2024-02-09 | 深圳大学 | Construction status abnormality identification method and system for drill and blast tunnel construction equipment |
Also Published As
Publication number | Publication date |
---|---|
CN103635960A (en) | 2014-03-12 |
GB2507674B (en) | 2015-04-08 |
US8682670B2 (en) | 2014-03-25 |
CN103635960B (en) | 2016-04-13 |
GB201400493D0 (en) | 2014-02-26 |
GB2507674A (en) | 2014-05-07 |
WO2013011397A1 (en) | 2013-01-24 |
JP2014522998A (en) | 2014-09-08 |
DE112012002524T5 (en) | 2014-03-13 |
DE112012002524B4 (en) | 2018-05-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8682670B2 (en) | Statistical enhancement of speech output from a statistical text-to-speech synthesis system | |
CN109523989B (en) | Speech synthesis method, speech synthesis device, storage medium, and electronic apparatus | |
CN111161702B (en) | Personalized speech synthesis method and device, electronic equipment and storage medium | |
US9031834B2 (en) | Speech enhancement techniques on the power spectrum | |
US20140114663A1 (en) | Guided speaker adaptive speech synthesis system and method and computer program product | |
Yapanel et al. | A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition | |
US20080243508A1 (en) | Prosody-pattern generating apparatus, speech synthesizing apparatus, and computer program product and method thereof | |
US9607610B2 (en) | Devices and methods for noise modulation in a universal vocoder synthesizer | |
EP0970466A2 (en) | Voice conversion system and methodology | |
JP2016537662A (en) | Bandwidth extension method and apparatus | |
US8280724B2 (en) | Speech synthesis using complex spectral modeling | |
KR102198598B1 (en) | Method for generating synthesized speech signal, neural vocoder, and training method thereof | |
US9922662B2 (en) | Coherently-modified speech signal generation by time-dependent scaling of intensity of a pitch-modified utterance | |
CN113421584B (en) | Audio noise reduction method, device, computer equipment and storage medium | |
CN110930975B (en) | Method and device for outputting information | |
JP5807921B2 (en) | Quantitative F0 pattern generation device and method, model learning device for F0 pattern generation, and computer program | |
CN113345410A (en) | Training method of general speech and target speech synthesis model and related device | |
CN116543778A (en) | Vocoder training method, audio synthesis method, medium, device and computing equipment | |
CN114203155A (en) | Method and apparatus for training vocoder and speech synthesis | |
US20250124934A1 (en) | Multi-lag format for audio coding | |
CN116129854A (en) | Speech synthesis method and device, and training method and device of speech synthesis model | |
CN119649791A (en) | Speech synthesis method, model training method and related device | |
Guner et al. | A small footprint hybrid statistical/unit selection text-to-speech synthesis system for agglutinative languages | |
CN119763540A (en) | Audio synthesis method, audio synthesis model training method and related device | |
Wen et al. | Statistical modification based post-filtering technique for HMM-based speech synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SHECHTMEN, SLAVA;SORIN, ALEXANDER;REEL/FRAME:026552/0476 Effective date: 20110703 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.) |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.) |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20180325 |