CN104484319A

CN104484319A - Methods and systems for automated text correction

Info

Publication number: CN104484319A
Application number: CN201410815655.4A
Authority: CN
Inventors: 丹尼尔·赫曼·理查德·戴梅尔; 陆巍; 黄伟道
Original assignee: National University of Singapore
Current assignee: National University of Singapore
Priority date: 2010-09-24
Filing date: 2011-09-23
Publication date: 2015-04-01
Also published as: US20140163963A2; SG10201507822YA; CN103154936A; US20170242840A1; CN104484322A; WO2012039686A1; CN103154936B; US20170177563A1; SG188531A1; US20130325442A1

Abstract

The present embodiments demonstrate systems and methods for automated text correction. In certain embodiments, the methods and systems may be implemented through analysis according to a single text correction model. In a particular embodiment, the single text correction model may be generated through analysis of both a corpus of learner text and a corpus of non-learner text.

Description

Method and system for automated text correction

针对如下案件的分案申请：Divisional applications for the following cases:

申请日：2011-9-23Application date: 2011-9-23

申请号2011800459619Application number 2011800459619

发明名称：用于自动化文本校正的方法和系统Title of Invention: Method and System for Automated Text Correction

技术领域technical field

本发明涉及用于自化化文本校正的方法和系统。The present invention relates to methods and systems for automated text correction.

背景技术Background technique

文本校正通常是困难和耗时的。另外，通常编辑文本是昂贵的，特别是涉及翻译，因为编辑通常需要使用有技术和受过训练的工作人员。例如，编辑翻译可能需要由在两种或多种语言中具有高水平熟练度的工作人员来提供密集劳动。Text correction is often difficult and time-consuming. In addition, editing texts is often expensive, especially when translations are involved, since editing often requires the use of skilled and trained staff. For example, editorial translations may require labor-intensive work by staff with a high level of proficiency in two or more languages.

自动化的翻译系统(例如某些在线翻译器)可以使翻译的劳动密集型的某些方面有所减轻，但是它们仍不能替代人工翻译员。特别地，自动化系统执行相对好的单词到单词翻译的工作，但是由于语法和标点的不精确性，句子的意义经常无法理解。Automated translation systems, such as some online translators, can make some aspects of translation less labor-intensive, but they still cannot replace human translators. In particular, automated systems perform a relatively good job of word-to-word translation, but the meaning of sentences is often incomprehensible due to inaccuracies in grammar and punctuation.

某些自动化文本编辑系统确实存在，但此类系统通常具有不精确性。另外，现有技术的自动化文本编辑系统可能需要相对大量的处理资源。Some automated text editing systems do exist, but such systems are often imprecise. Additionally, prior art automated text editing systems may require relatively large amounts of processing resources.

一些自动化文本编辑系统可能需要训练或配置以精确地编辑文本。例如，某些现有技术的系统可以使用学习文本(learner text)的加注释的语料库(annotated corpus)来被训练。替代地，一些现有技术的系统可以使用没有加注释的非学习文本的语料库来被训练。本领域普通技术人员可以认识学习文本和非学习文本之间的差异。Some automated text editing systems may require training or configuration to edit text precisely. For example, some prior art systems can be trained using an annotated corpus of learner text. Alternatively, some prior art systems can be trained using a corpus of unannotated non-learning text. One of ordinary skill in the art can recognize the difference between learning text and non-learning text.

标准自动化语音识别(ASR)系统的输出通常由话语(utterance)构成，其中例如真实情况、句子边界和标点符号的重要语言和结构信息是不可获得的。语言和结构信息改进转录的语音文本的可读性，并且辅助进一步的下游处理，例如词性(POS)标注、语法分析、信息抽取和机器翻译。The output of standard automated speech recognition (ASR) systems usually consists of utterances where important linguistic and structural information such as ground truth, sentence boundaries and punctuation are not available. Linguistic and structural information improves the readability of transcribed speech text and assists further downstream processing such as part-of-speech (POS) tagging, syntax analysis, information extraction and machine translation.

现有技术的标点预测技术使用词汇和韵律学线索。然而，例如基音和中断持续时间的韵律学特征在没有原始未处理语音波形的情况下通常是不可获得的。在其中对于转录语音文本的自然语言处理(NLP)变成主要关注的一些场景中，语音韵律学信息可能无法轻易获得。在国际口语翻译研讨会(IWSLT)的评测活动中，仅提供人工转录或自动识别的语音文本，而原始未处理语音波形是不可获得的。Prior art punctuation prediction techniques use lexical and prosodic cues. However, prosodic features such as pitch and interruption duration are generally not available without the original unprocessed speech waveform. In some scenarios where natural language processing (NLP) for transcribing spoken text becomes a major concern, speech prosody information may not be readily available. In the evaluation activities of the International Symposium on Spoken Language Translation (IWSLT), only human-transcribed or automatically recognized speech texts are provided, while the original unprocessed speech waveforms are not available.

按照惯例，在语音识别期间执行标点插入。在一个例子中，在决策树框架内使用连同语言模型概率的韵律学特征。在另一个例子中，广播新闻领域中的插入包括针对任务的有限状态和多层感知器方法，其中韵律学和词汇信息被并入。在进一步的例子中，实施基于最大熵的标注方法，其在自发的英语对话中进行标点插入，包括使用词汇和韵律学特征。在另一个例子中，通过使用条件随机场(CRF)来执行句子边界检测。边界检测显示出对于基于隐马尔可夫模型(HMM)的在先方法的改进。By convention, punctuation is performed during speech recognition. In one example, prosodic features are used together with language model probabilities within a decision tree framework. In another example, interpolation in the broadcast journalism domain includes finite-state and multi-layer perceptron approaches for the task, where prosodic and lexical information is incorporated. In a further example, a maximum entropy based tagging method is implemented for punctuation in spontaneous English conversations, including the use of lexical and prosodic features. In another example, sentence boundary detection is performed by using conditional random fields (CRF). Boundary detection shows an improvement over previous methods based on Hidden Markov Models (HMMs).

一些现有技术将句子边界检测和标点插入任务考虑为隐事件检测任务。例如，HMM可以描述单词和单词间事件上的联合分布，其中观察值是单词，并且单词/事件对被编码为隐状态。具体地，在该任务中，单词边界和标点符号被编码为单词间事件。训练短语涉及使用平滑技术来在所有观察单词和事件上训练n-gram语言模型。学习到的n-gram概率分数接着被用作HMM状态转换分数。在测试期间，在每个单词处的事件的后验概率利用使用前向-后向算法的动态编程来计算。最为可能的状态的序列因此形成给出加标点的句子的输出。此类的基于HMM的方法具有若干个缺陷。Some existing techniques consider sentence boundary detection and punctuation insertion tasks as latent event detection tasks. For example, an HMM can describe the joint distribution over words and inter-word events, where observations are words and word/event pairs are encoded as hidden states. Specifically, in this task, word boundaries and punctuation are encoded as inter-word events. Training phrases involves using smoothing techniques to train an n-gram language model on all observed words and events. The learned n-gram probability scores are then used as HMM state transition scores. During testing, the posterior probability of an event at each word was calculated using dynamic programming using a forward-backward algorithm. The sequence of most likely states thus forms an output giving a punctuated sentence. Such HMM-based approaches have several drawbacks.

首先，n-gram语言模型仅能够捕获围绕的上下文信息。然而，对于标点插入可能需要更长范围相关性的建模。例如，该方法不能够有效地捕获强烈的指示疑问句的初始短语“你想(would you)”和结束问号之间的长范围相关性。因此，在使用隐事件语言模型之外可以使用特殊的技术以便克服长范围相关性。First, n-gram language models are only able to capture surrounding contextual information. However, modeling of longer-range dependencies may be required for punctuation. For example, the method is not able to effectively capture the long-range correlation between the initial phrase "would you" and the ending question mark, which are strong indicative interrogatives. Therefore, special techniques can be used in addition to using latent event language models in order to overcome long-range dependencies.

现有技术的例子包括重新排放或复制标点符号到句子的不同位置，使得它们显得更接近于指示的单词(例如，“多少钱”指示疑问句)。一个此类的技术建议在训练语言模型前将结尾的标点符号复制到每个句子的开始处。从经验上来说，该技术已经演示了其在英语中预测问号的有效性，因为用于英语疑问句的大多数指示的单词出现在问题的开始处。然而，此类的技术是专门设计的并且可能不能广泛地通常应用或应用于除英语以外的语言。进一步，在每次话语多个句子而没有在话语内清楚地加注释的句子边界的情况下，直接应用该方法可能会失败。Examples of prior art techniques include rearranging or duplicating punctuation marks in different positions of a sentence so that they appear closer to the indicated word (eg, "how much money" indicates an interrogative sentence). One such technique suggests copying ending punctuation to the beginning of each sentence before training a language model. Empirically, this technique has demonstrated its effectiveness in predicting question marks in English, as most indicated words for English interrogative sentences occur at the beginning of the question. However, such techniques are specifically designed and may not be widely applicable generally or to languages other than English. Further, direct application of the method may fail in the case of multiple sentences per utterance without clearly annotated sentence boundaries within the utterance.

与此类方法关联的另一个缺陷是该方法对将要插入的标点符号和其围绕的单词之间的强相关性假定进行编码。因此，其缺乏鲁棒性来处理其中频繁出现噪声或词汇表外(OOV)单词的情形，例如在由ASR系统自动识别的文本中。Another drawback associated with this type of approach is that it encodes an assumption of a strong correlation between the punctuation mark to be inserted and the word it surrounds. Therefore, it lacks robustness to handle situations where noise or out-of-vocabulary (OOV) words occur frequently, such as in text automatically recognized by an ASR system.

语法纠错(GEC)已经被认为是自然语言处理(NLP)中有趣和商业上引人注意的问题，特别是对于将英语作为外语或第二门语言(EFL/ESL)的学习者来说。Grammatical error correction (GEC) has been recognized as an interesting and commercially interesting problem in natural language processing (NLP), especially for learners of English as a foreign or second language (EFL/ESL).

尽管兴趣在增长，由于缺乏可用于研究目的的学习文本的大量加注释的语料库，研究已经受到阻碍。结果是，对于GEC的标准方法是训练现成的分类器来重新预测非学习文本中的单词。直接从加注释的初学者语料库学习GEC模型不能被很好的实施，如同将学习文本和非学习文本合并的方法。进一步，GEC的评估已经是个问题。先前的工作或对人工测试实例进行评估来作为对实际初学者错误的替代，或对不可用于其他研究者的专用数据进行评估。结果，现有的方法并不能在相同的测试集上进行比较，从而不清楚现有技术的当前状态实际上在哪。Despite growing interest, research has been hampered by the lack of large annotated corpora of study texts available for research purposes. It turns out that the standard approach for GEC is to train an off-the-shelf classifier to re-predict words in non-learned text. Learning a GEC model directly from an annotated beginner corpus cannot be implemented well, as does the method of combining learner and non-learner text. Further, the assessment of GEC is already a problem. Prior work evaluates either on artificial test instances as a proxy for actual beginner errors, or on proprietary data not available to other researchers. As a result, existing methods cannot be compared on the same test set, making it unclear where the current state of the art actually is.

对于GEC的业界标准方法是构建统计模型，其能够从可能校正选择的混淆集来选项最为可能的校正。定义混淆集的方式取决于错误的类型。上下文相关拼写错误校正传统地关注于具有类似拼写(例如，{dessert,desert“})或类似的发音(例如，{there,their})的混淆集。换句话说，混淆集中的单词因为拼写或语音相似性而被认为是可能被混淆的。GEC中的其他工作基于句法相似性来定义混淆集，例如，所有的英语冠词或最为频繁的英语介词形成混淆集。The industry standard approach for GEC is to build a statistical model that is able to select the most likely correction from a confounded set of possible correction choices. The way to define a confusion set depends on the type of error. Context-sensitive misspelling correction traditionally focuses on confusion sets that have similar spellings (e.g., {dessert, desert"}) or similar pronunciations (e.g., {there, their}). In other words, words in the confusion set are phonetic similarity. Other work in GEC defines confusion sets based on syntactic similarity, for example, all English articles or most frequent English prepositions form confusion sets.

发明内容Contents of the invention

本实施例演示了用于自动化文本校正的系统和方法。在某些实施例中，方法和系统可以通过根据单个文本编辑模型的分析来实现。在特定的实施例中，单个文本编辑模型可以通过学习文本的语料库和非学习文本的语料库的分析来生成。This embodiment demonstrates a system and method for automated text correction. In some embodiments, methods and systems may be implemented by analysis according to a single text editing model. In certain embodiments, a single text editing model may be generated by analysis of a corpus of learned text and a corpus of non-learned text.

根据一个实施例，一种设备，包括至少一个处理器和耦合到该至少一个处理器的存储器装置，其中所述至少一个处理器配置成识别输入话语的单词。所述至少一个处理器也配置成将单词放置在存储在存储器装置中的多个第一节点中。所述至少一个处理器进一步配置成部分基于线性链的相邻节点来向第一节点的每个分配单词层标签。所述至少一个处理器也配置成通过将来自于多个第一节点的单词与部分在分配给每个第一节点的单词层标签上选择的标点组合，生成输出句子。According to one embodiment, an apparatus includes at least one processor and a memory device coupled to the at least one processor, wherein the at least one processor is configured to recognize words of an input utterance. The at least one processor is also configured to place words in a plurality of first nodes stored in the memory device. The at least one processor is further configured to assign a word-level label to each of the first nodes based in part on neighboring nodes of the linear chain. The at least one processor is also configured to generate an output sentence by combining words from the plurality of first nodes with a portion of the selected punctuation on the word-level labels assigned to each first node.

根据另一个实施例，一种计算机程序产品，包括具有用于识别输入话语的单词的代码的计算机可读介质。所述介质也包括用于将单词放置在存储在存储器装置中的多个第一节点中的代码。所述介质进一步包括用于部分基于多个第一节点的相邻节点来向第一节点的每个分配单词层标签的代码。所述介质也包括用于通过将来自于多个第一节点的单词与部分在分配给每个第一节点的单词层标签上选择的标点组合，生成输出句子的代码。According to another embodiment, a computer program product includes a computer-readable medium having code for identifying words of an input utterance. The medium also includes code for placing words in a plurality of first nodes stored in the memory device. The medium further includes code for assigning a word-level label to each of the plurality of first nodes based in part on neighboring nodes of the first node. The medium also includes code for generating an output sentence by combining words from the plurality of first nodes with punctuation selected in part on word-level labels assigned to each first node.

根据另一个实施例，一种方法，包括识别输入话语的单词。所述方法还包括将单词放置在存储在存储器装置中的多个第一节点中。所述方法进一步包括部分基于所述多个第一节点的相邻节点来向多个第一节点中的每个第一节点分配单词层标签。所述方法也包括通过将来自于多个第一节点的单词与部分在分配给每个第一节点的单词层标签上选择的标点组合，生成输出句子。According to another embodiment, a method includes recognizing words of an input utterance. The method also includes placing the word in a plurality of first nodes stored in the memory device. The method further includes assigning a word-level label to each first node of the plurality of first nodes based in part on neighboring nodes of the plurality of first nodes. The method also includes generating an output sentence by combining words from the plurality of first nodes with a portion of the selected punctuation on the word-level labels assigned to each first node.

一种方法的附加实施例包括接收自然语言文本输入，所述文本输入包括语法错误，其中输入文本的一部分包括来自于一组类的类。该方法也可以包括从假设没有语法错误的非学习文本的语料库生成多个选择任务，其中对于每个选择任务，分类器重新预测在非学习文本中使用的类。进一步，该方法可以包括从学习文本的语料库生成多个校正任务，其中对于每个校正任务，分类器建议在学习文本中使用的类。另外，所述方法可以包括使用一组二进制分类问题来训练语法校正模型，该一组二进制分类问题包括多个选择任务和多个校正任务。该实施例也可以包括使用训练的语法校正模型来从一组可能的类预测文本输入的类。An additional embodiment of a method includes receiving natural language text input, the text input including grammatical errors, wherein a portion of the input text includes a class from a set of classes. The method may also include generating a plurality of selection tasks from the corpus of non-learned text assumed to be free of grammatical errors, wherein for each selection task, the classifier re-predicts the class used in the non-learned text. Further, the method may include generating a plurality of correction tasks from the corpus of learning text, wherein for each correction task the classifier suggests a class to use in the learning text. Additionally, the method may include training the grammar correction model using a set of binary classification problems including a plurality of selection tasks and a plurality of correction tasks. This embodiment may also include using the trained grammar correction model to predict the class of the text input from a set of possible classes.

在另外的实施例中，该方法包括输出建议以便如果预测的类不同于文本输入中的类，则将文本输入的类改变成预测的类。在此类的实施例中，学习文本由老师以假定正确的类来加注释。类可以是与输入文本中的名词短语关联的冠词。该方法也可以包括从非学习文本和学习文本中的名词短语来抽取用于分类器的特征函数。In a further embodiment, the method includes outputting a suggestion to change the class of the textual input to the predicted class if the predicted class is different from the class in the textual input. In such an embodiment, the learning text is annotated by the teacher with the assumed correct classes. Classes can be articles associated with noun phrases in the input text. The method may also include extracting feature functions for the classifier from noun phrases in the non-learning text and the learning text.

在另一实施例中，类是与输入文本中的介词短语关联的介词。此类的方法可以包括从非学习文本和学习文本的介词短语抽取用于分类器的特征函数。In another embodiment, the classes are prepositions associated with prepositional phrases in the input text. Such methods may include extracting feature functions for classifiers from prepositional phrases of non-learned text and learned text.

在一个实施例中，非学习文本和学习文本具有不同的特征空间，学习文本的特征空间包括由作者使用的单词。训练语法校正模型可以包括最小化训练数据上的损失函数。训练语法校正模型也可以包括通过分析非学习文本来识别多个线性分类器。线性分类器进一步包括权重因子，该权重因子包括在权重因子的矩阵中。In one embodiment, the non-learned text and the learned text have different feature spaces, the feature space of the learned text includes words used by the author. Training the grammar correction model may include minimizing a loss function on the training data. Training a grammar correction model may also include identifying multiple linear classifiers by analyzing non-learned text. The linear classifier further includes weighting factors included in the matrix of weighting factors.

在一个实施例中，训练语法校正模型进一步包括在权重因子的矩阵上执行奇异值分解(SVD)。训练语法校正模型也可以包括识别组合权重值，该组合权重值代表通过分析非学习文本所识别的第一权重值元素以及通过最小化经验风险函数来分析学习文本而识别的第二权重值元素。In one embodiment, training the grammar correction model further includes performing a singular value decomposition (SVD) on the matrix of weighting factors. Training the grammar correction model may also include identifying combined weight values representing first weight value elements identified by analyzing the non-learned text and second weight value elements identified by analyzing the learned text by minimizing an empirical hazard function.

也提供用于自动化文本校正的一种设备。该设备可以包括例如配置成执行上述的方法的步骤的处理器。An apparatus for automated text correction is also provided. The device may comprise, for example, a processor configured to perform the steps of the methods described above.

提供一种方法的另一实施例。该方法可以包括校正语义搭配错误。此类方法的一个实施例包括响应于在处理装置执行的平行语言文本的语料库分析，自动地识别一个或多个译文候选。另外，该方法可以包括使用处理装置来确定与每个译文候选关联的特征。该方法也可以包括从存储在数据存储装置中的学习文本的语料库生成一组一个或多个权重值。该方法可以进一步包括响应于与每个译文候选关联的特征和所述一组一个或多个权重值来使用处理装置计算针对所述一个或多个译文候选的分数。Another embodiment of a method is provided. The method may include correcting semantic collocation errors. One embodiment of such a method includes automatically identifying one or more translation candidates in response to a corpus analysis of the parallel language text performed at the processing device. Additionally, the method may include using the processing means to determine features associated with each translation candidate. The method may also include generating a set of one or more weight values from a corpus of study text stored in the data storage device. The method may further comprise calculating, using processing means, a score for each translation candidate in response to the features associated with the one or more translation candidates and the set of one or more weight values.

在进一步的实施例中，识别一个或多个译文候选可以包括从平行文本的数据库选择文本的平行语料库，每个平行文本包括第一语言的文本和第二语言的相应文本，使用处理装置来对第一语言的文本进行划分，使用所述处理装置来标记化第二语言的文本，使用处理装置来自动化地将第一文本中的单词与第二文本中的单词对准，使用处理装置从第一文本和第二文本中的对准的单词抽取短语，并且使用处理装置来计算与第一文本中的一个或多个短语以及第二文本中的一个或多个短语关联的释义匹配的概率。In a further embodiment, identifying one or more translation candidates may comprise selecting a parallel corpus of texts from a database of parallel texts, each parallel text comprising a text in the first language and a corresponding text in the second language, using processing means to segmenting the text in the first language, using the processing means to tokenize the text in the second language, using the processing means to automatically align words in the first text with words in the second text, using the processing means to extract from the second text Aligned words in the first text and the second text extract phrases and use processing means to calculate probabilities of matching paraphrases associated with one or more phrases in the first text and one or more phrases in the second text.

在特定的实施例中，与每个译文候选关联的特征是释义匹配的概率。可以使用对学习文本的语料库的最小错误率训练(MERT)操作来计算一组一个或多个权重值。In a particular embodiment, the feature associated with each translation candidate is the probability of a paraphrase match. The set of one or more weight values may be computed using a minimum error rate training (MERT) operation on a corpus of learning text.

该方法也可以包括生成具有带有从拼写编辑距离导出的特征的搭配校正的短语表。在另一个实施例中，该方法可以包括生成具有带有从同音异义词字典导出的特征的搭配校正的短语表。在另一个实施例中，该方法可以包括生成带有从同义词导出的特征的搭配校正的短语表。另外，该方法可以包括生成具有带有从母语引入的释义导出的特征的搭配校正的短语表。The method may also include generating a collocation-corrected phrase table with features derived from spelling edit distances. In another embodiment, the method may include generating a collocation-corrected phrase table with features derived from a dictionary of homophones. In another embodiment, the method may include generating a collocation-corrected phrase table with features derived from synonyms. Additionally, the method may include generating a collocation-corrected phrase list with features derived from paraphrases imported in the native language.

在此类的实施例中，短语表包括用于在计算释义匹配的概率使用的一个或多个惩罚特征。In such embodiments, the phrase table includes one or more penalty features for use in computing the probability of a paraphrase match.

也提供一种设备，包括至少一个处理器和耦合到至少一个处理器的存储器装置，其中至少一个处理器配置成执行如上所述的权利要求的方法的步骤。也提供一种有形计算机可读介质，其包括计算机可读代码，当由计算机执行时，使得计算机执行如上所述的方法中的操作。There is also provided an apparatus comprising at least one processor and a memory device coupled to the at least one processor, wherein the at least one processor is configured to perform the steps of the method of the claims above. There is also provided a tangible computer readable medium comprising computer readable code which, when executed by a computer, causes the computer to perform the operations in the method as described above.

术语“耦合”被定义为连接，尽管不必为直接地连接，并且也不必是机械地连接。The term "coupled" is defined as connected, although not necessarily directly, and not necessarily mechanically.

术语“一个”以及“一种”被定义为一个或多个，除非本公开明确另外要求。The terms "a" and "an" are defined as one or more unless the disclosure expressly requires otherwise.

术语“基本上”以及其变形被定义为大体上但不必全部为由本领域技术人员理解所规定的那样，并且在一个非限制性的实施例中，“基本上”表示处于所规定的10％的范围内，优选地为5％的范围内，更为优选的是位于1％内，并且最为优选的是位于0.5％的范围内。The term "substantially" and variations thereof are defined as substantially, but not necessarily all, as understood by those skilled in the art, and in one non-limiting example, "substantially" means within 10% of the stated range, preferably within 5%, more preferably within 1%, and most preferably within 0.5%.

术语“包括(comprise)”(以及任意其他形式的包括，例如“comprises”和“comprising”)、“具有”、“包括(include)”(以及任意其他形式的包括，例如“includes”和“including”)和“包含(contain)”(以及任意其他形式的包含，例如“contains”和“containing”)是开放式的连接动词。结果是，“包括(comprises)”、“具有”、“包括(includes)”或“包含(contains)”一个或多个步骤或单元的方法或装置处理那些一个或多个步骤或单元，但不限于仅处理那些步骤或单元。同样地，“包括(comprises)”、“具有”、“包括(includes)”或“包含(contains)”一个或多个特征的方法的步骤或装置的单元处理那些一个或多个特征，但不限于仅处理那些一个或多个特征。进一步，以特定方式配置的装置或结构至少以这种方式来配置，但其也可以以没有列出的方式来配置。通过参考结合所附附图的下面特定实施例的详细描述，其他的特征和关联优势将变得明显。The terms "comprise" (and any other forms of inclusion, such as "comprises" and "comprising"), "have", "include" (and any other forms of inclusion, such as "includes" and "including ") and "contain" (and any other form of containing, such as "contains" and "containing") are open-ended linking verbs. Consequently, a method or apparatus that "comprises", "has", "includes" or "contains" one or more steps or elements addresses those steps or elements, but does not Restricted to processing only those steps or units. Likewise, a step of a method or an element of an apparatus that "comprises", "has", "includes" or "contains" one or more features addresses those one or more features, but does not Limit processing to only those one or more features. Further, a device or structure that is configured in a certain way is configured in at least that way, but it may also be configured in ways that are not listed. Other features and associated advantages will become apparent by reference to the following detailed description of certain embodiments in conjunction with the accompanying drawings.

附图说明Description of drawings

下面的附图形成本说明书的一部分并且被包括进以进一步演示本发明的某些方面。通过参考这些附图的一个或多个附图、结合这里所提供的特定实施例的详细描述，本发明可以被更好的理解。The following drawings form part of this specification and are included to further demonstrate certain aspects of the invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments provided herein.

图1是示出根据本公开的一个实施例的用于分析话语的系统的框图；FIG. 1 is a block diagram illustrating a system for analyzing utterances according to one embodiment of the present disclosure;

图2是示出根据本公开的一个实施例的配置成存储句子的数据管理系统的框图；Figure 2 is a block diagram illustrating a data management system configured to store sentences according to one embodiment of the present disclosure;

图3是示出根据本公开的一个实施例的用于分析话语的计算机系统的框图；Figure 3 is a block diagram illustrating a computer system for analyzing utterances according to one embodiment of the present disclosure;

图4是示出用于线性链CRF的图形表示的框图；Figure 4 is a block diagram showing a graphical representation for a linear chain CRF;

图5是用于线性链条件随机域(CRF)的训练句子的示例标记；Figure 5 is an example token for a training sentence for a linear chain conditional random field (CRF);

图6是示了二层阶乘CRF的图形表示的框图；Figure 6 is a block diagram illustrating a graphical representation of a two-level factorial CRF;

图7是用于阶乘条件随机域(CRF)的训练句子的示例标记；Figure 7 is an example token for a training sentence of a factorial conditional random field (CRF);

图8是示出用于将标点插入进句子的方法的一个实施例的流程图；Figure 8 is a flowchart illustrating one embodiment of a method for inserting punctuation into a sentence;

图9是示出用于自动的语法纠错的方法的一个实施例的流程图；Figure 9 is a flowchart illustrating one embodiment of a method for automatic grammatical error correction;

图10A是示出用于校正冠词错误的文本校正模型的一个实施例的精确性的示图；Figure 10A is a graph showing the accuracy of one embodiment of a text correction model for correcting article errors;

图10B是示出用于校正介词错误的文本校正模型的一个实施例的精确性的示图；Figure 10B is a graph showing the accuracy of one embodiment of a text correction model for correcting preposition errors;

图11A是示出相比较于使用DeFelice特征集的常用方法，用于校正冠词错误的方法的F1测量的示图；Figure 11A is a graph showing the F1 measure for a method for correcting article errors compared to a common method using the DeFelice feature set;

图11B是示出相比较于使用Han特征集的常用方法，用于校正冠词错误的方法的F1测量的示图；FIG. 11B is a graph showing the F1 measure of a method for correcting article errors compared to a common method using the Han feature set;

图11C是示出相比较于使用Lee特征集的常用方法，用于校正冠词错误的方法的F1测量的示图；Figure 11C is a graph showing the F1 measure for a method for correcting article errors compared to a common method using the Lee feature set;

图12A是示出相比较于使用DeFelice特征集的常用方法，用于校正介词错误的方法的F1测量的示图；Figure 12A is a graph showing the F1 measure for a method for correcting preposition errors compared to a common method using the DeFelice feature set;

图12B是示出相比较于使用TetreaultChunk特征集的常用方法，用于校正介词错误的方法的F1测量的示图；Figure 12B is a graph showing the F1 measure for a method for correcting preposition errors compared to a common method using the TetreaultChunk feature set;

图12C是示出相比较于使用TetreaultParse特征集的常用方法，用于校正介词错误的方法的F1测量的示图；Figure 12C is a graph showing the F1 measure for a method for correcting preposition errors, compared to a common method using the TetreaultParse feature set;

图13是示出用于校正语义搭配错误的方法的一个实施例的流程图。Figure 13 is a flowchart illustrating one embodiment of a method for correcting semantic collocation errors.

具体实施方式Detailed ways

参考在附图中示出并且在下面的描述细化的非限制性实施例来更为全面地解释各种特征和优势。公知的原始材料、处理技术、组件和装置的描述被省略以便不必要地混淆本发明的细节。然而，应该理解的是指示本发明的实施例的详细描述和特定例子是仅通过实例说明给出的，并且绝不是限制。根本的发明构思内的精神和/或范围内的各种替代、修改、添加和/或重新安排将通过本公开而对本领域技术人员变得清楚。The various features and advantages are explained more fully with reference to the non-limiting embodiments illustrated in the drawings and detailed in the following description. Descriptions of well-known starting materials, processing techniques, components and devices are omitted so as not to unnecessarily obscure the present invention in detail. It should be understood, however, that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration only and not limitation. Various substitutions, modifications, additions and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.

在本说明书中描述的某些单元已经被标记为模块，以便更为特别地强调它们的实现独立性。模块是“一种自包含硬件或软件组件，其与更大的系统交互”，艾伦弗里德曼，“The Computer Glossary”268(1998年，第8版)。模块包括机器或机器可执行指令。例如，模块可以被实现为硬件电路，包括定制的VLSI电路或门阵列，现成的半导体例如逻辑芯片、晶体管或其他分离组件。模块也可以被实现在可编程硬件器件中，例如现场可编程门阵列、可编程阵列逻辑、可编程逻辑器件或类似等。Certain elements described in this specification have been labeled as modules in order to more particularly emphasize their implementation independence. A module is "a self-contained hardware or software component that interacts with a larger system," Alan Friedman, "The Computer Glossary" 268 (8th ed., 1998). A module includes machine or machine-executable instructions. For example, a module may be implemented as a hardware circuit comprising custom VLSI circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, or the like.

模块也可以包括软件定义的单元或指令，当由处理机器或装置执行时，将存储在数据存储装置上的数据从第一状态转换到第二状态。可执行代码的标识模块可以例如包括计算机指令的一个或多个物理或逻辑块，其可以被组织为对象、过程或功能。不管怎样，标识模块的可执行文件不需要物理上在一起，而是可以包括存储在不同位置中的分离指令，其在逻辑上连接在一起时包括模块，并且当由处理器执行时，实现声明的数据转换。A module may also include software-defined units or instructions that, when executed by a processing machine or device, transform data stored on the data storage device from a first state to a second state. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions, which may be organized as an object, procedure, or function. However, the executables that identify the modules need not be physically together, but may include separate instructions stored in different locations that, when logically connected together, comprise the modules and, when executed by a processor, implement the statement data conversion.

事实上，可执行代码的模块可以是单个的指令，或者是许多指令，并且可以在若干个不同的代码段、在不同的程序间或跨若干个存储装置来分布。类似地，操作数据这里可以在模块内被识别和示出，并且可以以任意合适的形式来体现，并且在任意合适类型的数据结构内组织。操作数据可以被聚集为单个的数据集，或者可以在不同的位置上分布，包括在不同的存储装置上分布。Indeed, a module of executable code may be a single instruction, or many instructions, and may be distributed over several different code segments, among different programs or across several memory devices. Similarly, operational data may be identified and illustrated herein within modules, and may be embodied in any suitable form and organized within any suitable type of data structure. Operational data may be aggregated into a single data set, or may be distributed across different locations, including across different storage devices.

在下面的描述中，提供许多特定的细节，例如编制程序、软件模块、用户选择、网络事务、数据库查询、数据库结构、硬件模块、硬件电路、硬件芯片等的例子，以提供对本实施例的透彻理解。然而，相关领域的技术人员将认识到本发明可以在没有特定细节的一个或多个的情况下实践，或可以利用其他的方法、组件、材料等来实践。在其他的实例中，公知的结构、材料、或操作没有详细的示出或描述以避免混淆本发明的多个方面。In the following description, numerous specific details are provided, such as examples of programming, software modules, user selections, network transactions, database queries, database structures, hardware modules, hardware circuits, hardware chips, etc., to provide a thorough understanding of the embodiments understand. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

图1示出用于自动化文本和语音编辑(speech editing)的系统100的一个实施例。系统100可以包括服务器102、数据存储装置106、网络108和用户接口装置110。在一个特定的实施例中，系统100可以包括存储器控制器104、或存储器服务器，其配置成管理数据存储器装置106和与网络108通信的服务器102或其他组件之间的数据传递。在替代的实施例中，存储器控制器104可以耦合到网络108。Figure 1 illustrates one embodiment of a system 100 for automated text and speech editing. System 100 may include server 102 , data storage device 106 , network 108 and user interface device 110 . In one particular embodiment, system 100 may include a memory controller 104 , or memory server, configured to manage the transfer of data between data storage device 106 and server 102 or other components in communication with network 108 . In an alternate embodiment, memory controller 104 may be coupled to network 108 .

在一个实施例中，用户接口装置110可以被广义地指代，并且旨在包含基于合适的处理器的装置，例如台式计算机、膝上型计算机、个人数字助理(PDA)或平板计算机、接入到网络108的智能电话或其他移动通信装置或管理器装置。在进一步的实施例中，用户接口装置110可以接入到因特网或其他广域网或局域网，以访问由服务器102主控的web应用或web服务并且提供用户接口以便使得用户能够输入或接收信息。例如，用户可以通过麦克风(未示出)或键盘320来录入输入的话语或文本到系统100。In one embodiment, user interface device 110 may be referred to broadly and is intended to encompass suitable processor-based devices such as desktop computers, laptop computers, personal digital assistants (PDAs) or tablet computers, access Smartphone or other mobile communication device or manager device to network 108. In further embodiments, the user interface device 110 may be connected to the Internet or other wide or local area network to access web applications or web services hosted by the server 102 and provide a user interface to enable a user to enter or receive information. For example, a user may enter input utterances or text into the system 100 through a microphone (not shown) or keyboard 320 .

网络108可以促进服务器102和用户接口装置110之间的数据传递。网络108可以包括任意类型的通信网络，包括但不限于直接PC到PC连接、局域网(LAN)、广域网(WAN)、调制解调器到调制解调器连接、因特网、上述的组合，或现在已知或稍后开始的在组网领域内允许两个或多个计算机来彼此通信的任意其他通信网络。Network 108 may facilitate data transfer between server 102 and user interface device 110 . Network 108 may comprise any type of communication network including, but not limited to, direct PC-to-PC connections, local area networks (LANs), wide area networks (WANs), modem-to-modem connections, the Internet, combinations of the above, or now known or later developed Within the field of networking, any other communication network that allows two or more computers to communicate with each other.

在一个实施例中，服务器102配置成存储输入的话语和/或输入的文本。另外，服务器可以经由存储区域网(SAN)、LAN、数据总线或类似等来访问存储在数据存储器装置106中的数据。In one embodiment, server 102 is configured to store input utterances and/or input text. Additionally, the server may access data stored in the data storage device 106 via a storage area network (SAN), LAN, data bus, or the like.

数据存储器装置106可以包括硬盘(包括在独立磁盘冗余(RAID)阵列中布置的硬盘)、包括磁带数据存储器装置的带存储器驱动器、光存储器装置或类似等。在一个实施例中，数据存储器装置106可以存储英语或其他语言的句子。数据可以布置在数据库中并且可以通过结构化查询语言(SQL)查询、或其他数据库查询语言或操作来访问。The data storage device 106 may include hard disks (including hard disks arranged in a redundant array of independent disks (RAID)), tape memory drives including magnetic tape data storage devices, optical storage devices, or the like. In one embodiment, data storage device 106 may store sentences in English or other languages. Data may be arranged in a database and accessed through Structured Query Language (SQL) queries, or other database query languages or operations.

图2示出配置成存储输入的话语和/或输入文本的数据管理系统200的一个实施例。在一个实施例中，数据管理系统200可以包括服务器102。服务器102可以耦合到数据总线202。在一个实施例中，数据管理系统200也可以包括第一数据存储器装置204、第二数据存储器装置206和/或第三数据存储器装置208。在另外的实施例中，数据管理系统200可以包括另外的数据存储器装置(未示出)。在一个实施例中，例如学习者英语的NUS语料库(NUCLE)的学习文本的语料库可以存储在第一数据存储器装置204中。第二数据存储器装置206可以存储例如非学习文本的语料库。非学习文本的例子可以包括平行语料库、新闻或期刊文本以及其他公共可获得的文本。在某些实施例中，从被认为包含相对少的错误的源选择非学习文本。第三数据存储器装置208可以包含计算的数据、输入的文本和或输入的话语数据。在另外的实施例中，所述的数据可以被一起存储进合并的数据存储器装置210。FIG. 2 illustrates one embodiment of a data management system 200 configured to store input utterances and/or input text. In one embodiment, data management system 200 may include server 102 . Server 102 may be coupled to data bus 202 . In one embodiment, the data management system 200 may also include a first data storage device 204 , a second data storage device 206 and/or a third data storage device 208 . In other embodiments, data management system 200 may include additional data storage devices (not shown). In one embodiment, a corpus of learning text such as the NUS Corpus of Learners English (NUCLE) may be stored in the first data storage device 204 . The second data storage device 206 may store eg a corpus of non-learning text. Examples of non-study texts may include parallel corpora, news or journal texts, and other publicly available texts. In some embodiments, non-learned text is selected from sources believed to contain relatively few errors. The third data storage device 208 may contain calculated data, input text and or input utterance data. In other embodiments, the data may be stored together in the consolidated data storage device 210 .

在一个实施例中，服务器102可以向选择的数据存储器装置204、206提交查询，以检索输入的句子。服务器102可以将合并的数据集存储在合并的数据存储器装置210中。在此类的一个实施例中，服务器102可以返回查阅合并的数据存储器装置210以获得与指定的句子关联的一组数据元素。替代地，服务器101可以独立地查询数据存储器装置204、206、208中的每个或在分布式的查询中查询，以便获得与输入的句子关联的一组数据元素。在另一个替代实施例中，多个数据库可以存储在单个的合并的数据存储器装置210上。In one embodiment, the server 102 may submit a query to a selected data storage device 204, 206 to retrieve an input sentence. Server 102 may store the consolidated data set in consolidated data storage device 210 . In one such embodiment, the server 102 may refer back to the consolidated data storage device 210 for a set of data elements associated with the specified sentence. Alternatively, the server 101 may query each of the data storage devices 204, 206, 208 independently or in a distributed query to obtain a set of data elements associated with an input sentence. In another alternative embodiment, multiple databases may be stored on a single consolidated data storage device 210 .

数据管理系统200也可以包括用于输入和处理话语的文件。在各种实施例中，服务器102可以通过数据总线202与数据存储器装置204、206、208通信。数据总线202可以包括SAN、LAN或类似等。通信基础结构可以包括以太网、光纤通道仲裁环路(FC-AL)、小型计算机系统接口(SCSI)、串行高级技术附件(SATA)、高级技术附加装置(ATA)和/或其他与数据存储和通信关联的类似数据通信策略。例如，服务器102可以间接与数据存储器装置204、206、208、210通信；服务器102首先与存储器服务器或存储器控制器104通信。Data management system 200 may also include files for entering and processing utterances. In various embodiments, server 102 may communicate with data storage devices 204 , 206 , 208 via data bus 202 . Data bus 202 may include a SAN, LAN, or the like. The communications infrastructure may include Ethernet, Fiber Channel Arbitrated Loop (FC-AL), Small Computer System Interface (SCSI), Serial Advanced Technology Attachment (SATA), Advanced Technology Attachment (ATA), and/or other data storage-related Similar data communication strategy associated with communication. For example, server 102 may communicate indirectly with data storage devices 204 , 206 , 208 , 210 ; server 102 first communicates with memory server or memory controller 104 .

服务器102可以主控配置用于分析话语和/或输入文本的软件应用。软件应用可以进一步包括用于与数据存储器装置204、206、208、210接口连接、与网络108接口连接、通过用户接口装置110与用户接口连接以及类似等的模块。在另外的实施例中，服务器102可以主控引擎、应用插件、或应用编程接口(API)。Server 102 may host a software application configured to analyze utterances and/or input text. The software application may further include modules for interfacing with the data storage devices 204, 206, 208, 210, interfacing with the network 108, interfacing with a user through the user interface device 110, and the like. In other embodiments, the server 102 may host an engine, an application plug-in, or an application programming interface (API).

图3示出根据服务器102和/或用户接口装置110的某些实施例适配的计算机系统300。中央处理单元(“CPU”)302耦合到系统总线304。CPU 302可以是通用CPU或微处理器、图像处理单元(“GPU”)、微控制器或可以被专门地编程以执行如下面的流程图中描述的方法的类似物。本实施例并不限于CPU 302的架构，只要CPU 302直接或间接地支持如这里所述的模块和操作。CPU 302根据本实施例可以执行各种逻辑指令。FIG. 3 illustrates a computer system 300 adapted in accordance with certain embodiments of server 102 and/or user interface device 110 . Central processing unit (“CPU”) 302 is coupled to system bus 304 . The CPU 302 may be a general-purpose CPU or microprocessor, a graphics processing unit ("GPU"), a microcontroller, or the like that may be specially programmed to perform the methods as described in the flowcharts below. The present embodiment is not limited to the architecture of the CPU 302, as long as the CPU 302 directly or indirectly supports the modules and operations as described herein. CPU 302 can execute various logic instructions according to this embodiment.

计算机系统300也可以包括随机存取存储器(RAM)308、其可以是SRAM、DRAM、SDRAM或类似等。计算机系统300可以使用RAM 308来存储由具有代码的软件应用用于分析话语的各种数据结构。计算机系统300也可以包括只读存储器(ROM)306，其可以是PROM、EPROM、EEPROM、光存储器或类似等。ROM可以存储用于启动计算机系统300的配置信息。RAM 308和ROM 306保持用户和系统数据。Computer system 300 may also include random access memory (RAM) 308, which may be SRAM, DRAM, SDRAM, or the like. Computer system 300 may use RAM 308 to store various data structures used by software applications with code to analyze utterances. Computer system 300 may also include read only memory (ROM) 306, which may be PROM, EPROM, EEPROM, optical memory, or the like. The ROM may store configuration information for starting the computer system 300 . RAM 308 and ROM 306 hold user and system data.

计算机系统300也可以包括输入/输出(I/O)适配器310、通信适配器314、用户接口适配器316和显示器适配器322。在某些实施例中，I/O适配器310和/或用户接口适配器316可以使得用户来与计算机系统300交互，从而输入话语或文本。在另外的实施例中，显示器适配器322可以显示与用于生成具有插入的标点符号、语法校正和其他相关文本和语音编辑功能的基于软件和web的应用或移动应用关联的图形用户接口。Computer system 300 may also include input/output (I/O) adapter 310 , communication adapter 314 , user interface adapter 316 , and display adapter 322 . In some embodiments, I/O adapter 310 and/or user interface adapter 316 may enable a user to interact with computer system 300 to enter speech or text. In further embodiments, display adapter 322 may display a graphical user interface associated with a software and web-based application or a mobile application for generating with inserted punctuation, grammar correction, and other related text and speech editing functions.

I/O适配器310可以连接一个或多个存储器装置312到计算机系统300，该存储器装置312例如为硬驱动器、计算机盘(CD)驱动器、软盘驱动器和磁带驱动器中的一个或多个。通信适配器314可以适于将计算机系统300耦合到网络108，该网络108可以是LAN、WAN和/或因特网中的一个或多个。用户接口适配器316将例如键盘320和指向装置318的用户输入装置耦合到计算机系统300。显示器适配器322可以由CPU 302驱动以控制在显示器装置324上的显示。I/O adapter 310 may connect to computer system 300 one or more memory devices 312 such as one or more of a hard drive, a computer disk (CD) drive, a floppy disk drive, and a tape drive. Communications adapter 314 may be adapted to couple computer system 300 to network 108, which may be one or more of a LAN, WAN, and/or the Internet. User interface adapter 316 couples user input devices such as keyboard 320 and pointing device 318 to computer system 300 . Display adapter 322 may be driven by CPU 302 to control the display on display device 324.

本公开的应用并不限于计算机系统300的架构。相反，将计算机系统300提供为可以适于执行服务器102和/或用户接口装置110的一种类型的计算装置的例子。例如，可以使用任意合适的基于处理器的装置，包括但不限于个人数字助理(PDA)、台式计算机、智能电话、计算机游戏控制台以及多处理器服务器。此外，本公开的系统和方法可以实现在专用集成电路(ASIC)上，超大规模集成电路(VLSI)电路或其他电路。事实上，本领域技术人员可以使用任意数目的合适结构，该结构能够根据所述的实施例执行逻辑操作。The application of the present disclosure is not limited to the architecture of computer system 300 . Rather, computer system 300 is provided as an example of one type of computing device that may be adapted to execute server 102 and/or user interface device 110 . For example, any suitable processor-based device may be used, including but not limited to personal digital assistants (PDAs), desktop computers, smart phones, computer game consoles, and multi-processor servers. Additionally, the systems and methods of the present disclosure may be implemented on application specific integrated circuits (ASICs), very large scale integration (VLSI) circuits or other circuits. In fact, those skilled in the art may employ any number of suitable structures capable of performing logical operations in accordance with the described embodiments.

下面的示意流程图和相关描述总体上作为逻辑流程图来阐述。这样，所绘出的顺序和标记的步骤指示所提供的方法的一个实施例。在功能、逻辑或效果上等同于所示出的方法的一个或多个步骤、或其一部分的其他步骤和方法是可以想到的。另外，提供所使用的格式和符号以解释本方法的逻辑步骤并且被理解为不限制该方法的范围。尽管在流程图中可以使用各种箭头类型和连线类型，它们被理解为不限制相应方法的范围。事实上，一些箭头或其他连接符可以用于仅指示方法的逻辑流程。例如，箭头可以指示在所绘出的方法的列表步骤之间的未指定持续期间的等待或监视周期。另外，特定方法发生的顺序可以或可以不严格遵守所示相应步骤的顺序。The schematic flow diagrams and related descriptions that follow are generally set forth as logical flow diagrams. As such, the depicted order and labeled steps are indicative of one embodiment of the presented method. Other steps and methods are contemplated that are equivalent in function, logic, or effect to one or more steps, or portions thereof, of the illustrated method. Additionally, the format and symbols used are provided to explain the logical steps of the method and are understood not to limit the scope of the method. Although various arrow types and line types may be used in the flowcharts, they are understood not to limit the scope of the corresponding method. In fact, some arrows or other connectors can be used to simply indicate the logical flow of the method. For example, an arrow may indicate a waiting or monitoring period of unspecified duration between listed steps of the depicted method. Additionally, the order in which a particular method occurs may or may not strictly adhere to the order of the corresponding steps shown.

标点预测Punctuation prediction

根据一个实施例，可以从标准文本处理角度来预测标点符号，其中仅语音文本是可获得的，而不依赖于另外的韵律特征例如基音和中断持续时间。例如，可以在转录对话语音文本或话语上执行标点预测任务。不同于例如广播新闻语料库的许多其他语料库，对话语音语料库可以包括对话，其中非正式的和短的句子频繁地出现。此外，由于对话的属性，相比较于其他的语料库，其也可以包括更多的疑问句。According to one embodiment, punctuation marks can be predicted from a standard text processing perspective, where only phonetic text is available, without relying on additional prosodic features such as pitch and break duration. For example, the task of punctuation prediction can be performed on transcribed dialogue speech text or utterances. Unlike many other corpora such as the broadcast news corpus, a conversational speech corpus may include dialogue in which informal and short sentences occur frequently. In addition, due to the nature of dialogue, it can also include more interrogative sentences than other corpora.

一种放松由隐事件语言编码的强相关性假设的自然方法是采用非定向图形模型，其中可以利用任意重叠的特征。条件随机域(CRF)已经广泛地应用于各种序列标记和分段任务中。在给定观察项下，CRF可以是完整标记序列的条件分布的判别模型。例如，采取第一级马尔可夫属性的第一级线性链CRF可以通过下面的等式来定义：A natural way to relax the assumption of strong correlations encoded by latent event languages is to employ undirected graphical models, where arbitrarily overlapping features can be exploited. Conditional Random Fields (CRFs) have been widely used in various sequence labeling and segmentation tasks. A CRF can be a discriminative model of the conditional distribution of the complete label sequence given an observation. For example, the first-level linear chain CRF adopting the first-level Markov property can be defined by the following equation:

${p p}_{λ λ} ((y the y | | x x)) = = \frac{11}{Z Z ((x x))} exp exp ((\underset{t t}{Σ Σ} \underset{k k}{Σ Σ} {λ λ}_{k k} {f f}_{k k} ((x x,, {y the y}_{t t - - 11},, {y the y}_{t t},, t t))))$

其中x是观察项，而y是标记序列。作为时间步长t的函数的特征函数fk可以在整个观察项x和两个相邻隐标记上定义。Z(x)是归一化因子以确保很好的形成概率分布。where x is the observation and y is the sequence of markers. A feature function fk as a function of time step t can be defined over the entire observation x and two adjacent latent labels. Z(x) is a normalization factor to ensure a well formed probability distribution.

图4是示出用于线性链CRF的图形表示的框图。一系列第一节点402a、402b、402c、…,402n耦合到一系列第二节点404a、404b、404c、…,404n。第二节点可以是与第一节点402的相应节点关联的事件，例如单词层标签。标点预测任务可以建模为向每个单词分配标签的处理。一组可能的标签可以包括没有(NONE)、逗号(，)、句号(.)，问号(？)以及感叹号(！)。根据一个实施例，每个单词可以与一个事件关联。事件标识哪个标点符号(可能NONE)应该被插入在单词后。Figure 4 is a block diagram showing a graphical representation for a linear chain CRF. A series of first nodes 402a, 402b, 402c, ..., 402n is coupled to a series of second nodes 404a, 404b, 404c, ..., 404n. The second node may be an event associated with a corresponding node of the first node 402, such as a word level label. The punctuation prediction task can be modeled as the process of assigning labels to each word. A set of possible tags may include none (NONE), comma (,), period (.), question mark (?), and exclamation point (!). According to one embodiment, each word may be associated with an event. The event identifies which punctuation (possibly NONE) should be inserted after the word.

用于模型的训练数据可以包括一组话语，其中标点符号被编号为分配给各个单词的标签。标签NONE意味着在当前的单词后没有标点符号插入。任意其他的标签标识用于插入相应的标点符号的位置。预测标签的最为可能的序列并且接着可以从此类的输出构建中标点的文本。可以在图5中示出对话语加标点的示例。The training data for the model may include a set of utterances where punctuation marks are numbered as labels assigned to individual words. The tag NONE means that no punctuation is inserted after the current word. Any other tags identify where to insert the corresponding punctuation marks. The most probable sequence of labels is predicted and a punctuated text can then be constructed from the output of this class. An example of dialog punctuation can be shown in FIG. 5 .

图5是用于线性链条件随机域(CRF)的训练句子的示例加标点。句子502可以被划分成单词和分配给每个单词的单词层标签504。单词层标签504可以指示跟随输出句子中的单词的标点标记。例如，单词“不”被加标点“逗号”指示逗号应该跟着单词“不”。另外，例如“请”的一些单词标记有“没有”，以指示没有跟随单词“请”的符号标记。Fig. 5 is an example punctuation of a training sentence for a linear chain conditional random field (CRF). Sentence 502 may be divided into words and a word-level label 504 assigned to each word. Word level labels 504 may indicate punctuation marks that follow words in the output sentence. For example, the word "no" is punctuated with "comma" to indicate that a comma should follow the word "no". Also, some words such as "please" are marked with "no" to indicate that there is no sign mark following the word "please".

根据一个实施例，条件随机域的特征可以因式分解为在当前的时间步长(在该情形中，边缘)处分配一组团(clique)的二进制函数以及在观察序列上单独定义的特征函数的乘积。围绕当前单词的n-元发生连同位置信息用作针对n＝1；2；3的二进制特征函数。当构建特征时，出现在来自于当前单词的5个单词内的单词被考虑。特殊的开始和结束符号被超出话语边界使用。例如，对于在图5中示出的单词，例子特征包括在相对位置0处的一元特征“做”，在相对位置-1处的“请”，在相对位置2到3处的二元特征“你想”，以及在相对位置-2到0处的三元特征“不请做”。According to one embodiment, the characteristics of the conditional random field can be factorized into the binary function that assigns a set of cliques (clique) at the current time step (in this case, the edge) and the characteristic function defined separately on the observation sequence product. The n-gram occurrences around the current word together with the position information are used as binary feature functions for n=1;2;3. Words that occur within 5 words from the current word are considered when building features. Special start and end symbols are used beyond utterance boundaries. For example, for the word shown in Figure 5, example features include the unary feature "do" at relative position 0, "please" at relative position -1, and the binary feature " Do you want", and the ternary feature "Don't please" at relative positions -2 to 0.

在本实施例中的线性链CRF模型可以能够利用任意重叠特征来对单词和标点符号之间的相关性建模。因此，可以避免隐事件语言模型中的强相关性假设。通过包括在句子级处的长范围相关性的分析，提供进一步改进该模型。例如，在图5中示出的相同话语中，结束问号和出现的很远的指示单词“你想”之间的长范围相关性可以不被捕获。The linear chain CRF model in this embodiment may be able to utilize arbitrary overlapping features to model the correlation between words and punctuation marks. Therefore, strong correlation assumptions in latent event language models can be avoided. Further improvements to the model are provided by including analysis of long-range dependencies at the sentence level. For example, in the same utterance shown in Figure 5, the long-range correlation between the closing question mark and the far-distant occurrence of the indicative word "do you want" may not be captured.

作为动态条件随机域的一个实例的阶乘-CRF(F-CRF)可以用作一种框架，该框架用于针对给定的序列提供同时标记标签的多个层的能力。F-CRF学习给定观察项的标签的联合条件分布。动态条件随机域可以被定义为给定观察项x标记矢量序列y的条件概率：Factorial-CRF (F-CRF), an example of a dynamic conditional random field, can be used as a framework for providing the ability to label multiple layers of labels simultaneously for a given sequence. F-CRF learns the joint conditional distribution of labels given observations. A dynamic conditional random field can be defined as the conditional probability of a sequence y of marker vectors given an observation x:

${p p}_{λ λ} ((y the y | | x x)) = = \frac{11}{Z Z ((x x))} exp exp ((\underset{t t}{Σ Σ} \underset{c c &Element; &Element; C C}{Σ Σ} \underset{k k}{Σ Σ} {λ λ}_{k k} {f f}_{k k} ((x x,, {y the y}_{((c c,, t t))},, {y the y}_{t t},, t t)))),,$

其中团在每个时间步长处被编索引，C是团索引的集合，并且y(c；t)是在时间处t具有索引c的团的展开版本中的变更的集合。where cliques are indexed at each time step, C is the set of clique indices, and y(c;t) is the set of changes in the unrolled version of the clique with index c at time t.

图6是示出两层阶乘CRF的图形表示的框图。根据一个实施例，F-CRF可以具有作为标签的节点的两个层，其中团在每个时间步长包括两个链内边缘(例如，z2-z3和y2-y3)和一个链间边缘(例如，z3-y3)。一系列的第一节点602a、602b、602c、…,602n耦合到一系列的第二节点604a、604b、604c、…,604n。一系列的第三节点606a、606b、606c、…,606n耦合到一系列的第二节点和一系列的第一节点。一系列的第二节点的节点彼此耦合以提供节点之间的长范围相关性。Figure 6 is a block diagram showing a graphical representation of a two-level factorial CRF. According to one embodiment, the F-CRF may have two layers of nodes as labels, where a clique includes two intra-chain edges (e.g., z2-z3 and y2-y3) and one inter-chain edge ( For example, z3-y3). A series of first nodes 602a, 602b, 602c, ..., 602n is coupled to a series of second nodes 604a, 604b, 604c, ..., 604n. The series of third nodes 606a, 606b, 606c, ..., 606n are coupled to the series of second nodes and the series of first nodes. The nodes of the series of second nodes are coupled to each other to provide long-range dependencies between nodes.

根据一个实施例，第二节点是单词层节点并且第三节点是句子层节点。每个句子层节点可以与相应的单词层节点耦合。句子层节点和单词层节点二者可以与第一节点耦合。句子层节点可以捕获单词层节点之间的长范围相关性。According to one embodiment, the second node is a word level node and the third node is a sentence level node. Each sentence-level node can be coupled with a corresponding word-level node. Both the sentence level node and the word level node may be coupled to the first node. Sentence-level nodes can capture long-range correlations between word-level nodes.

在F-CRF中，两组标记可以分配给话语中的单词：单词层标签和句子层标签。单词层标签可以包括没有、逗号、句号、问号和/或感叹号。句子层标签可以包括陈述开始、陈述内部、问题开始、问题内部、感叹开始和/或感叹内部。单词层标签可以负责在每个单词后插入标点符号(包括没有)，而句子层标签可以用于标注句子边界并且识别句子类型(陈述、提问、或感叹)。In F-CRF, two sets of labels can be assigned to words in an utterance: word-level labels and sentence-level labels. Word-level tags can include none, commas, periods, question marks, and/or exclamation points. Sentence level tags may include start of statement, inside of statement, start of question, inside of question, start of exclamation, and/or inside of exclamation. Word-level tags can be responsible for inserting punctuation marks (including none) after each word, while sentence-level tags can be used to mark sentence boundaries and identify sentence types (statement, question, or exclamation).

根据一个实施例，来自于单词层的标签可以与那些来自于线性链CRF的标签相同。句子层标签可以被设计用于三种类型的句子：DEBEG和DEIN分别指示陈述句的开始和内部，类似于对于QNBEG和QNIN(疑问句)以及EXBEG和EXIN(感叹句)。我们在先前的节所看的相同例子话语可以以两层的标签来标记，如图7中所示。According to one embodiment, the labels from the word layer may be the same as those from the linear chain CRF. Sentence-level tags can be designed for three types of sentences: DEBEG and DEIN indicate the beginning and interior of declarative sentences, respectively, similar to QNBEG and QNIN (interrogative sentences) and EXBEG and EXIN (exclamatory sentences). The same example utterances we looked at in the previous section can be labeled with two layers of labels, as shown in Figure 7.

图7是用于阶乘条件随机域(CRF)的训练句子的例子标记。句子702可以被划分成单词并且每个单词以单词层标签704和句子层标签706来标记。例如，单词“不”可以以逗号单词层标签以及陈述开始句子层标签来标记。Figure 7 is an example token of a training sentence for a factorial conditional random field (CRF). Sentence 702 may be divided into words and each word labeled with a word-level label 704 and a sentence-level label 706 . For example, the word "not" can be tagged with a comma word-level tag and a statement-start sentence-level tag.

在线性链CRF中使用的模拟特征因式分解和n元特征函数可以使用在F-CRF中。当学习句子层标签连同单词层标签时，F-CRF模型能够利用从关于句子类型的句子层学习的有用线索(例如，疑问句，注释有QNBEG、QNIN、QNIN，或陈述句，注释有DEBEG、DEIN、DEIN)，其可以用于指导在每个单词处的标点符号的预测，因此改进在单词层处的性能。The analog eigenfactorization and n-ary eigenfunctions used in linear chain CRF can be used in F-CRF. When learning sentence-level labels in conjunction with word-level labels, the F-CRF model is able to leverage useful cues learned from the sentence level about sentence types (e.g., interrogative sentences, annotated with QNBEG, QNIN, QNIN, or declarative sentences, annotated with DEBEG, DEIN, DEIN), which can be used to guide the prediction of punctuation marks at each word, thus improving the performance at the word level.

例如，考虑联合标记图7中示出的话语。当证据显示话语由两个句子组成时，陈述句之后是疑问句，则模型趋向于以句子标签序列：QNBEG、QNIN来标注话语的第二部分。在给定在每个时间步长处存在的两个层之间的相关性下，这些句子层标签有助于将话语结束处的单词层标签预测为QMARK。根据一个实施例，在学习期间，两个标签层可以被联合地学习。因此，单词层标签可以影响句子层标签，并且反之亦然。GRMM包可以用于构建线性链CRF(LCRF)和阶乘CRF(F-CRF)二者。用于置信度传播的基于树的再参数化(TRP)调度用于近似推理。For example, consider the utterances shown in Figure 7 for joint labeling. When the evidence shows that the utterance consists of two sentences, with the declarative followed by the interrogative, the model tends to label the second part of the utterance with the sequence of sentence labels: QNBEG, QNIN. These sentence-level labels help predict word-level labels at the end of an utterance as QMARKs, given the correlation between the two layers that exists at each time step. According to one embodiment, during learning, the two label layers may be learned jointly. Thus, word-level labels can affect sentence-level labels, and vice versa. The GRMM package can be used to construct both linear chain CRFs (LCRFs) and factorial CRFs (F-CRFs). Tree-based reparameterization (TRP) scheduling for belief propagation is used for approximate inference.

上述的技术可以允许使用条件随机域(CRF)来执行话语中的预测而不需要依赖于韵律线索。因此，所述的方法可以有用于转录对话的话语的后处理。另外，可以在话语中的单词之间建立长范围的相关性以改进话语中的标点的预测。The techniques described above may allow the use of conditional random fields (CRFs) to perform in-utterance predictions without relying on prosodic cues. Thus, the method described may have post-processing for transcribing the utterances of the dialogue. Additionally, long-range correlations can be established between words in an utterance to improve prediction of punctuation in an utterance.

以不同的方法来执行其中使用中文和英文对话语音文本二者的IWSLT09评审活动的语料库的一部分上的实验。考虑两个多语言数据集，BTEC(基本旅游表达语料集)数据集和CT(挑战任务)数据集。前者包括旅游相关句子，而后者包括在旅行域内人力介入跨语言的对话。官方IWSLT09 BTEC训练集包括19972个中文-英文话语对，并且CT训练集包括10061个此类的对。两个数据集的每个可以随机地被划分成两个部分，其中话语的90％用于训练标点符号模型，并且剩余的10％用于评估预测性能。对于所有的实验，中文的默认分段可以如所提供的那样使用，而英文文本可以利用Penn树图资料库分词器来进行预处理。表1提供处理后的两个数据集的统计。Experiments on a portion of the corpus of the IWSLT09 review campaign in which both Chinese and English dialogue speech texts were used were performed in different ways. Consider two multilingual datasets, the BTEC (Basic Tourist Expression Corpus) dataset and the CT (Challenge Task) dataset. The former includes travel-related sentences, while the latter includes human intervention in cross-language dialogues within the travel domain. The official IWSLT09 BTEC training set includes 19972 Chinese-English utterance pairs, and the CT training set includes 10061 such pairs. Each of the two datasets can be randomly divided into two parts, where 90% of the utterances are used to train the punctuation model and the remaining 10% are used to evaluate the predictive performance. For all experiments, the default segmentation of Chinese can be used as provided, while the English text can be preprocessed using the Penn tree graph database tokenizer. Table 1 provides the statistics of the two datasets after processing.

列出两个数据集中的句子类型的比例。大部分的句子是陈述句。然而，相比较于CT数据集，疑问句更为频繁地出现在BTEC数据集中。对于所有的数据集，感叹句贡献不到1％并且没有被列出。另外，来自于CT数据集的话语更长(每个话语具有更多的单词)，并且因此多个的CT话语通常包括多个句子。List the proportions of sentence types in the two datasets. Most sentences are declarative sentences. However, interrogative sentences appear more frequently in the BTEC dataset than in the CT dataset. For all datasets, exclamatory sentences contribute less than 1% and are not listed. In addition, utterances from CT datasets are longer (with more words per utterance), and thus multiple CT utterances often include multiple sentences.

表1：BTEC和CT数据集的统计Table 1: Statistics of BTEC and CT datasets

另外的实验可以被划分成两类：在训练前将结束标点符号复制到句子的开始，或者在训练前不将结束标点符号复制到句子的开始。该设置可以用于评估标点符号和指示单词之间的邻近度对于预测任务的影响。在每类下，测试两个可能的方法。单程方法执行一个单个步骤中的预测，其中从左到右顺序地预测所有的标点符号。在级联的方法中，通过首先以所有的句子结束标点符号来替代特殊的句子边界符号，格式化训练句子。可以基于此类的训练数据来学习用于句子边界预测的模型。根据一个实施例，该步骤之后可以预测标点符号。Additional experiments can be divided into two categories: copy the ending punctuation to the beginning of the sentence before training, or not copy the ending punctuation to the beginning of the sentence before training. This setting can be used to evaluate the impact of punctuation marks and proximity between indicated words on prediction tasks. Under each class, test two possible methods. One-pass methods perform prediction in a single step, where all punctuation marks are predicted sequentially from left to right. In the cascaded approach, the training sentences are formatted by first replacing special sentence boundary symbols with all sentence-ending punctuation marks. A model for sentence boundary prediction can be learned based on such training data. According to one embodiment, this step may be followed by prediction of punctuation marks.

对于上述设置的所有组合尝试三元和5元语言模型二者。这基于隐事件语言模型提供总共八种可能的组合。当训练所有的语言模型时，可以使用针对n元组的修改的Kneser-Ney平滑。为了评估标点预测任务的性能，通过下面的等式来定义用于查准率(prec)、查全率(rec)和F1-测量(F1)的计算：Both trigram and 5-gram language models were tried for all combinations of the above settings. This provides a total of eight possible combinations based on the latent event language model. When training all language models, a modified Kneser-Ney smoothing for n-tuples can be used. To evaluate the performance of the punctuation prediction task, the calculations for precision (prec), recall (rec) and F1-measure (F1) are defined by the following equations:

${F f}_{11} = = \frac{22}{11 / / prec prec . . + + 11 / / rec rec . .}$

分别在表2和表3中示出在BTEC和CT数据库的正确识别输出中的中文(CN)和英文(EN)文本上的标点预测的性能。隐事件语言模型的性能严重地依赖于是否使用了复制方法以及是否考虑实际的语言。具体地，对于英文，在训练前将结束标点符号复制到句子的开始显示对于改进整体预测性能很有用。相比较而言，对中文应用相同的技术破坏性能。The performance of punctuation prediction on Chinese (CN) and English (EN) text in the correct recognition output of BTEC and CT databases is shown in Table 2 and Table 3, respectively. The performance of a hidden event language model depends heavily on whether a replication method is used and whether the actual language is considered. Specifically, for English, copying ending punctuation to the beginning of sentences before training is useful for improving overall predictive performance. In comparison, applying the same technique to Chinese destroys performance.

一个解释是英文疑问句通过以例如“你愿意(do you)”或“哪里(where)”的指示单词开始，该指示单词将疑问句与陈述句区分。因此，将结束标点符号复制到句子的开始以使得其接近于这些指示单词有助于改进预测准确性。然而，对于疑问句，中文表现出十分难的句法结构。One explanation is that English interrogative sentences distinguish interrogative sentences from declarative sentences by beginning with a demonstrative word such as "do you" or "where". Therefore, copying the closing punctuation to the beginning of the sentence so that it is close to these indicator words helps to improve prediction accuracy. However, for interrogative sentences, Chinese exhibits very difficult syntactic structures.

首先在许多情形中，中文趋向于在句子的结尾处使用句法模糊的助词以指示疑问。此类的助词包括“吗”和“呢”。因此，在训练前保留结束标点符号的位置产生更好的性能。另一个发现在于不同的英文，指示中文中的疑问句的那些单词可以出现在中文句子的几乎任何位置。例子包括哪里有…(where…),…是什么(what...)或…多少…(how many/much…)。这对简单隐事件语言模型造成难度，而简单的隐事件语言模型通过n元语言建模来仅编码围绕的单词上的简单相关性。First, in many cases, Chinese tends to use syntactically ambiguous particles at the end of sentences to indicate questions. Particles of this type include "do" and "do". Therefore, preserving the positions of ending punctuation marks before training yields better performance. Another finding is that, unlike English, those words that indicate interrogative sentences in Chinese can appear in almost any position in a Chinese sentence. Examples include where… (where…),… what (what…) or… how many/much…. This poses difficulties for simple latent event language models, which encode only simple correlations on surrounding words through n-gram language modeling.

表2：在BETC数据集的正确识别输出中的中文(CN)和英文(EN)文本上的标点预测性能。报告查准率(Prec.)、查全率(Rec.)和F1测量(F1)的百分比分数。Table 2: Punctuation prediction performance on Chinese (CN) and English (EN) text in the correct recognition output of the BETC dataset. Reports the percentage scores for Precision (Prec.), Recall (Rec.) and F1 measure (F1).

表3：在CT数据库的正确识别输出中的中文(CN)和英文(EN)文本上的标点预测性能。报告查准率(Prec.)、查全率(Rec.)和F1测量(F1)的百分比分数Table 3: Punctuation prediction performance on Chinese (CN) and English (EN) text in the correct recognition output of the CT database. Reports percentage scores for Precision (Prec.), Recall (Rec.) and F1 measure (F1)

通过采用实施非独立、重叠特征的判别模型，LCRF模型通常胜过隐事件语言模型。通过引入执行句子分段和分子类型预测的附加标签层，F-CRF模型进一步提升超过L-CRF模型的性能。利用自举重抽样执行统计显著性检验。在CT数据库中的中文和英文文本上、以及在BTEC数据库中英文文本上的F-CRF相对于L-CRF的改进是统计显著性的(p<0.01)。在中文文本上的F-CRF相对于L-CRF的改进更小，可能因为L-CRF在中文上已经很好地执行。在CT数据库上的F1测量低于在BTEC上的那些测量，主要是因为CT数据库包括更长的话语和更少的疑问句。整体上，建议的F-CRF模型是鲁棒的并且始终很好地工作，无论其在什么语言和数据库上测试。这表明该方法是通用的并且依赖于最低语言假设，并且因此可以容易地使用在其他语言和数据库上。By employing discriminative models that implement non-independent, overlapping features, LCRF models often outperform latent event language models. The F-CRF model further improves performance over the L-CRF model by introducing additional labeling layers that perform sentence segmentation and molecular type prediction. Perform statistical significance tests using bootstrap weight sampling. The improvement of F-CRF over L-CRF on Chinese and English texts in the CT database, and on English texts in the BTEC database was statistically significant (p<0.01). The improvement of F-CRF over L-CRF on Chinese text is smaller, probably because L-CRF already performs well on Chinese. F1 measures on the CT database were lower than those on BTEC, mainly because the CT database included longer utterances and fewer interrogative sentences. Overall, the proposed F-CRF model is robust and consistently works well regardless of the languages and databases it is tested on. This shows that the method is general and relies on minimal language assumptions, and thus can be easily used on other languages and databases.

模型也可以使用由ASR系统产生的文本来评估。为了评估，可以使用官方IWSLT08 BTEC评估数据库的即兴演讲的1-最佳ASR输出，其作为IWSLT09语料库的一部分发布。数据库包括中文的504个话语，以及英文的498个话语。不像如在章节6.1所描述的正确识别的文本，ASR输出包含实质识别错误(识别准确性对于中文是86％，并且对于英文是80％)。在由IWSLT 2009组织者所发布的数据库中，在ASR输出中并不标注正确的标点符号。为了执行实验性的评估，ASR输出上的正确标点符号可以手工加注释。在表4中示出对于每个模型的评估结果。结果表明F-CRF仍比L-CRF和隐事件语言模型给出更高的性能，并且改进是统计显著性的(p<0.01)。Models can also be evaluated using text generated by the ASR system. For evaluation, the 1-best ASR output of impromptu speech from the official IWSLT08 BTEC evaluation database, published as part of the IWSLT09 corpus, can be used. The database includes 504 utterances in Chinese and 498 utterances in English. Unlike the correctly recognized text as described in Section 6.1, the ASR output contained substantial recognition errors (recognition accuracy was 86% for Chinese and 80% for English). In the database published by the organizers of IWSLT 2009, the correct punctuation is not marked in the ASR output. To perform experimental evaluations, the correct punctuation marks on the ASR output can be manually annotated. Evaluation results for each model are shown in Table 4. The results show that F-CRF still gives higher performance than L-CRF and latent event language model, and the improvement is statistically significant (p<0.01).

表4：在IWSLT08 BTEC评估数据集的ASR输出中的中文(CN)和英文(EN)文本上的标点预测性能。报告报告查准率(Prec.)、查全率(Rec.)和F1测量(F1)的百分比分数Table 4: Punctuation prediction performance on Chinese (CN) and English (EN) text in the ASR output of the IWSLT08 BTEC evaluation dataset. The report reports the percentage scores for Precision (Prec.), Recall (Rec.) and F1 measure (F1)

在模型的另一评估中，通过将评估的ASR文本馈送进现有技术的机器翻译系统，可以采用间接方法来自动地评估ASR输出文本上的标点预测的性能，并且评估得到的翻译性能。翻译性能接着由与人工判断很好地相关的自动化评估度量来测量。现有的基于短语的统计机器翻译工具包Moses连同用于训练翻译系统的整个IWSLT09BTEC训练集用作翻译引擎。In another evaluation of the model, an indirect approach can be employed to automatically evaluate the performance of punctuation prediction on ASR output text and evaluate the resulting translation performance by feeding the evaluated ASR text into a prior art machine translation system. Translation performance is then measured by automated evaluation metrics that correlate well with human judgment. The existing phrase-based statistical machine translation toolkit Moses was used as the translation engine together with the entire IWSLT09BTEC training set used to train the translation system.

Berkeley校准器用于将训练双语文本与启用的词汇化的重排序模型相对准。这是因为词汇化的重排序相对于简单的基于距离的重排序来说给出更好的性能。特别地，使用默认的词汇化重排序模型(msd-bidirectional-fe)。为了调节Moses的参数，我们使用了官方的IWSLT05评估集，其中存在正确的标点符号。在IWSLT08 BTEC评估数据集的ASR输出上执行评估，而标点符号通过每个标点预测方法来插入。调节集合和评估集合包括7个参考译文。按照统计机器翻译中的惯例，我们报告BLEU-4分数，其被显示具有与人工判断好的相关性，而最近的参考长度为有效的参考长度。最小错误率训练(MERT)过程用于调节翻译系统的模型参数。The Berkeley calibrator is used to align the training bilingual text with the lexicalized reranking model enabled. This is because lexicalized re-ranking gives better performance than simple distance-based re-ranking. In particular, the default lexicalized reordering model (msd-bidirectional-fe) is used. To tune the parameters of Moses, we use the official IWSLT05 evaluation set, where correct punctuation is present. Evaluation is performed on the ASR output of the IWSLT08 BTEC evaluation dataset, while punctuation marks are interpolated by each punctuation prediction method. The conditioning set and evaluation set consist of 7 reference translations. Following the convention in statistical machine translation, we report the BLEU-4 score, which was shown to correlate well with human judgment, and the nearest reference length as the valid reference length. The Minimum Error Rate Training (MERT) procedure is used to tune the model parameters of the translation system.

由于MERT的不稳定属性，对于每个翻译任务执行10次运行，每次运行中具有参数的不同随机初始化，并且报告在10次运行上平均的BLEU-4分数。在表5中示出结果。通过应用F-CRF作为对于ASR文本的标点预测模型，可以实现用于两个翻译方向的最佳译文性能。此外，当人工加注释的标点符号用于翻译时，我们也评价译文性能。对于两个翻译任务的平均BLEU分数分别是31.58(中文到英文)和24.16(英文到中文)，这显示出对于口语翻译来说，我们的标点预测模型给出有具有竞争力的表现。Due to the unstable nature of MERT, 10 runs are performed for each translation task with a different random initialization of the parameters in each run, and the BLEU-4 score averaged over the 10 runs is reported. The results are shown in Table 5. By applying F-CRF as a punctuation prediction model for ASR text, the best translation performance for both translation directions can be achieved. Additionally, we also evaluate translation performance when human-annotated punctuation marks are used for translation. The average BLEU scores for the two translation tasks are 31.58 (Chinese to English) and 24.16 (English to Chinese), showing that our punctuation prediction model gives competitive performance for spoken translation.

表5：使用Moses的加标点的ASR输出的译文性能(BLEU的平均百分比分数)Table 5: Translation performance (average percentile score of BLEU) using Moses' punctuated ASR output

根据上述的实施例，描述了用于预测转录的对话话语文本的标点符号的一个示例性方法。建议的方法构建于动态条件随机域(DCRF)框架之上，其在语音话语上执行连同句子边界和句子类型预测的标点预测。可以在没有依赖于韵律线索的情况下完成根据DCRF的文本处理。基于隐事件语言模型，示例性的实施例胜过广泛使用的常规方法。所公开的实施例已经显示出为非特定于语言的并且对于中文和英文都很好地工作，并且都很好地正确识别和自动地识别文本。当加标点的自动化识别地文本用在后续的翻译中时，所公开的实施例也导致更好的翻译准确性。In accordance with the foregoing embodiments, one exemplary method for predicting punctuation of transcribed dialogue utterance text is described. The proposed method builds on the Dynamic Conditional Random Field (DCRF) framework, which performs punctuation prediction along with sentence boundary and sentence type prediction on speech utterances. Text processing according to DCRF can be done without relying on prosodic cues. Based on a latent event language model, exemplary embodiments outperform widely used conventional methods. The disclosed embodiments have been shown to be language-neutral and work well for both Chinese and English, and both correctly recognize and automatically recognize text. The disclosed embodiments also lead to better translation accuracy when punctuated automatically recognized text is used in subsequent translations.

图8是示出用于向句子中插入标点的方法的一个实施例的流程图。在一个实施例中，方法800在块802处以识别输入话语的单词开始。在块804处，单词被放置在多个第一节点中。在块806处，至少部分地基于多个第一节点的相邻节点来将单词层标签分配给所述多个第一节点中的每个第一节点。根据一个实施例，句子层标签和/或单词层标签也可以部分基于输入话语的边界而被分配给第一节点。在块808处，通过将来自于多个第一节点的单词与部分在分配给第一节点的每个节点的单词层标签上选择的标点标记组合，生成输出句子。Figure 8 is a flowchart illustrating one embodiment of a method for inserting punctuation into a sentence. In one embodiment, method 800 begins at block 802 by identifying words of an input utterance. At block 804, words are placed in a plurality of first nodes. At block 806, a word-level label is assigned to each first node of the plurality of first nodes based at least in part on neighboring nodes of the plurality of first nodes. According to one embodiment, sentence-level labels and/or word-level labels may also be assigned to the first node based in part on boundaries of the input utterance. At block 808, an output sentence is generated by combining words from the plurality of first nodes with punctuation marks selected in part on the word-level labels assigned to each of the first nodes.

语法纠错Grammatical error correction

在对加注释的学习文本训练和对非学习文本训练之间存在差异，即观察的单词是否可以用作特征。当对非学习文本进行训练时，观察的单词不能用作特征。作者的单词选择将从文本“取消”并且充当正确类。分类器被训练为在给定围绕的上下文下重新预测单词。可能类的混淆集通常是预定义的。该选择任务制定是方便的，因为训练例子可以从假定没有语法错误的任意文本来“无偿”的创建。更为实际的校正任务如下定义：给定特定的单词和其上下文，建议合适的校正。建议的校正可以与观察的单词相同，即，没有必要校正。主要区别在于作者的单词选择可以被编码为特征的一部分。There is a difference between training on annotated learned text and training on non-learned text, namely whether the observed words can be used as features. When training on non-learned text, observed words cannot be used as features. The author's word choice will be "cancelled" from the text and serve as the correct class. Classifiers are trained to re-predict words given the surrounding context. The confusion set of possible classes is usually predefined. This selection task formulation is convenient because training examples can be created "for free" from arbitrary text that is assumed to be free of grammatical errors. A more realistic correction task is defined as follows: Given a specific word and its context, suggest an appropriate correction. The suggested correction can be the same as the observed word, ie no correction is necessary. The main difference is that the author's word choice can be encoded as part of the feature.

冠词错误是由EFL初学者犯的一种频繁类型的错误。对于冠词错误，类是三个冠词，a、the和零冠词。这覆盖冠词插入、删除和替换错误。在训练期间，训练数据中的每个名词短语(NP)是一个训练例子。当对学习文本进行训练时，正确类是由人工注解者提供的冠词。当对非学习文本训练时，正确的类是观察的冠词。经由一组特征函数来对上下文进行编码。在测试期间，测试集合中的每个NP是一个测试例子。当对学习文本进行测试时，正确类是由人工注解者提供的冠词，而当对非学习文本进行测试时，正确类是观察的冠词。Article errors are a frequent type of error made by EFL beginners. For article errors, the classes are three articles, a, the, and the zero article. This covers article insertion, deletion and substitution errors. During training, each noun phrase (NP) in the training data is a training example. When training on the learned text, the correct classes are articles provided by human annotators. When training on non-learned text, the correct class is the article of observation. The context is encoded via a set of feature functions. During testing, each NP in the test set is a test case. When testing on the learned text, the correct class is the article provided by the human annotator, while when testing on the non-learned text, the correct class is the observed article.

介词错误是由EFL初学者犯的另一种频繁类型的错误。对介词错误的方式与对冠词错误的类似，但典型地关注在介词替换错误上。在该工作中，类是36种频繁的英文介词(about,along,among,around,as,at,beside,besides,between,by,down,during,except,for,from,in,inside,into,of,off,on,onto,outside,over,through,to,toward,towards,under,underneath,until,up,upon,with,within,without)。取决于36种介词之一的每个介词短语(PP)是一个训练例子或测试例子。在该实施例中，忽略受其他介词支配的PP。Preposition mistakes are another frequent type of mistake made by EFL beginners. Mistakes on prepositions are done in a similar way to mistakes on articles, but typically focus on preposition substitution mistakes. In this work, the classes are 36 frequent English prepositions (about, along, among, around, as, at, beside, besides, between, by, down, during, except, for, from, in, inside, into, of, off, on, onto, outside, over, through, to, toward, towards, under, underneath, until, up, upon, with, within, without). Each prepositional phrase (PP) that depends on one of the 36 prepositions is a training or testing example. In this example, PPs governed by other prepositions are ignored.

图9示出用于校正语法错误的方法900的一个实施例。在一个实施例中，方法900可以包括接收902自然语言文本输入，其中输入文本包括语法错误，其中输入文本的一部分包括来自于一组类的类。该方法900也可以包括从假设没有语法错误的非学习文本的语料库生成904多个选择任务，其中对于每个选择任务，分类器重新预测在非学习文本中使用的类。进一步，该方法900可以包括从学习文本的语料库生成906多个校正任务，其中对于每个校正任务，分类器建议在学习文本中使用的类。另外，所述方法900可以包括使用一组二进制分类问题来训练908语法校正模型，该一组二进制分类问题包括多个选择任务和多个校正任务。该实施例也可以包括使用910训练的语法校正模型来从一组可能的类预测文本输入的类。Figure 9 illustrates one embodiment of a method 900 for correcting syntax errors. In one embodiment, method 900 may include receiving 902 natural language text input, where the input text includes grammatical errors, where a portion of the input text includes classes from a set of classes. The method 900 may also include generating 904 a plurality of selection tasks from the corpus of non-learned text assumed to be free of grammatical errors, wherein for each selection task, the classifier re-predicts the class used in the non-learned text. Further, the method 900 may include generating 906 a plurality of correction tasks from the corpus of learning text, wherein for each correction task, the classifier suggests a class to use in the learning text. Additionally, the method 900 may include training 908 a grammar correction model using a set of binary classification problems including multiple selection tasks and multiple correction tasks. This embodiment may also include using 910 the trained grammar correction model to predict the class of the text input from a set of possible classes.

根据一个实施例，语法错误校正(GEC)被制定为分类问题并且线性分类器被用于解决该分类问题。According to one embodiment, grammatical error correction (GEC) is formulated as a classification problem and a linear classifier is used to solve the classification problem.

分类器用于近似学习文本中的冠词、介词和它们的上下文之间的关系，以及它们的有效校正。冠词或介词以及它们的上下文被表示为特征向量校正是类 Classifiers are used to approximately learn the relationship between articles, prepositions and their contexts in text, as well as their efficient corrections. Articles or prepositions and their contexts are represented as feature vectors Calibration is class

在一个实施例中，使用形式为uTX的二进制线性分类器，其中u是权重向量。如果分数是正的，则结果考虑为+1，并且如果分数为负，则结果考虑为-1。用于找到u的一种流行方法是具有最小二乘正则化的经验风险最小化。给定训练集{X_i,Y_i}_i＝1,...,n，目标是找到最小化对训练数据的经验损耗的权重向量。In one embodiment, a binary linear classifier of the form uTX is used, where u is a weight vector. If the score is positive, the result is considered +1, and if the score is negative, the result is considered -1. A popular method for finding u is empirical risk minimization with least squares regularization. Given a training set {X _i ,Y _i } _i=1,...,n , the goal is to find the weight vector that minimizes the empirical loss on the training data.

其中L是损失函数。在一个实施例中，使用Huber的鲁棒损失函数的修订。根据一个实施例，正则化参数λ可以达到10-4。具有m个类的多类分类问题可以被转换为一对多设置中的m进制分类问题。分类器的预测是具有最高分数的分类器。where L is the loss function. In one embodiment, a revision of Huber's robust loss function is used. According to one embodiment, the regularization parameter λ can reach 10-4. A multi-class classification problem with m classes can be transformed into an m-ary classification problem in the one-vs-many setting. The classifier's prediction is the one with the highest score classifier.

实施六个特征抽取方法，三个用于冠词，而三个用于介词。方法需要不同的语言预处理：组块分析(chunking)、CCG分析和成分性(constituency)分析。Six feature extraction methods are implemented, three for articles and three for prepositions. The methods require different linguistic preprocessing: chunking, CCG analysis, and constituency analysis.

用于冠词错误的特征抽取的例子包括“DeFelice”、“Han”和“Lee”。DeFelice-用于冠词错误的系统使用CCG分析器来抽取句法和语义特征的丰富集合，包括部分语音(POS)标签、来自词网的上位词和命名的实体。Han-系统依赖于从组块(chunker)导出的浅句法和词汇特征，该组块包括在NP前、中和后的单词，首词和POS标签。Lee-系统使用成分性分析器。特征包括POS标签、围绕的单词、首词和来自于词网的上位词。Examples of feature extraction for article errors include "DeFelice", "Han", and "Lee". DeFelice - A system for article errors uses a CCG analyzer to extract a rich set of syntactic and semantic features, including part-of-speech (POS) tags, hypernyms from wordnets, and named entities. The Han-system relies on shallow syntactic and lexical features derived from chunkers including words before, during and after NP, first words and POS tags. The Lee-system uses a compositional analyzer. Features include POS tags, surrounding words, first words, and hypernyms from WordNet.

用于介词错误的特征抽取的例子包括“DeFelice”、“TetreaultChunk”和“TetreaultParse”。DeFelice–用于介词错误的系统使用与用于冠词错误的系统类似的句法和语义特征的丰富集合。在重新实现中，不使用次类划分词典。TetreaultChunk-系统使用组块从围绕介词的两个单词窗口抽取特征，包括词汇和POS n元，以及来自于相邻成分的首词。TetreaultParse-系统通过添加从成分性和相关性分析树导出的附加特征来扩展TetreaultChunk。Examples of feature extraction for preposition errors include "DeFelice", "TetreaultChunk", and "TetreaultParse". DeFelice – The system for prepositional errors uses a rich set of syntactic and semantic features similar to the system for article errors. In the reimplementation, no subclass partition dictionary is used. TetreaultChunk - system uses chunking to extract features from two word windows surrounding prepositions, including vocabulary and POS n-grams, and first words from adjacent constituents. TetreaultParse - System that extends TetreaultChunk by adding additional features derived from compositional and correlation parse trees.

对于上述特征集的每个，当对学习文本进行训练时，观察的冠词或介词作为附加的特征加入。For each of the above feature sets, observed articles or prepositions are added as additional features when training on the learned text.

根据一个实施例，使用多个相关问题的共同结构的多任务学习算法的交替结构优化(ASO)可以用于语法纠错。假定存在m个二进制分类问题。每个分类器ui是维度p的权重向量。令θ为捕获m个权重向量的共同结构的正交h×p矩阵。假定每个权重向量可以被分解为两个部分：一个部分建模特定的第i个分类问题而一个部分建模共同结构。According to one embodiment, alternate structure optimization (ASO) of a multi-task learning algorithm using a common structure of multiple related problems can be used for grammatical error correction. Suppose there are m binary classification problems. Each classifier ui is a weight vector of dimension p. Let θ be an orthogonal h × p matrix that captures the common structure of the m weight vectors. It is assumed that each weight vector can be decomposed into two parts: one part models the specific i-th classification problem and one part models the common structure.

u_i＝w_i+Θ^Tv_i u _i ＝w _i +Θ ^T v _i

通过联合经验风险最小化来学习参数[{w_i,v_i},Θ]，即通过最小化训练数据上的m个问题的联合经验损失。The parameters [{w _i ,v _i },Θ] are learned by joint empirical risk minimization, i.e., by minimizing the joint empirical loss for m problems on the training data.

${Σ Σ}_{l l = = 11}^{m m} ((\frac{11}{n no} {Σ Σ}_{i i = = 11}^{n no} L L (({(({w w}_{l l} + + {Θ Θ}^{T T} {v v}_{l l}))}^{T T} {X x}_{i i}^{l l},, {Y Y}_{i i}^{l l})) + + λ λ {| | | | {w w}_{l l} | | | |}^{22})) . .$

在ASO中，用于找到θ的问题不必与要解决的目标问题相同。相反，为了学习更好的θ的单独目标，可以自动地创建辅助问题。In ASO, the problem used to find θ does not have to be the same as the target problem to be solved. Instead, auxiliary problems can be created automatically for the purpose of learning a separate objective of better θ.

假定存在个k目标问题和m个辅助问题，则通过下面的算法可以获得对于上述问题的近似解：Assuming that there are k-objective problems and m auxiliary problems, the approximate solution to the above problems can be obtained by the following algorithm:

1.独立地学习m个线性分类器u_i。1. Independently learn m linear classifiers u _i .

2.令U＝[u₁,u₂.....u_m]为从m个权重向量形成的矩阵p×m。2. Let U=[u ₁ , u ₂ . . . u _m ] be a matrix p×m formed from m weight vectors.

3.在U：上执行奇异值分解(SVD)。V₁的开始h个列向量作为θ的列存储。3. At U: Perform singular value decomposition (SVD) on . The first h column vectors of _V1 are stored as columns of θ.

4.通过最小化经验风险，对于每个目标问题来学习w_j和v_j：4. Learn w _j and v _j for each target problem by minimizing empirical risk:

$\frac{11}{n no} {Σ Σ}_{i i = = 11}^{n no} L L (({(({w w}_{j j} + + {Θ Θ}^{T T} {v v}_{j j}))}^{T T} {X x}_{i i},, {Y Y}_{i i})) + + λ λ {| | | | {w w}_{j j} | | | |}^{22} . .$

5.对于第j个目标问题的权重向量是：5. The weight vector for the jth target problem is:

u_j＝w_j+Θ^Tv_j.u _j ＝w _j +Θ ^T v _j .

有益地，对非学习文本的选择任务是针对学习文本的校正任务的高度大信息量的辅助问题。例如，可以预测介词on存在或不存在的分类器可以有益于校正在学习文本中错误的使用on，例如，如果分类器对于on的置信度是低的但作者使用了介词on，作者可能已经犯了错误。因为辅助问题可以被自动地创建，非学习文本的很大的语料库的力量可以受到影响。Beneficially, the selection task for non-learned text is a highly informative auxiliary problem for the correction task for learned text. For example, a classifier that can predict the presence or absence of the preposition on can be beneficial in correcting erroneous uses of on in the learned text, e.g., if the classifier's confidence in on is low but the author uses the preposition on, the author may have made a mistake. wrong. Because auxiliary questions can be created automatically, the power of large corpora of non-learned texts can be compromised.

在一个实施例中，假定具有m个类的语法纠错任务。对于每个类，定义二进制辅助问题。辅助问题的特征空间是将原始特征空间χ限于除观察的单词以外的所有特征：辅助问题的权重向量形成了ASO算法的步骤2中的矩阵U，θ通过SVD从该矩阵U获得。给定θ，向量wj和vj,j＝1,...,k可以使用完整的特征空间χ从加注释的学习文本获得。In one embodiment, assume a syntax error correction task with m classes. For each class, a binary auxiliary problem is defined. The feature space for the auxiliary problem is to restrict the original feature space χ to all features except the observed words: The weight vectors for the auxiliary problem form the matrix U in step 2 of the ASO algorithm, from which θ is obtained by SVD. Given θ, the vectors wj and vj, j=1,...,k can be obtained from the annotated learning text using the full feature space χ.

这可以视为迁移学习的一个实例，因为辅助问题是在对来自于不同的域(非学习文本)的数据上训练的并且具有稍微不同的特征空间该方法是通用的并且可以被应用于GEC中的任意分类问题。This can be seen as an instance of transfer learning, since the auxiliary problem is trained on data from a different domain (not the learned text) and has a slightly different feature space The method is general and can be applied to any classification problem in GEC.

对于非学习文本和学习文本上的两个实验定义评估度量。对于非学习文本上的实验，定义为正确预测的数目除以测试实例的总数目的准确性用作评估度量。对于学习文本上的实验，F1-测量用作评估度量。F1-测量定义为：Evaluation metrics are defined for two experiments on non-learned text and learned text. For experiments on non-learned text, accuracy, defined as the number of correct predictions divided by the total number of test instances, is used as the evaluation metric. For experiments on learning text, the F1-measure is used as the evaluation metric. The F1-measure is defined as:

其中查准率是与人工注解者一致的建议校正的数目除以由系统建议的校正的总数目，并且查全率是与人工注解者一致的建议校正除以由人工注解者加注释的总的错误数目。where precision is the number of suggested corrections agreed with human annotators divided by the total number of corrections suggested by the system, and recall is suggested corrections agreed with human annotators divided by the total number of corrections annotated by human annotators number of errors.

设计了一组实验来测试NUCLE测试数据上的校正任务。第二组实验调查该工作的首要目标：自动地校正学习文本中的语法错误。测试实例从NUCLE抽取。相比较于先前的选择任务，作者的观察的单词选择可以不同于正确类并且在测试期间可以获得观察的单词。调查两种不同的基准线以及ASO方法。A set of experiments is designed to test the correction task on the NUCLE test data. A second set of experiments investigates the primary goal of this work: automatically correcting grammatical errors in the learned text. Test instances are extracted from NUCLE. Compared to previous selection tasks, the authors' observed word choices can be different from the correct class and observed words are available during testing. Investigate two different baseline and ASO methods.

第一基准线是在Gigaword上以选择任务实验中所述的相同方式训练的分类器。简单的阈值转换策略用于在测试期间使用观察的单词。系统仅如果分类器对于其第一选择的置信度和对于观察的单词的置信度之间的差高于阈值t时标记错误。对于每个特征集，阈值参数t在NUCLE开发数据上调节。在实验中，t的值在0.7和1.2之间。The first baseline is a classifier trained on Gigaword in the same manner as described in the experiments on the choice task. A simple thresholding strategy is used to use observed words during testing. The system flags an error only if the difference between the classifier's confidence for its first choice and the observed word is above a threshold t. For each feature set, the threshold parameter t is tuned on the NUCLE development data. In the experiments, the value of t was between 0.7 and 1.2.

第二基准线是在NUCLE上训练的分类器。分类器以与Gigaword模型的相同方式训练，除了作为特征所包括的作者的观察的单词选择。在训练期间的正确类是由人工注解者所提供的校正。由于观察的单词是特征的一部分，该模型并不需要额外的阈值化步骤。事实上，阈值化在该情形中是有害的。在训练期间，不包含错误的实例在数目上将极大地超过的确包含错误的实例。为了减小该非平衡，包含错误的所有实例被保持并且不包含错误的实例的q百分比的随机采样被保留。对于每个数据集，在NUCLE开发数据上调节欠采样q。在实验中，q的值在20％和40％之间。The second baseline is a classifier trained on NUCLE. The classifier was trained in the same manner as the Gigaword model, except that the author's observed word choices were included as features. The correct class during training is the correction provided by human annotators. Since the observed words are part of the features, the model does not require an additional thresholding step. In fact, thresholding is harmful in this situation. During training, the instances that do not contain errors will greatly outnumber the instances that do contain errors. To reduce this imbalance, all instances that contain errors are kept and a random sample of q percent of instances that do not contain errors is kept. For each dataset, the undersampling q is tuned on the NUCLE development data. In experiments, the value of q was between 20% and 40%.

以下面的方式来训练ASO方法。创建针对冠词或介词的二进制辅助问题，即对于冠词存在3个辅助问题，并且对于介词存在36个辅助问题。以与选择任务实验相同的方式在来自于Gigaword的全部1千万个实例上训练用于辅助问题的分类器。辅助问题的权重矢量形成矩阵U。执行奇异值分解(SVD)以获得U＝V1DV2T。V1的所有列被保持以形成θ。目标问题再次是针对于每个冠词或介词的二进制分类器问题，但这次是在NUCLE上训练。包括作者的观察的单词选择作为用于目标问题的特征。不包含错误的实例被欠采样并且在NUCLE开发数据上调节参数q。q的值是20％和40％之间。不应用阈值化。The ASO method is trained in the following manner. Create binary auxiliary questions for articles or prepositions, i.e. there are 3 auxiliary questions for articles and 36 auxiliary questions for prepositions. The classifier for the auxiliary problem is trained on all 10 million instances from Gigaword in the same manner as the choice task experiments. The weight vectors for the auxiliary questions form the matrix U. Singular value decomposition (SVD) is performed to obtain U=V1DV2T. All columns of V1 are maintained to form θ. The target problem is again a binary classifier problem for each article or preposition, but this time trained on NUCLE. Include the author's observed word choices as features for the target question. Instances that do not contain errors are undersampled and the parameter q is tuned on the NUCLE development data. The value of q is between 20% and 40%. No thresholding is applied.

在图11和12中示出NUCLE测试数据上的校正任务实验的学习曲线。每个子曲线图示出在最后一节中描述的三个模型的曲线：在NUCLE和Gigaword上训练的ASO，在NUCLE上训练的基准线分类器，以及在Gigaword上训练的基准线分类器。对于ASO，x轴示出目标问题训练实例的数目。我们观察到在加注释的学习文本上的训练可以显著地改进性能。在三个实验中，NUCLE模型性能超出在1千万实例上训练的Gigaword模型。最后，ASO模型显示出最佳的结果。在其中NUCLE模型已经比Gigaword标准线更好执行的实验中，ASO给了相对或稍微更好的结果。在其中两个基准线(TetreaultChunk,TetreaultParse)都没有显示出好的性能的那些实验中，ASO得到超过任意基准线更大的改进。The learning curves for the correction task experiments on the NUCLE test data are shown in FIGS. 11 and 12 . Each subplot shows the curves for the three models described in the last section: ASO trained on NUCLE and Gigaword, a baseline classifier trained on NUCLE, and a baseline classifier trained on Gigaword. For ASO, the x-axis shows the number of training instances for the target question. We observe that training on annotated learning text can significantly improve performance. In three experiments, the NUCLE model outperformed the Gigaword model trained on 10 million instances. In the end, the ASO model showed the best results. In experiments where the NUCLE model already performed better than the Gigaword benchmark, ASO gave relatively or slightly better results. In those experiments where neither of the two baselines (TetreaultChunk, TetreaultParse) showed good performance, ASO achieved larger improvements over either baseline.

语义搭配纠错Semantic collocation error correction

在一个实施例中，搭配错误的频率由作者的母语或第一语言(L-1)造成。这些类型的错误被称为“L1-转换错误”。L1-转换错误用于估计EFL撰写中的多少错误可以潜在地利用关于作者的L1-语言的信息来校正。例如，L1-转换错误可以是作者L-1语言和英语的单词间的不精确译文的结果。在此类的例子中，中文中具有多个含义的单词可能无法精确地翻译成例如英语。In one embodiment, the frequency of collocation errors is due to the author's native or first language (L-1). These types of errors are known as "L1-transition errors". L1-transition errors were used to estimate how many errors in EFL writing could potentially be corrected using information about the author's L1-language. For example, L1-translation errors may be the result of imprecise translations between words in the author's L-1 language and English. In such an example, a word in Chinese that has multiple meanings may not translate precisely to, for example, English.

在一个实施例中，分析是基于初学者英语的NUS语料库(NUCLE)。语料库由EFL大学学生关于广泛的主题(像环境污染或医疗保健)所撰写的大约1400篇论文构成。大多数的学生母语是说中文的。语料库包括大约一百万个单词，其完全利用错误标签和校正来加注释。注解以平衡的方式来存储。每个错误标签包括注解的开始和结束位移，错误的类型以及注解者认为的合适黄金校正。如果选择的单词或短语将由校正来代替，则要求注解者提供将得到符合语法句子的校正。In one embodiment, the analysis is based on the NUS Corpus of English for Beginners (NUCLE). The corpus consists of around 1400 papers written by EFL University students on a wide range of topics like environmental pollution or healthcare. Most of the students are native speakers of Chinese. The corpus includes approximately one million words fully annotated with error labels and corrections. Annotations are stored in a balanced fashion. Each error label includes the start and end displacement of the annotation, the type of error, and the appropriate golden correction considered by the annotator. If the selected word or phrase is to be replaced by a correction, the annotator is asked to provide a correction that will result in a grammatical sentence.

在一个实施例中，分析已经被标记为错误标签错误搭配/习语/介词的错误。使用频繁英语介词的固定列表来自动地滤除代表介词的简单替换的所有实例。以类似的方式，被标记为搭配错误的小数目的冠词错误将被滤除。最终，其中加注释的短语或建议的校正长于3个单词的实例被滤除，因为它们包含高度特定于上下文的校正并且不太可能很好地概括(例如，“for the simple reasons that these can helpthem”→“simply to”)。In one embodiment, errors that have been flagged as mislabeled miscollocations/idioms/prepositions are analyzed. A fixed list of frequent English prepositions is used to automatically filter out all instances representing simple substitutions of prepositions. In a similar manner, the small number of article errors flagged as collocations will be filtered out. Ultimately, instances where the annotated phrase or suggested correction is longer than 3 words are filtered out because they contain highly context-specific corrections and are unlikely to generalize well (e.g., “for the simple reasons that these can help "→"simply to").

在滤除后，生成2747个搭配错误和它们各自的校正，这些占据NUCLE中的所有错误的大约6％。这使得搭配错误成为在冠词错误、冗余、介词、句词数、动词时态和语义之后的第7大错误类。不算复制，有2412个不同的搭配错误和校正。尽管还存在更为频繁的其他错误类型，搭配错误代表了一种特定的挑战，因为可能的校正并不限于选择的封闭集合，并且它们直接涉及语义而非句法。搭配错误被分析并且发现它们可以归因于下面的混淆源：After filtering, 2747 collocation errors and their respective corrections were generated, these accounted for approximately 6% of all errors in NUCLE. This makes collocation errors the seventh largest error category after article errors, redundancy, prepositions, sentence word count, verb tense, and semantics. Not counting duplications, there are 2412 different collocation errors and corrections. Although there are other error types that are more frequent, collocation errors represent a particular challenge because the possible corrections are not limited to a closed set of choices, and they relate directly to semantics rather than syntax. Collocation errors were analyzed and it was found that they could be attributed to the following sources of confusion:

拼写：如果错误短语和其校正的编辑距离小于某个阈值，则可以由类似的正字法造成错误。Spelling: Mistakes can be caused by similar orthography if the edit distance between the wrong phrase and its correction is less than a certain threshold.

同音异义词：如果错误单词和其校正具有相同的发音，则可以由类似的发音造成错误。单音词典用于将单词映射到它们的语音表达。Homophones: If the wrong word and its correction have the same pronunciation, errors can be caused by similar pronunciations. Monophonic dictionaries are used to map words to their phonetic representations.

同义词：如果错误单词和其校正在WordNet中是同义词，则同义词可以造成错误。使用WordNet 3.0。Synonyms: If the wrong word and its correction are synonyms in WordNet, synonyms can cause errors. Use WordNet 3.0.

L1-转换：如果错误短语和其校正在中-英文短语表中共享共同的译文，则可以由L1-转换造成错误。这里描述短语表构建的细节。尽管在该特定实施例中，该方法使用在中-英文翻译上，该方法可以应用于其中可以获得平行的语料库的任意语言对。L1-transition: If the wrong phrase and its correction share a common translation in the Chinese-English phrase list, errors can be caused by L1-transition. The details of phrase table construction are described here. Although in this particular example the method is used on Chinese-English translation, the method can be applied to any language pair where a parallel corpus is available.

由于单音词典和WordNet被定义用于各个单词，匹配过程以下面的方式扩展到短语：两个短语A和B如果具有相同的长度并且短语A中的第i单词是短语B中的相应第i单词的同音异义词/同义词，则两个短语A和B被认为是同音异义词/同义词。Since monophonic dictionaries and WordNet are defined for individual words, the matching process is extended to phrases in the following way: two phrases A and B have the same length and the i-th word in phrase A is the corresponding i-th word in phrase B homonyms/synonyms of words, then the two phrases A and B are considered homophones/synonyms.

表6：搭配错误的分析。对于多达6个字母的短语来说，用于拼写错误的阈值是1并且对于剩余短语来说是2。Table 6: Analysis of collocation errors. The threshold for misspellings is 1 for phrases up to 6 letters and 2 for the remaining phrases.

有嫌疑的错误源Suspected error source 标记mark 类型type 拼写spell 154154 131131 同音异义词homophone 22 22 同义词synonyms 7474 6060 L1-转换L1-transition 10161016 782782 L1-转换w/o拼写L1 - Conversion w/o spelling 954954 727727

L1-转换w/o同音异义词L1-transition w/o homonyms 10151015 781781 L1-转换w/o同义词L1-transition w/o synonyms 958958 737737 L1-转换w/o拼写,同音异义词,同义词L1 - Convert w/o spelling, homonyms, synonyms 906906 692692

表7：具有不同混淆源的搭配错误的例子。校正在括号中示出。对于L1-转换，也示出共享的中文译文。这里示出的L1-转换例子并不属于任意的其他类别。Table 7: Examples of collocation errors with different sources of confusion. Corrections are shown in parentheses. For the L1-conversion, the shared Chinese translation is also shown. The L1-transition examples shown here do not belong to any other category.

在表6中示出分析的结果。标记表示运行包括复制的错误短语校正对并且类型表示不同的错误短语-校正对。由于搭配错误可以是多于一种类别的一部分，表中的行并不总计为错误的总数目。可以追溯到L1-转换的错误数目极大地超过所有其他类别的数目。该表也示出可以追溯到L1-转换而非其他源的搭配错误的数目。具有692不同搭配错误类型的906搭配错误可以归因于L1-转换而非拼写、同音异义词、或同义词。表7示出对于来自我们的语料库的每种类别的搭配错误的一些例子。也存在不能追溯到任意上述源的搭配错误类型。The results of the analysis are shown in Table 6. Flag indicates that the run included duplicate error phrase correction pairs and type indicates a different error phrase-correction pair. Since collocation errors can be part of more than one category, the rows in the table do not add up to the total number of errors. The number of errors traceable to L1-transitions greatly exceeds that of all other categories. The table also shows the number of collocation errors that could be traced to L1-transitions but not other sources. 906 collocations with 692 different collocation types could be attributed to L1-transitions rather than spelling, homophones, or synonyms. Table 7 shows some examples of collocation errors for each category from our corpus. There are also types of collocation errors that cannot be traced to any of the aforementioned sources.

公开一种用于校正EFL撰写中的搭配错误的方法1300。此方法1300的一个实施例包括响应于在处理装置中执行的平行语言文本的语料库分析，自动地识别1302一个或多个译文候选。另外，该方法1300可以包括使用处理装置来确定1304与每个译文候选关联的特征。该方法1300也可以包括从存储在数据存储装置中的学习文本的语料库生成1306一组一个或多个权重值。该方法1300可以进一步包括响应于与每个译文候选关联的特征和所述一组一个或多个权重值来使用处理装置计算1308针对所述一个或多个译文候选的分数。A method 1300 for correcting collocation errors in EFL composition is disclosed. One embodiment of this method 1300 includes automatically identifying 1302 one or more translation candidates in response to a corpus analysis of the parallel language text performed in the processing device. Additionally, the method 1300 may include using the processing device to determine 1304 features associated with each translation candidate. The method 1300 may also include generating 1306 a set of one or more weight values from a corpus of learning text stored in a data storage device. The method 1300 may further include calculating 1308, using the processing means, a score for each translation candidate in response to the features associated with the one or more translation candidates and the set of one or more weight values.

在一个实施例中，该方法基于L1-引起释义(L1-inducedparaphrasing)。具有平行语料库的L1-引起释义用于自动地从句子对准的L1-英语平行语料库找到搭配候选。由于语料库中的大多数论文由母语说中文的人撰写的，使用FBIS中-英语料库，其由来自于新闻文章的大约230,000中文句子(8.5百万个字)构成，每个具有单个的英文译文。语料库的英文部分被标记化并且被小写。语料库的中文部分使用最大熵分段器来分段。随后，使用Berkeley对准器在单词级上自动地对准文本。使用短语抽取启发法从对准的文本抽取多达三个单词的英文-L1和L1-英文短语。给定英文短语e2的情况下，英文短语e1的释义概率定义为：In one embodiment, the method is based on L1 -induced paraphrasing. L1-Elicited Paraphrasing with Parallel Corpus is used to automatically find collocation candidates from sentence-aligned L1-English parallel corpus. Since most papers in the corpus are written by native Chinese speakers, use the FBIS Chinese-English corpus, which consists of approximately 230,000 Chinese sentences (8.5 million words) from news articles, each with a single English translation . The English portion of the corpus is tokenized and lowercased. The Chinese part of the corpus is segmented using a maximum entropy segmenter. Subsequently, the text is automatically aligned at the word level using the Berkeley aligner. Extract English-L1 and L1-English phrases of up to three words from the aligned text using a phrase extraction heuristic. Given the English phrase e2, the paraphrase probability of the English phrase e1 is defined as:

$p p (({e e}_{11} | | {e e}_{22})) = = \underset{f f}{Σ Σ} p p (({e e}_{11} | | f f)) p p ((f f | | {e e}_{22}))$

其中f表示L1语言中的外国短语。通过最大似然估计来估计短语翻译概率p(e₁|f)和p(f|e₂)并且使用Good-Turing平滑来进行平滑。最终，仅具有高于某个阈值(在该工作中被设置为0.001)的概率的释义被保留。where f denotes a foreign phrase in the L1 language. Phrase translation probabilities p(e ₁ |f) and p(f|e ₂ ) are estimated by maximum likelihood estimation and smoothed using Good-Turing smoothing. Ultimately, only paraphrases with probabilities above a certain threshold (set to 0.001 in this work) were retained.

在另一个实施例中，搭配校正的方法可以实现在基于短语的统计机器翻译(SMT)的框架中。基于短语的SMT试图在给定输入句子f下找到最高得分的译文e。找到最高得分译文的解码过程由使用一组特征函数hi,＝1,…,n来对译文候选进行评分的对数-线性模型来指导。In another embodiment, the method of collocation correction can be implemented in the framework of phrase-based statistical machine translation (SMT). Phrase-based SMT tries to find the highest-scoring translation e given an input sentence f. The decoding process to find the highest scoring translation is guided by a log-linear model that scores translation candidates using a set of feature functions hi,=1,...,n.

$score score ((e e | | f f)) = = exp exp (({Σ Σ}_{i i = = 11}^{n no} {λ λ}_{i i} {h h}_{i i} ((e e,, f f)))) . .$

典型的特征包括短语译文概率p(e|f)、反向短语译文概率p(f|e)、语言模型分数p(e)以及固定短语惩罚。可以通过在输入句子的开发集和参数译文上使用最小错误率训练(MERT)来完成特征权重λ_i,i＝1,...,n的最优化。Typical features include phrase translation probability p(e|f), reverse phrase translation probability p(f|e), language model score p(e), and fixed phrase penalties. The optimization of the feature weights λ _i , i=1,...,n can be done by using minimum error rate training (MERT) on the development set of input sentences and parametric translations.

基于短语的SMT解码器MOSES的短语表被修改以包括具有从拼写、同音异义词、同义词和L1-引起释义导出的特征的搭配校正。The phrase table of the phrase-based SMT decoder MOSES is modified to include collocation corrections with features derived from spelling, homonyms, synonyms, and L1-induced paraphrases.

拼写：对于每个英文单词，短语表包含这样的条目，该条目由单词本身和位于与原始单词某个编辑距离内的每个单词构成。每个条目具有固定特征1.0。Spelling: For each English word, the phrase table contains an entry consisting of the word itself and each word that lies within some edit distance from the original word. Each entry has a fixed characteristic of 1.0.

同音异义词：对于每个英文单词，短语表包含这样的条目，该条目由单词本身和每个单词的同音异义词构成。使用CuVPlus词典来确定同音异义词。每个条目具有固定特征1.0。Homonyms: For each English word, the phrase table contains an entry consisting of the word itself and each word's homonyms. Homophones were determined using the CuVPlus dictionary. Each entry has a fixed characteristic of 1.0.

同义词：对于每个英文单词，短语表包含这样的条目，该条目由单词本身和WordNet中的其每个同义词构成。如果单词具有多于一个含义，则其所有的含义都被考虑。每个条目具有固定特征1.0。Synonyms: For each English word, the phrase table contains an entry consisting of the word itself and each of its synonyms in WordNet. If a word has more than one meaning, all its meanings are considered. Each entry has a fixed characteristic of 1.0.

L1-释义：对于每个英文短语，短语表包含这样的条目，该条目由短语和其L1-导出的释义的每个构成。每个条目具有两个实值化的特征：释义概率和反向释义概率。L1-paraphrases: For each English phrase, the phrase table contains an entry consisting of each of the phrase and its L1-derived paraphrases. Each entry has two real-valued features: paraphrase probability and reverse paraphrase probability.

基准线：针对拼写、同音异义词和同义词构建的短语表被组合，其中组合的短语表包含分别用于针对拼写、同音异义词和同义词的三个二进制特征。Baseline: Phrase tables built for spelling, homonyms, and synonyms are combined, where the combined phrase table contains three binary features for spelling, homonyms, and synonyms, respectively.

所有：来自于拼写、同音异义词、同义词和L1-释义的短语表被组合，其中组合的短语表包含五个特征：针对拼写、同音异义词和同义词的三个二进制特征以及针对L1-释义概率和反向L1-释义概率的两个实值化的特征。All: Phrase tables from spelling, homonyms, synonyms, and L1-paraphrases are combined, where the combined phrase table contains five features: three binary features for spelling, homonyms, and synonyms and one for L1- Two Realized Features of Interpretation Probability and Inverse L1-Interpretation Probability.

另外，每个短语表包含标准固定短语惩罚特征。开始的四个表仅包含针对各个单词的搭配候选。如果必要，留给解码器来解码期间来构建针对更长短语的校正。Additionally, each phrase table contains standard fixed phrase penalty features. The first four tables only contain collocation candidates for individual words. It is left to the decoder to construct corrections for longer phrases during decoding if necessary.

执行一组实验来测试语义搭配错误校正的方法。用于实验的数据集是来自于语料库的770个句子的随机抽样的开发集以及856个句子的测试集。每个句子恰好包含一个搭配错误。以来自于相同文档的句子不能在开发和测试集中都结束的方式执行抽样。为了保持条件尽可能实际，并不以任何的方式来过滤测试集。A set of experiments is performed to test the method for semantic collocation error correction. The dataset used for experiments is a development set of 770 sentences randomly sampled from the corpus and a test set of 856 sentences. Each sentence contains exactly one collocation error. Sampling is performed in such a way that sentences from the same document cannot end up in both the development and test sets. In order to keep the conditions as realistic as possible, the test set is not filtered in any way.

对于实验也定义评估度量以评估搭配错误校正。执行自动化的和人工的评估。主要的评估度量是平均排序倒数(MRR)，其是由系统返回的第一正确答案的逆秩(inverse rank)的算术平均数。Evaluation metrics are also defined for experiments to evaluate pairing error correction. Perform automated and human assessments. The main evaluation metric is the Mean Reciprocal Rank (MRR), which is the arithmetic mean of the inverse ranks of the first correct answers returned by the system.

$MRR MRR = = \frac{11}{N N} {Σ Σ}_{i i = = 11}^{N N} \frac{11}{rank rank ((i i))}$

其中N是测试集的大小。如果系统不返回对于测试实例的正确答案，则设置为零。where N is the size of the test set. If the system does not return the correct answer for the test instance, then Set to zero.

在人工评估中，额外地报告在秩(rank)k,k＝1,2,3处的查准率，其中查准率如下计算：In human evaluation, the precision at rank k, k = 1, 2, 3 is additionally reported, where the precision is calculated as follows:

$P P k k = = \frac{{Σ Σ}_{a a &Element; &Element; A A} score score ((a a))}{| | A A | |}$

其中A是秩k或更小些的返回答案的集合并且score(·)是在零和一之间的实值化得分函数。where A is the set of returned answers of rank k or less and score(·) is a real-valued score function between zero and one.

在搭配错误实验中，搭配错误的自动校正在理论上可以被划分成两个步骤：i)识别输入中的错误搭配，以及ii)校正识别的搭配。假定错误搭配已经被识别。In the collocation error experiments, the automatic correction of collocation errors can theoretically be divided into two steps: i) identifying the wrong collocations in the input, and ii) correcting the identified collocations. It is assumed that a mismatch has already been identified.

在实验中，由人工注解员提供的搭配错误的开始和结束偏移用于识别搭配错误的位置。句子的剩余部分的译文被固定于其身份。移除其中短语和候选校正相同的短语表条目，这实际上强迫系统改变识别的短语。解码器的失真限度被设置为零以实现单调解码。对于语言模型，使用5元语言模型，该模型在英文Gigaword语料库上利用修改的Kneser-Ney平滑来训练。所有的实验使用相同的语言模型以允许公平的比较。In the experiments, the start and end offsets of collocation errors provided by human annotators are used to identify the location of collocation errors. The translation of the remainder of the sentence is fixed in its identity. Phrase table entries where the phrase and the candidate correction are the same are removed, which effectively forces the system to change the recognized phrase. The distortion limit of the decoder is set to zero for monotonic decoding. For the language model, a 5-gram language model was used, trained with modified Kneser-Ney smoothing on the English Gigaword corpus. All experiments use the same language model to allow fair comparisons.

在错误句子的开发集和它们的校正上执行具有受欢迎的BLEU度量的MERT训练。由于搜索空间限于改变每个句子的单个短语，训练在两次或三次迭代后相对快速地收敛。在收敛后，模型可以用于自动地校正新的搭配错误。MERT training with the popular BLEU metric is performed on a development set of erroneous sentences and their corrections. Since the search space is limited to changing a single phrase per sentence, training converges relatively quickly after two or three iterations. After convergence, the model can be used to automatically correct new collocation errors.

在85个句子的测试集上评估建议的方法的性能，每个句子具有一个搭配错误。执行自动化的和人工的评估二者。在自动化的评估中，系统的性能通过计算在系统的n-最佳列中、由人工注解者提供的黄金答案的秩来测量。n-最佳列的大小限于顶部的100个输出。如果在顶部的100个输出中没有找到黄金答案，则秩被认为是无穷的，或换句话说，逆秩是零。报告测试实例的数目，对于该测试实例，黄金答案在顶部k个答案间排列，k＝1,2,3,10,100。自动化评估的结果在表8中示出。The performance of the proposed method is evaluated on a test set of 85 sentences, each with one collocation error. Both automated and human assessments are performed. In the automated evaluation, the performance of the system is measured by computing the rank of the golden answers provided by human annotators in the n-best column of the system. The size of the n-best column is limited to the top 100 outputs. If no golden answer is found in the top 100 outputs, the rank is considered infinite, or in other words, the inverse rank is zero. Reports the number of test instances for which the golden answer is ranked among the top k answers, k=1,2,3,10,100. The results of the automated evaluation are shown in Table 8.

表8：自动化评估的结果。列2到6示出在顶部k个答案内排列的黄金答案的数目。最后的列以百分比示出平均排序倒数。值越大越好。Table 8: Results of automated evaluation. Columns 2 to 6 show the number of golden answers ranked within the top k answers. The last column shows the mean rank reciprocal as a percentage. The higher the value, the better.

模型Model 秩＝1rank=1 秩≤2Rank ≤ 2 秩≤3Rank ≤ 3 秩≤10Rank≤10 秩≤100Rank≤100 MRRMRR 拼写spell 3535 4141 4242 4444 4444 4.514.51 同音异义词homophone 11 11 11 11 11 0.110.11 同义词synonyms 3232 4747 5252 6060 6161 4.984.98 基准线baseline 4949 6868 8080 9393 9696 7.617.61 L1-释义L1-interpretation 9393 133133 154154 216216 243243 15.4315.43 所有all 112112 150150 166166 216216 241241 17.2117.21

表9：注解者间协议P(E)＝0.5。Table 9: Inter-annotator agreement P(E) = 0.5.

P(A) 0.8076P(A) 0.8076

Kappa 0.6152Kappa 0.6152

对于搭配错误，通常存在多于一个的可能校正答案。因此，通过仅考虑单个黄金答案为正确并且所有其他的答案是错误的，自动化的评估低估系统的实际性能。执行针对系统基准线和所有的人工评估。两个英文发言者被招聘来判断500个测试句子的子集。对于每个句子，向判断者显示原始句子和两个系统的每个的3个最佳候选。人工评估限于3个最佳候选，因为在秩大于3处的答案将在实际应用中不太有用。按字母顺序来一起显示候选，而没有关于它们的秩或哪个系统产生了它们或由注解者提供的黄金答案的任何信息。候选和原始句子的差异被突出显示。对于每个候选，要求判断者做出关于建议的候选是否是原始的有效校正的二进制判断。有效的校正以分数1.0来表示，而无效的校正以分数0.0来表示。在表9中报告注解者之间的一致。一致的可能性P(A)是注解者一致同意的次数百分比，并且P(E)是偶然的预计一致同意，其在我们的情形中是0.5。Kappa系数定义为For collocation errors, there is usually more than one possible corrective answer. Thus, by considering only a single golden answer as correct and all others as false, automated evaluation underestimates the actual performance of the system. Perform manual assessment against system baseline and all. Two English speakers were recruited to judge a subset of 500 test sentences. For each sentence, the judge is shown the original sentence and the 3 best candidates for each of the two systems. Human evaluation is limited to the 3 best candidates, since answers at ranks greater than 3 will be less useful in practical applications. Candidates are displayed together in alphabetical order without any information about their rank or which system produced them or the golden answer provided by the annotator. Differences between candidate and original sentences are highlighted. For each candidate, a judger is asked to make a binary judgment as to whether the proposed candidate is a valid correction of the original. Valid corrections are represented by a score of 1.0, while invalid corrections are represented by a score of 0.0. Inter-annotator agreement is reported in Table 9. Probability of agreement P(A) is the percentage of times the annotators agree and P(E) is the expected agreement by chance, which is 0.5 in our case. The Kappa coefficient is defined as

$Kappa Kappa = = \frac{P P ((A A)) - - P P ((E E.))}{11 - - P P ((E E.))}$

从实验获得0.6152的Kappa系数，其中0.6和0.8之间的Kappa系数被认为显示实质的一致。为了计算在秩k处的查准率，对判断进行平均。因此，对于每个返回的答案，系统可以接收分数0.0(两个判断为负)、0.5(判断者不同意)、或1.0(两个判断为正)。A Kappa coefficient of 0.6152 was obtained from the experiment, where a Kappa coefficient between 0.6 and 0.8 was considered to show substantial agreement. To calculate the precision at rank k, the judgments are averaged. Thus, for each returned answer, the system may receive a score of 0.0 (two judgments are negative), 0.5 (the judges disagree), or 1.0 (two judgments are positive).

鉴于本公开，这里公开和要求保护的所有方法可以在没有过分实验的情况下做出和执行。尽管在优选实施例方面描述了本发明的设备和方法，对于本领域技术人员来说明显的是变形可以应用于方法和在步骤中或在这里所述的方法的步骤的序列中，而没有脱离本发明的概念、精神和范围。此外，可以对公开的设备做出修改并且可以取消组件或替换这里所述的组件，其中相同或相似的结果可以实现。对于本领域技术人员明显的所有此类的类似替换和修改被认为在由所附权利要求所限定的本发明的精神、范围和概念内。All of the methods disclosed and claimed herein can be made and performed without undue experimentation in light of the present disclosure. Although the apparatus and methods of the present invention have been described in terms of preferred embodiments, it will be apparent to those skilled in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the methods described herein without departing from concept, spirit and scope of the present invention. Furthermore, modifications may be made to the disclosed apparatus and components may be eliminated or substituted for those described herein, wherein the same or similar results are achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.

Claims

1., for correcting a method for grammar mistake, the method comprises:

Reception natural language text inputs, and described Text Input comprises grammar mistake, and wherein a part for input text comprises the class coming from one group of class;

Do not have the corpus of the non-learning text of grammar mistake to generate multiple selection task from hypothesis, wherein for each selection task, the class used in non-learning text predicted again by sorter;

Multiple correction tasks is generated from the corpus of learning text, wherein for each correction tasks, the class that sorter suggestion uses in learning text;

Use one group of binary class problem to train syntactic correction model, this group of binary class problem comprises multiple selection task and multiple correction tasks; And

The syntactic correction model of use training carrys out the class from one group of possible class prediction Text Input.

2. method according to claim 1, comprises further and exports suggestion, if so that the class of prediction is different from the class in Text Input, then the class of Text Input is changed over the class of prediction.

3. method according to claim 1, wherein said learning text is annotated to suppose correct class by teacher.

4. method according to claim 1, wherein said class is the article associated with the noun phrase in input text.

5. method according to claim 4, comprises further and extracts fundamental function for sorter from the noun phrase non-learning text and learning text.

6. method according to claim 1, wherein said class is the preposition associated with the prepositional phrase in input text.

7. method according to claim 6, comprises further and extracts fundamental function for sorter from the prepositional phrase of non-learning text and learning text.

8. method according to claim 1, wherein said non-learning text and learning text take on a different character space, and the feature space of learning text comprises the word used by author.

9. method according to claim 1, wherein trains syntactic correction model to comprise the loss function minimized on training data.

10. method according to claim 1, trains syntactic correction model to comprise further and identifies multiple linear classifier by analyzing non-learning text.

11. methods according to claim 10, wherein said linear classifier comprises weight factor further, and this weight factor is included in the matrix of weight factor.

12. methods according to claim 11, the matrix of wherein training described syntactic correction model to be included in weight factor further performs svd (SVD).

13. methods according to claim 12, wherein train syntactic correction model also can comprise recognition combination weighted value, the representative of this combined weights weight values is by the first weighted value element of analyzing non-learning text and identifying and the second weighted value element identified by minimizing empirical risk function and carrying out analytic learning text.

14. 1 kinds of equipment, comprising:

At least one processor and the storage arrangement being coupled to this at least one processor, at least one processor wherein said is configured to:

15. equipment according to claim 14, comprise further and export suggestion, if so that the class of prediction is different from the class in Text Input, then the class of Text Input is changed over the class of prediction.

16. equipment according to claim 14, wherein said learning text is annotated to suppose correct class by teacher.

17. equipment according to claim 14, wherein said class is the article associated with the noun phrase in described input text.

18. equipment according to claim 17, comprise further and extract fundamental function for sorter from the noun phrase non-learning text and learning text.

19. equipment according to claim 14, wherein said class is the preposition associated with the prepositional phrase in input text.

20. equipment according to claim 19, comprise further and extract fundamental function for sorter from the prepositional phrase of non-learning text and learning text.

21. equipment according to claim 14, wherein said non-learning text and learning text take on a different character space, and the feature space of learning text comprises the word used by author.

22. equipment according to claim 14, wherein train syntactic correction model to comprise the loss function minimized on training data.

23. equipment according to claim 14, wherein train described syntactic correction model to comprise further and identify multiple linear classifier by analyzing non-learning text.

24. equipment according to claim 23, wherein said linear classifier comprises weight factor further, and this weight factor is included in the matrix of weight factor.

25. equipment according to claim 24, the matrix of wherein training described syntactic correction model to be included in weight factor further performs svd (SVD).

26. equipment according to claim 25, wherein train syntactic correction model also can comprise recognition combination weighted value, the representative of this combined weights weight values is by the first weighted value element of analyzing non-learning text and identifying and the second weighted value element identified by minimizing empirical risk function and carrying out analytic learning text.