CN103729346B

CN103729346B - Method for dynamically generating mass language assets in multiple language industry standard formats

Info

Publication number: CN103729346B
Application number: CN201210383201.5A
Authority: CN
Inventors: 杜金林; 朱懿; 杜勇
Original assignee: Translated By Mdt Infotech Ltd Shanghai
Current assignee: Translated By Mdt Infotech Ltd Shanghai
Priority date: 2012-10-11
Filing date: 2012-10-11
Publication date: 2017-02-08
Anticipated expiration: 2032-10-11
Also published as: CN103729346A

Abstract

The present invention is a method for dynamically generating massive language assets in a multilingual industry standard format, including: reading out the content in corpora and term bases in XML-based standard formats such as TMX and TBX by developing a parser and importing them into a specified In the database; while importing, it will automatically match and place database tables with the same content and different language pairs, and automatically generate a multilingual database with one source text and multiple sentences matching the target language; when the user uses it, according to the language specified by the user Yes, the searched results are automatically fed back to the user in the form of translation memory, and presented to the end user in a specific format for reuse; when adding or updating a multilingual database, it will automatically update relevant content in multiple languages, thereby ensuring language assets After the buzz, you can continue to let users get the updated translation memory content. Directly reuse the language assets saved in the text database format, the data is not easy to be damaged and lost, and the asset security is improved.

Description

Method for dynamically generating massive language assets in multilingual industry standard format

技术领域technical field

本发明涉及一种动态生成多语言行业标准格式的海量语言资产的方法，用于CAT软件或者多语言翻译系统中的TM模块的开发和应用，属多语言机器翻译技术领域。The invention relates to a method for dynamically generating massive language assets in a multilingual industry standard format, which is used for the development and application of a TM module in CAT software or a multilingual translation system, and belongs to the technical field of multilingual machine translation.

背景技术Background technique

TM(Translation Memory翻译记忆)是计算机辅助翻译(CAT)领域广泛采用的技术之一，借助TM技术可以显著提高翻译效率，保证内容一致性。由于采用TM技术开发的CAT软件种类繁多，TM内容的存储格式千差万别，为了便于翻译机构以及CAT工具之间的TM数据交换，一种称为TMX(Translation Memory eXchange)的开放标准已经成功应用到本地化和翻译行业。TM (Translation Memory) is one of the technologies widely used in the field of computer-aided translation (CAT). With the help of TM technology, translation efficiency can be significantly improved and content consistency can be ensured. Due to the wide variety of CAT software developed using TM technology, the storage format of TM content varies greatly. In order to facilitate the exchange of TM data between translation agencies and CAT tools, an open standard called TMX (Translation Memory eXchange) has been successfully applied locally. culture and translation industry.

在软件和网站本地化翻译的过程中，需要处理的数据文件内容重复性比较大，另外由于内容更新频繁，且都是基于上一版本的更新，只是增加了少量新内容或者对原来的内容进行了少量修正，所以很有必要充分利用以前版本已经翻译的内容，而不需要重新翻译。In the process of software and website localization translation, the content of the data files to be processed is relatively repetitive. In addition, because the content is updated frequently and is based on the update of the previous version, only a small amount of new content has been added or the original content has been modified. Minor corrections have been made, so it is necessary to take full advantage of what has already been translated in previous versions without having to re-translate.

TM技术有效地重复利用这些已经翻译的内容，它采用片断(Segment)和TM库的方式提高翻译的效率，翻译数据库以“翻译单元(Translation Unit)”为数据单位，将源语言的各个句子与目标语言的句子建立对应链接关系。翻译人员采用TM的CAT工具翻译内容时，CAT工具不断将最新翻译的内容存储到TM库，对于要翻译的内容(如单词、短语、句子、段落)，它先在TM库中搜索该内容是否有匹配的内容，并且自动提供最接近的译法，翻译人员可方便地插入最匹配的译文。TM technology effectively reuses the translated content. It uses segments (Segment) and TM library to improve translation efficiency. The translation database uses "Translation Unit" as the data unit, and combines each sentence in the source language with Sentences in the target language establish a corresponding link relationship. When translators use TM's CAT tool to translate content, the CAT tool continuously stores the latest translated content in the TM library. For the content to be translated (such as words, phrases, sentences, paragraphs), it first searches the TM library for whether the content is There is matching content, and the closest translation is automatically provided, and translators can easily insert the best matching translation.

随着翻译内容的不断丰富，TM库的容量不断增加，翻译人员不必为相同内容的再次重新翻译而苦恼，只需要专注于需要翻译的新内容即可，而且TM的准确性也能保证相同内容翻译的一致性。这是采用TM技术追求的目标。With the continuous enrichment of translation content, the capacity of TM library continues to increase, translators do not have to worry about re-translating the same content again, they only need to focus on new content that needs to be translated, and the accuracy of TM can also guarantee the same content Translation consistency. This is the goal pursued by adopting TM technology.

然而，随着经济全球化的不断深入，软件/网站的本地化和全球化行业迅速发展，与此相呼应，各个采用T M技术开发的本地化工具和TM工具越来越多，但是这些工具是不同的厂家开发的，每家都有各自的文件数据存储格式。另外，对于一个本地化服务机构来说，经常为不同客户或相同客户的不同项目提供本地化翻译服务，由于不同客户和不同项目需要使用不同的本地化工具，经常由于各个本地化工具文件数据缺乏可以交换的标准格式，因此，很难重复使用以前积累的TM库资源。显然，TM库的标准格式亟待统一。However, with the continuous deepening of economic globalization, the localization and globalization of software/website industries are developing rapidly. Correspondingly, there are more and more localization tools and TM tools developed using TM technology, but these tools are Developed by different manufacturers, each has its own file data storage format. In addition, for a localization service agency, it often provides localization translation services for different customers or different projects of the same customer. Since different customers and different projects need to use different localization tools, it is often due to the lack of file data of each localization tool. A standard format can be exchanged, therefore, it is difficult to reuse previously accumulated TM library resources. Obviously, the standard format of TM library needs to be unified urgently.

综上所述，随着经济全球化的不断深入，软件/网站的本地化和全球化行业迅速发展，除了对现有存储的TMX及TBX格式的语言资产(TM与术语资源)进行重用有助于提升产出与质量，降低成本。通常TMX或者TBX都是以一个语言对形式出现，如英文到中文，英文到德文等。然而，业界的技术还是停留在单一语言对格式进行支持的情况，还没有从现有的单一语言对里的相同内容自动生成多语言语言对的技术。To sum up, with the continuous deepening of economic globalization, the localization and globalization of software/website industries are developing rapidly. In addition to reusing the existing stored language assets (TM and terminology resources) in TMX and TBX formats, it is helpful To improve output and quality, reduce costs. Usually TMX or TBX appears in the form of a language pair, such as English to Chinese, English to German, etc. However, the technology in the industry still stays in the situation where a single language pair format is supported, and there is no technology to automatically generate a multilingual language pair from the same content in an existing single language pair.

现有技术的缺点：1)现有的语言资产存储架构是二维的、单向的，源语种与各个目标语种间的对应关系无法打通；2)无法从海量单一语言TMX或TBX文件中相同内容自动获取多语言(多维度)、多向的语言对，造成资源的极大浪费，如需获取，势必造成巨大人工成本。Disadvantages of the existing technology: 1) The existing language asset storage architecture is two-dimensional and one-way, and the corresponding relationship between the source language and each target language cannot be opened; Automatic acquisition of multi-language (multi-dimensional) and multi-directional language pairs for content results in a great waste of resources, and if it needs to be acquired, it will inevitably result in huge labor costs.

发明内容Contents of the invention

为解决上述问题，本发明旨在提供一种动态生成多语言行业标准格式的海量语言资产的方法。本发明的技术方案如下：In order to solve the above problems, the present invention aims to provide a method for dynamically generating massive language assets in a multilingual industry standard format. Technical scheme of the present invention is as follows:

一种动态生成多语言行业标准格式的海量语言资产的方法，包括以下步骤：A method for dynamically generating massive language assets in a multilingual industry standard format, comprising the following steps:

1、通过开发解析器将TMX、TBX等基于XML的标准格式的语料库、术语库中的内容读取出并导入到指定的数据库中；1. Read and import the content in the corpus and term base of TMX, TBX and other XML-based standard formats into the specified database through the development of the parser;

2、在导入的同时，将自动匹配和放置相同内容不同语言对的数据库表，自动生成一句源文，多句匹配的目标语言的多语言数据库；2. While importing, it will automatically match and place database tables with the same content and different language pairs, and automatically generate a multilingual database with one source text and multiple sentences matching the target language;

3、在用户使用时，根据用户指定的语言对，自动将搜索到的结果以翻译记忆的形式反馈给用户，以特定的格式呈现给最终用户进行重用；3. When the user is using it, according to the language pair specified by the user, the search results are automatically fed back to the user in the form of translation memory, and presented to the end user in a specific format for reuse;

4、当增加、更新多语言数据库时，将自动更新多语言的相关内容，保证语言资产在动态更新之后，可以继续让用户获取更新后的翻译记忆内容。4. When adding or updating a multilingual database, the relevant content in multiple languages will be automatically updated to ensure that after the language assets are dynamically updated, users can continue to obtain updated translation memory content.

以上所述的动态生成多语言行业标准格式的海量语言资产的方法，作为优选方案：还包括：The method for dynamically generating massive language assets in a multilingual industry standard format as described above, as a preferred solution: also includes:

采用λ语料解析模块，提供行业标准格式TMX和TBX的解析，将语料信息(包括源语言、目标语言等)读入内存，转换为二进制对象；Adopt λ corpus analysis module to provide analysis of industry standard formats TMX and TBX, read corpus information (including source language, target language, etc.) into memory, and convert them into binary objects;

采用λ语料适配模块，提供对中间语言语料的匹配功能，并将相应目标语言语料存储到多语言语料矩阵正确位置；Adopt the lambda corpus adaptation module to provide the matching function for the intermediate language corpus, and store the corresponding target language corpus in the correct position of the multilingual corpus matrix;

采用λ语料生成模块，提供读取多语言语料矩阵内语料信息，并将其按照行业标准输出为TMX或TBX格式文件，方便归档备份语料或供其他兼容TMX或TBX的工具使用。The λ corpus generation module is used to read the corpus information in the multilingual corpus matrix, and output it as a TMX or TBX format file according to industry standards, which is convenient for archiving and backup corpus or for other tools compatible with TMX or TBX.

本发明的动态生成多语言行业标准格式的海量语言资产的方法，其有益效果是：以多语言数据库形式存在的语言资产是物理上独立于以TMX和TBX格式存在的语言资产的，即使多语言数据库被删除，也不会影响到原始的语言资产，从而保证了资产的安全性；而且，资产是以文本形式的XML(TMX与TBX均基于XML)保存在存储介质上，不同于被CAT工具频繁读取存储的二进制数据库文件，其安全性可以得到保障，不会意外丢失。The method for dynamically generating massive language assets in the multilingual industry standard format of the present invention has the beneficial effect that the language assets existing in the form of a multilingual database are physically independent of the language assets existing in the TMX and TBX formats, even if multilingual The deletion of the database will not affect the original language assets, thereby ensuring the security of the assets; moreover, the assets are stored on the storage medium in the form of text XML (TMX and TBX are both based on XML), which is different from the CAT tool The stored binary database files are frequently read, and their security can be guaranteed without accidental loss.

直接对TMX与TBX两种行业标准格式的处理，可带来以下有益效果：Direct processing of two industry standard formats, TMX and TBX, can bring the following beneficial effects:

1)直接重用文本数据库格式保存的语言资产，数据不易损坏丢失，提升了资产安全性。1) Directly reuse the language assets saved in the text database format, the data is not easy to be damaged and lost, and the asset security is improved.

2)无需手动转换格式，自动导入行业标准格式，实现语言资产的重用。2) No need to manually convert the format, and automatically import the industry standard format to realize the reuse of language assets.

3)自动获取多语言多维度的语言对和术语对，比如原来有3个语言对的语料，通过应该用发明，可实现资产的额外增值，额外获得9个语言对的语料，从而发挥语言资产的最大效能，对企业的产品全球化和国际化，保持了全球化过程中语言表达的一致性，直接带来效率和质量的提升，节约巨大的多语言生产成本，缩短企业产品全球化布局的时间周期。3) Automatically obtain multilingual and multidimensional language pairs and term pairs. For example, there are 3 language pairs in the original corpus. Through the invention of should use, additional value-added of assets can be realized, and 9 additional language pairs of corpus can be obtained, so as to make full use of language assets. For the globalization and internationalization of the enterprise's products, it maintains the consistency of language expression in the process of globalization, directly brings about the improvement of efficiency and quality, saves huge multilingual production costs, and shortens the time spent on the globalization of enterprise products. Time period.

4)支持对海量多语言资产的高速查询/重用。4) Support high-speed query/reuse of massive multilingual assets.

附图说明Description of drawings

图1.动态生成多语言行业标准格式的海量语言资产的方法的系统框图。Figure 1. System block diagram of a method for dynamically generating massive language assets in a multilingual industry standard format.

具体实施方案specific implementation plan

缩略语和关键术语定义：Acronyms and key term definitions:

MTMM Multilingual Translation Memory Matrix 多语翻译记忆矩阵技术MTMM Multilingual Translation Memory Matrix multilingual translation memory matrix technology

TM Translation Memory 翻译记忆TM Translation Memory translation memory

TU Translation Unit 翻译单元TU Translation Unit translation unit

TMX Translation Memory eXchange 翻译记忆交换格式TMX Translation Memory eXchange Translation Memory Exchange Format

TBX Term Base eXchange 术语库交换格式TBX Term Base eXchange Term Base Exchange Format

CAT Computer Aided Translation 计算机辅助翻译CAT Computer Aided Translation Computer Aided Translation

LISA Localization Industry Standards Association 本地化行业标准协会LISA Localization Industry Standards Association Localization Industry Standards Association

OSCAR Open Standards for Container/Content Allowing Re-use 可重用容器/内容开放标准OSCAR Open Standards for Container/Content Allowing Re-use Reusable Container/Content Open Standards

具体实施例如下：Specific examples are as follows:

动态生成多语言行业标准格式的海量语言资产的方法，包括以下步骤：The method for dynamically generating a large amount of language assets in a multilingual industry standard format includes the following steps:

1)通过开发解析器将TMX、TBX等基于XML的标准格式的语料库、术语库中的内容读取出并导入到指定的数据库中；1) By developing a parser, read out the content in the corpus and term base based on XML-based standard formats such as TMX and TBX and import it into the specified database;

2)在导入的同时，将自动匹配和放置相同内容不同语言对的数据库表，自动生成一句源文，多句匹配的目标语言的多语言数据库；2) While importing, it will automatically match and place database tables with the same content and different language pairs, and automatically generate a multilingual database with one source text and multiple sentences matching the target language;

3)在用户使用时，根据用户指定的语言对，自动将搜索到的结果以翻译记忆的形式反馈给用户，以特定的格式呈现给最终用户进行重用；3) When the user is using it, according to the language pair specified by the user, the searched results are automatically fed back to the user in the form of translation memory, and presented to the end user in a specific format for reuse;

4)当增加、更新多语言数据库时，将自动更新多语言的相关内容，从而保证语言资产在动态更新之后，可以继续让用户获取更新后的翻译记忆内容。4) When adding and updating a multilingual database, the related content in multiple languages will be automatically updated, so as to ensure that after the language assets are dynamically updated, users can continue to obtain updated translation memory content.

动态生成多语言行业标准格式的海量语言资产的方法，具体还包括：A method for dynamically generating massive language assets in a multilingual industry standard format, specifically including:

以多语言数据库形式存在的语言资产是物理上独立于以TMX和TBX格式存在的语言资产的，即使多语言数据库被删除，也不会影响到原始的语言资产，从而保证了资产的安全性；而且，资产是以文本形式的XML(TMX与TBX均基于XML)保存在存储介质上，不同于被CAT工具频繁读取存储的二进制数据库文件，其安全性可以得到保障，不会意外丢失。Language assets in the form of multilingual databases are physically independent of language assets in TMX and TBX formats. Even if the multilingual databases are deleted, the original language assets will not be affected, thereby ensuring the security of assets; Moreover, the assets are saved on the storage medium in the form of text XML (TMX and TBX are both based on XML), which is different from the binary database files that are frequently read and stored by CAT tools, and its security can be guaranteed without accidental loss.

本发明的概念例句：Conceptual example sentences of the present invention:

A.对翻译记忆(TMX)的概念例如说明：A. Examples of the concept of translation memory (TMX):

普通情况下的单语言对二维TM内容举例：Examples of single-language paired 2D TM content under normal circumstances:

英文en-us：People’s Republic of China is a permanent member of theUnited Nations OrganizationEnglish en-us: People’s Republic of China is a permanent member of the United Nations Organization

中文zh-cn：中华人民共和国是联合国组织的常任理事国Chinese zh-cn: The People's Republic of China is a permanent member of the United Nations Organization

法文fr-fr：République populaire de Chine est membre permanent de l’Organisation des Nations UniesFrench fr-fr: République populaire de Chine est member permanent de l’Organisation des Nations Unies

德文de-de：Der Volksrepublik China ist Mitglied derOrganisation der Vereinten NationenGerman de-de: Der Volksrepublik China ist Mitglied der Organization der Vereinten Nationen

通过本发明技术，将自动获取任意匹配的多语言多维度语言对TM，如：Through the technology of the present invention, any matching multilingual and multidimensional language pair TM will be automatically obtained, such as:

德文de-de：Der Volksrepubl ik China ist Mitglied derOrganisation der Vereinten NationenGerman de-de: Der Volksrepubl ik China ist Mitglied der Organization der Vereinten Nationen

B.对术语库(TBX)的概念例如说明：B. For example, explain the concept of term base (TBX):

普通情况下的单语言二维术语内容：The content of monolingual two-dimensional terms in common cases:

英文en-us：Computer-assisted translationEnglish en-us: Computer-assisted translation

中文zh-cn：计算机辅助翻译Chinese zh-cn: Computer Aided Translation

法文fr-fr：Traduction assistée par ordinateurFrench fr-fr: Traduction assistée par ordinateur

德文de-de：Computerunterstützte German de-de: Computerrunterstützte

通过本发明技术，将自动获取任意匹配的多语言多维度语言对术语：Through the technology of the present invention, any matching multilingual and multidimensional language pair terms will be automatically obtained:

中文zh-cn：计算机辅助翻译Chinese zh-cn: Computer Aided Translation

德文de-de：Computerunterstützte German de-de: Computerrunterstützte

每个厂商都希望用户对自身的CAT产品依赖性更大，但从用户的角度考虑，一种支持海量语言资产的从单一语言对的相同内容自动生成多语言对的方法，保证资产安全性，实现资源的最大化应用，将是相当可贵。采用本发明的技术方案，可得到有益结果：除了保证原单语言句对的重用和资产安全性，同时自动为用户获取多语言多维度的语言对，实现了资产的额外增值，发挥语言资产的最大效能。Every manufacturer hopes that users will be more dependent on their own CAT products, but from the perspective of users, a method that supports massive language assets and automatically generates multi-language pairs from the same content of a single language pair ensures asset security. It will be quite valuable to realize the maximum application of resources. By adopting the technical solution of the present invention, beneficial results can be obtained: in addition to ensuring the reuse of the original single-language sentence pairs and asset security, at the same time, automatically obtaining multi-language and multi-dimensional language pairs for users, realizing additional value-added assets and maximizing language assets efficacy.

以上所述，仅为本发明的较佳实施例而已，本技术领域的技术人员围绕该精神所做的任何非创造性改进，皆属于本发明的保护范围。The above descriptions are only preferred embodiments of the present invention, and any non-creative improvements made by those skilled in the art around the spirit fall within the scope of protection of the present invention.

Claims

1. The method for dynamically generating a large amount of language assets in a multilingual industry standard format is characterized in that: comprising the following steps: (1) reading content in corpus and term bases based on XML standard formats such as TMX and TBX by developing a parser Take it out and import it into the designated database; (2) while importing, it will automatically match and place database tables with the same content and different language pairs, and automatically generate a multilingual database with one source text and multiple sentences matching the target language; ( 3) When the user is using, according to the language pair specified by the user, the searched results are automatically fed back to the user in the form of translation memory, and presented to the end user in a specific format for reuse; (4) When adding and updating the multilingual database , the multilingual related content will be automatically updated to ensure that after the language assets are dynamically updated, users can continue to obtain updated translation memory content;

The language assets are stored on the storage medium in the form of text XML, and the language assets A in the form of a multilingual database are physically independent from the language assets B in the formats of TMX and TBX.

2. the method for dynamically generating the massive language assets of multilingual industry standard format according to claim 1, is characterized in that: described step (1) specifically comprises: adopt λ corpus parsing module, provide industry standard format TMX and TBX Parse, read the corpus information into the memory, and convert it into a binary object;

Described step (2) specifically comprises: adopt λ corpus adaptation module, provide the matching function to intermediate language corpus, and store corresponding target language corpus in the correct position of multilingual corpus matrix; Adopt λ corpus generation module, provide to read multiple The corpus information in the language corpus matrix is output as a TMX or TBX format file according to industry standards.