JP5148278B2

JP5148278B2 - Method and system for selecting a language for text segmentation

Info

Publication number: JP5148278B2
Application number: JP2007534758A
Authority: JP
Inventors: ギラード・イスラエル・エルバス; ジェイコブ・レオン・マンデルソン
Original assignee: グーグル・インコーポレーテッド
Priority date: 2004-09-30
Filing date: 2005-09-28
Publication date: 2013-02-20
Anticipated expiration: 2025-09-28
Also published as: EP1800224B1; US20060074628A1; US20110301939A1; CA2581902A1; WO2006039398A3; EP2511832A3; WO2006039398A8; WO2006039398A2; CN101095138A; CN102831107A; EP2511832B1; US20130013288A1; US20130018648A1; ES2395168T3; EP1800224A2; JP2008515107A; CN102708095A; US8306808B2; EP2511832A2; CN102831107B

Description

本発明は、概してテキスト分割に関するものであり、特に、テキスト分割のために言語を選択することに関するものである。 The present invention relates generally to text segmentation, and more particularly to selecting a language for text segmentation.

テキストを示すデータの解釈を試みるテキスト処理方法およびシステムが存在している。テキスト処理は、言葉または他のトークンを示して分割できない文字列を有するテキストが受け取られたとき、より困難となる。トークンは、単語、頭文字語、簡略名、固有名、地理的名称、株式市場ティッカーシンボル、または他のトークンとすることができる。一般に、文字列は、既存の方法およびシステムを使用して、分割された文字列の複数の組合せに分割することができる。そのテキストを選択して使用するために正確な言葉を選択することで、より有意義な結果を出すことができる。 Text processing methods and systems exist that attempt to interpret data representing text. Text processing becomes more difficult when text is received that has a string that cannot be split to indicate words or other tokens. Tokens can be words, acronyms, short names, proper names, geographic names, stock market ticker symbols, or other tokens. In general, a string can be split into multiple combinations of split strings using existing methods and systems. Choosing the exact words to select and use the text can yield more meaningful results.

本発明の実施形態は、テキスト分割のために言語を選択する方法およびシステムを有している。本発明の一実施形態は、文字列に関連付けられている第１候補言語および第２候補言語を少なくとも特定するステップと、前記文字列から前記第１候補言語に関連付けられている第１分割結果を少なくとも決定するとともに、前記文字列から前記第２候補言語に関連付けられている第２分割結果を少なくとも決定するステップと、前記第１分割結果について第１出現頻度を決定するとともに、前記第２分割結果について第２出現頻度を決定するステップと、少なくとも前記第１出現頻度および第２出現頻度の一部に基づいて、前記第１候補言語および第２候補言語から実施可能言語を特定するステップと、を有する。 Embodiments of the invention have a method and system for selecting a language for text segmentation. According to an embodiment of the present invention, at least a first candidate language and a second candidate language associated with a character string are specified, and a first division result associated with the first candidate language from the character string Determining at least a second division result associated with the second candidate language from the character string; determining a first appearance frequency for the first division result; and determining the second division result. Determining a second appearance frequency for, and identifying an executable language from the first candidate language and the second candidate language based on at least a part of the first appearance frequency and the second appearance frequency. Have.

この例示的実施形態は、本発明を限定または定義するものではないが、本発明についての理解を支援するための本発明の実施形態の一例を提供する。例示的実施形態が詳細な説明において記載されており、さらにそこで本発明の詳細が記載されている。本発明の様々な実施形態で示された利点は、本明細書を検討することによってさらに理解することができる。 This exemplary embodiment does not limit or define the invention, but provides an example of an embodiment of the invention to assist in understanding the invention. Illustrative embodiments are described in the detailed description, and further details of the invention are described there. The advantages demonstrated in the various embodiments of the present invention can be further understood by reviewing the specification.

本発明のこれらおよび他の特徴、態様および利点は、以下の詳細な説明が添付図面を参照して読み取られることで、より良く理解される。 These and other features, aspects and advantages of the present invention will be better understood when the following detailed description is read with reference to the accompanying drawings, in which:

＜序論＞
本発明の実施形態は、テキスト分割のために言語を選択する方法およびシステムを有する。本発明について複数の実施形態がある。序論および実施形態として、本発明の一例の実施形態は、複数のトークンまたは単語の中のドメインネームのような、文字列の正確な言語を選択することで、文字列の分割を改善する方法として提供される。文字列についての多くの可能(potential)または候補言語が、前記文字列に関連している言語、ユーザに関連しているＩＰアドレス、前記文字列で使用された文字セット、前記ユーザに関連づけられたブラウザ・アプリケーション・プログラムのブラウザ設定、および前記文字列に関連づけられた最上位のドメインのような、様々な信号に基づいて選択することができる。文字列は、各候補言語を使用して多くの分割結果に区分することができる。各分割結果は、単語または他のトークンの特定の組合せとなることができる。例えば、文字列「usedrugs」は、英語についての以下の分割結果に分割することができる。
「used rugs」、「use drugs」、「us ed rugs」など。
各候補言語の分割結果のこの数から、実施可能分割結果および実施可能言語が前記実施可能分割結果を含む前記実施可能言語の中の文書または検索クエリーの数に基づいて特定できる。 <Introduction>
Embodiments of the invention have a method and system for selecting a language for text segmentation. There are several embodiments of the present invention. As an introduction and embodiment, an example embodiment of the present invention is a method for improving string segmentation by selecting the exact language of a string, such as a domain name in multiple tokens or words. Provided. Many potential or candidate languages for a string are associated with the language associated with the string, the IP address associated with the user, the character set used in the string, and the user The selection can be based on various signals, such as the browser settings of the browser application program and the top-level domain associated with the string. The character string can be divided into a number of division results using each candidate language. Each split result can be a specific combination of words or other tokens. For example, the character string “usedrugs” can be divided into the following division results for English.
“Used rugs”, “use drugs”, “us ed rugs”, etc.
From this number of segmentation results for each candidate language, the feasible segmentation result and the feasible language can be identified based on the number of documents or search queries in the feasible language that includes the feasible segmentation result.

例えば、最良の実施可能分割結果になるための最高確率の分割結果が各候補言語について選択できる。サーチエンジンは、選択された分割結果を有する文書または検索クエリーの数を決定することができるとともに、各候補言語の中の各選択された分割結果についてこれをすることができる。一実施形態として、特定の言語における文書または検索クエリーの中で最大頻度で生じる分割結果は、最良の実施可能分割結果として特定することができる。前記最良の実施可能分割結果に関連している言語は、最良の実施可能言語として特定することができる。また、前記候補言語の決定に使用された言語信号は、前記実施可能言語の選択に使用することができる。前記実施可能分割結果および実施可能言語は、言語および結果に基づいて広告を選択することを有する各種の機能を使用することができる。 For example, the division result with the highest probability to be the best feasible division result can be selected for each candidate language. The search engine can determine the number of documents or search queries that have the selected segmentation result and can do this for each selected segmentation result in each candidate language. In one embodiment, the segmentation result that occurs most frequently in a document or search query in a particular language can be identified as the best feasible segmentation result. The language associated with the best feasible partitioning result can be identified as the best feasible language. The language signal used for determining the candidate language can be used for selecting the executable language. The feasible segmentation result and the feasible language may use various functions including selecting an advertisement based on the language and the result.

この序論は、本発明の一般的な内容を利用者に紹介するためにある。本発明は、そのような内容に決して限定されない。一例としての実施形態が以下で説明されている。 This introduction is in order to introduce the general contents of the present invention to the user. The present invention is in no way limited to such contents. An exemplary embodiment is described below.

＜システム構造＞
本発明に従った様々なシステムが構成できる。図１は、本発明として動作可能な実施形態の一例のシステムの図を示している。また、本発明は、他のシステムとして、動作可能であるとともに、実施可能である。参照している図面では、いくつかの図面を通して同様の要素には同様の番号が付いており、図１は、本発明の実施形態を実施するための環境の一例を示す図である。図１に示すシステム１００は、ネットワーク１０６を渡ってサーバ装置１０４およびサーバ装置１５０と通信する複数のクライアント装置１０２ａ−ｎを有する。一実施形態として、示されたネットワーク１０６は、インターネットを有する。他の実施形態として、イントラネット、ＷＡＮまたはＬＡＮのような、他のネットワークを使用することができる。さらに、本発明に係る方法は、１台のコンピュータで動作することもできる。 <System structure>
Various systems according to the present invention can be configured. FIG. 1 shows a diagram of an example system that is operable as the present invention. In addition, the present invention can be operated and implemented as another system. In the referenced drawings, like elements are numbered similarly throughout the several views, and FIG. 1 is a diagram illustrating an example environment for practicing embodiments of the present invention. A system 100 illustrated in FIG. 1 includes a plurality of client devices 102 a-n that communicate with a server device 104 and a server device 150 across a network 106. In one embodiment, the network 106 shown has the Internet. In other embodiments, other networks can be used, such as an intranet, WAN or LAN. Furthermore, the method according to the invention can also be operated on a single computer.

図１に示されたクライアント装置１０２ａ−ｎは、プロセッサ１１０に接続されたランダム・アクセス・メモリ（ＲＡＭ）１０８のような、コンピュータ読み取り可能媒体を、それぞれ有している。プロセッサ１１０は、メモリ１０８に記憶されたコンピュータ実行可能なプログラム命令を実行する。そのようなプロセッサは、マイクロプロセッサ、ＡＳＩＣ、および状態マシンを有するものとしてもよい。また、そのようなプロセッサは、例えば、コンピュータ読み取り可能な媒体などの媒体と通信するものとしてもよい。前記コンピュータ読み取り可能な媒体は、ここに記載されたステップを実行するように前記プロセッサを動作させるものであって、前記プロセッサによって実行される命令を記憶する。コンピュータ読み取り可能媒体は、電子式、光学式、磁気式または他の記憶装置を有し、またはクライアント１０２ａのプロセッサ１１０ようなプロセッサに、コンピュータ読み取り可能な命令を提供することが可能な他の伝送装置を有するが、これらに限定されない。適当な媒体の他の実施形態としては、プロッピーディスク（登録商標）、ＣＤ−ＲＯＭ、ＤＶＤ、磁気ディスク、メモリチップ、ＲＯＭ、ＲＡＭ、ＡＳＩＣ、構成プロセッサ、全ての光学媒体、全ての磁気テープ、または他の磁気媒体、またはコンピュータプロセッサが命令を読み取ることができるあらゆる他の適当な媒体を有するが、これらに限定されない。また、コンピュータ読み取り可能な媒体の様々な他の形式が、ルータ、私的のまたは公的ネットワーク、または他の伝送装置またはチャネル、有線および無線の両方を有して、命令をコンピュータに伝送するまたは運ぶことができる。前記命令は、例えば、Ｃ、Ｃ＋＋、ビジュアルベーシック、ジャバ、パイソン、パール、およびジャバスクリプトなどを含む、あらゆるコンピュータ・プログラミング言語のコードを有することとしてもよい。 The client devices 102a-n shown in FIG. 1 each have a computer readable medium, such as a random access memory (RAM) 108 connected to the processor 110. The processor 110 executes computer-executable program instructions stored in the memory 108. Such a processor may have a microprocessor, an ASIC, and a state machine. Such a processor may also communicate with a medium such as a computer readable medium. The computer readable medium operates the processor to perform the steps described herein and stores instructions executed by the processor. The computer readable medium may include electronic, optical, magnetic, or other storage device, or other transmission device capable of providing computer readable instructions to a processor, such as processor 110 of client 102a. However, it is not limited to these. Other embodiments of suitable media include Proppy Disk (registered trademark), CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, ASIC, configuration processor, all optical media, all magnetic tape, Or other, but not limited to, any magnetic medium, or any other suitable medium from which a computer processor can read instructions. Also, various other forms of computer readable media have routers, private or public networks, or other transmission devices or channels, both wired and wireless, to transmit instructions to a computer or Can carry. The instructions may have code in any computer programming language including, for example, C, C ++, Visual Basic, Java, Python, Pearl, Javascript, and the like.

また、クライアント装置１０２ａ−ｎは、マウス、ＣＤ−ＲＯＭ、ＤＶＤ、キーボード、ディスプレイ、または他の入出力装置のような、多くの外部または内部装置を有することとしてもよい。クライアント装置１０２ａ−ｎの実施形態としては、パーソナルコンピュータ、デジタル・アシスタント、携帯情報端末、携帯電話、移動電話、スマートフォン、ページャ、デジタルタブレット、ラップトップコンピュータ、インターネット接続専用端末、および他のプロセッサベースの装置が挙げられる。一般に、クライアント装置１０２ａは、ネットワーク１０６に接続されて一つ以上の応用プログラムと対話処理するものであって、あらゆる適当なタイプのプロセッサベースのプラットフォームとすることができる。クライアント装置１０２ａ−ｎは、マイクロソフト（登録商標）のウィンドウズ（登録商標）またはリナックスのような、ブラウザまたはブラウザで動作するアプリケーションをサポート可能なあらゆるオペレーティング上で動作するものとしてもよい。クライアント装置１０２ａ−ｎは、例えば、マイクロソフト社のインターネットエクスプローラ（登録商標）、ネットスケープコーポレーションズのネットスケープナビゲータ（登録商標）およびアップルコンピュータ社のサファリ（登録商標）のような、ブラウザ・アプリケーション・プログラムを実行するパーソナルコンピュータを含むものを示す。 The client devices 102a-n may also have many external or internal devices, such as a mouse, CD-ROM, DVD, keyboard, display, or other input / output device. Embodiments of client devices 102a-n include personal computers, digital assistants, personal digital assistants, mobile phones, mobile phones, smartphones, pagers, digital tablets, laptop computers, dedicated terminals for Internet connection, and other processor-based Apparatus. In general, client device 102a is connected to network 106 and interacts with one or more application programs and can be any suitable type of processor-based platform. The client devices 102a-n may operate on any operating system that can support a browser or an application running on the browser, such as Microsoft® Windows® or Linux. The client apparatuses 102a-n execute browser application programs such as Internet Explorer (registered trademark) of Microsoft Corporation, Netscape Navigator (registered trademark) of Netscape Corporation, and Safari (registered trademark) of Apple Computer Corporation. Including those that include a personal computer.

クライアント装置１０２ａ−ｎを介して、ユーザ１１２ａ−ｎは、ネットワーク１０６に接続された他のシステムおよび装置と、ネットワーク１０６を渡って、相互に通信することができる。また、図１に示すように、サーバ装置１０４およびサーバ装置１５０は、ネットワーク１０６に接続されている。 Via client devices 102 a-n, users 112 a-n can communicate with each other across network 106 with other systems and devices connected to network 106. As shown in FIG. 1, the server device 104 and the server device 150 are connected to the network 106.

サーバ装置１０４は分割化エンジン・アプリケーションプログラムを実行するサーバを有するものとすることができ、サーバ装置１５０はサーチエンジン・アプリケーションプログラムを実行するサーバを有するものとすることができる。クライアント装置１０２ａ−ｎと同様に、図１に示すサーバ装置１０４およびサーバ装置１５０は、コンピュータ読み取り可能メモリ１１８に接続されたプロセッサ１１６と、コンピュータ読み取り可能メモリ１５４に接続されたプロセッサ１５２とを、それぞれを有する。１台のコンピュータシステムのように表現されているサーバ装置１０４および１５０は、コンピュータプロセッサのネットワークとして実現されるものとしてもよい。サーバ装置１０４，１５０の実施形態は、サーバ、メインフレーム・コンピュータ、ネットワーク・コンピュータ、プロセッサベース装置、および同様なタイプのシステムおよび装置とすることができる。クライアント・プロセッサ１１０およびサーバプロセッサ１１６，１５２は、カルフォルニア州サンタクララのインテル社およびイリノイ州スカンバーブのモトローラ社のプロセッサのような、上記の多くのコンピュータプロセッサのいずれかとすることができる。 The server device 104 may include a server that executes the split engine application program, and the server device 150 may include a server that executes the search engine application program. Similar to the client devices 102a-n, the server device 104 and the server device 150 shown in FIG. 1 each include a processor 116 connected to the computer readable memory 118 and a processor 152 connected to the computer readable memory 154, respectively. Have Server devices 104 and 150 expressed as a single computer system may be realized as a network of computer processors. Embodiments of server devices 104 and 150 may be servers, mainframe computers, network computers, processor-based devices, and similar types of systems and devices. The client processor 110 and server processors 116, 152 may be any of the many computer processors described above, such as Intel's processor in Santa Clara, California and Motorola's processor in Scambarb, Illinois.

また、メモリ１１８は、分割化エンジン１２０として知られている分割化アプリケーションプログラムを有している。サーバ装置１０４または関連装置は、ネットワーク１０６に接続された他の装置またはシステムから文字列を受信するために、ネットワーク１０６にアクセスすることができる。文字は、例えば、ＡＳＣＩＩ、ユニコード、ＩＳＯ８８５９−１、Ｓｈｉｆｔ−ＪＩＳ、および拡張２進化１０進符号またはあらゆる他の適当な文字セットのような、文字を表すデータに含まれる書記体型で使用されるマークまたはシンボルを有するものとすることができる。一実施形態として、分割化エンジン１２０は、ユーザ１１２ａがアクティブでないドメインネームをウェブブラウザ・アプリケーションに指示するとき、ネットワーク１０６上のサーバ装置から、ドメインネームのような文字列を受信することができる。 The memory 118 also has a segmented application program known as the segmentation engine 120. Server device 104 or related devices can access network 106 to receive strings from other devices or systems connected to network 106. Characters are marks used in the typeface type included in the data representing the characters, such as ASCII, Unicode, ISO8859-1, Shift-JIS, and extended binary-coded decimal codes or any other suitable character set Or it can have a symbol. In one embodiment, the segmentation engine 120 may receive a character string such as a domain name from a server device on the network 106 when the user 112a indicates an inactive domain name to the web browser application.

一実施形態として、分割化エンジン１２０は、前記文字列についての候補言語を特定し、各候補言語についてトークンの潜在的(potential)組合せに前記文字列を分割するとともに、前記文字列に関連する特定の言語および組合せを選択する。トークンは、単語、固有名、地名、簡略名、頭文字語、株式市場ティッカーシンボル、または他のトークンを有するものとすることができる。分割化エンジン１２０は、分割化プロセッサ１２２、頻度プロセッサ１２４および言語プロセッサ１２６を有するものとすることができる。図１に示す実施形態では、それぞれがメモリ１１８に具備されているコンピュータ・コードを有している。 In one embodiment, the segmentation engine 120 identifies candidate languages for the strings, divides the strings into potential combinations of tokens for each candidate language, and identifies associated with the strings. Select a language and combination. Tokens can have words, unique names, place names, short names, acronyms, stock market ticker symbols, or other tokens. The partitioning engine 120 may include a partitioning processor 122, a frequency processor 124, and a language processor 126. In the embodiment shown in FIG. 1, each has computer code provided in memory 118.

言語プロセッサ１２６は、候補言語または文字列の言語を特定することができる。一実施形態として、言語プロセッサ１２６は、文字列について多くの候補言語を特定するために、信号を使用することができる。例えば、前記言語プロセッサは、言語学と、ユーザのＩＰアドレスと、文字列で使用された文字セットと、ユーザに関連づけられたブラウザ・アプリケーション・プログラムのブラウザ設定と、前記文字列に関連づけられた最上位のドメインとを、前記文字列の候補言語決定するために、使用することができる。 The language processor 126 can identify the candidate language or the language of the string. In one embodiment, the language processor 126 can use the signal to identify a number of candidate languages for the string. For example, the language processor may include linguistics, a user's IP address, a character set used in a character string, browser settings of a browser application program associated with the user, and a maximum associated with the character string. The upper domain can be used to determine the candidate language for the string.

分割化プロセッサ１２２は、各候補言語について文字列からトークンまたは分割結果の潜在的組合せのリストを決定することができる。一実施形態として、トークンプロセッサ１２４は、前記リストの中の各分割結果の確率を決定するとともに、前記確率に基づいて各言語について最高分割結果を選択する。分割結果についての前記確率は、前記結果における個々のトークンに関連する頻度値に基づくものとすることができる。一実施形態として、分割されていない文字列が分割結果として含まれているものとしてもよい。 The segmentation processor 122 can determine a list of potential combinations of tokens or segmentation results from the strings for each candidate language. In one embodiment, the token processor 124 determines the probability of each split result in the list and selects the highest split result for each language based on the probability. The probabilities for segmentation results may be based on frequency values associated with individual tokens in the results. As an embodiment, a character string that is not divided may be included as a result of division.

頻度プロセッサ１２４は、頻度検索を実行すること、または各候補言語の最高に選択された分割結果について実行することができる。頻度プロセッサ１２４は、選択された分割結果についてスペルチェックを実行するために、スペルチェック機能を有することができ、または他の場所にあるスペルチェック機能を呼ぶことができる。あらゆるスペル修正結果が前記頻度検索に含まれることとすることができる。一実施形態として、頻度プロセッサは、選択された分割結果について頻度検索を実行するために、選択された分割結果をサーバ装置１５０に送信する。頻度検索は、以下で説明するように、各特定分割結果について出現頻度を決定することができる。前記頻度検索に基づいて、最高または実施可能分割結果が分割化プロセッサ１２２によって特定することができる。前記実施可能な結果に関連する言語は、文字列についての実施可能言語として分割化プロセッサ１２２によって特定できる。一実施形態として、実施可能分割結果および実施可能言語は、広告サーバに送信することができる。広告サーバは、前記実施可能言語および前記選択された結果の一方または両方に基づいて、対象とする広告を選択することができる。分割化プロセッサ１２２と、頻度プロセッサ１２４と、言語プロセッサ１２６との他の機能および特徴は、さらに以下で説明される。 The frequency processor 124 can perform a frequency search or perform on the highest selected segmentation result for each candidate language. The frequency processor 124 may have a spell check function or may call a spell check function elsewhere to perform a spell check on the selected segmentation result. Any spelling correction result can be included in the frequency search. In one embodiment, the frequency processor transmits the selected division result to the server device 150 in order to perform a frequency search for the selected division result. The frequency search can determine the appearance frequency for each specific division result, as will be described below. Based on the frequency search, the highest or feasible partition result can be identified by the partition processor 122. The language associated with the feasible result can be identified by the segmentation processor 122 as the feasible language for the string. In one embodiment, the feasible segmentation result and the feasible language can be sent to the advertisement server. The advertisement server may select a target advertisement based on one or both of the executable language and the selected result. Other functions and features of the partitioning processor 122, the frequency processor 124, and the language processor 126 are further described below.

また、サーバ装置１０４は、トークン・データベース１２０が示された実施形態において、トークン記憶要素のような、他の記憶要素へのアクセスを提供する。トークン・データベース１２０は、トークンおよび各トークンに関連する頻度情報を記憶することに使用することができる。また、トークン・データベース１２０は、言語または各トークンに関連する言語を記憶することができる。データ記憶要素は、データ記憶方法として、無制限の包含と、配列と、ハッシュテーブルと、リストと、組とのいずか一つまたは組合せを有することとしてもよい。サーバ装置１０４は、他の同様なタイプのデータ記憶装置にアクセスすることができる。 Server device 104 also provides access to other storage elements, such as token storage elements, in the embodiment in which token database 120 is shown. The token database 120 can be used to store tokens and frequency information associated with each token. The token database 120 can also store languages or languages associated with each token. The data storage element may have any one or combination of unlimited inclusion, array, hash table, list, and set as a data storage method. Server device 104 may access other similar types of data storage devices.

サーバ装置１５０は、グーグル（登録商標）サーチエンジンのようなサーチエンジン・アプリケーションプログラムを実行するサーバを有することができる。他の実施形態として、サーな装置１５０は、関連情報サーバまたは広告サーバを有することができる。他の実施形態として、多重サーバ装置１５０とすることができる。 The server device 150 may include a server that executes a search engine application program such as a Google (registered trademark) search engine. As another example, the smart device 150 may have a related information server or an advertisement server. As another embodiment, the multi-server device 150 may be used.

メモリ１５４は、また、サーチエンジン１５６として知られているサーチエンジン・アプリケーションプログラムを有している。サーチエンジン１５６は、ユーザ１１２ａからの検索クエリーに対応して、ネットワーク１０６から関連情報の場所を見つけることができるとともに、検索クエリーの検索ログを維持することができる。また、サーチエンジン１５６は、頻度プロセッサ１２４からの頻度検索要求に対応して、頻度検索を実行することができる。サーチエンジン１５６は、ネットワーク１０６を介して、ユーザ１１２ａに検索結果セットを提供することができ、または、分割化エンジン１２０に頻度情報を提供することができる。 Memory 154 also has a search engine application program known as search engine 156. The search engine 156 can find the location of relevant information from the network 106 and maintain a search query search log in response to the search query from the user 112a. Further, the search engine 156 can execute a frequency search in response to a frequency search request from the frequency processor 124. The search engine 156 can provide a search result set to the user 112 a via the network 106, or can provide frequency information to the segmentation engine 120.

一実施形態として、サーバ装置１５０、または関連装置は、ネットワーク１０６に接続された他の装置またはシステムに記憶された、ウェブページのような記事(article)の場所を見つけるために、ネットワーク１０６を巡回すること(crawl)を事前に実行する。記事には、例えば、文書、メール、インスタントメッセンジャメッセージ、データベースエントリ、様々なフォーマットのウェブページ、ＨＴＭＬや、ＸＭＬや、ＸＨＴＭＬや、ポータブル・ドキュメント・フォーマット（ＰＤＦ）ファイルや、雑誌新聞広告欄案内などのような、イメージ・ファイルや、オーディオファイルや、ビデオファイルや、またはあらゆる他の文書など、または文書グループまたはあらゆる適当なタイプの全ての情報などが含まれる。索引装置(indexer)１５８は、インデックス１６０のような、メモリ１５４または他のデータ記憶装置内の前記記事に索引をつけることに使用することができる。また、インデックスは、言語または各記事に関連する言語を有することとしてもよい。一実施形態として、総合記事索引の部分をそれぞれ有している複数のインデックスがある。巡回すること(crawling)の代わりとなる、または組み合わされる、記事に索引をつける他の適当な方法がある、と理解されるべきである。 In one embodiment, the server device 150, or associated device, crawls the network 106 to locate an article, such as a web page, stored on another device or system connected to the network 106. Do things (crawl) in advance. Articles include, for example, documents, emails, instant messenger messages, database entries, web pages in various formats, HTML, XML, XHTML, portable document format (PDF) files, magazine newspaper advertisements, etc. Image files, audio files, video files, or any other document, or a document group or any appropriate type of all information. An indexer 158 can be used to index the articles in memory 154 or other data storage, such as index 160. The index may have a language or a language related to each article. In one embodiment, there are a plurality of indexes each having a general article index portion. It should be understood that there are other suitable ways of indexing articles that can be substituted for or combined with crawling.

サーチエンジン１５６は、多くの適当な方法で頻度検索を実行することができる。一実施形態として、サーチエンジン１５６は、検索クエリーとしてそれぞれの最高選択分割結果を使用することでウェブ検索を実行することができるとともに、分割結果の候補言語の中で検索クエリーを有する記事を検索することができる。本実施形態において、頻度検索結果セットは、生成できるとともに、１つ以上の記事識別子を有することとすることができる。例えば、記事識別子は、全域資源位置指示子（ＵＲＬ）、ファイル名、リンク、アイコン、ローカルファイルのためのパス、または記事を特定する他の何か、とすることができる。一実施形態として、記事識別子は、記事に関連するＵＲＬを有することができる。 The search engine 156 can perform a frequency search in many suitable ways. In one embodiment, the search engine 156 can perform a web search by using each of the highest selected segmentation results as a search query, and searches for articles having the search query in the candidate languages of the segmentation results. be able to. In the present embodiment, the frequency search result set can be generated and can have one or more article identifiers. For example, an article identifier can be a global resource location indicator (URL), a file name, a link, an icon, a path for a local file, or something else that identifies an article. As one embodiment, the article identifier may have a URL associated with the article.

頻度プロセッサ１２４は、各分割結果の発生回数を表示するものとして、各頻度検索結果セットの中での記事識別子の数を使用することができる。他の実施形態としては、頻度プロセッサ１２４は、索引装置１５８に直接インターフェースすることができる。索引装置１５８は、分割結果が現れる関連候補言語の中の記事の数を、それぞれの最高選択分割結果について、決定することができる。この情報は、頻度プロセッサ１２４に送信することができる。また、他の実施形態として、サーチエンジン１５６および／または頻度プロセッサ１２４は、検索ログの中から関連候補言語の中の検索クエリーの発生回数を、それぞれの選択分割結果について、決定することができるとともに、頻度プロセッサ１２４は、この検索ログ情報に基づいた出現頻度を、決定することができる。一実施形態として、分割結果に関連する頻度検索における記事または検索クエリーの数は、関連言語の中の記事または検索クエリーの総数に基づいて正規化するものとしてもよい。 The frequency processor 124 can use the number of article identifiers in each frequency search result set as an indication of the number of occurrences of each division result. In other embodiments, the frequency processor 124 can interface directly to the index device 158. The indexing device 158 can determine the number of articles in the associated candidate language in which the split results appear for each highest selected split result. This information can be sent to the frequency processor 124. As another embodiment, the search engine 156 and / or the frequency processor 124 can determine the number of occurrences of the search query in the related candidate language from the search log for each selection division result. The frequency processor 124 can determine the appearance frequency based on the search log information. In one embodiment, the number of articles or search queries in a frequency search related to segmentation results may be normalized based on the total number of articles or search queries in the related language.

本発明は、図１に示されたものとは異なる構造のシステムを有することとしてもよいことに留意するべきである。例えば、本発明に従った何らかのシステムにおいて、サーバ装置１０４は、１台の物理的または論理的サーバを有することとしてもよい。図１に示すシステム１００は、単なる例であるとともに、図２に示す方法の説明を補助するために使用される。 It should be noted that the present invention may have a system with a different structure than that shown in FIG. For example, in any system according to the present invention, the server device 104 may have one physical or logical server. The system 100 shown in FIG. 1 is merely an example and is used to help explain the method shown in FIG.

＜処理＞
本発明の実施形態に係る様々な方法が実施できる。本発明に従った一例の方法は、文字列に関連する第１候補言語および第２候補言語を少なくとも特定するステップと、前記文字列から前記第１候補言語に関連する第１分割結果を少なくとも決定するとともに、前記文字列から前記第２候補言語に関連する第２分割結果を少なくとも決定するステップと、前記第１分割結果について第１出現頻度を決定するとともに、前記第２分割結果について第２出現頻度を決定するステップと、前記第１出現頻度および第２出現頻度の一部に少なくとも基づいて、前記第１候補言語および第２候補言語から実施可能言語を特定するステップと、を有する。２つ以上の候補言語が特定できるとともに、２つ以上の分割結果が決定できる。例えば、３つの候補言語が特定できるとともに、４つの分割結果が各候補言語について決定できる。 <Processing>
Various methods according to embodiments of the present invention can be implemented. An example method according to the invention includes at least identifying a first candidate language and a second candidate language associated with a character string, and at least determining a first segmentation result associated with the first candidate language from the character string. And determining at least a second division result related to the second candidate language from the character string; determining a first appearance frequency for the first division result; and a second appearance for the second division result Determining a frequency, and identifying an executable language from the first candidate language and the second candidate language based at least on part of the first appearance frequency and the second appearance frequency. Two or more candidate languages can be specified, and two or more division results can be determined. For example, three candidate languages can be identified and four division results can be determined for each candidate language.

前記実施可能言語は、前記第１出現頻度および第２出現頻度の一部に少なくとも基づく第１分割結果および第２分割結果から、実施可能分割結果を特定するものの一部に基づいて決定することができる。第１候補言語および第２候補言語は、１つ以上の言語信号の一部に基づいて特定されることとしてもよい。前記言語信号は、前記文字列に関連づけられた言語学と、前記文字列に関連づけられたユーザのＩＰアドレスと、前記文字列に使用された文字セットと、前記文字列に関連づけられたユーザに関連するブラウザ・アプリケーション・プログラムのブラウザ設定と、前記文字列に関連する最上位のドメインと、の少なくとも一つを有することができる。一実施形態として、前記実施可能言語を特定することは、言語信号の一部に少なくとも基づくものとしてもよい。 The executable language may be determined based on a part of what specifies the executable division result from the first division result and the second division result based at least on a part of the first appearance frequency and the second appearance frequency. it can. The first candidate language and the second candidate language may be specified based on a part of one or more language signals. The linguistic signal is associated with the linguistics associated with the string, the IP address of the user associated with the string, the character set used for the string, and the user associated with the string. At least one of a browser setting of a browser application program and a top-level domain related to the character string. In one embodiment, identifying the enablement language may be based at least on a portion of the language signal.

一実施形態として、前記第１出現頻度および第２出現頻度の一部に少なくとも基づいて、前記第１候補言語および前記第２候補言語から前記実施可能言語を特定することは、前記第１出現頻度が前記第２出現頻度よりも大きいときに、前記第１候補言語を選択すること、を有することとすることができる。前記文字列は、ドメインネームを有することができる。前記第１分割結果はトークンの第１組合せを有することができるとともに、前記第２分割結果はトークンの第２組合せを有することができる。 As one embodiment, specifying the executable language from the first candidate language and the second candidate language based at least on a part of the first appearance frequency and the second appearance frequency is the first appearance frequency. Can be selected when the second appearance frequency is greater than the second appearance frequency. The character string may have a domain name. The first split result may have a first combination of tokens, and the second split result may have a second combination of tokens.

一実施形態として、前記第１分割結果について前記第１出現頻度を決定することは、前記第１分割結果を有している前記第１候補言語における記事の数を決定することと、前記第１候補言語における記事の総数に基づいて記事の数を標準化することと、を有することができ、さらに、前記第１言語において記事の数を決定することは、前記第１分割結果を有している検索クエリーに対応して生成された検索結果セットを識別する記事の数を決定することを有することができる。 As one embodiment, determining the first appearance frequency for the first division result is determining the number of articles in the first candidate language having the first division result; and Standardizing the number of articles based on the total number of articles in the candidate language, and further, determining the number of articles in the first language comprises the first segmentation result. Determining the number of articles that identify the search result set generated in response to the search query may be included.

一実施形態として、前記第１言語において記事の数を決定することは、前記第１分割結果が記事のインデックスにアクセスすることを有することとすることができる。他の実施形態として、前記第１出現頻度を決定することは、前記第１候補言語の中の複数の検索クエリーにおける前記第１分割結果の出現数を決定することと、前記第１候補言語の中の検索クエリーの総数に基づいて前記出現数を標準化することとを有することとすることができる。 In one embodiment, determining the number of articles in the first language may include the first segmentation result accessing an index of articles. In another embodiment, determining the first appearance frequency includes determining the number of appearances of the first division result in a plurality of search queries in the first candidate language, and determining the first candidate language And standardizing the number of occurrences based on the total number of search queries within.

また、前記方法は、前記実施可能言語または前記実施可能分割結果の一部（または両方）に少なくとも基づいて選択することを有することとしてもよく、前記記事は、広告を有することとしてもよい。一実施形態として、前記第１分割結果を決定することは、前記文字列から前記第１候補言語における複数の分割結果を決定することと、前記第１候補言語の中の前記複数の分割結果から前記第１分割結果を特定することとを有することとすることができる。前記第１分割結果を特定することは、前記複数の分割結果それぞれについて確率値を計算することを有することとすることができる。前記第１分割結果に関連する第１確率値は、前記第１分割結果における各トークンの頻度の一部に少なくとも基づくものとすることができる。 In addition, the method may include selecting based on at least a part (or both) of the executable language or the executable division result, and the article may include an advertisement. In one embodiment, determining the first division result includes determining a plurality of division results in the first candidate language from the character string and from the plurality of division results in the first candidate language. Specifying the first division result. Specifying the first division result may include calculating a probability value for each of the plurality of division results. The first probability value related to the first division result may be based at least on a part of the frequency of each token in the first division result.

他の例の方法としては、ドメインネームから第１候補言語の中で第１分割結果を決定するとともに、ドメインネームから第２候補言語の中で第２分割結果を決定することと、記事インデックスと、テキストインデックスと、検索結果セットとの少なく一つの中で、前記第１分割結果について第１出現頻度を決定することと、前記第２分割結果について第２出現頻度決定することと、前記第１出現頻度が前記第２出現頻度よりも大きいとき、実施可能言語として前記第１候補言語を選択することと、前記第２出現頻度が前記第１出現頻度よりも大きいとき、実施可能言語として前記第２候補言語を選択することと、前記実施可能言語の一部に少なくとも基づいて広告を選択すること、前記ドメインネームに関連づけられたウェブページに関連づけて前記広告の表示をすることとを有する。前記広告は、前記実施可能言語のテキストを有している。 Another example method is to determine the first segmentation result in the first candidate language from the domain name, determine the second segmentation result in the second candidate language from the domain name, Determining a first appearance frequency for the first division result, determining a second appearance frequency for the second division result in at least one of the text index and the search result set; When the appearance frequency is higher than the second appearance frequency, the first candidate language is selected as an executable language, and when the second appearance frequency is higher than the first appearance frequency, the first language is selected as the executable language. 2 selecting a candidate language, selecting an advertisement based at least in part on the executable language, and relating to a web page associated with the domain name. And a to the display of the advertisement Te. The advertisement has text in the executable language.

図２は、本発明の一実施形態に従って、テキスト分割のために言語を選択する方法の一例の方法２００を示している。本例の方法は、本発明に従った方法を実行するための様々な方法があるなかの一例として提供する。図２に示されている方法２００は、１つのシステムでまたは様々なシステムの組合せで実行することができ、また他の方法で実行することもできる。方法２００は、実施形態として図１に示されたシステム１００によって実行されるものとして以下で説明されるとともに、システム１００の様々な要素が図２の実施形態方法を説明するために参照される。 FIG. 2 illustrates an example method 200 of a method for selecting a language for text segmentation in accordance with one embodiment of the present invention. The method of this example is provided as an example among various ways to carry out the method according to the present invention. The method 200 shown in FIG. 2 can be performed in one system or a combination of various systems, and can be performed in other ways. The method 200 is described below as being performed by the system 100 shown in FIG. 1 as an embodiment, and various elements of the system 100 are referenced to describe the embodiment method of FIG.

図２を参照すると、ブロック２０２において実施形態方法が開始する。ブロック２０２には、ブロック２０４が続く。ブロック２０４では、文字列が分割化エンジン１２０によってアクセスされる。文字列は、例えば、ネットワーク１０６に接続された装置または他の装置から受信またはアクセスできる。一実施形態として、前記文字列は、ドメインネームに関連する広告サーバから受信する、無効または実在しないウェブサイトに関連するドメインネームとすることができる。 Referring to FIG. 2, the embodiment method begins at block 202. Block 202 is followed by block 204. In block 204, the string is accessed by the segmentation engine 120. The character string can be received or accessed from a device connected to the network 106 or another device, for example. In one embodiment, the string may be a domain name associated with an invalid or non-existing website received from an ad server associated with the domain name.

ブロック２０４には、ブロック２０６が続く。ブロック２０６では、文字列についての候補言語が特定される。一実施形態として、言語プロセッサ１２６は、前記文字列について多くの候補言語を決定するために、一つ以上の言語信号を使用することができる。例えば、前記言語プロセッサは、前記文字列について、英語、フランス語およびスペイン語を３つの候補言語として、言語信号に基づいて、特定することができる。 Block 204 is followed by block 206. At block 206, candidate languages for the character string are identified. In one embodiment, the language processor 126 can use one or more language signals to determine a number of candidate languages for the string. For example, the language processor can identify the character string based on a language signal, with English, French and Spanish as three candidate languages.

例えば、使用されたいくつかの言語信号は、前記文字列に関連する言語学と、前記文字列に関連するユーザのＩＰアドレスと、前記文字列で使用された文字セットと、前記文字列に関連するユーザに関連するブラウザ・アプリケーション・プログラムのブラウザ設定と、前記文字列に関連する最上位のドメインとにすることができる。例えば、言語学は、特定の言語においてそれを示す前記文字列の構成または特徴を決定することに使用することができる。例えば、ある言語は、ある文字のグループにおける開始または終了に傾向をもっているとともに、一般的なパターンを使用する。前記ユーザのＩＰアドレスは、前記ユーザの位置および国を示すことができる。国情報から言語または国に関連する言語は、候補言語として使用することができる。文字列の文字セットは、言語または前記文字列に関連する言語を示すことができる。例えば、「キリル文字」の文字セットは、ロシア語または他のスラヴ言語を示すことができる。文字列に関連するユーザのブラウザ・アプリケーション・プログラムのブラウザ設定は、前記文字列に関連する言語および／または文字セットを示すことができる。例えば、ユーザのブラウザ・アプリケーション・プログラムの設定による前記言語および文字セットは、前記文字列に伴うＨＴＴＰヘッダで渡されることができる。文字列に関連する最上位ドメインは、国を示すことができる。最上位ドメインは、ルートに従う階層の最高水準になることができる。ドメインネームにおいて、最上位ドメインは、右側に最も遠く現れるドメインネームの一部である。例えば、ドメインネーム「usedrugs.co.uk」において、最上位ドメインは、「.uk」であるとともに、「イギリス(United Kingdom)」を示すことができる。最上位ドメイン「.ru」は、ロシアを示すことができる。最上位ドメインに関連する国は、ロシアを示す「ru」のような、候補言語を決定する際に使用されることができる。ここで、関連文字列はロシア言語になり得ることを示している。何らかの最上位ドメインは、１つ以上の言語を示すものとしてもよい。
例えば、「.ch」は、スイスを示すことができるとともに、文字列がフランス語、ドイツ語またはイタリア語に関連し得ることを示すことができる。文字列についての候補言語を特定する他の適当な信号および方法が使用されることとしてもよい。 For example, some linguistic signals used include the linguistics associated with the string, the user's IP address associated with the string, the character set used in the string, and the string The browser setting of the browser application program related to the user to be used and the top-level domain related to the character string. For example, linguistics can be used to determine the composition or characteristics of the string that represents it in a particular language. For example, some languages tend to start or end in a group of characters and use common patterns. The user's IP address may indicate the user's location and country. The language from the country information or the language related to the country can be used as a candidate language. The character set of the character string can indicate a language or a language associated with the character string. For example, a “Cyrillic” character set may indicate Russian or other Slavic languages. The browser settings of the user's browser application program associated with the string may indicate the language and / or character set associated with the string. For example, the language and character set according to the settings of the user's browser application program can be passed in an HTTP header associated with the character string. The top domain associated with a string can indicate a country. The top-level domain can be the highest level of the hierarchy that follows the root. In a domain name, the top domain is the part of the domain name that appears farthest to the right. For example, in the domain name “usedrugs.co.uk”, the highest domain is “.uk” and “United Kingdom” can be indicated. The top domain “.ru” can indicate Russia. The country associated with the top-level domain can be used in determining candidate languages, such as “ru” for Russia. Here, it is shown that the related character string can be in Russian language. Any top level domain may indicate one or more languages.
For example, “.ch” can indicate Switzerland and can indicate that the string can be associated with French, German or Italian. Other suitable signals and methods for identifying candidate languages for strings may be used.

ブロック２０６には、ブロック２０８が続く。ブロック２０８では、複数の分割結果が各候補言語について文字列を分割することによって文字列から生成される。文字列を分割することは、トークンの複数の組合せに文字列の中の文字を解析することを有することとしてもよく、さらに、分割化プロセッサ１２２によって実行することができる。分割化プロセッサ１２２は、各候補言語について分割結果のリストを発生させることができる。各分割結果は、トークンまたは信号トークンの組合せを特定することができる。例えば、文字列「assocomunicazioni」は、他の分割結果として「asso comunicazioni」にイタリア語で分割できるとともに、他の分割結果として「asso com uni cazioni」にフランス語で分割できる。他の実施形態として、文字列「maisonblanche」は、他の分割結果として「maison blanche」にフランス語で分割できるとともに、他の分割結果として「mai son blanc he」に英語で分割できる。他の実施形態として、文字列「usedrugs」は、「used rugs」、「use drugs」、「us ed rugs」、「u sed rugs」、「usedrugs」などを有する分割結果に英語で分割できる。また、分割結果は、フランス語およびドイツ語など、上記実施形態のような、他の候補言語について生成されることができる。分割されていない文字列が分割結果に含まれることとしてもよい。 Block 206 is followed by block 208. At block 208, a plurality of segmentation results are generated from the character string by dividing the character string for each candidate language. Splitting the string may include parsing the characters in the string into multiple combinations of tokens and may be further performed by the segmentation processor 122. The segmentation processor 122 can generate a list of segmentation results for each candidate language. Each segmentation result can specify a token or a combination of signal tokens. For example, the character string “assocomunicazioni” can be divided into “asso comunicazioni” in Italian as another division result, and can be divided into “asso com uni cazioni” in French as another division result. As another embodiment, the character string “maisonblanche” can be divided into “maison blanche” in French as another division result, and can be divided into English as “mai son blanc he” as another division result. As another embodiment, the string “usedrugs” can be split in English into splitting results having “used rugs”, “use drugs”, “us ed rugs”, “used rugs”, “usedrugs”, and the like. Also, the segmentation results can be generated for other candidate languages, such as the above embodiment, such as French and German. A character string that is not divided may be included in the division result.

分割化プロセッサ１２２は、分割化処理においてトークン・データベース１２６からトークンを利用することができる。ＰＣＴ国際特許出願番号ＰＣＴ／ＵＳ０３／４１６０９、発明の名称「テキスト分割方法およびシステム」、出願日２００３年１２月３０日に開示されたセグメンテーション手法のような、様々な方法が文字列を分割するために使用することができるとともに、その全てがここに引用されることによってここに組み込まれている。 The segmentation processor 122 can use the token from the token database 126 in the segmentation process. Various methods such as PCT International Patent Application No. PCT / US03 / 41609, the title of the invention “Text Splitting Method and System”, the segmentation technique disclosed on Dec. 30, 2003, for splitting strings. All of which are hereby incorporated by reference herein.

ブロック２０８には、ブロック２１０が続く。ブロック２１０では、各候補言語について最高分割結果が決定される。最高分割結果は、分割化プロセッサ１２２によって決定されることができるとともに、最良または実施可能分割結果となる最高確率の結果とすることができる。一実施形態として、分割結果は、各分割結果について決定された確率値に基づいてランキング分けされることができる。一実施形態として、確率値は、個々の分割結果の中の個々のトークンに関連する加算頻度値によって決定されることができる。他の実施形態として、確率値は、それぞれの個別分割結果の中の個々のトークンに関連する頻度値の対数の合計を伴う複素関数によって決定されることができる。そして、多くの上位分割結果が選択できる。例えば、各候補言語についての分割結果がランキング分けされることができるとともに、各候補言語の上位３つの結果が選択できる。 Block 208 is followed by block 210. At block 210, the highest segmentation result is determined for each candidate language. The highest partition result can be determined by the partitioning processor 122 and can be the highest probability result that is the best or feasible partition result. In one embodiment, the segmentation results can be ranked based on the probability values determined for each segmentation result. In one embodiment, the probability value can be determined by an additive frequency value associated with individual tokens in individual segmentation results. As another embodiment, the probability value can be determined by a complex function with a logarithmic sum of frequency values associated with individual tokens in each individual split result. A large number of upper division results can be selected. For example, the division results for each candidate language can be ranked and the top three results for each candidate language can be selected.

ブロック２１０には、ブロック２１２が続く。ブロック２１２では、各候補言語について分割結果を選択して、頻度検索がトップについて実行される。前記頻度検索は、サーチエンジン１５６とともに頻度プロセッサ１２４によって実行されることができる。一実施形態として、分割化プロセッサ１２２は、頻度プロセッサ１２４に選択分割結果を渡すことができる。頻度プロセッサ１２４は、記事または検索クエリーの収集資料の中の各分割結果について出現頻度を決定することができる。 Block 210 is followed by block 212. At block 212, a segmentation result is selected for each candidate language and a frequency search is performed on the top. The frequency search can be performed by the frequency processor 124 along with the search engine 156. As one embodiment, the segmentation processor 122 can pass the selected segmentation results to the frequency processor 124. The frequency processor 124 can determine the appearance frequency for each segmentation result in the collected material of the article or search query.

一実施形態として、頻度プロセッサ１２４は、サーチエンジン１５６によって索引を付けられた記事に基づいて分割結果について出現頻度を決定することができる。一実施形態として、頻度プロセッサ１２４は、ネットワーク１０６を介してサーチエンジン１５６に、トップ選択分割結果を送信することができる。サーチエンジン１５６は、検索クエリーとして各分割結果を使用して索引を付けられた記事の中の各分割結果について検索を実行することができる。例えば、頻度プロセッサ１２４は、サーチエンジン１５６が特定の言語の記事について正確な分割語句を検索を実行できるように、引用符によって囲まれた各候補言語について各分割結果を検索クエリーとしてサーチエンジン１５６に送信することができる。一実施形態として、各分割結果について、サーチエンジン１５６は、検索クエリーに応答して、多くの記事識別子を有する検索結果セットを生成することができる。サーチエンジン１５６は、ネットワーク１０６を介してもとの頻度プロセッサ１２４へ、分割結果のそれぞれについて検索結果セットを送信することができる。頻度プロセッサ１２４は、各分割結果が出現する頻度を、記事識別子の数に基づいて、各検索結果から決定することができる。 In one embodiment, the frequency processor 124 can determine the frequency of appearance for segmentation results based on articles indexed by the search engine 156. In one embodiment, the frequency processor 124 can send the top selection split results to the search engine 156 via the network 106. The search engine 156 can perform a search for each segmented result in articles indexed using each segmented result as a search query. For example, the frequency processor 124 may provide the search engine 156 with each segmentation result as a search query for each candidate language surrounded by quotation marks so that the search engine 156 can perform an exact segment search for articles in a particular language. Can be sent. In one embodiment, for each segmented result, search engine 156 can generate a search result set having a number of article identifiers in response to the search query. The search engine 156 can send a search result set for each of the split results to the original frequency processor 124 via the network 106. The frequency processor 124 can determine the frequency at which each division result appears from each search result based on the number of article identifiers.

他の実施形態として、頻度プロセッサ１２４は、ネットワーク１０６を介して索引装置１５８に、選択分割結果のトップを送信することができる。索引装置１５８は、分割結果が出現した特定の言語についての記事の数を決定するために、インデックス１６０にアクセスすることができるとともに、選択分割結果のそれぞれにこれをすることができる。一実施形態として、インデックス１６０は複数のインデックスとしてもよく、索引装置１５８は各分割結果について総合インデックスの一部分を検査することができる。そこで、索引装置１５８は、ネットワーク１０６を介して頻度プロセッサ１２４に、各分割結果に関連する出現数を渡すことができる。 In another embodiment, the frequency processor 124 can send the top of the selected split results to the index device 158 via the network 106. The indexer 158 can access the index 160 to determine the number of articles for a particular language in which the split results appear, and can do this for each of the selected split results. In one embodiment, the index 160 may be a plurality of indexes, and the index device 158 can examine a portion of the overall index for each split result. Thus, the indexing device 158 can pass the number of occurrences related to each division result to the frequency processor 124 via the network 106.

また、他の実施形態として、頻度プロセッサ１２４は、検索クエリーにおける分割結果の出現数を決定するために、ネットワーク１０６を介してサーチエンジン１５６に、選択分割結果のトップを送信することができる。例えば、サーチエンジン１５６は、関連言語の中の各分割結果について、分割結果が検索クエリーまたは検索クエリーの一部として使用された回数を決定することができる。各分割結果についての検索クエリーの中の出現数は、ネットワーク１０６を介して頻度プロセッサ１２４に、サーチエンジン１５６によって送信することができる。 In another embodiment, the frequency processor 124 can send the top of the selected split results to the search engine 156 via the network 106 to determine the number of occurrences of the split results in the search query. For example, the search engine 156 can determine, for each segmentation result in the related language, the number of times the segmentation result has been used as a search query or as part of a search query. The number of occurrences in the search query for each segmentation result can be transmitted by the search engine 156 to the frequency processor 124 via the network 106.

例えば、分割化プロセッサ１２２が、英語の文字列「usedrugs」について「used rugs」、「use drugs」および「us ed rugs」として選択分割結果を決定するとき、頻度プロセッサ１２４は、これらの分割結果と他の候補言語に関連する分割結果とをサーチエンジン１５６に送信することができる。例えば、サーチエンジン１５６は、検索クエリーとしてこれらの結果を使用することができるとともに、各分割結果について検索結果セットを生成することができる。例えば、サーチエンジン１５６は、検索クエリーとして「used rugs」を使用することができるとともに、語句「used rugs」を有する英語の記事に関連する記事識別子を有する前記検索クエリーについて検索結果セットを決定することができる。サーチエンジン１５６は、他の候補言語に関連する分割結果に同じことをすることができる。他の実施形態として、サーチエンジン１５６は、優先的に容認された検索クエリーを有する関連検索ログから、受信された分割結果を有する検索クエリーの回数を、決定することができる。例えば、サーチエンジン１５６は、受信された語句「used rugs」を有する検索クエリーの回数について検索ログを検索することができる。また、他の実施形態として、サーチエンジン１５６の索引装置１５８は、検索結果を受信することができるとともに、分割結果を有するインデックス１６０またはインデックス１６０の一部の中の記事の数を決定することができる。例えば、索引装置１５８は、「used rugs」を有する英語の記事の数について、インデックス１６０またはインデックス１６０の一部を介して検索することができる。 For example, when the segmentation processor 122 determines the selected segmentation results as “used rugs”, “use drugs”, and “us ed rugs” for the English string “usedrugs”, the frequency processor 124 Segmentation results related to other candidate languages can be transmitted to the search engine 156. For example, the search engine 156 can use these results as a search query and can generate a search result set for each split result. For example, the search engine 156 may use “used rugs” as a search query and determine a search result set for the search query having an article identifier associated with an English article having the phrase “used rugs”. Can do. The search engine 156 can do the same for segmentation results associated with other candidate languages. As another example, the search engine 156 can determine the number of search queries having a received segmentation result from an associated search log having a preferentially accepted search query. For example, the search engine 156 can search the search log for the number of search queries that have the received phrase “used rugs”. In another embodiment, the index device 158 of the search engine 156 can receive the search results and determine the number of articles in the index 160 or a part of the index 160 that has the segmentation results. it can. For example, the index device 158 can search through the index 160 or a portion of the index 160 for the number of English articles having “used rugs”.

また、スペルチェック機能が前記頻度検索の中に含まれることとすることができる。例えば、頻度プロセッサ１２４は、選択分割結果をスペルチェックできる、トップのために、スペルチェック機能を有することができ、または呼ぶことができる。前記スペルチェック機能は、各分割結果の中の個々のトークンについて正しいまたは好ましいスペルを決定することができる。頻度プロセッサ１２４は、両方の結果について出現頻度を決定するために、あらゆるスペル修正済み分割結果のみならず、最高分割結果についても頻度検索を実行することができる。例えば、分割結果が「basebal game」であるとともに、スペル修正結果が「baseball game」であるとき、頻度検索はこれらの結果の両方について実行することができる。 Also, a spell check function can be included in the frequency search. For example, the frequency processor 124 can have or call a spell check function for the top, which can spell check the selected split results. The spell check function can determine the correct or preferred spell for each token in each split result. The frequency processor 124 can perform a frequency search on the highest split result as well as any spell-corrected split results to determine the appearance frequency for both results. For example, when the division result is “basebal game” and the spelling correction result is “baseball game”, a frequency search can be performed for both of these results.

一実施形態として、前記分割結果についての各出現頻度は、特定言語の中の全ての記事または検索クエリーの数に基づいて正規化された値である。例えば、英語についての分割結果が７０の英語記事または検索クエリーの中で出現するとともに、総数１０００の英語記事または検索クエリーがある場合、この英語分割結果についての出現頻度は０．０７（７０／１０００）である。同様に、フランス語の分割結果が６０のフランス語記事または検索クエリーで出現するとともに、総数４００のフランス語記事または検索クエリーがある場合、このフランス語分割結果についての出現頻度は０．１５（６０／４００）である。このように、出現頻度は、記事または検索結果の収集資料の中の特定言語の普及率を考慮するとともに、より一般的な言語に固有の重み付けをしない。 In one embodiment, each appearance frequency for the segmentation result is a value normalized based on the number of all articles or search queries in a specific language. For example, if the segmentation result for English appears in 70 English articles or search queries, and there are a total of 1000 English articles or search queries, the appearance frequency for this English segmentation result is 0.07 (70/1000 ). Similarly, if the French segmentation result appears in 60 French articles or search queries and there are a total of 400 French articles or search queries, the frequency of occurrence for this French segmentation result is 0.15 (60/400). is there. Thus, the appearance frequency considers the penetration rate of a specific language in the collected material of articles or search results, and does not give a weight specific to a more general language.

ブロック２１２には、ブロック２１４が続く。ブロック２１４では、実行可能言語および実行可能分割結果が特定される。一実施形態として、頻度プロセッサ１２４は、前記実行可能言語および実行可能分割結果を特定することができる。例えば、前記頻度プロセッサ１２４は、最高関連出現頻度を持つ前記分割結果を選択することができる。上記のように、前記出現頻度は、分割結果を有する記事または検索クエリーの数、および特定言語の中の記事または検索クエリーの総数、に基づいて、正規化された値とすることができる。また、付加信号が実施可能分割結果を決定することに使用できる。例えば、頻度プロセッサ１２４は、各分割結果を有する前記記事の客観的ランキング（ページランク（登録商標）のようなウェブ記事のためのランキング・アルゴリズム）を考慮することができるとともに、各分割結果を有する記事を評価するために前記客観的ランキングを使用することができる。また、記事で出現する前記分割結果の回数と、前記記事の中の前記分割結果の位置とは、分割結果を有する前記記事を評価することに使用することができる。前記実施可能分割結果に関連する候補言語は、前記実施可能言語として選択されることができる。 Block 212 is followed by block 214. At block 214, an executable language and an executable partition result are identified. In one embodiment, the frequency processor 124 can identify the executable language and the executable partitioning result. For example, the frequency processor 124 can select the segmentation result with the highest associated appearance frequency. As described above, the appearance frequency may be a normalized value based on the number of articles or search queries having a segmentation result and the total number of articles or search queries in a specific language. Further, the additional signal can be used to determine the feasible division result. For example, the frequency processor 124 can take into account an objective ranking of the article with each segmentation result (a ranking algorithm for web articles such as PageRank®) and has each segmentation result. The objective ranking can be used to rate articles. Further, the number of division results appearing in an article and the position of the division result in the article can be used for evaluating the article having the division result. A candidate language related to the feasible division result may be selected as the feasible language.

一実施形態として、ブロック２０６において前記候補言語の特定に使用される前記言語信号は、前記実施可能言語の決定に使用されることとしてもよい。文字列を示す言語信号が十中八九特定言語である場合、これらの信号は、この言語についてのより大きな重みとして使用されることができる。例えば、言語学、関連ユーザのＩＰアドレス、文字列で使用された文字セット、ユーザに関連するブラウザ・アプリケーション・プログラムのブラウザ設定、文字列に関連する最上位のドメイン、のような言語信号は、例えば、フランス語のような、文字列に関連する言語が特定言語であることを示すものとすることができる。例えば、英語のような、他の言語における分割結果についての情報の出現頻度は、フランス語における他の分割結果についての情報の出現頻度に、近いまたは越えるものとしてもよい。前記言語信号は、本例における有効言語としてフランス語の選択をもたらすために、フランス語の重み付けに使用されることができる。２１６において、方法２００が終了する。 In one embodiment, the language signal used to identify the candidate language at block 206 may be used to determine the enablement language. If the linguistic signals that represent the strings are in most specific languages, these signals can be used as a greater weight for this language. For example, linguistic signals such as linguistics, the IP address of the associated user, the character set used in the string, the browser settings of the browser application program associated with the user, the top level domain associated with the string are: For example, it can indicate that the language related to the character string, such as French, is a specific language. For example, the appearance frequency of information about division results in other languages such as English may be close to or exceeds the appearance frequency of information about other division results in French. The language signal can be used for French weighting to provide a French selection as the effective language in this example. At 216, method 200 ends.

前記実施可能言語および実施可能分割結果は、様々な方法で使用されることができる。実施可能言語および／または実施可能分割結果は、広告の選択で使用されることができる。例えば、ユーザ１１２ａは、ウェブサイト「usedrugs.com」について、ブラウザ・アプリケーションの中にこの文字列を入力することによって、自身のブラウザ・アプリケーションでナビゲートすることを試みてもよい。ドメインネーム「usedrugs.com」にウェブサイトが存在していない場合などは、ユーザのブラウザ・アプリケーションが第三者ウェブサイトに転送されることとしてもよい。前記第三者ウェブサイトは、ユーザによって見られているウェブページでユーザによって入力されたドメインネームに関連する広告および／またはリンクを置くことが好ましい。前記第三者ウェブサイトは、ドメインネーム「usedrugs.com」を分割化エンジン１２０に送信することができる。分割化エンジン１２０は、前記ウェブサイトに関連する前記第三者ウェブサイトまたは広告サーバに、実施可能言語および実施可能分割結果を返すために、上記方法およびシステムを使用することができる。例えば、前記実施可能分割結果は「used rugs」とすることができ、前記実施可能言語は英語とすることができる。前記第三者ウェブサイトまたは広告サーバは、ユーザによって見られるウェブページ上で、英語の語句「used rugs」に関連する広告および／またはリンクを表示することができるとともに、前記ウェブサイト上で使用される言語が英語であることを確実にすることができる。また、前記実施可能言語は、ユーザに表示されたステータスメッセージで使用される言語の選択に使用されることができる。 The executable language and the executable partitioning result can be used in various ways. The actionable language and / or actionable segmentation result can be used in the selection of advertisements. For example, the user 112a may attempt to navigate the website “usedrugs.com” with his browser application by entering this string into the browser application. When the website does not exist in the domain name “usedrugs.com”, the user's browser application may be transferred to a third party website. The third party website preferably places advertisements and / or links related to the domain name entered by the user on the web page being viewed by the user. The third party website can transmit the domain name “usedrugs.com” to the segmentation engine 120. The segmentation engine 120 can use the above methods and systems to return an executable language and an executable segmentation result to the third party website or ad server associated with the website. For example, the feasible division result may be “used rugs”, and the feasible language may be English. The third party website or ad server can display advertisements and / or links related to the English phrase “used rugs” on a web page viewed by a user and is used on the website You can be sure that your language is English. In addition, the executable language can be used to select a language used in a status message displayed to the user.

＜全般＞
上述の説明は多くの特定の内容を有しているが、これらは本発明の範囲を限定するものと理解すべきではなく、単なる開示された実施形態として理解すべきである。当業者は、本発明の範囲内のあらゆる他の実施可能な変形をなすことができる。前記用語の第１および第２は、単に、１つの物を他の物から区別するものとして、本明細書では使用されている。前記用語の第１および第２は、明らかな注記がない場合、時間についての第１または第２、リストについての第１または第２、または他の順番、を示すためには使用されない。例えば、「第２」は、他の方法で明らかに示されていない限り、時間について、または「第１」の前のリストについて、のものとしてもよい。 <General>
Although the above description has many specific details, they should not be understood as limiting the scope of the invention, but merely as disclosed embodiments. Those skilled in the art can make all other possible variations within the scope of the present invention. The terms first and second are used herein simply to distinguish one thing from another. The terms first and second are not used to indicate first or second for time, first or second for list, or other order, unless explicitly noted. For example, “second” may be for time or for the list before “first” unless explicitly indicated otherwise.

図１は、本発明の一実施形態に係るシステムの図を示している。FIG. 1 shows a diagram of a system according to an embodiment of the present invention. 図２は、本発明によって実施される方法の一実施形態のフローチャートを示している。FIG. 2 shows a flowchart of one embodiment of a method implemented by the present invention.

Explanation of symbols

１００システム
１０２ａ−ｎクライアント装置
１０４サーバ装置
１０６ネットワーク
１１６プロセッサ
１１８メモリ
１２０分割化エンジン
１２２分割化プロセッサ
１２４頻度プロセッサ
１２６言語プロセッサ
１２６トークン・データベース
１５０サーバ装置
１５２プロセッサ
１５４メモリ
１５６サーチエンジン
１５８索引装置
１６０インデックス DESCRIPTION OF SYMBOLS 100 System 102a-n Client apparatus 104 Server apparatus 106 Network 116 Processor 118 Memory 120 Segmentation engine 122 Segmentation processor 124 Frequency processor 126 Language processor 126 Token database 150 Server apparatus 152 Processor 154 Memory 156 Search engine 158 Index apparatus 160 Index

Claims

Receiving a string that does not have a token representing a split (204);
Identifying (206) at least a first candidate language and a second candidate language as languages expected to be used in the string using specific rules and information ;
The character with at least determining a first segmentation result having a first plurality of tokens associated with the first candidate language for column, second associated with the second candidate languages for the string Determining (208) at least a second split result having a plurality of tokens;
Determining a first frequency of occurrence of the first segmentation result in at least one of a search engine index or a search query log received by the search engine (156) and a search engine received by the search engine (156); Determining a second frequency of appearance of the second segmentation result in at least one of an index or a search query log;
Specifying (214) an executable language from the first candidate language and the second candidate language based on at least a part of the first appearance frequency and the second appearance frequency, and outputting,
A computer-implemented method (200) comprising:

The identification (206) of the first candidate language includes information included in the IP address of the user associated with the character string, and browser settings of the browser application program associated with the user associated with the character string. 2. The computer-implemented method of claim 1, further comprising: identifying the first candidate language based on at least one information selected from a group consisting of a top-level domain associated with the character string. Method (200).

The feasible to identify the language step (206), the method (200 performed by a computer according to claim 2 which is performed based on at least a portion of at least one character in a character included in the IP address of the user ).

The step of determining (212) the first appearance frequency of the first division result in the search engine index is based on the number of search engine index entries corresponding to the first candidate language. Having a standardization step,
The step (212) of determining the first appearance frequency of the first division result in the search query log normalizes the first appearance frequency based on the number of search queries in the log corresponding to the first candidate language. The computer-implemented method (200) of claim 1, comprising the steps of:

The computer-implemented method (200) of claim 1, further comprising outputting an advertisement selected based on the enablement language.

The step of determining the first division result includes:
Determining a plurality of division results in the first candidate language from the character string;
Identifying the first division result from the plurality of division results based on a probability value associated with each of the plurality of division results;
The computer-implemented method (200) of claim 1, wherein each split result comprises a plurality of tokens that are different from each other split result.

The computer-implemented method (200) of claim 6, wherein a first probability value associated with the first segmentation result is calculated based at least on a portion of the frequency of each token in the first segmentation result.

The computer-implemented method (200) of claim 1, further comprising the step of outputting a web page containing the description expressed in the enablement language.

The step (212) of determining the first appearance frequency uses the search engine (156) to identify the number of articles in the first candidate language corresponding to the first query having the first segmentation result. Has steps,
The step of determining the second appearance frequency (212) uses the search engine (156) to identify the number of articles in the second candidate language corresponding to the second query having the second segmentation result. The computer-implemented method (200) of claim 1, comprising:

The step of determining (212) the first frequency of appearance comprises the step of normalizing the first frequency of occurrence based on the number of all articles in the first candidate language indexed by the search engine. A computer-implemented method (200) according to claim 1.

Using the search engine (156) to identify the number of articles in the first candidate language;
Executing a search query having the first segmentation result in the search engine; and determining a number of articles identifying a result set generated by the search engine as an execution result of the search query. A computer-implemented method (200) of claim 9.

The step of using the search engine (156) to identify the number of articles in the first candidate language is associated with the search engine (156) corresponding to one or more of the first plurality of tokens. The computer-implemented method (200) of claim 9, comprising determining the number of entries in the index (160).

Selecting an advertisement based on at least a portion of the enablement language;
Displaying the advertisement in association with a web page associated with a domain name;
The computer-implemented method (200) of claim 1, wherein the advertisement comprises text in the executable language.

Any one of the claims in a computer-readable recording medium recorded with a program configured to execute processing method (200) in a computer as claimed in 請 Motomeko 1 13.