2010年7月16日星期五

Practical experience in Chinese word segmentation algorithm

Recently as a search engine-related projects, not that that is clear is certainly developed based on Lucene, and standing on the shoulders of our predecessors ah, huh, huh! Familiar with Lucene know that, Lucene is not very good in Chinese word segmentation to achieve, I have done some of the features of Chinese word segmentation, initial test results were pretty good. Discussions around the Chinese word has a lot, but by the actual development, I do still have some of my experience, the intention to write this blog post, did not expect a sudden today, found that their partners have to write a paper, read a bit, feeling they want to say, he referred to, therefore, reproduced over.

The following is reproduced in the content, "The discussion about word"

---------------------------------------------

Full-text information retrieval system, inverted index created what word should be used when the method has been different opinions, there is no conclusion.

With my knowledge, there are certain paper "study that" the use of binary segmentation index is constructed through the "best"; also seen in the garden of a brother that the most accurate word segmentation (sorry, forgot specific source); of course, will be a dictionary or a total based on the current frequency of the Chinese word components packaged to join their project is also very popular practice.

Since there are so many views and practices, will inevitably make life out of a high commitment to the authenticity of the next or discernment;

However, as a mature and sensible young blood, even that this determination is not necessary because the information retrieval system evaluation criteria are diverse - recall, accuracy and efficiency of the three indicators contradict each other inquiries, only choice can not be reconciled; people are concerned about the indicators vary naturally present different views, pursue different approaches. Suppose you do a Web search engine, the first inquiry must be to ensure the efficiency, because it was dealing with huge amounts of data and concurrent request is a natural barrier; Second, the recall rate and accuracy rate in you will be more inclined to the latter, because the end user and the Web search engine, just as the negative relationship between the mind the relationship between man and woman infatuated - users want to be the most satisfactory results as soon as possible, and the next moment you abandon, until they need you up again (of course, if you provides the code for the Good Morni bid ranking service, in order to avoid customer complaints, better paying attention to the recall. So, the majority of white and a handful of VIP is a conflict of interest between the profound and long-term and irreconcilable. ..); same time, for a traditional library information retrieval systems, the situation will be different - the books and articles have a good keyword indexes, including title, author, summary, text, your time-defined structured data the document collection is relatively stable and relatively small - all to make your decisions more likely to improve the system of recall, the reason is very simple, you have the possibility to do or say is an advantage.

Now that we have clear indicators of evaluation of information retrieval system is diversified, and now let's look at the different sub-word indexing strategies in the end how they affect these indicators.

Let us first compare the two opposing strategies, words vs Chinese word segmentation:

Word segmentation of the strongest supporters of the evidence about the following:
"World Cup" is a word, with word segmentation, then check the "World" also hit this document, in Chinese word segmentation of finding a;
The Chinese word rebuttal supporters is about:
"Participated in the World Cup", with word segmentation, then search "death" of this document can also be hit, but in fact no one hangs up;

The above statement we can observe to be the case, with words cut branches to improve the system of recall rate, while the lower accuracy; the Chinese word is on the contrary, it increases accuracy, and reduce the recall rate, and word of The more coarse particles (average word length longer) this trend more apparent.

This conclusion seems to help to understand why google, Baidu, etc. The theory also requires high accuracy of Web search engines have adopted the Chinese word segmentation. But if our understanding remain at this level only appear a bit too much on superficial: the fact is that the high throughput required in dealing with the Chinese Web search engine, content must be in Chinese word segmentation.

Let us imagine an inverted index table, each row has a TermText and all documents that contain the TermText number list. In this way we check a particular keyword, you can one-time access to all documents that contain the keyword, rather than in one by one to find the original document collection. And use different strategies to create an index word, in fact, the document number is set to break up a different level to the index in different rows. Word segmentation can be described as a way to break up the lowest level, only the number of rows is equal to the number of Chinese characters, and the entire inverted index table will be very "wide"; the contrary, the coarser grain size of the Chinese word document number will be assigned to a collection of more many different lines, making the width of the inverted index table smaller. And as the word size of the increase in width will be decreased, the most extreme case is to document each one as a "word", then the width of the inverted index table is equal to 1 everywhere.

Based on the above discussion, we see that the following two points:
First, the number of very large document collections, the system throughput will be stored in inverted index file of the disk's performance limits, therefore, the use of Chinese word segmentation, to shorten the width of the inverted index table will help to improve the system throughput.
Second, regardless of Boolean queries, or location-based queries (such as Lucene in PhraseQuery) word segmentation of the word query performance is not better than the Chinese word.

It would appear that the use of the Web search engine is not a Chinese word difficult to understand what the; Similarly, smaller in the document, the use of word segmentation strategy would not be any major problems.

The binary segmentation, the dual view, this approach attempted in a rugged battlefield surgery to achieve moderation of ideological temperament, words and the Chinese word segmentation between the less than satisfactory in some respects a compromise (ambiguous, meaningless binary groups, etc.). It is closer in the realization of the word segmentation, rather than the Chinese word. According to Ou Di vision of achieving a kind of standard if the Chinese word meaning improvement strategy, it can be mitigated to some extent reduce the recall of Chinese word segmentation problem may be a more balanced in every solution .

没有评论:

发表评论