2010年7月16日星期五

Information Extraction

For practical applications, how to select one of the most effective model, the algorithm is tested skill thing. So, these days himself bored at home, not on the Web, Blog does not read, with great concentration of information extraction (Information Extraction)'s algorithm. Among these, again hidden Markov algorithm (HMM) make good eating for a moment. google china blog above an article, "The Beauty of Mathematics, Series 3 - Hidden Markov Model in Language Processing", more classic way to explain the application of hidden Markov algorithm is a very good article. Before, I have systematically studied the "mathematical beauty" the first few series, and put a few into my "day one" section, be regarded as a record of their learning. Although this part of the name as "day one", but in fact as far less than the daily one frequency. These words are not their own writing, so need to chew before being really useful for us. If only is "posted" about the case, it really will not be necessary, and a waste of time. Therefore wish to join "a day," the article, can really help themselves.

Simple display about three categories of information extraction methods.

1. Rule-based method. This method is effective to solve a specific problem better, but it is extracted more demanding information requirements. This method is the main rule-based library information extraction, therefore, the quality of direct absolute rule base algorithm recall and accuracy. Often, especially used in commercial projects, to prepare a high quality rule base is not economical. This way the project can start as a core, with sufficient data to be accumulated after the training model and the algorithm through the production, the quality of the entire project can be a degree of upgrading.
2. Hidden Markov method. This is a classic information extraction algorithm. But it requires between the content source associated with the order, that is, require the data are arranged in logical relations. For independent information between content and its effect is not very good. Unfortunately, I project the data source that is so. Its content is segmented, for the Chinese people have a habit of the order of paragraphs, but this habit is not the abstract into the order of logic, therefore not suitable to be Markov algorithm.
3. Aspects of text-based classification. This approach assumed the independence between the use of information, extract information classification algorithm used for processing the order appear independent information extraction problem. With relatively high quality of Chinese word segmentation algorithm, information extraction precision and recall is higher. I am ready to do the project as the core algorithm in this way.

没有评论:

发表评论