99网
您的当前位置:首页Statement

Statement

来源:99网
 Automatic Keyphrase Assignment Vishwa Vinay Thesis Submitted to The Department of Computer Science University College London in partial fulfilment of the MSc in Intelligent Systems programme University College London September 2003 Statement Information retrieval (IR) deals with the representation, storage, organisation of, and access to information items. The representation and organisation of the information items should provide the user with easy access to the information in which he/she is interested. Keyphrases give a high level description of a document’s content that is intended to make it easy for prospective readers to decide whether or not it is relevant to them. Documents that have keyphrases assigned to them can be summarised very concisely, or clustered into groups based on the low-cost measure of similarity that the keyphrases provide. A related application is topic search: upon entering a keyphrase into a search engine, all documents with this particular keyphrase attached are returned to the user. In summary, keyphrases provide a powerful means for sifting through large numbers of documents by focusing on those that are likely to be relevant. Unfortunately, only a small fraction of documents have keyphrases assigned to them – mostly because authors only provide keyphrases when they are explicitly instructed to do so – and manually attaching keyphrases to existing documents is a very laborious task. Therefore, ways of automating this process are of interest. This problem is an old one with established solutions. The aim of this project is not to propose a radically new approach, but in fact is to look at the existing algorithms that are suitable for our purpose and provide a working implementation of the solution. The report is divided into two independent parts, each dealing with a sub-problem. The first section considers Keyphrase Extraction methodologies. Though there are many ways of defining “keyphrase”, we assume that the linguistic ‘noun phrase’ structure fits our requirement. A way of identifying these phrases in free text is described. The second part of the report deals with the subject of Stemming. The earlier mentioned mechanism successfully labels documents, but to group together conceptually related keyphrases, morphological variants of a given word need to be collapsed to the same root. This process of removing prefixes and suffixes from words in a document or query in the formation of terms in the system's internal model is called stemming. By way of its nature the problem depends on concepts, models and ideas borrowed from other fields – like computational linguistics and natural language processing. While vanilla computer science algorithms as well as elementary linguistic techniques are mentioned in this project, the main aim was to look at a couple of promising Machine Learning based algorithms in the context of the general area of Document Retrieval. There is also a bias towards algorithms that minimise the role of expert knowledge. Internal Supervisor: Alex Chengyu Fang Department of Phonetics and Linguistics UCL External Supervisor: Sean Slattery APR Smartlogik Acknowledgements Several people have been instrumental in helping me finish this work, either directly or indirectly. To all the teachers on the MSc IS course – I’ve had a great time over the past year. To my advisor, Professor Alex Chengyu Fang, I would like to express my gratitude for his patience, understanding, and constructively critical eye. To the people at Applied Psychology Research, I thank you for providing me with a setting to work on the project. A special thanks to Sean for his guidance at all times. To my Mom and Dad for giving me the opportunity to be here. Vishwa Vinay Department of Computer Science University College London September 2003Table of Contents Statement Acknowledgements Table of Contents Chapter 1 Noun Phrase Extraction Introduction Part-of-Speech Tagging Hidden Markov Models in the Context of POS tagging Example of the Tagging Process Noun Phrases Other Relevant Processing Description of the NP Extraction Process Implementation and Optimisation Results Future Directions Chapter 2 Stemming Introduction A few useful definitions A Survey of the most popular Stemming Algorithms Effects of Stemming The Exception List Results Future Directions Conclusions References Appendix Hidden Markov Model (First Order) Formalism 1 2 3 5 6 7 8 9 10 13 14 15 16 17 19 21 24 28 31 32 33 35 Chapter 1 Noun Phrase Extraction This section of the report shows how to design and tune a stochastic Part-of-Speech tagging system capable of extracting Noun Phrases from free text. In particular, the report presents a Hidden Markov Model based system and solutions to a few practical problems associated with it. The HMM is trained on a set of untagged sentences and uses an accompanying lexicon. Simple patterns of tags define what a Noun Phrase is and such sequences of tags are extracted. Introduction Automatically extracting key phrases from documents is a task common to many applications in Knowledge Management, Information Extraction/Retrieval and Natural Language Processing. Relevant keyphrases can be used as criterion for categorisation or for retrieval. Taxonomies can be created based on these phrases. Automatic summarisation mechanisms can pick sentences with high keyphrase count. This project aims at picking Noun Phrases from a document as keyphrases. The automation of this and related processes is of utmost interest due to the increased availability of documents in digital form and the ensuing need to organise them. The traditional knowledge-based approach requires domain experts to manually define a classifier. Machine Learning based approaches are far more attractive because of the decreased reliance on expert knowledge and easy portability across domains. In particular, efficient and reliable ‘unsupervised learning’ methods further reduce the dependence on data that has been manually prepared. An important first step in discovering higher level structures and patterns in the text is part-of-speech tagging – associating each word with its grammatical context. Efficient, reliable and robust HMM based tagging systems have been proposed – this project uses [CKPS92] as a model, with other additions to satisfy the needs of a practical system. Documents received from higher level components were parsed and the noun phrases extracted were returned back, where this information was used appropriately. The goal of the project was to design and develop a set of modules that accepts a document, tags its tokens with their appropriate parts of speech and use these tags to identify Noun Phrases and other ‘useful’/’interesting’ areas in the text. The aim was to develop this as an independent library that can be called by anyone who required this functionality. Part-of-speech Tagging Part-of-speech tagging is the process of associating each word in a sentence with its relevant grammatical context. Most words can serve different semantic functions in different sentences, the process of POS tagging helps to disambiguate what particular purpose this word serves in this sentence. The tags themselves depend on the tag-set used. While the major categories such as noun, verb, adjective and adverb exist in all, different tag-sets differ in terms of finer divisions. Two commonly used tag-sets are: 1) The Brown Corpus Tag-set: http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html 2) The Penn Treebank Tag-set: http://www.ling.upenn.edu/courses/Fall_2002/ling001/penn_treebank_pos.html Depending on the purpose of the tagging, an appropriate tag-set is used. While applications that are linguistically inclined chose more elaborate sets, for most other applications a restricted set suffices. A smaller set also reduces computational costs, which may be an issue in most situations – it is in our case, and we therefore use the Penn Treebank tag-set. The first resource required for most tagging mechanisms is a lexicon – a list of words along with a list of all possible tags that this word can be associated with. The tags in the lexicon come from the chosen tag-set. The tagging will work with a non-exhaustive lexicon, but in terms of both speed and accuracy - the more exhaustive this is, the better. Different approaches exist for the automatic tagging of natural texts, such as: 1) With the help of transition probabilities empirically calculated from a pre-tagged corpus. An n-gram model (a given state depends on the previous n states) is used to calculate the most likely sequence of tags. The main problems with such an approach are the reliance on a tagged corpus (to calculate the probabilities) and the possibility of a combinatorial explosion when there exists a sequence of words with many choices for tags (dynamic programming methods exist but may not provide accurate results). 2) A technique known as shallow parsing breaks the sentence into chunks [a content word (noun, verb, adjectives, adverbs) surrounded by function words (articles, prepositions, etc) matching a pattern] and performs tagging as the first step towards building parse trees. Again, it relies on the existence of previously parsed/chunked data. [M02] 3) Some techniques use the context of tags of words around an ambiguous word to get its POS. However, this is usually used as a post-processing step to deal with words not seen in the lexicon. [NKM01] uses Support Vector Machines for this purpose and does report a very good accuracy for the tagging process, but SVMs tend to be computationally intensive. 4) A Maximum Entropy based method also exists but again requires a manually annotated corpus. [R96] Our situation dictates the requirement of a technique that does not require any kind of prepared data (apart from a lexicon). [CKPS92] describes one such approach that uses Hidden Markov Models and has been shown to be accurate. Hidden Markov Models in the context of POS tagging HMMs have long been used to model the spoken language ([LRS83]) and other sequential signals. Their structure offers a compact representation of word and word co-occurrence probabilities, which have been shown to carry information about the syntax and semantics of a language (English in our case). Further, efficient algorithms exist for learning parameters from sequences of words, for computing the probability of a sequence given the model, and for finding the highest probability path through the states of the model for a given sequence of words. An HMM trained on a set of example sentences thus captures information about the grammar of the language in which the sentences were written. Each different part-of-speech tag corresponds to a word class. There are many possible combinations of POS tags and each combination seen in the lexicon is called an ambiguity class. In one approach to POS tagging using HMMs, the underlying Markov process models the transitions across word classes, while being able to observe the ambiguity classes. Unseen words can be handled by taking into consideration the suffix information, but in our case were allocated the set of open class tags – the part-of-speech classes that potentially contain infinite words (nouns, adjectives, verbs, adverbs). Training involves looking at a set of (untagged) sentences and using them to estimate the transition and emission probabilities of the HMM - the Forward-Backward algorithm is used for this purpose. When the learnt model is given a new sentence, tagging corresponds to estimating the best sequence of hidden states (word classes) for this sequence of ambiguity classes (Viterbi algorithm). Example of the tagging process Word listen listened listener listeners listening listens listing listings lists Sample Lexicon section Ambiguity Classes vb_vbp vbd_vbn nn nns nn_vbg vbz nn_vbg nns nns_vbz Hidden States(word classes) …… vb vbd vbn vbz nns nn …… Allowed emissions vb,vbd vb,vbd,vbn vbd,vbn vbz,nns nns,nn nn Observed variables(ambiguity classes) All transitions between word classes are ‘allowed’, so there should exist arrows from every hidden state to every other(including itself). The probabilities associated with each arrow are estimated during training. HMM section Sample sentence (Each column would be an ambiguity class.) Sentence The quick brown fox jumped over the Ambiguity dt jj jj nnp vbd in dt Classes nnp np vbn jj (from the rb nnp lexicon) rb rp lazy dogs . jj nnp punc nns vbz Treating the sentence as a series of ambiguity classes, the inference problem is to find the most probable path through hidden states from which the ambiguity class corresponding to each word can legally be emitted.Noun Phrases As linguistic structures, noun phrases are usually defined in terms of recursive production rules. e.g.: NP-> dt NP2 NP-> NP2 NP2-> (adj)* noun NP2-> noun Identification of such structures would involve construction of the parse tree of tags for the particular sentence – a computationally expensive process, especially considering that the process of identifying the noun phrases is part of a larger activity. In our case, a simple sequence of tags was used to specify our definition of a noun phrase and such a pattern noticed in the tags of a given sentence corresponded to a phrase of interest. 󰀃 An example pattern would be where (x) indicates that x is optional is any sequence of 0 or more x’s is one of x, y or z (x)* {x, y, z} (dt) (jj)* {nn, nns, nnp} Examples of noun phrases satisfying this grammar: 1) Eager John rushed to the cinema . jj nnp vbd to dt nn punc 2) The fat black cat ate a white mouse . dt jj jj nnp vbd dt jj nn punc Other relevant processing The first step of the entire process involves breaking the text of the given document into its constituent text units. Any whitespace / punctuation separated sequence of characters forms one independent unit, the punctuations being considered as units themselves. Markers into the document in the form of offsets from the start are identified. These text units form the basis of all subsequent processing. Numbers, URLs and Email addresses are other regions in the text that one may wish to highlight. Extracting them involves the following: 1) Identify simple Finite State Machines that at least roughly describe the structure of such a type. e.g. – a numeral would be a sequence of digits, possibly separated by commas and having a maximum of one decimal point. Identifying these zones comes down to finding the points at which the text of the document enters and exits this FSM. 2) Draw up a list of special words. e.g. – domain extensions for Emails and URLs (“com”, “edu”, “org”, etc) and number words (“one”, “two”, “hundred”, “thousand”, etc) for numbers. Membership in this list is one of the criteria used to navigate through the FSM. 3) Tagging the words involves identifying the best sequence of legal tags for words in the same sentence. Therefore, sentence boundaries need to be identified. The FSM model is used for this as well. An end-of-sentence punctuation (‘.’, ‘!’, ‘?’) followed by a whitespace or a capitalised letter identifies end-of-sentence. Description of the NP Extraction process Once the text has been tokenised and split into sentences, it is ready for the extraction process. The heart of this process is the trained HMM. Training the Model The parameters of the HMM are estimated by iteratively going through sentences in a training file. While the model is stable with respect to insufficient data, to ensure that all the patterns in the grammar of the language are reinforced in the parameters, we ensure that the training data is sufficiently large and representative. Each token is replaced by its ambiguity class by way of a lexicon lookup, thereby translating a sentence into a list of symbols. This sequence is used to update the parameters of the model. The parameters are iteratively updated, making sure that move towards their maximum likelihood values. After the specified number of iterations over the entire training set have been performed, the HMM is stored to disk for later use. Using the trained model A given new sentence is converted to a list of ambiguity classes. The inference process of the HMM identifies the most likely sequence of hidden states for this list, which is then translated back into part-of-speech tags (the unique tags that exist in the lexicon). NP identification Given a tagged sentence, extracting the noun phrases involves picking the sequence of tags that match a given pattern. e.g. A sequence of adjectives terminated by a noun (possibly with predeterminers like “all”, “many” and with central determiners “a”, “an”, “the”). By definition, a noun phrase can contain another noun phrase – in our case, only the maximal one is picked. The start and end of this phrase (in terms of its text unit indices) is then stored. Implementation and Optimisation Design An object-oriented implementation in C++ was decided on. The HMM, Tagger and NP Extractor are all developed as separate components. This not only underlines their logical independence but also aids subsequent reuse in other situations. The lexicon is maintained as an independent structure common to all the phases of the extraction process. An HMM with N hidden states and M output symbols is described by 3 parameters: 1) An N*N matrix of transition probabilities 2) An N*M matrix of emission probabilities 3) An N*1 matrix of start probabilities The values of N and M are provided by the user of the HMM class. The Lexicon class is an in-memory representation of the lexicon file. Apart from a list of words and their associated tags, an instantiation of this class also maintains a list of unique tags in the lexicon (word classes) as also the list of combinations of these tags that it has seen in the lexicon (ambiguity classes). The Tagger class maintains an HMM as its member along with a reference to a Lexicon. The value of N used for construction of the HMM is the number of word classes and M is the number of ambiguity classes. The HMM recognises only Document Extractor Noun Phrase Grammar Lexicon Parsed Document Other Extractors Noun Phrase Extractor Part-of-Speech Tagger HMM numbers representing hidden states and observed variables. The Tagger performs the translation of the sequence of strings it receives (set of legal tags for each word) to the ambiguity class number the HMM works on. Each number in the state sequence output by the HMM is then converted to the string representation of the corresponding lexical tag. The Noun Phrase Extractor has an additional dependency – a grammar defining what a noun phrase is. This is specified in terms of 2 sets of tags: 1) Tags that signal the start of the phrase – Since we are interested in the maximal NP, we mark the start of any sequence of tags belonging to this set. 2) The end-of-phrase tags – The last occurrence of a member of this set signals the end of the phrase. This grammar is maintained as an external file to provide flexibility to the design. Performance Tuning 1) To make sure that we have a reliable model, the HMM needs to be trained over a large number of sentences. One way of handling multiple sentences would have been to have additional states for the begin and end of a sentence and use one long appended sequence representing the whole training set (inserted with begin and end of sentence symbols at appropriate places) – but having to maintain all training data in memory increases the resource requirement of the system. [LPP00] defines a method where an HMM can be trained on multiple observations with the final parameters being averaged over the trained values. Since in our situation the different training sentences are unrelated to each other, the independence assumption holds good and the training equations in the HMM represent this. 2) The “allowed” emissions are automatically inferred from the lexicon. Though these are legal, there is the chance that the values of these parameters go to 0 if they are under-represented in the training data. Due to this, there may be a sequence of symbols for which the trained HMM is unable to infer a transition through states with a non-zero probability. To ensure that this does not happen, at the beginning of every iteration, every allowed emission is initialised with an observed frequency of one. Therefore, though the frequently occurring patterns are reinforced, all legal options are maintained valid. For the case of transitions, since all transitions are in theory ‘allowed’ in our case, every transition is initialised to one. 3) The values of all the parameters of the HMM are probabilities. They therefore only hold values between 0 and 1. Over iterations and repeated multiplications and divisions, there is the danger of underflow being caused. Logarithms are used to partially rectify this problem. An additional step involving the scaling of all parameters by multiplying them by a normalising constant is introduced. This ensures that the values stay within the dynamic range of the computer. [LRS83] 4) The grammar input into the NP Extractor controls which phrases qualify as noun phrases and which do not. Our method of defining an NP is a little restrictive to capture its linguistic complexity. Expanding the grammar would lead to more phrases being picked up but care should be taken to see that words are not incorrectly included. Special attention should be paid to the following tags – VBG (represents a present continuous verb as well as a gerund, the latter being interesting), IN (represents a preposition or subordinating conjunction but since noun phrases can be part of prepositional phrases, it may be a good idea to include this) and CD (for cardinal numbers which are good to be picked). e.g. The(dt) corrupt(jj) politician(nn) was(vbd) found(vbn) guilty(jj) of(in) accepting(vbg) over(in) 100,000(cd) pounds(nns) in(in) bribes(nns) .(punc) 5) To reduce the dimensionality of the model, replace all the classes that are not part of the NP grammar by one single tag. While this leads to losing some fine grained information about the structure of the sentence, in our case it leads to a lesser number of parameters for the HMM (more speed) without compromising on the quality of noun phrases pulled out. 6) During inference, when an unseen word is encountered, it is associated with an ambiguity class consisting of all open class tags. Which specific class it finally is tagged with depends on its context in the sentence. If unseen words are considered to be interesting, the open class can be restricted to contain only tags that are part of our NP grammar. In effect, every unseen word becomes part of a noun phrase. Results When using supervised mechanisms to train the part-of-speech tagger, the accuracy of the model is obtained by comparison with a manually annotated corpus. With unsupervised learning, we do not have a gold standard training corpus with which the accuracy of the learner can be measured. [CKPS92] reports the performance of their implementation against the Brown corpus (tagged), achieving an accuracy of 96%. But, the lexicon used for this purpose was constructed from the tagged text and therefore the testing phase would not have encountered any unseen words. In [CKPS92], the corpus used contained about 1,000,000 words and 8 iterations of training were performed. Our data had about 3,750,000 words and 5 iterations were performed. The least expected accuracy in other similar situations is about 85% ([NKM01]). Training was performed on a set of 198500 English sentences of varying lengths. The HMM had 37 hidden states (unique word classes, i.e.; part-of-speech tags) and 324 ambiguity classes (word class combinations seen in the lexicon). Training time was a little over 4 hours. The trained model extracted 10866 phrases from 6920 documents in 475 seconds. This compares favourably, both in terms of quality of output and speed, with an industry standard natural language processing engine. Future Directions The described implementation performs satisfactorily in terms of both accuracy and speed and is stable with respect to insufficient training data. One area of possible attention is its performance with unseen words. To ensure higher percentage accuracy with such words, extra processing can be added. This can be in terms of incorporating linguistic information and using the suffixes of such words to help in the disambiguation. The design of the POS Tagger keeps the HMM completely insulated from the other processing. Hint passing mechanisms can be introduced into this structure such that the HMM uses information other than just the ambiguity class of the word to come up with its decision. Other prior information that can be used to bias the HMM include a choice of non-uniform starting probabilities for the different states. Once the performance of the HMM is tested to be scalable, a more exhaustive tag-set can be used. This helps by giving a finer control over the definition of the grammar because such a set contains information over a richer set of linguistic features. Currently, the grammar of the NP is very simple. It does serve the purpose but it may be interesting to investigate other representations. One shortcoming of the work described above is that the entire framework was not subjected to a rigorous evaluation procedure. Though the results were extremely encouraging, for the credibility of the system, its behaviour needs to be validated. This could mean checking the behaviour of the HMM or examining the ‘quality’ of the keyphrases extracted. Chapter 2 Stemming The problem of Stemming is a well studied one – for both linguistic and computational purposes. This section of the report aims at presenting a literature survey of the most common algorithms, identifying their pros and cons with respect to the particular purpose at hand (an interactive document retrieval system offering a free text search option). One approach was chosen to be suitable, an implementation of which was used for the design and development of an add-on tool to aid its operation. Introduction Information Retrieval (IR) is essentially a matter of deciding which documents in a collection should be retrieved to satisfy a user's need for information. The user's information need is represented by a query, and contains one or more search terms, plus perhaps some additional information such as importance weights. The retrieval decision is made by comparing the terms of the query with the index terms (important words or phrases) appearing in the document itself. The decision may be binary (retrieve/reject), or it may involve estimating the degree of relevance that the document has to the query. Unfortunately, the words that appear in documents and in queries often have many variants. In most cases, these variants have similar semantic interpretations and can be treated as equivalent in Information Retrieval applications. Thus, for pairs of terms such as \"computing\" and \"computation\" to be recognised as equivalent, some form of natural language processing is required. Stemming is the simplest example of this kind of processing. The nature of the application places different requirements on the stemming algorithm. The benefits (if any) of adding stemming to a system depends on which particular algorithm is used and how it is influenced by the rest of the system. If the ‘rules’ controlling the process of stemming a word are chosen carefully, they will largely be representative of the language in general. But, as with everything else, there are likely to be exceptions – these could be specific words that do not fit along with the general pattern, or domain specific words that should be handled specially. A list of such words can be assembled manually, but this report provides an algorithm for examining a corpus and drawing up a list of words that could potentially be placed in such a list. A few useful definitions Linguistic concepts English words are constructed from two different types of units - Roots and Affixes. Roots usually have a rather specific meaning, and this meaning tends to be relatively constant across all the words that use the root. Roots also contribute the greatest conceptual content to the overall meaning of the word. Since roots are doing most of the work of conveying meaning, they are required word elements. Affixes attach to roots or a combination of roots and other affixes. Their main use is to modify the meaning conveyed by the root or roots. The primary aim of stemming is therefore affix removal. Affixes are of two kinds – 1) Prefixes which are attached to the beginning of a root 2) Suffixes which are placed at the end Suffixes in turn are of two kinds – a) Inflectional suffixes are required to make a sentence grammatically correct, but they add little meaning to the word. They never change a word from one grammatical class to another, but each grammatical class has its own special set of inflectional suffixes. This is a very limited set in English – providing singular/plural information for Nouns, tense for Verbs and comparison-related information for Adjectives. If an inflectional suffix occurs, it will always be the last suffix of any type in the word, and there will only be one inflectional suffix in any word. e.g.: computer(s) {Noun}, compute(d) {Verb}, fast(er) {Adjective} b) Derivational suffixes are used to make (or derive) new words. In particular, they are used to change a word from one grammatical class to another. They can be divided into three kinds – Noun-forming, Adjective-forming and Verb-forming. Multiple derivational suffixes may occur but the last one determines the part of speech. e.g.: comput(ation) {Noun-forming}, comput(able) {Adjective-forming}, comput(ation)(al) {Adjective-forming after Noun-forming} IR Performance Evaluation The purpose of an Information Retrieval system is to effectively pull out from the database a set of documents that match the user’s need. This introduces the notion of Relevance. Different users may differ about the relevance or non-relevance of particular documents to given questions, making the judgement of relevance very subjective. During experiments, relevance values for the documents are usually assigned by a set of experts. Users who typically have requirements in that domain provide the queries to be used for testing the system. Any new algorithm that is proposed needs to be objectively evaluated against present systems so that a proper justification is provided for a change to the new system. For Information Retrieval systems, the popular choice of statistics are the values of Precision and Recall. Recall for a certain query is defined as the ratio of the total number of relevant documents retrieved by a certain system to the total number of relevant documents in the database. Precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved. The role of metrics like Average Precision and Average Recall in IR evaluation is disputed. While they are objective and (comparatively) easy to calculate, it is argued that they may not represent the true performance of the system. [KR96] provides statistical tests for measuring the change in the performance of a system with respect to these statistics - with and without stemming, using Analysis of Variance (ANOVA) type techniques. A Survey of the most popular Stemming Algorithms Stemming algorithms can largely be classified into two groups: 1. Ones that do not use any kind of linguistic knowledge 2. Ones that do The techniques that do not use background information tend to be very simple to implement and also computationally inexpensive. They also seem to serve the purpose in a surprisingly large percentage of the applications. These elementary techniques include: 1. Case folding: This results in word matching effectively being case-insensitive. Its major drawback is in the handling of acronyms (e.g. “AIDS” & “aids”). 2. S-stemming: This stemmer removes common word endings like “s”, “es” and “ies” to produce conflations. The advantage of using this algorithm is that it is extremely conservative and therefore rarely produces results that are ‘wrong’/’surprising’. 3. N-truncation: Here, the last N letters of words are simply removed. Since variants of words are usually formed by the addition of suffixes, this provides the simplest form of suffix removal and is quite effective. It is natural to expect that these elementary techniques would not satisfy most current applications and a more detailed and rigorous treatment would be necessary for all but the simplest situations. One way of looking at stemming is to think of it as clustering – clusters being based on the rules of conflation. [DRAF] provides an algorithm to automatically create these clusters. While the paper reports experiments on the Arabic language, similar principles would hold for English. [CX95] gives details of a corpus based technique for automatically identifying conflation groups. It suggests the use of an “aggressive” stemmer (one that conflates to a smaller set of words) to identify a preliminary list. Based on some statistics gathered from a corpus, a decision is made as to whether or not these groups represent the same concept – if they do, they are still considered together. The statistic that is considered is a Mutual Information measure – if two words are related in concept, they are likely to co-occur most of the time. The method relies on the use of a sufficiently aggressive stemmer making the first set of associations. When it comes to current IR applications, two suffix removal based stemming algorithms are very popular – the Lovins stemmer and the Porter stemmer. The Porter stemmer removes about 60 suffixes in a multi-step approach, each step successively removes suffixes or makes transformations to the stem. The Lovins stemmer removes over 260 different suffixes using a longest match algorithm. The driving aim of these algorithms is not to produce the linguistic root, but to improve performance of a retrieval system. e.g. The Porter stemmer conflates “accommodate”, “accommodated” and “accommodation” to “accommod” – which serves the purpose of being the single handle on the three words mentioned but is not their linguistic representative. [K93] provides an algorithm for a lexicon-based stemming method that places a condition that the resulting stem should be a valid word. This alleviates a major criticism of the Porter and Lovins stemmers. The approach is affix removal (both prefix and suffix) to identify morphological variants of a word. The technique also suggests an extremely useful procedure of maintaining an exception list of words that are resolved by way of a lookup, rather than subjecting them to the list of rules. The WordNet project at the Cognitive Science laboratory in Princeton represents a much more ambitious approach to the problem of concept stemming. Amongst the data maintained as part of the process is a dictionary listing of all the legal suffixes for each word. Also, a detailed taxonomy of English nouns/verbs/adjectives/adverbs that is in line with psycholinguistic theories of human lexical memory is constructed and used to identify sets of synonyms that are semantically related. The motivation of the WordNet system is therefore a better understanding of the linguistic processing in the human brain, and this may not necessarily benefit a practical IR system. Effects of Stemming The earliest evaluations of the different stemming algorithms varied widely in their reports. Harman reached the conclusion that stemming results in non-significant performance improvement. Her evaluations were based on comparisons between the Lovins, Porter and S-stemmers. Krovetz on the other hand reports generally positive effects of stemming. Though the test collections used in these two experiments were different, the resulting conclusion was that in those cases where stemming is beneficial, it tends to have only a small effect on performance, and the choice of stemming among the most common variants is not important. However, there was no evidence that a reasonable stemmer will hurt retrieval performance. [HG96] and [H96] consider more detailed criteria for evaluating the effect of the different stemming algorithms in the backdrop of IR. These provide more informative guidelines for new system design. Some of the conclusions reached were: 1) The nature of the language in which the system is operating would influence the performance benefits, if any, that one is likely to get on the addition of stemming. English being a language with very limited morphological variety, conflation should not be expected to be very productive. 2) Query lengths and document lengths are also factors affecting the change in performance. In the case of short queries, stemming is definitely useful but it tends to add noise when queries and document lengths become longer. 3) If the IR system is interactive, stemming of any kind is expected to be useful. Brief Evaluation of the different Stemming Algorithms 1) S-stemming and N-truncation: These methods are far too simple to provide any kind of improvement in system performance. Therefore, at all levels of precision and recall, a system having no stemming at all is likely to be just as good as one with S-stemming or truncation. 2) The Lovins stemmer, which uses a longest-match approach, scores over other algorithms when it comes to identifying derivational variants of words got by suffix addition. However, reliance on a large number of general rules means that it is likely to make mistakes when considering specific contexts (e.g. “serve” and “server” would be equated, which is wrong in the domain of computers and technology). Being a highly aggressive stemmer, at high levels of Precision, this algorithm tends to falter. 3) The Porter stemmer is probably the most popular algorithm used for stemming. By way of its fall-through mechanism, it ensures that a large number of words are collapsed to a small set, but this aggressive nature is also its drawback. The Porter stemmer is similar to the Lovins stemmer in terms of being rule-based, the large number of these rules introduce a maintainability problem. The stems as produced by this algorithm are usually not proper English words, which may be a problem if the stems are used for purposes apart from just indexing (e.g. displayed to the user in an interactive system). There are some conflations produced by the Porter algorithm that are known to be incorrect (e.g. “police” & “policy”) while it misses certain others (e.g. “Europe” & “European”). 4) Morphological stemming provides results that are not very consistent across a range of text collections and queries, but scores over the others in terms of clarity and flexibility of the algorithm. Applications requiring a conservative level of stemming use only inflectional suffix removal while more aggressive requirements are met by adding derivational suffix removal. Prefix removal (as part of derivational morphology considerations) tends to be detrimental to the performance of IR systems. The drawback of this method is in its reliance on an exhaustive dictionary – rules are applied only if the resulting word is part of the dictionary. While this ensures that all stems are valid words, it does also mean that most proper nouns (and other words missing from the dictionary) will be skipped. Though [K93] does provide a list of suggested suffixes and prefixes, the nature of the algorithm is simple so that one could tailor the aggressiveness of the system by adding/removing rules. 5) Unsupervised clustering based algorithms are interesting from a Machine Learning perspective but are not useful in this scenario, which requires a certain amount of linguistic knowledge. As part of the investigation, an implementation of this approach was tried – the results did not match up to those of other algorithms mainly because of two issues encountered: • Clustering algorithms depend on knowing the number of clusters before hand. What this means in this case is that, given a large corpus, we need to know how many of these words are stems. Values between 60-80% of the total number of unique words were tried with mildly varying quality of output. • It is difficult to define a distance metric that captures the stem-derived word relationship. The Dice Coefficient considering the set of all bigrams (or trigrams) in the two words leads to associating two words that are very different and just happen to have letters in common (e.g. “services” & “vice”). If extra weight is placed on common letters at the beginning of words, other problems arise (e.g. “record” & “recent”). The Exception List All the stemming approaches work on the basis of a set of rules that have been observed for the words belonging to the language. These rules are generalisations of the structure of words in that language, but for any language there are bound to be words that do not conform to these rules. e.g. In English, “better” should be conflated to “good”, or the relation between “men” and “man”. The Morphological stemmer uses the mechanism of maintaining a list of exceptions – a mapping from words to their stems – that do not fall in line with the rules that the stemmer applies. This provides a convenient way of correctly handling outliers. A revised version of the original Porter algorithm also uses this kind of a list to avoid making mistakes that are known. Apart from exceptions in the language itself, when considering a particular domain (e.g. medical, financial, etc.) we might encounter other words that need to be treated differently. e.g. “blue” and “blues” are different concepts when music is the topic. Exception lists are constructed manually for each application. Using human judgement for these lists can be expensive and inconsistent in quality. [CX95] defines a technique for automatically modifying conflation classes based on corpus-specific statistics. While the paper uses this technique for identifying groups of words that are to be stemmed to the same root, it is used in a slightly different context here. The idea is to use these statistics to identify a list of pairs of words that are possibly incorrectly being equated by the stemmer. The algorithm Parameters: OBSERVATIONS – The minimum number of times a word has been seen so that the statistics can be considered reliable THRESHOLD – A value for the statistic that separates the interesting from the not-interesting Resources: DATA – A corpus with sentences from the chosen domain STEMMER – A sufficiently aggressive stemmer To construct equivalence classes: 1. Break DATA into tokens 2. Pass each token through STEMMER 3. If word wi stems to si, then {wi : si = s} forms an equivalence class To calculate the statistics: For each equivalence class Ei For each pair of words in Ei If ( ( |wj| > OBSERVATIONS) AND ( |wk| > OBSERVATIONS) ) If ( Statistic(wj, wk) < THRESHOLD ) Add into list of suspect conflations where |w| is the number of times token w has been seen. As can be seen from the above algorithm, the number of times the statistic is calculated depends on the average number of words in each conflation class, which is just a little above one. Definition of Statistic(w1, w2) [CX95] suggests the use of a co-occurrence measure, similar to the expected mutual information measure, as evidence that 2 words being stemmed to the same root may be related in concept. Statistic(w1,w2)=|w12|window |w1|window+|w2|windowwhere |w|window represents the number of times token w has been seen in a defined window. The logic behind using this as a measure is that words that are conceptually related are likely to co-occur. And if two words seem to be conceptually related, then them being conflated together is likely to be correct. This mutual information-based measure is logically very indicative of the dependence we wish to estimate. Its calculation also has the property of being comparatively inexpensive computationally. It was however observed that it suffered from a shortcoming when it came down to inflectional forms of verbs – due to the structure of the language, different forms of the same verb are unlikely to co-occur. A similar observation was made, though to a slightly lesser degree, for noun forms. This could lead to a number of pairings being labelled as suspicious though they are very valid. (Examples in Results) It was suspected that the shortcomings of this measure were probably due to the fact that it did not represent the complete definition of the Mutual Information between two variables – it took into consideration only the documents where both the words occurred. The complete definition of Mutual Information is MI(x;y)=󰂦P(x,y)logx,yP(x,y) P(x).P(y)In our case, the distributions represent a binary occurs/does-not-occur choice. The summation for the complete MI value should therefore be over all the four 0/1 combinations for x and y. (The earlier mentioned measure effectively represented just the {1,1} combination). This new statistic, in general, did not work any better or worse than the earlier simple measure – even when the undesired {0,1} and {1,0} combinations (where one of the two words occurs but the other does not) were penalised. An attempt to more accurately model the ‘language’ of each document in which a word occurs (rather than just which documents the word occurs in) leads us to a measure that is based on the information theoretic distance metric known as the Kullback-Leibler(KL) divergence. For 2 discrete distributions ‘p’ and ‘q’, the KL distance is defined as KL(p,q)=󰂦p(x)logxp(x) q(x)The value ranges from 0 (when the two distributions are the same) to ∞. The higher the value of this distance, the farther the two distributions are. This measure suffers from a drawback with respect to our current requirement – it is asymmetric. To get around this problem, consider a slightly different criterion: I(p,q)=KL(p,q)+KL(q,p)I(p,q)=󰂦[(p(x)−q(x))logxp(x) ]q(x) For our application, If ‘D’ is the set of all documents Let D1 = { D | w1 ∈󰀃D } and D2 = { D | w2 ∈󰀃D } Let p1 be the distribution of words in D1 and p2 be the distribution of words in D2. Use Statistic(w1,w2)=−I(p1,p2) ∀󰀃w1,w2 such that Stem(w1) = Stem(w2). This measure led to results that were much more interesting, though the benefits came at a computational cost. Note: Measures should be taken to appropriately smoothen the distributions to handle the zero values. On observing the output, a refinement was made to the algorithm. In order to accentuate the difference between two conceptually unrelated words, the probability distributions were provided with weights to account for the following: 1. If the entire corpus is from one domain, the total number of documents in which a given word occurs is indicative of the importance it has as a concept in that domain. 2. The frequency of occurrence of a word in a document indicates its value as a concept in that document. Therefore, Statistic(w1,w2)=−I(p1',*p2'),∀x,pi'(x)=ci*ni*pi(x) where ci = Number of documents in Di ni = Number of times wi occurs in Di This modified algorithm resulted in better output. Results A corpus of data from the financial stock market domain was used. The data was pre-processed to remove all non-whitespace and non-alphanumeric characters. Each individual ‘piece of data’ was considered as one document. This also defined the window for all the considerations in our algorithms. Stemmer statistics (Morphology based Stemmer with both inflectional and derivational suffix removal) Number of documents: 2028 Total number of words: 858711 Number of stems: 16824 Average number of words per conflation class: 1.3838 Speed: 17.45 words per ms (Porter Stemmer - 13.09 words per ms) Performance of the Exception-words catching algorithm The following are tables with the top 25 entries (sorted in decreasing order, i.e.; most suspect ones at the top) for the 2 corpus statistics used. The first column is the common stem for the two words in columns 2 & 3. Their score is in column 4. 1) Mutual Information(MI) criterion Average Mutual Information value: 0.421218 file philip feed arise favour pay appropriate protect protect allow allow exclude copy affect filings philip fed arising favourable pays appropriately protected protections allows allowing excludes copied affect files philips feed arise favor payable appropriateness protection protection allowed allows excluded copying affects 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2) wish engage engage end advertise observe rise effect addressee address print wishing engagement engagement ending advertisers observations risen effects addressee addresses printable wish engaging engage ends advertisement observers rises effected addressees addressed printed 0 0 0 0 0 0 0 0 0 0 0 Modified Kullback-Leibler(KL) distance metric Average KL Distance value: 0.0966257 report report report invest bid secure bid secure say report secure say share share report share share invest invest financial say market secure inform group report report report investment bid securities bid securities said report securities said shares shares reports share share investment investment financial said market securities inform group reporters reporter reporting invests bidding secures bids secured saying reported securing say sharing shared report sharing shared investing invest financially says marketing secure information groups 14.3499 12.7848 12.4487 8.35869 8.3376 7.96572 7.60884 7.54453 7.45059 7.31222 7.30048 7.14817 6.59807 6.555 6.50175 6.19935 6.17766 6.15977 6.01905 5.75508 5.62843 5.52504 5.43397 5.26692 4.68804 The following observation can be easily made - inflectional forms of nouns, and more importantly verbs, tend to have high suspect scores. e.g. (protections-protection), (allows-allowed), (said-says), (group-groups) This seems to be more of a problem for the MI measure. The instances of words stemming to ‘report’ are largely due to the difference in the noun and verb usages of the word. It is however very encouraging to find that the KL-metric identified (share, shares) as being conceptually different from the act of ‘sharing’. Other interesting entries were: Word pair project - projected son - sony (Sony Corporation) bear - bears (Bear Stearns and stock market “bears”) firm - firmly exchange - exchangeable (stock exchange and exchangeable bonds) Position on the MI Position on the KL list list 44 134 - - - - 140 91 171 194 The average KL-distance metric was close to 0 – this is a good sign because one would expect documents from the same domain to have a very similar “structure”.Future Directions The different stemming algorithms differ in their approaches and therefore vary widely with respect to their characteristics. [HG96] provides a detailed analysis of the nature of query terms when one algorithm performs significantly better or worse than the others. However, there are other empirical properties of the stemmers that can be used to represent their performance – like average number of words per conflation class, ratio of word-set size to stem-set size and a concept similarity measure between words and their stems. This kind of a study could be useful. The identification of exception words depends entirely on the definition of a good corpus statistic. While it is possible to increase the complexity of this metric from what has been proposed earlier (for e.g. considering weights for each term, variance of the KL-distance across documents, etc), it is not completely clear as to what would be the right function to use. Looking across different domains with lists produced by different functions could give us a better idea of the nature of this ideal statistic. Though the problem has been approached largely from a Machine Learning/Information Retrieval perspective, it is quite clear that for any significant improvement beyond this point, it is the amount of linguistic information in the model that will be the important factor. Conclusions Text processing is a known success story as an application area for Machine Learning algorithms. While this report does reinforce this view, it is also recognised that new and more complex applications will require newer and stronger algorithms. While the basic methodologies hold the ability to be made capable of handling these situations, the most likely hurdle is going to be the complexity of the resultant variations – both in terms of space and time. One area that needs to be addressed is ways of including side information (linguistic or application domain specific) into our models. These essentially ‘hybrid’ systems will be more capable of exhibiting “intelligent” behaviour. While the final aim of computer science purists is to eliminate this sort of human expert intervention, it is to be understood that when dealing with a complicated framework like language – we are far from the stage of possessing good enough tools to achieve this purpose. In this thesis report, practical answers for two text retrieval issues have been detailed. While the chosen implementations are by no means the only alternatives, they were chosen because they suited our particular purpose well. This available choice from a range of algorithms as well as the adaptability of each is indicative of the usefulness of Machine Learning based techniques, which is encouraging for a student of this field. References [CKPS92] Cutting, D., J. Kupiec, J. Pedersen and P. Sibun. A Practical Part-of-Speech Tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing, p. 133-140, 1992. [M02] Megyesi, B. 2002. Shallow Parsing with PoS Taggers and Linguistic Features. Journal of Machine Learning Research: Special Issue on Shallow Parsing, JMLR (2): 639-668. MIT Press [NKM01] T. Nakagawa, T. Kudoh and Y. Matsumoto, Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines. Proceedings of the 6th Natural Language Processing Pacific Rim Symposium (NLPRS2001), 2001 [R96] Adwait Ratnaparkhi, A Maximum Entropy Model for Part-of-Speech Tagging (1996), In Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 133-142. [LPP00] Xiaolin Li, Marc Parizeau, Rejean Plamondon: Training Hidden Markov Models with Multiple Observations-A Combinatorial Method. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(4): 371-377 (2000) [LRS83] S. Levinson, R. Rabiner, and M. Sondhi. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Systems Technical Journal, 62:1035-1074, 1983. [CX95] W. Bruce Croft, Jinxi Xu, Corpus-Specific Stemming using Word Form Co-occurrence. Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval (1995) [K93] Robert Krovetz, Viewing Morphology as an Inference Process. Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 191-202 (1993) [HG96] David A. Hull, Gregory Grefenstette, A Detailed Analysis of English Stemming Algorithms. Technical Report MLTT-023, Xerox Research and Technology, January 1996. [H96] Hull, D.A., Stemming algorithms: a case study for detailed evaluation, Journal of the American Society for Information Science, 47(1), 70-84, 1996 [FZ98] Michael Fuller, Justin Zobel, Conflation-based Comparison of Stemming Algorithms , Proceedings of the Third Australian Document Computing Symposium, Sydney, Australia, 1998 [DRAF] Anne N. De Roeck; Waleed Al-Fares, A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots [KR96] W. Kraaij and R. Pohlmann, Viewing Stemming as Recall Enhancement, In Proceedings of ACM SIGIR, pp. 40-48, 1996 [CJVR] C. J. van Rijsbergen, Information Retrieval, Second Edition http://www.dcs.gla.ac.uk/Keith/Preface.html [Por] The Porter Stemming Algorithm, http://www.tartarus.org/~martin/PorterStemmer/ [WrdNt] The WordNet Project, http://www.cogsci.princeton.edu/~wn/ [Lin] Jianhua Lin, Divergence Measures Based on the Shannon Entropy [CoTh] Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, published by John Wiley, 1991 Appendix Hidden Markov Model (First Order) Formalism A hidden Markov process is a doubly stochastic process: an underlying process that is hidden from observation, and an observable process determined by the underlying process. The model is characterised by the following elements: • a set of hidden states S = {S1, S2, .. SN} where N is the number of states in the model • state transition probability distribution A = { aij }where 1 ≤ i,j ≤ N aij = P[qt+1 = Sj | qt = Si] 0 ≤ aij ≤1 N󰂦aj=1ij= 1 • a set of observation symbols V = {V1, V2, .. VM } where M is the number of output symbols • observation symbol probability distribution (emission probabilities) bj(v) = P(vt = v | st = j) bjk = P(vt = k | st = j) 0 ≤ bj(v) ≤ 1 or 0 ≤ bjk ≤ 1 (for continuous vt) (for discrete vt) where 1 ≤ j ≤ N and 1 ≤ k ≤ M MM󰂦b(v) = 1 or 󰂦bjj=1j=1jk= 1 • initial state probability distribution 󰀃∏ = { πi } where 1 ≤ i ≤ N󰀃0 ≤ πi ≤1 󰂦π= 1 ii=1NThe HMM is therefore a set of 3 parameters Λ = {A, B, ∏󰀃}󰀃󰀃 o1 o2 o3 o4 o5 o6 An example configuration of an HMM with 6 states and 6 output symbols The Forward-Backward algorithm: Let O = O1O2..OT be an observation sequence where Ot ∈󰀃V is the observation symbol at time t, and let Q = q1q2..qT be a state sequence where qt ∈ S is the state at time t. Given a model Λ󰀃and an observation sequence O, the observation evaluation problem P(O|Λ) can be solved using the forward-backward algorithm. • the forward variable αt(i) = P(O1O2..Ot, qt=Si | Λ)󰀃 αt(i) can be solved inductively: initialisation α1(i) = πibi(O1), 1 ≤ i ≤ N󰀃 induction αt+1(j) = [󰂦αt(i) aij] bj(Ot+1), 1 ≤ t ≤ T-1, 1 ≤ j ≤ N i=1N• the backward variable βt(i) = P(Ot+1Ot+2..OT | qt=Si, Λ) the inductive equations for βt(i): initialisation βΤ(i) = 1, 1 ≤ i ≤ N induction βt(i) = 󰂦aijbj(Ot+1)βt+1(j), 1 ≤ t ≤ T-1, 1 ≤ i ≤ N j=1N• observation evaluation P(O|Λ) = 󰂦αt(i)βt(i), ∀t and P(O|Λ) = 󰂦αT(i) i=1i=1NN Pictures from [LPP00] Baum-Welch Training: Define • joint event ξt(i,j)=P(qt=Si,qt+1=Sj|O,Λ) =• state variable αt(i)aijbj(Ot+1)βt+1(j)P(O|Λ) γt(i)=P(qt=Si|O,Λ) =󰂦ξt(i,j) j=1N• parameter updating equations 1. state transition probability T−1󰂦ξ(i,j)taij=t=1T−1t=1, 1 ≤i,j ≤N t󰂦γ(i)󰂦γ(j)tT−12. symbol emission probability bj(k)=t=1,ot=vkT−1tt=1, 1 ≤ j ≤ N, 1 ≤ k ≤ M 󰂦γ(j)3. initial state probability πi=γ1(i) Consider a set of observation sequences O = { O(1), O(2), .. O(K) } Training with Multiple Observation Sequences where O(k) = O1(k)O2(k)..OTk(k), 1 ≤ k ≤ K Considering all the possible dependencies we can have within this set, in general we have: P(O|Λ) = P(O(1)| Λ) P(O(2)|O(1), Λ) ... P(O(K)|O(K-1)..O(1), Λ) P(O|Λ) = P(O(2)| Λ) P(O(3)|O(2), Λ) ... P(O(1)|O(K)..O(2), Λ) . . P(O|Λ) = P(O(K)| Λ) P(O(1)|O(K), Λ) ... P(O(K-1)|O(K)O(K-2)..O(1), Λ) The multiple observation probability given the model can be expressed as the summation: P(O|Λ)=󰂦wkP(O(k)| Λ󰀌 k=1Kwhere 1P(O(2)|O(1), Λ) ... P(O(K)|O(K-1)..O(1), Λ) K1w2 = P(O(3)|O(2), Λ) ... P(O(1)|O(K)..O(2), Λ) Kw1 = . . wk = 1P(O(1)|O(K), Λ) ... P(O(K-1)|O(K)O(K-2)..O(1), Λ) Kare combinatorial weights which control the conditional dependence-independence between the individual observations. In the specific case where the different observations are mutually independent, we have P(O|Λ)=∏P(O(k)| Λ). k=1KThe combinatorial weights then become 1 wk = P(O| Λ) / P(O(k)| Λ), 1 ≤ k ≤ K K Using this, the training equations are: 1. state transition probability aij=󰂦󰂦ξk=1Kt=1Tk−1t=1k=1KTk−1(k)t(i,j), 1 ≤i,j ≤N (i)󰂦󰂦γK(k)t2. symbol emission probability bj(k)=󰂦󰂦γk=1t=1,ot(k)=vkT−1t=1T−1kt(j)`, 1 ≤ j ≤ N, 1 ≤ k ≤ M 󰂦󰂦γk=1Kkt(j)3. initial state probability 1πi=K󰂦γk=1K(k)1(n), 1 ≤ i ≤ N Viterbi Decoding: The primary use of a trained HMM is to find a state sequence I = i1, i2, .. , iT such that the probability of the occurrence of the observation sequence O = O1O2..OT from this state sequence is greater than that from any other sequence. The Viterbi algorithm is a dynamic programming procedure used for this purpose. Find max U(i1,i2, .. , iT) T{it}t=1 where U(i1,i2, .. , iT) = πi1bi1(O1)+󰂦ai(t−1)itbit(Ot) t=2TDefine • cumulative weight variable initialisation δ1(i)=πibi(O1) induction δt(j)=max[δt−1(i)aij]bj(Ot) 1≤i≤N• maximum path probability variable initialisation ψ1(i)=0 induction ψt(j)=max[δt−1(i)aij] 1≤i≤N• termination qT=max[δT(i)] 1≤i≤N*• trace back qt=ψ(t+1)(qt+1)for t = T-1, T-2, .. , 1 Q* = { q1*, q2*, .. , qT* } is the optimal state sequence. **Numerical Stability: Define ct=[󰂦αt(i)]−1i=1N αt(i)=ctαt(i) 󰀈βt(i)=ctβt(i) and use the following as the induction equations for the forward and backward variables: 󰀈αt+1(j) = [󰂦αt(i) aij] bj(Ot+1) , 1 ≤ t ≤ T-1, 1 ≤ j ≤ N i=1N󰀈and βt(i) = 󰂦aijbj(Ot+1)βt+1(j), 1 ≤ t ≤ T-1, 1 ≤ i ≤ N i=1N󰀈

因篇幅问题不能全部显示,请点此查看更多更全内容