NAME SYNOPSIS use Text::StemTagPOS; use Data::Dump qw(dump); my $stemTagger = Text::StemTagPOS->new; my $text = 'The first sentence. Sentence number two.'; dump $stemTagger->getStemmedAndTaggedText ($text); DESCRIPTION "Text::StemTagPOS" uses the modules Lingua::Stem::Snowball and Lingua::EN::Tagger to do part-of-speech tagging and stemming of English text. It was developed to pre-process text for the module Text::Categorize::Textrank. Encoding of all text should be in Perl's internal format; see Encode for converting text from various encodes to a Perl string. CONSTRUCTOR "new" The method "new" creates an instance of the "Text::StemTagPOS" class with the following parameters: "isoLangCode" isoLangCode => 'en' "isoLangCode" is the ISO language code of the language that will be tagged and stemmed by the object. It must be 'en', which is the default; other languages may be added when POS taggers for them are added to CPAN. "endingSentenceTag" endingSentenceTag => 'PP' "endingSentenceTag" is the part-of-speech tag from Lingua::EN::Tagger that will be used to indicate the end of a sentence. The default is 'PP'. The value of "endingSentenceTag" must be a tag generated by the module Lingua::EN::Tagger; see method "getListOfPartOfSpeechTags" for all the possible tags; which are based on the Penn Treebank tagset. "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep" listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...] The method "getTaggedTextToKeep" uses "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" to build the default list of the parts-of-speech to be retained when filtering previously tagged text. The default list is "[qw(TEXTRANK_WORDS)]", which is all the nouns and adjectives in the text, as used in the textrank algorithm. Permitted types for "getTaggedTextToKeep" are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. "listOfPOSTagsToKeep" provides finer control over the parts-of-speech to be retained. For a list of all the possible tags see method "getListOfPartOfSpeechTags". METHODS "getStemmedAndTaggedText" getStemmedAndTaggedText (@Text, $Text, \@Text) The method "getStemmedAndTaggedText" returns a hierarchy of array references containing the stemmed words, the original words, their part-of-speech tag, and their word position index within the original text. The hierarchy is of the form [ [ # sentence level: first sentence. [ # word level: first word. stemmed word, original word, part-of-speech tag, word index ] [ # word level: second word. stemmed word, original word, part-of-speech tag, word index ] ... ] [ # sentence level: second sentence. [ # word level: first word. stemmed word, original word, part-of-speech tag, word index ] [ # word level: second word. stemmed word, original word, part-of-speech tag, word index ] ... ] ] Its only parameters are any combination of strings of text as scalars, references to scalars, arrays of strings of text, or references to arrays of strings of text, etc... The following examples below show the various ways to call the method; note that the constants Text::StemTagPOS::WORD_STEMMED, Text::StemTagPOS::WORD_ORIGINAL, Text::StemTagPOS::WORD_POSTAG, and Text::StemTagPOS::WORD_INDEX are used to access the information about each word. use Text::StemTagPOS; use Data::Dump qw(dump); my $stemTagger = Text::StemTagPOS->new; my $text = 'The first sentence. Sentence number two.'; my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text); dump $stemmedTaggedText; # $stemmedTaggedText will containing the following: # [ # [ # ["the", "The", "/DET", 0], # ["first", "first", "/JJ", 1], # ["sentenc", "sentence", "/NN", 2], # [".", ".", "/PP", 3], # ], # [ # ["sentenc", "Sentence", "/NN", 4], # ["number", "number", "/NN", 5], # ["two", "two", "/CD", 6], # [".", ".", "/PP", 7], # ], # ] my $word = $stemmedTaggedText->[0][0]; print 'WORD_STEMMED: ' . "'" . $word->[Text::StemTagPOS::WORD_STEMMED] . "'\n" . 'WORD_ORIGINAL: ' . "'" . $word->[Text::StemTagPOS::WORD_ORIGINAL] . "'\n" . 'WORD_POSTAG: ' . "'" . $word->[Text::StemTagPOS::WORD_POSTAG] . "'\n" . 'WORD_INDEX: ' . $word->[Text::StemTagPOS::WORD_INDEX] . "\n"; # WORD_STEMMED: 'the' # WORD_ORIGINAL: 'The' # WORD_POSTAG: '/DET' # WORD_INDEX: '0' The following example shows the various ways the text can be passed to the method: use Text::StemTagPOS; use Data::Dump qw(dump); my $stemTagger = Text::StemTagPOS->new; my $text = 'This is a sentence with seven words.'; dump $stemTagger->getStemmedAndTaggedText ($text, [$text, \$text], ($text, \$text)); "getTaggedTextToKeep" getTaggedTextToKeep (stemmedTaggedText => [...], listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]); The method "getTaggedTextToKeep" returns all the array references of the words that have a part-of-speech tag that is of a type specified by "listOfPOSTypesToKeep" or "listOfPOSTagsToKeep". The word lists returned have the same hierarchical sentence structure used by "stemmedTaggedText". Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values. "stemmedTaggedText" stemmedTaggedText => [...] "stemmedTaggedText" is the array reference returned by "getStemmedAndTaggedText" or a previous call to "getTaggedTextToKeep". "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep" listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...] "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" define the list of parts-of-speech types to be retained when filtering previously tagged text. Permitted values for "listOfPOSTypesToKeep" are are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags". Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values. use Text::StemTagPOS; use Data::Dump qw(dump); my $stemTagger = Text::StemTagPOS->new; my $text = 'This is the first sentence. This is the last sentence.'; my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text); dump $stemTagger->getTaggedTextToKeep ( stemmedTaggedText => $stemmedTaggedText); # only the nouns and adjetives are retained by default. # [ # [ # ["first", "first", "/JJ", 3], # ["sentenc", "sentence", "/NN", 4], # ], # [ # ["last", "last", "/JJ", 9], # ["sentenc", "sentence", "/NN", 10], # ], # ] "getWordsPhrasesInTaggedText" getWordsPhrasesInTaggedText (stemmedTaggedText => ..., listOfPhrasesToFind => [...], listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]); The method "getWordsPhrasesInTaggedText" returns a reference to an array where each entry in the array corresponds to the word or phrase in "listOfPhrasesToFind". The value of each entry is a list of word indices where the words or phrases were found. Each list contains integer pairs of the form [first-word-index, last-word-index] where first-word-index is the index to the first word of the phrase and last-word-index the index of the last word. The values of the index are those assigned to the stemmed and tagged word in "stemmedTaggedText". [ [ # first phrase locations [first word index, last word index], [first word index, last word index], ...] ] [ # second phrase locations [first word index, last word index], [first word index, last word index], ...] ] ... ] "stemmedTaggedText" stemmedTaggedText => [...] "stemmedTaggedText" is the array reference returned by "getStemmedAndTaggedText" or "getTaggedTextToKeep". "listOfPhrasesToFind" listOfPhrasesToFind => [...] "listOfPhrasesToFind" is an array reference containing a list of strings of text that are either single words or phrases that are to be located in the text provided by "stemmedTaggedText". Before the words or phrases are located they are filtered using "listOfPOSTypesToKeep" or "listOfPOSTagsToKeep". "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep" listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...] "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" defines the list of parts-of-speech types to be retained when filtering previously tagged text. Permitted values for "listOfPOSTypesToKeep" are are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags". Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values. The code below illustrates the output format: use Text::StemTagPOS; use Data::Dump qw(dump); my $stemTagger = Text::StemTagPOS->new; my $text = 'This is the first sentence. This is the last sentence.'; my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text); dump $stemmedTaggedText; my $listOfWordsOrPhrasesToFind = ['first sentence','this is', 'third sentence', 'sentence']; my $phraseLocations = $stemTagger->getWordsPhrasesInTaggedText ( listOfPOSTypesToKeep => [qw(ALL)], stemmedTaggedText => $stemmedTaggedText, listOfWordsOrPhrasesToFind => $listOfWordsOrPhrasesToFind); dump $phraseLocations; # [ # [[3, 4]], # 'first sentence' # [[0, 1], [6, 7]], # 'this is': note period in text has index 5. # [], # 'third sentence' # [[4, 4], [10, 10]] # 'sentence' # ] "getListOfPartOfSpeechTags" The method "getListOfPartOfSpeechTags" takes no parameters. It returns an array reference where each item in the list is of the form "[part of speech tag, description, examples]". It is meant for getting the part-of-speech tags that can be used to populate "listOfPOSTagsToKeep". use Text::StemTagPOS; use Data::Dump qw(dump); my $stemTagger = Text::StemTagPOS->new; dump $stemTagger->getListOfPartOfSpeechTags; "getListOfStemmedWordsInText" The method "getListOfStemmedWordsInText" returns an array reference of the sorted stemmed words in the text given by "stemmedTaggedText". "stemmedTaggedText" stemmedTaggedText => [...] "stemmedTaggedText" is the array reference returned by "getStemmedAndTaggedText" or "getTaggedTextToKeep" of the text. use Text::StemTagPOS; use Data::Dump qw(dump); my $stemTagger = Text::StemTagPOS->new; my $text = 'The first sentence. Sentence number two.'; my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text); dump $stemTagger->getStemmedAndTaggedText (stemmedTaggedText => $stemmedTaggedText); "getListOfStemmedWordsInAllDocuments" The method "getListOfStemmedWordsInAllDocuments" returns an array reference of the sorted stemmed words of the intersection of all the words in the documents given by "listOfStemmedTaggedText"; "listOfStemmedTaggedText" listOfStemmedTaggedText => [...] "listOfStemmedTaggedText" is a list of document references returned by "getStemmedAndTaggedText" or "getTaggedTextToKeep". INSTALLATION To install the module run the following commands: perl Makefile.PL make make test make install If you are on a windows box you should use 'nmake' rather than 'make'. AUTHOR Jeff Kubina<jeff.kubina@gmail.com> COPYRIGHT Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The full text of the license can be found in the LICENSE file included with this module. KEYWORDS natural language processing, NLP, part of speech tagging, POS, stemming SEE ALSO Encode, perlunicode, Lingua::Stem::Snowball, Lingua::EN::Tagger, Text::Iconv, Text::Categorize::Textrank, utf8