NAME
SYNOPSIS
      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'The first sentence. Sentence number two.';
      dump $stemTagger->getStemmedAndTaggedText ($text);

DESCRIPTION
    "Text::StemTagPOS" uses the modules Lingua::Stem::Snowball and
    Lingua::EN::Tagger to do part-of-speech tagging and stemming of English
    text. It was developed to pre-process text for the module
    Text::Categorize::Textrank. Encoding of all text should be in Perl's
    internal format; see Encode for converting text from various encodes to
    a Perl string.

CONSTRUCTOR
  "new"
    The method "new" creates an instance of the "Text::StemTagPOS" class
    with the following parameters:

    "isoLangCode"
         isoLangCode => 'en'

        "isoLangCode" is the ISO language code of the language that will be
        tagged and stemmed by the object. It must be 'en', which is the
        default; other languages may be added when POS taggers for them are
        added to CPAN.

    "endingSentenceTag"
         endingSentenceTag => 'PP'

        "endingSentenceTag" is the part-of-speech tag from
        Lingua::EN::Tagger that will be used to indicate the end of a
        sentence. The default is 'PP'. The value of "endingSentenceTag" must
        be a tag generated by the module Lingua::EN::Tagger; see method
        "getListOfPartOfSpeechTags" for all the possible tags; which are
        based on the Penn Treebank tagset.

    "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
         listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]

        The method "getTaggedTextToKeep" uses "listOfPOSTypesToKeep" and
        "listOfPOSTagsToKeep" to build the default list of the
        parts-of-speech to be retained when filtering previously tagged
        text. The default list is "[qw(TEXTRANK_WORDS)]", which is all the
        nouns and adjectives in the text, as used in the textrank algorithm.
        Permitted types for "getTaggedTextToKeep" are 'ALL', 'ADJECTIVES',
        'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION',
        'TEXTRANK_WORDS', and 'VERBS'. "listOfPOSTagsToKeep" provides finer
        control over the parts-of-speech to be retained. For a list of all
        the possible tags see method "getListOfPartOfSpeechTags".

METHODS
  "getStemmedAndTaggedText"
     getStemmedAndTaggedText (@Text, $Text, \@Text)

    The method "getStemmedAndTaggedText" returns a hierarchy of array
    references containing the stemmed words, the original words, their
    part-of-speech tag, and their word position index within the original
    text. The hierarchy is of the form

      [
        [ # sentence level: first sentence.
          [ # word level: first word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          [ # word level: second word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          ...
        ]
        [ # sentence level: second sentence.
          [ # word level: first word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          [ # word level: second word.
            stemmed word, original word, part-of-speech tag, word index
          ]
          ...
        ]
      ]

    Its only parameters are any combination of strings of text as scalars,
    references to scalars, arrays of strings of text, or references to
    arrays of strings of text, etc... The following examples below show the
    various ways to call the method; note that the constants
    Text::StemTagPOS::WORD_STEMMED, Text::StemTagPOS::WORD_ORIGINAL,
    Text::StemTagPOS::WORD_POSTAG, and Text::StemTagPOS::WORD_INDEX are used to
    access the information about each word.

      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'The first sentence. Sentence number two.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemmedTaggedText;

      # $stemmedTaggedText will containing the following:
      # [
      #   [
      #     ["the", "The", "/DET", 0],
      #     ["first", "first", "/JJ", 1],
      #     ["sentenc", "sentence", "/NN", 2],
      #     [".", ".", "/PP", 3],
      #   ],
      #   [
      #     ["sentenc", "Sentence", "/NN", 4],
      #     ["number", "number", "/NN", 5],
      #     ["two", "two", "/CD", 6],
      #     [".", ".", "/PP", 7],
      #   ],
      # ]

      my $word = $stemmedTaggedText->[0][0];
      print
        'WORD_STEMMED: ' .
        "'" . $word->[Text::StemTagPOS::WORD_STEMMED] . "'\n" .
        'WORD_ORIGINAL: ' .
        "'" . $word->[Text::StemTagPOS::WORD_ORIGINAL] . "'\n" .
        'WORD_POSTAG: ' .
        "'" . $word->[Text::StemTagPOS::WORD_POSTAG] . "'\n" .
        'WORD_INDEX: ' .
        $word->[Text::StemTagPOS::WORD_INDEX] . "\n";

      # WORD_STEMMED: 'the'
      # WORD_ORIGINAL: 'The'
      # WORD_POSTAG: '/DET'
      # WORD_INDEX: '0'

    The following example shows the various ways the text can be passed to
    the method:

      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'This is a sentence with seven words.';
      dump $stemTagger->getStemmedAndTaggedText ($text,
        [$text, \$text], ($text, \$text));

  "getTaggedTextToKeep"
     getTaggedTextToKeep (stemmedTaggedText => [...],
      listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]);

    The method "getTaggedTextToKeep" returns all the array references of the
    words that have a part-of-speech tag that is of a type specified by
    "listOfPOSTypesToKeep" or "listOfPOSTagsToKeep". The word lists returned
    have the same hierarchical sentence structure used by
    "stemmedTaggedText". Note "listOfPOSTypesToKeep" and
    "listOfPOSTagsToKeep" are optional parameters, if neither is defined,
    then the values used when the object was instantiated are used. If one
    of them is defined, its values override the default values.

    "stemmedTaggedText"
         stemmedTaggedText => [...]

        "stemmedTaggedText" is the array reference returned by
        "getStemmedAndTaggedText" or a previous call to
        "getTaggedTextToKeep".

    "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
         listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]

        "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" define the list of
        parts-of-speech types to be retained when filtering previously
        tagged text. Permitted values for "listOfPOSTypesToKeep" are are
        'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS',
        'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value
        of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags".
        Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional
        parameters, if neither is defined, then the values used when the
        object was instantiated are used. If one of them is defined, its
        values override the default values.

      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'This is the first sentence. This is the last sentence.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemTagger->getTaggedTextToKeep (
        stemmedTaggedText => $stemmedTaggedText);
      # only the nouns and adjetives are retained by default.
      # [
      #   [
      #     ["first", "first", "/JJ", 3],
      #     ["sentenc", "sentence", "/NN", 4],
      #   ],
      #   [
      #     ["last", "last", "/JJ", 9],
      #     ["sentenc", "sentence", "/NN", 10],
      #   ],
      # ]

  "getWordsPhrasesInTaggedText"
     getWordsPhrasesInTaggedText (stemmedTaggedText => ...,
        listOfPhrasesToFind => [...],  listOfPOSTypesToKeep => [...],
        listOfPOSTagsToKeep => [...]);

    The method "getWordsPhrasesInTaggedText" returns a reference to an array
    where each entry in the array corresponds to the word or phrase in
    "listOfPhrasesToFind". The value of each entry is a list of word indices
    where the words or phrases were found. Each list contains integer pairs
    of the form [first-word-index, last-word-index] where first-word-index
    is the index to the first word of the phrase and last-word-index the
    index of the last word. The values of the index are those assigned to
    the stemmed and tagged word in "stemmedTaggedText".

      [
        [ # first phrase locations
          [first word index, last word index],
          [first word index, last word index], ...]
        ]
        [ # second phrase locations
          [first word index, last word index],
          [first word index, last word index], ...]
        ]
        ...
      ]

    "stemmedTaggedText"
         stemmedTaggedText => [...]

        "stemmedTaggedText" is the array reference returned by
        "getStemmedAndTaggedText" or "getTaggedTextToKeep".

    "listOfPhrasesToFind"
         listOfPhrasesToFind => [...]

        "listOfPhrasesToFind" is an array reference containing a list of
        strings of text that are either single words or phrases that are to
        be located in the text provided by "stemmedTaggedText". Before the
        words or phrases are located they are filtered using
        "listOfPOSTypesToKeep" or "listOfPOSTagsToKeep".

    "listOfPOSTypesToKeep" and/or "listOfPOSTagsToKeep"
         listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]

        "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" defines the list of
        parts-of-speech types to be retained when filtering previously
        tagged text. Permitted values for "listOfPOSTypesToKeep" are are
        'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS',
        'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value
        of "listOfPOSTagsToKeep" see the method "getListOfPartOfSpeechTags".
        Note "listOfPOSTypesToKeep" and "listOfPOSTagsToKeep" are optional
        parameters, if neither is defined, then the values used when the
        object was instantiated are used. If one of them is defined, its
        values override the default values.

    The code below illustrates the output format:

      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'This is the first sentence. This is the last sentence.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemmedTaggedText;
      my $listOfWordsOrPhrasesToFind = ['first sentence','this is',
        'third sentence', 'sentence'];
      my $phraseLocations = $stemTagger->getWordsPhrasesInTaggedText (
        listOfPOSTypesToKeep => [qw(ALL)],
        stemmedTaggedText => $stemmedTaggedText,
        listOfWordsOrPhrasesToFind => $listOfWordsOrPhrasesToFind);
      dump $phraseLocations;
      # [
      #   [[3, 4]],           # 'first sentence'
      #   [[0, 1], [6, 7]],   # 'this is': note period in text has index 5.
      #   [],                 # 'third sentence'
      #   [[4, 4], [10, 10]]  # 'sentence'
      # ]

  "getListOfPartOfSpeechTags"
    The method "getListOfPartOfSpeechTags" takes no parameters. It returns
    an array reference where each item in the list is of the form "[part of
    speech tag, description, examples]". It is meant for getting the
    part-of-speech tags that can be used to populate "listOfPOSTagsToKeep".

      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      dump $stemTagger->getListOfPartOfSpeechTags;

  "getListOfStemmedWordsInText"
    The method "getListOfStemmedWordsInText" returns an array reference of
    the sorted stemmed words in the text given by "stemmedTaggedText".

    "stemmedTaggedText"
         stemmedTaggedText => [...]

        "stemmedTaggedText" is the array reference returned by
        "getStemmedAndTaggedText" or "getTaggedTextToKeep" of the text.

      use Text::StemTagPOS;
      use Data::Dump qw(dump);
      my $stemTagger = Text::StemTagPOS->new;
      my $text = 'The first sentence. Sentence number two.';
      my $stemmedTaggedText = $stemTagger->getStemmedAndTaggedText ($text);
      dump $stemTagger->getStemmedAndTaggedText (stemmedTaggedText => $stemmedTaggedText);

  "getListOfStemmedWordsInAllDocuments"
    The method "getListOfStemmedWordsInAllDocuments" returns an array
    reference of the sorted stemmed words of the intersection of all the
    words in the documents given by "listOfStemmedTaggedText";

    "listOfStemmedTaggedText"
         listOfStemmedTaggedText => [...]

        "listOfStemmedTaggedText" is a list of document references returned
        by "getStemmedAndTaggedText" or "getTaggedTextToKeep".

INSTALLATION
    To install the module run the following commands:

      perl Makefile.PL
      make
      make test
      make install

    If you are on a windows box you should use 'nmake' rather than 'make'.

AUTHOR
     Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT
    Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is
    free software; you can redistribute it and/or modify it under the same
    terms as Perl itself.

    The full text of the license can be found in the LICENSE file included
    with this module.

KEYWORDS
    natural language processing, NLP, part of speech tagging, POS, stemming

SEE ALSO
    Encode, perlunicode, Lingua::Stem::Snowball, Lingua::EN::Tagger,
    Text::Iconv, Text::Categorize::Textrank, utf8