ModErn Text Analysis
META Enumerates Textual Applications
Public Member Functions | Static Public Attributes | Private Types | Private Attributes | List of all members
meta::analyzers::ngram_pos_analyzer Class Reference

Analyzes documents based on part-of-speech tags instead of words. More...

#include <ngram_pos_analyzer.h>

Inheritance diagram for meta::analyzers::ngram_pos_analyzer:
meta::util::multilevel_clonable< analyzer, ngram_analyzer, ngram_pos_analyzer >

Public Member Functions

 ngram_pos_analyzer (uint16_t n, std::unique_ptr< token_stream > stream, const std::string &crf_prefix)
 Constructor. More...
 
 ngram_pos_analyzer (const ngram_pos_analyzer &other)
 Copy constructor. More...
 
virtual void tokenize (corpus::document &doc) override
 Tokenizes a file into a document. More...
 
- Public Member Functions inherited from meta::util::multilevel_clonable< analyzer, ngram_analyzer, ngram_pos_analyzer >
virtual std::unique_ptr< analyzer > clone () const
 Clones the given object. More...
 

Static Public Attributes

static const std::string id = "ngram-pos"
 Identifier for this analyzer.
 

Private Types

using base = util::multilevel_clonable< analyzer, ngram_analyzer, ngram_pos_analyzer >
 

Private Attributes

std::unique_ptr< token_streamstream_
 The token stream to be used for extracting tokens.
 
std::shared_ptr< sequence::crfcrf_
 The CRF used to tag the sentences.
 
const sequence::sequence_analyzer seq_analyzer_
 Generates features for the CRF; const indicates testing mode.
 

Detailed Description

Analyzes documents based on part-of-speech tags instead of words.

The recommended tokenizer for use with this analyzer is icu-tokenizer with no other filters added. This tokenizer should be used to ensure that capital letters and such may be used as features. Function words and stop words should not be removed and words should not be stemmed for the same reason.

Constructor & Destructor Documentation

meta::analyzers::ngram_pos_analyzer::ngram_pos_analyzer ( uint16_t  n,
std::unique_ptr< token_stream stream,
const std::string &  crf_prefix 
)

Constructor.

Parameters
nThe value of n to use for the ngrams.
streamThe stream to read tokens from.
crf_prefix
meta::analyzers::ngram_pos_analyzer::ngram_pos_analyzer ( const ngram_pos_analyzer other)

Copy constructor.

Parameters
otherThe other ngram_pos_analyzer to copy from

Member Function Documentation

virtual void meta::analyzers::ngram_pos_analyzer::tokenize ( corpus::document doc)
overridevirtual

Tokenizes a file into a document.

Parameters
docThe document to store the tokenized information in

The documentation for this class was generated from the following files: