ModErn Text Analysis
META Enumerates Textual Applications
|
Analyzes documents based on part-of-speech tags instead of words. More...
#include <ngram_pos_analyzer.h>
Public Member Functions | |
ngram_pos_analyzer (uint16_t n, std::unique_ptr< token_stream > stream, const std::string &crf_prefix) | |
Constructor. More... | |
ngram_pos_analyzer (const ngram_pos_analyzer &other) | |
Copy constructor. More... | |
virtual void | tokenize (corpus::document &doc) override |
Tokenizes a file into a document. More... | |
Public Member Functions inherited from meta::util::multilevel_clonable< analyzer, ngram_analyzer, ngram_pos_analyzer > | |
virtual std::unique_ptr< analyzer > | clone () const |
Clones the given object. More... | |
Static Public Attributes | |
static const std::string | id = "ngram-pos" |
Identifier for this analyzer. | |
Private Types | |
using | base = util::multilevel_clonable< analyzer, ngram_analyzer, ngram_pos_analyzer > |
Private Attributes | |
std::unique_ptr< token_stream > | stream_ |
The token stream to be used for extracting tokens. | |
std::shared_ptr< sequence::crf > | crf_ |
The CRF used to tag the sentences. | |
const sequence::sequence_analyzer | seq_analyzer_ |
Generates features for the CRF; const indicates testing mode. | |
Analyzes documents based on part-of-speech tags instead of words.
The recommended tokenizer for use with this analyzer is icu-tokenizer with no other filters added. This tokenizer should be used to ensure that capital letters and such may be used as features. Function words and stop words should not be removed and words should not be stemmed for the same reason.
meta::analyzers::ngram_pos_analyzer::ngram_pos_analyzer | ( | uint16_t | n, |
std::unique_ptr< token_stream > | stream, | ||
const std::string & | crf_prefix | ||
) |
Constructor.
n | The value of n to use for the ngrams. |
stream | The stream to read tokens from. |
crf_prefix |
meta::analyzers::ngram_pos_analyzer::ngram_pos_analyzer | ( | const ngram_pos_analyzer & | other | ) |
Copy constructor.
other | The other ngram_pos_analyzer to copy from |
|
overridevirtual |
Tokenizes a file into a document.
doc | The document to store the tokenized information in |