ModErn Text Analysis
META Enumerates Textual Applications
|
A filter that normalizes text to match Penn Treebank conventions. More...
#include <ptb_normalizer.h>
Public Member Functions | |
ptb_normalizer (std::unique_ptr< token_stream > source) | |
Constructs an ptb_normalizer which reads tokens from the given source. More... | |
ptb_normalizer (const ptb_normalizer &other) | |
Copy constructor. More... | |
void | set_content (const std::string &content) override |
Sets the content for the beginning of the filter chain. More... | |
std::string | next () override |
Obtains the next token in the sequence. | |
operator bool () const override | |
Determines whether there are more tokens available in the stream. | |
Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived > | |
virtual std::unique_ptr< Root > | clone () const |
Clones the given object. More... | |
Static Public Attributes | |
static const std::string | id = "ptb-normalizer" |
Identifier for this filter. | |
Private Member Functions | |
std::string | current_token () |
void | parse_token (const std::string &token) |
Performs token normalization, splitting, etc. More... | |
Private Attributes | |
std::unique_ptr< token_stream > | source_ |
The source to read tokens from. | |
std::deque< std::string > | tokens_ |
Buffered tokens to return. | |
A filter that normalizes text to match Penn Treebank conventions.
This is important as a preprocessing step for input to POS taggers and parsers that were trained on Penn Treebank formatted data.
meta::analyzers::filters::ptb_normalizer::ptb_normalizer | ( | std::unique_ptr< token_stream > | source | ) |
Constructs an ptb_normalizer which reads tokens from the given source.
source | The source to construct the filter from |
meta::analyzers::filters::ptb_normalizer::ptb_normalizer | ( | const ptb_normalizer & | other | ) |
Copy constructor.
other | The ptb_normalizer to copy into this one |
|
override |
Sets the content for the beginning of the filter chain.
content | The string content to set |
|
private |
|
private |
Performs token normalization, splitting, etc.
The token(s) are placed on the token buffer.
token | The token to be parsed |