ModErn Text Analysis
META Enumerates Textual Applications
Public Member Functions | Static Public Attributes | Private Member Functions | Private Attributes | List of all members
meta::analyzers::filters::ptb_normalizer Class Reference

A filter that normalizes text to match Penn Treebank conventions. More...

#include <ptb_normalizer.h>

Inheritance diagram for meta::analyzers::filters::ptb_normalizer:
meta::util::multilevel_clonable< Root, Base, Derived >

Public Member Functions

 ptb_normalizer (std::unique_ptr< token_stream > source)
 Constructs an ptb_normalizer which reads tokens from the given source. More...
 
 ptb_normalizer (const ptb_normalizer &other)
 Copy constructor. More...
 
void set_content (const std::string &content) override
 Sets the content for the beginning of the filter chain. More...
 
std::string next () override
 Obtains the next token in the sequence.
 
 operator bool () const override
 Determines whether there are more tokens available in the stream.
 
- Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived >
virtual std::unique_ptr< Root > clone () const
 Clones the given object. More...
 

Static Public Attributes

static const std::string id = "ptb-normalizer"
 Identifier for this filter.
 

Private Member Functions

std::string current_token ()
 
void parse_token (const std::string &token)
 Performs token normalization, splitting, etc. More...
 

Private Attributes

std::unique_ptr< token_streamsource_
 The source to read tokens from.
 
std::deque< std::string > tokens_
 Buffered tokens to return.
 

Detailed Description

A filter that normalizes text to match Penn Treebank conventions.

This is important as a preprocessing step for input to POS taggers and parsers that were trained on Penn Treebank formatted data.

Constructor & Destructor Documentation

meta::analyzers::filters::ptb_normalizer::ptb_normalizer ( std::unique_ptr< token_stream source)

Constructs an ptb_normalizer which reads tokens from the given source.

Parameters
sourceThe source to construct the filter from
meta::analyzers::filters::ptb_normalizer::ptb_normalizer ( const ptb_normalizer other)

Copy constructor.

Parameters
otherThe ptb_normalizer to copy into this one

Member Function Documentation

void meta::analyzers::filters::ptb_normalizer::set_content ( const std::string &  content)
override

Sets the content for the beginning of the filter chain.

Parameters
contentThe string content to set
std::string meta::analyzers::filters::ptb_normalizer::current_token ( )
private
Returns
the token from the front of the buffered tokens list
void meta::analyzers::filters::ptb_normalizer::parse_token ( const std::string &  token)
private

Performs token normalization, splitting, etc.

The token(s) are placed on the token buffer.

Parameters
tokenThe token to be parsed
See also
http://www.cis.upenn.edu/~treebank/tokenizer.sed

The documentation for this class was generated from the following files: