ModErn Text Analysis
META Enumerates Textual Applications
Classes | Public Member Functions | Static Public Attributes | Private Attributes | List of all members
meta::analyzers::tokenizers::icu_tokenizer Class Reference

Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation. More...

#include <icu_tokenizer.h>

Inheritance diagram for meta::analyzers::tokenizers::icu_tokenizer:
meta::util::multilevel_clonable< Root, Base, Derived >

Classes

class  impl
 Implementation class for the icu_tokenizer. More...
 

Public Member Functions

 icu_tokenizer ()
 Creates an icu_tokenizer.
 
 icu_tokenizer (const icu_tokenizer &other)
 Copies an icu_tokenizer. More...
 
 icu_tokenizer (icu_tokenizer &&other)
 Moves an icu_tokenizer. More...
 
 ~icu_tokenizer ()
 Destroys an icu_tokenizer.
 
void set_content (const std::string &content) override
 Sets the content for the tokenizer to parse. More...
 
std::string next () override
 
 operator bool () const override
 Determines if there are more tokens in the document.
 
- Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived >
virtual std::unique_ptr< Root > clone () const
 Clones the given object. More...
 

Static Public Attributes

static const std::string id = "icu-tokenizer"
 Identifier for this tokenizer.
 

Private Attributes

util::pimpl< implimpl_
 The implementation for this tokenizer.
 

Detailed Description

Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation.

Constructor & Destructor Documentation

meta::analyzers::tokenizers::icu_tokenizer::icu_tokenizer ( const icu_tokenizer other)

Copies an icu_tokenizer.

Parameters
otherThe other icu_tokenizer to copy into this one
meta::analyzers::tokenizers::icu_tokenizer::icu_tokenizer ( icu_tokenizer &&  other)
default

Moves an icu_tokenizer.

Parameters
otherThe other icu_tokenizer to move into this one

Member Function Documentation

void meta::analyzers::tokenizers::icu_tokenizer::set_content ( const std::string &  content)
override

Sets the content for the tokenizer to parse.

This input is assumed to be utf-8 encoded. It will be converted to utf-16 internally by ICU for the segmentation, but all tokens are output as utf-8 encoded strings.

Parameters
contentThe string content to set
std::string meta::analyzers::tokenizers::icu_tokenizer::next ( )
override
Returns
the next token in the document. This will either by a sentence boundary ("<s>" or "</s>"), a token consisting of non-whitespace characters, or a token consisting of only whitespace characters.

The documentation for this class was generated from the following files: