ModErn Text Analysis
META Enumerates Textual Applications
|
Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation. More...
#include <icu_tokenizer.h>
Classes | |
class | impl |
Implementation class for the icu_tokenizer. More... | |
Public Member Functions | |
icu_tokenizer () | |
Creates an icu_tokenizer. | |
icu_tokenizer (const icu_tokenizer &other) | |
Copies an icu_tokenizer. More... | |
icu_tokenizer (icu_tokenizer &&other) | |
Moves an icu_tokenizer. More... | |
~icu_tokenizer () | |
Destroys an icu_tokenizer. | |
void | set_content (const std::string &content) override |
Sets the content for the tokenizer to parse. More... | |
std::string | next () override |
operator bool () const override | |
Determines if there are more tokens in the document. | |
Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived > | |
virtual std::unique_ptr< Root > | clone () const |
Clones the given object. More... | |
Static Public Attributes | |
static const std::string | id = "icu-tokenizer" |
Identifier for this tokenizer. | |
Private Attributes | |
util::pimpl< impl > | impl_ |
The implementation for this tokenizer. | |
Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation.
meta::analyzers::tokenizers::icu_tokenizer::icu_tokenizer | ( | const icu_tokenizer & | other | ) |
Copies an icu_tokenizer.
other | The other icu_tokenizer to copy into this one |
|
default |
Moves an icu_tokenizer.
other | The other icu_tokenizer to move into this one |
|
override |
Sets the content for the tokenizer to parse.
This input is assumed to be utf-8 encoded. It will be converted to utf-16 internally by ICU for the segmentation, but all tokens are output as utf-8 encoded strings.
content | The string content to set |
|
override |