|
ModErn Text Analysis
META Enumerates Textual Applications
|
Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation. More...
#include <icu_tokenizer.h>
Classes | |
| class | impl |
| Implementation class for the icu_tokenizer. More... | |
Public Member Functions | |
| icu_tokenizer () | |
| Creates an icu_tokenizer. | |
| icu_tokenizer (const icu_tokenizer &other) | |
| Copies an icu_tokenizer. More... | |
| icu_tokenizer (icu_tokenizer &&other) | |
| Moves an icu_tokenizer. More... | |
| ~icu_tokenizer () | |
| Destroys an icu_tokenizer. | |
| void | set_content (const std::string &content) override |
| Sets the content for the tokenizer to parse. More... | |
| std::string | next () override |
| operator bool () const override | |
| Determines if there are more tokens in the document. | |
Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived > | |
| virtual std::unique_ptr< Root > | clone () const |
| Clones the given object. More... | |
Static Public Attributes | |
| static const std::string | id = "icu-tokenizer" |
| Identifier for this tokenizer. | |
Private Attributes | |
| util::pimpl< impl > | impl_ |
| The implementation for this tokenizer. | |
Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation.
| meta::analyzers::tokenizers::icu_tokenizer::icu_tokenizer | ( | const icu_tokenizer & | other | ) |
Copies an icu_tokenizer.
| other | The other icu_tokenizer to copy into this one |
|
default |
Moves an icu_tokenizer.
| other | The other icu_tokenizer to move into this one |
|
override |
Sets the content for the tokenizer to parse.
This input is assumed to be utf-8 encoded. It will be converted to utf-16 internally by ICU for the segmentation, but all tokens are output as utf-8 encoded strings.
| content | The string content to set |
|
override |
1.8.9.1