|
ModErn Text Analysis
META Enumerates Textual Applications
|
Converts documents into streams of whitespace delimited tokens. More...
#include <whitespace_tokenizer.h>
Public Member Functions | |
| whitespace_tokenizer () | |
| Creates a whitespace_tokenizer. | |
| void | set_content (const std::string &content) override |
| Sets the content for the tokenizer to parse. More... | |
| std::string | next () override |
| operator bool () const override | |
| Determines if there are more tokens in the document. | |
Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived > | |
| virtual std::unique_ptr< Root > | clone () const |
| Clones the given object. More... | |
Static Public Attributes | |
| static const std::string | id = "whitespace-tokenizer" |
| Identifier for this tokenizer. | |
Private Attributes | |
| std::string | content_ |
| Buffered string content for this tokenizer. | |
| uint64_t | idx_ |
| Character index into the current buffer. | |
Converts documents into streams of whitespace delimited tokens.
This tokenizer preserves the whitespace, but combines adjacent non-whitespace characters together into individual tokens.
|
override |
Sets the content for the tokenizer to parse.
| content | The string content to set |
|
override |
1.8.9.1