ModErn Text Analysis
META Enumerates Textual Applications
|
Converts documents into streams of whitespace delimited tokens. More...
#include <whitespace_tokenizer.h>
Public Member Functions | |
whitespace_tokenizer () | |
Creates a whitespace_tokenizer. | |
void | set_content (const std::string &content) override |
Sets the content for the tokenizer to parse. More... | |
std::string | next () override |
operator bool () const override | |
Determines if there are more tokens in the document. | |
Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived > | |
virtual std::unique_ptr< Root > | clone () const |
Clones the given object. More... | |
Static Public Attributes | |
static const std::string | id = "whitespace-tokenizer" |
Identifier for this tokenizer. | |
Private Attributes | |
std::string | content_ |
Buffered string content for this tokenizer. | |
uint64_t | idx_ |
Character index into the current buffer. | |
Converts documents into streams of whitespace delimited tokens.
This tokenizer preserves the whitespace, but combines adjacent non-whitespace characters together into individual tokens.
|
override |
Sets the content for the tokenizer to parse.
content | The string content to set |
|
override |