ModErn Text Analysis
META Enumerates Textual Applications
character_tokenizer.h
Go to the documentation of this file.
1 
9 #ifndef META_CHARACTER_TOKENIZER_H_
10 #define META_CHARACTER_TOKENIZER_H_
11 
12 #include "analyzers/token_stream.h"
13 #include "util/clonable.h"
14 
15 namespace meta
16 {
17 namespace corpus
18 {
19 class document;
20 }
21 }
22 
23 namespace meta
24 {
25 namespace analyzers
26 {
27 namespace tokenizers
28 {
29 
35  : public util::clonable<token_stream, character_tokenizer>
36 {
37  public:
42 
47  void set_content(const std::string& content) override;
48 
53  std::string next() override;
54 
58  operator bool() const override;
59 
61  const static std::string id;
62 
63  private:
65  std::string content_;
66 
68  uint64_t idx_;
69 };
70 }
71 }
72 }
73 #endif
void set_content(const std::string &content) override
Sets the content for the tokenizer.
Definition: character_tokenizer.cpp:24
Converts documents into streams of characters.
Definition: character_tokenizer.h:34
Template class to facilitate polymorphic cloning.
Definition: clonable.h:28
The ModErn Text Analysis toolkit is a suite of natural language processing, classification, information retreival, data mining, and other applications of text processing.
Definition: analyzer.h:24
uint64_t idx_
Character index into the current buffer.
Definition: character_tokenizer.h:68
std::string next() override
Definition: character_tokenizer.cpp:30
static const std::string id
Identifier for this tokenizer.
Definition: character_tokenizer.h:61
std::string content_
The buffered string content for this tokenizer.
Definition: character_tokenizer.h:65
character_tokenizer()
Creates a character_tokenizer.
Definition: character_tokenizer.cpp:19