|
ModErn Text Analysis
META Enumerates Textual Applications
|
Fills document objects with content line-by-line from an input file. More...
#include <line_corpus.h>
Public Member Functions | |
| line_corpus (const std::string &file, std::string encoding, uint64_t num_lines=0) | |
| bool | has_next () const override |
| document | next () override |
| uint64_t | size () const override |
Public Member Functions inherited from meta::corpus::corpus | |
| corpus (std::string encoding) | |
| Constructs a new corpus with the given encoding. More... | |
| virtual | ~corpus ()=default |
| Destructor. | |
| const std::string & | encoding () const |
Private Attributes | |
| doc_id | cur_id_ |
| The current document we are on. | |
| uint64_t | num_lines_ |
| The number of lines in the file. | |
| io::parser | parser_ |
| Parser to read the corpus file. | |
| std::unique_ptr< io::parser > | class_parser_ |
| Parser to read the class labels. | |
| std::unique_ptr< io::parser > | name_parser_ |
| Parser to read the document names. | |
Additional Inherited Members | |
Static Public Member Functions inherited from meta::corpus::corpus | |
| static std::unique_ptr< corpus > | load (const std::string &config_file) |
Fills document objects with content line-by-line from an input file.
It is up to the tokenizer used to be able to correctly parse the document content into labels and features.
| meta::corpus::line_corpus::line_corpus | ( | const std::string & | file, |
| std::string | encoding, | ||
| uint64_t | num_lines = 0 |
||
| ) |
| file | The path to the corpus file, where each line represents a document |
| encoding | The encoding for the file |
| num_lines | The number of lines in the corpus file if known beforehand. If unknown, leave out this parameter and the value will be calculated in the constructor. |
|
overridevirtual |
Implements meta::corpus::corpus.
|
overridevirtual |
Implements meta::corpus::corpus.
|
overridevirtual |
Implements meta::corpus::corpus.
1.8.9.1