Fills document objects with content line-by-line from an input file. More...

Inheritance diagram for meta::corpus::line_corpus:

Public Member Functions
	line_corpus (const std::string &file, std::string encoding, uint64_t num_lines=0)

bool	has_next () const override

document	next () override

uint64_t	size () const override

Public Member Functions inherited from meta::corpus::corpus
	corpus (std::string encoding)
	Constructs a new corpus with the given encoding. More...

virtual	~corpus ()=default
	Destructor.

const std::string &	encoding () const

Private Attributes
doc_id	cur_id_
	The current document we are on.

uint64_t	num_lines_
	The number of lines in the file.

io::parser	parser_
	Parser to read the corpus file.

std::unique_ptr< io::parser >	class_parser_
	Parser to read the class labels.

std::unique_ptr< io::parser >	name_parser_
	Parser to read the document names.

Additional Inherited Members
Static Public Member Functions inherited from meta::corpus::corpus
static std::unique_ptr< corpus >	load (const std::string &config_file)

Detailed Description

Fills document objects with content line-by-line from an input file.

It is up to the tokenizer used to be able to correctly parse the document content into labels and features.

Constructor & Destructor Documentation

meta::corpus::line_corpus::line_corpus	(	const std::string &	file,
		std::string	encoding,
		uint64_t	num_lines = `0`
	)

Parameters

file	The path to the corpus file, where each line represents a document
encoding	The encoding for the file
num_lines	The number of lines in the corpus file if known beforehand. If unknown, leave out this parameter and the value will be calculated in the constructor.