ModErn Text Analysis
META Enumerates Textual Applications
|
Represents an indexable document. More...
#include <document.h>
Public Member Functions | |
document (const std::string &path="[NONE]", doc_id d_id=doc_id{0}, const class_label &label=class_label{"[NONE]"}) | |
Constructor. More... | |
void | increment (const std::string &term, double amount) |
Increment the count of the specified transition. More... | |
std::string | path () const |
const class_label & | label () const |
std::string | name () const |
void | name (const std::string &n) |
uint64_t | length () const |
double | count (const std::string &term) const |
Get the number of occurrences for a particular term. More... | |
const std::unordered_map< std::string, double > & | counts () const |
void | content (const std::string &content, const std::string &encoding="utf-8") |
Sets the content of the document to be the parameter. More... | |
void | encoding (const std::string &encoding) |
Sets the encoding for the document to be the parameter. More... | |
const std::string & | content () const |
const std::string & | encoding () const |
doc_id | id () const |
bool | contains_content () const |
void | label (class_label label) |
Sets the label for this document. More... | |
Private Attributes | |
std::string | path_ |
Where this document is on disk. | |
doc_id | d_id_ |
The document id for this document. | |
class_label | label_ |
Which category this document would be classified into. | |
std::string | name_ |
The short name for this document (not the full path) | |
size_t | length_ |
The number of (non-unique) tokens in this document. | |
std::unordered_map< std::string, double > | counts_ |
Counts of how many times each token appears. | |
util::optional< std::string > | content_ |
What the document contains. | |
std::string | encoding_ |
The encoding for the content. | |
Represents an indexable document.
Internally, a document may contain either string content or a path to a file it represents on disk.
Once tokenized, a document contains a mapping of term -> frequency. This mapping is empty upon creation.
meta::corpus::document::document | ( | const std::string & | path = "[NONE]" , |
doc_id | d_id = doc_id{0} , |
||
const class_label & | label = class_label{"[NONE]"} |
||
) |
Constructor.
path | The path to the document |
d_id | The doc id to assign to this document |
label | The optional class label to assign this document |
void meta::corpus::document::increment | ( | const std::string & | term, |
double | amount | ||
) |
Increment the count of the specified transition.
term | The string token whose count to increment |
amount | The amount to increment by |
std::string meta::corpus::document::path | ( | ) | const |
const class_label & meta::corpus::document::label | ( | ) | const |
std::string meta::corpus::document::name | ( | ) | const |
void meta::corpus::document::name | ( | const std::string & | n | ) |
n | The new name for this document |
uint64_t meta::corpus::document::length | ( | ) | const |
double meta::corpus::document::count | ( | const std::string & | term | ) | const |
Get the number of occurrences for a particular term.
term | The string term to look up |
const std::unordered_map< std::string, double > & meta::corpus::document::counts | ( | ) | const |
void meta::corpus::document::content | ( | const std::string & | content, |
const std::string & | encoding = "utf-8" |
||
) |
Sets the content of the document to be the parameter.
content | The string content to assign into this document |
encoding | the encoding of content, which defaults to utf-8 |
void meta::corpus::document::encoding | ( | const std::string & | encoding | ) |
Sets the encoding for the document to be the parameter.
encoding | The string label for the encoding |
const std::string & meta::corpus::document::content | ( | ) | const |
const std::string & meta::corpus::document::encoding | ( | ) | const |
doc_id meta::corpus::document::id | ( | ) | const |
bool meta::corpus::document::contains_content | ( | ) | const |
void meta::corpus::document::label | ( | class_label | label | ) |
Sets the label for this document.
label | The new label for this document |