ModErn Text Analysis
META Enumerates Textual Applications
Public Member Functions | Private Attributes | List of all members
meta::corpus::document Class Reference

Represents an indexable document. More...

#include <document.h>

Public Member Functions

 document (const std::string &path="[NONE]", doc_id d_id=doc_id{0}, const class_label &label=class_label{"[NONE]"})
 Constructor. More...
 
void increment (const std::string &term, double amount)
 Increment the count of the specified transition. More...
 
std::string path () const
 
const class_label & label () const
 
std::string name () const
 
void name (const std::string &n)
 
uint64_t length () const
 
double count (const std::string &term) const
 Get the number of occurrences for a particular term. More...
 
const std::unordered_map< std::string, double > & counts () const
 
void content (const std::string &content, const std::string &encoding="utf-8")
 Sets the content of the document to be the parameter. More...
 
void encoding (const std::string &encoding)
 Sets the encoding for the document to be the parameter. More...
 
const std::string & content () const
 
const std::string & encoding () const
 
doc_id id () const
 
bool contains_content () const
 
void label (class_label label)
 Sets the label for this document. More...
 

Private Attributes

std::string path_
 Where this document is on disk.
 
doc_id d_id_
 The document id for this document.
 
class_label label_
 Which category this document would be classified into.
 
std::string name_
 The short name for this document (not the full path)
 
size_t length_
 The number of (non-unique) tokens in this document.
 
std::unordered_map< std::string, double > counts_
 Counts of how many times each token appears.
 
util::optional< std::string > content_
 What the document contains.
 
std::string encoding_
 The encoding for the content.
 

Detailed Description

Represents an indexable document.

Internally, a document may contain either string content or a path to a file it represents on disk.

Once tokenized, a document contains a mapping of term -> frequency. This mapping is empty upon creation.

Constructor & Destructor Documentation

meta::corpus::document::document ( const std::string &  path = "[NONE]",
doc_id  d_id = doc_id{0},
const class_label &  label = class_label{"[NONE]"} 
)

Constructor.

Parameters
pathThe path to the document
d_idThe doc id to assign to this document
labelThe optional class label to assign this document

Member Function Documentation

void meta::corpus::document::increment ( const std::string &  term,
double  amount 
)

Increment the count of the specified transition.

Parameters
termThe string token whose count to increment
amountThe amount to increment by
std::string meta::corpus::document::path ( ) const
Returns
the path to this document (the argument to the constructor)
const class_label & meta::corpus::document::label ( ) const
Returns
the classification category this document is in
std::string meta::corpus::document::name ( ) const
Returns
the name of this document
void meta::corpus::document::name ( const std::string &  n)
Parameters
nThe new name for this document
uint64_t meta::corpus::document::length ( ) const
Returns
the total of transitions recorded for this document. This is not the number of unique transitions.
double meta::corpus::document::count ( const std::string &  term) const

Get the number of occurrences for a particular term.

Parameters
termThe string term to look up
Returns
the number of times term appears in this document
const std::unordered_map< std::string, double > & meta::corpus::document::counts ( ) const
Returns
the map of counts for this document.
void meta::corpus::document::content ( const std::string &  content,
const std::string &  encoding = "utf-8" 
)

Sets the content of the document to be the parameter.

Parameters
contentThe string content to assign into this document
encodingthe encoding of content, which defaults to utf-8
Note
saving the document's content is only used by some corpora formats; not all documents are guaranteed to have content stored in the object itself
void meta::corpus::document::encoding ( const std::string &  encoding)

Sets the encoding for the document to be the parameter.

Parameters
encodingThe string label for the encoding
const std::string & meta::corpus::document::content ( ) const
Returns
the contents of this document
const std::string & meta::corpus::document::encoding ( ) const
Returns
the encoding for this document
doc_id meta::corpus::document::id ( ) const
Returns
the doc_id for this document
bool meta::corpus::document::contains_content ( ) const
Returns
whether this document contains its content internally
void meta::corpus::document::label ( class_label  label)

Sets the label for this document.

Parameters
labelThe new label for this document

The documentation for this class was generated from the following files: