ModErn Text Analysis
META Enumerates Textual Applications
Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions | Private Attributes | List of all members
meta::index::disk_index::disk_index_impl Class Reference

The implementation of a disk_index. More...

#include <disk_index_impl.h>

Public Member Functions

void initialize_metadata (uint64_t num_docs=0)
 Initializes the following metadata maps: doc_sizes_, labels_, unique_terms_. More...
 
void load_doc_sizes (uint64_t num_docs=0)
 Loads the doc sizes. More...
 
void load_labels (uint64_t num_docs=0)
 Loads the doc labels. More...
 
void load_unique_terms (uint64_t num_docs=0)
 Loads the unique terms per document. More...
 
void load_doc_id_mapping ()
 Loads the doc_id mapping.
 
void load_term_id_mapping ()
 Loads the term_id mapping.
 
void load_label_id_mapping ()
 Loads the label_id mapping.
 
void load_postings ()
 Loads the postings file.
 
void save_label_id_mapping ()
 Saves the label_id mapping.
 
string_list_writer make_doc_id_writer (uint64_t num_docs) const
 Creates a string_list_writer for writing the docids mapping. More...
 
void set_label (doc_id id, const class_label &label)
 Sets the label for a document. More...
 
void set_length (doc_id id, uint64_t length)
 Sets the size of a document. More...
 
void set_unique_terms (doc_id id, uint64_t terms)
 Sets the number of unique terms for a document. More...
 
const io::mmap_filepostings () const
 
uint64_t total_unique_terms () const
 
label_id doc_label_id (doc_id id) const
 
std::vector< class_label > class_labels () const
 

Public Attributes

friend disk_index
 friend the interface
 

Static Public Attributes

static const std::vector< const char * > files
 Filenames used in the index. More...
 

Private Member Functions

label_id get_label_id (const class_label &lbl)
 

Private Attributes

std::string index_name_
 the location of this index
 
util::optional< string_listdoc_id_mapping_
 doc_id -> document path mapping. More...
 
util::optional< util::disk_vector< double > > doc_sizes_
 doc_id -> document length mapping. More...
 
util::optional< util::disk_vector< label_id > > labels_
 Maps which class a document belongs to (if any). More...
 
util::optional< util::disk_vector< uint64_t > > unique_terms_
 Holds how many unique terms there are per-document. More...
 
util::optional< vocabulary_mapterm_id_mapping_
 Maps string terms to term_ids.
 
util::invertible_map< class_label, label_id > label_ids_
 Assigns an integer to each class label (used for liblinear mappings)
 
util::optional< io::mmap_filepostings_
 A pointer to a memory-mapped postings file. More...
 
std::mutex mutex_
 mutex for thread-safe operations
 

Detailed Description

The implementation of a disk_index.

Member Function Documentation

void meta::index::disk_index::disk_index_impl::initialize_metadata ( uint64_t  num_docs = 0)

Initializes the following metadata maps: doc_sizes_, labels_, unique_terms_.

Parameters
num_docsThe number of documents stored in the index
void meta::index::disk_index::disk_index_impl::load_doc_sizes ( uint64_t  num_docs = 0)

Loads the doc sizes.

Parameters
num_docsThe number of documents stored in the index
void meta::index::disk_index::disk_index_impl::load_labels ( uint64_t  num_docs = 0)

Loads the doc labels.

Parameters
num_docsThe number of documents stored in the index
void meta::index::disk_index::disk_index_impl::load_unique_terms ( uint64_t  num_docs = 0)

Loads the unique terms per document.

Parameters
num_docsThe number of documents stored in the index
string_list_writer meta::index::disk_index::disk_index_impl::make_doc_id_writer ( uint64_t  num_docs) const

Creates a string_list_writer for writing the docids mapping.

Parameters
num_docsThe number of documents stored in the index, as the size of the string_list_writer
Returns
the string_list_writer to write doc ids
void meta::index::disk_index::disk_index_impl::set_label ( doc_id  id,
const class_label &  label 
)

Sets the label for a document.

Parameters
idThe document id
labelThe new label
void meta::index::disk_index::disk_index_impl::set_length ( doc_id  id,
uint64_t  length 
)

Sets the size of a document.

Parameters
idThe document id
lengthThe number of terms that will appear in the document
void meta::index::disk_index::disk_index_impl::set_unique_terms ( doc_id  id,
uint64_t  terms 
)

Sets the number of unique terms for a document.

Parameters
idThe document id
termsThe number of unique terms that will appear in the document
const io::mmap_file & meta::index::disk_index::disk_index_impl::postings ( ) const
Returns
the mmap file for the postings.
uint64_t meta::index::disk_index::disk_index_impl::total_unique_terms ( ) const
Returns
the total number of unique terms in the index.
label_id meta::index::disk_index::disk_index_impl::doc_label_id ( doc_id  id) const
Returns
the label id for a given document.
Parameters
idThe document id
std::vector< class_label > meta::index::disk_index::disk_index_impl::class_labels ( ) const
Returns
the possible class labels for this index
label_id meta::index::disk_index::disk_index_impl::get_label_id ( const class_label &  lbl)
private
Parameters
lblthe string class label to find the id for
Returns
the label_id of a class_label, creating a new one if necessary

Member Data Documentation

const std::vector< const char * > meta::index::disk_index::disk_index_impl::files
static
Initial value:
= {"/docids.mapping", "/docids.mapping_index", "/docsizes.counts",
"/docs.labels", "/docs.uniqueterms", "/labelids.mapping",
"/postings.index", "/termids.mapping", "/termids.mapping.inverse"}

Filenames used in the index.

util::optional<string_list> meta::index::disk_index::disk_index_impl::doc_id_mapping_
private

doc_id -> document path mapping.

Each index corresponds to a doc_id (uint64_t).

util::optional<util::disk_vector<double> > meta::index::disk_index::disk_index_impl::doc_sizes_
private

doc_id -> document length mapping.

Each index corresponds to a doc_id (uint64_t).

util::optional<util::disk_vector<label_id> > meta::index::disk_index::disk_index_impl::labels_
private

Maps which class a document belongs to (if any).

Each index corresponds to a doc_id (uint64_t).

util::optional<util::disk_vector<uint64_t> > meta::index::disk_index::disk_index_impl::unique_terms_
private

Holds how many unique terms there are per-document.

This is sort of like an inverse IDF. For a forward_index, this field is certainly redundant, though it can save querying the postings file. Each index corresponds to a doc_id (uint64_t).

util::optional<io::mmap_file> meta::index::disk_index::disk_index_impl::postings_
private

A pointer to a memory-mapped postings file.

It is a pointer because we want to delay the initialization of it until the postings file is created in some cases.


The documentation for this class was generated from the following files: