ModErn Text Analysis
META Enumerates Textual Applications
Classes | Public Member Functions | Private Member Functions | Private Attributes | List of all members
meta::index::vocabulary_map_writer Class Reference

A class that writes the B+-tree-like data structure used for storing the term id mapping in an index. More...

#include <vocabulary_map_writer.h>

Classes

class  vocabulary_map_writer_exception
 An exception that can be thrown during the building of the tree. More...
 

Public Member Functions

 vocabulary_map_writer (const std::string &path, uint16_t block_size=4096)
 Creates a writer for a tree at the given path and block_size. More...
 
 ~vocabulary_map_writer ()
 The destructor for a vocabulary_map_writer flushes the last leaf node and builds the internal structure—as such, it may block for a period of time while finalizing the tree structure. More...
 
void insert (const std::string &term)
 Inserts this term into the map. More...
 

Private Member Functions

void write_padding ()
 Writes null bytes to fill up the current block.
 
void flush ()
 Flushes a node to disk after writing the padding bytes.
 

Private Attributes

std::ofstream file_
 The file containing the forward mapping tree.
 
uint64_t file_write_pos_
 The current write position in the forward mapping tree file. More...
 
std::ofstream inverse_file_
 The file containing the reverse mapping.
 
std::string path_
 The path to the tree file.
 
uint16_t block_size_
 The block size of every node in the tree, in bytes.
 
uint64_t num_terms_
 The total number of terms inserted so far.
 
uint16_t remaining_block_space_
 The remaining space in the block currently being written.
 
uint64_t written_nodes_
 Number of written nodes to be "merged" when writing the next level.
 

Detailed Description

A class that writes the B+-tree-like data structure used for storing the term id mapping in an index.

This class provides a write-only view of the mapping.

The file format consists of two files: one for the "forward" mapping and one for the "backward" mapping.

The "forward" mapping is a structure like a B+-tree, with the exception that each node has a fixed size instead of there always being a fixed number of elements. This is necessary due to the size of the keys being unknown ahead of time; this also allows for some gains in space utilization as we do not pad the keys to be a fixed maximum length. However, the nodes are always block_size bytes large and will be padded with null bytes when needed.

Each internal node is a sorted list of strings and byte positions—depending on the string comparison, the search will pick the appropriate byte position and seek there, recursively. The leaf nodes are a sorted list of strings and id assignments. The search linearly scans the leaf node it arrives at for the term in question.

The "backward" mapping is simply a disk-persisted vector of byte positions, indexed by the term id. Reading a string from the forward mapping's file starting at the given byte position will yield the string for that id.

The mappings created are non-portable and depend on the endianness of the system building them. This may be changed in the future.

This class is not internally synchronized, so external synchronization must be provided if using it in a threaded context to avoid race conditions.

Constructor & Destructor Documentation

meta::index::vocabulary_map_writer::vocabulary_map_writer ( const std::string &  path,
uint16_t  block_size = 4096 
)

Creates a writer for a tree at the given path and block_size.

Changing the block size is not recommended—if doing so, ensure that any and all vocabulary_maps created for the written file are also created with the same modified block size.

Parameters
paththe path to the tree file tow rite
block_sizethe size of the nodes in the tree
meta::index::vocabulary_map_writer::~vocabulary_map_writer ( )

The destructor for a vocabulary_map_writer flushes the last leaf node and builds the internal structure—as such, it may block for a period of time while finalizing the tree structure.

This may be changed in the future.

Member Function Documentation

void meta::index::vocabulary_map_writer::insert ( const std::string &  term)

Inserts this term into the map.

No checking is done for duplicate terms or that the ordering is indeed lexicographically sorted, but these invariants are important to maintain for the tree to work properly.

Parameters
termthe term to insert

Member Data Documentation

uint64_t meta::index::vocabulary_map_writer::file_write_pos_
private

The current write position in the forward mapping tree file.

This is kept to avoid potential 32-bit system issues with the tellp()/tellg() functions on the standard streams.


The documentation for this class was generated from the following files: