ModErn Text Analysis
META Enumerates Textual Applications
|
A class that writes the B+-tree-like data structure used for storing the term id mapping in an index. More...
#include <vocabulary_map_writer.h>
Classes | |
class | vocabulary_map_writer_exception |
An exception that can be thrown during the building of the tree. More... | |
Public Member Functions | |
vocabulary_map_writer (const std::string &path, uint16_t block_size=4096) | |
Creates a writer for a tree at the given path and block_size. More... | |
~vocabulary_map_writer () | |
The destructor for a vocabulary_map_writer flushes the last leaf node and builds the internal structure—as such, it may block for a period of time while finalizing the tree structure. More... | |
void | insert (const std::string &term) |
Inserts this term into the map. More... | |
Private Member Functions | |
void | write_padding () |
Writes null bytes to fill up the current block. | |
void | flush () |
Flushes a node to disk after writing the padding bytes. | |
Private Attributes | |
std::ofstream | file_ |
The file containing the forward mapping tree. | |
uint64_t | file_write_pos_ |
The current write position in the forward mapping tree file. More... | |
std::ofstream | inverse_file_ |
The file containing the reverse mapping. | |
std::string | path_ |
The path to the tree file. | |
uint16_t | block_size_ |
The block size of every node in the tree, in bytes. | |
uint64_t | num_terms_ |
The total number of terms inserted so far. | |
uint16_t | remaining_block_space_ |
The remaining space in the block currently being written. | |
uint64_t | written_nodes_ |
Number of written nodes to be "merged" when writing the next level. | |
A class that writes the B+-tree-like data structure used for storing the term id mapping in an index.
This class provides a write-only view of the mapping.
The file format consists of two files: one for the "forward" mapping and one for the "backward" mapping.
The "forward" mapping is a structure like a B+-tree, with the exception that each node has a fixed size instead of there always being a fixed number of elements. This is necessary due to the size of the keys being unknown ahead of time; this also allows for some gains in space utilization as we do not pad the keys to be a fixed maximum length. However, the nodes are always block_size bytes large and will be padded with null bytes when needed.
Each internal node is a sorted list of strings and byte positions—depending on the string comparison, the search will pick the appropriate byte position and seek there, recursively. The leaf nodes are a sorted list of strings and id assignments. The search linearly scans the leaf node it arrives at for the term in question.
The "backward" mapping is simply a disk-persisted vector of byte positions, indexed by the term id. Reading a string from the forward mapping's file starting at the given byte position will yield the string for that id.
The mappings created are non-portable and depend on the endianness of the system building them. This may be changed in the future.
This class is not internally synchronized, so external synchronization must be provided if using it in a threaded context to avoid race conditions.
meta::index::vocabulary_map_writer::vocabulary_map_writer | ( | const std::string & | path, |
uint16_t | block_size = 4096 |
||
) |
Creates a writer for a tree at the given path and block_size.
Changing the block size is not recommended—if doing so, ensure that any and all vocabulary_maps created for the written file are also created with the same modified block size.
path | the path to the tree file tow rite |
block_size | the size of the nodes in the tree |
meta::index::vocabulary_map_writer::~vocabulary_map_writer | ( | ) |
The destructor for a vocabulary_map_writer flushes the last leaf node and builds the internal structure—as such, it may block for a period of time while finalizing the tree structure.
This may be changed in the future.
void meta::index::vocabulary_map_writer::insert | ( | const std::string & | term | ) |
Inserts this term into the map.
No checking is done for duplicate terms or that the ordering is indeed lexicographically sorted, but these invariants are important to maintain for the tree to work properly.
term | the term to insert |
|
private |
The current write position in the forward mapping tree file.
This is kept to avoid potential 32-bit system issues with the tellp()/tellg()
functions on the standard streams.