ModErn Text Analysis
META Enumerates Textual Applications
|
An LDA topic model implemented using the Approximate Distributed LDA algorithm. More...
#include <parallel_lda_gibbs.h>
Public Member Functions | |
virtual | ~parallel_lda_gibbs ()=default |
Destructor: virtual for potential subclassing. | |
Public Member Functions inherited from meta::topics::lda_gibbs | |
lda_gibbs (std::shared_ptr< index::forward_index > idx, uint64_t num_topics, double alpha, double beta) | |
Constructs the lda model over the given documents, with the given number of topics, and hyperparameters \(\alpha\) and \(\beta\) for the priors on \(\phi\) (topic distributions) and \(\theta\) (topic proportions), respectively. More... | |
virtual | ~lda_gibbs ()=default |
Destructor: virtual for potential subclassing. | |
virtual void | run (uint64_t num_iters, double convergence=1e-6) |
Runs the sampler for a maximum number of iterations, or until the given convergence criterion is met. More... | |
Public Member Functions inherited from meta::topics::lda_model | |
lda_model (std::shared_ptr< index::forward_index > idx, uint64_t num_topics) | |
Constructs an lda_model over the given set of documents and with a fixed number of topics. More... | |
virtual | ~lda_model ()=default |
Destructor. More... | |
void | save_doc_topic_distributions (const std::string &filename) const |
Saves the topic proportions \(\theta_d\) for each document to the given file. More... | |
void | save_topic_term_distributions (const std::string &filename) const |
Saves the term distributions \(\phi_j\) for each topic to the given file. More... | |
void | save (const std::string &prefix) const |
Saves the current model to a set of files beginning with prefix: prefix.phi, prefix.theta, and prefix.terms. More... | |
Protected Member Functions | |
virtual void | initialize () override |
Initializes the first set of topic assignments for inference. More... | |
virtual void | perform_iteration (uint64_t iter, bool init=false) override |
Performs a sampling iteration of the AD-LDA algorithm. More... | |
virtual void | decrease_counts (topic_id topic, term_id term, doc_id doc) override |
Decreases all counts associated with the given topic, term, and document by one. More... | |
virtual void | increase_counts (topic_id topic, term_id term, doc_id doc) override |
Increases all counts associated with the given topic, term, and document by one. More... | |
virtual double | compute_sampling_weight (term_id term, doc_id doc, topic_id topic) const override |
Computes a weight proportional to \(P(z_i = j | w, \boldsymbol{z})\). More... | |
Protected Member Functions inherited from meta::topics::lda_gibbs | |
topic_id | sample_topic (term_id term, doc_id doc) |
Samples a topic from the full conditional distribution \(P(z_i = j | w, \boldsymbol{z})\). More... | |
virtual double | compute_term_topic_probability (term_id term, topic_id topic) const override |
virtual double | compute_doc_topic_probability (doc_id doc, topic_id topic) const override |
double | corpus_log_likelihood () const |
lda_gibbs & | operator= (const lda_gibbs &)=delete |
lda_gibbs cannot be copy assigned. | |
lda_gibbs (const lda_gibbs &other)=delete | |
lda_gibbs cannot be copy constructed. | |
Protected Member Functions inherited from meta::topics::lda_model | |
lda_model & | operator= (const lda_model &)=delete |
lda_models cannot be copy assigned. | |
lda_model (const lda_model &)=delete | |
lda_models cannot be copy constructed. | |
Protected Attributes | |
parallel::thread_pool | pool_ |
The thread pool used for parallelization. | |
std::unordered_map< std::thread::id, std::vector< stats::multinomial< term_id > > > | phi_diffs_ |
Stores the difference in topic_term counts on a per-thread basis for use in the reduction step. More... | |
Protected Attributes inherited from meta::topics::lda_gibbs | |
std::vector< std::vector< topic_id > > | doc_word_topic_ |
The topic assignment for every word in every document. More... | |
std::vector< stats::multinomial< term_id > > | phi_ |
The word distributions for each topic, \(\phi_t\). | |
std::vector< stats::multinomial< topic_id > > | theta_ |
The topic distributions for each document, \(\theta_d\). | |
std::mt19937_64 | rng_ |
The random number generator for the sampler. | |
Protected Attributes inherited from meta::topics::lda_model | |
std::shared_ptr< index::forward_index > | idx_ |
The index containing the documents for the model. | |
size_t | num_topics_ |
The number of topics. | |
size_t | num_words_ |
The number of total unique words. | |
An LDA topic model implemented using the Approximate Distributed LDA algorithm.
Based on the algorithm detailed by David Newman et. al.
|
overrideprotectedvirtual |
Initializes the first set of topic assignments for inference.
Employs an online application of the sampler where counts are only considered for the words observed so far through the loop.
Reimplemented from meta::topics::lda_gibbs.
|
overrideprotectedvirtual |
Performs a sampling iteration of the AD-LDA algorithm.
This consists of splitting up the sampling of (document, word) topic assignments across threads, keeping for each thread a difference in counts for the potentially shared topic counts. Once the sampling has finished, the counts are reduced down (serially) before the iteration is completed.
iter | The current iteration number |
init | Whether or not this iteration should use the online method for initializing the sampler |
Reimplemented from meta::topics::lda_gibbs.
|
overrideprotectedvirtual |
Decreases all counts associated with the given topic, term, and document by one.
topic | The topic in question |
term | The term in question |
doc | The document in question |
Reimplemented from meta::topics::lda_gibbs.
|
overrideprotectedvirtual |
Increases all counts associated with the given topic, term, and document by one.
topic | The topic in question |
term | The term in question |
doc | The document in question |
Reimplemented from meta::topics::lda_gibbs.
|
overrideprotectedvirtual |
Computes a weight proportional to \(P(z_i = j | w, \boldsymbol{z})\).
term | The current word we are sampling for |
doc | The document in which the term resides |
topic | The topic \(j\) we want to compute the probability for |
Reimplemented from meta::topics::lda_gibbs.
|
protected |
Stores the difference in topic_term counts on a per-thread basis for use in the reduction step.
Indexed as [thread_id][topic]