ModErn Text Analysis
META Enumerates Textual Applications
|
Implements a perceptron classifier, but using the dual formulation of the problem. More...
#include <dual_perceptron.h>
Public Member Functions | |
template<class Kernel > | |
dual_perceptron (std::shared_ptr< index::forward_index > idx, Kernel &&kernel_fn=kernel::polynomial{}, double alpha=default_alpha, double gamma=default_gamma, double bias=default_bias, uint64_t max_iter=default_max_iter) | |
Constructs a dual_perceptron classifier over the given index and with the given paramters. More... | |
void | train (const std::vector< doc_id > &docs) override |
Trains the perceptron on the given training documents. More... | |
class_label | classify (doc_id d_id) override |
Classifies the given document. More... | |
void | reset () override |
Resets all learned information for this perceptron so it may be re-learned. | |
Public Member Functions inherited from meta::classify::classifier | |
classifier (std::shared_ptr< index::forward_index > idx) | |
virtual confusion_matrix | test (const std::vector< doc_id > &docs) |
Classifies a collection document into specific groups, as determined by training data; this function will make repeated calls to classify(). More... | |
virtual confusion_matrix | cross_validate (const std::vector< doc_id > &input_docs, size_t k, bool even_split=false, int seed=1) |
Performs k-fold cross-validation on a set of documents. More... | |
Static Public Attributes | |
static const constexpr double | default_alpha = 0.1 |
The default \(\alpha\) parameter. | |
static const constexpr double | default_gamma = 0.05 |
The default \(\gamma\) parameter. | |
static const constexpr double | default_bias = 0 |
The default \(b\) parameter. | |
static const constexpr uint64_t | default_max_iter = 100 |
The default number of allowed iterations. | |
static const std::string | id = "dual-perceptron" |
The identifier for this classifier. | |
Private Types | |
using | pdata = decltype(idx_->search_primary(doc_id{})) |
Convenience typedef for the postings data type. | |
Private Member Functions | |
void | decrease_weight (const class_label &label, const doc_id &id) |
Decreases the "weight" (mistake count) for a given class label and document. More... | |
Private Attributes | |
std::unordered_map< class_label, std::unordered_map< doc_id, uint64_t > > | weights_ |
The "weight" (mistake count) vectors for each class label. | |
std::function< double(pdata, pdata)> | kernel_ |
The kernel function to be used in lieu of a dot product. | |
const double | alpha_ |
\(\alpha\), the learning rate | |
const double | gamma_ |
\(\gamma\), the error threshold (in terms of percentage of mistakes on the training data in one iteration of training). | |
const double | bias_ |
\(b\), the bias factor. | |
const uint64_t | max_iter_ |
The maximum number of iterations for training. | |
Additional Inherited Members | |
Protected Attributes inherited from meta::classify::classifier | |
std::shared_ptr< index::forward_index > | idx_ |
the index that the classifer is run on | |
Implements a perceptron classifier, but using the dual formulation of the problem.
This allows the perceptron to be used for data that is not necessarily linearly separable via the use of a kernel function.
|
inline |
Constructs a dual_perceptron classifier over the given index and with the given paramters.
idx | The index to run the classifier on |
kernel_fn | The kernel function to be used |
alpha | \(\alpha\), the learning rate |
gamma | \(\gamma\), the error threshold (in terms of percentage of mistakes on one training run) |
bias | \(b\), the bias |
max_iter | The maximum allowed iterations for training. |
|
overridevirtual |
Trains the perceptron on the given training documents.
Maintains a set of weight vectors \(w_1,\ldots,w_K\) where \(K\) is the number of classes and updates them for each training document seen in each iteration. This continues until the error threshold is met or the maximum number of iterations is completed.
Contrary to the regular perceptron, since this is the dual formulation, its vectors are "mistake vectors" that keep track of how often a given training instance was misclassified.
docs | The training set |
Implements meta::classify::classifier.
|
overridevirtual |
Classifies the given document.
The class label returned is \(\arg\!\max_k(\sum_d(w_k^d*(K(d,x) + b))\)—in other words, the class whose associated weight vector gives the highest result.
doc | The document to be classified |
Implements meta::classify::classifier.
|
private |
Decreases the "weight" (mistake count) for a given class label and document.
label | The class label |
id | The document |