Analyzers and Filters
Analyzers, Tokenizers, and Filters
The first step in creating an index over any sort of text data is the
“tokenization” process. At a high level, this simply means converting your
individual text documents into sparse vectors of counts of terms—these
sparse vectors are then typically consumed by an indexer to output an
inverted_index
over your corpus.
MeTA structures this text analysis process into several layers in order to give you as much power and control over the way your text is analyzed as possible. The important components are
analyzer
s, which takedocument
s from thecorpus
and convert their content into sparse vectors of counts, storing that data back in thedocument
,tokenizer
s, which take adocument
’s content and split it into a stream of tokens, andfilter
s, which take a stream of tokens, perform operations on them (like stemming, filtering, and other mutations), and also produce a stream of tokens
An analyzer
, in most cases, will take a “filter chain” that is used to
generate the final tokens for its tokenization process: the filter chains
are always defined as a specific tokenizer
class followed by a sequence
of 0 or more filter
classes, each of which reads from the previous
class’s output. For example, here is a simple filter chain that lowercases
all tokens and only keeps tokens with a certain length:
icu_tokenizer -> lowercase_filter -> length_filter
Using the Default Filter Chain
MeTA defines a “sane default” filter chain that you are encouraged to use for general text analysis in the absence of any specific requirements. To use it, you should specify the following in your configuration file:
This configures your text analysis process to consider unigrams of words generated by running each of your documents through the default filter chain. This filter chain should work well for most languages, as all of its operations (including but not limited to tokenization and sentence boundary detection) are defined in terms of the Unicode standard wherever possible.
To consider both unigrams and bigrams, your configuration file should look like the following:
Each [[analyzers]]
block defines a single analyzer and its corresponding
filter chain: you can use as many as you would like—the tokens generated
by each analyzer you specified will be counted and placed in a single
sparse vector of counts. This is useful for combining multiple different
kinds of features together into your document representation. For example,
the following configuration would combine unigram words, bigram
part-of-speech tags, tree skeleton features, and subtree features.
The path to the models in the tree
and ngram-pos
analyzers is wherever you
put the files downloaded from the current
release.
Getting Creative: Specifying Your Own Filter Chain
If your application requires specific text analysis operations, you can
specify directly what your filter chain should look like by modifying your
configuration file. Instead of filter
being a string parameter as above,
we will change filter
to look very much like the [[analyzers]]
blocks:
each analyzer
will have a series of [[analyzers.filter]]
blocks, each
of which defines a step in the filter chain. All filter chains must start
with a tokenizer. Here is an example filter chain for unigram words like
the one at the beginning of this tutorial:
MeTA provides many different classes to support building filter chains.
Please look at the API documentation for more information. In particular,
the analyzers::tokenizers
namespace and the
analyzers::filters
namespace
should give you a good idea of the capabilities—the static public
attribute id
for a given class is the string you need to use for the
“type” in the configuration file.
Extending MeTA With Your Own Filters
In certain situations, you may want to do more complex text analysis by defining your own components to plug into the filter chain. To do this, you should first determine what kind of component you want to add.
- Add an
analyzer
if you want to define an entirely new kind of token (e.g., tree features). - Add a
tokenizer
if you want to change the way that tokens are generated directly using the document’s plain-text content. - Add a
filter
if you want to mutate, remove, or inject tokens into an existing stream of tokens.
Adding an Analyzer
To define your own analyzer is to specify your own mechanism for document
tokenization entirely. This is typically done in cases where analyzing the
text of a document directly is not sufficient (or not meaningful). A good
example of the need for a new analyzer is the existing libsvm_analyzer
,
which tokenizes documents whose content is actually already pre-processed
to be in the standard libsvm format. Other examples include the subclasses
of tree_analyzer
which operate on pre-processed document trees.
Adding your own analyzer is relatively straightforward: you should
subclass from analyzer
and implement the tokenize(corpus::document)
method. One slight caveat to be aware of is that analyzer
s are required
to be clonable by the internal implementation, but this is easily solved
by adapting your subclassing specification from
to
and providing a valid copy constructor. (The polymorphic cloning facility
is taken care of by the base analyzer
combined with the util::clonable
mixin.)
Your tokenize
method is responsible for incrementing the counts of your
features in the given corpus::document
object given to the method.
Features are identified by unique strings.
Your analyzer
object will be a thread-local instance during
indexing, so be aware that member variables are not shared across threads,
and that access to any static member variables should be properly
synchronized. We strongly encourage state-less analyzers (that is,
analyzers that are capable of operating on a single document at a time
without keeping context information).
To be able to use your analyzer by specifying it in a configuration file,
it must be registered with the factory. You can do this by calling the
following function in main()
somewhere before you create your index:
The class my_analyzer
should also have a static member id
that
specifies the string that should be used to identify that analyzer to the
factory—this id must be unique.
If you require special construction behavior (beyond default
construction), you may specialize the make_analyzer()
function for your
specific analyzer class to extract additional information from the
configuration file: that specialization would look something like this:
The first parameter is the configuration group for the entire configuration file, and the second parameter is the local configuration group for your analyzer block. Generally, you will only use the local configuration group unless you need to read some global paths from the main configuration file.
Adding a Tokenizer
To define your own tokenizer is to specify a new mechanism for initially
separating the textual content of a document into a series of discrete
“tokens”. These tokens may be modified later via filters (they may be
split, removed, or otherwise modified), but a tokenizer’s job is to do
this initial separation work. Creating a new tokenizer should be a
relatively rare occurrence, as the existing icu_tokenizer
should perform
well for most languages due to its adherence to the Unicode standard (and
its related annexes).
Adding your own tokenizer is very similar to adding an analyzer: you need
to subclass token_stream
now, and the same clonable caveat remains, so
your declaration should look something like this:
Remember to provide a valid copy constructor!
Your tokenizer class should implement the virtual methods of the
token_stream
class.
next()
obtains the next token in the sequenceset_content()
changes the underlying content being tokenizedoperator bool()
determines if there are more tokens left in your token stream
To be able to use your tokenizer by specifying it in a configuration file,
it must be registered with the factory. You can do this by calling the
following function in main()
somewhere before you create your index:
The class my_tokenizer
should also have a static member id
that
specifies the string to be used to identify that tokenizer to the
factory—this id must be unique.
If you require special construction behavior (beyond default
construction), you may specialize the make_tokenizer()
function for your
specific tokenizer class to extract additional information from the
configuration file: that specialization would look something like this:
The configuration group passed to this function is the configuration block for your tokenizer.
Adding a Filter
To add a filter is to specify a new mechanism for transforming existing token streams after they have been created from a document. This should be the most common occurrence, as it’s also the most general and encompasses things like lexical analysis, filtering, stop word removal, stemming, and so on.
Creating a new filter is nearly identical to creating a new tokenizer
class: you will subclass token_stream
(using the util::clonable
mixin)
and implement the virtual functions of token_stream
. The major
difference is that a filter class’s constructor takes as its first
parameter the token_stream
to read from (this is passed as a
std::unique_ptr<token_stream>
to signify that your filter class should
take ownership of that source).
Registration of a new filter class is done as follows:
And the following is the specialization of the make_tokenizer()
function
that would be required if you need special construction behavior: