Part of Speech Tagging
MeTA also provides models that can be used for part-of-speech tagging. These models, at the moment, are designed for tagging English text, but they should be able to be trained for any language desired once appropriate feature extractors are defined.
To use these models, you should download a tagger model file from the releases page on the Github repository. MeTA currently has two different POS-tagger models available:
- A linear-chain conditional random field (
meta::sequence::crf
) - An averaged Perceptron, greedy tagger
(
meta::sequence::perceptron
)
Using Taggers
Using the CRF
First, extract your model files into a directory. You should modify your
config.toml
to contain a [crf]
group like so:
where prefix
has been set to the folder that contains the model files you
extracted.
Interactive tagging
You can interactively tag sentences using the provided pos-tag
tool.
This application will load the CRF model, and then proceed to tag sentences typed in at the prompt. You can stop tagging by simply inputting a blank line.
As an analyzer
The CRF model can also be used as an analyzer during index creation to create features based on n-grams of part-of-speech tags. To do so, you would need to add an analyzer group to your configuration that looks like the following:
You can alter the filter chain if you would like, but we strongly recommend sticking with the above setup as it is designed to match the original Penn Treebank tokenization format that the supplied model is trained on.
Programmatically
To use the CRF inside your own program, your code might look like this:
Have a look at the
API documentation for the meta::sequence::crf
class for more
information.
Using the greedy tagger (Perceptron)
First, extract your model files into a directory. You should modify your
config.toml
to contain a [sequence]
group like so:
where prefix
has been set to the folder that contains the model files you
extracted.
Interactive tagging
The pos-tag
tool doesn’t currently use this tagger (patches welcome!),
but you can still interactively tag sentences using the profile
tool. See
the profile
tutorial for a walkthrough of that demo
application.
Programmatically
To use the greedy Perceptron-based tagger inside your own program, your code might look like this:
This API is a bit simpler than that of the CRF. For more information, you
can check the
API documentation for the meta::sequence::perceptron
class.
Training Taggers
In order to train your own models using our provided training programs, you will need to have a copy of the Penn Treebank (v2) extracted into your data prefix (see the overview tutorial). Your folder structure should look like the following:
prefix
|---- penn-treebank
|---- treebank-2
|---- tagged
|---- wsj
|---- 00
|---- 01
...
|---- 24
Training a CRF
To train your own CRF model from the Penn Treebank data, you should be able
to use the provided crf-train
executable. You will first need to adjust
your [crf]
group in your config.toml
to look something like this:
You should now be able to run the training procedure:
This will train a CRF model using the default training options. For more
information on the options available, please see the API documentation for
the meta::sequence::crf
class (in particular, the parameters
struct). If you would like to try different options, you can use the code
provided in src/sequence/crf/tools/crf_train.cpp
as a starting point. You
will need to change the call to crf.train()
to use a non-default
parameters
struct.
The model will take several hours to train. Its termination is based on convergence of the loss function.
Training a greedy Perceptron-based tagger
To train your own greedy tagger model from the Penn Treebank data, you
should be able to use the provided greedy-tagger-train
executable. You
will need to first adjust your [sequence]
group in your config.toml
to
look something like this (very similar to the above):
You should now be able to run the training procedure:
This will train the averaged Perceptron model using the default training options. The termination criteria is simply a maximum iteration count, which defaults to 5 as of the time of writing. This means that the greedy tagger is signifigantly faster to train than its corresponding CRF model. In practice, the two achieve nearly the same accuracy with our default settings (the CRF being just slightly better).
If you want to adjust the number of training iterations, you can use the
code provided in src/sequence/tools/greedy_tagger_train.cpp
as a starting
point. You will need to change the call to tagger.train()
to use a
non-default training_options
struct.
Testing Taggers
If you follow the instructions above for the tagger type you wish to test, you should be able to test them with their corresponding testing executables. For the CRF, you would use
and for the greedy Perceptron-based tagger, you would use
Both will run over the testing section defined in config.toml
and report
precision, recall, and F1 score for each class, as well as the overall
token-level accuracy. The current CRF model achieves 97% accuracy, and the
greedy Perceptron model achieves 96.9% accuracy.