Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation. More...

#include <icu_tokenizer.h>

Inheritance diagram for meta::analyzers::tokenizers::icu_tokenizer:

Classes
class	impl
	Implementation class for the icu_tokenizer. More...

Public Member Functions
	icu_tokenizer ()
	Creates an icu_tokenizer.

	icu_tokenizer (const icu_tokenizer &other)
	Copies an icu_tokenizer. More...

	icu_tokenizer (icu_tokenizer &&other)
	Moves an icu_tokenizer. More...

	~icu_tokenizer ()
	Destroys an icu_tokenizer.

void	set_content (const std::string &content) override
	Sets the content for the tokenizer to parse. More...

std::string	next () override

	operator bool () const override
	Determines if there are more tokens in the document.

Public Member Functions inherited from meta::util::multilevel_clonable< Root, Base, Derived >
virtual std::unique_ptr< Root >	clone () const
	Clones the given object. More...

Static Public Attributes
static const std::string	id = "icu-tokenizer"
	Identifier for this tokenizer.

Private Attributes
util::pimpl< impl >	impl_
	The implementation for this tokenizer.

Detailed Description

Converts documents into streams of tokens by following the unicode standards for sentence and word segmentation.

Constructor & Destructor Documentation

meta::analyzers::tokenizers::icu_tokenizer::icu_tokenizer ( const icu_tokenizer & other )

Copies an icu_tokenizer.

Parameters

other The other icu_tokenizer to copy into this one

meta::analyzers::tokenizers::icu_tokenizer::icu_tokenizer ( icu_tokenizer && other )

default

Moves an icu_tokenizer.

Parameters

other The other icu_tokenizer to move into this one

Member Function Documentation

void meta::analyzers::tokenizers::icu_tokenizer::set_content ( const std::string & content )

override

Sets the content for the tokenizer to parse.

This input is assumed to be utf-8 encoded. It will be converted to utf-16 internally by ICU for the segmentation, but all tokens are output as utf-8 encoded strings.

Parameters

content The string content to set

std::string meta::analyzers::tokenizers::icu_tokenizer::next ( )

override

Returns: the next token in the document. This will either by a sentence boundary ("<s>" or "</s>"), a token consisting of non-whitespace characters, or a token consisting of only whitespace characters.

The documentation for this class was generated from the following files:

/home/chase/projects/meta/include/analyzers/tokenizers/icu_tokenizer.h
/home/chase/projects/meta/src/analyzers/tokenizers/icu_tokenizer.cpp

Classes

Public Member Functions

Static Public Attributes

Private Attributes

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation