ModErn Text Analysis
META Enumerates Textual Applications
Classes | Public Member Functions | Private Attributes | List of all members
meta::utf::segmenter Class Reference

Class that encapsulates segmenting unicode strings. More...

#include <segmenter.h>

Classes

class  impl
 Implementation class for the segmenter. More...
 
class  segment
 Represents a segment within a unicode string. More...
 

Public Member Functions

 segmenter ()
 Constructs a segmenter. More...
 
 segmenter (const segmenter &)
 Copy constructs a segmenter.
 
 ~segmenter ()
 Destructor for segmenter.
 
void set_content (const std::string &str)
 Resets the content of the segmenter to the given string. More...
 
std::vector< segmentsentences () const
 Segments the current content into sentences by following the unicode segmentation standard. More...
 
std::vector< segmentwords () const
 Segments the current content into words by following the unicode segmentation standard. More...
 
std::vector< segmentwords (const segment &seg) const
 Segments a given segment into words by following the unicode segmentation standard. More...
 
std::string content (const segment &seg) const
 

Private Attributes

util::pimpl< implimpl_
 A pointer to the implementation class for the segmenter.
 

Detailed Description

Class that encapsulates segmenting unicode strings.

Supports segmenting sentences as well as words.

Constructor & Destructor Documentation

meta::utf::segmenter::segmenter ( )

Constructs a segmenter.

An instance of segmenter may be used to segment many different unicode strings, and it is encouraged to re-use one if you are segmenting many strings.

Member Function Documentation

void meta::utf::segmenter::set_content ( const std::string &  str)

Resets the content of the segmenter to the given string.

Parameters
strA utf-8 string that should be segmented
auto meta::utf::segmenter::sentences ( ) const

Segments the current content into sentences by following the unicode segmentation standard.

Returns
a vector of segments that represent sentences
auto meta::utf::segmenter::words ( ) const

Segments the current content into words by following the unicode segmentation standard.

Returns
a vector of segments that represent words
auto meta::utf::segmenter::words ( const segment seg) const

Segments a given segment into words by following the unicode segmentation standard.

Typically, this would be used to further segment a sentence segment into its constituent words.

Parameters
segthe segment to sub-segment into words
Returns
a vector of segments that represent words
std::string meta::utf::segmenter::content ( const segment seg) const
Returns
the content associated with a given segment as a utf-8 encoded string
Parameters
segthe segment to get content for

The documentation for this class was generated from the following files: