ModErn Text Analysis
META Enumerates Textual Applications
Public Types | Public Member Functions | Private Attributes | List of all members
meta::utf::segmenter::impl Class Reference

Implementation class for the segmenter. More...

Public Types

enum  segment_t { SENTENCES, WORDS }
 Tag class for the segmentation strategy.
 

Public Member Functions

 impl ()
 Constructs a new impl.
 
 impl (const impl &other)
 Copy constructs an impl. More...
 
 impl (impl &&)=default
 Defaulted move constructor.
 
void set_content (const std::string &str)
 Sets the content of the segmenter. More...
 
std::string substr (int32_t begin, int32_t end) const
 Obtains a utf-8 encoded string by first extracting the utf-16 encoded substring between the given indices and converting that substring to utf-8. More...
 
std::vector< segmentsentences () const
 Segments the entire content into sentences. More...
 
std::vector< segmentwords () const
 Segments the entire content into words. More...
 
std::vector< segmentsegments (int32_t first, int32_t last, segment_t type) const
 Generic segmentation method that operates on the substring between the given indices, using the given strategy for segmenting that substring. More...
 

Private Attributes

icu::UnicodeString u_str_
 The internal ICU string.
 
std::unique_ptr< icu::BreakIterator > sentence_iter_
 A pointer to a sentence break iterator.
 
std::unique_ptr< icu::BreakIterator > word_iter_
 A pointer to a word break iterator.
 

Detailed Description

Implementation class for the segmenter.

Constructor & Destructor Documentation

meta::utf::segmenter::impl::impl ( const impl other)
inline

Copy constructs an impl.

Parameters
otherThe impl to copy.

Member Function Documentation

void meta::utf::segmenter::impl::set_content ( const std::string &  str)
inline

Sets the content of the segmenter.

Parameters
strThe content to be set
std::string meta::utf::segmenter::impl::substr ( int32_t  begin,
int32_t  end 
) const
inline

Obtains a utf-8 encoded string by first extracting the utf-16 encoded substring between the given indices and converting that substring to utf-8.

Parameters
beginThe beginning index
endThe ending index
Returns
the substring between begin and end
std::vector<segment> meta::utf::segmenter::impl::sentences ( ) const
inline

Segments the entire content into sentences.

Returns
a vector of segments representing sentences
std::vector<segment> meta::utf::segmenter::impl::words ( ) const
inline

Segments the entire content into words.

Returns
a vector of segments representing words
std::vector<segment> meta::utf::segmenter::impl::segments ( int32_t  first,
int32_t  last,
segment_t  type 
) const
inline

Generic segmentation method that operates on the substring between the given indices, using the given strategy for segmenting that substring.

Parameters
firstThe index of the beginning of the string to work on
lastThe index of the end of the string to work on
typeThe type of segmentation to perform
Returns
a vector of segments (whose meaning depends on type)

The documentation for this class was generated from the following file: