ModErn Text Analysis
META Enumerates Textual Applications
Classes | Functions
meta::utf Namespace Reference

Functions for converting to and from various character sets. More...

Classes

class  icu_handle
 Internal class that ensures that ICU cleans up all of its "still-reachable" memory before program termination. More...
 
class  segmenter
 Class that encapsulates segmenting unicode strings. More...
 
class  transformer
 Class that encapsulates transliteration of unicode strings. More...
 

Functions

std::string to_utf8 (const std::string &str, const std::string &charset)
 Converts a string from the given charset to utf8. More...
 
std::u16string to_utf16 (const std::string &str, const std::string &charset)
 Converts a string fro the given charset to utf16. More...
 
std::string to_utf8 (const std::u16string &str)
 Converts a string from utf16 to utf8. More...
 
std::u16string to_utf16 (const std::string &str)
 Converts a string from utf8 to utf16. More...
 
std::string tolower (const std::string &str)
 Lowercases a utf8 string. More...
 
std::string toupper (const std::string &str)
 Uppercases a utf8 string. More...
 
std::string foldcase (const std::string &str)
 Folds the case of a utf8 string. More...
 
std::string transform (const std::string &str, const std::string &id)
 Transliterates a utf8 string, using the rules defined in ICU. More...
 
std::string remove_if (const std::string &str, std::function< bool(uint32_t)> pred)
 Removes UTF-32 codepoints that match the given function. More...
 
uint64_t length (const std::string &str)
 
bool isalpha (uint32_t codepoint)
 
bool isblank (uint32_t codepoint)
 
std::u16string icu_to_u16str (const icu::UnicodeString &icu_str)
 Helper method that converts an ICU string to a std::u16string. More...
 
std::string icu_to_u8str (const icu::UnicodeString &icu_str)
 Helper method that converts an ICU string to a std::string in utf8. More...
 
void utf8_append_codepoint (std::string &dest, uint32_t codepoint)
 Helper method that appends a UTF-32 codepoint to the given utf8 string. More...
 

Detailed Description

Functions for converting to and from various character sets.

Function Documentation

std::string meta::utf::to_utf8 ( const std::string &  str,
const std::string &  charset 
)

Converts a string from the given charset to utf8.

Parameters
strThe string to convert
charsetThe charset of the given string
Returns
a utf8 string
std::u16string meta::utf::to_utf16 ( const std::string &  str,
const std::string &  charset 
)

Converts a string fro the given charset to utf16.

Parameters
strThe string to convert
charsetThe charset of the given string
Returns
a utf string
std::string meta::utf::to_utf8 ( const std::u16string &  str)

Converts a string from utf16 to utf8.

Parameters
strThe string to convert
Returns
a utf8 string
std::u16string meta::utf::to_utf16 ( const std::string &  str)

Converts a string from utf8 to utf16.

Parameters
strThe string to convert
Returns
a utf16 string
std::string meta::utf::tolower ( const std::string &  str)

Lowercases a utf8 string.

Parameters
strThe string to convert
Returns
a lowercased utf8 string
std::string meta::utf::toupper ( const std::string &  str)

Uppercases a utf8 string.

Parameters
strThe string to convert
Returns
an uppercased utf8 string.
std::string meta::utf::foldcase ( const std::string &  str)

Folds the case of a utf8 string.

This is like lowercase, but a bit more general.

Parameters
strThe string to convert
Returns
a case-folded utf8 string
std::string meta::utf::transform ( const std::string &  str,
const std::string &  id 
)

Transliterates a utf8 string, using the rules defined in ICU.

See also
http://userguide.icu-project.org/transforms
Parameters
strThe string to transliterate
idThe ICU identifier for the transliteration method to use
Returns
the transliterated string, in utf8
std::string meta::utf::remove_if ( const std::string &  str,
std::function< bool(uint32_t)>  pred 
)

Removes UTF-32 codepoints that match the given function.

Parameters
strThe string to remove characters from
predThe predicate that returns true for codepoints that should be removed
Returns
a utf8 formatted string with all codepoints matching pred removed
uint64_t meta::utf::length ( const std::string &  str)
Returns
the number of code points in a utf8 string.
Parameters
strThe string to find the length of
bool meta::utf::isalpha ( uint32_t  codepoint)
Returns
whether a code point is a letter character
Parameters
codepointThe codepoint in question
bool meta::utf::isblank ( uint32_t  codepoint)
Returns
whether a code point is a blank character
Parameters
codepointThe codepoint in question
std::u16string meta::utf::icu_to_u16str ( const icu::UnicodeString &  icu_str)
inline

Helper method that converts an ICU string to a std::u16string.

Parameters
icu_strThe ICU string to be converted
Returns
a std::u16string from the given ICU string
std::string meta::utf::icu_to_u8str ( const icu::UnicodeString &  icu_str)
inline

Helper method that converts an ICU string to a std::string in utf8.

Parameters
icu_strThe ICU string to be converted
Returns
a std::string in utf8 from the given ICU string
void meta::utf::utf8_append_codepoint ( std::string &  dest,
uint32_t  codepoint 
)
inline

Helper method that appends a UTF-32 codepoint to the given utf8 string.

Parameters
destThe string to append the codepoint to
codepointThe UTF-32 codepoint to append