Mobile native Wearable native

i18n

You can generate flexible number or date format patterns and format and parse dates and numbers for any locale. The i18n API is implemented by using the ICU library.

Ubrk

The Ubrk API (in mobile and wearable applications) is used to find the location of boundaries in text. The i18n_ubreak_iterator_h handle maintains a current position and scans over the text returning the index of characters where the boundaries occur.

The following boundary analyzing methods are available:

  • Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.
  • Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks, such as quotation marks and parentheses.
  • Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double-click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.
  • Character boundary analysis identifies the boundaries of Extended Grapheme Clusters, which are groupings of codepoints that must be treated as character-like units for many text operations. For additional information on grapheme clusters and guidelines on their use, see Unicode Standard Annex #29, Unicode Text Segmentation.
  • Title boundary analysis locates all positions, typically starts of words, that must be set to Title Case when title casing the text.
  • The text boundary positions are found according to the rules described in Unicode Standard Annex #29, Text Boundaries, and Unicode Standard Annex #14, Line Breaking Properties.

Ucalendar

The Ucalendar API (in mobile and wearable applications) is used for converting between a Udate object and a set of integer fields such as I18N_UCALENDAR_YEAR, I18N_UCALENDAR_MONTH, I18N_UCALENDAR_DAY, and I18N_UCALENDAR_HOUR. A Udate object represents a specific instant in time with one millisecond precision.

The types of Ucalendar interpret a Udate according to the rules of a specific calendar system, such as the gregorian or traditional system.

A Ucalendar object can produce all the time field values needed to implement the date-time formatting for a particular language and calendar style (for example, Japanese-Gregorian, Japanese-Traditional).

When computing a Udate from the time fields, 2 special circumstances can arise. The information can be insufficient to compute the Udate (you have only the year and the month, but not the day of the month), or the information can be inconsistent (such as "Tuesday, July 15, 1996" even though July 15, 1996 is actually a Monday).

  • Insufficient information

    The calendar uses the default information to specify the missing fields. This can vary by calendar: for the Gregorian calendar, the default for a field is the same as that of the start of the epoch, such as I18N_UCALENDAR_YEAR = 1970, I18N_UCALENDAR_MONTH = JANUARY, I18N_UCALENDAR_DATE = 1.

  • Inconsistent information

    If the fields conflict, the calendar prefers the most recently set fields. For example, when determining the day, the calendar looks for one of the following field combinations listed in the following table. The most recent combination, as determined by the most recently set single field, is used.

Table: Combinations of the calendar fields to determine the day
Combinations of the calendar fields
I18N_UCALENDAR_MONTH + I18N_UCALENDAR_DAY_OF_MONTH

I18N_UCALENDAR_MONTH + I18N_UCALENDAR_WEEK_OF_MONTH + I18N_UCALENDAR_DAY_OF_WEEK

I18N_UCALENDAR_MONTH + I18N_UCALENDAR_DAY_OF_WEEK_IN_MONTH + I18N_UCALENDAR_DAY_OF_WEEK

I18N_UCALENDAR_DAY_OF_YEAR

I18N_UCALENDAR_DAY_OF_WEEK + I18N_UCALENDAR_WEEK_OF_YEAR

For the time of day:

Table: Combinations of calendar fields to determine the time of the day
Combinations of the calendar fields
I18N_UCALENDAR_HOUR_OF_DAY

I18N_UCALENDAR_AM_PM + I18N_UCALENDAR_HOUR

Note
For some non-Gregorian calendars, different fields are necessary for complete disambiguation. For example, a full specification of the historical Arabic astronomical calendar requires the year, month, day-of-month and day-of-week in some cases.

Uchar

The Uchar API (in mobile and wearable applications) provides a low-level access to the Unicode Character Database.

Unicode assigns each code point (not just the assigned character) values for several properties. Most of them are simple boolean flags, or constants from a small enumerated list. For some properties, values are strings or other relatively more complex types.

For more information, see About the Unicode Character Database and ICU User Guide chapter on Properties.

The following table describes the details of script codes that you can get using the i18n_uchar_get_int_property_value() function.

Table: Script codes
Value Code English name Value Code English name
0 Zyyy Code for undetermined script 80 Latf Latin (Fraktur variant)
1 Zinh Code for inherited script 81 Latg Latin (Gaelic variant)
2 Arab Arabic 82 Lepc Lepcha (Rong)
3 Armn Armenian 83 Lina LinearA
4 Beng Bengali 84 Mand Mandaic, Mandaean
5 Bopo Bopomofo 85 Maya Mayan hieroglyphs
6 Cher Cherokee 86 Mero Meroitic hieroglyphs
7 Copt Coptic 87 Nkoo N’Ko
8 Cyrl Cyrillic 88 Orkh Old Turkic, Orkhon Runic
9 Dsrt Deseret (Mormon) 89 Perm Old Permic
10 Deva Devanagari (Nagari) 90 Phag Phags-pa
11 Ethi Ethiopic (Geʻez) 91 Phnx Phoenician
12 Geor Georgian (Mkhedruli) 92 Plrd Miao (Pollard)
13 Goth Gothic 93 Roro Rongorongo
14 Grek Greek 94 Sara Sarati
15 Gujr Gujarati 95 Syre Syriac (Estrangelo variant)
16 Guru Gurmukhi 96 Syrj Syriac (Western variant)
17 Hani Han (Hanzi, Kanji, Hanja) 97 Syrn Syriac (Eastern variant)
18 Hang Hangul (Hangŭl, Hangeul) 98 Teng Tengwar
19 Hebr Hebrew 99 Vaii Vai
20 Hira Hiragana 100 Visp Visible Speech
21 Knda Kannada 101 Xsux Cuneiform, Sumero-Akkadian
22 Kana Katakana 102 Zxxx Code for unwritten documents
23 Khmr Khmer 103 Zzzz Code for uncoded script
24 Laoo Lao 104 Cari Carian
25 Latn Latin 105 Jpan Japanese (alias for Han+Hiragana+Katakana)
26 Mlym Malayalam 106 Lana TaiTham (Lanna)
27 Mong Mongolian 107 Lyci Lycian
28 Mymr Myanmar (Burmese) 108 Lydi Lydian
29 Ogam Ogham 109 Olck Ol Chiki (Ol Cemet’, Ol Santali)
30 Ital Old Italic (Etruscan, Oscan) 110 Rjng Rejang (Redjang, Kaganga)
31 Orya Oriya 111 Saur Saurashtra
32 Runr Runic 112 Sgnw SignWriting
33 Sinh Sinhala 113 Sund Sundanese
34 Syrc Syriac 114 Moon Moon (Mooncode, Moonscript, Moontype)
35 Taml Tamil 115 Mtei Meitei Mayek (Meithei, Meetei)
36 Telu Telugu 116 Armi Imperial Aramaic
37 Thaa Thaana 117 Avst Avestan
38 Thai Thai 118 Cakm Chakma
39 Tibt Tibetan 119 Kore Korean (alias for Hangul+Han)
40 Cans Unified Canadian Aboriginal Syllabics 120 Kthi Kaithi
41 Yiii Yi 121 Mani Manichaean
42 Tglg Tagalog (Baybayin, Alibata) 122 Phli Inscriptional Pahlavi
43 Hano Hanunoo (Hanunoo) 123 Phlp Psalter Pahlavi
44 Buhd Buhid 124 Phlv Book Pahlavi
45 Tagb Tagbanwa 125 Prti Inscriptional Parthian
46 Brai Braille 126 Samr Samaritan
47 Cprt Cypriot 127 Tavt TaiViet
48 Limb Limbu 128 Zmth Mathematical notation
49 Linb LinearB 129 Zsym Symbols
50 Osma Osmanya 130 Bamu Bamum
51 Shaw Shavian (Shaw) 131 Lisu Lisu (Fraser)
52 Tale TaiLe 132 Nkgb Nakhi Geba ('Na-'Khi ²Ggŏ-¹baw, Naxi Geba)
53 Ugar Ugaritic 133 Sarb Old South Arabian
54 Hrkt Japanese syllabaries (alias for Hiragana+Katakana) 134 Bass BassaVah
55 Bugi Buginese 135 Dupl Duployan shorthand, Duployan stenography
56 Glag Glagolitic 136 Elba Elbasan
57 Khar Kharoshthi 137 Gran Grantha
58 Sylo Syloti Nagri 138 Kpel Kpelle
59 Talu New Tai Lue 139 Loma Loma
60 Tfng Tifinagh (Berber) 140 Mend Mende Kikakui
61 Xpeo Old Persian 141 Merc Meroitic Cursive
62 Bali Balinese 142 Narb Old North Arabian (Ancient North Arabian)
63 Batk Batak 143 Nbat Nabataean
64 Blis Blissymbols 144 Palm Palmyrene
65 Brah Brahmi 145 Sind Khudawadi, Sindhi
66 Cham Cham 146 Wara Warang Citi (Varang Kshiti)
67 Cirt Cirth 147 Afak Afaka
68 Cyrs Cyrillic (Old Church Slavonic variant) 148 Jurc Jurchen
69 Egyd Egyptian demotic 149 Mroo Mro, Mru
70 Egyh Egyptian hieratic 150 Nshu Nushu
71 Egyp Egyptian hieroglyphs 151 Shrd Sharada, Śāradā
72 Geok Khutsuri (Asomtavruli and Nuskhuri) 152 Sora Sora Sompeng
73 Hans Han (Simplified variant) 153 Takr Takri, Ṭākrī, Ṭāṅkrī
74 Hant Han (Traditional variant) 154 Tang Tangut
75 Hmng Pahawh Hmong 155 Wole Woleai
76 Hung Old Hungarian (Hungarian Runic) 156 Hluw Anatolian hieroglyphs (Luwian hieroglyphs, Hittite hieroglyphs)
77 Inds Indus (Harappan) 157 Khoj Khojki
78 Java Javanese 158 Tirh Tirhuta
79 Kali KayahLi -1 Invalid code

Ucollator

The Ucollator API (in mobile and wearable applications) performs locale-sensitive string comparison. It builds searching and sorting routines for natural language text and provides correct sorting orders for most supported locales. If specific data for a locale is not available, the order eventually falls back to the CLDR root sort order. The sorting order can be customized by providing your own set of rules. For more information, see the ICU Collation Customization section of the User Guide.

Udate

The Udate API (in mobile and wearable applications) consists of functions that convert dates and times from their internal representations to textual form and back again in a language-independent manner. Converting from the internal representation (milliseconds since midnight, January 1, 1970) to text is known as formatting, and converting from text to milliseconds is known as parsing. Tizen currently defines only one concrete handle (i18n_udate_format_h), which can handle practically all normal date formatting and parsing actions.

The Udate format helps you to format and parse dates for any locale. Your code can be completely independent of the locale conventions for months, days of the week, or even the lunar or solar calendar format.

You can pass in different options for the arguments for date and time style to control the length of the result; you can select from SHORT, MEDIUM, LONG, and FULL. The exact result depends on the locale.

  • I18N_UDATE_SHORT is completely numeric, such as 12/13/52 or 3:30pm
  • I18N_UDATE_MEDIUM is longer, such as Jan 12, 1952
  • I18N_UDATE_LONG is longer, such as January 12, 1952 or 3:30:32pm
  • I18N_UDATE_FULL is completely specified, such as Tuesday, April 12, 1952 AD or 3:30:42pm PST.

Date and Time Patterns

The date and time formats are specified by the date and time pattern strings. Within the date and time pattern strings, all unquoted ASCII letters (A-Z and a-z) are reserved as pattern letters representing calendar fields. The i18n_udate_format_h handle supports the date and time formatting algorithm and pattern letters defined by the UTS#35 Unicode Locale Data Markup Language (LDML). It is further documented in the ICU User Guide.

Udatepg

The Udatepg API (in mobile and wearable applications) enables flexible generation of date format patterns, such as "yy-MM-dd". The user can build up the generator by adding successive patterns. After this, a query can be made using a pattern that includes only the desired fields and lengths. The generator returns the a pattern that is most similar to it.

The main method is the i18n_udatepg_get_best_pattern() function, since normally the Udatepg API is pre-built with data from a particular locale. However, generators can be built directly from other data as well.

Uenumeration

The Uenumeration API (in mobile and wearable applications) enables you to create an enumeration object out of a given set of strings. The object can be created out of an array of const char* strings or an array of i18n_uchar* strings.

The enumeration object enables navigation through the enumeration values, with the use of the i18n_uenumeration_next() or i18n_uenumeration_unext() function (depending on the type used for creating the enumeration object), and with the i18n_uenumeration_reset() function.

You can get the number of values stored in the enumeration object with the i18n_uenumeration_count() function.

Ulocale

The Ulocale API (in mobile and wearable applications) represents a specific geographical, political, or cultural region. Locale-sensitive operations use the Ulocale functions to tailor information for the user. For example, displaying a number is a locale-sensitive operation. The number must be formatted according to the customs and conventions of the user's native country, region, or culture.

You create a locale with one of the options listed below. Each component is separated by an underscore in the locale string:

Table: Options for creating a locale
Options for creating a locale
newLanguage

newLanguage + newCountry

newLanguage + newCountry + newVariant

The first option is a valid ISO Language Code. These codes are the lower-case two-letter codes as defined by the ISO-639 standard.

The second option includes an additional ISO Country Code.

The third option requires additional information on the variant. The variant codes are vendor and browser-specific. For example, use WIN for Windows, MAC for Macintosh, and POSIX for POSIX. Where there are two variants, separate them with an underscore, and put the most important one first. For example, a Traditional Spanish collation might be referenced, with ES, ES, Traditional_WIN.

Because a locale is just an identifier for a region, no validity check is performed when you specify a locale. If you want to see whether particular resources are available for the locale you asked for, you must query those resources.

Once you have specified a locale you can query it for information about itself. Use i18n_ulocale_get_language() to get the ISO Language Code. You can use i18n_ulocale_get_display_name() to get the name of the language suitable for display to the user.

Unormalization

The Unicode normalization API (in mobile and wearable applications) is for the standard unicode normalization. All instances of i18n_unormalizer_s are unmodifiable and immutable. Instances returned by i18n_unormalization_get_instance() are singletons that must not be deleted by the caller.

Unumber

The Unumber API (in mobile and wearable applications) helps you to format and parse numbers for any locale. Your code can be completely independent of the locale conventions for decimal points, thousands-separators, or even the particular decimal digits used, or whether the number format is even decimal. There are different number format styles like decimal, currency, percent and spellout.

Usearch

The Usearch API (in mobile and wearable applications) provides language-sensitive text searching based on the comparison rules defined in a Ucollator data struct. This ensures that language eccentricity can be handled. For example, for the German collator, characters ß and SS are matched if case is chosen to be ignored. That is why it can be important to pass a locale when creating the usearch with the i18n_usearch_create_new() function.

Uset

Uset is a mutable set of Unicode characters and multicharacter strings. The sets represent character classes used in regular expressions. A character specifies a subset of the Unicode code points. The legal code points are U+0000 to U+10FFFF, inclusive.

The set supports 2 functions:

  • The operand function allows the caller to modify the value of the set. The operand function works similarly to the boolean logic: a boolean OR is implemented by add, a boolean AND is implemented by retain, a boolean XOR is implemented by a complement taking an argument, and a boolean NOT is implemented by a complement with no argument. In terms of traditional set theory function names, add is a union, retain is an intersection, remove is an asymmetric difference, and complement with no argument is a set complement with respect to the superset range MIN_VALUE-MAX_VALUE.
  • The i18n_uset_apply_pattern() or i18n_uset_to_pattern() function. Unlike the functions that add characters or categories, and control the logic of the set, the i18n_uset_apply_pattern() function sets all attributes of a set at once, based on a string pattern.

Pattern Syntax

Patterns are accepted by the i18n_uset_create_pattern(), i18n_uset_create_pattern_options(), and i18n_uset_apply_pattern() functions and returned by the i18n_uset_to_pattern() function. The patterns follow a syntax similar to that employed by version 8 regular expression character classes.

Table: Examples of simple pattern syntaxes
Pattern Description
[] No characters
[a] Character 'a'
[ae] Characters 'a' and 'e'
[a-e] Characters 'a' through 'e' inclusive, in Unicode code point order
[\u4E01] Character U+4E01
[a{ab}{ac}] Character 'a' and the multicharacter strings 'ab' and 'ac'
[\p{Lu}] All characters in the general category 'uppercase letter'

Any character can be preceded by a backslash in order to remove any special meaning. Whitespace characters are ignored, unless they are escaped.

Property patterns specify a set of characters having a certain property as defined by the Unicode standard. Both the POSIX-like [:Lu:] and the Perl-like syntax \\p{Lu} are recognized.

Patterns specify individual characters, ranges of characters, and Unicode property sets. When the elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters, [:^foo] and \\P{foo}. In any other location, '^' has no special meaning.

Ranges are indicated by placing a '-' between 2 characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character, it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', it is taken as a literal. This means that [a\-b], [-ab], and [ab-] all indicate the same set of three characters, 'a', 'b', and '-'.

Sets can be intersected using the '&' operator or the asymmetric set difference can be taken using the '-' operator. For example, [[:L:]&[\\u0000-\\u0FFF]] indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. This means that [[:L:]-[a-z]-[\\u0100-\\u01FF]] is equivalent to [[[:L:]-[a-z]]-[\\u0100-\\u01FF]]. This only really matters for difference; intersection is commutative.

Table: Examples of set syntaxes
Set Description
[a] Set containing 'a'
[a-z] Set containing 'a' through 'z' and all letters in between, in Unicode order
[^a-z] Set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
[[pat1][pat2]] Union of sets specified by pat1 and pat2
[[[pat1]&[pat2]] Intersection of sets specified by pat1 and pat2
[[pat1]-[pat2]] Asymmetric difference of sets specified by pat1 and pat2
[:Lu:] or \p{Lu} Set of characters having the specified Unicode property, in this case Unicode uppercase letters
[:^Lu:] or \P{Lu} Set of characters not having the given Unicode property
Note
You cannot add an empty string ("") to a set.

Formal Syntax

The following table provide examples of formal syntax patterns.

Table: Formal syntax patterns
Pattern Description
pattern := ('[' '^'? item* ']') | property
item := char | (char '-' char) | pattern-expr
pattern-expr := pattern | pattern-expr pattern | pattern-expr or pattern
op := '&' | '-'
special := '[' | ']' | '-'
char := Any character that is not special | ('\' any character) | ('\u' hex hex hex hex)
property := Unicode property set pattern
a := b a can be replaced by b
a? 0 or 1 instance of a
a* 1 or more instances of a
a | b Either a or b
'a' Literal string between the quotes

Ustring

The Ustring API (in mobile and wearable applications) provides general unicode string handling. Some functions are similar in name, signature, and behavior to the ANSI C <string.h> functions, and other functions provide more Unicode-specific functionality.

The i18n uses 16-bit Unicode (UTF-16) in the form of arrays of i18n_uchar code units. UTF-16 encodes each Unicode code point with either 1 or 2 i18n_uchar code units. This is the default form of Unicode, and a forward-compatible extension of the original, fixed-width form that was known as UCS-2. UTF-16 superseded UCS-2 with Unicode 2.0 in 1996.

The i18n also handles 16-bit Unicode text with unpaired surrogates. Such text is not well-formed UTF-16. Code-point-related functions treat unpaired surrogates as surrogate code points, such as separate units.

Although UTF-16 is a variable-width encoding form, such as some legacy multi-byte encodings, it is much more efficient even for random access because the code unit values for single-unit characters versus lead units versus trail units are completely disjoint. This means that it is easy to determine character (code point) boundaries from random offsets in the string.

Unicode (UTF-16) string processing is optimized for the single-unit case. Although it is important to support supplementary characters, which use pairs of lead/trail code units called "surrogates", their occurrence is rare. Almost all characters in modern use require only a single i18n_uchar code unit (such as their code point values are <=0xffff).

Character Set Mapping Tables

The i18n API provides a character set conversion with mapping tables for a number of important codepages. The default tables are a subset of IBM's CDRA conversion table repository. ICU's Converter Explorer shows aliases and codepage charts for the default tables that are built into a standard ICU distribution.

Conversions for most codepages are implemented differently on different platforms. We are providing mapping tables here from many different sources, so that the i18n users and others can use these tables to get the same conversion behavior as on the original platforms.

The mapping tables and some of the source code of the tools that collected these tables are checked into a CVS repository.

For more information about character sets, codepages and encodings, see Coded Character Sets on the IBM site.

Go to top