i18n
You can generate flexible number or date format patterns and format and parse dates and numbers for any locale. The i18n API is implemented by using the ICU library.
Ubrk
The Ubrk API (in mobile and wearable applications) is used to find the location of boundaries in text. The i18n_ubreak_iterator_h handle maintains a current position and scans over the text returning the index of characters where the boundaries occur.
The following boundary analyzing methods are available:
- Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.
- Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks, such as quotation marks and parentheses.
- Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double-click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.
- Character boundary analysis identifies the boundaries of Extended Grapheme Clusters, which are groupings of codepoints that must be treated as character-like units for many text operations. For additional information on grapheme clusters and guidelines on their use, see Unicode Standard Annex #29, Unicode Text Segmentation.
- Title boundary analysis locates all positions, typically starts of words, that must be set to Title Case when title casing the text.
- The text boundary positions are found according to the rules described in Unicode Standard Annex #29, Text Boundaries, and Unicode Standard Annex #14, Line Breaking Properties.
Ucalendar
The Ucalendar API (in mobile and wearable applications) is used for converting between a Udate object and a set of integer fields such as I18N_UCALENDAR_YEAR, I18N_UCALENDAR_MONTH, I18N_UCALENDAR_DAY, and I18N_UCALENDAR_HOUR. A Udate object represents a specific instant in time with one millisecond precision.
The types of Ucalendar interpret a Udate according to the rules of a specific calendar system, such as the gregorian or traditional system.
A Ucalendar object can produce all the time field values needed to implement the date-time formatting for a particular language and calendar style (for example, Japanese-Gregorian, Japanese-Traditional).
When computing a Udate from the time fields, 2 special circumstances can arise. The information can be insufficient to compute the Udate (you have only the year and the month, but not the day of the month), or the information can be inconsistent (such as "Tuesday, July 15, 1996" even though July 15, 1996 is actually a Monday).
-
Insufficient information
The calendar uses the default information to specify the missing fields. This can vary by calendar: for the Gregorian calendar, the default for a field is the same as that of the start of the epoch, such as I18N_UCALENDAR_YEAR = 1970, I18N_UCALENDAR_MONTH = JANUARY, I18N_UCALENDAR_DATE = 1.
-
Inconsistent information
If the fields conflict, the calendar prefers the most recently set fields. For example, when determining the day, the calendar looks for one of the following field combinations listed in the following table. The most recent combination, as determined by the most recently set single field, is used.
Combinations of the calendar fields |
---|
I18N_UCALENDAR_MONTH + I18N_UCALENDAR_DAY_OF_MONTH
I18N_UCALENDAR_MONTH + I18N_UCALENDAR_WEEK_OF_MONTH + I18N_UCALENDAR_DAY_OF_WEEK I18N_UCALENDAR_MONTH + I18N_UCALENDAR_DAY_OF_WEEK_IN_MONTH + I18N_UCALENDAR_DAY_OF_WEEK I18N_UCALENDAR_DAY_OF_YEAR I18N_UCALENDAR_DAY_OF_WEEK + I18N_UCALENDAR_WEEK_OF_YEAR |
For the time of day:
Combinations of the calendar fields |
---|
I18N_UCALENDAR_HOUR_OF_DAY
I18N_UCALENDAR_AM_PM + I18N_UCALENDAR_HOUR |
Note |
---|
For some non-Gregorian calendars, different fields are necessary for complete disambiguation. For example, a full specification of the historical Arabic astronomical calendar requires the year, month, day-of-month and day-of-week in some cases. |
Uchar
The Uchar API (in mobile and wearable applications) provides a low-level access to the Unicode Character Database.
Unicode assigns each code point (not just the assigned character) values for several properties. Most of them are simple boolean flags, or constants from a small enumerated list. For some properties, values are strings or other relatively more complex types.
For more information, see About the Unicode Character Database and ICU User Guide chapter on Properties.
The following table describes the details of script codes that you can get using the i18n_uchar_get_int_property_value() function.
Value | Code | English name | Value | Code | English name |
---|---|---|---|---|---|
0 | Zyyy | Code for undetermined script | 80 | Latf | Latin (Fraktur variant) |
1 | Zinh | Code for inherited script | 81 | Latg | Latin (Gaelic variant) |
2 | Arab | Arabic | 82 | Lepc | Lepcha (Rong) |
3 | Armn | Armenian | 83 | Lina | LinearA |
4 | Beng | Bengali | 84 | Mand | Mandaic, Mandaean |
5 | Bopo | Bopomofo | 85 | Maya | Mayan hieroglyphs |
6 | Cher | Cherokee | 86 | Mero | Meroitic hieroglyphs |
7 | Copt | Coptic | 87 | Nkoo | N’Ko |
8 | Cyrl | Cyrillic | 88 | Orkh | Old Turkic, Orkhon Runic |
9 | Dsrt | Deseret (Mormon) | 89 | Perm | Old Permic |
10 | Deva | Devanagari (Nagari) | 90 | Phag | Phags-pa |
11 | Ethi | Ethiopic (Geʻez) | 91 | Phnx | Phoenician |
12 | Geor | Georgian (Mkhedruli) | 92 | Plrd | Miao (Pollard) |
13 | Goth | Gothic | 93 | Roro | Rongorongo |
14 | Grek | Greek | 94 | Sara | Sarati |
15 | Gujr | Gujarati | 95 | Syre | Syriac (Estrangelo variant) |
16 | Guru | Gurmukhi | 96 | Syrj | Syriac (Western variant) |
17 | Hani | Han (Hanzi, Kanji, Hanja) | 97 | Syrn | Syriac (Eastern variant) |
18 | Hang | Hangul (Hangŭl, Hangeul) | 98 | Teng | Tengwar |
19 | Hebr | Hebrew | 99 | Vaii | Vai |
20 | Hira | Hiragana | 100 | Visp | Visible Speech |
21 | Knda | Kannada | 101 | Xsux | Cuneiform, Sumero-Akkadian |
22 | Kana | Katakana | 102 | Zxxx | Code for unwritten documents |
23 | Khmr | Khmer | 103 | Zzzz | Code for uncoded script |
24 | Laoo | Lao | 104 | Cari | Carian |
25 | Latn | Latin | 105 | Jpan | Japanese (alias for Han+Hiragana+Katakana) |
26 | Mlym | Malayalam | 106 | Lana | TaiTham (Lanna) |
27 | Mong | Mongolian | 107 | Lyci | Lycian |
28 | Mymr | Myanmar (Burmese) | 108 | Lydi | Lydian |
29 | Ogam | Ogham | 109 | Olck | Ol Chiki (Ol Cemet’, Ol Santali) |
30 | Ital | Old Italic (Etruscan, Oscan) | 110 | Rjng | Rejang (Redjang, Kaganga) |
31 | Orya | Oriya | 111 | Saur | Saurashtra |
32 | Runr | Runic | 112 | Sgnw | SignWriting |
33 | Sinh | Sinhala | 113 | Sund | Sundanese |
34 | Syrc | Syriac | 114 | Moon | Moon (Mooncode, Moonscript, Moontype) |
35 | Taml | Tamil | 115 | Mtei | Meitei Mayek (Meithei, Meetei) |
36 | Telu | Telugu | 116 | Armi | Imperial Aramaic |
37 | Thaa | Thaana | 117 | Avst | Avestan |
38 | Thai | Thai | 118 | Cakm | Chakma |
39 | Tibt | Tibetan | 119 | Kore | Korean (alias for Hangul+Han) |
40 | Cans | Unified Canadian Aboriginal Syllabics | 120 | Kthi | Kaithi |
41 | Yiii | Yi | 121 | Mani | Manichaean |
42 | Tglg | Tagalog (Baybayin, Alibata) | 122 | Phli | Inscriptional Pahlavi |
43 | Hano | Hanunoo (Hanunoo) | 123 | Phlp | Psalter Pahlavi |
44 | Buhd | Buhid | 124 | Phlv | Book Pahlavi |
45 | Tagb | Tagbanwa | 125 | Prti | Inscriptional Parthian |
46 | Brai | Braille | 126 | Samr | Samaritan |
47 | Cprt | Cypriot | 127 | Tavt | TaiViet |
48 | Limb | Limbu | 128 | Zmth | Mathematical notation |
49 | Linb | LinearB | 129 | Zsym | Symbols |
50 | Osma | Osmanya | 130 | Bamu | Bamum |
51 | Shaw | Shavian (Shaw) | 131 | Lisu | Lisu (Fraser) |
52 | Tale | TaiLe | 132 | Nkgb | Nakhi Geba ('Na-'Khi ²Ggŏ-¹baw, Naxi Geba) |
53 | Ugar | Ugaritic | 133 | Sarb | Old South Arabian |
54 | Hrkt | Japanese syllabaries (alias for Hiragana+Katakana) | 134 | Bass | BassaVah |
55 | Bugi | Buginese | 135 | Dupl | Duployan shorthand, Duployan stenography |
56 | Glag | Glagolitic | 136 | Elba | Elbasan |
57 | Khar | Kharoshthi | 137 | Gran | Grantha |
58 | Sylo | Syloti Nagri | 138 | Kpel | Kpelle |
59 | Talu | New Tai Lue | 139 | Loma | Loma |
60 | Tfng | Tifinagh (Berber) | 140 | Mend | Mende Kikakui |
61 | Xpeo | Old Persian | 141 | Merc | Meroitic Cursive |
62 | Bali | Balinese | 142 | Narb | Old North Arabian (Ancient North Arabian) |
63 | Batk | Batak | 143 | Nbat | Nabataean |
64 | Blis | Blissymbols | 144 | Palm | Palmyrene |
65 | Brah | Brahmi | 145 | Sind | Khudawadi, Sindhi |
66 | Cham | Cham | 146 | Wara | Warang Citi (Varang Kshiti) |
67 | Cirt | Cirth | 147 | Afak | Afaka |
68 | Cyrs | Cyrillic (Old Church Slavonic variant) | 148 | Jurc | Jurchen |
69 | Egyd | Egyptian demotic | 149 | Mroo | Mro, Mru |
70 | Egyh | Egyptian hieratic | 150 | Nshu | Nushu |
71 | Egyp | Egyptian hieroglyphs | 151 | Shrd | Sharada, Śāradā |
72 | Geok | Khutsuri (Asomtavruli and Nuskhuri) | 152 | Sora | Sora Sompeng |
73 | Hans | Han (Simplified variant) | 153 | Takr | Takri, Ṭākrī, Ṭāṅkrī |
74 | Hant | Han (Traditional variant) | 154 | Tang | Tangut |
75 | Hmng | Pahawh Hmong | 155 | Wole | Woleai |
76 | Hung | Old Hungarian (Hungarian Runic) | 156 | Hluw | Anatolian hieroglyphs (Luwian hieroglyphs, Hittite hieroglyphs) |
77 | Inds | Indus (Harappan) | 157 | Khoj | Khojki |
78 | Java | Javanese | 158 | Tirh | Tirhuta |
79 | Kali | KayahLi | -1 | Invalid code |
Ucollator
The Ucollator API (in mobile and wearable applications) performs locale-sensitive string comparison. It builds searching and sorting routines for natural language text and provides correct sorting orders for most supported locales. If specific data for a locale is not available, the order eventually falls back to the CLDR root sort order. The sorting order can be customized by providing your own set of rules. For more information, see the ICU Collation Customization section of the User Guide.
Udate
The Udate API (in mobile and wearable applications) consists of functions that convert dates and times from their internal representations to textual form and back again in a language-independent manner. Converting from the internal representation (milliseconds since midnight, January 1, 1970) to text is known as formatting, and converting from text to milliseconds is known as parsing. Tizen currently defines only one concrete handle (i18n_udate_format_h), which can handle practically all normal date formatting and parsing actions.
The Udate format helps you to format and parse dates for any locale. Your code can be completely independent of the locale conventions for months, days of the week, or even the lunar or solar calendar format.
You can pass in different options for the arguments for date and time style to control the length of the result; you can select from SHORT, MEDIUM, LONG, and FULL. The exact result depends on the locale.
- I18N_UDATE_SHORT is completely numeric, such as 12/13/52 or 3:30pm
- I18N_UDATE_MEDIUM is longer, such as Jan 12, 1952
- I18N_UDATE_LONG is longer, such as January 12, 1952 or 3:30:32pm
- I18N_UDATE_FULL is completely specified, such as Tuesday, April 12, 1952 AD or 3:30:42pm PST.
Date and Time Patterns
The date and time formats are specified by the date and time pattern strings. Within the date and time pattern strings, all unquoted ASCII letters (A-Z and a-z) are reserved as pattern letters representing calendar fields. The i18n_udate_format_h handle supports the date and time formatting algorithm and pattern letters defined by the UTS#35 Unicode Locale Data Markup Language (LDML). It is further documented in the ICU User Guide.
Udatepg
The Udatepg API (in mobile and wearable applications) enables flexible generation of date format patterns, such as "yy-MM-dd". The user can build up the generator by adding successive patterns. After this, a query can be made using a pattern that includes only the desired fields and lengths. The generator returns the a pattern that is most similar to it.
The main method is the i18n_udatepg_get_best_pattern() function, since normally the Udatepg API is pre-built with data from a particular locale. However, generators can be built directly from other data as well.
Uenumeration
The Uenumeration API (in mobile and wearable applications) enables you to create an enumeration object out of a given set of strings. The object can be created out of an array of const char* strings or an array of i18n_uchar* strings.
The enumeration object enables navigation through the enumeration values, with the use of the i18n_uenumeration_next() or i18n_uenumeration_unext() function (depending on the type used for creating the enumeration object), and with the i18n_uenumeration_reset() function.
You can get the number of values stored in the enumeration object with the i18n_uenumeration_count() function.
Ulocale
The Ulocale API (in mobile and wearable applications) represents a specific geographical, political, or cultural region. Locale-sensitive operations use the Ulocale functions to tailor information for the user. For example, displaying a number is a locale-sensitive operation. The number must be formatted according to the customs and conventions of the user's native country, region, or culture.
You create a locale with one of the options listed below. Each component is separated by an underscore in the locale string:
Options for creating a locale |
---|
newLanguage
newLanguage + newCountry newLanguage + newCountry + newVariant |
The first option is a valid ISO Language Code. These codes are the lower-case two-letter codes as defined by the ISO-639 standard.
The second option includes an additional ISO Country Code.
The third option requires additional information on the variant. The variant codes are vendor and browser-specific. For example, use WIN for Windows, MAC for Macintosh, and POSIX for POSIX. Where there are two variants, separate them with an underscore, and put the most important one first. For example, a Traditional Spanish collation might be referenced, with ES, ES, Traditional_WIN.
Because a locale is just an identifier for a region, no validity check is performed when you specify a locale. If you want to see whether particular resources are available for the locale you asked for, you must query those resources.
Once you have specified a locale you can query it for information about itself. Use i18n_ulocale_get_language() to get the ISO Language Code. You can use i18n_ulocale_get_display_name() to get the name of the language suitable for display to the user.
Unormalization
The Unicode normalization API (in mobile and wearable applications) is for the standard unicode normalization. All instances of i18n_unormalizer_s are unmodifiable and immutable. Instances returned by i18n_unormalization_get_instance() are singletons that must not be deleted by the caller.
Unumber
The Unumber API (in mobile and wearable applications) helps you to format and parse numbers for any locale. Your code can be completely independent of the locale conventions for decimal points, thousands-separators, or even the particular decimal digits used, or whether the number format is even decimal. There are different number format styles like decimal, currency, percent and spellout.
Usearch
The Usearch API (in mobile and wearable applications) provides language-sensitive text searching based on the comparison rules defined in a Ucollator data struct. This ensures that language eccentricity can be handled. For example, for the German collator, characters ß and SS are matched if case is chosen to be ignored. That is why it can be important to pass a locale when creating the usearch with the i18n_usearch_create_new() function.
Uset
Uset is a mutable set of Unicode characters and multicharacter strings. The sets represent character classes used in regular expressions. A character specifies a subset of the Unicode code points. The legal code points are U+0000 to U+10FFFF, inclusive.
The set supports 2 functions:
- The operand function allows the caller to modify the value of the set. The operand function works similarly to the boolean logic: a boolean OR is implemented by add, a boolean AND is implemented by retain, a boolean XOR is implemented by a complement taking an argument, and a boolean NOT is implemented by a complement with no argument. In terms of traditional set theory function names, add is a union, retain is an intersection, remove is an asymmetric difference, and complement with no argument is a set complement with respect to the superset range MIN_VALUE-MAX_VALUE.
- The i18n_uset_apply_pattern() or i18n_uset_to_pattern() function. Unlike the functions that add characters or categories, and control the logic of the set, the i18n_uset_apply_pattern() function sets all attributes of a set at once, based on a string pattern.
Pattern Syntax
Patterns are accepted by the i18n_uset_create_pattern(), i18n_uset_create_pattern_options(), and i18n_uset_apply_pattern() functions and returned by the i18n_uset_to_pattern() function. The patterns follow a syntax similar to that employed by version 8 regular expression character classes.
Pattern | Description |
---|---|
[] | No characters |
[a] | Character 'a' |
[ae] | Characters 'a' and 'e' |
[a-e] | Characters 'a' through 'e' inclusive, in Unicode code point order |
[\u4E01] | Character U+4E01 |
[a{ab}{ac}] | Character 'a' and the multicharacter strings 'ab' and 'ac' |
[\p{Lu}] | All characters in the general category 'uppercase letter' |
Any character can be preceded by a backslash in order to remove any special meaning. Whitespace characters are ignored, unless they are escaped.
Property patterns specify a set of characters having a certain property as defined by the Unicode standard. Both the POSIX-like [:Lu:] and the Perl-like syntax \\p{Lu} are recognized.
Patterns specify individual characters, ranges of characters, and Unicode property sets. When the elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters, [:^foo] and \\P{foo}. In any other location, '^' has no special meaning.
Ranges are indicated by placing a '-' between 2 characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character, it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', it is taken as a literal. This means that [a\-b], [-ab], and [ab-] all indicate the same set of three characters, 'a', 'b', and '-'.
Sets can be intersected using the '&' operator or the asymmetric set difference can be taken using the '-' operator. For example, [[:L:]&[\\u0000-\\u0FFF]] indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. This means that [[:L:]-[a-z]-[\\u0100-\\u01FF]] is equivalent to [[[:L:]-[a-z]]-[\\u0100-\\u01FF]]. This only really matters for difference; intersection is commutative.
Set | Description |
---|---|
[a] | Set containing 'a' |
[a-z] | Set containing 'a' through 'z' and all letters in between, in Unicode order |
[^a-z] | Set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF |
[[pat1][pat2]] | Union of sets specified by pat1 and pat2 |
[[[pat1]&[pat2]] | Intersection of sets specified by pat1 and pat2 |
[[pat1]-[pat2]] | Asymmetric difference of sets specified by pat1 and pat2 |
[:Lu:] or \p{Lu} | Set of characters having the specified Unicode property, in this case Unicode uppercase letters |
[:^Lu:] or \P{Lu} | Set of characters not having the given Unicode property |
Note |
---|
You cannot add an empty string ("") to a set. |
Formal Syntax
The following table provide examples of formal syntax patterns.
Pattern | Description |
---|---|
pattern := | ('[' '^'? item* ']') | property |
item := | char | (char '-' char) | pattern-expr |
pattern-expr := | pattern | pattern-expr pattern | pattern-expr or pattern |
op := | '&' | '-' |
special := | '[' | ']' | '-' |
char := | Any character that is not special | ('\' any character) | ('\u' hex hex hex hex) |
property := | Unicode property set pattern |
a := b | a can be replaced by b |
a? | 0 or 1 instance of a |
a* | 1 or more instances of a |
a | b | Either a or b |
'a' | Literal string between the quotes |
Ustring
The Ustring API (in mobile and wearable applications) provides general unicode string handling. Some functions are similar in name, signature, and behavior to the ANSI C <string.h> functions, and other functions provide more Unicode-specific functionality.
The i18n uses 16-bit Unicode (UTF-16) in the form of arrays of i18n_uchar code units. UTF-16 encodes each Unicode code point with either 1 or 2 i18n_uchar code units. This is the default form of Unicode, and a forward-compatible extension of the original, fixed-width form that was known as UCS-2. UTF-16 superseded UCS-2 with Unicode 2.0 in 1996.
The i18n also handles 16-bit Unicode text with unpaired surrogates. Such text is not well-formed UTF-16. Code-point-related functions treat unpaired surrogates as surrogate code points, such as separate units.
Although UTF-16 is a variable-width encoding form, such as some legacy multi-byte encodings, it is much more efficient even for random access because the code unit values for single-unit characters versus lead units versus trail units are completely disjoint. This means that it is easy to determine character (code point) boundaries from random offsets in the string.
Unicode (UTF-16) string processing is optimized for the single-unit case. Although it is important to support supplementary characters, which use pairs of lead/trail code units called "surrogates", their occurrence is rare. Almost all characters in modern use require only a single i18n_uchar code unit (such as their code point values are <=0xffff).
Character Set Mapping Tables
The i18n API provides a character set conversion with mapping tables for a number of important codepages. The default tables are a subset of IBM's CDRA conversion table repository. ICU's Converter Explorer shows aliases and codepage charts for the default tables that are built into a standard ICU distribution.
Conversions for most codepages are implemented differently on different platforms. We are providing mapping tables here from many different sources, so that the i18n users and others can use these tables to get the same conversion behavior as on the original platforms.
The mapping tables and some of the source code of the tools that collected these tables are checked into a CVS repository.
For more information about character sets, codepages and encodings, see Coded Character Sets on the IBM site.