i18n

You can generate flexible number or date format patterns and format and parse dates and numbers for any locale. The i18n API is implemented by using the ICU library.

Ubrk

The Ubrk API (in mobile and wearable applications) is used to find the location of boundaries in text. The i18n_ubreak_iterator_h handle maintains a current position and scans over the text returning the index of characters where the boundaries occur.

The following boundary analyzing methods are available:

Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.
Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks, such as quotation marks and parentheses.
Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double-click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.
Character boundary analysis identifies the boundaries of Extended Grapheme Clusters, which are groupings of codepoints that must be treated as character-like units for many text operations. For additional information on grapheme clusters and guidelines on their use, see Unicode Standard Annex #29, Unicode Text Segmentation.
Title boundary analysis locates all positions, typically starts of words, that must be set to Title Case when title casing the text.
The text boundary positions are found according to the rules described in Unicode Standard Annex #29, Text Boundaries, and Unicode Standard Annex #14, Line Breaking Properties.

Ucalendar

The Ucalendar API (in mobile and wearable applications) is used for converting between a Udate object and a set of integer fields such as I18N_UCALENDAR_YEAR, I18N_UCALENDAR_MONTH, I18N_UCALENDAR_DAY, and I18N_UCALENDAR_HOUR. A Udate object represents a specific instant in time with one millisecond precision.

The types of Ucalendar interpret a Udate according to the rules of a specific calendar system, such as the gregorian or traditional system.

A Ucalendar object can produce all the time field values needed to implement the date-time formatting for a particular language and calendar style (for example, Japanese-Gregorian, Japanese-Traditional).

When computing a Udate from the time fields, 2 special circumstances can arise. The information can be insufficient to compute the Udate (you have only the year and the month, but not the day of the month), or the information can be inconsistent (such as "Tuesday, July 15, 1996" even though July 15, 1996 is actually a Monday).

Insufficient information
The calendar uses the default information to specify the missing fields. This can vary by calendar: for the Gregorian calendar, the default for a field is the same as that of the start of the epoch, such as I18N_UCALENDAR_YEAR = 1970, I18N_UCALENDAR_MONTH = JANUARY, I18N_UCALENDAR_DATE = 1.
Inconsistent information
If the fields conflict, the calendar prefers the most recently set fields. For example, when determining the day, the calendar looks for one of the following field combinations listed in the following table. The most recent combination, as determined by the most recently set single field, is used.

Table: Combinations of the calendar fields to determine the day
Combinations of the calendar fields
I18N_UCALENDAR_MONTH + I18N_UCALENDAR_DAY_OF_MONTH I18N_UCALENDAR_MONTH + I18N_UCALENDAR_WEEK_OF_MONTH + I18N_UCALENDAR_DAY_OF_WEEK I18N_UCALENDAR_MONTH + I18N_UCALENDAR_DAY_OF_WEEK_IN_MONTH + I18N_UCALENDAR_DAY_OF_WEEK I18N_UCALENDAR_DAY_OF_YEAR I18N_UCALENDAR_DAY_OF_WEEK + I18N_UCALENDAR_WEEK_OF_YEAR

For the time of day:

Table: Combinations of calendar fields to determine the time of the day
Combinations of the calendar fields
I18N_UCALENDAR_HOUR_OF_DAY I18N_UCALENDAR_AM_PM + I18N_UCALENDAR_HOUR

Note
For some non-Gregorian calendars, different fields are necessary for complete disambiguation. For example, a full specification of the historical Arabic astronomical calendar requires the year, month, day-of-month and day-of-week in some cases.

Uchar

The Uchar API (in mobile and wearable applications) provides a low-level access to the Unicode Character Database.

Unicode assigns each code point (not just the assigned character) values for several properties. Most of them are simple boolean flags, or constants from a small enumerated list. For some properties, values are strings or other relatively more complex types.

For more information, see About the Unicode Character Database and ICU User Guide chapter on Properties.

The following table describes the details of script codes that you can get using the i18n_uchar_get_int_property_value() function.

Table: Script codes
Value	Code	English name	Value	Code	English name
0	Zyyy	Code for undetermined script	80	Latf	Latin (Fraktur variant)
1	Zinh	Code for inherited script	81	Latg	Latin (Gaelic variant)
2	Arab	Arabic	82	Lepc	Lepcha (Rong)
3	Armn	Armenian	83	Lina	LinearA
4	Beng	Bengali	84	Mand	Mandaic, Mandaean
5	Bopo	Bopomofo	85	Maya	Mayan hieroglyphs
6	Cher	Cherokee	86	Mero	Meroitic hieroglyphs
7	Copt	Coptic	87	Nkoo	N’Ko
8	Cyrl	Cyrillic	88	Orkh	Old Turkic, Orkhon Runic
9	Dsrt	Deseret (Mormon)	89	Perm	Old Permic
10	Deva	Devanagari (Nagari)	90	Phag	Phags-pa
11	Ethi	Ethiopic (Geʻez)	91	Phnx	Phoenician
12	Geor	Georgian (Mkhedruli)	92	Plrd	Miao (Pollard)
13	Goth	Gothic	93	Roro	Rongorongo
14	Grek	Greek	94	Sara	Sarati
15	Gujr	Gujarati	95	Syre	Syriac (Estrangelo variant)
16	Guru	Gurmukhi	96	Syrj	Syriac (Western variant)
17	Hani	Han (Hanzi, Kanji, Hanja)	97	Syrn	Syriac (Eastern variant)
18	Hang	Hangul (Hangŭl, Hangeul)	98	Teng	Tengwar
19	Hebr	Hebrew	99	Vaii	Vai
20	Hira	Hiragana	100	Visp	Visible Speech
21	Knda	Kannada	101	Xsux	Cuneiform, Sumero-Akkadian
22	Kana	Katakana	102	Zxxx	Code for unwritten documents
23	Khmr	Khmer	103	Zzzz	Code for uncoded script
24	Laoo	Lao	104	Cari	Carian
25	Latn	Latin	105	Jpan	Japanese (alias for Han+Hiragana+Katakana)
26	Mlym	Malayalam	106	Lana	TaiTham (Lanna)
27	Mong	Mongolian	107	Lyci	Lycian
28	Mymr	Myanmar (Burmese)	108	Lydi	Lydian
29	Ogam	Ogham	109	Olck	Ol Chiki (Ol Cemet’, Ol Santali)
30	Ital	Old Italic (Etruscan, Oscan)	110	Rjng	Rejang (Redjang, Kaganga)
31	Orya	Oriya	111	Saur	Saurashtra
32	Runr	Runic	112	Sgnw	SignWriting
33	Sinh	Sinhala	113	Sund	Sundanese
34	Syrc	Syriac	114	Moon	Moon (Mooncode, Moonscript, Moontype)
35	Taml	Tamil	115	Mtei	Meitei Mayek (Meithei, Meetei)
36	Telu	Telugu	116	Armi	Imperial Aramaic
37	Thaa	Thaana	117	Avst	Avestan
38	Thai	Thai	118	Cakm	Chakma
39	Tibt	Tibetan	119	Kore	Korean (alias for Hangul+Han)
40	Cans	Unified Canadian Aboriginal Syllabics	120	Kthi	Kaithi
41	Yiii	Yi	121	Mani	Manichaean
42	Tglg	Tagalog (Baybayin, Alibata)	122	Phli	Inscriptional Pahlavi
43	Hano	Hanunoo (Hanunoo)	123	Phlp	Psalter Pahlavi
44	Buhd	Buhid	124	Phlv	Book Pahlavi
45	Tagb	Tagbanwa	125	Prti	Inscriptional Parthian
46	Brai	Braille	126	Samr	Samaritan
47	Cprt	Cypriot	127	Tavt	TaiViet
48	Limb	Limbu	128	Zmth	Mathematical notation
49	Linb	LinearB	129	Zsym	Symbols
50	Osma	Osmanya	130	Bamu	Bamum
51	Shaw	Shavian (Shaw)	131	Lisu	Lisu (Fraser)
52	Tale	TaiLe	132	Nkgb	Nakhi Geba ('Na-'Khi ²Ggŏ-¹baw, Naxi Geba)
53	Ugar	Ugaritic	133	Sarb	Old South Arabian
54	Hrkt	Japanese syllabaries (alias for Hiragana+Katakana)	134	Bass	BassaVah
55	Bugi	Buginese	135	Dupl	Duployan shorthand, Duployan stenography
56	Glag	Glagolitic	136	Elba	Elbasan
57	Khar	Kharoshthi	137	Gran	Grantha
58	Sylo	Syloti Nagri	138	Kpel	Kpelle
59	Talu	New Tai Lue	139	Loma	Loma
60	Tfng	Tifinagh (Berber)	140	Mend	Mende Kikakui
61	Xpeo	Old Persian	141	Merc	Meroitic Cursive
62	Bali	Balinese	142	Narb	Old North Arabian (Ancient North Arabian)
63	Batk	Batak	143	Nbat	Nabataean
64	Blis	Blissymbols	144	Palm	Palmyrene
65	Brah	Brahmi	145	Sind	Khudawadi, Sindhi
66	Cham	Cham	146	Wara	Warang Citi (Varang Kshiti)
67	Cirt	Cirth	147	Afak	Afaka
68	Cyrs	Cyrillic (Old Church Slavonic variant)	148	Jurc	Jurchen
69	Egyd	Egyptian demotic	149	Mroo	Mro, Mru
70	Egyh	Egyptian hieratic	150	Nshu	Nushu
71	Egyp	Egyptian hieroglyphs	151	Shrd	Sharada, Śāradā
72	Geok	Khutsuri (Asomtavruli and Nuskhuri)	152	Sora	Sora Sompeng
73	Hans	Han (Simplified variant)	153	Takr	Takri, Ṭākrī, Ṭāṅkrī
74	Hant	Han (Traditional variant)	154	Tang	Tangut
75	Hmng	Pahawh Hmong	155	Wole	Woleai
76	Hung	Old Hungarian (Hungarian Runic)	156	Hluw	Anatolian hieroglyphs (Luwian hieroglyphs, Hittite hieroglyphs)
77	Inds	Indus (Harappan)	157	Khoj	Khojki
78	Java	Javanese	158	Tirh	Tirhuta
79	Kali	KayahLi	-1		Invalid code

Ucollator

The Ucollator API (in mobile and wearable applications) performs locale-sensitive string comparison. It builds searching and sorting routines for natural language text and provides correct sorting orders for most supported locales. If specific data for a locale is not available, the order eventually falls back to the CLDR root sort order. The sorting order can be customized by providing your own set of rules. For more information, see the ICU Collation Customization section of the User Guide.

Udate

The Udate API (in mobile and wearable applications) consists of functions that convert dates and times from their internal representations to textual form and back again in a language-independent manner. Converting from the internal representation (milliseconds since midnight, January 1, 1970) to text is known as formatting, and converting from text to milliseconds is known as parsing. Tizen currently defines only one concrete handle (i18n_udate_format_h), which can handle practically all normal date formatting and parsing actions.

The Udate format helps you to format and parse dates for any locale. Your code can be completely independent of the locale conventions for months, days of the week, or even the lunar or solar calendar format.

You can pass in different options for the arguments for date and time style to control the length of the result; you can select from SHORT, MEDIUM, LONG, and FULL. The exact result depends on the locale.

I18N_UDATE_SHORT is completely numeric, such as 12/13/52 or 3:30pm
I18N_UDATE_MEDIUM is longer, such as Jan 12, 1952
I18N_UDATE_LONG is longer, such as January 12, 1952 or 3:30:32pm
I18N_UDATE_FULL is completely specified, such as Tuesday, April 12, 1952 AD or 3:30:42pm PST.

Date and Time Patterns

The date and time formats are specified by the date and time pattern strings. Within the date and time pattern strings, all unquoted ASCII letters (A-Z and a-z) are reserved as pattern letters representing calendar fields. The i18n_udate_format_h handle supports the date and time formatting algorithm and pattern letters defined by the UTS#35 Unicode Locale Data Markup Language (LDML). It is further documented in the ICU User Guide.

Udatepg

The Udatepg API (in mobile and wearable applications) enables flexible generation of date format patterns, such as "yy-MM-dd". The user can build up the generator by adding successive patterns. After this, a query can be made using a pattern that includes only the desired fields and lengths. The generator returns the a pattern that is most similar to it.

The main method is the i18n_udatepg_get_best_pattern() function, since normally the Udatepg API is pre-built with data from a particular locale. However, generators can be built directly from other data as well.

Uenumeration

The Uenumeration API (in mobile and wearable applications) enables you to create an enumeration object out of a given set of strings. The object can be created out of an array of const char* strings or an array of i18n_uchar* strings.

The enumeration object enables navigation through the enumeration values, with the use of the i18n_uenumeration_next() or i18n_uenumeration_unext() function (depending on the type used for creating the enumeration object), and with the i18n_uenumeration_reset() function.

You can get the number of values stored in the enumeration object with the i18n_uenumeration_count() function.

Ulocale

The Ulocale API (in mobile and wearable applications) represents a specific geographical, political, or cultural region. Locale-sensitive operations use the Ulocale functions to tailor information for the user. For example, displaying a number is a locale-sensitive operation. The number must be formatted according to the customs and conventions of the user's native country, region, or culture.

You create a locale with one of the options listed below. Each component is separated by an underscore in the locale string:

Table: Options for creating a locale
Options for creating a locale
newLanguage newLanguage + newCountry newLanguage + newCountry + newVariant

The first option is a valid ISO Language Code. These codes are the lower-case two-letter codes as defined by the ISO-639 standard.

The second option includes an additional ISO Country Code.

The third option requires additional information on the variant. The variant codes are vendor and browser-specific. For example, use WIN for Windows, MAC for Macintosh, and POSIX for POSIX. Where there are two variants, separate them with an underscore, and put the most important one first. For example, a Traditional Spanish collation might be referenced, with ES, ES, Traditional_WIN.

Because a locale is just an identifier for a region, no validity check is performed when you specify a locale. If you want to see whether particular resources are available for the locale you asked for, you must query those resources.

Once you have specified a locale you can query it for information about itself. Use i18n_ulocale_get_language() to get the ISO Language Code. You can use i18n_ulocale_get_display_name() to get the name of the language suitable for display to the user.

Unormalization

The Unicode normalization API (in mobile and wearable applications) is for the standard unicode normalization. All instances of i18n_unormalizer_s are unmodifiable and immutable. Instances returned by i18n_unormalization_get_instance() are singletons that must not be deleted by the caller.

Unumber

The Unumber API (in mobile and wearable applications) helps you to format and parse numbers for any locale. Your code can be completely independent of the locale conventions for decimal points, thousands-separators, or even the particular decimal digits used, or whether the number format is even decimal. There are different number format styles like decimal, currency, percent and spellout.

Usearch

The Usearch API (in mobile and wearable applications) provides language-sensitive text searching based on the comparison rules defined in a Ucollator data struct. This ensures that language eccentricity can be handled. For example, for the German collator, characters ß and SS are matched if case is chosen to be ignored. That is why it can be important to pass a locale when creating the usearch with the i18n_usearch_create_new() function.

Uset

Uset is a mutable set of Unicode characters and multicharacter strings. The sets represent character classes used in regular expressions. A character specifies a subset of the Unicode code points. The legal code points are U+0000 to U+10FFFF, inclusive.

The set supports 2 functions:

The operand function allows the caller to modify the value of the set. The operand function works similarly to the boolean logic: a boolean OR is implemented by add, a boolean AND is implemented by retain, a boolean XOR is implemented by a complement taking an argument, and a boolean NOT is implemented by a complement with no argument. In terms of traditional set theory function names, add is a union, retain is an intersection, remove is an asymmetric difference, and complement with no argument is a set complement with respect to the superset range MIN_VALUE-MAX_VALUE.
The i18n_uset_apply_pattern() or i18n_uset_to_pattern() function. Unlike the functions that add characters or categories, and control the logic of the set, the i18n_uset_apply_pattern() function sets all attributes of a set at once, based on a string pattern.

Pattern Syntax

Patterns are accepted by the i18n_uset_create_pattern(), i18n_uset_create_pattern_options(), and i18n_uset_apply_pattern() functions and returned by the i18n_uset_to_pattern() function. The patterns follow a syntax similar to that employed by version 8 regular expression character classes.

Table: Examples of simple pattern syntaxes
Pattern	Description
[]	No characters
[a]	Character 'a'
[ae]	Characters 'a' and 'e'
[a-e]	Characters 'a' through 'e' inclusive, in Unicode code point order
[\u4E01]	Character U+4E01
[a{ab}{ac}]	Character 'a' and the multicharacter strings 'ab' and 'ac'
[\p{Lu}]	All characters in the general category 'uppercase letter'

Any character can be preceded by a backslash in order to remove any special meaning. Whitespace characters are ignored, unless they are escaped.

Property patterns specify a set of characters having a certain property as defined by the Unicode standard. Both the POSIX-like [:Lu:] and the Perl-like syntax \\p{Lu} are recognized.

Patterns specify individual characters, ranges of characters, and Unicode property sets. When the elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters, [:^foo] and \\P{foo}. In any other location, '^' has no special meaning.

Ranges are indicated by placing a '-' between 2 characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character, it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', it is taken as a literal. This means that [a\-b], [-ab], and [ab-] all indicate the same set of three characters, 'a', 'b', and '-'.

Sets can be intersected using the '&' operator or the asymmetric set difference can be taken using the '-' operator. For example, [[:L:]&[\\u0000-\\u0FFF]] indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. This means that [[:L:]-[a-z]-[\\u0100-\\u01FF]] is equivalent to [[[:L:]-[a-z]]-[\\u0100-\\u01FF]]. This only really matters for difference; intersection is commutative.

Table: Examples of set syntaxes
Set	Description
[a]	Set containing 'a'
[a-z]	Set containing 'a' through 'z' and all letters in between, in Unicode order
[^a-z]	Set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
[[pat1][pat2]]	Union of sets specified by pat1 and pat2
[[[pat1]&[pat2]]	Intersection of sets specified by pat1 and pat2
[[pat1]-[pat2]]	Asymmetric difference of sets specified by pat1 and pat2
[:Lu:] or \p{Lu}	Set of characters having the specified Unicode property, in this case Unicode uppercase letters
[:^Lu:] or \P{Lu}	Set of characters not having the given Unicode property

Note
You cannot add an empty string ("") to a set.

Formal Syntax

The following table provide examples of formal syntax patterns.

Table: Formal syntax patterns
Pattern	Description
pattern :=	('[' '^'? item* ']') \| property
item :=	char \| (char '-' char) \| pattern-expr
pattern-expr :=	pattern \| pattern-expr pattern \| pattern-expr or pattern
op :=	'&' \| '-'
special :=	'[' \| ']' \| '-'
char :=	Any character that is not special \| ('\' any character) \| ('\u' hex hex hex hex)
property :=	Unicode property set pattern
a := b	a can be replaced by b
a?	0 or 1 instance of a
a*	1 or more instances of a
a \| b	Either a or b
'a'	Literal string between the quotes

Ustring

The Ustring API (in mobile and wearable applications) provides general unicode string handling. Some functions are similar in name, signature, and behavior to the ANSI C <string.h> functions, and other functions provide more Unicode-specific functionality.

The i18n uses 16-bit Unicode (UTF-16) in the form of arrays of i18n_uchar code units. UTF-16 encodes each Unicode code point with either 1 or 2 i18n_uchar code units. This is the default form of Unicode, and a forward-compatible extension of the original, fixed-width form that was known as UCS-2. UTF-16 superseded UCS-2 with Unicode 2.0 in 1996.

The i18n also handles 16-bit Unicode text with unpaired surrogates. Such text is not well-formed UTF-16. Code-point-related functions treat unpaired surrogates as surrogate code points, such as separate units.

Although UTF-16 is a variable-width encoding form, such as some legacy multi-byte encodings, it is much more efficient even for random access because the code unit values for single-unit characters versus lead units versus trail units are completely disjoint. This means that it is easy to determine character (code point) boundaries from random offsets in the string.

Unicode (UTF-16) string processing is optimized for the single-unit case. Although it is important to support supplementary characters, which use pairs of lead/trail code units called "surrogates", their occurrence is rare. Almost all characters in modern use require only a single i18n_uchar code unit (such as their code point values are <=0xffff).

Character Set Mapping Tables

The i18n API provides a character set conversion with mapping tables for a number of important codepages. The default tables are a subset of IBM's CDRA conversion table repository. ICU's Converter Explorer shows aliases and codepage charts for the default tables that are built into a standard ICU distribution.

Conversions for most codepages are implemented differently on different platforms. We are providing mapping tables here from many different sources, so that the i18n users and others can use these tables to get the same conversion behavior as on the original platforms.

The mapping tables and some of the source code of the tools that collected these tables are checked into a CVS repository.

For more information about character sets, codepages and encodings, see Coded Character Sets on the IBM site.