Tutorial.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239
  1. 1. Collator::getAvailableLocales().
  2. Return the locales available at the time of the call, including registered locales.
  3. If a sever error occurs (such as out of memory condition) this will return null.
  4. If there is no locale data, an empty enumeration will be returned.
  5. Returned locales list is a strings in format of RFC4646 standart (see http://www.rfc-editor.org/rfc/rfc4646.txt).
  6. Examle of locales format: 'en_US', 'ru_UA', 'ua_UA' (see http://demo.icu-project.org/icu-bin/locexp).
  7. 2. Collator::getDisplayName( $obj_locale, $disp_locale ).
  8. Get name of the object for the desired Locale, in the desired language. Both arguments
  9. must be from getAvailableLocales method.
  10. @param string $obj_locale Locale to get display name for.
  11. @param string $disp_locale Specifies the desired locale for output
  12. Both parameters are case insensitive.
  13. For locale format see RFC4647 standart in ftp://ftp.rfc-editor.org/in-notes/rfc4647.txt
  14. 3. Collator::getLocaleByType( $type ).
  15. Allow user to select whether she wants information on requested, valid or actual locale.
  16. Returned locale tag is a string formatted to a RFC4646 standart and normalize to normal form -
  17. value is a string from
  18. For example, a collator for "en_US_CALIFORNIA" was requested. In the current state of ICU (2.0),
  19. the requested locale is "en_US_CALIFORNIA", the valid locale is "en_US" (most specific locale
  20. supported by ICU) and the actual locale is "root" (the collation data comes unmodified from the UCA)
  21. The locale is considered supported by ICU if there is a core ICU bundle for that locale (although
  22. it may be empty).
  23. 4. VariableTop
  24. The Variable_Top attribute is only meaningful if the Alternate attribute is not set to NonIgnorable.
  25. In such a case, it controls which characters count as ignorable. The string value specifies
  26. the "highest" character (in UCA order) weight that is to be considered ignorable.
  27. Thus, for example, if a user wanted whitespace to be ignorable, but not any visible characters,
  28. then s/he would use the value Variable_Top="\u0020" (space). The string should only be a
  29. single character. All characters of the same primary weight are equivalent, so
  30. Variable_Top="\u3000" (ideographic space) has the same effect as Variable_Top="\u0020".
  31. This setting (alone) has little impact on string comparison performance; setting it lower or higher
  32. will make sort keys slightly shorter or longer respectively.
  33. 5. Strength
  34. The ICU Collation Service supports many levels of comparison (named "Levels", but also
  35. known as "Strengths"). Having these categories enables ICU to sort strings precisely
  36. according to local conventions. However, by allowing the levels to be selectively
  37. employed, searching for a string in text can be performed with various matching
  38. conditions.
  39. Performance optimizations have been made for ICU collation with the default level
  40. settings. Performance specific impacts are discussed in the Performance section below.
  41. Following is a list of the names for each level and an example usage:
  42. 1. Primary Level: Typically, this is used to denote differences between base characters
  43. (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are
  44. divided into different sections by base character. This is also called the level1
  45. strength.
  46. 2. Secondary Level: Accents in the characters are considered secondary differences (for
  47. example, "as" < "as" < "at"). Other differences between letters can also be considered
  48. secondary differences, depending on the language. A secondary difference is ignored
  49. when there is a primary difference anywhere in the strings. This is also called the
  50. level2 strength.
  51. Note: In some languages (such as Danish), certain accented letters are considered to
  52. be separate base characters. In most languages, however, an accented letter only has a
  53. secondary difference from the unaccented version of that letter.
  54. 3. Tertiary Level: Upper and lower case differences in characters are distinguished at the
  55. tertiary level (for example, "ao" < "Ao" < "ao"). In addition, a variant of a letter differs
  56. from the base form on the tertiary level (such as "A" and " "). Another ? example is the
  57. difference between large and small Kana. A tertiary difference is ignored when there is
  58. a primary or secondary difference anywhere in the strings. This is also called the level3
  59. strength.
  60. 4. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations ) at level
  61. 13, an additional level can be used to distinguish words with and without punctuation
  62. (for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a primary,
  63. secondary or tertiary difference. This is also known as the level4 strength. The
  64. quaternary level should only be used if ignoring punctuation is required or when
  65. processing Japanese text (see Hiragana processing).
  66. 5. Identical Level: When all other levels are equal, the identical level is used as a
  67. tiebreaker. The Unicode code point values of the NFD form of each string are
  68. compared at this level, just in case there is no difference at levels 14
  69. . For example, Hebrew cantillation marks are only distinguished at this level. This level should be
  70. used sparingly, as only code point values differences between two strings is an
  71. extremely rare occurrence. Using this level substantially decreases the performance for
  72. both incremental comparison and sort key generation (as well as increasing the sort
  73. key length). It is also known as level 5 strength.
  74. For example, people may choose to ignore accents or ignore accents and case when searching
  75. for text. Almost all characters are distinguished by the first three levels, and in most
  76. locales the default value is thus Tertiary. However, if Alternate is set to be Shifted,
  77. then the Quaternary strength can be used to break ties among whitespace, punctuation, and
  78. symbols that would otherwise be ignored. If very fine distinctions among characters are required,
  79. then the Identical strength can be used (for example, Identical Strength distinguishes
  80. between the Mathematical Bold Small A and the Mathematical Italic Small A.). However, using
  81. levels higher than Tertiary the Identical strength result in significantly longer sort
  82. keys, and slower string comparison performance for equal strings.
  83. 6. Collator::__construct( $locale ).
  84. The Locale attribute is typically the most important attribute for correct sorting and matching,
  85. according to the user expectations in different countries and regions. The default UCA
  86. ordering will only sort a few languages such as Dutch and Portuguese correctly ("correctly"
  87. meaning according to the normal expectations for users of the languages).
  88. Otherwise, you need to supply the locale to UCA in order to properly collate text for a
  89. given language. Thus a locale needs to be supplied so as to choose a collator that is correctly
  90. tailored for that locale. The choice of a locale will automatically preset the values for
  91. all of the attributes to something that is reasonable for that locale. Thus most of the time the
  92. other attributes do not need to be explicitly set. In some cases, the choice of locale will make a
  93. difference in string comparison performance and/or sort key length.
  94. In short attribute names, <language>_<script>_<region>_<keyword>.
  95. Not all the elements are required. Valid values for locale elements are general valid values
  96. for RFC4646 locale naming, and RFC 4647 lookup algorithm.
  97. Example:
  98. Locale="sv" (Swedish) "Kypper" < "Kopfe"
  99. Locale="de" (German) "Kopfe" < "Kypper"
  100. 7. Collator::get/setAttribute.
  101. ICU uses UCA as a default starting point for ordering. Not all languages have sorting sequences
  102. that correspond with the UCA because UCA cannot simultaneously encompass the specifics of all
  103. the languages currently in use. Therefore, ICU provides a data-driven, flexible, and run-time
  104. customizable mechanism called "tailoring". Tailoring overrides the default order of code points
  105. and the values of the ICU Collation Service attributes.
  106. Collator have followed attributes:
  107. - FRENCH_COLLATION, possible values are:
  108. ON
  109. OFF (default)
  110. DEFAULT
  111. - CASE_FIRST, possible values are:
  112. OFF (default)
  113. LOWER_FIRST
  114. UPPER_FIRST
  115. DEFAULT
  116. - CASE_LEVEL, possible values are:
  117. OFF (default)
  118. ON
  119. DEFAULT
  120. - NORMALIZATION_MODE, possible values are:
  121. OFF (default)
  122. ON
  123. DEFAULT
  124. - STRENGTH, possible values are:
  125. PRIMARY
  126. SECONDARY
  127. TERTIARY (default)
  128. QUATERNARY
  129. IDENTICAL
  130. DEFAULT
  131. - ALTERNATE_HANDLING, possible values are:
  132. NON_IGNORABLE (default)
  133. SHIFTED
  134. DEFAULT
  135. - HIRAGANA_QUATERNARY_MODE, possible values are:
  136. ON
  137. OFF (default)
  138. DEFAULT
  139. - NUMERIC_COLLATION, possible values are:
  140. ON
  141. OFF (default)
  142. DEFAULT
  143. Description of all of this attributes:
  144. FRENCH_COLLATION - Sort strings with different accents from the back of the string. This attribute
  145. is automatically set to On for the French locales and a few others. Users normally would
  146. not need to explicitly set this attribute. There is a string comparison performance cost when
  147. it is set On, but sort key length is unaffected.
  148. Example:
  149. F=X cote < cote < cote < cote
  150. F=O cote < cote < cote < cote
  151. CASE_FIRST - The Case_First attribute is used to control whether uppercase letters come before
  152. lowercase letters or vice versa, in the absence of other differences in the strings. The possible
  153. values are Uppercase_First (U) and Lowercase_First (L), plus the standard Default and Off.
  154. There is almost no difference between the Off and Lowercase_First options in terms of results,
  155. so typically users will not use Lowercase_First: only Off or Uppercase_First. (People interested
  156. in the detailed differences between X and L should consult the Collation Customization).
  157. Specifying either L or U won't affect string comparison performance, but will affect the sort key
  158. length.
  159. Example:
  160. C=X or C=L "china" < "China" < "denmark" <
  161. "Denmark"
  162. C=U "China" < "china" < "Denmark" < "denmark"
  163. CASE_LEVEL - The Case_Level attribute is used when ignoring accents but not case. In such a situation,
  164. set Strength to be Primary, and Case_Level to be On. In most locales, this setting is Off by default.
  165. There is a small string comparison performance and sort key impact if this attribute is set to be On.
  166. Example:
  167. S=1, E=X role = Role = role
  168. S=1, E=O role = role < Role
  169. NORMALIZATION_MODE - The Normalization setting determines whether text is thoroughly normalized
  170. or not in comparison. Even if the setting is off (which is the default for many locales), text as
  171. represented in common usage will compare correctly (for details, see UTN #5). Only if the accent
  172. marks are in noncanonical order will there be a problem. If the setting is On, then the best
  173. results are guaranteed for all possible text input. There is a medium string comparison performance
  174. cost if this attribute is On, depending on the frequency of sequences that require normalization.
  175. There is no significant effect on sort key length. If the input text is known to be in NFD or NFKD
  176. normalization forms, there is no need to enable this Normalization option.
  177. STRENGTH - see Collator::setStrength chapter.
  178. ALTERNATE_HANDLING - The Alternate attribute is used to control the handling of the socalled
  179. variable characters in the UCA: whitespace, punctuation and symbols. If Alternate is set to
  180. NonIgnorable (N), then differences among these characters are of the same importance as
  181. differences among letters. If Alternate is set to Shifted (S), then these characters are of only
  182. minor importance. The Shifted value is often used in combination with Strength set to Quaternary.
  183. In such a case, whitespace, punctuation, and symbols are considered when comparing strings,
  184. but only if all other aspects of the strings (base letters, accents, and case) are identical.
  185. If Alternate is not set to Shifted, then there is no difference between a Strength of 3 and
  186. a Strength of 4. For more information and examples, see
  187. Variable_Weighting in the UCA (http://www.unicode.org/reports/tr10/#Variable_Weighting).
  188. The reason the Alternate values are not simply On and Off is that additional Alternate values
  189. may be added in the future. The UCA option Blanked is expressed with Strength set to 3,
  190. and Alternate set to Shifted. The default for most locales is NonIgnorable. If Shifted is selected,
  191. it may be slower if there are many strings that are the same except for punctuation;
  192. sort key length will not be affected unless the strength level is also increased.
  193. Example:
  194. S=3, A=N di Silva < Di Silva < diSilva < U.S.A. < USA
  195. S=3, A=S di Silva = diSilva < Di Silva < U.S.A. = USA
  196. S=4, A=S di Silva < diSilva < Di Silva < U.S.A. < USA
  197. HIRAGANA_QUATERNARY_MODE - Compatibility with JIS x 4061 requires the introduction of an additional
  198. level to distinguish Hiragana and Katakana characters. If compatibility with that standard is required,
  199. then this attribute should be set On, and the strength set to Quaternary. This will affect sort key
  200. length and string comparison string comparison performance.
  201. NUMERIC_COLLATION - When turned on, this attribute generates a collation key for the
  202. numeric value of substrings of digits. This is a way to get '100' to sort AFTER '2'.