README_PHP3-i18n-ja 22 KB


  1. ==========================================
  2. README for I18N Package
  3. ==========================================
  4. o Name and location of package
  5. Name: php-3.0.18-i18n-ja-2
  6. Location: http://www.happysize.co.jp/techie/php-ja-jp/
  7. ftp://ftp.happysize.co.jp/php-ja-jp/
  8. http://php.vdomains.org/
  9. ftp://ftp.vdomains.org/pub/php-ja-jp/
  10. http://php.jpnnet.com/
  11. Currently, this I18N version of PHP only adds Japanese support to base
  12. PHP. It allows you to use Japanese in scripts, as well as conversion
  13. between various Japanese encodings. It will work perfectly fine with
  14. ASCII with i18n option enabled. (note: executable is bit larger due
  15. to UNICODE table). The basic design aproach is to allow for other
  16. languages to be added in the future. Developers are encourage to join
  17. us!
  18. For more information on Japanese encodings, please refer to the
  19. section "Additional Notes."
  20. o What is this package?
  21. This package allows you to handle multiple Japanese encodings (SJIS, EUC,
  22. UTF-8, JIS) in PHP. If you find any bugs in this package, please report
  23. them to the appropriate mailing list. For now, the PHP-jp mailing list
  24. is the best place for this.
  25. PHP-jp ML mailto:PHP-jp@sidecar.ics.es.osaka-u.ac.jp
  26. http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
  27. (discussions are in Japanese)
  28. o Who should use this
  29. Due to lack of documentation, it's not intended for beginners. If
  30. something goes wrong, be prepared to fix it on your own.
  31. o Warranty and Copyright
  32. There is no warranty with this package. Use it at your own risk.
  33. Please refer to the source code for the copyrights. In general, each
  34. program's copyright is owned by the programmer. Unless you obey the
  35. copyright holders restrictions, you are not allowed to use it in any
  36. form.
  37. o Redistribution
  38. As described in the source code, this package and the components are
  39. allowed to be redistributed with certain restrictions.
  40. Due to this package being still in beta, please try to redistribute
  41. it as an entire package. Please try not to distribute it as a form
  42. of patch. Because we would prefer to have this package distributed
  43. as one single package (not patch of patch of patch), avoid releasing
  44. any patch to this package.
  45. o Who made this
  46. A team of volunteers, PHP3 Internationalization, has been contributing
  47. their free time producing it. Although we are not related to the core
  48. PHP programmers, we are hoping to have our modifications merged into the
  49. core distribution in the near future. Thus, we did not call this a
  50. "Japanese Patch" (or distribution). Our final goal is to have true
  51. i18nized PHP!
  52. For anyone interested in this project, please drop us a line.
  53. Contact Address:
  54. phpj-dev@kage.net
  55. (Discussions are in Japanese, but feel free to write us in English)
  56. Webpage (English and Japanese):
  57. http://php.jpnnet.com/
  58. Project Outline (Japanese):
  59. http://www.happysize.co.jp/techie/php-ja-jp/spec.htm
  60. Developers:
  61. Hironori Sato <satoh@jpnnet.com>
  62. Shigeru Kanemoto <sgk@happysize.co.jp>
  63. Tsukada Takuya <tsukada@fminn.nagano.nagano.jp>
  64. U. Kenkichi <kenkichi@axes.co.jp>
  65. Tateyama <tateyan@amy.hi-ho.ne.jp>
  66. Other gracious contributors
  67. o Future plans
  68. - fulfilling what's written in outline
  69. - support for other languages other than Japanese
  70. - make the character conversion as a library (?)
  71. - more testing
  72. o Special Thanks to
  73. PHP Japanese webpage maintainer, Hirokawa-san
  74. http://www.cityfujisawa.ne.jp/%7Elouis/apps/phpfi/
  75. PHP-JP ML's Yamamoto-san
  76. http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
  77. Previous jp-patch developers
  78. ==========================================
  79. Advantages of using I18N package
  80. ==========================================
  81. - allows you to use various character encodings for script files and
  82. http output
  83. - distinguish character encoding in POST/GET/COOKIE
  84. - proper mail output using JIS as body and MIME/Base64/JIS subject
  85. - if http output's Content-Type is text/html, it will set proper charset
  86. - stable character encoding conversion
  87. - multibyte regex
  88. ==========================================
  89. Installation
  90. ==========================================
  91. o Summary
  92. Add --enable-i18n option when running configure. For your own setup,
  93. add any other appropriate options as well.
  94. Don't forget to copy php3.ini-dist to desired location.
  95. (ex. /usr/local/lib/php3.ini)
  96. If you have already installed PHP3, copy all the entries in php3.ini-dist
  97. which start with "i18n.xxxx" to php3.ini.
  98. o configure option
  99. --enable-i18n
  100. include i18n features
  101. --enable-mbregex
  102. include multibyte regex library
  103. (without i18n enabled, mbregex functions will not function)
  104. o creating cgi version
  105. % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
  106. % cd php-3.0.18-i18n-ja-2
  107. % ./configure --enable-i18n --enable-mbregex
  108. % make
  109. o creating Apache version (regular module)
  110. % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
  111. % tar xvzf apache_1.3.x.tar.gz
  112. % cd apache_1.3.x
  113. % ./configure
  114. % cd ../php-3.0.18-i18n-ja-2
  115. % ./configure --with-apache=../apache_1.3.x --enable-i18n --enable-mbregex
  116. % make
  117. % make install
  118. % cd ../apache_1.3.x
  119. % ./configure --activate-module=src/modules/php3/libphp3.a
  120. % make
  121. % make install
  122. o creating Apache DSO version
  123. create DSO capable Apache first
  124. % tar xvzf apache_1.3.x.tar.gz
  125. % cd apache-1.3.x
  126. % ./configure --enable-shared=max
  127. % make
  128. % make install
  129. now create php3
  130. % cd php-3.0.18-i18n-ja-2
  131. % ./configure --with-apxs=/usr/local/apache/bin/apxs --enable-i18n \
  132. --enable-mbregex
  133. % make
  134. % make install
  135. ==========================================
  136. Additional Notes
  137. ==========================================
  138. o Multibyte regex library
  139. From beta4, we have included the multibyte (mb) regex library which comes with
  140. Ruby. With this addition, you can now use regex in EUC, SJIS and UTF-8
  141. encoding. To avoid any conflicts with HSREGEX included with Apache,
  142. each function name has been changed. Therefore, mb regex functions are
  143. named differently from the original ereg functions in PHP. The character
  144. encoding used in mb regex is configured in i18n.internal_encoding.
  145. o Binary Output
  146. If http output encoding is set to other than 'pass', conversion of encoding
  147. from internal encoding to http output is done automatically. Thus,
  148. if you prefer to spit out anything in raw binary format, your data
  149. may be corrupted. In such event, set http_output to 'pass'.
  150. ex.
  151. <?
  152. i18n_http_output("pass");
  153. ...
  154. echo $the_binary_data_string;
  155. ?>
  156. o Content-Type
  157. Depending on the setting of http_output, PHP will output the proper charset.
  158. ex. Content-Type: text/html; charset="..."
  159. Be aware of following:
  160. - If you set Content-Type header using header() function, that will
  161. override the automatic addition of charset.
  162. - Be cautious when you set i18n_http_output, since if any output is
  163. made prior to this, proper header may have been sent out to the
  164. client already.
  165. o In the event of trouble
  166. If you find any bugs or trouble, please contact us at the above address.
  167. It may help us to track the problem if you send us the script as well.
  168. If you encounter any memory related error such as segmentation violation,
  169. add --enable-debug when you run configure. This will give you more
  170. detail information on where error has occurred. The error is stored
  171. in the server log or regular http output in CGI mode.
  172. o About Japanese encodings
  173. Due to historical reason, there are multiple character encodings used
  174. for Japanese. The most common encodings are: SJIS, EUC, JIS, and UTF-8.
  175. Here are (very) brief description of them:
  176. EUC
  177. commonly used in UNIX environment
  178. 8bit-8bit combo
  179. always >=0x80
  180. SJIS
  181. commonly used in Mac or PCs
  182. similar to EUC
  183. mostly 8bit-8bit (some 8bit-7bit)
  184. mostly >=0x80
  185. there are some halfwidth (size of ASCII) multibytes
  186. JIS
  187. commonly used in 7bit environment (nntp and smtp)
  188. starts with escaping char, \033 and a few more characters
  189. UTF-8
  190. 16bit+ encoding
  191. defines many languages existing in this world
  192. see http://www.unicode.org/ for more detail
  193. Because of having all these character encodings, PHP needs to translate
  194. between these encodings on the fly. Also, the addition of the mb regex
  195. library allows you to handle mb strings without fear of getting mb char
  196. chopped in half.
  197. Since Japanese is not the only language with multiple encodings, we
  198. encourage other developers to modify our code to suit your needs. We
  199. definitely need people to work with Korean, Chinese (both traditional
  200. and simplified), and Russian. Let us know if you are interested in
  201. this project!
  202. ==========================================
  203. php3.ini setting
  204. ==========================================
  205. The following init options will allow you to change the default settings.
  206. Define these settings in the global section of php3.ini.
  207. All keywords are case-insensitive.
  208. o Encoding naming
  209. For each encoding, there are three names: standarized, alias, MIME
  210. - UTF-8
  211. standard: UTF-8
  212. alias: N/A
  213. mime: UTF-8
  214. - ASCII
  215. standard: ASCII
  216. alias: N/A
  217. mime: US-ASCII
  218. - Japanese EUC
  219. standard: EUC-JP
  220. alias: EUC, EUC_JP, eucJP, x-euc-jp
  221. mime: EUC-JP
  222. - Shift JIS
  223. standard: SJIS
  224. alias: x-sjis, MS_Kanji
  225. mime: Shift_JIS
  226. - JIS
  227. standard: JIS
  228. alias: N/A
  229. mime: ISO-2022-JP
  230. - Quoted-Printable
  231. standard: Quoted-Printable
  232. alias: qprint
  233. mime: N/A
  234. - BASE64
  235. standard: BASE64
  236. alias: N/A
  237. mime: N/A
  238. - no conversion
  239. standard: pass
  240. alias: none
  241. mime: N/A
  242. - auto encoding detection
  243. standard: auto
  244. alias: unknown
  245. mime: N/A
  246. * N/A - Not Applicapable
  247. o i18n.http_output - default http output encoding
  248. i18n.http_output = EUC-JP|SJIS|JIS|UTF-8|pass
  249. EUC-JP : EUC
  250. SJIS: SJIS
  251. JIS : JIS
  252. UTF-8: UTF-8
  253. pass: no conversion
  254. The default is pass (internal encoding is used)
  255. It can be re-configured on the fly using i18n_http_output().
  256. o i18n.internal_encoding - internal encoding
  257. i18n.internal_encoding = EUC-JP|SJIS|UTF-8
  258. EUC-JP : EUC
  259. SJIS: SJIS
  260. UTF-8: UTF-8
  261. The default is EUC-JP.
  262. PHP parser is designed based on using ISO-8859-1. For other
  263. encodings, following conditions have to be satisfied in order
  264. to use them:
  265. - per byte encoding
  266. - single byte character in range of 00h-7fh which is compatible
  267. with ASCII
  268. - multibyte without 00h-7fh
  269. In case of Japanese, EUC-JP and UTF-8 are the only encoding that
  270. meets this criteria.
  271. If i18n.internal_encoding and i18n.http_output differs, conversion
  272. takes place at the time of output. If you convert any data within
  273. PHP scripts to URL encoding, BASE64 or Quoted-Printable, encoding
  274. stays as defined in i18n.internal_encoding. Thus, if you would
  275. prefer to encode in compliance with i18n.http_output, you need
  276. to manually convert encoding.
  277. ex. $str = urlencode( i18n_convert($str, i18n_http_output()) );
  278. Encoding such as ISO-2022-** and HZ encoding which uses escape
  279. sequences can not be used as internal encoding. If used, they
  280. result in following errors:
  281. - parser pukes funky error
  282. - magic_quotes_*** breaks encoding (SJIS may have similar problem)
  283. - string manipulation and regex will malfunction
  284. o i18n.script_encoding - script encoding
  285. i18n.script_encoding = auto|EUC-JP|SJIS|JIS|UTF-8
  286. auto: automatic
  287. EUC-JP : EUC
  288. SJIS: SJIS
  289. JIS : JIS
  290. UTF-8: UTF-8
  291. The default is auto.
  292. The script's encoding is converted to i18n.internal_encoding before
  293. entering the script parser.
  294. Be aware that auto detection may fail under some conditions.
  295. For best auto detection, add multibyte character at beginning of
  296. script.
  297. o i18n.http_input - handling of http input (GET/POST/COOKIE)
  298. i18n.http_input = pass|auto
  299. auto: auto conversion
  300. pass: no conversion
  301. The default is auto.
  302. If set to pass, no conversion will take place.
  303. If set to auto, it will automatically detect the encoding. If
  304. detection is successful, it will convert to the proper internal
  305. encoding. If not, it will assume the input as defined in
  306. i18n.http_input_default.
  307. o i18n.http_input_default - default http input encoding
  308. i18n.http_input_default = pass|EUC-JP|SJIS|JIS|UTF-8
  309. pass: no conversion
  310. EUC-JP : EUC
  311. SJIS: SJIS
  312. JIS : JIS
  313. UTF-8: UTF-8
  314. The default is pass.
  315. This option is only effective as long as i18n.http_input is set to
  316. auto. If the auto detection fails, this encoding is used as an
  317. assumption to convert the http input to the internal encoding.
  318. If set to pass, no conversion will take place.
  319. o sample settings
  320. 1) For most flexibility, we recommend using following example.
  321. i18n.http_output = SJIS
  322. i18n.internal_encoding = EUC-JP
  323. i18n.script_encoding = auto
  324. i18n.http_input = auto
  325. i18n.http_input_default = SJIS
  326. 2) To avoid unexpected encoding problems, try these:
  327. i18n.http_output = pass
  328. i18n.internal_encoding = EUC-JP
  329. i18n.script_encoding = pass
  330. i18n.http_input = pass
  331. i18n.http_input_default = pass
  332. ==========================================
  333. PHP functions
  334. ==========================================
  335. The following describes the additional PHP functions.
  336. All keywords are case-insensitive.
  337. o i18n_http_output(encoding)
  338. o encoding = i18n_http_output()
  339. This will set the http output encoding. Any output following this
  340. function will be controlled by this function. If no argument is given,
  341. the current http output encode setting is returned.
  342. encodings
  343. EUC-JP : EUC
  344. SJIS: SJIS
  345. JIS : JIS
  346. UTF-8: UTF-8
  347. pass: no conversion
  348. NONE is not allowed
  349. o encoding = i18n_internal_encoding()
  350. Returns the current internal encoding as a string.
  351. internal encoding
  352. EUC-JP : EUC
  353. SJIS: SJIS
  354. UTF-8: UTF-8
  355. o encoding = i18n_http_input()
  356. Returns http input encoding.
  357. encodings
  358. EUC-JP : EUC
  359. SJIS: SJIS
  360. JIS : JIS
  361. UTF-8: UTF-8
  362. pass: no conversion (only if i18n.http_input is set to pass)
  363. o string = i18n_convert(string, encoding)
  364. string = i18n_convert(string, encoding, pre-conversion-encoding)
  365. Returns converted string in desired encoding. If
  366. pre-conversion-encoding is not defined, the given
  367. string is assumed to be in internal encoding.
  368. encoding
  369. EUC-JP : EUC
  370. SJIS: SJIS
  371. JIS : JIS
  372. UTF-8: UTF-8
  373. pass: no conversion
  374. pre-conversion-encoding
  375. EUC-JP : EUC
  376. SJIS: SJIS
  377. JIS : JIS
  378. UTF-8: UTF-8
  379. pass: no conversion
  380. auto: auto detection
  381. o encoding = i18n_discover_encoding(string)
  382. Encoding of the given string is returned (as a string).
  383. encoding
  384. EUC-JP : EUC
  385. SJIS: SJIS
  386. JIS : JIS
  387. UTF-8: UTF-8
  388. ASCII: ASCII (only 09h, 0Ah, 0Dh, 20h-7Eh)
  389. pass: unable to determine (text is too short to determine)
  390. unknown: unknown or possible error
  391. o int = mbstrlen(string)
  392. o int = mbstrlen(string, encoding)
  393. Returns character length of a given string. If no encoding is defined,
  394. the encoding of string is assumed to be the internal encoding.
  395. encoding
  396. EUC-JP : EUC
  397. SJIS: SJIS
  398. JIS : JIS
  399. UTF-8: UTF-8
  400. auto: automatic
  401. o int = mbstrpos(string1, string2)
  402. o int = mbstrpos(string1, string2, start)
  403. o int = mbstrpos(string1, string2, start, encoding)
  404. Same as strpos. If no encoding is defined, the encoding of string
  405. is assumed to be the internal encoding.
  406. encoding
  407. EUC-JP : EUC
  408. SJIS: SJIS
  409. JIS : JIS
  410. UTF-8: UTF-8
  411. o int = mbstrrpos(string1, string2)
  412. o int = mbstrrpos(string1, string2, encoding)
  413. Same as strrpos. If no encoding is defined, the encoding of string
  414. is assumed to be the internal encoding.
  415. encoding
  416. EUC-JP : EUC
  417. SJIS: SJIS
  418. JIS : JIS
  419. UTF-8: UTF-8
  420. o string = mbsubstr(string, position)
  421. o string = mbsubstr(string, position, length)
  422. o string = mbsubstr(string, position, length, encoding)
  423. Same as substr. If no encoding is defined, the encoding of string
  424. is assumed to be the internal encoding.
  425. encoding
  426. EUC-JP : EUC
  427. SJIS: SJIS
  428. JIS : JIS
  429. UTF-8: UTF-8
  430. o string = mbstrcut(string, position)
  431. o string = mbstrcut(string, position, length)
  432. o string = mbstrcut(string, position, length, encoding)
  433. Same as subcut. If position is the 2nd byte of a mb character, it will cut
  434. from the first byte of that character. It will cut the string without
  435. chopping a single byte from a mb character. In another words, if you
  436. set length to 5, you will only get two mb characters. If no encoding
  437. is defined, the encoding of string is assumed to be the internal encoding.
  438. encoding
  439. EUC-JP : EUC
  440. SJIS: SJIS
  441. JIS : JIS
  442. UTF-8: UTF-8
  443. o string = i18n_mime_header_encode(string)
  444. MIME encode the string in the format of =?ISO-2022-JP?B?[string]?=.
  445. o string = i18n_mime_header_decode(string)
  446. MIME decodes the string.
  447. o string = i18n_ja_jp_hantozen(string)
  448. o string = i18n_ja_jp_hantozen(string, option)
  449. o string = i18n_ja_jp_hantozen(string, option, encoding)
  450. Conversion between full width character and halfwidth character.
  451. option
  452. The following options are allowed. The default is "KV".
  453. Acronym: FW = fullwidth, HW = halfwidth
  454. "r" : FW alphabet -> HW alphabet
  455. "R" : HW alphabet -> FW alphabet
  456. "n" : FW number -> HW number
  457. "N" : HW number -> FW number
  458. "a" : FW alpha numeric (21h-7Eh) -> HW alpha numeric
  459. "A" : HW alpha numeric (21h-7Eh) -> FW alpha numeric
  460. "k" : FW katakana -> HW katakana
  461. "K" : HW katakana -> FW katakana
  462. "h" : FW hiragana -> HW hiragana
  463. "H" : HW hiragana -> FW katakana
  464. "c" : FW katakana -> FW hiragana
  465. "C" : FW hiragana -> FW katakana
  466. "V" : merge dakuon character. only works with "K" and "H" option
  467. encoding
  468. If no encoding is defined, the encoding of string is assumed to be
  469. the internal encoding.
  470. EUC-JP : EUC
  471. SJIS: SJIS
  472. JIS : JIS
  473. UTF-8: UTF-8
  474. int = mbereg(regex_pattern, string, string)
  475. int = mberegi(regex_pattern, string, string)
  476. mb version of ereg() and eregi()
  477. string = mbereg_replace(regex_pattern, string, string)
  478. string = mberegi_replace(regex_pattern, string, string)
  479. mb version of ereg_replace() and eregi_replace()
  480. string_array = mbsplit(regex, string, limit)
  481. mb version of split()
  482. ==========================================
  483. FAQ
  484. ==========================================
  485. Here, we have gathered some commonly asked questions on PHP-jp mailing
  486. list.
  487. o To use Japanese in GET method
  488. If you need to assign Japanese text in GET method with argument, such as;
  489. xxxx.php?data=<Japanese text>, use urlencode function in PHP. If not,
  490. text may not be passed onto action php properly.
  491. ex: <a href="hoge.php?data=<? echo urlencode($data) ?>">Link</a>
  492. o When passing data via GET/POST/COOKIE, \ character sneaks in
  493. When using SJIS as internal encoding, or passed-on data includes '"\,
  494. PHP automatically inserts escaping character, \. Set magic_quotes_gpc
  495. in php3.ini from On to Off. An alternative work around to this problem
  496. is to use StripSlashes().
  497. If $quote_str is in SJIS and you would like to extract Japanese text,
  498. use ereg_replace as follows:
  499. ereg_replace(sprintf("([%c-%c%c-%c]\\\\)\\\\",0x81,0x9f,0xe0,0xfc),
  500. "\\1",$quote_str);
  501. This will effectively extract Japanese text out of $quote_str.
  502. o Sometimes, encoding detection fails
  503. If i18n_http_input() returns 'pass', it's likely that PHP failed to
  504. detect whether it's SJIS or EUC. In such case, use <input type=hidden
  505. value="some Japanese text"> to properly detect the incoming text's
  506. encoding.
  507. ==========================================
  508. Japanese Manual
  509. ==========================================
  510. Translated manual done by "PHP Japanese Manual Project" :
  511. http://www.php.net/manual/ja/manual.php
  512. Starting 3.0.18-i18n-ja, we have removed doc-jp from tarball package.
  513. ==========================================
  514. Change Logs
  515. ==========================================
  516. o 2000-10-28, Rui Hirokawa <hirokawa@php.net>
  517. This patch is derived from php-3.0.15-i18n-ja as well as php-3.0.16 by
  518. Kuwamura applied to original php-3.0.18. It also includes following fixes:
  519. 1) allows you to set charset in mail().
  520. 2) fixed mbregex definitions to avoid conflicts with system regex
  521. 3) php3.ini-dist now uses PASS for http_output instead of SJIS
  522. o 2000-11-24, Hironori Sato <satoh@yyplanet.com>
  523. Applied above patched and added detection for gdImageStringTTF in configure.
  524. Following setups are known to work:
  525. gd-1.3-6, gd-devel-1.3-6, freetype-1.3.1-5, freetype-devel-1.3.1-5
  526. ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf",
  527. i18n_convert("ÆüËܸì", "UTF-8"));
  528. ImageGif($im);
  529. gd-1.7.3-1k1, gd-devel-1.7.3-1k1, freetype-1.3.1-5, freetype-devel-1.3.1-5
  530. ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf","ÆüËܸì");
  531. ImagePng($im);
  532. * i18n_internal_encoding = EUC Ëô¤Ï SJIS
  533. For any gd libraries before 1.6.2, you need to use i18n_convert. For
  534. gd-1.5.2/3, upgrade to anything above 1.7 to use ImageTTFText without
  535. using i18n_convert. As long as you have internal_encoding set to EUC or
  536. SJIS, ImageTTFText should work without mojibake. Again, make sure you
  537. have i18n_http_output("pass") before calling ImageGif, ImagePng, ImageJpeg!
  538. o 2000-12-09, Rui Hirokawa <hirokawa@php.net>
  539. Fixed mail() which was causing segmentation fault when header was null.