123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774 |
- ==========================================
- README for I18N Package
- ==========================================
- o Name and location of package
- Name: php-3.0.18-i18n-ja-2
- Location: http://www.happysize.co.jp/techie/php-ja-jp/
- ftp://ftp.happysize.co.jp/php-ja-jp/
- http://php.vdomains.org/
- ftp://ftp.vdomains.org/pub/php-ja-jp/
- http://php.jpnnet.com/
- Currently, this I18N version of PHP only adds Japanese support to base
- PHP. It allows you to use Japanese in scripts, as well as conversion
- between various Japanese encodings. It will work perfectly fine with
- ASCII with i18n option enabled. (note: executable is bit larger due
- to UNICODE table). The basic design aproach is to allow for other
- languages to be added in the future. Developers are encourage to join
- us!
- For more information on Japanese encodings, please refer to the
- section "Additional Notes."
- o What is this package?
- This package allows you to handle multiple Japanese encodings (SJIS, EUC,
- UTF-8, JIS) in PHP. If you find any bugs in this package, please report
- them to the appropriate mailing list. For now, the PHP-jp mailing list
- is the best place for this.
- PHP-jp ML mailto:PHP-jp@sidecar.ics.es.osaka-u.ac.jp
- http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
- (discussions are in Japanese)
- o Who should use this
- Due to lack of documentation, it's not intended for beginners. If
- something goes wrong, be prepared to fix it on your own.
- o Warranty and Copyright
- There is no warranty with this package. Use it at your own risk.
- Please refer to the source code for the copyrights. In general, each
- program's copyright is owned by the programmer. Unless you obey the
- copyright holders restrictions, you are not allowed to use it in any
- form.
- o Redistribution
- As described in the source code, this package and the components are
- allowed to be redistributed with certain restrictions.
- Due to this package being still in beta, please try to redistribute
- it as an entire package. Please try not to distribute it as a form
- of patch. Because we would prefer to have this package distributed
- as one single package (not patch of patch of patch), avoid releasing
- any patch to this package.
- o Who made this
- A team of volunteers, PHP3 Internationalization, has been contributing
- their free time producing it. Although we are not related to the core
- PHP programmers, we are hoping to have our modifications merged into the
- core distribution in the near future. Thus, we did not call this a
- "Japanese Patch" (or distribution). Our final goal is to have true
- i18nized PHP!
- For anyone interested in this project, please drop us a line.
- Contact Address:
- phpj-dev@kage.net
- (Discussions are in Japanese, but feel free to write us in English)
- Webpage (English and Japanese):
- http://php.jpnnet.com/
- Project Outline (Japanese):
- http://www.happysize.co.jp/techie/php-ja-jp/spec.htm
- Developers:
- Hironori Sato <satoh@jpnnet.com>
- Shigeru Kanemoto <sgk@happysize.co.jp>
- Tsukada Takuya <tsukada@fminn.nagano.nagano.jp>
- U. Kenkichi <kenkichi@axes.co.jp>
- Tateyama <tateyan@amy.hi-ho.ne.jp>
- Other gracious contributors
- o Future plans
- - fulfilling what's written in outline
- - support for other languages other than Japanese
- - make the character conversion as a library (?)
- - more testing
- o Special Thanks to
- PHP Japanese webpage maintainer, Hirokawa-san
- http://www.cityfujisawa.ne.jp/%7Elouis/apps/phpfi/
- PHP-JP ML's Yamamoto-san
- http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
- Previous jp-patch developers
- ==========================================
- Advantages of using I18N package
- ==========================================
- - allows you to use various character encodings for script files and
- http output
- - distinguish character encoding in POST/GET/COOKIE
- - proper mail output using JIS as body and MIME/Base64/JIS subject
- - if http output's Content-Type is text/html, it will set proper charset
- - stable character encoding conversion
- - multibyte regex
- ==========================================
- Installation
- ==========================================
- o Summary
- Add --enable-i18n option when running configure. For your own setup,
- add any other appropriate options as well.
- Don't forget to copy php3.ini-dist to desired location.
- (ex. /usr/local/lib/php3.ini)
- If you have already installed PHP3, copy all the entries in php3.ini-dist
- which start with "i18n.xxxx" to php3.ini.
- o configure option
- --enable-i18n
- include i18n features
- --enable-mbregex
- include multibyte regex library
- (without i18n enabled, mbregex functions will not function)
- o creating cgi version
- % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
- % cd php-3.0.18-i18n-ja-2
- % ./configure --enable-i18n --enable-mbregex
- % make
- o creating Apache version (regular module)
- % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
- % tar xvzf apache_1.3.x.tar.gz
- % cd apache_1.3.x
- % ./configure
- % cd ../php-3.0.18-i18n-ja-2
- % ./configure --with-apache=../apache_1.3.x --enable-i18n --enable-mbregex
- % make
- % make install
- % cd ../apache_1.3.x
- % ./configure --activate-module=src/modules/php3/libphp3.a
- % make
- % make install
- o creating Apache DSO version
- create DSO capable Apache first
- % tar xvzf apache_1.3.x.tar.gz
- % cd apache-1.3.x
- % ./configure --enable-shared=max
- % make
- % make install
- now create php3
- % cd php-3.0.18-i18n-ja-2
- % ./configure --with-apxs=/usr/local/apache/bin/apxs --enable-i18n \
- --enable-mbregex
- % make
- % make install
- ==========================================
- Additional Notes
- ==========================================
- o Multibyte regex library
- From beta4, we have included the multibyte (mb) regex library which comes with
- Ruby. With this addition, you can now use regex in EUC, SJIS and UTF-8
- encoding. To avoid any conflicts with HSREGEX included with Apache,
- each function name has been changed. Therefore, mb regex functions are
- named differently from the original ereg functions in PHP. The character
- encoding used in mb regex is configured in i18n.internal_encoding.
- o Binary Output
- If http output encoding is set to other than 'pass', conversion of encoding
- from internal encoding to http output is done automatically. Thus,
- if you prefer to spit out anything in raw binary format, your data
- may be corrupted. In such event, set http_output to 'pass'.
- ex.
- <?
- i18n_http_output("pass");
- ...
- echo $the_binary_data_string;
- ?>
- o Content-Type
- Depending on the setting of http_output, PHP will output the proper charset.
- ex. Content-Type: text/html; charset="..."
- Be aware of following:
- - If you set Content-Type header using header() function, that will
- override the automatic addition of charset.
- - Be cautious when you set i18n_http_output, since if any output is
- made prior to this, proper header may have been sent out to the
- client already.
- o In the event of trouble
- If you find any bugs or trouble, please contact us at the above address.
- It may help us to track the problem if you send us the script as well.
- If you encounter any memory related error such as segmentation violation,
- add --enable-debug when you run configure. This will give you more
- detail information on where error has occurred. The error is stored
- in the server log or regular http output in CGI mode.
- o About Japanese encodings
- Due to historical reason, there are multiple character encodings used
- for Japanese. The most common encodings are: SJIS, EUC, JIS, and UTF-8.
- Here are (very) brief description of them:
- EUC
- commonly used in UNIX environment
- 8bit-8bit combo
- always >=0x80
- SJIS
- commonly used in Mac or PCs
- similar to EUC
- mostly 8bit-8bit (some 8bit-7bit)
- mostly >=0x80
- there are some halfwidth (size of ASCII) multibytes
- JIS
- commonly used in 7bit environment (nntp and smtp)
- starts with escaping char, \033 and a few more characters
- UTF-8
- 16bit+ encoding
- defines many languages existing in this world
- see http://www.unicode.org/ for more detail
- Because of having all these character encodings, PHP needs to translate
- between these encodings on the fly. Also, the addition of the mb regex
- library allows you to handle mb strings without fear of getting mb char
- chopped in half.
- Since Japanese is not the only language with multiple encodings, we
- encourage other developers to modify our code to suit your needs. We
- definitely need people to work with Korean, Chinese (both traditional
- and simplified), and Russian. Let us know if you are interested in
- this project!
- ==========================================
- php3.ini setting
- ==========================================
- The following init options will allow you to change the default settings.
- Define these settings in the global section of php3.ini.
- All keywords are case-insensitive.
- o Encoding naming
- For each encoding, there are three names: standarized, alias, MIME
- - UTF-8
- standard: UTF-8
- alias: N/A
- mime: UTF-8
- - ASCII
- standard: ASCII
- alias: N/A
- mime: US-ASCII
- - Japanese EUC
- standard: EUC-JP
- alias: EUC, EUC_JP, eucJP, x-euc-jp
- mime: EUC-JP
- - Shift JIS
- standard: SJIS
- alias: x-sjis, MS_Kanji
- mime: Shift_JIS
- - JIS
- standard: JIS
- alias: N/A
- mime: ISO-2022-JP
- - Quoted-Printable
- standard: Quoted-Printable
- alias: qprint
- mime: N/A
- - BASE64
- standard: BASE64
- alias: N/A
- mime: N/A
- - no conversion
- standard: pass
- alias: none
- mime: N/A
- - auto encoding detection
- standard: auto
- alias: unknown
- mime: N/A
- * N/A - Not Applicapable
- o i18n.http_output - default http output encoding
- i18n.http_output = EUC-JP|SJIS|JIS|UTF-8|pass
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- pass: no conversion
- The default is pass (internal encoding is used)
- It can be re-configured on the fly using i18n_http_output().
- o i18n.internal_encoding - internal encoding
- i18n.internal_encoding = EUC-JP|SJIS|UTF-8
- EUC-JP : EUC
- SJIS: SJIS
- UTF-8: UTF-8
- The default is EUC-JP.
- PHP parser is designed based on using ISO-8859-1. For other
- encodings, following conditions have to be satisfied in order
- to use them:
- - per byte encoding
- - single byte character in range of 00h-7fh which is compatible
- with ASCII
- - multibyte without 00h-7fh
- In case of Japanese, EUC-JP and UTF-8 are the only encoding that
- meets this criteria.
- If i18n.internal_encoding and i18n.http_output differs, conversion
- takes place at the time of output. If you convert any data within
- PHP scripts to URL encoding, BASE64 or Quoted-Printable, encoding
- stays as defined in i18n.internal_encoding. Thus, if you would
- prefer to encode in compliance with i18n.http_output, you need
- to manually convert encoding.
- ex. $str = urlencode( i18n_convert($str, i18n_http_output()) );
- Encoding such as ISO-2022-** and HZ encoding which uses escape
- sequences can not be used as internal encoding. If used, they
- result in following errors:
- - parser pukes funky error
- - magic_quotes_*** breaks encoding (SJIS may have similar problem)
- - string manipulation and regex will malfunction
- o i18n.script_encoding - script encoding
- i18n.script_encoding = auto|EUC-JP|SJIS|JIS|UTF-8
- auto: automatic
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- The default is auto.
- The script's encoding is converted to i18n.internal_encoding before
- entering the script parser.
- Be aware that auto detection may fail under some conditions.
- For best auto detection, add multibyte character at beginning of
- script.
- o i18n.http_input - handling of http input (GET/POST/COOKIE)
- i18n.http_input = pass|auto
- auto: auto conversion
- pass: no conversion
- The default is auto.
- If set to pass, no conversion will take place.
- If set to auto, it will automatically detect the encoding. If
- detection is successful, it will convert to the proper internal
- encoding. If not, it will assume the input as defined in
- i18n.http_input_default.
- o i18n.http_input_default - default http input encoding
- i18n.http_input_default = pass|EUC-JP|SJIS|JIS|UTF-8
- pass: no conversion
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- The default is pass.
- This option is only effective as long as i18n.http_input is set to
- auto. If the auto detection fails, this encoding is used as an
- assumption to convert the http input to the internal encoding.
- If set to pass, no conversion will take place.
- o sample settings
- 1) For most flexibility, we recommend using following example.
- i18n.http_output = SJIS
- i18n.internal_encoding = EUC-JP
- i18n.script_encoding = auto
- i18n.http_input = auto
- i18n.http_input_default = SJIS
- 2) To avoid unexpected encoding problems, try these:
- i18n.http_output = pass
- i18n.internal_encoding = EUC-JP
- i18n.script_encoding = pass
- i18n.http_input = pass
- i18n.http_input_default = pass
- ==========================================
- PHP functions
- ==========================================
- The following describes the additional PHP functions.
- All keywords are case-insensitive.
- o i18n_http_output(encoding)
- o encoding = i18n_http_output()
- This will set the http output encoding. Any output following this
- function will be controlled by this function. If no argument is given,
- the current http output encode setting is returned.
- encodings
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- pass: no conversion
- NONE is not allowed
- o encoding = i18n_internal_encoding()
- Returns the current internal encoding as a string.
- internal encoding
- EUC-JP : EUC
- SJIS: SJIS
- UTF-8: UTF-8
- o encoding = i18n_http_input()
- Returns http input encoding.
- encodings
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- pass: no conversion (only if i18n.http_input is set to pass)
- o string = i18n_convert(string, encoding)
- string = i18n_convert(string, encoding, pre-conversion-encoding)
- Returns converted string in desired encoding. If
- pre-conversion-encoding is not defined, the given
- string is assumed to be in internal encoding.
- encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- pass: no conversion
- pre-conversion-encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- pass: no conversion
- auto: auto detection
- o encoding = i18n_discover_encoding(string)
- Encoding of the given string is returned (as a string).
- encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- ASCII: ASCII (only 09h, 0Ah, 0Dh, 20h-7Eh)
- pass: unable to determine (text is too short to determine)
- unknown: unknown or possible error
- o int = mbstrlen(string)
- o int = mbstrlen(string, encoding)
- Returns character length of a given string. If no encoding is defined,
- the encoding of string is assumed to be the internal encoding.
- encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- auto: automatic
- o int = mbstrpos(string1, string2)
- o int = mbstrpos(string1, string2, start)
- o int = mbstrpos(string1, string2, start, encoding)
- Same as strpos. If no encoding is defined, the encoding of string
- is assumed to be the internal encoding.
- encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- o int = mbstrrpos(string1, string2)
- o int = mbstrrpos(string1, string2, encoding)
- Same as strrpos. If no encoding is defined, the encoding of string
- is assumed to be the internal encoding.
- encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- o string = mbsubstr(string, position)
- o string = mbsubstr(string, position, length)
- o string = mbsubstr(string, position, length, encoding)
- Same as substr. If no encoding is defined, the encoding of string
- is assumed to be the internal encoding.
- encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- o string = mbstrcut(string, position)
- o string = mbstrcut(string, position, length)
- o string = mbstrcut(string, position, length, encoding)
- Same as subcut. If position is the 2nd byte of a mb character, it will cut
- from the first byte of that character. It will cut the string without
- chopping a single byte from a mb character. In another words, if you
- set length to 5, you will only get two mb characters. If no encoding
- is defined, the encoding of string is assumed to be the internal encoding.
- encoding
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- o string = i18n_mime_header_encode(string)
- MIME encode the string in the format of =?ISO-2022-JP?B?[string]?=.
- o string = i18n_mime_header_decode(string)
- MIME decodes the string.
- o string = i18n_ja_jp_hantozen(string)
- o string = i18n_ja_jp_hantozen(string, option)
- o string = i18n_ja_jp_hantozen(string, option, encoding)
- Conversion between full width character and halfwidth character.
- option
- The following options are allowed. The default is "KV".
- Acronym: FW = fullwidth, HW = halfwidth
- "r" : FW alphabet -> HW alphabet
- "R" : HW alphabet -> FW alphabet
- "n" : FW number -> HW number
- "N" : HW number -> FW number
- "a" : FW alpha numeric (21h-7Eh) -> HW alpha numeric
- "A" : HW alpha numeric (21h-7Eh) -> FW alpha numeric
- "k" : FW katakana -> HW katakana
- "K" : HW katakana -> FW katakana
- "h" : FW hiragana -> HW hiragana
- "H" : HW hiragana -> FW katakana
- "c" : FW katakana -> FW hiragana
- "C" : FW hiragana -> FW katakana
- "V" : merge dakuon character. only works with "K" and "H" option
- encoding
- If no encoding is defined, the encoding of string is assumed to be
- the internal encoding.
- EUC-JP : EUC
- SJIS: SJIS
- JIS : JIS
- UTF-8: UTF-8
- int = mbereg(regex_pattern, string, string)
- int = mberegi(regex_pattern, string, string)
- mb version of ereg() and eregi()
- string = mbereg_replace(regex_pattern, string, string)
- string = mberegi_replace(regex_pattern, string, string)
- mb version of ereg_replace() and eregi_replace()
- string_array = mbsplit(regex, string, limit)
- mb version of split()
- ==========================================
- FAQ
- ==========================================
- Here, we have gathered some commonly asked questions on PHP-jp mailing
- list.
- o To use Japanese in GET method
- If you need to assign Japanese text in GET method with argument, such as;
- xxxx.php?data=<Japanese text>, use urlencode function in PHP. If not,
- text may not be passed onto action php properly.
- ex: <a href="hoge.php?data=<? echo urlencode($data) ?>">Link</a>
- o When passing data via GET/POST/COOKIE, \ character sneaks in
- When using SJIS as internal encoding, or passed-on data includes '"\,
- PHP automatically inserts escaping character, \. Set magic_quotes_gpc
- in php3.ini from On to Off. An alternative work around to this problem
- is to use StripSlashes().
- If $quote_str is in SJIS and you would like to extract Japanese text,
- use ereg_replace as follows:
- ereg_replace(sprintf("([%c-%c%c-%c]\\\\)\\\\",0x81,0x9f,0xe0,0xfc),
- "\\1",$quote_str);
- This will effectively extract Japanese text out of $quote_str.
- o Sometimes, encoding detection fails
- If i18n_http_input() returns 'pass', it's likely that PHP failed to
- detect whether it's SJIS or EUC. In such case, use <input type=hidden
- value="some Japanese text"> to properly detect the incoming text's
- encoding.
- ==========================================
- Japanese Manual
- ==========================================
- Translated manual done by "PHP Japanese Manual Project" :
- http://www.php.net/manual/ja/manual.php
- Starting 3.0.18-i18n-ja, we have removed doc-jp from tarball package.
- ==========================================
- Change Logs
- ==========================================
- o 2000-10-28, Rui Hirokawa <hirokawa@php.net>
- This patch is derived from php-3.0.15-i18n-ja as well as php-3.0.16 by
- Kuwamura applied to original php-3.0.18. It also includes following fixes:
- 1) allows you to set charset in mail().
- 2) fixed mbregex definitions to avoid conflicts with system regex
- 3) php3.ini-dist now uses PASS for http_output instead of SJIS
- o 2000-11-24, Hironori Sato <satoh@yyplanet.com>
- Applied above patched and added detection for gdImageStringTTF in configure.
- Following setups are known to work:
- gd-1.3-6, gd-devel-1.3-6, freetype-1.3.1-5, freetype-devel-1.3.1-5
- ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf",
- i18n_convert("ÆüËܸì", "UTF-8"));
- ImageGif($im);
- gd-1.7.3-1k1, gd-devel-1.7.3-1k1, freetype-1.3.1-5, freetype-devel-1.3.1-5
- ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf","ÆüËܸì");
- ImagePng($im);
- * i18n_internal_encoding = EUC Ëô¤Ï SJIS
- For any gd libraries before 1.6.2, you need to use i18n_convert. For
- gd-1.5.2/3, upgrade to anything above 1.7 to use ImageTTFText without
- using i18n_convert. As long as you have internal_encoding set to EUC or
- SJIS, ImageTTFText should work without mojibake. Again, make sure you
- have i18n_http_output("pass") before calling ImageGif, ImagePng, ImageJpeg!
- o 2000-12-09, Rui Hirokawa <hirokawa@php.net>
- Fixed mail() which was causing segmentation fault when header was null.
|