=====
!==
!== Japanese-HOWTO.txt for Samba release 2.2.x
!==
Contributor:    TAKAHASHI Motonobu <monyo@samba.org>
Date:           26 Nov 2001
Status:         Current

How to use Japanese on Samba 2.0.x/2.2.x
========================================

This document explains how to use a Japanese file name and share
name in Samba and the notes for it. Although actually Samba Japanese
Edition provided by Samba Users Group Japan is widely used in Japan,
the following document is for original versions of Samba as long as
there is no specification clearly, 

In addition, even if your UNIX itself cannot treat Japanese, it is
possible to treat Japanese file names in Samba.


Settings for using Japanese
===========================

To use Japanese, you should set the two parameters, "client code page"
and "coding system" appropriately. They defines "using Japanese" and
"the encoding method for Japanese" respectively. 

Both "client code page" and "coding system" must be put on the top of
your smb.conf. Samba cannot recognize the encoding method of Japanese
in smb.conf without these parameters since they also define the
encoding method of Japanese as well. Be carefull when you edit
smb.conf by hand! They are automatically set up on the top of smb.conf
when smb.conf is edited with SWAT. 

"client code page" is set as 932, which is the codepage for Japanese.

"coding system" is a parameter expected to use in Japanese environment
to determines the encoding method of the Japanese file name on your
Samba server.

Mainly for historical reasons, there are several encoding method in
Japanese, which are not fully compatible with each other. Moreover,
Samba also offers several unique encoding method to keep
interoperability with UNIX which cannot use a Japanese file name.
"coding system" defines which encoding method to use.


The decision of "coding system" to use
======================================

It is a difficult issue that which "coding system" value to use.
At least five values, SJIS, EUC, CAP, HEX and UTF8 (UTF8 is available
only in Samba 2.2 series) are generally used and all have merits and
demerits.

The standard enoding method on Windows is "Shift_JIS", equivalent to
SJIS (although Unicode 2.0 encoded with UCS-2 is used internally in
Windows NT series, Shift_JIS is externally used for Japanese as well
as ASCII is used for English). BUT using SJIS, same as Windows in
Samba is not always the best selection as described below.

Please read the following explanation and choose a value suitable for
you. Although there are more values, I will not explained here since
they are hardly used.

You can determine the value for "coding system" according to the
following judgment order. The detail of each values is mentioned
later.

1. set to "HEX" unless subsequent conditions are satisfied.
  "HEX" is "safety" because it uses ASCII characters only in 
  [0-9a-f:] to express Japanese file names

2. set to "CAP" if the directory shared with Samba is also shared with
    CAP or Netatalk.

  Since CAP and Netatalk usually write file names with "CAP" form,
  it is necessary for Samba to use the same encoding method.

  However, in case of Netatalk applied EUC-JP patch, file names are
  written with "EUC-JP" form and it is also necessary for Samba to use
  "EUC" as well.

3. If you need to use Japanese file names on UNIX, 
    set to "EUC" if the form used on the UNIX is EUC-JP,
    set to "SJIS" if the form used on the UNIX is Shift_JIS and 
    set to "UTF8" if the form used on the UNIX is UTF-8.

  Usually, EUC-JP is used on Linux, FreeBSD, Solaris, IRIX, and Tru64
  UNIX, Shift_JIS is used on HP-UX and AIX. Most UNIX for commercial
  can use both to change their locale.

  However, much of freewares can work only with EUC-JP regardless of 
  setting on UNIX, using EUC-JP is also considerable in the case of
  using such softwares mainly.

There is no allround way to satisfy all conditions. If some conditions
are inconsistent, unfortunately you need to give up one of them.


The detail for each values of "coding system"
=============================================

Here is the detail and merit and demerit for each value of "coding
system".

o HEX
  In the case of "HEX", for example if a Japanese file name consist of
  0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning
  "share") and ".txt" is written from Windows on Samba, the file name
  on UNIX becomes ":8b:a4:97:4c.txt" (a 16 bytes ASCII string). This
  is Samba original specification.

  The greatest merit of "HEX" is the interoperability with English
  environment. In the case of "HEX", all Japanese file names are
  written on UNIX with the original encoding method, only using some
  ASCII characters. This is very safety because there can be no
  problems of broken file names or aborting a command during parsing
  filenames even if your UNIX cannot treat Japanese characters.

  On the other hand, since 6 bytes is used to express a 2 bytes
  character, in the case of using long file names, they may be
  exceeded over 128 bytes, which is the limit of filename length in
  Samba 2.0 series.

  Moreover, it is very inconvenient for users using a Japanese file
  name written from Windows since the file name is visible only as an
  encoded ASCII characters string.

o CAP
  In the case of "CAP", for example if a Japanese file name consist of
  0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning
  "share") and ".txt" is written from Windows on Samba, the file name
  on UNIX becomes ":8b:a4:97L.txt" (a 14 bytes ASCII string). This
  is a specification using in CAP and Netatalk, file server softwares
  for Macintosh.

  The difference from "HEX" is that when a 2 byte Japanese character
  is devided into 2 bytes, a byte which can be expressed as an ASCII
  character is not encoded as ":xx" form but is written as the ASCII
  character itself. A character which is allowed to use in a file name
  on UNIX but is unpleasant may be contained in the "CAP" encoded file
  name. you need to take care of containing a "\(0x5c)" in a file
  name.

  The greatest merit of "CAP" is the compatibility of encoding file
  names with CAP or Netatalk, file server softwares of Macintosh.
  Since they usually write a file name on UNIX with CAP form, if a
  directory is shared with both Samba and Netatalk, you need to use
  "CAP" to avoid Japanese filenames are broken.

  However, recently there are some systems where the Netatalk which is
  applied a patch to write file names with EUC-JP is installed
  (i.e. Japanese original Vine Linux), where you need to choose "EUC"
  instead of "CAP".

  Most merits and demerits of "CAP" is basically same as "HEX", except
  "HEX" is more safety. It is better to use "HEX" or other values
  unless you need to use "CAP".

o EUC
  In the case of "EUC", for example if a Japanese file name consist of
  0x8ba4 and 0x974c (a 4 bytes Japanese character string meaning
  "share") and ".txt" is written from Windows on Samba, the file name
  on UNIX becomes 0xb6a6, 0xcdad, ".txt" (a 8 bytes BINARY string). 
  "EUC" is equivalent to the industry standard called EUC-JP, widely
  used in Japanese UNIX (although EUC contains specifications for
  langauages other than Japanese, such as EUC-KR, "EUC" in Samba is
  only for EUC-JP).

  The greatest merit of "EUC" is the interoperability with "Japanized"
  UNIX. Since EUC-JP is usually used on Open source UNIX, Linux and
  FreeBSD, and on commercial based UNIX, Solaris, IRIX and Tru64 UNIX
  as the default Japanese character code (however, it is also possible
  on Solaris to use Shift_JIS and UTF-8, on Tru64 UNIX to use
  Shift_JIS). To use "EUC", most Japanese file names created from
  Windows can be referred to also on UNIX. Also, most Japanized
  free softwares work mainly with EUC-JP only. It is good to choose
  "EUC" when using Japanese file names on these UNIX.

  However, when your locale is not set for EUC-JP, there are some
  characters which cannot be displayed displayed correctly. Although
  there is no character which needs to be carefully treated like "\
  (0x5c)", broken file names may be displayed and some commands may be
  aborted during parsing filenames.

  Moreover, there are NOT fully compatibility with Windows. the user
  defined characters available in Windows is not available with "EUC"
  because of its specification.

  Therefore, if you use "EUC", you need to avoid using imcompatible
  characters for file names.

o SJIS
  "SJIS" is equivalent to Shift_JIS, used as a standard on Japanese
  Windows. In the case of "SJIS", for example if a Japanese file name
  consist of 0x8ba4 and 0x974c (a 4 bytes Japanese character string
  meaning "share") and ".txt" is written from Windows on Samba, the
  file name on UNIX becomes 0x8ba4, 0x974c, ".txt" (a 8 bytes BINARY
  string), same as Windows.

  The greatest merit of "SJIS" is, contrary to EUC, the
  interoperability with Windows. Since there is no conversion, it is
  fully compatible with Windows and the "user defined characters" and
  "vendor defined characters", which have problems mentioned later can
  be used comparatively safely.

  However, like EUC, broken file names may be displayed and some
  commands may be aborted during parsing filenames. especially unlike
  "EUC", there may be "\ (0x5c)" in file names, which need to be
  treated carefully.

  Since Shift_JIS is usually used on some commercial based UNIX, HP-UX
  and AIX as the default Japanese character code (however, it is also
  possible to use EUC-JP), To use "SJIS", most Japanese file names
  created from Windows can be referred to also on UNIX. However,
  mentioned in the description of "EUC", most Japanized free softwares
  work actually with EUC-JP only. You had better confirm to use if the
  Japanized free software can work with Shift_JIS.

  If your UNIX is already working with Shift_JIS and there is a user 
  who needs to use Japanese file names written from Windows, basically
  "SJIS" is the best choice.

  Using "SJIS" on the UNIX which cannot treat Shift_JIS for the
  purpose that compatibility with Windows is most important,  you
  should not touch files written from Windows on UNIX.

o UTF8
  "UTF8" is equivalent to UTF-8, the international standard defined by
  Unicode.org. in UTF-8, a *character* is expressed with 1 - 3 *bytes*.
  In case of Japanese, most characters are expressed with 3
  bytes. Since on Windows Shift_JIS, where a character is expressed
  with 2 bytes, is used to express Japanese, basically a byte length
  of a UTF-8 string grows 1.5 times the length of a original Shift_JIS
  string. In the case of "UTF8", for example if a Japanese file name
  consist of 0x8ba4 and 0x974c (a 4 bytes Japanese character string
  meaning "share") and ".txt" is written from Windows on Samba, the
  file name on UNIX becomes 0xe585, 0xb1e6, 0x9c89, ".txt" (a 10 bytes
  BINARY string).

  For the Japanese processing in Samba, there is no merit for using
  "UTF8" unless Japanese file name can be treated when your UNIX uses
  UTF-8 as its current locale.

  Like "EUC", when your locale is not set for UTF-8, there are some
  broken file names may be displayed and some commands may be aborted
  during parsing filenames. Moreover there may be "\ (0x5c)" in file
  names, which need to be treated carefully.

  UTF-8 can be used on some commercial based UNIX such as Solaris and
  HP-UX. However, mentioned in the description of "EUC", most
  Japanized free softwares work actually with EUC-JP only and there
  are few ones correctly working with UTF-8 than that with Shift_JIS.
  You had better confirm to use if the Japanized free software can
  work with UTF-8.

  Therefore there are few case that UTF-8 is actually used as the
  encoding method of a file system.

  In addition, although it is not directly concerned with Samba, since
  there is a delicate difference between iconv() function, which is
  generally used on UNIX and the functions used on other platforms,
  such as Windows and Java about the conversion table between
  Shift_JIS and Unicode, you should be carefully to treat UTF-8.

  Therefore using "UTF8" is not considerable now for Samba.
  Although Mac OS X uses UTF-8 as its encoding method of a file name,
  it uses Unicode 3.1 as its character set instead of Unicode 2.0,
  which Samba assumes the character set for "UTF8". Using "UTF8" on
  Mac OS X, therefore, some characters becomes broken so that it is
  also not recommended now.

Notes for changing "coding system"
==================================

  Changing "coding system" once set up, it is necessary to change the
  encoding method for Japanese file names which already exist on the
  file system as well.

  The easiest way is to get backup of files on Windows at once before 
  changing "coding system" and to restore them after the changing.

  In the archive of Samba Japanese Edition, which is mentioned later
  in detail, there is a perl script named "smbchartool", which
  supports this work. to use this, you will do this work simply on
  UNIX.


HOWTO and Notes for including Japanese characters in smb.conf
=============================================================

  In Samba 2.0.7 and later, it is allowed to include Japanese
  (and some other language's) characters in smb.conf to set "coding
  system" and "client code page" parameter appropriately. You need to 
  write Japanese characters with the encoding method, which is set
  by "coding system" parameter. For example, if you will create a
  Japanese section, which is 0x8ba4 and 0x974c (a 4 bytes Japanese
  character string meaning "share") under "coding system = HEX", 
  you need to write as follows:

-----
[global]
    client code page = 932
    coding system = HEX
...

[:8b:a4:97:4c]
    path = /tmp

-----

  SUGJ (Samba Users Group Japan) tested that using Japanese string is
  allowed to be included not only in share names (and file names) but
  also in these parameters:

  - the comment of the server (server string)
  - the comment of the share (comment)
  - user names in username map

  Although using Japanese strings may be included in most parameters
  which take strings as its value, since there are several problems
  found in those for using Japanese, it is recommended not to use
  Japanese strings there.


Issues for using Japanese and Samba Japanese Edition
======================================================

  There are some problems in Japanese processing for Samba, apart from
  the issues which encoding method to use. The biggest one is that 
  the Shift_JIS code for some Windows-oriented *characters* are
  different with Windows 9x series (Windows 95/98/Me) and Windows NT
  series (Windows NT/2000/XP) for some historical reasons.

  These codes must be processed as the same in Samba, but this process 
  is not implemented correctly in current Samba and problems that 
  a Japanese file name written from Windows 9x cannot be read from
  Windows NT will sometimes occur.

  There is another problem that "user defined characters" cannot be used
  in EUC-JP, generally used on UNIX. Although use of "user defined
  characters" is decreasing with the spread of the Internet, they are
  still indispensable in some commercial or public systems, where lots
  of KANJI characters are required to display their names correctly
  and etc.

  These characters are Windows-oriented but is widely used as an
  industry standard, so it is indespensable on business to treat them
  correctly. 

  Moreover, Windows NT series treat that some special KANJI characters
  such as full-width Roman numerials (full-width Alphabet, full-width
  Cyrilic and full-width Greek alphabet) are case-insensitive like
  same as ASCII characters, but Windodws 9x series treat they are
  case-sensitive. The current implementation of Samba is different
  from both.

  In addition, there are some implementations where using Japanese
  characters is not expected, sousing Japanese like for a Windows 
  may cause a trouble in an unexpected place.

  Samba Japanese Edition, developed by SUGJ, is developed in order to
  solve these problems. In Samba 2.0.7-ja-2.2 or later, these problems
  on file/directory name and shared name are solved. Regrettably such
  works are not fully merged with original Samba because lots place of
  source codes are modified.

  In Samba Japanese Edition Japanese processing is extended as
  follows:

  1. "Normalization of Shift_JIS"
    In Samba 2.0.7-ja-2.2 or later, the problems that the case
    recognition and the code for some characters are different between
    Windows 9x series and Windows NT series are coped with.

  2. "User defined character" can be used on EUC-JP
    In Samba 2.0.7-ja-2.2 or later, "EUC3", Samba Japanese
    Edition-original value is available for "coding system" parameter,
    which is based on eucJP-open, an industry standard in order to use
    "user defined characters" with EUC-JP.

  3. Implementation for UTF-8
    In Samba 2.0.7-ja-2.0 or later, "UTF8" is available for "coding
    system" parameter, which allows to write file names with UTF-8,
    based on Unicode 2.0. This is merged with Samba 2.2 series.

  4. Mac OS X issues
    In Samba 2.0.10-ja-1.0 or later, "UTF8-MAC" is available for
    "coding system" parameter, which allows to write file names with
    UTF-8, based on Unicode 3.1 and "Normalization Form", which is
    neccessary to write into the file systems of Mac OS X.

  SUGJ is currently developing Samba Japanese Edition for Samba 2.2
  series, which is based on Samba 2.2 series and merged with these
  work.

  And these work will be merged with HEAD branch, which will be Samba
  3.0 series.
