Java i18n
Trivia: “i18n” 18 is the number of letters between the “i” and the “n” in “internationalization”
Basics:
Most traditional character encoding standards are 8-bit, and can only represent 256 different characters. While internationalization this becomes a bottleneck since 256 characters can’t accommodate every character possible. The solution to this problem is the adoption of universal character encoding: Unicode.
Unicode is
a) an industry standard designed
b) to bring together texts and symbols from all the writing systems of the world by
c) providing a unique number (not glyph) for every character.
This means that it represents a character in a number and the underlying application will render the character (symbol, font, size, or shape) with some rendering/mapping algorithm.
There are several possible representations of Unicode data indicated by the Unicode Transformation Format (UTF). UTF is an algorithmic mapping from every Unicode code point to a unique byte sequence.
There are various UTF algorithms available such as, UTF-8, UTF-16 or UTF-32, but the preferred character encoding used in web environments is UTF-8, which is
a) a variable-length character encoding able to represent any character in the Unicode standard, yet
b) the initial encoding of bytecodes and character assignments for UTF-8 is consistent with ASCII, though not with Latin-1, because the characters greater that 127 differ.
Enabling full internationalization in a typical java web system:
A typical flow will look something like below:
Client <–> Internet <–> Web Server <–> Application Sever <–> DBMS
Let’s look at each layer and enable i18n to it.
Client: Web browsers (like Internet Explorer, Mozilla Firefox, Safari, and Opera) represent the client side of a web application. The best way to tell a browser about UTF-8 encoding is by putting the character-set information in the HTTP response header:
Server: Apache-Coyote/1.1
Pragma: No-cache
Cache-Control: no-cache
Expires: <date>
Content-Type: text/html;charset=utf-8
Transfer-Encoding: chunked
Date: <date>
Web Server:
1. Most web servers use the encoding of the operating system, defined in the system property file.encoding.
This property is usually defined as
a) ISO-8859-1 in unix-based systems or
b) Cp1252 in windows systems.
To ensure UTF-8 support, the file.encoding property has to be redefined during system startup.
2. Apache2 on Windows NT use UTF-8 for all filename encodings, but otherwise, recommends changing the Tomcat/JBoss startup script (run/catalina) to add the switch
-Dfile.encoding=UTF-8
to the startup call to the JVM to ensure that the HTTP response encoding will be defaulted to UTF-8. However, this can be overridden within the Java Servlet code as needed.
3. Static hypertext documents should at the top of the <head> section include:
<meta http-equiv=”content-type”content=”text/html; charset=utf-8″>
4. JavaScript block or file should include the charset attribute:
<script src=” scriptFile.js” type=”text/javascript” charset=”utf-8″></script>
Application Server: Application servers are programs that sit between web server and backend business applications or databases.
1. Java files do not require any UTF-8 configuration, where JSP files enable UTF-8 encoding by placing a page directive at the top of the file and including pageEncoding and contentType attributes:
<%@ page contentType=”text/html;charset=utf-8″ pageEncoding=”utf-8″ %>
This page directive should be used in all JSP files that are included with the <jsp:include> tag (not the <%@ include %> page directive).
2. Moreover, if JSP file contains a (X)HTML <head> tag, it should to include UTF-8 page directive:
<meta http-equiv=”content-type” content=”text/html; charset=utf-8″>
3. When sendRedirect() method is used, query string parameters should be encoded with java.net.URLEncoder.encode() method.
4. HTML forms should include charset attribute:
<form action=”processData.jsp” method=”post” enctype=”multipart/form-data; charset=utf-8″>
……..
</html:form>
The upper input form submits the form data in UTF-8. And a filter must be implemented to specify character encoding before reading the form parameters.
response.setContentType(”text/html; charset=UTF-8″);
5. A request submitted through JavaScript with the form’s “GET”, multilanguage query string parameters should be encoded by using the JavaScript encodeURI method, and so should all standard HTML hyperlink tags <a href=”">.
6. Java Dictionary Files (message bundles) are key-value hash kept to lookup for internationalized data. They do not provide a mechanism for indicating the encoding.
Therefore they have to be encoded manually. Java comes with a native2ascii converter which takes an -encoding switch to indicate the encoding of the file, the name of the source file and the name of the target file:
native2ascii -encoding UTF-8 SourceFile TargetFile
Database: Database management systems (DBMS) require character-set information when a new database or table is created.
1. Databases that don’t support UTF-8 by default, default character set has to be defined as, for example, in MySQL’s configuration file (my.ini):
default-character-set=utf8
2. Database drivers usually require extra configuration as for example when connecting to a MySQL database using a Java database connectivity (JDBC) driver:
Connection db =DriverManager.getConnection(”jdbc:mysql://localhost/myDatabase?useUnicode=true&characterEncoding=utf-8″,”username”,”password”);
References:
In the Java Language the job of managing coordination between threads is largely pushed on to the developer. The primary tool for managing coordination between threads in Java programs is the synchronized keyword, in absence of which the JVM is free to take a great deal of liberty in the timing and ordering of operations (Refer JLS - 
