Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

JVM Languages

Java: Good World Citizen


Java to the Rescue: Part 1

Java deals with this problem in an effective way - it coerces you rather firmly into using Unicode for all character strings. C++ developers have a choice between narrow strings and wide strings -- which aren't necessarily Unicode. For better or worse, Java eliminates that choice.

The nice part about this is that in Unicode, U+00F1 is always the ñ character, and the lowercase rho, ρ, is always U+03F1. Even if I don't have a rho character on my keyboard, I know that I can use Java's escaping mechanism to represent it as "\u03F1", inconvenient as that may be.

This is a nice feature, because it means that at least internally, a Java string is a string is a string. You don't have to worry about what character set it is from -- it's Unicode.

It's Worse than You Think

In a perfect world, Java's insistence on Unicode would spill over to every file system, network packet type, and so on, and everything would be fine. But unfortunately, there are still billions of web browsers in the world configured to read text from an ISO-8859-X character set. And when our attention turns to Asia, things get even worse, for two reasons.

  • China, Japan, and Korea have character sets composed of thousands of ideographs. To compound this problem, there are competing character sets used to create Chinese web pages. Taiwan and the PRC tend to use two different character sets, known as Traditional (or Big5), and Simplified (or GB2312.)
  • These character sets don't fit in a single byte, and accordingly must be encoded in order to be written into byte-oriented files and networks. Unicode is most commonly encoded as UTF-8, in which a single 16-bit character is encoded as one, two, or three bytes. Other 16-bit characters, including the Chinese, Japanese, and Korean character sets use different encoding schemes, usually a row/column value encoded as two bytes.

Naturally, the different character sets that I've mentioned here are incompatible with one another. Needless to say, the encoding schemes are incompatible as well.

Simply storing your data internally as Unicode doesn't solve the problem of incompatible character sets and encodings. But, the good news is, Java has built-in library support for converting to and from these encodings any time you convert to or from bytes during an I/O operation.

Both the OutputStreamWriter and InputStreamReader class have two constructors: one which takes just a reference to a stream object, and a second which requires both a stream object and an encoding parameter.

If you search through the Java docs for "Supported Encodings", you'll see that Java has built-in support for a huge library of character sets and encodings. Converting one of these to or from Unicode is simply a matter of instantiating a class with the correct encoding parameter.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.