Java to the Rescue: Part 1
Java deals with this problem in an effective way - it coerces you rather firmly into using Unicode for all character strings. C++ developers have a choice between narrow strings and wide strings -- which aren't necessarily Unicode. For better or worse, Java eliminates that choice.
The nice part about this is that in Unicode, U+00F1 is always the ñ character, and the lowercase rho, ρ, is always U+03F1. Even if I don't have a rho character on my keyboard, I know that I can use Java's escaping mechanism to represent it as "\u03F1", inconvenient as that may be.
This is a nice feature, because it means that at least internally, a Java string is a string is a string. You don't have to worry about what character set it is from -- it's Unicode.
It's Worse than You Think
In a perfect world, Java's insistence on Unicode would spill over to every file system, network packet type, and so on, and everything would be fine. But unfortunately, there are still billions of web browsers in the world configured to read text from an ISO-8859-X character set. And when our attention turns to Asia, things get even worse, for two reasons.
- China, Japan, and Korea have character sets composed of thousands of ideographs. To compound this problem, there are competing character sets used to create Chinese web pages. Taiwan and the PRC tend to use two different character sets, known as Traditional (or Big5), and Simplified (or GB2312.)
- These character sets don't fit in a single byte, and accordingly must be encoded in order to be written into byte-oriented files and networks. Unicode is most commonly encoded as UTF-8, in which a single 16-bit character is encoded as one, two, or three bytes. Other 16-bit characters, including the Chinese, Japanese, and Korean character sets use different encoding schemes, usually a row/column value encoded as two bytes.
Naturally, the different character sets that I've mentioned here are incompatible with one another. Needless to say, the encoding schemes are incompatible as well.
Simply storing your data internally as Unicode doesn't solve the problem of incompatible character sets and encodings. But, the good news is, Java has built-in library support for converting to and from these encodings any time you convert to or from bytes during an I/O operation.
Both the OutputStreamWriter and InputStreamReader class have two constructors: one which takes just a reference to a stream object, and a second which requires both a stream object and an encoding parameter.
If you search through the Java docs for "Supported Encodings", you'll see that Java has built-in support for a huge library of character sets and encodings. Converting one of these to or from Unicode is simply a matter of instantiating a class with the correct encoding parameter.