Heads up: what you need to know about Unicode I/O in GHC 6.12.1

The GHC 6.12.1 release candidate will be out shortly, and it includes a newly rewritten I/O library including Unicode support.  Here’s what you need to know to make sure your applications/libraries continue to work with GHC 6.12.1.

We expect the release candidate phase to last a couple of weeks or so, depending on how many problems arise, after which 6.12.1 will be released.  However, 6.12 is not currently scheduled to become part of the Haskell Platform until the next platform release, due around February 2010, so package authors have a grace period for testing before 6.12.1 becomes more widely used.

The new System.IO docs can be found here, in particular the unicode-related functionality is  here.

Console and text I/O

If you are reading or writing to/from the console, or  reading/writing text files in the local encoding, then use the System.IO functions for doing text I/O (openFile, readFile, hGetContents, putStr, etc.), and you will automatically benefit from  the new Unicode support.  Text written will be encoded according to the current locale, or code page on Windows, and text read will be decoded accordingly.

If you need to use a particular encoding (e.g. UTF-8), then the  hSetEncoding function lets you set the encoding on a Handle, e.g.

  hSetEncoding stdout utf8

Binary I/O

If you’re reading or writing binary data, or for some other reason you want to bypass the Unicode encoding/decoding that the IO library now does, you have two options:

  • Use openBinaryFile or hSetBinaryMode to put the Handle into binary  mode.  No encoding/decoding or newline translation will be done.
  • Use hGetBuf/hPutBuf, or the I/O operations provided by Data.ByteString, which all operate with binary data.

Using utf8-string

If you’re using utf8-string in certain ways then you might get incorrect results.

  • The operations in System.IO.UTF8 add a UTF8 wrapper around the  corresponding System.IO operation.  Unless the underlying Handle is in binary  mode, these operations will result in garbage being read or  written.  For example, if you want to use System.IO.UTF8.print,  then call hSetBinaryMode stdout True first.  Better still, just use System.IO.print directly.  f you need to fix the encoding to UTF-8 rather than using the locale encoding, then call hSetEncoding handle utf8.
  • The rest of the operations in utf8-string will continue to work as before.

Newline handling

There is a new API for newline translation in System.IO.  By default, Handles in text mode translate newlines to or from the native representation for the current platform, that is “\r\n” on Windows and “\n” on other platforms.  You can change this default using hSetNewlineMode, for example to be able to read a file with either Windows or Unix line-ending conventions:

 hSetNewlineMode handle universalNewlineMode

where universalNewlineMode translates from “\r\n” to “\n” on input, leaving “\n” alone, and translates “\n” to the native newline representation on output.

About these ads
This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to Heads up: what you need to know about Unicode I/O in GHC 6.12.1

  1. design says:

    yea, but can we replace with an actual lambda?

  2. guest says:

    Well, it’s nice that System.IO is finally UTF8-aware, but it seems it came at a cost of _breaking_ existing code (commonly used library will silently corrupt data).

    So what is the recommended way to write unicode IO so that it works both in 6.10 and in 6.12?

    Should we add cpp and conditional compilation to every file just to import the right IO (System.IO in 6.12 and System.IO.UTF8 in previous versions)? Are there plans to update utf8-string so it just does not break existing code on 6.12?

  3. simonmar says:

    Firstly, let’s be clear that Unicode /= UTF-8. The new IO library in 6.12.1 does I/O using the locale encoding by default, which may or may not be UTF-8. There was no good way to do this with 6.10 – you could use the iconv package, but that doesn’t support Windows, and only has a ByteString API.

    If you have code that works with 6.10 and is using one of the UTF-8 libraries, then it should be fairly easy to ensure it continues to work with 6.12. Make sure that you are using either ByteStrings to do I/O, or that the Handle is in binary mode (using hSetBinaryMode or openBinaryFile). If you’re using the utf8-string API exclusively, then you are already using openBinaryFile, so that’s fine. Just be careful if you use utf8-string with stdin/stdout: call hSetBinaryMode first. There’s no good way to add this to utf8-string, because there’s nowhere to add the call (you don’t want to do it for every operation).

    Alternatively, as you say you could use CPP to select between utf8-string and the native System.IO. I wouldn’t recommend doing that, because it gives you two versions of the code to test.

    Be careful with FilePaths. On Windows they are interpreted as Unicode, on Unix they are interpreted as [Word8], by taking the low 8 bits of each Char. So if you always encode FilePaths to UTF-8, that will break on Windows. Fixing FilePaths is a high priority.

  4. guest says:

    Thanks for writing a detailed reply, simonmanr!

    > Firstly, let’s be clear that Unicode /= UTF-8. The new IO library in 6.12.1 does I/O using the locale encoding by default,

    That’s the really good news!

    > Make sure that you are using either ByteStrings to do I/O, or that the Handle is in binary mode … Just be careful if you use utf8-string with stdin/stdout: call hSetBinaryMode first.

    Well, this is how I used utf8-string actually (to replace standard IO). As “hSetBinaryMode stdin True” seems to work with System.IO.UTF8.getContents in 6.10, this is probably the way to keep old scripts compatible with both 6.10 and 6.12 without changing those scripts significantly.

    > Be careful with FilePaths. … Fixing FilePaths is a high priority.

    OK. Always better to know before. I just hope it gets fixed by February 2010.

  5. Maurício says:

    What about String -> CString conversion?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s