From: "Martin J. Dürst" <duerst@...>
Date: 2010-11-29T20:02:02+09:00
Subject: [ruby-core:33461] Re: [ruby-cvs:37089] Ruby:r29896 (trunk): * string.c (rb_str_inspect): treat UTF-16 and UTF-32 as BE or LE.

This is a multi-part message in MIME format.
--------------040709010003050001070909
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

Hello Yui,

On 2010/11/24 11:20, naruse@ruby-lang.org wrote:
> naruse	2010-11-24 11:20:11 +0900 (Wed, 24 Nov 2010)
>
>    New Revision: 29896
>
>    http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=rev&revision=29896
>
>    Log:
>      * string.c (rb_str_inspect): treat UTF-16 and UTF-32 as BE or LE.
>
>    Modified files:
>      trunk/ChangeLog
>      trunk/string.c

As a result of this patch, I get the following:

(1)
ruby -e 'puts "\uFF21\uFF22\uFF23".
   encode("UTF-16LE").force_encoding("UTF-16").inspect'

=> "\u21FF\u22FF\u23FF"

(2)
ruby -e 'puts "\uFF21\uFF22\uFF23".
   encode("UTF-16BE").force_encoding("UTF-16").inspect'

=> "\u21FF\u22FF\u23FF"

(3)
ruby -e 'puts "\uFF21\uFF22\uFF23".
   encode("UTF-32LE").force_encoding("UTF-32").inspect'

=> "\u{21FF0000}\u{22FF0000}\u{23FF0000}"

(4)
ruby -e 'puts "\uFF21\uFF22\uFF23".
   encode("UTF-32BE").force_encoding("UTF-32").inspect'

=> "\uFF21\uFF22\uFF23"

(4) is right by chance, but the others are wrong.

I think the only sensible solution is to check for the two (or four for 
UTF-32) first bytes, and output everything as a BINARY string (with \x 
where necessary). I started working on a patch for this, but I didn't 
manage to display \x, I still got some \u. I'm attaching this patch, I 
hope it provides a start.

On a broader scale, I think that e.g. UTF-16 has essentially two usages:
1) Internally as an indication of UTF-16 in machine endianness.
2) Externally as an indication of "UTF-16 or UTF-16", with the details
    depending on the circumstances:
    2a) The relevant RFC (http://tools.ietf.org/html/rfc2781) recommends
        interpretation as UTF-16BE if no BOM is present.
    2b) XML strictly requires a BOM.
    2c) Many Windows programs assume UTF-16LE if there's no BOM.

My main questions here are:
A) Which one of the above is the current Ruby implementation effort
   (the above patch and a few related ones) targetting?
B) How complete is that implementation (thought to be)?
C) What about other implementation needs?
D) What can we do to make sure users have at least a chance of
    understanding what "UTF-16" in Ruby is good for?

Regards,    Martin.

-- 
#-# Martin J. D�rst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

--------------040709010003050001070909
Content-Type: text/plain;
 name="diff_string.c_2001-11-29.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="diff_string.c_2001-11-29.txt"

Index: string.c
===================================================================
--- string.c	(revision 29969)
+++ string.c	(working copy)
@@ -4214,10 +4214,20 @@
     p = RSTRING_PTR(str); pend = RSTRING_END(str);
     prev = p;
     if (enc == utf16) {
-	enc = *p == (char)0xFF ? rb_enc_find("UTF-16LE") : rb_enc_find("UTF-16BE");
+	if (*p==(char)0xFD && p[1]==(char)0xFF)
+	    enc = rb_enc_find("UTF-16BE");
+        else if (*p==(char)0xFF && p[1]==(char)0xFD)
+	    enc = rb_enc_find("UTF-16LE");
+	else
+	    enc = rb_enc_find("ASCII-8BIT"), unicode_p = 0;
     }
     else if (enc == utf32) {
-	enc = *p == (char)0xFF ? rb_enc_find("UTF-32LE") : rb_enc_find("UTF-32BE");
+	if (*p==(char)0x00 && p[1]==(char)0x00 && p[2]==(char)0xFD && p[3]==(char)0xFF)
+	    enc = rb_enc_find("UTF-32BE");
+        else if (*p==(char)0xFF && p[1]==(char)0xFD && p[2]==(char)0x00 && p[3]==(char)0x00)
+	    enc = rb_enc_find("UTF-32LE");
+	else
+	    enc = rb_enc_find("ASCII-8BIT"), unicode_p = 0;
     }
     while (p < pend) {
 	unsigned int c, cc;

--------------040709010003050001070909--