From: "Martin J. Dürst" Date: 2010-11-29T20:02:02+09:00 Subject: [ruby-core:33461] Re: [ruby-cvs:37089] Ruby:r29896 (trunk): * string.c (rb_str_inspect): treat UTF-16 and UTF-32 as BE or LE. This is a multi-part message in MIME format. --------------040709010003050001070909 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Hello Yui, On 2010/11/24 11:20, naruse@ruby-lang.org wrote: > naruse 2010-11-24 11:20:11 +0900 (Wed, 24 Nov 2010) > > New Revision: 29896 > > http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=rev&revision=29896 > > Log: > * string.c (rb_str_inspect): treat UTF-16 and UTF-32 as BE or LE. > > Modified files: > trunk/ChangeLog > trunk/string.c As a result of this patch, I get the following: (1) ruby -e 'puts "\uFF21\uFF22\uFF23". encode("UTF-16LE").force_encoding("UTF-16").inspect' => "\u21FF\u22FF\u23FF" (2) ruby -e 'puts "\uFF21\uFF22\uFF23". encode("UTF-16BE").force_encoding("UTF-16").inspect' => "\u21FF\u22FF\u23FF" (3) ruby -e 'puts "\uFF21\uFF22\uFF23". encode("UTF-32LE").force_encoding("UTF-32").inspect' => "\u{21FF0000}\u{22FF0000}\u{23FF0000}" (4) ruby -e 'puts "\uFF21\uFF22\uFF23". encode("UTF-32BE").force_encoding("UTF-32").inspect' => "\uFF21\uFF22\uFF23" (4) is right by chance, but the others are wrong. I think the only sensible solution is to check for the two (or four for UTF-32) first bytes, and output everything as a BINARY string (with \x where necessary). I started working on a patch for this, but I didn't manage to display \x, I still got some \u. I'm attaching this patch, I hope it provides a start. On a broader scale, I think that e.g. UTF-16 has essentially two usages: 1) Internally as an indication of UTF-16 in machine endianness. 2) Externally as an indication of "UTF-16 or UTF-16", with the details depending on the circumstances: 2a) The relevant RFC (http://tools.ietf.org/html/rfc2781) recommends interpretation as UTF-16BE if no BOM is present. 2b) XML strictly requires a BOM. 2c) Many Windows programs assume UTF-16LE if there's no BOM. My main questions here are: A) Which one of the above is the current Ruby implementation effort (the above patch and a few related ones) targetting? B) How complete is that implementation (thought to be)? C) What about other implementation needs? D) What can we do to make sure users have at least a chance of understanding what "UTF-16" in Ruby is good for? Regards, Martin. -- #-# Martin J. D�rst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp --------------040709010003050001070909 Content-Type: text/plain; name="diff_string.c_2001-11-29.txt" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="diff_string.c_2001-11-29.txt" Index: string.c =================================================================== --- string.c (revision 29969) +++ string.c (working copy) @@ -4214,10 +4214,20 @@ p = RSTRING_PTR(str); pend = RSTRING_END(str); prev = p; if (enc == utf16) { - enc = *p == (char)0xFF ? rb_enc_find("UTF-16LE") : rb_enc_find("UTF-16BE"); + if (*p==(char)0xFD && p[1]==(char)0xFF) + enc = rb_enc_find("UTF-16BE"); + else if (*p==(char)0xFF && p[1]==(char)0xFD) + enc = rb_enc_find("UTF-16LE"); + else + enc = rb_enc_find("ASCII-8BIT"), unicode_p = 0; } else if (enc == utf32) { - enc = *p == (char)0xFF ? rb_enc_find("UTF-32LE") : rb_enc_find("UTF-32BE"); + if (*p==(char)0x00 && p[1]==(char)0x00 && p[2]==(char)0xFD && p[3]==(char)0xFF) + enc = rb_enc_find("UTF-32BE"); + else if (*p==(char)0xFF && p[1]==(char)0xFD && p[2]==(char)0x00 && p[3]==(char)0x00) + enc = rb_enc_find("UTF-32LE"); + else + enc = rb_enc_find("ASCII-8BIT"), unicode_p = 0; } while (p < pend) { unsigned int c, cc; --------------040709010003050001070909--