From: "kennygrant (Kenny Grant)" <kennygrant@...>
Date: 2012-11-09T19:30:32+09:00
Subject: [ruby-core:49157] [ruby-trunk - Bug #7267] Dir.glob on Mac OS X returns unexpected string encodings for unicode file names


Issue #7267 has been updated by kennygrant (Kenny Grant).


Thanks for the comments on this issue. I'm not clear on what the UTF8-MAC encoding represents, are there docs on this Ruby behaviour and the problems involved somewhere? 

At present Dir.glob has inconsistent behaviour even working with the same file/filesystem:

It may return a filename marked UTF-8 which is NFD, or NFC, depending on the glob pattern you call it with (see writer.rb attachment to this issue). That's a small issue though and just indicates a wider complex problem.

> An issue is people may write decomposed filename. A imaginary use case is a program which make a filename from the name of a music output from iTunes. iTunes manages texts with UTF8-MAC. So the people will confuse.

OK, so in this case someone is unwittingly using a mix of UTF-8 NFC (any strings they create in ruby with legible accents) and UTF-8 NFD (any strings they get from itunes say) in their script, which could lead to issues even before writing file names. If they get NFD from itunes, then try to match on a track name with a regexp, it won't work unless they convert to NFC or explicitly create an NFD string will it? So even ignoring the file system there are issues here with labelling two different normalization forms UTF-8 as it leads to naive expectations of compatability which are false. 

> More painful thing is normal UTF-8 is not NFCed UTF-8.There is both NFCed one and NFDed one.
 
Yes, it might be good to make this very clear in the Ruby docs, as so many people use UTF-8 for all strings now, probably mostly without understanding this issue (I'm aware not everyone in the Ruby community uses UTF-8 for other reasons). I certainly didn't know about it and it took quite a while to find any docs about it at all (most of which were for perl or other languages). Thanks for taking the time to explain it, and sorry if I missed some obvious notes in the docs for the ruby string class say. 

One thing I don't understand though, is that you say there are both in normal use - in use of Ruby ignoring file systems, if you create a string or regexp, NFC is the default isn't it? So Ruby has chosen one default for UTF-8 strings created in Ruby (as it must), but has to interact with lots of systems which might or might not be using NFC. At present we seem to have a de-facto default normalization of NFC, but nothing is translated to it when it comes from the OS. That might be a a very hard problem, but in principle it would be nice to have one normalization blessed as the default so that all strings in a given encoding are comparable. The results of leaving them as they are supplied are really unexpected, and people using Ruby are not going to want to manually convert every string they touch from outside Ruby to NFC in case it was touched by HFS or created as NFD.

> First Ruby 1.9.0 set strings derived from filenames UTF8-MAC.
> But some reported that if filenames is UTF8-MAC, it is hard to compare
> with normal UTF-8 strings.

This is interesting as it's exactly the behaviour I expected (if it's not possible to cleanly translate to NFC) - if strings are coming through as UTF-8 NFD, I'd expect them to be marked as such somehow (for example by being marked as encoding UTF8-MAC) - is there any indication? Then at least it is clear that they are not comparable or compatible with the NFC ruby strings I get when creating a string s = "d��tente". Removing the explicit encoding doesn't really solve the problem of strings being hard to compare, it just hides it until comparisons fail. By default, Ruby seems to use NFC UTF-8, which is what I'd expect, so I also expected to either receive strings which are comparable directly, or for those strings to be marked as some other encoding/normalization if the come from an HFS+ file system, so that I can deal with them even if Ruby doesn't. Is the normalization exposed somewhere on strings? The current situation was surprising and unexpected, though now that it has 
 been explained I see why it might occur and how hard it is to fix.  

In the example code attached to this issue, I've used str.encode!("UTF-8", "UTF8-MAC") to translate file names/paths to NFC which works for my uses, but from naruse's comments, it seems that would fail in certain circumstances? 

> If the translation from UTF8-MAC -> UTF-8 is entirely non-lossy and would do no harm to other UTF-8 strings
> Yes until all part of the converting string is truly UTF8-MAC.

I assumed from others' comments that UTF8-MAC was purely a sub-encoding used to indicate the use of decomposed strings, but would appreciate some more detail (if anyone has a link) on what exactly it involves, and if translation from UTF8-MAC to UTF8 can lose information that implies other differences. If the only difference is the decomposition (patterns which do not occur in NFC), I'd expect re-encoding to be idempotent and not affect NFC strings and thus harmless to apply to NFC strings or strings containing a mix. Re the file-system example, I had assumed that if you ask HFS to write to a file on a mounted file system HFS would normalize all names to NFD (as it does for any HFS files), but perhaps that is incorrect. 

I suppose the above boils down to this question:

Is there a correct way to handle this situation, and never fail when comparing a default Ruby string (NFC) against a file from any file system which may be NFD?


----------------------------------------
Bug #7267: Dir.glob on Mac OS X returns unexpected string encodings for unicode file names
https://bugs.ruby-lang.org/issues/7267#change-32704

Author: kennygrant (Kenny Grant)
Status: Assigned
Priority: Normal
Assignee: duerst (Martin D��rst)
Category: 
Target version: 2.0.0
ruby -v: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.4.0]


Tested on Ruby 1.9.3-p194 and ruby-2.0.0-preview1 on Mac OS X 10. 7.5

When calling file system methods with Ruby on Mac OS X, it is not possible to manipulate the resulting file name as a normal UTF-8 string, even though it reports the encoding as UTF-8. It seems to be a UTF-8-MAC string, even when the default encoding is set to UTF-8. This leads to confusion as the string can be manipulated normally except for any unicode characters, which seem to be decomposed. So a regexp using utf-8 characters won't work on the string, unless it is first converted from UTF-8-MAC. I'd expect the string encoding to be UTF-8, or at least to report that it is not a normal UTF-8 string if it has to be UTF-8-MAC for some reason. 

Example, run with a file called Test��.txt in the same folder:

def transform_string s
   puts "Testing string #{s}"
   puts s.gsub(/��/,'TEST')
end

Dir.glob("./*.txt").each do |f|  
  puts "Inline string works as expected" 
   s = "./Test��.txt" 
   puts transform_string s

   puts "File name from Dir.glob does not" 
   puts transform_string f
   
   puts "Encoded file name works as expected, though it is reported as UTF-8, not UTF-8-MAC" 
   f.encode!('UTF-8','UTF-8-MAC')
   puts transform_string f
end


-- 
http://bugs.ruby-lang.org/