From: "kennygrant (Kenny Grant)" Date: 2012-11-03T07:50:20+09:00 Subject: [ruby-core:48768] [ruby-trunk - Bug #7267] Dir.glob on Mac OS X returns unexpected string encodings for unicode file names Issue #7267 has been updated by kennygrant (Kenny Grant). File writer.rb added The problem I encountered here was that although the encoding is also UTF-8, apparently there are several flavours of UTF-8, NFC or NFD. This is not something most Ruby users will be familiar with, and I'm not sure it's really the same as having to choose your own encoding and always convert to and from it (which I understand is necessary). I was very confused at first as these strings all report UTF-8 encoding, yet behave differently - they display in the same way but act differently in some circumstances because of the underlying bytes. Writing out to a file with a UTF-8 string name (composed) will result in getting a different UTF-8 string (decomposed) when read back in later with ruby Dir.glob or other File/Dir methods, so apparently there is automatic translation one way but not another. The file name read back in appears to work until you try matching against the string or manipulating the parts, so it leads to silent failures where code which worked for ascii strings will fail on names which are not plain ascii on Mac OS X, unless you explicitly reconvert the file name to a composed form again. What I would expect to happen is for Ruby file system methods to convert back to composed form on reading in file names again, so that matches on a strings or regexp defined as UTF-8 would work correctly. As it is these fail, and the string displays normally but does not behave as you would expect. Apologies if this has already been considered and I am rehashing an old argument, I just found this behaviour somewhat puzzling until I worked out what it was doing, and even then it is painful to have to consider which flavour of UTF-8 is in use and convert to NFC all the time. A related bug is that if you match a glob on the exact name, you will receive a UTF-8 string which is NFC, whereas if you try a partial match, you will receive NFD, so the behaviour can be inconsistent. See attached file for an example which writes out a name then tries to match it straight afterward. It would be great if Ruby could just consistently return NFC (as is used when you use UTF-8) and convert as necessary for the file system, but never expose that to the user. Thanks for your time. ---------------------------------------- Bug #7267: Dir.glob on Mac OS X returns unexpected string encodings for unicode file names https://bugs.ruby-lang.org/issues/7267#change-32255 Author: kennygrant (Kenny Grant) Status: Open Priority: Normal Assignee: Category: Target version: 2.0.0 ruby -v: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.4.0] Tested on Ruby 1.9.3-p194 and ruby-2.0.0-preview1 on Mac OS X 10. 7.5 When calling file system methods with Ruby on Mac OS X, it is not possible to manipulate the resulting file name as a normal UTF-8 string, even though it reports the encoding as UTF-8. It seems to be a UTF-8-MAC string, even when the default encoding is set to UTF-8. This leads to confusion as the string can be manipulated normally except for any unicode characters, which seem to be decomposed. So a regexp using utf-8 characters won't work on the string, unless it is first converted from UTF-8-MAC. I'd expect the string encoding to be UTF-8, or at least to report that it is not a normal UTF-8 string if it has to be UTF-8-MAC for some reason. Example, run with a file called Test��.txt in the same folder: def transform_string s puts "Testing string #{s}" puts s.gsub(/��/,'TEST') end Dir.glob("./*.txt").each do |f| puts "Inline string works as expected" s = "./Test��.txt" puts transform_string s puts "File name from Dir.glob does not" puts transform_string f puts "Encoded file name works as expected, though it is reported as UTF-8, not UTF-8-MAC" f.encode!('UTF-8','UTF-8-MAC') puts transform_string f end -- http://bugs.ruby-lang.org/