From: "shyouhei (Shyouhei Urabe)" Date: 2012-11-03T13:27:38+09:00 Subject: [ruby-core:48794] [ruby-trunk - Bug #7267] Dir.glob on Mac OS X returns unexpected string encodings for unicode file names Issue #7267 has been updated by shyouhei (Shyouhei Urabe). Just another reason why Unicodes sucks. Anyway, I know your feeling and I wish I could help you. The problem is, the world isn't built on top of Mac OS. So there're virtually thousands of different formats of hard disks, with different ways of storing filenames. In _your_ mac, it might be some sort of normalized Unicode. Not the case for others, even on a mac, like when you network-mount a Linux-hosted logical volume. So it's not a matter of normalizing. The real problem is we can't, in practice, know the real encoding of a filename. There's simply no way to obtain an encoding of a path. We have to ASSUME what it is instead. And all you're familiar with the fact that assumptions always sucks. Yes, this is the reason of the mess. Creating a file with UTF-8 filename and resulting a filename in different kind of UTF-8, isn't because Ruby's being evil. It's the default behaviour of your filesystem. We just can't know about that situation. Can't. PS. See also how MacFUSE project thinks this exact same issue as "WontFix": http://code.google.com/p/macfuse/issues/detail?id=139 PS. I repeat I don't believe the current situation is the best. We should have a better workaround. Don't know how though. ---------------------------------------- Bug #7267: Dir.glob on Mac OS X returns unexpected string encodings for unicode file names https://bugs.ruby-lang.org/issues/7267#change-32286 Author: kennygrant (Kenny Grant) Status: Open Priority: Normal Assignee: Category: Target version: 2.0.0 ruby -v: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.4.0] Tested on Ruby 1.9.3-p194 and ruby-2.0.0-preview1 on Mac OS X 10. 7.5 When calling file system methods with Ruby on Mac OS X, it is not possible to manipulate the resulting file name as a normal UTF-8 string, even though it reports the encoding as UTF-8. It seems to be a UTF-8-MAC string, even when the default encoding is set to UTF-8. This leads to confusion as the string can be manipulated normally except for any unicode characters, which seem to be decomposed. So a regexp using utf-8 characters won't work on the string, unless it is first converted from UTF-8-MAC. I'd expect the string encoding to be UTF-8, or at least to report that it is not a normal UTF-8 string if it has to be UTF-8-MAC for some reason. Example, run with a file called Test��.txt in the same folder: def transform_string s puts "Testing string #{s}" puts s.gsub(/��/,'TEST') end Dir.glob("./*.txt").each do |f| puts "Inline string works as expected" s = "./Test��.txt" puts transform_string s puts "File name from Dir.glob does not" puts transform_string f puts "Encoded file name works as expected, though it is reported as UTF-8, not UTF-8-MAC" f.encode!('UTF-8','UTF-8-MAC') puts transform_string f end -- http://bugs.ruby-lang.org/