From: "NARUSE, Yui" Date: 2012-11-09T17:55:16+09:00 Subject: [ruby-core:49140] Re: [ruby-trunk - Bug #7267] Dir.glob on Mac OS X returns unexpected string encodings for unicode file names > What I would expect to happen is for Ruby file system methods to convert back to composed form on reading in file names again, so that matches on a strings or regexp defined as UTF-8 would work correctly. As it is these fail, and the string displays normally but does not behave as you would expect. An issue is people may write decomposed filename. A imaginary use case is a program which make a filename from the name of a music output from iTunes. iTunes manages texts with UTF8-MAC. So the people will confuse. > Apologies if this has already been considered and I am rehashing an old argument, I just found this behaviour somewhat puzzling until I worked out what it was doing, This issue is disscussed for long time. First Ruby 1.9.0 set strings derived from filenames UTF8-MAC. But some reported that if filenames is UTF8-MAC, it is hard to compare with normal UTF-8 strings. So from 1.9.2 filenames become UTF-8. I find there is no correct simple answer. > and even then it is painful to have to consider which flavour of UTF-8 is in use and convert to NFC all the time. More painful thing is normal UTF-8 is not NFCed UTF-8. There is both NFCed one and NFDed one. People are living with such ambiguousity all the time even if they don't notice. > It would be great if Ruby could just consistently return NFC (as is used when you use UTF-8) and convert as necessary for the file system, but never expose that to the user. Again Ruby can't always return NFC strings because filenames are not normalized and may contain both NFCed one and NFDed one on other than HFS+. (you may know Mac OS X's default filesystem is HFS+) > If the translation from UTF8-MAC -> UTF-8 is entirely non-lossy and would do no harm to other UTF-8 strings Yes until all part of the converting string is truly UTF8-MAC. > perhaps the right thing to do here would be to auto-translate UTF-8-MAC to UTF-8 on reading all file names assumed to be UTF-8 on Mac OS, as the OS default is decomposed, but the default in Ruby is composed. I slightly doubt that this feature truely make people happy and there are no side effect even if without technical difficulty. > Hopefully this would not affect any other users/file systems, but I'm afraid I don't know enough to make that judgement call and may well have overlooked something. On Mac OS X, there are other than HFS+. Mac OS X can use UFS, CDFS, NFS and so on. They don't normalize filenames. If you NFCed such filenames, you lost the file. Moreover if you mount filesystems, a path may contain names from different filesystems like / - HFS+ /foo - UFS /foo/bar - ext4 over NFS /foo/bar/baz - NTFS /foo/bar/baz/cd - CDFS Here, only "foo" is normalized. If bar is "e`" (decomposed) and a directory named "�" (composed) in the parent directory, the path lost the file.