Bug #2026: String encodings are not supported by most of IO on Linux - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #2026

closed

String encodings are not supported by most of IO on Linux

Added by vo.x (Vit Ondruch) almost 16 years ago. Updated about 14 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 1.9.2dev (2009-08-19 trunk 24581) [i686-linux]

Backport:

[ruby-core:25220]

Description

=begin
If string used as path has different than UTF-8, the path created on file system is incorrect. The described faulty behavior is common for most of Dir and IO actions. Attached script demonstrates the behavior on sample of Dir#mkdir method.
=end

Files

mkdir.rb (43 Bytes) mkdir.rb

vo.x (Vit Ondruch), 09/01/2009 12:27 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by matz (Yukihiro Matsumoto) almost 16 years ago

=begin
Hi,

In message "Re: [ruby-core:25220] [Bug #2026] String encodings are not supported by most of IO on Linux"
on Tue, 1 Sep 2009 00:27:14 +0900, Vit Ondruch [email protected] writes:

|If string used as path has different than UTF-8, the path created on file system is incorrect. The described faulty behavior is common for most of Dir and IO actions. Attached script demonstrates the behavior on sample of Dir#mkdir method.

Could you explain what you expect for the sample script?

Ruby 1.9 does not convert path encoding automatically, since path may
be encoded in various encodings, for example, under my home directory,
I have files with names in EUC-JP and UTF-8. So auto conversion would
cause problem. It is responsibility of user program to make sure the
encoding of the path strings.

						matz.

=end

Actions

Copy link

Updated by hramrach (Michal Suchanek) almost 16 years ago

=begin
2009/8/31 Yukihiro Matsumoto [email protected]:

Hi,

In message "Re: [ruby-core:25220] [Bug #2026] String encodings are not supported by most of IO on Linux"
on Tue, 1 Sep 2009 00:27:14 +0900, Vit Ondruch [email protected] writes:

|If string used as path has different than UTF-8, the path created on file system is incorrect. The described faulty behavior is common for most of Dir and IO actions. Attached script demonstrates the behavior on sample of Dir#mkdir method.

Could you explain what you expect for the sample script?

Ruby 1.9 does not convert path encoding automatically, since path may
be encoded in various encodings, for example, under my home directory,
I have files with names in EUC-JP and UTF-8. So auto conversion would
cause problem. It is responsibility of user program to make sure the
encoding of the path strings.

If you have your locale set then it is usually expected that your
pathnames are in your locale's encoding just like stdio, text files,
etc.

While it is possible to mix filenames in various encodings on some
filesystems it is usually not desirable (as is the case with text
files - they are easiest to view when in your locale's encoding).
Other filesystems require a single encoding and the filenames are
probably converted from the current locale to the on-disk encoding in
libc and filenames invalid in the current locale's encoding cannot be
created/accessed. This is the case at least for hfsplus on OSX and
perhaps some networked filesystems.

Thanks

Michal

=end

Actions

Copy link

Updated by vo.x (Vit Ondruch) almost 16 years ago

=begin
Hello,

Working on Ubuntu, I have set following environment variable: LANG="cs_CZ.UTF-8" so it means my system expects IO operations to be UTF-8 encoded. Otherwise Nautilus, command line or every other application cannot correctly interpret the path names created by attached script.

If I work on Windows, I expect that every filename will be stored in UTF-16LE, otherwise I'm in trouble again.

As Michel said: "While it is possible to mix filenames in various encodings on some filesystems it is usually not desirable". From my point of view, there should be done explicit conversion of string encoding prior to call some IO method. And it should be responsibility of Ruby by default00, unless this conversion will be explicitly disabled.

Vit
=end

Actions

Copy link

Updated by hramrach (Michal Suchanek) almost 16 years ago

=begin

If I work on Windows, I expect that every filename will be stored in UTF-16LE, otherwise I'm in trouble again.

On Windows the situation is somewhat more complicated. It does store
the filenames in UTF-16 but it also has 8bit short filenames, and
depending on the interface you use you might get to UTF-16, the
"windows codepage" or the "dos codepage".

Thanks

Michal

=end

Actions

Copy link

Updated by vo.x (Vit Ondruch) almost 16 years ago

=begin
Actually it is simpler from Ruby point of view, since if you want to cover all the cases you named, it is enough to use UTF-16 functionality and everybody will be just happy. The only question is how to ensure that string coming from Ruby, in whatever encoding, will be automatically converted into UTF-16. You really don't want to care about this conversion explicitly.

And the same apply also for Unix, where the filesystem encoding should be used automatically, as long as different encoding is not enforced explicitly in some special cases.

Vit
=end

Actions

Copy link

Updated by naruse (Yui NARUSE) almost 16 years ago

=begin
Current Ruby thinks the filesystem encoding of Unix is binary.
Because our initial research shows users of Unix expect it.

As you know, Windows is UTF-16LE/Locale (see also win32-unicode-test),
and Mac OS X is UTF-8.
If some OSs or distributions imply their filesystem encoding,
we can follow it.

The only question is how to ensure that string coming from Ruby,
in whatever encoding, will be automatically converted into UTF-16.

use Encoding.default_internal.
=end

Actions

Copy link

Updated by hramrach (Michal Suchanek) almost 16 years ago

=begin
2009/9/1 Yui NARUSE [email protected]:

Issue #2026 has been updated by Yui NARUSE.

Current Ruby thinks the filesystem encoding of Unix is binary.
Because our initial research shows users of Unix expect it.

As you know, Windows is UTF-16LE/Locale (see also win32-unicode-test),
and Mac OS X is UTF-8.
If some OSs or distributions imply their filesystem encoding,
we can follow it.

Any distribution that sets locale (other than C) does.

Thanks

Michal

=end

Actions

Copy link

Updated by naruse (Yui NARUSE) almost 16 years ago

=begin
As I stated, our initial research shows Unix user expects filesystem encoding to be binary.
For example, matz used UTF-8 locale with EUC-JP files.
see [ruby-dev:34923] in Japanese

So you must have counterarguments or restrict destributions.
=end

Actions

Copy link

Updated by shyouhei (Shyouhei Urabe) almost 16 years ago

=begin
Michal Suchanek wrote:

2009/9/1 Yui NARUSE [email protected]:

Issue #2026 has been updated by Yui NARUSE.

Current Ruby thinks the filesystem encoding of Unix is binary.
Because our initial research shows users of Unix expect it.

As you know, Windows is UTF-16LE/Locale (see also win32-unicode-test),
and Mac OS X is UTF-8.
If some OSs or distributions imply their filesystem encoding,
we can follow it.

Any distribution that sets locale (other than C) does.

That's just a default and users can override it. Even if a distro sets default
locate to UTF-8, a user can still generate a filename which is invalid as a
UTF-8 sequence. Ruby must be able to handle that kind of filename. Hence the
choice of binary encoding.

On the other hand AFAIK you cannot create a UTF16-invalid filename on Windows.

Attachment: signature.asc
=end

Actions

Copy link

#10

Updated by hramrach (Michal Suchanek) almost 16 years ago

=begin
2009/9/1 Urabe Shyouhei [email protected]:

Michal Suchanek wrote:

2009/9/1 Yui NARUSE [email protected]:

Issue #2026 has been updated by Yui NARUSE.

Current Ruby thinks the filesystem encoding of Unix is binary.
Because our initial research shows users of Unix expect it.

As you know, Windows is UTF-16LE/Locale (see also win32-unicode-test),
and Mac OS X is UTF-8.
If some OSs or distributions imply their filesystem encoding,
we can follow it.

Any distribution that sets locale (other than C) does.

That's just a default and users can override it. Even if a distro sets default
locate to UTF-8, a user can still generate a filename which is invalid as a
UTF-8 sequence. Ruby must be able to handle that kind of filename. Hence the
choice of binary encoding.

That's the same situation as with stdio. It is binary and can use any
encoding (or none at all) but unless specified otherwise it is
expected to be in the current locale.

The problem is described with mkdir. In this case it is somewhat
disturbing if Ruby creates directories with names not encoded in your
locale's encoding unless you explicitly ask.

On the other hand AFAIK you cannot create a UTF16-invalid filename on Windows.

You cannot create invalid filename on OS X either.

Thanks

Michal

=end

Actions

Copy link

#11

Updated by vo.x (Vit Ondruch) almost 16 years ago

=begin
I have nothing against binary encoding of filenames, but it should be option, not a default. Or there should be some way how to support OS X or Windows, which is not current case.

Vit
=end

Actions

Copy link

#12

Updated by matz (Yukihiro Matsumoto) almost 16 years ago

=begin
Hi,

In message "Re: [ruby-core:25241] [Bug #2026] String encodings are not supported by most of IO on Linux"
on Tue, 1 Sep 2009 16:46:18 +0900, Vit Ondruch [email protected] writes:

|And the same apply also for Unix, where the filesystem encoding should be used automatically, as long as different encoding is not enforced explicitly in some special cases.

Automatic encoding conversion bites you rather than buys you. I can
tell you from my 20+ years of experience of multi-byte text
processing. If you use UTF-8 everywhere, there should be no problem.
But in the original case, you tried mix cp1250 and utf-8
(automatically), that was a whole source of problem.

And you really want one-true-internal-encoding, try using
default_internal encoding by specifying -E :,
e.g. -E cp1250:utf-8

						matz.

=end

Actions

Copy link

#13

Updated by vo.x (Vit Ondruch) almost 16 years ago

=begin
Ok, may be I started from wrong side. What I am trying to achieve is unicode support on Windows. I started playing with Dir.mkdir method which seems to be pretty simple. However, at the end it calls POSIX mkdir method, which from your words doesn't care too much about encoding. However, I have to care about encoding for Windows, since it is pretty clear that Windows using UTF-16LE. So what is your suggestion? There is no way how to ensure encoding which is coming into rb_w32_mkdir which is Windows implementation of mkdir.

Initially I tried to start the discussion at http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24993 post, but that did not attract too much attention, since it is probably too complex and there are several flaws. Therefore I tired to chose another way, which is passing UTF-8 encoded strings into rb_w32_mkdir. However, there is no way how to ensure the encoding as of now. Of course there could be some platform specific ifdefs in dir_s_mkdir but that will just clutter the code.
=end

Actions

Copy link

#14

Updated by vo.x (Vit Ondruch) almost 16 years ago

=begin
I just want to clarify that if there could be ensured some file system encoding, it would be easier to do explicit conversion into Windows encoding. That is my interest of whole topic.

Vit
=end

Actions

Copy link

#15

Updated by nobu (Nobuyoshi Nakada) almost 16 years ago

=begin
Hi,

At Wed, 2 Sep 2009 16:43:15 +0900,
Vit Ondruch wrote in [ruby-core:25267]:

Ok, may be I started from wrong side. What I am trying to
achieve is unicode support on Windows.

Check out win32-unicode-test branch.

--
Nobu Nakada

=end

Actions

Copy link

#16

Updated by nobu (Nobuyoshi Nakada) almost 16 years ago

Status changed from Open to Closed

=begin

=end

Actions

Copy link

#17

Updated by usa (Usaku NAKAMURA) almost 16 years ago

=begin
Hello,

In message "[ruby-core:25268] [Bug #2026] String encodings are not supported by most of IO on Linux"
on Sep.02,2009 16:47:45, [email protected] wrote:

I just want to clarify that if there could be ensured some file system encoding, it would be easier to do explicit conversion into Windows encoding. That is my interest of whole topic.

You can see what we are doing about Windows' Unicode path name
support on win32-unicode-test branch.

The results of the branch will be merged to trunk before 1.9.2
feature freeze.

Regards,¶

U.Nakamura [email protected]

=end

Actions

Copy link

#18

Updated by vo.x (Vit Ondruch) almost 16 years ago

=begin
Great! I will check it out. It seems that it is doing exactly what I have required. Thank you.

Vit
=end

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #2026

String encodings are not supported by most of IO on Linux

Updated by matz (Yukihiro Matsumoto) almost 16 years ago

Updated by hramrach (Michal Suchanek) almost 16 years ago

Updated by vo.x (Vit Ondruch) almost 16 years ago

Updated by hramrach (Michal Suchanek) almost 16 years ago

Updated by vo.x (Vit Ondruch) almost 16 years ago

Updated by naruse (Yui NARUSE) almost 16 years ago

Updated by hramrach (Michal Suchanek) almost 16 years ago

Updated by naruse (Yui NARUSE) almost 16 years ago

Updated by shyouhei (Shyouhei Urabe) almost 16 years ago

Updated by hramrach (Michal Suchanek) almost 16 years ago

Updated by vo.x (Vit Ondruch) almost 16 years ago

Updated by matz (Yukihiro Matsumoto) almost 16 years ago

Updated by vo.x (Vit Ondruch) almost 16 years ago

Updated by vo.x (Vit Ondruch) almost 16 years ago

Updated by nobu (Nobuyoshi Nakada) almost 16 years ago

Updated by nobu (Nobuyoshi Nakada) almost 16 years ago

Updated by usa (Usaku NAKAMURA) almost 16 years ago

Regards,¶

Updated by vo.x (Vit Ondruch) almost 16 years ago