Masa's blog(2011-06-26)

2011年06月26日 ruby-1.9.x と encoding [長年日記]

_ ruby-1.9.x と encoding

ruby-1.8.x時代のスクリプトがruby-1.9.xだとエラーになるパターンが多々あるので、ruby-1.9.xのencoding関係(ソースとデータ)を短いスクリプトで検証してみた。

前提環境

$ echo $LC_ALL
ja_JP.eucJP
$

パターン 1

データはこれ。

$ cat test1.dat
This is line 1.
This is line 2.
This is line 3.
This is line 4.
This is line 5.
$ nkf -g test1.dat
ASCII
$

ソースはこれ。

$ cat test1.rb
#! /usr/bin/ruby
f = open("test1.dat","r")
while (r = f.gets)
        if (/line 3/ =~ r)
                puts "matching line is #{r}"
        end
end
f.close
$ nkf -g test1.rb
ASCII
$

結果はこれ。問題無し。

$ ./test1.rb
matching line is This is line 3.
$

パターン 2

ソースの中にマルチバイト文字を使ってみる。

$ cat test1.rb
#! /usr/bin/ruby
f = open("test1.dat","r")
while (r = f.gets)
        if (/ライン３/ =~ r)
                puts "matching line is #{r}"
        end
end
f.close
$ nkf -g test1.rb
EUC-JP
$

結果はこれ。エラー発生。

$ ./test1.rb
./test1.rb:4: invalid multibyte char (US-ASCII)
./test1.rb:4: invalid multibyte char (US-ASCII)
./test1.rb:4: syntax error, unexpected $end, expecting ')'
        if (/ライン３/ =~ r)
               ^
$

パターン 3

ソースのエンコーディングを指定してみる(2行目)。

$ cat test1.rb
#! /usr/bin/ruby
# coding: euc-jp
f = open("test1.dat","r")
while (r = f.gets)
        if (/ライン３/ =~ r)
                puts "matching line is #{r}"
        end
end
f.close
$ nkf -g test1.rb
EUC-JP
$

結果はこれ。エラー回避に成功。

$ ./test1.rb
$

パターン 4

データにマルチバイト文字を入れてみる。

$ cat test1.dat
This is ライン１。
This is ライン２。
This is ライン３。
This is ライン４。
This is ライン５。
$ nkf -g test1.dat
EUC-JP
$

結果はこれ。問題無し。

$ ./test1.rb
matching line is This is ライン３。
$

パターン 5

データのエンコーディングをutf-8にしてみる。

$ cat test1.dat | nkf -W -e
This is ライン１。
This is ライン２。
This is ライン３。
This is ライン４。
This is ライン５。
$ nkf -g test1.dat
UTF-8

結果はこれ。エラー発生。

$ ./test1.rb
./test1.rb:5:in `<main>': invalid byte sequence in EUC-JP (ArgumentError)
$

パターン 6

オープン時にデータのエンコーディング(utf-8)を指定してみる。

$ cat test1.rb
#! /usr/bin/ruby
# coding: euc-jp
f = open("test1.dat","r:utf-8")
while (r = f.gets)
        if (/ライン３/ =~ r)
                puts "matching line is #{r}"
        end
end
f.close
$ nkf -g test1.rb
EUC-JP
$

結果はこれ。エラー回避できず。

$ ./test1.rb
./test1.rb:5:in `<main>': incompatible encoding regexp match (EUC-JP regexp with UTF-8 string) (Encoding::CompatibilityError)
$

パターン 7

オープン時にソースのエンコーディング(euc-jp)も指定してみる。

$ cat test1.rb
#! /usr/bin/ruby
# coding: euc-jp
f = open("test1.dat","r:utf-8:euc-jp")
while (r = f.gets)
        if (/ライン３/ =~ r)
                puts "matching line is #{r}"
        end
end
f.close
$ nkf -g test1.rb
EUC-JP
$

結果はこれ。エラー回避に成功。

$ ./test1.rb
matching line is This is ライン３。
$

番外

データの入力が標準入力の場合は`set_encoding'でエンコーディングを指定する。

$ cat test1.rb
#! /usr/bin/ruby
# coding: euc-jp
STDIN.set_encoding("utf-8", "euc-jp")
while (r = STDIN.gets)
        if (/ライン３/ =~ r)
                STDOUT.puts "matching line is #{r}"
        end
end
$ nkf -g test1.rb
EUC-JP

結果はこれ。問題無し。

$ ./test1.rb <test1.dat
matching line is This is ライン３。
$

[ツッコミを入れる]


		2011年 6月
日	月	火	水	木	金	土
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Masa's blog

2011年06月26日 ruby-1.9.x と encoding [長年日記]

_ ruby-1.9.x と encoding

前提環境

パターン 1

パターン 2

パターン 3

パターン 4

パターン 5

パターン 6

パターン 7

番外

最近の日記

最近のツッコミ

最近のTrackBack