[インデックス 1288] ファイルの概要

このコミットは、Go言語の初期のコミットの一つであり、tabwriter パッケージにおけるUTF-8テキストの取り扱いを改善することを目的としています。具体的には、タブ区切りのテキストを整形する際に、ASCII文字だけでなくマルチバイト文字であるUTF-8文字の幅も正しく計算し、アラインメントを維持できるように修正が加えられています。

コミット

tabwriter パッケージは、テキストをカラム形式で整形し、各カラムの幅を揃えるためのユーティリティを提供します。このコミット以前は、tabwriter はテキストをASCIIエンコーディングとして扱い、文字幅をバイト数と同一視していました。しかし、UTF-8のような可変長エンコーディングでは、1文字が複数バイトで表現されることがあり、単純なバイト数での幅計算では表示上のアラインメントが崩れる問題がありました。このコミットは、この問題を解決し、UTF-8テキストが正しく整形されるように tabwriter を更新します。

GitHub上でのコミットページへのリンク

https://github.com/golang/go/commit/8aeb8647c5be40ef4e85649453da9ca3c52a42e5

元コミット内容

- handle UTF-8 text in tabwriter

R=r
DELTA=84  (27 added, 3 deleted, 54 changed)
OCL=20539
CL=20584
---
 src/lib/tabwriter/tabwriter.go      | 110 ++++++++++++++++++++++--------------
 src/lib/tabwriter/tabwriter_test.go |  24 ++++----
 2 files changed, 79 insertions(+), 55 deletions(-)

変更の背景

Go言語は当初から国際化対応を重視しており、UTF-8はGoの文字列の標準エンコーディングです。tabwriter のようなテキスト処理ユーティリティがASCIIのみを前提としていると、非ASCII文字を含むテキストを扱う際に表示が崩れるという問題が発生します。特に、日本語や中国語のような東アジアの言語では、1文字が2バイト以上で表現されることが一般的であり、これらの文字が混在するテキストを整形する際には、バイト数ではなく「文字幅」（表示上のグリフの幅）に基づいてアラインメントを行う必要があります。

このコミット以前の tabwriter は、文字幅の計算をバイト数で行っていたため、例えば日本語の「本」のような文字が1バイト文字と同じ幅として扱われ、結果としてカラムがずれて表示されるという不具合がありました。この変更は、Go言語が提供する utf8 パッケージを利用して、文字のバイト数とルーン（Unicodeコードポイント）数を区別し、表示上の幅を正確に計算することで、この問題を解決しています。

前提知識の解説

UTF-8: Unicode文字を可変長でエンコードする方式の一つ。ASCII文字は1バイトで表現され、非ASCII文字は2バイト以上で表現されます。これにより、ASCIIとの互換性を保ちつつ、世界中の多様な文字を表現できます。
ルーン (Rune): Go言語におけるルーンは、Unicodeコードポイントを表す int32 型のエイリアスです。Goの文字列はUTF-8バイトのシーケンスとして内部的に表現されますが、個々のUnicode文字を扱う際にはルーンとして扱われます。
文字幅とバイト数:
- バイト数: 文字列を構成するバイトの総数。UTF-8では、1つのルーンが1バイトから4バイトの範囲で表現されます。
- 文字幅: 画面上での文字の表示幅。多くのフォントでは、ASCII文字は1単位の幅を持ち、一部の東アジア文字（全角文字）は2単位の幅を持つことがあります。このコミットでは、簡略化のため、すべてのUTF-8文字が同じ幅（通常は1単位）を持つと仮定しています。これは、ターミナルや等幅フォントでの表示を想定しているためです。
tabwriter パッケージ: Go言語の標準ライブラリの一部で、タブ区切りのテキストを整形し、カラムを揃える機能を提供します。例えば、以下のようなテキストを整形する際に使用されます。
```
Name    Age  City
Alice   30   New York
Bob     25   London
Charlie 35   Paris
```
このパッケージは、各カラムの最大幅を計算し、それに応じてパディングを追加することで、上記のようにきれいに揃った出力を生成します。

技術的詳細

このコミットの主要な変更点は、tabwriter がテキストの「バイト数」と「表示上の幅（ルーン数）」を区別して管理するようになったことです。

utf8 パッケージの導入: tabwriter.go に import "utf8" が追加されました。これにより、UTF-8エンコードされたバイト列からルーンをデコードし、そのバイトサイズやルーン数を正確に取得できるようになります。
Writer 構造体の変更:
- width int: 以前は「最後の不完全なセルのバイト幅」を表していましたが、このコミットにより「最後の不完全なセルのルーン幅」を表すようになりました。
- size int: 新たに追加されたフィールドで、「最後の不完全なセルのバイトサイズ」を表します。
- lines array.Array: 以前は各行のセルのバイト幅のリストを保持していましたが、lines_size と lines_width の2つのフィールドに分割されました。
- lines_size array.Array: 各行のセルのバイトサイズのリストを保持します。
- lines_width array.Array: 各行のセルのルーン幅のリストを保持します。
- widths array.IntArray: カラムの最大幅をルーン幅で保持するようになりました。
UnicodeLen 関数の追加:
```
func UnicodeLen(buf *[]byte) int {
    l := 0;
    for i := 0; i < len(buf); {
        if buf[i] < utf8.RuneSelf {
            i++;
        } else {
            _, size := utf8.DecodeRune(buf[i : len(buf)]);
            i += size;
        }
        l++;
    }
    return l;
}
```
この関数は、与えられたバイトスライス buf 内のルーン（Unicodeコードポイント）の数を正確に数えます。utf8.RuneSelf はASCII文字の最大値（128）であり、それより小さい値であれば1バイト文字として処理し、そうでなければ utf8.DecodeRune を使ってルーンをデコードし、そのバイトサイズ分だけインデックスを進めます。これにより、マルチバイト文字も正しく1ルーンとしてカウントされます。
Append メソッドの変更: Writer.Append メソッドは、入力されたバイト列を内部バッファに追加する際に、以前は b.width += len(buf) のようにバイト数を直接幅として加算していました。この変更により、b.size += len(buf) でバイトサイズを更新し、b.width += UnicodeLen(buf) でルーン幅を更新するようになりました。
WriteLines および Format メソッドの変更: これらのメソッドは、実際にテキストを書き出す際やカラム幅を計算する際に、lines_size と lines_width の両方を利用するようになりました。これにより、パディングの計算にはルーン幅 (b.width) を使用し、実際のバイト列の書き込みにはバイトサイズ (b.size) を使用することで、UTF-8テキストの正確なアラインメントを実現しています。
テストケースの追加/修正: tabwriter_test.go では、日本語の「本」や「日本語」、アクセント付きの「è」などのマルチバイト文字を含むテストケースが追加され、tabwriter がこれらの文字を正しく処理し、期待通りのアラインメントを生成することを確認しています。

コアとなるコードの変更箇所

`src/lib/tabwriter/tabwriter.go`

--- a/src/lib/tabwriter/tabwriter.go
+++ b/src/lib/tabwriter/tabwriter.go
@@ -8,12 +8,12 @@ import (
 	"os";
 	"io";
 	"array";
+	"utf8";
 )


 // ----------------------------------------------------------------------------
 // ByteArray
-// TODO should use a ByteArray library eventually

 type ByteArray struct {
 	a *[]byte;
@@ -62,11 +62,13 @@ func (b *ByteArray) Append(s *[]byte) {

 // ----------------------------------------------------------------------------
 // Writer is a filter implementing the io.Write interface. It assumes
-// that the incoming bytes represent ASCII encoded text consisting of
+// that the incoming bytes represent UTF-8 encoded text consisting of
 // lines of tab-terminated "cells". Cells in adjacent lines constitute
 // a column. Writer rewrites the incoming text such that all cells in
 // a column have the same width; thus it effectively aligns cells. It
-// does this by adding padding where necessary.
+// does this by adding padding where necessary. All characters (ASCII
+// or not) are assumed to be of the same width - this may not be true
+// for arbitrary UTF-8 characters visualized on the screen.
 //
 // Note that any text at the end of a line that is not tab-terminated
 // is not a cell and does not enforce alignment of cells in adjacent
@@ -84,8 +86,6 @@ func (b *ByteArray) Append(s *[]byte) {
 //            (for correct-looking results, cellwidth must correspond
 //            to the tabwidth in the editor used to look at the result)

-// TODO Should support UTF-8 (requires more complicated width bookkeeping)
-

 export type Writer struct {
 	// TODO should not export any of the fields
@@ -97,15 +97,18 @@ export type Writer struct {
 	align_left bool;

 	// current state
-	buf ByteArray;  // the collected text w/o tabs and newlines
-	width int;  // width of last incomplete cell
-	lines array.Array;  // list of lines; each line is a list of cell widths
-	widths array.IntArray;  // list of column widths - re-used during formatting
+	buf ByteArray;  // collected text w/o tabs and newlines
+	size int;  // size of last incomplete cell in bytes
+	width int;  // width of last incomplete cell in runes
+	lines_size array.Array;  // list of lines; each line is a list of cell sizes in bytes
+	lines_width array.Array;  // list of lines; each line is a list of cell widths in runes
+	widths array.IntArray;  // list of column widths in runes - re-used during formatting
 }


 func (b *Writer) AddLine() {
-	b.lines.Push(array.NewIntArray(0));
+	b.lines_size.Push(array.NewIntArray(0));
+	b.lines_width.Push(array.NewIntArray(0));
 }


@@ -125,7 +128,8 @@ func (b *Writer) Init(writer io.Write, cellwidth, padding int, padchar byte, ali
 	b.align_left = align_left || padchar == '\t';  // tab enforces left-alignment

 	b.buf.Init(1024);
-	b.lines.Init(0);
+	b.lines_size.Init(0);
+	b.lines_width.Init(0);
 	b.widths.Init(0);
 	b.AddLine();  // the very first line

@@ -133,21 +137,23 @@ func (b *Writer) Init(writer io.Write, cellwidth, padding int, padchar byte, ali
 }


-func (b *Writer) Line(i int) *array.IntArray {
-	return b.lines.At(i).(*array.IntArray);
+func (b *Writer) Line(i int) (*array.IntArray, *array.IntArray) {
+	return
+		b.lines_size.At(i).(*array.IntArray),
+		b.lines_width.At(i).(*array.IntArray);
 }


 // debugging support
 func (b *Writer) Dump() {
 	pos := 0;
-	for i := 0; i < b.lines.Len(); i++ {
-		line := b.Line(i);
+	for i := 0; i < b.lines_size.Len(); i++ {
+		line_size, line_width := b.Line(i);
 		print("(", i, ") ");
-		for j := 0; j < line.Len(); j++ {
-			w := line.At(j);
-			print("[", string(b.buf.Slice(pos, pos + w)), "]");
-			pos += w;
+		for j := 0; j < line_size.Len(); j++ {
+			s := line_size.At(j);
+			print("[", string(b.buf.Slice(pos, pos + s)), "]");
+			pos += s;
 		}
 		print("\n");
 	}
@@ -198,16 +204,16 @@ exit:
 func (b *Writer) WriteLines(pos0 int, line0, line1 int) (pos int, err *os.Error) {
 	pos = pos0;
 	for i := line0; i < line1; i++ {
-		line := b.Line(i);
-		for j := 0; j < line.Len(); j++ {
-			w := line.At(j);
+		line_size, line_width := b.Line(i);
+		for j := 0; j < line_size.Len(); j++ {
+			s, w := line_size.At(j), line_width.At(j);

 			if b.align_left {
-				err = b.Write0(b.buf.a[pos : pos + w]);
+				err = b.Write0(b.buf.a[pos : pos + s]);
 				if err != nil {
 					goto exit;
 				}
-				pos += w;
+				pos += s;
 				if j < b.widths.Len() {
 					err = b.WritePadding(w, b.widths.At(j));
 					if err != nil {
@@ -223,20 +229,20 @@ func (b *Writer) WriteLines(pos0 int, line0, line1 int) (pos int, err *os.Error)
 					goto exit;
 				}
 				}
-			err = b.Write0(b.buf.a[pos : pos + w]);
+			err = b.Write0(b.buf.a[pos : pos + s]);
 			if err != nil {
 				goto exit;
 			}
-			pos += w;
+			pos += s;
 			}
 		}

-	if i+1 == b.lines.Len() {
+	if i+1 == b.lines_size.Len() {
 			// last buffered line - we don't have a newline, so just write
 			// any outstanding buffered data
-			err = b.Write0(b.buf.a[pos : pos + b.width]);
-			pos += b.width;
-			b.width = 0;
+			err = b.Write0(b.buf.a[pos : pos + b.size]);
+			pos += b.size;
+			b.size, b.width = 0, 0;
 		} else {
 			// not the last line - write newline
 			err = b.Write0(Newline);
@@ -256,9 +262,9 @@ func (b *Writer) Format(pos0 int, line0, line1 int) (pos int, err *os.Error) {
 	column := b.widths.Len();	
 	last := line0;
 	for this := line0; this < line1; this++ {
-		line := b.Line(this);
+		line_size, line_width := b.Line(this);

-		if column < line.Len() - 1 {
+		if column < line_size.Len() - 1 {
 			// cell exists in this column
 			// (note that the last cell per line is ignored)

@@ -272,10 +278,10 @@ func (b *Writer) Format(pos0 int, line0, line1 int) (pos int, err *os.Error) {
 			// column block begin
 			width := b.cellwidth;  // minimal width
 			for ; this < line1; this++ {
-				line = b.Line(this);
-				if column < line.Len() - 1 {
+				line_size, line_width = b.Line(this);
+				if column < line_size.Len() - 1 {
 					// cell exists in this column => update width
-					w := line.At(column) + b.padding;
+					w := line_width.At(column) + b.padding;
 					if w > width {
 						width = w;
 					}
@@ -302,18 +308,35 @@ exit:
 }


+func UnicodeLen(buf *[]byte) int {
+	l := 0;
+	for i := 0; i < len(buf); {
+		if buf[i] < utf8.RuneSelf {
+			i++;
+		} else {
+			rune, size := utf8.DecodeRune(buf[i : len(buf)]);
+			i += size;
+		}
+		l++;
+	}
+	return l;
+}
+
+
 func (b *Writer) Append(buf *[]byte) {
 	b.buf.Append(buf);
-	b.width += len(buf);
+	b.size += len(buf);
+	b.width += UnicodeLen(buf);
 }


 /* export */ func (b *Writer) Flush() *os.Error {
-	dummy, err := b.Format(0, 0, b.lines.Len());
+	dummy, err := b.Format(0, 0, b.lines_size.Len());
 	// reset (even in the presence of errors)
 	b.buf.Clear();
-	b.width = 0;
-	b.lines.Init(0);
+	b.size, b.width = 0, 0;
+	b.lines_size.Init(0);
+	b.lines_width.Init(0);
 	b.AddLine();
 	return err;
 }
@@ -329,13 +352,14 @@ func (b *Writer) Append(buf *[]byte) {
 		i0 = i + 1;  // exclude ch from (next) cell

 		// terminate cell
-		last := b.Line(b.lines.Len() - 1);
-		last.Push(b.width);
-		b.width = 0;
+		last_size, last_width := b.Line(b.lines_size.Len() - 1);
+		last_size.Push(b.size);
+		last_width.Push(b.width);
+		b.size, b.width = 0, 0;

 		if ch == '\n' {
 			b.AddLine();
-			if last.Len() == 1 {
+			if last_size.Len() == 1 {
 				// The previous line has only one cell which does not have
 				// an impact on the formatting of the following lines (the
 				// last cell per line is ignored by Format), thus we can

`src/lib/tabwriter/tabwriter_test.go`

--- a/src/lib/tabwriter/tabwriter_test.go
+++ b/src/lib/tabwriter/tabwriter_test.go
@@ -189,24 +189,24 @@ export func Test(t *testing.T) {

 	Check(
 		t, 8, 1, ' ', true,
-		"a\tb\tc\n"
-		"aa\tbbb\tcccc\tddddd\n"
+		"本\tb\tc\n"
+		"aa\t本本本\tcccc\tddddd\n"
 		"aaa\tbbbb\n",

-		"a       b       c\n"
-		"aa      bbb     cccc    ddddd\n"
+		"本       b       c\n"
+		"aa      本本本     cccc    ddddd\n"
 		"aaa     bbbb\n"
 	);

 	Check(
 		t, 8, 1, ' ', false,
-		"a\tb\tc\t\n"
-		"aa\tbbb\tcccc\tddddd\t\n"
-		"aaa\tbbbb\t\n",
+		"a\tè\tc\t\n"
+		"aa\tèèè\tcccc\tddddd\t\n"
+		"aaa\tèèèè\t\n",

-		"       a       b       c\n"
-		"      aa     bbb    cccc   ddddd\n"
-		"     aaa    bbbb\n"
+		"       a       è       c\n"
+		"      aa     èèè    cccc   ddddd\n"
+		"     aaa    èèèè\n"
 	);

 	Check(
@@ -233,7 +233,7 @@ export func Test(t *testing.T) {

 	Check(
 		t, 4, 1, '-', true,
-		"4444\t333\t22\t1\t333\n"
+		"4444\t日本語\t22\t1\t333\n"
 		"999999999\t22\n"
 		"7\t22\n"
 		"\t\t\t88888888\n"
@@ -241,7 +241,7 @@ export func Test(t *testing.T) {
 		"666666\t666666\t666666\t4444\n"
 		"1\t1\t999999999\t0000000000\n",

-		"4444------333-22--1---333\n"
+		"4444------日本語-22--1---333\n"
 		"999999999-22\n"
 		"7---------22\n"
 		"------------------88888888\n"

コアとなるコードの解説

このコミットの核心は、tabwriter がテキストの長さをバイト数ではなく、表示上の「文字数」（ルーン数）で計算するように変更された点です。

utf8 パッケージの利用: tabwriter.go に utf8 パッケージがインポートされたことで、Goの標準ライブラリが提供するUTF-8処理機能が利用可能になりました。
UnicodeLen 関数の導入: この関数は、バイトスライス内のルーンの数を正確に数えるためのものです。utf8.DecodeRune を使用することで、マルチバイト文字も正しく1ルーンとしてカウントされます。これにより、例えば日本語の「本」（3バイト）も、表示上は1文字として扱われるようになります。
Writer 構造体の size と width の分離:
- size は、内部バッファに格納されているテキストの実際のバイト数を追跡します。これは、io.Writer インターフェースを通じてバイト列を書き込む際に必要です。
- width は、表示上の文字幅（ルーン数）を追跡します。これは、カラムのアラインメント計算（パディングの量など）に使用されます。この分離により、tabwriter はバイト数と文字幅という異なる概念を適切に管理できるようになりました。
lines_size と lines_width の導入: 各行のセルの情報をバイトサイズとルーン幅の両方で保持することで、整形処理の各段階で適切な情報（バイト数での書き込み、ルーン幅でのアラインメント計算）を利用できるようになりました。
テストケースの強化: tabwriter_test.go に追加されたテストケースは、日本語やアクセント付き文字などのマルチバイト文字が正しく整形されることを検証しています。これにより、UTF-8対応が単なるコード変更だけでなく、実際の動作として確認されています。

これらの変更により、tabwriter はUTF-8テキストをより正確に処理し、国際化されたアプリケーションでの利用において、より信頼性の高いテキスト整形機能を提供するようになりました。

参考にした情報源リンク

Go言語の公式ドキュメント
Go言語のソースコード (特に src/lib/tabwriter/ ディレクトリ)
UTF-8に関する一般的な情報源 (例: Wikipedia)

comemo