[インデックス 13802] ファイルの概要

コミット

commit 6910356ea899e0c7b19da5f59c6058a737ed5e93
Author: Nigel Tao <nigeltao@golang.org>
Date:   Wed Sep 12 10:40:39 2012 +1000

    strings: fix NewReplacer(old0, new0, old1, new1, ...) to be consistent
    when oldi == oldj.
    
    Benchmark numbers show no substantial change.
    
    R=eric.d.eisner, rogpeppe
    CC=golang-dev
    https://golang.org/cl/6496104

GitHub上でのコミットページへのリンク

https://github.com/golang/go/commit/6910356ea899e0c7b19da5f59c6058a737ed5e93

元コミット内容

strings: fix NewReplacer(old0, new0, old1, new1, ...) to be consistent
when oldi == oldj.

Benchmark numbers show no substantial change.

R=eric.d.eisner, rogpeppe
CC=golang-dev
https://golang.org/cl/6496104

変更の背景

このコミットは、Go言語の標準ライブラリstringsパッケージ内のNewReplacer関数の振る舞いを修正し、一貫性を持たせることを目的としています。NewReplacerは、複数の文字列置換ルールを効率的に適用するためのReplacerオブジェクトを作成します。

変更の具体的な背景は、NewReplacerに渡される置換ペア（old, new）の中に、同じold文字列が複数回出現する場合（例: NewReplacer("a", "1", "a", "2")）の挙動が不明確または非一貫であったことです。以前の実装では、このような重複するold文字列に対する置換ルールが、どのペアを優先するかについて明確な定義がなく、実装の詳細によって結果が変わる可能性がありました。

このコミットでは、同じold文字列が複数回指定された場合、最初に指定された置換ペアが優先されるという明確なルールを導入し、そのように動作するように実装を修正しています。これにより、NewReplacerの利用者は、重複する置換ルールが与えられた場合でも、予測可能で一貫した結果を得られるようになります。テストケースの変更（"br2d"から"br1d"への期待値変更）が、この新しい一貫した挙動を明確に示しています。

また、コミットメッセージに「Benchmark numbers show no substantial change.」とあるように、この修正がパフォーマンスに大きな影響を与えないことも確認されています。

前提知識の解説

Go言語の`strings`パッケージと`Replacer`

Go言語のstringsパッケージは、文字列操作のための基本的な機能を提供します。その中でもReplacer型は、複数の置換ルールを一度に定義し、それを文字列に適用することで、効率的な一括置換を可能にします。これは、特に多数の置換が必要な場合や、同じ置換ルールセットを繰り返し使用する場合に有用です。

strings.NewReplacer関数は、可変長引数としてoldとnewの文字列ペアを受け取り、*Replacer型のインスタンスを返します。例えば、strings.NewReplacer("a", "A", "b", "B")は、文字列中の"a"を"A"に、"b"を"B"に置換するReplacerを作成します。

`Replacer`の内部実装戦略

Replacerは、置換対象の文字列（old）と置換後の文字列（new）の特性（例えば、oldが1バイト文字か、newが1バイト文字かなど）に応じて、内部的に異なる最適化された実装（アルゴリズム）を選択します。これはパフォーマンスを最大化するためです。

主な内部実装は以下の3種類です。

byteReplacer: oldもnewもすべて1バイト文字の場合に選択されます。最も高速な置換が可能です。内部的にはバイト配列やビットセットを用いて、高速なルックアップと置換を行います。
byteStringReplacer: oldはすべて1バイト文字だが、newは複数バイト文字（または空文字列）を含む場合に選択されます。byteReplacerよりは複雑ですが、oldが1バイトであるという特性を活かして最適化されます。
genericReplacer: 上記のいずれにも当てはまらない場合、つまりold文字列が複数バイト文字を含む場合に選択されます。最も汎用的な実装であり、最も低速になる可能性があります。

これらの実装は、NewReplacerが引数として受け取ったoldnewペアを解析し、最適なものを動的に選択します。

ビットセットによる高速なバイトチェック

byteReplacerやbyteStringReplacerの内部では、置換対象のバイト（old）が既に登録されているかを効率的にチェックするためにビットセットのようなデータ構造が使われています。コード中のo>>5やo&31といったビット演算は、256種類のバイト値を32ビットのuint32の配列（[8]uint32）にマッピングし、特定のバイトが存在するかどうかを高速に判定するためのものです。

o >> 5: バイト値oを32で割った商（o / 32）に相当し、uint32配列のインデックスを計算します。
o & 31: バイト値oを32で割った余り（o % 32）に相当し、uint32内のどのビットを操作するかを決定します。この方法により、特定のバイトが存在するかどうかのチェックが非常に高速に行えます。

技術的詳細

このコミットの技術的な核心は、strings.NewReplacerが内部で最適なReplacer実装を選択するロジックと、重複する置換ルール（oldi == oldj）の処理方法の変更にあります。

1. `NewReplacer`のアルゴリズム選択ロジックの変更

修正前は、NewReplacerはすべてのoldnewペアを一度ループし、byteReplacer、byteStringReplacer、genericReplacerのそれぞれの候補を同時に構築しようとしていました。そしてループ終了後に、どの候補が最も適切かを判断して返していました。

修正後は、アルゴリズム選択のフローがより明確になりました。

まず、すべてのold文字列が1バイトであるかをチェックします。
- もし1つでもold文字列が1バイトでなければ、即座にgenericReplacerを構築して返します。これは、genericReplacerが最も汎用的なため、これ以外の最適化された実装は適用できないからです。
すべてのold文字列が1バイトであることが確認された場合、次にすべてのnew文字列が1バイトであるかをチェックします。
- もしすべてのnew文字列も1バイトであれば、byteReplacerを構築して返します。
- そうでなければ（oldは1バイトだがnewは複数バイトの場合）、byteStringReplacerを構築して返します。

この変更により、不要なReplacer候補の構築を避け、コードの可読性と効率が向上しています。

2. 重複する`old`文字列の処理ロジックの導入

最も重要な変更点は、byteReplacerとbyteStringReplacerの構築ループ内に、重複するold文字列に対する処理ロジックが追加されたことです。

// src/pkg/strings/replace.go (抜粋)
// byteReplacer の構築ループ内
if bb.old[o>>5]&uint32(1<<(o&31)) != 0 {
    // Later old->new maps do not override previous ones with the same old string.
    continue
}
bb.old.set(o)
bb.new[o] = n

// byteStringReplacer の構築ループ内
if bs.old[o>>5]&uint32(1<<(o&31)) != 0 {
    // Later old->new maps do not override previous ones with the same old string.
    continue
}
bs.old.set(o)
bs.new[o] = []byte(new)

このコードスニペットが示すように、bb.old（またはbs.old）というビットセットが、既に現在のoldバイトoを記録しているかどうかをチェックしています。もし記録済みであれば、それは同じoldバイトに対する以前の置換ルールが既に存在することを意味します。この場合、continue文によって現在の置換ペアはスキップされ、以前のルールが維持されます。これにより、「後から指定された同じold文字列に対する置換ルールは、以前のルールを上書きしない」という一貫した挙動が保証されます。

この修正は、NewReplacer("a", "1", "a", "2")のようなケースで、以前は"br2d"（最後のルールが適用）となる可能性があったものが、確実に"br1d"（最初のルールが適用）となるように変更されたことを意味します。

--- a/src/pkg/strings/replace.go
+++ b/src/pkg/strings/replace.go
@@ -33,47 +33,41 @@ func NewReplacer(oldnew ...string) *Replacer {
 		panic("strings.NewReplacer: odd argument count")
 	}\n 
-\t// Possible implementations.\n-\tvar (\n-\t\tbb  byteReplacer\n-\t\tbs  byteStringReplacer\n-\t\tgen genericReplacer\n-\t)\n-\n-\tallOldBytes, allNewBytes := true, true
-\tfor len(oldnew) > 0 {\n-\t\told, new := oldnew[0], oldnew[1]\n-\t\toldnew = oldnew[2:]
-\t\tif len(old) != 1 {\n-\t\t\tallOldBytes = false
+\tallNewBytes := true
+\tfor i := 0; i < len(oldnew); i += 2 {\n+\t\tif len(oldnew[i]) != 1 {\n+\t\t\treturn &Replacer{r: makeGenericReplacer(oldnew)}\n \t\t}\n-\t\tif len(new) != 1 {\n+\t\tif len(oldnew[i+1]) != 1 {\n \t\t\tallNewBytes = false
 \t\t}\n+\t}\n \n-\t\t// generic
-\t\tgen.p = append(gen.p, pair{old, new})\n-\n-\t\t// byte -> string
-\t\tif allOldBytes {\n-\t\t\tbs.old.set(old[0])
-\t\t\tbs.new[old[0]] = []byte(new)
-\t\t}\n-\n-\t\t// byte -> byte
-\t\tif allOldBytes && allNewBytes {\n-\t\t\tbb.old.set(old[0])
-\t\t\tbb.new[old[0]] = new[0]
+\tif allNewBytes {\n+\t\tbb := &byteReplacer{}\n+\t\tfor i := 0; i < len(oldnew); i += 2 {\n+\t\t\to, n := oldnew[i][0], oldnew[i+1][0]\n+\t\t\tif bb.old[o>>5]&uint32(1<<(o&31)) != 0 {\n+\t\t\t\t// Later old->new maps do not override previous ones with the same old string.\n+\t\t\t\tcontinue
+\t\t\t}\n+\t\t\tbb.old.set(o)
+\t\t\tbb.new[o] = n
 \t\t}\n+\t\treturn &Replacer{r: bb}\n \t}\n \n-\tif allOldBytes && allNewBytes {\n-\t\treturn &Replacer{r: &bb}\n-\t}\n-\tif allOldBytes {\n-\t\treturn &Replacer{r: &bs}\n+\tbs := &byteStringReplacer{}\n+\tfor i := 0; i < len(oldnew); i += 2 {\n+\t\to, new := oldnew[i][0], oldnew[i+1]\n+\t\tif bs.old[o>>5]&uint32(1<<(o&31)) != 0 {\n+\t\t\t// Later old->new maps do not override previous ones with the same old string.\n+\t\t\tcontinue
+\t\t}\n+\t\tbs.old.set(o)
+\t\tbs.new[o] = []byte(new)
 \t}\n-\treturn &Replacer{r: &gen}\n+\treturn &Replacer{r: bs}\n }\n \n // Replace returns a copy of s with all replacements performed.\n@@ -94,6 +88,16 @@ type genericReplacer struct {\n \n type pair struct{ old, new string }\n \n+func makeGenericReplacer(oldnew []string) *genericReplacer {\n+\tgen := &genericReplacer{\n+\t\tp: make([]pair, len(oldnew)/2),\n+\t}\n+\tfor i := 0; i < len(oldnew); i += 2 {\n+\t\tgen.p[i/2] = pair{oldnew[i], oldnew[i+1]}\n+\t}\n+\treturn gen\n+}\n+\n type appendSliceWriter struct {\n \tb []byte\n }\n```

### `src/pkg/strings/replace_test.go`

```diff
--- a/src/pkg/strings/replace_test.go
+++ b/src/pkg/strings/replace_test.go
@@ -70,7 +70,7 @@ func TestReplacer(t *testing.T) {
 		testCase{inc, "\x00\xff", "\x01\x00"},
 		testCase{inc, "", ""},
 
-\t\ttestCase{NewReplacer("a", "1", "a", "2"), "brad", "br2d"}, // TODO: should this be "br1d"?
+\t\ttestCase{NewReplacer("a", "1", "a", "2"), "brad", "br1d"},
 	)
 
 	// repeat maps "a"->"a", "b"->"bb", "c"->"ccc", ...
@@ -95,7 +95,7 @@ func TestReplacer(t *testing.T) {
 		testCase{repeat, "abba", "abbbba"},
 		testCase{repeat, "", ""},
 
-\t\ttestCase{NewReplacer("a", "11", "a", "22"), "brad", "br22d"}, // TODO: should this be "br11d"?
+\t\ttestCase{NewReplacer("a", "11", "a", "22"), "brad", "br11d"},
 	)
 
 	// The remaining test cases have variable length old strings.
@@ -246,6 +246,14 @@ func TestReplacer(t *testing.T) {
 		testCase{blankFoo, "", "X"},
 	)
 
+\t// No-arg test cases.
+\n+\tnop := NewReplacer()\n+\ttestCases = append(testCases,\n+\t\ttestCase{nop, "abc", "abc"},\n+\t\ttestCase{nop, "", ""},\n+\t)
+\n \t// Run the test cases.\n \n \tfor i, tc := range testCases {\n@@ -277,9 +277,10 @@ func TestPickAlgorithm(t *testing.T) {
 \t\twant string
 \t}{\n \t\t{capitalLetters, "*strings.byteReplacer"},\n+\t\t{htmlEscaper, "*strings.byteStringReplacer"},\n \t\t{NewReplacer("12", "123"), "*strings.genericReplacer"},\n \t\t{NewReplacer("1", "12"), "*strings.byteStringReplacer"},\n-\t\t{htmlEscaper, "*strings.byteStringReplacer"},\n+\t\t{NewReplacer("a", "1", "b", "12", "cde", "123"), "*strings.genericReplacer"},\n \t}\n \tfor i, tc := range testCases {\n \t\tgot := fmt.Sprintf("%T", tc.r.Replacer())\n```

## コアとなるコードの解説

### `src/pkg/strings/replace.go`の変更点

1.  **アルゴリズム選択ロジックの再構築**:
    *   以前は`allOldBytes`, `allNewBytes`といったフラグをループ内で更新し、最後に`if`文で最適な`Replacer`を選択していました。
    *   新しいコードでは、まず`oldnew`ペアをループし、`old`文字列が1バイトでないものが一つでもあれば、即座に`makeGenericReplacer`を呼び出して`genericReplacer`を返します。これにより、最も複雑なケースを早期に処理し、以降の処理を簡素化しています。
    *   すべての`old`文字列が1バイトであることが確認された後、`allNewBytes`フラグ（これはループ内で`new`文字列が1バイトでない場合に`false`になる）に基づいて、`byteReplacer`または`byteStringReplacer`のどちらを構築するかが決定されます。

2.  **重複する`old`文字列の処理**:
    *   `byteReplacer`と`byteStringReplacer`を構築するループ内で、以下の条件が追加されました。
        ```go
        if bb.old[o>>5]&uint32(1<<(o&31)) != 0 {
            // Later old->new maps do not override previous ones with the same old string.
            continue
        }
        ```
        これは、`bb.old`（または`bs.old`）というビットセットが、既に現在の`old`バイト`o`を記録しているかどうかをチェックしています。もし記録済みであれば、それは同じ`old`バイトに対する以前の置換ルールが既に存在することを意味します。この場合、`continue`文によって現在の置換ペアはスキップされ、以前のルールが維持されます。これにより、「後から指定された同じ`old`文字列に対する置換ルールは、以前のルールを上書きしない」という一貫した挙動が実現されます。

3.  **`makeGenericReplacer`関数の導入**:
    *   `genericReplacer`の初期化と`pair`スライスの構築ロジックが、`NewReplacer`関数から`makeGenericReplacer`という独立した関数に切り出されました。これにより、コードの関心事が分離され、`NewReplacer`の主要なロジックがより読みやすくなっています。

### `src/pkg/strings/replace_test.go`の変更点

1.  **期待値の修正**:
    *   `NewReplacer("a", "1", "a", "2"), "brad", "br2d"`というテストケースの期待値が`"br1d"`に変更されました。
    *   同様に、`NewReplacer("a", "11", "a", "22"), "brad", "br22d"`の期待値が`"br11d"`に変更されました。
    *   これらの変更は、`NewReplacer`が重複する`old`文字列に対して、**最初に指定された置換ルールを優先する**という新しい一貫した挙動を反映しています。以前の`TODO`コメントも削除され、この挙動が確定したことを示しています。

2.  **引数なし`NewReplacer`のテストケース追加**:
    *   `NewReplacer()`（引数なし）が呼び出された場合のテストケースが追加されました。これは、空の`Replacer`が文字列を何も変更しないことを確認するものです。

3.  **`TestPickAlgorithm`のテストケース追加**:
    *   `NewReplacer("a", "1", "b", "12", "cde", "123")`という新しいテストケースが追加され、これが正しく`*strings.genericReplacer`を返すことを確認しています。これは、`old`文字列に複数バイトの`"cde"`が含まれるため、`genericReplacer`が選択されるべきであることを示しています。

これらの変更により、`strings.NewReplacer`の挙動がより予測可能になり、テストカバレッジも向上しています。

## 関連リンク

*   Go言語 `strings` パッケージ公式ドキュメント: [https://pkg.go.dev/strings](https://pkg.go.dev/strings)
*   Go言語 `strings.NewReplacer` 関数ドキュメント: [https://pkg.go.dev/strings#NewReplacer](https://pkg.go.dev/strings#NewReplacer)
*   Gerrit Code Review: [https://go-review.googlesource.com/6496104](https://go-review.googlesource.com/6496104)

## 参考にした情報源リンク

*   Go言語のソースコード (`src/pkg/strings/replace.go`, `src/pkg/strings/replace_test.go`)
*   Go言語の公式ドキュメント
*   ビット演算に関する一般的な知識 (例: ビットセットの実装)
*   Gerrit Code Review: [https://go-review.googlesource.com/6496104](https://go-review.googlesource.com/6496104)

comemo