[インデックス 10102] ファイルの概要

このコミットは、Go言語の実験的なHTMLテンプレートパッケージ exp/template/html における文字表現の変更に関するものです。具体的には、文字を扱う際に int 型から rune 型への移行が行われています。これにより、Unicode文字の正確な処理と、それに伴うセキュリティ上の堅牢性が向上しています。

コミット

commit 4911622055d1bcc88182a0c3292115e33c299814
Author: Russ Cox <rsc@golang.org>
Date:   Tue Oct 25 22:22:26 2011 -0700

    exp/template/html: use rune
    
    Nothing terribly interesting here.
    
    R=mikesamuel, nigeltao, r
    CC=golang-dev
    https://golang.org/cl/5307044

GitHub上でのコミットページへのリンク

https://github.com/golang/go/commit/4911622055d1bcc88182a0c3292115e33c299814

元コミット内容

exp/template/html: use rune

このコミットは、exp/template/html パッケージ内で文字を扱う際に int 型の代わりに rune 型を使用するように変更するものです。コミットメッセージには「Nothing terribly interesting here.（特に興味深いことはない）」とありますが、これはGo言語における文字と文字列の扱いに関する重要な変更を示唆しています。

変更の背景

Go言語では、文字列はUTF-8でエンコードされたバイトのシーケンスとして扱われます。個々の文字（Unicodeコードポイント）は rune 型で表現されます。rune は int32 のエイリアスであり、Unicodeコードポイントを格納するために使用されます。

このコミットが行われた2011年当時、Go言語の初期段階では、文字を int 型で扱うことが一般的でした。しかし、int はシステムのアーキテクチャ（32ビットか64ビットか）によってサイズが異なる可能性があり、Unicodeコードポイントを確実に表現するためには int32 または rune を明示的に使用することが推奨されます。

exp/template/html パッケージは、HTMLテンプレートを安全に生成するためのものであり、クロスサイトスクリプティング（XSS）などの脆弱性を防ぐために、入力された文字列を適切にエスケープする役割を担っています。文字の正確な識別と処理は、セキュリティ上非常に重要です。int を使用していると、文字コードの解釈に誤りが生じる可能性があり、特に非ASCII文字や多バイト文字の処理において問題が発生するリスクがありました。

rune 型への移行は、以下の目的で行われました。

Unicodeの正確な処理: rune を使用することで、Go言語がUnicodeコードポイントを扱うための標準的な方法に準拠し、多言語対応や特殊文字の処理をより堅牢にする。
コードの明確性: int ではなく rune を使用することで、その変数が文字（Unicodeコードポイント）を表していることがコードを読む人にとって明確になる。
潜在的なバグの回避: int のサイズが環境によって異なることによる予期せぬ挙動や、Unicode文字の誤った処理に起因するバグを未然に防ぐ。
セキュリティの向上: HTMLテンプレートにおけるエスケープ処理において、文字の正確な識別はXSSなどの脆弱性対策に直結するため、rune の使用はセキュリティの堅牢性を高める。

前提知識の解説

Go言語の `rune` 型

Go言語において、string は読み取り専用のバイトスライスであり、UTF-8でエンコードされたテキストを表します。string をイテレートすると、個々のバイトではなく、Unicodeコードポイント（文字）が rune 型として返されます。

byte: uint8 のエイリアスで、1バイトのデータを表します。ASCII文字は1バイトで表現されます。
rune: int32 のエイリアスで、Unicodeコードポイントを表します。UTF-8では1文字が1〜4バイトで表現されるため、rune はその1文字全体を格納できるサイズを持っています。

例えば、for i, r := range "こんにちは" のようにループを回すと、r は各文字の rune 値になります。

UnicodeとUTF-8

Unicode: 世界中の文字を統一的に扱うための文字コードの国際標準です。各文字には一意の「コードポイント」が割り当てられています。例えば、A は U+0041、あ は U+3042 です。
UTF-8: Unicodeのコードポイントをバイト列にエンコードするための可変長エンコーディング方式です。ASCII文字は1バイトで表現され、それ以外の文字は2バイト以上で表現されます。これにより、ASCII互換性を保ちつつ、効率的に多言語を扱うことができます。

Go言語は内部的にUTF-8を強く意識して設計されており、文字列処理において rune を使用することは、この設計思想に沿ったものです。

`exp/template/html` パッケージ

exp/template/html は、Go言語の標準ライブラリ html/template の前身、または実験的なバージョンであったと考えられます。このパッケージの主な目的は、Goのテンプレートエンジンを使用してHTMLを生成する際に、自動的にコンテキストに応じたエスケープ処理を行うことで、クロスサイトスクリプティング（XSS）攻撃などのWebセキュリティ脆弱性を防ぐことです。

例えば、ユーザーが入力した文字列をそのままHTMLに出力すると、悪意のあるスクリプトが埋め込まれてしまう可能性があります。このパッケージは、そのようなリスクを軽減するために、HTMLの特定のコンテキスト（例：属性値、JavaScriptコード内、CSSスタイル内）に応じて、適切なエスケープ処理を自動的に適用します。

技術的詳細

このコミットの技術的な核心は、文字を扱う関数の引数や変数において、int 型から rune 型への変更を徹底した点にあります。

変更された主な関数は以下の通りです。

isCSSNmchar(rune int) -> isCSSNmchar(r rune): CSS識別子に許可される文字を判定する関数。
hexDecode(s []byte) int -> hexDecode(s []byte) rune: 16進数シーケンスをデコードする関数。
isJSIdentPart(rune int) -> isJSIdentPart(r rune): JavaScript識別子の一部として許可される文字を判定する関数。

これらの変更は、単に型名を変更するだけでなく、関数内部での文字の比較や操作においても、int ではなく rune として扱うように修正されています。

例えば、isCSSNmchar 関数では、以下のような変更が見られます。

-	return 'a' <= rune && rune <= 'z' ||
-		'A' <= rune && rune <= 'Z' ||
-		'0' <= rune && rune <= '9' ||
-		'-' == rune ||
-		'_' == rune ||
+	return 'a' <= r && r <= 'z' ||
+		'A' <= r && r <= 'Z' ||
+		'0' <= r && r <= '9' ||
+		r == '-' ||
+		r == '_' ||

これは、引数名が rune から r に変更されただけでなく、比較演算子も rune 変数に対して直接適用されるようになっています。

また、decodeCSS 関数内では、hexDecode の結果を rune 型として受け取り、utf8.EncodeRune に渡すことで、Unicodeコードポイントをバイト列に正確にエンコードしています。

-			rune := hexDecode(s[1:j])
-			if rune > unicode.MaxRune {
-				rune, j = rune/16, j-1
+			r := hexDecode(s[1:j])
+			if r > unicode.MaxRune {
+				r, j = r/16, j-1
 			}
-			n := utf8.EncodeRune(b[len(b):cap(b)], rune)
+			n := utf8.EncodeRune(b[len(b):cap(b)], r)

この変更は、Go言語の文字列処理におけるベストプラクティスに準拠するものであり、特にセキュリティが要求されるテンプレートエンジンにおいては、文字の正確な処理が不可欠です。int を使用した場合、プラットフォーム依存の挙動や、Unicodeの範囲外の値を誤って処理してしまうリスクがありましたが、rune を明示的に使用することで、これらの問題を回避し、より堅牢なコードを実現しています。

コアとなるコードの変更箇所

このコミットでは、以下の4つのファイルが変更されています。

src/pkg/exp/template/html/css.go
src/pkg/exp/template/html/css_test.go
src/pkg/exp/template/html/html.go
src/pkg/exp/template/html/js.go

主な変更は、関数の引数、戻り値、および変数宣言における int 型から rune 型への変更です。

`src/pkg/exp/template/html/css.go`

// isCSSNmchar returns whether rune is allowed anywhere in a CSS identifier.
-func isCSSNmchar(rune int) bool {
+func isCSSNmchar(r rune) bool {
 	// Based on the CSS3 nmchar production but ignores multi-rune escape
 	// sequences.
 	// http://www.w3.org/TR/css3-syntax/#SUBTOK-nmchar
-	return 'a' <= rune && rune <= 'z' ||
-		'A' <= rune && rune <= 'Z' ||
-		'0' <= rune && rune <= '9' ||
-		'-' == rune ||
-		'_' == rune ||
+	return 'a' <= r && r <= 'z' ||
+		'A' <= r && r <= 'Z' ||
+		'0' <= r && r <= '9' ||
+		r == '-' ||
+		r == '_' ||
 		// Non-ASCII cases below.
-		0x80 <= rune && rune <= 0xd7ff ||
-		0xe000 <= rune && rune <= 0xfffd ||
-		0x10000 <= rune && rune <= 0x10ffff
+		0x80 <= r && r <= 0xd7ff ||
+		0xe000 <= r && r <= 0xfffd ||
+		0x10000 <= r && r && r <= 0x10ffff
 }

// decodeCSS decodes CSS3 escapes given a sequence of stringchars.
func decodeCSS(s []byte) []byte {
	b := make([]byte, 0, len(s))
	for i := 0; i < len(s); {
		c := s[i]
		if c == '\\' && i+1 < len(s) {
			j := i + 1
			if isHex(s[j]) {
				for j < len(s) && j < 7 && isHex(s[j]) {
					j++
				}
-				rune := hexDecode(s[1:j])
-				if rune > unicode.MaxRune {
-					rune, j = rune/16, j-1
+				r := hexDecode(s[1:j])
+				if r > unicode.MaxRune {
+					r, j = r/16, j-1
 				}
-				n := utf8.EncodeRune(b[len(b):cap(b)], rune)
+				n := utf8.EncodeRune(b[len(b):cap(b)], r)
 				b = b[:len(b)+n]
 				if j < len(s) && isSpace(s[j]) {
 					j++
 				}
 				i = j
 				continue
 			}
 		}
		// ... (省略)
	}
	return b
}

// hexDecode decodes a short hex digit sequence: "10" -> 16.
-func hexDecode(s []byte) int {
-	n := 0
+func hexDecode(s []byte) rune {
+	n := rune(0)
 	for _, c := range s {
 		n <<= 4
 		switch {
 		case '0' <= c && c <= '9':
-			n |= int(c - '0')
+			n |= rune(c - '0')
 		case 'a' <= c && c <= 'f':
-			n |= int(c-'a') + 10
+			n |= rune(c-'a') + 10
 		case 'A' <= c && c <= 'F':
-			n |= int(c-'A') + 10
+			n |= rune(c-'A') + 10
 		default:
 			panic(fmt.Sprintf("Bad hex digit in %q", s))
 		}
 	}
 	return n
}

// cssValueFilter filters a value for use in a CSS context.
func cssValueFilter(args ...interface{}) string {
	// ... (省略)
	for i, c := range b {
		switch c {
		case '-':
			// Disallow <!-- or -->.
			// -- should not appear in valid identifiers.
-			if i != 0 && '-' == b[i-1] {
+			if i != 0 && b[i-1] == '-' {
 				return filterFailsafe
 			}
 		default:
-			if c < 0x80 && isCSSNmchar(int(c)) {
+			if c < 0x80 && isCSSNmchar(rune(c)) {
 				id = append(id, c)
 			}
 		}
	}
	return string(id)
}

`src/pkg/exp/template/html/css_test.go`

テストコードでも、isCSSNmchar のテストケースの rune フィールドが int から rune に変更されています。また、TestHexDecode では hexDecode の戻り値を int にキャストして比較しています。

func TestIsCSSNmchar(t *testing.T) {
	tests := []struct {
-		rune int
+		rune rune
 		want bool
 	}{
 		{0, false},
		// ... (省略)
	}
	// ... (省略)
}

func TestHexDecode(t *testing.T) {
	for i := 0; i < 0x200000; i += 101 /* coprime with 16 */ {
		s := strconv.Itob(i, 16)
-		if got := hexDecode([]byte(s)); got != i {
+		if got := int(hexDecode([]byte(s))); got != i {
 			t.Errorf("%s: want %d but got %d", s, i, got)
 		}
 		s = strings.ToUpper(s)
-		if got := hexDecode([]byte(s)); got != i {
+		if got := int(hexDecode([]byte(s))); got != i {
 			t.Errorf("%s: want %d but got %d", s, i, got)
 		\t}
 	}
}

`src/pkg/exp/template/html/html.go`

htmlReplacer 関数内で、r (rune) と len(replacementTable) の比較が int(r) にキャストされています。これは、len が int を返すため、型の一致を保つための変更です。

func htmlReplacer(s string, replacementTable []string, badRunes bool) string {
	written, b := 0, new(bytes.Buffer)
	for i, r := range s {
-		if r < len(replacementTable) {
+		if int(r) < len(replacementTable) {
 			if repl := replacementTable[r]; len(repl) != 0 {
 				b.WriteString(s[written:i])
 				b.WriteString(repl)

`src/pkg/exp/template/html/js.go`

JavaScript関連の処理でも同様に、isJSIdentPart の引数や replace 関数内の比較で int から rune への変更が行われています。

func nextJSCtx(s []byte, preceding jsCtx) jsCtx {
	// ... (省略)
	j := n
-	for j > 0 && isJSIdentPart(int(s[j-1])) {
+	for j > 0 && isJSIdentPart(rune(s[j-1])) {
 		j--
 	}
	// ... (省略)
}

func replace(s string, replacementTable []string) string {
	// ... (省略)
	for i, r := range s {
		var repl string
		switch {
-		case r < len(replacementTable) && replacementTable[r] != "":
+		case int(r) < len(replacementTable) && replacementTable[r] != "":
 			repl = replacementTable[r]
 		case r == '\u2028':
 			repl = `\u2028`
		// ... (省略)
	}
	// ... (省略)
}

// isJSIdentPart returns whether rune is allowed anywhere in a JS identifier.
// It does not handle all the non-Latin letters, joiners, and combining marks,
// but it does handle every codepoint that can occur in a numeric literal or
// a keyword.
-func isJSIdentPart(rune int) bool {
+func isJSIdentPart(r rune) bool {
 	switch {
-	case '$' == rune:
+	case r == '$':
 		return true
-	case '0' <= rune && rune <= '9':
+	case '0' <= r && r <= '9':
 		return true
-	case 'A' <= rune && rune <= 'Z':
+	case 'A' <= r && r <= 'Z':
 		return true
-	case '_' == rune:
+	case r == '_':
 		return true
-	case 'a' <= rune && rune <= 'z':
+	case 'a' <= r && r <= 'z':
 		return true
 	}
 	return false
}

コアとなるコードの解説

このコミットの主要な変更は、Go言語の exp/template/html パッケージ内で文字を扱う際の型を int から rune に統一したことです。

isCSSNmchar および isJSIdentPart 関数の引数:
- これらの関数は、与えられた文字がCSSまたはJavaScriptの識別子として有効かどうかを判定します。
- 変更前は引数が int 型でしたが、変更後は rune 型になりました。これにより、関数がUnicodeコードポイントを直接受け取り、より正確な文字判定が可能になります。
- 関数内部の文字比較も、引数名が r に変更されたことに合わせて修正されています。
hexDecode 関数の戻り値:
- この関数は、16進数文字列をデコードして数値に変換します。
- 変更前は int を返していましたが、変更後は rune を返すようになりました。これは、デコードされた値がUnicodeコードポイントとして扱われることを明確にし、後続の utf8.EncodeRune 関数への連携をスムーズにします。
decodeCSS 関数内の rune 変数:
- CSSエスケープシーケンスをデコードする際に、16進数から変換された文字を一時的に格納する変数が rune 型になりました。
- unicode.MaxRune との比較や、utf8.EncodeRune への引数として rune 型の変数が直接使用されることで、Unicode文字のエンコードがより正確に行われます。
htmlReplacer および replace 関数内の型キャスト:
- これらの関数では、range ループで取得した rune 値 r を replacementTable のインデックスとして使用する際に、int(r) と明示的に int にキャストしています。
- これは、len(replacementTable) が int を返すため、比較演算子の両辺の型を一致させるためのGo言語の慣習的な記述です。rune は int32 のエイリアスですが、異なる型として扱われるため、このようなキャストが必要になります。

これらの変更は、Go言語の文字列と文字の扱いに関する設計思想に沿ったものであり、特にセキュリティが重要なWebアプリケーションのテンプレートエンジンにおいて、Unicode文字の正確な処理とXSSなどの脆弱性対策を強化する上で不可欠な改善と言えます。

参考にした情報源リンク

Go言語の rune について:
- Go Slices: usage and internals - The Go Programming Language (runeの概念が説明されています)
- Strings, bytes, runes and characters in Go - The Go Programming Language (Goにおける文字列、バイト、rune、文字に関する詳細な解説)
UnicodeとUTF-8について:
- UTF-8 - Wikipedia
- Unicode - Wikipedia
Go言語の html/template パッケージ（exp/template/html の後継または関連）:
- html/template package - html/template - Go Packages
CSS3 Syntax Module:
- CSS Syntax Module Level 3 - W3C Recommendation (nmcharプロダクションに関する情報)
JavaScript Language Specification (ECMAScript):
- ECMAScript® 2024 Language Specification (IdentifierNameに関する情報)

comemo