[インデックス 13706] ファイルの概要

このコミットは、Go言語の公式仕様書である doc/go_spec.html ファイルに対する変更です。主な目的は、Go言語における「rune（ルーン）」と「string（文字列）」リテラルの定義と説明をより明確にすることです。意味的な変更はなく、言語の記述をより分かりやすくし、不正なリテラルの例を追加することで、仕様の理解を深めることを意図しています。特に、「character literal（文字リテラル）」および「character constant（文字定数）」という用語が「rune literal（ルーンリテラル）」および「rune constant（ルーン定数）」に置き換えられ、「character（文字）」という単語は常にソーステキスト（コードそのもの）を指し、プログラムの実行時の値を指さないように定義が統一されました。

コミット

commit 9dfc6f6427b4b44d5684dad1ae5cea45a82821ee
Author: Rob Pike <r@golang.org>
Date:   Wed Aug 29 14:46:57 2012 -0700

    go_spec.html: clarify rune and string literals
    No changes to the meaning, just clearer language and more
    examples, including illegal rune and string literals.
    In particular, "character literal" and "character constant"
    are now called "rune literal" and "rune constant" and the
    word "character" always refers to the source text, not
    program values.
    
    R=golang-dev, gri
    CC=golang-dev
    https://golang.org/cl/6448137

GitHub上でのコミットページへのリンク

https://github.com/golang/go/commit/9dfc6f6427b4b44d5684dad1ae5cea45a82821ee

元コミット内容

go_spec.html: clarify rune and string literals
No changes to the meaning, just clearer language and more
examples, including illegal rune and string literals.
In particular, "character literal" and "character constant"
are now called "rune literal" and "rune constant" and the
word "character" always refers to the source text, not
program values.

変更の背景

Go言語の初期の仕様書では、「character（文字）」という用語が、ソースコード上の文字と、プログラム内で表現されるUnicodeコードポイントの両方を指す曖昧さがありました。この曖昧さは、特にUnicodeを扱う際の混乱を招く可能性がありました。Go言語では、Unicodeコードポイントを表現するために rune という組み込み型が導入されており、これは実質的に int32 のエイリアスです。

このコミットの背景には、Go言語の設計思想である「明確さ」と「簡潔さ」があります。仕様書はプログラマが言語の挙動を正確に理解するための最も重要なドキュメントであるため、用語の厳密な定義と一貫した使用が不可欠です。この変更は、特にUnicode文字の扱いに関する混乱を解消し、Go言語の文字列と文字の概念をより正確に反映させることを目的としています。

具体的には、以下の点が変更の動機となっています。

用語の統一: 「character」という単語が、ソースコード上の単なるテキスト文字を指すように限定され、プログラム内で扱われるUnicodeコードポイントの値は rune という用語で一貫して表現されるようになりました。
明確な例の追加: 不正なルーンリテラルや文字列リテラルの例を追加することで、プログラマがどのような記述が許容され、どのような記述がエラーとなるのかをより具体的に理解できるようになりました。
仕様の厳密化: 意味的な変更はないものの、仕様書の記述をより厳密にすることで、将来的な誤解や実装の不一致を防ぐ狙いがあります。

前提知識の解説

このコミットの変更内容を理解するためには、以下のGo言語の基本的な概念とUnicodeに関する知識が必要です。

Go言語における `rune` と `string`

string 型: Go言語の string 型は、不変のバイトスライスであり、UTF-8エンコードされたUnicodeテキストを表します。Goの文字列はバイトのシーケンスであり、必ずしも文字のシーケンスではありません。これは、UTF-8が可変長エンコーディングであるため、1つのUnicode文字が1バイトから4バイトで表現されるためです。
rune 型: rune はGo言語の組み込み型で、実質的には int32 のエイリアスです。これは単一のUnicodeコードポイントを表すために使用されます。Go言語では、単一引用符 (') で囲まれたリテラルは rune 型として扱われます。例えば、'a' は rune 型で、その値はUnicodeコードポイントの U+0061 (97) です。

UnicodeとUTF-8

Unicode: 世界中の文字を統一的に扱うための文字コードの国際標準です。各文字には一意の「コードポイント」（例: U+0061 は 'a'）が割り当てられています。
UTF-8: Unicodeのコードポイントをバイト列にエンコードするための可変長エンコーディング方式です。ASCII文字は1バイトで表現され、それ以外の文字は2バイト以上で表現されます。Go言語の文字列は内部的にUTF-8でエンコードされています。

リテラル

プログラミング言語において、リテラルとはソースコード中に直接記述される値のことです。

ルーンリテラル (Rune Literal): 単一のUnicodeコードポイントを表すリテラルです。Goでは単一引用符 (') で囲まれます。例: 'a', '世', '\n', '\u0061'。
文字列リテラル (String Literal): 文字列を表すリテラルです。Goでは二重引用符 (") で囲まれる「解釈済み文字列リテラル」と、バッククォート (`) で囲まれる「生文字列リテラル」があります。
- 解釈済み文字列リテラル: バックスラッシュエスケープシーケンスが解釈されます。例: "Hello\nWorld"。
- 生文字列リテラル: バックスラッシュエスケープシーケンスが解釈されず、記述された通りの文字が文字列に含まれます。複数行にわたる文字列を記述するのに便利です。例: `Hello\nWorld`。

Go言語の仕様書 (The Go Programming Language Specification)

Go言語の公式な挙動を定義するドキュメントです。言語の構文、セマンティクス、組み込み型、パッケージなどが詳細に記述されています。このコミットで変更された doc/go_spec.html は、この仕様書のHTML版です。

技術的詳細

このコミットの技術的な詳細は、主にGo言語の仕様書における用語の厳密化と、リテラルの記述に関する規則の明確化にあります。

用語の変更:
- Character literals (文字リテラル) -> Rune literals (ルーンリテラル)
- Character constants (文字定数) -> Rune constants (ルーン定数)
- character という単語は、ソーステキスト内のUnicodeコードポイントを指す場合にのみ使用されるようになりました。これにより、プログラムの実行時に扱われる値としての「文字」は rune で一貫して表現されることになります。
ルーンリテラルの説明の明確化:
- 以前は「文字リテラルは通常、Unicodeコードポイントを表す」とされていましたが、変更後は「ルーンリテラルはUnicodeコードポイントを識別する整数値を表すルーン定数を表す」と、より直接的に定義されました。
- 単一引用符で囲まれた単一の文字が、その文字自体のUnicode値を表すことが明記されました。
不正なルーンリテラルの例の追加:
- 'aa'：文字が多すぎる（ルーンリテラルは単一のコードポイントを表すため）。
- '\xa'：16進数が少なすぎる（\x の後には正確に2桁の16進数が必要）。
- '\0'：8進数が少なすぎる（\ の後には正確に3桁の8進数が必要）。
- '\uDFFF'：サロゲートペアの半分（Unicodeの有効なコードポイントではない）。
- '\U00110000'：無効なUnicodeコードポイント（Unicodeの範囲外）。これらの例は、Goコンパイラがルーンリテラルをどのように解釈し、どのような場合にエラーとするかを具体的に示しています。
文字列リテラルの説明の明確化:
- 生文字列リテラル (``) の値が「解釈されない（暗黙的にUTF-8エンコードされた）文字」で構成されることが追記されました。これは、生文字列リテラルがエスケープシーケンスを解釈しないことを強調しています。
- 解釈済み文字列リテラル ("") のバックスラッシュエスケープが「ルーンリテラルと同じように解釈される」と明記され、一貫性が保たれました。
- 不正な文字列リテラルの例として、ルーンリテラルと同様にサロゲートペアの半分や無効なUnicodeコードポイントが追加されました。
定数に関する用語の統一:
- 定数の種類を列挙する箇所で、「character constants」が「rune constants」に置き換えられました。
- complex 組み込み関数が、型なしの「integer, character, or floating-point constants」ではなく、「integer, rune, or floating-point constants」に適用されることが明記されました。
- 型なし定数が変数に割り当てられる際の型変換の説明で、「boolean, character, integer...」が「boolean, rune, integer...」に修正されました。

これらの変更は、Go言語の仕様書がより正確で、曖昧さのない、そして教育的なドキュメントとなることを目指しています。特に、UnicodeとUTF-8の複雑さをGo言語のシンプルなモデルに落とし込む上で、用語の厳密な定義は非常に重要です。

コアとなるコードの変更箇所

変更はすべて doc/go_spec.html ファイル内で行われています。

主な変更点は以下の通りです。

ヘッダーの更新:

--- a/doc/go_spec.html
+++ b/doc/go_spec.html
@@ -1,6 +1,6 @@
 <!--{
 	"Title": "The Go Programming Language Specification",
-	"Subtitle": "Version of August 17, 2012",
+	"Subtitle": "Version of August 29, 2012",
 	"Path": "/ref/spec"
 }-->

仕様書のバージョン日付が更新されています。

「character」の定義の明確化:

@@ -88,7 +88,8 @@ Source code is Unicode text encoded in
 canonicalized, so a single accented code point is distinct from the
 same character constructed from combining an accent and a letter;
 those are treated as two code points.  For simplicity, this document
-will use the term <i>character</i> to refer to a Unicode code point.
+will use the unqualified term <i>character</i> to refer to a Unicode code point
+in the source text.
 </p>

character という用語が「ソーステキスト内のUnicodeコードポイント」を指すことが明確化されました。

「Character literals」から「Rune literals」への変更:

@@ -197,7 +198,7 @@ token is
 	    <a href=\"#Integer_literals\">integer</a>,\n     	    <a href=\"#Floating-point_literals\">floating-point</a>,\n     	    <a href=\"#Imaginary_literals\">imaginary</a>,\n    -	    <a href=\"#Character_literals\">character</a>, or\n    +	    <a href=\"#Rune_literals\">rune</a>, or\n     	    <a href=\"#String_literals\">string</a> literal\n     	</li>

目次のような箇所で「Character literals」へのリンクが「Rune literals」に変更されています。

Character literals セクションのタイトルと内容の変更:

@@ -359,13 +360,15 @@ imaginary_lit = (decimals | float_lit) \"i\" .\n </pre>\n \n \n-<h3 id=\"Character_literals\">Character literals</h3>\n+<h3 id=\"Rune_literals\">Rune literals</h3>\n \n <p>\n-A character literal represents a <a href=\"#Constants\">character constant</a>,\n-typically a Unicode code point, as one or more characters enclosed in single\n-quotes.  Within the quotes, any character may appear except single\n-quote and newline. A single quoted character represents itself,\n+A rune literal represents a <a href=\"#Constants\">rune constant</a>,\n+an integer value identifying a Unicode code point.\n+A rune literal is expressed as one or more characters enclosed in single quotes.\n+Within the quotes, any character may appear except single\n+quote and newline. A single quoted character represents the Unicode value\n+of the character itself,\n     while multi-character sequences beginning with a backslash encode\n     values in various formats.\n     </p>

セクションタイトルが Character literals から Rune literals に変更され、説明文も「ルーンリテラルはUnicodeコードポイントを識別する整数値を表すルーン定数を表す」と明確化されました。

エスケープシーケンスの説明の修正:

@@ -408,11 +411,11 @@ After a backslash, certain single-character escapes represent special values:\n     \\t   U+0009 horizontal tab\n     \\v   U+000b vertical tab\n     \\\\   U+005c backslash\n    -\\'   U+0027 single quote  (valid escape only within character literals)\n    +\\'   U+0027 single quote  (valid escape only within rune literals)\n     \\\"   U+0022 double quote  (valid escape only within string literals)\n     </pre>\n     <p>\n    -All other sequences starting with a backslash are illegal inside character literals.\n    +All other sequences starting with a backslash are illegal inside rune literals.\n     </p>

エスケープシーケンスの有効範囲が「character literals」から「rune literals」に修正されました。

不正なルーンリテラルの例の追加:

@@ -438,6 +441,11 @@ escaped_char     = `\\` ( \"a\" | \"b\" | \"f\" | \"n\" | \"r\" | \"t\" | \"v\" | `\\` | \"\'\" | `\n     '\\xff'\n     '\\u12e4'\n     '\\U00101234'\n    +'aa'         // illegal: too many characters\n    +'\\xa'        // illegal: too few hexadecimal digits\n    +'\\0'         // illegal: too few octal digits\n    +'\\uDFFF'     // illegal: surrogate half\n    +'\\U00110000' // illegal: invalid Unicode code point\n     </pre>

上記で説明した不正なルーンリテラルの例が追加されています。

文字列リテラルの説明の明確化と不正な例の追加:

@@ -452,7 +460,8 @@ raw string literals and interpreted string literals.\n     Raw string literals are character sequences between back quotes\n     ````.  Within the quotes, any character is legal except\n     back quote. The value of a raw string literal is the\n    -string composed of the uninterpreted characters between the quotes;\n    +string composed of the uninterpreted (implicitly UTF-8-encoded) characters\n    +between the quotes;\n     in particular, backslashes have no special meaning and the string may\n     contain newlines.\n     Carriage returns inside raw string literals\n    @@ -463,8 +472,9 @@ Interpreted string literals are character sequences between double\n     quotes \"\". The text between the quotes,\n     which may not contain newlines, forms the\n     value of the literal, with backslash escapes interpreted as they\n    -are in character literals (except that \\' is illegal and\n    - \\\" is legal).  The three-digit octal (\\<i>nnn</i>)\n    +are in rune literals (except that \\' is illegal and\n    + \\\" is legal), with the same restrictions.\n    +The three-digit octal (\\<i>nnn</i>)\n     and two-digit hexadecimal (\\x<i>nn</i>) escapes represent individual\n     <i>bytes</i> of the resulting string; all other escapes represent\n     the (possibly multi-byte) UTF-8 encoding of individual <i>characters</i>.\n    @@ -491,6 +501,8 @@ interpreted_string_lit = `\"` { unicode_value | byte_value } `\"` .\n     \"日本語\"\n     \"\\u65e5本\\U00008a9e\"\n     \"\\xff\\u00FF\"\n    +\"\\uD800\"       // illegal: surrogate half\n    +\"\\U00110000\"   // illegal: invalid Unicode code point\n     </pre>

生文字列リテラルの説明に「implicitly UTF-8-encoded」が追加され、解釈済み文字列リテラルのエスケープが「rune literals」と同じように解釈されることが明記されました。また、不正な文字列リテラルの例も追加されています。

定数に関する用語の統一:

@@ -500,15 +512,15 @@ These examples all represent the same string:\n     <pre>\n     \"日本語\"                                 // UTF-8 input text\n     `日本語`                                 // UTF-8 input text as a raw literal\n    -\"\\u65e5\\u672c\\u8a9e\"                    // The explicit Unicode code points\n    -\"\\U000065e5\\U0000672c\\U00008a9e\"        // The explicit Unicode code points\n    -\"\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e\"  // The explicit UTF-8 bytes\n    +\"\\u65e5\\u672c\\u8a9e\"                    // the explicit Unicode code points\n    +\"\\U000065e5\\U0000672c\\U00008a9e\"        // the explicit Unicode code points\n    +\"\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e\"  // the explicit UTF-8 bytes\n     </pre>\n     \n     <p>\n     If the source code represents a character as two code points, such as\n     a combining form involving an accent and a letter, the result will be\n    -an error if placed in a character literal (it is not a single code\n    +an error if placed in a rune literal (it is not a single code\n     point), and will appear as two code points if placed in a string\n     literal.\n     </p>\n    @@ -517,7 +529,7 @@ literal.\n     <h2 id=\"Constants\">Constants</h2>\n     \n     <p>There are <i>boolean constants</i>,\n    -<i>character constants</i>,\n    +<i>rune constants</i>,\n     <i>integer constants</i>,\n     <i>floating-point constants</i>, <i>complex constants</i>,\n     and <i>string constants</i>. Character, integer, floating-point,\n    @@ -527,7 +539,7 @@ collectively called <i>numeric constants</i>.\n     \n     <p>\n     A constant value is represented by a\n    -<a href=\"#Character_literals\">character</a>,\n    +<a href=\"#Rune_literals\">rune</a>,\n     <a href=\"#Integer_literals\">integer</a>,\n     <a href=\"#Floating-point_literals\">floating-point</a>,\n     <a href=\"#Imaginary_literals\">imaginary</a>,\n    @@ -3586,7 +3598,7 @@ wherever it is legal to use an operand of boolean, numeric, or string type,\n     respectively.\n     Except for shift operations, if the operands of a binary operation are\n     different kinds of untyped constants, the operation and, for non-boolean operations, the result use\n    -the kind that appears later in this list: integer, character, floating-point, complex.\n    +the kind that appears later in this list: integer, rune, floating-point, complex.\n     For example, an untyped integer constant divided by an\n     untyped complex constant yields an untyped complex constant.\n     </p>\n    @@ -3614,7 +3626,7 @@ const f = int32(1) &lt;&lt; 33   // f == 0     (type int32)\n     const g = float64(2) &gt;&gt; 1  // illegal    (float64(2) is a typed floating-point constant)\n     const h = \"foo\" &gt; \"bar\"    // h == true  (untyped boolean constant)\n     const j = true             // j == true  (untyped boolean constant)\n    -const k = 'w' + 1          // k == 'x'   (untyped character constant)\n    +const k = 'w' + 1          // k == 'x'   (untyped rune constant)\n     const l = \"hi\"             // l == \"hi\"  (untyped string constant)\n     const m = string(k)        // m == \"x\"   (type string)\n     const Σ = 1 - 0.707i       //            (untyped complex constant)\n    @@ -3624,7 +3636,7 @@ const Φ = iota*1i - 1/1i   //            (untyped complex constant)\n     \n     <p>\n     Applying the built-in function <code>complex</code> to untyped\n    -integer, character, or floating-point constants yields\n    +integer, rune, or floating-point constants yields\n     an untyped complex constant.\n     </p>\n     \n    @@ -3960,7 +3972,7 @@ is assigned to a variable of interface type, the constant is <a href=\"#Conversio\n     to type <code>bool</code>, <code>rune</code>, <code>int</code>, <code>float64</code>,\n     <code>complex128</code> or <code>string</code>\n     respectively, depending on whether the value is a\n    -boolean, character, integer, floating-point, complex, or string constant.\n    +boolean, rune, integer, floating-point, complex, or string constant.\n     </p>

定数に関する説明全体で、「character」という用語が「rune」に置き換えられています。

コアとなるコードの解説

このコミットにおける「コアとなるコード」は、Go言語の公式仕様書である doc/go_spec.html のテキストコンテンツそのものです。Go言語の仕様書は、言語の挙動を定義する最も権威あるドキュメントであり、その内容はGoコンパイラの実装やGoプログラムの動作に直接影響を与えます。

変更の核心は、Go言語における「文字」と「ルーン」の概念を明確に区別し、仕様書全体で一貫した用語を使用することにあります。

character (文字): この用語は、ソースコード（テキストファイル）に記述された単なるUnicodeコードポイントを指すように限定されました。例えば、エディタで入力する「A」や「あ」といった文字そのものです。
rune (ルーン): この用語は、Goプログラム内で単一のUnicodeコードポイントの値を表現するために使用される int32 型の値を指します。ルーンリテラル ('a', '世') は、この rune 型の定数を生成します。

この区別は、特にUTF-8エンコーディングの性質を考慮すると重要です。UTF-8では1つのUnicode文字が可変長のバイト数で表現されるため、バイト列としての文字列と、意味的な文字（コードポイント）としてのルーンを混同しないようにすることが、正確なプログラミングとバグの回避につながります。

具体的には、以下の点が重要です。

ルーンリテラルの厳密な定義: ルーンリテラルが単一のUnicodeコードポイントを表す整数値であるという定義が強化されました。これにより、'aa' のような複数の文字を含むリテラルや、サロゲートペアの半分 ('\uDFFF')、無効なUnicodeコードポイント ('\U00110000') などが不正なルーンリテラルとして明確に例示され、コンパイラがこれらのケースをどのように扱うべきかが示されました。
文字列リテラルとの関連性の明確化: 解釈済み文字列リテラル内のエスケープシーケンスの解釈が、ルーンリテラルの規則に準拠することが明記されました。これにより、文字列内のエスケープシーケンスの挙動が一貫性を持って理解できるようになります。
定数における用語の統一: Go言語には様々な種類の定数がありますが、その中で「文字定数」という曖昧な表現が「ルーン定数」に置き換えられました。これにより、定数の型推論や演算における挙動がより明確になります。

これらの変更は、Go言語の仕様書をより正確で、曖昧さのない、そして学習者にとって理解しやすいものにすることを目的としています。意味的な変更はないものの、用語の厳密化と例の追加は、Go言語の設計思想である「シンプルさ」と「明確さ」を反映した重要な改善と言えます。

参考にした情報源リンク

Go言語の公式ドキュメント: doc/go_spec.html の変更履歴
Go言語のコミット履歴: https://github.com/golang/go/commits/master
Go言語のIssue Tracker (Go Issues): https://github.com/golang/go/issues (関連する議論がある可能性)
Go言語のメーリングリスト (golang-dev): https://groups.google.com/g/golang-dev (コミットメッセージに記載されている golang-dev は開発者メーリングリストを指します)
Go Code Review (Gerrit): https://go-review.googlesource.com/ (コミットメッセージに記載されている https://golang.org/cl/6448137 はGerritの変更リストへのリンクです)
Go言語に関する技術ブログや記事 (Goにおけるruneとstringの解説など)
UnicodeおよびUTF-8に関する一般的な情報源 (Wikipediaなど)

comemo