CppStd:lex.string

Source: https://timsong-cpp.github.io/cppwp/n3337/lex.string

List of Tables [tab]
List of Figures [fig]
1 General [intro]
2 Lexical conventions [lex]
2.1 Separate translation [lex.separate]
2.2 Phases of translation [lex.phases]
2.3 Character sets [lex.charset]
2.4 Trigraph sequences [lex.trigraph]
2.5 Preprocessing tokens [lex.pptoken]
2.6 Alternative tokens [lex.digraph]
2.7 Tokens [lex.token]
2.8 Comments [lex.comment]
2.9 Header names [lex.header]
2.10 Preprocessing numbers [lex.ppnumber]
2.11 Identifiers [lex.name]
2.12 Keywords [lex.key]
2.13 Operators and punctuators [lex.operators]
2.14 Literals [lex.literal]
2.14.1 Kinds of literals [lex.literal.kinds]
2.14.2 Integer literals [lex.icon]
2.14.3 Character literals [lex.ccon]
2.14.4 Floating literals [lex.fcon]
2.14.5 String literals [lex.string]
2.14.6 Boolean literals [lex.bool]
2.14.7 Pointer literals [lex.nullptr]
2.14.8 User-defined literals [lex.ext]

2.14.5 String literals [lex.string]

Syntax (BNF)

string-literal:

encoding-prefix_opt " s-char-sequence_opt "

encoding-prefix_opt R raw-string

encoding-prefix: u8 u U L

s-char-sequence:

s-char

s-char-sequence s-char

s-char:

any member of the source character set except

the double-quote ", backslash \, or new-line character

escape-sequence

universal-character-name

raw-string:

" d-char-sequence_opt ( r-char-sequence_opt ) d-char-sequence_opt "

r-char-sequence:

r-char

r-char-sequence r-char

r-char:

any member of the source character set, except

a right parenthesis ) followed by the initial d-char-sequence

(which may be empty) followed by a double quote ".

d-char-sequence:

d-char

d-char-sequence d-char

d-char:

any member of the basic source character set except:

space, the left parenthesis (, the right parenthesis ), the backslash \,

and the control characters representing horizontal tab,

vertical tab, form feed, and newline.

1 A string literal is a sequence of characters (as defined in [lex.ccon]) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"**(...)**", u"...", uR"*~(...)*~", U"...", UR"zzz(...)zzz", L"...", or LR"(...)", respectively.

2 A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.

3 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".

4

NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);

5

ExampleThe raw string

R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(??)"

is equivalent to "\?\?". The raw string

R"#(
)??="
)#"

is equivalent to "\n)\?\?=\"\n".

6 After translation phase 6, a string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.

7 A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.

8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration ([basic.stc]).

9 A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.

10 A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

11 A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

12 Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.

13

In translation phase 6 ([lex.phases]), adjacent string literals are concatenated. If both string literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string literal has no encoding-prefix, it is treated as a string literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally supported with implementation-defined behavior. NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a literal has been translated into a value from the appropriate character set), a string literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table [tab:lex.string.concat] has some examples of valid concatenations.

Table 8 — String literal concatenations
Source		Means	Source		Means	Source		Means
`u"a"`	`u"b"`	`u"ab"`	`U"a"`	`U"b"`	`U"ab"`	`L"a"`	`L"b"`	`L"ab"`
`u"a"`	`"b"`	`u"ab"`	`U"a"`	`"b"`	`U"ab"`	`L"a"`	`"b"`	`L"ab"`
`"a"`	`u"b"`	`u"ab"`	`"a"`	`U"b"`	`U"ab"`	`"a"`	`L"b"`	`L"ab"`

Characters in concatenated strings are kept distinct.

Example

"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').

14 After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string literal so that programs that scan a string can find its end.

15 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. NoteThe size of a char16_t string literal is the number of code units, not the number of characters. Within char32_t and char16_t literals, any universal-character-names shall be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

[edit]

Source: https://timsong-cpp.github.io/cppwp/n4140/lex.string

List of Tables [tab]
List of Figures [fig]
1 General [intro]
2 Lexical conventions [lex]
2.1 Separate translation [lex.separate]
2.2 Phases of translation [lex.phases]
2.3 Character sets [lex.charset]
2.4 Trigraph sequences [lex.trigraph]
2.5 Preprocessing tokens [lex.pptoken]
2.6 Alternative tokens [lex.digraph]
2.7 Tokens [lex.token]
2.8 Comments [lex.comment]
2.9 Header names [lex.header]
2.10 Preprocessing numbers [lex.ppnumber]
2.11 Identifiers [lex.name]
2.12 Keywords [lex.key]
2.13 Operators and punctuators [lex.operators]
2.14 Literals [lex.literal]
2.14.1 Kinds of literals [lex.literal.kinds]
2.14.2 Integer literals [lex.icon]
2.14.3 Character literals [lex.ccon]
2.14.4 Floating literals [lex.fcon]
2.14.5 String literals [lex.string]
2.14.6 Boolean literals [lex.bool]
2.14.7 Pointer literals [lex.nullptr]
2.14.8 User-defined literals [lex.ext]

2.14.5 String literals [lex.string]

Syntax (BNF)

string-literal:

encoding-prefix_opt " s-char-sequence_opt "

encoding-prefix_opt R raw-string

encoding-prefix: u8 u U L

s-char-sequence:

s-char

s-char-sequence s-char

s-char:

any member of the source character set except

the double-quote ", backslash \, or new-line character

escape-sequence

universal-character-name

raw-string:

" d-char-sequence_opt ( r-char-sequence_opt ) d-char-sequence_opt "

r-char-sequence:

r-char

r-char-sequence r-char

r-char:

any member of the source character set, except

a right parenthesis ) followed by the initial d-char-sequence

(which may be empty) followed by a double quote ".

d-char-sequence:

d-char

d-char-sequence d-char

d-char:

any member of the basic source character set except:

space, the left parenthesis (, the right parenthesis ), the backslash \,

and the control characters representing horizontal tab,

vertical tab, form feed, and newline.

1 A string literal is a sequence of characters (as defined in [lex.ccon]) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"**(...)**", u"...", uR"*~(...)*~", U"...", UR"zzz(...)zzz", L"...", or LR"(...)", respectively.

2 A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.

3 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".

4

NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);

5

ExampleThe raw string

R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(??)"

is equivalent to "\?\?". The raw string

R"#(
)??="
)#"

is equivalent to "\n)\?\?=\"\n".

6 After translation phase 6, a string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.

7 A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal.

8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration ([basic.stc]).

9 For a UTF-8 string literal, each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.

10 A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.

11 A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

12 A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

13 Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.

14

In translation phase 6 ([lex.phases]), adjacent string literals are concatenated. If both string literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string literal has no encoding-prefix, it is treated as a string literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a literal has been translated into a value from the appropriate character set), a string literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table [tab:lex.string.concat] has some examples of valid concatenations.

Table 8 — String literal concatenations
Source		Means	Source		Means	Source		Means
`u"a"`	`u"b"`	`u"ab"`	`U"a"`	`U"b"`	`U"ab"`	`L"a"`	`L"b"`	`L"ab"`
`u"a"`	`"b"`	`u"ab"`	`U"a"`	`"b"`	`U"ab"`	`L"a"`	`"b"`	`L"ab"`
`"a"`	`u"b"`	`u"ab"`	`"a"`	`U"b"`	`U"ab"`	`"a"`	`L"b"`	`L"ab"`

Characters in concatenated strings are kept distinct.

Example

"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').

15 After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string literal so that programs that scan a string can find its end.

16 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. NoteThe size of a char16_t string literal is the number of code units, not the number of characters. Within char32_t and char16_t literals, any universal-character-names shall be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

[edit]

Source: https://timsong-cpp.github.io/cppwp/n4659/lex.string

List of Tables [tab]
List of Figures [fig]
1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]
5.13.1 Kinds of literals [lex.literal.kinds]
5.13.2 Integer literals [lex.icon]
5.13.3 Character literals [lex.ccon]
5.13.4 Floating literals [lex.fcon]
5.13.5 String literals [lex.string]
5.13.6 Boolean literals [lex.bool]
5.13.7 Pointer literals [lex.nullptr]
5.13.8 User-defined literals [lex.ext]

5.13.5 String literals [lex.string]

Syntax (BNF)

string-literal:

encoding-prefix_opt " s-char-sequence_opt "

encoding-prefix_opt R raw-string

s-char-sequence:

s-char

s-char-sequence s-char

s-char:

any member of the source character set except

the double-quote ", backslash \, or new-line character

escape-sequence

universal-character-name

raw-string:

" d-char-sequence_opt ( r-char-sequence_opt ) d-char-sequence_opt "

r-char-sequence:

r-char

r-char-sequence r-char

r-char:

any member of the source character set, except

a right parenthesis ) followed by the initial d-char-sequence

(which may be empty) followed by a double quote ".

d-char-sequence:

d-char

d-char-sequence d-char

d-char:

any member of the basic source character set except:

space, the left parenthesis (, the right parenthesis ), the backslash \,

and the control characters representing horizontal tab,

vertical tab, form feed, and newline.

1 A string-literal is a sequence of characters (as defined in [lex.ccon]) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"**(...)**", u"...", uR"*~(...)*~", U"...", UR"zzz(...)zzz", L"...", or LR"(...)", respectively.

2

A string-literal that has an R

in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.

3 NoteThe characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".

4

NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);

5

ExampleThe raw string

R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(??)"

is equivalent to "\?\?". The raw string

R"#(
)??="
)#"

is equivalent to "\n)\?\?=\"\n".

6 After translation phase 6, a string-literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.

7 A string-literal that begins with u8, such as u8"asdf", is a UTF-8 string literal.

8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration.

9 For a UTF-8 string literal, each successive element of the object representation has the value of the corresponding code unit of the UTF-8 encoding of the string.

10 A string-literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.

11 A string-literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it is initialized with the given characters.

12 A string-literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it is initialized with the given characters.

13

In translation phase 6, adjacent string-literals are concatenated. If both string-literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a string literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table 9 has some examples of valid concatenations.

Table 9 — String literal concatenations
Source		Means	Source		Means	Source		Means
`u"a"`	`u"b"`	`u"ab"`	`U"a"`	`U"b"`	`U"ab"`	`L"a"`	`L"b"`	`L"ab"`
`u"a"`	`"b"`	`u"ab"`	`U"a"`	`"b"`	`U"ab"`	`L"a"`	`"b"`	`L"ab"`
`"a"`	`u"b"`	`u"ab"`	`"a"`	`U"b"`	`U"ab"`	`"a"`	`L"b"`	`L"ab"`

Characters in concatenated strings are kept distinct.

Example

"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').

14 After any necessary concatenation, in translation phase 7, '\0' is appended to every string literal so that programs that scan a string can find its end.

15 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals, except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a char16_t string literal may yield a surrogate pair. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. NoteThe size of a char16_t string literal is the number of code units, not the number of characters. Within char32_t and char16_t string literals, any universal-character-names shall be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

16

Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above. Whether all string literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

NoteThe effect of attempting to modify a string literal is undefined.

[edit]

Source: https://timsong-cpp.github.io/cppwp/n4868/lex.string

1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]
5.13.1 Kinds of literals [lex.literal.kinds]
5.13.2 Integer literals [lex.icon]
5.13.3 Character literals [lex.ccon]
5.13.4 Floating-point literals [lex.fcon]
5.13.5 String literals [lex.string]
5.13.6 Boolean literals [lex.bool]
5.13.7 Pointer literals [lex.nullptr]
5.13.8 User-defined literals [lex.ext]

5.13.5 String literals [lex.string]

Syntax (BNF)

string-literal:

encoding-prefix_opt " s-char-sequence_opt "

encoding-prefix_opt R raw-string

s-char-sequence:

s-char

s-char-sequence s-char

s-char:

any member of the basic source character set except the double-quote ", backslash \, or new-line character

escape-sequence

universal-character-name

raw-string:

" d-char-sequence_opt ( r-char-sequence_opt ) d-char-sequence_opt "

r-char-sequence:

r-char

r-char-sequence r-char

r-char:

any member of the source character set, except a right parenthesis ) followed by

the initial d-char-sequence (which may be empty) followed by a double quote ".

d-char-sequence:

d-char

d-char-sequence d-char

d-char:

any member of the basic source character set except:

space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters

representing horizontal tab, vertical tab, form feed, and newline.

1

A string-literal that has an R

in the prefix is a raw string literal . The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence . A d-char-sequence shall consist of at most 16 characters.

2 NoteThe characters '(' and ')' are permitted in a raw-string . Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".

3

NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);

4

ExampleThe raw string

R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(x = "\"y\"")"

is equivalent to "x = \"\\\"y\\\"\"".

5 After translation phase 6, a string-literal that does not begin with an encoding-prefix is an ordinary string literal . An ordinary string literal has type “array of n const char” where n is the size of the string as defined below, has static storage duration ([basic.stc]), and is initialized with the given characters.

6 A string-literal that begins with u8, such as u8"asdf", is a UTF-8 string literal . A UTF-8 string literal has type “array of n const char8_t”, where n is the size of the string as defined below; each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.

7 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.

8

A string-literal that begins with u, such as u"asdf", is a UTF-16 string literal . A UTF-16 string literal has type “array of n const char16_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-16 encoding of the string.

NoteA single c-char may produce more than one char16_t character in the form of surrogate pairs. A surrogate pair is a representation for a single code point as a sequence of two 16-bit code units.

9 A string-literal that begins with U, such as U"asdf", is a UTF-32 string literal . A UTF-32 string literal has type “array of n const char32_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-32 encoding of the string.

10

A string-literal that begins with L, such as L"asdf", is a wide string literal . A wide string literal has type “array of n

const

wchar_t”, where n is the size of the string as defined below; it is initialized with the given characters.

11

In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated. If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix . If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior.

NoteThis concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a string-literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. Table 11 has some examples of valid concatenations.

Table 11: String literal concatenations [tab:lex.string.concat]
Source		Means	Source		Means	Source		Means
`u"a"`	`u"b"`	`u"ab"`	`U"a"`	`U"b"`	`U"ab"`	`L"a"`	`L"b"`	`L"ab"`
`u"a"`	`"b"`	`u"ab"`	`U"a"`	`"b"`	`U"ab"`	`L"a"`	`"b"`	`L"ab"`
`"a"`	`u"b"`	`u"ab"`	`"a"`	`U"b"`	`U"ab"`	`"a"`	`L"b"`	`L"ab"`

Characters in concatenated strings are kept distinct.

Example

"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').

12 After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string-literal so that programs that scan a string can find its end.

13

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character-literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a UTF-16 string literal may yield a surrogate pair. In a narrow string literal, a universal-character-name may map to more than one char or char8_t element due to multibyte encoding . The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a UTF-16 string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'.

NoteThe size of a char16_t string literal is the number of code units, not the number of characters.

NoteAny universal-character-names are required to correspond to a code point in the range [0, D800) or [E000, 10FFFF] (hexadecimal) ([lex.charset]). The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

14

Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above. Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

NoteThe effect of attempting to modify a string-literal is undefined.

[edit]

Source: https://timsong-cpp.github.io/cppwp/n4950/lex.string

1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]
5.13.1 Kinds of literals [lex.literal.kinds]
5.13.2 Integer literals [lex.icon]
5.13.3 Character literals [lex.ccon]
5.13.4 Floating-point literals [lex.fcon]
5.13.5 String literals [lex.string]
5.13.6 Boolean literals [lex.bool]
5.13.7 Pointer literals [lex.nullptr]
5.13.8 User-defined literals [lex.ext]

5.13.5 String literals [lex.string]

Syntax (BNF)

string-literal:

encoding-prefix_opt " s-char-sequence_opt "

encoding-prefix_opt R raw-string

s-char-sequence:

s-char

s-char-sequence s-char

s-char:

basic-s-char

escape-sequence

universal-character-name

basic-s-char:

any member of the translation character set except the U+0022 quotation mark,

U+005c reverse solidus, or new-line character

raw-string:

" d-char-sequence_opt ( r-char-sequence_opt ) d-char-sequence_opt "

r-char-sequence:

r-char

r-char-sequence r-char

r-char:

any member of the translation character set, except a U+0029 right parenthesis followed by

the initial d-char-sequence (which may be empty) followed by a U+0022 quotation mark

d-char-sequence:

d-char

d-char-sequence d-char

d-char:

any member of the basic character set except:

U+0020 space, U+0028 left parenthesis, U+0029 right parenthesis, U+005c reverse solidus,

U+0009 character tabulation, U+000b line tabulation, U+000c form feed, and new-line

1

The kind of a string-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding prefix and sequence of s-chars or r-chars as defined by Table 12 where n is the number of encoded code units as described below.

Table 12: String literals [tab:lex.string.literal]
Encoding	Kind	Type	Associated	Examples
prefix			character
			encoding
none	ordinary string literal	array of n `const char`	ordinary literal encoding	`"ordinary string"` `R"(ordinary raw string)"`
`L`	wide string literal	array of n `const wchar_t`	wide literal encoding	`L"wide string"` `LR"w(wide raw string)w"`
`u8`	UTF-8 string literal	array of n `const char8_t`	UTF-8	`u8"UTF-8 string"` `u8R"x(UTF-8 raw string)x"`
`u`	UTF-16 string literal	array of n `const char16_t`	UTF-16	`u"UTF-16 string"` `uR"y(UTF-16 raw string)y"`
`U`	UTF-32 string literal	array of n `const char32_t`	UTF-32	`U"UTF-32 string"` `UR"z(UTF-32 raw string)z"`

2

A string-literal that has an R

in the prefix is a raw string literal . The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence . A d-char-sequence shall consist of at most 16 characters.

3 NoteThe characters '(' and ')' are permitted in a raw-string . Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".

4

NoteA source-file new-line in a raw string literal results in a new-line in the resulting execution string literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);

5

Example The raw string

R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(x = "\"y\"")"

is equivalent to "x = \"\\\"y\\\"\"".

6 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.

7

The common encoding-prefix for a sequence of adjacent string-literals is determined pairwise as follows: If two string-literals have the same encoding-prefix, the common encoding-prefix is that encoding-prefix . If one string-literal has no encoding-prefix, the common encoding-prefix is that of the other string-literal . Any other combinations are ill-formed.

NoteA string-literal's rawness has no effect on the determination of the common encoding-prefix .

8

In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated. The lexical structure and grouping of the contents of the individual string-literals is retained.

Example

"\xA" "B"

represents the code unit '\xA' and the character 'B' after concatenation (and not the single code unit '\xAB'). Similarly,

R"(\u00)" "41"

represents six characters, starting with a backslash and ending with the digit 1 (and not the single character 'A' specified by a universal-character-name). Table 13 has some examples of valid concatenations.

Table 13: String literal concatenations [tab:lex.string.concat]
Source		Means	Source		Means	Source		Means
`u"a"`	`u"b"`	`u"ab"`	`U"a"`	`U"b"`	`U"ab"`	`L"a"`	`L"b"`	`L"ab"`
`u"a"`	`"b"`	`u"ab"`	`U"a"`	`"b"`	`U"ab"`	`L"a"`	`"b"`	`L"ab"`
`"a"`	`u"b"`	`u"ab"`	`"a"`	`U"b"`	`U"ab"`	`"a"`	`L"b"`	`L"ab"`

9

Evaluating a string-literal results in a string literal object with static storage duration ([basic.stc]). Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

NoteThe effect of attempting to modify a string literal object is undefined.

10 String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-chars (originally from non-raw string literals) and r-chars (originally from raw string literals), plus a terminating U+0000 null character, in order as follows:

(10.1)

The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence using the string-literal's associated character encoding. If a character lacks representation in the associated character encoding, then the string-literal is conditionally-supported and an implementation-defined code unit sequence is encoded. NoteNo character lacks representation in any Unicode encoding form. When encoding a stateful character encoding, implementations should encode the first such sequence beginning with the initial encoding state and encode subsequent sequences beginning with the final encoding state of the prior sequence. NoteThe encoded code unit sequence can differ from the sequence of code units that would be obtained by encoding each character independently.

(10.2)

Each numeric-escape-sequence ([lex.ccon]) contributes a single code unit with a value as follows:

(10.2.1)

Let v be the integer value represented by the octal number comprising the sequence of octal-digits in an octal-escape-sequence or by the hexadecimal number comprising the sequence of hexadecimal-digits in a hexadecimal-escape-sequence .

(10.2.2)

If v does not exceed the range of representable values of the string-literal's array element type, then the value is v.

(10.2.3)

Otherwise, if the string-literal's encoding-prefix is absent or L, and v does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the string-literal's array element type, then the value is the unique value of the string-literal's array element type T that is congruent to v modulo 2N, where N is the width of T.

(10.2.4)

Otherwise, the string-literal is ill-formed.
When encoding a stateful character encoding, these sequences should have no effect on encoding state.

(10.3)

Each conditional-escape-sequence ([lex.ccon]) contributes an implementation-defined code unit sequence. When encoding a stateful character encoding, it is implementation-defined what effect these sequences have on encoding state.

CppStd:lex.string

2.14.5 String literals [lex.string]

2.14.5 String literals [lex.string]

5.13.5 String literals [lex.string]

5.13.5 String literals [lex.string]

5.13.5 String literals [lex.string]

Navigation menu

Search