CppStd:lex.phases

From emmtrix Wiki
Jump to navigation Jump to search
[edit]

Source: https://timsong-cpp.github.io/cppwp/n3337/lex.phases

List of Tables [tab]
List of Figures [fig]
1 General [intro]
2 Lexical conventions [lex]
2.1 Separate translation [lex.separate]
2.2 Phases of translation [lex.phases]
2.3 Character sets [lex.charset]
2.4 Trigraph sequences [lex.trigraph]
2.5 Preprocessing tokens [lex.pptoken]
2.6 Alternative tokens [lex.digraph]
2.7 Tokens [lex.token]
2.8 Comments [lex.comment]
2.9 Header names [lex.header]
2.10 Preprocessing numbers [lex.ppnumber]
2.11 Identifiers [lex.name]
2.12 Keywords [lex.key]
2.13 Operators and punctuators [lex.operators]
2.14 Literals [lex.literal]

2.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp11 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences ([lex.trigraph]) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set ([lex.charset]) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. If, as a result, a character sequence that matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp11 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Examplesee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation ([cpp.concat]), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp11 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. ([lex.token]). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens may occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these may be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis may include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

  1. Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.
  3. An implementation need not convert all non-corresponding source characters to the same execution character.


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4140/lex.phases

List of Tables [tab]
List of Figures [fig]
1 General [intro]
2 Lexical conventions [lex]
2.1 Separate translation [lex.separate]
2.2 Phases of translation [lex.phases]
2.3 Character sets [lex.charset]
2.4 Trigraph sequences [lex.trigraph]
2.5 Preprocessing tokens [lex.pptoken]
2.6 Alternative tokens [lex.digraph]
2.7 Tokens [lex.token]
2.8 Comments [lex.comment]
2.9 Header names [lex.header]
2.10 Preprocessing numbers [lex.ppnumber]
2.11 Identifiers [lex.name]
2.12 Keywords [lex.key]
2.13 Operators and punctuators [lex.operators]
2.14 Literals [lex.literal]

2.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp14 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences ([lex.trigraph]) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set ([lex.charset]) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp14 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Examplesee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation ([cpp.concat]), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp14 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. ([lex.token]). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens may occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these may be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis may include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

  1. Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.
  3. An implementation need not convert all non-corresponding source characters to the same execution character.


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4659/lex.phases

List of Tables [tab]
List of Figures [fig]
1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]

5.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp17 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set is replaced by the universal-character-name that designates that character. An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted ([lex.pptoken]) in a raw string literal.
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp17 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Examplesee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation, the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp17 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens may occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these may be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis may include instantiations which have been explicitly requested. The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

  1. Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.
  3. An implementation need not convert all non-corresponding source characters to the same execution character.


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4868/lex.phases

1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]

5.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp20 1]
  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set is replaced by the universal-character-name that designates that character. An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted ([lex.pptoken]) in a raw string literal.
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp20 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    ExampleSee the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation, the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each basic source character set member in a character-literal or a string-literal, as well as each escape sequence and universal-character-name in a character-literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.[cpp20 3]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token ([lex.token]). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. NoteThe process of analyzing and translating the tokens can occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). It is implementation-defined whether the sources for module units and header units on which the current translation unit has an interface dependency ([module.unit], [module.import]) are required to be available. NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these can be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis can include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation can choose to encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

  1. Implementations behave as if these separate phases occur, although in practice different phases can be folded together. 
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment. 
  3. An implementation need not convert all non-corresponding source characters to the same execution character. 


[edit]

Source: https://timsong-cpp.github.io/cppwp/n4950/lex.phases

1 Scope [intro.scope]
2 Normative references [intro.refs]
3 Terms and definitions [intro.defs]
4 General principles [intro]
5 Lexical conventions [lex]
5.1 Separate translation [lex.separate]
5.2 Phases of translation [lex.phases]
5.3 Character sets [lex.charset]
5.4 Preprocessing tokens [lex.pptoken]
5.5 Alternative tokens [lex.digraph]
5.6 Tokens [lex.token]
5.7 Comments [lex.comment]
5.8 Header names [lex.header]
5.9 Preprocessing numbers [lex.ppnumber]
5.10 Identifiers [lex.name]
5.11 Keywords [lex.key]
5.12 Operators and punctuators [lex.operators]
5.13 Literals [lex.literal]

5.2 Phases of translation [lex.phases]

1 The precedence among the syntax rules of translation is specified by the following phases.[cpp23 1]
  1. An implementation shall support input files that are a sequence of UTF-8 code units (UTF-8 files). It may also support an implementation-defined set of other kinds of input files, and, if so, the kind of an input file is determined in an implementation-defined manner that includes a means of designating input files as UTF-8 files, independent of their content. NoteIn other words, recognizing the U+feff byte order mark is not sufficient. If an input file is determined to be a UTF-8 file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of Unicode scalar values. A sequence of translation character set elements is then formed by mapping each Unicode scalar value to the corresponding translation character set element. In the resulting sequence, each pair of characters in the input sequence consisting of U+000d carriage return followed by U+000a line feed, as well as each U+000d carriage return not immediately followed by a U+000a line feed, is replaced by a single new-line character. For any other kind of input file supported by the implementation, characters are mapped, in an implementation-defined manner, to a sequence of translation character set elements ([lex.charset]), representing end-of-line indicators as new-line characters.
  2. If the first translation character is U+feff byte order mark, it is deleted. Each sequence of a backslash character (\) immediately followed by zero or more whitespace characters other than new-line followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined. A source file that is not empty and that does not end in a new-line character, or that ends in a splice, shall be processed as if an additional new-line character were appended to the file.
  3. The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.[cpp23 2] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified. As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), except when matching a c-char-sequence, s-char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal-character-names are recognized and replaced by the designated element of the translation character set. The process of dividing a source file's characters into preprocessing tokens is context-dependent.
    Example See the handling of < within a #include preprocessing directive.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. For a sequence of two or more adjacent string-literal tokens, a common encoding-prefix is determined as specified in [lex.string]. Each such string-literal token is then considered to have that common encoding-prefix.
  6. Adjacent string-literal tokens are concatenated ([lex.string]).
  7. Whitespace characters separating tokens are no longer significant. Each preprocessing token is converted into a token ([lex.token]). The resulting tokens constitute a translation unit and are syntactically and semantically analyzed and translated. NoteThe process of analyzing and translating the tokens can occasionally result in one token being replaced by a sequence of other tokens ([temp.names]). It is implementation-defined whether the sources for module units and header units on which the current translation unit has an interface dependency ([module.unit], [module.import]) are required to be available. NoteSource files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.
  8. Translated translation units and instantiation units are combined as follows: NoteSome or all of these can be supplied from a library. Each translated translation unit is examined to produce a list of required instantiations. NoteThis can include instantiations which have been explicitly requested ([temp.explicit]). The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. NoteAn implementation can choose to encode sufficient information into the translated translation unit so as to ensure the source is not required here. All the required instantiations are performed to produce instantiation units. NoteThese are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. The program is ill-formed if any instantiation fails.
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

  1. Implementations behave as if these separate phases occur, although in practice different phases can be folded together.
  2. A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.