Clione is a Java implementation of a lexical parser that tokenizes source code written in C17 and other C-like programming languages.
The main facility is a tokenization API corresponding to the C preprocessor layer. It includes the features of trigraph replacement, line splicing, and tokenization but does not include macro expansion and directive handling.
A typical usage example would be as follows:
package com.example;
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import com.maroontress.clione.LexicalParser;
import com.maroontress.clione.Token;
public final class TokenDemo {
public static void main(String[] args) {
var path = FileSystems.getDefault().getPath(args[0]);
try (var parser = LexicalParser.of(Files.newBufferedReader(path))) {
run(parser);
} catch (IOException e) {
e.printStackTrace();
}
}
public static void run(LexicalParser parser) throws IOException {
for (;;) {
var maybeToken = parser.next();
if (maybeToken.isEmpty()) {
break;
}
var token = maybeToken.get();
printToken(token, "");
}
}
public static void printToken(Token token, String indent) {
var type = token.getType();
var value = token.getValue();
var span = token.getSpan();
var s = switch (type) {
case DELIMITER, DIRECTIVE_END
-> "'" + value.replaceAll("\n", "\\\\n") + "'";
default -> value;
};
System.out.printf("%s%s: %s: %s%n", indent, span, type, s);
for (var child : token.getChildren()) {
printToken(child, indent + "| ");
}
}
}
And helloworld.c
would be as
follows:
#include <stdio.h>
int main(void)
{
printf("hello world\n");
}
In this example, the result of "java com.example.TokenDemo helloworld.c
" is
as follows:
L1:1--19: DIRECTIVE: #
| L1:2--8: DIRECTIVE_NAME: include
| L1:9: DELIMITER: ' '
| L1:10--18: STANDARD_HEADER: <stdio.h>
| L1:19: DIRECTIVE_END: '\n'
L2:1: DELIMITER: '\n'
L3:1--3: RESERVED: int
L3:4: DELIMITER: ' '
L3:5--8: IDENTIFIER: main
L3:9: PUNCTUATOR: (
L3:10--13: RESERVED: void
L3:14: PUNCTUATOR: )
L3:15: DELIMITER: '\n'
L4:1: PUNCTUATOR: {
L4:2--L5:4: DELIMITER: '\n '
L5:5--10: IDENTIFIER: printf
L5:11: PUNCTUATOR: (
L5:12--26: STRING: "hello world\n"
L5:27: PUNCTUATOR: )
L5:28: PUNCTUATOR: ;
L5:29: DELIMITER: '\n'
L6:1: PUNCTUATOR: }
L6:2: DELIMITER: '\n'
The LexicalParser
object creates and returns a token from the stream of the
source file. It often extracts the ones from the source file, but trigraph and
digraph substitution and line concatenation may result in tokens that are not
in the source file. It returns an empty token when it finally reaches the end
of the source file.
The Token
objects that the next()
method of LexicalParser
instance
returns are the preprocessing tokens. So, the evaluation is necessary before
using their content. In other words, they can be incomplete according to the
token type. For example, the string literal or comment may not terminate, the
preprocessing number may not represent valid integer and floating-point
constants, and so on.
As in the example above, Token
objects can have children, which means they
can be in a tree structure. For tokens that the next()
method returns, tokens
of type TokenType.DIRECTIVE
only have children.
The Token
object has its type, span, and characters. The type is one of the
constants defined in enum TokenType
, the span represents the range of the
source file where the token occurs, and the characters are SourceChar
objects
that compose it.
The SourceChar
object represents a character that composes the token or EOF.
It may also have one or more child characters in some cases. For example, it is
the case that it represents:
- the character which is substituted for any digraph or trigraph sequence
- the character that follows a backslash (
\
) at the end of the line
The following code shows an example:
package com.example;
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.util.List;
import com.maroontress.clione.LexicalParser;
import com.maroontress.clione.SourceChar;
import com.maroontress.clione.Token;
public final class SourceCharDemo {
public static void main(String[] args) {
var path = FileSystems.getDefault().getPath(args[0]);
try (var parser = LexicalParser.of(Files.newBufferedReader(path))) {
run(parser);
} catch (IOException e) {
e.printStackTrace();
}
}
public static void run(LexicalParser parser) throws IOException {
for (;;) {
var maybeToken = parser.next();
if (maybeToken.isEmpty()) {
break;
}
printToken(maybeToken.get());
}
}
public static void printToken(Token token) {
var type = token.getType();
var value = token.getValue();
var span = token.getSpan();
var s = switch (type) {
case DELIMITER, DIRECTIVE_END
-> "'" + value.replaceAll("\n", "\\\\n") + "'";
default -> value;
};
System.out.printf("%s: %s: %s%n", span, type, s);
printChars(token.getChars(), " ");
}
private static void printChars(List<SourceChar> chars, String indent) {
for (var c : chars) {
var span = c.getSpan();
var value = c.toChar();
var s = (value == '\n')
? "'\\n'"
: Character.isHighSurrogate(value)
? "H(0x" + Integer.toString((int) value, 16) + ")"
: Character.isLowSurrogate(value)
? "L(0x" + Integer.toString((int) value, 16) + ")"
: String.valueOf(value);
System.out.printf("%s%s: %s%n", indent, span, s);
printChars(c.getChildren(), indent + "| ");
}
}
}
And
main.c
would be as follows:
ma??/
in
In this example, the result of "java com.example.SourceCharDemo main.c
" is as follows:
L1:1--L2:2: IDENTIFIER: main
L1:1: m
L1:2: a
L1:3--L2:1: i
| L1:3--5: \
| | L1:3: ?
| | L1:4: ?
| | L1:5: /
| L1:6: '\n'
| L2:1: i
L2:2: n
⋮
The result illustrates that the character i
in the identifier main
has
child characters: a backslash (\
), a newline (\n
), and i
. Furthermore,
the backslash character has child characters: ?
, ?
, and /
. Of course,
what happens is that the trigraph sequence ??/
is replaced with a backslash
at first, and then the backslash at the end of the line results in the line
concatenation.
A character corresponds to a column. So, one char
value often represents one
column. However, in the case of a character represented with a surrogate pair,
the two char
values in the pair represent one column. Here is an example
emojicat.c
:
char *cat = u8"🐱";
The result of "java com.example.SourceCharDemo emojicat.c
" is as follows:
⋮
L1:19--23: STRING: u8"🐱"
L1:13: u
L1:14: 8
L1:15: "
L1:16: H(0xd83d)
L1:16: L(0xdc31)
L1:17: "
⋮
This example shows that the high and low surrogate characters are in the same column.
The lexical parser starts tokenization after trigraph replacement and line splicing, according to the phases of translation.
Before anything else, the lexical parser substitutes \n
for all newlines,
that is, line feed (LF), carriage return and line feed (CRLF), and carriage
return (CR) in the stream, even if different newlines are mixed in the stream.
It indicates \n
as a newline (NL) character, regardless of platform.
After unifying newline characters, the lexical parser replaces
trigraph sequences with the new SourceChar
objects they
represent. The new one becomes the parent of the replaced characters and
represents their equivalent. The following table lists all trigraphs:
Trigraph | Equivalent |
---|---|
??< |
{ |
??> |
} |
??[ |
( |
??] |
) |
??= |
# |
??/ |
\ |
??' |
^ |
??! |
| |
??- |
~ |
Next to the trigraph replacement, the lexical parser removes the backslash
character at the end of the line. To be more precise, it replaces the
backslash, the newline character, and the next character with a new
SourceChar
object. The new one becomes the parent of the replaced characters
and represents the character that followed the backslash and newline
characters.
A pair of the backslash and newline characters may appear two or more times with consecutive occurrences. In that case, the new substituted one becomes the parent of both their characters and the next character.
After line splicing, the lexical parser starts to break the SourceChar
stream
into Token
s. A Token
object may be either:
- delimiters (that are sequences of whitespace characters)
- comments
- directives
- preprocessing tokens (that are standard header names, identifiers, preprocessing numbers, character constants, string literals, operators and punctuators, or unknown token)
A delimiter is a separator between tokens. Strictly speaking, it is not a token, but the lexical parser returns the delimiter as a token. Some applications may completely ignore delimiters (for example, code formatters).
The space, horizontal tab (HT), form feed (FF), vertical tab (VT), and NL characters are delimiters within any non-directive line. The space and HT characters are delimiters within any directive lines.
☕ By the way, have you seen source code including FF and VT characters? In the past, people often printed source code on paper. In the 1980s, I saw some source code that included a FF character inserted between functions. It resulted in a page break, so each function started at the top of the page. As far as a VT character goes, I have never seen it in the source code.
The token type of delimiters is TokenType.DELIMITER
.
A comment also can be a delimiter, because C preprocessors replace each comment with a space character.
There are two types of comments. The one starts with /*
and ends with */
.
The other starts with //
and ends with a newline character. No comment can be
inside a character constant, a string literal, a standard header name, or a
filename in either case.
The content of the token can be incomplete. For example, it may not terminate, and so on.
The token type of comments is TokenType.COMMENT
.
An identifier is a preprocessing token.
The first character of an identifier name must be one of:
- an underscore character or an uppercase or lowercase letter (
[_A-Za-z]
) - universal character names (
\uXXXX
or\UXXXXXXXX
,X
is a hexadecimal digit) - other implementation-defined characters
The second and subsequent character must be one of them or a digit ([0-9]
).
The other implementation-defined characters that LexicalParser
's
implementation defines are of
Unicode Identifier that is as follows:
- The first character: a character with which the
Character.isUnicodeIdentifierStart(int)
method returns
true
- The second and subsequent character: a character with which the
Character.isUnicodeIdentifierPart(int)
method returns
true
So, the lexical parser can parse the following C code:
char *\U0001f431 = "cat";
However, it does NOT support the following code because Unicode Identifier does not contain the emoji characters such as 🐱:
char *🐱 = "cat";
Note that the recent famous C compilers (like GCC, Clang, etc.) can compile the code where an identifier contains emoji characters like this.
The token type of identifiers is TokenType.IDENTIFIER
.
Reserved words are equivalent to identifiers, but they are in the set of
keywords, which you can specify with the factory method of LexicalParser
.
The token type of reserved words is TokenType.RESERVED
.
A character constant is a preprocessing token.
It consists of one or more characters enclosed in single quotes. The quotes may
follow a prefix either L
, u
, or U
. It may contain
escape sequences. It may not contain a newline
character.
The content of the token can be incomplete. For example, it may not terminate, it may contain no character, two or more characters, or invalid escape sequences inside the single quotes, and so on.
The token type of character constants is TokenType.CHARACTER
.
A string literal is a preprocessing token.
It consists of zero or more characters enclosed in double quotes. The quotes
may follow a prefix either L
, u
, U
, or u8
. It may contain
escape sequences. It may not contain a newline
character.
The content of the token can be incomplete. For example, it may not terminate, it may contain invalid escape sequences inside the double quotes, and so on.
The token type of string literals is TokenType.STRING
.
A preprocessing number is a preprocessing token.
It includes all integer and floating-point constants but does other forms except them.
The content of the token can be incomplete. For example, it may not represent both integer and floating-point constants, and so on.
The token type of preprocessing numbers is TokenType.NUMBER
.
Operator or punctuator tokens are preprocessing tokens. The following table
lists valid tokens of which the type is TokenType.OPERATOR
:
+ - * / % ++ -- == !=
> < >= <= ! && || ~ &
| ^ << >> = += -= *= /=
%= &= |= ^= <<= >>= -> . ?
Note that these are preprocessing tokens, not C operators. For example,
sizeof
is an operator in C, but a reserved word (or an identifier) as a
preprocessing token.
The following table lists all valid tokens of which the type is
TokenType.PUNCTUATOR
:
( ) [ ] { } :
; , ... <: :> <% %>
The lexical parser specially treats the four tokens: #
, %:
, ##
, and
%:%:
. The type of them is TokenType.OPERATOR
in directive lines. Otherwise,
#
and %:
are of type TokenType.DIRECTIVE
, ##
and %:%:
are of type
TokenType.UNKNOWN
as follows:
Tokens | In directive lines | Otherwise |
---|---|---|
# %: |
TokenType.OPERATOR |
TokenType.DIRECTIVE |
## %:%: |
TokenType.OPERATOR |
TokenType.UNKNOWN |
The following table lists all tokens that are digraphs:
Token | Equivalent |
---|---|
<: |
[ |
:> |
] |
<% |
{ |
%> |
} |
%: |
# |
%:%: |
## |
The lexical parser replaces the digraphs with their equivalents. The substituted characters have the child characters that represent the replaced ones.
A directive token consists of a number sign (or hash) character (#
) and the
child tokens. The null directive has no child tokens.
The child tokens must include a directive name, arguments (depending on the directive name), and the end of the directive (that is a newline character). They also may include delimiters and comments. The last of them must be the end of the directive.
The content of the child tokens can be incomplete. For example, they may represent an invalid directive, they may not end with the end of the directive, and so on.
The token type of directives is TokenType.DIRECTIVE
.
The tokens that represent the directive names must have the content which is
either: define
, undef
, include
, if
, ifdef
, ifndef
, else
, elif
,
endif
, line
, error
, or pragma
. Their token type is
TokenType.DIRECTIVE_NAME
.
The tokens that represents the end of the directive must have a newline
character as the content. Their token type is TokenType.DIRECTIVE_END
.
When the directive name equals include
, the argument must be either:
- a standard header name between angle brackets (
<
and>
) - a filename between double quotes (
"
and"
) - any other form that expands to a standard header name or a filename after macro replacement
A standard header name and a filename are preprocessing tokens.
The content of the tokens can be incomplete. For example, they may not terminate, and so on.
The token types of standard header names and filenames are
TokenType.STANDARD_HEADER
and TokenType.FILENAME
, respectively.
When the lexical parser encounters characters that do not fit the above description, it returns an unknown token containing them.
The token type of unknown tokens is TokenType.UNKNOWN
.
- com.maroontress.clione module