Add support for Nim (with tests) #633

gcr · 2024-01-31T23:21:02Z

Nim (formerly nimrod) is a compiled systems language with type inference, macros, and memory safety. It's becoming more common. Nitter for example uses it.

This patch adds tree-sitter support from https://github.com/alaviss/tree-sitter-nim , which uses the MPL-2 license. If this isn't acceptable, I can try to find another implementation, e.g. alaviss/tree-sitter-nim#11 mentions https://github.com/aMOPel/tree-sitter-nim.

macOS CI doesn't like it

There are still a few bugs, mostly related to EOL/DEDENT compositions.

Previously non-capturing groups are used, which works fine but is on the overkill side and probably complicated the parser a tad bit.

The scanner is now a bottom-up scanner with hopes that this will make adding more rules in easier.

This is no longer a problem with the switch to indentation relations.

This reduces the number of syntax nodes drastically.

This commit removes newline from "tokens that can appear anywhere" and replacing it with strict layout definition. This prevents erroneous code like ``` type X = int ``` from being parsed as a single type definition.

The statement -> expression -> ... tree makes the generated CST hard to navigate while providing little information of use. This commit hides them. With that hidden, we can now separate calls syntax into two without making the CST looks weird.

This fits the definition of how command call is evaluated perfectly!

Follow up to 5976424

This commit add new grammar for the following blocks: `block`, `if`, `when`, `case`, `try`, `for`, `while`.

Tuple expression with one entry has to have a trailing comma

Covers basic statements like: `import`, `export`, `discard`, `return`, etc.

This prevents multiple syntax node being created for each characters but instead try to group them whenever possible.

Previously, lexer's layouting tokens are composed of both newlines and whitespaces. This required disambiguation at the lexer level and things like DEDENT -> INDENT_EQ required lexing tricks such as expecting the other token to be requested precisely after DEDENT. Stuff like lines into comment also required special tokens like SPACES_BEFORE_COMMENT. This commit decouples newlines from layouting tokens and uses the grammar to specify them. This brings several advantages: * Comments can be transparently handled since extra tokens are scanned automatically between newline -> layout. * The parser can catch ambiguous definition at compile time and require explicit disambiguation. * The loosened tokens can be used as extra tokens in the grammar to track indentation across all lines. This improves the accuracy of layout tokens and allow for some helpers to be removed. Alongside this, long string content has been replicated in the grammar itself and only the quote handling portion is offloaded.

A complete rewrite of the grammar, with almost no parts used from the original. The parser is now capable of parsing ~97% of all files within the nim-lang/Nim repository, however the parser size is extremely large: 132MiB. Some features are currently disabled to prevent the parser from exceeding tree-sitter's own limit.

This allows us to remove 11 external scanning nodes. Tie breaking against command call & prefix is now done solely on the prefix operator. Symbolic operators are now exposed as nodes to allow syntax queries to match against them. Due to size explosion, unicode operators are disabled. Included are some fixes for scanner flags reset on new line.

Put infix operators as inline within the grammar This allows us to remove 11 external scanning nodes. Tie breaking against command call & prefix is now done solely on the prefix operator. Symbolic operators are now exposed as nodes to allow syntax queries to match against them. Due to size explosion, Unicode operators are disabled. Included are some fixes for scanner flags reset on new line.

improve field accuracy and introduce new fields This pull includes two changes: * Make sure field pin-point to the interested node * Improve field consistency within the grammar and introduce more fields for other complex structures. This pull contains breaking change for a few field names.

Allowing layout_end on EOF allows the parser to close and isolate grammar portions where layout_end is typically not allowed (ie. in parentheses), which enabled better error recovery. The empty termination hack does not seem to contribute anymore with this change and was removed. With this change, we also remove the use of synchronize node for fixing after newline. Instead it is cleared based on column position at the start of the lexer. This should reduce errors from node reuse, but is more or less a hunch since it is hard to test incremental parsing.

scanner: allow end on EOF and use column for flag invalidation Allowing layout_end on EOF allows the parser to close and isolate grammar portions where layout_end is typically not allowed (ie. in parentheses), which enabled better error recovery. The empty termination hack does not seem to contribute anymore with this change and was removed. With this change, we also remove the use of synchronize node for fixing after newline. Instead it is cleared based on column position at the start of the lexer. This should reduce errors from node reuse, but is more or less a hunch since it is hard to test incremental parsing. Fixes alaviss/tree-sitter-nim#64.

This syntax quirk is utilized by system.nim Fixes alaviss/tree-sitter-nim#66.

make type declaration RHS optional This syntax quirk is utilized by `system.nim` Fixes alaviss/tree-sitter-nim#66.

bump version to 0.4.0

In certain scenarios, the parser might crash due to an OOB in tree-sitter during get_column() at EOF. Since tree-sitter still hasn't released a new version with the fix, we will have to solve it here ourselves. Ref tree-sitter/tree-sitter#2563

scanner: add workaround for column at EOF

This should align these with their when/case conditional counterparts.

add missing consequence field to object declaration

Fixes alaviss/tree-sitter-nim#75

add missing alternative field to object variants

Since the scanner no longer consumes input when emitting layout tokens, we can recover the scanner state by rescanning the input instead of storing them. This allows the removal of some state invalidation schemes and avoids get_column overhead. This change have some implications to error recovery as layout termination no longer change the scanner state. For the regressed cases, it appears that making the body of a section optional was enough to make tree-sitter produce better recoveries.

This would make indentation queries for this case a lot easier to write.

This drops the parser size by 6MiB. An unfortunate consequence is that `else` no longers terminate a case expression, but that feature shouldn't be used by anyone.

tree-sitter regexes are not the same as JS regexes, and "useless" escapes are often required.

This one trick dropped state count/large state count from 20478/10711 to 20193/10568.

This is 1:1 against the way the compiler parses these. The infix parsing grammar had to be duplicated due to the unique position that this matches, which is quite ugly. Unfortunately this syntax is widely used and thus parser support must be provided.

Also added metadata to show Sponsor button on GitHub.

This update required both actions to be updated at the same time.

changes staged for 0.5.0 Notable changes: - Reduced the amount of states used for tracking layout - Support for concept without a body - Support for type(x) expressions at the top level Shortlog: Leorize (9): remove flag and indentation tracking across scans allow concept body to be omitted grammar: share if alternatives between if and case eslint: disable useless-escape rule grammar: factor out for loop body support old type(x) expression in statement lists update readme for the current project status ci: bump upload/download artifact version bump version to 0.5.0

…c7092cd7f4f6a251789af121' git-subtree-dir: vendored_parsers/tree-sitter-nim git-subtree-mainline: 7fda26d git-subtree-split: 70ceee8

Checked on a few popular repositories on github like nitter and jester.

Wilfred · 2024-02-05T16:48:32Z

Thanks for the PR! I'm afraid I can't accept this in the current state, the parser is just too big (parser.c is 66MiB). Difftastic already has problems with the git repo being too big, and this parser is bigger than the largest parsers currently included.

I'd like to support Nim, but I need a smaller file. Say something smaller than 30MiB.

maxbrunsfeld · 2024-03-21T00:01:03Z

@Wilfred, if you're open to enabling the wasm feature for Tree-sitter (which adds a dependency on wasmtime), you could consider switching away from vendoring all of the Tree-sitter grammars, and instead allow users to add their own parsers at runtime via WASM files. With the wasm feature, the native Tree-sitter library can load Language objects from wasm files, but perform native parsing with the same Rust API as normal (only the lexing phase uses WASM, so performance is not impacted very much, and you're still free to Send the resulting syntax trees to other threads as normal).

You could probably make it seamless for users by bundling a list of known grammars (with file extensions and such) and just store URLs where the corresponding WASM files can be downloaded from.

It's a very new Tree-sitter feature, developed for the Zed editor's new extension system, but it works pretty well, and I think it might be well-suited for your use case, and solve the problem of needing to bundle a large set of languages.

I know this is off-topic; I just thought I'd mention it here, since this PR was linked from a HN thread.

Wilfred · 2024-03-22T22:46:37Z

@maxbrunsfeld ooh, I am very interested in this! The difft binary is pretty large, and having a nice way to distribute parsers separately would really help. Some distro packagers have expressed a preference for not vendoring parsers too.

I need highlights.scm too though: difftastic needs to know which nodes are strings/comments, and tree-sitter complains if you load a highlighting file that doesn't match the loaded parser. How does Zed handle this?

(I imagine Zed also needs to associate file extensions with languages, just like difftastic, so maybe you have a solution for that metadata too?)

maxbrunsfeld · 2024-03-23T00:08:55Z

Yeah, in Zed, extensions are specified via a combination of:

a .wasm file for the Tree-sitter parser
a set of .scm files containing queries for highlighting, language injection, outline symbols, etc
a TOML file with metadata about the language (user-facing name, file extensions, other editor configuration)

I'm guessing Difftastic would want a slightly different packaging format, because you don't need all of the stuff Zed uses, but I think a similar approach would probably work.

For now, these WASM files would need to be hosted somewhere. The WASM mode of compiling parsers isn't widely used yet, but down the road, I'd love to start standardizing on ways that Tree-sitter grammars store the WASM builds and queries. Maybe just GitHub release assets.

alaviss added 30 commits August 24, 2022 21:59

initial commit

ecf6006

scanner: use less C++11 stuff

79fa05c

macOS CI doesn't like it

add grammar for tuple and tuple deconstruction def

8a9f74a

add grammar for type sections and type specifications

60a43cd

There are still a few bugs, mostly related to EOL/DEDENT compositions.

use character classes for selective case-insensitivity

d5872db

Previously non-capturing groups are used, which works fine but is on the overkill side and probably complicated the parser a tad bit.

switch to indentation relations

6d52ab6

The scanner is now a bottom-up scanner with hopes that this will make adding more rules in easier.

remove dynamic precedence on object_type

b2b46ec

This is no longer a problem with the switch to indentation relations.

turn numeric literals into a single token

83677ae

This reduces the number of syntax nodes drastically.

add basic string literals

e3fa443

add long string grammar and tokenize raw strings

2b1011b

supports escaped quotes within raw string

9dd5a92

long string literals may have the r prefix

42fea5c

add grammar for single-line comments

bdd6d3e

add missing tests for comments

d8ef67d

basic support for simple calls

5d869f9

completely define the source layout

1311196

This commit removes newline from "tokens that can appear anywhere" and replacing it with strict layout definition. This prevents erroneous code like ``` type X = int ``` from being parsed as a single type definition.

add a cummulative status check for required checks

9a1e38c

replace command_call with left associativity (Wilfred#2)

5976424

This fits the definition of how command call is evaluated perfectly!

get rid of command_call rule (Wilfred#3)

3876cb6

Follow up to 5976424

Add grammars for block-style statements

6026bc5

This commit add new grammar for the following blocks: `block`, `if`, `when`, `case`, `try`, `for`, `while`.

correct tuple grammar on singular tuples

f102c56

Tuple expression with one entry has to have a trailing comma

allow for calls with empty arguments list

05fe522

add grammar for statements (Wilfred#7)

9042bef

Covers basic statements like: `import`, `export`, `discard`, `return`, etc.

tokenize interpreted string text

cf03d90

This prevents multiple syntax node being created for each characters but instead try to group them whenever possible.

add block comment grammar

f9e1155

alaviss and others added 26 commits December 11, 2023 06:04

make type declaration RHS optional

05bffa0

This syntax quirk is utilized by system.nim Fixes alaviss/tree-sitter-nim#66.

Merge pull request Wilfred#72 from alaviss/type-opt

740d062

make type declaration RHS optional This syntax quirk is utilized by `system.nim` Fixes alaviss/tree-sitter-nim#66.

bump version to 0.4.0

6a11bc7

Merge pull request Wilfred#73 from alaviss/bump-0.4.0

77762e3

bump version to 0.4.0

scanner: add workaround for column at EOF

6d7c079

In certain scenarios, the parser might crash due to an OOB in tree-sitter during get_column() at EOF. Since tree-sitter still hasn't released a new version with the fix, we will have to solve it here ourselves. Ref tree-sitter/tree-sitter#2563

Merge pull request Wilfred#74 from alaviss/column-eof

d41fd3e

scanner: add workaround for column at EOF

add missing consequence field to object declaration

37d5f37

This should align these with their when/case conditional counterparts.

Merge pull request Wilfred#76 from alaviss/field-obj

0fdb059

add missing consequence field to object declaration

add missing alternative field to object variants

d31594d

Fixes alaviss/tree-sitter-nim#75

Merge pull request Wilfred#79 from alaviss/variant-alt

482e2f4

add missing alternative field to object variants

allow concept body to be omitted

2e0eb2a

This would make indentation queries for this case a lot easier to write.

grammar: share if alternatives between if and case

e36916b

This drops the parser size by 6MiB. An unfortunate consequence is that `else` no longers terminate a case expression, but that feature shouldn't be used by anyone.

eslint: disable useless-escape rule

4e0c321

tree-sitter regexes are not the same as JS regexes, and "useless" escapes are often required.

grammar: factor out for loop body

b397db6

This one trick dropped state count/large state count from 20478/10711 to 20193/10568.

update readme for the current project status

38d17e7

Also added metadata to show Sponsor button on GitHub.

ci: bump upload/download artifact version

1e8b9cb

This update required both actions to be updated at the same time.

bump version to 0.5.0

901306b

Add 'vendored_parsers/tree-sitter-nim/' from commit '70ceee835e033acb…

386cc7d

…c7092cd7f4f6a251789af121' git-subtree-dir: vendored_parsers/tree-sitter-nim git-subtree-mainline: 7fda26d git-subtree-split: 70ceee8

Add support for nim, including regression tests.

5748274

Checked on a few popular repositories on github like nitter and jester.

Whitespace

f546f5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Nim (with tests) #633

Add support for Nim (with tests) #633

gcr commented Jan 31, 2024

Wilfred commented Feb 5, 2024

maxbrunsfeld commented Mar 21, 2024 •

edited

Loading

Wilfred commented Mar 22, 2024 •

edited

Loading

maxbrunsfeld commented Mar 23, 2024

Add support for Nim (with tests) #633

Are you sure you want to change the base?

Add support for Nim (with tests) #633

Conversation

gcr commented Jan 31, 2024

Wilfred commented Feb 5, 2024

maxbrunsfeld commented Mar 21, 2024 • edited Loading

Wilfred commented Mar 22, 2024 • edited Loading

maxbrunsfeld commented Mar 23, 2024

maxbrunsfeld commented Mar 21, 2024 •

edited

Loading

Wilfred commented Mar 22, 2024 •

edited

Loading