Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Nim (with tests) #633

Open
wants to merge 151 commits into
base: master
Choose a base branch
from
Open

Add support for Nim (with tests) #633

wants to merge 151 commits into from

Conversation

gcr
Copy link

@gcr gcr commented Jan 31, 2024

Nim (formerly nimrod) is a compiled systems language with type inference, macros, and memory safety. It's becoming more common. Nitter for example uses it.

This patch adds tree-sitter support from https://github.com/alaviss/tree-sitter-nim , which uses the MPL-2 license. If this isn't acceptable, I can try to find another implementation, e.g. alaviss/tree-sitter-nim#11 mentions https://github.com/aMOPel/tree-sitter-nim.

image

macOS CI doesn't like it
There are still a few bugs, mostly related to EOL/DEDENT compositions.
Previously non-capturing groups are used, which works fine but is on the
overkill side and probably complicated the parser a tad bit.
The scanner is now a bottom-up scanner with hopes that this will make
adding more rules in easier.
This is no longer a problem with the switch to indentation relations.
This reduces the number of syntax nodes drastically.
This commit removes newline from "tokens that can appear anywhere" and
replacing it with strict layout definition.

This prevents erroneous code like

```
type
X = int
```

from being parsed as a single type definition.
The statement -> expression -> ... tree makes the generated CST hard to
navigate while providing little information of use.

This commit hides them.

With that hidden, we can now separate calls syntax into two without
making the CST looks weird.
This fits the definition of how command call is evaluated perfectly!
This commit add new grammar for the following blocks: `block`, `if`, `when`,
`case`, `try`, `for`, `while`.
Tuple expression with one entry has to have a trailing comma
Covers basic statements like: `import`, `export`, `discard`, `return`, etc.
This prevents multiple syntax node being created for each characters but
instead try to group them whenever possible.
Previously, lexer's layouting tokens are composed of both newlines and
whitespaces. This required disambiguation at the lexer level and things
like DEDENT -> INDENT_EQ required lexing tricks such as expecting the
other token to be requested precisely after DEDENT.

Stuff like lines into comment also required special tokens like
SPACES_BEFORE_COMMENT.

This commit decouples newlines from layouting tokens and uses the
grammar to specify them. This brings several advantages:

* Comments can be transparently handled since extra tokens are scanned
  automatically between newline -> layout.

* The parser can catch ambiguous definition at compile time and require
  explicit disambiguation.

* The loosened tokens can be used as extra tokens in the grammar to
  track indentation across all lines. This improves the accuracy of
  layout tokens and allow for some helpers to be removed.

Alongside this, long string content has been replicated in the grammar
itself and only the quote handling portion is offloaded.
A complete rewrite of the grammar, with almost no parts used from the
original.

The parser is now capable of parsing ~97% of all files within the
nim-lang/Nim repository, however the parser size is extremely large:
132MiB.

Some features are currently disabled to prevent the parser from
exceeding tree-sitter's own limit.
This allows us to remove 11 external scanning nodes. Tie breaking
against command call & prefix is now done solely on the prefix operator.

Symbolic operators are now exposed as nodes to allow syntax queries to
match against them.

Due to size explosion, unicode operators are disabled.

Included are some fixes for scanner flags reset on new line.
Put infix operators as inline within the grammar

This allows us to remove 11 external scanning nodes. Tie breaking against command call & prefix is now done solely on the prefix operator.

Symbolic operators are now exposed as nodes to allow syntax queries to match against them.

Due to size explosion, Unicode operators are disabled.

Included are some fixes for scanner flags reset on new line.
alaviss and others added 26 commits December 11, 2023 06:04
improve field accuracy and introduce new fields

This pull includes two changes:

* Make sure field pin-point to the interested node
* Improve field consistency within the grammar and introduce
  more fields for other complex structures.

This pull contains breaking change for a few field names.
Allowing layout_end on EOF allows the parser to close and isolate
grammar portions where layout_end is typically not allowed (ie. in
parentheses), which enabled better error recovery.

The empty termination hack does not seem to contribute anymore with this
change and was removed.

With this change, we also remove the use of synchronize node for fixing
after newline. Instead it is cleared based on column position at the
start of the lexer. This should reduce errors from node reuse, but is
more or less a hunch since it is hard to test incremental parsing.
scanner: allow end on EOF and use column for flag invalidation

Allowing layout_end on EOF allows the parser to close and isolate grammar portions where layout_end is typically not allowed (ie. in parentheses), which enabled better error recovery.

The empty termination hack does not seem to contribute anymore with this change and was removed.

With this change, we also remove the use of synchronize node for fixing after newline. Instead it is cleared based on column position at the start of the lexer. This should reduce errors from node reuse, but is more or less a hunch since it is hard to test incremental parsing.

Fixes alaviss/tree-sitter-nim#64.
This syntax quirk is utilized by system.nim

Fixes alaviss/tree-sitter-nim#66.
make type declaration RHS optional

This syntax quirk is utilized by `system.nim`

Fixes alaviss/tree-sitter-nim#66.
In certain scenarios, the parser might crash due to an OOB in tree-sitter
during get_column() at EOF. Since tree-sitter still hasn't released a
new version with the fix, we will have to solve it here ourselves.

Ref tree-sitter/tree-sitter#2563
scanner: add workaround for column at EOF
This should align these with their when/case conditional counterparts.
add missing consequence field to object declaration
add missing alternative field to object variants
Since the scanner no longer consumes input when emitting layout tokens,
we can recover the scanner state by rescanning the input instead of
storing them. This allows the removal of some state invalidation schemes
and avoids get_column overhead.

This change have some implications to error recovery as layout
termination no longer change the scanner state. For the regressed cases,
it appears that making the body of a section optional was enough to make
tree-sitter produce better recoveries.
This would make indentation queries for this case a lot
easier to write.
This drops the parser size by 6MiB.

An unfortunate consequence is that `else` no longers terminate
a case expression, but that feature shouldn't be used by anyone.
tree-sitter regexes are not the same as JS regexes, and "useless"
escapes are often required.
This one trick dropped state count/large state count from 20478/10711 to
20193/10568.
This is 1:1 against the way the compiler parses these.

The infix parsing grammar had to be duplicated due to the unique position
that this matches, which is quite ugly. Unfortunately this syntax is
widely used and thus parser support must be provided.
Also added metadata to show Sponsor button on GitHub.
This update required both actions to be updated at the same time.
changes staged for 0.5.0

Notable changes:

- Reduced the amount of states used for tracking layout
- Support for concept without a body
- Support for type(x) expressions at the top level

Shortlog:

  Leorize (9):
        remove flag and indentation tracking across scans
        allow concept body to be omitted
        grammar: share if alternatives between if and case
        eslint: disable useless-escape rule
        grammar: factor out for loop body
        support old type(x) expression in statement lists
        update readme for the current project status
        ci: bump upload/download artifact version
        bump version to 0.5.0
…c7092cd7f4f6a251789af121'

git-subtree-dir: vendored_parsers/tree-sitter-nim
git-subtree-mainline: 7fda26d
git-subtree-split: 70ceee8
Checked on a few popular repositories on github like nitter and jester.
@Wilfred
Copy link
Owner

Wilfred commented Feb 5, 2024

Thanks for the PR! I'm afraid I can't accept this in the current state, the parser is just too big (parser.c is 66MiB). Difftastic already has problems with the git repo being too big, and this parser is bigger than the largest parsers currently included.

I'd like to support Nim, but I need a smaller file. Say something smaller than 30MiB.

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Mar 21, 2024

@Wilfred, if you're open to enabling the wasm feature for Tree-sitter (which adds a dependency on wasmtime), you could consider switching away from vendoring all of the Tree-sitter grammars, and instead allow users to add their own parsers at runtime via WASM files. With the wasm feature, the native Tree-sitter library can load Language objects from wasm files, but perform native parsing with the same Rust API as normal (only the lexing phase uses WASM, so performance is not impacted very much, and you're still free to Send the resulting syntax trees to other threads as normal).

You could probably make it seamless for users by bundling a list of known grammars (with file extensions and such) and just store URLs where the corresponding WASM files can be downloaded from.

It's a very new Tree-sitter feature, developed for the Zed editor's new extension system, but it works pretty well, and I think it might be well-suited for your use case, and solve the problem of needing to bundle a large set of languages.

I know this is off-topic; I just thought I'd mention it here, since this PR was linked from a HN thread.

@Wilfred
Copy link
Owner

Wilfred commented Mar 22, 2024

@maxbrunsfeld ooh, I am very interested in this! The difft binary is pretty large, and having a nice way to distribute parsers separately would really help. Some distro packagers have expressed a preference for not vendoring parsers too.

I need highlights.scm too though: difftastic needs to know which nodes are strings/comments, and tree-sitter complains if you load a highlighting file that doesn't match the loaded parser. How does Zed handle this?

(I imagine Zed also needs to associate file extensions with languages, just like difftastic, so maybe you have a solution for that metadata too?)

@maxbrunsfeld
Copy link
Contributor

Yeah, in Zed, extensions are specified via a combination of:

  • a .wasm file for the Tree-sitter parser
  • a set of .scm files containing queries for highlighting, language injection, outline symbols, etc
  • a TOML file with metadata about the language (user-facing name, file extensions, other editor configuration)

I'm guessing Difftastic would want a slightly different packaging format, because you don't need all of the stuff Zed uses, but I think a similar approach would probably work.

For now, these WASM files would need to be hosted somewhere. The WASM mode of compiling parsers isn't widely used yet, but down the road, I'd love to start standardizing on ways that Tree-sitter grammars store the WASM builds and queries. Maybe just GitHub release assets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants