Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify class TokenCollector to avoid two versions of maximal token logic #500

Conversation

sungshik
Copy link
Contributor

@sungshik sungshik commented Nov 1, 2024

(The description of this PR is somewhat lengthy -- a bit in the style of a small ADR -- to be able to remember why this PR was needed and how we got here.)

Background

The implementation of the semantic tokenizer consists of three parts:

  • class TokenCollector: conversion from parse trees to lists of semantic tokens
  • class TokenList: data structure that encodes lists of semantic tokens as integer arrays (required by LSP);
  • class TokenTypes: mapping from Rascal categories (= Rascal's legacy semantic token types, TextMate scopes, and LSP semantic token types) to LSP semantic token types;

To compute the semantic tokens for a parse tree, an instance of TokenCollector recursively traverses the parse tree. Each time when a semantic token is identified during the traversal, an instance of TokenList is updated.

Scope

This PR touches only the token collector. Tests were recently added in #494 to gain confidence this PR doesn't break anything.

Motivation

The semantic tokenizer should compute maximal tokens. For instance, if foobar is a string, then the semantic tokenizer should return foobar as semantic token instead of separately foo and bar.

As of 2495f20, logic to compute maximal tokens is actually implemented twice. Roughly:

  1. Token collector: It adds a new token to the token list only when the current character in the parse tree has a different Rascal category than the previous character. This results in maximal tokens in terms of Rascal categories.
  2. Token list: It merges each new token-to-be-added with the previous token when they have the same LSP semantic token type. This results in maximal tokens in terms of LSP semantic token types.

This PR removes point 1 from the token collector, because:

  • It is subsumed by point 2, but not the other way around (maximality in terms of Rascal categories vs. maximality in terms of LSP semantic token types).
  • The token collector has organically grown quite complex as new features were added over the years. Code comprehension/extensibility is starting to become problematic. In anticipation of a fix for Semantic tokenizer makes mistakes when a syntax tree (with category) has syntax children #456, now seems to be a good opportunity to re-simplify the design and implementation of the token collector.

Plan

Before this PR

The token collector has a number of fields that are used for bookkeeping. The fields are used throughout the recursive traversal. These are the relevant commits in which fields were added to support new features:

After this PR

By removing the logic to compute maximal tokens from the token collector (keep it only in the token list), the state of the token collector is simplified to:

private static class TokenCollector {
private int line;
private int column;

The recursive traversal is much simplified as well.

Copy link
Contributor Author

@sungshik sungshik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments

@sungshik sungshik changed the title Refactor class TokenCollector Simplify class TokenCollector Nov 4, 2024
@sungshik sungshik marked this pull request as ready for review November 4, 2024 11:20
@sungshik sungshik changed the title Simplify class TokenCollector Simplify class TokenCollector to avoid two implementations of maximal tokens Nov 4, 2024
@sungshik sungshik changed the title Simplify class TokenCollector to avoid two implementations of maximal tokens Simplify class TokenCollector to avoid two versions of maximal token logic Nov 4, 2024
@DavyLandman
Copy link
Member

Thanks for the writeup, and first adding tests 👍

@sungshik sungshik force-pushed the semantic-tokenizer-fall2024-refactor-collector branch from 1c6e471 to 41cfe99 Compare November 4, 2024 13:18
Copy link

sonarqubecloud bot commented Nov 4, 2024

@sungshik sungshik merged commit 4931b7d into semantic-tokenizer-fall2024 Nov 4, 2024
23 checks passed
@sungshik sungshik deleted the semantic-tokenizer-fall2024-refactor-collector branch November 6, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants