Simplify class `TokenCollector` to avoid two versions of maximal token logic #500

sungshik · 2024-11-01T14:50:56Z

(The description of this PR is somewhat lengthy -- a bit in the style of a small ADR -- to be able to remember why this PR was needed and how we got here.)

Background

The implementation of the semantic tokenizer consists of three parts:

class TokenCollector: conversion from parse trees to lists of semantic tokens
class TokenList: data structure that encodes lists of semantic tokens as integer arrays (required by LSP);
class TokenTypes: mapping from Rascal categories (= Rascal's legacy semantic token types, TextMate scopes, and LSP semantic token types) to LSP semantic token types;

To compute the semantic tokens for a parse tree, an instance of TokenCollector recursively traverses the parse tree. Each time when a semantic token is identified during the traversal, an instance of TokenList is updated.

Scope

This PR touches only the token collector. Tests were recently added in #494 to gain confidence this PR doesn't break anything.

Motivation

The semantic tokenizer should compute maximal tokens. For instance, if foobar is a string, then the semantic tokenizer should return foobar as semantic token instead of separately foo and bar.

As of 2495f20, logic to compute maximal tokens is actually implemented twice. Roughly:

Token collector: It adds a new token to the token list only when the current character in the parse tree has a different Rascal category than the previous character. This results in maximal tokens in terms of Rascal categories.
Token list: It merges each new token-to-be-added with the previous token when they have the same LSP semantic token type. This results in maximal tokens in terms of LSP semantic token types.

This PR removes point 1 from the token collector, because:

It is subsumed by point 2, but not the other way around (maximality in terms of Rascal categories vs. maximality in terms of LSP semantic token types).
The token collector has organically grown quite complex as new features were added over the years. Code comprehension/extensibility is starting to become problematic. In anticipation of a fix for Semantic tokenizer makes mistakes when a syntax tree (with category) has syntax children #456, now seems to be a good opportunity to re-simplify the design and implementation of the token collector.

Plan

Before this PR

The token collector has a number of fields that are used for bookkeeping. The fields are used throughout the recursive traversal. These are the relevant commits in which fields were added to support new features:

Initial implementation:

rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

Lines 130 to 131 in 6a73d54

private static class TokenCollector extends TreeVisitor<RuntimeException> {

private int location;

After adding support for subtrees without categories:

rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

Lines 155 to 158 in cb733e1

    
           private static class TokenCollector { 
        
               private int location; 
        
               private int line; 
        
               private int column;

After adding support for multiline subtrees:

rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

Lines 363 to 367 in 9bef283

    
           private static class TokenCollector { 
        
               private int line; 
        
               private int column; 
        
               private int startLineCurrentToken; 
        
               private int startColumnCurrentToken;

After adding support for subtrees with nested categories

rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

Lines 391 to 396 in ee70838

    
           private static class TokenCollector { 
        
               private int line; 
        
               private int column; 
        
               private int startLineCurrentToken; 
        
               private int startColumnCurrentToken; 
        
               private String currentTokenCategory;

After this PR

By removing the logic to compute maximal tokens from the token collector (keep it only in the token list), the state of the token collector is simplified to:

rascal-language-servers/rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

Lines 394 to 396 in 6c2ed04

    
           private static class TokenCollector { 
        
               private int line; 
        
               private int column;

The recursive traversal is much simplified as well.

sungshik

Additional comments

rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java

DavyLandman · 2024-11-04T12:23:08Z

Thanks for the writeup, and first adding tests 👍

sonarqubecloud · 2024-11-04T13:28:39Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
84.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Refactor class TokenCollector of the semantic tokenizer

6c2ed04

sungshik mentioned this pull request Nov 1, 2024

Semantic tokenizer improvements: tests, simplification, syntax-in-syntax bugfix #497

Merged

sungshik added 3 commits November 4, 2024 11:44

Simplify code to collect ambiguity semantic tokens

5b7b264

Make diff smaller

a7aa3f9

Make diff smaller

41cfe99

sungshik commented Nov 4, 2024

View reviewed changes

rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java Show resolved Hide resolved

rascal-lsp/src/main/java/org/rascalmpl/vscode/lsp/util/SemanticTokenizer.java Show resolved Hide resolved

sungshik changed the title ~~Refactor class TokenCollector~~ Simplify class TokenCollector Nov 4, 2024

sungshik marked this pull request as ready for review November 4, 2024 11:20

sungshik changed the title ~~Simplify class TokenCollector~~ Simplify class TokenCollector to avoid two implementations of maximal tokens Nov 4, 2024

sungshik changed the title ~~Simplify class TokenCollector to avoid two implementations of maximal tokens~~ Simplify class TokenCollector to avoid two versions of maximal token logic Nov 4, 2024

PieterOlivier approved these changes Nov 4, 2024

View reviewed changes

sungshik force-pushed the semantic-tokenizer-fall2024-refactor-collector branch from 1c6e471 to 41cfe99 Compare November 4, 2024 13:18

sungshik merged commit 4931b7d into semantic-tokenizer-fall2024 Nov 4, 2024
23 checks passed

sungshik deleted the semantic-tokenizer-fall2024-refactor-collector branch November 6, 2024 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify class `TokenCollector` to avoid two versions of maximal token logic #500

Simplify class `TokenCollector` to avoid two versions of maximal token logic #500

sungshik commented Nov 1, 2024 •

edited

Loading

sungshik left a comment

DavyLandman commented Nov 4, 2024

sonarqubecloud bot commented Nov 4, 2024

	private static class TokenCollector extends TreeVisitor<RuntimeException> {
	private int location;

	private static class TokenCollector {
	private int location;
	private int line;
	private int column;

Simplify class TokenCollector to avoid two versions of maximal token logic #500

Simplify class TokenCollector to avoid two versions of maximal token logic #500

Conversation

sungshik commented Nov 1, 2024 • edited Loading

Background

Scope

Motivation

Plan

Before this PR

After this PR

sungshik left a comment

Choose a reason for hiding this comment

DavyLandman commented Nov 4, 2024

sonarqubecloud bot commented Nov 4, 2024

Quality Gate passed

Simplify class `TokenCollector` to avoid two versions of maximal token logic #500

Simplify class `TokenCollector` to avoid two versions of maximal token logic #500

sungshik commented Nov 1, 2024 •

edited

Loading