Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRLC performance analysis and improvements #43

Open
2 of 5 tasks
florianschanda opened this issue Oct 23, 2023 · 0 comments
Open
2 of 5 tasks

TRLC performance analysis and improvements #43

florianschanda opened this issue Oct 23, 2023 · 0 comments
Assignees
Labels
topic: core Affects lexer/parser/infrastructure

Comments

@florianschanda
Copy link
Collaborator

florianschanda commented Oct 23, 2023

The worst offenders are for tests-system/bulk are:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   31.589   31.589 trlc/trlc.py:21(<module>)
        1    0.000    0.000   31.571   31.571 trlc/trlc.py:490(main)
        1    0.000    0.000   31.558   31.558 trlc/trlc.py:422(process)
        1    0.000    0.000   27.343   27.343 trlc/trlc.py:364(parse_trlc_files)
       85    0.037    0.000   27.338    0.322 trlc/parser.py:1705(parse_trlc_file)
    63859    0.044    0.000   27.257    0.000 trlc/parser.py:1577(parse_trlc_entry)
    63859    0.887    0.000   27.084    0.000 trlc/parser.py:1530(parse_record_object_declaration)
  1217740    0.433    0.000   19.282    0.000 trlc/parser.py:167(match)
  1217930    0.643    0.000   18.853    0.000 trlc/parser.py:139(advance)
  1217930    3.994    0.000   18.210    0.000 trlc/lexer.py:332(token)
   319609    0.692    0.000   10.996    0.000 trlc/parser.py:1352(parse_value)
  6214992    3.212    0.000    4.665    0.000 trlc/lexer.py:215(is_alnum)
   581360    0.591    0.000    4.360    0.000 trlc/ast.py:3060(lookup_direct)
        1    0.017    0.017    4.212    4.212 trlc/trlc.py:391(resolve_record_references)
    63859    0.144    0.000    4.160    0.000 trlc/ast.py:2873(resolve_references)
   109334    0.077    0.000    4.030    0.000 trlc/ast.py:1063(resolve_references)
    74685    0.032    0.000    3.853    0.000 trlc/ast.py:904(resolve_references)
  9752278    3.774    0.000    3.774    0.000 trlc/lexer.py:238(advance)
       10    0.298    0.030    3.659    0.366 /usr/lib/python3.8/difflib.py:688(get_close_matches)
    94872    0.136    0.000    2.933    0.000 trlc/parser.py:328(parse_qualified_name)
   638740    1.912    0.000    2.813    0.000 /usr/lib/python3.8/difflib.py:647(quick_ratio)
  1217930    0.876    0.000    2.442    0.000 trlc/lexer.py:232(skip_whitespace)
    63859    0.096    0.000    2.229    0.000 trlc/ast.py:2816(__init__)
  1217844    1.020    0.000    1.905    0.000 trlc/lexer.py:71(__init__)
    63859    0.531    0.000    1.846    0.000 trlc/ast.py:2821(<dictcomp>)
  1021744    0.720    0.000    1.316    0.000 trlc/ast.py:546(__init__)
  1217844    0.765    0.000    1.242    0.000 trlc/lexer.py:162(__init__)
  1217844    0.798    0.000    1.188    0.000 trlc/lexer.py:200(is_alpha)

This is not unexpected:

  • token() is the worst offender with 18s (number crunching)
  • parse_trlc_files() takes around 9s once you remove the lexing (which likely seems unavoidable)
  • and process() takes 4 seconds, which is entirely due to resolve_record_references (unavoidable, this is work that needs to happen sooner or later)

There are some immediate ideas:

  • is_alpha, is_alum, and is_digit could be replaced by more builtiny functions (but we need to take care of unicode stuff, so it's not as easy as just using the builtins)
  • implement partial parsing (sound) #47
  • implement partial parsing (unsound) #48
  • token() could be optimised in some other way
  • token() could be replaced by a hand-written c lexer (but this adds portability concerns)

There is one more issue that could manifest on windows with large repos: if you have millions of files (most of which are not trlc files) then the initial traversal for register_dir could take a lot of time.

@florianschanda florianschanda added the topic: core Affects lexer/parser/infrastructure label Oct 23, 2023
@florianschanda florianschanda self-assigned this Oct 23, 2023
florianschanda added a commit to christophkloeffel/trlc that referenced this issue Oct 23, 2023
Replace the char classification functions with more efficient, but
equivalent, implementations.

This reduces token() runtime from 18.2s to 15.1 which is a 17%
improvement.
florianschanda added a commit that referenced this issue Oct 23, 2023
Replace the char classification functions with more efficient, but
equivalent, implementations.

This reduces token() runtime from 18.2s to 15.1 which is a 17%
improvement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: core Affects lexer/parser/infrastructure
Projects
None yet
Development

No branches or pull requests

1 participant