Investigate RSyntaxTextArea for RUPS #155

iText-CI · 2025-01-22T23:08:11Z

Programmatically created Pull Request to automatically keep merge branch to develop up-to-date

Initially the idea was to just use PdfTokenizer from iText, but it had some problems, like not preserving token positions in the input data, skipping whitespace outright, have hard errors for invalid output, etc. For more example see PdfContentStreamParser docs. This will be used as a reference model for the editor. Ideally we would have the same model in the editor to limit memory consumption, but having a separate one allows us some flexibility, which might be useful when implementing static analysis.

This is one of the first steps of integrating RSyntaxTextArea for PDF content stream editing. Tokenization is based on the parsing logic, which was added in the previous commit. Since PDF content streams are, in a general case, not text, but binary data, some workarounds had to be made, as RSyntaxTextArea was designed to work with text. A custom token type and painter was created, so that we could render arbitrary characters not as text, but as hexadecimal representations of the binary data. Additionally, Latin1Filter was made so that we could somewhat trick RSyntaxTextArea to work with binary data. It replaces any characters, which are not representable in Latin-1 (i.e. U+0100 and beyond) with their UTF-8 representation, but with bytes stored in chars. As a result all chars in the document fit in one byte and the backing character array for the document acts as an inflated byte array. This way we can just decode the text with Latin-1 to get the expected output, which can be put directly into PDF. With how RSyntaxTextArea is structured, quite a lot of code from there has to be copied, as inheritance is not "granular" enough to do what we cant. Since the input stream is no longer processed by iText, what you see in the editor pane is the raw data in the stream itself unmodified. This was one of the goals, as before opening a stream for editing would make it pretty much impossible to save it without altering at least whitespace in some way. As of now, there are some regressions. For example: 1. Images are no longer render in the stream pane. For now, they are displayed as text. 2. Since stream is presented as-is, there is no indentation at the moment. This will be added later as an explicit prettifier option. Code folding, static syntax analysis and RSTAUI dialog integrations will be added later.

Now in the editor you should be able to freely fold BT->ET blocks and BMC/BDC->EMC sequences.

This is pretty basic as a proof of concept. It can currently show the following issues: * Array/Dictionary/String object was not closed. * Unnecessary whitespace at the end of lines. * Unexpected tokens. * Operand count and type for path construction operators.

Eswcvlad added 4 commits January 22, 2025 23:03

Add basic fold parser for PDF content streams

9812d4d

Now in the editor you should be able to freely fold BT->ET blocks and BMC/BDC->EMC sequences.

iText-CI assigned Eswcvlad Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate RSyntaxTextArea for RUPS #155

Investigate RSyntaxTextArea for RUPS #155

iText-CI commented Jan 22, 2025

Investigate RSyntaxTextArea for RUPS #155

Are you sure you want to change the base?

Investigate RSyntaxTextArea for RUPS #155

Conversation

iText-CI commented Jan 22, 2025