Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate RSyntaxTextArea for RUPS #155

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open

Investigate RSyntaxTextArea for RUPS #155

wants to merge 4 commits into from

Conversation

iText-CI
Copy link
Contributor

Programmatically created Pull Request to automatically keep merge branch to develop up-to-date

Initially the idea was to just use PdfTokenizer from iText, but it had
some problems, like not preserving token positions in the input data,
skipping whitespace outright, have hard errors for invalid output, etc.
For more example see PdfContentStreamParser docs.

This will be used as a reference model for the editor. Ideally we would
have the same model in the editor to limit memory consumption, but
having a separate one allows us some flexibility, which might be useful
when implementing static analysis.
This is one of the first steps of integrating RSyntaxTextArea for PDF
content stream editing. Tokenization is based on the parsing logic,
which was added in the previous commit.

Since PDF content streams are, in a general case, not text, but binary
data, some workarounds had to be made, as RSyntaxTextArea was designed
to work with text. A custom token type and painter was created, so that
we could render arbitrary characters not as text, but as hexadecimal
representations of the binary data.

Additionally, Latin1Filter was made so that we could somewhat trick
RSyntaxTextArea to work with binary data. It replaces any characters,
which are not representable in Latin-1 (i.e. U+0100 and beyond) with
their UTF-8 representation, but with bytes stored in chars. As a result
all chars in the document fit in one byte and the backing character
array for the document acts as an inflated byte array. This way we can
just decode the text with Latin-1 to get the expected output, which can
be put directly into PDF.

With how RSyntaxTextArea is structured, quite a lot of code from there
has to be copied, as inheritance is not "granular" enough to do what we
cant.

Since the input stream is no longer processed by iText, what you see in
the editor pane is the raw data in the stream itself unmodified. This
was one of the goals, as before opening a stream for editing would
make it pretty much impossible to save it without altering at least
whitespace in some way.

As of now, there are some regressions. For example:
1. Images are no longer render in the stream pane. For now, they are
   displayed as text.
2. Since stream is presented as-is, there is no indentation at the
   moment. This will be added later as an explicit prettifier option.

Code folding, static syntax analysis and RSTAUI dialog integrations will
be added later.
Now in the editor you should be able to freely fold BT->ET blocks and
BMC/BDC->EMC sequences.
This is pretty basic as a proof of concept. It can currently show the
following issues:
* Array/Dictionary/String object was not closed.
* Unnecessary whitespace at the end of lines.
* Unexpected tokens.
* Operand count and type for path construction operators.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants