Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create annotations from highlights #6

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

brush701
Copy link

@brush701 brush701 commented Apr 7, 2021

Implements the PDF annotation spec for highlights, including QuadPoints which are used by many utilities to extract the underlying text. Tested and working with a range of sample PDFs on Windows 10 Home, OS build 19042.867

Copy link
Owner

@rschroll rschroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of the work on this, and sorry it's taken me a while to get to reviewing the work. In truth, part of the delay was knowing that it would take a while for me to wrap my head around this code.

I've put a bunch of nit-picky comments within the code. These are generally based on my own style opinions, which aren't documented anywhere. And since this code was adapted from RCU, it's generally not consistent anyway. If you don't feel like cleaning these up, I'll take care of them.

But before dealing with those, I see two large problems to address:

First, with this patch I get all highlighted sections drawn as rectangles. While this works well for highlighted text, this doesn't work well for drawings with the highlighter. At the very least, we should provide a setting to toggle between drawn highlights and annotations.

However, with the new split in how highlighting works with the tablet, I wonder if the easier solution is to only turn highlights from the highlights file into annotations, and leave the highlights in the lines files as drawn lines on the PDF. I feel like this would do the right thing in at least 95% of the cases going forward. (It would also let us avoid passing the annotations list into the pen classes, which has struck me as a confusion of responsibilities.)

Second, this doesn't handle some of the weird and crazy things PDF files do with rotations and crop boxes. Rather than try to explain, I'll attach some acid test files I used to figure out the transformations. Each page has a line of text in the middle of the form (page size) - (crop box) - rotation. The page coordinates are plotted around the edges. On each page, I've drawn a highlight over the line of text, as well as an arrow pointing upwards above this line. Note that these are lines-file-type highlights, so this may cause a testing problem if you follow my suggest approach above.

In brief, the tablet displays the CropBox, not the PageBox, following the page rotation. But it adds a -90 rotation to any page that would end up being landscape, based on its original size and rotation. If the page does not match the aspect ratio of the tablet, the CropBox is placed in the upper-left corner of the device, meaning extra space appears to the bottom or right. But this is based on the device, not the page orientation, so that space may be above or to the left of the CropBox in page coordinates.

(I wish I could say that I worked this out through clever calculation, but in truth I just tried every transformation until I found one that worked. For whatever reason, I've never been properly able to wrap my head around coordinate transformations. So I hope the explanation above makes some kind of sense.)

Please feel free to discuss this further here. I should be able to be more responsive in the days ahead.

boxes.2.zip
skinny.2.zip
wide.2.zip

self.x = x
self.y = y

def toList(self) -> list:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: to_list. (The existing code base is not terribly consistent here, since it derived from some Qt code. But I'm mostly following pep8 with new code, I think.)

Alternatively, we could implement __iter__ and then just call list(point). And depending what it's used for, we might just iterate through the point directly. But I'm happy with to_list; only go this way if it seems to provide other benefits.

# the line cannot have positive overlap
return False


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: No more than one empty line.

max(self.ur.y, rectB.ur.y))
return Rect(ll, ur)

def toList(self) -> list:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: to_list.



@staticmethod
def fromRect(rect: Rect):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: from_rect.

self.annotype = annotype
self.rect = rect
if quadpoints:
self.quadpoints = quadpoints
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this produce some inconsistency, if quadpoints and rect aren't the same?

@@ -74,9 +75,11 @@ def render(source, *,
# key of zero length, so it doesn't break the rest of the
# process.
pages = []
highlihgts = []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spelling.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this doesn't appear to be used anywhere?

for k, layer_a in enumerate(page_annot):
layerannots = layer_a[1]
for a in layerannots:
# PDF origin is in bottom-left, so invert all
# y-coordinates.
author = 'RCU' #self.model.device_info['rcuname']
author = 'reMarkable' #self.model.device_info['rcuname']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove old comment. (Admittedly, not your fault.)


def rotate_annot_points(points: list) -> list:
rotated = []
for n in range(0,len(points),2):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for x, y in zip(points[::2], points[1::2])?

Or there are more clever solutions here: https://stackoverflow.com/questions/5389507/iterating-over-every-two-elements-in-a-list. (I'm actually surprised there isn't a solution in itertools.)

def scale_annot_points(points: list, scale:float, adjust: list) -> list:
scaled = []
for i, p in enumerate(points):
scaled.append(p*scale + adjust[i%2])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use a list comprehension.

for i, p in enumerate(points):
scaled.append(p*scale + adjust[i%2])

return scaled
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Trailing newline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants