-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrate to core v3 #117
base: master
Are you sure you want to change the base?
migrate to core v3 #117
Conversation
Oh, and this is based on #116, since I often cannot even run Calamari without that. |
I wrote a simple script to measure and plot the GPU utilisation. The rather simple modification 9611e2c (which I will cherry pick into #116 for core v2) helps in two ways: it utilises the GPU better (because it avoids too small batches when regions have but a few lines) and thus also allows increasing the batch size without causing OOM: Unfortunately, fb2a680 does not accomplish what I expected – reducing the peaky GPU utilization behaviour due to GPU waiting for CPU and vice versa. Here's a log for batch_size=64 without parallel page threads (but with METS Server) – i.e. before fb2a680: Thus, surprisingly, the timeline still shows low average utilisation with lots of waiting time. This is also reflected by wall time and CPU time measurements (see below). It gets a little better if I split up the batches for the background thread (instead of having Calamari v1 do the batching), though: Also, it helps to increase the number of parallel page threads from 3 to 6 – but just relatively, not regarding bg threading. Here's with 6 threads before fb2a680: I also tried with more than 1 background thread (number of workers in the shared ThreadPoolExecutor), but that does not do much better either – the above with 2 "GPU" threads: Increasing the number of parallel page threads to 12 or 24 becomes more inefficent still. Figures for a book with 180 pages of Fraktur:
Perhaps we must go for Calamari 2 with its efficient tfaip pipelining... |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #117 +/- ##
==========================================
- Coverage 71.07% 68.93% -2.15%
==========================================
Files 5 4 -1
Lines 204 206 +2
Branches 50 55 +5
==========================================
- Hits 145 142 -3
- Misses 48 51 +3
- Partials 11 13 +2 ☔ View full report in Codecov by Sentry. |
When will OCR-D 3 be released? (Roughly) |
Hard to tell. It hinges on a couple of open design decisions yet to be made in OCR-D/core#1240. Plus (likely) a switch from ThreadPoolExecutor to ProcessPoolExecutor for page-parallel. Plus bertsky/core#21. I did quite a bit of testing against various workflows with processors (both v3-migrated and old), but we should formalise this into an integration test (probably Quiver)... Something to be discussed in next Tech Call? |
process = Process(target=_start_mets_server, | ||
kwargs={'workspace': workspace, 'url': 'mets.sock'}) | ||
process.start() | ||
sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no better way than this sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just for the test, you know. 1sec does not hurt considering how long the tests run.
'//page:TextLine/page:TextEquiv[1]/page:Unicode/text()', namespaces=NS) | ||
assert len(text1_out) == len(text1), "not all lines have been recognized" | ||
assert "verſchuldeten" in "\n".join(text1_out), "result for first page is inaccurate" | ||
assert "\n".join(text1_out) != "\n".join(text1), "result is suspiciously identical to GT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove the GT text in the fixture? If that is done in a robust way, it seems like the cleanest way to make sure we're actually testing for OCR text, not the original GT text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because then we could not compare it, and we would not receive the warnings on the log which we are asserting above.
|
||
CONFIGS = ['', 'pageparallel', 'metscache', 'pageparallel+metscache'] | ||
|
||
@pytest.fixture(params=CONFIGS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused by those configs being tied into the workspace fixture, but they seem to be connected in the OCR-D API (mets_server_url
in the Workspace
class), so this probably ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, they are factually tied because the workspace references the METS Server if present.
process.start() | ||
sleep(1) | ||
workspace = Workspace(resolver, directory, mets_server_url='mets.sock') | ||
yield {'workspace': workspace, 'mets_server_url': 'mets.sock'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mets_server_url
redundant here, the Workspace
seems to have it too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, run_processor
only needs mets_server_url
if there is no workspace
. So we could actually simplify here.
Still draft as long as v3 is beta / RC, but we can use the CI and already discuss the changes (esp. to tests).
@kba this closely resembles tests on OCR-D/ocrd_kraken#44 (covering variants with METS Caching and/or METS Server and/or parallel pages).