-
Notifications
You must be signed in to change notification settings - Fork 3
Convert And Extract
Identify start page and end page of biographies to extract each of them as txt respectively.
Take 政治與經濟篇 as example, below is part of the result
...
總 論 ..........................................................................................................陳翠蓮 XI
概 說 ........................................................................................................................ 001
第一章 政府部門 ................................................................................................ 005
王民寧 ................................................................................................................. 005
吳三連 ................................................................................................................. 008
...
石 益 ................................................................................................................. 059
...
2. Use regular expression to identify biography's biographee and start page, and infer end page by the start page of next biography.
Regex: ^(\w+ ?\w+) ? ?\.+ (\d\d\d)$
- group1 : biographee's name
- group2 : start page
and exclude false positive result like 第一章 政府部門
, 005
then the rest will be all true positive result such as 王民寧
, 005
but actually page 1 in the book is page 21 of pdf, so we should plus 20, such as 王民寧
, 025
next by the start page of next biography, we can know the end page of the biography, such as 王民寧 start:25 end:27
(on the premise that no two biography will appear in the same page.)
Take 王民寧
for example
java -jar ./Tools/pdfbox-app-1.8.13.jar ExtractText -startPage 25 -endPage 27 -encoding UTF-8 ./DataBase/政治與經濟篇.pdf ./DataBase/raw_txt/政治與經濟篇-25-王民寧.txt
Note: we prepend book name and start page to output txt's name because of biographees with the same names.