Skip to content

Convert And Extract

richardyy1188 edited this page Aug 13, 2018 · 4 revisions

Goal

Identify start page and end page of biographies to extract each of them as txt respectively.

Process

1. extract index as txt

Take 政治與經濟篇 as example, below is part of the result

...

總 論 ..........................................................................................................陳翠蓮 XI
概 說 ........................................................................................................................ 001
第一章 政府部門 ................................................................................................ 005
王民寧  ................................................................................................................. 005
吳三連  ................................................................................................................. 008
...
石 益  ................................................................................................................. 059
...

2. Use regular expression to identify biography's biographee and start page, and infer end page by the start page of next biography.

Regex: ^(\w+ ?\w+) ? ?\.+ (\d\d\d)$

  • group1 : biographee's name
  • group2 : start page

and exclude false positive result like 第一章 政府部門, 005
then the rest will be all true positive result such as 王民寧, 005
but actually page 1 in the book is page 21 of pdf, so we should plus 20, such as 王民寧, 025
next by the start page of next biography, we can know the end page of the biography, such as 王民寧 start:25 end:27 (on the premise that no two biography will appear in the same page.)

3. Then extract and convert each biography using PDFBox

Take 王民寧 for example
java -jar ./Tools/pdfbox-app-1.8.13.jar ExtractText -startPage 25 -endPage 27 -encoding UTF-8 ./DataBase/政治與經濟篇.pdf ./DataBase/raw_txt/政治與經濟篇-25-王民寧.txt
Note: we prepend book name and start page to output txt's name because of biographees with the same names.