Skip to content

Regexs in preprocess

richardyy1188 edited this page Aug 24, 2018 · 3 revisions

1. Regular Expressions

There is problems described in next header 1-1. Remove chapter and category octocat

1-2. Find the spaces that don't want to clear out octocat

1-3. Get footnote indices octocat

1-4. Split into pages by page number octocat

1-5. find the first point footnote when there is a footnote continuous two pages and you want distinguish content part and footnote part of a page octocat

1-6. Clear out footnote indices in content octocat

1-7. Get and remove authors

1-8. Get and remove Alias, Birth, Death octocat

2. Problems

2-1. False Positive

2-1-1. Aggressive segments paragraphs

2-1-2. Falsely Catch not footnote index number

2-2. False Negative

2-2-1. Incorrect start point of footnote part in a page
Haven't seen this case really, but we can imaginate it.

2-3. Others

2-3-1. Broken English
Haven't come up with a good idea to deal with it...
I try to use dictionary to check every possible concatenation of broken english, but it's not possible to figure out what is we want,
e.g. I want Association but I may get As first because As is also in dictionary and become As socia...