Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessing issues #1

Open
michaelmoju opened this issue May 13, 2019 · 2 comments
Open

preprocessing issues #1

michaelmoju opened this issue May 13, 2019 · 2 comments

Comments

@michaelmoju
Copy link

michaelmoju commented May 13, 2019

I preprocess the ACE2005 corpus through your code but found some issues. The first issue is that in ace2json.py line93 the print function is in python2. I manually change the print function and run ace2json.py file. But it shows that I miss some files. I found that in ace2005/text/CNN_IP_20030408.1600.04.txt.conll the Standford annotator would wrongly add one more period in line 87~88 the word "SR..". there are two periods. Can you check this issue? Thanks for your contribution.

@luanyi
Copy link
Owner

luanyi commented May 24, 2019

Hi Michael,

For preprocessing, we follow exactly the same as Miwa & Bansal's repo . The ace2json.py only converts their output to jason format. I'm not exactly sure what their pre-processing script does for the ``SR." issue. But this code should be fine for direct comparison with previous literature on ACE.

For the missing files, could you give me an example which file you are missing and what error message did you get?

Best,
Yi

@michaelmoju
Copy link
Author

In the file: preprocessing/ace2005/log, line165, There is file not found error:

Traceback (most recent call last): File "../common/standoff.py", line 12, in <module> for line in open(sys.argv[1]): FileNotFoundError: [Errno 2] No such file or directory: 'corpus/CNN_ENG_20030304_173120.16.txt' corpus/CNN_ENG_20030305_170125.1.split.txt corpus/CNN_ENG_20030306_070606.18.split.txt

That is because in preprocessing/common/standoff.py line85 asserts whether the texts are the same. However, I found that stanford annotator wrongly annotated the file ace2005/text/CNN_IP_20030408.1600.04.txt.conll line87-88 for some reason which caused the assertion error, and the procedure stops yeilding the following files after the error.

syuoni added a commit to syuoni/DyGIE that referenced this issue Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants