Make petrarch2 output more JSON friendly #44

cegme · 2017-07-13T19:50:18Z

When using the petratch output, it would be helpful to make the output python friendly and json. Currently, the petrarch output is a Python specific. Can we make the output abide by json rules? This way the data can still be read in python (via json package) and other language can easily read and use the output without custom converters (i.e. mongo, Redis).

Below is a snippet out petrarch2 output that shows. The main issues is that the actorroot, actortext,eventtext contains dictionaries, those dictionaries have a Tuple as its key.

python3
>>> s = """{u'nytasiapacific20160622.0002': {'sents': {1: {'geo-location': [{u'placename': u'Beirut', u'countrycode': u'LBN', u'lon': 35.49442, u'admin1': u'Beyrouth', u'lat': 33.88894, u'searchterm': u'Beirut'}], u'events': [(u'TUNJUD', u'NGAEDU', u'173')], 'content': u'A Tunisian court has jailed a Nigerian student for two years for helping young militants join an armed Islamic group in Beirut, his lawyer said Wednesday.', u'meta': {u'actorroot': {(u'TUNJUD', u'NGAEDU', u'173'): [u'', u'']}, (u'TUNJUD', u'NGAEDU', u'173'): [[u'JAILED'], [u'HAS']], u'eventtext': {(u'TUNJUD', u'NGAEDU', u'173'): u'has jailed'}, u'nouns': [([u' TUNISIAN', u' COURT'], [u'TUNJUD'], [(u'TUN', []), [u'~']]), ([u' NIGERIAN', u' STUDENT'], [u'NGAEDU'], [(u'NGA', []), [u'~']]), ([u' MILITANTS', u' ARMED ISLAMIC GROUP', u' BEIRUT'], [u'DZAREBUAF', u'LBNUAF'], [[u'~'], (u'DZAREB', []), (u'LBN', [])]), ([u' LAWYER'], [u'~JUD'], [[u'~']])], u'actortext': {(u'TUNJUD', u'NGAEDU', u'173'): [u'Tunisian court', u'Nigerian student']}}, 'parsed': u'(S (S (NP (DT A )  (NNP TUNISIAN )  (NN COURT )  )  (VP (VBZ HAS )  (VP (VBN JAILED )  (NP (DT A )  (NNP NIGERIAN )  (NN STUDENT )  )  (PP (IN FOR )  (NP (NP (CD TWO )  (NNS YEARS )  )  (PP (IN FOR )  (S (VP (VBG HELPING )  (S (NP (JJ YOUNG )  (NNS MILITANTS )  )  (VP (VB JOIN )  (NP (DT AN )  (JJ ARMED )  (JJ ISLAMIC )  (NN GROUP )  )  (PP (IN IN )  (NP (NNP BEIRUT )  )  )  )  )  )  )  )  )  )  )  )  )  (, , )  (NP (PRP$ HIS )  (NN LAWYER )  )  (VP (VBD SAID )  (NP (NNP WEDNESDAY )  )  )  (. . )  )  ', u'issues': [[u'STUDENTS', 1], [u'NAMED_TERROR_GROUP', 1]]}}, 'meta': {'date': '20160621', 'headline': u'Lightning Ridge Journal: An Amateur Undertaking in Australian Mining Town With No Funeral Home', u'verbs': {u'actorroot': {(u'TUNJUD', u'NGAEDU', u'173'): [u'', u'']}, (u'TUNJUD', u'NGAEDU', u'173'): [[u'JAILED'], [u'HAS']], u'eventtext': {(u'TUNJUD', u'NGAEDU', u'173'): u'has jailed'}, u'nouns': [([u' TUNISIAN', u' COURT'], [u'TUNJUD'], [(u'TUN', []), [u'~']]), ([u' NIGERIAN', u' STUDENT'], [u'NGAEDU'], [(u'NGA', []), [u'~']]), ([u' MILITANTS', u' ARMED ISLAMIC GROUP', u' BEIRUT'], [u'DZAREBUAF', u'LBNUAF'], [[u'~'], (u'DZAREB', []), (u'LBN', [])]), ([u' LAWYER'], [u'~JUD'], [[u'~']])], u'actortext': {(u'TUNJUD', u'NGAEDU', u'173'): [u'Tunisian court', u'Nigerian student']}}}}}"""
>>> import pprint
>>> pprint.pprint(z)
{'nytasiapacific20160622.0002':
  {'meta': 
    {'date': '20160621',
             'headline': 'Lightning Ridge Journal: An Amateur Undertaking in Australian Mining Town With No Funeral Home',
             'verbs': {'actorroot': {('TUNJUD', 'NGAEDU', '173'): ['', '']},
                      'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
                      'eventtext': {('TUNJUD', 'NGAEDU', '173'): 'has jailed'},
                      'nouns': [([' TUNISIAN', ' COURT'], ['TUNJUD'], [('TUN', []), ['~']]),
                               ([' NIGERIAN', ' STUDENT'], ['NGAEDU'], [('NGA', []), ['~']]),
                               ([' MILITANTS', ' ARMED ISLAMIC GROUP', ' BEIRUT'], ['DZAREBUAF', 'LBNUAF'],
                               [['~'], ('DZAREB', []), ('LBN', [])]),
                               ([' LAWYER'], ['~JUD'], [['~']])],
                      ('TUNJUD', 'NGAEDU', '173'): [['JAILED'], ['HAS']]}},
   'sents': {1: {'content': 'A Tunisian court has jailed a Nigerian student for two years for helping young militants join an armed '
                          'Islamic group in Beirut, his lawyer said Wednesday.',
               'events': [('TUNJUD', 'NGAEDU', '173')],
               'geo-location': [{'admin1': 'Beyrouth',
                               'countrycode': 'LBN',
                               'lat': 33.88894,
                               'lon': 35.49442,
                               'placename': 'Beirut',
                               'searchterm': 'Beirut'}],
               'issues': [['STUDENTS', 1], ['NAMED_TERROR_GROUP', 1]],
               'meta': {'actorroot': {('TUNJUD', 'NGAEDU', '173'): ['', '']},
                       'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
                       'eventtext': {('TUNJUD', 'NGAEDU', '173'): 'has jailed'},
                       'nouns': [([' TUNISIAN', ' COURT'], ['TUNJUD'], [('TUN', []), ['~']]),
                                ([' NIGERIAN', ' STUDENT'], ['NGAEDU'], [('NGA', []), ['~']]),
                                ([' MILITANTS', ' ARMED ISLAMIC GROUP', ' BEIRUT'], ['DZAREBUAF', 'LBNUAF'],
                                [['~'], ('DZAREB', []), ('LBN', [])]),
                                ([' LAWYER'], ['~JUD'], [['~']])],
                       ('TUNJUD', 'NGAEDU', '173'): [['JAILED'], ['HAS']]},
               'parsed': '(S (S (NP (DT A )  (NNP TUNISIAN )  (NN COURT )  )  (VP (VBZ HAS )  (VP (VBN JAILED )  (NP (DT A )  (NNP '
                         'NIGERIAN )  (NN STUDENT )  )  (PP (IN FOR )  (NP (NP (CD TWO )  (NNS YEARS )  )  (PP (IN FOR )  (S (VP '
                         '(VBG HELPING )  (S (NP (JJ YOUNG )  (NNS MILITANTS )  )  (VP (VB JOIN )  (NP (DT AN )  (JJ ARMED )  (JJ '
                         'ISLAMIC )  (NN GROUP )  )  (PP (IN IN )  (NP (NNP BEIRUT )  )  )  )  )  )  )  )  )  )  )  )  )  (, , )  '
                         '(NP (PRP$ HIS )  (NN LAWYER )  )  (VP (VBD SAID )  (NP (NNP WEDNESDAY )  )  )  (. . )  )  '}}}}

Three alternatives are:

Quotify the key:
'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
to
'actortext': {"['TUNJUD', 'NGAEDU', '173']": ["Tunisian court", 'Nigerian student"]},
Use arrays instead of tuples/dictionaries
'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
to
'actortext': [["TUNJUD", "NGAEDU", "173"], ["Tunisian court", "Nigerian student"]},
Use more descriptive dictionaries (code, text key pairs)
'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
to
'actortext': {"code" : ["TUNJUD", "NGAEDU", "173"], "text": ["Tunisian court", "Nigerian student"]},

This output/structure is decided during the do_coding phase of petrarch2. It seems like this change may break a lot of existing code.

The text was updated successfully, but these errors were encountered:

johnb30 · 2017-07-17T17:50:25Z

Definitely agree, and this is a thing that has been plaguing us for awhile. It breaks hypnos downstream and causes other issues as well.

I don't think those particular fields are used anywhere downstream ( @cnnorris or @philip-schrodt could maybe shed more light on that) so changing the format slightly probably wouldn't cause the program to go haywire. This is very much something that should be fixed, though. I just don't think anyone's had the time to get around to it.

philip-schrodt · 2017-07-17T21:50:33Z

Yes, definitely an issue, though we're transitioning to the next generation of the coder, which uses universal-dependency (CONLL-U) parses as the input, and will be generating both CAMEO and PLOVER events as the output. So that is where our effort is going at the moment, though we will definitely try to stick with strictly json-compliant formats there (or at least that is the intention).

cegme · 2017-07-20T19:33:10Z

@johnb30 @philip-schrodt Thanks for the update.
Has an output as an output format been decided for the universal dependency? It looks like Universal Petrarch is using the same code. /cc @JingL

philip-schrodt · 2017-07-22T19:12:37Z

I'll be posting some new code on the pschrodt branch of Universal Petrarch next week to produce output in PLOVER JSON format, though I think we'll probably maintain parallel systems for CAMEO (the existing tab-delimited format) and PLOVER. A prototype of that output routine is in the mudflat system: https://github.com/philip-schrodt/mudflat. [reference for PLOVER is https://github.com/openeventdata/PLOVER: there's a current draft of the manual there with the output format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make petrarch2 output more JSON friendly #44

Make petrarch2 output more JSON friendly #44

cegme commented Jul 13, 2017

johnb30 commented Jul 17, 2017

philip-schrodt commented Jul 17, 2017

cegme commented Jul 20, 2017

philip-schrodt commented Jul 22, 2017

Make petrarch2 output more JSON friendly #44

Make petrarch2 output more JSON friendly #44

Comments

cegme commented Jul 13, 2017

johnb30 commented Jul 17, 2017

philip-schrodt commented Jul 17, 2017

cegme commented Jul 20, 2017

philip-schrodt commented Jul 22, 2017