Skip to content

Commit

Permalink
Merge pull request #5 from SCANL/qol_update
Browse files Browse the repository at this point in the history
Complete QOL update minus some important-but-not-required features we will add in the near future. See #3 and #4.
  • Loading branch information
cnewman authored Aug 1, 2021
2 parents 6a4b736 + 245811d commit 079b110
Show file tree
Hide file tree
Showing 17 changed files with 5,442 additions and 1,808 deletions.
8 changes: 6 additions & 2 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@ name: Python application

on:
push:
branches: [ main ]
branches:
- '*'
pull_request:
branches: [ main ]
branches:
- '*'

jobs:
build:
Expand Down Expand Up @@ -35,4 +37,6 @@ jobs:
- name: Test with unittest
run: |
cd ensemble_tagger_implementation
export PYTHONPATH=.
export PERL5LIB=./POSSE/Scripts
python -m unittest
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
__pycache__
*.log
47 changes: 27 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@



# SCANL Ensemble tagger
This the official release of the SCANL ensemble part-of-speech tagger.

Expand Down Expand Up @@ -39,25 +36,36 @@ Once it is compiled, you should have an executable in the build/bin folder.

Before running the python server, you need to install required modules. To download all of the required modules, use:

sudo pip3 install -r requirements.txt
sudo pip3 install -r requirements.txt

You will then need to configure flask, so that it knows how to run the server:
Configure ``PYTHONPATH`` as well:

export FLASK_APP=model_classification.py
export PYTHONPATH=~/path/to/ensemble_tagger/ensemble_tagger_implementation

You will also need to configure POSSE (one of the taggers). Do the following:
1. Install wordnet-dev
2. Open POSSE/Scripts/getWordNetType.sh
3. You **MAY** need to modify this line, which is at the top of the file: `/usr/bin/wn $1 | grep "Information available for (noun|verb|adj|adv) $1" | cut -d " " -f4` by changing the path to wordnet (/usr/bin/wn) to the path on your own system. But usr/bin is the typical installation directory so it is unlikely you need to do this step.
4. set your PERL5LIB path to point to the Scripts folder in POSSE's directory: `export PERL5LIB=/path/from/root/ensemble_tagger/POSSE/Scripts`
3. You **MAY** need to modify this line, which is at the top of the file: ``/usr/bin/wn $1 | grep "Information available for (noun|verb|adj|adv) $1" | cut -d " " -f4`` by changing the path to wordnet (/usr/bin/wn) to the path on your own system. But usr/bin is the typical installation directory so it is unlikely you need to do this step.
4. set your PERL5LIB path to point to the Scripts folder in POSSE's directory: ``export PERL5LIB=~/path/to/ensemble_tagger/POSSE/Scripts``

Finally, you need to install Spiral, which we use for identifier splitting:

sudo pip3 install git+https://github.com/casics/spiral.git

Once it is all installed, you should be able to run the server (you may need to go into the ``ensemble_tagger_implementation`` directory before you do the following comamand):
Once it is all installed, you should be able to run the server:

flask run
cd ensemble_tagger_implementation
python3 routes.py [MODEL]

Where MODEL can one of the below. ``DTCP`` is the default if you do not specify a model:
1. DTCP
2. RFCP
3. DTCA
4. RFCA
5. DTNP
6. RFNP
7. DTNA
8. RFNA

This will start the server, which will listen for identifier names sent via HTTP over the route:

Expand All @@ -78,6 +86,14 @@ Tag a function: ``http://127.0.0.1:5000/int/GetNumberArray(int* begin, int* end)

Tag an class: ``http://127.0.0.1:5000/class/PersonRecord/CLASS``

**You should run the tests the validate that everything is set up at this point**

Make sure you're in the ``ensemble_tagger_implementation`` directory, then run:
```
python -m unittest
```
If the tests do not pass, something above is misconfigured. Re-scan over the instructions carefully. If you can't figure out what's wrong, make an issue.

You can use HTTP to interact with the server and get part-of-speech annotations. This is where the C++ script comes in. You can run this script using the following command, assuming you're in the build folder:

./bin/grabidentifiers {srcML file name}
Expand All @@ -86,15 +102,6 @@ This will run the program that automatically queries the route above using all i

If you are unfamiliar with srcML, [check it out](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information. If you decide not to use srcML, you should ignore the C++ script.

## Configure the script
### Choose a model
You can configure the yourself by commenting out various parts of it and uncommenting others. There is a comment after each .pkl file, telling you which configuration each model represents. Uncomment the one you want to run, comment the ones you don't want to run. The code looks like this:

input_model = 'models/model_DecisionTreeClassifier_training_set_conj.pkl' #DTCP

### Choose a tagset
You will also need to comment/uncomment the tagsets at the top depending on which model you are using. You can look at the comment above each tagset to see which two configurations each one should be used for. Each tagset is used for one decision tree configuration and one random forest configuration, so two configurations in total.

## Errors?
Please make an issue if you run into errors

Expand All @@ -105,4 +112,4 @@ Please make an issue if you run into errors
The data used to train this tagger can be found here: https://github.com/SCANL/datasets/tree/master/ensemble_tagger_training_data

# Interested in our other work?
Find our other research here: https://www.scanl.org/
Find our other research [at our webpage](https://www.scanl.org/) and check out the [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
123 changes: 123 additions & 0 deletions ensemble_tagger_implementation/ensemble_functions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
from process_features import Get_identifier_context, CODE_CONTEXT, Convert_tag_to_numeric_category
from preprocess_identifiers import Parse_posse, Parse_stanford, Parse_swum, Split_raw_identifier

import logging
root_logger = logging.getLogger(__name__)
root_logger.setLevel(logging.DEBUG)
handler = logging.FileHandler('tagger_error.log', 'a', 'utf-8')
root_logger.addHandler(handler)
import pandas as pd
import sys, subprocess, joblib, pexpect
import yaml
from spiral import ronin

stanford_process = pexpect.spawn(
"""java -mx3g -cp
'../stanford-postagger-2018-10-16/stanford-postagger.jar:'
edu.stanford.nlp.tagger.maxent.MaxentTagger
-model ../stanford-postagger-2018-10-16/models/english-bidirectional-distsim.tagger""")

stanford_process.expect("(For EOF, use Return, Ctrl-D on Unix; Enter, Ctrl-Z, Enter on Windows.)")

def Process_identifier_with_swum(identifier_data, context_of_identifier):
#format identifier string in preparation to send it to SWUM
identifier_type_and_name = Split_raw_identifier(identifier_data)
split_identifier_name_raw = ronin.split(identifier_type_and_name[1])
split_identifier_name = '_'.join(ronin.split(identifier_type_and_name[1]))
if Get_identifier_context(context_of_identifier) != CODE_CONTEXT.FUNCTION:
swum_string = "{identifier_type} {identifier_name}".format(identifier_name = split_identifier_name, identifier_type = identifier_type_and_name[0])
swum_process = subprocess.Popen(['java', '-jar', '../SWUM/SWUM_POS/swum.jar', swum_string, '2', 'true'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
else:
split_identifier_name = split_identifier_name+'('+identifier_data.split('(')[1]
swum_string = " {identifier_type} {identifier_name}".format(identifier_name = split_identifier_name, identifier_type = identifier_type_and_name[0])
swum_process = subprocess.Popen(['java', '-jar', '../SWUM/SWUM_POS/swum.jar', swum_string, '1', 'true'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

swum_out, swum_err = swum_process.communicate()
swum_parsed_out = Parse_swum(swum_out.decode('utf-8').strip(), split_identifier_name_raw)
return swum_parsed_out

def Process_identifier_with_posse(identifier_data, context_of_identifier):
#format identifier string in preparation to send it to POSSE
identifier_type_and_name = Split_raw_identifier(identifier_data)
split_identifier_name_raw = ronin.split(identifier_type_and_name[1])
split_identifier_name = ' '.join(split_identifier_name_raw)
posse_string = "{data} | {identifier_name}".format(data = identifier_data, identifier_name = split_identifier_name)
type_value = Get_identifier_context(context_of_identifier)
if any([type_value == x for x in [CODE_CONTEXT.DECLARATION, CODE_CONTEXT.ATTRIBUTE, CODE_CONTEXT.PARAMETER]]):
posse_process = subprocess.Popen(['../POSSE/Scripts/mainParser.pl', 'A', posse_string], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
elif type_value == CODE_CONTEXT.CLASS:
posse_process = subprocess.Popen(['../POSSE/Scripts/mainParser.pl', 'C', posse_string], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
else:
posse_process = subprocess.Popen(['../POSSE/Scripts/mainParser.pl', 'M', posse_string], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

posse_out, posse_err = posse_process.communicate()
posse_out_parsed = Parse_posse(posse_out.decode('utf-8').strip(), split_identifier_name_raw)
return posse_out_parsed

def Process_identifier_with_stanford(identifier_data, context_of_identifier):
identifier_type_and_name = identifier_data.split()
identifier_type_and_name = Split_raw_identifier(identifier_data)
split_identifier_name_raw = ronin.split(identifier_type_and_name[1])
if Get_identifier_context(context_of_identifier) != CODE_CONTEXT.FUNCTION:
split_identifier_name = "{identifier_name}".format(identifier_name=' '.join(split_identifier_name_raw))
else:
split_identifier_name = "I {identifier_name}".format(identifier_name=' '.join(split_identifier_name_raw))

stanford_process.sendline(split_identifier_name)
stanford_process.expect(' '.join([word+'_[A-Z]+' for word in split_identifier_name_raw]))
#stanford_out, stanford_err = stanford_process.communicate()
stanford_out = Parse_stanford(stanford_process.after.decode('utf-8').strip(), split_identifier_name_raw)
return stanford_out

def Generate_ensemble_tagger_input_format(external_tagger_outputs):
ensemble_input = dict()
for tagger_output in external_tagger_outputs:
identifier, grammar_pattern = tagger_output.split(',')
identifier_grammarPattern = zip(identifier.split(), grammar_pattern.split())
i = 0
for word_gp_pair in identifier_grammarPattern:
if word_gp_pair[0]+str(i) in ensemble_input:
ensemble_input[word_gp_pair[0]+str(i)].append(word_gp_pair[1])
else:
ensemble_input[word_gp_pair[0]+str(i)] = [word_gp_pair[1]]
i = i + 1
root_logger.debug("Final ensemble input: {identifierDat}".format(identifierDat=ensemble_input))
return ensemble_input

def Run_external_taggers(identifier_data, context_of_identifier):
external_tagger_outputs = []
#split and process identifier data into external tagger outputs
external_tagger_outputs.append(Process_identifier_with_swum(identifier_data, context_of_identifier))
external_tagger_outputs.append(Process_identifier_with_posse(identifier_data, context_of_identifier))
external_tagger_outputs.append(Process_identifier_with_stanford(identifier_data, context_of_identifier))
root_logger.debug("raw ensemble input: {identifierDat}".format(identifierDat=external_tagger_outputs))
return Generate_ensemble_tagger_input_format(external_tagger_outputs)

def Annotate_word(swum_tag, posse_tag, stanford_tag, normalized_length, code_context):
model_dictionary = input_model = swum = posse = stanford = None

#Determine whether to go with default model (DTCP) or if user selected one
with open("tagger_config/model_config.yml", 'r') as stream:
model_dictionary = yaml.safe_load(stream)
if len(sys.argv) < 2:
input_model = model_dictionary['models']['DTCP']
swum, posse, stanford = Convert_tag_to_numeric_category(swum_tag, posse_tag, stanford_tag, 'DTCP')
else:
input_model = model_dictionary['models'][sys.argv[1]]
swum, posse, stanford = Convert_tag_to_numeric_category(swum_tag, posse_tag, stanford_tag, sys.argv[1])

data = {'SWUM_TAG': [swum],
'POSSE_TAG': [posse],
'STANFORD_TAG': [stanford],
'NORMALIZED_POSITION': [normalized_length],
'CONTEXT': [code_context]
}

df_features = pd.DataFrame(data,
columns=['SWUM_TAG', 'POSSE_TAG', 'STANFORD_TAG', 'NORMALIZED_POSITION', 'CONTEXT'])

clf = joblib.load(input_model)
y_pred = clf.predict(df_features)
return (y_pred[0])

#read_from_cmd_line()
Loading

0 comments on commit 079b110

Please sign in to comment.