Merge pull request #5 from SCANL/qol_update

Complete QOL update minus some important-but-not-required features we will add in the near future. See #3 and #4.
SCANL · Aug 1, 2021 · 079b110 · 079b110
2 parents 6a4b736 + 245811d
commit 079b110
Show file tree

Hide file tree

Showing 17 changed files with 5,442 additions and 1,808 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -5,9 +5,11 @@ name: Python application
 
 on:
   push:
-    branches: [ main ]
+    branches:
+     - '*'
   pull_request:
-    branches: [ main ]
+    branches:
+     - '*'
 
 jobs:
   build:
@@ -35,4 +37,6 @@ jobs:
     - name: Test with unittest
       run: |
         cd ensemble_tagger_implementation
+        export PYTHONPATH=.
+        export PERL5LIB=./POSSE/Scripts
         python -m unittest
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+__pycache__
+*.log
diff --git a/README.md b/README.md
@@ -1,6 +1,3 @@
-
-
-
 # SCANL Ensemble tagger 
 This the official release of the SCANL ensemble part-of-speech tagger.
 
@@ -39,25 +36,36 @@ Once it is compiled, you should have an executable in the build/bin folder.
 
 Before running the python server, you need to install required modules. To download all of the required modules, use:
 
-    sudo pip3 install -r requirements.txt
+	sudo pip3 install -r requirements.txt
 
-You will then need to configure flask, so that it knows how to run the server:
+Configure ``PYTHONPATH`` as well:
 
-    export FLASK_APP=model_classification.py
+	export PYTHONPATH=~/path/to/ensemble_tagger/ensemble_tagger_implementation
 
 You will also need to configure POSSE (one of the taggers).  Do the following:
 1. Install wordnet-dev
 2. Open POSSE/Scripts/getWordNetType.sh
-3. You **MAY** need to modify this line, which is at the top of the file: `/usr/bin/wn $1 | grep "Information available for (noun|verb|adj|adv) $1" | cut -d " " -f4` by changing the path to wordnet (/usr/bin/wn) to the path on your own system. But usr/bin is the typical installation directory so it is unlikely you need to do this step.
-4. set your PERL5LIB path to point to the Scripts folder in POSSE's directory: `export PERL5LIB=/path/from/root/ensemble_tagger/POSSE/Scripts`
+3. You **MAY** need to modify this line, which is at the top of the file: ``/usr/bin/wn $1 | grep "Information available for (noun|verb|adj|adv) $1" | cut -d " " -f4`` by changing the path to wordnet (/usr/bin/wn) to the path on your own system. But usr/bin is the typical installation directory so it is unlikely you need to do this step.
+4. set your PERL5LIB path to point to the Scripts folder in POSSE's directory: ``export PERL5LIB=~/path/to/ensemble_tagger/POSSE/Scripts``
 
 Finally, you need to install Spiral, which we use for identifier splitting:
 
     sudo pip3 install git+https://github.com/casics/spiral.git
 
-Once it is all installed, you should be able to run the server (you may need to go into the ``ensemble_tagger_implementation`` directory before you do the following comamand):
+Once it is all installed, you should be able to run the server:
 
-    flask run
+    cd ensemble_tagger_implementation
+    python3 routes.py [MODEL]
+
+Where MODEL can one of the below. ``DTCP`` is the default if you do not specify a model:
+1. DTCP
+2. RFCP
+3. DTCA
+4. RFCA
+5. DTNP
+6. RFNP
+7. DTNA
+8. RFNA
 
 This will start the server, which will listen for identifier names sent via HTTP over the route:
 
@@ -78,6 +86,14 @@ Tag a function: ``http://127.0.0.1:5000/int/GetNumberArray(int* begin, int* end)
 
 Tag an class: ``http://127.0.0.1:5000/class/PersonRecord/CLASS``
 
+**You should run the tests the validate that everything is set up at this point**
+
+Make sure you're in the ``ensemble_tagger_implementation`` directory, then run:
+```
+python -m unittest
+```
+If the tests do not pass, something above is misconfigured. Re-scan over the instructions carefully. If you can't figure out what's wrong, make an issue.
+
 You can use HTTP to interact with the server and get part-of-speech annotations. This is where the C++ script comes in. You can run this script using the following command, assuming you're in the build folder:
 
     ./bin/grabidentifiers {srcML file name}
@@ -86,15 +102,6 @@ This will run the program that automatically queries the route above using all i
 
 If you are unfamiliar with srcML, [check it out](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information. If you decide not to use srcML, you should ignore the C++ script.
 
-## Configure the script
-### Choose a model
-You can configure the yourself by commenting out various parts of it and uncommenting others. There is a comment after each .pkl file, telling you which configuration each model represents. Uncomment the one you want to run, comment the ones you don't want to run. The code looks like this:
-
-    input_model = 'models/model_DecisionTreeClassifier_training_set_conj.pkl'  #DTCP
-
-### Choose a tagset
-You will also need to comment/uncomment the tagsets at the top depending on which model you are using.  You can look at the comment above each tagset to see which two configurations each one should be used for. Each tagset is used for one decision tree configuration and one random forest configuration, so two configurations in total.
-
 ## Errors?
 Please make an issue if you run into errors
 
@@ -105,4 +112,4 @@ Please make an issue if you run into errors
 The data used to train this tagger can be found here: https://github.com/SCANL/datasets/tree/master/ensemble_tagger_training_data
 
 # Interested in our other work?
-Find our other research here: https://www.scanl.org/
+Find our other research [at our webpage](https://www.scanl.org/) and check out the [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
diff --git a/ensemble_tagger_implementation/ensemble_functions.py b/ensemble_tagger_implementation/ensemble_functions.py
@@ -0,0 +1,123 @@
+from process_features import Get_identifier_context, CODE_CONTEXT, Convert_tag_to_numeric_category
+from preprocess_identifiers import Parse_posse, Parse_stanford, Parse_swum, Split_raw_identifier
+
+import logging
+root_logger = logging.getLogger(__name__)
+root_logger.setLevel(logging.DEBUG)
+handler = logging.FileHandler('tagger_error.log', 'a', 'utf-8')
+root_logger.addHandler(handler)
+import pandas as pd
+import sys, subprocess, joblib, pexpect
+import yaml
+from spiral import ronin
+
+stanford_process = pexpect.spawn(
+    """java -mx3g -cp 
+    '../stanford-postagger-2018-10-16/stanford-postagger.jar:' 
+    edu.stanford.nlp.tagger.maxent.MaxentTagger 
+    -model ../stanford-postagger-2018-10-16/models/english-bidirectional-distsim.tagger""")
+
+stanford_process.expect("(For EOF, use Return, Ctrl-D on Unix; Enter, Ctrl-Z, Enter on Windows.)")
+
+def Process_identifier_with_swum(identifier_data, context_of_identifier):
+    #format identifier string in preparation to send it to SWUM
+    identifier_type_and_name = Split_raw_identifier(identifier_data)
+    split_identifier_name_raw = ronin.split(identifier_type_and_name[1])
+    split_identifier_name = '_'.join(ronin.split(identifier_type_and_name[1]))
+    if Get_identifier_context(context_of_identifier) != CODE_CONTEXT.FUNCTION:
+        swum_string = "{identifier_type} {identifier_name}".format(identifier_name = split_identifier_name, identifier_type = identifier_type_and_name[0])
+        swum_process = subprocess.Popen(['java', '-jar', '../SWUM/SWUM_POS/swum.jar', swum_string, '2', 'true'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    else:
+        split_identifier_name = split_identifier_name+'('+identifier_data.split('(')[1]
+        swum_string = " {identifier_type} {identifier_name}".format(identifier_name = split_identifier_name, identifier_type = identifier_type_and_name[0])
+        swum_process = subprocess.Popen(['java', '-jar', '../SWUM/SWUM_POS/swum.jar', swum_string, '1', 'true'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+
+    swum_out, swum_err = swum_process.communicate()
+    swum_parsed_out = Parse_swum(swum_out.decode('utf-8').strip(), split_identifier_name_raw)
+    return swum_parsed_out
+
+def Process_identifier_with_posse(identifier_data, context_of_identifier):
+    #format identifier string in preparation to send it to POSSE
+    identifier_type_and_name = Split_raw_identifier(identifier_data)
+    split_identifier_name_raw = ronin.split(identifier_type_and_name[1])
+    split_identifier_name = ' '.join(split_identifier_name_raw)
+    posse_string = "{data} | {identifier_name}".format(data = identifier_data, identifier_name = split_identifier_name)
+    type_value = Get_identifier_context(context_of_identifier)
+    if any([type_value == x for x in [CODE_CONTEXT.DECLARATION, CODE_CONTEXT.ATTRIBUTE, CODE_CONTEXT.PARAMETER]]):
+        posse_process = subprocess.Popen(['../POSSE/Scripts/mainParser.pl', 'A', posse_string], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    elif type_value == CODE_CONTEXT.CLASS:
+        posse_process = subprocess.Popen(['../POSSE/Scripts/mainParser.pl', 'C', posse_string], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    else:
+        posse_process = subprocess.Popen(['../POSSE/Scripts/mainParser.pl', 'M', posse_string], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+
+    posse_out, posse_err = posse_process.communicate()
+    posse_out_parsed = Parse_posse(posse_out.decode('utf-8').strip(), split_identifier_name_raw)
+    return posse_out_parsed
+
+def Process_identifier_with_stanford(identifier_data, context_of_identifier):
+    identifier_type_and_name = identifier_data.split()
+    identifier_type_and_name = Split_raw_identifier(identifier_data)
+    split_identifier_name_raw = ronin.split(identifier_type_and_name[1])
+    if Get_identifier_context(context_of_identifier) != CODE_CONTEXT.FUNCTION:
+        split_identifier_name = "{identifier_name}".format(identifier_name=' '.join(split_identifier_name_raw))
+    else:
+        split_identifier_name = "I {identifier_name}".format(identifier_name=' '.join(split_identifier_name_raw))
+
+    stanford_process.sendline(split_identifier_name)
+    stanford_process.expect(' '.join([word+'_[A-Z]+' for word in split_identifier_name_raw]))
+    #stanford_out, stanford_err = stanford_process.communicate()
+    stanford_out = Parse_stanford(stanford_process.after.decode('utf-8').strip(), split_identifier_name_raw)
+    return stanford_out
+
+def Generate_ensemble_tagger_input_format(external_tagger_outputs):
+    ensemble_input = dict()
+    for tagger_output in external_tagger_outputs:
+        identifier, grammar_pattern = tagger_output.split(',')
+        identifier_grammarPattern = zip(identifier.split(), grammar_pattern.split())
+        i = 0
+        for word_gp_pair in identifier_grammarPattern:
+            if word_gp_pair[0]+str(i) in ensemble_input:
+                ensemble_input[word_gp_pair[0]+str(i)].append(word_gp_pair[1])
+            else:
+                ensemble_input[word_gp_pair[0]+str(i)] = [word_gp_pair[1]]
+            i = i + 1
+    root_logger.debug("Final ensemble input: {identifierDat}".format(identifierDat=ensemble_input))
+    return ensemble_input
+
+def Run_external_taggers(identifier_data, context_of_identifier):
+    external_tagger_outputs = []
+    #split and process identifier data into external tagger outputs
+    external_tagger_outputs.append(Process_identifier_with_swum(identifier_data, context_of_identifier))
+    external_tagger_outputs.append(Process_identifier_with_posse(identifier_data, context_of_identifier))
+    external_tagger_outputs.append(Process_identifier_with_stanford(identifier_data, context_of_identifier))
+    root_logger.debug("raw ensemble input: {identifierDat}".format(identifierDat=external_tagger_outputs))
+    return Generate_ensemble_tagger_input_format(external_tagger_outputs)
+
+def Annotate_word(swum_tag, posse_tag, stanford_tag, normalized_length, code_context):
+    model_dictionary = input_model = swum = posse = stanford = None
+
+    #Determine whether to go with default model (DTCP) or if user selected one
+    with open("tagger_config/model_config.yml", 'r') as stream:
+        model_dictionary = yaml.safe_load(stream)
+        if len(sys.argv) < 2:
+            input_model = model_dictionary['models']['DTCP']
+            swum, posse, stanford = Convert_tag_to_numeric_category(swum_tag, posse_tag, stanford_tag, 'DTCP')
+        else:
+            input_model = model_dictionary['models'][sys.argv[1]]
+            swum, posse, stanford = Convert_tag_to_numeric_category(swum_tag, posse_tag, stanford_tag, sys.argv[1])
+
+    data = {'SWUM_TAG': [swum],
+            'POSSE_TAG': [posse],
+            'STANFORD_TAG': [stanford],
+            'NORMALIZED_POSITION': [normalized_length],
+            'CONTEXT': [code_context]
+            }
+
+    df_features = pd.DataFrame(data,
+                               columns=['SWUM_TAG', 'POSSE_TAG', 'STANFORD_TAG', 'NORMALIZED_POSITION', 'CONTEXT'])
+
+    clf = joblib.load(input_model)
+    y_pred = clf.predict(df_features)
+    return (y_pred[0])
+
+#read_from_cmd_line()