Requirements -
Libraries -
sklearn, scipy, numpy, pickle
Files -
Please ensure that the directory structure is maintained.
Preprocessing the Data
For Models 1 to 12
- cd into the codes directory
- Get the training data and combine all the files into 1 file named combined.txt
- Cleaning the Data
python3 clean.py combined.txt
python3 more_clean.py cleaned.txt
- Removing the sentences from the corpus which are chunked wrongly.
python3 remove_wrongly_chunked_sentences.py more_cleaned.txt head_mapping.json
- Numbering the sentences
python3 number_sentences.py correct_sentences.txt
- Extracting Heads of Each Chunk
python3 head_finding.py numbered.txt
- Extracting Dependency Relations between the heads
python3 dependencies_finding.py head_sentences.txt
- Applying Arc Eager Algorithm to remove the sentences which cannot be parsed. The non-parsable sentences will be the ones which have projectivity issues.
python3 arc_eager.py dependency_sentences.txt
- Extracting Some Pairs of Chunks that are unrelated.
python3 get_unrelated_dependencies.py parsable_dependency_sentences.txt
For Models 13 to 17
-
cd into the Morphological Features directory
-
Follow steps 2 to 9 above
After preprocessing the corpus size falls down to 7.5K from 13K in the training data! The other 5.5K sentences either had insufficient data, or were wrongly chunked, or were non parsable due to projectivity issues.
Pre-Training the Data
- Rename the final file named parsable_dependency_sentences_with_unrelated_pairs.txt to Training.txt
- Preprocess the Testing and Development Data too and rename them to Testing.txt and Development.txt
For Models 1 to 12
- cd into the Data folder
- Move all the three files Training.txt, Development.txt and Testing.txt into the Data folder.
- Run four python codes:
python3 get_distinct_chunk_tags.py Training.txt Development.txt Testing.txt
python3 get_distinct_pos_tags.py Training.txt Development.txt Testing.txt
python3 get_distinct_words.py Training.txt Development.txt Testing.txt
python3 get_distinct_morph_vectors.py Training.txt Development.txt Testing.txt
These four files will extract all the distinct POS Tags, Word Tags, Words and Morphological Features' Vectors and store them into files in the same folder
- Append the data from Development.txt into the Training.txt file
For Models 13 to 17
- Move all the files Training.txt, Development.txt and Testing.txt into the Morphological Features directory
- cd into the "Data/For Model 13 to 15" directory
- Run four python codes:
python3 get_distinct_chunk_tags.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt
python3 get_distinct_pos_tags.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt
python3 get_distinct_words.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt
python3 get_distinct_morph_vectors.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt
These four files will extract all the distinct POS Tags, Word Tags, Words and Morphological Features' Vectors and store them into files in the same folder
- Append the data from Development.txt into the Training.txt file
Running the Models -
- cd into the Data/Models directory
- cd into any model number
For Models 1 to 12
- Train the Model by running
python3 train.py ../../Training.txt
- Test the Model by running
python3 predict.py ../../Testing.txt
For Models 13 to 17
- Train the Model by running
python3 train.py ../../../Morphological Features/Training.txt
- Test the Model by running
python3 predict.py ../../../Morphological Features/Testing.txt
Reading the output -
-
The values printed in the first two arrays are predictions from SVM and Logistic Regression Models respectively. (depending on what is the model's output (what we are predicting - the L/R/U relationship or the dependency relation), it would either be the arc direction or the arc label)
-
The numeric values are accuracies of the SVM and Logistic Regression models respectively.
Source of our Data -
We processed about 15K data from the gold labeled dependency from https://ltrc.iiit.ac.in/downloads/kolhi/ 's Marathi dataset.
Github Repository Link
Work Division
The work was done in such a way that each member contributed to each of the tasks.
Tanish (2018114005) - Mostly Coding Part
Mukund (2018114015) - Mostly Theoretical Part
We both feel that we have contributed equally on this project.
Under the Guidance of
Manish Shrivastava, Assistant Professor, IIIT Hyderabad
Pruthwik Mishra, PhD Student, IIIT Hyderabad
Alok Debnath, Introduction to NLP Course TA, IIIT Hyderabad