Dependency-Parser-Marathi

Requirements -

Libraries -

sklearn, scipy, numpy, pickle

Files -

Please ensure that the directory structure is maintained.

Preprocessing the Data

For Models 1 to 12

cd into the codes directory
Get the training data and combine all the files into 1 file named combined.txt
Cleaning the Data

    python3 clean.py combined.txt   
    python3 more_clean.py cleaned.txt

Removing the sentences from the corpus which are chunked wrongly.

    python3 remove_wrongly_chunked_sentences.py more_cleaned.txt head_mapping.json

Numbering the sentences

    python3 number_sentences.py correct_sentences.txt

Extracting Heads of Each Chunk

    python3 head_finding.py numbered.txt

Extracting Dependency Relations between the heads

    python3 dependencies_finding.py head_sentences.txt

Applying Arc Eager Algorithm to remove the sentences which cannot be parsed. The non-parsable sentences will be the ones which have projectivity issues.

   python3 arc_eager.py dependency_sentences.txt

Extracting Some Pairs of Chunks that are unrelated.

    python3 get_unrelated_dependencies.py parsable_dependency_sentences.txt

For Models 13 to 17

cd into the Morphological Features directory
Follow steps 2 to 9 above

After preprocessing the corpus size falls down to 7.5K from 13K in the training data! The other 5.5K sentences either had insufficient data, or were wrongly chunked, or were non parsable due to projectivity issues.

Pre-Training the Data

Rename the final file named parsable_dependency_sentences_with_unrelated_pairs.txt to Training.txt
Preprocess the Testing and Development Data too and rename them to Testing.txt and Development.txt

For Models 1 to 12

cd into the Data folder
Move all the three files Training.txt, Development.txt and Testing.txt into the Data folder.
Run four python codes:

    python3 get_distinct_chunk_tags.py Training.txt Development.txt Testing.txt
    python3 get_distinct_pos_tags.py Training.txt Development.txt Testing.txt
    python3 get_distinct_words.py Training.txt Development.txt Testing.txt
    python3 get_distinct_morph_vectors.py Training.txt Development.txt Testing.txt

These four files will extract all the distinct POS Tags, Word Tags, Words and Morphological Features' Vectors and store them into files in the same folder

Append the data from Development.txt into the Training.txt file

For Models 13 to 17

Move all the files Training.txt, Development.txt and Testing.txt into the Morphological Features directory
cd into the "Data/For Model 13 to 15" directory
Run four python codes:

    python3 get_distinct_chunk_tags.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt
    python3 get_distinct_pos_tags.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt
    python3 get_distinct_words.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt
    python3 get_distinct_morph_vectors.py ../../Morphological Features/Training.txt ../../Morphological Features/Development.txt ../../Morphological Features/Testing.txt

These four files will extract all the distinct POS Tags, Word Tags, Words and Morphological Features' Vectors and store them into files in the same folder

Append the data from Development.txt into the Training.txt file

Running the Models -

cd into the Data/Models directory
cd into any model number

For Models 1 to 12

Train the Model by running

    python3 train.py ../../Training.txt

Test the Model by running

    python3 predict.py ../../Testing.txt

For Models 13 to 17

Train the Model by running

    python3 train.py ../../../Morphological Features/Training.txt

Test the Model by running

    python3 predict.py ../../../Morphological Features/Testing.txt

Reading the output -

The values printed in the first two arrays are predictions from SVM and Logistic Regression Models respectively. (depending on what is the model's output (what we are predicting - the L/R/U relationship or the dependency relation), it would either be the arc direction or the arc label)
The numeric values are accuracies of the SVM and Logistic Regression models respectively.

Source of our Data -

We processed about 15K data from the gold labeled dependency from https://ltrc.iiit.ac.in/downloads/kolhi/ 's Marathi dataset.

Github Repository Link

Click

Work Division

The work was done in such a way that each member contributed to each of the tasks.

Tanish (2018114005) - Mostly Coding Part

Mukund (2018114015) - Mostly Theoretical Part

We both feel that we have contributed equally on this project.

Under the Guidance of

Manish Shrivastava, Assistant Professor, IIIT Hyderabad

Pruthwik Mishra, PhD Student, IIIT Hyderabad

Alok Debnath, Introduction to NLP Course TA, IIIT Hyderabad

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
Arc Eager		Arc Eager
Corpus		Corpus
Data		Data
Development		Development
Head Finding		Head Finding
Head Preprocessing		Head Preprocessing
Morphological Features		Morphological Features
Testing		Testing
Unrelated Dependencies		Unrelated Dependencies
codes		codes
Project Outline.pdf		Project Outline.pdf
Project Report.pdf		Project Report.pdf
README.md		README.md
marathi models comparision - Comparisons.pdf		marathi models comparision - Comparisons.pdf
marathi models comparision.xlsx		marathi models comparision.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependency-Parser-Marathi

About

Releases

Packages

Contributors 3

Languages

destinyson7/Dependency-Parser-Marathi

Folders and files

Latest commit

History

Repository files navigation

Dependency-Parser-Marathi

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages