Automatic Short Answer Grading

This code is a Automatic Short Answer Grading (ASAG) system on the Mohler (Texas) dataset.

Mohler Dataset

data/ contains two files in csv format:

Data.csv consists of a total of 2442 student answers as a response to around 87 computer science questions collected over a period of 10 assignments and 2 tests. The scoring is continuous over a range of 0-5 and every answer is graded by two human evaluators. We consider their average as the gold standard score.

QA1.csv consists of just the questions and the model answers, thus containing lesser entries.

Download embeddings

Download the Baroni embeddings here and extract them into the ./data/ directory.

Afterwards the ./data/ directory should look like:

data/
    Data.csv
    QA1.csv
    EN-wform.w.5.cbow.neg10.400.subsmpl.txt

Install Packages

Packages used in this project can be installed with the following command: pip install -r requirements.txt

Experimentation

Preprocessing

Text normalization includes converting text to lowercase, removing non-alphanumeric characters, tokenizing, and filtering out stopwords.
Creating a correct word pool by parsing through a collection of texts, excluding words shorter than two characters.
Identification of potentially incorrect words by comparing against WordNet and using a spellchecker.
Mapping potentially incorrect words to their potential correct counterparts:
Utilizing Levenshtein and fuzzy matching ratios for potential matches.
Refine the word mapping pool by spellchecking, word splitting, and checking against word pools.
Replacement of identified incorrect words in the DataFrame column with their potential correct versions based on the generated dictionary.

Feature Extraction

Generate Sum Of the Word Embeddings (SOWE) for all the answers in the dataset when calculating the cosine similarity.
These embeddings are created using the Baroni embeddings.
We generate features for each student answer by calculating cosine similarities between the reference answer and every student answer. The features are cosine similarity, alignment score, length ratio, eucledian distance, fuzzy features. You can find the detailed code for feature extraction in feature_extraction.py.
We use these features for training the regression models.

Training and Testing

We split the Mohler data into 75%-25% training and testing data.
We use the training data to train on regression models namely, Random Forest Regressor, Ridge regression and a Neural Network.
We use these trained models, to predict the grades of test data and generate the results.

Training model

To train the model, you first need to download the embeddings of Baroni so that they can be used for feature extraction.

Run Training

To run training, run:

bash main.sh

References

[1] Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors (Baroni et al., ACL 2014)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
data		data
downloads		downloads
.DS_Store		.DS_Store
README.md		README.md
args.py		args.py
embeddings.py		embeddings.py
feature_extraction.py		feature_extraction.py
main.sh		main.sh
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Short Answer Grading

Mohler Dataset

Download embeddings

Install Packages

Experimentation

Preprocessing

Feature Extraction

Training and Testing

Training model

Run Training

References

About

Releases

Packages

Contributors 2

Languages

ameyaranadee/automatic-short-answer-grading

Folders and files

Latest commit

History

Repository files navigation

Automatic Short Answer Grading

Mohler Dataset

Download embeddings

Install Packages

Experimentation

Preprocessing

Feature Extraction

Training and Testing

Training model

Run Training

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages