COSC480 Project - Quantifying Conceptual Density in Text

This repository contains the source code for my COSC480 project, "Quantifying Conceptual Density in Text".

Copied from the abstract of the report, a description of the project:

"Conceptual density" is a term that describes the degree to which concepts in a domain are integrated, or interdependent. There is a hypothesis that text documents with high conceptual density are harder to process. At present, the concept of “conceptual density” has been informally defined. In this pilot study, I investigate the idea of conceptual density in the context of expository text documents, a graph-based model for quantifying conceptual density and ways to evaluate this model. Finally, I discuss open problems and direction for future work related to conceptual density.

The original aims and objectives can be found here. The main report can be found here.

Getting Started

Set up your python environment. If you are using conda you can do this with the command:
```
conda env create -f environment.yml
```
and then activate the environment using:
```
conda activate cosc480
```
Otherwise, make sure you have the packages listed in environment.yml installed.
Run the setup script:
```
python setup.py
```
Install GraphViz if you want to visualise the generated graph structures. Skip to step 7. if you are not planning on running scripts that use CoreNLP.
Install Java SE 1.8+
Download the Standford CoreNLP package and extract the contents somewhere. Then add the path to where you extracted the files to the environment variable CORENLP_HOME:
```
export CORENLP_HOME=path/to/corenlp
```
This sets the environment variable for the current shell.

To make this environment variable more permanent run:
```
echo export CORENLP_HOME=path/to/corenlp >> ~/.bashrc
```
These changes will take effect when you open a new shell.
Start up the CoreNLP server:
```
bash corenlp_server/run.sh
```
To see the help message type:
```
bash corenlp_server/run.sh -h
```
When using the default settings, you can access the server from localhost:9000 and use the web interface to issue queries.
Run the main script to start quantifying conceptual density!
```
python -m qcd docs/bread.xml
```
To see the help message type:
```
python -m qcd --help
```

Annotating Documents

There is a separate GitHub repository for a web application that facilitates annotation of documents and generating the XML documents with annotations. This web app can be accessed from here and the GitHub repository can be found here.

Evaluating the Model

The easiest way for evaluating the model of conceptual density on a document goes like this:

Go to the web application and create a document.
Annotate the document.
Download the annotated XML version of the document either via the web application interface or command line.

For example, we can download the annotated XML document via command line with the following:
```
curl -o annotations.xml https://cosc480-document-annotator.herokuapp.com/api/documents/1/xml
```
The exact URL for a given document can be found through the link in download button in web app interface, or by using the URL in the above example and replacing the document ID with the ID of the document you want. Document IDs can be found in the URL when viewing a document in the web app or by inspecting the JSON response from the endpoint cosc480-document-annotator.herokuapp.com/api/documents.

It may be the case that annotations need to spread across multiple copies of a document (e.g. overlapping annotations). In this case you either evaluate the conceptual density model on each document separately or merge the XML documents with:
```
python -m qcd.merge_xml doc-1.xml doc-2.xml ... doc-n.xml -output-path annotations.xml
```
and evaluate the model using the merged XML document.
Run the evaluation script on a given document:
```
python -m qcd.evaluate annotations.xml
```

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
corenlp_server		corenlp_server
documents		documents
qcd		qcd
reports		reports
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COSC480 Project - Quantifying Conceptual Density in Text

Getting Started

Annotating Documents

Evaluating the Model

About

Releases 3

Packages

Languages

License

AnthonyDickson/Quantifying-Conceptual-Density-in-Text

Folders and files

Latest commit

History

Repository files navigation

COSC480 Project - Quantifying Conceptual Density in Text

Getting Started

Annotating Documents

Evaluating the Model

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages