Serelex

Name

Serelex - a tool for semantic relation extraction.

Synopsis

serelex [options]

Description

A tool for semantic relation extraction. The program finds pairs of semantically related words based on the text definitions coming from the Wikipedia articles (other texts may be also used). The extraction method implemented in this system is based on three similarity measures (cosine, gloss overlap, and Karaulov's measure) between texts and two nearest-neighbor algorithms (KNN and Mutual KNN). The tool is a cross-platform console application.

Licensed under LGPLv3.

Options

-c file
Concepts file, default concepts.csv. A text file containing a set of input words (one word per line). The program will try to find semantic relations between these input words. For instance, if words 'crocodile, alligator, house, and building' were given as input the program will try to return pairs 'crocodile,alligator' and 'house,building' among all possible combinations.

-d file
Definitions file, default definitions.csv. A text file containing a set of definitions for words specified in the concepts.csv file. If this file contains definitions for more words that given in concepts.csv the program will skip the words which do not appear in the concepts.csv. The sample-data directory contains files with "definitions" derived from the introduction of Wikipedia articles. Other texts may be used as definitions (traditional dictionary glosses, WordNet, etc.) if provided in the same format.

-s file
Stop-words file, default stoplist.csv. A text file containing stop words (one word per line). Words from this list will not be used by a similarity measure. All occurrences of these words in definitons.csv will be removed.

-o file
Output file, default result.csv. A text file containing set of found semantic relations between words specified in the file concepts.csv. Each line of this file contains a pair of semantically related words, according to the specified method.

-S o|c|k
Similarity method, default o. Semantic similarity measure used for relation extraction.
o - Gloss overlap measure equal to number of common words in the definitions of two words.
c - Cosine between bag-of-word vectors build from definitions of respective of two words
k - Karaulov's semantic similarity measure.

-M 1|2
Component analysis method, default knn. An algorithm used to derive semantic relations from pairwise similarity scores between the words.
1 - knn. Standard nearest-neighbor algorithm (KNN). Here K most similar words are related to a target word.
2 - mknn. Mutual nearest-neighbor algorithm (MKNN). Here K mutually most similar words are related to a target word.

-K NUM
Number of nearest-neighbors, default K = 2.

-T T1 T2 T3
T1, T2, T3 - parameters of Karaulov's semantic similarity measure, default T1 = 2, T1 = 1, T3 = 6.

Files and Catalogs

bin - contains binary excutable file:

serelex_win32.exe - excutable file for 32-bit Windows
serelex_win64.exe - excutable file for 64-bit Windows
serelex_i686 - executable file for i686 Debian-based systems
serelex_amd64 - executable file for amd64 Debian-based systems

docs - provides documentation in html format

sample-data - contains sample source data:

concepts.csv - a text file containing a set of input words (one word per line)
definitions.csv - a text file containing a set of definitions for words specified in the concepts.csv file
stoplist.csv - a text file containing stop words (one word per line)

src - the source code

windows - provides project to Microsoft Visual Studio 2010

test.sh and test.bat - run analysis for algorithm KNN, MutalKNN, with overlap and cosinus method and K = 1, 2, 5, 10

Build

To build under Windows, use MS Visual Studio 2010 project in windows folder. To select configuration - x32 or x64 - use select list on tools panel.
To build under *nix use command make. To build on amd64 use command make ARCH=-m64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serelex

Name

Synopsis

Description

Options

Files and Catalogs

Build

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
bin		bin
docs		docs
preprocessing		preprocessing
sample-data		sample-data
src		src
windows		windows
.gitignore		.gitignore
README.md		README.md
makefile		makefile
test.bat		test.bat
test.sh		test.sh

jgc128/DefVectors

Folders and files

Latest commit

History

Repository files navigation

Serelex

Name

Synopsis

Description

Options

Files and Catalogs

Build

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages