Skip to content

An academic project at HIAST university. A Semantic Search Engine, written in Java programing language, developed with the help of Stanford coreNLP, Lucene, and WordNet softwares.

Notifications You must be signed in to change notification settings

nemat-al/Semantic-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Search Engine

A Semantic Search Engine, written in Java programing language, developed with the help of Stanford coreNLP, Lucene, and WordNet softwares.


Index

  1. Introduction
  2. Used software and Requirments
  3. Semantic Search Engine Steps
  4. Results

Introduction

To read a small introduction, please read the introducary report

The purpose of the project is to develop a semantic search engine in the English language, the search is done within a set of news from the .bbc station, represented in an xml file, that can be found here.

In order to achieve semantic research, semantic linguistic processing is carried out either on documents or on the query, and since the query is smaller, the semantic analysis is applied on it.


Used Software and Requirments

The project is accomplished using Java programing language, Lucene software for Document Indexing and Searching, Standford CoreNlp for processing both documents and query, and WordNet for finding synonym phase during language processing process applied on the query.

In order to run the project in a local device, you can download the project and open it in a Java IDE, for example: Apache NetBeans. As prerequisites, the following should softwares should be downloaded:

  1. Java JDK.
  2. Lucene.
  3. Satndford CoreNlp.
  4. WordNet.

The jar files should be located in the Jar folder.


Semantic Search Engine Steps

In the following picture we can see the structure of the developed search engine. alt text

We can summarize the steps in three titles:

  1. Documents processing.
  2. Query processing.
  3. Searching and showing the results.

Documents processing

The first step is to read the data from the XML data file, and then parse it using the defined RssFeedParser class.

Then the parsed documents will be analyzed in the defined DocumentAnalyzer class. By using Standford CoreNlp functions, both the title and the content of each document are splitted to sentences. Then each sentence is splitted into tokens. Then the stop words are removed. Finally, for each token its lemma is found.

The last step in this phase is to index and save the processed documents by using the defined LuceneWriteIndex class, which use Lucene defined functions.

Query processing

The first step is to read the query which was entered by the user. Then, by using the defined QueryAnalyzer class the same linguistic processing applied on the documents will be applied on the query. Moreover, by using CoreNlp functions, we select the Part of Speech for each token and search for synonms for it by using the defined function SynonmsHandler. Finally the query to be searched for will be recreated from the lemmas of the tokens in the original query and except the stopwords and names, in addition to the lemmas of the synonms of the tokens in the original query.

Searching and showing the results

After reading and processing the user's query, the searching of the recreated query is carried out on the defined functions in the LuceneReadIndex class. The searching is done among the indexed documents, and then the original documents are returned as results to be shown.


Results

Here we can see some examples of user's queries and the returned results.

Query Titles of Returned Documents Descriptions of Returned Documents
women Indian is world's shortest woman Jyoti Amge, ... world's shortest woman by Guinness World Records.
Woman set alight in New York fit A man is arrested after a 73-year-old woman is ...
...
tells Riot operation 'flawed' say MPs The operation to police ... a report by the Home Affairs Committee says..
Peru 'halts' parole woman's trip A US woman on parole in Peru.... says she was stopped from leaving the country.
...

About

An academic project at HIAST university. A Semantic Search Engine, written in Java programing language, developed with the help of Stanford coreNLP, Lucene, and WordNet softwares.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published