Skip to content

Commit

Permalink
Documentation as markdown in main repo
Browse files Browse the repository at this point in the history
  • Loading branch information
Mandalka committed Dec 12, 2021
1 parent 81de12d commit c47a988
Show file tree
Hide file tree
Showing 169 changed files with 7,693 additions and 0 deletions.
155 changes: 155 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
---
title: Open-Source Search Engine with Apache Lucene / Solr
authors:
- Markus Mandalka
---

# Open-Source Search Engine with Apache Lucene / Solr



## Integrated research tools for easier searching, monitoring, analytics, discovery & text mining of heterogenous & large document sets & news with free software on your own server



### Search engine(Fulltext search)


[Easy full text search](../doc/search) in multiple data sources and many different file formats: Just enter a search query (which can include [powerful search operators](../doc/search/operators)) and navigate through the results.




### Thesaurus & Grammar(Semantic search)


Based on a [thesaurus](../doc/datamanagement/thesaurus) the multilingual semantic search engine will find [synonyms, hyponyms and aliases](../doc/search/fuzzy#synonyms), too. Using heuristics for [grammar rules like stemming](../doc/search/fuzzy#stemming) it finds other word forms, too.




### Interactive filters(Faceted search)



Easy navigation through many results with [interactive filters](../doc/search#faceted_search) (faceted search) which aggregates an overview over and interactive filters for (meta) data like authors, organizations, persons, places, dates, products, tags or document types.





### Exploration, browsing & preview(Exploratory search)



Explore your data or search results with an [overview of aggregated search results](../doc/search#faceted_search) by different facets with [named entities (i.e. file paths, tags, persons, locations, organisations or products)](../doc/datamanagement/thesaurus), while browsing with comfortable navigation through search results or document sets.
View previews (i.e. PDF, extracted Text, Table rows or Images).
Analyze or review document sets by preview, extracted text or [wordlists for textmining](../doc/analytics/textmining).





### Collaborative annotation & tagging (Social search & collaborative filtering)



[Tag your documents with keywords, categories, names or text notes](../doc/datamanagement/annotation "Tagging and annotation") that are not included in the original content to find them better later (document management & knowledge management) or in other research or search contexts or to be able to filter annotated or tagged documents by interactive filters (faceted search).

Or evaluate, value or assess or filter documents (i.e. for validation or collaborative filtering).





### Datavisualization (Dataviz)



Visualizing data like document dates as [trend charts](../doc/analyze/trend) or [text analysis](../doc/analyze/textmining) for example as [word clouds](../doc/analyze/words), [connections and networks in visual graph view](../doc/analytics/graph) or view results with [geodata as interactive maps](../doc/analytics/map).





### Monitoring: Alerts & Watchlists (Newsfeeds)



Stay informed via watchlists for news alerts from media monitoring or activity streams of new or changed documents on file shares: Subscribe searches and filters as RSS-Newsfeed and get notifications when there are changed or new documents, news or search results for your keywords, search context or filter.







### Supports different file formats


No matter if [structured data like databases, tables or spreadsheets](../doc/search/table) or [unstructured data like text documents](../doc/analytics/textmining), E-Mails or even scanned legacy documents: Search in many different formats and content types (text files, Word and other Microsoft Office documents or OpenOffice documents, Excel or LibreOffice Calc tables, PDF, E-Mail, CSV, doc, images, photos, pictures, JPG, TIFF, videos and [many other file formats](http://tika.apache.org/1.13/formats.html)).





### Supports multiple data sources


Find all your data at one place: Search in many different [data sources](../doc/admin/connectors) like [files and folders, file server, file shares](../connector/files), [databases](../connector/db), websites, Content Management Systems, [RSS-Feeds](../doc/datamanagement/rss) and many more.

The Connectors and Importers of the [Extract Transform Load (ETL) framework for Data Integration](../etl) connects and combines multiple data sources and as integrated [document analysis and data enrichment](../doc/data_enrichment) framework it enhances the data with the analysis results of diverse analytics tools.





### Automatic text recognition



[Optical character recognition (OCR) or automatic text recognition for images](../doc/admin/config/ocr) and text content stored in graphical format like scanned legacy documents, screenshots or photographed documents in the form of image files or embedded in PDF files.







---



## Open-Source enterprise search and information retrieval technology based on interoperable open standards



### Mobile (Responsive Design)



Open Semantic Search can not only be used with every desktop (Linux, Windows or Mac) or web browser. With its [responsive design](http://foundation.zurb.com "Powerded by Zurb Foundation") and open standards like HTML5 it is possible to search with tablets, smartphones and other mobiles.





### Metadata management (RDF)


Structure your research, investigation, navigation, document sets, collections, metadata forms or notes in a Semantic Wiki, Drupal or another content management system (CMS) or with an innovative annotation framework with taxonomies and custom fields for tagging documents, annotations, linking relationships, mapping and structured notes. So you integrate powerful and flexible metadata management or annotation tools using interoperable open standards like Resource Description Framework (RDF) and Simple Knowledge Organization System ([SKOS](https://www.w3.org/TR/skos-primer)).





### Filesystem monitoring



Using [file monitoring](../trigger/filemonitoring), new or changed files are indexed within seconds without frequent recrawls (which is not possible often if many files).
Colleagues are able to find new data immediately without (often forgotten) uploads to a data or document management system (DMS) or filling out a data registration form for each new or changed document or dataset in a data management system, data registry or digital asset management (DAM) system.




43 changes: 43 additions & 0 deletions docs/connector/db/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Full text search, faceted search & textmining in a database (SQL DB)
authors:
- Markus Mandalka
---

# Full text search, faceted search & textmining in a database (SQL DB)


## Connectors for Relational Database Management Systems (RDBMS) & Structured Query Lanaguage (SQL) to Solr or Elastic Search


To be able to use simple and powerful [search user interfaces (UI)](../../doc/search) for [full text search](../../doc/search), faceted search, semantic search and ontology or thesaurus based [text mining](../../doc/analytics/textmining) research tools on structured data from fields of a SQL database, import their tables to the search index:

## Index SQL databases like MySQL or Postgresql


There are multiple ways or open source connectors for import SQL databases like MySQL, MariaDB, PostgreSQL and other SQL databases based on Structured Query Language (SQL) or supported by Open Database Connectivity (ODBC) or Java DataBase Connectivity (JDBC) to Apache Solr index:

## Import SQL database to Solr search index


Free Software for indexing SQL to Apache Solr:
* [Apache Manifold JDBC Connector](http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#jdbcrepository)
* [Apache Nifi](https://nifi.apache.org/) - Read from `ExecuteSQL` and write to Solr index
* Setup the built in [RDBMS data import handler](http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS) for your database
* Export your database to a CSV format and use the build in CSV import of Solr
* Write a short Python script for database import based on [Open Semantic ETL](../../etl) with a concrete SQL query, write the columns to the `data` variable (data type is a Python dictionary i.e. `data['columnname'] = columnvalue` and export/write it to index by calling `etl.process(data=data)`
* Other [Extract Transform Load (ETL) - Frameworks](../../etl) like Talend Open Studio


## Import SQL database to Elastic Search index


Free Software for indexing SQL to Elastic Search:
* [Apache Manifold JDBC Connector](http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#jdbcrepository)
* [Apache Nifi](https://nifi.apache.org/) - Read from `ExecuteSQL` and write to Elastic Search index
* [Elastic search JDBC Importer](https://github.com/jprante/elasticsearch-jdbc)
* Export your database to a CSV format and use the build in CSV import of Solr
* Write a short Python script for database import based on [Open Semantic ETL](../../etl) with a concrete SQL query, write the columns to the `data` variable (data type is a Python dictionary i.e. `data['columnname'] = columnvalue` and export/write it to index by calling `etl.process(data=data)`
* Other [Extract Transform Load (ETL) - Frameworks](../../etl) like Talend Open Studio


12 changes: 12 additions & 0 deletions docs/connector/email/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Index or import E-Mails for search with Solr or Elastic Search
authors:
- Markus Mandalka
---

# Index or import E-Mails for search with Solr or Elastic Search


Apache Solr or Elastic Search importer for E-Mails from IMAP or POP3 email servers

[Learn more](http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#emailrepository) ...
58 changes: 58 additions & 0 deletions docs/connector/files/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: Crawl and index files, file folders or file servers
authors:
- Markus Mandalka
---

# Crawl and index files, file folders or file servers


## How to index files like Word documents, PDF files and whole document folders to Apache Solr or Elastic Search?



This connector and command line tools crawl and index directories and files from your filesystem and index it to [Apache Solr](../../etl/export/solr) or [Elastic Search](../../etl/elasticsearch) for [full text search](../../doc/search) and [text mining](../../doc/analytics/textmining).

If you use Linux that means you can crawl whatever is mountable to Linux into an Apache Solr or Elastic Search index or into a triplestore.

## Index different file system types to Solr or Elastic Search


This can be a hard disk or partitions formated with *fat*, *ext3*, *ext4* or a file server connected via *ntfs*, file shares like *smb* or even *sshfs* or *sftp* on servers, private file sharing services like Seafile or OwnCloud on own servers or Dropbox, Amazon or other storage services in the cloud.

## Data enrichment by different data analytic tools


This connector integrates enhanced data enrichment and data analysis plugins like automatic text recognition (OCR) for images and photos (i.e. as files like PNG, JPG, GIF ...) or inside PDFs (i.e.scanned Documents) using Tesseract OCR.


## Usage



Index a file or directory:

### Web admin interface



Using the web admin interface
* Open the page *Files*
* Enter *filename* to the form
* Press button "crawl"


### Command line


Using the command line interface (CLI):
`opensemanticsearch-index-file *filename*`
### API


Using the REST-API:
`http://127.0.0.1/search-apps/api/index-file?uri=*/home/opensemanticsearch/readme.txt*`
## Config


Config file for indexing files: `*/etc/opensemanticsearch/connector-files*`
59 changes: 59 additions & 0 deletions docs/connector/files/pdf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: Index PDF files for search and text mining with Solr or Elastic Search
authors:
- Markus Mandalka
---

# Index PDF files for search and text mining with Solr or Elastic Search


## How to index a PDF file or many PDF documents for full text search and text mining



You can [search](../../../doc/search) and do [textmining](../../../doc/analytics/textmining) with the content of many PDF documents, since the content of PDF files is extracted and text in images were recognized by optical character recognition (OCR) automatically.

## Indexing a PDF file to the Solr or Elastic Search


Therefore you have to index the PDF documents or file directories or file shares that contain PDF documents to the Solr or Elastic Search server index:

### Desktop search


If you use [Open Semantic Desktop Search](../../../doc/desktop_search), just copy the PDF files to a directory that is indexed automatically or add the directory with the PDF files to shared folders for indexing and restart the virtual machine or press the "Index" button within the VM.

### File monitoring


If you use an file share where [file monitoring](../../../trigger/filemonitoring) is active, just copy the PDF files to an monitored folder or file directory and wait, until they are indexed automatically.

### Web admin interface



Using the web admin interface
* Open the page *Files*
* Enter *filename* of the PDF file to the form
* Press button "crawl"


### Command line


Using the command line interface (CLI):
`opensemanticsearch-index-file *filename*`
### API


Using the REST-API within your tools, script or within your browser:
`http://127.0.0.1/search-apps/api/index-file?uri=*/home/opensemanticsearch/document.pdf*`
## Indexing a folder with PDF files to the Solr or Elastic Search


You can index whole folders with PDF documents to Apache Solr or Elastic Search the same way. Just use the name of the file directory or folder instead of a single file name.

## Config


Config file for indexing files: `*/etc/opensemanticsearch/connector-files*`
53 changes: 53 additions & 0 deletions docs/connector/hypothesis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Import Hypothesis web annotations, tags & documents to Solr and Elastic Search
authors:
- Markus Mandalka
---

# Import Hypothesis web annotations, tags & documents to Solr and Elastic Search


## Integration of Hypothesis with Solr & Elastic Search


The [integrated Open Source visual annotation tool](../../doc/datamanagement/annotation/hypothesis) [Hypothesis](https://hypothes.is) provides an powerful visual user interface for (collaborative) web annotation and tagging by human editors, teams and groups that supports not only to tag documents and add page notes, but allows you to annotate documents and web pages within the text, even for single words, names, parts of sentences, sentences or paragraphs.


## Import annotations and tags from Hypothesis for semantic search, faceted search, analytics & textmining



By the **web user interface** for **configuration of datasources** in the **tab Hypothesis** you can easyly **setup the import / indexing of annotations, tags and annotated documents from the Hypothesis API** to the Solr or Elasticsearch search index by data enrichment of indexed documents in search index by hypothesis annotations and tags:

![](../../screenshots/hypothesis_import.png)

If you want not only to import public annotations, you must setup your private API token, that you can find in the settings menu "developer" when logged in in hypothes.is:

![](../../screenshots/hypothesis_config.png)
## Filters


In the tab filter(s) you can set from which user or group annotations should be imported. If you are using the public hypothesis web service, you must set such a filter to prevent import of all public annotations from all users of the web service.

## Full text search in document content with annotations and text


After setup of Hypothesis as a datasource and the first run of the import you can search with full text search the full document content combined/enriched with your notes and tags.

## Faceted search and interactive filters by Hypothesis tags


Additionally you get interactive filters for faceted search, text analytics and text mining by your hypothes.is tags.

## Automatic named entity recognition / named entity extraction for tagged or annotated documents and web pages


Additionally automatic analysis like Named Entity Recognition / automatic extraction of persons, organizations and places by machine learning will be done automatically for documents and web pages you tagged with Hypothesis.

## Limitations



Since hypothesis doesn't support an export to standards like RDF for export (yet?), there is an own plugin for the hypothesis API. The plugin is in early development, so there are some limitations yet:

Until implementation of API paging within the next weeks, the plugin imports only the last 200 annotations per user or group. So you have to set the delta time to a period where your last annotations are within this limit.
Loading

0 comments on commit c47a988

Please sign in to comment.