Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Termit #1

Closed
wants to merge 81 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
fa5390d
Service refactoring, bump Spring to v2.4.4
psiotwo Jul 23, 2021
84bbc74
Libraries update, Java to 11.
psiotwo Jul 23, 2021
1cf0067
Test refactoring
psiotwo Jul 23, 2021
4d65e32
Useless dependency removal
psiotwo Jul 23, 2021
122f736
Fix java version in pom
psiotwo Jul 23, 2021
d70d9a3
Dockerization + Lombok + Spring boot configuration/YAML
psiotwo Jul 23, 2021
01ab633
Extract morphodita configuration
psiotwo Jul 25, 2021
9b76ce7
Splitting into modules; Maven -> Gradle
psiotwo Jul 25, 2021
4aa1586
refacting multimodule build setup
psiotwo Jul 25, 2021
c7671b7
Remove XMLAnnotationParser
psiotwo Jul 27, 2021
6fd892b
Setup github action
psiotwo Jul 27, 2021
91adc5b
Encoding with utf-8 charset.
psiotwo Jul 27, 2021
8c60a42
Encoding with utf-8 charset fix.
psiotwo Jul 27, 2021
a89cafa
removing termit-irrelevant services, modularization of the lemmatizer…
psiotwo Jul 27, 2021
fb6f135
refactoring mocks
psiotwo Jul 29, 2021
4b8be02
refactor project, upgrade gradle to 7, spring to 5,
psiotwo Jul 30, 2021
e87b1d3
[upd] Update the SPARQL query with language filter
Jun 14, 2019
76cee2f
[upd] Check for negation in a separate method
Jun 14, 2019
98076c4
[upd] Allow processing for a specific language (English model support)
Jun 18, 2019
55e7c45
[add] Add English language model file
Jun 18, 2019
2d223e7
[add] Add English stop-word list
Jul 2, 2019
f31ce1d
[upd] Check for stopwords in annotations with minor refactoring
Jul 4, 2019
d56a45b
[upd] Update the SPARQL query with language filter
Jun 14, 2019
449dc21
[upd] Allow processing for a specific language (English model support)
Jun 18, 2019
bb385ed
[upd] Check for negation in a separate method
Jun 14, 2019
cbaa003
[add] add stopwords for English language analysis
Jul 4, 2019
14974fe
[add] create test for stopwords
Jul 4, 2019
64b8b43
[Upd] Improve code quality and efficiency a little bit. Update .gitig…
ledsoft Nov 27, 2019
1de7fd6
[Fix] Ensure MorphoDita analyzes ontology labels correctly.
Nov 29, 2019
b24d2ae
[Fix] Fix system default URI encoding issue.
Feb 5, 2020
88483fc
[#1218] Adjust the query to use PrefLabel instead of label to stay al…
May 4, 2020
4d47cb6
[#1226] Ensure creating unique ids for nodes with multiple analysis.
May 7, 2020
9d50a47
[Fix] Fix issues caused by update term namespace in the data.
May 31, 2020
9f0510a
[#Fix] Fix the IRI of terms also in the annotate method.
Jun 1, 2020
3da29a5
[#1249] Fix spacing problems in the annotator.
Jul 10, 2020
7046992
[add] XML transformation of Annotace output
May 21, 2019
e31d121
[#Fix] Save output to a new file instead of loading one.
Jul 17, 2020
f75b5d3
[fix] Specify input language for morphological analysis.
Jul 17, 2020
5b39a30
[fix] Expect various lengths of tags for different input languages / …
Jul 17, 2020
e88f6f4
[fix] Reflect the recent changes in data and retrieve only terms with…
Jul 24, 2020
e11152e
[upd] Use the correct tokenizer corresponding to the input language.
Saeedla Sep 2, 2020
382eb05
[#1336] Allow to retrieve all the labels in the vocabulary with their…
Saeedla Sep 7, 2020
a618379
[#1336] Allow to retrieve all the labels in the vocabulary with their…
Saeedla Sep 7, 2020
27915c4
[#1336] Implement a new scoring strategy to sort the term candidates.
Saeedla Sep 7, 2020
0579d5c
[#1336] Prioritize prefLabel to ensure its retrieval when exact match.
Saeedla Sep 9, 2020
610849f
[#upd] Improve the English stop-words list.
Saeedla Sep 9, 2020
c1a511e
[#1352] Configure keyword extractor to be called on demand.
Saeedla Sep 15, 2020
e0faecf
[#926] Support authenticated access to the repository.
Oct 27, 2020
dada6d9
[Upd] Compare on token level as well as lemmas.
Nov 20, 2020
5fcf25b
[Fix] Split string on / separator.
Nov 20, 2020
86d05db
[Upd] Update English stop-word list.
Nov 20, 2020
2a4f180
merging KBSS master, making SparkLemmatizer working
psiotwo Aug 1, 2021
07e23d0
preloading Spark lemmatizers.
psiotwo Aug 1, 2021
e11922c
making english lemmatizer working
psiotwo Aug 3, 2021
10fdbb4
Code cleanup, fix Spark configuration
psiotwo Aug 6, 2021
edf5f41
Setup docker build of annotace with spark lemmatizer
psiotwo Aug 31, 2021
03cbcd9
Fix stopwords resolution in JAR files.
psiotwo Aug 31, 2021
d600c7d
[R-#1613] backquotes recognition during tokenization in Spark.
psiotwo Oct 18, 2021
08920bd
[R-#1613] support backquotes in Spark lemmatizer
psiotwo Oct 18, 2021
db38548
[R-#1613] dependencies update
psiotwo Oct 18, 2021
60f0b4a
Connecting to the ontology repo with/without the credentials.
psiotwo Oct 22, 2021
0ed5f00
Fix reading model.
psiotwo Oct 22, 2021
301aa59
Fix reading model.
psiotwo Oct 22, 2021
c5a252e
[#3] idempotent annotate method
psiotwo Oct 26, 2021
4124464
Merge pull request #4 from kbss-cvut/3-idempotent
psiotwo Oct 26, 2021
475d8bd
[#3] adding one more test
psiotwo Oct 26, 2021
643c4d6
Merge pull request #5 from kbss-cvut/3-idempotent
psiotwo Oct 26, 2021
b9bc7d3
[#7] Instructions to run annotace.
psiotwo Jun 10, 2022
1171872
[Bug #6] Update dependencies, including Apache Spark.
ledsoft Apr 12, 2023
185c0e5
[Fix] Fix build issues (tests).
ledsoft Apr 12, 2023
76bef03
[Fix] Fix deprecated Gradle feature warning.
ledsoft Apr 12, 2023
1ede3cb
[Ref] Unify single and double quote usage in build.gradle files.
ledsoft Apr 13, 2023
aed5014
[Upd] Bump Gradle version to 8.1 and use a version catalog to manage …
ledsoft Apr 13, 2023
93ebb42
[Upd] Update Docker images to use Gradle 8.0.2 and Eclipse Temurin al…
ledsoft Apr 13, 2023
bb0ec58
[Doc #7] Improve info on running annotace in readme.
ledsoft Apr 13, 2023
561b01d
Merge pull request #10 from kbss-cvut/fix/spark-scala-version
ledsoft Apr 13, 2023
7dfe72b
Minor code improvements, remove test resources from core main resources.
ledsoft Mar 11, 2024
61ef363
More code cleanup.
ledsoft Mar 11, 2024
58878be
Allow running Annotace either with Spark or MorphoDiTa lemmatizer.
ledsoft Mar 11, 2024
e14e815
[Fix] Fix Docker compose configuration for running Annotace with Morp…
ledsoft Mar 12, 2024
67f213e
Provide readme with instructions on how to run Annotace.
ledsoft Mar 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/workflows/before-push-to-termit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Before merge to 'termit'

on:
push:
branches: [ termit ]
pull_request:
branches: [ termit ]

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Set up JDK 11
uses: actions/setup-java@v2
with:
java-version: '11'
distribution: 'adopt'
- name: Grant execute permission for gradlew
run: chmod +x gradlew
- name: Build with Gradle
run: ./gradlew build -x :lemmatizer-morphodita:build -x :lemmatizer-tests:build
37 changes: 37 additions & 0 deletions .github/workflows/on-push-to-termit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: On push to 'termit'
on:
push:
branches: [ termit ]
workflow_dispatch:
env:
IMAGE_NAME: annotace-spark
USERNAME: ${{ github.actor }}
TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
build-and-publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Grant execute permission for gradlew
run: chmod +x gradlew
- name: Build with Gradle
run: ./gradlew build -x :lemmatizer-morphodita:build -x :lemmatizer-tests:build
- name: Build image
run: docker build . --file Dockerfile --tag $IMAGE_NAME
- name: Log into registry
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login docker.pkg.github.com -u ${{ github.actor }} --password-stdin
- name: Push image
run: |
IMAGE_ID=docker.pkg.github.com/${{ github.repository }}/$IMAGE_NAME
# Change all uppercase to lowercase
IMAGE_ID=$(echo $IMAGE_ID | tr '[A-Z]' '[a-z]')
# Strip git ref prefix from version
VERSION=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,')
# Strip "v" prefix from tag name
[[ "${{ github.ref }}" == "refs/tags/"* ]] && VERSION=$(echo $VERSION | sed -e 's/^v//')
# Use Docker `latest` tag convention
[ "$VERSION" == "termit" ] && VERSION=latest
echo IMAGE_ID=$IMAGE_ID
echo VERSION=$VERSION
docker tag $IMAGE_NAME $IMAGE_ID:$VERSION
docker push $IMAGE_ID:$VERSION
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
.idea/**
.idea/**
.gradle
**/build/
*.iml
**/out/
lib
12 changes: 12 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM gradle:8.0.2-jdk11-alpine as build
RUN mkdir annotace
WORKDIR /annotace
COPY . .
RUN gradle bootJar -Pcore,lemmatizer-spark,keywordextractor-ker

FROM eclipse-temurin:11-jdk-alpine as runtime
COPY --from=build /annotace/core/build/libs/*.jar /
RUN mv annotace*.jar annotace.jar

EXPOSE 8080
ENTRYPOINT ["java","-jar","/annotace.jar"]
35 changes: 35 additions & 0 deletions Dockerfile-morphodita
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
ARG MORPHODITA_TAGGERS
ARG MORPHODITA_ZIP
ARG MORPHODITA_ZIP_SO

########################################################################################################################

FROM alpine as unzip
ARG MORPHODITA_TAGGERS
ARG MORPHODITA_ZIP
RUN mkdir taggers
RUN mkdir morphodita
COPY $MORPHODITA_TAGGERS /taggers
COPY $MORPHODITA_ZIP /morphodita
WORKDIR /morphodita
RUN unzip *.zip

FROM gradle:8.0.2-jdk11-alpine as buildMaven
ARG MORPHODITA_ZIP_SO
RUN mkdir annotace
WORKDIR /annotace
COPY . .
RUN gradle clean bootJar -x test

FROM eclipse-temurin:11-jdk-alpine as runtime
ARG MORPHODITA_ZIP_SO
# Work around an issue with missing library on Alpine Linux - https://www.svlada.com/fun-times-with-gcc-musl-alpine-linux/
RUN apk add --update --no-cache libc6-compat
RUN cp /lib64/ld-linux-x86-64.so.2 /lib/
COPY --from=buildMaven /annotace/core/build/libs/annotace-*.jar /
RUN mv *.jar annotace.jar
COPY --from=unzip /taggers .
COPY --from=unzip /morphodita/$MORPHODITA_ZIP_SO /lib

EXPOSE 8080
ENTRYPOINT ["java","-jar","/annotace.jar"]
52 changes: 48 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,48 @@
In order to run MorphoDiTa JNI, it is necessary to
- Download MorphoDiTa 1.9.2 binaries
- Take system library (.so on linux, .dll on Win) and put it on the Java library path (java.library.path system var)
- Download and put necessary MorphoDiTa language models into src/main/resources
# Annotace

Annotace is a text analysis service used e.g. by [TermIt](https://github.com/kbss-cvut/termit) and its [web annotation plugin](https://github.com/alanbuzek/termit-extension).

## How to run it?

- Install Java 11
- Run `./gradlew bootRun` (on Linux/WSL) or `gradlew.bat bootRun` on Windows

## Lemmatizers

Annotace supports two lemmatizer implementations:

- [Spark](https://sparknlp.org/)-based lemmatizer is more suitable for annotation of English texts. This is the default lemmatizer
- [MorphoDiTa](https://ufal.mff.cuni.cz/morphodita)-based lemmatizer is more suitable for annotation of Czech or Slovak texts. It comes in two variants:
- JNI-based - runs locally using the MorphoDiTa library itself
- Service-based - invokes a remote annotation service (needs to be configured)

## Setup

Spark-based Annotace setup does not require any additional configuration or files. Either run it directly `./gradlew bootRun`
or use Docker. There is an [image](ghcr.io/kbss-cvut/annotace/annotace-spark:latest) published at GitHub package registry.

Running Annotace with MorphoDiTa is a bit more complicated.

### Annotace with MorphoDiTa Locally

1. Download the MorphoDiTa [ZIP archive](https://github.com/kbss-cvut/annotace/pkgs/container/annotace%2Fannotace-spark) and extract it.
2. Find a file with JNI bindings corresponding to your platform in the extracted directory. For 64-bit Linux the file is `morphodita-1.9.2-bin/bin-linux64/java/libmorphodita_java.so`.
3. Set path to the **directory containing this file** as `java.library.path` environment variable name.
4. Provide mapping of taggers (language models) to Annotace. Either by editing `application.yml` before build or by passing them as environment variables.
5. Run Annotace with the MorphoDiTa lemmatizer by setting `ANNOTACE_LEMMATIZER` to `morphodita-jni`.

A complete command line example would be:
`ANNOTACE_LEMMATIZER=morphodita-jni ANNOTACE_MORPHODITA_TAGGERS_CS=/opt/annotace/lib/czech-morfflex2.0-pdtc1.0-220710.tagger LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/annotace/lib/morphodita-1.9.2-bin/bin-linux64/java ./gradlew bootRun`


### Annotace with MorphoDiTa in Docker

1. Download the MorphoDiTa [ZIP archive](https://github.com/ufal/morphodita/releases/download/v1.9.2/morphodita-1.9.2-bin.zip).
2. Set `MORPHODITA_ZIP` in `docker-compose-morphodita.yml` to path to the downloaded MorphoDiTa ZIP file.
3. Download and extract taggers (language models). Put them into a single directory.
4. Set `MORPHODITA_TAGGERS` in `docker-compose-morphodita.yml` to path to the taggers' directory.
5. Run `docker compose -f docker-compose-morphodita.yml up -d --build` to build and start Annotace wih MorphoDiTa.

## License

Annotace is licensed under GPL v3.0, Spark and MorphoDiTa are distributed under their respective licenses.
3 changes: 3 additions & 0 deletions api/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dependencies {
implementation(libs.jackson.annotations)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
package cz.cvut.kbss.textanalysis.keywordextractor;

import cz.cvut.kbss.textanalysis.keywordextractor.model.KeywordExtractorResult;

public interface KeywordExtractorAPI {

KeywordExtractorResult process(final String input);
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,23 @@
* along with this program. If not, see <https://www.gnu.org/licenses/>.
* © 2019 GitHub, Inc.
*/
package cz.cvut.kbss.textanalysis;
package cz.cvut.kbss.textanalysis.keywordextractor.model;

import com.fasterxml.jackson.annotation.JsonIgnoreProperties;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.Collections;
import java.util.List;
import lombok.Data;

@JsonIgnoreProperties(ignoreUnknown = true)
@Data
public class KeywordExtractorResult {

public class Stopwords {
private List<String> keywords;

public List<String> getStopwords(){
try {
return Files.readAllLines(new File(Stopwords.class.getClassLoader().getResource("stopwords-Czech.txt").getFile()).toPath());
} catch (IOException e) {
e.printStackTrace();
return Collections.emptyList();
}
public static KeywordExtractorResult createEmpty() {
KeywordExtractorResult response = new KeywordExtractorResult();
response.setKeywords(Collections.emptyList());
return response;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
/**
* Annotac Copyright (C) 2019 Czech Technical University in Prague
*
* This program is free software: you can redistribute it and/or modify it under the terms of the
* GNU General Public License as published by the Free Software Foundation, either version 3 of the
* License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
* even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License along with this program. If
* not, see <https://www.gnu.org/licenses/>. © 2019 GitHub, Inc.
*/

package cz.cvut.kbss.textanalysis.lemmatizer;

import cz.cvut.kbss.textanalysis.lemmatizer.model.LemmatizerResult;
import cz.cvut.kbss.textanalysis.lemmatizer.model.SingleLemmaResult;
import java.util.List;

public interface LemmatizerApi {

/**
* Lemmatizes the given text w.r.t. the given language.
*
* @param text text to lemmatize
* @param lang language to use
* @return result of the lemmatizations
*/
LemmatizerResult process(String text, String lang);
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,34 +16,21 @@
* along with this program. If not, see <https://www.gnu.org/licenses/>.
* © 2019 GitHub, Inc.
*/
package cz.cvut.kbss.textanalysis.model;
package cz.cvut.kbss.textanalysis.lemmatizer.model;

import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.annotation.JsonProperty;

import java.util.List;
import lombok.Data;

@JsonIgnoreProperties(ignoreUnknown = true)
public class MorphoDitaResult {
@Data
public class LemmatizerResult {

@JsonProperty
private List<List<MorphoDitaResultJson>> result;
private String lemmatizer;

public MorphoDitaResult() {
}

public List<List<MorphoDitaResultJson>> getResult() {
return result;
}

public void setResult(List<List<MorphoDitaResultJson>> result) {
this.result = result;
}

@Override
public String toString() {
return "MorphoDitaResult{" +
"result=" + result +
'}';
}
@JsonProperty
private List<List<SingleLemmaResult>> result;
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,18 @@
* along with this program. If not, see <https://www.gnu.org/licenses/>.
* © 2019 GitHub, Inc.
*/
package cz.cvut.kbss.textanalysis.service.morphodita;
package cz.cvut.kbss.textanalysis.lemmatizer.model;

import cz.cvut.kbss.textanalysis.model.MorphoDitaResultJson;
import java.util.List;
import lombok.Data;

public interface MorphoDitaServiceAPI {
@Data
public class SingleLemmaResult {

List<List<MorphoDitaResultJson>> getMorphoDiteResultProcessed(String s);
private String token;

private String lemma;

private String spaces;

private boolean negated;
}
52 changes: 52 additions & 0 deletions build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
plugins {
id "org.springframework.boot" version "2.7.10" apply false
id "io.spring.dependency-management" version "1.0.11.RELEASE" apply false
}

group "cz.cvut.kbss"
description "Text analysis for Czech language and annotation recommendation service"
version "0.0.1"

def revision = "git rev-list --count HEAD".execute().text.trim()
def hash = "git rev-parse --short HEAD".execute().text.trim()
version = "0.0.1.r${revision}.${hash}";

ext {
junitVersion = "5.9.2"
}

subprojects {
apply plugin: "java"
apply plugin: "java-library"

compileJava {
sourceCompatibility = "11"
targetCompatibility = "11"
}

test {
useJUnitPlatform()
}

group parent.group
version parent.version

repositories {
mavenCentral()
maven {
name = "kbss-private"
url = uri("https://kbss.felk.cvut.cz/m2repo-private")
}
}

dependencies {
implementation "org.slf4j:slf4j-api:1.7.36"
implementation "ch.qos.logback:logback-classic:1.2.11"
compileOnly "org.projectlombok:lombok:1.18.20"

annotationProcessor "org.projectlombok:lombok:1.18.20"

testImplementation(libs.junit.api)
testRuntimeOnly(libs.junit.engine)
}
}
Loading
Loading