Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Termit #1

Closed
wants to merge 81 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
fa5390d
Service refactoring, bump Spring to v2.4.4
psiotwo Jul 23, 2021
84bbc74
Libraries update, Java to 11.
psiotwo Jul 23, 2021
1cf0067
Test refactoring
psiotwo Jul 23, 2021
4d65e32
Useless dependency removal
psiotwo Jul 23, 2021
122f736
Fix java version in pom
psiotwo Jul 23, 2021
d70d9a3
Dockerization + Lombok + Spring boot configuration/YAML
psiotwo Jul 23, 2021
01ab633
Extract morphodita configuration
psiotwo Jul 25, 2021
9b76ce7
Splitting into modules; Maven -> Gradle
psiotwo Jul 25, 2021
4aa1586
refacting multimodule build setup
psiotwo Jul 25, 2021
c7671b7
Remove XMLAnnotationParser
psiotwo Jul 27, 2021
6fd892b
Setup github action
psiotwo Jul 27, 2021
91adc5b
Encoding with utf-8 charset.
psiotwo Jul 27, 2021
8c60a42
Encoding with utf-8 charset fix.
psiotwo Jul 27, 2021
a89cafa
removing termit-irrelevant services, modularization of the lemmatizer…
psiotwo Jul 27, 2021
fb6f135
refactoring mocks
psiotwo Jul 29, 2021
4b8be02
refactor project, upgrade gradle to 7, spring to 5,
psiotwo Jul 30, 2021
e87b1d3
[upd] Update the SPARQL query with language filter
Jun 14, 2019
76cee2f
[upd] Check for negation in a separate method
Jun 14, 2019
98076c4
[upd] Allow processing for a specific language (English model support)
Jun 18, 2019
55e7c45
[add] Add English language model file
Jun 18, 2019
2d223e7
[add] Add English stop-word list
Jul 2, 2019
f31ce1d
[upd] Check for stopwords in annotations with minor refactoring
Jul 4, 2019
d56a45b
[upd] Update the SPARQL query with language filter
Jun 14, 2019
449dc21
[upd] Allow processing for a specific language (English model support)
Jun 18, 2019
bb385ed
[upd] Check for negation in a separate method
Jun 14, 2019
cbaa003
[add] add stopwords for English language analysis
Jul 4, 2019
14974fe
[add] create test for stopwords
Jul 4, 2019
64b8b43
[Upd] Improve code quality and efficiency a little bit. Update .gitig…
ledsoft Nov 27, 2019
1de7fd6
[Fix] Ensure MorphoDita analyzes ontology labels correctly.
Nov 29, 2019
b24d2ae
[Fix] Fix system default URI encoding issue.
Feb 5, 2020
88483fc
[#1218] Adjust the query to use PrefLabel instead of label to stay al…
May 4, 2020
4d47cb6
[#1226] Ensure creating unique ids for nodes with multiple analysis.
May 7, 2020
9d50a47
[Fix] Fix issues caused by update term namespace in the data.
May 31, 2020
9f0510a
[#Fix] Fix the IRI of terms also in the annotate method.
Jun 1, 2020
3da29a5
[#1249] Fix spacing problems in the annotator.
Jul 10, 2020
7046992
[add] XML transformation of Annotace output
May 21, 2019
e31d121
[#Fix] Save output to a new file instead of loading one.
Jul 17, 2020
f75b5d3
[fix] Specify input language for morphological analysis.
Jul 17, 2020
5b39a30
[fix] Expect various lengths of tags for different input languages / …
Jul 17, 2020
e88f6f4
[fix] Reflect the recent changes in data and retrieve only terms with…
Jul 24, 2020
e11152e
[upd] Use the correct tokenizer corresponding to the input language.
Saeedla Sep 2, 2020
382eb05
[#1336] Allow to retrieve all the labels in the vocabulary with their…
Saeedla Sep 7, 2020
a618379
[#1336] Allow to retrieve all the labels in the vocabulary with their…
Saeedla Sep 7, 2020
27915c4
[#1336] Implement a new scoring strategy to sort the term candidates.
Saeedla Sep 7, 2020
0579d5c
[#1336] Prioritize prefLabel to ensure its retrieval when exact match.
Saeedla Sep 9, 2020
610849f
[#upd] Improve the English stop-words list.
Saeedla Sep 9, 2020
c1a511e
[#1352] Configure keyword extractor to be called on demand.
Saeedla Sep 15, 2020
e0faecf
[#926] Support authenticated access to the repository.
Oct 27, 2020
dada6d9
[Upd] Compare on token level as well as lemmas.
Nov 20, 2020
5fcf25b
[Fix] Split string on / separator.
Nov 20, 2020
86d05db
[Upd] Update English stop-word list.
Nov 20, 2020
2a4f180
merging KBSS master, making SparkLemmatizer working
psiotwo Aug 1, 2021
07e23d0
preloading Spark lemmatizers.
psiotwo Aug 1, 2021
e11922c
making english lemmatizer working
psiotwo Aug 3, 2021
10fdbb4
Code cleanup, fix Spark configuration
psiotwo Aug 6, 2021
edf5f41
Setup docker build of annotace with spark lemmatizer
psiotwo Aug 31, 2021
03cbcd9
Fix stopwords resolution in JAR files.
psiotwo Aug 31, 2021
d600c7d
[R-#1613] backquotes recognition during tokenization in Spark.
psiotwo Oct 18, 2021
08920bd
[R-#1613] support backquotes in Spark lemmatizer
psiotwo Oct 18, 2021
db38548
[R-#1613] dependencies update
psiotwo Oct 18, 2021
60f0b4a
Connecting to the ontology repo with/without the credentials.
psiotwo Oct 22, 2021
0ed5f00
Fix reading model.
psiotwo Oct 22, 2021
301aa59
Fix reading model.
psiotwo Oct 22, 2021
c5a252e
[#3] idempotent annotate method
psiotwo Oct 26, 2021
4124464
Merge pull request #4 from kbss-cvut/3-idempotent
psiotwo Oct 26, 2021
475d8bd
[#3] adding one more test
psiotwo Oct 26, 2021
643c4d6
Merge pull request #5 from kbss-cvut/3-idempotent
psiotwo Oct 26, 2021
b9bc7d3
[#7] Instructions to run annotace.
psiotwo Jun 10, 2022
1171872
[Bug #6] Update dependencies, including Apache Spark.
ledsoft Apr 12, 2023
185c0e5
[Fix] Fix build issues (tests).
ledsoft Apr 12, 2023
76bef03
[Fix] Fix deprecated Gradle feature warning.
ledsoft Apr 12, 2023
1ede3cb
[Ref] Unify single and double quote usage in build.gradle files.
ledsoft Apr 13, 2023
aed5014
[Upd] Bump Gradle version to 8.1 and use a version catalog to manage …
ledsoft Apr 13, 2023
93ebb42
[Upd] Update Docker images to use Gradle 8.0.2 and Eclipse Temurin al…
ledsoft Apr 13, 2023
bb0ec58
[Doc #7] Improve info on running annotace in readme.
ledsoft Apr 13, 2023
561b01d
Merge pull request #10 from kbss-cvut/fix/spark-scala-version
ledsoft Apr 13, 2023
7dfe72b
Minor code improvements, remove test resources from core main resources.
ledsoft Mar 11, 2024
61ef363
More code cleanup.
ledsoft Mar 11, 2024
58878be
Allow running Annotace either with Spark or MorphoDiTa lemmatizer.
ledsoft Mar 11, 2024
e14e815
[Fix] Fix Docker compose configuration for running Annotace with Morp…
ledsoft Mar 12, 2024
67f213e
Provide readme with instructions on how to run Annotace.
ledsoft Mar 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[add] XML transformation of Annotace output
(cherry picked from commit 2c6fdf364bcbb5fb6bcb263744fd7581255627e4)
  • Loading branch information
Lama Saeeda authored and psiotwo committed Jul 30, 2021
commit 7046992e2a2bf449e4296c1ef28fdb12c773d9c3
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
package cz.cvut.kbss.textanalysis.service;

import cz.cvut.kbss.textanalysis.model.MorphoDitaResultJson;
import org.jsoup.Jsoup;
import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.Element;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class XMLAnnotationParser {

XMLAnnotationService xmlAnnotationService = new XMLAnnotationService();

public void paresContent() throws ParserConfigurationException {
//Initialize xml document
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();

final List<String> COMP = new ArrayList<String>(Arrays.asList("zahrnuje","zahrnovat", "tvořený", "skládající", "členěno", "členit", "rozděluje", "rozdělovat"));
final List<String> COMPR = new ArrayList<String>(Arrays.asList("vyskytující", "tvoří", "tvořit", "součástí", "součást"));
final List<String> CATV = new ArrayList<String>(Arrays.asList("rozlišuje", "rozlišovat", "člení", "vymezují", "vymezovat", "rozlišením", "rozlišení", "zejména"));
final List<String> CN = new ArrayList<String>(Arrays.asList("typy", "základní typy", "typ"));
final List<String> SYN = new ArrayList<String>(Arrays.asList("synonym", "syn", "ekvivalent", "ekvivalentem"));
final List<String> PROP = new ArrayList<String>(Arrays.asList("přiřazen", "přiřazena", "přiřadit"));
final List<String> BE = new ArrayList<String>(Arrays.asList("být", "bude", "budou", "byl", "byt", "je", "jsou"));
final List<String> CNR = new ArrayList<String>(Arrays.asList("takový", "takové", "taková"));
final List<String> PUNC = new ArrayList<String>(Arrays.asList("."));

int labelCount=0;

String content = null;
try {
content = new String(Files.readAllBytes(Paths.get(getClass().getClassLoader().getResource("MPP-allgraphs-dev-annotated-noStopwords.html").toURI())));
} catch (IOException e) {
e.printStackTrace();
} catch (URISyntaxException e) {
e.printStackTrace();
}

String processedContent = xmlAnnotationService.stripSpans(content);

processedContent = processedContent.replaceAll("\\(", "\\(").replaceAll("</u>", " ").trim();

String cleanDoc = processedContent.replaceAll("<span.*?\">", " ").replaceAll("</span>", " ");

List<List<MorphoDitaResultJson>> analyzedDocList = xmlAnnotationService.morphoAnalyze(cleanDoc);

Element tokenCollection = doc.createElement("TokenCollection");
doc.appendChild(tokenCollection);

Attr type = doc.createAttribute("Type");
type.setValue("Content");
tokenCollection.setAttributeNode(type);

for (List<MorphoDitaResultJson> list1 : analyzedDocList) {

for (int i=0; i<list1.size(); i++) {
if(labelCount != 0) {
i = i+labelCount - 1;
labelCount = 0;
}

String token = list1.get(i).getToken();
String tag = transformTag(list1.get(i).getTag().substring(0, 2));
String lemma = stripLemma(list1.get(i).getLemma());
if (processedContent.startsWith(token)) {

Element xmltoken;
//xmltoken = createToken(doc, tokenCollection, lemma);
xmltoken = createToken(doc, tokenCollection, token);
createFeature(doc, xmltoken, "SynCategory", tag);

if (COMP.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#COMP");
else if (COMPR.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#COMPR");
else if (CATV.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#CATV");
else if (CN.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#CN");
else if (SYN.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#SYN");
else if (PROP.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#PROP");
else if (BE.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#BE");
else if (CNR.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#CNR");
else if (PUNC.contains(token))
createFeature(doc, xmltoken, "Concept", "http://www.hermes.com/knowledgebase.owl#PUNC");



processedContent = processedContent.replaceFirst(Pattern.quote(token), Matcher.quoteReplacement("")).trim();
}
else {
if (processedContent.startsWith("<span")) {
final org.jsoup.nodes.Document spanDoc =Jsoup.parse(processedContent);
org.jsoup.nodes.Element element = spanDoc.selectFirst("span");
String spanValue = element.attr("resource");

Element xmlToken = createToken(doc, tokenCollection, element.text());

if (spanValue.contains("ufo")) {
createFeature(doc, xmlToken, "UFOConcept", spanValue);
createFeature(doc, xmlToken, "UFOConcept", "http://www.hermes.com/knowledgebase.owl#Concept");
}
else {
createFeature(doc, xmlToken, "Concept", spanValue);
createFeature(doc, xmlToken, "Concept", "http://www.hermes.com/knowledgebase.owl#Concept");
}

//termTag = termTag + list1.get(i).getTag().substring(0, 2);
String[] words = element.text().split("\\s+");
labelCount = words.length;
//int counter = labelCount;
String termTag;
if (labelCount > 1)
termTag = "NNP";
else
termTag = tag;

createFeature(doc, xmlToken, "SynCategory", termTag);
processedContent = processedContent.replaceFirst("<span.*?>", "").replaceFirst(".*?</span>", " ").trim();
}
}
}
}

transformXml(doc);
System.out.println(doc.toString());
}

public Element createToken(Document doc, Element root, String token) {
Element xmltoken = doc.createElement("Token");
root.appendChild(xmltoken);
Attr value = doc.createAttribute("Value");
Attr endOffset = doc.createAttribute("EndOffset");
endOffset.setValue("0");
Attr startOffset = doc.createAttribute("StartOffset");
startOffset.setValue("0");
value.setValue(token);
xmltoken.setAttributeNode(value);
xmltoken.setAttributeNode(endOffset);
xmltoken.setAttributeNode(startOffset);

return xmltoken;
}

public Element createFeature(Document doc, Element parentToken, String type, String value) {
Element feature = doc.createElement("Feature");
parentToken.appendChild(feature);

Attr featureType = doc.createAttribute("Type");
Attr featureValue = doc.createAttribute("Value");
feature.setAttributeNode(featureType);
feature.setAttributeNode(featureValue);
featureType.setValue(type);
featureValue.setValue(value);

return feature;
}

public void transformXml(Document doc) {
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = null;
try {
transformer = transformerFactory.newTransformer();
} catch (TransformerConfigurationException e) {
e.printStackTrace();
}
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource domSource = new DOMSource(doc);
StreamResult streamResult = new StreamResult(new File(XMLAnnotationParser.class.getResource("xmlAnnotations.xml").getFile()));

try {
transformer.transform(domSource, streamResult);
} catch (TransformerException e) {
e.printStackTrace();
}
System.out.println("xml file is created");
}

public String transformTag(String tag) {
String newTag = tag;
// PP: personal pronoun, NN: noun, VB: verb,
if (tag.contains("C")) newTag = "CD";
else if (tag.contains("R") | (tag.contains(","))) newTag = "IN"; //preposition or subordering conjunction do, z, aby, kdyby, etc.
else if (tag.equals("AA")) newTag = "JJ"; //Adjective
else if (tag.equals("Dg")) newTag = "RBR"; //RBR here represents comparative or superlative adverb, ex. druhotně, vzájemně, územně, stejně, průběžně, samostatně, bezprostředně, obecně, etc.
else if (tag.equals("Db")) newTag = "RB"; //Adverb without a possibility to form negation and degrees of comparison, ex. tak, zejména, Zároveň, nejen, pouze, především, také, dále, etc.
else if (tag.equals("TT")) newTag = "RP"; // li, jen, nikoli[v], až, ovšem, tedy, etc.
else if (tag.equals("II")) newTag = "UH"; //interjection
else if (tag.equals("Vs") || tag.equals("Vp")) newTag = "VBN"; //Verb, past participle ex. je kladen, je pořízen, etc.
else if (tag.equals("Vf")) newTag = "VB"; //Verb, infinitive
else if (tag.contains("V")) newTag = "VB";
else if (tag.equals("PS")) newTag = "WP$"; //Pronoun possessive
else if (tag.contains("^")) newTag = "CC"; //Conjunction (connecting main clauses, not subordinate) ex. nebo
else if (tag.equals("Z:")) newTag = "DT"; //Punctuation
else if (tag.equals("P7")) newTag = "PRP"; //reflexive se

return newTag;
}

public String stripLemma(String lemma) {
if (lemma.contains("_")) {
return lemma.substring(0, lemma.indexOf("_"));
} else
if (lemma.contains("-")) {
return lemma.substring(0, lemma.indexOf("-"));
}
else
return lemma;

}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
package cz.cvut.kbss.textanalysis.service;

import cz.cvut.kbss.textanalysis.model.MorphoDitaResultJson;
import cz.cvut.kbss.textanalysis.service.morphodita.MorphoDitaServiceAPI;
import cz.cvut.kbss.textanalysis.service.morphodita.MorphoDitaServiceJNI;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.safety.Whitelist;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.util.List;

@Service
public class XMLAnnotationService {

@Autowired
private MorphoDitaServiceAPI morphoDitaService = new MorphoDitaServiceJNI();

public String stripSpans(String htmlDocument) {
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
final Document doc =Jsoup.parse(htmlDocument);
for( Element element : doc.select("span") ) {
if (!element.attr("score").equals("1.0")) {
TextNode tn = new TextNode(" " +element.text().trim() + " ");
element.before(tn);
element.remove();
}
}

return stripTagsWithWhitelist(doc);
}

public String stripTagsWithWhitelist(Document htmlDocument) {
//final Document doc =Jsoup.parse(htmlDocument);

Whitelist whitelist = Whitelist.none();
whitelist.addTags("span");
whitelist.addAttributes("span", "about", "resource", "score");
whitelist.addProtocols("span", "score", "1.0");

Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);

htmlDocument.select("br").after(" ");
htmlDocument.select("p").after(" ");

String noTags = Jsoup.clean(htmlDocument.toString(),"", whitelist, settings);
return noTags;

}

public String stripTags(String htmlDocument) {
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
Whitelist whitelist = Whitelist.none();
Document doc = Jsoup.parse(htmlDocument);
doc.select("br").after("\\n");
doc.select("p").after("\\n");
return Jsoup.clean(doc.toString(), whitelist);
}

public List<List<MorphoDitaResultJson>> morphoAnalyze(String s) {
final List<List<MorphoDitaResultJson>> morphoDitaResult = morphoDitaService.getMorphoDiteResultProcessed(s);
return morphoDitaResult;
}

}
Loading