- Add documentation about the execution-process.

- Set the name of the "uber-jar" produced by the "maven-shade-plugin" in order to avoid taking the name of the simple-jar and thus taking its pace when uploaded to a remote maven repository by Jenkins. - Add a smaller input-file of the first 500 records (from the 1000-file). - Update License. - Update a dependency. - Code polishing.
LSmyrnaios · Feb 7, 2024 · ab33077 · ab33077
1 parent 85187ac
commit ab33077
Show file tree

Hide file tree

Showing 6 changed files with 537 additions and 11 deletions.
diff --git a/LICENSE b/LICENSE
@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright 2018-2023 Lampros Smyrnaios
+   Copyright 2018-2024 OpenAIRE AMKE
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/README.md b/README.md
@@ -45,6 +45,8 @@ DocFileFullPath: *the full-storage-path of the fulltext-document-file.*<br>
 ErrorCause: *the cause of the failure of retrieving the docUrl or the docFile.*<br>
 <br>
 
+The program's execution process can be found [here](documentation/program-execution-process.md).
+<br>
 This program utilizes multiple threads to speed up the process, while using politeness-delays between same-domain connections, in order to avoid overloading the data-providers.
 <br>
 In case no IDs are available to be used in the input, the user should provide a file containing just urls (one url per line)

diff --git a/documentation/program-execution-process.md b/documentation/program-execution-process.md
@@ -0,0 +1,19 @@
+## Program's execution process
+
+During the program's execution, the following steps take place:
+
+1) The given command-line-arguments are validated and the related variables are set.
+2) The input-json-file is parsed and the id-url pairs are loaded in batches.
+3) Each id-url pair is added to the "callableTasksList" and then is processed by one of the running threads.
+4) For each id-url, the url is checked if it matches to some problematic url (which does not give the landing-page of the publication nor the full-text (or dataset) itself).
+5) Each url gets ready to be connected. If it belongs to a "special-case" domain then a dedicated handler is applied and transforms it either to a full-text url or to a "workable" landing-page.
+6) If the url gives a full-text (or dataset) file directly (even after redirections), the file will be saved (if such option is chosen) and the results will be queued to be written to the disk in due time.
+7) If the url leads to a web-page, the following steps take place:
+   - The file-url presented in the ***< meta >*** (metadata) tags is checked for whether it actually gives the file.
+   - If the above step does not succeed, then the internal links are extracted from the page.
+     - During the links-extraction process, some likely-to-be-fulltext-links (based on text-mining of the surrounding text) are picked-up and connected immediately.
+     - Also, some invalid docUrls or irrelevant urls are identified and blocked before we give them a chance to be connected. 
+   - After the extraction process, we loop through the internal-links list, identify some potential docOrDataset urls and check those first, before checking the rest.
+   - If a docUrl is verified, step 6 takes place. Otherwise, the negative result will be checked for its "re-triable status" and written to the disc in due time.
+8) Once all records of a batch are processed the results of those records are written to the output-file.
+9) Once all batches have finished being processed, the program exits.
diff --git a/pom.xml b/pom.xml
@@ -35,6 +35,7 @@
               <goal>shade</goal>
             </goals>
             <configuration>
+              <finalName>publications_retriever-uber_jar-${version}</finalName>
               <transformers>
                 <!-- add Main-Class to manifest file -->
                 <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
@@ -78,6 +79,13 @@
 
   <dependencies>
 
+    <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
+    <dependency>
+      <groupId>org.slf4j</groupId>
+      <artifactId>slf4j-api</artifactId>
+      <version>1.7.36</version>
+    </dependency>
+
     <!-- logback versions 1.4.X require Java-11 -->
     <!-- logback versions 1.3.X require Java-8, but if this project is added as Dependency in a Spring Boot App, then Spring Boot throws an error, since it does not yet support logback 1.3.x -->
 
@@ -88,13 +96,6 @@
       <version>1.2.13</version>
     </dependency>
 
-    <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
-    <dependency>
-      <groupId>org.slf4j</groupId>
-      <artifactId>slf4j-api</artifactId>
-      <version>1.7.36</version>
-    </dependency>
-
     <!-- https://mvnrepository.com/artifact/ch.qos.logback/logback-classic -->
     <dependency>
       <groupId>ch.qos.logback</groupId>
@@ -164,6 +165,7 @@
       </exclusions>
     </dependency>
 
+    <!-- There is a newer version, but from v.1.4 onwards it requires Java-11. -->
     <dependency>
       <groupId>com.github.crawler-commons</groupId>
       <artifactId>crawler-commons</artifactId>
@@ -192,7 +194,7 @@
     <dependency>
       <groupId>org.junit.jupiter</groupId>
       <artifactId>junit-jupiter-engine</artifactId>
-      <version>5.10.1</version>
+      <version>5.10.2</version>
       <scope>test</scope>
     </dependency>
 

diff --git a/src/test/java/eu/openaire/publications_retriever/test/TestNonStandardInputOutput.java b/src/test/java/eu/openaire/publications_retriever/test/TestNonStandardInputOutput.java
@@ -32,8 +32,11 @@ public class TestNonStandardInputOutput  {
 	private static final Logger logger = LoggerFactory.getLogger(TestNonStandardInputOutput.class);
 
 	private static final String testingSubDir = "idUrlPairs";	// "idUrlPairs" or "justUrls".
-	private static final String testingDirectory = System.getProperty("user.dir") + File.separator + "testData" + File.separator + testingSubDir + File.separator;
-	private static final String testInputFile = "orderedList1000.json";	//"test_only_ids.json";	//"id_to_url_rand10000_20201015.json";	//"test_non_utf_output.json"; //"around_200k_IDs.json";	// "sampleCleanUrls3000.json", "orderedList1000.json", "orderedList5000.json", "testRandomNewList100.csv", "test.json", "id_to_url_rand10000_20201015.json"
+	private static final String testingDirectory = FileUtils.workingDir + "testData" + File.separator + testingSubDir + File.separator;
+	private static final String testInputFile = "orderedList500.json";	//"test_only_ids.json";	//"id_to_url_rand10000_20201015.json";	//"test_non_utf_output.json"; //"around_200k_IDs.json";	// "sampleCleanUrls3000.json", "orderedList1000.json", "orderedList5000.json", "testRandomNewList100.csv", "test.json", "id_to_url_rand10000_20201015.json"
+
+	// TODO - Add an extra cmd-argument to define the "upper-limit" of the number of records to process from a file.
+	// Thus, we will have a single 5000 or 10000 records file and process only the first N records, instead of having multiple files of smaller length.
 
 	private static final File inputFile = new File(testingDirectory + testInputFile);
 	private static File outputFile = new File(testingDirectory + "results_" + testInputFile);	// This can change if the user gives the inputFile as a cmd-arg.