Skip to content

Commit

Permalink
- Add documentation about the execution-process.
Browse files Browse the repository at this point in the history
- Set the name of the "uber-jar" produced by the "maven-shade-plugin" in order to avoid taking the name of the simple-jar and thus taking its pace when uploaded to a remote maven repository by Jenkins.
- Add a smaller input-file of the first 500 records (from the 1000-file).
- Update License.
- Update a dependency.
- Code polishing.
  • Loading branch information
LSmyrnaios committed Feb 7, 2024
1 parent 85187ac commit ab33077
Show file tree
Hide file tree
Showing 6 changed files with 537 additions and 11 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2018-2023 Lampros Smyrnaios
Copyright 2018-2024 OpenAIRE AMKE

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ DocFileFullPath: *the full-storage-path of the fulltext-document-file.*<br>
ErrorCause: *the cause of the failure of retrieving the docUrl or the docFile.*<br>
<br>

The program's execution process can be found [here](documentation/program-execution-process.md).
<br>
This program utilizes multiple threads to speed up the process, while using politeness-delays between same-domain connections, in order to avoid overloading the data-providers.
<br>
In case no IDs are available to be used in the input, the user should provide a file containing just urls (one url per line)
Expand Down
19 changes: 19 additions & 0 deletions documentation/program-execution-process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Program's execution process

During the program's execution, the following steps take place:

1) The given command-line-arguments are validated and the related variables are set.
2) The input-json-file is parsed and the id-url pairs are loaded in batches.
3) Each id-url pair is added to the "callableTasksList" and then is processed by one of the running threads.
4) For each id-url, the url is checked if it matches to some problematic url (which does not give the landing-page of the publication nor the full-text (or dataset) itself).
5) Each url gets ready to be connected. If it belongs to a "special-case" domain then a dedicated handler is applied and transforms it either to a full-text url or to a "workable" landing-page.
6) If the url gives a full-text (or dataset) file directly (even after redirections), the file will be saved (if such option is chosen) and the results will be queued to be written to the disk in due time.
7) If the url leads to a web-page, the following steps take place:
- The file-url presented in the ***< meta >*** (metadata) tags is checked for whether it actually gives the file.
- If the above step does not succeed, then the internal links are extracted from the page.
- During the links-extraction process, some likely-to-be-fulltext-links (based on text-mining of the surrounding text) are picked-up and connected immediately.
- Also, some invalid docUrls or irrelevant urls are identified and blocked before we give them a chance to be connected.
- After the extraction process, we loop through the internal-links list, identify some potential docOrDataset urls and check those first, before checking the rest.
- If a docUrl is verified, step 6 takes place. Otherwise, the negative result will be checked for its "re-triable status" and written to the disc in due time.
8) Once all records of a batch are processed the results of those records are written to the output-file.
9) Once all batches have finished being processed, the program exits.
18 changes: 10 additions & 8 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
<goal>shade</goal>
</goals>
<configuration>
<finalName>publications_retriever-uber_jar-${version}</finalName>
<transformers>
<!-- add Main-Class to manifest file -->
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
Expand Down Expand Up @@ -78,6 +79,13 @@

<dependencies>

<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.36</version>
</dependency>

<!-- logback versions 1.4.X require Java-11 -->
<!-- logback versions 1.3.X require Java-8, but if this project is added as Dependency in a Spring Boot App, then Spring Boot throws an error, since it does not yet support logback 1.3.x -->

Expand All @@ -88,13 +96,6 @@
<version>1.2.13</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.36</version>
</dependency>

<!-- https://mvnrepository.com/artifact/ch.qos.logback/logback-classic -->
<dependency>
<groupId>ch.qos.logback</groupId>
Expand Down Expand Up @@ -164,6 +165,7 @@
</exclusions>
</dependency>

<!-- There is a newer version, but from v.1.4 onwards it requires Java-11. -->
<dependency>
<groupId>com.github.crawler-commons</groupId>
<artifactId>crawler-commons</artifactId>
Expand Down Expand Up @@ -192,7 +194,7 @@
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<version>5.10.1</version>
<version>5.10.2</version>
<scope>test</scope>
</dependency>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,11 @@ public class TestNonStandardInputOutput {
private static final Logger logger = LoggerFactory.getLogger(TestNonStandardInputOutput.class);

private static final String testingSubDir = "idUrlPairs"; // "idUrlPairs" or "justUrls".
private static final String testingDirectory = System.getProperty("user.dir") + File.separator + "testData" + File.separator + testingSubDir + File.separator;
private static final String testInputFile = "orderedList1000.json"; //"test_only_ids.json"; //"id_to_url_rand10000_20201015.json"; //"test_non_utf_output.json"; //"around_200k_IDs.json"; // "sampleCleanUrls3000.json", "orderedList1000.json", "orderedList5000.json", "testRandomNewList100.csv", "test.json", "id_to_url_rand10000_20201015.json"
private static final String testingDirectory = FileUtils.workingDir + "testData" + File.separator + testingSubDir + File.separator;
private static final String testInputFile = "orderedList500.json"; //"test_only_ids.json"; //"id_to_url_rand10000_20201015.json"; //"test_non_utf_output.json"; //"around_200k_IDs.json"; // "sampleCleanUrls3000.json", "orderedList1000.json", "orderedList5000.json", "testRandomNewList100.csv", "test.json", "id_to_url_rand10000_20201015.json"

// TODO - Add an extra cmd-argument to define the "upper-limit" of the number of records to process from a file.
// Thus, we will have a single 5000 or 10000 records file and process only the first N records, instead of having multiple files of smaller length.

private static final File inputFile = new File(testingDirectory + testInputFile);
private static File outputFile = new File(testingDirectory + "results_" + testInputFile); // This can change if the user gives the inputFile as a cmd-arg.
Expand Down
Loading

0 comments on commit ab33077

Please sign in to comment.