Harvest links from the children of an element you specify on a webpage by using Xpath, which can be accessed through the inspect element function. This also contains a Python 3.5+ utility app to process and download the resulting links from the JSON and CSV files located in \py\LinkHarvest_Downloader.py
.
- User can receive a JSON or CSV file if needed.
- User can download a JSON or CSV with
LinkHarvest_.DownloadJson()
orLinkHarvest_.DownloadCSV()
- User can download all the links in the resulting files with a Python 3.5+ app. Example:
py/LinkHarvest_Downloader.py --input C:/temp/LinksHarvested_2021-6-3_122311.json --output C:/temp/LinkHarvest --starting 0 --ending 5
.
{
url: "https://archive.org/download/Aru.zip",
filename: "Aru.zip",
extension: "zip",
status: 0
}
var LinkHarvest_ = new LinkHarvest();
LinkHarvest_.GetLinksFromXpath('/html/body/div/main/div[5]/div/div/div[1]/div[6]/div[8]/div');
// Get the Links as a JSON
linkjson = LinkHarvest_.LinksToJson();
console.log(linkjson);
// Get the Links as a CSV
linkcsv = LinkHarvest_.LinksToCSV();
console.log(linkcsv);
// Download Links document as Json file
LinkHarvest_.DownloadJson();
// Download Links document as CSV
LinkHarvest_.DownloadCSV();
Name | Description | Args |
---|---|---|
GetLinksFromXpath | Get the children <a href> from the Xpath provided |
xpath = string |
DownloadJson | Download Links document as Json file | - |
DownloadCSV | Download Links document as CSV | - |
LinksToJson | Get the Links as a JSON | - |
LinksToCSV | Get the Links as a CSV | - |
GetURLExtension | Get the Extension from the URL | url= string |
GetDateTime | Get the current Date Time in a YMMDD_HHmmss format |
- |
Name | Description | Args |
---|---|---|
links | Array containing the current links in their url forms. | - |
linksjs | Array containing the current links in their JSON object forms. | - |
csvstring | If you ran LinksToCSV or DownloadCSV this will contain the current CSV string representing the links. |
- |
csvheader | Change this if you want to change the CSV header output. | default = LinkHarvester_.csvheader = "url,filename,extension,status"; |
A Python 3.5+ utility app to process and download the resulting links from the JSON and CSV files located in \py\LinkHarvest_Downloader.py
.
- Change directory to
\py
.cd \py
- Run
pip install -r requirements.txt
orpip3 install -r requirements.txt
LinkHarvest_Downloader.py --input C:/temp/LinksHarvested_2021-6-3_122311.json --output C:/temp/LinkHarvest --starting 0 --ending 5
LinkHarvest_Downloader.py --input C:/temp/LinksHarvested_2021-6-3_122311.csv --output C:/temp/LinkHarvest --starting 0 --ending 5 --searchstrings USA,En
Name | Description | Example |
---|---|---|
--input | JSON, or CSV file to process. Created by the Link Harvester JS class above. | C:/temp/LinksHarvested_2021-6-3_122311.json |
--output | Directory to save the resulting downloaded files to. | C:/temp/LinkHarvest |
--starting | The first index to download. | 0 |
--ending | The last index to download. | 10 |
--searchstrings | comma sepearted strings to search for in the file names | EN,USA |
A Python 3.5+ utility app to process a .txt file containing a list of filenames created by the Internet Archive CLI using ia list collectioname > listdump.txt
located in \py\LinkHarvest_ConvertIAList.py
.
A Collection from Internet Archive is defined as https://archive.org/details/CAT_DATASET
<- would be CAT_DATASET
.
- Change directory to
\py
.cd \py
- Run
pip install -r requirements.txt
orpip3 install -r requirements.txt
python LinkHarvest_ConvertIAList.py --input "C:\temp\listdump.txt" --output "C:\temp\listdump.json" --collection "CAT_DATASET" --exporttype json --ignorestrings "tiff,flowers"
Name | Description | Example |
---|---|---|
--input | JSON, or CSV file to process. Created by the Link Harvester JS class above. | C:\temp\listdump.txt |
--output | Output file to write to. | C:\temp\listdump.json |
--collection | The collection you're downloading from. | CAT_DATASET |
--exporttype | extension to save out as (json, csv) | json or csv |
--ignorestrings | comma sepearted strings to ignore in the file names | (Demo),Poop,.tiff |