Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have issue to download the manifest data with the tsv file. #32

Open
yyuu33nn opened this issue Dec 13, 2024 · 1 comment
Open

Have issue to download the manifest data with the tsv file. #32

yyuu33nn opened this issue Dec 13, 2024 · 1 comment

Comments

@yyuu33nn
Copy link

yyuu33nn commented Dec 13, 2024

Attempting to download: https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmgs/v1/PGA/SRS015072/SRS015072-annotation.slimmap.v2.bz2
File number 1
Filename: SRS015072-annotation.slimmap.v2.bz2
Attempt 1 failed: 403 Client Error: Forbidden for url: https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmgs/v1/PGA/SRS015072/SRS015072-annotation.slimmap.v2.bz2

i have this output when i trying to download the data from HMP.

this is my current code

import argparse
import os
import requests
from urllib.parse import urlparse
from tqdm import tqdm
from requests.exceptions import RequestException
import time

# Step 1: Initialize Argument Parser
my_parser = argparse.ArgumentParser(description='Make sure your files exist in the data folder!')

# Step 2: Add Argument for Input File
my_parser.add_argument('-i', '--input',
                       action='store',
                       metavar='input',
                       type=str,
                       help='The path to your manifest file (TSV format)')

# Step 3: Parse Arguments
args = my_parser.parse_args()

# Step 4: Read the Manifest File
f_name = args.input
manifest_df = pd.read_csv(f_name, sep='\t')
list_of_single_column = manifest_df['urls'].tolist()

# Initialize Counters and Lists for Success and Failure Tracking
cc = 0
success = []
failed = []

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Referer": "https://downloads.hmpdacc.org/"
}

print("Updated: 15_04_2024")
print("Tip: In your manifest file.tsv, if the URL (https://...) works in your browser, this script will help you.")

# Step 5: File Download Loop
for x in list_of_single_column:
    try:
        print(f"\nAttempting to download: {x}")

        a = urlparse(x)
        filename = os.path.basename(a.path)
        cc += 1
        print(f"File number {cc}")
        print(f"Filename: {filename}")

        # Retry mechanism
        max_retries = 3
        retry_delay = 5  # seconds

        for attempt in range(max_retries):
            try:
                response = requests.get(x, headers=headers, stream=True)
                response.raise_for_status()  # Raise an error for HTTP issues

                total_size = int(response.headers.get('content-length', 0))

                with open(filename, 'wb') as f, tqdm(
                    desc=filename,
                    total=total_size,
                    unit='B',
                    unit_scale=True,
                    unit_divisor=1024,
                ) as bar:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                        bar.update(len(chunk))

                success.append(filename)
                print(f"Downloaded successfully: {filename}")
                break
            except RequestException as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                if attempt < max_retries - 1:
                    time.sleep(retry_delay)
                else:
                    print("Max retries reached. Skipping this file.")
                    failed.append(filename)

    except Exception as e:
        print(f"Error downloading {filename}: {e}")
        failed.append(filename)

# Step 6: Save Successful and Failed Downloads to Separate Manifest Files
successful = manifest_df[manifest_df['urls'].str.contains('|'.join(success), na=False)]
failed_df = manifest_df[manifest_df['urls'].str.contains('|'.join(failed), na=False)]

successful.to_csv("successful_manifest.tsv", sep='\t', index=False, header=True)
failed_df.to_csv("failed_manifest.tsv", sep='\t', index=False, header=True)

print("\n## Finally finished! ##")
print(f"Successful downloads: {len(success)}")
print(f"Failed downloads: {len(failed)}")
@leahfa
Copy link

leahfa commented Jan 13, 2025

Hi, I am having the same issues -did you ever manage to resolve this?
Pretty sure we need to generate some token to allow the download, but am not managing even to register in order to do this - gues steh site is just not maintined at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants