-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated CLI for Cambridge #4
base: main
Are you sure you want to change the base?
Changes from 23 commits
7ab11d6
f3ee332
019a958
156459e
c21dc98
e8c0114
d972ee3
cca5082
649196d
617be2c
fd36c76
4b26b35
4e9e590
9bd378a
93b93ac
27d8520
1867fb4
3580917
57be339
4b1d16a
22b9559
313b9dd
070d31f
a5374cc
7231777
330c7e9
13ce132
2e85656
979f7c1
db4f301
121fd15
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
CLI for Cambridge University Press. | ||
|
||
CLI finds the new releases for a specified subject and stores them as a .tsv in a specified directory. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
import click | ||
from bs4 import BeautifulSoup | ||
from utils import ( | ||
get_page_content, | ||
find_links_and_tags, | ||
download_file | ||
) | ||
|
||
@click.command() | ||
@click.option('--subject', prompt='Enter the subject', type=click.Choice(['computer science', 'engineering', 'mathematics', 'physics', 'science and engineering', 'statistics'])) | ||
@click.option('--directory', prompt='Enter the directory', type=click.Path(exists=True, file_okay=False, dir_okay=True)) | ||
def main(subject,directory): | ||
url = 'https://www.cambridge.org/core/services/librarians/kbart' | ||
response = get_page_content(url) | ||
|
||
if response: | ||
soup = BeautifulSoup(response.text, 'html.parser') | ||
prefix = 'cambridge ebooks and partner presses: 2023 ' | ||
subjects = [subject] | ||
found_links = find_links_and_tags(soup, subjects, prefix) | ||
for link, subject in zip(found_links, subjects): | ||
full_link = 'https://www.cambridge.org' + link | ||
download_file(full_link, subject, directory) | ||
|
||
if __name__ == "__main__": | ||
main() | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
import pytest | ||
from utils import get_page_content, find_links_and_tags, download_file | ||
from bs4 import BeautifulSoup | ||
import os | ||
|
||
@pytest.mark.vcr() | ||
def test_page_content_valid(): | ||
response = get_page_content('https://www.cambridge.org/core/services/librarians/kbart') | ||
assert response.status_code == 200 | ||
|
||
@pytest.mark.vcr() | ||
def test_page_content_invalid(): | ||
response = get_page_content('https://www.cambridge.org/core/services/librarians/kbartWontWork') | ||
assert response.status_code == 404 | ||
|
||
@pytest.mark.vcr | ||
def test_found_links(): | ||
url = 'https://www.cambridge.org/core/services/librarians/kbart' | ||
response = get_page_content(url) | ||
soup = BeautifulSoup(response.text, 'html.parser') | ||
found_links = find_links_and_tags(soup, ['computer science'], 'cambridge ebooks and partner presses: 2023 ') | ||
assert not len(found_links) == 0 | ||
|
||
@pytest.mark.vcr | ||
def test_download_file(tmp_path): | ||
url = 'https://www.cambridge.org/core/services/aop-cambridge-core/kbart/create/bespoke/717854B1C18FD5D0B882344E83E6F52B' | ||
desired_filename = 'computer science' | ||
target_filepath = str(tmp_path) + '/' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are you sure that you need + "/" ? since I see you use os.path.join in line 30, it's not needed, because With your solution |
||
download_file(url, desired_filename, target_filepath) | ||
expected_filepath = os.path.join(target_filepath, f"{desired_filename}.tsv") | ||
assert os.path.exists(expected_filepath) | ||
assert os.path.getsize(expected_filepath) > 0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import requests | ||
import structlog | ||
logger = structlog.get_logger() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe separate imports and a variable with a new line? Improves the readability |
||
|
||
def get_page_content(url): | ||
try: | ||
response = requests.get(url) | ||
response.raise_for_status() # Raise an HTTPError if the response status code indicates an error | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need a comment, the function raise_for_status() is quite informative |
||
return response | ||
except requests.exceptions.ConnectionError as err: | ||
logger.error(err) | ||
return response | ||
except requests.exceptions.HTTPError as err: | ||
logger.error(err) | ||
return response | ||
|
||
def find_links_and_tags(soup, subjects, prefix): | ||
found_links = [] | ||
for subject in subjects: | ||
target_word = prefix + subject | ||
for tag in soup.find_all(string=lambda text: text and target_word in text.lower()): | ||
parent_tag = tag.parent | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what happens if the tag doesn't have a parent? Does it return None or crash? let's say the parent tag is root? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So should we have something like
or
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To check, please write a simple function to see what happens when:
For my question, there are few ways to check it: reading docs and trying it manually:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jdm010 ⬆️ |
||
if parent_tag.name == 'a' and parent_tag.get('href'): | ||
link = parent_tag.get('href') | ||
found_links.append(link) | ||
return found_links | ||
|
||
def download_file(url, desired_filename, target_filepath): | ||
response = requests.get(url) | ||
if response.status_code == 200: | ||
filename = f"{desired_filename}.tsv" | ||
with open(target_filepath + filename, 'wb') as file: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can use |
||
file.write(response.content) | ||
print(f'Successfully downloaded {filename}') | ||
else: | ||
print(f"Error: Failed to download {desired_filename} ({response.status_code})") |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
from bs4 import BeautifulSoup | ||
import click | ||
|
||
from utils import ( | ||
get_page_content, | ||
find_links_and_tags, | ||
download_file | ||
) | ||
|
||
@click.command() | ||
@click.option('--subject', prompt='Enter the subject', type=click.Choice(['chemistry and materials science', 'computer science', 'engineering', 'mathematics and statistics', 'physics and astronomy'])) | ||
@click.option('--directory', prompt='Enter the directory', type=click.Path(exists=True, file_okay=False, dir_okay=True)) | ||
def main(subject,directory): | ||
url = 'https://adminportal.springernature.com/metadata/kbart' | ||
response = get_page_content(url) | ||
|
||
if response: | ||
soup = BeautifulSoup(response.text, 'html.parser') | ||
prefix = 'springer ' | ||
suffix = ' ebooks 2023' | ||
subjects = [subject] | ||
found_links = find_links_and_tags(soup, subjects, prefix, suffix) | ||
for link, subject in zip(found_links, subjects): | ||
full_link = 'https://adminportal.springernature.com' + link | ||
download_file(full_link, subject, directory) | ||
|
||
if __name__ == "__main__": | ||
main() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import pytest | ||
from utils import get_page_content, find_links_and_tags, download_file | ||
from bs4 import BeautifulSoup | ||
import os | ||
|
||
@pytest.mark.vcr() | ||
def test_page_content_valid(): | ||
response = get_page_content('https://adminportal.springernature.com/metadata/kbart') | ||
assert response.status_code == 200 | ||
|
||
@pytest.mark.vcr() | ||
def test_page_content_invalid(): | ||
response = get_page_content('https://adminportal.springernature.com/metadata/errkbart') | ||
assert response.status_code == 404 | ||
|
||
@pytest.mark.vcr | ||
def test_found_links(): | ||
url = 'https://adminportal.springernature.com/metadata/kbart' | ||
response = get_page_content(url) | ||
if response: | ||
soup = BeautifulSoup(response.text, 'html.parser') | ||
subjects = ['engineering'] | ||
founds_links = find_links_and_tags(soup, subjects, 'springer ', ' ebooks 2023') | ||
assert len(founds_links) == 1 | ||
|
||
@pytest.mark.vcr | ||
def test_download_file(tmp_path): | ||
url = 'https://adminportal.springernature.com/metadata/kbart/Springer_Global_Springer_Computer_Science_eBooks_2023_English+International_2023-08-01.txt' | ||
desired_filename = 'computer science' | ||
target_filepath = str(tmp_path) + '/' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to add backslash if you will use |
||
|
||
download_file(url, desired_filename, target_filepath) | ||
|
||
expected_filepath = os.path.join(target_filepath, f"{desired_filename}.tsv") | ||
assert os.path.exists(expected_filepath) | ||
assert os.path.getsize(expected_filepath) > 0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import requests | ||
import structlog | ||
logger = structlog.get_logger() | ||
|
||
def get_page_content(url): | ||
try: | ||
response = requests.get(url) | ||
response.raise_for_status() # Raise an HTTPError if the response status code indicates an error | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same: I don't think we need a comment, the function raise_for_status() is quite informative |
||
return response | ||
except requests.exceptions.ConnectionError as err: | ||
logger.error(err) | ||
return response | ||
except requests.exceptions.HTTPError as err: | ||
logger.error(err) | ||
return response | ||
|
||
def find_links_and_tags(soup, subjects, prefix, suffix): | ||
found_links = [] | ||
for subject in subjects: | ||
target_word = prefix + subject + suffix | ||
for tag in soup.find_all(string=lambda text: text and target_word in text.lower()): | ||
parent_tag = tag.parent | ||
if parent_tag.name == 'a' and parent_tag.get('href'): | ||
link = parent_tag.get('href') | ||
found_links.append(link) | ||
return found_links | ||
|
||
def download_file(url, desired_filename, target_filepath): | ||
response = requests.get(url) | ||
if response.status_code == 200: | ||
filename = f"{desired_filename}.tsv" | ||
with open(target_filepath + filename, 'wb') as file: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you |
||
file.write(response.content) | ||
print(f'Successfully downloaded {filename}') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of all prints, please use logger |
||
else: | ||
print(f"Error: Failed to download {desired_filename} ({response.status_code})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, put how many links are actually found, instead of:
not len
, which makes it a more precise testAlso, can you add a test, where is 0 links found? You can simulate that response, just read it from a file. Let's say copy the xml to the file and remove the tags you are looking for. And then you can use this file for a test input.
OR you can try to find an XML that doesn't have one and use this one