-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated CLI for Cambridge #4
base: main
Are you sure you want to change the base?
Conversation
Fetches files for relevant subjects from Cambridge University Press
File containing functions that will be called from other files
Retrieves data and modifies it, so that the CLI can compare its newly fetched files
CLI to fetch new releases.
Structure of code changed
CLI for Cambridge University Press. get_test_data.py must be run first to fetch and store files locally. These files are then randomly modified, so we can check for new releases. cambridge_cli.py can be used to fetch new releases for the subject specified by the user. Future versions of the code will be higher-level (where the user can specify the publisher in the CLI), and also be able to fetch releases for all subjects.
Springer CLI
src/providers/cambridge/utils.py
Outdated
return response.text | ||
except requests.exceptions.HTTPError as err: | ||
print(f"HTTP Error: Failed to fetch the webpage ({err})") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to return None. Function without return statement, by default, returns None:
https://realpython.com/python-return-statement/#implicit-return-statements
Added tests
src/providers/cambridge/utils.py
Outdated
response.raise_for_status() # Raise an HTTPError if the response status code indicates an error | ||
return response.text | ||
except requests.exceptions.HTTPError as err: | ||
print(f"HTTP Error: Failed to fetch the webpage ({err})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also change fprints to structlog. Here you can find step by step example: https://www.structlog.org/en/stable/bound-loggers.html#step-by-step-example
From docs, you can see that there are different log methods like: debug(), info(), warning(), error(), and critical(). In this case we need error().
Improved tests
Added an exception for get_page_content. The function now returns response rather than response.text. Replaced f-string with structlog.
Minor change to accommodate changes with utils.py
Utils for Springer
Created CLI for Springer
Tests for Springer
response = get_page_content(url) | ||
soup = BeautifulSoup(response.text, 'html.parser') | ||
found_links = find_links_and_tags(soup, ['computer science'], 'cambridge ebooks and partner presses: 2023 ') | ||
assert not len(found_links) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, put how many links are actually found, instead of: not len
, which makes it a more precise test
Also, can you add a test, where is 0 links found? You can simulate that response, just read it from a file. Let's say copy the xml to the file and remove the tags you are looking for. And then you can use this file for a test input.
OR you can try to find an XML that doesn't have one and use this one
src/providers/cambridge/utils.py
Outdated
def get_page_content(url): | ||
try: | ||
response = requests.get(url) | ||
response.raise_for_status() # Raise an HTTPError if the response status code indicates an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a comment, the function raise_for_status() is quite informative
@@ -0,0 +1,36 @@ | |||
import requests | |||
import structlog | |||
logger = structlog.get_logger() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe separate imports and a variable with a new line? Improves the readability
for subject in subjects: | ||
target_word = prefix + subject | ||
for tag in soup.find_all(string=lambda text: text and target_word in text.lower()): | ||
parent_tag = tag.parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the tag doesn't have a parent? Does it return None or crash? let's say the parent tag is root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should we have something like
else:
return None
or
else:
continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check, please write a simple function to see what happens when:
- Return None
- Using else: contunue
What is the difference?
For my question, there are few ways to check it: reading docs and trying it manually:
Please try to run a small function that reads super small HTML, and try to find the parent that doesn’t exist:
soup = BeautifulSoup("<p>Some</p>")
for tag in soup.find_all('p'):
tag.parent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdm010 ⬆️
def test_download_file(tmp_path): | ||
url = 'https://www.cambridge.org/core/services/aop-cambridge-core/kbart/create/bespoke/717854B1C18FD5D0B882344E83E6F52B' | ||
desired_filename = 'computer science' | ||
target_filepath = str(tmp_path) + '/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that you need + "/" ? since I see you use os.path.join in line 30, it's not needed, because os.path.join adds
"/" after every string you called function with.
With your solution expected_filepath
has two backslashes after the value of target_filepath
.
src/providers/cambridge/utils.py
Outdated
response = requests.get(url) | ||
if response.status_code == 200: | ||
filename = f"{desired_filename}.tsv" | ||
with open(target_filepath + filename, 'wb') as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use os.path.join
for constructing the file path, instead of using +
src/providers/springer/test_utils.py
Outdated
def test_download_file(tmp_path): | ||
url = 'https://adminportal.springernature.com/metadata/kbart/Springer_Global_Springer_Computer_Science_eBooks_2023_English+International_2023-08-01.txt' | ||
desired_filename = 'computer science' | ||
target_filepath = str(tmp_path) + '/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to add backslash if you will use os.path.join
in download_file
src/providers/springer/utils.py
Outdated
response = requests.get(url) | ||
if response.status_code == 200: | ||
filename = f"{desired_filename}.tsv" | ||
with open(target_filepath + filename, 'wb') as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you os.path.join
src/providers/springer/utils.py
Outdated
def get_page_content(url): | ||
try: | ||
response = requests.get(url) | ||
response.raise_for_status() # Raise an HTTPError if the response status code indicates an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same: I don't think we need a comment, the function raise_for_status() is quite informative
Utils for Taylor & Francis
CLI for Taylor & Francis
Improved readability, added else statement for ind_links_and_tags
Improved test_found_links and added case with 0 links found
Better readability, improved find_links_and_tags
Improved test_found_links, added case with no links found.
Added a higher-level CLI that gives the option of choosing the publisher
file_path = os.path.join(target_filepath, filename) | ||
with open(file_path, 'wb') as file: | ||
file.write(response.content) | ||
print(f'Successfully downloaded {filename}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of all prints, please use logger
CLI downloads new releases for a chosen subject in a chosen directory as a .tsv file.