Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated CLI for Cambridge #4

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7ab11d6
Create cambridge_fetchfiles.py
jdm010 Aug 4, 2023
f3ee332
Create functions.py
jdm010 Aug 10, 2023
019a958
Create get_test_data.py
jdm010 Aug 10, 2023
156459e
Create cambridge_cli.py
jdm010 Aug 10, 2023
c21dc98
Delete cambridge_fetchfiles.py
jdm010 Aug 10, 2023
e8c0114
Merge pull request #1 from jdm010/jdm010-patch-1
jdm010 Aug 16, 2023
d972ee3
Create functions.py
jdm010 Aug 16, 2023
cca5082
Add files via upload
jdm010 Aug 16, 2023
649196d
Merge pull request #2 from jdm010/jdm010-patch-1
jdm010 Aug 16, 2023
617be2c
Create README.md
jdm010 Aug 16, 2023
fd36c76
Create README.md
jdm010 Aug 16, 2023
4b26b35
Delete src/providers/springer directory
jdm010 Aug 29, 2023
4e9e590
Update and rename functions.py to utils.py
jdm010 Aug 29, 2023
9bd378a
Delete get_test_data.py
jdm010 Aug 29, 2023
93b93ac
Update and rename cambridge_cli.py to cli.py
jdm010 Aug 29, 2023
27d8520
Update README.md
jdm010 Aug 29, 2023
1867fb4
Create test_utils.py
jdm010 Aug 30, 2023
3580917
Update test_utils.py
jdm010 Aug 31, 2023
57be339
Update utils.py
jdm010 Aug 31, 2023
4b1d16a
Update cli.py
jdm010 Aug 31, 2023
22b9559
Create utils.py
jdm010 Aug 31, 2023
313b9dd
Create cli.py
jdm010 Aug 31, 2023
070d31f
Create test_utils.py
jdm010 Aug 31, 2023
a5374cc
Create utils.py
jdm010 Sep 5, 2023
7231777
Create cli.py
jdm010 Sep 5, 2023
330c7e9
Update utils.py
jdm010 Sep 6, 2023
13ce132
Update test_utils.py
jdm010 Sep 6, 2023
2e85656
Update utils.py
jdm010 Sep 6, 2023
979f7c1
Update test_utils.py
jdm010 Sep 6, 2023
db4f301
Create cli.py
jdm010 Sep 6, 2023
121fd15
Create README.md
jdm010 Sep 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/providers/cambridge/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
CLI for Cambridge University Press.

CLI finds the new releases for a specified subject and stores them as a .tsv in a specified directory.
28 changes: 28 additions & 0 deletions src/providers/cambridge/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import click
from bs4 import BeautifulSoup
from utils import (
get_page_content,
find_links_and_tags,
download_file
)

@click.command()
@click.option('--subject', prompt='Enter the subject', type=click.Choice(['computer science', 'engineering', 'mathematics', 'physics', 'science and engineering', 'statistics']))
@click.option('--directory', prompt='Enter the directory', type=click.Path(exists=True, file_okay=False, dir_okay=True))
def main(subject,directory):
url = 'https://www.cambridge.org/core/services/librarians/kbart'
response = get_page_content(url)

if response:
soup = BeautifulSoup(response.text, 'html.parser')
prefix = 'cambridge ebooks and partner presses: 2023 '
subjects = [subject]
found_links = find_links_and_tags(soup, subjects, prefix)
for link, subject in zip(found_links, subjects):
full_link = 'https://www.cambridge.org' + link
download_file(full_link, subject, directory)

if __name__ == "__main__":
main()


32 changes: 32 additions & 0 deletions src/providers/cambridge/test_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import pytest
from utils import get_page_content, find_links_and_tags, download_file
from bs4 import BeautifulSoup
import os

@pytest.mark.vcr()
def test_page_content_valid():
response = get_page_content('https://www.cambridge.org/core/services/librarians/kbart')
assert response.status_code == 200

@pytest.mark.vcr()
def test_page_content_invalid():
response = get_page_content('https://www.cambridge.org/core/services/librarians/kbartWontWork')
assert response.status_code == 404

@pytest.mark.vcr
def test_found_links():
url = 'https://www.cambridge.org/core/services/librarians/kbart'
response = get_page_content(url)
soup = BeautifulSoup(response.text, 'html.parser')
found_links = find_links_and_tags(soup, ['computer science'], 'cambridge ebooks and partner presses: 2023 ')
assert not len(found_links) == 0
Copy link

@ErnestaP ErnestaP Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, put how many links are actually found, instead of: not len, which makes it a more precise test

Also, can you add a test, where is 0 links found? You can simulate that response, just read it from a file. Let's say copy the xml to the file and remove the tags you are looking for. And then you can use this file for a test input.
OR you can try to find an XML that doesn't have one and use this one


@pytest.mark.vcr
def test_download_file(tmp_path):
url = 'https://www.cambridge.org/core/services/aop-cambridge-core/kbart/create/bespoke/717854B1C18FD5D0B882344E83E6F52B'
desired_filename = 'computer science'
target_filepath = str(tmp_path) + '/'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that you need + "/" ? since I see you use os.path.join in line 30, it's not needed, because os.path.join adds "/" after every string you called function with.

With your solution expected_filepath has two backslashes after the value of target_filepath.

download_file(url, desired_filename, target_filepath)
expected_filepath = os.path.join(target_filepath, f"{desired_filename}.tsv")
assert os.path.exists(expected_filepath)
assert os.path.getsize(expected_filepath) > 0
36 changes: 36 additions & 0 deletions src/providers/cambridge/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import requests
import structlog
logger = structlog.get_logger()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe separate imports and a variable with a new line? Improves the readability


def get_page_content(url):
try:
response = requests.get(url)
response.raise_for_status() # Raise an HTTPError if the response status code indicates an error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a comment, the function raise_for_status() is quite informative

return response
except requests.exceptions.ConnectionError as err:
logger.error(err)
return response
except requests.exceptions.HTTPError as err:
logger.error(err)
return response

def find_links_and_tags(soup, subjects, prefix):
found_links = []
for subject in subjects:
target_word = prefix + subject
for tag in soup.find_all(string=lambda text: text and target_word in text.lower()):
parent_tag = tag.parent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the tag doesn't have a parent? Does it return None or crash? let's say the parent tag is root?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we have something like

            else:
                return None

or

            else:
                continue

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check, please write a simple function to see what happens when:

  1. Return None
  2. Using else: contunue
    What is the difference?

For my question, there are few ways to check it: reading docs and trying it manually:
Please try to run a small function that reads super small HTML, and try to find the parent that doesn’t exist:

soup = BeautifulSoup("<p>Some</p>")
for tag in soup.find_all('p'):
    tag.parent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdm010 ⬆️

if parent_tag.name == 'a' and parent_tag.get('href'):
link = parent_tag.get('href')
found_links.append(link)
return found_links

def download_file(url, desired_filename, target_filepath):
response = requests.get(url)
if response.status_code == 200:
filename = f"{desired_filename}.tsv"
with open(target_filepath + filename, 'wb') as file:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use os.path.join for constructing the file path, instead of using +

file.write(response.content)
print(f'Successfully downloaded {filename}')
else:
print(f"Error: Failed to download {desired_filename} ({response.status_code})")
28 changes: 28 additions & 0 deletions src/providers/springer/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from bs4 import BeautifulSoup
import click

from utils import (
get_page_content,
find_links_and_tags,
download_file
)

@click.command()
@click.option('--subject', prompt='Enter the subject', type=click.Choice(['chemistry and materials science', 'computer science', 'engineering', 'mathematics and statistics', 'physics and astronomy']))
@click.option('--directory', prompt='Enter the directory', type=click.Path(exists=True, file_okay=False, dir_okay=True))
def main(subject,directory):
url = 'https://adminportal.springernature.com/metadata/kbart'
response = get_page_content(url)

if response:
soup = BeautifulSoup(response.text, 'html.parser')
prefix = 'springer '
suffix = ' ebooks 2023'
subjects = [subject]
found_links = find_links_and_tags(soup, subjects, prefix, suffix)
for link, subject in zip(found_links, subjects):
full_link = 'https://adminportal.springernature.com' + link
download_file(full_link, subject, directory)

if __name__ == "__main__":
main()
36 changes: 36 additions & 0 deletions src/providers/springer/test_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import pytest
from utils import get_page_content, find_links_and_tags, download_file
from bs4 import BeautifulSoup
import os

@pytest.mark.vcr()
def test_page_content_valid():
response = get_page_content('https://adminportal.springernature.com/metadata/kbart')
assert response.status_code == 200

@pytest.mark.vcr()
def test_page_content_invalid():
response = get_page_content('https://adminportal.springernature.com/metadata/errkbart')
assert response.status_code == 404

@pytest.mark.vcr
def test_found_links():
url = 'https://adminportal.springernature.com/metadata/kbart'
response = get_page_content(url)
if response:
soup = BeautifulSoup(response.text, 'html.parser')
subjects = ['engineering']
founds_links = find_links_and_tags(soup, subjects, 'springer ', ' ebooks 2023')
assert len(founds_links) == 1

@pytest.mark.vcr
def test_download_file(tmp_path):
url = 'https://adminportal.springernature.com/metadata/kbart/Springer_Global_Springer_Computer_Science_eBooks_2023_English+International_2023-08-01.txt'
desired_filename = 'computer science'
target_filepath = str(tmp_path) + '/'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to add backslash if you will use os.path.join in download_file


download_file(url, desired_filename, target_filepath)

expected_filepath = os.path.join(target_filepath, f"{desired_filename}.tsv")
assert os.path.exists(expected_filepath)
assert os.path.getsize(expected_filepath) > 0
36 changes: 36 additions & 0 deletions src/providers/springer/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import requests
import structlog
logger = structlog.get_logger()

def get_page_content(url):
try:
response = requests.get(url)
response.raise_for_status() # Raise an HTTPError if the response status code indicates an error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same: I don't think we need a comment, the function raise_for_status() is quite informative

return response
except requests.exceptions.ConnectionError as err:
logger.error(err)
return response
except requests.exceptions.HTTPError as err:
logger.error(err)
return response

def find_links_and_tags(soup, subjects, prefix, suffix):
found_links = []
for subject in subjects:
target_word = prefix + subject + suffix
for tag in soup.find_all(string=lambda text: text and target_word in text.lower()):
parent_tag = tag.parent
if parent_tag.name == 'a' and parent_tag.get('href'):
link = parent_tag.get('href')
found_links.append(link)
return found_links

def download_file(url, desired_filename, target_filepath):
response = requests.get(url)
if response.status_code == 200:
filename = f"{desired_filename}.tsv"
with open(target_filepath + filename, 'wb') as file:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you os.path.join

file.write(response.content)
print(f'Successfully downloaded {filename}')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of all prints, please use logger

else:
print(f"Error: Failed to download {desired_filename} ({response.status_code})")