Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy price monitor update #14

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

further-reading
Copy link

Updating the scrapy price monitor example project as part of ecommerce content review.

Some notable changes from the original:

  • Only scrapes books.toscrape.com to prevent need to update site specific scrapers
  • For interacting with collections I used my collection helper to simplify it
  • I use an ItemLoader for handling the item to allow for addign standard processors
  • I use price-parser to handle extraction of the price from string

Old version is still present in the _scrapy_price_monitor_OLD directory.


class CollectionHelper:
"""Adapter to make interacting with scraping collection easier"""
def __init__(self, proj_id, collection_name, api_key=None, create=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only problem I see with this is that it will force people to use our Collections, I'm not sure if it's needed, maybe we can just leave it out? It will make code easier to use by newbies. If someone new to Scrapy will start using it without apikey he will get errors, he might not know what to do to tix it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project is intended to run on scrapy cloud, which will default to its enviornment apikey when api_key is None. The alternative would require setting up persistance to keep some sort of database which is far more complex.

for name, urls in products.items():
for url in urls:
if self.name in url:
now = datetime.now().strftime('%Y/%m/%d %H:%M:%S')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use isoformat here

datetime.now().isoformat()
'2021-11-19T09:30:27.341132'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually why is there need to pass item in meta? just for showing how to pass in meta? product name is already available on product page, retailer is static and now can also be taken from product page.

Output is somewhat confusing

{'product_name': 'soumission', 'retailer': 'books.toscrape.com', 'when': '2021/11/19 09:23:19', 'url': 'https://books.toscrape.com/catalogue/soumission_998/index.html', 'name': 'Soumission', 'price': 50.1}

so there is product name twice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible that the name on the website may not exactly be the name on the json. The json name seems to be more of a label to indicate what product the urls should be linking to.

SHUB_PROJ_ID = os.getenv('SHUB_JOBKEY', '999999').split('/')[0]

ITEM_PIPELINES = {
'price_monitor.pipelines.CollectionStoragePipeline': 400,
}

# settings for Amazon SES email service
AWS_ACCESS_KEY = os.getenv('$AWS_ACCESS_KEY')
AWS_SECRET_KEY = os.getenv('$AWS_SECRET_KEY')
EMAIL_ALERT_FROM = 'Price Monitor <[email protected]>'
EMAIL_ALERT_TO = ['[email protected]']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried if e-mail sending actually works?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I don't personally have an AWS email service set up for this so I wasn't able to test if this part still works.

@pawelmhm
Copy link
Member

In my opinion we could skip e-mail sending, as this is extra configuration step and it's good to keep things easy for newcomers. We can just print somethign in bin/monitor without sending e-mail and add comment asking users to configure somethign, they can add slack integration, e-mail whaterver.

@further-reading
Copy link
Author

further-reading commented Nov 19, 2021

In my opinion we could skip e-mail sending, as this is extra configuration step and it's good to keep things easy for newcomers. We can just print somethign in bin/monitor without sending e-mail and add comment asking users to configure somethign, they can add slack integration, e-mail whaterver.

I'm not certain if the intended audience of the post are total newcomers. One of the main selling points of the blog appears to be the email step so that the user gets passive notifications and I'm not sure if it's still a valuable sample if it defaults to a log you have to manually check every 30 minutes.

I wonder if maybe a disclaimer explaining that this notification script is using an external service that can be changed and the user could write an alternative would be better?

Update: I updated the alert step to seperate the ASN relevant code to make it easier for a user to update and/or replace this content for their preferred alert approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants