Skip to content

Latest commit

 

History

History
30 lines (25 loc) · 1.3 KB

README.md

File metadata and controls

30 lines (25 loc) · 1.3 KB

SEO Crawler

Bare-bones Basic SEO Crawler using Python Scrapy

This project has become part of the advertools package, checkout the documentation page

Using Scrapy, get the main SEO elements for exploratory analysis of a website. It works by supplying a list of known URLs to crawl and return structured results.

The main elements include:

  • url: the actual URL
  • slug: the URI part of the URL
  • directories: splits the URI by slashes to return the different folders (directories) in each URI
  • title: the <title> tag
  • h1, h2, h3, h4: header tags
  • description: the meta description
  • link_urls: not activated, needs special configuration to make sure you are getting links to certain sites
  • link_text: depends on the above, extracts the anchor text of each link
  • link_count: number of links on page (based on your criteria)
  • load_time: page load time in seconds
  • status_code: response code of page 200, 301, 404, etc.

Many other elements should be added to the list but they differ from site to site, some examples:

  • publishing date
  • product price
  • content category
  • tags of an article
  • whether or not a certain keyword is in a certain location
  • type of content (inferred from a URL directory, or from certain content on page)
  • etc.