Skip to content
This repository has been archived by the owner on Jan 30, 2023. It is now read-only.

Commit

Permalink
Merge pull request #38 from spatie/upgrade-crawler
Browse files Browse the repository at this point in the history
Upgrade crawler
  • Loading branch information
brendt authored Mar 20, 2018
2 parents cdf5847 + cbf36c9 commit 500099b
Show file tree
Hide file tree
Showing 8 changed files with 161 additions and 83 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

All notable changes to `laravel-link-checker` will be documented in this file

## 4.0.0 - 2018-03-20

- Update to `spatie/crawler ^4.0`
- Drop `illuminate/support ~5.5.0` support

## 3.0.0 - 2018-02-12

- Update to `spatie/crawler ^3.0`
Expand Down
50 changes: 35 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,50 +141,71 @@ By default the package will log all broken links. If you want to have them maile
## Creating your own crawl profile
A crawlprofile determines which links need to be crawled. By default `Spatie\LinkChecker\CheckAllLinks` is used,
which will check all links it finds. This behaviour can be customized by specifying a class in the `default_profile`-option in the config file.
The class must implement the `Spatie\Crawler\CrawlProfile`-interface:
The class must extend the abstract class `Spatie\Crawler\CrawlProfile`:

```php

interface CrawlProfile
abstract class CrawlProfile
{
/**
* Determine if the given url should be crawled.
*
* @param \Spatie\Crawler\Url $url
* @param \Psr\Http\Message\UriInterface $url
*
* @return bool
*/
public function shouldCrawl(Url $url);
abstract public function shouldCrawl(UriInterface $url): bool;
}
```

## Creating your own reporter
A reporter determines what should be done when a link is crawled and when the crawling process is finished.
This package provides two reporters: `Spatie\LinkChecker\Reporters\LogBrokenLinks` and `Spatie\LinkChecker\Reporters\MailBrokenLinks`.
You can create your own behaviour by making a class adhere to the `Spatie\Crawler\CrawlObserver`-interface:
You can create your own behaviour by making a class extend the abstract class `Spatie\Crawler\CrawlObserver`:

```php
interface CrawlObserver
abstract class CrawlObserver
{
/**
* Called when the crawler will crawl the url.
*
* @param \Spatie\Crawler\Url $url
* @param \Psr\Http\Message\UriInterface $url
*/
public function willCrawl(UriInterface $url)
{
}

/**
* Called when the crawler has crawled the given url successfully.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\ResponseInterface $response
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function willCrawl(Url $url);
abstract public function crawled(
UriInterface $url,
ResponseInterface $response,
?UriInterface $foundOnUrl = null
);

/**
* Called when the crawler has crawled the given url.
* Called when the crawler had a problem crawling the given url.
*
* @param \Spatie\Crawler\Url $url
* @param \Psr\Http\Message\ResponseInterface|null $response
* @param \Psr\Http\Message\UriInterface $url
* @param \GuzzleHttp\Exception\RequestException $requestException
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function hasBeenCrawled(Url $url, $response);
abstract public function crawlFailed(
UriInterface $url,
RequestException $requestException,
?UriInterface $foundOnUrl = null
);

/**
* Called when the crawl has ended.
*/
public function finishedCrawling();
public function finishedCrawling()
{
}
}
```

Expand All @@ -197,7 +218,6 @@ Please see [CHANGELOG](CHANGELOG.md) for more information what has changed recen

## Testing


First, start the test server in a separate terminal session:
``` bash
cd tests/server
Expand Down
5 changes: 5 additions & 0 deletions UPGRADING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

Because there are many breaking changes an upgrade is not that easy. There are many edge cases this guide does not cover. We accept PRs to improve this guide.

## From 3.0 to 4.0

- `spatie/crawler` is updated to `^4.0`. This version made changes to the way custom `Profiles` and `Observers` are made. Please see the [UPGRADING](https://github.com/spatie/crawler/blob/master/UPGRADING.md) guide of `spatie/crawler` to know how to update any custom crawl profiles or observers - if you have any.
- Laravel 5.5 support is dropped.

## From 2.0 to 3.0

- `spatie/crawler` is updated to `^3.0`. This version introduced the use of PSR-7 `UriInterface` instead of a custom `Url` class. Please see the [UPGRADING](https://github.com/spatie/crawler/blob/master/UPGRADING.md) guide of `spatie/crawler` to know how to update any custom crawl profiles - if you have any.
4 changes: 2 additions & 2 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
],
"require": {
"php" : "^7.1",
"illuminate/support": "~5.5.0|~5.6.0",
"spatie/crawler": "^3.0"
"illuminate/support": "~5.6.0",
"spatie/crawler": "^4.0.3"
},
"require-dev": {
"phpunit/phpunit" : "^6.0|^7.0",
Expand Down
45 changes: 31 additions & 14 deletions src/Reporters/BaseReporter.php
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@

namespace Spatie\LinkChecker\Reporters;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;
use Spatie\Crawler\CrawlObserver;

abstract class BaseReporter implements CrawlObserver
abstract class BaseReporter extends CrawlObserver
{
const UNRESPONSIVE_HOST = 'Host did not respond';

Expand All @@ -15,26 +17,41 @@ abstract class BaseReporter implements CrawlObserver
protected $urlsGroupedByStatusCode = [];

/**
* Called when the crawler will crawl the url.
* Called when the crawler has crawled the given url successfully.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\ResponseInterface $response
* @param null|\Psr\Http\Message\UriInterface $foundOnUrl
*
* @return int|string
*/
public function willCrawl(UriInterface $url)
{
public function crawled(
UriInterface $url,
ResponseInterface $response,
?UriInterface $foundOnUrl = null
) {
$statusCode = $response->getStatusCode();

if (!$this->isExcludedStatusCode($statusCode)) {
$this->urlsGroupedByStatusCode[$statusCode][] = $url;
}

return $statusCode;
}

/**
* Called when the crawler has crawled the given url.
* Called when the crawler had a problem crawling the given url.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\ResponseInterface|null $response
* @param \Psr\Http\Message\UriInterface $foundOnUrl
*
* @return string
* @param \Psr\Http\Message\UriInterface $url
* @param \GuzzleHttp\Exception\RequestException $requestException
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function hasBeenCrawled(UriInterface $url, $response, ?UriInterface $foundOnUrl = null)
{
$statusCode = $response ? $response->getStatusCode() : static::UNRESPONSIVE_HOST;
public function crawlFailed(
UriInterface $url,
RequestException $requestException,
?UriInterface $foundOnUrl = null
) {
$statusCode = $requestException->getCode();

if (!$this->isExcludedStatusCode($statusCode)) {
$this->urlsGroupedByStatusCode[$statusCode][] = $url;
Expand Down
77 changes: 45 additions & 32 deletions src/Reporters/LogBrokenLinks.php
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

namespace Spatie\LinkChecker\Reporters;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;

Expand All @@ -14,38 +16,6 @@ public function __construct(LoggerInterface $log)
$this->log = $log;
}

/**
* Called when the crawler has crawled the given url.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\ResponseInterface|null $response
* @param \Psr\Http\Message\UriInterface $foundOnUrl
*
* @return string
*/
public function hasBeenCrawled(UriInterface $url, $response, ?UriInterface $foundOnUrl = null)
{
$statusCode = parent::hasBeenCrawled($url, $response);

if ($this->isSuccessOrRedirect($statusCode)) {
return;
}

if ($this->isExcludedStatusCode($statusCode)) {
return;
}

$reason = $response ? $response->getReasonPhrase() : '';

$logMessage = "{$statusCode} {$reason} - {$url}";

if ($foundOnUrl) {
$logMessage .= " (found on {$foundOnUrl}";
}

$this->log->warning($logMessage);
}

/**
* Called when the crawl has ended.
*/
Expand All @@ -71,4 +41,47 @@ public function finishedCrawling()

});
}

/**
* Called when the crawler had a problem crawling the given url.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \GuzzleHttp\Exception\RequestException $requestException
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function crawlFailed(
UriInterface $url,
RequestException $requestException,
?UriInterface $foundOnUrl = null
) {
parent::crawlFailed($url, $requestException, $foundOnUrl);

$statusCode = $requestException->getCode();

if ($this->isExcludedStatusCode($statusCode)) {
return;
}

$this->log->warning(
$this->formatLogMessage($url, $requestException, $foundOnUrl)
);
}

protected function formatLogMessage(
UriInterface $url,
RequestException $requestException,
?UriInterface $foundOnUrl = null
): string {
$statusCode = $requestException->getCode();

$reason = $requestException->getMessage();

$logMessage = "{$statusCode} {$reason} - {$url}";

if ($foundOnUrl) {
$logMessage .= " (found on {$foundOnUrl}";
}

return $logMessage;
}
}
50 changes: 34 additions & 16 deletions src/Reporters/MailBrokenLinks.php
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

namespace Spatie\LinkChecker\Reporters;

use GuzzleHttp\Exception\RequestException;
use Illuminate\Contracts\Mail\Mailer;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;

class MailBrokenLinks extends BaseReporter
Expand All @@ -22,22 +24,6 @@ public function __construct(Mailer $mail)
$this->mail = $mail;
}

/**
* Called when the crawler has crawled the given url.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\ResponseInterface|null $response
* @param \Psr\Http\Message\UriInterface $foundOnUrl
*
* @return string
*/
public function hasBeenCrawled(UriInterface $url, $response, ?UriInterface $foundOnUrl = null)
{
$url->foundOnUrl = $foundOnUrl;

return parent::hasBeenCrawled($url, $response, $foundOnUrl);
}

/**
* Called when the crawl has ended.
*/
Expand All @@ -55,4 +41,36 @@ public function finishedCrawling()
$message->subject(config('laravel-link-checker.reporters.mail.subject'));
});
}

/**
* Called when the crawler has crawled the given url successfully.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\ResponseInterface $response
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function crawled(
UriInterface $url,
ResponseInterface $response,
?UriInterface $foundOnUrl = null
) {
$url->foundOnUrl = $foundOnUrl;

return parent::crawled($url, $response, $foundOnUrl);
}

/**
* Called when the crawler had a problem crawling the given url.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \GuzzleHttp\Exception\RequestException $requestException
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function crawlFailed(
UriInterface $url,
RequestException $requestException,
?UriInterface $foundOnUrl = null
) {
return;
}
}
8 changes: 4 additions & 4 deletions tests/LogBrokenUrlsTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ public function it_can_report_broken_urls_in_the_log()

$this->app[Kernel::class]->call('link-checker:run', ['--url' => $this->appUrl]);

$this->assertLogContainsTextAfterLastMarker('400 Bad Request - http://localhost:4020/400');
$this->assertLogContainsTextAfterLastMarker('500 Internal Server Error - http://localhost:4020/500');
$this->assertLogContainsTextAfterLastMarker('GET http://localhost:4020/400` resulted in a `400 Bad Request`');
$this->assertLogContainsTextAfterLastMarker('GET http://localhost:4020/500` resulted in a `500 Internal Server Error`');
$this->assertLogContainsTextAfterLastMarker('link checker summary');
$this->assertLogContainsTextAfterLastMarker('Crawled 1 url(s) with statuscode 400');
$this->assertLogContainsTextAfterLastMarker('Crawled 1 url(s) with statuscode 500');
Expand All @@ -32,8 +32,8 @@ public function it_does_not_report_excluded_status_codes()
$this->app['config']->set('laravel-link-checker.reporters.exclude_status_codes', [500]);
$this->app[Kernel::class]->call('link-checker:run', ['--url' => $this->appUrl]);

$this->assertLogContainsTextAfterLastMarker('400 Bad Request - http://localhost:4020/400');
$this->assertLogNotContainsTextAfterLastMarker('500 Internal Server Error - http://localhost:4020/500');
$this->assertLogContainsTextAfterLastMarker('GET http://localhost:4020/400` resulted in a `400 Bad Request`');
$this->assertLogNotContainsTextAfterLastMarker('GET http://localhost:4020/500` resulted in a `500 Internal Server Error`');
$this->assertLogContainsTextAfterLastMarker('link checker summary');
$this->assertLogContainsTextAfterLastMarker('Crawled 1 url(s) with statuscode 400');
$this->assertLogNotContainsTextAfterLastMarker('Crawled 1 url(s) with statuscode 500');
Expand Down

0 comments on commit 500099b

Please sign in to comment.