Documentation for crwlr / crawler (v1.0)

Attention: You're currently viewing the documentation for v1.0 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

HTTP Steps

The Http step implements the LoadingStepInterface and automatically receives the crawler's Loader when added to the crawler.

HTTP Requests

There are static methods to get steps for all the different HTTP methods:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get();
Http::post();
Http::put();
Http::patch();
Http::delete();

They all have optional parameters for headers, body (if available for method) and HTTP version:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get(array $headers = [], string $httpVersion = '1.1');

Http::post(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1'
)

Http::put(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::patch(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::delete(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Error Responses

By default, error responses (HTTP status code 4xx and 5xx) are not passed on to the next step in the crawler. If you want to also cascade error responses down to the next step, you can call the yieldErrorResponses() method:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/broken-link')
    ->addStep(
        Http::get()->yieldErrorResponses()
    )
    ->addStep(...);

Another default behavior is, that crawlers keep on crawling after error responses (except for some special behaviour in case of a 429 HTTP response, see the Politeness page). If it's important for your crawler that none of the requests fail, call the stopOnErrorResponse() method, and the step will throw a LoadingException in case it receives an error response.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/broken-link')
    ->addStep(
        Http::get()->stopOnErrorResponse()
    )
    ->addStep(...);

Paginating List Pages

A typical challenge when crawling, is listings with multiple pages on different URLs. A convenient way to solve this are Paginators. Here's a simple example how to use it:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/some/listing')
    ->addStep(
        Http::get()->paginate('#pages')
    );

As first argument, the paginate() method either takes a CSS selector string or an instance of the PaginatorInterface (more about this below). With a CSS selector, the method creates an instance of the SimpleWebsitePaginator class. You can either use a CSS selector to select the link to the next page, or just the element containing all the pagination links. The SimpleWebsitePaginator remembers all the URLs it already loaded, so it won't load any link twice. But keep in mind, that pages may not be loaded in the correct order, when selecting a pagination wrapper element.

As the second argument, the paginate() method takes the maximum number of pages it will load. The default value if you don't provide a value yourself, is 1000.

Custom Paginators

The SimpleWebsitePaginator currently is the only Paginator shipped with the package, but if it doesn't fit your needs you can write your own. A Paginator has to implement the PaginatorInterface. You can also extend the AbstractPaginator class that comes with a constructor taking a max pages argument and a default implementation of the PaginatorInterface::prepareRequest() method, that just returns the incoming request without changes.

use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;

class CustomPaginator extends AbstractPaginator
{
    public function hasFinished(): bool
    {
        // This method is called after each page load to check if we're finished loading all pages.
    }

    public function getNextUrl(): ?string
    {
        // Return the next URL that should be loaded, or null if there is no further page to load.
    }

    public function prepareRequest(
        RequestInterface $request,
        ?RespondedRequest $previousResponse = null,
    ): RequestInterface {
        // Here you can manipulate a request before it is sent.
        // So you can e.g. also solve use cases where serving different pages is done via POST requests.

        // But, implementing this method is optional, when you extend the AbstractPaginator class!
    }

    public function processLoaded(
        UriInterface $url,
        RequestInterface $request,
        ?RespondedRequest $respondedRequest,
    ): void {
        // This method is called after a page was loaded.
        // Here you can process the response and get further links to load.
    }

    public function logWhenFinished(LoggerInterface $logger): void
    {
        // This method is called when hasFinished() returned true. Here you can log some messages if you want to.
    }
}

You can then use your Paginator class like this:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/some/listing')
    ->addStep(
        Http::get()->paginate(new CustomPaginator())
    );

Or another example, if you've built a Paginator for a use case where different pages are served based on POST parameters:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/some/listing')
    ->addStep(
        Http::post()->paginate(new MyPostParamPaginator())
    );

Crawling (whole Websites)

If you want to crawl a whole website the Http::crawl() step is for you. By default, it just follows all the links it finds until everything on the same host is loaded.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(Http::crawl());

Depth

You can also tell it to only follow links to a certain depth.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->depth(2)
    );

This means, it will load all the URLs it finds on the page from the initial input (in this case https://www.example.com/), then all the links it finds on those found links, and then it'll stop. With a depth of 3 it will load another level of newly found links.

Start with a sitemap

By using the inputIsSitemap() method, you can start crawling with a sitemap.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(
        Http::crawl()
            ->inputIsSitemap()
    );

The crawl step usually assumes that all input URLs will deliver HTML documents, so if you want to start crawling with a sitemap, the call to this method is necessary.

Load URLs on the same domain (instead of host)

As mentioned, by default, it loads all the pages on the same host. So, for example www.example.com. If there's a link to https://jobs.example.com/foo, it won't follow that link, as it is on jobs.example.com. But you can tell it to also load all URLs on the same domain, using the sameDomain() method:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->sameDomain()
    );

Only load URLs matching path criteria

There's two methods that you can use to tell it, to only load URLs with certain paths:

pathStartsWith()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo/')
    );

In this case it will only load found URLs where the path starts with /foo/, so for example: https://www.example.com/foo/bar, but not https://www.example.com/other/bar.

pathMatches()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathMatches('/\/bar\//')
    );

The pathMatches() method takes a regex to match the paths of found URLs. So in this case it will load all URLs containing /bar/ anywhere in the path.

Custom Filtering based on URL or Link Element

The customFilter() method allows you to define your own callback function that will be called with any found URL or link:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url) {
                return $url->scheme('https');
            })
    );

So, this example will only load URLs where the URL scheme is https.

In case the URL was found in an HTML document (not in a sitemap), the Closure also receives the link element as a Symfony DomCrawler instance as the second argument:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
use Symfony\Component\DomCrawler\Crawler;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url, ?Crawler $linkElement) {
                return $linkElement && str_contains($linkElement->innerText(), 'Foo');
            })
    );

So, this example will only load links when the link text contains Foo.

Load all URLs but yield only matching

When restricting crawling e.g. to only paths starting with /foo/, it will only load matching URLs (after the initial input URL). So if some page /some/page contains a link to /foo/quz, the link won't be found, because the /some/page won't be loaded. If you want to find all links matching your criteria, on the whole website, but yield only the responses of the matching URLs, you can use the loadAllButYieldOnlyMatching() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo')
            ->loadAllButYieldOnlyMatching()
    );

This works for restrictions defined using the path methods (pathStartsWith() and pathMatches()) and also for the customFilter() method. Of course, it doesn't affect depth or staying on the same host or domain.