Versions:

Library for Rapid (Web) Crawler and Scraper Development

This package provides kind of a framework and a lot of ready to use, so-called steps, that you can combine to build your own crawlers and scrapers with. But first let's clarify the meaning of those two terms.

What's the Difference between Crawling and Scraping

For most use cases those two things go hand in hand which is why this library helps with and combines both.

What is a Crawler?

Animated visualization of a crawling procedure

A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could just load actually all links it is finding (and is allowed to load according to the robots.txt file), then it would just load the whole internet (if the url(s) it starts with are no dead end). Or it can be restricted to load only links matching certain criteria (on same domain/host, url path starts with "/foo",...) or only to a certain depth. A depth of 3 means 3 levels deep. Links found on the initial urls provided to the crawler are level 1 and so on.

What is a Scraper?

Visualization of extracting data from a document

A scraper extracts data from a document. Crawling only gets you the documents that you're looking for, but in most use cases you also want to extract certain data from those documents which is called scraping.

That being said: in this project the term crawling is preferred, but most of the time it also includes scraping. The class that you need to extend is called Crawler but it's here for both, crawling and scraping.

Requirements

Requires PHP version 8.1 or above.

Installation

composer require crwlr/crawler

Usage

To build a crawler you always need to make your own class extending the Crawler or HttpCrawler class. In a class extending the HttpCrawler you need to at least define a user agent for your crawler, which can just be a name for your Crawler/Bot or any browser user-agent string.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return BotUserAgent::make('MyBot');
    }
}

A very simple example to extract data from some articles that we get from a listing of articles would look like this:

$crawler->input('https://www.example.com/articles');

$crawler->addStep(Http::get())                          // Load the listing page
    ->addStep(Html::getLinks('#artList .article a'))    // Get the links to the articles
    ->addStep(Http::get())                              // Load the article pages
    ->addStep(
        Html::first('article')                          // Extract the data
            ->extract([
                'title' => 'h1',
                'date' => '.date',
                'author' => '.articleAuthor'
            ])
            ->addKeysToResult()
    );

foreach ($crawler->run() as $result) {
    // Do something with the Result
}

You can see a very central concept are the so-called "steps". A key thing to understand, to use this library, will be how data flows through those steps.

Assuming the listing contains 3 articles, running this crawler via command line will give you an output like this:

08:57:40:123456 [INFO] Loaded https://www.example.com/robots.txt
08:57:40:234567 [INFO] Wait 0.0xs for politeness.
08:57:41:123456 [INFO] Loaded https://www.example.com/articles
08:57:41:234567 [INFO] Select links with CSS selector: #artList .article a
08:57:41:345678 [INFO] Wait 0.0xs for politeness.
08:57:42:123456 [INFO] Loaded https://www.example.com/articles/1
08:57:42:234567 [INFO] Extracted properties title, date, author from document.
08:57:42:345678 [INFO] Wait 0.0xs for politeness.
08:57:43:123456 [INFO] Loaded https://www.example.com/articles/2
08:57:43:234567 [INFO] Wait 0.0xs for politeness.
08:57:44:123456 [INFO] Loaded https://www.example.com/articles/3
08:57:44:234567 [INFO] Extracted properties title, date, author from document.
08:57:44:345678 [INFO] Extracted properties title, date, author from document.

You can see there's a lot already built-in. By default the HttpCrawler uses the PoliteHttpLoader which sticks to the rules defined in a robots.txt file if the requested host has one. And further it automatically assures the crawler won't produce too much load on the server that is being crawled, by waiting a little between requests and the wait time depends on how long the latest request took to be answered. This means if the server starts to respond slower, the crawler also waits longer between requests.

If you don't want to use those features you can use a different Loader.