Documentation for crwlr / crawler (v1.1)

Attention: You're currently viewing the documentation for v1.1 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Step Output Filters

Any step that extends the abstract Step class shipped with the package, has the where() and orWhere() methods to filter its outputs. Here's an example how to use it:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;

$json = <<<JSON
{
    "queenAlbums": [
        { "title": "Queen", "year": 1973, "charts": { "uk": 24, "us": 83 } },
        { "title": "Queen II", "year": 1974, "charts": { "uk": 5, "us": 49 } },
        { "title": "A Night at the Opera", "year": 1975, "charts": { "uk": 1, "us": 4 } },
        { "title": "A Day at the Races", "year": 1976, "charts": { "uk": 1, "us": 5 } },
        { "title": "The Game", "year": 1980, "charts": { "uk": 1, "us": 1 } },
        { "title": "A Kind of Magic", "year": 1986, "charts": { "uk": 1, "us": 46 } }
    ]
}
JSON;

$crawler = new MyCrawler();

$crawler->input($json);

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
);

As you can see, you always need to provide a Filter object. But that shouldn't be too complicated, as there is a static method for any available filter on that class.

In the example, the result will be only the album "The Game", as it's the only one from the list from after 1979 and reaching #1 in the US charts.

The first parameter is the key in the step's output array (or object). If the step outputs only a single, non array/object value, you can just give it only the Filter:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Html;

$crawler->addStep(
    Html::getLink('.linkClass')
        ->where(Filter::urlDomain('crwlr.software'))
);

As mentioned, there is also orWhere. So in the same example as above you can also do:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
        ->orWhere('chartsUK', Filter::equal(1))
);

This will also get "A Kind of Magic" as it was #1 in UK.

Negating filters

You can use any Filter inverted, using the negate() method, available with any filter.

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Sitemap;

Sitemap::getUrlsFromSitemap()
    ->where(
        Filter::urlPathStartsWith('/foo')->negate()
    );

So, this step will get only URLs from a sitemap, where the path doesn't start with /foo.

Available Filters

Comparison Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::equal(mixed $toValue);
Filter::notEqual(mixed $value);
Filter::greaterThan(mixed $value);
Filter::greaterThanOrEqual(mixed $value);
Filter::lessThan(mixed $value);
Filter::lessThanOrEqual(mixed $value);

String Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::stringContains(string $string);                 // uses PHP's str_contains()
Filter::stringStartsWith(string $string)                // str_starts_with()
Filter::stringEndsWith(string $string)                  // str_ends_with()
Filter::stringLengthEqual(int $length)                  // strlen($outputValue) === $length
Filter::stringLengthNotEqual(int $length)               // strlen($outputValue) !== $length
Filter::stringLengthGreaterThan(int $length)            // strlen($outputValue) > $length
Filter::stringLengthGreaterThanOrEqual(int $length)     // strlen($outputValue) >= $length
Filter::stringLengthLessThan(int $length)               // strlen($outputValue) < $length
Filter::stringLengthLessThanOrEqual(int $length)        // strlen($outputValue) <= $length

URL Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::urlScheme(string $scheme);              // e.g. http, https, ftp,...
Filter::urlHost(string $host);                  // www.crwlr.software
Filter::urlDomain(string $domain);              // crwlr.software
Filter::urlPath(string $path);                  // /exact/path
Filter::urlPathStartsWith(string $pathStart);   // /foo
Filter::urlPathMatches(string $regex);          // Regex (without delimiters) that the path has to match.
                                                // Like: ^/\d{1,5}/

Custom Filter Callback

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::custom(function (mixed $outputValue) {
    if (/* $outputValue should be passed on */) {
        return true;
    }

    return false; // Throw this output away
});