Documentation for crwlr / crawler (v0.5)

Attention: You're currently viewing the documentation for v0.5 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Step Output Filters

Any step that extends the abstract Step class shipped with the package, has the where() and orWhere() methods to filter its outputs. Here's an example how to use it:

$json = <<<JSON
{
    "queenAlbums": [
        { "title": "Queen", "year": 1973, "charts": { "uk": 24, "us": 83 } },
        { "title": "Queen II", "year": 1974, "charts": { "uk": 5, "us": 49 } },
        { "title": "A Night at the Opera", "year": 1975, "charts": { "uk": 1, "us": 4 } },
        { "title": "A Day at the Races", "year": 1976, "charts": { "uk": 1, "us": 5 } },
        { "title": "The Game", "year": 1980, "charts": { "uk": 1, "us": 1 } },
        { "title": "A Kind of Magic", "year": 1986, "charts": { "uk": 1, "us": 46 } }
    ]
}
JSON;

$crawler = new MyCrawler();

$crawler->input($json);

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
);

As you can see, you always need to provide a Filter object. But that shouldn't be too complicated, as there is a static method for any available filter on that class.

In the example, the result will be only the album "The Game", as it's the only one from the list from after 1979 and reaching #1 in the US charts.

The first parameter is the key in the step's output array (or object). If the step outputs only a single, non array/object value, you can just give it only the Filter:

$crawler->addStep(
    Html::getLink('.linkClass')
        ->where(Filter::urlDomain('crwlr.software'))
);

As mentioned, there is also orWhere. So in the same example as above you can also do:

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
        ->orWhere('chartsUK', Filter::equal(1))
);

This will also get "A Kind of Magic" as it was #1 in UK.

Available Filters

Comparison Filters

Filter::equal(mixed $toValue);
Filter::notEqual(mixed $value);
Filter::greaterThan(mixed $value);
Filter::greaterThanOrEqual(mixed $value);
Filter::lessThan(mixed $value);
Filter::lessThanOrEqual(mixed $value);

String Filters

Filter::stringContains(string $string);   // uses PHP's str_contains()
Filter::stringStartsWith(string $string)  // str_starts_with()
Filter::stringEndsWith(string $string)    // str_ends_with()

Url filters

Filter::urlScheme(string $scheme);              // e.g. http, https, ftp,...
Filter::urlHost(string $host);                  // www.crwlr.software
Filter::urlDomain(string $domain);              // crwlr.software
Filter::urlPath(string $path);                  // /exact/path
Filter::urlPathStartsWith(string $pathStart);   // /foo