Step Output Filters
Any step that extends the abstract Step
class shipped with the package, has the where()
and orWhere()
methods to filter its outputs. Here's an example how to use it:
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;
$json = <<<JSON
{
"queenAlbums": [
{ "title": "Queen", "year": 1973, "charts": { "uk": 24, "us": 83 } },
{ "title": "Queen II", "year": 1974, "charts": { "uk": 5, "us": 49 } },
{ "title": "A Night at the Opera", "year": 1975, "charts": { "uk": 1, "us": 4 } },
{ "title": "A Day at the Races", "year": 1976, "charts": { "uk": 1, "us": 5 } },
{ "title": "The Game", "year": 1980, "charts": { "uk": 1, "us": 1 } },
{ "title": "A Kind of Magic", "year": 1986, "charts": { "uk": 1, "us": 46 } }
]
}
JSON;
$crawler = new MyCrawler();
$crawler->input($json);
$crawler->addStep(
Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
->where('year', Filter::greaterThan(1979))
->where('chartsUS', Filter::equal(1))
);
As you can see, you always need to provide a Filter
object.
But that shouldn't be too complicated, as there is a static
method for any available filter on that class.
In the example, the result will be only the album "The Game", as it's the only one from the list from after 1979 and reaching #1 in the US charts.
The first parameter is the key in the step's output array
(or object). If the step outputs only a single, non
array/object value, you can just give it only the Filter
:
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Html;
$crawler->addStep(
Html::getLink('.linkClass')
->where(Filter::urlDomain('crwlr.software'))
);
As mentioned, there is also orWhere
. So in the same
example as above you can also do:
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;
$crawler->addStep(
Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
->where('year', Filter::greaterThan(1979))
->where('chartsUS', Filter::equal(1))
->orWhere('chartsUK', Filter::equal(1))
);
This will also get "A Kind of Magic" as it was #1 in UK.
Available Filters
Comparison Filters
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::equal(mixed $toValue);
Filter::notEqual(mixed $value);
Filter::greaterThan(mixed $value);
Filter::greaterThanOrEqual(mixed $value);
Filter::lessThan(mixed $value);
Filter::lessThanOrEqual(mixed $value);
String Filters
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::stringContains(string $string); // uses PHP's str_contains()
Filter::stringStartsWith(string $string) // str_starts_with()
Filter::stringEndsWith(string $string) // str_ends_with()
Filter::stringLengthEqual(int $length) // strlen($outputValue) === $length
Filter::stringLengthNotEqual(int $length) // strlen($outputValue) !== $length
Filter::stringLengthGreaterThan(int $length) // strlen($outputValue) > $length
Filter::stringLengthGreaterThanOrEqual(int $length) // strlen($outputValue) >= $length
Filter::stringLengthLessThan(int $length) // strlen($outputValue) < $length
Filter::stringLengthLessThanOrEqual(int $length) // strlen($outputValue) <= $length
URL Filters
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::urlScheme(string $scheme); // e.g. http, https, ftp,...
Filter::urlHost(string $host); // www.crwlr.software
Filter::urlDomain(string $domain); // crwlr.software
Filter::urlPath(string $path); // /exact/path
Filter::urlPathStartsWith(string $pathStart); // /foo
Filter::urlPathMatches(string $regex); // Regex (without delimiters) that the path has to match.
// Like: ^/\d{1,5}/
Custom Filter Callback
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::custom(function (mixed $outputValue) {
if (/* $outputValue should be passed on */) {
return true;
}
return false; // Throw this output away
});