Refining Output Data
Sometimes extracted data isn't in the exact format you need. You may want to clean it up or transform it before passing it on to the next step.
Using the refineOutput() Method
The refineOutput()
method, which is available on any step, enables you to do exactly that. It can be used in different ways depending on what kind of refinement you need and what type of output the step produces.
Predefined Refiners vs. Custom Callback Function
The method accepts either a (predefined) refiner (a class implementing the Crwlr\Crawler\Steps\Refiners\RefinerInterface
) or a custom callback function.
The library comes with a lot of predefined refiners, offering convenient helpers for common operations like modifying URLs, strings, date/time or even HTML. If there is no refiner for your use case, you can manually manipulate output data via a custom callback function.
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Refiners\UrlRefiner;
Html::getLinks('#navigation a')
->refineOutput(
UrlRefiner::withScheme('https'),
);
use Crwlr\Crawler\Steps\Html;
use Crwlr\Url\Url;
Html::getLinks('#navigation a')
->refineOutput(function (string $url) {
return Url::parse($url)
->scheme('https')
->__toString();
});
Usage Based on Output Type
The refineOutput()
method can be used differently depending on the type of output.
If the step yields scalar values (single value without a key), you can pass a refiner as the only argument:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Refiners\UrlRefiner;
Html::getLinks('#navigation a')
->refineOutput(
UrlRefiner::withScheme('https')
);
To target a certain property of associative array (or object) outputs, provide the key as the first argument:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Refiners\StringRefiner;
Html::root()
->extract([
'foo' => '.foo',
'bar' => '.bar',
'baz' => '.baz',
])
->refineOutput(
'foo', // Key of property to refine.
StringRefiner::replace('a', 'b'),
);
Refining an Entire Output Array in One Callback Function
If the step returns an associative array (or object) and you don’t pass a key as the first argument, the entire output will be passed to your custom callback function. This allows you to use that function to manipulate multiple properties at once — which is especially useful when some values depend on others.
use Crwlr\Crawler\Steps\Html;
Html::root()
->extract([
'foo' => '.foo',
'bar' => '.bar',
'baz' => '.baz',
])
->refineOutput('foo', function (array $outputData, mixed $originalInputData) {
if ($outputData['foo'] === 'something') {
$outputData['bar'] .= 'something else';
}
return $outputData;
});
Predefined Refiners
The following examples showcase all predefined refiners included in the library.
String Refiners
use Crwlr\Crawler\Steps\Refiners\StringRefiner;
StringRefiner::afterFirst('foo'); // Rest of the string after the first occurrence of "foo".
StringRefiner::afterLast('foo'); // Rest of the string after the last occurrence of "foo".
StringRefiner::beforeFirst('foo'); // String before the first occurrence of "foo".
StringRefiner::beforeLast('foo'); // String before the last occurrence of "foo".
// Everything between the first occurrence of "foo" and the next occurrence of "bar" after that "foo".
StringRefiner::betweenFirst('foo', 'bar');
// Everything between the last occurrence of "foo" and the next occurrence of "bar" after that "foo".
StringRefiner::betweenLast('foo', 'bar');
// Find and replace.
StringRefiner::replace('°', '');
// Can also take arrays of strings, like:
StringRefiner::replace(['foo', 'bar'], ['FOO', 'BAR']);
Note: all those string refiners automatically trim the refined string.
URL Refiners
use Crwlr\Crawler\Steps\Refiners\UrlRefiner;
UrlRefiner::withScheme('https'); // Sets scheme to "https"
// E.g. http://example.com => https://example.com
UrlRefiner::withHost('www.example.com'); // Sets the host to "www.example.com"
// E.g. https://example.com => https://www.example.com
UrlRefiner::withPort(1234); // Sets the port to "1234"
// E.g. https://example.com/foo => https://example.com:1234/foo
UrlRefiner::withoutPort(); // Removes the port.
// E.g. https://example.com:1234/foo => https://example.com/foo
UrlRefiner::withPath('/contact'); // Sets the path to "/contact"
// E.g. https://example.com/foo => https://example.com/contact
UrlRefiner::withQuery('a=b&c=d'); // Sets the query to "a=b&c=d"
// E.g. https://example.com/foo?foo=bar => https://example.com/foo?a=b&c=d
UrlRefiner::withoutQuery(); // Removes the query.
// E.g. https://example.com/foo?foo=bar => https://example.com/foo
UrlRefiner::withFragment('foo'); // Sets the fragment to "foo".
// E.g. https://example.com/home => https://example.com/home#foo
UrlRefiner::withoutFragment(); // Removes the fragment.
// E.g. https://example.com/home#foo => https://example.com/home
DateTime Refiners
use Crwlr\Crawler\Steps\Refiners\DateTimeRefiner;
DateTimeRefiner::reformat('Y-m-d H:i:s');
// Automatically detects the format of the extracted date/time string
// and converts it to the given target format.
DateTimeRefiner::reformat('Y-m-d H:i:s', 'd. F Y \u\m H:i:s');
// If automatic detection doesn't work, you can explicitly define the
// origin format as the second argument.
HTML Refiners
use Crwlr\Crawler\Steps\Refiners\HtmlRefiner;
HtmlRefiner::remove('#cookie-consent');
// Removes elements matching the given CSS selector from an HTML string.