HTTP Steps
The Http
step uses the LoadingStep
trait and automatically receives the crawler's Loader when it is added to the crawler.
HTTP Requests
There are static methods to get steps for all the different HTTP methods:
use Crwlr\Crawler\Steps\Loading\Http;
Http::get();
Http::post();
Http::put();
Http::patch();
Http::delete();
They all have optional parameters for headers, body (if available for method) and HTTP version:
use Crwlr\Crawler\Steps\Loading\Http;
Http::get(array $headers = [], string $httpVersion = '1.1');
Http::post(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1'
)
Http::put(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1',
);
Http::patch(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1',
);
Http::delete(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1',
);
Adding Raw HTTP Response Data to the Result
After an HTTP request step, you will usually add another step to extract data from the response document (HTML, XML, or whatever format is returned).
If you want to directly add properties from the Http
step to the crawling result, you can use the output keys url
, status
, headers
, and body
:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com')
->addStep(
Http::get()
->keep(['url', 'status', 'headers', 'body'])
);
Getting Headers and/or Body from previous Step
By default, if the step receives array input, it will look for the keys url
or uri
to use it as the request URL. But to be as flexible as possible, the Http
steps can receive not only the URL, but also headers and a body from the outputs of a previous step. Let's say you have a MyCustomStep
that produces outputs like:
[
'link' => 'https://www.example.com',
'someHeaderValue' => '123abc',
'queryString' => 'foo=bar&baz=quz',
]
You can get those values to be used as a certain HTTP request header and as the request body, like this:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('...')
->addStep(new MyCustomStep())
->addStep(
Http::post()
->useInputKeyAsUrl('link')
->useInputKeyAsHeader('someHeaderValue', 'x-header-value')
->useInputKeyAsBody('queryString')
);
As you can see you can even map the output key to a certain header name.
You can also use an array from the output, containing multiple headers. Let's assume the output of MyCustomStep
looks like:
[
'link' => 'https://www.example.com',
'customHeaders' => [
'Accept' => 'text/html,application/xhtml+xml,application/xml',
'Accept-Encoding' => 'gzip, deflate',
],
]
In this case you can add those headers to your request like this:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('...')
->addStep(new MyCustomStep())
->addStep(
Http::post()
->useInputKeyAsUrl('link')
->useInputKeyAsHeaders('customHeaders')
);
If you're also defining some headers statically when creating the step, dynamic headers from previous step's outputs are merged with them:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('...')
->addStep(new MyCustomStep())
->addStep(
Http::get(headers: ['Accept-Language' => 'de-DE'])
->useInputKeyAsUrl('link')
->useInputKeyAsHeaders('customHeaders')
);
Watch out: usually the Http
steps receive the request URL as scalar input, or you define, which key from array input should be used by calling the step's useInputKey()
method. When you also want to get headers and/or body from the input, you have to use the useInputKeyAsUrl()
method, because when using the useInputKey()
method, all other values are just thrown away before invoking the step.
Static URLs and Using Input Properties in URL, Headers, and Body
You can define a static request URL on an Http
step using the staticUrl()
method.
This can be useful for example when making POST requests to a fixed endpoint:
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(
Html::root()->extract([
'query' => Dom::cssSelector('#foo')->attribute('data-query'),
// The attribute contains a query string to be sent in a POST request.
])
)
->addStep(
Http::post()
->staticUrl('https://www.example.com/bar')
->useInputKeyAsBody('query'),
);
In addition to setting a static URL, you can also inject dynamic values using placeholder variables in the format [crwl:propertyName]
. These variables will be replaced with values from the input data — including values passed through using keep()
from previous steps.
Here’s an example using a variable inside the URL:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(
Html::each('#list .item')->extract([
'id' => '.id',
])
)
->addStep(
Http::get()
->staticUrl('https://www.example.com/bar?id=[crwl:id]'),
);
If the id
extracted by the Html
step is 123
, the request URL for the last Http step will be:
https://www.example.com/bar?id=123
.
You can also use these variable placeholders in request headers and the request body:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(
Html::each('#list .item')->extract([
'id' => '.id',
'foo' => '.fooHeader',
])
)
->addStep(
Http::post(
headers: ['x-foo' => '[crwl:foo],bar'],
body: 'id=[crwl:id]&type=detail',
),
);
Error Responses
By default, error responses (HTTP status code 4xx and 5xx) are not passed on to the next step in the crawler. If you want to also cascade error responses down to the next step, you can call the yieldErrorResponses()
method:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/broken-link')
->addStep(
Http::get()->yieldErrorResponses()
)
->addStep(...);
Another default behavior is, that crawlers keep on crawling after error responses (except for some special behaviour in case of a 429 HTTP response, see the Politeness page). If it's important for your crawler that none of the requests fail, call the stopOnErrorResponse()
method, and the step will throw a LoadingException
in case it receives an error response.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/broken-link')
->addStep(
Http::get()->stopOnErrorResponse()
)
->addStep(...);
Skipping the Cache for a Specific Step
If your crawler uses an HTTP response cache, you can disable it for individual Http
steps by calling the skipCache()
method:
use Crwlr\Crawler\Cache\FileCache;
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');
$crawler->getLoader()->setCache(new FileCache(__DIR__ . '/cachedir'));
$crawler
->input('https://www.example.com/')
->addStep(Http::get()->skipCache()) // This Http step won't use the cache.
->addStep(Html::getLinks('#list .link'))
->addStep(Http::get()); // This Http step will use the cache.
This allows you to bypass the cache selectively while keeping it enabled for the rest of the crawler.
Using a (Headless) Chrome Browser to Load Pages
If you want to use a (headless) Chrome browser for loading pages, there are two ways to enable it:
1. Globally for the entire crawler:
You can configure the crawler’s loader to use a browser for all requests:
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');
$crawler->getLoader()->useHeadlessBrowser();
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(Html::getLinks('#list .link'))
->addStep(Http::get());
Both Http
steps in this example will use the (headless) Chrome browser.
2. For a specific step only:
If you want to use the browser only for certain requests, you can call the useBrowser()
method on a specific Http
step. This switches the loader to use the browser for this step’s requests, and automatically reverts to the previous setting afterward.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get()->useBrowser()) // Uses the Chrome browser.
->addStep(Html::getLinks('#list .link'))
->addStep(Http::get()); // Doesn't use the Chrome browser.
Post Browser Navigate Hooks
If your crawler’s loader uses the headless browser as shown above, the postBrowserNavigateHook()
method allows you to define actions that are performed immediately after navigating to the target URL but before reading the HTML source code.
For common tasks, like clicking elements, scrolling, or typing into inputs, the BrowserAction
class provides predefined actions that make it very straightforward to interact with the page (see Predefined Browser Actions below).
Alternatively, you can define a fully custom callback function, leveraging the Page class API of the chrome-php/chrome package, which is used under the hood.
Here's an example using both approaches:
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Loading\Http;
use HeadlessChromium\Page;
use Http\Browser\BrowserAction;
$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');
$crawler->getLoader()->usesHeadlessBrowser();
$crawler
->input('https://www.example.com/foo')
->addStep(
Http::get()
->postBrowserNavigateHook(BrowserAction::clickElement('#firstname'))
->postBrowserNavigateHook(BrowserAction::typeText('Christian'))
->postBrowserNavigateHook(function (Page $page) {
$page->mouse()->find('#some_button')->click();
}),
);
Predefined Browser Actions
To make performing actions in the browser as easy as possible, the Crwlr\Crawler\Steps\Loading\Http\Browser\BrowserAction
class provides various pre-built callback functions for this purpose. For simplicity, the following code examples only show the Http
step, without the rest of the crawler code as in the example above.
BrowserAction::screenshot()
Takes a screenshot of the current page in the browser. This action requires an instance of the ScreenshotConfig
class, which is used to configure the screenshot settings. The only required argument for ScreenshotConfig::make()
is the path where the screenshot will be stored. Other options, like image file type, quality, and full-page capture, are optional and can be set using the respective methods.
use Crwlr\Crawler\Loader\Http\Browser\ScreenshotConfig;
Http::get()
->postBrowserNavigateHook(
BrowserAction::screenshot(
ScreenshotConfig::make('/my/screenshots/store/path')
->setImageFileType('jpeg') // Default 'png'. Available: 'jpeg', 'png', 'webp'
->setQuality(75) // Only available with 'jpeg' and 'webp'. Default 80.
->setFullPage(), // Screenshot the whole page from top to bottom,
// not only the viewport area.
),
);
This browser action is especially useful if you can't disable headless mode in your current environment to observe what actually happens in the browser. With this screenshot action, you can at least capture snapshots of the page at specific moments.
BrowserAction::clickElement()
Click an element that matches a given CSS selector. The action automatically waits until an element matching the selector is rendered, then clicks it.
Http::get()
->postBrowserNavigateHook(
BrowserAction::clickElement('#some_element'),
);
BrowserAction::clickInsideShadowDom()
Sometimes, an element cannot be clicked with BrowserAction::clickElement()
because it is located inside a shadow DOM. This action allows you to click an element inside a shadow DOM tree.
Http::get()
->postBrowserNavigateHook(
BrowserAction::clickInsideShadowDom('#shadow_host', '.element-inside-shadow-host'),
);
Note: If you also want the HTML content of shadow DOM elements to be included in the returned HTML source, you can configure the browser like this:
$crawler->getLoader()->browser()->includeShadowElementsInHtml();
BrowserAction::typeText()
Types the specified text. This is commonly used to fill form input fields when one is focused – which can be achieved by using the BrowserAction::clickElement()
action before it.
Http::get()
->postBrowserNavigateHook(
BrowserAction::typeText('Hello World!'),
);
BrowserAction::scrollDown()
Scrolls the page down by the specified number of pixels.
Http::get()
->postBrowserNavigateHook(
BrowserAction::scrollDown(500),
);
BrowserAction::scrollUp()
Scrolls the page up by the specified number of pixels.
Http::get()
->postBrowserNavigateHook(
BrowserAction::scrollUp(500),
);
BrowserAction::moveMouseToElement()
Moves the mouse cursor to the element matching the given CSS selector. This can be useful for triggering hover effects.
Http::get()
->postBrowserNavigateHook(
BrowserAction::moveMouseToElement('#some_element'),
);
BrowserAction::moveMouseToPosition()
Moves the mouse cursor to the specified coordinates on the page (x, y).
Http::get()
->postBrowserNavigateHook(
BrowserAction::moveMouseToPosition(250, 400),
);
BrowserAction::evaluate()
Executes Javascript code in the context of the current page.
Http::get()
->postBrowserNavigateHook(
BrowserAction::evaluate('document.getElementById("some_element").innerHTML = \'Hello\''),
);
BrowserAction::waitUntilDocumentContainsElement()
Waits until an element matching the given CSS selector appears in the DOM before proceeding. This is useful when you need to ensure that a specific element is present before continuing with the next actions.
Http::get()
->postBrowserNavigateHook(
BrowserAction::waitUntilDocumentContainsElement('#some_element'),
);
Note: You don't need to use this action together with BrowserAction::clickElement()
, BrowserAction::clickInsideShadowDom()
, or BrowserAction::moveMouseToElement()
, as these actions already wait for the element to appear internally.
BrowserAction::wait()
Pauses execution for the specified number of seconds before proceeding with the next action. While waiting for a specific DOM element (see BrowserAction::waitUntilDocumentContainsElement()
) is usually a better solution, sometimes waiting for a fixed amount of time is the only way to make things work.
Http::get()
->postBrowserNavigateHook(
BrowserAction::wait(2.5),
);
BrowserAction::waitForReload()
Waits for a page reload to occur. This is useful when a preceding action clicks an element (such as a link or button) and you expect the browser to navigate to a different page as a result.
Http::get()
->postBrowserNavigateHook(BrowserAction::clickElement('#some_button'))
->postBrowserNavigateHook(
BrowserAction::waitForReload(),
);
Paginating List Pages
A typical challenge when crawling, is listings with items spread over multiple pages. A convenient way to solve this are Paginators. Here's a simple example how to use it:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/some/listing')
->addStep(
Http::get()->paginate('#pages')
);
As first argument, the paginate()
method either takes a CSS selector string or an instance of the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator
class. With a CSS selector, the method creates an instance of the SimpleWebsitePaginator
class. You can either use a CSS selector to select the link to the next page, or just the element containing all the pagination links. The SimpleWebsitePaginator
remembers all the URLs it already loaded, so it won't load any link twice. But keep in mind, that pages may not be loaded in the correct order, when selecting a pagination wrapper element.
As the second argument, the paginate()
method takes the maximum number of pages it will load. The default value if you don't provide a value yourself, is 1000.
Query Params Paginator
Another paginator implementation shipped with the package is the Crwlr\Crawler\Steps\Loading\Http\Paginators\QueryParamsPaginator
. It automatically increases or decreases values of query parameters, either in the URL or in the request body (e.g. with POST requests).
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list')
->addStep(
Http::post(body: 'page=1&offset=0')
->paginate(
Paginator::queryParams()
->inBody() // or ->inUrl() when working with URL query params
->increase('page')
->increase('offset', 20)
)
);
In this example, the page
query parameter is increase by one (default increase value) after each request, and the offset
parameter is increased by 20. You also have the option to decrease parameter values as needed using the decrease()
method.
If you're dealing with a nested query string like pagination[page]=1&pagination[size]=25
, you can use dot notation to define the query param to increase or decrease:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list?pagination[page]=1&pagination[size]=25')
->addStep(
Http::get()
->paginate(
Paginator::queryParams()
->inUrl()
->increase('pagination.page')
)
);
However, the issue with this example is that it continuously sends requests until it reaches the default limit of 1000 requests (you can customize this limit by specifying it as a method argument: Paginator::queryParams(300)
). What we want to do here, is to provide the paginator with a rule that determines when it should stop loading further pages as a reaction to received responses:
Paginator Stop Rules
Suppose the example.com/list
endpoint returns a JSON list of books, with book items stored in data.books
. When we reach the end of the list, data.books
becomes empty, and we want stop loading any further pages. To achieve this, we can do:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
use Crwlr\Crawler\Steps\Loading\Http\Paginators\StopRules\PaginatorStopRules;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list?page=1')
->addStep(
Http::get()
->paginate(
Paginator::queryParams()
->inUrl()
->increase('page')
->stopWhen(PaginatorStopRules::isEmptyInJson('data.books'))
)
);
As you can see, you can define a so-called stop rule through the stopWhen()
method. These stop rules are applicable to any paginator, as they are implemented in the AbstractPaginator
class. The package includes several pre-defined stop rules, such as:
PaginatorStopRules::isEmptyResponse()
// Paginator stops when response body is empty.
PaginatorStopRules::isEmptyInJson('data.items')
// Paginator stops when response is empty, or data.items doesn't exist or is empty in JSON response.
PaginatorStopRules::isEmptyInHtml('#search .list .item')
// Paginator stops when response is empty, or the CSS selector `#search .list .item` does not select any nodes.
PaginatorStopRules::isEmptyInXml('channel item')
// Paginator stops when response is empty, or the CSS selector `channel item` does not select any nodes.
PaginatorStopRules::contains('a specific string')
// Paginator stops when response is empty, or the response body contains a specific string.
PaginatorStopRules::notContains('a specific string')
// Paginator stops when response is empty, or the response body does not contain a specific string.
If your use case requires unique criteria, you can also supply a custom Closure.
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
use Psr\Http\Message\RequestInterface;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list?page=1')
->addStep(
Http::get()
->paginate(
Paginator::queryParams()
->inUrl()
->increase('page')
->stopWhen(function (RequestInterface $request, ?RespondedRequest $respondedRequest) {
// Based on the $request and the $respondedRequest object provided to the callback
// you can decide if the paginator should stop. In this case, return true.
return true;
})
)
);
Custom Paginators
If the paginators shipped with the package don't fit your needs, you can write your own. A Paginator has to extend the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator
class and implement at least a custom getNextRequest()
method. Optionally you can also implement your custom versions of the methods processLoaded()
, hasFinished()
and logWhenFinished()
.
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;
class CustomPaginator extends AbstractPaginator
{
public function getNextRequest(): ?RequestInterface
{
// Let's say we paginate URLs with a path like this: /list-of-things/<pageNumber>
$latestRequestUrlPath = $this->latestRequest->getUri()->getPath();
$prevPageNumber = explode('/list-of-things/', $latestRequestUrlPath);
if (count($prevPageNumber) < 2) {
return null;
}
$nextPageNumber = ((int) $prevPageNumber[1]) + 1;
return $this->latestRequest->withUri(
$this->latestRequest->getUri()->withPath('/list-of-things/' . $nextPageNumber)
);
}
}
You can then use your Paginator class like this:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginators\StopRules\PaginatorStopRules;
$crawler
->input('https://www.example.com/list-of-things/1')
->addStep(
Http::get()
->paginate(new CustomPaginator())
->stopWhen(PaginatorStopRules::isEmptyInHtml('#results .item'))
);
Crawling (whole Websites)
If you want to crawl a whole website the Http::crawl()
step is for you. By default, it just follows all the links it finds until everything on the same host is loaded.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(Http::crawl());
Depth
You can also tell it to only follow links to a certain depth.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler->input('https://www.example.com/')
->addStep(
Http::crawl()
->depth(2)
);
This means, it will load all the URLs it finds on the page from the initial input (in this case https://www.example.com/
), then all the links it finds on those found links, and then it'll stop. With a depth of 3
it will load another level of newly found links.
Start with a sitemap
By using the inputIsSitemap()
method, you can start crawling with a sitemap.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler->input('https://www.example.com/sitemap.xml')
->addStep(
Http::crawl()
->inputIsSitemap()
);
The crawl step usually assumes that all input URLs will deliver HTML documents, so if you want to start crawling with a sitemap, the call to this method is necessary.
Load URLs on the same domain (instead of host)
As mentioned, by default, it loads all the pages on the same host. So, for example www.example.com
. If there's a link to https://jobs.example.com/foo
, it won't follow that link, as it is on jobs.example.com
. But you can tell it to also load all URLs on the same domain, using the sameDomain()
method:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->sameDomain()
);
Only load URLs matching path criteria
There's two methods that you can use to tell it, to only load URLs with certain paths:
pathStartsWith()
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->pathStartsWith('/foo/')
);
In this case it will only load found URLs where the path starts with /foo/
, so for example: https://www.example.com/foo/bar
, but not https://www.example.com/other/bar
.
pathMatches()
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->pathMatches('/\/bar\//')
);
The pathMatches()
method takes a regex to match the paths of found URLs. So in this case it will load all URLs containing /bar/
anywhere in the path.
Custom Filtering based on URL or Link Element
The customFilter()
method allows you to define your own callback function that will be called with any found URL or link:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->customFilter(function (Url $url) {
return $url->scheme() === 'https';
})
);
So, this example will only load URLs where the URL scheme is https
.
In case the URL was found in an HTML document (not in a sitemap), the Closure also receives the link element as a Crwlr\Crawler\Steps\Dom\HtmlElement
instance as the second argument:
use Crwlr\Crawler\Steps\Dom\HtmlElement;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->customFilter(function (Url $url, ?HtmlElement $linkElement) {
return $linkElement && str_contains($linkElement->text(), 'Foo');
})
);
So, this example will only load links when the link text contains Foo
.
Load all URLs but yield only matching
When restricting crawling e.g. to only paths starting with /foo/
, it will only load matching URLs (after the initial input URL). So if some page /some/page
contains a link to /foo/quz
, the link won't be found, because the /some/page
won't be loaded. If you want to find all links matching your criteria, on the whole website, but yield only the responses of the matching URLs, you can use the loadAllButYieldOnlyMatching()
method.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->pathStartsWith('/foo')
->loadAllButYieldOnlyMatching()
);
This works for restrictions defined using the path methods (pathStartsWith()
and pathMatches()
) and also for the customFilter()
method. Of course, it doesn't affect depth or staying on the same host or domain.
Use Canonical Links
If a websites delivers the same content via multiple URLs (for example like example.com/products?productId=123 and example.com/products/123), it can use canonical links to tell crawlers if a page is a duplicate of another one and which one is the main URL. If you want to avoid loading the same document multiple times, you can tell the Http::crawl()
step to use canonical links, calling its useCanonicalLinks()
method.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/foo')
->addStep(
Http::crawl()
->useCanonicalLinks()
);
Calling that method, the step will not yield responses if its canonical link URL was already yielded before. If it discovers a link, and some document pointing to that URL via canonical link was already loaded, the newly discovered link is treated as if it was already loaded. Further this feature also sets the canonical link URL as the effectiveUri
of the response.
Keep URL Fragments
By default, the Http::crawl()
step throws away the fragment part of all discovered URLs (example.com/path#fragment => example.com/path), because websites only very rarely respond with different content based on the fragment part. If a site that you're crawling does so, you can tell the step to keep the URL fragment, calling the keepUrlFragment()
method.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/something')
->addStep(
Http::crawl()
->keepUrlFragment()
);