Documentation for crwlr / crawler (v0.4)

Attention: You're currently viewing the documentation for v0.4 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

HTML Steps

There are 2 different kinds of steps available via static methods of the Html class. The ones to get links (urls) from HTML documents and the others to select data/text via CSS selectors (or XPath queries).

Getting (absolute) links

This can only be used with an instance of RespondedRequest as input, so immediately after an Http loading step. The reason for this is, that it needs to know the url of the document to resolve relative links in the document to absolute ones.

There are 2 different methods, you can either get one, or all links (matching a CSS selector).

Html::getLink()

It takes the first link (matching the CSS selector => optional).

Html::getLink();
Html::getLink('#listing #nextPage');

Html::getLinks()

Exact same, but gets you all matching links as separate outputs.

Html::getLinks();
Html::getLinks('.matchingLink');

In both methods, if your CSS selector matches an element that is not a link (<a>) element, it is ignored.

Both steps provide the following chainable methods to filter:

// Only links to urls on the same domain.
Html::getLinks()->onSameDomain();

// Only links to urls not on the same domain.
Html::getLinks()->notOnSameDomain();

// Only links to urls on (a) certain domain(s).
Html::getLinks()->onDomain('example.com');

Html::getLinks()->onDomain(['example.com', 'crwl.io']);

// Only links to urls on the same host (includes subdomain).
Html::getLinks()->onSameHost();

// Only links to urls not on the same host.
Html::getLinks()->notOnSameHost();

// Only links to urls on (a) certain host(s)
Html::getLinks()->onDomain('blog.example.com');

Html::getLinks()->onDomain(['blog.example.com', 'www.crwl.io']);

Selecting data

The main method to select data is extract() but you always have to use it in combination with one of: root, each, first or last.

Html::root()->extract(['title' => 'h1', 'date' => '#main .date']);

Html::each('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);

Html::first('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);

Html::last('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);

It should be pretty clear with this example. root is used to just extract a set of properties from the root of the document. each, first and last are all used to extract a set of properties from a list of similar items. each is the only one that yields multiple outputs.

The extract method takes an array with the data property names that you want to have in the output/result as key and the CSS selector as value.

Accessing other Node Values

By default, the CSS selectors return the text of the selected node. But of course you can also get other values:

Html::last('#listing .item')->extract([
    'default' => Dom::cssSelector('.default')->text(),
    'foo' => Dom::cssSelector('.foo')->innerText(),
    'bar' => Dom::cssSelector('.bar')->html(),
    'baz' => Dom::cssSelector('.baz')->outerHtml(),
    'test' => Dom::cssSelector('.test')->attribute('data-test'),
]);

text
You don't have to use this explicitly, it's the default when you only provide the selector as string. It gets the text inside the node including children.

innerText
Gets only the text directly inside the node. Excludes text from child nodes.

html
Gets the html source inside the selected element.

outerHtml
Gets the html of the selected element including the element itself.

attribute(x)
Gets the value inside attribute x of the selected element.

Using XPath instead of CSS selectors

The Xml and Html steps both have the same base class (Dom) that behind the scenes uses the symfony DomCrawler to extract data. As default, Html steps use CSS selectors and Xml steps use XPath queries. But if you want to, you can also use XPath for Html:

Html::each(Dom::xPath('//div[@id=\'bookstore\']/div[@class=\'book\']'))
    ->extract([
        'title' => Dom::xPath('//h3[@class=\'title\']'),
        'author' => Dom::xPath('//*[@class=\'author\']'),
        'year' => Dom::xPath('//span[@class=\'year\']'),
    ]);