Documentation for crwlr / crawler (v1.10)

XML Steps

The Xml step extends the same base class (Dom) as the Html step but uses XPath queries as default, instead of CSS selectors. So selecting data from an XML document looks pretty much the same as selecting from HTML:

use Crwlr\Crawler\Steps\Xml;

Xml::root()->extract('//title');

Xml::root()->extract(['title' => '//title', 'author' => '//author']);

Xml::each('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);

Xml::first('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);

Xml::last('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);

root is used to just extract data from the root of the document. each, first and last are all used to extract data from a list of similar items. each is the only one that yields multiple outputs.

The extract method takes either a single xPath query or an array of queries with keys to name the data properties being extracted.

Nesting Extracted Data

If you use the extract() method with a mapping array, you can also use another Xml step as value to achieve nesting.

use Crwlr\Crawler\Steps\Xml;

Xml::each('//events/event')
    ->extract([
        'title' => '//name',
        'location' => '//location',
        'date' => '//date',
        'talks' => Xml::each('//talks/talk')->extract([
            'title' => '//title',
            'speaker' => '//speaker',
        ])
    ]);

Accessing other Node Values

By default, the XPath queries return the text of the selected node. But of course you can also get other values:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Xml;

Xml::first('listing/item')->extract([
    'default' => Dom::xPath('//default')->text(),
    'foo' => Dom::xPath('//foo')->innerText(),
    'bar' => Dom::xPath('//bar')->html(),
    'baz' => Dom::xPath('//baz')->outerHtml(),
    'test' => Dom::xPath('//test')->attribute('test'),
]);

text
You don't have to use this explicitly, it's the default when you only provide the selector as string. It gets the text inside the node including children.

innerText
Gets only the text directly inside the node. Excludes text from child nodes.

html
Gets the xml source inside the selected element.

outerHtml
Gets the xml source of the selected element including the element itself.

attribute(x)
Gets the value inside attribute x of the selected element.

Using CSS selectors instead of XPath queries

As default, Xml steps use XPath queries, but if you want to, you can also use CSS selectors for Xml:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Xml;

Xml::each(Dom::cssSelector('bookstore book'))
    ->extract([
        'title' => Dom::cssSelector('title'),
        'author' => Dom::cssSelector('author'),
        'year' => Dom::cssSelector('year'),
    ]);