Documentation for crwlr / crawler (v0.2)

Attention: You're currently viewing the documentation for v0.2 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

XML Steps

The Xml step extends the same base class (Dom) as the Html step but uses XPath queries as default, instead of CSS selectors. So selecting data from an XML document looks pretty much the same as selecting from HTML:

Xml::root()->extract(['title' => '//title', 'author' => '//author']);

Xml::each('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);

Xml::first('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);

Xml::last('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);

root is used to just extract a set of properties from the root of the document. each, first and last are all used to extract a set of properties from a list of similar items. each is the only one that yields multiple outputs.

The extract method takes an array with the data property names that you want to have in the output/result as key and the XPath query as value.

Accessing other Node Values

By default, the XPath queries return the text of the selected node. But of course you can also get other values:

Xml::first('listing/item')->extract([
    'default' => Dom::xPath('//default')->text(),
    'foo' => Dom::xPath('//foo')->innerText(),
    'bar' => Dom::xPath('//bar')->html(),
    'baz' => Dom::xPath('//baz')->outerHtml(),
    'test' => Dom::xPath('//test')->attribute('test'),
]);

text
You don't have to use this explicitly, it's the default when you only provide the selector as string. It gets the text inside the node including children.

innerText
Gets only the text directly inside the node. Excludes text from child nodes.

html
Gets the xml source inside the selected element.

outerHtml
Gets the xml source of the selected element including the element itself.

attribute(x)
Gets the value inside attribute x of the selected element.

Using CSS selectors instead of XPath queries

As default, Xml steps use XPath queries, but if you want to, you can also use CSS selectors for Xml:

Xml::each(Dom::cssSelector('bookstore book'))
    ->extract([
        'title' => Dom::cssSelector('title'),
        'author' => Dom::cssSelector('author'),
        'year' => Dom::cssSelector('year'),
    ]);