Documentation for crwlr / crawler (v1.7)

Json

The Json step has three static methods:

  • Json::all() to just extract the whole JSON object
  • Json::get() to cherry pick properties from the JSON object
  • and Json::each() to extract multiple items from the JSON object

Json::all()

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Json;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com/json')
    ->addStep(Http::get())
    ->addStep(Json::all());

Json::get()

The Json::get() method works pretty much like the extract method of the Html and Xml steps. Thanks to adbario/php-dot-notation extracting data from JSON documents is really simple. Given the URL https://www.example.com/json responds with the following JSON:

{
    "data": {
        "something": "yolo",
        "target": {
            "foo": "Lorem ipsum",
            "bar": "dolor sit",
            "array": [
                { "baz": "zero" },
                { "baz": "one" },
                { "baz": "two" }
            ]
        }
    }
}

Cherry-pick your desired properties like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Json;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com/json')
    ->addStep(Http::get())
    ->addStep(
        Json::get([
            'foo' => 'data.target.foo',
            'bar' => 'data.target.array.1.baz',
        ])
    );

The output of the JSON step then is:

array(2) {
  ["foo"]=>
  string(11) "Lorem ipsum"
  ["bar"]=>
  string(3) "one"
}

Json::each()

You can also extract multiple items from an array in the JSON object, by using the each method. Let's say the JSON looks like this:

{
    "list": {
        "people": [
            { "name": "Hans Zimmer", "age": { "years": 66 }, "home": "US" },
            { "name": "John Williams", "age": { "years": 92 }, "home": "US" },
            { "name": "Alan Silvestri", "age": { "years": 73 }, "home": "US" }
        ]
    }
}

You can get the names and ages like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Json;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com/json')
    ->addStep(Http::get())
    ->addStep(
        Json::each(
            'list.people',
            [ // provide the data mapping as second argument to the each() method.
                'name' => 'name',
                'age' => 'age.years'
            ]  
        )
    );

This yields 3 separate outpus:

array(2) {
  ["name"]=>
  string(11) "Hans Zimmer"
  ["age"]=>
  int(66)
}
array(2) {
  ["name"]=>
  string(13) "John Williams"
  ["age"]=>
  int(92)
}
array(2) {
  ["name"]=>
  string(14) "Alan Silvestri"
  ["age"]=>
  int(73)
}