url

Table of contents

Installation

Install the latest version with:

composer require crwlr/url

Usage

Including the package

<?php

include('vendor/autoload.php');

use Crwlr\Url\Url;

To start using the library include composer's autoload file and import the Url class so you don't have to write the full namespace path again and again. Further code examples skip the above.

Parsing urls

$url = Url::parse('https://john:123@www.example.com:8080/foo?bar=baz');

// Accessing url components as properties
$scheme = $url->scheme;                 // => "https"
$user = $url->user;                     // => "john"
$host = $url->host;                     // => "www.example.com"
$domain = $url->domain;                 // => "example.com"

// Or via method calls
$port = $url->port();                   // => 8080
$domainSuffix = $url->domainSuffix();   // => "com"
$path = $url->path();                   // => "/foo"
$fragment = $url->fragment();           // => NULL

Available url components

Below is a list of all components the Url class takes care of. The highlighted part in the example url shows what the component returns.

  • scheme
    https ://john:123@subdomain.example.com:8080/foo?bar=baz#anchor
  • user
    https:// john :123@subdomain.example.com:8080/foo?bar=baz#anchor
  • pass or password (alias)
    https://john: 123 @subdomain.example.com:8080/foo?bar=baz#anchor
  • host
    https://john:123@ subdomain.example.com :8080/foo?bar=baz#anchor
  • domain
    https://john:123@subdomain. example.com :8080/foo?bar=baz#anchor
  • domainLabel
    https://john:123@subdomain. example .com:8080/foo?bar=baz#anchor
  • domainSuffix
    https://john:123@subdomain.example. com :8080/foo?bar=baz#anchor
  • subdomain
    https://john:123@ subdomain .example.com:8080/foo?bar=baz#anchor
  • port
    https://john:123@subdomain.example.com: 8080 /foo?bar=baz#anchor
  • path
    https://john:123@subdomain.example.com:8080 /foo ?bar=baz#anchor
  • query
    https://john:123@subdomain.example.com:8080/foo? bar=baz #anchor
  • fragment
    https://john:123@subdomain.example.com:8080/foo?bar=baz# anchor

When a component is not present in a url (e.g. it doesn't contain user and password) the corresponding properties will return NULL.

Combinations of components

root

There are situations where it can be very helpful to get the root as it's called here. It returns everything that comes before the path component.

$url = Url::parse('https://www.example.com:8080/foo?bar=baz');
$root = $url->root();   // => "https://www.example.com:8080"
relative

Complementary to the root you can also retrieve all components starting from the path (path, query and fragment) combined, via the relative property. It's called relative because it's like a relative url (without scheme and host information).

$url = Url::parse('https://www.example.com/foo?bar=baz#anchor');
$relative = $url->relative();   // => "/foo?bar=baz#anchor"

Parsing a query string

If you're after the query of a url you may want to get it as an array. Don't worry, nothing easier than that:

$url = Url::parse('https://www.example.com/foo?bar=baz&key=value');
var_dump($url->queryArray());

Output

array(2) {
  ["bar"]=>
  string(3) "baz"
  ["key"]=>
  string(5) "value"
}

Modifying urls

All methods that are used to get a component's value can also be used to replace or set a value. So for example if you have an array of urls and you want to be sure that they are all on https, you can achieve that like this:

$urls = [
    'https://www.example.com',
    'http://notsecure.example.org/foo',
    'https://secure.example.org/bar',
    'http://www.example.com/baz'
];

foreach ($urls as $key => $url) {
    $urls[$key] = Url::parse($url)->scheme('https')->toString();
}

var_dump($urls);

Output

array(4) {
  [0]=>
  string(24) "https://www.example.com/"
  [1]=>
  string(33) "https://notsecure.example.org/foo"
  [2]=>
  string(30) "https://secure.example.org/bar"
  [3]=>
  string(27) "https://www.example.com/baz"
}

Another example: most websites can be reached with or without the www subdomain. If you have an array of urls and want to assure that they all point to the version with www:

$urls = [
    'https://www.example.com/stuff',
    'https://example.com/yolo',
    'https://example.com/products',
    'https://www.example.com/contact',
];

$urls = array_map(function($url) {
    return Url::parse($url)->host('www.example.com')->toString();
}, $urls);

var_dump($urls);

Output

array(4) {
  [0]=>
  string(29) "https://www.example.com/stuff"
  [1]=>
  string(28) "https://www.example.com/yolo"
  [2]=>
  string(32) "https://www.example.com/products"
  [3]=>
  string(31) "https://www.example.com/contact"
}

And that's the same for all components that are listed under the available url components. And for the query string you can also just provide an array:

$url = Url::parse('https://www.example.com/foo');
$url->queryArray(['param' => 'value', 'marco' => 'polo']);
echo $url;

Output

https://www.example.com/foo?param=value&marco=polo

Btw.: As you can see in the example above, you can use a Url object like a string because of its __toString() method.

Resolving relative urls

When you scrape urls from a website you will come across relative urls like /path/to/page, ../path/to/page, ?param=value, #anchor and alike. This package makes it a breeze to resolve these urls to absolute ones with the url of the page where they have been found on.

$documentUrl = Url::parse('https://www.example.com/foo/bar/baz');

$relativeLinks = [
    '/path/to/page',
    '../path/to/page',
    '?param=value',
    '#anchor'
];

$absoluteLinks = array_map(function($relativeLink) use ($documentUrl) {
    return $documentUrl->resolve($relativeLink)->toString();
}, $relativeLinks);

var_dump($absoluteLinks);

Output

array(4) {
  [0]=>
  string(36) "https://www.example.com/path/to/page"
  [1]=>
  string(40) "https://www.example.com/foo/path/to/page"
  [2]=>
  string(47) "https://www.example.com/foo/bar/baz?param=value"
  [3]=>
  string(42) "https://www.example.com/foo/bar/baz#anchor"
}

If you pass an absolute url to resolve() it will just return that absolute url.

Comparing url components

If you need to, it's really easy to compare components of 2 different urls.

$url1 = Url::parse('https://www.example.com/foo/bar');
$url2 = Url::parse('https://www.example.org/contact?key=value');

if ($url1->compare($url2, 'host')) {
    echo "Urls 1 and 2 ARE on the same host.\n";
} else {
    echo "Urls 1 and 2 ARE NOT on the same host.\n";
}

if ($url1->compare($url2, 'subdomain')) {
    echo "Urls 1 and 2 ARE on the same subdomain.\n";
} else {
    echo "Urls 1 and 2 ARE NOT on the same subdomain.\n";
}

if ($url1->compare($url2, 'query')) {
    echo "Urls 1 and 2 HAVE the same query.\n";
} else {
    echo "Urls 1 and 2 DO NOT HAVE the same query.\n";
}

Output

Urls 1 and 2 ARE NOT on the same host.
Urls 1 and 2 ARE on the same subdomain.
Urls 1 and 2 DO NOT HAVE the same query.

And again, this can be done with all components listed under the available url components. Instead of a Url object ($url2 in the example above) you can also just provide a url as a string.

$url1 = Url::parse('https://www.example.com/foo/bar');
$url2 = 'https://www.example.org/foo/bar?key=value';

if ($url1->compare($url2, 'path')) {
    echo "Urls 1 and 2 HAVE the same path.\n";
} else {
    echo "Urls 1 and 2 DO NOT HAVE the same path.\n";
}

Output

Urls 1 and 2 HAVE the same path.

Internationalized domain names (IDN)

echo Url::parse('https://www.пример.онлайн/hello/world')->toString();

Output

https://www.xn--e1afmkfd.xn--80asehdb/hello/world

Behind the curtains true/punycode is used to parse internationalized domain names.

Updating Mozilla's Public Suffix List

Mozilla's Public Suffix List is parsed and stored in a file in this package to be able to extract the domain suffix from a url's host component. It should be updated with every new release of this package. If you need to get the latest version of the list immediately, because a particular new suffix isn't included in the list in this repository, you can update it using the following composer command:

composer update-suffixes

Note: Please don't overuse this, as Mozilla states on their page:

If you wish to make your app download an updated list periodically, please use this URL and have your app download the list no more than once per day. (The list usually changes a few times per week; more frequent downloading is pointless and hammers our servers.)

https://publicsuffix.org/list/