Prevent Homograph Attacks using the crwlr / url Package

2022-01-19

This post is not crawling/scraping related, but about another valuable use case for the url package, to prevent so-called homograph attacks.

About the attack

Homograph attacks are using internationalized domain names (IDN) for malicious links including domains that look like trusted organizations. You might know attacks where they want to trick you with typos like faecbook or things like zeros instead of Os (g00gle). Using internationalized domain names this kind of attack is even harder to spot because they are using characters that almost exactly look like other characters (also depending on the font they're displayed with).

Can you see the difference between those two As?

a а

No? But in fact they aren't the same. The second one is a Cyrillic character.
You can check it e.g. by using PHP's ord function.

var_dump(ord('a')); // int(97)
var_dump(ord('а')); // int(208)

Browsers already implemented mechanisms to warn users that a page they're visiting might not be as legitimate as they thought.

But still: if on your website, you are linking to urls originating from user input, it'd be a good idea to have an eye on urls containing internationalized domain names.

How to identify IDN urls using the Url class

The Url class has the handy hasIdn method:

$legitUrl = Url::parse('https://www.apple.com');
$seemsLegitUrl = Url::parse('https://www.аpple.com');

var_dump($legitUrl->hasIdn());              // bool(false)
var_dump($seemsLegitUrl->hasIdn());         // bool(true)

var_dump($legitUrl->__toString());          // string(21) "https://www.apple.com"
var_dump($seemsLegitUrl->__toString());     // string(28) "https://www.xn--pple-43d.com"

So you see, it's very easy to identify IDN urls with it. Of course there are many legitimate IDN domains, so you might not want to automatically block all of them. I'd suggest you could put some kind of monitoring in place that notifies you about users posting links to IDNs.

Maybe you're operating in a country where IDNs are very common. Maybe in that case you can find a way to automatically sort out legitimate uses from your area.