Invasion of Privacy
Friday, 11 December 2015
One of my ongoing research projects concerns web browser identification. This project stems from my ongoing research into better anti-porn filters at FotoForensics. To me, the ideal solution should identify a distinct (unique enough) signature for each browser. This signature shouldn't need to be saved; it only needs to be generated and compared against a list of known hostile actors. If you're hostile, then the system can identify you and take preventative measures. If you're harmless, then it won't store the signature (not worth the disk space, and it will be regenerated the next time you visit).
Right now, people who upload prohibited content receive a ban, and the ban is tied to their signature. Most banned users notice the block, understand why, and go away. However, a small percent of banned users immediately flush cookies and change networks. We had one guy this year who spent nearly four hours trying to evade the signature detection. Fortunately, the current signature system permits tracking these hostile actors as they change network addresses -- he ended up going nowhere.
The better the signature, the more likely it is to identify specific undesirable users. The current tracking system isn't perfect, but it's better than nothing. And it catches the majority of people who violate the site's terms of service. About twice a year, we catch someone by coincidence. Twice out of a few million visitors really isn't a bad error rate, but I still want to do better.
A big part of the browser fingerprint comes from the user-agent string. This describes the type of browser and some basic capabilities. For example:
This user-agent string identifies Chrome 47.0.2526.72 on Windows 7 (NT 6.1). The "WOW64" identifies a 32-bit version of Internet Explorer running on a 64-bit operating system. And it supports the KHTML rendering engine's extensions.
Of course, there's lots of garbage in these user-agent strings. In the old days of the web, site developers would build web pages that changed content based on the kind of web browser. Since the Netscape browser (aka Mozilla) was the best thing available, web sites would explicitly look for "Mozilla/5.0" in the user-agent string. Today, every browser claims to be "Mozilla/5.0" -- the actual string is meaningless. And as far as I can tell, no web sites still look for this string.
By the same means, Chrome is not Safari. Yet every Chrome browser claims to be "Safari" in the user-agent string -- just in case there's a web site that only supports Safari and AppleWebKit (like some of Apple's web services). In fact, KHTML and AppleWebKit are not the only rendering options; this browser also says "like Gecko", just in case any site still looks for Gecko-specific functionality.
All of this mimicked functionality leads to some great lies in user-agent strings. Like this one, from a Nokia Lumia 930 smartphone:
This string claims to be a Windows Phone and an Android running Chrome, Mobile Safari, and Microsoft's Edge browser -- all in the same program and all at the same time!
So what's the truth? The Lumia 930 is a Windows Phone (not Android) and it is running the Edge browser. This phone used to support some Android apps, but Microsoft put an end to that. In addition, all Edge browsers lie about being Chrome and Safari, and Mobile Edge lies about being Mobile Safari. The user-agent string specifies all of this, regardless of whether the support is actually fully compatible (hint: it isn't).
And then there's this user-agent string:
This Lumia 640 smartphone is a Windows Phone that claims to also be both an android and an iPhone. I guess this is because there's an app that can make the Lumia "look like" an iPhone, even if it's only a different window manager.
I pulled up some stats from today (11-Dec-2015):
I expect a little variation due to different collection methods. The rest of the variations are likely due to problems decoding the strings. For example, w3counter noticed 20% Safari browsers. However, most browsers claim to be Safari, so this is probably acting as a catch-all.
Most of the "others" at FotoForensics are from mobile devices, bots, or browsers configured to lie about their user-agent string. The main difference between my metrics and the others is that I report significantly fewer Internet Explorer web browsers. You see, to get these statistics, I'm not just looking at the user-agent string. I first look at the string, then I test the browser. Real IE responds to the IE test as real IE. In contrast, an IE that lies about not being IE, or a non-IE browser that falsely claims to be IE, responds differently and goes into the "other" category.
As far as I can tell, about 10% of browsers are configured to provide misleading user-agent strings. Some user-agent strings were changed within the browser. Other came from anonymizing proxy systems. But in each case, the false strings became trivial to identify.
I also find it ironic that all of this effort is made to lie about functionality. As far as I can tell, only a few sites (ahem, Apple) still do browser checking. Most web sites (like NetMeeting and GoDaddy's domain management system) don't bother checking; they just fail to run properly if your browser lacks their browser-specific requirements.
My original purpose for doing this experiment was to test a theory. My theory was that browsers configured to lie about their user-agent string were more likely to upload and access prohibited content.
I defined a hypothesis that can be tested: assume there is a correlation between prohibited content and misleading user-agent strings. Then I created a test to evaluate this hypothesis: compare the user-agent string against the web browser's feature set. If the browser claims to be Chrome, then it should have Chrome-specific functionality. If it does match, then it may still be lying about the version of Chrome, but it still looks like Chrome. In contrast, if it fails the test, then I know it is lying about being Chrome. I ended up making tests for every major browser and then comparing the results against uploaded and accessed content at FotoForensics.
As it turns out, there is no significant difference between people who upload porn and people who have misleading user-agent strings. (The hypothesis is unsupported, so the theory fails.) There doesn't seem to be any correlation between misleading user-agent strings and the type of content accessed by the user.
There were a few other great outcome from this test. For example, it is yet-another way to rapidly identify bots, scanners, and hostile systems. A lot of scanners use lists of user-agent strings that they randomly select. They want the server to think it is just another browser. But with any of a dozen simple tests, it becomes clear that it is just a bot.
There is not supposed to be any way for the server to know if you are using private browsing, and it should be invisible to client-side JavaScript. Except that it can be detected and browser manufacturers have known this for years.
At FotoForensics, about 10% of Chrome and Opera users have private browsing enabled. Firefox and IE are at 20%, and Safari is 4%. (The other browsers occur so infrequently that the percents of private browsing become misleading.) This is yet-another attribute that can be combined to distinguish your browser from anyone else.
I should also note that there is no significant difference between people who upload porn and those who use private browsing. Private browsing does not appear to be an indicator of malicious intent.
In the future, I'll be detecting private browsing mode and not storing the user's FotoForensics access history on their browser. (Since it won't store anyway, this will cut down on my bandwidth, while abiding by their desire to keep their browsing activities private. Oh the irony!)
For people who think that changing their user-agent string or using private browsing makes them anonymous online, beware: it really makes you easy to detect! Rather than becoming anonymous, these "fake anonymous" steps make you appear even more unique. If you really want to be anonymous, it is better to tell the truth and blend into the crowd. (It kind of reminds me of the old joke: All you non-conformists are alike.)
Right now, people who upload prohibited content receive a ban, and the ban is tied to their signature. Most banned users notice the block, understand why, and go away. However, a small percent of banned users immediately flush cookies and change networks. We had one guy this year who spent nearly four hours trying to evade the signature detection. Fortunately, the current signature system permits tracking these hostile actors as they change network addresses -- he ended up going nowhere.
The better the signature, the more likely it is to identify specific undesirable users. The current tracking system isn't perfect, but it's better than nothing. And it catches the majority of people who violate the site's terms of service. About twice a year, we catch someone by coincidence. Twice out of a few million visitors really isn't a bad error rate, but I still want to do better.
Who are you?
For the signature, I don't care about names or email addresses. In fact, I don't want that kind of information; too much risk to personal privacy. (Besides, the public FotoForensics site doesn't require logins.) One of the better solutions is to use browser fingerprinting: your web browser generates a distinct-enough signature when you visit a web site.A big part of the browser fingerprint comes from the user-agent string. This describes the type of browser and some basic capabilities. For example:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36
This user-agent string identifies Chrome 47.0.2526.72 on Windows 7 (NT 6.1). The "WOW64" identifies a 32-bit version of Internet Explorer running on a 64-bit operating system. And it supports the KHTML rendering engine's extensions.
Of course, there's lots of garbage in these user-agent strings. In the old days of the web, site developers would build web pages that changed content based on the kind of web browser. Since the Netscape browser (aka Mozilla) was the best thing available, web sites would explicitly look for "Mozilla/5.0" in the user-agent string. Today, every browser claims to be "Mozilla/5.0" -- the actual string is meaningless. And as far as I can tell, no web sites still look for this string.
By the same means, Chrome is not Safari. Yet every Chrome browser claims to be "Safari" in the user-agent string -- just in case there's a web site that only supports Safari and AppleWebKit (like some of Apple's web services). In fact, KHTML and AppleWebKit are not the only rendering options; this browser also says "like Gecko", just in case any site still looks for Gecko-specific functionality.
All of this mimicked functionality leads to some great lies in user-agent strings. Like this one, from a Nokia Lumia 930 smartphone:
Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; NOKIA; Lumia 930) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Mobile Safari/537.36 Edge/13.10586
This string claims to be a Windows Phone and an Android running Chrome, Mobile Safari, and Microsoft's Edge browser -- all in the same program and all at the same time!
So what's the truth? The Lumia 930 is a Windows Phone (not Android) and it is running the Edge browser. This phone used to support some Android apps, but Microsoft put an end to that. In addition, all Edge browsers lie about being Chrome and Safari, and Mobile Edge lies about being Mobile Safari. The user-agent string specifies all of this, regardless of whether the support is actually fully compatible (hint: it isn't).
And then there's this user-agent string:
Mozilla/5.0 (Mobile; Windows Phone 8.1; Android 4.0; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; Microsoft; Lumia 640 Dual SIM) like iPhone OS 7_0_3 Mac OS X AppleWebKit/537 (KHTML, like Gecko) Mobile Safari/537
This Lumia 640 smartphone is a Windows Phone that claims to also be both an android and an iPhone. I guess this is because there's an app that can make the Lumia "look like" an iPhone, even if it's only a different window manager.
Truth Detector
There are some well-known lies that appear in the user-agent string. So if you're going to check for a specific browser, then you need to check in a specific order. For example:- Opera. No self-respecting browser would ever claim to be Opera. So if the user-agent string says "Opera" or "OPR", then you can assume it is Opera. In contrast, Opera claims to also be Chrome and Safari.
- Edge. Similar to Opera, no browsers claim to be Microsoft's new Edge browser except for Edge. Edge also claims to be Chrome and Safari.
- Trident. This denotes Internet Explorer. Again, nobody wants to claim to be IE except for IE. And IE claims to also be lots of other browsers. (I find it funny that even the big liar, Edge, doesn't want to be mistaken for IE.)
As an aside: IE used to include an MSIE field to identify Internet Explorer. This field was dropped in IE 11, and Trident has been used since IE 8. If the browser claims to be running something older than IE8 (e.g., MSIE without Trident), then it's either a pre-Windows XP system (too old to view most web sites), a bot, or an anonymizing system (where the programmer never updated the impersonation string for the user-agent). Today, the bot and anonymizer options are much more likely. - Windows Phone. No Android would ever claim to be a Windows Phone, but many Windows Phones claim to also be Androids, iPhones, and other platforms.
- Chrome. Now that we have the big liars out of the way, we can focus on the little liars. Chrome claims to also be Safari, but Safari doesn't claim to be Chrome.
- Safari. As far as I can tell, this browser does not lie by default. Lots of other browsers claim to be Safari, but Safari doesn't claim to be anyone else.
- Firefox. I think this is the most honest browser: nobody except Firefox claims to be Firefox, and Firefox doesn't claim to be anyone else.
Uncommon Knowledge
After accounting for the known-false fields in the user-agent string, we can start collecting statistics about the various types of web browsers out there. There are plenty of web sites that list the current browser marketspace distribution. However, they all seem to handle these lies differently. For example, if they don't look for Edge, then they will likely count it as Chrome or Safari. And if they look for the wrong lies (like NetMarketshare), then you might mistakenly believe that most browsers run IE.I pulled up some stats from today (11-Dec-2015):
Source | Chrome | Firefox | IE | Edge | Safari | Opera | Other |
---|---|---|---|---|---|---|---|
w3schools.com | 67.4% | 19.2% | 6.8% | n/a | 3.9% | 1.5% | 1.2 |
Clicky | 52.1% | 17.0% | 20.5% | 2.1% | 6.9% | 1.4% | 0% |
w3counter.com | 45.5% | 11.4% | 13.1% | n/a | 20.8% | 3.0% | 6.2 |
NetMarketshare | 31.4% | 12.2% | 50.0% | n/a | 4.3% | 1.5% | 0.6% |
Observed at Fotoforensics | 55.0% | 18.7% | 3.6% | 0.5% | 4.1% | 1.4% | 16.4% |
I expect a little variation due to different collection methods. The rest of the variations are likely due to problems decoding the strings. For example, w3counter noticed 20% Safari browsers. However, most browsers claim to be Safari, so this is probably acting as a catch-all.
Most of the "others" at FotoForensics are from mobile devices, bots, or browsers configured to lie about their user-agent string. The main difference between my metrics and the others is that I report significantly fewer Internet Explorer web browsers. You see, to get these statistics, I'm not just looking at the user-agent string. I first look at the string, then I test the browser. Real IE responds to the IE test as real IE. In contrast, an IE that lies about not being IE, or a non-IE browser that falsely claims to be IE, responds differently and goes into the "other" category.
As far as I can tell, about 10% of browsers are configured to provide misleading user-agent strings. Some user-agent strings were changed within the browser. Other came from anonymizing proxy systems. But in each case, the false strings became trivial to identify.
Truth Tests
As a tracking attribute, I really don't care about the value of the user-agent string, as long as it doesn't change. There are so many different mobile devices out there that your smartphone is probably the only one of its kind to touch any given web site today. (Unless you're using an iPhone. There are far fewer variations on Apple devices.)I also find it ironic that all of this effort is made to lie about functionality. As far as I can tell, only a few sites (ahem, Apple) still do browser checking. Most web sites (like NetMeeting and GoDaddy's domain management system) don't bother checking; they just fail to run properly if your browser lacks their browser-specific requirements.
My original purpose for doing this experiment was to test a theory. My theory was that browsers configured to lie about their user-agent string were more likely to upload and access prohibited content.
I defined a hypothesis that can be tested: assume there is a correlation between prohibited content and misleading user-agent strings. Then I created a test to evaluate this hypothesis: compare the user-agent string against the web browser's feature set. If the browser claims to be Chrome, then it should have Chrome-specific functionality. If it does match, then it may still be lying about the version of Chrome, but it still looks like Chrome. In contrast, if it fails the test, then I know it is lying about being Chrome. I ended up making tests for every major browser and then comparing the results against uploaded and accessed content at FotoForensics.
As it turns out, there is no significant difference between people who upload porn and people who have misleading user-agent strings. (The hypothesis is unsupported, so the theory fails.) There doesn't seem to be any correlation between misleading user-agent strings and the type of content accessed by the user.
There were a few other great outcome from this test. For example, it is yet-another way to rapidly identify bots, scanners, and hostile systems. A lot of scanners use lists of user-agent strings that they randomly select. They want the server to think it is just another browser. But with any of a dozen simple tests, it becomes clear that it is just a bot.
A little privacy, please?
And then there is "private browsing". Different browsers call it by different names. Chrome calls it is "Incognito" and Internet Explorer says "InPrivate". The entire idea with private browsing is that nothing gets saved to disk. This way, you can visit porn sites, or other prohibited web services, and nothing gets saved to your computer.There is not supposed to be any way for the server to know if you are using private browsing, and it should be invisible to client-side JavaScript. Except that it can be detected and browser manufacturers have known this for years.
At FotoForensics, about 10% of Chrome and Opera users have private browsing enabled. Firefox and IE are at 20%, and Safari is 4%. (The other browsers occur so infrequently that the percents of private browsing become misleading.) This is yet-another attribute that can be combined to distinguish your browser from anyone else.
I should also note that there is no significant difference between people who upload porn and those who use private browsing. Private browsing does not appear to be an indicator of malicious intent.
In the future, I'll be detecting private browsing mode and not storing the user's FotoForensics access history on their browser. (Since it won't store anyway, this will cut down on my bandwidth, while abiding by their desire to keep their browsing activities private. Oh the irony!)
For people who think that changing their user-agent string or using private browsing makes them anonymous online, beware: it really makes you easy to detect! Rather than becoming anonymous, these "fake anonymous" steps make you appear even more unique. If you really want to be anonymous, it is better to tell the truth and blend into the crowd. (It kind of reminds me of the old joke: All you non-conformists are alike.)
Comments
Add Comment
Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.