Many already known security holes in contemporary web infrastructure come from bugs and poor design choices… fixed bugs and fixable poor design choices. All too often, we suspect that web server administrators refuse to update or properly configure the software running on their web servers due to a lack of time, lack of knowledge, old hardware, a need for stability, backwards compatibility, secret embedded hardware, or uncountably many other reasons. This should result in web servers running today that are still vulnerable to exploits that have been fixed for years.
Despite our suspicions, little is known about how many active servers run egregiously unpatched software or engage in risky, easily preventable behaviors such as not treating string injection points with care or failing to identify content types properly. These irresponsible servers can harm both themselves and others. While there exist superficial surveys that cover server platform type like the server platform distribution survey at http://news.netcraft.com/archives/2012/03/05/march-2012-web-server-survey.html, we have yet to find a comprehensive survey that covers server behaviors as well as version.
Therefore, we want to gather data from a large and diverse sample of currently active servers that covers both the easier questions like ‘are you up-to-date?’ and ‘what platform do you run?’ alongside more complex behavior tests, like checking for proper content-type identification. To do this, we plan to build a web crawler capable of non-destructively querying both server platform and behaviors. Then, we plan to set the web crawler loose on the Alexa Top n sites and known embedded devices, like printers.
See our proposal (with citations) here: Comp527Proposal