The Deep Web

Deep Web – (commonly referred to as the Invisible Web) For this project, the Deep Web will refer to online databases and other dynamic pages that cannot be indexed by common search engines like Google. A technical description helps in understanding the Deep Web:

“The Deep Web refers to content hidden behind HTML forms. In order to get to such content, a user has to perform a form submission with valid input values. The name Deep Web arises from the fact that such content was thought to be beyond the reach of search engines. The Deep Web is also believed to be the biggest source of structured data on the Web and hence accessing its contents has been a long standing challenge in the data management community” (Madhavan et al., 1)

One way to think of this is to imagine that you are looking for a flight. You could Google ‘JFK-LAX’ and get a list of airlines offering this route, or similar offers. To access a Deep Web database, you would go to a specific airline’s website, such as United, to search for this route along with the dates of travel to get specific information (in this case, flight times).

Harvesting the Deep and Surface Web with a Directed Query Engine

Harvesting the Deep and Surface Web with a Directed Query Engine (source: Bergman, 5).

Colley & McDonnell quote Chris Sherman, who offers a further explanation: “When an indexing spider comes across a database, it’s as if it has run smack into the entrance of a massive library with securely bolted doors. Spiders can record the library’s address, but can tell you nothing about the books, magazines or other documents it contains” (Colleey & McDonnell).

Size and Scale

An important aspect of understanding the Deep Web is the scale of its size. Michael K. Bergman published an influential whitepaper in 2001 that is still considered to be the Holy Grail of Deep Web information. In the paper, Bergman offers a mind-boggling analytical approach to understanding the Deep Web:

  • Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
  • The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web.
  • The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.
  • The deep Web is the largest growing category of new information on the Internet.
  • More than half of the deep Web content resides in topic­specific databases.
  • A full ninety­five per cent of the deep Web is publicly accessible information — not subject to fees or subscriptions (Bergman, 1-2)

Extra: Searching The Deep Web, an educational (but nonetheless hilarious) video from the late 90’s-early 2000’s

Next, the Introduction

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s