AUUG: Annual Winter Conference '97
Data Mining on the World Wide Web
The Web has turned out to be a great place to publish information but often
it is poorly organized, poorly served or simply served in a manner which we
choose not to burden ourselves with. Rather than forego this information
entirely, we can now programmatically retrieve, condense and optionally
archive this data. How is this done, how easy is it and some of the wider
implications then form the basis of this paper.
Many organizations are now publishing everything from useful catalogues of
information to searchable directories of useful information on the World Wide
Web. This is happening at a time when some of the our favorite programming and
scripting languages are also offering significant abilities to access the Web
directly and process the information retrieved from there.
Faced with this cornucopia of data, the challenge is now to automate the
retrieval of this data for us so that we can have local access to it at will
without the need to 'hit the wire' personally. Indeed the ability to perform
such operations allows us to spread the load on out outbound connections to
those times of day when neither we nor our co-workers are inconvenienced. In
fact, as time goes by, the Web becomes bigger and the need changes from one of
finding data to condensing it to just what we wanted.
This then is a discussion of how to automate access to Web based data and how
to build programs which can navigate the Web automatically to perform such
When the first tools were built which in a sense defined the Web they were
both thin on the ground and of limited scope. This is not to say text based
browsing is bad simply that it was all you could do.
Over time though, people have wanted to do more complex things with Web sites
and these have almost always meant writing programs to access these sites. In
the beginning this meant lifting the 'C' routines used by the first generation
browsers and using them to write programs that do stuff. More recently, the
code to access the Web has appeared in a few other languages and often these
languages are better suited to the tasks we want to do. This is as we often
need to do lots of text processing to find the data we asked for in the morass
of Webvertisements, Frames and corporate image building and because the Web
and the network it is built on is by nature slow and don't require the speed
or efficiency of compiled 'C'.
The choice of language then may more often be made on intangibles like how
good the debugging support is or how fast the edit/test cycle is. To some
extent, the way Web sites are built and continuously upgraded you may find
that your mining script needs a similar level of commitment to maintenance.
It's certainly worth evaluating up front how much effort you're prepared to
expend on getting this data. Just as the Web site may well be a work in
progress which never completes, so too the script you build to access it may
well need constant care.
Another factor in the equation of which language to choose is the scope of the
Web support it offers and the complexity of the task of customizing the
library components they supply to the task at hand. So lets look at a few of
Python comes with extensive libraries in its standard form which greatly
simplify the task of trolling the Web for data. With both libraries to
automate HTTP accesses and URL manipulation and libraries to parse SGML and
HTML it is well placed to interact with the Web. Python also offers reasonable
speed and is well documented.
PERL also has extensive Web access libraries available to assist in the task
of automating data retrieval. In particular, PERL with its robust and
efficient regular expression support, is suited where searching of the
returned data is pattern based. PERL is ideal for the task of gleaning nuggets
of data from large volumes of data in this way. Note though, that either you
have to have a fast network connection or be patient.
Much of the code necessary for Java to do this also exists. The proof of this
is that at least one browser (we know of) has been written in Java. This means
the mechanisms needed to retrieve and pick apart and sensibly interpret Web
based data streams exist. To what extent these code resources are available
and will be made available is less clear. (HotJava was to be re-engineered by
Sun as a set of classes to for 3rd parties to develop "Web-enabled" programs.
As such this would represent the only really commercially supported example of
such code at this point in time).
Its worth noting that the distinguishing feature here is that each language
has libraries which do some or all of the work of interacting with Web servers
for you. They all have code to parse retrieved HTML and to a greater or lesser
degree to other webby things like manipulating URLs and handling MIME
encapsulated data. The extent of such support you can leverage off and the
ease with which this support allows itself to customized and tuned also become
issues when choosing.
Most of these other languages also offer greater cross platform support should
you have an active interest in performing these operations from hosts other
than UNIX machines. In particular each of the three named above has good PC
support. The downside of such cross platform source portability is that each
foregoes some of the intimacy with the environment which is offered by C and
Finally remember, that if you do need speed and efficiency, you can't beat C
and C++. In fact, in something of a turnaround we at times need to remind
ourselves that the Web was first written in the older more traditional
languages and migrated to the younger languages only later when the Web was
staring to mature and we broadened our interest in how we might us it.
For the purposes of this paper, we will use the facilities offered by Python
almost exclusively. Those of you who have an interest in using PERL for the
same tasks should look at the excellent book from O'Reilly & Associates on the
subject "Web Client Programming with PERL" . Indeed, this book contains
many valuable insights and can be recommended to anyone planning on
undertaking such tasks or indeed for those Web administrators who want insight
in how to make their site more useful.
Spiders & Robots
Spiders and Robots are specialized classes of data mining programs which have
been on the Web since its earliest days. Indeed, the Web search engines which
are so useful now when searching the Web interactively, build their databases
almost exclusively by making use of these automated programs to explore the
Web by finding pages and then exploring all the links from those pages to
other pages both locally and at more remote sites on the Web. This effectively
exploits the connectivity inherent in the ability to link pages of information
and also highlights a fact of Web existence which is often overlooked. If your
Web pages 'stand alone', then they are effectively not really in the Web
itself. They are leaves on the tree, rather than branches. This is why other
people often encourage you to link to their pages as it increases the
likelihood they will be found when a search engine (or human) passes along the
links from those pages.
This also serves to highlight one difficulty of finding your data on the Web.
There is a combinatorial explosion of links which your Miner will discover and
which it will need to decide to follow or ignore. Strategies need to be
developed and from time to time refined to allow the Miner to sensibly follow
or discard various pieces of data.
Several examples of Python based robots will be shown which range from a dozen
lines of python to longer more complex and more capable systems. Indeed, a
simple Python based Miner might look like this:
# Grab the nominated URLs.
urls = [
for url in urls:
## files = urllib.urlretrieve(url, outputfile)
files = urllib.urlretrieve(url)
print url, ' ----> ', files
# code to 'do' something to the data could go here...
This produces output which looks like this:
http://www.vic.auug.org.au/auugvic/av_meetings.html ----> /email@example.com
http://www.vic.auug.org.au/auugvic/av_mtg_current.html ----> /firstname.lastname@example.org
I should note that while its pretty minimal, this examples serves to show a
few things off which are of note. Firstly, with the right libraries and those
which are available for both PERL and Python certainly qualify, the task of
getting the raw data off the Web is completely trivial. Secondly, you need to
watch out for error conditions carefully. This is highlighted by the second
URL retrieved in the example which in fact doesn't exist. The Web server
returned the dreaded "404 - URL not found" and the Web page storing that
frankly useless data was saved for for us. In fact, the Python library offers
simple means to trap this condition but as ever, we have elided this code and
other error checking in the interests of brevity.
Each example then attempts to show some feature of accessing Web based data
which may be of use to the prospective user. These features largely devolve to
limiting the data which the robot is prepared to retrieve and store for longer
periods. Each Web robot can easily retrieve much more data in a short period
than most people can comfortably contemplate if the only limitation is the
bandwidth of the path from data source to data sink and the storage available
at your site. It has to be noted that with falling disk prices and arrival of
cable modem access to the net this can be significant. If you have signed up
for Big Pond Cable and thus have lots of bandwidth and no time charges you
would be well advised to watch the volume charges they live off!
So how can we limit the amount of data we try to pull down?
rule based approaches
Rule based approaches attempt to limit accepted data by examining some aspect
of the data itself. These aspects include where the data came from (source
host), where the data is stored (path), what the data is (mime type), keywords
in the data (meta tags or regular expression searches) or the age of the data.
Mime types in fact are less of a problem than might be imagined. The browser
software, whose part we are playing, informs the server at the beginning of a
HTTP transfer which MIME types it can cope with. The server then will not
offer the client anything which falls outside the scope implied by that.
The most common heuristics are those which see data rejected which comes from
a different host than the source URL, data which comes from elsewhere on the
host than the source URL we have retrieved or data uses a different protocol.
For examples of these, we would choose not to download the advertisement from
www.doubleclick.com, we could choose only to examine files under the dilbert
subtree of www.unitedmedia.com and we could avoid attempting to use ftp or
mailto as transports. These examples serve to highlight that it can be useful
to save such ignored URLs in a log file for examination in the event that the
data we wanted didn't get retrieved. Usually this is a symptom that our
filtering has been too successful and we have in fact filtered all of our
signal along with the noise.
state based approaches
State based approaches are simpler and rely on keeping a more or less
complete set of the data for a site around between instances of the robot
running. This allows newer instances to choose to ignore data which is already
cached locally. This applies mostly to those situations where 'whole sites' or
subtrees of sites are being 'mirrored' locally. Even when whole portions of
some remote site are mirrored in this fashion the extraction of the data of
interest may then take place off-line as a post processing operation.
State based mechanisms can be used to effectively monitor sites for new
information too. Indeed, several URL watching services exist on the net which
use precisely this approach. These services cache a copy of a URL and issue
mail alerts when the cached data differs from the live copy. One such service
can be seen at:
Most practical solutions combine some of each of these approaches. If a
large data set is being mined, keeping a locally cached copy can lead to
significant savings in bandwidth. At the same time, such data can easily
become out of date and misleading very easily. A solution is to build a
richer configuration which allows both persistent state information to be used
where it is appropriate to do so and for that same state to be ignored where
it is inappropriate to do so. In fact, most Miners tend to be either very
flexible and are configured for the job at hand or more commonly are
handcrafted for the job at hand. The latter are only possible due to the
simplicity of the basic functionality, as was seen above. Still, the approach
to take is a matter of judgement in each of the cases you are interested in.
A common piece of information which robots and web crawlers are expected to
honour is the robots.txt file. This file contains details of which areas of a
web server robots are expected to stay away from. This was largely proposed
to stop web crawling robots operated by search engine sites from indexing Web
pages which were not for public consumption or of limited use to the general
net surfer. But, enforcement of this is entirely on the honour system which
raises the question of whether your robots should also honour this file. In
some sense, your Miner can be considered to be an agent which is merely
retrieving data you would otherwise be forced to retrieve manually. Clearly
though, the Miner can implement behaviors which real humans don't follow. This
means it may well be regarded as either a malfunctioning robot or as an actively
hostile one. The solution in either case is for the owner for the data to
block access to it from your site, your browsers ID string or some combination
of attributes which targets the behavior of the Miner. In general, don't make
waves. No one is obliged to make their data available and specifically they
are not obliged to make it available to you.
Automating access to interesting information
Having found an interesting piece of data on the Web, you now wish to automate
its retrieval. In all cases this will involve retrieving it and storing it
locally. We have already seen the simple example above get this far. What
happens next depends on what sort of data you are looking for.
The first thing to note is that the structure of a 'page' of data on the Web
may in fact be composed of many interlocked pieces. The most obvious of these
are the graphics which adorn many Web pages. The simple example we saw only
retrieved the base HTML page data of the URLs we asked it to fetch though. To
have the complete page at hand we must additionally fetch those extra pieces
of data we do not yet have in hand. For most purposes this means simply
scanning the base HTML data and extracting the URLs associated with the
rendering of the page. We may also wish to retrieve other pieces of data which
have been associated with the page. In some cases you will choose to
specifically ignore some of data (like images) which have no interest of
benefit to you.
More generally of course, there is also the question of examining the HTML
itself and extracting the data for which we downloaded it.
Much Web data is obscured under a layer of HTML. Some of this is merely
formatting and some is in the form of hyperlinks to other places. The odds are
that the data you want may well be spread across more than a single Web page
and retrieving it requires visiting each page of the set in an automated
manner. Unless, the user wishes to be involved in the filtering of associated
URLs, the solution is to pass a HTML parser across the retrieved data and save
tags with references which may be related to the data we are searching for.
Most typically, to get an image if a page, associated URL's (hyperlinks) need
to be processed as are image resources which may be attached to the page. If
you are attempt to mine other data types (such as Midi or Mpeg) then special
arrangements will likely need to be made. When this has been done, we have
recreated the Web crawler of the search engine which downloads pages from your
host and then downloads all the linked pages and so on, hoping to eventually
traverse the entire allocated URL space of the Web. This mode of operation is
also similar to the many solutions which offer to pull down Web trees and
subtrees from remote servers. These are the basis of many Web mirroring
projects where popular remote sites are mirrored for the bandwidth savings
such mirroring allows.
Here we can see that we have a few ways of getting our customized code to
have a look at the HTML data. One is to pass a Formatter in to the Python
parser itself. This will be handed the textual data from the page which can be
easily searched for the data of interest. The other method, also shown above,
is to make a sub- class of the HTML parser included with the Python standard
library. In the usual manner of OO languages we can accept or override the
member functions of the class as suits us and thus customize the behavior of
the parser as deeply as may be necessary.
This is also the point where we can most easily extract the data we are
interested in. Parsing the base HTML of the page is often sufficient to allow
us to mine the data we first came looking for.
# Sub-class the standard HTML parser class.
def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self, formatter, verbose)
self.imglist = 
def handle_image(self, src, alt, *args):
... after retrieving the HTML file ...
# now scan the file for other URLs
dst = open(outputfile, 'r')
n = NullFormatter()
p = lclHTMLParser(n)
imglist = 
# make full paths out of the list of URLs we found
for j in p.anchorlist:
... process each link as appropriate ...
Occasionally however we do not even need to go to such lengths. At times the
data we are interested in is easily marked in the HTML data itself. Comments
the HTML author left can often be greatly useful.
other approaches (i.e. saving the dilbert GIF and destroying the rest?)
At times the task of extracting data from a Web page is simpler. A simple
example of this is retrieval of a comic page (such as Dilbert or Robotman for
instance) where the data is neatly packaged as a GIF which can be moved away
and the balance of the data destroyed. The Dilbert pages also highlight
another reason for processing the HTML as United Media obfuscates file name
with random suffixes to prevent pro-active users from performing directory
name space searches for as yet unpublished data. It should be noted that
automated namespace searches are feasible but are to be discouraged as they
clearly go against the wishes of the owner of the intellectual property.
Automating access to (other peoples) search engines
Search engines present an interface to the world which is bound to HTTP just
as the data sources we are more interested in. This means retrieving specific
information through these CGI gatewayed services is no more difficult than any
other from of Web data mining. Conceptually at least. :)
In fact, Web search engines offer a valuable means for automating search to
other peoples data over and above the search for the data itself. In order to
make maximum benefit the data extraction from the retrieved pages may have to
become much more subtle. On the upside, the extraction of the link data from
the search engine results page is simply a matter of following the links from
the results page. This is already catered for by all the languages we are
accessing GET and POST forms
Most of these search engines are based on the use of CGI forms in some manner
to get input from the user and process it as a query. The two forms of query
manipulation are referred to as GET and POST. GET forms look to the user just
like any other URL and can be retrieved in a like manner. Examining the source
page or the output it produces can allow you to determine the encoding rules
for the URLs it produces allowing you to bypass it entirely if desired.
Indeed, many sites now offer access to common search engines based on the fact
that these engines will accept new URL's from any source. Look for the search
engine access pages which features text fields for search terms and simply
pass the command URL to the engines directly allowing their output to capture
the entire screen. Note that many of these pages are attempting to attract
users in this way merely to siphon some advertising revenue from page hit
counts. Why visit Infoseek and Altavista and Yahoo and others when one page
can give you access to a half dozen of the most useful and most complete.
POST forms use similar query encoding but send the data to the target CGI as
part of the query body rather than the URL portion of the connection. At the
server the data passes into the CGI interfaced program as the standard input
stream rather than the command line and hence, some size restrictions on the
data are relaxed (allowing more complex queries to be assembled). Support for
these forms in the languages we are looking at is sketchier at the moment but
given the open nature of the sources can be implemented without a great deal
Note that in the case of both GET and POST forms, the data returns as the
same HTML encoded data we have already dealt with before. The mechanisms we
use are the same we have already seen. It may help to refer to a good HTTP
reference such as those listed at the end of this paper.    
Searching the search engine results.
Having generated GET or POST data, you may wish to use a generic search
engine to find references to a site and have your mining robot process the
results from a search engine and visit the first 100 sites automatically for
you. You can then effectively browse the retrieved data at a later date more
simply. This is really just a special case of the more general operations we
have been discussing but bears mention simply as it causes your search to fan
out across the net more than we have preferred previous searches to do.
plugin/java based data streams
Having realized the value of their data and in the hopes of capturing the
hearts and minds of user community, some search engines now offer programmatic
access directly to the data store their searches examine (which is to say the
data they themselves have mined from the Web). So far, access to this data
stream is performed by a plug-in or java user agent for the user and the
format of these custom data stream is as yet secret (given the competitive
nature of this market segment at the moment, we can imagine they are unlikely
to allow easy access to this data for competitors, but access for the public
must mean access for the competitor to if they choose to make use of it).
Mining data from inside returned HTML
We've briefly touched on the techniques to use above. The simple fact is that
a HTML page, which is the form of most data retrieved from the Web, consists
of the data to display and the markup which controls its formatting. The
principle of finding the data is usually easy. Firstly, the markup can be
easily ignored with a parser to find it. The parser simply does nothing with
the markup after identifying it. Whatever is left, is the raw data.
Intellectual Property considerations
It is fair to say that much abuse of intellectual property rights is taking
place on the Web even as we speak. Most of the content of many 'adult' sites
for example is digitised without the knowledge or consent of the copyright
owners or indeed the subject(s).
It is also clear that modern copyright law is completely inadequate to the
task of dealing with computers or networks in any reasonable manner. In
particular the interpretations of what constitutes copying of a work have been
alternately both liberal and harsh in the recent past. (i.e does copying data
from disk to an operating system maintained disk buffer prior to delivery to
the user application constitute acceptable use or not. Lawyers are unsure!
Where this interpretation might leave packets in transit through routers and
switches is even more interesting as the owners of those devices have no clear
right to any data at all)
In some sense, the Web is a publishing medium and the owners of intellectual
property must recognize that once published, many people will make lots of
uses of their intellectual property in ways they have anticipated (and indeed
the reason for their electronic publication) and some that are unforeseen. The
key is where you may deprive the owner/author of income which they might
(rightly or wrongly) expect to receive. The only sensible position to take is
that if you are asked by an intellectual property owner to cease collecting
data from them, you should comply. By all means, see if they are prepared to
negotiate access to their intellectual property but they own it and failure to
respect that ownership can result in legal action being taken (c.f.
Paramount's current actions to protect Star Trek intellectual property and
Fox's attempts to protect Simpsons characterizations).
copyright and 'fair use' provisions
In general, if you are collecting data for your own use only, you may claim
protection under the fair use provisions of copyright law. To what extent you
having a copy of the data from a site constitutes an infringement is unclear.
For example, were you to collect the daily Dilbert strip each day, it could be
argued you were about to deprive the Author of some of his anthology book
income (from you). Will proving you own a complete set of such books
constitute rebuttal? Certainly collecting such information and making it
available in its collected form is likely to draw the ire of the owners. And
quite rightly so.
storing other peoples data
As has been said above, the main issue is storage. Web sites effectively give
away their data for free as part of their publishing process. It is in fact
what this paper is relying on as a mechanism underlying its operations.
nevertheless, the idea that some surfers are making more permanent copies of
this data sits uncomfortably with some information providers. Some sites will
never even realize that they have been mined. Others may well be keeping
careful watch. Some will simply notice in their agent log files that a
'Python-urllib/1.5' agent visited their site. If you visit often enough to
annoy, you may also expect to be noticed.
linking to other peoples data
recently attempts were made by rival newspaper publishers in the UK to
prevent hyperlinks being made to their pages by other (notably news
publishing) organizations. This has raised the issue of whether this is a fair
use of published data. In the past, the answer was yes. In fact, the Web
refers to the connectivity from one site to another. Without such links
between disparate places, the Web wouldn't be a Web at all so much as a
collection of little islands. The now ubiquitous advertising banners serve to
take up a lot of slack here and are tacitly approved by both parties to the
link but it will still be severe blow to the Web if such links are found to
breach intellectual property rights. The last word has yet to be written on
this issue but it bodes poorly for the existence of search engines at all if
it is allowed to stand. The question of jurisdiction when links are
transnational will also be a challenge.
exporting other peoples data
this is fairly clearly an infringement of the law as it stands and in fact
you are unlikely to find anyone supporting the position that you should be
allowed to take (steal) their data and make commercial use of it. In this
sense Web publishing will remain like all other forms of publishing. Having
said this, the Web like traditional paper publishing features a number of data
sources which are effectively free for all. These include, open public
records, data which the owner wants to have widely disseminated (like
advertisements) and data which is otherwise clearly marked as for public use.
As noted above, even data which are clearly proprietary and protected may be
made available if suitable terms can be struck with the owner and direct
approach to enquire about the terms of such licensing should not be ruled out
where a use for the data is clear and where a value for such access can be
established. Such data is often already made available behind secure channels
(such as SSL) or behind user/password challenge dialogs.
Optimizing your site to be mined
Having seen how spiders and robots are built, we can easily determine what
sort of site is easier to mine for data and what isn't. In particular this may
be of interest to those people who want to be well represented in the general
Web search engines.
The other reason for making it easy to mine your data is that you may have a
commercial interest in making it easier for your users to do this. An example
is the semiconductor chip manufacturers who now largely publish their
technical data online and allow designers to access it there or download it
for offline use. Making the download process efficient can only increase their
sales and their customer satisfaction. In general, if you are serving
something to the public, making the content available with a minimum of fuss
will generally only benefit you. The more hurdles you place, the fewer
successful surfers will complete the course.
tree organization vs. directed graphs.
A recent fashion in Web site design has been the change from the simpler tree
structured page layout to that of a directed graph. Some of these topologies
are now becoming arbitrarily complex. State based robots can cope with such
sites reasonably well but rule based systems often have difficulty optimizing
access or indeed even preventing the reload of data which has been captured
already. While some see these directed graphs as more natural there is
clearly a judgement call here.
load balancing across servers at one site
Sites which partition data across multiple servers with different names may
find some robots do not follow links to these other machines. In particular,
human intervention may be required to ensure that such sites are examined
fully. If such a site frequently re-partitions their data set then they may
effectively limit their general usefulness. Newer facilities in the DNS may
see such crude load balancing abandoned in favor of more scalable and more
subtle schemes. For now, examination of rejected URLs is often an illuminating
use of robots.txt
robots.txt as a general guide to the intentions of the site administrators
should be honoured by all robots. In particular this may avoid any
unpleasantness about intellectual property and access to private data sets or
sensitive pages of information.
Where to from here
caching and proxies
another form of Web service which has yet to be effectively looked at is that
of Web proxies. For simplicity the examples here have ignored the existence of
caches and proxies. The principle reason for considering these machines is
that by definition the ones of interest are closer to us both geographically
and logically than the services themselves and that much of the data we wish
to mine may already be there. Caches and proxies are obviously also attractive
as not only may the data be available in a more timely manner but it may be
available at a reduced charge to the robot. Cache and proxy administrators
may choose to use specially programmed robots as a crude mechanism to preload
commonly requested pages at intervals which pre-empt regular loads.
paying attention to expiry information on data
a piece of the meta data which may be associated with each piece of Web data
is an estimate by the server of how long this information will remain current.
This is principally intended for browsers to enable them to both perform cache
management for the user and to allow the browser to reload data (pages) which
have become stale. This allows dynamic data such as statistics to be reloaded
periodically and kept fresh without need for user interaction to bring this
Some Web servers and sites though are known to provide (either incidentally
or deliberately) quite incorrect values for this data so means to override
this processing must also be provided if it is undertaken.
Most of the code we've looked at so far is single threaded. For the purposes
of this presentation this puts natural and limits on complexity and bandwidth.
As clearly though is the fact that if you have a lot of data to retrieve, the
process of farming out the job of retrieving data to a pool of subprocesses
and thus parallelizing the retrieval process will make better use of your
available bandwidth due to unavoidable round-trip delays in the protocol
layers. Data mining processes which run for extended durations (days rather
than hours) should alternately be written to be self limiting in their
bandwidth requirements as otherwise howls of protest may be heard from
interactive users who should have priority over such automated background
The other obvious cause for concern here is that if we parallelize the
retrieval process to well we may well become a significant and more
importantly unwelcome load on the target site(s). This more than anything is
likely to draw attention, protest or ultimately refusal of service from those
same targets. Once again, a balanced approach mindful of the data owners point
of view is more likely to be welcomed.
Many sites today are defraying the costs of their network access by hosting
advertising in one form or another. If your automated access mechanism is
running at non-peak hours without supervision you may wish to consider having
it follow these links too. The hits may well be coming back to your data
source as revenue and you're not obliged to actually look at (or in fact
store) any data which comes back this way. If you're paying volume charges
this may seem less attractive to you, but then advertising is unlikely to form
a large percentage of what you mine and as noted, it may well be financing
your data supplier.
Whether this is seen as a good thing or not will depend on your view of the
efficacy and desirability of Web advertising. Clearly though while people are
taking simplistic hit count based measurements of patronage, they are open to
gross and not particularly subtle manipulation like this. Beware! both if you
are a buyer of such services or if you rely on them. Building a robot to make
false hits on a Web site or advertising link is no harder than doing real
work. As I said, hit counts are too simplistic and you must beware of how you
are tempted to interpret them.
You may find a few other small hurdles to jump when you choose to play this
game. Most of these are not show stoppers but they are occasionally
When you retrieve HTML pages from the Web, the inclination is to store them
under names which are close to the name implied by the URL. This means a
mechanism must be built which can perform the mapping from URL to local
filename. This then leads to the problem, which is that Web servers may
interpret some URLs in different ways. Notably, URLs which map to directories
in the Web server spool area may return a default page or a file list for the
directory. When a local storage name is created, the choice must be made
between a file and a directory (i.e. part of a path). Indeed, whichever choice
is made, a mechanism needs to be present which allows the Miner to recant and
reverse its decision. (e.g. a directory was stored as a file and it must be
renamed, a directory created and the file moved to it with a name which
hopefully won't clash).
As we speak, the first implementations of HTTP 1.1 are creeping out into the
world. This has already highlighted some problems in HTTP 1.0 implementations.
These are all so far more in the nature of annoyances but it must be noted
that when you get a file full of 404 - data not found, this may be the cause.
This is merely another reason why choosing a language which is easily debugged
and tuned may be a wiser choice than a compiled traditional language.
A slight variation of the theme above is that at times objects are returned
from a Web serve with a MIME type which was not anticipated in the robot. As
HTTP clients must nominate the MIME types they are prepared to accept in
advance of seeing the data, asking for a resource which has a type other than
one in which interest has been registered results in an error too.
Taming HTTP implementations
Surprisingly, bugs in the HTTP's implemented in all languages are not unheard
of! This once again manifests as noted above in error HTML files being
retrieved. The solution is once again a quick bout of debug. It can be
illuminating to use tcpdump or a similar tool to examine the data stream from
the robot in such circumstances and compare it to a similarly captured data
streams from the mainstream browsers. Problems are usually quickly highlighted
in this manner.
Mining data from the Web can be both quick and easy. The quality of the data
is as variable as the sources it is gleaned from. The new kid script languages
are all well supplied with Web savvy and allow the user to build and tune a
variety of Miners which can be used to good effect. In fact, the flexibility
thus gained outweighs any efficiency concerns which might be raised. Indeed,
the speed of Web transfer seldom places such a strain on systems that
efficiency need be a concern and the desire to limit bandwidth impact on the
source sites should maintain this state.
Balanced against this ease, is the need to be sensitive to the intent and
concerns of the owners of the data we are working with. The law is gloriously
silent on most things Web and while this vacuum can be seen as carte blanche,
it more realistically means that older, perhaps inappropriate laws may be
pressed into service or that the owners of intellectual property who feel
slighted may take civil action to protect their ownership.
Clinton Wong, Web Client Programming.
O'Reilly & Associates, Inc. ISBN 1-56592-214-X
Shisir Gundavaram, CGI Programming.
O'Reilly & Associates, Inc. ISBN1-56592-168-2
Chuck Musciano & Bill Kennedy, HTML: The Definitive Guide.
O'Reilly & Associates, Inc. ISBN 1-56592-175-5
T. Berners-Lee, R. Fielding, H Frystyk,
RFC1945 Hypertext Transfer Protocol -- HTTP/1.0
R. Fielding, J. Gettys, J. Mogul, H Frystyk, T. Berners-Lee
RFC2068 Hypertext Transfer Protocol -- HTTP/1.1
Mark Lutz, Programming Python.
O'Reilly & Associates, Inc. ISBN 1-56592-197-6
Aaron Watters, Guido Van Rossum, James C. Ahlstrom,
Internet Programming with Python
M & T Books, ISBN 1-55851-484-8
Larry Wall, Tom Christiansen, Randall Schwartz,
O'Reilly & Associates, Inc. ISBN 1-56592-149-6
Document last revised:
Mon Jan 19 23:00:18 EST 1998