AUUG: Annual Winter Conference '97



Data Mining on the World Wide Web

Enno Davids, Metva P/L.
Enno.Davids@metva.com.au





ABSTRACT
The Web has turned out to be a great place to publish information but often it is poorly organized, poorly served or simply served in a manner which we choose not to burden ourselves with. Rather than forego this information entirely, we can now programmatically retrieve, condense and optionally archive this data. How is this done, how easy is it and some of the wider implications then form the basis of this paper.


Introduction

Many organizations are now publishing everything from useful catalogues of information to searchable directories of useful information on the World Wide Web. This is happening at a time when some of the our favorite programming and scripting languages are also offering significant abilities to access the Web directly and process the information retrieved from there.

Faced with this cornucopia of data, the challenge is now to automate the retrieval of this data for us so that we can have local access to it at will without the need to 'hit the wire' personally. Indeed the ability to perform such operations allows us to spread the load on out outbound connections to those times of day when neither we nor our co-workers are inconvenienced. In fact, as time goes by, the Web becomes bigger and the need changes from one of finding data to condensing it to just what we wanted.

This then is a discussion of how to automate access to Web based data and how to build programs which can navigate the Web automatically to perform such data retrieval.


Languages

When the first tools were built which in a sense defined the Web they were both thin on the ground and of limited scope. This is not to say text based browsing is bad simply that it was all you could do.

Over time though, people have wanted to do more complex things with Web sites and these have almost always meant writing programs to access these sites. In the beginning this meant lifting the 'C' routines used by the first generation browsers and using them to write programs that do stuff. More recently, the code to access the Web has appeared in a few other languages and often these languages are better suited to the tasks we want to do. This is as we often need to do lots of text processing to find the data we asked for in the morass of Webvertisements, Frames and corporate image building and because the Web and the network it is built on is by nature slow and don't require the speed or efficiency of compiled 'C'.

The choice of language then may more often be made on intangibles like how good the debugging support is or how fast the edit/test cycle is. To some extent, the way Web sites are built and continuously upgraded you may find that your mining script needs a similar level of commitment to maintenance. It's certainly worth evaluating up front how much effort you're prepared to expend on getting this data. Just as the Web site may well be a work in progress which never completes, so too the script you build to access it may well need constant care.

Another factor in the equation of which language to choose is the scope of the Web support it offers and the complexity of the task of customizing the library components they supply to the task at hand. So lets look at a few of the players.

Python Python comes with extensive libraries in its standard form which greatly simplify the task of trolling the Web for data. With both libraries to automate HTTP accesses and URL manipulation and libraries to parse SGML and HTML it is well placed to interact with the Web. Python also offers reasonable speed and is well documented.
PERL PERL also has extensive Web access libraries available to assist in the task of automating data retrieval. In particular, PERL with its robust and efficient regular expression support, is suited where searching of the returned data is pattern based. PERL is ideal for the task of gleaning nuggets of data from large volumes of data in this way. Note though, that either you have to have a fast network connection or be patient.
Java Much of the code necessary for Java to do this also exists. The proof of this is that at least one browser (we know of) has been written in Java. This means the mechanisms needed to retrieve and pick apart and sensibly interpret Web based data streams exist. To what extent these code resources are available and will be made available is less clear. (HotJava was to be re-engineered by Sun as a set of classes to for 3rd parties to develop "Web-enabled" programs. As such this would represent the only really commercially supported example of such code at this point in time).


Its worth noting that the distinguishing feature here is that each language has libraries which do some or all of the work of interacting with Web servers for you. They all have code to parse retrieved HTML and to a greater or lesser degree to other webby things like manipulating URLs and handling MIME encapsulated data. The extent of such support you can leverage off and the ease with which this support allows itself to customized and tuned also become issues when choosing.

Most of these other languages also offer greater cross platform support should you have an active interest in performing these operations from hosts other than UNIX machines. In particular each of the three named above has good PC support. The downside of such cross platform source portability is that each foregoes some of the intimacy with the environment which is offered by C and C++.

Finally remember, that if you do need speed and efficiency, you can't beat C and C++. In fact, in something of a turnaround we at times need to remind ourselves that the Web was first written in the older more traditional languages and migrated to the younger languages only later when the Web was staring to mature and we broadened our interest in how we might us it.

For the purposes of this paper, we will use the facilities offered by Python almost exclusively. Those of you who have an interest in using PERL for the same tasks should look at the excellent book from O'Reilly & Associates on the subject "Web Client Programming with PERL" [1]. Indeed, this book contains many valuable insights and can be recommended to anyone planning on undertaking such tasks or indeed for those Web administrators who want insight in how to make their site more useful.


Spiders & Robots

Spiders and Robots are specialized classes of data mining programs which have been on the Web since its earliest days. Indeed, the Web search engines which are so useful now when searching the Web interactively, build their databases almost exclusively by making use of these automated programs to explore the Web by finding pages and then exploring all the links from those pages to other pages both locally and at more remote sites on the Web. This effectively exploits the connectivity inherent in the ability to link pages of information and also highlights a fact of Web existence which is often overlooked. If your Web pages 'stand alone', then they are effectively not really in the Web itself. They are leaves on the tree, rather than branches. This is why other people often encourage you to link to their pages as it increases the likelihood they will be found when a search engine (or human) passes along the links from those pages.

This also serves to highlight one difficulty of finding your data on the Web. There is a combinatorial explosion of links which your Miner will discover and which it will need to decide to follow or ignore. Strategies need to be developed and from time to time refined to allow the Miner to sensibly follow or discard various pieces of data.

Several examples of Python based robots will be shown which range from a dozen lines of python to longer more complex and more capable systems. Indeed, a simple Python based Miner might look like this:

This produces output which looks like this: I should note that while its pretty minimal, this examples serves to show a few things off which are of note. Firstly, with the right libraries and those which are available for both PERL and Python certainly qualify, the task of getting the raw data off the Web is completely trivial. Secondly, you need to watch out for error conditions carefully. This is highlighted by the second URL retrieved in the example which in fact doesn't exist. The Web server returned the dreaded "404 - URL not found" and the Web page storing that frankly useless data was saved for for us. In fact, the Python library offers simple means to trap this condition but as ever, we have elided this code and other error checking in the interests of brevity.

Each example then attempts to show some feature of accessing Web based data which may be of use to the prospective user. These features largely devolve to limiting the data which the robot is prepared to retrieve and store for longer periods. Each Web robot can easily retrieve much more data in a short period than most people can comfortably contemplate if the only limitation is the bandwidth of the path from data source to data sink and the storage available at your site. It has to be noted that with falling disk prices and arrival of cable modem access to the net this can be significant. If you have signed up for Big Pond Cable and thus have lots of bandwidth and no time charges you would be well advised to watch the volume charges they live off!

So how can we limit the amount of data we try to pull down?


Automating access to interesting information

Having found an interesting piece of data on the Web, you now wish to automate its retrieval. In all cases this will involve retrieving it and storing it locally. We have already seen the simple example above get this far. What happens next depends on what sort of data you are looking for.

The first thing to note is that the structure of a 'page' of data on the Web may in fact be composed of many interlocked pieces. The most obvious of these are the graphics which adorn many Web pages. The simple example we saw only retrieved the base HTML page data of the URLs we asked it to fetch though. To have the complete page at hand we must additionally fetch those extra pieces of data we do not yet have in hand. For most purposes this means simply scanning the base HTML data and extracting the URLs associated with the rendering of the page. We may also wish to retrieve other pieces of data which have been associated with the page. In some cases you will choose to specifically ignore some of data (like images) which have no interest of benefit to you.

More generally of course, there is also the question of examining the HTML itself and extracting the data for which we downloaded it.


Automating access to (other peoples) search engines

Search engines present an interface to the world which is bound to HTTP just as the data sources we are more interested in. This means retrieving specific information through these CGI gatewayed services is no more difficult than any other from of Web data mining. Conceptually at least. :)

In fact, Web search engines offer a valuable means for automating search to other peoples data over and above the search for the data itself. In order to make maximum benefit the data extraction from the retrieved pages may have to become much more subtle. On the upside, the extraction of the link data from the search engine results page is simply a matter of following the links from the results page. This is already catered for by all the languages we are considering.


Mining data from inside returned HTML

We've briefly touched on the techniques to use above. The simple fact is that a HTML page, which is the form of most data retrieved from the Web, consists of the data to display and the markup which controls its formatting. The principle of finding the data is usually easy. Firstly, the markup can be easily ignored with a parser to find it. The parser simply does nothing with the markup after identifying it. Whatever is left, is the raw data.


Intellectual Property considerations

It is fair to say that much abuse of intellectual property rights is taking place on the Web even as we speak. Most of the content of many 'adult' sites for example is digitised without the knowledge or consent of the copyright owners or indeed the subject(s).

It is also clear that modern copyright law is completely inadequate to the task of dealing with computers or networks in any reasonable manner. In particular the interpretations of what constitutes copying of a work have been alternately both liberal and harsh in the recent past. (i.e does copying data from disk to an operating system maintained disk buffer prior to delivery to the user application constitute acceptable use or not. Lawyers are unsure! Where this interpretation might leave packets in transit through routers and switches is even more interesting as the owners of those devices have no clear right to any data at all)

In some sense, the Web is a publishing medium and the owners of intellectual property must recognize that once published, many people will make lots of uses of their intellectual property in ways they have anticipated (and indeed the reason for their electronic publication) and some that are unforeseen. The key is where you may deprive the owner/author of income which they might (rightly or wrongly) expect to receive. The only sensible position to take is that if you are asked by an intellectual property owner to cease collecting data from them, you should comply. By all means, see if they are prepared to negotiate access to their intellectual property but they own it and failure to respect that ownership can result in legal action being taken (c.f. Paramount's current actions to protect Star Trek intellectual property and Fox's attempts to protect Simpsons characterizations).


Optimizing your site to be mined

Having seen how spiders and robots are built, we can easily determine what sort of site is easier to mine for data and what isn't. In particular this may be of interest to those people who want to be well represented in the general Web search engines.

The other reason for making it easy to mine your data is that you may have a commercial interest in making it easier for your users to do this. An example is the semiconductor chip manufacturers who now largely publish their technical data online and allow designers to access it there or download it for offline use. Making the download process efficient can only increase their sales and their customer satisfaction. In general, if you are serving something to the public, making the content available with a minimum of fuss will generally only benefit you. The more hurdles you place, the fewer successful surfers will complete the course.


Where to from here


Other Problems

You may find a few other small hurdles to jump when you choose to play this game. Most of these are not show stoppers but they are occasionally inconvenient.


Conclusion

Mining data from the Web can be both quick and easy. The quality of the data is as variable as the sources it is gleaned from. The new kid script languages are all well supplied with Web savvy and allow the user to build and tune a variety of Miners which can be used to good effect. In fact, the flexibility thus gained outweighs any efficiency concerns which might be raised. Indeed, the speed of Web transfer seldom places such a strain on systems that efficiency need be a concern and the desire to limit bandwidth impact on the source sites should maintain this state.

Balanced against this ease, is the need to be sensitive to the intent and concerns of the owners of the data we are working with. The law is gloriously silent on most things Web and while this vacuum can be seen as carte blanche, it more realistically means that older, perhaps inappropriate laws may be pressed into service or that the owners of intellectual property who feel slighted may take civil action to protect their ownership.



References

[1] Clinton Wong, Web Client Programming.
O'Reilly & Associates, Inc. ISBN 1-56592-214-X
[2] Shisir Gundavaram, CGI Programming.
O'Reilly & Associates, Inc. ISBN1-56592-168-2
[3] Chuck Musciano & Bill Kennedy, HTML: The Definitive Guide.
O'Reilly & Associates, Inc. ISBN 1-56592-175-5
[4] T. Berners-Lee, R. Fielding, H Frystyk,
RFC1945 Hypertext Transfer Protocol -- HTTP/1.0
http://www.ics.uci.edu/pub/ietf/http/rfc1945
[5] R. Fielding, J. Gettys, J. Mogul, H Frystyk, T. Berners-Lee
RFC2068 Hypertext Transfer Protocol -- HTTP/1.1
http://www.ics.uci.edu/pub/ietf/http/rfc2068.txt
[6] Mark Lutz, Programming Python.
O'Reilly & Associates, Inc. ISBN 1-56592-197-6
[7] Aaron Watters, Guido Van Rossum, James C. Ahlstrom,
Internet Programming with Python
M & T Books, ISBN 1-55851-484-8
[8] Larry Wall, Tom Christiansen, Randall Schwartz,
Programming Perl.
O'Reilly & Associates, Inc. ISBN 1-56592-149-6



Document last revised: Mon Jan 19 23:00:18 EST 1998



Author's papers