CSC 465 Assignment 3

 

The Web Spider

 

 

Due:  Thursday, March 8, 2001

 

 

The objective of this assignment is to convince you of the relative ease with which HTTP client applications can be designed and implemented.

 

You may work either individually or with one partner.  If you would like a partner but don’t know someone to ask, let me know and I’ll match you up.

 

 

 

The Challenge.

 

Implement an HTTP client application to explore and profile a web site.  The program will be called spider.

 

 

Communication Protocol.

 

§         Your program will use socket connections to communicate with web servers using HTTP 1.0 protocol.

§         For each page requested, you will first issue a HEAD command.  If the header information indicates that the file is OK (status code is 200) and it is an HTML file (content type is text/html) and it is on the web site’s server, then issue a GET command for the same file.  Read the file contents and parse it to find links and references to images.  Details below.

§         Links found during the parsing process are similarly explored if not already encountered.

 

 

Program Input.

 

Run the client with the command:

 

java spider  rootURL [trace] [pageLimit]

 

Where rootURL is the web page at which to begin the exploration, and pageLimit is optional limit on number of web pages to request using GET (default 100).  The trace feature is triggered by typing the word “trace” after the rootURL.  This is an optional argument.  Since “trace” is non-integer and pageLimit must be integer, your program should be able to handle the situation where a page limit is specified but not the trace.

Program Output.

 

Output a statistical summary to include:

1.      Number of HEAD requests issued

2.      Counter for each status code returned from a HEAD request (e.g. how many 200s, how many 404s, etc)

3.      Number of GET requests issued

4.      Number of web pages successfully fetched (should match number of GET requests)

5.      For all pages successfully fetched using GET, report the minimum, maximum, and average:

a.       page length in bytes

b.      number of images per page (count “IMG” occurrences)

c.       number of hyperlinks per page (count “HREF” occurances)

d.      number of internal links per page (links to pages on same server)

e.       number of external links per page (links to pages on different server)

 

For #5, don’t be concerned with duplicates.  For example, if a page contains three different links to a web page, count all of them.  That does not mean you should explore all three – this is explained below.

 

Trace output:  Output a single line for each HEAD or GET request, to include:

1.      Whether it was HEAD or GET

2.      The URL requested

3.      The status code from response

4.      File length from response, if status code was 200

 

 

The Basic Idea.

 

Your program will get a URL from the command line, contact the HTTP server (which you will have to extract from the URL) through a socket, and request the page using the HTTP HEAD command.  If this is not successful, which you can tell from the status code, that’s it.  If it is successful, fetch the page using GET, record the page length (this is in the header), count the number of image references and hyperlinks, then repeat the process recursively for each hyperlink.  Notice that every page request is first done using HTTP HEAD command.  Response will be a header.  If header status code is 200 and content type is text/html and the page is on the same server (identified in command line), then issue GET to fetch the page and repeat the process.  Otherwise, do not issue the GET.

 

 

To Turn In:

 

Copy your source files as well as an ASCII readme file to your eccentric server upload folder (if working as a team, copy to either member’s folder).  The readme file must include your name(s) and explain implementation platform and anything weird that might happen.

 

Note that I will be out-of-town from February 20 through February 25, so any questions you need addressed before February 26 will have to be asked before then.  I am leaving immediately after the CSC 482 class Tuesday morning.

 

 

Details.

 

1.      Beware of cycles!  For instance, if you fetch index.html and find it has a link to info.html, then you fetch info.html and it has a link back to index.html, then a hyperlink cycle is formed.  If you blindly explore every link, you’ll get into a never-ending cycle!  To avoid it, you have to keep a list of each page that has been already been explored.  To do this, add the URL to the list after it has been fetched using GET.  Then before issuing any HEAD command, check the requested URL against this list.  If already there, do not issue the HEAD.  Consider using a hash table; Java provides a nice class for this.  The URL class in java.net should be useful too, and provides a hashCode method.

 

2.      Your solution should be reasonably object-oriented.  Even if you have little Java experience, its class construct is quite similar that of C++.

3.      Your source code should be adequately documented (self-documenting identifier names, comments, indentation, etc).  Be sure to put your name(s) at the top of every file!

 

4.      Other details come to light in the FAQ list below.  You must read and understand them.

 

 

FAQs

 

Q: What distinguishes “internal” versus “external” pages?

 

A:  If URL is relative pathname, the page is internal.  If URL is absolute, extract the host name (everything between “://” and the next “/”).   The Java InetAddress class provides a static method called getByName, which takes a String containing presumably a host name, and returns its IP address.  It returns an object of type InetAddress.  My suggestion is that you use this at the start to get the IP address of the server specified in the command line.  Then for each absolute URL, similarly extract the host name and gets its IP address.  If it matches the initial one, the page is internal else it is external.  If this causes serious performance problems, you might want to just keep the original server name in a String then for each absolute URL extract the host name and compare the strings for equality.  This will catch everything except aliases, and will run a hell of a lot faster.

 

 

Q:  Regarding the summary information, it is not clear to me what counts as an image or a hyperlink for statistical purposes.

 

A: Interpret “number of images on a page” as number of IMGs.

Interpret “number of hyperlinks per page” as number of HREFs.

Interpret “number of insider hyperlinks per page” as number of internal (see previous question) HREFs that link to HTML files. 

Interpret “number of outsider hyperlinks per page” as number of external (see previous question) HREFs that link to HTML files. 

 

 

 

Q:  How do I know whether an HREF refers to an HTML file or not?

 

A:  You can request information about a page without fetching the page itself, by using the HTTP command HEAD instead of GET.  Command format is the same.  HEAD will return only the header information.  Look for the header line “Content-Type: xxxxxx”.  If the content type is “text/html” then it is HTML, otherwise not.  This means you will issue HEAD commands to external servers (requires new socket), but you will never issue a GET command to an external server. 

 

 

Q:  Does this mean I'll use HEAD as a filter of sorts?

 

A:  Exactly.  Rather than trying to determine the file type based on filename extension, you will request information about that file using HEAD.  The response message will indicate the file content type.  Once you get information on an internal file using HEAD, you will issue GET for the same file only if the status code was 200 (Ok) and the content type was text/html.

 

 

Q:  What if response to HEAD is code 404 (not found)?

 

A: If you request a file that does not exist, the content type will come back as text/html regardless of any file type that may be inferred by filename extensions!  This is because the content type will refer to the server-generated file that contains the 404 error message.  I do not want you to GET that file!

 

 

Q:  What is a "redirect" and how do I handle it?

 

A:  When you issue the HEAD command and get a response code in the range 300-399 (typically 301), this means the link is being redirected to another file.  That file will be given in the "Location:" field in the response header.  It will be a full URL, so needs to be fully processed (check for internal/external, then issue HEAD again).

 

 

Q:  What if an IMG or HREF is contained inside an HTML comment?

 

A:  Do not check for comments.  If this happens, your program will not be aware of it, and will go ahead and attempt to count / follow it.  This will rarely occur. 

 

 

Q:  Pages can use cascading style sheets that themselves have images.  Do I need to find these?

 

A:  No.  Just look for IMG and HREF (not case sensitive) in the current page.

 

 

 

Q:  What if the same image file is referenced repeatedly in an HTML file?

 

A:  Count each one separately.

 

 

Q:  What about "mailto" links?

 

A:  They are included in count of "number of hyperlinks on a page", but otherwise not processed.

 

 

Q:  GET requires that the pathname start with "/" and contain full path relative to server (necessary due to stateless server).  But HTML assumes browser will keep track of current directory and allows links to be relative to it.  How do I deal with this?

 

A:  You have to do what the browser does.  Keep track of current directory, and piece together the file path using it plus the relative path.  You need to handle "../" and "./" when they appear in a link.

 

 

Q:  What if the hyperlink ends with a directory name or with "/"?

 

A: In the first case, server will respond with a redirect message(see earlier question).  In the second case, server will fetch index.html, index.htm, default.htm, ... as appropriate.


[ assignments | CSC 465 | Peter Sanderson | Computer Science | SMSU ]


Last reviewed: 15 February 2001

Peter Sanderson ( PeteSanderson@smsu.edu )