Project
Part 3: Web Crawling, Caching Search Result, and (optional)
GUI
Due: Wednesday,
December 12, 2012 by 11:59:59pm
You may work with one partner on
this assignment. Anyone from the class is fine.
If you were unable to complete Project 2, then I strongly
recommend you work with a partner who did complete it and start
from their code.
You can
download all starting code for this project from here.
COURSE PROJECT DESCRIPTION
This assignment is the third and final part in a series of related
assignments about the World Wide Web. Our ultimate goal is to build
a web crawler and search engine for a limited portion of the Web.
Your search engine will have some of the features of common search
engines such as Yahoo, Lycos, or Google. By the end of the semester
you will have implemented a Web browser with a fairly sophisticated
search engine. Your Web browser will have the following
capabilities:
- It will display a Web page given a URL
- It will automatically discover a local network of connected
web pages given a starting URL
- Its search engine will search for good matching Web pages
given a query string and display the resulting URLs in order of
best match to worst (we will re-define a "good match" as we add
more features to the search engine).
- The Web browser will automatically display the best matching
URL result of a search. Most of the work involves implementing
the search engine part of the Web browser, and this is the part
that you implemented in the first two parts of the project.
I strongly encourage you to work in pairs on the project
assignments.
PROBLEM DESCRIPTION
This program consists of three main (and separable) parts:
- Part 1 is adding crawling, so that given an initial URL, it
will find a set of webpages that are local to that starting URL.
- Part 2 is adding caching of query results to your search
engine, speeding up the execution of queries by using the cached
results of previous queries.
- (Optional) Part 3 is tying your code into a GUI front-end to
your search engine that also fetches and displays web pages.
The solution that you submit should link the three parts together.
However, you can start working on each part independently. Parts 1
and 3 will not take you that long. Part 2 will take longer
since you must implement a hash table. When you add the second
part to the GUI Web browser, print the caching information to the
search results window as well as to stdout.
Part 1: Web Crawling
For this part, you will have your web
browser automatically discover new links given a starting point.
You will discover these links by parsing a starting url's file and
finding href links of other local webpages, and parsing them, and
so on. Specialized programs, called WebCrawlers, do this on the
web to discover new pages for search engines.
Your Web browser will start by taking two or three command line
arguments: the first is a start_url, the second is link_depth
(the depth you will follow any link from the start_url), and the
third is an optional html_ignore file. For example, you may run
your program like this:
% java WebBrowser www.brynmawr.edu 4 html_ignore_file
Starting at the initial URL, it will crawl the websites following
href links to a depth of 4, yielding a list of URLs. The
list of URLs will then be used in place of the urls contained in
the urlListFile from previous parts. For each url, you will create
a URLContent object by parsing its file and create its
WordFrequency tree. You should be able to just make a few
modifications to your ProcessQueries class to get this to work.
You have been given a HREFScanner class
that will take care of all the ugly parsing of url links. It works
by first initializing it with a url, then it will open the url's
file, scan its webpage for href tags and return the next valid url
link from the scanned page when its getNextToken() method is
called.
Getting Started
Here is a breakdown of the specific changes you will need to make
to your existing code (again, the exact changes you may need to
make will vary based on your specific implementation):
Change ProcessQueries
---------------------
(*) add a constructor that takes a start url and link_depth
limit (and optionally an html_ignore_list) as input, and
(1) recursively discovers the URLs up to the specified depth
(2) creates URLContent list and cache as before
Here is pseudocode for how to recursively discover the URLs:
List<String> urls; // class-level variable
void crawlWeb(String url, int depth)
base case: return if depth == 0
parse the url using the HREFScanner
for each link given by the HREFScanner, do
if the link is not a valid
html file (e.g., it does not end in .html), skip it
if ( ! urls.contains(link))
urls.add(link)
crawlWeb(link, depth-1)
The data structure used for urls
in the pseudocode is
a List. You are encouraged to choose a different data
structure for urls
, picking one which is more
efficient. Which one should you use?
Part 2: Caching
Below is a demonstration of how your more efficient query processing
program should behave. You will save all previous query results in a
data structure called the cache. When the user types in a query, you
will first check the cache to see if you have processed a similar
request in the past.
#####################################
Enter next Query String or -1 to quit
#####################################
artificial
Results for your Query "artificial":
-----------------------------------------------
Query: artificial
Query Result NOT Found
in cache <-------- print
out whether or not query is in cache
Matching URLs:
cs.brynmawr.edu/~eeaton priority: -8
#####################################
Enter next Query String or -1 to quit
#####################################
artificial intelligence
Results for your Query "artificial
intelligence":
-----------------------------------------------
Query: artificial intelligence
Query Result NOT Found
in cache
Matching URLs:
cs.brynmawr.edu/~eeaton priority: -17
#####################################
Enter next Query String or -1 to quit
#####################################
intelligence artificial
Results for your Query "intelligence
artificial":
-----------------------------------------------
Query: artificial
intelligence
Query Result Found
in cache
Matching URLs:
cs.brynmawr.edu/~eeaton priority: -17
Notice that the first time the user asks about "artificial
intelligence" we cannot find anything in the cache. But the next
time the user asks about "intelligence artificial" we can grab the
stored information that we previously displayed for "artificial
intelligence" and return it immediately. Also note that we
want to be able to recognize that a query is the same regardless of
the word order, so "artificial intelligence" is the same as
"intelligence artificial".
To get the query result to print out nicely in the display, you need
to convert the query results to a string (either in html or ASCII).
One way to do this is to add a queryToString method (or a
queryToHTMLString method) of the ProcessQuery class that takes as
input the result returned by the performQuery method and converts
the results to an string generating "\n" for new lines or < br
> for new lines. You can then call setText on the JEditorPane to
display the string.
Getting Started
Your first step should be to implement either a ChainingHashTable
or an OpenAddressingHashTable (i.e., that does probing).
Your hash table implementation should use the provided dictionary
interface in Dictionary.java. Think carefully about
which implementation of the HashTable you plan to use for your
cache: does it make a difference?
Once your HashTable implementation is complete and working, use
it to implement the cache.
The cache is represented as a hashtable where the key is the
query string (with the words reordered to be in alphabetical
order). The element stored with the query should be the String
answer for that query.
hashtable cache
-------------------------------------------------------------------------
|key: | | | | | | | |
| String | | | | | | | |
|element: | | | | | | | |
| String | | | | | | | |
| | | | | | | | |
| | | | | | | | |
-------------------------------------------------------------------------
1. For each word in query, create an alphabetically ordered query
string that is all one case and the same case as you stored tokens
into your WordFrequencyTrees
2. Search for the query in the cache
(A) if a match is found
immediately return and display the corresponding answer
(B) if a match is not found
(a) create a result priority queue just like you did in project 2 to yield the answer
(C) add the (query, answer) to the cache
(D) return the answer to caller
3. After each query print out:
(a) whether or not the query was found in the cache
(b) the answer for that query
The best place to implement caching is in your ProcessQueries class.
However, you may also implement another class to perform the caching
and make calls to the ProcessQueries class methods.
Optional Part 3: GUI [5 pts Extra
Credit]
Build a GUI front-end to your search engine. It should have
the following 6 GUI components:
- a URL text box (where the user can type in a url)
- a fetch URL button
- a display area for fetched urls
- a query text box (where the user can type in a query)
- a search button
- a display area for search results
Your WebBrowser's main method will read in the starting url and
depth (for part 1) and ignore_list command line options, create a
ProcessQueries object (from Project 2), and then create the GUI
front-end.
The WebBrowser GUI will work like the following:
- When the get URL button is selected, the url entered in the
URL text box is evaluated, the url's webpage is fetched and its
contents are displayed in the display box for fetched urls.
- When the search button is selected, the query entered in the
query box is evaluated and the search results box displays the
ordered list of matching webpages. In addition, the webpage of
the best match is fetched and displayed in the display box for
fetched urls.
Most of the GUI has already been implemented for you. Your
effort will be connecting what you've already written to this code.
Your main class must be called WebBrowser.java and take in three
command line arguments:
java cs206proj.part3.WebBrowser startingURL depth ignoreFile
See part 1 for a more detailed description of these command
line arguments.
Getting Started
Start by reading the Tutorial example about the JEditorPane class,
and try out the example (TextSamplerDemoHelp.java) to get an idea
of how to display .html files and search results (look at the
JEditorPane's setPage and setText methods), and how to associate
scroll bars with panes:
http://download.oracle.com/javase/tutorial/uiswing/components/editorpane.html
Getting your GUI to display url files and to display search result
strings is easy if you use the JEditorPane's setPage and setText
methods. Remember to import javax.swing.text.* and java.net.URL
along with the standard Swing imports.
Note: the links on the displayed webpage don't work. You do not
have to add support to make links work. However, there is a way
to add this functionality, and you can try it out if you'd like.
The name of your GUI class should be WebBrowser.java.
This is also the main class that you will run to launch the
WebBrowser.
You might also find the Java Trail on Swing helpful:
http://download.oracle.com/javase/tutorial/ui/index.html
You are welcome to modify the provided sample GUI however you
like. If you do, you will probably want to use the
GridBagLayout to layout the buttons and the text display in the
same panel. This will likely give the best results when the
Window is re-sized (you want the display parts of the window to
re-size vertically, but probably not the buttons and the search
text and url text boxes).
Extra Credit
Do not start on extra credit parts until you have completed the
regular part of the assignment
You can receive up to 10
points worth of extra credit on this assignment by implementing
one or more of the following extensions:
- [5 pts]: Add a Back, Forward, and Home Button to your Web
Browser. Add data structures to keep track of your
browsing history, and "Back", "Forward" and "Home" buttons
that use this data structures to fetch and display the correct
webpage. You will likely want to put a limit on the history
depth (15 or 20 previous pages seems pretty good) so that this
data structure doesn't get too large.
- [5 pts]: Add support for enabling links on displayed
webpages and listing search results as links. Clicking on a
link on a displayed webpage should result in the link's
webpage being fetched and displayed by your web browser. In
addition, your list of matching urls from your search engine
should be displayed as a list of links to webpages (use href).
Clicking on one of the links listed as a search result will
result in the webpage being displayed in your URL display
window.
Extra credit is graded on an all or nothing basis: either the
feature completely works or it doesn't. No extra credit will
be given for partially implemented extra credit features.
Also, I will not provide you with assistance on completing the
extra credit; you must figure out how to do it yourselves.
If you implement one or more extra credit features, be sure to
include a description of the feature and how to test it in the
README file.
CLASSES
Classes you'll need for this assignment include the classes you
developed for part 2 of the project, plus the following:
- Dictionary.java -- an
interface for the Dictionary ADT
- TryHash.java -- test
program for the hashtables
- TryHashMyString.java --
test program for the hashtables that includes an example of an
object with overridden hashCode() method
- HREFScanner.java: a scanner
class whose constructor takes a url, and parses the url's file
returning href links to possibly valid webpages as the next
token. The main() method demonstrates its use.
- WebBrowser.java:
a simple GUI web browser
You can download all code from here.
SUBMITTING THE ASSIGNMENT
Submit the following via dropbox:
- All .java files necessary for compiling your code (including
the necessary ones from parts 1&2 of the project).
- All .class files of your code located in the proper package
directories (including the necessary ones from parts 1&2 of
the project).
- An
html_ignore
File, containing tokens that
should be ignored from an html input file.
- A README file with your name and your partner's name (if you
had one).
If you developed in Eclipse, simply compress your entire Eclipse
project folder into a single .tar or .zip archive, and then copy
this archive into dropbox to submit all of these files.
Once you have submitted your code, I strongly encourage you to test
it, either from within dropbox using the command line, or by copying
it to a clean location and then attempting to run it.
If you work with a partner, please only one of you submit your joint
solution via dropbox.
ECLIPSE PROJECT ORGANIZATION -- Important!
You should work within the same Eclipse project that you did for
parts 1 and 2. That project should contain three
packages: cs206proj.part1 and cs206proj.part2 and
cs206proj.part3. All new code should be placed within the
cs206proj.part3 package. This will enable you to build off
parts 1 and 2 of the project easily.
When you submit your code, you must be certain to submit all of the
code from parts 1 and 2 that you used in this project as well.
This assignment is
based on a course project created by Tia Newhall and Lisa Meeden.