Date: ![]()
Department of Computer Science
In partial fulfillment of the Requirements for the Degree of
Doctor of Philosophy
will defend his Dissertation
IMPROVING WEB RETRIEVAL BY MINING THE HTML TAGS FOR KEYWORDS
AND EXPLORING THE HYPERLINK STRUCTURES OF WEB PAGES
The increasing amount of data stored in the World Wide Web demands
efficient techniques for information retrieval. When consulting a
regular search engine, it is very common to receive millions of
documents as an answer. We explore different aspects of the web to
improve the quality of its retrieval. For instance, we developed TAKER
a system that finds Tag-Keyword Relationships in HTML documents from a
given set of “exemplary” documents representative for a particular
topic or keyword. For each document, we extract a vector in lieu of the
relationship between the keyword and the HTML tags. We propose that,
the “centroid” of those vectors characterize the particular community
being studied. Moreover, we use our own web query language to examine
the “prestige” of documents. This concept states that the prestige of a
web page related to a particular topic, depends on the number of
documents referencing it and related to the same topic. Furthermore,
Hyperlinks inside HTML pages contain a wealth of information about the
relationships among web pages. Given a set of web pages, we can
explore the hyperlink relationships among these pages. We provide
formal definitions of hyperlink relations. We then use the
notations to define similarity between two web pages and between two
sets of web pages. For each one of them, we provide several
definitions of similarity using forward and backward links. The
similarity measure gives us a number between 0 and 1. We also
demonstrate how to use the similarity measure to study clustering
within a set of pages and to determine the “diversity” of a set of web
pages. Moreover, we mined Association Rules from the structure of HTML
documents to find its keywords. We designed various experiments to test
these conjectures, and they supported the efficiency of our approaches.
Time: 4:00 PM
Place: 550-PGH
Faculty, students, and the general public are invited.
Thesis Advisor: Dr. S. H. Stephen Huang