University of Houston
Department of Computer Science

id
In partial fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Command1

Jesús Ubaldo Queved-Torrero
will defend his Dissertation

IMPROVING WEB RETRIEVAL BY MINING THE HTML TAGS FOR KEYWORDS
AND EXPLORING THE HYPERLINK STRUCTURES OF WEB PAGES



Abstract

The increasing amount of data stored in the World Wide Web demands efficient techniques for information retrieval. When consulting a regular search engine, it is very common to receive millions of documents as an answer. We explore different aspects of the web to improve the quality of its retrieval. For instance, we developed TAKER a system that finds Tag-Keyword Relationships in HTML documents from a given set of “exemplary” documents representative for a particular topic or keyword. For each document, we extract a vector in lieu of the relationship between the keyword and the HTML tags. We propose that, the “centroid” of those vectors characterize the particular community being studied. Moreover, we use our own web query language to examine the “prestige” of documents. This concept states that the prestige of a web page related to a particular topic, depends on the number of documents referencing it and related to the same topic. Furthermore, Hyperlinks inside HTML pages contain a wealth of information about the relationships among web pages.  Given a set of web pages, we can explore the hyperlink relationships among these pages.  We provide formal definitions of hyperlink relations.  We then use the notations to define similarity between two web pages and between two sets of web pages.  For each one of them, we provide several definitions of similarity using forward and backward links.  The similarity measure gives us a number between 0 and 1.  We also demonstrate how to use the similarity measure to study clustering within a set of pages and to determine the “diversity” of a set of web pages. Moreover, we mined Association Rules from the structure of HTML documents to find its keywords. We designed various experiments to test these conjectures, and they supported the efficiency of our approaches.

 

 

Date: Thursday August 5, 2004
Time: 4:00 PM
Place: 550-PGH



Faculty, students, and the general public are invited.
Thesis Advisor: Dr. S. H. Stephen Huang