Date of Award
Doctor of Philosophy in Information Systems - (Ph.D.)
Yi-Fang Brook Wu
A key to successfully satisfy an information need lies in how users express it using keywords as queries. However, for many users, expressing their information needs using keywords is difficult, especially when the information need is complex. Search By Multiple Examples (SBME), a promising method for overcoming this problem, allows users to specify their information needs as a set of relevant documents rather than as a set of keywords.
Most of the studies on SBME adopt the Positive Unlabeled learning (PU learning) techniques by treating the user's provided examples (denoted as query examples) as positive set and the entire data collection in the database as unlabeled set. User's information need is then represented as a query vector, which is obtained from the query examples or further augmented with unlabeled data as negative examples, in which the documents are ranked according to their degree of similarity to the query vector. The query examples are treated as being relevant to a single topic to build the query vector, but it is often the case that they belong to multiple topics. New methods are needed to deal with such a topic diversity issue.
Furthermore, there are many PU learning algorithms available, but it is still unknown which methods perform most effectively for SBME, as the experiments conducted in the previous studies have not taken into account the user search situation, where the size of the query examples varies and is much smaller than the size of the unlabeled data. When the query examples are much fewer than the unlabeled data, the system effectiveness may downgrade dramatically because of the class imbalance problem. Thus, it is important to identify the most effective PU learning algorithms for SBME and explore how to improve the system effectiveness further.
In the previous studies on SBME, a document is usually treated as a vector, of which the features are terms in the collections. Such a term-vector based document representation brings high dimensionality problems when the collection is large; or even worse, some noisy features seriously degrade the performance of the learning algorithms. Feature selection is necessary for solving the high dimensionality problem.
This research proposes a framework named Information Filtering by Multiple Examples (IFME) to explore how to improve SBME by: (1) solving the topic diversity issue by adopting probabilistic topic models to predict user's information need from the query examples; (2) tackling the class imbalance problem by adopting machine learning techniques; (3) identifying the most effective PU learning algorithms for SBME, (4) adopting ensemble learning techniques to improve the effectiveness of the PU learning algorithms for SBME further; and (5) adopting topic model for feature dimension reduction. The experimental results show that the proposed framework addressed the research questions successfully.
Zhu, Mingzhu, "Information filtering by multiple examples" (2015). Dissertations. 128.