A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification
Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar;
3(Mar):1265-1287, 2003.
Abstract
High dimensionality of text can be a deterrent in applying
complex learners such as Support Vector Machines to the task of text
classification. Feature clustering is a powerful alternative to feature selection
for reducing the dimensionality of text data. In this paper we propose a new
information-theoretic divisive algorithm for feature/word clustering and
apply it to text classification.
Existing techniques for such "distributional clustering" of words are
agglomerative in nature and result
in (i) sub-optimal word clusters and (ii) high computational cost.
In order to explicitly capture the optimality of word clusters in
an information theoretic framework, we first derive a global
criterion for feature clustering. We then present a fast, divisive
algorithm that monotonically decreases this objective function value.
We show that our algorithm
minimizes the "within-cluster Jensen-Shannon divergence" while
simultaneously maximizing the "between-cluster Jensen-Shannon divergence".
In comparison to the previously proposed agglomerative strategies our
divisive algorithm is much faster and achieves comparable or higher classification
accuracies. We further show that feature clustering is an
effective technique for building smaller class models in hierarchical
classification. We present detailed experimental results
using Naive Bayes and Support Vector Machines
on the 20Newsgroups data set and a 3-level hierarchy of HTML
documents collected from the Open Directory project (www.dmoz.org).
[abs]
[pdf]
[ps.gz]
[ps]