Abstract
In text mining and natural language processing (NLP) applications, a vector representation
of text data is the key in designing an effective machine learning algorithm.
Document-term and word-context matrices are two important matrices to represent
texts as vectors. These two matrices are usually sparse and high-dimensional. The
process of creating low-dimensional representations of texts
... read more
is referred to as dimensionality reduction. Dimensionality reduction is associated with the representation of
text data and thus very important. In the machine learning literature, little to no attention
has been paid to a popular statistical technique, correspondence analysis (CA).
Other popular dimensionality reduction methods receive more attention, like latent
semantic analysis (LSA). This project is to study whether CA is a good dimensionality
reduction technique in text mining and NLP.
Chapter 2 theoretically compares CA and LSA of a document-term matrix. In addition,
the performance of CA is compared to the performance of different versions
of LSA in the context of text categorization and authorship attribution. The criterion
used to make comparisons is mainly a measure for accuracy. From a theoretical
point of view it appears that CA has more attractive properties than LSA. For example,
in LSA, the effect of the margins as well as the dependence between documents
and terms is part of the matrix that is analyzed, while CA eliminates the effect of the
margins and thus the solution only displays the dependence. The results for four empirical
datasets show that CA can obtain higher accuracies on text categorization and
authorship attribution than the different versions of LSA.
Chapter 3 also studies the performance of CA and LSA in the context of documentterm
matrices. CA and LSA are empirically compared in information retrieval by calculating
the mean average precision. An attempt is made to improve CA by applying
the two kinds of weighting, that are also used in LSA. These are weighting schemes for
the elements of the document-term matrix and the adjustment of the singular value
weighting exponent. The results for four empirical datasets show that CA always performs
better than LSA. Weighting the elements of the raw data matrix can improve
CA; however, it is data dependent and the improvement is small. Adjusting the singular
value weighting exponent often improves the performance of CA; however, the
extent of the improvement depends on the dataset and the number of dimensions.
Chapter 4 compares CA with PPMI-SVD, GloVe, and SGNS. Theoretically, like
PPMI-SVD, GloVe, and SGNS, we are able to link CA to the factorization of the PMI
matrix. An attempt is made to improve CA by making use of weighting schemes for
the elements of the word-context matrix. An empirical comparison on word similarity
tasks shows that the overall results for CA with the two weighting schemes are slightly better than those of PPMI-SVD, GloVe, and SGNS.
CA is susceptible to outliers. In Chapter 5, the so-called reconstitution algorithm is
introduced to cope with outlying cells. This algorithm can reduce the contribution of
the outlying cells in CA. The reconstitution algorithm is compared with two alternative
methods for handling outliers, the supplementary points method and MacroPCA.
It is shown that the proposed strategy works well.
Summarizing, we have shown that CA is a technique that matches or outperforms
techniques that are now commonly used in computing science. We think that the
performance of CA in the studies of this dissertation shows that CA deserves more
attention in this field.
show less