There is an interesting train of statistical thought which leads to
exactly this conclusion about the utility of considering some
multi-word units to be on par with single words. If we take for the
moment as given that economical descriptions are better than prolix
descriptions, then we can say that we prefer a theory of language
which allows use to describe a corpus and our theory together with
minimal size. To be useful, we must include the size of the corpus
(after maximal explanation by our theory) since we would otherwise opt
for the null theory. Conversely, we must include the size of our
theory since otherwise we might as well just have a theory in which we
could represent our observations (say, for example, the British
National Corpus) by the string "BNC". In this case, our theory is
huge, but the representation of our observation set is minute.
Neither extreme is useful, so we need to use a combined size of theory
and encoded observation.
In statistical circles, approaches which use this principle of
minimality of description are collectively referred to as Minimum
Description Length techniques. Such techniques have been in a small
vogue in computational linguistics recently.
Suppose that we wish to examine the question of what exactly the
minimal unit of text might most appropriately be. Computationally,
one way that we might proceed with this analysis is to use primitive
tokens which are the characters in our text. Then our lexicon would
consist of rules which define unique expansions for non-primitive
tokens in terms of other tokens, either primitive or not. Associated
with each rule is an estimated probability of occurrence in the
encoded corpus the use of which will be described later. In order to
guarantee that all elements of our lexicon can be expanded, it is
useful to impose a stratification condition on our lexicon which
states that the dependency graph of terms defined in our lexicon is a
direct acyclic graph.
Given all these preliminaries, we can now encode our corpus and our
lexicon using standard compression techniques such as an arithmetic
encoder. The specific algorithm for encoding the corpus for
compression would be the Viterbi algorithm which allows us to find a
minimum size encoding if we know the compressed size of each token. The
encoded form of the corpus together with the lexicon allows us to
re-estimate the compressed size of each token which we can then use to
get a new encoded form for the corpus. This process of encoding and
re-estimation is a special case of the Estimate Maximization (EM)
algorithm. The result of this algorithm is ultimately a size in bits
for the encoded corpus plus lexicon. We can then attempt to minimize
this total size.
Carl de Marcken worked extensively on this problem in the work he
reported in his dissertation. His particular interests were the
application of this method to the segmentation of speech signals and
the derivation of higher level language models. He mentioned in
passing the use of this technique for segmenting Asian languages. I
have been able to show that this technique is, in fact, an approximate
generalization of the most commonly used algorithms for Chinese
segmentation.
David Magerman proposed some time ago that mutual information between
adjacent words could be used to chunk texts into primitive syntactic
constituents. The same argument that shows Sproat's algorithm for
Chinese segmentation by mutual information shows Magerman's
segmentation to also be a special of the MDL method.
Of particular interest to me is the fact that the log-likelihood ratio
test for bigrams that I proposed in my CL paper is also an
approximation of the improvement in total encoded size of corpus and
lexicon which might be expected by adding that bigram to the lexicon.
This provides an interesting connection between MDL methods and the
log-likelihood ratio test. Given the economy with which the
log-likelihood ratio test can be computed, this connection may be of
substantial practical import.
I conjecture that MDL methods such as these would provide a very
strong basis for the determination of which multi-word units actually
do behave as single words.