Corpora: Counting semantic propositions (was Relatve text length)

From: Tadeusz Piotrowski (tadpiotr@plusnet.pl)
Date: Mon Apr 29 2002 - 22:11:23 MET DST

  • Next message: Anke Lüdeling: "Corpora: 2nd CfP for 'Quantitative Investigations in Theoretical Linguistics'"

    I know some people love semantic propositions etc., but for me we are
    back again in the world of Platonic ideas. I like this discussion group
    because language is usually not regarded here as an ideal object. I must
    confess I find counting (calculating) ideal objects like semantic
    propositions a bit difficult. I find it difficult both as a researcher
    and as a practising translator, and I reach for my Quine to find peace
    of mind.
    Regards
    Tadeusz Piotrowski

    > -----Original Message-----
    > From: owner-corpora@lists.uib.no
    > [mailto:owner-corpora@lists.uib.no] On Behalf Of Alex Chengyu Fang
    > Sent: Monday, April 29, 2002 5:34 PM
    > To: Yorick Wilks
    > Cc: ramesh@ccl.bham.ac.uk; corpora@hd.uib.no
    > Subject: Re: Corpora: Relatve text length
    >
    >
    > What I wanted to say is that there are different ways
    > of measuring the relative length and that, if counts
    > of characters, syllables and morphemes are used, you
    > are likely to see differences between language pairs.
    > If, however, semantic proposition is used as key,
    > lanauges may not be so different as the number of
    > propositions should be a near constant across
    > multi-lingual texts that are mutual translations of
    > each other.
    >
    > So, my simplistic view is that to see the differences,
    > use characters, syllables and morphemes as
    > measurements. To see similarities (the other
    > direction), the number of semantic propositions can
    > serve the purpose.
    >
    > Regards,
    >
    > Alex
    >
    >
    > --- Yorick Wilks <yorick@dcs.shef.ac.uk> wrote: >
    > Sorry, I dont quite follow this--I thought the
    > > original question was
    > > just about length (whether text, characters,
    > > morphemes or words) and I
    > > didnt know when reading the question what the
    > > questioner's
    > > purpose was---I HOPE it wasnt language
    > > discrimination because Ramesh's
    > > figues show pretty clearly
    > > that length (as words) doesnt separate Slavic
    > > languages like Czech from
    > > Estonian/Hungarian--though
    > > length as characters does a bit bette, although
    > > theres no separation
    > > from the Slavic family as a whole at all!
    > > None of that seems terribly simplistic ,just
    > > natural, given the question
    > > and answer
    > > (though which is unhelpful as it turns out).
    > >
    > > What iIdont follow is the link to alignment that you
    > make--alignment
    > > is clearly interesting but
    > > what does it or can it say about the relative length
    > > of languages that
    > > the simpler counts do not?
    > > What is this 'other direction' you write of ----is
    > > it that, if you align
    > > at the sentence level
    > > many-one it says something about some property of
    > > the languages that
    > > can distinguish them?
    > > Or won't all that depend on the existence and shared
    > > significance of
    > > punctuation marks--which seems a bit implausible?
    > > Regards
    > > Yorick Wilks
    > >
    > >
    > >
    > > Alex Chengyu Fang wrote:
    > >
    > > > Which measure to use depends on the purpose of the
    > > > study, whether to bring out differences or
    > > > similarities of the languages concerned.
    > > >
    > > > A rather simplistic view is that counds of words, characters,
    > > > syllables, morphemes etc tend to be
    > > used
    > > > to discriminate between languages. An attempt in
    > > the
    > > > other direction is the use of the number of
    > > > propositions to, for instance, automatically align multilingual
    > > > texts:
    > > >
    > > > Campbell, J. and A.C. Fang. 1995. Automated
    > > Alignment
    > > > in Multilingual Corpora. In Proceedings of the
    > > 10th
    > > > Pacific Asia Conference on Language, Information
    > > and
    > > > Computation (PACLIC10), 27-28 December 1995, Hong
    > > Kong
    > > > City University, Hong Kong. pp 185-193.
    > > >
    > > > Regards,
    > > >
    > > > Alex Fang
    > > >
    > > > --- ramesh@ccl.bham.ac.uk wrote: > Dear Yorick
    > > > >
    > > > > Would morpheme counts not be even more accurate
    > > (or
    > > > > linguistically valid) than counting orthographic characters?
    > > > > Unfortunately, I don't think anyone has done
    > > these
    > > > > yet...
    > > > >
    > > > > Anyway, I agree that for the moment, character
    > > > > counts
    > > > > are a useful addition to word counts.
    > > > >
    > > > > Problems about translation (compensation,
    > > > > explication,
    > > > > zero translation, etc) obviously apply
    > > throughout.
    > > > >
    > > > > Here are some figures from my own research:
    > > > >
    > > > > 1. FIFA Laws in English, German, Spanish, and
    > > > > French.
    > > > > French is longest, then Spanish, German, and
    > > > > English.
    > > > >
    > > > > lines words characters text
    > > > >
    > > > > 726 10216 56874 Laws97GB.txt
    > > > > 724 9173 63402 Laws97DE.txt
    > > > > 1342 11030 63765 Laws97SP.txt
    > > > > 1169 11763 67537 Laws97FR.txt
    > > > >
    > > > > 2. Canadian Hansard in English and French.
    > > > > French is longer in both samples.
    > > > >
    > > > > lines words chars text
    > > > >
    > > > > 1569 20336 104015 c1.001.E.A
    > > > > 1569 22413 124457 c1.002.F.A
    > > > >
    > > > > 1120 12260 62421 c2.002.E.A
    > > > > 1120 12135 62622 c2.003.F.A
    > > > >
    > > > > 3. George Orwell's 1984 (thanks to Multext-East
    > > and
    > > > > TELRI)
    > > > > in several languages. These figures were
    > > provided by
    > > > >
    > > > > Dr Tomaz Erjavec (Ljubljana) with various
    > > additional
    > > > > caveats:
    > > > >
    > > > > line word char
    > > > >
    > > > > English 16053 102787 584803
    > > > > Bulgarian 11172 85878 536977
    > > > > Czech 11087 79022 498216
    > > > > Estonian 17872 78792 545984
    > > > > Hungarian 8813 79814 575219
    > > > > Romanian 16684 103704 603868
    > > > > Slovene 14938 91336 541461
    > > > >
    > > > > 4. Le Monde Diplomatique in English and Fench:
    > > > >
    > > > > lines words characters text
    > > > >
    > > > > 116 956 6410 LEMAE1.txt
    > > > > 133 941 7457 LEMAF1.txt
    > > > >
    > > > > 5. From research with Dr Maria Cristina Borba
    > > (Rio
    > > > > Grande, Brazil).
    > > > > Alice in Wonderland in English, 2
    > > > > Brazilian-Portuguese translations
    > > > > (one for adults, one for children), and a
    > > Catalan
    > > > > translation (MARIST).
    > > > >
    > > > > CARROLL LEITE
    > > > > SEVCENKO MARIST
    > > > >
    > > > > File length (bytes) 204,288 148,889
    > > > > 150,235 143,055
    > > > >
    > > > > Running words (tokens) 31,731 25,348
    > > > > 26,245 25,566
    > > > > Different words (types) 3,417 3,896
    > > > > 3,614 4,400
    > > > > type/token ratio (mean) 44.99% 51.61%
    > > > > 51.25% 51.19%
    > > > > ave. word length (letters) 3.63 4.36
    > > > > 4.31 4.16
    > > > >
    > > > > Best
    > > > > Ramesh
    > > > >
    > > > > Ramesh Krishnamurthy
    > > > > Honorary Research Fellow, University of
    > > Birmingham;
    > > > > Honorary Research Fellow, University of
    > > > > Wolverhampton;
    > > > > Consultant, Cobuild and Bank of English Corpus,
    > > > > Collins Dictionaries.
    > > > >
    > > > >
    > > > > On Thu, Apr 25, 2002 at 04:56:15PM +0100, Yorick
    > > > > Wilks wrote:
    > > > > > t=iso-8859-1
    > > > > > Content-Transfer-Encoding: 8bit
    > > > > > X-checked-clean: by exiscan on alf
    > > > > > X-Scanner: 5832cd47e7f9ea43fe3a076fe9cb1a70
    > > > > http://tjinfo.uib.no/virus.html
    > > > > > X-Spam-Flag: NO UIB: 0 hits, 8 required;
    > > > > > X-Spam-Report: spamassassin found:
    > > > > > Sender: owner-corpora@lists.uib.no
    > > > > > Precedence: bulk
    > > > > > Status: O
    > > > > > Content-Length: 3684
    > > > > > Lines: 114
    > > > > >
    > > > > >
    > > > > > Isnt there some (minor) confusion here? If
    > > the
    > > > > question really is relative TEXT
    > > > > > length,
    > > > > > then nothing to do with word counts will
    > > settle
    > > > > it--what matters is character
    > > > > > counts, since word length
    > > > > > varies considerably between languages. The
    > > table
    > >
    > === message truncated ===
    >
    > __________________________________________________
    > Do You Yahoo!?
    > Everything you'll ever need on one web page
    > from News and Sport to Email and Music Charts http://uk.my.yahoo.com
    >



    This archive was generated by hypermail 2b29 : Mon Apr 29 2002 - 22:18:47 MET DST