Corpora: [CORPORA] Re: closed class list

From: MT (mtnaves@fil.ub.es)
Date: Tue Jun 11 2002 - 08:50:38 MET DST

  • Next message: Edward Loper: "Corpora: Natural Language Toolkit, Version 0.7"

    Lists of English CLOSED CLASS available on the web selected by Teresa
    Naves
    naves@fil.ub.es
    *****Geoffrey Leech, Paul Rayson, Andrew Wilson (2001) Word Frequencies
    in
    Written and Spoken English: based on the British National
    Corpus.Longman,
    London. ISBN 0582-32007-0 available at
    http://www.comp.lancs.ac.uk/ucrel/bncfreq/ Chapter 5 Rank Frequency
    Lists of
    Words within Word Classes (Parts of Speech) in the whole corpus contains

    the complete list of closed class under the following subchapters
    List 5.5: Frequency list of pronouns (not lemmatized): list
    http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/5_5_all_rank_pron.txt
    List 5.6: Frequency list of determiners: list
    http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/5_6_all_rank_determ.txt
    List 5.7: Frequency list of determiner/pronouns: list
    http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/5_7_all_rank_detpro.txt
    List 5.8: Frequency list of prepositions: list
    http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/5_8_all_rank_preposition.txt

    List 5.9: Frequency list of conjunctions: list
    http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/5_9_all_rank_conjunction.txt

    List 5.10: Frequency list of interjections and discourse particles: list

    http://www.comp.lancs.ac.uk/ucrel/bncfreq/lists/5_10_all_rank_interjection.t

    xt

    1. Noemi Preissner sent the following m message to this discussion list
    CORPORA on February 199 about stop lists. "Patrice Bonhomme has a
    collection
    of stop-lists for English, French and German available at
    ttp://www.loria.fr/~bonhomme/sw/ Apart from stop words you will find
    lists
    of word frequencies there. Those lists have been created on the base of
    the
    corpora at the Silfide server (http://www.loria.fr/projets/Silfide/)'
    (Below you'll find the list of stop words for English by Patrice
    Bonhomme
    available at http://www.loria.fr/~bonhomme/sw/stopword.en)
    2. Modified Penn Treebank Tag-Set (closed class categories) from
    INFOGISTICS http://www.infogistics.com/tagset.html
    (I don't understand why subordinating conjunctions are not listed either

    under closed-clased or open-class words) (find the whole list below)
    3. English Conjunctions by Linda Bryson
    http://www.gsu.edu/~wwwesl/egw/bryson.htm
    (She regards both subordinating and co-ordinating conjunctions as closed

    classed)
    Further information on the web on closed class /Readings on closed class

    words
    4. "The English base lists are are (i) a list of 288 closed class words
    drawn from the Alvey Grammar 3rd release (lexicon file d.le), (ii) a
    list of
    9532 general open class words derived from the British National Corpus
    ([BNC]) via word/part of speech frequency lists compiled by Adam
    Kilgarriff
    of the University of Brighton, kindly made available by anonymous ftp
    [AK],
    and (iii) a list of 32,250 technical words drawn from the European
    Corpus
    Initiative CDROM (ECI)" from TEMAA at
    http://cst.dk/temaa/D12/d12exp-2.html#Heading8
    5. ELLY VAN GELDEREN Function WordsEncyclopedia of Linguistics Sample
    Entry
    http://www.fitzroydearborn.com/chicago/linguistics/sample-function-words.php

    3

    1Stop words for English by Patrice Bonhomme available at
    http://www.loria.fr/~bonhomme/sw/stopword.en)
    a
    about
    above
    abst
    abst
    accordance
    accordance
    according
    across
    act
    actually
    added
    adj
    adopted
    after
    afterwards
    again
    against
    all
    almost
    alone
    along
    already
    also
    although
    always
    am
    among
    amongst
    an
    and
    announce
    another
    any
    anyhow
    anyone
    anything
    anywhere
    are
    aren
    aren't
    arent
    around
    as
    at
    auth
    available
    b
    be
    became
    because
    become
    becomes
    becoming
    been
    before
    beforehand
    begin
    beginning
    behind
    being
    below
    beside
    besides
    between
    beyond
    billion
    both
    but
    by
    c
    ca
    can
    can't
    cannot
    cant
    caption
    co
    co.
    contains
    could
    couldn't
    couldnt
    d
    date
    did
    didn't
    didnt
    do
    does
    doesn't
    doesnt
    don't
    dont
    down
    during
    e
    each
    ed
    eg
    eight
    eighty
    either
    else
    elsewhere
    end
    ending
    enough
    etc
    even
    ever
    every
    everyone
    everything
    everywhere
    except
    f
    far
    few
    fifty
    first
    five
    fix
    for
    former
    formerly
    forty
    found
    four
    from
    further
    g
    get
    go
    got
    h
    had
    has
    hasn't
    hasnt
    have
    haven't
    havent
    he
    he'd
    he'll
    he's
    hed
    hell
    hence
    her
    here
    here's
    hereafter
    hereby
    herein
    heres
    hereupon
    hers
    herself
    hes
    hid
    him
    himself
    his
    home
    hop
    how
    however
    hundred
    i
    i'd
    i'll
    i'm
    i've
    id
    ie
    if
    ill
    im
    in
    inc
    inc.
    include
    includes
    indeed
    index
    information
    instead
    internet
    into
    is
    isn't
    isnt
    it
    it's
    its
    itself
    ive
    j
    just
    k
    keys
    l
    last
    later
    latter
    latterly
    least
    less
    let
    let's
    lets
    like
    likely
    line
    links
    ll
    ltd
    m
    made
    make
    makes
    many
    may
    maybe
    me
    meantime
    meanwhile
    might
    million
    miss
    more
    moreover
    most
    mostly
    mr
    mrs
    much
    must
    my
    myself
    n
    na
    namely
    near
    neither
    never
    nevertheless
    new
    next
    nine
    ninety
    no
    nobody
    none
    nonetheless
    noone
    nor
    not
    nothing
    now
    nowhere
    o
    of
    off
    often
    oh
    omitted
    on
    once
    one
    one's
    ones
    only
    onto
    or
    ord
    other
    others
    otherwise
    our
    ours
    ourselves
    out
    over
    overall
    own
    p
    page
    pages
    part
    per
    perhaps
    pp
    proud
    put
    q
    r
    ran
    rather
    re
    recent
    recently
    ref
    refs
    related
    research
    run
    s
    same
    say
    search
    sec
    section
    seem
    seemed
    seeming
    seems
    server
    seven
    seventy
    several
    she
    she'd
    she'll
    she's
    shed
    shell
    shes
    should
    shouldn't
    shouldnt
    since
    six
    sixty
    so
    some
    somehow
    someone
    something
    sometime
    sometimes
    somewhere
    still
    stop
    such
    t
    taking
    ten
    than
    that
    that'll
    that's
    that've
    thatll
    thats
    thatve
    the
    their
    them
    themselves
    then
    thence
    there
    there'd
    there'll
    there're
    there's
    there've
    thereafter
    thereby
    thered
    therefore
    therein
    therell
    therere
    theres
    thereupon
    thereve
    these
    they
    they'd
    they'll
    they're
    they've
    theyd
    theyll
    theyre
    theyve
    thirty
    this
    those
    though
    thousand
    three
    through
    throughout
    thru
    thus
    til
    tip
    to
    together
    too
    toward
    towards
    trillion
    try
    twenty
    two
    u
    under
    unless
    unlike
    unlikely
    until
    unto
    up
    upon
    ups
    us
    used
    using
    v
    ve
    very
    via
    vol
    vols
    vs
    w
    was
    wasn't
    wasnt
    way
    we
    we'd
    we'll
    we're
    we've
    web
    wed
    well
    were
    weren't
    werent
    weve
    what
    what'll
    what's
    what've
    whatever
    whatll
    whats
    whatve
    when
    whence
    whenever
    where
    where's
    whereafter
    whereas
    whereby
    wherein
    wheres
    whereupon
    wherever
    whether
    which
    while
    whim
    whither
    who
    who'd
    who'll
    who's
    whod
    whoever
    whole
    wholl
    whom
    whomever
    whos
    whose
    why
    will
    with
    within
    without
    won't
    wont
    words
    world
    would
    wouldn't
    wouldnt
    www
    x
    y
    yes
    yet
    you
    you'd
    you'll
    you're
    you've
    youd
    youll
    your
    youre
    yours
    yourself
    yourselves
    youve
    z

    2.. Modified Penn Treebank Tag-Set (closed class categories) from
    INFOGISTICS http://www.infogistics.com/tagset.html
    Tag Description Example
    CD cardinal number 1, third
    CC coordinating conjunction and
    DT determiner the
    EX existential there there is
    IN preposition in, of, like
    LS list marker 1)
    MD modal could, will
    PDT predeterminer both the boys
    POS possessive ending friend's
    PRP personal pronoun I, he, it
    PRP$ possessive pronoun my, his
    RP particle give up
    TO to (both "to go" and "to him") to go, to him
    UH interjection uhhuhhuhh
    WDT wh-determiner which
    WP wh-pronoun who, what
    WP$ possessive wh-pronoun whose
    WRB wh-adverb where, when

    > Diego Molla wrote:
    >
    > > By definition, a list of closed class words must be easy to compile,

    > > since new additions to the list would be rare.
    > >
    > > Oddly enough, I haven't found any such list on the Web. A student of

    > > mine needs to use a list of closed class words. Does anybody know of

    > > such a list?
    >
    >
    > Assuming you're interested in English, I have a list of closed class
    > words that I developed for working with a corpus of usenet text. It
    has
    > about 150 words. As far as I can tell, the set of closed class words
    in
    > English is not completely well-defined. Some words (pronouns,
    > conjunctives, articles) are clearly closed class. But certain adverbs

    > and common verbs are probably debatable, as are, I think, digits. So
    > for what it's worth, here's my list. You notice that it includes
    things
    > like punctuation and stuff in brackets like <NUM> (which stands for a
    > number) that you may want to remove.
    >
    > Doug
    >
    >
    >

    ----------------------------------------------------------------------------

    ----
    

    > . > , > THE > TO > AND > A > OF > <MIX> > " > IN > I > <NUM> > : > YOU > IS > THAT > ) > ( > IT > FOR > ON > ! > <URL> > HAVE > WITH > ? > THIS > BE > ... > NOT > ARE > AS > WAS > BUT > OR > FROM > MY > AT > IF > THEY > <XXX> > YOUR > ALL > HE > BY > ONE > ME > WHAT > SO > CAN > WILL > DO > AN > ABOUT > WE > JUST > WOULD > THERE > NO > LIKE > OUT > HIS > HAS > UP > MORE > WHO > WHEN > DON'T > SOME > HAD > THEM > ANY > THEIR > IT'S > ONLY > ; > WHICH > I'M > BEEN > OTHER > WERE > HOW > THEN > NOW > HER > THAN > SHE > WELL > <IPA> > ALSO > US > VERY > BECAUSE > AM > HERE > COULD > EVEN > <EMO> > HIM > INTO > OUR > MUCH > TOO > DID > SHOULD > OVER > WANT > THESE > MAY > WHERE > MOST > MANY > THOSE > DOES > WHY > PLEASE > OFF > GOING > ITS > I'VE > DOWN > THAT'S > CAN'T > YOU'RE > DIDN'T > ANOTHER > AROUND > MUST > <EMA> > FEW > DOESN'T > EVERY > YES > EACH > MAYBE > I'LL > AWAY > DOING > OH > ELSE > ISN'T > HE'S > THERE'S > HI > WON'T > OK > THEY'RE > YEAH > MINE > WE'RE > WHAT'S > SHALL > SHE'S > HELLO > OKAY > HERE'S > - >



    This archive was generated by hypermail 2b29 : Tue Jun 11 2002 - 09:02:43 MET DST