Corpora: Swedish corpus material on the net

Daniel Ridings (ridings@svenska.gu.se)
Fri, 23 Oct 1998 07:49:35 +0100 (MET)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: DL: "Re: Corpora: Corpus of scientific texts"
Previous message: GCW: "Re: Corpora: Corpus of scientific texts"
In reply to: Daniel Ridings: "Corpora: Swedish corpus material on the net"

Språkdata, which is now a part of the Department of Swedish, Göteborgs
universitet, has been actively engaged in working with written language
resources since the mid-sixties.

It is nice to be able to announce that we now have enabled free access to
all of our modern material (after 1965) on the WWW. The address to the
English home page is: http://spraakbanken.gu.se/lbeng.html and the
address to the Swedish home page is: http://spraakbanken.gu.se

On the home page you will see "Word-classed tagged concordances" or
"konkordanser med ordklasstaggning" which are links that will take you to
the home page for the corpus material. There you can choose between
"Press 65", "Parole" (the default), "The Bank of Swedish" (ca 55 million
words) or "Shona" and "Ndebele". The last two are the natural languages
spoken by the majority of the population in Zimbabwe.

All of the Swedish material has been part-of-speech tagged and searches
may be formed to take advantage of the tags. The searching software is
the IMS Corpus Workbench from Stuttgart. Please read the skimpy, but
essential, instructions on the home-page. A quick example will illustrate
the possibilities.

In Swedish, there is a tendency for future tense to be constructed
without the "att" after "kommer". Kommer+att, however is still the most
frequent by far, making it a little tricky to get at the exerpts without
"att". The following search criterium will help:

"kommer" [word!="att" & msd!="F.*" & msd!="V@I.*"]{0,3} [msd="V@N.*"] within S

That should be written on one line and reads:

"kommer" followed by 0-3 words that are not "att", not punctuation, and
not finite verbs, then followed by the infinitive and all of this should
be within an orthographical sentence.

A simplier search, that contains only high frequency words that would
rarely be searched on by themselves:

"i" "och" "för" "sig"

Note that individual words in phrases must be written separately.

The Bank of Swedish will be growing continually and we hope to provide at
least 100 million words and refined searching criteria for collocations
and lexical associations in the near future.

Please feel free to get in touch with us if you have questions or problems.

Daniel Ridings +46 31 773 47 99
Språkbanken ridings@svenska.gu.se
Göteborgs universitet
SE-405 30 GÖTEBORG

Next message: DL: "Re: Corpora: Corpus of scientific texts"
Previous message: GCW: "Re: Corpora: Corpus of scientific texts"
In reply to: Daniel Ridings: "Corpora: Swedish corpus material on the net"