Re: [Corpora-List] Searching Japanese corpora

From: Cyrus Shaoul (cyrus.shaoul@ualberta.ca)
Date: Thu Dec 21 2006 - 18:11:40 MET

Next message: Ryan North: "Re: [Corpora-List] Corpora of comic strips/books"

Previous message: Ron Artstein: "[Corpora-List] Call for papers: 2007 SEMDIAL (Workshop on the Semantics and Pragmatics of Dialogue)"
In reply to: Eric J. M. Smith: "[Corpora-List] Searching Japanese corpora"
Next in thread: Brett Powley: "Re: [Corpora-List] Searching Japanese corpora"
Reply: Brett Powley: "Re: [Corpora-List] Searching Japanese corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Eric,

It is my understanding that it is possible to write the pronunciation of all
kanji and kanji compounds in both hiragana and katakana (and each
kanji/kanji compound can
have multiple pronunciations). In most types of written Japanese, it
would be uncommon to write the pronunciation for kanji, and there are
many words that are
always written in katakana or hiragana, and never in kanji, so when
searching for words, having a tool that
would automatically search for a kanji word and it's kana
representations at the same time would not
be that useful.

I should confess that there are some words that are written in both
kanji and kana with higher frequency, such as
some older loanwords, some place names, some proper names, some
low-frequency kanji, and a few other types of words.
I have a gut feeling that the number of words that fall into these
categories is not that large.

I don't know of any tools out there to do the kind of query you
mentioned, but it has been a few years since I
working on Japanese text. In the meantime, I can only suggest making
many queries, one with kanji/kanji compund and
others with the hiragana and katakana spellings of all the possible
pronunciations.

Yours,

Cyrus

http://www.psych.ualberta.ca/~westburylab/

Eric J. M. Smith wrote:
> Greetings,
>
> Following up on our recent thread about grep with Unicode, I'm curious
> about how people search for text in Japanese-language corpora.
>
> My understanding of Japanese is rudimentary, but is it not possible
> (potentially at least) for the same text to be written in hiragana,
> katakana, or kanji? In order to find all occurrences of a particular
> string in a corpus, would I have to do the search 3 times, once for
> each script? I assume that would be the case for something like grep.
> But are there more sophisticated query tools which abstract away the
> question of which script is actually used for data within the corpus?
>
> Thanks,
>
> Eric J. M. Smith
> Dept. of Linguistics
> University of Toronto
>

Next message: Ryan North: "Re: [Corpora-List] Corpora of comic strips/books"
Previous message: Ron Artstein: "[Corpora-List] Call for papers: 2007 SEMDIAL (Workshop on the Semantics and Pragmatics of Dialogue)"
In reply to: Eric J. M. Smith: "[Corpora-List] Searching Japanese corpora"
Next in thread: Brett Powley: "Re: [Corpora-List] Searching Japanese corpora"
Reply: Brett Powley: "Re: [Corpora-List] Searching Japanese corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Dec 21 2006 - 18:10:24 MET