[Corpora-List] EXTENDED DEADLINE: ATALA Workshop, Role of typography and punctuation in natural language processing

From: Ghassan Mourad (Ghassan.Mourad@paris4.sorbonne.fr)
Date: Tue Sep 30 2003 - 13:27:37 MET DST

  • Next message: LREC 2004: "[Corpora-List] LREC 2004 - Second Call for Papers"

    EXTENDED DEADLINE:Octobre 10th 2003

             CALL FOR WORKSHOP PAPERS

    (Please accept my apologies if you receive multiple copies of this
    message.)
    -------------------------------------------------------------------------------------------------------

    ATALA Workshop

    **************************************
    22 novembre 2003
    ENST, 46, rue Barrault (49, rue Vergnault), 75013 Paris
    ****************************************************
    Title :
    Role of typography and punctuation in natural language processing
    (texts segmentation, prosody, syntactical analysis, information retrieval,
    coding in multilingual systems,…)

    Organisation : Ghassan Mourad & Jean-Pierre Descles
    Laboratory : LaLICC (UMR 8139 Paris-Sorbonne / CNRS

    Conference call

    Objective:
    Even though punctuation and typography are not seen as teaching knowledge,
    we can hardly deny their role in reading and writing. This is also true for
    natural language processing, where punctuation plays an important role.
    Typographical and punctuation signs are “natural tags” of information, and
    indicators on which most of the processing should rely. It is essential to
    tally and study all issues in the multilingual, multiwriting, and
    multicoding processing phases.

    The ATALA workshop is particularly concerned with current research on
    punctuation, typography, coding and transcribing issues in linguistics and
    language processing; and with work that already exists in this restricted
    domain or directly related to.

    Issues:
    Linguistic engineering and language processing is confronted with new
    issues. Indeed, it is now necessary to work not only on isolated sentences
    or utterances, but on entire structured or unstructured texts too; for
    example, texts from the Internet or from document-bases stored by companies
    or administrations, encyclopaedias or even dictionary articles.
    Moreover, texts are rarely tagged or digitised. However, text processing
    requires pre-processing in order to conduct syntactical, semantic and
    pragmatic analysis. In particular, each text has two structures: formal and
    discursive. The later depends on the earlier. The formal structure
    expresses a certain meaning intentionality; it results from the coding in a
    typographical system and from “text-setting” or text layout.
    The pre-processing of a text must exploit the formal structure (titles and
    sub-titles localisation; text fragmentation in sentences, paragraphs,
    utterances, propositions, words; quotation identification; item list
    identification; spatial disposition consideration; images, diagrams,
    captions, boxes localisation....), before executing other tasks, or
    exploiting the discursive structure (temporal, spatial, topic, event frames
    identification; relations between concepts, terms, events; anaphoric links;
    enunciative phenomena…).

      Without complete control of the exploitation of formal structure, text
    processing will not really be operational. Obviously, this issue did not
    appear when we worked only on isolated sentences. However, for semantic
    analysis, text must segmented into linguistic units that are superior or
    inferior to the normative sentences, by taking into account semiotic marks
    clearly and formally known by the computer. Punctuation and all typographic
    signs (index) are still the most relevant elements, since they can provide
    sharp indications for formal text segmentation and structuring; these
    indications being the foundation of automatic textual linguistics.

    We can distinguish between three types of approaches for segmentation:
    (a) Digital approaches (neuronal nets, N-grams, Markov model…);
    (b) Finite automata and regular expressions approaches (for instance
    INTEX);
    (c) Contextual exploration approaches based on punctuation marks (for
    instance SegATex).

    Traditional theories (treaties, handbooks) of punctuation generally are
    normative and do not allow the expression of precise rules that could lead
    to automatic segmentation. Furthermore, these treaties did not consider
    semantic analysis of highly polysemous marks like comma, semicolon, colon,
    dash, parenthesises, ... However, marks play a very important role in
    semantic structuring; their analysis allow to improve segmentation process
    and text discursive structuring.
    Text processing tools offer enormous potentialities for typographic
    variations; for example highlighting a term being quoted, exemplify, or
    disambiguate an expression…; Quoting Ch. Gouriou : « A tout problème que
    pose la transcription de la pensée, la typographie se doit d’apporter au
    moins une solution ; elle en offre plusieurs dès que l’on la sollicite de
    faire valoir des nuances ou des subtilité ». However, the integration to be
    granted to these variations is not regular and depends on other contextual
    (typographic and punctuation) elements; for example, an italicized
    expression does not have the same value (meaning) according to the fact
    that it is capitalized or between quoting marks. It is indeed a
    conglomerate of typographic marks, variable from text to text, which gives
    the value of an occurrence of typographic change. Text processing must
    resolve these linguistic and computational issues.

    Theme:
    Submission can also Discuss/tackle cross-domain topics in relation to:

    - Formal segmentation of text,
    - Text discursive segmentation based on punctuation and typography marks,
    - “Textual architecture”,
    - The role of the punctuation ­particularly, the comma- in a
    syntactic analysis,
    - Contribution of the punctuation for the coding of the prosody and
    contribution of typography for the coding of intonation,
    - Contribution of the punctuation for the identification of proper
    names, compound words, abbreviations, initials, …
    - Comparison between punctuation in various linguistic systems (Arab,
    Chinese…),
    - Coding and transcribing issues in various linguistics systems,
    - …

    Modalities :
    Submission : a 2-4 page summary.
    We ask authors to indicate if their submission:
    - present in-progress work or is a position paper;
    - present theoretical or applied completed work.
    A 2-4-page summary must be sent before 10 Octobre 2003 by e-mail in text,
    .rtf, .doc or .pdf to:
    Ghassan.Mourad@paris4.sorbonne.fr
    and
    Jean-Pierre.Descles@paris4.sorbonne.fr

    Acceptance notifications will be sent for 20 October 2003.

    ****************************************************************************************

    Ghassan Mourad
    ISHA, Paris - Sorbonne
    Laboratoire LaLICC (Langage, Logique, Informatique, Cognition et Communication)
    (UMR 8139 Paris-Sorbonne / CNRS)
    http://www.lalic.paris4.sorbonne.fr/
    96, Bd Raspail
    75006 Paris
    France
    tel : 01 44 39 35 90
    fax : 01 44 39 35 91



    This archive was generated by hypermail 2b29 : Tue Sep 30 2003 - 13:29:55 MET DST