Academic literature on the topic 'Corpora annotation with deep linguistic information'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Corpora annotation with deep linguistic information.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Corpora annotation with deep linguistic information"

1

DORR, BONNIE J., REBECCA J. PASSONNEAU, DAVID FARWELL, et al. "Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation." Natural Language Engineering 16, no. 3 (2010): 197–243. http://dx.doi.org/10.1017/s1351324910000070.

Full text
Abstract:
AbstractThis paper focuses on an important step in the creation of a system of meaning representation and the development of semantically annotated parallel corpora, for use in applications such as machine translation, question answering, text summarization, and information retrieval. The work described below constitutes the first effort of any kind to annotate multiple translations of foreign-language texts with interlingual content. Three levels of representation are introduced: deep syntactic dependencies (IL0), intermediate semantic representations (IL1), and a normalized representation that unifies conversives, nonliteral language, and paraphrase (IL2). The resulting annotated, multilingually induced, parallel corpora will be useful as an empirical basis for a wide range of research, including the development and evaluation of interlingual NLP systems and paraphrase-extraction systems as well as a host of other research and development efforts in theoretical and applied linguistics, foreign language pedagogy, translation studies, and other related disciplines.
APA, Harvard, Vancouver, ISO, and other styles
2

Zolotov, Pitirim Y. "Linguodidactic properties of corpus technologies." Tambov University Review. Series: Humanities, no. 185 (2020): 75–82. http://dx.doi.org/10.20310/1810-0201-2020-25-185-75-82.

Full text
Abstract:
For the last two decades, corpus technologies, understood as a combination of means and methods of processing and analyzing data of electronic linguistic corpora, as a type of information and communication technology, have attracted great interest of researchers and teachers of foreign languages.We explain the concepts of corpus linguistics, corpus technology, linguistic corpus, concordance. The methods of studying case technologies, which are an annotation, abstraction, and analysis, are considered. The advantages of linguistic corpora are given. The history of the emergence and development of linguistic electronic cases from the pre-digital to digital period is described. Minimum requirements for the corpus of texts are presented. They include representativeness, known volume of the corpus, electronic form, annotation and balance. We consider the typology of linguistic corpora. According to the language of the texts in corpora, there are monolingual and multilingual corpora, which in turn are divided into mixed and parallel ones. According to language data, there are written, oral and mixed corpora. Corpora can be annotated and non-annotated. There are three types of annotation: linguistic, metatextual, and extralinguistic. According to the parameter of representation of the language material of a corpus, there are fragmented and non-fragmented ones. According to the type of access, they are classified as open and restricted. According to the genre representation, linguistic corpora are diverse. The size of a corpus should distinguish between representative, illustrative and monitoring types of corpora. The didactic properties of corpus technologies in the field of teaching a foreign language are studied. The division of the linguodidactic properties of case technologies into mandatory and optional is proposed.
APA, Harvard, Vancouver, ISO, and other styles
3

Erjavec, Tomaž. "Označevanje korpusov." Jezik in slovstvo 48, no. 3-4 (2024): 61–76. http://dx.doi.org/10.4312/jis.48.3-4.61-76.

Full text
Abstract:
Ordered collections of machine-readable texts, corpora, are useful in various branches of linguistics. The present paper focuses on the machine-readable form of corpora, above all on their annotation, i.e. adding interpretative information to the text in the corpus. The annotation presented is based on taking into consideration the international standards from this field, which contributes to better documentation and verifiability, easier use of processing applications, and better interchange and longevity. In the first part, corpus encoding standards, above all XML (eXtended Markup Language) and TEI (Text Encoding Initiative), are discussed, while in the second part some of the more interesting levels of linguistic annotation of corpora are outlined.
APA, Harvard, Vancouver, ISO, and other styles
4

Alayiaboozar, Elham. "Indicators and stages of building a linguistic corpus: written and spoken varieties." Linguistics of Iranian Dialects 4, no. 2 (2019): 267–90. https://doi.org/10.5281/zenodo.14033040.

Full text
Abstract:
This research aims to assist researchers in the construction of various linguistic corpora by collecting information related to the indicators and stages of corpus building. In this article, after reviewing the opinions of researchers who have constructed corpora in different languages, the general indicators for building linguistic corpora are discussed. These indicators pertain to the construction of textual and spoken varieties of the corpus, including sampling, representativeness, balance, size, type of corpus, and homogeneity. Subsequently, the process of constructing a textual corpus is presented, which encompasses text selection, text preprocessing, and annotation, with detailed explanations provided for each of these stages. Finally, the process of constructing a spoken corpus is outlined, which includes data collection, transcription, display and annotation, and accessibility, with thorough explanations given for each of the mentioned stages.
APA, Harvard, Vancouver, ISO, and other styles
5

Iomdin, Leonid. "Microsyntactic Annotation of Corpora and its Use in Computational Linguistics Tasks." Journal of Linguistics/Jazykovedný casopis 68, no. 2 (2017): 169–78. http://dx.doi.org/10.1515/jazcas-2017-0027.

Full text
Abstract:
Abstract Microsyntax is a linguistic discipline dealing with idiomatic elements whose important properties are strongly related to syntax. In a way, these elements may be viewed as transitional entities between the lexicon and the grammar, which explains why they are often underrepresented in both of these resource types: the lexicographer fails to see such elements as full-fledged lexical units, while the grammarian finds them too specific to justify the creation of individual well-developed rules. As a result, such elements are poorly covered by linguistic models used in advanced modern computational linguistic tasks like high-quality machine translation or deep semantic analysis. A possible way to mend the situation and improve the coverage and adequate treatment of microsyntactic units in linguistic resources is to develop corpora with microsyntactic annotation, closely linked to specially designed lexicons. The paper shows how this task is solved in the deeply annotated corpus of Russian, SynTagRus.
APA, Harvard, Vancouver, ISO, and other styles
6

Jiménez-Zafra, Salud María, Roser Morante, María Teresa Martín-Valdivia, and L. Alfonso Ureña-López. "Corpora Annotated with Negation: An Overview." Computational Linguistics 46, no. 1 (2020): 1–52. http://dx.doi.org/10.1162/coli_a_00371.

Full text
Abstract:
Negation is a universal linguistic phenomenon with a great qualitative impact on natural language processing applications. The availability of corpora annotated with negation is essential to training negation processing systems. Currently, most corpora have been annotated for English, but the presence of languages other than English on the Internet, such as Chinese or Spanish, is greater every day. In this study, we present a review of the corpora annotated with negation information in several languages with the goal of evaluating what aspects of negation have been annotated and how compatible the corpora are. We conclude that it is very difficult to merge the existing corpora because we found differences in the annotation schemes used, and most importantly, in the annotation guidelines: the way in which each corpus was tokenized and the negation elements that have been annotated. Differently than for other well established tasks like semantic role labeling or parsing, for negation there is no standard annotation scheme nor guidelines, which hampers progress in its treatment.
APA, Harvard, Vancouver, ISO, and other styles
7

Cantalini, Giorgina, and Massimo Moneglia. "annotation of gesture and gesture / prosody synchronization in multimodal speech corpora." Journal of Speech Sciences 9 (September 9, 2020): 07–30. http://dx.doi.org/10.20396/joss.v9i00.14956.

Full text
Abstract:
This paper was written with the aim of highlighting the functional and structural correlations between gesticulation and prosody, focusing on gesture / prosody synchronization in spontaneous spoken Italian. The gesture annotation used follows the LASG model (Bressem et al. 2013), while the prosodic annotation focuses on the identification of terminal and non-terminal prosodic breaks which, according to L-AcT (Cresti, 2000; Moneglia & Raso 2014), determine speech act boundaries and the information structure, respectively. Gesticulation co-occurs with speech in about 90% of the speech flow examined and gestural arcs are synchronous with prosodic boundaries. Gesture Phrases, which contain the expressive phase (Stroke) never cross terminal prosodic boundaries, finding in the utterance the maximum unit for gesture / speech correlation. Strokes may correlate with all information unit types, however only infrequently with Dialogic Units (i.e. those functional to the management of the communication). The identification of linguistic units via the marking of prosodic boundaries allows us to understand the linguistic scope of the gesture, supporting its interpretation. Gestures may be linked at different linguistic levels, namely those of: a) the word level; b) the information unit phrase; c) the information unit function; d) the illocutionary value.
APA, Harvard, Vancouver, ISO, and other styles
8

Hajič, Jan, Eva Hajičová, Jiří Mírovský, and Jarmila Panevová. "Linguistically Annotated Corpus as an Invaluable Resource for Advancements in Linguistic Research: A Case Study." Prague Bulletin of Mathematical Linguistics 106, no. 1 (2016): 69–124. http://dx.doi.org/10.1515/pralin-2016-0012.

Full text
Abstract:
Abstract A case study based on experience in linguistic investigations using annotated monolingual and multilingual text corpora; the “cases” include a description of language phenomena belonging to different layers of the language system: morphology, surface and underlying syntax, and discourse. The analysis is based on a complex annotation of syntax, semantic functions, information structure and discourse relations of the Prague Dependency Treebank, a collection of annotated Czech texts. We want to demonstrate that annotation of corpus is not a self-contained goal: in order to be consistent, it should be based on some linguistic theory, and, at the same time, it should serve as a test bed for the given linguistic theory in particular and for linguistic research in general.
APA, Harvard, Vancouver, ISO, and other styles
9

Novák, Václav. "Semantic Network Manual Annotation and its Evaluation." Prague Bulletin of Mathematical Linguistics 90, no. 1 (2008): 69–82. http://dx.doi.org/10.2478/v10108-009-0008-4.

Full text
Abstract:
Semantic Network Manual Annotation and its Evaluation The present contribution is a brief extract of (Novák, 2008). The Prague Dependency Treebank (PDT) is a valuable resource of linguistic information annotated on several layers. These layers range from morphemic to deep and they should contain all the linguistic information about the text. The natural extension is to add a semantic layer suitable as a knowledge base for tasks like question answering, information extraction etc. In this paper I set up criteria for this representation, explore the possible formalisms for this task and discuss their properties. One of them, Multilayered Extended Semantic Networks (Multi-Net), is chosen for further investigation. Its properties are described and an annotation process set up. I discuss some practical modifications of MultiNet for the purpose of manual annotation. MultiNet elements are compared to the elements of the deep linguistic layer of PDT. The tools and problems of the annotation process are presented and initial annotation data evaluated.
APA, Harvard, Vancouver, ISO, and other styles
10

Druskat, Stephan, Thomas Krause, Clara Lachenmaier, and Bastian Bunzeck. "Hexatomic: An extensible, OS-independent platform for deep multi-layer linguistic annotation of corpora." Journal of Open Source Software 8, no. 86 (2023): 4825. http://dx.doi.org/10.21105/joss.04825.

Full text
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Corpora annotation with deep linguistic information"

1

Castro, Sérgio Ricardo de. "Developing reliability metrics and validation tools for datasets with deep linguistic information." Master's thesis, 2011. http://hdl.handle.net/10451/13908.

Full text
Abstract:
The purpose of this dissertation is to propose a reliability metric and respective validation tools for corpora annotated with deep linguistic information. The annotation of corpus with deep linguistic information is a complex task, and therefore is aided by a computational grammar. This grammar generates all the possible grammatical representations for sentences. The human annotators select the most correct analysis for each sentence, or reject it if no suitable representation is achieved. This task is repeated by two human annotators under a double-blind annotation scheme and the resulting annotations are adjudicated by a third annotator. This process should result in reliable datasets since the main purpose of this dataset is to be the training and validation data for other natural language processing tools. Therefore it is necessary to have a metric that assures such reliability and quality. In most cases, the metrics uses for shallow annotation or parser evaluation have been used for this same task. However the increased complexity demands a better granularity in order to properly measure the reliability of the dataset. With that in mind, I suggest the usage of a metric based on the Cohen’s Kappa metric that instead of considering the assignment of tags to parts of the sentence, considers the decision at the level of the semantic discriminants, the most granular unit available for this task. By comparing each annotator’s options it is possible to evaluate with a high degree of granularity how close their analysis were for any given sentence. An application was developed that allowed the application of this model to the data resulting from the annotation process which was aided by the LOGON framework. The output of this application not only has the metric for the annotated dataset, but some information related with divergent decision with the intent of aiding the adjudication process.
APA, Harvard, Vancouver, ISO, and other styles
2

"Information structure in cross-linguistic corpora : annotation guidelines for phonology, morphology, syntax, semantics and information structure." Universität Potsdam, 2007. http://opus.kobv.de/ubp/volltexte/2007/1419/.

Full text
Abstract:
This volume presents annotation guidelines that have been developed in the context of the SFB 632, a collaborative research center entitled "Information Structure: the Linguistic Means for Structuring Utterances, Sentences and Texts". An important result of the SFB 632 are the SFB corpora from more than 20 typologically different languages, which have been annotated according to the guidelines presented here. The ultimate target of the data and its annotations is to support the study of Information Structure. Information Structure involves all levels of grammar and, hence, the present guidelines cover relevant aspects of all these levels: - Phonology - Morphology - Syntax - Semantics - Information Structure These levels are dealt with in individual chapters, containing tagset declarations with obligatory and optional tags, detailed annotation instructions, and illustrative examples. The volume also presents an evaluation of inter-annotator agreement of Syntax and Information Structural annotation.
APA, Harvard, Vancouver, ISO, and other styles
3

Castro, Sérgio Ricardo de 1981. "Developing reliability metrics and validation tools for datasets with deep linguistic information." Master's thesis, 2011. http://hdl.handle.net/10451/8688.

Full text
Abstract:
Tese de mestrado em Engenharia Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2011<br>Grande parte das ferramentas de processamento de linguagem natural utilizadas hoje em dia, desde os anotadores morfossintácticos (POS taggers) até aos analisadores sintáticos (parsers), necessita de corpora anotados com a informação linguística necessária para efeitos de treino e avaliação. A qualidade dos resultados obtidos por estas ferramentas está directamente ligada à qualidade dos corpora utilizados no seu treino ou avaliação. Como tal, é do mais alto interesse construir corpora anotados para treino ou avaliação com o maior nível de qualidade. Tal como as técnicas e as ferramentas da área do processamento de linguagem natural se vão tornando mais sofisticadas e tecnicamente mais complexas, também a quantidade e profundidade da informação contida nos corpora anotados tem vindo a crescer. O estado da arte actual consiste em corpora anotados com informação gramatical profunda, isto é anotação que contém não só a função ou tipo de cada elemento mas também os tipos das relações entre os diferentes elementos, sejam estas directas ou de longa distância. Esta quantidade crescente de informação contida na anotação dos corpora torna a tarefa da sua anotação crescentemente mais complexa, daí existir a necessidade de garantir que este processo resulta em corpora da melhor qualidade possível. No seguimento desta crescente complexidade, as técnicas utilizadas para o processo de anotação também tem sofrido alterações. A quantidade de informação a ser introduzida no corpus é demasiado complexa para ser introduzida manualmente, portanto este processo é agora conduzido por uma gramática computacional, que produz todas as possíveis representações gramaticais para cada frase, e de seguida um ou mais anotadores humanos escolhem a representação gramatical que melhor se aplica a frase em questão. Este processo garante uma uniformidade no formato da anotação, bem como consistência total nas etiquetas utilizadas, problemas recorrentes em corpus anotados manualmente. O objectivo desta dissertação é o de identificar um método ou uma métrica que possibilite a avaliação da tarefa de anotação de corpora com informação gramatical profunda, bem como uma aplicação que permita a recolha dos dados necessários referentes à tarefa de anotação, e que calcule a métrica ou métricas necessárias para validação e avaliação da tarefa. Com este objectivo em mente, foi inicialmente explorado o trabalho de fundo da tarefa de anotação, tanto na vertente linguística como na vertente de processamento de linguagem natural. Na vertente linguística, devem ser realçadas algumas noções base, tais como a de corpus, que se trata de um acervo de material linguístico originário de múltiplas fontes, tais como emissões de rádio, imprensa escrita e até conversas do dia-a-dia. Um corpus anotado é um corpus em que o material foi explicitamente enriquecido com informação linguística que é implícita para um falante nativo da língua, com o objectivo de auxiliar ao processamento do material por parte de máquinas. A anotação de corpus por parte do grupo NLX está a ser feita recorrendo a um esquema de anotação duplamente cego, em que dois anotadores escolhem de um conjunto de possíveis representações gramaticais atribuídas a cada frase pela gramática LXGram, a que para si é a mais correcta. Estas representações são posteriormente adjudicadas por um terceiro anotador. O resultado desta adjudicação é a representação que integra o corpus anotado. O foco deste trabalho é o de avaliar a qualidade e fiabilidade do material resultante deste processo de anotação. O processo de anotação pode ser visto como o processo de atribuição de categorias a itens, neste caso, a atribuição de categorias ou informação linguística a palavras ou multi-palavras de uma frase. Neste caso concreto, dada uma lista de discriminantes semânticos, os anotadores devem decidir quais pertencem ou não à melhor representação gramatical de uma dada frase. Na literatura, existem várias abordagens para a avaliação de anotação com esquemas de anotação simples, por exemplo, com anotação morfossintáctica (POS tagging), como é o caso do Cohen’s Kappa (Cohen, 1960), ou k, e suas variantes, tais como o S (Bennett et al., 1954), _ (Scott, 1955) ou o próprio k. Todas estas métricas se baseiam na mesma ideia de que a taxa de concordância entre anotadores (inter-annotator agreement) pode ser calculada tendo em conta dois valores: a concordância observada (Ae), isto é a quantidade de informação em relação à qual os anotadores concordam; e a concordância esperada (Ao), ou seja a quantidade de informação que se esperaria obter entre os anotadores se a anotação fosse feita aleatoriamente. Todas as métricas derivadas directamente do Cohen’s Kappa, calculam também a taxa de concordância da mesma forma, recorrendo à fórmula: concordância = Ao–Ae 1–Ae. O ponto de divergência entre as diferentes abordagens está na maneira de calcular a taxa de concordância esperada. Estas divergências consistem na representação da taxa de concordância esperada através de diferentes distribuições estatísticas. Existe outro tipo de métricas, normalmente utilizado para a avaliação de análises sintáticas que também são aplicadas neste tipo de tarefa. Métricas como são o caso do Parseval (Black et al., 1991) e do Leaf Ancestor (Sampson and Babarczy, 2003) que frase a frase comparam a análise sintática dada pelo analisador sintático automático com um padrão dourado (análise sintática considerada correcta para a frase). Contudo, a complexidade da tarefa a ser avaliada exige não só uma métrica sólida, mas também que a sua granularidade seja suficiente para distinguir pequenas divergências que podem sustentar resultados que aparentam ser contraditórios. Tendo em conta a tarefa a ser avaliada, a abordagem mais granular possível é a que consiste em comparar individualmente cada decisão sobre cada discriminante para uma dada frase. Portanto, visto que o objectivo é obter a maior granularidade possível, para a métrica desenvolvida Y-Option Kappa, a taxa de acordo observado pode ser calculada pela razão entre o número de discriminantes com decisões idênticas, ou opções, e o número total de discriminantes disponíveis para uma dada frase. Como cada discriminantes tem dois valores possíveis, isto é, ou pertence ou não à melhor representação gramatical, a taxa de concordância esperada pode ser considerada uma distribuição uniforme de decisões binárias, o que significa que o acordo esperado para caso de decisão aleatória será 0,5. A métrica Y-Option Kappa é calculada através da mesma fórmula utilizada pelo Cohen’s K e suas variantes. A tarefa de anotação é auxiliada por um pacote de ferramentas linguísticas designado LOGON, pacote este que permite a anotação dinâmica de corpus, isto é as frases são analisadas dinamicamente pela gramática computacional conforme as decisões sobre os discriminantes são tomadas pelos anotadores. Isto permite ter acesso às representações gramaticais resultantes, possibilitando assim uma melhor percepção do resultado das decisões tomadas. A informação resultante do processo de anotação é guardada em ficheiros de log que podem ser utilizados para reconstruir a representação gramatical resultante para a frase. Este pacote é bastante útil e fornece uma ajuda preciosa no processo de anotação. Contudo, os ficheiros de log guardam apenas a informação necessária para a reconstrução da representação gramatical final, o que resulta numa lista de discriminantes que pode ser incompleta para os propósitos de avaliação do processo de anotação. Por exemplo, quando um anotador rejeita uma frase, ou seja, considera que não existe no conjunto possível de representações gramaticais uma que seja considerada correcta, apenas os discriminantes considerados até ao momento da rejeição são registados no ficheiro de log. Para resolver este problema, algumas adaptações tiveram de ser feitas à ideia original da métrica Y-Options K para que esta fosse aplicável aos dados recolhidos. Existem três casos gerais que resultam em conjuntos de informação concretos nos ficheiros de log. Estes três casos são: • Cada anotador aceita uma representação gramatical como óptima para a frase: Todas as opções estão presentes e podem ser comparadas correctamente • Pelo menos um dos anotadores rejeita qualquer representação gramatical para a frase: Existe apenas uma lista parcial das opções tomadas (para esse anotador). Para resolver estes casos, são estimados sobre os casos em que toda a informação está disponível valores médios que são depois aplicados a casos em que a informação não esteja disponível. A métrica é assim calculada frase a frase, e o resultado final apresentado é a média aritmética da métrica para todas as frases. Foi desenvolvida uma aplicação que permite através dos ficheiros de log determinar o valor da métrica, bem como alguma informação adicional para auxílio da tarefa de adjudicação. Um objectivo futuro seria o de alterar as aplicações do pacote LOGON, mais concretamente o [incr tsdb()] de modo a que este guarde todos os discriminantes para cada frase, podendo assim dispensar o cálculo de estimativas.<br>The purpose of this dissertation is to propose a reliability metric and respective validation tools for corpora annotated with deep linguistic information. The annotation of corpus with deep linguistic information is a complex task, and therefore is aided by a computational grammar. This grammar generates all the possible grammatical representations for sentences. The human annotators select the most correct analysis for each sentence, or reject it if no suitable representation is achieved. This task is repeated by two human annotators under a double-blind annotation scheme and the resulting annotations are adjudicated by a third annotator. This process should result in reliable datasets since the main purpose of this dataset is to be the training and validation data for other natural language processing tools. Therefore it is necessary to have a metric that assures such reliability and quality. In most cases, the metrics uses for shallow annotation or parser evaluation have been used for this same task. However the increased complexity demands a better granularity in order to properly measure the reliability of the dataset. With that in mind, I suggest the usage of a metric based on the Cohen’s Kappa metric that instead of considering the assignment of tags to parts of the sentence, considers the decision at the level of the semantic discriminants, the most granular unit available for this task. By comparing each annotator’s options it is possible to evaluate with a high degree of granularity how close their analysis were for any given sentence. An application was developed that allowed the application of this model to the data resulting from the annotation process which was aided by the LOGON framework. The output of this application not only has the metric for the annotated dataset, but some information related with divergent decision with the intent of aiding the adjudication process.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Corpora annotation with deep linguistic information"

1

G, Garside R., Leech Geoffrey N, and McEnery Tony 1964-, eds. Corpus annotation: Linguistic information from computer text corpora. Longman, 1997.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Michael, Götze, ed. Information structure in cross-linguistic corpora: Annotation guidelines for phonology, morphology, syntax, semantics and information structure. Universitätsverlag Potsdam, 2007.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

(Editor), Roger Garside, Geoffrey N. Leech (Editor), and Tony McEnery (Editor), eds. Corpus Annotation: Linguistic Information from Computer Text Corpora. Addison Wesley Longman, 1997.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

Garside, R. G., Geoffrey Leech, and Anthony Mark Mcenery. Corpus Annotation: Linguistic Information from Computer Text Corpora. Taylor & Francis Group, 2016.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

Lüdeling, Anke, Julia Ritz, Manfred Stede, and Amir Zeldes. Corpus Linguistics and Information Structure Research. Edited by Caroline Féry and Shinichiro Ishihara. Oxford University Press, 2015. http://dx.doi.org/10.1093/oxfordhb/9780199642670.013.013.

Full text
Abstract:
This chapter describes the contributions that Corpus Linguistics (the study of linguistic phenomena by means of systematically exploiting collections of naturally-occurring linguistic data) can make to IS research. It discusses issues of designing a corpus that can serve as a basis for qualitative or quantitative studies, and then turns to the central issue of data annotation: what corpora are available that have been annotated with IS-related annotations, and how can such annotations be evaluated? In case a corpus does not have direct IS annotation, can other types of annotations, especially in the form of multi-layer annotation, be used as indirect evidence for the presence of IS phenomena? Next, the present state of the art in automatic IS annotation (by means of techniques from computational linguistics) is sketched, and finally, several sample studies that exploit IS annotations are introduced briefly.
APA, Harvard, Vancouver, ISO, and other styles
6

McEnery, Tony. Corpus Linguistics. Edited by Ruslan Mitkov. Oxford University Press, 2012. http://dx.doi.org/10.1093/oxfordhb/9780199276349.013.0024.

Full text
Abstract:
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described as a large body of linguistic evidence composed of attested language use. It may be contrasted against sentences constructed from metalinguist reflection upon language use, rather than as a result of communication in context. Corpus can be both spoken and written. It can be categorized as follows: monolingual, representing one language; comparable, using multiple monolingual corpora to create a comparative framework; parallel corpora, wherein, corpus of one language is considered, and the data obtained, is translated in other languages. The choice of corpus depends on the research question/the chosen application. Adding linguistic information can enhance a corpus. Analysts, human or mechanical, or a combination achieves annotation. The modern computerized corpus has been in vogue only since the 1940s. Ever since, the volume of corpus banks have risen steadily and assumed an increasingly multilingual nature.
APA, Harvard, Vancouver, ISO, and other styles
7

Ufimtseva, Nataliya V., Iosif A. Sternin, and Elena Yu Myagkova. Russian psycholinguistics: results and prospects (1966–2021): a research monograph. Institute of Linguistics, Russian Academy of Sciences, 2021. http://dx.doi.org/10.30982/978-5-6045633-7-3.

Full text
Abstract:
The monograph reflects the problems of Russian psycholinguistics from the moment of its inception in Russia to the present day and presents its main directions that are currently developing. In addition, theoretical developments and practical results obtained in the framework of different directions and research centers are described in a concise form. The task of the book is to reflect, as far as it is possible in one edition, firstly, the history of the formation of Russian psycholinguistics; secondly, its methodology and developed methods; thirdly, the results obtained in different research centers and directions in different regions of Russia; fourthly, to outline the main directions of the further development of Russian psycholinguistics. There is no doubt that in the theoretical, methodological and applied aspects, the main problems and the results of their development by Russian psycholinguistics have no analogues in world linguistics and psycholinguistics, or are represented by completely original concepts and methods. We have tried to show this uniqueness of the problematics and the methodological equipment of Russian psycholinguistics in this book. The main role in the formation of Russian psycholinguistics was played by the Moscow psycholinguistic school of A.A. Leontyev. It still defines the main directions of Russian psycholinguistics. Russian psycholinguistics (the theory of speech activity - TSA) is based on the achievements of Russian psychology: a cultural-historical approach to the analysis of mental phenomena L.S. Vygotsky and the system-activity approach of A.N. Leontyev. Moscow is the most "psycholinguistic region" of Russia - INL RAS, Moscow State University, Moscow State Linguistic University, RUDN, Moscow State Pedagogical University, Moscow State Pedagogical University, Sechenov University, Moscow State University and other Moscow universities. Saint Petersburg psycholinguists have significant achievements, especially in the study of neurolinguistic problems, ontolinguistics. The most important feature of Russian psycholinguistics is the widespread development of psycholinguistics in the regions, the emergence of recognized psycholinguistic research centers - St. Petersburg, Tver, Saratov, Perm, Ufa, Omsk, Novosibirsk, Voronezh, Yekaterinburg, Kursk, Chelyabinsk; psycholinguistics is represented in Cherepovets, Ivanovo, Volgograd, Vyatka, Kaluga, Krasnoyarsk, Irkutsk, Vladivostok, Abakan, Maikop, Barnaul, Ulan-Ude, Yakutsk, Syktyvkar, Armavir and other cities; in Belarus - Minsk, in Ukraine - Lvov, Chernivtsi, Kharkov, in the DPR - Donetsk, in Kazakhstan - Alma-Ata, Chimkent. Our researchers work in Bulgaria, Hungary, Vietnam, China, France, Switzerland. There are Russian psycholinguists in Canada, USA, Israel, Austria and a number of other countries. All scientists from these regions and countries have contributed to the development of Russian psycholinguistics, to the development of psycholinguistic theory and methods of psycholinguistic research. Their participation has not been forgotten. We tried to present the main Russian psycholinguists in the Appendix - in the sections "Scientometrics", "Monographs and Manuals" and "Dissertations", even if there is no information about them in the Electronic Library and RSCI. The principles of including scientists in the scientometric list are presented in the Appendix. Our analysis of the content of the resulting monograph on psycholinguistic research in Russia allows us to draw preliminary conclusions about some of the distinctive features of Russian psycholinguistics: 1. cultural-historical approach to the analysis of mental phenomena of L.S.Vygotsky and the system-activity approach of A.N. Leontiev as methodological basis of Russian psycholinguistics; 2. theoretical nature of psycholinguistic research as a characteristic feature of Russian psycholinguistics. Our psycholinguistics has always built a general theory of the generation and perception of speech, mental vocabulary, linked specific research with the problems of ontogenesis, the relationship between language and thinking; 3. psycholinguistic studies of speech communication as an important subject of psycholinguistics; 4. attention to the psycholinguistic analysis of the text and the development of methods for such analysis; 5. active research into the ontogenesis of linguistic ability; 6. investigation of linguistic consciousness as one of the important subjects of psycholinguistics; 7. understanding the need to create associative dictionaries of different types as the most important practical task of psycholinguistics; 8. widespread use of psycholinguistic methods for applied purposes, active development of applied psycholinguistics. The review of the main directions of development of Russian psycholinguistics, carried out in this monograph, clearly shows that the direction associated with the study of linguistic consciousness is currently being most intensively developed in modern Russian psycholinguistics. As the practice of many years of psycholinguistic research in our country shows, the subject of study of psycholinguists is precisely linguistic consciousness - this is a part of human consciousness that is responsible for generating, understanding speech and keeping language in consciousness. Associative experiments are the core of most psycholinguistic techniques and are important both theoretically and practically. The following main areas of practical application of the results of associative experiments can be outlined. 1. Education. Associative experiments are the basis for constructing Mind Maps, one of the most promising tools for systematizing knowledge, assessing the quality, volume and nature of declarative knowledge (and using special techniques and skills). Methods based on smart maps are already widely used in teaching foreign languages, fast and deep immersion in various subject areas. 2. Information search, search optimization. The results of associative experiments can significantly improve the quality of information retrieval, its efficiency, as well as adaptability for a specific person (social group). When promoting sites (promoting them in search results), an associative experiment allows you to increase and improve the quality of the audience reached. 3. Translation studies, translation automation. An associative experiment can significantly improve the quality of translation, take into account intercultural and other social characteristics of native speakers. 4. Computational linguistics and automatic word processing. The results of associative experiments make it possible to reveal the features of a person's linguistic consciousness and contribute to the development of automatic text processing systems in a wide range of applications of natural language interfaces of computer programs and robotic solutions. 5. Advertising. The use of data on associations for specific words, slogans and texts allows you to predict and improve advertising texts. 6. Social relationships. The analysis of texts using the data of associative experiments makes it possible to assess the tonality of messages (negative / positive moods, aggression and other characteristics) based on user comments on the Internet and social networks, in the press in various projections (by individuals, events, organizations, etc.) from various social angles, to diagnose the formation of extremist ideas. 7. Content control and protection of personal data. Associative experiments improve the quality of content detection and filtering by identifying associative fields in areas subject to age restrictions, personal information, tobacco and alcohol advertising, incitement to ethnic hatred, etc. 8. Gender and individual differences. The data of associative experiments can be used to compare the reactions (and, in general, other features of thinking) between men and women, different social and age groups, representatives of different regions. The directions for the further development of Russian psycholinguistics from the standpoint of the current state of psycholinguistic science in the country are seen by us, first of all:  in the development of research in various areas of linguistic consciousness, which will contribute to the development of an important concept of speech as a verbal model of non-linguistic consciousness, in which knowledge revealed by social practice and assigned by each member of society during its inculturation is consolidated for society and on its behalf;  in the expansion of the problematics, which is formed under the influence of the growing intercultural communication in the world community, which inevitably involves the speech behavior of natural and artificial bilinguals in the new object area of psycholinguistics;  in using the capabilities of national linguistic corpora in the interests of researchers studying the functioning of non-linguistic and linguistic consciousness in speech processes;  in expanding research on the semantic perception of multimodal texts, the scope of which has greatly expanded in connection with the spread of the Internet as a means of communication in the life of modern society;  in the inclusion of the problems of professional communication and professional activity in the object area of psycholinguistics in connection with the introduction of information technologies into public practice, entailing the emergence of new professions and new features of the professional ethos;  in the further development of the theory of the mental lexicon (identifying the role of different types of knowledge in its formation and functioning, the role of the word as a unit of the mental lexicon in the formation of the image of the world, as well as the role of the natural / internal metalanguage and its specificity in speech activity);  in the broad development of associative lexicography, which will meet the most diverse needs of society and cognitive sciences. The development of associative lexicography may lead to the emergence of such disciplines as associative typology, associative variantology, associative axiology;  in expanding the spheres of applied use of psycholinguistics in social sciences, sociology, semasiology, lexicography, in the study of the brain, linguodidactics, medicine, etc. This book is a kind of summarizing result of the development of Russian psycholinguistics today. Each section provides a bibliography of studies on the relevant issue. The Appendix contains the scientometrics of leading Russian psycholinguists, basic monographs, psycholinguistic textbooks and dissertations defended in psycholinguistics. The content of the publications presented here is convincing evidence of the relevance of psycholinguistic topics and the effectiveness of the development of psycholinguistic problems in Russia.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Corpora annotation with deep linguistic information"

1

Alves, Diego, and Daniel Gomes. "Robustness of Corpus-Based Typological Strategies for Dependency Parsing." In Event Analytics across Languages and Communities. Springer Nature Switzerland, 2024. http://dx.doi.org/10.1007/978-3-031-64451-1_3.

Full text
Abstract:
AbstractThis chapter presents a comparison of the corpus-based typological classification of ten European Union languages obtained using parallel corpora, with one generated in a less controlled scenario, with non-parallel automatically annotated data. First, we described the specific pipeline that was created to extract and annotate multilingual data from the Arquivo.pt 2019 European Parliamentary Elections collection. Two new corpora for all EU languages were generated and made publicly and freely available: one composed of raw texts extracted from this collection and the other with syntactic annotation obtained automatically. Then, we presented an overview of different quantitative typological approaches developed for dependency parsing improvement and selected the most optimised ones to conduct our comparative analysis. Finally, we compared both scenarios using the same corpus-based strategy and showed that the classification obtained using the data provided by the Arquivo.pt dataset provides valuable linguistic information for this type of study, presenting similarities when compared to the classification based on parallel corpora. However, considering the dissimilarities observed, further analysis is required before validating this new method.
APA, Harvard, Vancouver, ISO, and other styles
2

Fang, Alex Chengyu. "Autasys: Grammatical Tagging and Cross-Tagset Mapping." In Comparing English Worldwide. Oxford University PressOxford, 1996. http://dx.doi.org/10.1093/oso/9780198235828.003.0009.

Full text
Abstract:
Abstract Ever since the advent of the first computer linguistic corpus in the 1960s, linguists and computer programmers have been working on the annotation of material thus stored. Word-class tagging, the assignment of an unambiguous indication of the grammatical word class to each word in a text, has been in great demand, not only in lexicographical and grammatical studies, but also in natural language processing (NLP), an area where the corpus-based, or more specifically, probabilistic approach is becoming increasingly popular. Taggers have flourished and the past twenty years or so have witnessed TAGGIT (Greene and Rubin, 1971), CLAWS (Marshall, 1983; Garside et al., 1987), FALSUNGA (DeRose, 1988), AGTS (Huang, 1991), and TOSCA (Oostdijk, 1991), to name just a few. Tagsets different in various aspects have also come into being, with Brown (Francis, 1980), LOB (Johansson et al., 1986), and Lund (Svartvik, 1987) as the best known. Most recently, a tagset has been designed at the Survey of English Usage (SEU), University College London (Greenbaum and Ni, 1994; Greenbaum, 1995), which has been used to annotate the one million-word British component of the International Corpus of English (ICE-GB, cf. Greenbaum, 1992).This has created an intriguing situation in corpus annotation. On the one hand, compilers of corpora vary in what they intend as the primary uses of their corpora. Grammarians, lexicographers, language teachers, and NLP researchers naturally want different information from corpus annotation: grammatical, morphological, discoursal, statistical, semantic, pragmatic, or prosodic. On the other hand, unfortunately, we have not seen any single annotation scheme that meets all these requirements. Corpora thus differently annotated according to different schemes have become ‘isolated islands’, rendering cross-corpora studies virtually impossible. Consequently, it is desirable that either a standard annotation scheme be agreed upon in this field, or flexible systems be designed that can readily adapt themselves to different annotation schemes.
APA, Harvard, Vancouver, ISO, and other styles
3

Porter, Nick, and Akiva Quinn. "Developing the ICE Corpus Utility Program." In Comparing English Worldwide. Oxford University PressOxford, 1996. http://dx.doi.org/10.1093/oso/9780198235828.003.0007.

Full text
Abstract:
Abstract Soon after the Survey of English Usage initiated the project, it was realized that there was a need for a general corpus processing and analysis tool for the International Corpus of English. In its broadest conception, this tool was to cover the central requirements for corpus preparation and study, including corpus annotation, markup conversion and filtering, searching, concordancing, statistical analyses, sub corpus information, and subcorpus selection. The system would primarily target ICE corpora, yet would provide a range of general corpus utilities equally applicable to other corpora. The requirements for ICECUP were determined by the design and content of ICE and the corpus utilities that were to be provided. The International Corpus exists primarily to allow comparison between national and regional varieties of English, but further, each component corpus is structured according to medium (spoken, manuscript, and printed) and genre (news, business, natural sciences, novels, and so on). Describing and quantifying linguistic features within national, medium, and genre categories is a key requirement. ICE corpora use Standard Generalized Markup Language (SGML) to encode typographic and content features of a text, as well as word-classes, syntactic structures, and functions. Searches and concordances should be able to include any combination of these markup symbols in their search arguments, together with lexical items, punctuation, and wildcards. The citations or concordance lines shown should be able to focus attention on relevant annotations by selectively filtering out markup symbols. Making functions easy to find, and providing help to explain the options at any point in the program, are also key parts of the specification. Besides providing facilities to support linguistic research, ICECUP has to be easy to use and accessible to potential users.
APA, Harvard, Vancouver, ISO, and other styles
4

Abdoullaev, Azamat. "How to Represent the World." In Reality, Universal Ontology and Knowledge Systems. IGI Global, 2008. http://dx.doi.org/10.4018/978-1-59904-966-3.ch010.

Full text
Abstract:
As far as human knowledge about the world is commonly given in NL expressions and as far as universal ontology is a general science of the world, the examination of its impact on natural language science and technology is among the central topics of many academic workshops and conferences. Ontologists, knowledge engineers, lexicographers, lexical semanticists, and computer scientists are attempting to integrate top-level entity classes with language knowledge presented in extensive corpora and electronic lexical resources. Such a deep quest is mostly motivated by high application potential of reality-driven models of language for knowledge communication and management, information retrieval and extraction, information exchange in software and dialogue systems, all with an ultimate view to transform the World Wide Web into a machine-readable global language resource of world knowledge, the Onto-Semantic Web. One of the practical applications of integrative ontological framework is to discover the underlying mechanisms of representing and processing language content and meaning by cognitive agents, human and artificial. Specifically, to provide the formalized algorithms or rules, whereby machines could derive or attach significance (or signification) from coded signals, both natural signs obtained by sensors and linguistic symbols.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Corpora annotation with deep linguistic information"

1

Stewart, Michael, and Wei Liu. "Seq2KG: An End-to-End Neural Model for Domain Agnostic Knowledge Graph (not Text Graph) Construction from Text." In 17th International Conference on Principles of Knowledge Representation and Reasoning {KR-2020}. International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/kr.2020/77.

Full text
Abstract:
Knowledge Graph Construction (KGC) from text unlocks information held within unstructured text and is critical to a wide range of downstream applications. General approaches to KGC from text are heavily reliant on the existence of knowledge bases, yet most domains do not even have an external knowledge base readily available. In many situations this results in information loss as a wealth of key information is held within "non-entities". Domain-specific approaches to KGC typically adopt unsupervised pipelines, using carefully crafted linguistic and statistical patterns to extract co-occurred noun phrases as triples, essentially constructing text graphs rather than true knowledge graphs. In this research, for the first time, in the same flavour as Collobert et al.'s seminal work of "Natural language processing (almost) from scratch" in 2011, we propose a Seq2KG model attempting to achieve "Knowledge graph construction (almost) from scratch". An end-to-end Sequence to Knowledge Graph (Seq2KG) neural model jointly learns to generate triples and resolves entity types as a multi-label classification task through deep learning neural networks. In addition, a novel evaluation metric that takes both semantic and structural closeness into account is developed for measuring the performance of triple extraction. We show that our end-to-end Seq2KG model performs on par with a state of the art rule-based system which outperformed other neural models and won the first prize of the first Knowledge Graph Contest in 2019. A new annotation scheme and three high-quality manually annotated datasets are available to help promote this direction of research.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!