To see the other types of publications on this topic, follow the link: Corpora annotation with deep linguistic information.

Journal articles on the topic 'Corpora annotation with deep linguistic information'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Corpora annotation with deep linguistic information.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

DORR, BONNIE J., REBECCA J. PASSONNEAU, DAVID FARWELL, et al. "Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation." Natural Language Engineering 16, no. 3 (2010): 197–243. http://dx.doi.org/10.1017/s1351324910000070.

Full text
Abstract:
AbstractThis paper focuses on an important step in the creation of a system of meaning representation and the development of semantically annotated parallel corpora, for use in applications such as machine translation, question answering, text summarization, and information retrieval. The work described below constitutes the first effort of any kind to annotate multiple translations of foreign-language texts with interlingual content. Three levels of representation are introduced: deep syntactic dependencies (IL0), intermediate semantic representations (IL1), and a normalized representation th
APA, Harvard, Vancouver, ISO, and other styles
2

Zolotov, Pitirim Y. "Linguodidactic properties of corpus technologies." Tambov University Review. Series: Humanities, no. 185 (2020): 75–82. http://dx.doi.org/10.20310/1810-0201-2020-25-185-75-82.

Full text
Abstract:
For the last two decades, corpus technologies, understood as a combination of means and methods of processing and analyzing data of electronic linguistic corpora, as a type of information and communication technology, have attracted great interest of researchers and teachers of foreign languages.We explain the concepts of corpus linguistics, corpus technology, linguistic corpus, concordance. The methods of studying case technologies, which are an annotation, abstraction, and analysis, are considered. The advantages of linguistic corpora are given. The history of the emergence and development o
APA, Harvard, Vancouver, ISO, and other styles
3

Erjavec, Tomaž. "Označevanje korpusov." Jezik in slovstvo 48, no. 3-4 (2024): 61–76. http://dx.doi.org/10.4312/jis.48.3-4.61-76.

Full text
Abstract:
Ordered collections of machine-readable texts, corpora, are useful in various branches of linguistics. The present paper focuses on the machine-readable form of corpora, above all on their annotation, i.e. adding interpretative information to the text in the corpus. The annotation presented is based on taking into consideration the international standards from this field, which contributes to better documentation and verifiability, easier use of processing applications, and better interchange and longevity. In the first part, corpus encoding standards, above all XML (eXtended Markup Language)
APA, Harvard, Vancouver, ISO, and other styles
4

Alayiaboozar, Elham. "Indicators and stages of building a linguistic corpus: written and spoken varieties." Linguistics of Iranian Dialects 4, no. 2 (2019): 267–90. https://doi.org/10.5281/zenodo.14033040.

Full text
Abstract:
This research aims to assist researchers in the construction of various linguistic corpora by collecting information related to the indicators and stages of corpus building. In this article, after reviewing the opinions of researchers who have constructed corpora in different languages, the general indicators for building linguistic corpora are discussed. These indicators pertain to the construction of textual and spoken varieties of the corpus, including sampling, representativeness, balance, size, type of corpus, and homogeneity. Subsequently, the process of constructing a textual corpus is
APA, Harvard, Vancouver, ISO, and other styles
5

Iomdin, Leonid. "Microsyntactic Annotation of Corpora and its Use in Computational Linguistics Tasks." Journal of Linguistics/Jazykovedný casopis 68, no. 2 (2017): 169–78. http://dx.doi.org/10.1515/jazcas-2017-0027.

Full text
Abstract:
Abstract Microsyntax is a linguistic discipline dealing with idiomatic elements whose important properties are strongly related to syntax. In a way, these elements may be viewed as transitional entities between the lexicon and the grammar, which explains why they are often underrepresented in both of these resource types: the lexicographer fails to see such elements as full-fledged lexical units, while the grammarian finds them too specific to justify the creation of individual well-developed rules. As a result, such elements are poorly covered by linguistic models used in advanced modern comp
APA, Harvard, Vancouver, ISO, and other styles
6

Jiménez-Zafra, Salud María, Roser Morante, María Teresa Martín-Valdivia, and L. Alfonso Ureña-López. "Corpora Annotated with Negation: An Overview." Computational Linguistics 46, no. 1 (2020): 1–52. http://dx.doi.org/10.1162/coli_a_00371.

Full text
Abstract:
Negation is a universal linguistic phenomenon with a great qualitative impact on natural language processing applications. The availability of corpora annotated with negation is essential to training negation processing systems. Currently, most corpora have been annotated for English, but the presence of languages other than English on the Internet, such as Chinese or Spanish, is greater every day. In this study, we present a review of the corpora annotated with negation information in several languages with the goal of evaluating what aspects of negation have been annotated and how compatible
APA, Harvard, Vancouver, ISO, and other styles
7

Cantalini, Giorgina, and Massimo Moneglia. "annotation of gesture and gesture / prosody synchronization in multimodal speech corpora." Journal of Speech Sciences 9 (September 9, 2020): 07–30. http://dx.doi.org/10.20396/joss.v9i00.14956.

Full text
Abstract:
This paper was written with the aim of highlighting the functional and structural correlations between gesticulation and prosody, focusing on gesture / prosody synchronization in spontaneous spoken Italian. The gesture annotation used follows the LASG model (Bressem et al. 2013), while the prosodic annotation focuses on the identification of terminal and non-terminal prosodic breaks which, according to L-AcT (Cresti, 2000; Moneglia & Raso 2014), determine speech act boundaries and the information structure, respectively. Gesticulation co-occurs with speech in about 90% of the speech flow e
APA, Harvard, Vancouver, ISO, and other styles
8

Hajič, Jan, Eva Hajičová, Jiří Mírovský, and Jarmila Panevová. "Linguistically Annotated Corpus as an Invaluable Resource for Advancements in Linguistic Research: A Case Study." Prague Bulletin of Mathematical Linguistics 106, no. 1 (2016): 69–124. http://dx.doi.org/10.1515/pralin-2016-0012.

Full text
Abstract:
Abstract A case study based on experience in linguistic investigations using annotated monolingual and multilingual text corpora; the “cases” include a description of language phenomena belonging to different layers of the language system: morphology, surface and underlying syntax, and discourse. The analysis is based on a complex annotation of syntax, semantic functions, information structure and discourse relations of the Prague Dependency Treebank, a collection of annotated Czech texts. We want to demonstrate that annotation of corpus is not a self-contained goal: in order to be consistent,
APA, Harvard, Vancouver, ISO, and other styles
9

Novák, Václav. "Semantic Network Manual Annotation and its Evaluation." Prague Bulletin of Mathematical Linguistics 90, no. 1 (2008): 69–82. http://dx.doi.org/10.2478/v10108-009-0008-4.

Full text
Abstract:
Semantic Network Manual Annotation and its Evaluation The present contribution is a brief extract of (Novák, 2008). The Prague Dependency Treebank (PDT) is a valuable resource of linguistic information annotated on several layers. These layers range from morphemic to deep and they should contain all the linguistic information about the text. The natural extension is to add a semantic layer suitable as a knowledge base for tasks like question answering, information extraction etc. In this paper I set up criteria for this representation, explore the possible formalisms for this task and discuss
APA, Harvard, Vancouver, ISO, and other styles
10

Druskat, Stephan, Thomas Krause, Clara Lachenmaier, and Bastian Bunzeck. "Hexatomic: An extensible, OS-independent platform for deep multi-layer linguistic annotation of corpora." Journal of Open Source Software 8, no. 86 (2023): 4825. http://dx.doi.org/10.21105/joss.04825.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Zeng, Jinshan, Xianchao Tong, Xianglong Yu, Wenyan Xiao, and Qing Huang. "InterpretARA: Enhancing Hybrid Automatic Readability Assessment with Linguistic Feature Interpreter and Contrastive Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 19497–505. http://dx.doi.org/10.1609/aaai.v38i17.29921.

Full text
Abstract:
The hybrid automatic readability assessment (ARA) models that combine deep and linguistic features have recently received rising attention due to their impressive performance. However, the utilization of linguistic features is not fully realized, as ARA models frequently concentrate excessively on numerical values of these features, neglecting valuable structural information embedded within them. This leads to limited contribution of linguistic features in these hybrid ARA models, and in some cases, it may even result in counterproductive outcomes. In this paper, we propose a novel hybrid ARA
APA, Harvard, Vancouver, ISO, and other styles
12

Finkel, Raphael, Daniel Kaufman, and Ahmed Shamim. "Analyzing Code-mixing in Linguistic Corpora Using Kratylos." Journal on Computing and Cultural Heritage 15, no. 1 (2022): 1–15. http://dx.doi.org/10.1145/3480238.

Full text
Abstract:
Code-switching, code-mixing, and, more generally, multilingualism pose technological challenges for language documentation, the sub-discipline of linguistics that deals with the annotation and basic analysis of field recordings and other primary data. We focus here on a case study involving code-mixing in the endangered Koda language, which poses special problems for morphosyntactic analysis. We offer a robust approach to multilingual annotations that involves a combination of the popular open source software FieldWorks Language Explorer (FLEx) with Kratylos, a web-based corpus tool for displa
APA, Harvard, Vancouver, ISO, and other styles
13

Buntman, Nadezhda V., Anna S. Borisova, and Yulia A. Darovskikh. "Verb database: Structure, clusters and options." Russian Journal of Linguistics 27, no. 4 (2023): 981–1004. http://dx.doi.org/10.22363/2687-0088-35812.

Full text
Abstract:
The content and volume of language corpora provide an opportunity to obtain reliable information about the real use of a particular linguistic unit. Nowadays, there is a large number of corpora in different languages, their formation technologies are being improved. Nevertheless, some problems and limitations arise when using these resources in comparative studies. Corpora users need to work with annotated data submitted to tagging through annotation protocols. The article presents the structure and functionality of the supracorpora verb database (SVD) developed on the basis of a parallel Russ
APA, Harvard, Vancouver, ISO, and other styles
14

Santamaría García, Carmen. "Bricolage assembling." International Journal of Corpus Linguistics 16, no. 3 (2011): 345–70. http://dx.doi.org/10.1075/ijcl.16.3.04san.

Full text
Abstract:
This article illustrates the use of spoken corpora for a contrastive study of casual conversation in English and Spanish. It models an eclectic methodology for cross-linguistic comparison at the level of discourse, specifically of exchange structures, by drawing upon analytic resources from corpus linguistics (CL), conversation analysis (CA) and discourse analysis (DA). This combination of perspectives presents challenges and limitations which will be discussed and exemplified through a case study that explores agreement and disagreement sequences. English data have been retrieved from the San
APA, Harvard, Vancouver, ISO, and other styles
15

Rackevičienė, Sigita, Liudmila Mockienė, Andrius Utka, and Aivaras Rokas. "Methodological Framework for the Development of an English-Lithuanian Cybersecurity Termbase." Studies about Languages, no. 39 (November 27, 2021): 85–92. http://dx.doi.org/10.5755/j01.sal.1.39.29156.

Full text
Abstract:
The aim of the paper is to present a methodological framework for the development of an English-Lithuanian bilingual termbase in the cybersecurity domain, which can be applied as a model for other language pairs and other specialised domains. It is argued that the presented methodological approach can ensure creation of high-quality bilingual termbases even with limited available resources. The paper touches upon the methods and problems of dataset (corpora) compilation, terminology annotation, automatic bilingual term extraction (BiTE) and alignment, knowledge-rich context extraction, and lin
APA, Harvard, Vancouver, ISO, and other styles
16

Zeng, Jinshan, Xianglong Yu, Xianchao Tong, and Wenyan Xiao. "Self-Supervised Collaborative Information Bottleneck for Text Readability Assessment." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25814–22. https://doi.org/10.1609/aaai.v39i24.34774.

Full text
Abstract:
Text readability assessment involves categorizing texts based on readers' comprehension levels. Hybrid automatic readability assessment (ARA) models, combining deep and linguistic features, have recently attracted rising attention due to their impressive performance. However, existing hybrid ARA models generally ignore the specific-intrinsic information of deep and linguistic representations, and cannot fully explore their common-intrinsic information. In this paper, we introduce a self-supervised collaborative information bottleneck (SCIB) module for ARA to address these issues. Specifically,
APA, Harvard, Vancouver, ISO, and other styles
17

ГОЛОЩУК, С. "КОРПУСНА ЛІНГВІСТИКА: СУЧАСНИЙ СТАН ТА ПЕРСПЕКТИВИ ДОСЛІДЖЕНЬ". Current issues of linguistics and translation studies 22 (2 грудня 2021): 33–36. https://doi.org/10.31891/2415-7929-2021-22-7.

Full text
Abstract:
The article deals with the analysis of the characteristic features of corpus linguistics. Corpus linguistics is the study of language data on a large scale - the computer-aided analysis of very extensive collections of transcribed utterances or written texts. The article outlines the basic methods of corpus linguistics, explains the influence of the generative linguists on corpus linguistics, and surveys the major approaches to the use of corpus data. Clear and detailed explanations lay out the key issues of method and theory in contemporary corpus linguistics. Corpus linguistics is viewed as
APA, Harvard, Vancouver, ISO, and other styles
18

Kozak, Ivan, and Nataliia Kunanets. "Information Systems for Working with Text Corpora: Classification and Comparative Analysis." Vìsnik Nacìonalʹnogo unìversitetu "Lʹvìvsʹka polìtehnìka". Serìâ Ìnformacìjnì sistemi ta merežì 16 (November 21, 2024): 273–89. https://doi.org/10.23939/sisn2024.16.273.

Full text
Abstract:
The article examines information systems for working with text corpora, particularly their application for linguistic analysis and management of large text data. Information systems for supporting text corpora are analyzed, classified, and compared based on their historical development and functional capabilities. The main focus is comparing the two most common systems that can be distinguished by functionality as corpus managers: ‘AntConc’ and ‘Sketch Engine’. These are evaluated based on key criteria: corpus creation, text processing, annotation, storage and export, data analysis and visuali
APA, Harvard, Vancouver, ISO, and other styles
19

Weisser, Martin. "Annotating the ICE corpora pragmatically – preliminary issues & steps." ICAME Journal 41, no. 1 (2017): 181–214. http://dx.doi.org/10.1515/icame-2017-0008.

Full text
Abstract:
Abstract Since the inception of the ICE project in 1990, ICE corpora have been used extensively in the investigation and comparison of varieties of English on different linguistic levels. These levels, however, have so far primarily been restricted to lexis and lexico-grammar, while relatively little has to date been achieved in the investigation of pragmatic strategies used by the speakers in these corpora. One of the main reasons for this shortcoming is a lack of suitable annotation that would make such a detailed pragmatic comparison possible. This paper will propose a suitable model and fo
APA, Harvard, Vancouver, ISO, and other styles
20

Beloglazova, E. V., and N. K. Genidze. "ROSSICA V. BELAROSSICA: UNIVERSAL AND CULTURE-SPECIFIC FEATURES OF RUSSIA- AND BELARUS-CENTERED DISCOURSES." Voprosy Kognitivnoy Lingvistiki, no. 4 (2023): 108–15. http://dx.doi.org/10.20916/1812-3228-2023-4-108-115.

Full text
Abstract:
The paper focuses on the comparison of two variations of foreign-culture-oriented discourse - the English-language descriptions of Russia and Belarus based on ad hoc corpora of titles Rossica-T and Belarossica-T. The research aims at identifying both universal and culture-specific markers of foreign-culture-oriented discourse. The methodology employed combines automatic processing of the textual data with the AntConc corpus manager tools, as well as the deep analysis and manual annotation of the data. The particular foci of the research are the corpora keywords, key clusters, collocates analys
APA, Harvard, Vancouver, ISO, and other styles
21

Becker, Maria, Michael Bender, and Marcus Müller. "Classifying heuristic textual practices in academic discourse." International Journal of Corpus Linguistics 25, no. 4 (2020): 426–60. http://dx.doi.org/10.1075/ijcl.19097.bec.

Full text
Abstract:
Abstract In this paper, we investigate how deep learning techniques can be applied to discourse pragmatics. As a testcase we analyse heuristic textual practices, defined as linguistic implementations of decision routines in research processes in academic discourse. We develop a complex annotation scheme of pragmalinguistic categories on different levels of granularity and manually annotate a corpus of texts across various scientific disciplines. This is the basis for training recurrent neural networks to classify heuristic textual practices. Our experiments show that the annotation categories
APA, Harvard, Vancouver, ISO, and other styles
22

Batista, F., H. Moniz, I. Trancoso, N. Mamede, and A. I. Mata. "Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation." Journal of Speech Sciences 2, no. 2 (2021): 113–36. http://dx.doi.org/10.20396/joss.v2i2.15035.

Full text
Abstract:
This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different ling
APA, Harvard, Vancouver, ISO, and other styles
23

Crawford Camiciottoli, Belinda. "Persuasion in Earnings Calls: A Diachronic Pragmalinguistic Analysis." International Journal of Business Communication 55, no. 3 (2017): 275–92. http://dx.doi.org/10.1177/2329488417735644.

Full text
Abstract:
This study investigates persuasive language in earnings calls. These are routine events organized by companies to report their quarterly financial results. The analysis is based on the earnings calls of 10 companies in the third quarter of 2009, when financial markets were still suffering from the global financial crisis, and the third quarter of 2013 when markets had largely recovered. Earnings call transcripts were compiled in two parallel corpora (Crisis Corpus and Recovery Corpus), thus providing a diachronic perspective. Semantic annotation software was used to extract pragmalinguistic re
APA, Harvard, Vancouver, ISO, and other styles
24

Silvestri, Stefano, Francesco Gargiulo, and Mario Ciampi. "Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases." Applied Sciences 12, no. 12 (2022): 5775. http://dx.doi.org/10.3390/app12125775.

Full text
Abstract:
The large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, makes it difficult to exploit the state-of-art machine-learning systems to extract information from such kinds of documents. For these reasons, healthcare professionals lose big opportunities that can arise from the analysis of this data. In this paper, we propose a methodology to reduce the manual e
APA, Harvard, Vancouver, ISO, and other styles
25

Neptune, Nathalie, and Josiane Mothe. "Automatic Annotation of Change Detection Images." Sensors 21, no. 4 (2021): 1110. http://dx.doi.org/10.3390/s21041110.

Full text
Abstract:
Earth observation satellites have been capturing a variety of data about our planet for several decades, making many environmental applications possible such as change detection. Recently, deep learning methods have been proposed for urban change detection. However, there has been limited work done on the application of such methods to the annotation of unlabeled images in the case of change detection in forests. This annotation task consists of predicting semantic labels for a given image of a forested area where change has been detected. Currently proposed methods typically do not provide ot
APA, Harvard, Vancouver, ISO, and other styles
26

du Toit, Jakobus S., and Martin J. Puttkammer. "Developing Core Technologies for Resource-Scarce Nguni Languages." Information 12, no. 12 (2021): 520. http://dx.doi.org/10.3390/info12120520.

Full text
Abstract:
The creation of linguistic resources is crucial to the continued growth of research and development efforts in the field of natural language processing, especially for resource-scarce languages. In this paper, we describe the curation and annotation of corpora and the development of multiple linguistic technologies for four official South African languages, namely isiNdebele, Siswati, isiXhosa, and isiZulu. Development efforts included sourcing parallel data for these languages and annotating each on token, orthographic, morphological, and morphosyntactic levels. These sets were in turn used t
APA, Harvard, Vancouver, ISO, and other styles
27

Bossaglia, Giulia, and Lúcia de Almeida Ferrari. "C-oral-Brasil project." Journal of Speech Sciences 7, no. 2 (2019): 65–77. http://dx.doi.org/10.20396/joss.v7i2.15000.

Full text
Abstract:
In this paper we present different resources for the study of spoken Brazilian Portuguese, developed within the C-ORAL-BRASIL project. The C-ORAL-BRASIL stemmed from the European C-ORAL-ROM project (Cresti & Moneglia, 2005), which has compiled spoken corpora of Italian, French, Spanish, and European Portuguese. The corpora of the C-ORAL family represent adequate tools for the analysis of spoken language, for they are provided not only with the transcripts of the recorded sessions (with prosodic breaks’ annotation), but also with their audio files and the text-to-speech alignment. So far, t
APA, Harvard, Vancouver, ISO, and other styles
28

Bhanusree, Yalamanchili, Samayamantula Srinivas Kumar, and Anne Koteswara Rao. "Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition." Journal of Information and Communication Technology 22, no. 1 (2023): 49–76. http://dx.doi.org/10.32890/jict2023.22.1.3.

Full text
Abstract:
Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterancesare a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computerinteraction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The propose
APA, Harvard, Vancouver, ISO, and other styles
29

Warikoo, Neha, Yung-Chun Chang, and Shang-Pin Ma. "Gradient Boosting over Linguistic-Pattern-Structured Trees for Learning Protein–Protein Interaction in the Biomedical Literature." Applied Sciences 12, no. 20 (2022): 10199. http://dx.doi.org/10.3390/app122010199.

Full text
Abstract:
Protein-based studies contribute significantly to gathering functional information about biological systems; therefore, the protein–protein interaction detection task is one of the most researched topics in the biomedical literature. To this end, many state-of-the-art systems using syntactic tree kernels (TK) and deep learning have been developed. However, these models are computationally complex and have limited learning interpretability. In this paper, we introduce a linguistic-pattern-representation-based Gradient-Tree Boosting model, i.e., LpGBoost. It uses linguistic patterns to optimize
APA, Harvard, Vancouver, ISO, and other styles
30

Ye, Peng, Yujin Jiang, and Yadi Wang. "CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus." Information 16, no. 7 (2025): 610. https://doi.org/10.3390/info16070610.

Full text
Abstract:
Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only focus on the surface-level features of words, failing to effectively handle complex scenarios such as alias nesting, metonymy ambiguity, and mixed punctuation. This leads to the loss of toponym semantic integrity and deviations i
APA, Harvard, Vancouver, ISO, and other styles
31

Mahany, Ahmed, Heba Khaled, Nouh Sabri Elmitwally, Naif Aljohani, and Said Ghoniemy. "Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications." Applied Sciences 12, no. 10 (2022): 5209. http://dx.doi.org/10.3390/app12105209.

Full text
Abstract:
Negation and speculation are universal linguistic phenomena that affect the performance of Natural Language Processing (NLP) applications, such as those for opinion mining and information retrieval, especially in biomedical data. In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. Furthermore, we discuss the ongoing research into recent rule-based, supervised, and transfer learning techniques for the detection of negating and speculative content. Many English corpora for various domains are now annotated with negation and spe
APA, Harvard, Vancouver, ISO, and other styles
32

Alexandridis, Georgios, Iraklis Varlamis, Konstantinos Korovesis, George Caridakis, and Panagiotis Tsantilas. "A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media." Information 12, no. 8 (2021): 331. http://dx.doi.org/10.3390/info12080331.

Full text
Abstract:
As the amount of content that is created on social media is constantly increasing, more and more opinions and sentiments are expressed by people in various subjects. In this respect, sentiment analysis and opinion mining techniques can be valuable for the automatic analysis of huge textual corpora (comments, reviews, tweets etc.). Despite the advances in text mining algorithms, deep learning techniques, and text representation models, the results in such tasks are very good for only a few high-density languages (e.g., English) that possess large training corpora and rich linguistic resources;
APA, Harvard, Vancouver, ISO, and other styles
33

Lortkipanidze, Liana L., and Anna R. Chutkerashvili. "Automated Development of a Grammatical Dictionary for Georgian Dialects." European Journal of Engineering Science and Technology 8, no. 1 (2025): 13–25. https://doi.org/10.33422/ejest.v8i1.1553.

Full text
Abstract:
This paper presents an automated system for compiling grammatical dictionaries of the Georgian language and its dialects. Unlike traditional dictionaries, grammatical dictionaries include not only base word forms but also complete paradigms, offering detailed morphological and syntactic information. This is particularly crucial for agglutinative-inflectional languages such as Georgian, where word forms vary significantly depending on context. The system applies a dictionary-based approach to expand lexical resources by identifying words with shared grammatical markers and integrates an innovat
APA, Harvard, Vancouver, ISO, and other styles
34

Gusarenko, S. V., and M. K. Gusarenko. "On the corpus of speech samples with errors in the use of Russian as a foreign language: methods of data representation and deep markup parameters." Гуманитарные и юридические исследования 9, no. 4 (2022): 650–58. http://dx.doi.org/10.37493/2409-1030.2022.4.17.

Full text
Abstract:
The purpose of the study, the results of which are presented in the article, is to develop the optimal composition and method of presenting data in the developed corpus of Russian speech samples with errors made by foreign students. The development of such a corpus is conditioned, firstly, by the need for a scientific description of erroneous linguistic expressions, as all significant facts of the use of the language are currently being described, and secondly, by the need to create a unified database of systematized data on errors in the speech of Russian language learners for linguodidactic
APA, Harvard, Vancouver, ISO, and other styles
35

Simov, Kiril, and Petya Osenova. "Special Thematic Section on Semantic Models for Natural Language Processing (Preface)." Cybernetics and Information Technologies 18, no. 1 (2018): 93–94. http://dx.doi.org/10.2478/cait-2018-0008.

Full text
Abstract:
Abstract With the availability of large language data online, cross-linked lexical resources (such as BabelNet, Predicate Matrix and UBY) and semantically annotated corpora (SemCor, OntoNotes, etc.), more and more applications in Natural Language Processing (NLP) have started to exploit various semantic models. The semantic models have been created on the base of LSA, clustering, word embeddings, deep learning, neural networks, etc., and abstract logical forms, such as Minimal Recursion Semantics (MRS) or Abstract Meaning Representation (AMR), etc. Additionally, the Linguistic Linked Open Data
APA, Harvard, Vancouver, ISO, and other styles
36

Kuttaiyapillai, Dhanasekaran, Anand Madasamy, Shobanadevi Ayyavu, and Md Shohel Sayeed. "Clinical named entity extraction for extracting information from medical data." Indonesian Journal of Electrical Engineering and Computer Science 35, no. 3 (2024): 1722. http://dx.doi.org/10.11591/ijeecs.v35.i3.pp1722-1731.

Full text
Abstract:
Clinical named entity extraction (NER) based on deep learning gained much attention among researchers and data analysts. This paper proposes a NER approach to extract valuable Parkinson’s disease-related information. To develop an effective NER method and to handle problems in disease data analytics, a unique NER technique applies a “recognize-map-extract (RME)” mechanism and aims to deal with complex relationships present in the data. Due to the fast-growing medical data, there is a challenge in the development of suitable deep-learning methods for NER. Furthermore, the traditional machine le
APA, Harvard, Vancouver, ISO, and other styles
37

ZENNAKI, O., N. SEMMAR, and L. BESACIER. "A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages." Natural Language Engineering 25, no. 1 (2018): 43–67. http://dx.doi.org/10.1017/s1351324918000293.

Full text
Abstract:
AbstractThis work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages
APA, Harvard, Vancouver, ISO, and other styles
38

YOON, JUNTAE, KEY-SUN CHOI, and MANSUK SONG. "A corpus-based approach for Korean nominal compound analysis based on linguistic and statistical information." Natural Language Engineering 7, no. 3 (2001): 251–70. http://dx.doi.org/10.1017/s1351324901002686.

Full text
Abstract:
The syntactic structure of a nominal compound must be analyzed first for its semantic interpretation. In addition, the syntactic analysis of nominal compounds is very useful for NLP application such as information extraction, since a nominal compound often has a similar linguistic structure with a simple sentence, as well as representing concrete and compound meaning of an object with several nouns combined. In this paper, we present a novel model for structural analysis of nominal compounds using linguistic and statistical knowledge which is coupled based on lexical information. That is, the
APA, Harvard, Vancouver, ISO, and other styles
39

Sahraoui, Maya, Marc Pignal, Lebbe Régine Vignes, and Vincent Guigue. "NEARSIDE: Structured kNowledge Extraction frAmework from SpecIes DEscriptions." Biodiversity Information Science and Standards 6 (September 7, 2022): e94297. https://doi.org/10.3897/biss.6.94297.

Full text
Abstract:
Species descriptions are stored in textual form in corpora such as in floras and faunas, but this large amount of information cannot be used directly by algorithms, nor can it be linked to other data sources. The production of knowledge bases expressing structured data can benefit from collaborative and easy-to-use platforms like Xper3 (Vignes-Lebbe et al. 2017, Kerner and Vignes 2019, Saucède et al. 2021) but is very time-consuming at the human level. It is therefore mandatory for this task to make the information contained in species descriptions measurable and compatible with computer techn
APA, Harvard, Vancouver, ISO, and other styles
40

Bogdanchikov, Andrey, Dauren Ayazbayev, and Iraklis Varlamis. "Classification of Scientific Documents in the Kazakh Language Using Deep Neural Networks and a Fusion of Images and Text." Big Data and Cognitive Computing 6, no. 4 (2022): 123. http://dx.doi.org/10.3390/bdcc6040123.

Full text
Abstract:
The rapid development of natural language processing and deep learning techniques has boosted the performance of related algorithms in several linguistic and text mining tasks. Consequently, applications such as opinion mining, fake news detection or document classification that assign documents to predefined categories have significantly benefited from pre-trained language models, word or sentence embeddings, linguistic corpora, knowledge graphs and other resources that are in abundance for the more popular languages (e.g., English, Chinese, etc.). Less represented languages, such as the Kaza
APA, Harvard, Vancouver, ISO, and other styles
41

Dhanasekaran, Kuttaiyapillai Anand Madasamy Shobanadevi Ayyavu Md Shohel Sayeed. "Clinical named entity extraction for extracting information from medical data." Indonesian Journal of Electrical Engineering and Computer Science 35, no. 3 (2024): 1722–31. https://doi.org/10.11591/ijeecs.v35.i3.pp1722-1731.

Full text
Abstract:
Clinical named entity extraction (NER) based on deep learning gained much attention among researchers and data analysts. This paper proposes a NER approach to extract valuable Parkinson’s disease-related information. To develop an effective NER method and to handle problems in disease data analytics, a unique NER technique applies a “recognize-map-extract (RME)” mechanism and aims to deal with complex relationships present in the data. Due to the fast-growing medical data, there is a challenge in the development of suitable deep-learning methods for NER. Furthermore, the trad
APA, Harvard, Vancouver, ISO, and other styles
42

Abdelmageed, Nora, Felicitas Löffler, Leila Feddoul, et al. "BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain." Biodiversity Data Journal 10 (October 7, 2022): e89481. https://doi.org/10.3897/BDJ.10.e89481.

Full text
Abstract:
Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named
APA, Harvard, Vancouver, ISO, and other styles
43

Mouratidis, Despoina, Katia Lida Kermanidis, and Vilelmini Sosoni. "Innovatively Fused Deep Learning with Limited Noisy Data for Evaluating Translations from Poor into Rich Morphology." Applied Sciences 11, no. 2 (2021): 639. http://dx.doi.org/10.3390/app11020639.

Full text
Abstract:
Evaluation of machine translation (MT) into morphologically rich languages has not been well studied despite its importance. This paper proposes a classifier, that is, a deep learning (DL) schema for MT evaluation, based on different categories of information (linguistic features, natural language processing (NLP) metrics and embeddings), by using a model for machine learning based on noisy and small datasets. The linguistic features are string based for the language pairs English (EN)–Greek (EL) and EN–Italian (IT). The paper also explores the linguistic differences that affect evaluation acc
APA, Harvard, Vancouver, ISO, and other styles
44

Mouratidis, Despoina, Katia Lida Kermanidis, and Vilelmini Sosoni. "Innovatively Fused Deep Learning with Limited Noisy Data for Evaluating Translations from Poor into Rich Morphology." Applied Sciences 11, no. 2 (2021): 639. http://dx.doi.org/10.3390/app11020639.

Full text
Abstract:
Evaluation of machine translation (MT) into morphologically rich languages has not been well studied despite its importance. This paper proposes a classifier, that is, a deep learning (DL) schema for MT evaluation, based on different categories of information (linguistic features, natural language processing (NLP) metrics and embeddings), by using a model for machine learning based on noisy and small datasets. The linguistic features are string based for the language pairs English (EN)–Greek (EL) and EN–Italian (IT). The paper also explores the linguistic differences that affect evaluation acc
APA, Harvard, Vancouver, ISO, and other styles
45

HACHEY, B., C. GROVER, and R. TOBIN. "Datasets for generic relation extraction." Natural Language Engineering 18, no. 1 (2011): 21–59. http://dx.doi.org/10.1017/s1351324911000106.

Full text
Abstract:
AbstractA vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup for
APA, Harvard, Vancouver, ISO, and other styles
46

Moniri, Sara, Tobias Schlosser, and Danny Kowerko. "Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning." Computers 13, no. 8 (2024): 212. http://dx.doi.org/10.3390/computers13080212.

Full text
Abstract:
The Persian language, also known as Farsi, is distinguished by its intricate morphological richness, yet it contends with a paucity of linguistic resources. With an estimated 110 million speakers, it finds prevalence across Iran, Tajikistan, Uzbekistan, Iraq, Russia, Azerbaijan, and Afghanistan. However, despite its widespread usage, scholarly investigations into Persian document retrieval remain notably scarce. This circumstance is primarily attributed to the absence of standardized test collections, which impedes the advancement of comprehensive research endeavors within this realm. As data
APA, Harvard, Vancouver, ISO, and other styles
47

Koseska-Toszewa, Violetta, and Roman Roszko. "Języki słowiańskie i litewski w korpusach równoległych Clarin-PL." Studia z Filologii Polskiej i Słowiańskiej 51 (December 31, 2016): 191–217. http://dx.doi.org/10.11649/sfps.2016.011.

Full text
Abstract:
Slavic languages and the Lithuanian language in the Clarin-PL parallel corporaThe Clarin Eric and Clarin-PL strategic scientific purpose is to support humanistic research in a multicultural and multilingual Europe. Polish researchers put the emphasis on building a bridge between the Polish language and Polish linguistic technologies and other European languages and their linguistic technologies. So far, the Polish scientific community has mainly focused on Polish-English connections. Clarin-PL has been developing the first and only multilingual corpora of the Polish language in conjunction wit
APA, Harvard, Vancouver, ISO, and other styles
48

Hahn, Udo, and Michel Oleynik. "Medical Information Extraction in the Age of Deep Learning." Yearbook of Medical Informatics 29, no. 01 (2020): 208–20. http://dx.doi.org/10.1055/s-0040-1702001.

Full text
Abstract:
Objectives: We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basic IE tasks, named entity recognition and relation extraction, for two selected semantic classes—diseases and drugs (or medications)—and relations between them. Methods: For the time period from 2017 to early 2020, we searched for relevant publication
APA, Harvard, Vancouver, ISO, and other styles
49

Maestre, María Miró, Marta Vicente, Elena Lloret, and Armando Suárez Cueto. "Extracting Narrative Patterns in Different Textual Genres: A Multilevel Feature Discourse Analysis." Information 14, no. 1 (2022): 28. http://dx.doi.org/10.3390/info14010028.

Full text
Abstract:
We present a data-driven approach to discover and extract patterns in textual genres with the aim of identifying whether there is an interesting variation of linguistic features among different narrative genres depending on their respective communicative purposes. We want to achieve this goal by performing a multilevel discourse analysis according to (1) the type of feature studied (shallow, syntactic, semantic, and discourse-related); (2) the texts at a document level; and (3) the textual genres of news, reviews, and children’s tales. To accomplish this, several corpora from the three textual
APA, Harvard, Vancouver, ISO, and other styles
50

Aerts, Diederik, Suzette Geriente, Roberto Leporini, and Sandro Sozzo. "Bell’s Inequalities and Entanglement in Corpora of Italian Language." Entropy 27, no. 7 (2025): 656. https://doi.org/10.3390/e27070656.

Full text
Abstract:
We analyse the results of three information retrieval tests on conceptual combinations that we have recently performed using corpora of Italian language. Each test has the form of a ‘Bell-type test’ and was aimed at identifying `quantum entanglement’ in the combination, or composition, of two concepts. In the first two tests, we studied the Italian translation of the combination The Animal Acts, while in the third test, we studied the Italian translation of the combination The Animal eats the Food. We found a significant violation of Bell’s inequalities in all tests. Empirical patterns confirm
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!