Log in

Relevant bibliographies by topics / Noisy Texts / Journal articles

To see the other types of publications on this topic, follow the link: Noisy Texts.

Journal articles on the topic 'Noisy Texts'

Author: Grafiati

Published: 2 June 2025

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Noisy Texts.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

K., Abainia. "Topic Identification of Noisy Texts: Statistical Approaches." International Journal of Hidden Data Mining and Scientific Knowledge Discovery 01, no. 01 (2015): 2. https://doi.org/10.5281/zenodo.20362.

Full text

Abstract:

This paper deals with the problem of automatic theme identification of noisy Arabic texts. Actually, there exist several works in this field based on statistical and machine learning approaches for different text categories. Unfortunately, most of the proposed approaches are suitable in clean and long texts. In this investigation, we carried out a comparative study between two different statistical approaches based on tf-idf. Hence, different configurations were used in both approaches to provide a large comparison. Furthermore, an in-house corpus called ANTSIX was created to evaluate the prop

APA, Harvard, Vancouver, ISO, and other styles

2

Doval, Yerai, Jesús Vilares, and Carlos Gómez-Rodríguez. "Towards Robust Word Embeddings for Noisy Texts." Applied Sciences 10, no. 19 (2020): 6893. http://dx.doi.org/10.3390/app10196893.

Full text

Abstract:

Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic,

APA, Harvard, Vancouver, ISO, and other styles

3

Pang, Ning, Zhen Tan, Xiang Zhao, Weixin Zeng, and Weidong Xiao. "Domain relation extraction from noisy Chinese texts." Neurocomputing 418 (December 2020): 21–35. http://dx.doi.org/10.1016/j.neucom.2020.07.077.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Liu, Zheng, Chiyu Liu, Bin Xia, and Tao Li. "Multiple Relational Topic Modeling for Noisy Short Texts." International Journal of Software Engineering and Knowledge Engineering 28, no. 11n12 (2018): 1559–74. http://dx.doi.org/10.1142/s021819401840017x.

Full text

Abstract:

Understanding contents in social networks by inferring high-quality latent topics from short texts is a significant task in social analysis, which is challenging because social network contents are usually extremely short, noisy and full of informal vocabularies. Due to the lack of sufficient word co-occurrence instances, well-known topic modeling methods such as LDA and LSA cannot uncover high-quality topic structures. Existing research works seek to pool short texts from social networks into pseudo documents or utilize the explicit relations among these short texts such as hashtags in tweets

APA, Harvard, Vancouver, ISO, and other styles

5

Bolger, Elizabeth. "“Noisy Pleasures” and “Noisy Evil[s]”: The Political Dimensions of Sound in Jane Austen’s Mansfield Park." Eighteenth Century 64, no. 3-4 (2023): 287–301. https://doi.org/10.1353/ecy.2023.a950265.

Full text

Abstract:

Abstract: In this essay, I demonstrate how Jane Austen uses sound (chatter, commotion, and silence) in Mansfield Park (1814) to create a conservative soundscape, one that associates reticence with the elite and noise with the lower classes. At the same time, I argue that Austen creates this soundscape only to break down its hierarchical assumptions about class. By considering her depiction of sound in relation to other eighteenth-century texts that discuss the politics of sound—Edmund Burke and Mary Wollstonecraft’s texts on the French Revolution, as well as Thomas Clarkson and Samuel Johnson’

APA, Harvard, Vancouver, ISO, and other styles

6

Stoica, Alina. "Filtering Noisy Web Data by Identifying and Leveraging Users' Contributions." Proceedings of the International AAAI Conference on Web and Social Media 6, no. 1 (2021): 583–86. http://dx.doi.org/10.1609/icwsm.v6i1.14295.

Full text

Abstract:

In this paper we present several methods for collecting Web textual contents and filtering noisy data. We show that knowing which user publishes which contents can contribute to detecting noise. We begin by collecting data from two forums and from Twitter. For the forums, we extract the meaningful information from each discussion (texts of question and answers, IDs of users, date). For the Twitter dataset, we first detect tweets with very similar texts, which helps avoiding redundancy in further analysis. Also, this leads us to clusters of tweets that can be used in the same way as the forum d

APA, Harvard, Vancouver, ISO, and other styles

7

Sanjay Kumar Gorai and Shekhar Pradhan. "Bridging the Gap: OCR Techniques for Noisy and Distorted Texts." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 11, no. 1 (2025): 695–703. https://doi.org/10.32628/cseit2511111.

Full text

Abstract:

Optical Character Recognition (OCR) has evolved significantly over the years, enabling automated text extraction from a variety of sources. However, OCR systems often struggle with noisy and distorted texts, such as those found in low-quality scans, degraded historical documents, or images captured in challenging conditions. This paper explores state-of-the-art techniques and advancements in OCR for handling noisy and distorted texts. We discuss preprocessing methods, robust feature extraction, deep learning models, and post-processing techniques, providing a comprehensive overview of the fiel

APA, Harvard, Vancouver, ISO, and other styles

8

Niu, Yue, Hongjie Zhang, and Jing Li. "A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings." Applied Sciences 11, no. 18 (2021): 8708. http://dx.doi.org/10.3390/app11188708.

Full text

Abstract:

In recent years, short texts have become a kind of prevalent text on the internet. Due to the short length of each text, conventional topic models for short texts suffer from the sparsity of word co-occurrence information. Researchers have proposed different kinds of customized topic models for short texts by providing additional word co-occurrence information. However, these models cannot incorporate sufficient semantic word co-occurrence information and may bring additional noisy information. To address these issues, we propose a self-aggregated topic model incorporating document embeddings.

APA, Harvard, Vancouver, ISO, and other styles

9

Habeeb, Imad Qasim, Tamara Z. Fadhil, Yaseen Naser Jurn, Zeyad Qasim Habeeb, and Hanan Najm Abdulkhudhur. "An ensemble technique for speech recognition in noisy environments." Indonesian Journal of Electrical Engineering and Computer Science 18, no. 2 (2020): 835. http://dx.doi.org/10.11591/ijeecs.v18.i2.pp835-842.

Full text

Abstract:

<span>Automatic speech recognition (ASR) is a technology that allows a computer and mobile device to recognize and translate spoken language into text. ASR systems often produce poor accuracy for the noisy speech signal. Therefore, this research proposed an ensemble technique that does not rely on a single filter for perfect noise reduction but incorporates information from multiple noise reduction filters to improve the final ASR accuracy. The main factor of this technique is the generation of K-copies of the speech signal using three noise reduction filters. The speech features of thes

APA, Harvard, Vancouver, ISO, and other styles

10

Imad, Qasim Habeeb, Z. Fadhil Tamara, Naser Jurn Yaseen, Qasim Habeeb Zeyad, and Najm Abdulkhudhur Hanan. "An ensemble technique for speech recognition in noisy environments." Indonesian Journal of Electrical Engineering and Computer Science (IJEECS) 18, no. 2 (2020): 835–42. https://doi.org/10.11591/ijeecs.v18.i2.pp835-842.

Full text

Abstract:

Automatic speech recognition (ASR) is a technology that allows a computer and mobile device to recognize and translate spoken language into text. ASR systems often produce poor accuracy for the noisy speech signal. Therefore, this research proposed an ensemble technique that does not rely on a single filter for perfect noise reduction but incorporates information from multiple noise reduction filters to improve the final ASR accuracy. The main factor of this technique is the generation of K-copies of the speech signal using three noise reduction filters. The speech features of these copies dif

APA, Harvard, Vancouver, ISO, and other styles

11

Kasthuriarachchy, Buddhika, Madhu Chetty, Adrian Shatte, and Darren Walls. "From General Language Understanding to Noisy Text Comprehension." Applied Sciences 11, no. 17 (2021): 7814. http://dx.doi.org/10.3390/app11177814.

Full text

Abstract:

Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish

APA, Harvard, Vancouver, ISO, and other styles

12

Khan, Jebran, and Sungchang Lee. "Enhancement of Sentiment Analysis by Utilizing Noisy Social Media Texts." Journal of Korean Institute of Communications and Information Sciences 45, no. 6 (2020): 1027–37. http://dx.doi.org/10.7840/kics.2020.45.6.1027.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Murtadha, Ahmed, Shengfeng Pan, Wen Bo, et al. "Rank-Aware Negative Training for Semi-Supervised Text Classification." Transactions of the Association for Computational Linguistics 11 (2023): 771–86. http://dx.doi.org/10.1162/tacl_a_00574.

Full text

Abstract:

Abstract Semi-supervised text classification-based paradigms (SSTC) typically employ the spirit of self-training. The key idea is to train a deep classifier on limited labeled texts and then iteratively predict the unlabeled texts as their pseudo-labels for further training. However, the performance is largely affected by the accuracy of pseudo-labels, which may not be significant in real-world scenarios. This paper presents a Rank-aware Negative Training (RNT) framework to address SSTC in learning with noisy label settings. To alleviate the noisy information, we adapt a reasoning with uncerta

APA, Harvard, Vancouver, ISO, and other styles

14

Kim, Mi-Young, Ying Xu, Osmar R. Zaiane, and Randy Goebel. "Recognition of Patient-Related Named Entities in Noisy Tele-Health Texts." ACM Transactions on Intelligent Systems and Technology 6, no. 4 (2015): 1–23. http://dx.doi.org/10.1145/2651444.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Coggeshall, Elizabeth. "Resonant Texts in Noisy Spaces: Approaching the “Publics” of the Public Humanities." PMLA/Publications of the Modern Language Association of America 140, no. 1 (2025): 145–53. https://doi.org/10.1632/s0030812925000112.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Tang, Qirui, Wenkang Jiang, Xinlong Pan, et al. "Using Psycholinguistic Clues to Index Deep Semantic Evidences: Personality Detection in Social Media Texts." Chinese Journal of Information Fusion 2, no. 2 (2025): 112–26. https://doi.org/10.62762/cjif.2025.820998.

Full text

Abstract:

Detecting personalities in social media content is an important application of personality psychology. Most early studies apply a coherent piece of writing to personality detection, but today, the challenge is to identify dominant personality traits from a series of short, noisy social media posts. To this end, recent studies have attempted to individually encode the deep semantics of posts, often using attention-based methods, and then relate them, or directly assemble them into graph structures. However, due to the inherently disjointed and noisy nature of social media content, constructing

APA, Harvard, Vancouver, ISO, and other styles

17

Matricciani, Emilio. "Translation Can Distort the Linguistic Parameters of Source Texts Written in Inflected Language: Multidimensional Mathematical Analysis of “The Betrothed”, a Translation in English of “I Promessi Sposi” by A. Manzoni." AppliedMath 5, no. 1 (2025): 24. https://doi.org/10.3390/appliedmath5010024.

Full text

Abstract:

We compare, mathematically, the text of a famous Italian novel, I promessi sposi, written by Alessandro Manzoni (source text), to its most recent English translation, The Betrothed by Michael F. Moore (target text). The mathematical theory applied does not measure the efficacy and beauty of texts; only their mathematical underlying structure and similarity. The translation theory adopted by the translator is the “domestication” of the source text because English is not as economical in its use of subject pronouns as Italian. A domestication index measures the degree of domestication. The modif

APA, Harvard, Vancouver, ISO, and other styles

18

Matricciani, Emilio. "Linguistic Mathematical Relationships Saved or Lost in Translating Texts: Extension of the Statistical Theory of Translation and Its Application to the New Testament." Information 13, no. 1 (2022): 20. http://dx.doi.org/10.3390/info13010020.

Full text

Abstract:

The purpose of the paper is to extend the general theory of translation to texts written in the same language and show some possible applications. The main result shows that the mutual mathematical relationships of texts in a language have been saved or lost in translating them into another language and consequently texts have been mathematically distorted. To make objective comparisons, we have defined a “likeness index”—based on probability and communication theory of noisy binary digital channels-and have shown that it can reveal similarities and differences of texts. We have applied the ex

APA, Harvard, Vancouver, ISO, and other styles

19

Shirakawa, Masumi, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. "Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes." IEEE Transactions on Emerging Topics in Computing 3, no. 2 (2015): 205–19. http://dx.doi.org/10.1109/tetc.2015.2418716.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Sahare, Parul, and Sanjay B. Dhok. "Separation of Handwritten and Machine-Printed Texts from Noisy Documents Using Contourlet Transform." Arabian Journal for Science and Engineering 43, no. 12 (2018): 8159–77. http://dx.doi.org/10.1007/s13369-018-3365-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Sahare, Parul, and Sanjay B. Dhok. "Separation of Machine-Printed and Handwritten Texts in Noisy Documents using Wavelet Transform." IETE Technical Review 36, no. 4 (2018): 341–61. http://dx.doi.org/10.1080/02564602.2018.1475266.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Etxeberria, Izaskun, Iñaki Alegria, and Larraitz Uria. "Weighted finite-state transducers for normalization of historical texts." Natural Language Engineering 25, no. 2 (2019): 307–21. http://dx.doi.org/10.1017/s1351324918000505.

Full text

Abstract:

AbstractThis paper presents a study about methods for normalization of historical texts. The aim of these methods is learning relations between historical and contemporary word forms. We have compiled training and test corpora for different languages and scenarios, and we have tried to read the results related to the features of the corpora and languages. Our proposed method, based on weighted finite-state transducers, is compared to previously published ones. Our method learns to map phonological changes using a noisy channel model; it is a simple solution that can use a limited amount of sup

APA, Harvard, Vancouver, ISO, and other styles

23

Ma, Pingchuan, Lennart Rietdorf, Dmytro Kotovenko, Vincent Tao Hu, and Björn Ommer. "Does VLM Classification Benefit from LLM Description Semantics?" Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 6 (2025): 5973–81. https://doi.org/10.1609/aaai.v39i6.32638.

Full text

Abstract:

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-tim

APA, Harvard, Vancouver, ISO, and other styles

24

Marie, Benjamin, and Atsushi Fujita. "Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 710–25. http://dx.doi.org/10.1162/tacl_a_00341.

Full text

Abstract:

Neural machine translation (NMT) systems are usually trained on clean parallel data. They can perform very well for translating clean in-domain texts. However, as demonstrated by previous work, the translation quality significantly worsens when translating noisy texts, such as user-generated texts (UGT) from online social media. Given the lack of parallel data of UGT that can be used to train or adapt NMT systems, we synthesize parallel data of UGT, exploiting monolingual data of UGT through crosslingual language model pre-training and zero-shot NMT systems. This paper presents two different b

APA, Harvard, Vancouver, ISO, and other styles

25

Xiao, Ya, Chengxiang Tan, Zhijie Fan, Qian Xu, and Wenye Zhu. "Joint Entity and Relation Extraction with a Hybrid Transformer and Reinforcement Learning Based Model." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 9314–21. http://dx.doi.org/10.1609/aaai.v34i05.6471.

Full text

Abstract:

Joint extraction of entities and relations is a task that extracts the entity mentions and semantic relations between entities from the unstructured texts with one single model. Existing entity and relation extraction datasets usually rely on distant supervision methods which cannot identify the corresponding relations between a relation and the sentence, thus suffers from noisy labeling problem. We propose a hybrid deep neural network model to jointly extract the entities and relations, and the model is also capable of filtering noisy data. The hybrid model contains a transformer-based encodi

APA, Harvard, Vancouver, ISO, and other styles

26

Rachel Stein and Max Keller. "Multi-Task Learning for Sentiment and Topic Classification in Social Media Texts." Frontiers in Artificial Intelligence Research 2, no. 1 (2025): 134–42. https://doi.org/10.71465/fair268.

Full text

Abstract:

Social media platforms have emerged as rich sources of textual data, offering insights into public opinion, consumer preferences, and emerging topics. However, extracting meaningful information from such unstructured and noisy content presents considerable challenges. This study proposes a multi-task learning (MTL) framework to simultaneously perform sentiment classification and topic classification on social media texts. By sharing representations across tasks, the model leverages interrelated patterns and dependencies between sentiment and topical content. Experimental results demonstrate th

APA, Harvard, Vancouver, ISO, and other styles

27

Xiong, Shuguang, Huitao Zhang, and Meng Wang. "Ensemble Model of Attention Mechanism-Based DCGAN and Autoencoder for Noised OCR Classification." Journal of Electronic & Information Systems 4, no. 1 (2022): 33–41. http://dx.doi.org/10.30564/jeis.v4i1.6725.

Full text

Abstract:

Optical Character Recognition (OCR) is a technology that converts images of text into machine-readable formats, essential for digitizing printed texts and enabling digital searches. Traditional OCR methods often struggle with variations in font styles and noise. This paper proposes an innovative approach to enhance OCR classification under challenging conditions by leveraging an ensemble model that combines an Attention Mechanism-Based Generative Adversarial Network (GAN) and an Autoencoder. The GAN generates synthetic data to mitigate the limitations of small datasets, while the autoencoder e

APA, Harvard, Vancouver, ISO, and other styles

28

Khokhlova, Maria. "Identifying Errors in Russian Web Corpora." Journal of Linguistics/Jazykovedný casopis 73, no. 1 (2022): 977–85. http://dx.doi.org/10.2478/jazcas-2022-0021.

Full text

Abstract:

Abstract The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such “noisy” fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as wel

APA, Harvard, Vancouver, ISO, and other styles

29

M., Ali Fauzi, Fahreza Nur Firmansyah Ro'i, and Afirianto Tri. "Improving Sentiment Analysis of Short Informal Indonesian Product Reviews using Synonym Based Feature Expansion." TELKOMNIKA Telecommunication, Computing, Electronics and Control 16, no. 3 (2018): 1345–50. https://doi.org/10.12928/TELKOMNIKA.v16i3.7751.

Full text

Abstract:

Sentiment analysis in short informal texts like product reviews is more challenging. Short texts are sparse, noisy, and lack of context information. Traditional text classification methods may not be suitable for analyzing sentiment of short texts given all those difficulties. A common approach to overcome these problems is to enrich the original texts with additional semantics to make it appear like a large document of text. Then, traditional classification methods can be applied to it. In this study, we developed an automatic sentiment analysis system of short informal Indonesian texts using

APA, Harvard, Vancouver, ISO, and other styles

30

Li, Ximing, Jiaojiao Zhang, and Jihong Ouyang. "Dirichlet Multinomial Mixture with Variational Manifold Regularization: Topic Modeling over Short Texts." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 7884–91. http://dx.doi.org/10.1609/aaai.v33i01.33017884.

Full text

Abstract:

Conventional topic models suffer from a severe sparsity problem when facing extremely short texts such as social media posts. The family of Dirichlet multinomial mixture (DMM) can handle the sparsity problem, however, they are still very sensitive to ordinary and noisy words, resulting in inaccurate topic representations at the document level. In this paper, we alleviate this problem by preserving local neighborhood structure of short texts, enabling to spread topical signals among neighboring documents, so as to correct the inaccurate topic representations. This is achieved by using variation

APA, Harvard, Vancouver, ISO, and other styles

31

Kavitha ,, Dr S., Mr Muhammad Abul Kalam ,, Mrs J. Bhargavi ,, Dinesh Mannem, Pagidimari Aravind, and Aluvala Poojitha,. "AI-Driven Restoration of Documents using CNN and OCR for Precise Text Recovery." International Scientific Journal of Engineering and Management 04, no. 03 (2025): 1–7. https://doi.org/10.55041/isjem02385.

Full text

Abstract:

Document restoration is a critical task in preserving text integrity from degraded, noisy, or damaged sources. This paper presents an AI- driven approach utilizing Convolutional Neural Networks (CNN) for image enhancement and Optical Character Recognition (OCR) for precise text recovery. The CNN model effectively removes noise, reconstructs missing or distorted text regions, and enhances readability, while OCR ensures accurate transcription. The proposed method demonstrates superior performance compared to traditional restoration techniques, improving both visual clarity and text extraction ac

APA, Harvard, Vancouver, ISO, and other styles

32

Yao, Wenlin, Cheng Zhang, Shiva Saravanan, Ruihong Huang, and Ali Mostafavi. "Weakly-Supervised Fine-Grained Event Recognition on Social Media Texts for Disaster Management." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 532–39. http://dx.doi.org/10.1609/aaai.v34i01.5391.

Full text

Abstract:

People increasingly use social media to report emergencies, seek help or share information during disasters, which makes social networks an important tool for disaster management. To meet these time-critical needs, we present a weakly supervised approach for rapidly building high-quality classifiers that label each individual Twitter message with fine-grained event categories. Most importantly, we propose a novel method to create high-quality labeled data in a timely manner that automatically clusters tweets containing an event keyword and asks a domain expert to disambiguate event word senses

APA, Harvard, Vancouver, ISO, and other styles

33

Song, Wei, and Zijiang Yang. "Improving Distantly Supervised Relation Extraction with Multi-Level Noise Reduction." AI 5, no. 3 (2024): 1709–30. http://dx.doi.org/10.3390/ai5030084.

Full text

Abstract:

Background: Distantly supervised relation extraction (DSRE) aims to identify semantic relations in large-scale texts automatically labeled via knowledge base alignment. It has garnered significant attention due to its high efficiency, but existing methods are plagued by noise at both the word and sentence level and fail to address these issues adequately. The former level of noise arises from the large proportion of irrelevant words within sentences, while noise at the latter level is caused by inaccurate relation labels for various sentences. Method: We propose a novel multi-level noise reduc

APA, Harvard, Vancouver, ISO, and other styles

34

Nguyen, Thi Tuyet Hai, Adam Jatowt, Mickael Coustaty, and Antoine Doucet. "Survey of Post-OCR Processing Approaches." ACM Computing Surveys 54, no. 6 (2021): 1–37. http://dx.doi.org/10.1145/3453476.

Full text

Abstract:

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processi

APA, Harvard, Vancouver, ISO, and other styles

35

Nguyen, Thi-Tuyet-Hai, Adam Jatowt, MIickael Coustaty, and Antoine Doucet. "Survey of Post-OCR Processing Approaches." ACM Computing Surveys 1, 1 (March 2020) (March 1, 2021): 36. https://doi.org/10.5281/zenodo.4640070.

Full text

Abstract:

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. OCR engines can perform well on modern text, unfortunately, their performance is significantly reduced on historical materials. Additionally, many texts have been already processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their affects on information retrieval and natural language processing

APA, Harvard, Vancouver, ISO, and other styles

36

Cheng, Kewei, Xian Li, Yifan Ethan Xu, Xin Luna Dong, and Yizhou Sun. "PGE." Proceedings of the VLDB Endowment 15, no. 6 (2022): 1288–96. http://dx.doi.org/10.14778/3514061.3514074.

Full text

Abstract:

Although product graphs (PGs) have gained increasing attentions in recent years for their successful applications in product search and recommendations, the extensive power of PGs can be limited by the inevitable involvement of various kinds of errors. Thus, it is critical to validate the correctness of triples in PGs to improve their reliability. Knowledge graph (KG) embedding methods have strong error detection abilities. Yet, existing KG embedding methods may not be directly applicable to a PG due to its distinct characteristics: (1) PG contains rich textual signals, which necessitates a jo

APA, Harvard, Vancouver, ISO, and other styles

37

Wołk, Krzysztof, Agnieszka Wołk, and Krzysztof Marasek. "Semantic approach for building generated virtual-parallel corpora from monolingual texts." Poznan Studies in Contemporary Linguistics 55, no. 2 (2019): 469–90. http://dx.doi.org/10.1515/psicl-2019-0017.

Full text

Abstract:

Abstract Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides

APA, Harvard, Vancouver, ISO, and other styles

38

Abreu Mendoza, Carlos. "Sound, Noisy Feelings, and the Audible Sublime in Nineteenth-Century Latin America." Revista de Estudios Hispánicos 57, no. 3 (2023): 403–26. http://dx.doi.org/10.1353/rvs.2023.a924206.

Full text

Abstract:

Abstract: This article delves into the aural dimensions of Latin American literary tradition, exploring those instances when the sublime provides the vocabulary and syntax to textually inscribe sonic phenomena. As highlighted by Ana María Ochoa Gautier, the audible techniques cultivated by lettered elites found a way into their writing, thus forming a rich textual archive where they registered sounds, voices, and noises that often defied classification and overwhelmed the senses. In reading texts for sounds that emphasize adversity, discordance, and sono-racial designations, this article trace

APA, Harvard, Vancouver, ISO, and other styles

39

Koltcov, Sergei, Anton Surkov, Olessia Koltsova, and Vera Ignatenko. "Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language." PeerJ Computer Science 10 (November 28, 2024): e2395. http://dx.doi.org/10.7717/peerj-cs.2395.

Full text

Abstract:

Recent advancements in large language models (LLMs) have opened new possibilities for developing conversational agents (CAs) in various subfields of mental healthcare. However, this progress is hindered by limited access to high-quality training data, often due to privacy concerns and high annotation costs for low-resource languages. A potential solution is to create human-AI annotation systems that utilize extensive public domain user-to-user and user-to-professional discussions on social media. These discussions, however, are extremely noisy, necessitating the adaptation of LLMs for fully au

APA, Harvard, Vancouver, ISO, and other styles

40

Yang, Jvlie. "The Evaluation of Performance Related to Noise Robustness of VITS for Speech Synthesis." Highlights in Science, Engineering and Technology 57 (July 11, 2023): 62–68. http://dx.doi.org/10.54097/hset.v57i.9904.

Full text

Abstract:

In recent years, the utilization of voice interfaces has gained significant popularity, with speech synthesis technology playing a pivotal role in their functionality. However, speech synthesis technology is susceptible to noise interference in practical applications, which may lead to a decrease in the quality of speech synthesis. In this paper, the noise robustness of the Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS) model was investigated, which has shown promising results in speech synthesis tasks. This study conducted experiments using six different

APA, Harvard, Vancouver, ISO, and other styles

41

Liu, Shengyu, Buzhou Tang, Qingcai Chen, Xiaolong Wang, and Xiaoming Fan. "Feature Engineering for Drug Name Recognition in Biomedical Texts: Feature Conjunction and Feature Selection." Computational and Mathematical Methods in Medicine 2015 (2015): 1–9. http://dx.doi.org/10.1155/2015/913489.

Full text

Abstract:

Drug name recognition (DNR) is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for

APA, Harvard, Vancouver, ISO, and other styles

42

Fang, Hui, Ge Xu, Yunfei Long, and Weimian Tang. "An Effective ELECTRA-Based Pipeline for Sentiment Analysis of Tourist Attraction Reviews." Applied Sciences 12, no. 21 (2022): 10881. http://dx.doi.org/10.3390/app122110881.

Full text

Abstract:

In the era of information explosion, it is difficult for people to decide on a tourist destination quickly. Online travel review texts provide valuable references and suggestions to assist in decision making. However, tourist attraction reviews are primarily informal and noisy. Most works in this field focus on shallow machine learning models or non-pretrained deep learning models. These approaches struggle to generate satisfactory classification results. To solve this issue, the paper proposes a pipeline model. In the first step of this paper, we preprocess tourist attraction reviews by perfo

APA, Harvard, Vancouver, ISO, and other styles

43

Yu, Yunxia, Yibo Guan, and Yurong Hu. "Natural language processing applications in social network analysis: a data mining approach." Journal of Physics: Conference Series 2813, no. 1 (2024): 012009. http://dx.doi.org/10.1088/1742-6596/2813/1/012009.

Full text

Abstract:

Abstract Social media texts, such as tweets and user profiles, reflect the opinions, emotions, preferences, and behaviors of social media users, as well as the structure and dynamics of their social networks. Analyzing social media texts can provide valuable insights and applications for various domains and stakeholders, such as marketing, politics, healthcare, and education. However, analyzing social media texts poses significant challenges, as they are often noisy, informal, unstructured, and heterogeneous. Moreover, social media texts are not only influenced by the individual characteristic

APA, Harvard, Vancouver, ISO, and other styles

44

Waltham-Smith, Naomi. "Unflappable." Paragraph 45, no. 3 (2022): 336–50. http://dx.doi.org/10.3366/para.2022.0408.

Full text

Abstract:

Taking off from the Flügelschlag or coup d’aile in Trakl’s poem to which the ‘ Ein’ of ‘Ein Geschlecht’ responds with the Grundton (fundamental or tonic) of the Gedicht (poem), the article tracks the figure of this noisy wing-flap as a metaphor for the force of reading (aloud) from Geschlecht III to exchanges between Derrida and Cixous in Voiles, Insister, ‘ Fourmis’ and other texts. Alongside figures of take-off, there are also repeated images in these texts of frozen flights and broken or belimed wings which may be connected with the quasi-methodological remarks about the irreducible violenc

APA, Harvard, Vancouver, ISO, and other styles

45

Zenevich, Ekaterina Vasilievna. "Reception of the Christian tradition of "cleansing" prayer in the lyrics of Julia Zhadovskaya." Litera, no. 5 (May 2024): 198–207. http://dx.doi.org/10.25136/2409-8698.2024.5.70669.

Full text

Abstract:

Due to the constant interest in the study of religious images and motifs in literary works, it becomes relevant to study the motifs contained in literary texts on religious subjects. One of the key features of the work of the poetess of the mid – 19th century Yu.V. Zhadovskaya (1824-1883) is an artistic reinterpretation of the Christian traditions of spiritual weeping and "cleansing" prayer. The subject of the study is the motif of prayer tears as an external sign of "cleansing prayer" in the lyrics of Yu.V. Zhadovskaya. The object of the research is the texts of religious and spiritual subjec

APA, Harvard, Vancouver, ISO, and other styles

46

Du, Siyuan, and Hao Wang. "Addressing Syntax-Based Semantic Complementation: Incorporating Entity and Soft Dependency Constraints into Metonymy Resolution." Future Internet 14, no. 3 (2022): 85. http://dx.doi.org/10.3390/fi14030085.

Full text

Abstract:

State-of-the-art methods for metonymy resolution (MR) consider the sentential context by modeling the entire sentence. However, entity representation, or syntactic structure that are informative may be beneficial for identifying metonymy. Other approaches only using deep neural network fail to capture such information. To leverage both entity and syntax constraints, this paper proposes a robust model EBAGCN for metonymy resolution. First, this work extracts syntactic dependency relations under the guidance of syntactic knowledge. Then the work constructs a neural network to incorporate both en

APA, Harvard, Vancouver, ISO, and other styles

47

Sun, Yuan, Jian Dai, Zhenwen Ren, Yingke Chen, Dezhong Peng, and Peng Hu. "Dual Self-Paced Cross-Modal Hashing." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15184–92. http://dx.doi.org/10.1609/aaai.v38i14.29441.

Full text

Abstract:

Cross-modal hashing~(CMH) is an efficient technique to retrieve relevant data across different modalities, such as images, texts, and videos, which has attracted more and more attention due to its low storage cost and fast query speed. Although existing CMH methods achieve remarkable processes, almost all of them treat all samples of varying difficulty levels without discrimination, thus leaving them vulnerable to noise or outliers. Based on this observation, we reveal and study dual difficulty levels implied in cross-modal hashing learning, \ie instance-level and feature-level difficulty. To

APA, Harvard, Vancouver, ISO, and other styles

48

Chen, Lielei, and Hui Fang. "An Automatic Method for Extracting Innovative Ideas Based on the Scopus® Database." KNOWLEDGE ORGANIZATION 46, no. 3 (2019): 171–86. http://dx.doi.org/10.5771/0943-7444-2019-3-171.

Full text

Abstract:

The novelty of knowledge claims in a research paper can be considered an evaluation criterion for papers to supplement citations. To provide a foundation for research evaluation from the perspective of innovativeness, we propose an automatic approach for extracting innovative ideas from the abstracts of technology and engineering papers. The approach extracts N-grams as candidates based on part-of-speech tagging and determines whether they are novel by checking the Scopus® database to determine whether they had ever been presented previously. Moreover, we discussed the distributions of innovat

APA, Harvard, Vancouver, ISO, and other styles

49

Xin, Yi, Monika E. Grabowska, Srushti Gangireddy, et al. "Improving topic modeling performance on social media through semantic relationships within biomedical terminology." PLOS ONE 20, no. 2 (2025): e0318702. https://doi.org/10.1371/journal.pone.0318702.

Full text

Abstract:

Topic modeling utilizes unsupervised machine learning to detect underlying themes within texts and has been deployed routinely to analyze social media for insights into healthcare issues. However, the inherent messiness of social media hinders the full realization of this technique’s potential. As such, we hypothesized that restricting medical concepts in social media texts to specific related semantic types and applying topic modeling to these concepts could be a feasible approach to overcome the challenge of traditional topic modeling for social media texts. Therefore, we developed a semanti

APA, Harvard, Vancouver, ISO, and other styles

50

Branavan, S. R. K., H. Chen, J. Eisenstein, and R. Barzilay. "Learning Document-Level Semantic Properties from Free-Text Annotations." Journal of Artificial Intelligence Research 34 (April 23, 2009): 569–603. http://dx.doi.org/10.1613/jair.2633.

Full text

Abstract:

This paper presents a new method for inferring the semantic properties of documents by leveraging free-text keyphrase annotations. Such annotations are becoming increasingly abundant due to the recent dramatic growth in semi-structured, user-generated online content. One especially relevant domain is product reviews, which are often annotated by their authors with pros/cons keyphrases such as ``a real bargain'' or ``good value.'' These annotations are representative of the underlying semantic properties; however, unlike expert annotations, they are noisy: lay authors may use different labels t

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!