To see the other types of publications on this topic, follow the link: Pre-training corpora.

Journal articles on the topic 'Pre-training corpora'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Pre-training corpora.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Sun, Yu, Shuohuan Wang, Yukun Li, et al. "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8968–75. http://dx.doi.org/10.1609/aaai.v34i05.6428.

Full text
Abstract:
Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named E
APA, Harvard, Vancouver, ISO, and other styles
2

Moodaley, Wayne, and Arnesh Telukdarie. "A Conceptual Framework for Subdomain Specific Pre-Training of Large Language Models for Green Claim Detection." European Journal of Sustainable Development 12, no. 4 (2023): 319. http://dx.doi.org/10.14207/ejsd.2023.v12n4p319.

Full text
Abstract:
Detection of false or misleading green claims (referred to as “greenwashing”) within company sustainability disclosures is challenging for a number of reasons, which include the textual and qualitative nature, volume, and complexity of such disclosures. In recent years, notable progress made in the fields of artificial intelligence and specifically, large language models (LLMs), has showcased the capacity of these tools to effectively analyse extensive and intricate textual data, including the contents of sustainability disclosures. Transformer-based LLMs, such as Google’s BERT architecture, w
APA, Harvard, Vancouver, ISO, and other styles
3

Hussain, Rida Ghafoor. "RiskBERT: A Pre-Trained Insurance-Based Language Model for Text Classification." International Journal of Innovative Technology and Exploring Engineering 14, no. 7 (2025): 12–18. https://doi.org/10.35940/ijitee.f1097.14070625.

Full text
Abstract:
The rapid growth of insurance-related documents has increased the need for efficient and accurate text classification techniques. Advances in natural language processing (NLP) and deep learning have enabled the extraction of valuable insights from textual data, particularly in specialised domains such as insurance, legal, and scientific documents. While Bidirectional Encoder Representations from Transformers (BERT) models have demonstrated state-of-theart performance across various NLP tasks, their application to domain-specific corpora often results in suboptimal accuracy due to linguistic an
APA, Harvard, Vancouver, ISO, and other styles
4

Liu, Yinhan, Jiatao Gu, Naman Goyal, et al. "Multilingual Denoising Pre-training for Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Full text
Abstract:
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete mod
APA, Harvard, Vancouver, ISO, and other styles
5

Dean, Roger Thornton, and Marcus Thomas Pearce. "Algorithmically-generated Corpora that use Serial Compositional Principles Can Contribute to the Modeling of Sequential Pitch Structure in Non-tonal Music." Empirical Musicology Review 11, no. 1 (2016): 27. http://dx.doi.org/10.18061/emr.v11i1.4900.

Full text
Abstract:
We investigate whether pitch sequences in non-tonal music can be modeled by an information-theoretic approach using algorithmically-generated melodic sequences, made according to 12-tone serial principles, as the training corpus. This is potentially useful, because symbolic corpora of non-tonal music are not readily available. A non-tonal corpus of serially-composed melodies was constructed algorithmically using classic principles of 12-tone music, including prime, inversion, retrograde and retrograde inversion transforms. A similar algorithm generated a tonal melodic corpus of tonal transform
APA, Harvard, Vancouver, ISO, and other styles
6

Kreutzer, Julia, Isaac Caswell, Lisa Wang, et al. "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Full text
Abstract:
Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/am
APA, Harvard, Vancouver, ISO, and other styles
7

Yuan, Sha, Hanyu Zhao, Zhengxiao Du, et al. "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models." AI Open 2 (2021): 65–68. http://dx.doi.org/10.1016/j.aiopen.2021.06.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing 12, no. 2 (2023): 01–12. http://dx.doi.org/10.5121/ijnlc.2023.12201.

Full text
Abstract:
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pretraining + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre
APA, Harvard, Vancouver, ISO, and other styles
9

Jing, Qian, Yue Yong, Atkinson Katie, and Li Gangmin. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing (IJNLC) 12, no. 2 (2023): 12. https://doi.org/10.5281/zenodo.7929155.

Full text
Abstract:
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-tra
APA, Harvard, Vancouver, ISO, and other styles
10

Chukhno, Olena, and Nataliia Tuchyna. "OVERCOMING DIFFICULATIES IN USING LINGUISTIC CORPORA FOR TEACHING ENGLISH TO PRE-SERVICE TEACHERS." Education. Innovation. Practice 12, no. 7 (2024): 91–105. http://dx.doi.org/10.31110/2616-650x-vol12i7-014.

Full text
Abstract:
The rapid pace of technological advancement necessitates that Ukrainian graduates possess advanced digital literacy and critical thinking skills as well as lifelong learning abilities. Within this context, using linguistic corpora can be considered an effective approach which contributes to developing professional communicative competence by engaging students with authentic language data and promoting critical analysis and independent learning. The article addresses the challenges of integrating the direct corpus-based approach into pre-service English language teacher education. These include
APA, Harvard, Vancouver, ISO, and other styles
11

Jiang, Xiaoze, Yaobo Liang, Weizhu Chen, and Nan Duan. "XLM-K: Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10840–48. http://dx.doi.org/10.1609/aaai.v36i10.21330.

Full text
Abstract:
Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora. However, most pre-trained models neglect multilingual knowledge, which is language agnostic but comprises abundant cross-lingual structure alignment. In this paper, we propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training. XLM-K augments existing multilingual pre-training with two knowledge tasks, namely Masked Entity Prediction Task and Object Entailment Task. We evaluate XLM-K on MLQA, NER and XNLI. Experimental results clearly demonstrate s
APA, Harvard, Vancouver, ISO, and other styles
12

Galli, Carlo, Maria Teresa Colangelo, Marco Meleti, and Elena Calciolari. "The Specialist’s Paradox: Generalist AI May Better Organize Medical Knowledge." Algorithms 18, no. 7 (2025): 451. https://doi.org/10.3390/a18070451.

Full text
Abstract:
This study investigates the ability of six pre-trained sentence transformers to organize medical knowledge by performing unsupervised clustering on 70 high-level Medical Subject Headings (MeSH) terms across seven medical specialties. We evaluated models from different pre-training paradigms: general-purpose, domain-adapted, and from-scratch domain-specific. The results reveal a clear performance hierarchy. A top tier of models, including the general-purpose MPNet and the domain-adapted BioBERT and RoBERTa, produced highly coherent, specialty-aligned clusters (Adjusted Rand Index > 0.80). Co
APA, Harvard, Vancouver, ISO, and other styles
13

Kajiwara, Tomoyuki, Biwa Miura, and Yuki Arase. "Monolingual Transfer Learning via Bilingual Translators for Style-Sensitive Paraphrase Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8042–49. http://dx.doi.org/10.1609/aaai.v34i05.6314.

Full text
Abstract:
We tackle the low-resource problem in style transfer by employing transfer learning that utilizes abundantly available raw corpora. Our method consists of two steps: pre-training learns to generate a semantically equivalent sentence with an input assured grammaticality, and fine-tuning learns to add a desired style. Pre-training has two options, auto-encoding and machine translation based methods. Pre-training based on AutoEncoder is a simple way to learn these from a raw corpus. If machine translators are available, the model can learn more diverse paraphrasing via roundtrip translation. Afte
APA, Harvard, Vancouver, ISO, and other styles
14

Shen, Huawen, Gengluo Li, Jinwen Zhong, and Yu Zhou. "LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 7 (2025): 6805–13. https://doi.org/10.1609/aaai.v39i7.32730.

Full text
Abstract:
Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images
APA, Harvard, Vancouver, ISO, and other styles
15

Alruwaili, Awatif. "An online training course on the use of corpora for teachers in public schools." JALT CALL Journal 19, no. 1 (2023): 53–70. http://dx.doi.org/10.29140/jaltcall.v19n1.675.

Full text
Abstract:
This paper describes the outcomes of a teacher-training course offered to inservice teachers in public education on the use of corpora in language education. The paper reports on a mixed method study that explores in-service teachers’ evaluation of an online seven-week course in corpus linguistics (CL). Data were gathered through surveys and open-ended questions. Seventy-one in-service teachers took part in this study and completed both pre- and post-course questionnaires. The main aim of the course was to introduce the main concepts of CL, including an exploration of CL tools and resources, a
APA, Harvard, Vancouver, ISO, and other styles
16

Shi, Peng, Patrick Ng, Zhiguo Wang, et al. "Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (2021): 13806–14. http://dx.doi.org/10.1609/aaai.v35i15.17627.

Full text
Abstract:
Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train powerful language models with self-supervised learning objectives, such as Masked Language Model (MLM). Based on a pilot study, we observe three issues of existing general-purpose language models when they are applied in the text-to-SQL semantic parsers: fail to detect the column mentions in the utterances, to infer the column mentions from the cell values, and to compose target SQL queries when they are complex. To mitigate these issu
APA, Harvard, Vancouver, ISO, and other styles
17

Kryeziu, Labehat, and Visar Shehu. "Pre-Training MLM Using Bert for the Albanian Language." SEEU Review 18, no. 1 (2023): 52–62. http://dx.doi.org/10.2478/seeur-2023-0035.

Full text
Abstract:
Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models li
APA, Harvard, Vancouver, ISO, and other styles
18

Li, Zhen, Dan Qu, Chaojie Xie, Wenlin Zhang, and Yanxia Li. "Language Model Pre-training Method in Machine Translation Based on Named Entity Recognition." International Journal on Artificial Intelligence Tools 29, no. 07n08 (2020): 2040021. http://dx.doi.org/10.1142/s0218213020400217.

Full text
Abstract:
Neural Machine Translation (NMT) model has become the mainstream technology in machine translation. The supervised neural machine translation model trains with abundant of sentence-level parallel corpora. But for low-resources language or dialect with no such corpus available, it is difficult to achieve good performance. Researchers began to focus on unsupervised neural machine translation (UNMT) that monolingual corpus as training data. UNMT need to construct the language model (LM) which learns semantic information from the monolingual corpus. This paper focuses on the pre-training of LM in
APA, Harvard, Vancouver, ISO, and other styles
19

Liu, Peng, Lemei Zhang, and Jon Atle Gulla. "Pre-train, Prompt, and Recommendation: A Comprehensive Survey of Language Modeling Paradigm Adaptations in Recommender Systems." Transactions of the Association for Computational Linguistics 11 (2023): 1553–71. http://dx.doi.org/10.1162/tacl_a_00619.

Full text
Abstract:
Abstract The emergence of Pre-trained Language Models (PLMs) has achieved tremendous success in the field of Natural Language Processing (NLP) by learning universal representations on large corpora in a self-supervised manner. The pre-trained models and the learned representations can be beneficial to a series of downstream NLP tasks. This training paradigm has recently been adapted to the recommendation domain and is considered a promising approach by both academia and industry. In this paper, we systematically investigate how to extract and transfer knowledge from pre-trained models learned
APA, Harvard, Vancouver, ISO, and other styles
20

Luo, Da, Yanglei Gan, Rui Hou, et al. "Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 18742–50. http://dx.doi.org/10.1609/aaai.v38i17.29838.

Full text
Abstract:
Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framewo
APA, Harvard, Vancouver, ISO, and other styles
21

Sprugnoli, Rachele, Giovanni Moretti, and Marco Passarotti. "Building and Comparing Lemma Embeddings for Latin. Classical Latin versus Thomas Aquinas." Italian Journal of Computational Linguistics 6, no. 1 (2021): 29–45. https://doi.org/10.5281/zenodo.4618000.

Full text
Abstract:
This paper presents a new set of lemma embeddings for the Latin language. Embeddings are trained on a manually annotated corpus of texts belonging to the Classical era: different models, architectures and dimensions are tested and evaluated using a novel benchmark for the synonym selection task. In addition, we release vectors pre-trained on the “Opera Maiora” by Thomas Aquinas, thus providing a resource to analyze Latin in a diachronic perspective. The embeddings built upon the two training corpora are compared to each other to support diachronic lexical&n
APA, Harvard, Vancouver, ISO, and other styles
22

Maruyama, Takumi, and Kazuhide Yamamoto. "Extremely Low-Resource Text Simplification with Pre-trained Transformer Language Model." International Journal of Asian Language Processing 30, no. 01 (2020): 2050001. http://dx.doi.org/10.1142/s2717554520500010.

Full text
Abstract:
Inspired by machine translation task, recent text simplification approaches regard a task as a monolingual text-to-text generation, and neural machine translation models have significantly improved the performance of simplification tasks. Although such models require a large-scale parallel corpus, such corpora for text simplification are very few in number and smaller in size compared to machine translation task. Therefore, we have attempted to facilitate the training of simplification rewritings using pre-training from a large-scale monolingual corpus such as Wikipedia articles. In addition,
APA, Harvard, Vancouver, ISO, and other styles
23

Zheng, Yinhe, Rongsheng Zhang, Minlie Huang, and Xiaoxi Mao. "A Pre-Training Based Personalized Dialogue Generation Model with Persona-Sparse Data." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 9693–700. http://dx.doi.org/10.1609/aaai.v34i05.6518.

Full text
Abstract:
Endowing dialogue systems with personas is essential to deliver more human-like conversations. However, this problem is still far from well explored due to the difficulties of both embodying personalities in natural languages and the persona sparsity issue observed in most dialogue corpora. This paper proposes a pre-training based personalized dialogue model that can generate coherent responses using persona-sparse dialogue data. In this method, a pre-trained language model is used to initialize an encoder and decoder, and personal attribute embeddings are devised to model richer dialogue cont
APA, Harvard, Vancouver, ISO, and other styles
24

Mao, Zhuoyuan, Chenhui Chu, and Sadao Kurohashi. "Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 4 (2022): 1–29. http://dx.doi.org/10.1145/3491065.

Full text
Abstract:
In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese–English & Japanese–Chinese, Wikipedia Japanese–Chinese, News English
APA, Harvard, Vancouver, ISO, and other styles
25

Kim, Svetlana, and Yuchae Jung. "Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries." Applied Sciences 15, no. 12 (2025): 6541. https://doi.org/10.3390/app15126541.

Full text
Abstract:
Despite remarkable advances in neural language models, a substantial gap remains in precisely interpreting the complex semantics of Electronic Medical Records (EMR). We propose Contrastive Representations Pre-Training (CRPT) to address this gap, replacing the conventional Next Sentence Prediction task’s cross-entropy loss with contrastive loss and incorporating whole-word masking to capture multi-token domain-specific terms better. We also introduce a carefully designed negative sampling strategy that balances intra-document and cross-document sentences, enhancing the model’s discriminative po
APA, Harvard, Vancouver, ISO, and other styles
26

Ahn, Youngdo, Sangwook Han, Seonggyu Lee, and Jong Won Shin. "Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability." Sensors 24, no. 13 (2024): 4111. http://dx.doi.org/10.3390/s24134111.

Full text
Abstract:
Emotions in speech are expressed in various ways, and the speech emotion recognition (SER) model may perform poorly on unseen corpora that contain different emotional factors from those expressed in training databases. To construct an SER model robust to unseen corpora, regularization approaches or metric losses have been studied. In this paper, we propose an SER method that incorporates relative difficulty and labeling reliability of each training sample. Inspired by the Proxy-Anchor loss, we propose a novel loss function which gives higher gradients to the samples for which the emotion label
APA, Harvard, Vancouver, ISO, and other styles
27

Ai, Xi, and Bin Fang. "Empirical Regularization for Synthetic Sentence Pairs in Unsupervised Neural Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 14 (2021): 12471–79. http://dx.doi.org/10.1609/aaai.v35i14.17479.

Full text
Abstract:
UNMT tackles translation on monolingual corpora in two required languages. Since there is no explicitly cross-lingual signal, pre-training and synthetic sentence pairs are significant to the success of UNMT. In this work, we empirically study the core training procedure of UNMT to analyze the synthetic sentence pairs obtained from back-translation. We introduce new losses to UNMT to regularize the synthetic sentence pairs by jointly training the UNMT objective and the regularization objective. Our comprehensive experiments support that our method can generally improve the performance of curren
APA, Harvard, Vancouver, ISO, and other styles
28

Zhu, Quan, Xiaoyin Wang, Xuan Liu, Wanru Du, and Xingxing Ding. "Multi-task learning for aspect level semantic classification combining complex aspect target semantic enhancement and adaptive local focus." Mathematical Biosciences and Engineering 20, no. 10 (2023): 18566–91. http://dx.doi.org/10.3934/mbe.2023824.

Full text
Abstract:
<abstract> <p>Aspect-based sentiment analysis (ABSA) is a fine-grained and diverse task in natural language processing. Existing deep learning models for ABSA face the challenge of balancing the demand for finer granularity in sentiment analysis with the scarcity of training corpora for such granularity. To address this issue, we propose an enhanced BERT-based model for multi-dimensional aspect target semantic learning. Our model leverages BERT's pre-training and fine-tuning mechanisms, enabling it to capture rich semantic feature parameters. In addition, we propose a complex seman
APA, Harvard, Vancouver, ISO, and other styles
29

Siddhant, Aditya, Anuj Goyal, and Angeliki Metallinou. "Unsupervised Transfer Learning for Spoken Language Understanding in Intelligent Agents." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4959–66. http://dx.doi.org/10.1609/aaai.v33i01.33014959.

Full text
Abstract:
User interaction with voice-powered agents generates large amounts of unlabeled utterances. In this paper, we explore techniques to efficiently transfer the knowledge from these unlabeled utterances to improve model performance on Spoken Language Understanding (SLU) tasks. We use Embeddings from Language Model (ELMo) to take advantage of unlabeled data by learning contextualized word representations. Additionally, we propose ELMo-Light (ELMoL), a faster and simpler unsupervised pre-training method for SLU. Our findings suggest unsupervised pre-training on a large corpora of unlabeled utterance
APA, Harvard, Vancouver, ISO, and other styles
30

Fromont, Robert, and Kevin Watson. "Factors influencing automatic segmental alignment of sociophonetic corpora." Corpora 11, no. 3 (2016): 401–31. http://dx.doi.org/10.3366/cor.2016.0101.

Full text
Abstract:
Automatically time-aligning utterances at the segmental level is increasingly common practice in phonetic and sociophonetic work because of the obvious benefits it brings in allowing the efficient scaling up of the amount of speech data that can be analysed. The field is arriving at a set of recommended practices for improving alignment accuracy, but methodological differences across studies (e.g., the use of different languages and different measures of accuracy) often mean that direct comparison of the factors which facilitate or hinder alignment can be difficult. In this paper, following a
APA, Harvard, Vancouver, ISO, and other styles
31

Gao, Yunfan, Yun Xiong, Siqi Wang, and Haofen Wang. "GeoBERT: Pre-Training Geospatial Representation Learning on Point-of-Interest." Applied Sciences 12, no. 24 (2022): 12942. http://dx.doi.org/10.3390/app122412942.

Full text
Abstract:
Thanks to the development of geographic information technology, geospatial representation learning based on POIs (Point-of-Interest) has gained widespread attention in the past few years. POI is an important indicator to reflect urban socioeconomic activities, widely used to extract geospatial information. However, previous studies often focus on a specific area, such as a city or a district, and are designed only for particular tasks, such as land-use classification. On the other hand, large-scale pre-trained models (PTMs) have recently achieved impressive success and become a milestone in ar
APA, Harvard, Vancouver, ISO, and other styles
32

Chiang, Cheng-Han, and Hung-yi Lee. "On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10518–25. http://dx.doi.org/10.1609/aaai.v36i10.21295.

Full text
Abstract:
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning
APA, Harvard, Vancouver, ISO, and other styles
33

Karimzadeh, Morteza, and Alan MacEachren. "GeoAnnotator: A Collaborative Semi-Automatic Platform for Constructing Geo-Annotated Text Corpora." ISPRS International Journal of Geo-Information 8, no. 4 (2019): 161. http://dx.doi.org/10.3390/ijgi8040161.

Full text
Abstract:
Ground-truth datasets are essential for the training and evaluation of any automated algorithm. As such, gold-standard annotated corpora underlie most advances in natural language processing (NLP). However, only a few relatively small (geo-)annotated datasets are available for geoparsing, i.e., the automatic recognition and geolocation of place references in unstructured text. The creation of geoparsing corpora that include both the recognition of place names in text and matching of those names to toponyms in a geographic gazetteer (a process we call geo-annotation), is a laborious, time-consu
APA, Harvard, Vancouver, ISO, and other styles
34

Li, Yucheng, Frank Guerin, and Chenghua Lin. "LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 18600–18607. http://dx.doi.org/10.1609/aaai.v38i17.29822.

Full text
Abstract:
Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language
APA, Harvard, Vancouver, ISO, and other styles
35

Bae, Jae Kwon. "A Study on Application of the Artificial Intelligence-Based Pre-trained Language Model." Academic Society of Global Business Administration 21, no. 2 (2024): 64–83. http://dx.doi.org/10.38115/asgba.2024.21.2.64.

Full text
Abstract:
Pre-trained Language Model(PLM) refers to a natural language processing(NLP) model that has been pre-trained using large amounts of text data. The PLM has the limitation of not being able to understand domain-specific terminology due to a lack of training data for terminology. Therefore, the need for a domain-specific language model modified through BERT- or GPT-based pre-trained learning has recently been emphasized. In this study, we analyze BERT's pre-training method and BERT-based transformation techniques (ALBERT, RoBERTa, ELECTRA) and propose a PLM that can be used in biomedical, financi
APA, Harvard, Vancouver, ISO, and other styles
36

Fang, Liuqin, Qing Ma, and Jiahao Yan. "The effectiveness of corpus-based training on collocation use in L2 writing for Chinese senior secondary school students." Journal of China Computer-Assisted Language Learning 1, no. 1 (2021): 80–109. http://dx.doi.org/10.1515/jccall-2021-2004.

Full text
Abstract:
Abstract Corpus tools are known to be effective in helping L2 learners improve their writing, especially regarding their use of words. Most corpus-based L2 writing research has focused on university students while little attention has been paid to secondary school L2 students. This study investigated whether senior secondary school students in China, upon receiving corpus-based training under the framework of data-driven learning (DDL), could improve their vocabulary use, especially the use of collocations, in their writing for the International English Language Testing System (IELTS) test. Tw
APA, Harvard, Vancouver, ISO, and other styles
37

Yuan, Ling, Chenglong Zeng, and Peng Pan. "Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement." Electronics 13, no. 24 (2024): 4895. https://doi.org/10.3390/electronics13244895.

Full text
Abstract:
Chinese Entity Recognition (CER) aims to extract key information entities from Chinese text data, supporting subsequent natural language processing tasks such as relation extraction, knowledge graph construction, and intelligent question answering. However, CER faces several challenges, including limited training corpora, unclear entity boundaries, and complex entity structures, resulting in low accuracy and a call for further improvements. To address issues such as high annotation costs and ambiguous entity boundaries, this paper proposes the SEMFF-CER model, a CER model based on semantic enh
APA, Harvard, Vancouver, ISO, and other styles
38

Kang, Yu, Tianqiao Liu, Hang Li, Yang Hao, and Wenbiao Ding. "Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10875–83. http://dx.doi.org/10.1609/aaai.v36i10.21334.

Full text
Abstract:
Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-paralle
APA, Harvard, Vancouver, ISO, and other styles
39

He, Wanwei, Yinpei Dai, Yinhe Zheng, et al. "GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-supervised Learning and Explicit Policy Injection." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10749–57. http://dx.doi.org/10.1609/aaai.v36i10.21320.

Full text
Abstract:
Pre-trained models have proved to be powerful in enhancing task-oriented dialog systems. However, current pre-training methods mainly focus on enhancing dialog understanding and generation tasks while neglecting the exploitation of dialog policy. In this paper, we propose GALAXY, a novel pre-trained dialog model that explicitly learns dialog policy from limited labeled dialogs and large-scale unlabeled dialog corpora via semi-supervised learning. Specifically, we introduce a dialog act prediction task for policy optimization during pre-training and employ a consistency regularization term to r
APA, Harvard, Vancouver, ISO, and other styles
40

Garrido-Muñoz , Ismael, Arturo Montejo-Ráez , Fernando Martínez-Santiago , and L. Alfonso Ureña-López . "A Survey on Bias in Deep NLP." Applied Sciences 11, no. 7 (2021): 3184. http://dx.doi.org/10.3390/app11073184.

Full text
Abstract:
Deep neural networks are hegemonic approaches to many machine learning areas, including natural language processing (NLP). Thanks to the availability of large corpora collections and the capability of deep architectures to shape internal language mechanisms in self-supervised learning processes (also known as “pre-training”), versatile and performing models are released continuously for every new network design. These networks, somehow, learn a probability distribution of words and relations across the training collection used, inheriting the potential flaws, inconsistencies and biases contain
APA, Harvard, Vancouver, ISO, and other styles
41

Perkowski, Ernest, Rui Pan, Tuan Dung Nguyen, et al. "AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets." Research Notes of the AAS 8, no. 1 (2024): 7. http://dx.doi.org/10.3847/2515-5172/ad1abe.

Full text
Abstract:
Abstract We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora—comprising abstracts, introductions, and conclusions—we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performan
APA, Harvard, Vancouver, ISO, and other styles
42

Pota, Marco, Mirko Ventura, Rosario Catelli, and Massimo Esposito. "An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian." Sensors 21, no. 1 (2020): 133. http://dx.doi.org/10.3390/s21010133.

Full text
Abstract:
Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to d
APA, Harvard, Vancouver, ISO, and other styles
43

Wang, Ke, Xiutian Zhao, and Wei Peng. "Learning from Failure: Improving Meeting Summarization without Good Samples." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 19153–61. http://dx.doi.org/10.1609/aaai.v38i17.29883.

Full text
Abstract:
Existing methods aligning language models with various human needs are reliant heavily on high-quality and task-specific data. However, industrial deployment of task-specific language models often encounter challenges in the availability of appropriate training samples. Taking meeting summarization for instance, public datasets are scarce, and private corpora are also hard to obtain due to privacy issues or resource-demanding annotation. To improve meeting summarization in the absence of positively-rated (i.e., ``good'') samples, we propose Score Tuning, a cold start tuning framework that leve
APA, Harvard, Vancouver, ISO, and other styles
44

González-Docasal, Ander, and Aitor Álvarez. "Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics." Applied Sciences 13, no. 14 (2023): 8049. http://dx.doi.org/10.3390/app13148049.

Full text
Abstract:
Voice cloning, an emerging field in the speech-processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigated the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also used two high-quality corpora for comparative analysis. We conducted exhaustive evaluations of the quality of the gathered corpora in order to select the most-suitable data for the training of a voice-cloning system. Following these measurements, we c
APA, Harvard, Vancouver, ISO, and other styles
45

Vu, Dang Thanh, Gwanghyun Yu, Chilwoo Lee, and Jinyoung Kim. "Text Data Augmentation for the Korean Language." Applied Sciences 12, no. 7 (2022): 3425. http://dx.doi.org/10.3390/app12073425.

Full text
Abstract:
Data augmentation (DA) is a universal technique to reduce overfitting and improve the robustness of machine learning models by increasing the quantity and variety of the training dataset. Although data augmentation is essential in vision tasks, it is rarely applied to text datasets since it is less straightforward. Some studies have concerned text data augmentation, but most of them are for the majority languages, such as English or French. There have been only a few studies on data augmentation for minority languages, e.g., Korean. This study fills the gap by demonstrating several common data
APA, Harvard, Vancouver, ISO, and other styles
46

Qi, Kunxun, and Jianfeng Du. "Translation-Based Matching Adversarial Network for Cross-Lingual Natural Language Inference." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8632–39. http://dx.doi.org/10.1609/aaai.v34i05.6387.

Full text
Abstract:
Cross-lingual natural language inference is a fundamental task in cross-lingual natural language understanding, widely addressed by neural models recently. Existing neural model based methods either align sentence embeddings between source and target languages, heavily relying on annotated parallel corpora, or exploit pre-trained cross-lingual language models that are fine-tuned on a single language and hard to transfer knowledge to another language. To resolve these limitations in existing methods, this paper proposes an adversarial training framework to enhance both pre-trained models and cl
APA, Harvard, Vancouver, ISO, and other styles
47

Li, Lei, Yongfeng Zhang, and Li Chen. "Personalized Prompt Learning for Explainable Recommendation." ACM Transactions on Information Systems 41, no. 4 (2023): 1–26. http://dx.doi.org/10.1145/3580488.

Full text
Abstract:
Providing user-understandable explanations to justify recommendations could help users better understand the recommended items, increase the system’s ease of use, and gain users’ trust. A typical approach to realize it is natural language generation. However, previous works mostly adopt recurrent neural networks to meet the ends, leaving the potentially more effective pre-trained Transformer models under-explored. In fact, user and item IDs, as important identifiers in recommender systems, are inherently in different semantic space as words that pre-trained models were already trained on. Thus
APA, Harvard, Vancouver, ISO, and other styles
48

Yang, Tiancheng, Ilia Sucholutsky, Kuang-Yu Jen, and Matthias Schonlau. "exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies." PeerJ Computer Science 10 (February 28, 2024): e1888. http://dx.doi.org/10.7717/peerj-cs.1888.

Full text
Abstract:
Background Pathology reports contain key information about the patient’s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus,
APA, Harvard, Vancouver, ISO, and other styles
49

A. Brenes, Jose, Javier Ferrández-Pastor, José M. Cámara-Zapata, and Gabriela Marín-Raventós. "Use of Hough Transform and Homography for the Creation of Image Corpora for Smart Agriculture." International Journal on Cybernetics & Informatics 12, no. 6 (2023): 09–19. http://dx.doi.org/10.5121/ijci.2023.120602.

Full text
Abstract:
In the context of smart agriculture, developing deep learning models demands large and highquality datasets for training. However, the current lack of such datasets for specific crops poses a significant challenge to the progress of this field. This research proposes an automated method to facilitate the creation of training datasets through automated image capture and pre-processing. The method’s efficacy is demonstrated through two study cases conducted in a Cannabis Sativa cultivation setting. By leveraging automated processes, the proposed approach enables to create large-volume and high-q
APA, Harvard, Vancouver, ISO, and other styles
50

Panboonyuen, Teerapong, Kulsawasd Jitkajornwanich, Siam Lawawirojwong, Panu Srestasathiern, and Peerapon Vateekul. "Semantic Segmentation on Remotely Sensed Images Using an Enhanced Global Convolutional Network with Channel Attention and Domain Specific Transfer Learning." Remote Sensing 11, no. 1 (2019): 83. http://dx.doi.org/10.3390/rs11010083.

Full text
Abstract:
In the remote sensing domain, it is crucial to complete semantic segmentation on the raster images, e.g., river, building, forest, etc., on raster images. A deep convolutional encoder–decoder (DCED) network is the state-of-the-art semantic segmentation method for remotely sensed images. However, the accuracy is still limited, since the network is not designed for remotely sensed images and the training data in this domain is deficient. In this paper, we aim to propose a novel CNN for semantic segmentation particularly for remote sensing corpora with three main contributions. First, we propose
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!