To see the other types of publications on this topic, follow the link: Pre-training corpora.

Journal articles on the topic 'Pre-training corpora'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Pre-training corpora.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Sun, Yu, Shuohuan Wang, Yukun Li, et al. "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8968–75. http://dx.doi.org/10.1609/aaai.v34i05.6428.

Full text
Abstract:
Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which incrementally builds pre-training tasks and then learn pre-trained models on these constructed tasks via continual multi-task learning. Based on this framework, we construct several tasks and train the ERNIE 2.0 model to capture lexical, syntactic and semantic aspects of information in the training data. Experimental results demonstrate that ERNIE 2.0 model outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several similar tasks in Chinese. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.
APA, Harvard, Vancouver, ISO, and other styles
2

Moodaley, Wayne, and Arnesh Telukdarie. "A Conceptual Framework for Subdomain Specific Pre-Training of Large Language Models for Green Claim Detection." European Journal of Sustainable Development 12, no. 4 (2023): 319. http://dx.doi.org/10.14207/ejsd.2023.v12n4p319.

Full text
Abstract:
Detection of false or misleading green claims (referred to as “greenwashing”) within company sustainability disclosures is challenging for a number of reasons, which include the textual and qualitative nature, volume, and complexity of such disclosures. In recent years, notable progress made in the fields of artificial intelligence and specifically, large language models (LLMs), has showcased the capacity of these tools to effectively analyse extensive and intricate textual data, including the contents of sustainability disclosures. Transformer-based LLMs, such as Google’s BERT architecture, were trained on general domain text corpora. Subsequent research has shown that further pre-training of such LLMs on specific domains, such as the climate or sustainability domains, may improve performance. However, previous research often uses text corpora that exhibit significant variation across topics and language and which often consist of heterogeneous subdomains. We therefore propose a conceptual framework for further pre-training of transformer based LLMs using text corpora relating to specific sustainability subdomains i.e. subdomain specific pre-training. We do so as a basis for the improved performance of such models in analysing sustainability disclosures. The main contribution is a conceptual framework to advance the use of LLMs for the reliable identification of green claims and ultimately, greenwashing.
 Keywords: greenwashing, artificial intelligence, sustainability, sustainability reporting, sustainability disclosures.
APA, Harvard, Vancouver, ISO, and other styles
3

Hussain, Rida Ghafoor. "RiskBERT: A Pre-Trained Insurance-Based Language Model for Text Classification." International Journal of Innovative Technology and Exploring Engineering 14, no. 7 (2025): 12–18. https://doi.org/10.35940/ijitee.f1097.14070625.

Full text
Abstract:
The rapid growth of insurance-related documents has increased the need for efficient and accurate text classification techniques. Advances in natural language processing (NLP) and deep learning have enabled the extraction of valuable insights from textual data, particularly in specialised domains such as insurance, legal, and scientific documents. While Bidirectional Encoder Representations from Transformers (BERT) models have demonstrated state-of-theart performance across various NLP tasks, their application to domain-specific corpora often results in suboptimal accuracy due to linguistic and contextual differences. In this study, I propose RiskBERT, a domain-specific language representation model pre-trained on insurance corpora. I further pre-trained LegalBERT on insurance-specific datasets to enhance its understanding of insurance-related texts. The resulting model, RiskBERT, was then evaluated on downstream clause and provision classification tasks using two benchmark datasets – LEDGAR [1] and Unfair ToS [2]. I conducted a comparative analysis against BERT-Base and LegalBERT to assess the impact of domain-specific pre-training on classification performance. The findings demonstrate that pre-training on insurance-specific corpora significantly improves the model’s ability to analyse complex insurance texts. The experimental results show that RiskBERT significantly outperforms LegalBERT and BERT-Base, achieving superior accuracy in classifying complex insurance texts, with 96.8% and 92.1% accuracy, respectively, on the LEDGAR and Unfair ToS datasets. These findings highlight the effectiveness of domain-adaptive pre-training and underscore the importance of specialised language models for enhancing insurance document processing, making RiskBERT a valuable tool for this purpose.
APA, Harvard, Vancouver, ISO, and other styles
4

Liu, Yinhan, Jiatao Gu, Naman Goyal, et al. "Multilingual Denoising Pre-training for Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Full text
Abstract:
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task- specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show that it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training. 1
APA, Harvard, Vancouver, ISO, and other styles
5

Dean, Roger Thornton, and Marcus Thomas Pearce. "Algorithmically-generated Corpora that use Serial Compositional Principles Can Contribute to the Modeling of Sequential Pitch Structure in Non-tonal Music." Empirical Musicology Review 11, no. 1 (2016): 27. http://dx.doi.org/10.18061/emr.v11i1.4900.

Full text
Abstract:
We investigate whether pitch sequences in non-tonal music can be modeled by an information-theoretic approach using algorithmically-generated melodic sequences, made according to 12-tone serial principles, as the training corpus. This is potentially useful, because symbolic corpora of non-tonal music are not readily available. A non-tonal corpus of serially-composed melodies was constructed algorithmically using classic principles of 12-tone music, including prime, inversion, retrograde and retrograde inversion transforms. A similar algorithm generated a tonal melodic corpus of tonal transformations, in each case based on a novel tonal melody and expressed in alternating major keys. A cognitive model of auditory expectation (IDyOM) was used first to analyze the sequential pitch structure of the corpora, in some cases with pre-training on established tonal folk-song corpora (Essen, Schaffrath, 1995). The two algorithmic corpora can be distinguished in terms of their information content, and they were quite different from random corpora and from the folk-song corpus. We then demonstrate that the algorithmic serial corpora can assist modeling of canonical non-tonal compositions by Webern and Schoenberg, and also non-tonal segments of improvisations by skilled musicians. Separately, we developed the process of algorithmic melody composition into a software system (the Serial Collaborator) capable of generating multi-stranded serial keyboard music. Corpora of such keyboard compositions based either on the non-tonal or the tonal melodic corpora were generated and assessed for their information-theoretic modeling properties.
APA, Harvard, Vancouver, ISO, and other styles
6

Kreutzer, Julia, Isaac Caswell, Lisa Wang, et al. "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Full text
Abstract:
Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
APA, Harvard, Vancouver, ISO, and other styles
7

Yuan, Sha, Hanyu Zhao, Zhengxiao Du, et al. "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models." AI Open 2 (2021): 65–68. http://dx.doi.org/10.1016/j.aiopen.2021.06.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing 12, no. 2 (2023): 01–12. http://dx.doi.org/10.5121/ijnlc.2023.12201.

Full text
Abstract:
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pretraining + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre-training + finetuning”. Further pre-training generally refers to continual learning on task-specific or domain-relevant corpora before being applied to target tasks, which aims at bridging the gap in data distribution between the phases of pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the backbone model with inferential knowledge besides morality. By comparison with several advanced models including BERTbase, RoBERTa-base and T5-base, experimental results on two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.
APA, Harvard, Vancouver, ISO, and other styles
9

Jing, Qian, Yue Yong, Atkinson Katie, and Li Gangmin. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing (IJNLC) 12, no. 2 (2023): 12. https://doi.org/10.5281/zenodo.7929155.

Full text
Abstract:
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre-training + fine-tuning”. Further pre-training generally refers to continual learning on task-specific or domain-relevant corpora before being applied to target tasks, which aims at bridging the gap in data distribution between the phases of pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the backbone model with inferential knowledge besides morality. By comparison with several advanced models including BERT-base, RoBERTa-base and T5-base, experimental results on two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.
APA, Harvard, Vancouver, ISO, and other styles
10

Chukhno, Olena, and Nataliia Tuchyna. "OVERCOMING DIFFICULATIES IN USING LINGUISTIC CORPORA FOR TEACHING ENGLISH TO PRE-SERVICE TEACHERS." Education. Innovation. Practice 12, no. 7 (2024): 91–105. http://dx.doi.org/10.31110/2616-650x-vol12i7-014.

Full text
Abstract:
The rapid pace of technological advancement necessitates that Ukrainian graduates possess advanced digital literacy and critical thinking skills as well as lifelong learning abilities. Within this context, using linguistic corpora can be considered an effective approach which contributes to developing professional communicative competence by engaging students with authentic language data and promoting critical analysis and independent learning. The article addresses the challenges of integrating the direct corpus-based approach into pre-service English language teacher education. These include some limitations related to the technical domain (e.g. limited access to hardware and software, lack of digital skills, navigation issues) and those pertaining to pedagogical integration of corpus-based instruction (e.g. resistance to change, lack of student engagement, data overload). The authors stress the importance of understanding these limitations to effectively integrate the direct corpus-based approach into English language teaching. To overcome technological barriers, the authors suggest upgrading infrastructure, utilizing offline corpora, fostering peer mentoring, developing customized mini-corpora, enhancing digital literacy through targeted training of teachers and students, etc. The constraints within the pedagogical domain can be dealt with by raising awareness of the benefits of corpora use, integrating corpus-based activities into the regular curriculum, designing interactive tasks that align with students’ interests and learning goals, differentiated instruction, etc. Future research may involve creating a system of activities for developing professional communicative competence of pre-service teachers of English with the use of linguistic corpora.
APA, Harvard, Vancouver, ISO, and other styles
11

Jiang, Xiaoze, Yaobo Liang, Weizhu Chen, and Nan Duan. "XLM-K: Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10840–48. http://dx.doi.org/10.1609/aaai.v36i10.21330.

Full text
Abstract:
Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora. However, most pre-trained models neglect multilingual knowledge, which is language agnostic but comprises abundant cross-lingual structure alignment. In this paper, we propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training. XLM-K augments existing multilingual pre-training with two knowledge tasks, namely Masked Entity Prediction Task and Object Entailment Task. We evaluate XLM-K on MLQA, NER and XNLI. Experimental results clearly demonstrate significant improvements over existing multilingual language models. The results on MLQA and NER exhibit the superiority of XLM-K in knowledge related tasks. The success in XNLI shows a better cross-lingual transferability obtained in XLM-K. What is more, we provide a detailed probing analysis to confirm the desired knowledge captured in our pre-training regimen. The code is available at https://github.com/microsoft/Unicoder/tree/master/pretraining/xlmk.
APA, Harvard, Vancouver, ISO, and other styles
12

Kajiwara, Tomoyuki, Biwa Miura, and Yuki Arase. "Monolingual Transfer Learning via Bilingual Translators for Style-Sensitive Paraphrase Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8042–49. http://dx.doi.org/10.1609/aaai.v34i05.6314.

Full text
Abstract:
We tackle the low-resource problem in style transfer by employing transfer learning that utilizes abundantly available raw corpora. Our method consists of two steps: pre-training learns to generate a semantically equivalent sentence with an input assured grammaticality, and fine-tuning learns to add a desired style. Pre-training has two options, auto-encoding and machine translation based methods. Pre-training based on AutoEncoder is a simple way to learn these from a raw corpus. If machine translators are available, the model can learn more diverse paraphrasing via roundtrip translation. After these, fine-tuning achieves high-quality paraphrase generation even in situations where only 1k sentence pairs of the parallel corpus for style transfer is available. Experimental results of formality style transfer indicated the effectiveness of both pre-training methods and the method based on roundtrip translation achieves state-of-the-art performance.
APA, Harvard, Vancouver, ISO, and other styles
13

Shen, Huawen, Gengluo Li, Jinwen Zhong, and Yu Zhou. "LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 7 (2025): 6805–13. https://doi.org/10.1609/aaai.v39i7.32730.

Full text
Abstract:
Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
14

Alruwaili, Awatif. "An online training course on the use of corpora for teachers in public schools." JALT CALL Journal 19, no. 1 (2023): 53–70. http://dx.doi.org/10.29140/jaltcall.v19n1.675.

Full text
Abstract:
This paper describes the outcomes of a teacher-training course offered to inservice teachers in public education on the use of corpora in language education. The paper reports on a mixed method study that explores in-service teachers’ evaluation of an online seven-week course in corpus linguistics (CL). Data were gathered through surveys and open-ended questions. Seventy-one in-service teachers took part in this study and completed both pre- and post-course questionnaires. The main aim of the course was to introduce the main concepts of CL, including an exploration of CL tools and resources, and the use of CL in learning and teaching. The course presents a wide range of corpora, corpus-based materials, and tools. The participants had a positive reaction to the course and considered it useful and comprehensive, although they did not prefer the online delivery method for the practical sessions of the course. The participants acknowledged the usefulness of corpora in their classrooms as well as the possible difficulties they might face, which shows that they genuinely thought about applying corpora in their teaching contexts. Moreover, the participants showed a willingness for the future use of corpora. Offering such a course to in-service teachers using an online delivery method is not preferable, but a hybrid course may be more suitable and effective, given the variation in teachers’ computer and corpus literacy.
APA, Harvard, Vancouver, ISO, and other styles
15

Shi, Peng, Patrick Ng, Zhiguo Wang, et al. "Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (2021): 13806–14. http://dx.doi.org/10.1609/aaai.v35i15.17627.

Full text
Abstract:
Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train powerful language models with self-supervised learning objectives, such as Masked Language Model (MLM). Based on a pilot study, we observe three issues of existing general-purpose language models when they are applied in the text-to-SQL semantic parsers: fail to detect the column mentions in the utterances, to infer the column mentions from the cell values, and to compose target SQL queries when they are complex. To mitigate these issues, we present a model pretraining framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterance and table schemas, by leveraging generation models to generate high-quality pre-train data. GAP Model is trained on 2 million utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are generated by generation models. Based on experimental results, neural semantic parsers that leverage GAP Model as a representation encoder obtain new state-of-the-art results on both Spider and Criteria-to-SQL benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
16

Kryeziu, Labehat, and Visar Shehu. "Pre-Training MLM Using Bert for the Albanian Language." SEEU Review 18, no. 1 (2023): 52–62. http://dx.doi.org/10.2478/seeur-2023-0035.

Full text
Abstract:
Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.
APA, Harvard, Vancouver, ISO, and other styles
17

Li, Zhen, Dan Qu, Chaojie Xie, Wenlin Zhang, and Yanxia Li. "Language Model Pre-training Method in Machine Translation Based on Named Entity Recognition." International Journal on Artificial Intelligence Tools 29, no. 07n08 (2020): 2040021. http://dx.doi.org/10.1142/s0218213020400217.

Full text
Abstract:
Neural Machine Translation (NMT) model has become the mainstream technology in machine translation. The supervised neural machine translation model trains with abundant of sentence-level parallel corpora. But for low-resources language or dialect with no such corpus available, it is difficult to achieve good performance. Researchers began to focus on unsupervised neural machine translation (UNMT) that monolingual corpus as training data. UNMT need to construct the language model (LM) which learns semantic information from the monolingual corpus. This paper focuses on the pre-training of LM in unsupervised machine translation and proposes a pre-training method, NER-MLM (named entity recognition masked language model). Through performing NER, the proposed method can obtain better semantic information and language model parameters with better training results. In the unsupervised machine translation task, the BLEU scores on the WMT’16 English–French, English–German, data sets are 35.30, 27.30 respectively. To the best of our knowledge, this is the highest results in the field of UNMT reported so far.
APA, Harvard, Vancouver, ISO, and other styles
18

Liu, Peng, Lemei Zhang, and Jon Atle Gulla. "Pre-train, Prompt, and Recommendation: A Comprehensive Survey of Language Modeling Paradigm Adaptations in Recommender Systems." Transactions of the Association for Computational Linguistics 11 (2023): 1553–71. http://dx.doi.org/10.1162/tacl_a_00619.

Full text
Abstract:
Abstract The emergence of Pre-trained Language Models (PLMs) has achieved tremendous success in the field of Natural Language Processing (NLP) by learning universal representations on large corpora in a self-supervised manner. The pre-trained models and the learned representations can be beneficial to a series of downstream NLP tasks. This training paradigm has recently been adapted to the recommendation domain and is considered a promising approach by both academia and industry. In this paper, we systematically investigate how to extract and transfer knowledge from pre-trained models learned by different PLM-related training paradigms to improve recommendation performance from various perspectives, such as generality, sparsity, efficiency and effectiveness. Specifically, we propose a comprehensive taxonomy to divide existing PLM-based recommender systems w.r.t. their training strategies and objectives. Then, we analyze and summarize the connection between PLM-based training paradigms and different input data types for recommender systems. Finally, we elaborate on open issues and future research directions in this vibrant field.
APA, Harvard, Vancouver, ISO, and other styles
19

Luo, Da, Yanglei Gan, Rui Hou, et al. "Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 18742–50. http://dx.doi.org/10.1609/aaai.v38i17.29838.

Full text
Abstract:
Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framework is motivated by the insight that the diverse viewpoints conveyed through instance-label pairs capture incomplete yet complementary intrinsic textual semantics. Specifically, our framework involves a symmetrical contrastive objective that encompasses both sentence-anchored and label-anchored contrastive losses. By combining these two losses, the model establishes a robust and uniform representation space. This space effectively captures the reciprocal alignment of feature distributions among instances and relational facts, simultaneously enhancing the maximization of mutual information across diverse perspectives within the same relation. Experimental results demonstrate that our framework achieves significant performance enhancements compared to baseline models in downstream FSRE tasks. Furthermore, our approach exhibits superior adaptability to handle the challenges of domain shift and zero-shot relation extraction. Our code is available online at https://github.com/AONE-NLP/FSRE-SaCon.
APA, Harvard, Vancouver, ISO, and other styles
20

Sprugnoli, Rachele, Giovanni Moretti, and Marco Passarotti. "Building and Comparing Lemma Embeddings for Latin. Classical Latin versus Thomas Aquinas." Italian Journal of Computational Linguistics 6, no. 1 (2021): 29–45. https://doi.org/10.5281/zenodo.4618000.

Full text
Abstract:
This paper presents a new set of lemma embeddings for the Latin language. Embeddings are trained on a manually annotated corpus of texts belonging to the Classical era: different models, architectures and dimensions are tested and evaluated using a novel benchmark for the synonym selection task. In addition, we release vectors pre-trained on the “Opera Maiora” by Thomas Aquinas, thus providing a resource to analyze Latin in a diachronic perspective. The embeddings built upon the two training corpora are compared to each other to support diachronic lexical studies. The words showing the highest usage change between the two corpora are reported and a selection of them is discussed.
APA, Harvard, Vancouver, ISO, and other styles
21

Maruyama, Takumi, and Kazuhide Yamamoto. "Extremely Low-Resource Text Simplification with Pre-trained Transformer Language Model." International Journal of Asian Language Processing 30, no. 01 (2020): 2050001. http://dx.doi.org/10.1142/s2717554520500010.

Full text
Abstract:
Inspired by machine translation task, recent text simplification approaches regard a task as a monolingual text-to-text generation, and neural machine translation models have significantly improved the performance of simplification tasks. Although such models require a large-scale parallel corpus, such corpora for text simplification are very few in number and smaller in size compared to machine translation task. Therefore, we have attempted to facilitate the training of simplification rewritings using pre-training from a large-scale monolingual corpus such as Wikipedia articles. In addition, we propose a translation language model to seamlessly conduct a fine-tuning of text simplification from the pre-training of the language model. The experimental results show that the translation language model substantially outperforms a state-of-the-art model under a low-resource setting. In addition, a pre-trained translation language model with only 3000 supervised examples can achieve a performance comparable to that of the state-of-the-art model using 30,000 supervised examples.
APA, Harvard, Vancouver, ISO, and other styles
22

Zheng, Yinhe, Rongsheng Zhang, Minlie Huang, and Xiaoxi Mao. "A Pre-Training Based Personalized Dialogue Generation Model with Persona-Sparse Data." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 9693–700. http://dx.doi.org/10.1609/aaai.v34i05.6518.

Full text
Abstract:
Endowing dialogue systems with personas is essential to deliver more human-like conversations. However, this problem is still far from well explored due to the difficulties of both embodying personalities in natural languages and the persona sparsity issue observed in most dialogue corpora. This paper proposes a pre-training based personalized dialogue model that can generate coherent responses using persona-sparse dialogue data. In this method, a pre-trained language model is used to initialize an encoder and decoder, and personal attribute embeddings are devised to model richer dialogue contexts by encoding speakers' personas together with dialogue histories. Further, to incorporate the target persona in the decoding process and to balance its contribution, an attention routing structure is devised in the decoder to merge features extracted from the target persona and dialogue contexts using dynamically predicted weights. Our model can utilize persona-sparse dialogues in a unified manner during the training process, and can also control the amount of persona-related features to exhibit during the inference process. Both automatic and manual evaluation demonstrates that the proposed model outperforms state-of-the-art methods for generating more coherent and persona consistent responses with persona-sparse data.
APA, Harvard, Vancouver, ISO, and other styles
23

Mao, Zhuoyuan, Chenhui Chu, and Sadao Kurohashi. "Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 4 (2022): 1–29. http://dx.doi.org/10.1145/3491065.

Full text
Abstract:
In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese–English & Japanese–Chinese, Wikipedia Japanese–Chinese, News English–Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese–English tasks, up to +7.0 BLEU points for the Japanese–Chinese tasks and up to +1.3 BLEU points for English–Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency.
APA, Harvard, Vancouver, ISO, and other styles
24

Kim, Svetlana, and Yuchae Jung. "Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries." Applied Sciences 15, no. 12 (2025): 6541. https://doi.org/10.3390/app15126541.

Full text
Abstract:
Despite remarkable advances in neural language models, a substantial gap remains in precisely interpreting the complex semantics of Electronic Medical Records (EMR). We propose Contrastive Representations Pre-Training (CRPT) to address this gap, replacing the conventional Next Sentence Prediction task’s cross-entropy loss with contrastive loss and incorporating whole-word masking to capture multi-token domain-specific terms better. We also introduce a carefully designed negative sampling strategy that balances intra-document and cross-document sentences, enhancing the model’s discriminative power. Implemented atop a BERT-based architecture and evaluated on the Biomedical Language Understanding Evaluation (BLUE) benchmark, our Discharge Summary CRPT model achieves significant performance gains, including a natural language inference precision of 0.825 and a sentence similarity score of 0.775. We further extend our approach through Bio+Discharge Summary CRPT, combining biomedical and clinical corpora to boost downstream performance across tasks. Our framework demonstrates robust interpretive capacity in clinical texts by emphasizing sentence-level semantics and domain-aware masking. These findings underscore CRPT’s potential for advancing semantic accuracy in healthcare applications and open new avenues for integrating larger negative sample sets, domain-specific masking techniques, and multi-task learning paradigms.
APA, Harvard, Vancouver, ISO, and other styles
25

Ahn, Youngdo, Sangwook Han, Seonggyu Lee, and Jong Won Shin. "Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability." Sensors 24, no. 13 (2024): 4111. http://dx.doi.org/10.3390/s24134111.

Full text
Abstract:
Emotions in speech are expressed in various ways, and the speech emotion recognition (SER) model may perform poorly on unseen corpora that contain different emotional factors from those expressed in training databases. To construct an SER model robust to unseen corpora, regularization approaches or metric losses have been studied. In this paper, we propose an SER method that incorporates relative difficulty and labeling reliability of each training sample. Inspired by the Proxy-Anchor loss, we propose a novel loss function which gives higher gradients to the samples for which the emotion labels are more difficult to estimate among those in the given minibatch. Since the annotators may label the emotion based on the emotional expression which resides in the conversational context or other modality but is not apparent in the given speech utterance, some of the emotional labels may not be reliable and these unreliable labels may affect the proposed loss function more severely. In this regard, we propose to apply label smoothing for the samples misclassified by a pre-trained SER model. Experimental results showed that the performance of the SER on unseen corpora was improved by adopting the proposed loss function with label smoothing on the misclassified data.
APA, Harvard, Vancouver, ISO, and other styles
26

Ai, Xi, and Bin Fang. "Empirical Regularization for Synthetic Sentence Pairs in Unsupervised Neural Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 14 (2021): 12471–79. http://dx.doi.org/10.1609/aaai.v35i14.17479.

Full text
Abstract:
UNMT tackles translation on monolingual corpora in two required languages. Since there is no explicitly cross-lingual signal, pre-training and synthetic sentence pairs are significant to the success of UNMT. In this work, we empirically study the core training procedure of UNMT to analyze the synthetic sentence pairs obtained from back-translation. We introduce new losses to UNMT to regularize the synthetic sentence pairs by jointly training the UNMT objective and the regularization objective. Our comprehensive experiments support that our method can generally improve the performance of currently successful models on three similar pairs {French, German, Romanian} English and one dissimilar pair Russian English with acceptably additional cost.
APA, Harvard, Vancouver, ISO, and other styles
27

Zhu, Quan, Xiaoyin Wang, Xuan Liu, Wanru Du, and Xingxing Ding. "Multi-task learning for aspect level semantic classification combining complex aspect target semantic enhancement and adaptive local focus." Mathematical Biosciences and Engineering 20, no. 10 (2023): 18566–91. http://dx.doi.org/10.3934/mbe.2023824.

Full text
Abstract:
<abstract> <p>Aspect-based sentiment analysis (ABSA) is a fine-grained and diverse task in natural language processing. Existing deep learning models for ABSA face the challenge of balancing the demand for finer granularity in sentiment analysis with the scarcity of training corpora for such granularity. To address this issue, we propose an enhanced BERT-based model for multi-dimensional aspect target semantic learning. Our model leverages BERT's pre-training and fine-tuning mechanisms, enabling it to capture rich semantic feature parameters. In addition, we propose a complex semantic enhancement mechanism for aspect targets to enrich and optimize fine-grained training corpora. Third, we combine the aspect recognition enhancement mechanism with a CRF model to achieve more robust and accurate entity recognition for aspect targets. Furthermore, we propose an adaptive local attention mechanism learning model to focus on sentiment elements around rich aspect target semantics. Finally, to address the varying contributions of each task in the joint training mechanism, we carefully optimize this training approach, allowing for a mutually beneficial training of multiple tasks. Experimental results on four Chinese and five English datasets demonstrate that our proposed mechanisms and methods effectively improve ABSA models, surpassing some of the latest models in multi-task and single-task scenarios.</p> </abstract>
APA, Harvard, Vancouver, ISO, and other styles
28

Siddhant, Aditya, Anuj Goyal, and Angeliki Metallinou. "Unsupervised Transfer Learning for Spoken Language Understanding in Intelligent Agents." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4959–66. http://dx.doi.org/10.1609/aaai.v33i01.33014959.

Full text
Abstract:
User interaction with voice-powered agents generates large amounts of unlabeled utterances. In this paper, we explore techniques to efficiently transfer the knowledge from these unlabeled utterances to improve model performance on Spoken Language Understanding (SLU) tasks. We use Embeddings from Language Model (ELMo) to take advantage of unlabeled data by learning contextualized word representations. Additionally, we propose ELMo-Light (ELMoL), a faster and simpler unsupervised pre-training method for SLU. Our findings suggest unsupervised pre-training on a large corpora of unlabeled utterances leads to significantly better SLU performance compared to training from scratch and it can even outperform conventional supervised transfer. Additionally, we show that the gains from unsupervised transfer techniques can be further improved by supervised transfer. The improvements are more pronounced in low resource settings and when using only 1000 labeled in-domain samples, our techniques match the performance of training from scratch on 10-15x more labeled in-domain data.
APA, Harvard, Vancouver, ISO, and other styles
29

Fromont, Robert, and Kevin Watson. "Factors influencing automatic segmental alignment of sociophonetic corpora." Corpora 11, no. 3 (2016): 401–31. http://dx.doi.org/10.3366/cor.2016.0101.

Full text
Abstract:
Automatically time-aligning utterances at the segmental level is increasingly common practice in phonetic and sociophonetic work because of the obvious benefits it brings in allowing the efficient scaling up of the amount of speech data that can be analysed. The field is arriving at a set of recommended practices for improving alignment accuracy, but methodological differences across studies (e.g., the use of different languages and different measures of accuracy) often mean that direct comparison of the factors which facilitate or hinder alignment can be difficult. In this paper, following a review of the state of the art in automatic segmental alignment, we test the effects of a number of factors on its accuracy. Namely, we test the effects of: (1) the presence or absence of pause markers in the training data, (2) the presence of overlapping speech or other noise, (3) using training data from single or multiple speakers, (4) using different sampling rates, (5) using pre-trained acoustic models versus models trained ‘from scratch’, and (6) using different amounts of training data. For each test, we examine three different varieties of English, from New Zealand, the USA and the UK. The paper concludes with some recommendations for automatic segmental alignment in general.
APA, Harvard, Vancouver, ISO, and other styles
30

Gao, Yunfan, Yun Xiong, Siqi Wang, and Haofen Wang. "GeoBERT: Pre-Training Geospatial Representation Learning on Point-of-Interest." Applied Sciences 12, no. 24 (2022): 12942. http://dx.doi.org/10.3390/app122412942.

Full text
Abstract:
Thanks to the development of geographic information technology, geospatial representation learning based on POIs (Point-of-Interest) has gained widespread attention in the past few years. POI is an important indicator to reflect urban socioeconomic activities, widely used to extract geospatial information. However, previous studies often focus on a specific area, such as a city or a district, and are designed only for particular tasks, such as land-use classification. On the other hand, large-scale pre-trained models (PTMs) have recently achieved impressive success and become a milestone in artificial intelligence (AI). Against this background, this study proposes the first large-scale pre-training geospatial representation learning model called GeoBERT. First, we collect about 17 million POIs in 30 cities across China to construct pre-training corpora, with 313 POI types as the tokens and the level-7 Geohash grids as the basic units. Second, we pre-train GeoEBRT to learn grid embedding in self-supervised learning by masking the POI type and then predicting. Third, under the paradigm of “pre-training + fine-tuning”, we design five practical downstream tasks. Experiments show that, with just one additional output layer fine-tuning, GeoBERT outperforms previous NLP methods (Word2vec, GloVe) used in geospatial representation learning by 9.21% on average in F1-score for classification tasks, such as store site recommendation and working/living area prediction. For regression tasks, such as POI number prediction, house price prediction, and passenger flow prediction, GeoBERT demonstrates greater performance improvements. The experiment results prove that pre-training on large-scale POI data can significantly improve the ability to extract geospatial information. In the discussion section, we provide a detailed analysis of what GeoBERT has learned from the perspective of attention mechanisms.
APA, Harvard, Vancouver, ISO, and other styles
31

Chiang, Cheng-Han, and Hung-yi Lee. "On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10518–25. http://dx.doi.org/10.1609/aaai.v36i10.21295.

Full text
Abstract:
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. We define and discuss three different characteristics in the artificial dataset: 1) matching the token's uni-gram or bi-gram distribution between pre-training and downstream fine-tuning, 2) the presence of the explicit dependencies among the tokens in a sequence, 3) the length of the implicit dependencies among the tokens in a sequence. Our experiments show that the explicit dependencies in the sequences of the pre-training data are critical to the downstream performance. Our results also reveal that models achieve better downstream performance when pre-trained on a dataset with a longer range of implicit dependencies. Based on our analysis, we find that models pre-trained with artificial datasets are prone to learn spurious correlation in downstream tasks. Our work reveals that even if the LMs are not pre-trained on natural language, they still gain transferability on certain human language downstream tasks once the LMs learn to model the token dependencies in the sequences. This result helps us understand the exceptional transferability of pre-trained LMs.
APA, Harvard, Vancouver, ISO, and other styles
32

Karimzadeh, Morteza, and Alan MacEachren. "GeoAnnotator: A Collaborative Semi-Automatic Platform for Constructing Geo-Annotated Text Corpora." ISPRS International Journal of Geo-Information 8, no. 4 (2019): 161. http://dx.doi.org/10.3390/ijgi8040161.

Full text
Abstract:
Ground-truth datasets are essential for the training and evaluation of any automated algorithm. As such, gold-standard annotated corpora underlie most advances in natural language processing (NLP). However, only a few relatively small (geo-)annotated datasets are available for geoparsing, i.e., the automatic recognition and geolocation of place references in unstructured text. The creation of geoparsing corpora that include both the recognition of place names in text and matching of those names to toponyms in a geographic gazetteer (a process we call geo-annotation), is a laborious, time-consuming and expensive task. The field lacks efficient geo-annotation tools to support corpus building and lacks design guidelines for the development of such tools. Here, we present the iterative design of GeoAnnotator, a web-based, semi-automatic and collaborative visual analytics platform for geo-annotation. GeoAnnotator facilitates collaborative, multi-annotator creation of large corpora of geo-annotated text by generating computationally-generated pre-annotations that can be improved by human-annotator users. The resulting corpora can be used in improving and benchmarking geoparsing algorithms as well as various other spatial language-related methods. Further, the iterative design process and the resulting design decisions can be used in annotation platforms tailored for other application domains of NLP.
APA, Harvard, Vancouver, ISO, and other styles
33

Li, Yucheng, Frank Guerin, and Chenghua Lin. "LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 18600–18607. http://dx.doi.org/10.1609/aaai.v38i17.29822.

Full text
Abstract:
Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.
APA, Harvard, Vancouver, ISO, and other styles
34

Bae, Jae Kwon. "A Study on Application of the Artificial Intelligence-Based Pre-trained Language Model." Academic Society of Global Business Administration 21, no. 2 (2024): 64–83. http://dx.doi.org/10.38115/asgba.2024.21.2.64.

Full text
Abstract:
Pre-trained Language Model(PLM) refers to a natural language processing(NLP) model that has been pre-trained using large amounts of text data. The PLM has the limitation of not being able to understand domain-specific terminology due to a lack of training data for terminology. Therefore, the need for a domain-specific language model modified through BERT- or GPT-based pre-trained learning has recently been emphasized. In this study, we analyze BERT's pre-training method and BERT-based transformation techniques (ALBERT, RoBERTa, ELECTRA) and propose a PLM that can be used in biomedical, financial, and legal domains. The biomedical-specific pre-trained learning model is designed to learn domain-specific language characteristics such as technical terminology, medical sentence structure, and medical entity name recognition in the biomedical field. It is mainly adjusted to be applied to biomedical tasks through transfer learning based on BERT's pre-training method and architecture. For this purpose, it is pre-trained with pre-trained biomedical text data, and this pre-training transfers domain-specific knowledge to the model through learning representations for biomedical-related texts. The finance-specific pre-trained learning model is a model that can understand and process financial terminology, financial market trends, and sentence structures and vocabulary related to financial products and services. It can be used to generate news articles about financial market trends and to extract key information by concisely summarizing long texts such as financial reports and corporate press releases. Additionally, finance-specific pre-trained models help financial analysts generate investment recommendations based on a company's financial condition, performance, and prospects. The legal-specific pre-trained model is a language model suitable for legal documents and is used for legal document classification, legal document summarization, and legal document similarity evaluation. The legal-specific pre-learning model was created by pre-training the BERT model on special texts in the legal field, and through this, it learns characteristics specialized for legal documents. The performance of the legal-specific pre-training model can be improved to solve legal-related tasks through scratch pre-training and additional pre-training using legal corpora.
APA, Harvard, Vancouver, ISO, and other styles
35

Fang, Liuqin, Qing Ma, and Jiahao Yan. "The effectiveness of corpus-based training on collocation use in L2 writing for Chinese senior secondary school students." Journal of China Computer-Assisted Language Learning 1, no. 1 (2021): 80–109. http://dx.doi.org/10.1515/jccall-2021-2004.

Full text
Abstract:
Abstract Corpus tools are known to be effective in helping L2 learners improve their writing, especially regarding their use of words. Most corpus-based L2 writing research has focused on university students while little attention has been paid to secondary school L2 students. This study investigated whether senior secondary school students in China, upon receiving corpus-based training under the framework of data-driven learning (DDL), could improve their vocabulary use, especially the use of collocations, in their writing for the International English Language Testing System (IELTS) test. Twenty-two students aged 16–18 in a senior secondary school in Nanchang, China who were planning to take the IELTS exam participated in the study. Corpus of Contemporary American English (COCA) and Word and Phrase were the main corpora that the participants used to learn various search functions. Pre-writing and post-writing tests were administered to measure the effect of corpus training. In addition, a questionnaire and interviews were used to collect students’ perspectives and attitudes. The results indicate that students made improvement in word selection after three corpus training sessions, and their attitudes towards corpus use were positive even though they were restricted from using computers to access corpora inside their school.
APA, Harvard, Vancouver, ISO, and other styles
36

Yuan, Ling, Chenglong Zeng, and Peng Pan. "Research of Chinese Entity Recognition Model Based on Multi-Feature Semantic Enhancement." Electronics 13, no. 24 (2024): 4895. https://doi.org/10.3390/electronics13244895.

Full text
Abstract:
Chinese Entity Recognition (CER) aims to extract key information entities from Chinese text data, supporting subsequent natural language processing tasks such as relation extraction, knowledge graph construction, and intelligent question answering. However, CER faces several challenges, including limited training corpora, unclear entity boundaries, and complex entity structures, resulting in low accuracy and a call for further improvements. To address issues such as high annotation costs and ambiguous entity boundaries, this paper proposes the SEMFF-CER model, a CER model based on semantic enhancement and multi-feature fusion. The model employs character feature extraction algorithms, SofeLexicon semantic enhancement for vocabulary feature extraction, and deep semantic feature extraction from pre-trained models. These features are integrated into the entity recognition process via gating mechanisms, effectively leveraging diverse features to enhance contextual semantics and improve recognition accuracy. Additionally, the model incorporates several optimization strategies: an adaptive loss function to balance negative samples and improve the F1 score, data augmentation to enhance model robustness, and dropout and Adamax optimization algorithms to refine training. The SEMFF-CER model is characterized by a low dependence on training corpora, fast computation speed, and strong scalability. Experiments conducted on four Chinese benchmark entity recognition datasets validate the proposed model, demonstrating superior performance over existing models with the highest F1 score.
APA, Harvard, Vancouver, ISO, and other styles
37

Kang, Yu, Tianqiao Liu, Hang Li, Yang Hao, and Wenbiao Ding. "Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10875–83. http://dx.doi.org/10.1609/aaai.v36i10.21334.

Full text
Abstract:
Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method.
APA, Harvard, Vancouver, ISO, and other styles
38

He, Wanwei, Yinpei Dai, Yinhe Zheng, et al. "GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-supervised Learning and Explicit Policy Injection." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10749–57. http://dx.doi.org/10.1609/aaai.v36i10.21320.

Full text
Abstract:
Pre-trained models have proved to be powerful in enhancing task-oriented dialog systems. However, current pre-training methods mainly focus on enhancing dialog understanding and generation tasks while neglecting the exploitation of dialog policy. In this paper, we propose GALAXY, a novel pre-trained dialog model that explicitly learns dialog policy from limited labeled dialogs and large-scale unlabeled dialog corpora via semi-supervised learning. Specifically, we introduce a dialog act prediction task for policy optimization during pre-training and employ a consistency regularization term to refine the learned representation with the help of unlabeled dialogs. We also implement a gating mechanism to weigh suitable unlabeled dialog samples. Empirical results show that GALAXY substantially improves the performance of task-oriented dialog systems, and achieves new state-of-the-art results on benchmark datasets: In-Car, MultiWOZ2.0 and MultiWOZ2.1, improving their end-to-end combined scores by 2.5, 5.3 and 5.5 points, respectively. We also show that GALAXY has a stronger few-shot ability than existing models under various low-resource settings. For reproducibility, we release the code and data at https://github.com/siat-nlp/GALAXY.
APA, Harvard, Vancouver, ISO, and other styles
39

Garrido-Muñoz , Ismael, Arturo Montejo-Ráez , Fernando Martínez-Santiago , and L. Alfonso Ureña-López . "A Survey on Bias in Deep NLP." Applied Sciences 11, no. 7 (2021): 3184. http://dx.doi.org/10.3390/app11073184.

Full text
Abstract:
Deep neural networks are hegemonic approaches to many machine learning areas, including natural language processing (NLP). Thanks to the availability of large corpora collections and the capability of deep architectures to shape internal language mechanisms in self-supervised learning processes (also known as “pre-training”), versatile and performing models are released continuously for every new network design. These networks, somehow, learn a probability distribution of words and relations across the training collection used, inheriting the potential flaws, inconsistencies and biases contained in such a collection. As pre-trained models have been found to be very useful approaches to transfer learning, dealing with bias has become a relevant issue in this new scenario. We introduce bias in a formal way and explore how it has been treated in several networks, in terms of detection and correction. In addition, available resources are identified and a strategy to deal with bias in deep NLP is proposed.
APA, Harvard, Vancouver, ISO, and other styles
40

Perkowski, Ernest, Rui Pan, Tuan Dung Nguyen, et al. "AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets." Research Notes of the AAS 8, no. 1 (2024): 7. http://dx.doi.org/10.3847/2515-5172/ad1abe.

Full text
Abstract:
Abstract We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora—comprising abstracts, introductions, and conclusions—we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational data set, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.
APA, Harvard, Vancouver, ISO, and other styles
41

Pota, Marco, Mirko Ventura, Rosario Catelli, and Massimo Esposito. "An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian." Sensors 21, no. 1 (2020): 133. http://dx.doi.org/10.3390/s21010133.

Full text
Abstract:
Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.
APA, Harvard, Vancouver, ISO, and other styles
42

Wang, Ke, Xiutian Zhao, and Wei Peng. "Learning from Failure: Improving Meeting Summarization without Good Samples." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 19153–61. http://dx.doi.org/10.1609/aaai.v38i17.29883.

Full text
Abstract:
Existing methods aligning language models with various human needs are reliant heavily on high-quality and task-specific data. However, industrial deployment of task-specific language models often encounter challenges in the availability of appropriate training samples. Taking meeting summarization for instance, public datasets are scarce, and private corpora are also hard to obtain due to privacy issues or resource-demanding annotation. To improve meeting summarization in the absence of positively-rated (i.e., ``good'') samples, we propose Score Tuning, a cold start tuning framework that leverages bad samples of distinguishable degrees to incrementally enhance the performance of summary generation without an initial presence of good samples. Our method utilizes asynchronous and numerical human feedback that measure the quality of generated summaries. Formulating data into triplets of (transcript, summary, score), our approach instructs a pre-trained model to learn the association between summary qualities and human-rated scores and hence to generate better summaries corresponding to higher scores. The experiment results show that our method is effective in improving meeting summarization on both English and Chinese corpora while requiring less annotated data and training resources compared to existing alignment methods. Additionally, we also preliminarily explore the transferability of our approach in machine translation tasks and demonstrate its potential for future development and usage in other domains.
APA, Harvard, Vancouver, ISO, and other styles
43

González-Docasal, Ander, and Aitor Álvarez. "Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics." Applied Sciences 13, no. 14 (2023): 8049. http://dx.doi.org/10.3390/app13148049.

Full text
Abstract:
Voice cloning, an emerging field in the speech-processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigated the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also used two high-quality corpora for comparative analysis. We conducted exhaustive evaluations of the quality of the gathered corpora in order to select the most-suitable data for the training of a voice-cloning system. Following these measurements, we conducted a series of ablations by removing audio files with a lower signal-to-noise ratio and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduced a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 text-to-speech system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice-cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increased the quality of synthesised audio for the challenging low-quality corpus. Notably, our findings indicated that models trained on a 3 h corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.
APA, Harvard, Vancouver, ISO, and other styles
44

Vu, Dang Thanh, Gwanghyun Yu, Chilwoo Lee, and Jinyoung Kim. "Text Data Augmentation for the Korean Language." Applied Sciences 12, no. 7 (2022): 3425. http://dx.doi.org/10.3390/app12073425.

Full text
Abstract:
Data augmentation (DA) is a universal technique to reduce overfitting and improve the robustness of machine learning models by increasing the quantity and variety of the training dataset. Although data augmentation is essential in vision tasks, it is rarely applied to text datasets since it is less straightforward. Some studies have concerned text data augmentation, but most of them are for the majority languages, such as English or French. There have been only a few studies on data augmentation for minority languages, e.g., Korean. This study fills the gap by demonstrating several common data augmentation methods and Korean corpora with pre-trained language models. In short, we evaluate the performance of two text data augmentation approaches, known as text transformation and back translation. We compare these augmentations among Korean corpora on four downstream tasks: semantic textual similarity (STS), natural language inference (NLI), question duplication verification (QDV), and sentiment classification (STC). Compared to cases without augmentation, the performance gains when applying text data augmentation are 2.24%, 2.19%, 0.66%, and 0.08% on the STS, NLI, QDV, and STC tasks, respectively.
APA, Harvard, Vancouver, ISO, and other styles
45

Qi, Kunxun, and Jianfeng Du. "Translation-Based Matching Adversarial Network for Cross-Lingual Natural Language Inference." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8632–39. http://dx.doi.org/10.1609/aaai.v34i05.6387.

Full text
Abstract:
Cross-lingual natural language inference is a fundamental task in cross-lingual natural language understanding, widely addressed by neural models recently. Existing neural model based methods either align sentence embeddings between source and target languages, heavily relying on annotated parallel corpora, or exploit pre-trained cross-lingual language models that are fine-tuned on a single language and hard to transfer knowledge to another language. To resolve these limitations in existing methods, this paper proposes an adversarial training framework to enhance both pre-trained models and classical neural models for cross-lingual natural language inference. It trains on the union of data in the source language and data in the target language, learning language-invariant features to improve the inference performance. Experimental results on the XNLI benchmark demonstrate that three popular neural models enhanced by the proposed framework significantly outperform the original models.
APA, Harvard, Vancouver, ISO, and other styles
46

Li, Lei, Yongfeng Zhang, and Li Chen. "Personalized Prompt Learning for Explainable Recommendation." ACM Transactions on Information Systems 41, no. 4 (2023): 1–26. http://dx.doi.org/10.1145/3580488.

Full text
Abstract:
Providing user-understandable explanations to justify recommendations could help users better understand the recommended items, increase the system’s ease of use, and gain users’ trust. A typical approach to realize it is natural language generation. However, previous works mostly adopt recurrent neural networks to meet the ends, leaving the potentially more effective pre-trained Transformer models under-explored. In fact, user and item IDs, as important identifiers in recommender systems, are inherently in different semantic space as words that pre-trained models were already trained on. Thus, how to effectively fuse IDs into such models becomes a critical issue. Inspired by recent advancement in prompt learning, we come up with two solutions: find alternative words to represent IDs (called discrete prompt learning) and directly input ID vectors to a pre-trained model (termed continuous prompt learning). In the latter case, ID vectors are randomly initialized but the model is trained in advance on large corpora, so they are actually in different learning stages. To bridge the gap, we further propose two training strategies: sequential tuning and recommendation as regularization. Extensive experiments show that our continuous prompt learning approach equipped with the training strategies consistently outperforms strong baselines on three datasets of explainable recommendation.
APA, Harvard, Vancouver, ISO, and other styles
47

Yang, Tiancheng, Ilia Sucholutsky, Kuang-Yu Jen, and Matthias Schonlau. "exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies." PeerJ Computer Science 10 (February 28, 2024): e1888. http://dx.doi.org/10.7717/peerj-cs.1888.

Full text
Abstract:
Background Pathology reports contain key information about the patient’s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. Objective To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) “What kind of rejection does the patient show?”; (2) “What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?” Methods Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT’s tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model’s vocabulary. All three models were fine-tuned with information retrieval heads. Results The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate. Conclusion ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer’s vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
APA, Harvard, Vancouver, ISO, and other styles
48

A. Brenes, Jose, Javier Ferrández-Pastor, José M. Cámara-Zapata, and Gabriela Marín-Raventós. "Use of Hough Transform and Homography for the Creation of Image Corpora for Smart Agriculture." International Journal on Cybernetics & Informatics 12, no. 6 (2023): 09–19. http://dx.doi.org/10.5121/ijci.2023.120602.

Full text
Abstract:
In the context of smart agriculture, developing deep learning models demands large and highquality datasets for training. However, the current lack of such datasets for specific crops poses a significant challenge to the progress of this field. This research proposes an automated method to facilitate the creation of training datasets through automated image capture and pre-processing. The method’s efficacy is demonstrated through two study cases conducted in a Cannabis Sativa cultivation setting. By leveraging automated processes, the proposed approach enables to create large-volume and high-quality datasets, significantly reducing human effort. The results indicate that the proposed method not only simplifies dataset creation but also allows researchers to concentrate on other critical tasks, such as refining image labeling and advancing artificial intelligence model creation. This work contributes towards efficient and accurate deep learning applications in smart agriculture.
APA, Harvard, Vancouver, ISO, and other styles
49

Panboonyuen, Teerapong, Kulsawasd Jitkajornwanich, Siam Lawawirojwong, Panu Srestasathiern, and Peerapon Vateekul. "Semantic Segmentation on Remotely Sensed Images Using an Enhanced Global Convolutional Network with Channel Attention and Domain Specific Transfer Learning." Remote Sensing 11, no. 1 (2019): 83. http://dx.doi.org/10.3390/rs11010083.

Full text
Abstract:
In the remote sensing domain, it is crucial to complete semantic segmentation on the raster images, e.g., river, building, forest, etc., on raster images. A deep convolutional encoder–decoder (DCED) network is the state-of-the-art semantic segmentation method for remotely sensed images. However, the accuracy is still limited, since the network is not designed for remotely sensed images and the training data in this domain is deficient. In this paper, we aim to propose a novel CNN for semantic segmentation particularly for remote sensing corpora with three main contributions. First, we propose applying a recent CNN called a global convolutional network (GCN), since it can capture different resolutions by extracting multi-scale features from different stages of the network. Additionally, we further enhance the network by improving its backbone using larger numbers of layers, which is suitable for medium resolution remotely sensed images. Second, “channel attention” is presented in our network in order to select the most discriminative filters (features). Third, “domain-specific transfer learning” is introduced to alleviate the scarcity issue by utilizing other remotely sensed corpora with different resolutions as pre-trained data. The experiment was then conducted on two given datasets: (i) medium resolution data collected from Landsat-8 satellite and (ii) very high resolution data called the ISPRS Vaihingen Challenge Dataset. The results show that our networks outperformed DCED in terms of F 1 for 17.48% and 2.49% on medium and very high resolution corpora, respectively.
APA, Harvard, Vancouver, ISO, and other styles
50

Liu, Rui, and Barzan Mozafari. "Transformer with Memory Replay." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (2022): 7567–75. http://dx.doi.org/10.1609/aaai.v36i7.20722.

Full text
Abstract:
Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose Transformer with Memory Replay, which integrates memory replay with transformer, making transformer more sample efficient. Experiments on GLUE and SQuAD benchmark datasets showed that Transformer with Memory Replay can achieve at least 1% point increase compared to the baseline transformer model when pre-trained with the same number of examples. Further, by adopting a careful design that reduces the wall-clock time overhead of memory replay, we also empirically achieve a better runtime efficiency.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!