Log in

Relevant bibliographies by topics / Monolingual machine translation / Journal articles

To see the other types of publications on this topic, follow the link: Monolingual machine translation.

Journal articles on the topic 'Monolingual machine translation'

Author: Grafiati

Published: 4 June 2021

Last updated: 3 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Monolingual machine translation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Riezler, Stefan, and Yi Liu. "Query Rewriting Using Monolingual Statistical Machine Translation." Computational Linguistics 36, no. 3 (2010): 569–82. http://dx.doi.org/10.1162/coli_a_00010.

Full text

Abstract:

Long queries often suffer from low recall in Web search due to conjunctive term matching. The chances of matching words in relevant documents can be increased by rewriting query terms into new terms with similar statistical properties. We present a comparison of approaches that deploy user query logs to learn rewrites of query terms into terms from the document space. We show that the best results are achieved by adopting the perspective of bridging the “lexical chasm” between queries and documents by translating from a source language of user queries into a target language of Web documents. We train a state-of-the-art statistical machine translation model on query-snippet pairs from user query logs, and extract expansion terms from the query rewrites produced by the monolingual translation system. We show in an extrinsic evaluation in a real-world Web search task that the combination of a query-to-snippet translation model with a query language model achieves improved contextual query expansion compared to a state-of-the-art query expansion model that is trained on the same query log data.

APA, Harvard, Vancouver, ISO, and other styles

2

Marie, Benjamin, and Atsushi Fujita. "Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 710–25. http://dx.doi.org/10.1162/tacl_a_00341.

Full text

Abstract:

Neural machine translation (NMT) systems are usually trained on clean parallel data. They can perform very well for translating clean in-domain texts. However, as demonstrated by previous work, the translation quality significantly worsens when translating noisy texts, such as user-generated texts (UGT) from online social media. Given the lack of parallel data of UGT that can be used to train or adapt NMT systems, we synthesize parallel data of UGT, exploiting monolingual data of UGT through crosslingual language model pre-training and zero-shot NMT systems. This paper presents two different but complementary approaches: One alters given clean parallel data into UGT-like parallel data whereas the other generates translations from monolingual data of UGT. On the MTNT translation tasks, we show that our synthesized parallel data can lead to better NMT systems for UGT while making them more robust in translating texts from various domains and styles.

APA, Harvard, Vancouver, ISO, and other styles

3

IRVINE, ANN, and CHRIS CALLISON-BURCH. "End-to-end statistical machine translation with zero or small parallel texts." Natural Language Engineering 22, no. 4 (2016): 517–48. http://dx.doi.org/10.1017/s1351324916000127.

Full text

Abstract:

AbstractWe use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.

APA, Harvard, Vancouver, ISO, and other styles

4

Lesatari, Aufa Eka Putri, Arie Ardiyanti, Arie Ardiyanti, Ibnu Asror, and Ibnu Asror. "Phrase Based Statistical Machine Translation Javanese-Indonesian." JURNAL MEDIA INFORMATIKA BUDIDARMA 5, no. 2 (2021): 378. http://dx.doi.org/10.30865/mib.v5i2.2812.

Full text

Abstract:

This research aims to produce a statistical machine translation that can be implemented to perform Javanese-Indonesian translation and to know the influence of the main data sources of statistical machine translation namely parallel corpus and monolingual corpus on the quality of Javanese-Indonesian statistical machine translation. The testing was carried out by gradually adding the quantity of parallel corpus and monolingual corpus to seven configurations of Javanese-Indonesian statistical machine translation. All machine translation configuration experiments were tested with test data totaling 500 lines of Javanese sentences. Results from machine translation are evaluated automatically using Bilingual Evaluation Understudy (BLEU). Test results in seven configurations showed an increase in the evaluation value of the translation machine after the quantity of parallel corpus and monolingual corpus was added. The quantity of parallel corpus in configurations 1 and 2 increased by 3,6%, configurations 2 and 3 increased by 8,23%, configurations 3 and 7 increased by 14,92%. Additional monolingual corpus quantity in configurations 4 and 5 increased BLEU score by 0,18%, configurations 5 and 6 increased by 0,06%, configurations 6 and 7 increased by 0,24%. The test results showed that the quantity of parallel corpus and monolingual corpus could increase the evaluation value of statistical machine translation Javanese-Indonesian, but the quantity of parallel corpus had a greater influence than the quantity of monolingual corpus

APA, Harvard, Vancouver, ISO, and other styles

5

Yu, Lei, Laurent Sartran, Wojciech Stokowiec, et al. "Better Document-Level Machine Translation with Bayes’ Rule." Transactions of the Association for Computational Linguistics 8 (July 2020): 346–60. http://dx.doi.org/10.1162/tacl_a_00319.

Full text

Abstract:

We show that Bayes’ rule provides an effective mechanism for creating document translation models that can be learned from only parallel sentences and monolingual documents a compelling benefit because parallel documents are not always available. In our formulation, the posterior probability of a candidate translation is the product of the unconditional (prior) probability of the candidate output document and the “reverse translation probability” of translating the candidate output back into the source language. Our proposed model uses a powerful autoregressive language model as the prior on target language documents, but it assumes that each sentence is translated independently from the target to the source language. Crucially, at test time, when a source document is observed, the document language model prior induces dependencies between the translations of the source sentences in the posterior. The model’s independence assumption not only enables efficient use of available data, but it additionally admits a practical left-to-right beam-search algorithm for carrying out inference. Experiments show that our model benefits from using cross-sentence context in the language model, and it outperforms existing document translation approaches.

APA, Harvard, Vancouver, ISO, and other styles

6

Zhang, Wenbo, Xiao Li, Yating Yang, and Rui Dong. "Pre-Training on Mixed Data for Low-Resource Neural Machine Translation." Information 12, no. 3 (2021): 133. http://dx.doi.org/10.3390/info12030133.

Full text

Abstract:

The pre-training fine-tuning mode has been shown to be effective for low resource neural machine translation. In this mode, pre-training models trained on monolingual data are used to initiate translation models to transfer knowledge from monolingual data into translation models. In recent years, pre-training models usually take sentences with randomly masked words as input, and are trained by predicting these masked words based on unmasked words. In this paper, we propose a new pre-training method that still predicts masked words, but randomly replaces some of the unmasked words in the input with their translation words in another language. The translation words are from bilingual data, so that the data for pre-training contains both monolingual data and bilingual data. We conduct experiments on Uyghur-Chinese corpus to evaluate our method. The experimental results show that our method can make the pre-training model have a better generalization ability and help the translation model to achieve better performance. Through a word translation task, we also demonstrate that our method enables the embedding of the translation model to acquire more alignment knowledge.

APA, Harvard, Vancouver, ISO, and other styles

7

Wang, Rui. "Neural Network Machine Translation Method Based on Unsupervised Domain Adaptation." Complexity 2020 (December 24, 2020): 1–11. http://dx.doi.org/10.1155/2020/6657344.

Full text

Abstract:

Relying on large-scale parallel corpora, neural machine translation has achieved great success in certain language pairs. However, the acquisition of high-quality parallel corpus is one of the main difficulties in machine translation research. In order to solve this problem, this paper proposes unsupervised domain adaptive neural network machine translation. This method can be trained using only two unrelated monolingual corpora and obtain a good translation result. This article first measures the matching degree of translation rules by adding relevant subject information to the translation rules and dynamically calculating the similarity between each translation rule and the document to be translated during the decoding process. Secondly, through the joint training of multiple training tasks, the source language can learn useful semantic and structural information from the monolingual corpus of a third language that is not parallel to the current two languages during the process of translation into the target language. Experimental results show that better results can be obtained than traditional statistical machine translation.

APA, Harvard, Vancouver, ISO, and other styles

8

Li, Zhen, Dan Qu, Chaojie Xie, Wenlin Zhang, and Yanxia Li. "Language Model Pre-training Method in Machine Translation Based on Named Entity Recognition." International Journal on Artificial Intelligence Tools 29, no. 07n08 (2020): 2040021. http://dx.doi.org/10.1142/s0218213020400217.

Full text

Abstract:

Neural Machine Translation (NMT) model has become the mainstream technology in machine translation. The supervised neural machine translation model trains with abundant of sentence-level parallel corpora. But for low-resources language or dialect with no such corpus available, it is difficult to achieve good performance. Researchers began to focus on unsupervised neural machine translation (UNMT) that monolingual corpus as training data. UNMT need to construct the language model (LM) which learns semantic information from the monolingual corpus. This paper focuses on the pre-training of LM in unsupervised machine translation and proposes a pre-training method, NER-MLM (named entity recognition masked language model). Through performing NER, the proposed method can obtain better semantic information and language model parameters with better training results. In the unsupervised machine translation task, the BLEU scores on the WMT’16 English–French, English–German, data sets are 35.30, 27.30 respectively. To the best of our knowledge, this is the highest results in the field of UNMT reported so far.

APA, Harvard, Vancouver, ISO, and other styles

9

Yang, Zhen, Wei Chen, Feng Wang, and Bo Xu. "Effectively training neural machine translation models with monolingual data." Neurocomputing 333 (March 2019): 240–47. http://dx.doi.org/10.1016/j.neucom.2018.12.032.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

He, Xiaodong, Mei Yang, Jianfeng Gao, Patrick Nguyen, and Robert Moore. "Improved Monolingual Hypothesis Alignment for Machine Translation System Combination." ACM Transactions on Asian Language Information Processing 8, no. 2 (2009): 1–19. http://dx.doi.org/10.1145/1526252.1526254.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Khaikal, Muhammad Fiqri, and Arie Ardiyanti Suryani. "Statistical Machine Translation Dayak Language – Indonesia Language." Informatika Mulawarman : Jurnal Ilmiah Ilmu Komputer 16, no. 1 (2021): 49. http://dx.doi.org/10.30872/jim.v16i1.5315.

Full text

Abstract:

This Paper aims to discuss how to create the local language machine translation of Indonesia Language where the reason of local language selection was carried out as considering the using of machine translator for local language are still infrequently found mainly for Dayak Language machine translator. Machine Translation on this research had used statistical approach where the resource data that was taken originated from articles on dayaknews.com pages with total parallel corpus was approximately 1000 Dayak Language – Indonesia Language furthermore as this research contains the corpus with total 1000 sentences accordingly divided into three sections in order to comprehend the certain analysis from a pattern that was created. The monolingual corpus was collected approximately 1000 sentences of Indonesia Language. The testing was carried out using Bilingual Evaluation Understudy (BLEU) tool and had result the highest accuracy value amounting to 49.15% which increase from some the others machine translator amounting to approximately 3%.

APA, Harvard, Vancouver, ISO, and other styles

12

Soto, Xabier, Olatz Perez-de-Viñaspre, Gorka Labaka, and Maite Oronoz. "Neural machine translation of clinical texts between long distance languages." Journal of the American Medical Informatics Association 26, no. 12 (2019): 1478–87. http://dx.doi.org/10.1093/jamia/ocz110.

Full text

Abstract:

Abstract Objective To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations of neural machine translation systems and used different methods to overcome the lack of a bilingual corpus of clinical texts or health records in Basque and Spanish. Materials and Methods We trained recurrent neural networks on an out-of-domain corpus with different hyperparameter values. Subsequently, we used the optimal configuration to evaluate machine translation of EHR templates between Basque and Spanish, using manual translations of the Basque templates into Spanish as a standard. We successively added to the training corpus clinical resources, including a Spanish-Basque dictionary derived from resources built for the machine translation of the Spanish edition of SNOMED CT into Basque, artificial sentences in Spanish and Basque derived from frequently occurring relationships in SNOMED CT, and Spanish monolingual EHRs. Apart from calculating bilingual evaluation understudy (BLEU) values, we tested the performance in the clinical domain by human evaluation. Results We achieved slight improvements from our reference system by tuning some hyperparameters using an out-of-domain bilingual corpus, obtaining 10.67 BLEU points for Basque-to-Spanish clinical domain translation. The inclusion of clinical terminology in Spanish and Basque and the application of the back-translation technique on monolingual EHRs significantly improved the performance, obtaining 21.59 BLEU points. This was confirmed by the human evaluation performed by 2 clinicians, ranking our machine translations close to the human translations. Discussion We showed that, even after optimizing the hyperparameters out-of-domain, the inclusion of available resources from the clinical domain and applied methods were beneficial for the described objective, managing to obtain adequate translations of EHR templates. Conclusion We have developed a system which is able to properly translate health record templates from Basque to Spanish without making use of any bilingual corpus of clinical texts or health records.

APA, Harvard, Vancouver, ISO, and other styles

13

Luo, Gong-Xu, Ya-Ting Yang, Rui Dong, Yan-Hong Chen, and Wen-Bo Zhang. "A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation." Mathematical Problems in Engineering 2020 (May 31, 2020): 1–11. http://dx.doi.org/10.1155/2020/6140153.

Full text

Abstract:

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.

APA, Harvard, Vancouver, ISO, and other styles

14

Maruyama, Takumi, and Kazuhide Yamamoto. "Extremely Low-Resource Text Simplification with Pre-trained Transformer Language Model." International Journal of Asian Language Processing 30, no. 01 (2020): 2050001. http://dx.doi.org/10.1142/s2717554520500010.

Full text

Abstract:

Inspired by machine translation task, recent text simplification approaches regard a task as a monolingual text-to-text generation, and neural machine translation models have significantly improved the performance of simplification tasks. Although such models require a large-scale parallel corpus, such corpora for text simplification are very few in number and smaller in size compared to machine translation task. Therefore, we have attempted to facilitate the training of simplification rewritings using pre-training from a large-scale monolingual corpus such as Wikipedia articles. In addition, we propose a translation language model to seamlessly conduct a fine-tuning of text simplification from the pre-training of the language model. The experimental results show that the translation language model substantially outperforms a state-of-the-art model under a low-resource setting. In addition, a pre-trained translation language model with only 3000 supervised examples can achieve a performance comparable to that of the state-of-the-art model using 30,000 supervised examples.

APA, Harvard, Vancouver, ISO, and other styles

15

Wołk, Krzysztof, Agnieszka Wołk, and Krzysztof Marasek. "Semantic approach for building generated virtual-parallel corpora from monolingual texts." Poznan Studies in Contemporary Linguistics 55, no. 2 (2019): 469–90. http://dx.doi.org/10.1515/psicl-2019-0017.

Full text

Abstract:

Abstract Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English, Polish-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.

APA, Harvard, Vancouver, ISO, and other styles

16

Xu, Guanghao, Youngjoong Ko, and Jungyun Seo. "Improving Neural Machine Translation by Filtering Synthetic Parallel Data." Entropy 21, no. 12 (2019): 1213. http://dx.doi.org/10.3390/e21121213.

Full text

Abstract:

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.

APA, Harvard, Vancouver, ISO, and other styles

17

Silva, Carlos Eduardo, and Lincoln Fernandes. "Apresentando o copa-trad versão 2.0 um sistema com base em corpus paralelo para pesquisa, ensino e prática da tradução." Ilha do Desterro A Journal of English Language, Literatures in English and Cultural Studies 73, no. 1 (2020): 297–316. http://dx.doi.org/10.5007/2175-8026.2020v73n1p297.

Full text

Abstract:

This paper describes COPA-TRAD Version 2.0, a parallel corpus-based system developed at the Universidade Federal de Santa Catarina (UFSC) for translation research, teaching and practice. COPA-TRAD enables the user to investigate the practices of professional translators by identifying translational patterns related to a particular element or linguistic pattern. In addition, the system allows for the comparison between human translation and automatic translation provided by three well-known machine translation systems available on the Internet (Google Translate, Microsoft Translator and Yandex). Currently, COPA-TRAD incorporates five subcorpora (Children's Literature, Literary Texts, Meta-Discourse in Translation, Subtitles and Legal Texts) and provides the following tools: parallel concordancer, monolingual concordancer, wordlist and a DIY Tool that enables the user to create his own parallel disposable corpus. The system also provides a POS-tagging tool interface to analyze and classify the parts of speech of a text.

APA, Harvard, Vancouver, ISO, and other styles

18

Oh, Jiun, and Yong-Suk Choi. "Reusing Monolingual Pre-Trained Models by Cross-Connecting Seq2seq Models for Machine Translation." Applied Sciences 11, no. 18 (2021): 8737. http://dx.doi.org/10.3390/app11188737.

Full text

Abstract:

This work uses sequence-to-sequence (seq2seq) models pre-trained on monolingual corpora for machine translation. We pre-train two seq2seq models with monolingual corpora for the source and target languages, then combine the encoder of the source language model and the decoder of the target language model, i.e., the cross-connection. We add an intermediate layer between the pre-trained encoder and the decoder to help the mapping of each other since the modules are pre-trained completely independently. These monolingual pre-trained models can work as a multilingual pre-trained model because one model can be cross-connected with another model pre-trained on any other language, while their capacity is not affected by the number of languages. We will demonstrate that our method improves the translation performance significantly over the random baseline. Moreover, we will analyze the appropriate choice of the intermediate layer, the importance of each part of a pre-trained model, and the performance change along with the size of the bitext.

APA, Harvard, Vancouver, ISO, and other styles

19

Tezcan, Arda, Véronique Hoste, and Lieve Macken. "Estimating word-level quality of statistical machine translation output using monolingual information alone." Natural Language Engineering 26, no. 1 (2019): 73–94. http://dx.doi.org/10.1017/s1351324919000111.

Full text

Abstract:

AbstractVarious studies show that statistical machine translation (SMT) systems suffer from fluency errors, especially in the form of grammatical errors and errors related to idiomatic word choices. In this study, we investigate the effectiveness of using monolingual information contained in the machine-translated text to estimate word-level quality of SMT output. We propose a recurrent neural network architecture which uses morpho-syntactic features and word embeddings as word representations within surface and syntactic n-grams. We test the proposed method on two language pairs and for two tasks, namely detecting fluency errors and predicting overall post-editing effort. Our results show that this method is effective for capturing all types of fluency errors at once. Moreover, on the task of predicting post-editing effort, while solely relying on monolingual information, it achieves on-par results with the state-of-the-art quality estimation systems which use both bilingual and monolingual information.

APA, Harvard, Vancouver, ISO, and other styles

20

Weng, Rongxiang, Heng Yu, Shujian Huang, Shanbo Cheng, and Weihua Luo. "Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 9266–73. http://dx.doi.org/10.1609/aaai.v34i05.6465.

Full text

Abstract:

Pre-training and fine-tuning have achieved great success in natural language process field. The standard paradigm of exploiting them includes two steps: first, pre-training a model, e.g. BERT, with a large scale unlabeled monolingual data. Then, fine-tuning the pre-trained model with labeled data from downstream tasks. However, in neural machine translation (NMT), we address the problem that the training objective of the bilingual task is far different from the monolingual pre-trained model. This gap leads that only using fine-tuning in NMT can not fully utilize prior language knowledge. In this paper, we propose an Apt framework for acquiring knowledge from pre-trained model to NMT. The proposed approach includes two modules: 1). a dynamic fusion mechanism to fuse task-specific features adapted from general knowledge into NMT network, 2). a knowledge distillation paradigm to learn language knowledge continuously during the NMT training process. The proposed approach could integrate suitable knowledge from pre-trained models to improve the NMT. Experimental results on WMT English to German, German to English and Chinese to English machine translation tasks show that our model outperforms strong baselines and the fine-tuning counterparts.

APA, Harvard, Vancouver, ISO, and other styles

21

Rubino, Raphael, Benjamin Marie, Raj Dabre, Atushi Fujita, Masao Utiyama, and Eiichiro Sumita. "Extremely low-resource neural machine translation for Asian languages." Machine Translation 34, no. 4 (2020): 347–82. http://dx.doi.org/10.1007/s10590-020-09258-6.

Full text

Abstract:

AbstractThis paper presents a set of effective approaches to handle extremely low-resource language pairs for self-attention based neural machine translation (NMT) focusing on English and four Asian languages. Starting from an initial set of parallel sentences used to train bilingual baseline models, we introduce additional monolingual corpora and data processing techniques to improve translation quality. We describe a series of best practices and empirically validate the methods through an evaluation conducted on eight translation directions, based on state-of-the-art NMT approaches such as hyper-parameter search, data augmentation with forward and backward translation in combination with tags and noise, as well as joint multilingual training. Experiments show that the commonly used default architecture of self-attention NMT models does not reach the best results, validating previous work on the importance of hyper-parameter tuning. Additionally, empirical results indicate the amount of synthetic data required to efficiently increase the parameters of the models leading to the best translation quality measured by automatic metrics. We show that the best NMT models trained on large amount of tagged back-translations outperform three other synthetic data generation approaches. Finally, comparison with statistical machine translation (SMT) indicates that extremely low-resource NMT requires a large amount of synthetic parallel data obtained with back-translation in order to close the performance gap with the preceding SMT approach.

APA, Harvard, Vancouver, ISO, and other styles

22

Kajiwara, Tomoyuki, Biwa Miura, and Yuki Arase. "Monolingual Transfer Learning via Bilingual Translators for Style-Sensitive Paraphrase Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8042–49. http://dx.doi.org/10.1609/aaai.v34i05.6314.

Full text

Abstract:

We tackle the low-resource problem in style transfer by employing transfer learning that utilizes abundantly available raw corpora. Our method consists of two steps: pre-training learns to generate a semantically equivalent sentence with an input assured grammaticality, and fine-tuning learns to add a desired style. Pre-training has two options, auto-encoding and machine translation based methods. Pre-training based on AutoEncoder is a simple way to learn these from a raw corpus. If machine translators are available, the model can learn more diverse paraphrasing via roundtrip translation. After these, fine-tuning achieves high-quality paraphrase generation even in situations where only 1k sentence pairs of the parallel corpus for style transfer is available. Experimental results of formality style transfer indicated the effectiveness of both pre-training methods and the method based on roundtrip translation achieves state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

23

Wang, Yiren, Lijun Wu, Yingce Xia, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. "Transductive Ensemble Learning for Neural Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (2020): 6291–98. http://dx.doi.org/10.1609/aaai.v34i04.6097.

Full text

Abstract:

Ensemble learning, which aggregates multiple diverse models for inference, is a common practice to improve the accuracy of machine learning tasks. However, it has been observed that the conventional ensemble methods only bring marginal improvement for neural machine translation (NMT) when individual models are strong or there are a large number of individual models. In this paper, we study how to effectively aggregate multiple NMT models under the transductive setting where the source sentences of the test set are known. We propose a simple yet effective approach named transductive ensemble learning (TEL), in which we use all individual models to translate the source test set into the target language space and then finetune a strong model on the translated synthetic corpus. We conduct extensive experiments on different settings (with/without monolingual data) and different language pairs (English↔{German, Finnish}). The results show that our approach boosts strong individual models with significant improvement and benefits a lot from more individual models. Specifically, we achieve the state-of-the-art performances on the WMT2016-2018 English↔German translations.

APA, Harvard, Vancouver, ISO, and other styles

24

Marie, Benjamin, and Atsushi Fujita. "Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation." Transactions of the Association for Computational Linguistics 5 (December 2017): 487–500. http://dx.doi.org/10.1162/tacl_a_00075.

Full text

Abstract:

We present a new framework to induce an in-domain phrase table from in-domain monolingual data that can be used to adapt a general-domain statistical machine translation system to the targeted domain. Our method first compiles sets of phrases in source and target languages separately and generates candidate phrase pairs by taking the Cartesian product of the two phrase sets. It then computes inexpensive features for each candidate phrase pair and filters them using a supervised classifier in order to induce an in-domain phrase table. We experimented on the language pair English–French, both translation directions, in two domains and obtained consistently better results than a strong baseline system that uses an in-domain bilingual lexicon. We also conducted an error analysis that showed the induced phrase tables proposed useful translations, especially for words and phrases unseen in the parallel data used to train the general-domain baseline system.

APA, Harvard, Vancouver, ISO, and other styles

25

Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, et al. "Multimodal machine translation through visuals and speech." Machine Translation 34, no. 2-3 (2020): 97–147. http://dx.doi.org/10.1007/s10590-020-09250-0.

Full text

Abstract:

Abstract Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

APA, Harvard, Vancouver, ISO, and other styles

26

Togeby, Ole. "Parsing Danish Text in Eurotra." Nordic Journal of Linguistics 11, no. 1-2 (1988): 175–91. http://dx.doi.org/10.1017/s0332586500001803.

Full text

Abstract:

The machine translation project Eurotra is described as a multilanguage modular translation system with 9 monolingual analysis modules, 72 bilingual transfer modules, and 9 monolingual synthesis modules. The analysis module for Danish is described as a three-step parser with structure generation rules for immediate constituent structure, syntactic structure, and semantic structure, and translation rules between them. The topological grammatical description of Danish proposed by Paul Diderichsen, is shown to be useful in building the parser for Danish, especially with respect to the interaction between empty slots and filled slot in the topological pattern. Lastly, the special problem with parsing and disambiguation of sentences that allow many pp attachments patterns is mentioned and a solution is suggested.

APA, Harvard, Vancouver, ISO, and other styles

27

Marie, Benjamin, and Atsushi Fujita. "Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation." ACM Transactions on Asian and Low-Resource Language Information Processing 17, no. 3 (2018): 1–25. http://dx.doi.org/10.1145/3168054.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Wan, Yu, Baosong Yang, Derek F. Wong, Lidia S. Chao, Haihua Du, and Ben C. H. Ao. "Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 9130–37. http://dx.doi.org/10.1609/aaai.v34i05.6448.

Full text

Abstract:

As a special machine translation task, dialect translation has two main characteristics: 1) lack of parallel training corpus; and 2) possessing similar grammar between two sides of the translation. In this paper, we investigate how to exploit the commonality and diversity between dialects thus to build unsupervised translation models merely accessing to monolingual data. Specifically, we leverage pivot-private embedding, layer coordination, as well as parameter sharing to sufficiently model commonality and diversity among source and target, ranging from lexical, through syntactic, to semantic levels. In order to examine the effectiveness of the proposed models, we collect 20 million monolingual corpus for each of Mandarin and Cantonese, which are official language and the most widely used dialect in China. Experimental results reveal that our methods outperform rule-based simplified and traditional Chinese conversion and conventional unsupervised translation models over 12 BLEU scores.

APA, Harvard, Vancouver, ISO, and other styles

29

Ahmadnia, Benyamin, and Bonnie J. Dorr. "Augmenting Neural Machine Translation through Round-Trip Training Approach." Open Computer Science 9, no. 1 (2019): 268–78. http://dx.doi.org/10.1515/comp-2019-0019.

Full text

Abstract:

AbstractThe quality of Neural Machine Translation (NMT), as a data-driven approach, massively depends on quantity, quality and relevance of the training dataset. Such approaches have achieved promising results for bilingually high-resource scenarios but are inadequate for low-resource conditions. Generally, the NMT systems learn from millions of words from bilingual training dataset. However, human labeling process is very costly and time consuming. In this paper, we describe a round-trip training approach to bilingual low-resource NMT that takes advantage of monolingual datasets to address training data bottleneck, thus augmenting translation quality. We conduct detailed experiments on English-Spanish as a high-resource language pair as well as Persian-Spanish as a low-resource language pair. Experimental results show that this competitive approach outperforms the baseline systems and improves translation quality.

APA, Harvard, Vancouver, ISO, and other styles

30

Peris, Álvaro, Mara Chinea-Ríos, and Francisco Casacuberta. "Neural Networks Classifier for Data Selection in Statistical Machine Translation." Prague Bulletin of Mathematical Linguistics 108, no. 1 (2017): 283–94. http://dx.doi.org/10.1515/pralin-2017-0027.

Full text

Abstract:

AbstractCorpora are precious resources, as they allow for a proper estimation of statistical machine translation models. Data selection is a variant of the domain adaptation field, aimed to extract those sentences from an out-of-domain corpus that are the most useful to translate a different target domain. We address the data selection problem in statistical machine translation as a classification task. We present a new method, based on neural networks, able to deal with monolingual and bilingual corpora. Empirical results show that our data selection method provides slightly better translation quality, compared to a state-of-the-art method (cross-entropy), requiring substantially less data. Moreover, the results obtained are coherent across different language pairs, demonstrating the robustness of our proposal.

APA, Harvard, Vancouver, ISO, and other styles

31

Hu, J. Edward, Rachel Rudinger, Matt Post, and Benjamin Van Durme. "PARABANK: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-Constrained Neural Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6521–28. http://dx.doi.org/10.1609/aaai.v33i01.33016521.

Full text

Abstract:

We present PARABANK, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of PARANMT (Wieting and Gimpel, 2018), we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that PARABANK’s paraphrases improve over PARANMT on both semantic similarity and fluency. Finally, we use PARABANK to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.

APA, Harvard, Vancouver, ISO, and other styles

32

Xia, Yingce, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. "Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 5466–73. http://dx.doi.org/10.1609/aaai.v33i01.33015466.

Full text

Abstract:

Sharing source and target side vocabularies and word embeddings has been a popular practice in neural machine translation (briefly, NMT) for similar languages (e.g., English to French or German translation). The success of such wordlevel sharing motivates us to move one step further: we consider model-level sharing and tie the whole parts of the encoder and decoder of an NMT model. We share the encoder and decoder of Transformer (Vaswani et al. 2017), the state-of-the-art NMT model, and obtain a compact model named Tied Transformer. Experimental results demonstrate that such a simple method works well for both similar and dissimilar language pairs. We empirically verify our framework for both supervised NMT and unsupervised NMT: we achieve a 35.52 BLEU score on IWSLT 2014 German to English translation, 28.98/29.89 BLEU scores on WMT 2014 English to German translation without/with monolingual data, and a 22.05 BLEU score on WMT 2016 unsupervised German to English translation.

APA, Harvard, Vancouver, ISO, and other styles

33

Irvine, Ann, and Chris Callison-Burch. "A Comprehensive Analysis of Bilingual Lexicon Induction." Computational Linguistics 43, no. 2 (2017): 273–310. http://dx.doi.org/10.1162/coli_a_00284.

Full text

Abstract:

Bilingual lexicon induction is the task of inducing word translations from monolingual corpora in two languages. In this article we present the most comprehensive analysis of bilingual lexicon induction to date. We present experiments on a wide range of languages and data sizes. We examine translation into English from 25 foreign languages: Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, Gujarati, Hindi, Hungarian, Indonesian, Latvian, Nepali, Romanian, Serbian, Slovak, Somali, Spanish, Swedish, Tamil, Telugu, Turkish, Ukrainian, Uzbek, Vietnamese, and Welsh. We analyze the behavior of bilingual lexicon induction on low-frequency words, rather than testing solely on high-frequency words, as previous research has done. Low-frequency words are more relevant to statistical machine translation, where systems typically lack translations of rare words that fall outside of their training data. We systematically explore a wide range of features and phenomena that affect the quality of the translations discovered by bilingual lexicon induction. We provide illustrative examples of the highest ranking translations for orthogonal signals of translation equivalence like contextual similarity and temporal similarity. We analyze the effects of frequency and burstiness, and the sizes of the seed bilingual dictionaries and the monolingual training corpora. Additionally, we introduce a novel discriminative approach to bilingual lexicon induction. Our discriminative model is capable of combining a wide variety of features that individually provide only weak indications of translation equivalence. When feature weights are discriminatively set, these signals produce dramatically higher translation quality than previous approaches that combined signals in an unsupervised fashion (e.g., using minimum reciprocal rank). We also directly compare our model's performance against a sophisticated generative approach, the matching canonical correlation analysis (MCCA) algorithm used by Haghighi et al. ( 2008 ). Our algorithm achieves an accuracy of 42% versus MCCA's 15%.

APA, Harvard, Vancouver, ISO, and other styles

34

Liu, Yinhan, Jiatao Gu, Naman Goyal, et al. "Multilingual Denoising Pre-training for Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Full text

Abstract:

This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task- specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show that it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training. 1

APA, Harvard, Vancouver, ISO, and other styles

35

Li, Yu, Xiao Li, Yating Yang, and Rui Dong. "A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation." Information 11, no. 5 (2020): 255. http://dx.doi.org/10.3390/info11050255.

Full text

Abstract:

One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German–English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English–Turkish, Nepali–English, and Sinhala–English translation tasks.

APA, Harvard, Vancouver, ISO, and other styles

36

Mohiuddin, Tasnim, and Shafiq Joty. "Unsupervised Word Translation with Adversarial Autoencoder." Computational Linguistics 46, no. 2 (2020): 257–88. http://dx.doi.org/10.1162/coli_a_00374.

Full text

Abstract:

Crosslingual word embeddings learned from monolingual embeddings have a crucial role in many downstream tasks, ranging from machine translation to transfer learning. Adversarial training has shown impressive success in learning crosslingual embeddings and the associated word translation task without any parallel data by mapping monolingual embeddings to a shared space. However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs. In this article, we investigate adversarial autoencoder for unsupervised word translation and propose two novel extensions to it that yield more stable training and improved results. Our method includes regularization terms to enforce cycle consistency and input reconstruction, and puts the target encoders as an adversary against the corresponding discriminator. We use two types of refinement procedures sequentially after obtaining the trained encoders and mappings from the adversarial training, namely, refinement with Procrustes solution and refinement with symmetric re-weighting. Extensive experimentations with high- and low-resource languages from two different data sets show that our method achieves better performance than existing adversarial and non-adversarial approaches and is also competitive with the supervised system. Along with performing comprehensive ablation studies to understand the contribution of different components of our adversarial model, we also conduct a thorough analysis of the refinement procedures to understand their effects.

APA, Harvard, Vancouver, ISO, and other styles

37

Moreo Fernández, Alejandro, Andrea Esuli, and Fabrizio Sebastiani. "Lightweight Random Indexing for Polylingual Text Classification." Journal of Artificial Intelligence Research 57 (October 13, 2016): 151–85. http://dx.doi.org/10.1613/jair.5194.

Full text

Abstract:

Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the |L| monolingual classifiers by also leveraging the training documents written in the other (|L| − 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly |L| times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-translation-free and dictionary-free PLTC methods that we use as baselines.

APA, Harvard, Vancouver, ISO, and other styles

38

Li, Hang, and Cong Li. "Word Translation Disambiguation Using Bilingual Bootstrapping." Computational Linguistics 30, no. 1 (2004): 1–22. http://dx.doi.org/10.1162/089120104773633367.

Full text

Abstract:

This article proposes a new method for word translation disambiguation, one that uses a machine-learning technique called bilingual bootstrapping. In learning to disambiguate words to be translated, bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages. It repeatedly constructs classifiers in the two languages in parallel and boosts the performance of the classifiers by classifying unclassified data in the two languages and by exchanging information regarding classified data between the two languages. Experimental results indicate that word translation disambiguation based on bilingual bootstrapping consistently and significantly outperforms existing methods that are based on monolingual bootstrapping.

APA, Harvard, Vancouver, ISO, and other styles

39

Tezcan, Arda, Véronique Hoste, and Lieve Macken. "A Neural Network Architecture for Detecting Grammatical Errors in Statistical Machine Translation." Prague Bulletin of Mathematical Linguistics 108, no. 1 (2017): 133–45. http://dx.doi.org/10.1515/pralin-2017-0015.

Full text

Abstract:

Abstract In this paper we present a Neural Network (NN) architecture for detecting grammatical errors in Statistical Machine Translation (SMT) using monolingual morpho-syntactic word representations in combination with surface and syntactic context windows. We test our approach on two language pairs and two tasks, namely detecting grammatical errors and predicting overall post-editing effort. Our results show that this approach is not only able to accurately detect grammatical errors but it also performs well as a quality estimation system for predicting overall post-editing effort, which is characterised by all types of MT errors. Furthermore, we show that this approach is portable to other languages.

APA, Harvard, Vancouver, ISO, and other styles

40

Kapočiūtė-Dzikienė, Jurgita, Askars Salimbajevs, and Raivis Skadiņš. "Monolingual and Cross-Lingual Intent Detection without Training Data in Target Languages." Electronics 10, no. 12 (2021): 1412. http://dx.doi.org/10.3390/electronics10121412.

Full text

Abstract:

Due to recent DNN advancements, many NLP problems can be effectively solved using transformer-based models and supervised data. Unfortunately, such data is not available in some languages. This research is based on assumptions that (1) training data can be obtained by the machine translating it from another language; (2) there are cross-lingual solutions that work without the training data in the target language. Consequently, in this research, we use the English dataset and solve the intent detection problem for five target languages (German, French, Lithuanian, Latvian, and Portuguese). When seeking the most accurate solutions, we investigate BERT-based word and sentence transformers together with eager learning classifiers (CNN, BERT fine-tuning, FFNN) and lazy learning approach (Cosine similarity as the memory-based method). We offer and evaluate several strategies to overcome the data scarcity problem with machine translation, cross-lingual models, and a combination of the previous two. The experimental investigation revealed the robustness of sentence transformers under various cross-lingual conditions. The accuracy equal to ~0.842 is achieved with the English dataset with completely monolingual models is considered our top-line. However, cross-lingual approaches demonstrate similar accuracy levels reaching ~0.831, ~0.829, ~0.853, ~0.831, and ~0.813 on German, French, Lithuanian, Latvian, and Portuguese languages.

APA, Harvard, Vancouver, ISO, and other styles

41

Adjeisah, Michael, Guohua Liu, Douglas Omwenga Nyabuga, Richard Nuetey Nortey, and Jinling Song. "Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation." Computational Intelligence and Neuroscience 2021 (April 11, 2021): 1–10. http://dx.doi.org/10.1155/2021/6682385.

Full text

Abstract:

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

APA, Harvard, Vancouver, ISO, and other styles

42

Flati, T., and R. Navigli. "The CQC Algorithm: Cycling in Graphs to Semantically Enrich and Enhance a Bilingual Dictionary." Journal of Artificial Intelligence Research 43 (February 19, 2012): 135–71. http://dx.doi.org/10.1613/jair.3456.

Full text

Abstract:

Bilingual machine-readable dictionaries are knowledge resources useful in many automatic tasks. However, compared to monolingual computational lexicons like WordNet, bilingual dictionaries typically provide a lower amount of structured information, such as lexical and semantic relations, and often do not cover the entire range of possible translations for a word of interest. In this paper we present Cycles and Quasi-Cycles (CQC), a novel algorithm for the automated disambiguation of ambiguous translations in the lexical entries of a bilingual machine-readable dictionary. The dictionary is represented as a graph, and cyclic patterns are sought in the graph to assign an appropriate sense tag to each translation in a lexical entry. Further, we use the algorithm's output to improve the quality of the dictionary itself, by suggesting accurate solutions to structural problems such as misalignments, partial alignments and missing entries. Finally, we successfully apply CQC to the task of synonym extraction.

APA, Harvard, Vancouver, ISO, and other styles

43

Heid, Ulrich. "A linguistic bootstrapping approach to the extraction of term candidates from German text." Terminology 5, no. 2 (1998): 161–81. http://dx.doi.org/10.1075/term.5.2.06hei.

Full text

Abstract:

This paper deals with computational linguistic tools and methods for the extraction of raw material for terminological glossaries from machine-readable text. We concentrate on monolingual German term candidates, and only briefly hint at tools and procedures for the creation of bilingual glossaries. Most of the examples we use to illustrate methods and results of our work come from technical texts provided by the translation services of Daimler Chrysler AG1 and from legal texts made available by the European Academy in Bozen, Sudtirol. The Academy is working on translations of legal documents for bilingual South Tyrol, and, in this context, on the creation, upgrading, and maintenance of terminological resources.

APA, Harvard, Vancouver, ISO, and other styles

44

Ji, Baijun, Zhirui Zhang, Xiangyu Duan, Min Zhang, Boxing Chen, and Weihua Luo. "Cross-Lingual Pre-Training Based Transfer for Zero-Shot Neural Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 115–22. http://dx.doi.org/10.1609/aaai.v34i01.5341.

Full text

Abstract:

Transfer learning between different language pairs has shown its effectiveness for Neural Machine Translation (NMT) in low-resource scenario. However, existing transfer methods involving a common target language are far from success in the extreme scenario of zero-shot translation, due to the language space mismatch problem between transferor (the parent model) and transferee (the child model) on the source side. To address this challenge, we propose an effective transfer learning approach based on cross-lingual pre-training. Our key idea is to make all source languages share the same feature space and thus enable a smooth transition for zero-shot translation. To this end, we introduce one monolingual pre-training method and two bilingual pre-training methods to obtain a universal encoder for different languages. Once the universal encoder is constructed, the parent model built on such encoder is trained with large-scale annotated data and then directly applied in zero-shot translation scenario. Experiments on two public datasets show that our approach significantly outperforms strong pivot-based baseline and various multilingual NMT approaches.

APA, Harvard, Vancouver, ISO, and other styles

45

Fonseca, Norma Barbosa de Lima. "Investigando adjetivos atributivos em traduções publicadas, em um sistema de tradução automática e na pós-edição monolíngue: contribuições para a pedagogia da tradução / Investigating attributive adjectives in published translations, in a machine translation system and in monolingual post-editing: contributions to translation pedagogy." Texto Livre: Linguagem e Tecnologia 11, no. 1 (2018): 121–43. http://dx.doi.org/10.17851/1983-3652.11.1.121-143.

Full text

Abstract:

RESUMO: Este artigo visa investigar a tradução em língua portuguesa de adjetivos atributivos em um trecho da obra em língua inglesa Heart of Darkness (CONRAD, 1994) em três estudos. O primeiro analisa as opções de tradução desse trecho em quatro traduções publicadas nos anos 1984, 1996, 2008 e 2011. O segundo examina o insumo da tradução automática desse mesmo trecho fornecido pelo Google Translate. O terceiro analisa textos-alvo produzidos em uma tarefa de pós-edição monolíngue, cujo texto-fonte foi esse mesmo trecho, por oito estudantes de uma disciplina de prática de tradução. Dessa maneira, integra-se uma análise do processo e do produto tradutórios para verificar padrões na tradução de grupos nominais com um, dois e três adjetivos atributivos presentes no trecho analisado. Para a realização desses estudos, utilizou-se parte do RETRAD (Corpus de (Re)Traduções), o qual foi complementado com o trecho traduzido automaticamente pelo Google Translate e com os textos pós-editados pelos estudantes. Os resultados apontam alguns padrões sintáticos e lexicais recorrentes nos grupos nominais com até dois adjetivos atributivos nas traduções publicadas, havendo, no caso de grupos nominais com mais de dois adjetivos, a inclusão de preposições e conjunções em português. Observou-se também uma maior variedade sintática e lexical nos grupos nominais com três ou mais adjetivos atributivos nas traduções publicadas. O fato que os estudantes fizeram poucas alterações nos textos-alvo pós-editados parece indicar que esses participantes não se sentem confiantes para produzir o texto-alvo sem consulta ao texto-fonte. Desse modo, é necessário incentivar o uso de corpora em sala de aula, expor aos estudantes as limitações de sistemas de tradução automática e as estratégias de solução de problemas, levando-os a refletir sobre os processos de tradução e de pós-edição. Isso contribuirá para o aumento da confiança desses estudantes em suas escolhas e para a produção de textos-alvo que representam a língua em uso.PALAVRAS-CHAVE: adjetivos atributivos; sistema de tradução automática; pós-edição monolíngue; linguística de corpus; pesquisa experimental.ABSTRACT: This article aims at investigating the translation in Brazilian Portuguese of attributive adjectives in an excerpt from Heart of Darkness (CONRAD, 1994) in three studies. The first analyses the translation of the same excerpt in four translations published in 1984, 1996, 2008 and 2011. The second examines the machine-translated output from English into Portuguese of the same excerpt as provided by Google Translate. The third analyzes the target texts produced in a monolingual post-editing task performed by students of a translation practice subject. Thus, it integrates a process and product analysis to verify patterns of translating noun phrases with one, two and three attributive adjectives in the analyzed excerpt. For this study, part of Corpus de (Re)traduções (RETRAD) was used, which was complemented by the machine-translated output and students’ post-edited texts. Results point to recurrent syntactic and lexical patterns in the translation of noun phrases with up to two attributive adjectives in published translations, with the addition of prepositions and conjunctions in noun phrases with more than two adjectives. Greater syntactic and lexical variety is also found in the translations of noun phrases with three or more attributive adjectives in published translations. The fact that students made a few changes in the post-edited target texts seems to indicate that these participants do not feel confident to produce the target text without referring to the source text. Therefore, it is necessary to promote the use of corpora in translation classroom, to expose students to the limitations of machine translations systems and to problem-solving strategies, leading them to reflect on the translation and post-editing processes. This will contribute to increase students' confidence in their choices and to produce target texts that represent the language in use.KEYWORDS: attributive adjectives; machine translation system; monolingual post-editing, corpus linguistics; experimental research.

APA, Harvard, Vancouver, ISO, and other styles

46

Sutcliffe, R. F. E., A. McElligott, and G. O’Neill. "Translation between Irish and English by mapping between monolingual machine readable dictionaries using distributed representations." Irish Journal of Psychology 14, no. 3 (1993): 442–44. http://dx.doi.org/10.1080/03033910.1993.10557949.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

KANG, INSU, OH-WOOG KWON, JONG-HYEOK LEE, and GEUNBAE LEE. "CROSS-LANGUAGE TEXT RETRIEVAL BY QUERY TRANSLATION USING TERM REWEIGHTING." International Journal of Pattern Recognition and Artificial Intelligence 14, no. 05 (2000): 617–29. http://dx.doi.org/10.1142/s0218001400000404.

Full text

Abstract:

In a dictionary-based query translation for cross-language text retrieval, transfer ambiguity is one of the main causes of performance deterioration, but the problem has not received significant attention in this field. To resolve transfer ambiguity, this paper proposes a two-phase query translation based on term reweighting, which uses a bilingual transfer dictionary, originally designed for machine translation. In general, source language query terms each show some word association with others, so that their correct translations should be more likely to co-occur in target documents. Based on this simple intuition, the first phase discriminates more relevant target documents from the others. Using statistical and ranking information from the highly relevant documents, the second phase then converts a translated query vector into reweighted form to add an extra weight on probably correct target terms. In experiments, the results were remarkable: the proposed method achieved almost the same performance as the monolingual IR system, actually contributing to an improvement of precision by about 9% over a baseline system.

APA, Harvard, Vancouver, ISO, and other styles

48

TSVETKOV, YULIA, and SHULY WINTNER. "Extraction of multi-word expressions from small parallel corpora." Natural Language Engineering 18, no. 4 (2012): 549–73. http://dx.doi.org/10.1017/s1351324912000101.

Full text

Abstract:

AbstractWe present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional way. We introduce a simple algorithm that proposes MWE candidates based on such misalignments, relying on 1:1 alignments as anchors that delimit the search space. We use a large monolingual corpus to rank and filter these candidates. Evaluation of the quality of the extraction algorithm reveals significant improvements over naïve alignment-based methods. The extracted MWEs, with their translations, are used in the training of a statistical machine translation system, showing a small but significant improvement in its performance.

APA, Harvard, Vancouver, ISO, and other styles

49

Aziz, Wilker. "Grasp: Randomised Semiring Parsing." Prague Bulletin of Mathematical Linguistics 104, no. 1 (2015): 51–62. http://dx.doi.org/10.1515/pralin-2015-0013.

Full text

Abstract:

AbstractWe present a suite of algorithms for inference tasks over (finite and infinite) context-free sets. For generality and clarity, we have chosen the framework ofsemiring parsingwith support to the most common semirings (e.g. Forest, Viterbi, k-bestand Inside). We see parsing from the more general viewpoint of weighted deduction allowing for arbitrary weighted finite-state input and provide implementations of both bottom-up (CKY-inspired) and top-down (Earley-inspired) algorithms. We focus on approximate inference by Monte Carlo methods and provide implementations ofancestral samplingandslice sampling. In principle, sampling methods can deal with models whose independence assumptions are weaker than what is feasible by standard dynamic programming. We envision applications such as monolingual constituency parsing, synchronous parsing, context-free models of reordering for machine translation, and machine translation decoding.

APA, Harvard, Vancouver, ISO, and other styles

50

Sun, Haipeng, Rui Wang, Masao Utiyama, et al. "Unsupervised Neural Machine Translation for Similar and Distant Language Pairs." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 1 (2021): 1–17. http://dx.doi.org/10.1145/3418059.

Full text

Abstract:

Unsupervised neural machine translation (UNMT) has achieved remarkable results for several language pairs, such as French–English and German–English. Most previous studies have focused on modeling UNMT systems; few studies have investigated the effect of UNMT on specific languages. In this article, we first empirically investigate UNMT for four diverse language pairs (French/German/Chinese/Japanese–English). We confirm that the performance of UNMT in translation tasks for similar language pairs (French/German–English) is dramatically better than for distant language pairs (Chinese/Japanese–English). We empirically show that the lack of shared words and different word orderings are the main reasons that lead UNMT to underperform in Chinese/Japanese–English. Based on these findings, we propose several methods, including artificial shared words and pre-ordering, to improve the performance of UNMT for distant language pairs. Moreover, we propose a simple general method to improve translation performance for all these four language pairs. The existing UNMT model can generate a translation of a reasonable quality after a few training epochs owing to a denoising mechanism and shared latent representations. However, learning shared latent representations restricts the performance of translation in both directions, particularly for distant language pairs, while denoising dramatically delays convergence by continuously modifying the training data. To avoid these problems, we propose a simple, yet effective and efficient, approach that (like UNMT) relies solely on monolingual corpora: pseudo-data-based unsupervised neural machine translation. Experimental results for these four language pairs show that our proposed methods significantly outperform UNMT baselines.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!