To see the other types of publications on this topic, follow the link: Parallel corpus.

Journal articles on the topic 'Parallel corpus'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Parallel corpus.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Macken, Lieve, Orphée De Clercq, and Hans Paulussen. "Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus." Meta 56, no. 2 (October 14, 2011): 374–90. http://dx.doi.org/10.7202/1006182ar.

Full text
Abstract:
This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).
APA, Harvard, Vancouver, ISO, and other styles
2

Satoła-Staśkowiak, Joanna. "On the Benefits of Foreign Language Learning Based on Parallel Language Corpus." Cognitive Studies | Études cognitives, no. 15 (December 31, 2015): 57–65. http://dx.doi.org/10.11649/cs.2015.005.

Full text
Abstract:
On the Benefits of Foreign Language Learning Based on Parallel Language CorpusA recently observed strong interest in language corpora, which can be defined as a collection of texts in an electronic format, as well as my work within the European Project Clarin on ‘The Parallel Polish-Bulgarian-Russian Corpus’ became the reason for writing the text concerning the use of the parallel language corpus for learning a foreign language. The article discusses the benefits resulting from the use of such a corpus in learning a foreign language, describes selected corpus language tools supporting the learning process as well as indicates some threats arising from the wrong use of the corpus.
APA, Harvard, Vancouver, ISO, and other styles
3

Sujaini, Herry. "Peningkatan Akurasi Mesin Penerjemah Bahasa Inggris - Indonesia dengan Memaksimalkan Kualitas dan Kuantitas Korpus Paralel." Jurnal Teknologi Informasi dan Ilmu Komputer 7, no. 3 (May 22, 2020): 471. http://dx.doi.org/10.25126/jtiik.2020732076.

Full text
Abstract:
<p class="Body">Korpus paralel memiliki peran yang sangat penting dalam mesin penerjemah statistik (MPS). Korpus paralel yang diperoleh berbagai sumber biasanya memiliki kualitas yang kurang baik, sedangkan kuantitas korpus paralel merupakan tuntutan utama bagi hasil penerjemahan yang baik. Penelitian ini bertujuan untuk mengetahui efek ukuran dan kualitas korpus paralel di MPS. Penelitian ini menggunakan metode <em>bilingual</em> <em>evaluation understudy</em> (BLEU) untuk mengklasifikasikan pasangan kalimat paralel sebagai kalimat berkualitas tinggi atau buruk. Metode ini diterapkan ke korpus paralel yang berisi 1,5 M pasangan kalimat Inggris-Indonesia paralel dan memperoleh 900K pasangan kalimat paralel berkualitas tinggi. Beberapa sistem MPS dengan berbagai ukuran korpus paralel mentah dan korpus berkualitas tinggi yang difilter dilatih dengan MOSES dan dievaluasi kinerjanya. Hasil percobaan yang dilakukan menunjukkan bahwa ukuran korpus paralel merupakan faktor utama dalam kinerja terjemahan. Selain itu, kinerja terjemahan yang lebih baik dapat dicapai dengan korpus berkualitas tinggi yang lebih kecil menggunakan metode filter berkualitas. Hasil eksperimen pada MPS bahasa Inggris-Indonesia menunjukkan bahwa dengan menggunakan 60% kalimat yang kualitas terjemahannya baik, kualitas terjemahan dapat meningkat sebesar 7,31%.</p><p class="Body"> </p><p class="Body"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>The parallel corpus has a very important role in the statistical machine translator (SMT) system. The parallel corpus obtained by various sources usually has poor quality, while the quantity of parallel corpus is the main demand for good translation results. This study aims to determine the effect of the size and quality of parallel corpus at SMT. This study uses the bilingual evaluation understudy (BLEU) method to classify pairs of parallel sentences as high-quality or bad sentences. This method is applied to a parallel corpus containing 1.5 M parallel English-Indonesian sentence pairs and obtaining 900K pairs of high-quality parallel sentences. Some SMT systems with various sizes of raw parallel bodies and high-quality corpus filtered are trained with MOSES and evaluated for performance. The experimental results show that the size of the parallel corpus is a major factor in translation performance. In addition, better translation performance can be achieved with a smaller high-quality corpus using a quality filter method.The experimental results in the English-Indonesian SMT show that by using 60% of sentences whose translation quality is good, the quality of the translation can increase by 7.31%.</em></p><p class="Body"><em><strong><br /></strong></em></p>
APA, Harvard, Vancouver, ISO, and other styles
4

Izquierdo, Marlén, Knut Hofland, and Øystein Reigem. "The ACTRES parallel corpus: an English–Spanish translation corpus." Corpora 3, no. 1 (May 2008): 31–41. http://dx.doi.org/10.3366/e1749503208000051.

Full text
Abstract:
This paper describes the compilation of the ACTRES Parallel Corpus, an English–Spanish translation corpus built at the Department of Modern Languages at the University of León (Spain) by the ACTRES research group. The computerisation of the corpus was carried out in collaboration with Knut Hofland and Øystein Reigem, from the Department of Culture, Language and Information Technology, Aksis, at the UNIFOB/University of Bergen (Norway). The corpus is conceived as a powerful tool for cross-linguistic research in the fields of Contrastive Analysis and Descriptive Translation Studies. It was the need to bridge the gap between these disciplines and to extend applications that encouraged the building of a parallel corpus as a suitable tool to achieve these goals. This paper focusses on the practical aspects of building the corpus. A brief account of the research which prompted this endeavour precedes the description of this process. 4 4 This paper is an account of the building of the ACTRES Parallel Corpus, so no empirical results from research done on the basis of the corpus are reported here. Concerning new insights drawn from the actual use of P-ACTRES in English–Spanish translation and contrastive projects, there is an extended bibliography at: http://actres.unileon.es/
APA, Harvard, Vancouver, ISO, and other styles
5

Karimov, Rustam Abdurasulovich. "Text Selection Issue For Parallel Corpus." American Journal of Social Science and Education Innovations 2, no. 09 (September 26, 2020): 311–16. http://dx.doi.org/10.37547/tajssei/volume02issue09-48.

Full text
Abstract:
It is known that the basis of any corpus is its units. Typically, texts of different genres are selected as the corpus unit to ensure the representativeness of the corpus. Therefore, when creating any language corpus, first of all, the principles of selection of texts that are part of it should be defined. Parallel corpus units consist of texts that have been translated one or more times from the original. Which topic and genre text to choose for the parallel corpus is determined by the purpose of the compiler?
APA, Harvard, Vancouver, ISO, and other styles
6

Levchuk, Pavlo, Danuta Roszko, and Roman Roszko. "Multilingual corps institute of Slavic Studies, Polish Academy of Sciences – CLARIN PL. Polish-Lithuanian Parallel Corpus “2” and Polish-Ukrainian Parallel Corpus." Language: classic - modern - postmodern, no. 6 (December 30, 2020): 306–170. http://dx.doi.org/10.18523/lcmp2522-9281.2020.6.306-170.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Resnik, Philip, and Noah A. Smith. "The Web as a Parallel Corpus." Computational Linguistics 29, no. 3 (September 2003): 349–80. http://dx.doi.org/10.1162/089120103322711578.

Full text
Abstract:
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.
APA, Harvard, Vancouver, ISO, and other styles
8

Al-Raisi, Fatima, Weijian Lin, and Abdelwahab Bourai. "A Monolingual Parallel Corpus of Arabic." Procedia Computer Science 142 (2018): 334–38. http://dx.doi.org/10.1016/j.procs.2018.10.487.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Duškin, Maksim, and Joanna Satoła-Staśkowiak. "The Bulgarian-Polish-Russian parallel corpus." Cognitive Studies | Études cognitives, no. 11 (November 24, 2015): 241–54. http://dx.doi.org/10.11649/cs.2011.015.

Full text
Abstract:
The Bulgarian-Polish-Russian parallel corpusThe Semantics Laboratory Team of Institute of Slavic Studies of Polish Academy of Sciences is planning to begin work on the creation of a Bulgarian-Polish-Russian parallel corpus. The three selected languages are representatives of the main groups of Slavic languages: Bulgarian represents the southern group of Slavic languages, Polish – the western group of Slavic languages, Russian – the eastern group of Slavic languages. Our project will be the first parallel corpus of these three languages. The planned corpus will be based on material, dating from one period (the 20th century) and will have a synchronous nature. The project will not constitute the sum of the separate corpora of selected languages.One of the problems with creating multilingual parallel corpora are different proportions of translated texts between the selected languages, for example, Polish literature is often translated into Bulgarian, but not vice versa.Bulgarian, Russian and Polish differ typologically – Bulgarian is an analytic language, Polish and Russian are synthetic. The parallel corpus should have compatible annotation, while taking into account the characteristic features of the selected languages.We hope that the Bulgarian-Polish-Russian parallel corpus will serve as a source of linguistic material of contrastive language studies and may prove to be a big help for linguists, translators, terminologists and students of linguistics. The results of our work will be available on the Internet.
APA, Harvard, Vancouver, ISO, and other styles
10

Lesatari, Aufa Eka Putri, Arie Ardiyanti, Arie Ardiyanti, Ibnu Asror, and Ibnu Asror. "Phrase Based Statistical Machine Translation Javanese-Indonesian." JURNAL MEDIA INFORMATIKA BUDIDARMA 5, no. 2 (April 25, 2021): 378. http://dx.doi.org/10.30865/mib.v5i2.2812.

Full text
Abstract:
This research aims to produce a statistical machine translation that can be implemented to perform Javanese-Indonesian translation and to know the influence of the main data sources of statistical machine translation namely parallel corpus and monolingual corpus on the quality of Javanese-Indonesian statistical machine translation. The testing was carried out by gradually adding the quantity of parallel corpus and monolingual corpus to seven configurations of Javanese-Indonesian statistical machine translation. All machine translation configuration experiments were tested with test data totaling 500 lines of Javanese sentences. Results from machine translation are evaluated automatically using Bilingual Evaluation Understudy (BLEU). Test results in seven configurations showed an increase in the evaluation value of the translation machine after the quantity of parallel corpus and monolingual corpus was added. The quantity of parallel corpus in configurations 1 and 2 increased by 3,6%, configurations 2 and 3 increased by 8,23%, configurations 3 and 7 increased by 14,92%. Additional monolingual corpus quantity in configurations 4 and 5 increased BLEU score by 0,18%, configurations 5 and 6 increased by 0,06%, configurations 6 and 7 increased by 0,24%. The test results showed that the quantity of parallel corpus and monolingual corpus could increase the evaluation value of statistical machine translation Javanese-Indonesian, but the quantity of parallel corpus had a greater influence than the quantity of monolingual corpus
APA, Harvard, Vancouver, ISO, and other styles
11

Xiong, Kai, Rui Yuan, Wenxue He, Yanmei Jing, Yansheng Wang, Qiqi He, and Huafu Li. "Crawling Chinese-Myanmar Parallel Corpus: Automatic Collection, Screening and Cleaning Corpus." IOP Conference Series: Materials Science and Engineering 646 (October 17, 2019): 012046. http://dx.doi.org/10.1088/1757-899x/646/1/012046.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Erjavec, Tomaž. "The IJS-ELAN Slovene-English Parallel Corpus." International Journal of Corpus Linguistics 7, no. 1 (October 18, 2002): 1–20. http://dx.doi.org/10.1075/ijcl.7.1.01erj.

Full text
Abstract:
The paper presents an annotated parallel Slovene-English corpus developed in the scope of the EU ELAN project. The IJS-ELAN corpus was compiled to be a widely distributable dataset for language engineering and for translation and terminology studies. The corpus contains 1 million words from fifteen recent terminology-rich texts. The corpus is sentence aligned and word-tagged with context disambiguated morphosyntactic descriptions and lemmas. These descriptions model simple feature structures, the structure of which is shared between Slovene and English. The corpus is encoded according to the Guidelines for Text Encoding and Interchange and is freely available on the Web for downloading. Additionally, access to IJS-ELAN is available via a powerful Web concordancer.
APA, Harvard, Vancouver, ISO, and other styles
13

Deep, Kamal, Ajit Kumar, and Vishal Goyal. "Development of Punjabi-English (PunEng) Parallel Corpus for Machine Translation System." International Journal of Engineering & Technology 7, no. 2 (May 10, 2018): 690. http://dx.doi.org/10.14419/ijet.v7i2.10762.

Full text
Abstract:
This paper describes the creation process and statistics of Punjabi English (PunEng) parallel corpus. Parallel corpus is the main requirement to develop statistical machine translation as well as neural machine translation. Until now, we do not have any availability of PunEng parallel corpus. In this paper, we have shown difficulties and intensive labor to develop parallel corpus. Methods used for collecting data and the results are discussed, errors during the process of collecting data and how to handle these errors will be described.
APA, Harvard, Vancouver, ISO, and other styles
14

Lefever, Els, and Véronique Hoste. "Parallel corpora make sense." International Journal of Corpus Linguistics 19, no. 3 (September 1, 2014): 333–67. http://dx.doi.org/10.1075/ijcl.19.3.02lef.

Full text
Abstract:
We present a multilingual approach to Word Sense Disambiguation (WSD), which automatically assigns the contextually appropriate sense to a given word. Instead of using a predefined monolingual sense-inventory, we use a language-independent framework by deriving the senses of a given word from word alignments on a multilingual parallel corpus, which we made available for corpus linguistics research. We built five WSD systems with English as the input language and translations in five supported languages (viz. French, Dutch, Italian, Spanish and German) as senses. The systems incorporate both binary translation features and local context features. The experimental results are very competitive, which confirms our initial hypothesis that each language contributes to the disambiguation of polysemous words. Because our system extracts all information from the parallel corpus, it offers a flexible language-independent approach, which implicitly deals with the sense distinctness issue and allows us to bypass the knowledge acquisition bottleneck for WSD.
APA, Harvard, Vancouver, ISO, and other styles
15

Cheon, Juryong, and Youngjoong Ko. "Parallel sentence extraction to improve cross-language information retrieval from Wikipedia." Journal of Information Science 47, no. 2 (February 10, 2021): 281–93. http://dx.doi.org/10.1177/0165551521992754.

Full text
Abstract:
Translation language resources, such as bilingual word lists and parallel corpora, are important factors affecting the effectiveness of cross-language information retrieval (CLIR) systems. In particular, when large domain-appropriate parallel corpora are not available, developing an effective CLIR system is particularly difficult. Furthermore, creating a large parallel corpus is costly and requires considerable effort. Therefore, we here demonstrate the construction of parallel corpora from Wikipedia as well as improved query translation, wherein the queries are used for a CLIR system. To do so, we first constructed a bilingual dictionary, termed WikiDic. Then, we evaluated individual language resources and combinations of them in terms of their ability to extract parallel sentences; the combinations of our proposed WikiDic with the translation probability from the Web’s bilingual example sentence pairs and WikiDic was found to be best suited to parallel sentence extraction. Finally, to evaluate the parallel corpus generated from this best combination of language resources, we compared its performance in query translation for CLIR to that of a manually created English–Korean parallel corpus. As a result, the corpus generated by our proposed method achieved a better performance than did the manually created corpus, thus demonstrating the effectiveness of the proposed method for automatic parallel corpus extraction. Not only can the method demonstrated herein be used to inform the construction of other parallel corpora from language resources that are readily available, but also, the parallel sentence extraction method will naturally improve as Wikipedia continues to be used and its content develops.
APA, Harvard, Vancouver, ISO, and other styles
16

KASHIOKA, HIDEKI. "Synonymous Sentences Grouping with Multilingual Parallel Corpus." Journal of Natural Language Processing 11, no. 5 (2004): 3–18. http://dx.doi.org/10.5715/jnlp.11.5_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Leong, Chongman, Xuebo Liu, Derek F. Wong, and Lidia S. Chao. "Exploiting Translation Model for Parallel Corpus Mining." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 2829–39. http://dx.doi.org/10.1109/taslp.2021.3105798.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Alotaibi, Hind M. "AEPC: Designing an Arabic/English parallel corpus." Research in Corpus Linguistics 4 (2016): 1–7. http://dx.doi.org/10.32714/ricl.04.01.

Full text
Abstract:
Parallel corpora ‒ collections of aligned translated texts of two or more languages ‒ play a significant role in translation and contrastive studies. Given the importance of the availability of such learning resources for the education and training of translators, Arabic suffers from a lack of such learning resources. Although there are a limited number of free Arabic/English parallel corpora, a major drawback is that they are domain-restricted corpora, which limits their benefits for Arabic translation education. This paper describes an ongoing project to design and construct a balanced, representative, and free-to-use Arabic English parallel corpus (AEPC). In addition, the project involves the design and implementation of an Arabic/English concordance tool. The proposed parallel corpus and its tool can be integrated into translators’ training institutions as an educational resource for translation studies and teaching. It can be used in training and testing Arabic/English machine translation systems. The first phase of this project involved compiling high-quality translated text samples; all translations were done by human translators. The corpus covers a wide range of text types and rich metadata. The target figure for the corpus is minimally 10 million words, with the intention to increase that figure in the future. After compiling the texts, manual (i.e. human-aided) alignment was performed, offering better outcomes in terms of accuracy compared to automated alignment. The second phase of this project involved designing a web interface with a bilingual concordancer, where users can explore the content of the AEPC in both English and Arabic.
APA, Harvard, Vancouver, ISO, and other styles
19

Mohammadi, Mohammad Hadi, Qiao Pan, Dehua Chen, and Marjan Kamyab. "PC-Corpus: A Persian-Chinese Parallel Corpora." Journal of Physics: Conference Series 1176 (March 2019): 022002. http://dx.doi.org/10.1088/1742-6596/1176/2/022002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Schwartz, Lane. "Better Splitting Algorithms for Parallel Corpus Processing." Prague Bulletin of Mathematical Linguistics 98, no. 1 (October 1, 2012): 109–19. http://dx.doi.org/10.2478/v10108-012-0013-x.

Full text
Abstract:
Better Splitting Algorithms for Parallel Corpus Processing Each iteration of minimum error rate training involves re-translating a development set. Distributing this work across computational nodes can speed up translation time, but in practice some parts may take much longer to complete than others, leading to computational slack time. To address this problem, we develop three novel algorithms for distributing translation tasks in a parallel computing environment, drawing on research in parallel machine scheduling. We present results showing a substantial speedup in overall decoding time.
APA, Harvard, Vancouver, ISO, and other styles
21

Ruziev, Khusniddin Bakhritdinovich. "STAGES TO CREATE CORPUS OF PARALLEL TEXTS." Theoretical & Applied Science 124, no. 08 (August 30, 2023): 243–47. http://dx.doi.org/10.15863/tas.2023.08.124.25.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Hu, Ninglei, and Shurui Yang. "Translation Learning Based on Italian Parallel Corpus." Creativity and Innovation 6, no. 3 (2022): 141–47. http://dx.doi.org/10.47297/wspciwsp2516-252726.20220603.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Gao, Yin. "Research on English Electronic Standard Database Based on Android Platform." Advanced Materials Research 971-973 (June 2014): 2752–55. http://dx.doi.org/10.4028/www.scientific.net/amr.971-973.2752.

Full text
Abstract:
E-C parallel corpus is of great importance to translation teaching and practice due to its abundant text corpus. Although there are a number of E-C parallel corpora at home and abroad, few can be applied to translation teaching. In view of this, it is necessary to construct E-C parallel corpora that can be used in translation teaching. This paper aims to explore the construction of mini E-C parallel corpus and its application in translation teaching.
APA, Harvard, Vancouver, ISO, and other styles
24

Le Bruyn, Bert, Martín Fuchs, Martijn van der Klis, Jianan Liu, Chou Mo, Jos Tellings, and Henriëtte De Swart. "Parallel Corpus Research and Target Language Representativeness: The Contrastive, Typological, and Translation Mining Traditions." Languages 7, no. 3 (July 7, 2022): 176. http://dx.doi.org/10.3390/languages7030176.

Full text
Abstract:
This paper surveys the strategies that the Contrastive, Typological, and Translation Mining parallel corpus traditions rely on to deal with the issue of target language representativeness of translations. On the basis of a comparison of the corpus architectures and research designs of the three traditions, we argue that they have each developed their own representativeness strategies: (i) monolingual control corpora (Contrastive tradition), (ii) limits on the scope of research questions (Typological tradition), and (iii) parallel control corpora (Translation Mining tradition). We introduce normalized pointwise mutual information (NPMI) as a bi-directional measure of cross-linguistic association, allowing for an easy comparison of the outcomes of different traditions and the impact of the monolingual and parallel control corpus representativeness strategies. We further argue that corpus size has a major impact on the reliability of the monolingual control corpus strategy and that a sequential parallel control corpus strategy is preferable for smaller corpora.
APA, Harvard, Vancouver, ISO, and other styles
25

Zhang, Yingyi. "Russian Speech Conversion Algorithm Based on a Parallel Corpus and Machine Translation." Wireless Communications and Mobile Computing 2022 (March 23, 2022): 1–9. http://dx.doi.org/10.1155/2022/8023115.

Full text
Abstract:
The phonetic conversion technology is crucial in the resource construction of Russian phonetic information processing. This paper explains how to build a corpus and the key algorithms that are used, as well as how to design auxiliary translation software and implement the key algorithms. This paper focuses on the “parallel corpus” method of problem solving and the indispensable role of a parallel corpus in Russian learning. This paper examines the foundations, motivations, and methods for using parallel corpora in translation instruction. The main way of using a parallel corpus in the classroom environment is to present data, so that learners can be exposed to a large amount of easily screened bilingual data, and translation skills and specific language item translation can be taught in a concentrated and focused manner. Among them, the creation of a large-scale Russian-Chinese parallel corpus will play an important role not only in improving the translation quality of Russian-Chinese machine translation systems but also in Chinese and Russian teaching as well as other branches of linguistics and translation studies, all of which should be given sufficient attention. This paper proposes the use of automatic speech analysis technology to assist Russian pronunciation learning and designs a Russian word pronunciation learning assistant system with demonstration, scoring, and feedback functions, in response to the shortcomings of pronunciation teaching in Russian teaching in China. It can provide corpus support for gathering a large number of parallel corpora and, in the future, enabling online translation. This system is used for corpus automatic construction, and future corpus automatic construction systems could be built on top of it. The proper application of parallel corpus data will aid in the development of a high-quality autonomous learning and translation teaching environment.
APA, Harvard, Vancouver, ISO, and other styles
26

Liu, Chao Peng. "Research on Web Application Technology for Building a Chinese-French Parallel Corpus of the Four Great Chinese Classical Novels." Applied Mechanics and Materials 473 (December 2013): 206–10. http://dx.doi.org/10.4028/www.scientific.net/amm.473.206.

Full text
Abstract:
As masterpieces in Chinese classical literature, the Four Great Chinese Classical Novels with their multilingual translations have exerted a profound influence in literature and translation studies both home and abroad. Building a Chinese-French bilingual parallel corpus of the Four Great Chinese Classical Novels is believed to facilitate large-scale investigations into the original Chinese text and their French translations in terms of stylistics, diction, culture and translation techniques. In this paper, we introduced the French translations of the four novels and illustrated the process of building the parallel corpus in detail. When the parallel corpus is completed, statistical analysis can thereby be carried out by employing different corpus tools. In order to enhance the availability and convenience of the parallel corpus, a web-based query platform is designed to provide world-wide search through the Internet for interested researchers and language learners.
APA, Harvard, Vancouver, ISO, and other styles
27

Santos, Diana, and Signe Oksefjell. "Using a Parallel Corpus to Validate Independent Claims." Languages in Contrast 2, no. 1 (December 31, 1999): 115–30. http://dx.doi.org/10.1075/lic.2.1.07san.

Full text
Abstract:
This paper examines the results from two corpus-based contrastive studies. Both studies offer cross-linguistic claims about the language pair English-Portuguese. We attempt to replicate the studies and check the findings against a different corpus, viz. the English—Portuguese part of the English—Norwegian Parallel Corpus, to see whether the regularities observed in the original corpora can be confirmed. After a brief presentation of each study, we describe how we gathered equivalent data, present our findings in the new corpus, and discuss some possible reasons for discrepancies in relation to the earlier studies. The topics investigated are boundary-crossing movement descriptions (after Slobin 1997) and perception verbs (after Santos 1998).
APA, Harvard, Vancouver, ISO, and other styles
28

Wang, Xizhe. "Word Alignment of Chinese Poetry Parallel Corpus based on Word Embedding Technology." Applied and Computational Engineering 8, no. 1 (August 1, 2023): 721–25. http://dx.doi.org/10.54254/2755-2721/8/20230110.

Full text
Abstract:
The research of word alignment technology provides the basis for many fields in natural language processing, such as speech recognition, bilingual dictionary writing, corpus construction and information retrieval. Based on the existing research foundation of parallel corpus for poetry sentence alignment, this paper studies how to align English-Chinese words. Firstly, the parallel corpus of Chinese poetry is preprocessed and translated into words, and the English poetry word vector is trained based on word embedding method, and the optimal corresponding word is judged by similarity calculation, and finally getting word alignment from parallel corpus sentence alignment.
APA, Harvard, Vancouver, ISO, and other styles
29

Yang, Wei, Hanfei Shen, and Yves Lepage. "Inflating a Small Parallel Corpus into a Large Quasi-parallel Corpus Using Monolingual Data for Chinese-Japanese Machine Translation." Journal of Information Processing 25 (2017): 88–99. http://dx.doi.org/10.2197/ipsjjip.25.88.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Mosavi Miangah, Tayebeh. "Constructing a Large-Scale English-Persian Parallel Corpus." Meta 54, no. 1 (April 29, 2009): 181–88. http://dx.doi.org/10.7202/029804ar.

Full text
Abstract:
Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.
APA, Harvard, Vancouver, ISO, and other styles
31

Sole-Mauri, Francina, Pilar Sánchez-Gijón, and Antoni Oliver. "Cadlaws – An English–French Parallel Corpus of Legally Equivalent Documents." Mutatis Mutandis. Revista Latinoamericana de Traducción 14, no. 2 (July 13, 2021): 494–508. http://dx.doi.org/10.17533/udea.mut.v14n2a10.

Full text
Abstract:
This article presents Cadlaws, a new English–French corpus built from Canadian legal documents, and describes the corpus construction process and preliminary statistics obtained from it. The corpus contains over 16 million words in each language and includes unique features since it is composed of documents that are legally equivalent in both languages but not the result of a translation. The corpus is built upon enactments co-drafted by two jurists to ensure legal equality of each version and to re­flect the concepts, terms and institutions of two legal traditions. In this article the corpus definition as a parallel corpus instead of a comparable one is also discussed. Cadlaws has been pre-processed for machine translation and baseline Bilingual Evaluation Understudy (bleu), a score for comparing a candidate translation of text to a gold-standard translation of a neural machine translation system. To the best of our knowledge, this is the largest parallel corpus of texts which convey the same meaning in this language pair and is freely available for non-commercial use.
APA, Harvard, Vancouver, ISO, and other styles
32

Siruk, Olena, and Ivan Derzhanski. "Linguistic Corpora as International Cultural Heritage: The Corpus of Bulgarian and Ukrainian Parallel Texts." Digital Presentation and Preservation of Cultural and Scientific Heritage 3 (September 30, 2013): 91–98. http://dx.doi.org/10.55630/dipp.2013.3.9.

Full text
Abstract:
The paper relates about our ongoing work on the creation of a corpus of Bulgarian and Ukrainian parallel texts. We discuss some differences in the approaches and the interpretation of some concepts, as well as various problems associated with the construction of our corpus, in particular the occasional ‘nonparallelism’ of original and translated texts. We give examples of the a pplication of the parallel corpus for the study of lexical semantics and note the outstanding role of the corpus in the lexicographic description of Ukrainian and Bulgarian translation equivalents. We draw attention to the importance of creating parallel corpora as objects of national as well as global cultural heritage.
APA, Harvard, Vancouver, ISO, and other styles
33

De Pauw, Guy, Peter Waiganjo Wagacha, and Gilles-Maurice de Schryver. "Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili." Language Resources and Evaluation 45, no. 3 (July 19, 2011): 331–44. http://dx.doi.org/10.1007/s10579-011-9159-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Ebling, Sarah. "Building a parallel corpus of German/Swiss German Sign Language train announcements." International Journal of Corpus Linguistics 21, no. 1 (March 31, 2016): 116–29. http://dx.doi.org/10.1075/ijcl.21.1.06ebl.

Full text
Abstract:
We present a parallel corpus of German/Swiss German Sign Language train announcements. The corpus is used in a statistical machine translation system that translates from German to Swiss German Sign Language. The output of the translation system is then passed on to an animation system, the result being a sign language avatar representation on a mobile phone. Building the parallel corpus consisted of four steps: translating the written German train announcements into Swiss German Sign Language glosses, signing the announcements in front of a camera on the basis of the gloss transcriptions, notating the signs in the video recordings in a form-based sign language notation system, and adding information about non-manual features. The resulting corpus contains 3,241 sentence pairs, which makes it a large parallel corpus involving sign language.
APA, Harvard, Vancouver, ISO, and other styles
35

Mihailov, Mihail, and Hannu Tommola. "Compiling Parallel Text Corpora: Towards Automation of Routine Procedures." Text Corpora and Multilingual Lexicography 6, no. 3 (December 17, 2001): 67–77. http://dx.doi.org/10.1075/ijcl.6.si.07mih.

Full text
Abstract:
The aim of the research project running at the Department of Translation Studies of the University of Tampere is to collect a Russian-Finnish parallel corpus of fiction. The corpus will be equipped with efficient search and analysis tools. The texts of the corpus will be stored as ordinary text files. Each text will be registered in a Microsoft Access database and supplied with a description. Automated parallel concordancing is being developed for the corpus. The program will find the keywords in text A (Russian), then look for possible translation equivalents of the keywords in language B (Finnish), and then search for the portion of text B (Finnish) where most of the keywords in question can be found.
APA, Harvard, Vancouver, ISO, and other styles
36

Oliver, Antoni. "El corpus paral·lel del Diari Oficial de la Generalitat de Catalunya: compilació, anàlisi i exemples d'ús." Zeitschrift für Katalanistik 30 (July 1, 2017): 269–91. http://dx.doi.org/10.46586/zfk.2017.269-291.

Full text
Abstract:
Summary: In this paper the process of compilation of the parallel corpus from the Official Diary of the Catalan Government (DOGC) is presented. It describes the downloading process, the tools and processes for the treatment and linguistic analysis. The final result is a big parallel corpus that is freely available in several formats and with several annotation levels. This corpus is a very valuable resource for different applications. As example, three possible fields of application are described: as a translation memory to be used in a Computer-Assisted Translation tool; for terminology extraction and query and for training statistical machine translation systems. Keywords: Parallel corpus, translation memory, terminology extraction, statistical machine translation, Natural Language Processing
APA, Harvard, Vancouver, ISO, and other styles
37

Tan, Tien-Ping, Chai Kim Lim, and Wan Rose Eliza Abdul Rahman. "Sliding Window and Parallel LSTM with Attention and CNN for Sentence Alignment on Low-Resource Languages." Pertanika Journal of Science and Technology 30, no. 1 (November 24, 2021): 97–121. http://dx.doi.org/10.47836/pjst.30.1.06.

Full text
Abstract:
A parallel text corpus is an important resource for building a machine translation (MT) system. Existing resources such as translated documents, bilingual dictionaries, and translated subtitles are excellent resources for constructing parallel text corpus. A sentence alignment algorithm automatically aligns source sentences and target sentences because manual sentence alignment is resource-intensive. Over the years, sentence alignment approaches have improved from sentence length heuristics to statistical lexical models to deep neural networks. Solving the alignment problem as a classification problem is interesting as classification is the core of machine learning. This paper proposes a parallel long-short-term memory with attention and convolutional neural network (parallel LSTM+Attention+CNN) for classifying two sentences as parallel or non-parallel sentences. A sliding window approach is also proposed with the classifier to align sentences in the source and target languages. The proposed approach was compared with three classifiers, namely the feedforward neural network, CNN, and bi-directional LSTM. It is also compared with the BleuAlign sentence alignment system. The classification accuracy of these models was evaluated using Malay-English parallel text corpus and UN French-English parallel text corpus. The Malay-English sentence alignment performance was then evaluated using research documents and the very challenging Classical Malay-English document. The proposed classifier obtained more than 80% accuracy in categorizing parallel/non-parallel sentences with a model built using only five thousand training parallel sentences. It has a higher sentence alignment accuracy than other baseline systems.
APA, Harvard, Vancouver, ISO, and other styles
38

Hafsah, Saidah Saad, Lailatul Qadri Zakaria, and Ahmad Fadhil Naswir. "Parallel-Based Corpus Annotation for Malay Health Documents." Applied Sciences 13, no. 24 (December 9, 2023): 13129. http://dx.doi.org/10.3390/app132413129.

Full text
Abstract:
Named entity recognition (NER) is a crucial component of various natural language processing (NLP) applications, particularly in healthcare. It involves accurately identifying and extracting named entities such as medical terms, diseases, and drug names, and healthcare professionals are essential for tasks like clinical text analysis, electronic health record management, and medical research. However, healthcare NER faces challenges, especially in Malay, in which specialized corpora are limited, and no general corpus is available yet. To address this, the paper proposes a method for constructing an annotated corpus of Malay health documents. The researchers leverage a parallel source that contains annotated entities in English due to the limited tools available for the Malay language, and it is very language-dependent. Additional credible Malay documents are incorporated as sources to enhance the development. The targeted health entities in this research include penyakit (diseases), simptom (symptoms), and rawatan (treatments). The primary objective is to facilitate the development of NER algorithms specifically tailored to the healthcare domain in the Malay language. The methodology encompasses data collection, preprocessing, annotation of text in both English and Malay, and corpus creation. The outcome of this research is the establishment of the Malay Health Document Annotated Corpus, which serves as a valuable resource for training and evaluating NLP models in the Malay language. Future research directions may focus on developing domain-specific NER models, exploring alternative algorithms, and enhancing performance. Overall, this research aims to address the challenges of healthcare NER in the Malay language by constructing an annotated corpus and facilitating the development of tailored NER algorithms for the healthcare domain.
APA, Harvard, Vancouver, ISO, and other styles
39

Jindal, Shishpal, Vishal Goyal, and Jaskarn Singh. "Building English-Punjabi Parallel corpus for Machine Translation." International Journal of Computer Applications 180, no. 8 (December 16, 2017): 26–29. http://dx.doi.org/10.5120/ijca2017916036.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

MATSUNAGA, Tsutomu, Daisuke SATO, and Masami HARA. "Parallel Corpus Clean-up Based on Recursive Learning." Journal of Japan Society for Fuzzy Theory and Intelligent Informatics 29, no. 1 (2017): 527–32. http://dx.doi.org/10.3156/jsoft.29.1_527.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Kuandykova, Ayana, Amandyk Kartbayev, and Tannur Kaldybekov. "English-Kazakh Parallel Corpus For Statistical Machine Translation." International Journal on Natural Language Computing 3, no. 3 (June 30, 2014): 65–72. http://dx.doi.org/10.5121/ijnlc.2014.3306.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Rakhmanova, Azizakhan Abdugafurovna. "THE ROLE OF PARALLEL TEXT IN CORPUS LINGUISTICS." Theoretical & Applied Science 91, no. 11 (November 30, 2020): 66–70. http://dx.doi.org/10.15863/tas.2020.11.91.15.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Arkhipova, Elena Ivanovna. "Using a Parallel Corpus to Translate Ethnocultural Collocations." Filologičeskie nauki. Voprosy teorii i praktiki, no. 2 (February 2022): 554–58. http://dx.doi.org/10.30853/phil20220046.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Sidorova, Elena Yurievna. "Diminutive «Потихоньку» in the Texts of Parallel Corpus." Filologičeskie nauki. Voprosy teorii i praktiki, no. 11 (November 2021): 3404–9. http://dx.doi.org/10.30853/phil210567.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

박명수. "Investigation into English-Korean Parallel Corpus with ParaConc." Journal of Translation Studies 18, no. 5 (December 2017): 29–57. http://dx.doi.org/10.15749/jts.2017.18.5.002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Torres-Ramos, Sulema, and Raymundo E. Garay-Quezada. "A Survey on Statistical-based Parallel Corpus Alignment." Research in Computing Science 90, no. 1 (December 31, 2015): 57–76. http://dx.doi.org/10.13053/rcs-90-1-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Čermák, František, and Alexandr Rosen. "The case of InterCorp, a multilingual parallel corpus." International Journal of Corpus Linguistics 17, no. 3 (December 31, 2012): 411–27. http://dx.doi.org/10.1075/ijcl.17.3.05cer.

Full text
Abstract:
This paper introduces InterCorp, a parallel corpus including texts in Czech and 27 other languages, available for online searches via a web interface. After discussing some issues and merits of a multilingual resource we argue that it has an important role especially for languages with fewer native speakers, supporting both comparative research and studies of the language from the perspective of other languages. We proceed with an overview of the corpus — the strategy and criteria for including new texts, the representation of available languages and text types, linguistic annotation, and a sketch of pre-processing issues. Finally, we present the search interface and suggest some research opportunities.
APA, Harvard, Vancouver, ISO, and other styles
48

Oksefjell, Signe. "A Description of the English-Norwegian Parallel Corpus." International Journal of Corpus Linguistics 4, no. 2 (December 31, 1999): 197–219. http://dx.doi.org/10.1075/ijcl.4.2.01oks.

Full text
Abstract:
This paper gives an introduction to the most important steps in the process of compiling the English-Norwegian Parallel Corpus (ENPC), which contains 50 original English text extracts with their translations into Norwegian and 50 original Norwegian text extracts with their translations into English, in all about 2.6 million words. Even if the most time-consuming part of the process is to prepare the text extracts for the corpus, much of the focus has also been on the development of software, notably a browser handling parallel texts and an alignment program linking the original and translated versions of the same text. The preparation of the texts themselves includes scanning, proofreading, mark-up, and alignment. Although the ENPC is completed, the ENPC project is still developing, and the most recent extensions will be mentioned in this paper, such as adding more languages, compiling multiple translations (in the same language) of the same text, part-of-speech-tagging, and marking direct speech and thought in the ENPC.
APA, Harvard, Vancouver, ISO, and other styles
49

Tadić, M. "Procedures in Building the Croatian-English Parallel Corpus." International Journal of Corpus Linguistics 6, no. 1 (December 1, 2001): 107–23. http://dx.doi.org/10.1075/ijcl.6.3.10tad.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Tadic, Marko. "Procedures in Building the Croatian-English Parallel Corpus." Text Corpora and Multilingual Lexicography 6, no. 3 (December 17, 2001): 107–23. http://dx.doi.org/10.1075/ijcl.6.si.10tad.

Full text
Abstract:
This contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected at the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is the newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for Information and Culture). After a quick survey of existing English-Croatian parallel corpora, the article copes with procedures involved in text conversion and text encoding, particularly the alignment. There are several recent suggestions for alignment encoding, and they are listed and elaborated at the end of the article.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography