Dissertations / Theses on the topic 'Corpus comparables'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 41 dissertations / theses for your research on the topic 'Corpus comparables.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Prochasson, Emmanuel. "Alignement multilingue en corpus comparables spécialisés." Phd thesis, Université de Nantes, 2009. http://tel.archives-ouvertes.fr/tel-00462248.
Full textLi, Bo. "Mesurer et améliorer la qualité des corpus comparables." Thesis, Grenoble, 2012. http://www.theses.fr/2012GRENM069.
Full textBilingual corpora are an essential resource used to cross the language barrier in multilingual Natural Language Processing (NLP) tasks. Most of the current work makes use of parallel corpora that are mainly available for major languages and constrained areas. Comparable corpora, text collections comprised of documents covering overlapping information, are however less expensive to obtain in high volume. Previous work has shown that using comparable corpora is beneficent for several NLP tasks. Apart from those studies, we will try in this thesis to improve the quality of comparable corpora so as to improve the performance of applications exploiting them. The idea is advantageous since it can work with any existing method making use of comparable corpora. We first discuss in the thesis the notion of comparability inspired from the usage experience of bilingual corpora. The notion motivates several implementations of the comparability measure under the probabilistic framework, as well as a methodology to evaluate the ability of comparability measures to capture gold-standard comparability levels. The comparability measures are also examined in terms of robustness to dictionary changes. The experiments show that a symmetric measure relying on vocabulary overlapping can correlate very well with gold-standard comparability levels and is robust to dictionary changes. Based on the comparability measure, two methods, namely the greedy approach and the clustering approach, are then developed to improve the quality of any given comparable corpus. The general idea of these two methods is to choose the highquality subpart from the original corpus and to enrich the low-quality subpart with external resources. The experiments show that one can improve the quality, in terms of comparability scores, of the given comparable corpus by these two methods, with the clustering approach being more efficient than the greedy approach. The enhanced comparable corpus further results in better bilingual lexicons extracted with the standard extraction algorithm. Lastly, we investigate the task of Cross-Language Information Retrieval (CLIR) and the application of comparable corpora in CLIR. We develop novel CLIR models extending the recently proposed information-based models in monolingual IR. The information-based CLIR model is shown to give the best performance overall. Bilingual lexicons extracted from comparable corpora are then combined with the existing bilingual dictionary and used in CLIR experiments, which results in significant improvement of the CLIR system
Goeuriot, Lorraine. "Découverte et caractérisation des corpus comparables spécialisés." Phd thesis, Université de Nantes, 2009. http://tel.archives-ouvertes.fr/tel-00474405.
Full textBo, Li. "Mesurer et améliorer la qualité des corpus comparables." Phd thesis, Université de Grenoble, 2012. http://tel.archives-ouvertes.fr/tel-00997769.
Full textHazem, Amir. "Extraction de lexiques bilingues à partir de corpus comparables." Phd thesis, Université de Nantes, 2013. http://tel.archives-ouvertes.fr/tel-00946914.
Full textKe, Guiyao. "Mesures de comparabilité pour la construction assistée de corpus comparables bilingues thématiques." Phd thesis, Université de Bretagne Sud, 2014. http://tel.archives-ouvertes.fr/tel-00997837.
Full textDelpech, Estelle. "Traduction assistée par ordinateur et corpus comparables : contributions à la traduction compositionnelle." Phd thesis, Université de Nantes, 2013. http://tel.archives-ouvertes.fr/tel-00905930.
Full textBouamor, Dhouha. "Constitution de ressources linguistiques multilingues à partir de corpus de textes parallèles et comparables." Phd thesis, Université Paris Sud - Paris XI, 2014. http://tel.archives-ouvertes.fr/tel-00994222.
Full textHarastani, Rima. "Alignement lexical en corpus comparables : le cas des composés savants et des adjectifs relationnels." Phd thesis, Université de Nantes, 2014. http://tel.archives-ouvertes.fr/tel-00949025.
Full textHarastani, Rima. "Alignement lexical en corpus comparables : le cas des composés savants et des adjectifs relationnels." Phd thesis, Nantes, 2014. https://archive.bu.univ-nantes.fr/pollux/show/show?id=715f898e-83d7-4541-8a0c-cf910bd67fee.
Full textOur work concerns the automatic extraction of a list of aligned terms with their translations (i. E. Specialized bilingual lexicon) from comparable corpora belonging to a specific domain. Comparable corpora include texts written in two languages which are not mutual translations but belong to the same domain. This thesis contributes to the improvement of the quality of an extracted bilingual lexicon. We propose methods dedicated to the translation of two types of terms that have common characteristics among many languages or that cause specific problems for translation due to their nature. These types of terms are the neoclassical compounds (terms containing at least one root borrowed from Greek or Latin) and the terms composed of one noun and one relational adjective. We also propose a method that exploits contexts rich in domain-specific terms to re-rank some provided translations in a bilingual lexicon for a given term. The experiments are performed using two specialized comparable corpora (in the domains of Breast Cancer and Renewable Energy), on the French, English, German and Spanish languages
Cetro, Rosa. "Lexique-grammaire et Unitex : quels apports pour une description terminologique bilingue de qualité ? : analyse sur deux corpus comparables de médecine thermale." Phd thesis, Université Paris-Est, 2013. http://tel.archives-ouvertes.fr/tel-00823735.
Full textDeléger, Louise. "Exploitation de corpus parallèles et comparables pour la détection de correspondances lexicales : application au domaine médical." Paris 6, 2009. http://www.theses.fr/2009PA066400.
Full textKorenchuk, Yuliya. "Méthode d'enrichissement et d'élargissement d'une ontologie à partir de corpus de spécialité multilingues." Thesis, Strasbourg, 2017. http://www.theses.fr/2017STRAC014/document.
Full textThis thesis proposes a method of enrichment and population of an ontology, a structure of concepts linked by semantic relations, by terms in French, English and German from comparable domain-specific corpora. Our main contribution is the development of extraction methods based on endogenous resources, learned from the corpus and the ontology being analyzed. Using caracter n-grams, these resources are available and independent of a particular language or domain. The first contribution concerns the use of endogenous morphological and morphosyntactic resources for mono- and polylexical terms extraction from the corpus. The second contribution aims to use endogenous resources to identify translations for these terms. The third contribution concerns the construction of endogenous morphological families designed to enrich and populate the ontology
Tajo, Kinda. "La terminologie bilingue (Arabe-Français) de la surdité : analyse du discours textuelle et socioterminologique." Thesis, Paris 3, 2013. http://www.theses.fr/2013PA030180.
Full textThe specialized text in the domain of deafness is a complex phenomenon where terms have important semantic functions. The discourse updates the meaning of terms and brings up new dynamic significations. The bilingual corpus (French, Arabic) is representative of different types of discourse and levels of specialization especially when it comes to comparing the terminology of deafness in the three Arab countries (Lebanon, Syria, Jordan). Terms in charge of transmitting knowledge of special fields represent nowadays a central object of study for terminology. The extraction of terms can be made manually but also by means of new automatic term extraction software. Our doctoral thesis takes into consideration the linguistic needs of language users that are considered from now on the real consumers of terminology. This thesis is intended for socioterminological and textual approaches of the domain of deafness. It highlights the studied phenomena such as synonymy, terminology variation, scientific popularization, metaphor, translation and many other phenomena. The result of the thesis research being the construction of a trilingual terminological data base, it meets the requirements of specialists and non-specialists
Serrone, Gabriella. "Figement juri-linguistique : étude des collocations dans deux corpus juridiques français et italien." Sorbonne Paris Cité, 2015. http://www.theses.fr/2015USPCC318.
Full textThis purpose of this thesis is to study collocational phenomena in order to analyze the genre "judgement" and the text typology " judgement of the Cour de cassation", the highest court in the judiciary of some civil law countries. By Adopting the definition of collocation theorized by corpus linguistics and John Sinclair (1991, 2004), the research determines the values words take in their specific context in the field of law, and in particular the judicial domain, and their role in the development and progression of the genre and the text object of this study. Data for the analysis were collected from two comparable corpora, made up of judgementspassed by the Cour de cassation in two countries that have this institution at the top of their civil and criminal justice system : France and Italy. In particular, the analysis of the collocation profiles of the field key words, in French and Italian, as well as the findings of the collocational phenomena, colligation, semantic preference and prosody aim at the structure and the organization phases of text. This methodical analysis of the collocational phenomena and of the textual progression inthe two languages results in some observations concerning genre and text translation, with the purpose of underlying the advantages a corpus driven approach can provide to the judicial translator work
Hô, Dinh Océane. "Caractérisation différentielle de forums de discussion sur le VIH en vietnamien et en français : Éléments pour la fouille comportementale du web social." Thesis, Sorbonne Paris Cité, 2017. http://www.theses.fr/2017USPCF022/document.
Full textThe standard discourse produced by official organisations is confronted with the unofficial or informal discourse of the social web. Empowering people to express themselves results in a new balance of authority, when it comes to knowledge and changes the way people learn. Social web discourse is available to each and everyone and its size is growing fast, which opens up new fields for both humanities and social sciences to investigate. The latter, however, are not equipped to engage with such complex and little-analysed data. The aim of this dissertation is to investigate how far social web discourse can help supplement official discourse. In it we set out a method to collect and analyse data that is in line with the characteristics of a digital environment, namely data size, anonymity, transience, structure. We focus on forums, where such discourse is built, and test our method on a specific social issue, ie the HIV/AIDS epidemic in Vietnam. This field of investigation encompasses several related questions that have to do with health, society, the evolution of morals, the mismatch between different kinds of discourse. Our study is also grounded in the analysis of a comparable French corpus dealing with the same topic, whose genre and discourse characteristics are equivalent to those of the Vietnamese one: this two-pronged research highlights the specific features of different socio-cultural environments
Zouaidi, Safa. "La combinatoire des verbes d'affect : analyse sémantique, syntaxique et discursive français-arabe." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAL028/document.
Full textThe paramount stake of this research is to achieve an integrative functional model for the analysis of affective verbs in French and Arabic. I have chosen four affective verbs: two verbs of emotion (to astonish and to rage in French and their equivalent in Arabic) and two verbs of sentiment (to admire and to envy in French and their equivalent [ʔadhaʃa], [ʔaɣḍaba] in Arabic) they belong to semantic dictions of Surprise, Anger, Admiration, and Jealousy. More concretely, the analysis is shaped:- On the semantic and syntactic level: the semantic dimensions carried by verbal collocations such as to extremely astonish, to rage prodigiously in French, and [ʔaʕʒaba ʔiʕʒāban kabīran] (admire admiration big)*, [ɣaḍaba ɣaḍabaan ʃadīdan] (to rage rage extreme), and in Arabic are systematically linked to syntax (the recurrent grammatical constructions) (Hoey 2005).- On the syntactic and discursive level: the usage of passive, active and reflexive forms of affective verbs are dealt with from the perspective of informational dynamics in the sentence. (Van Valin et LaPolla 1997).From a methodological point of view, the study is based on the quantitative and qualitative approach of the verbal combination and favours the contrastive one. It is founded on the French journalistic corpus of Emobase Database (Emolex project 100 M of words) and the journalistic corpus Arabicorpus) (137 M of words).Furthermore, the thesis participates in the studies of semantic values, the syntactic and the discursive behavior of affective verbs’combinations, in Arabic and in French, which will enable to better structure the diction of emotions in relation to what is proposed by current studies in lexicography. The main results of the study can be applied in language teaching, translation, and automated processing of emotions' lexicon in the two compared languages
Domont, Ludivine. "Minéral/minéralité : étude diachronique de la construction discursive d'un descripteur sensoriel dans les textes prescriptifs et descriptifs de la filière vitivinicole." Thesis, Bourgogne Franche-Comté, 2019. http://www.theses.fr/2019UBFCH016.
Full textMineral/minerality: diachronic analysis of the discursive construction of a sensory descriptor in prescriptive and descriptive texts from the wine industryAbstract: This thesis aims to analyze the sensory descriptors mineral and minerality in discourse with an empirical bottom-up and hypotheticodeductive approach. The first part is dedicated to the constitution of the corpus of study MINERS in the theoretical framework of corpus-based Linguistics and on the basis of selection criteria defined for the design of a comparable specialized corpus. In a second part dedicated to empirical analysis, exploitation of the corpus is based on the methodological framework of a computational and situated discourse analysis in order to reconstruct emergence and diffusion of the use of mineral and minerality. In a short-term diachronic and comparative perspective (1981-2014) between prescriptive and descriptive discourses in French, the methodological framework of discursive and lexical semantics is applied in order to carry out a semantic mapping of retained sensory descriptors. The goal of these lexical, semantic, discursive, pragmatic and cognitive analyses is to establish an "identikit" definition based on the underlying processes of categorization and conceptualization of the lexemes in question. The circularity of these concepts is also observed via a synchronic perspective (2006-2014) in German and English lingua franca descriptive discourses in order to take into account the conditions of diffusion and the uses of competing forms.Key words: corpus linguistics, comparable corpora, specialized discourses, oenology, discursive semantic, lexicology, categorization
Shrestha, Prajol. "Alignement inter-modalités de corpus comparable monolingue." Phd thesis, Université de Nantes, 2013. http://tel.archives-ouvertes.fr/tel-00909179.
Full textAbdul, Rauf Sadaf. "Sélection de corpus en traduction automatique statistique." Phd thesis, Université du Maine, 2012. http://tel.archives-ouvertes.fr/tel-00732984.
Full textPolo, Anna. "La traducción de la modalidad en un corpus de textos científico-divulgativos." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3424056.
Full textLa tesi di dottorato che viene qui presentata, La traducción de la modalidad en un corpus de textos científico-divulgativos, si inserisce nel filone di studi di traduzione empirico-sperimentali basato sullo studio di corpus, e si propone di osservare se la traduzione della modalità, in relazione con una specifica tipologia testuale, ovvero il saggio di argomentativo di divulgazione scientifica, rappresenti un problema dal punto di vista traduttologico. Il dominio semantico della modalità, inteso come manifestazione dell’attitudine del parlante rispetto al valore di verità di quanto viene espresso nell’enunciato, pur essendo caratteristico di qualsiasi atto comunicativo, rappresenta una nozione complessa e costituisce, di fatto, uno dei domini più controversi della linguistica. Per questo motivo si è deciso di elaborare una revisione critica delle differenti approssimazioni teoriche riguardanti questo dominio che ha portato all’adozione di un punto di vista considerato ristretto. In questo lavoro, la modalità si caratterizza quindi, come una categoria semantico-pragmatica basata sui valori logici di necessità e possibilità -espressi attraverso i subdomini di modalità epistemica e radicale, a sua volta suddivisa in dinamica, deontica e anankastica- relazionati, in particolare, alla posizione del parlante rispetto alla possibilità o alla necessità del valore di verità espresso nell’enunciato, dal suo grado di certezza e il suo coinvolgimento rispetto al testo. I marcatori lessicali oggetto dell’analisi presentata in questa tesi sono: le perifrasi modali, i verbi di attitudine proposizionale e gli avverbi modali. Questi marcatori non apportano un contributo significativo al contenuto informativo all’enunciato, ma rappresentano il punto di vista dell’enunciatore, perciò il mancato riconoscimento del valore effettivo di questi elementi può portare a una significativa modificazione dell’equivalenza funzionale dei testi tradotti. La metodologia adottata in questo lavoro si basa sulla compenetrazione di due tipologie distinte di corpus (uno parallelo e uno comparabile), che ha portato all’analisi statistica di un ampio insieme di dati, che hanno permesso di determinare e interpretare sia regolarità che discrepanze nei testi di riferimento, altrimenti difficilmente analizzabili. Da una parte, lo studio sia qualitativo, sia quantitativo dei dati, dedicato a questioni di tipo linguistico, ha portato ad una quantificazione delle occorrenze dei singoli marcatori e all’analisi dei rispettivi valori di frequenza (relativa e assoluta) di questi ultimi; dall’altra l’analisi traduttologica dei marcatori oggetto di studio ha messo in luce quali procedimenti tecnici vengono utilizzati in relazione alla traduzione di questi marcatori e quali modificando sostanzialmente il punto di vista di chi emette l’enunciato sono da considerarsi non accettabili. L’originalità di questa metodologia deriva dalla complementarietà dei diversi strumenti e metodi di analisi utilizzati: da una parte si trova la compenetrazione tra uno studio quantitativo e uno qualitativo, dall’altra si sottolinea l’importanza dell’adozione di un corpus di controllo a supporto del corpus parallelo allineato, che permette di analizzare in modo più accurato alcuni fenomeni sia linguistici, che traduttologici, connessi con la complessità intrinseca al processo traduttivo. L’ampiezza e l’omogeneità del corpus di lavoro hanno mostrato l’esistenza di alcune tendenze sistematiche nel processo traduttivo, che non rappresentano semplici opzioni di tipo stilistico. I risultati presentati in questo lavoro di tesi hanno permesso di dimostrare che la traduzione dei marcatori modali è effettivamente un problema che deve essere affrontato sistematicamente tanto nella didattica quanto negli studi di traduzione.
Al-Qaisi, Fu'ad. "Apport de la linguistique de corpus à la lexicographie bilingue (français-arabe) : macrostructure et microstructure d'un dictionnaire de collocations." Thesis, Lyon 2, 2015. http://www.theses.fr/2015LYO20115.
Full textThe aim of this study is to examine the contribution of corpus linguistics to bilingual French-Arabic lexicography. We particularly focus on collocations, as our research begins with the compilation of a bilingual corpus leading up to the integration of collocations in the lexicon. Fundamentals such as corpus linguistics, corpora and collocation are examined. Our research then takes an empirical turn that is based on the use of our corpus. To overcome the unavailability of corpus processing tools in Arabic, an approach was developed in this study that we called the footbridge strategy. The idea is to start from a French-Arabic (translated) parallel corpus. This corpus consists of the French version of Le Monde Diplomatique, and its translation. Using a parallel corpus aims to facilitate the identification of contrastive phenomena. The results obtained in the translated corpus (in its Arabic component) will be subsequently checked in an Arabic monolingual corpus. The latter is a corpus consisting of three newspapers: Alrai, Alayyam, Algouhouria. Throughout the exploitation of the corpus, results are compared first between corpora and dictionaries, secondly between corpus types (parallel and comparable), and thirdly between newspapers (Alrai, Alayyam, Algouhouria). Then a number of collocations are subjected to semantic and structural review and consideration. This review process not only brings some clarifications on the environment of collocations between language and speech but also about a possible approach for their integration in the dictionary. Legitimate questions gradually arise regarding the resemblance of collocations in French and Arabic. The results highlight phenomena such as collocational chains (clusters), collocational synonyms, etc. The study culminates in the design of a computer dictionary of collocations, i.e. an active bilingual dictionary aimed at Arabic language specialists and translators
Hoddinott, Simon Matthew. "Web mining for translators: automatic construction of comparable, genre-driven corpora." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10775/.
Full textZennaki, Othman. "Construction automatique d'outils et de ressources linguistiques à partir de corpus parallèles." Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAM006/document.
Full textThis thesis focuses on the automatic construction of linguistic tools and resources for analyzing texts of low-resource languages. We propose an approach using Recurrent Neural Networks (RNN) and requiring only a parallel or multi-parallel corpus between a well-resourced language and one or more low-resource languages. This parallel or multi-parallel corpus is used to construct a multilingual representation of words of the source and target languages. We used this multilingual representation to train our neural models and we investigated both uni and bidirectional RNN models. We also proposed a method to include external information (for instance, low-level information from Part-Of-Speech tags) in the RNN to train higher level taggers (for instance, SuperSenses taggers and Syntactic dependency parsers). We demonstrated the validity and genericity of our approach on several languages and we conducted experiments on various NLP tasks: Part-Of-Speech tagging, SuperSenses tagging and Dependency parsing. The obtained results are very satisfactory. Our approach has the following characteristics and advantages: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger for N languages)
Chiao, Yun-Chuang. "Extraction lexicale bilingue à partir de textes médicaux comparable : application à la recherche d'information translangue." Paris 6, 2004. https://tel.archives-ouvertes.fr/tel-00007704.
Full textLaviosa-Braithwaite, S. "The English Comparable Corpus (ECC) : a resource and a methodology for the empirical study of translation." Thesis, University of Manchester, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.488308.
Full textAlmujaiwel, Sultan Nasser. "Contrastive lexicology and comparable English-Arabic corpora-based analysis of vague and mistranslated Arabic equivalence : the case of the modern English-Arabic dictionary of al-Mawrid." Thesis, University of Exeter, 2012. http://hdl.handle.net/10871/13141.
Full textDo, Thi Ngoc Diep. "Extraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée." Phd thesis, Université de Grenoble, 2011. http://tel.archives-ouvertes.fr/tel-00680046.
Full textVeiga, Alexandre Trigo. "A identificação de termos de Maçonaria simbólica usando corpora comparáveis." Pontifícia Universidade Católica de São Paulo, 2014. https://tede2.pucsp.br/handle/handle/13692.
Full textThe present research was developed in order to present an alternative methodology for gathering and identifying terms from a specific area of studies in comparable corpora in Portuguese and English using computer tools designed for linguistic analysis. The selected specific area is Symbolic Freemasonry and the compiled corpora for this study are manuals and rituals used by freemasons during their works that are available in the Internet. The computer tools used for this research are the WordSmith Tools 6.0, the zExtractor and the SketchEngine. The terms identified as a result of this research will provide relevant data for developing a bilingual glossary of Symbolic Freemasonry to aid translators and proof-readers who specialize in masonic works
Esta pesquisa foi desenvolvida com o objetivo de apresentar uma metodologia alternativa para reunir e identificar termos de uma área específica em corpora comparáveis em português e inglês usando ferramentas computacionais de análise linguística. A área escolhida é a de Maçonaria Simbólica e os corpora compilados para este estudo são manuais e rituais utilizados pelos maçons em seus trabalhos disponíveis na Internet. As ferramentas computacionais usadas nesta pesquisa são o WordSmith Tools 6.0, o zExtractor e o SketchEngine. Os termos identificados como resultado desta pesquisa fornecerão dados relevantes para a elaboração de um glossário bilíngue para auxiliar tradutores e revisores que se especializam em obras maçônicas
Afli, Haithem. "La Traduction automatique statistique dans un contexte multimodal." Thesis, Le Mans, 2014. http://www.theses.fr/2014LEMA1012/document.
Full textThe performance of Statistical Machine Translation Systems statistics depends on the availability of bilingual parallel texts, also known as bitexts. However, freely available parallel texts are also a sparse resource : the size is often limited, languistic coverage insufficient or the domain of texts is not appropriate. There are relatively few pairs of languages for which parallel corpora sizes are available for some domains. One way to overcome the lack of parallel data is to exploit comparable corpus that are more abundant. Previous work in this area have been applied for the text modality. The question we asked in this thesis is : can comparable multimodal corpus allows us to make solutions to the lack of parallel data in machine translation? In this thesis, we studied how to use resources from different modalities (text or speech) for the development of a Statistical machine translation System. The first part of the contributions is to provide a method for extracting parallel data from a comparable multimodal corpus (text and audio). The audio data are transcribed with an automatic speech recognition system and translated with a machine translation system. These translations are then used as queries to select parallel sentences and generate a bitext. In the second part of the contribution, we aim to improve our method to exploit the sub-sentential entities creating an extension of our system to generate parallel segments. We also improve the filtering module. Finally, we présent several approaches to adapt translation systems with the extracted data. Our experiments were conducted on data from the TED and Euronews web sites which show the feasibility of our approaches
Prestes, Kassius Vargas. "Extração multilíngue de termos multipalavra em corpora comparáveis." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2015. http://hdl.handle.net/10183/118257.
Full textThis work investigates techniques for multiword term extraction from comparable corpora, which are sets of texts in two (or more) languages about the same topic. Term extraction, specially multiword terms is very important to help the creation of terminologies, ontologies and the improvement of machine translation. In this work we use a comparable corpora Portuguese/ English and want to find terms and their equivalents in both languages. To do this we start with separate term extraction for each language. Using morphossintatic patterns to identify n-grams (sequences of n words) most likely to be important terms of the domain. From the terms of each language, we use their context, i. e., the words that occurr around the term to compare the terms of different languages and to find the bilingual equivalents. We had as main goals in this work identificate monolingual terms, apply alignment techniques for Portuguese and evaluate the different parameters of size and type (used PoS) of window to the context extraction. This is the first work to apply this methodology to Portuguese and in spite of the lack of lexical and computational resources (like bilingual dictionaries and parsers) for this language, we achieved results comparable to state of the art in French/English.
Shen, Lionel. "Méthodes de veille textométrique multilingue appliquées à des corpus de l’environnement et de l’énergie : « Restitution, prévision et anticipation d’événements par poly-résonances croisées »." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCA085/document.
Full textThis thesis proposes a series of textometric multilingual information monitoring methods applied to thematic corpora (textometry is also called textual statistics or text data analysis). Two types of corpora are mobilized to create this work: a comparable corpus and a parallel corpus in which the textual data are extracted from the press and discourse of NGOs. The information source was retrieved from three countries in three different languages: English, French and Chinese. The two corpora were constructed on two topical issues concerning the environment and energy, with a focus on three concepts: energy, nuclear power and the EPR (European Pressurized Reactor or Evolutionary Power Reactor). After a brief review of the state of the art on business intelligence, information monitoring and textometry, we first set out the two chosen subjects – the environment and energy – and then the morphosyntactic features of the three languages in national and international contexts. The overall characteristics, similarities and peculiarities of these corpora are highlighted successively. The recounts and qualitative and quantitative analyses of the results were carried out using textometric tools, including factor analysis of correspondences, co-occurrences and polyco-occurrential networks, specificities of the hypergeometric model and repeated segments or map sections. Thereafter, bilingual bitextual information monitoring was applied to the same three concepts with the aim of elucidating how the comparable corpus and the parallel corpus can mutually help each other in a process of multilingual information monitoring, by restitution, forecasting and anticipation. We conclude our research by offering an analytical method called Objects-Features-Opening (OFO)
Martínez, Vilinsky Bárbara. "La Infrarrepresentación de elementos únicos en textos traducidos de inglés a español: perífrasis verbales, demostrativos y sufijos apreciativos en un corpus comparable y paralelo de novela policíaca." Doctoral thesis, Universitat Jaume I, 2016. http://hdl.handle.net/10803/669024.
Full textSaad, Motaz. "Fouille de documents et d'opinions multilingue." Thesis, Université de Lorraine, 2015. http://www.theses.fr/2015LORR0003/document.
Full textThe aim of this thesis is to study sentiments in comparable documents. First, we collect English, French and Arabic comparable corpora from Wikipedia and Euronews, and we align each corpus at the document level. We further gather English-Arabic news documents from local and foreign news agencies. The English documents are collected from BBC website and the Arabic documents are collected from Al-jazeera website. Second, we present a cross-lingual document similarity measure to automatically retrieve and align comparable documents. Then, we propose a cross-lingual sentiment annotation method to label source and target documents with sentiments. Finally, we use statistical measures to compare the agreement of sentiments in the source and the target pair of the comparable documents. The methods presented in this thesis are language independent and they can be applied on any language pair
Orenha, Adriane [UNESP]. "Unidades fraseológicas especializadas: colocações e colocações estendidas em contratos sociais e estatutos sociais traduzidos no modo juramentado e não-juramentado." Universidade Estadual Paulista (UNESP), 2009. http://hdl.handle.net/11449/103524.
Full textEsta pesquisa visa realizar um estudo a respeito dos termos, colocações e colocações especializadas estendidas presentes em contratos sociais e estatutos sociais que representam os corpora de pesquisa. Nesta pesquisa, também observaremos as semelhanças e diferenças nos corpora de traduções jurídicas e juramentadas, no que concerne ao uso desses termos e padrões lexicais, assim como apontaremos aqueles que são mais frequentemente empregados em documentos do tipo contrato social e estatuto social. A investigação baseia-se na abordagem interdisciplinar dos Estudos da Tradução Baseados em Corpus, da Linguística de Corpus, da Fraseologia, de modo mais específico das colocações, das colocações especializadas e das unidades fraseológicas especializadas. A Terminologia, por meio de seus pressupostos teóricos, também traz sua contribuição para a pesquisa, assim como os trabalhos sobre a tradução juramentada. Uma das motivações que delineia este estudo reside no fato de a tradução juramentada ser considerada de grande relevância nas relações comerciais, sociais e jurídicas entre as nações. Para realizar este estudo, compilamos um corpus de estudo (CE1) constituído por contratos sociais e estatutos sociais traduzidos no modo juramentado, nas direções tradutórias inglês português e português inglês, extraídos de Livros de Registro de Traduções, pertencentes a tradutores juramentados credenciados pela Junta Comercial de dois Estados brasileiros; e um corpus de estudo (CE2) formado por documentos de mesma natureza traduzidos sem o processo de juramentação, nas mesmas direções tradutórias. Além destes corpora, construímos dois corpora comparáveis, formados pelos referidos documentos originalmente escritos em português e em inglês. Os resultados desta pesquisa mostraram várias semelhanças, no tocante aos termos empregados em documentos traduzidos...
This investigation aims at carrying out a study on terms, collocations and extended specialized collocations present in articles of incorporation/articles of organization/articles of association and bylaws that represent our research corpora. We will also observe similarities and differences in sworn and legal translation corpora, which concerns the use of such terms and lexical patterns, as well as point out the ones which are more frequently used in the focused documents. This research derives its theoretical and methodological sources from Corpus-Based Translation Studies, Corpus Linguistics, Phraseology, more specifically from collocations, specialized collocations and specialized phraseological units (SPUs). Terminology, from its theoretical standpoint, also offers its contribution to this study, as well as essays on sworn translation. One of the aspects that motivates this study is the fact that sworn translation is considered to be of great relevance to commercial, social and legal relations among nations. To conduct this research, we compiled a study corpus (CE1) composed of articles of incorporation/articles of organization/articles of association and bylaws submitted to the process of sworn translation in the English Portuguese and Portuguese English directions, excerpted from the Books of Sworn Translation Records, made available by five Brazilian sworn translators, duly sworn by the Board of Trade of two Brazilian States; a study corpus (CE2) made up of documents of the same nature not submitted to the process of sworn translation, in the same translation directions. Besides these corpora, we also built two comparable corpora formed by the referred documents originally written in Portuguese and in English. The results obtained in this research showed some similarities which refer to the terms used in documents submitted to the process of sworn translation... (Complete abstract click electronic access below)
Jakubina, Laurent. "Induction de lexiques bilingues à partir de corpus comparables et parallèles." Thèse, 2017. http://hdl.handle.net/1866/20488.
Full textRebout, Lise. "L’extraction de phrases en relation de traduction dans Wikipédia." Thèse, 2012. http://hdl.handle.net/1866/8614.
Full textWorking with comparable corpora can be useful to enhance bilingual parallel corpora. In fact, in such corpora, even if the documents in the target language are not the exact translation of those in the source language, one can still find translated words or sentences. The free encyclopedia Wikipedia is a multilingual comparable corpus of several millions of documents. Our task is to find a general endogenous method for extracting a maximum of parallel sentences from this source. We are working with the English-French language pair but our method -- which uses no external bilingual resources -- can be applied to any other language pair. It can best be described in two steps. The first one consists of detecting article pairs that are most likely to contain translations. This is achieved through a neural network trained on a small data set composed of sentence aligned articles. The second step is to perform the selection of sentence pairs through another neural network whose outputs are then re-interpreted by a combinatorial optimization algorithm and an extension heuristic. The addition of the 560~000 pairs of sentences extracted from Wikipedia to the training set of a baseline statistical machine translation system improves the quality of the resulting translations. We make both the aligned data and the extracted corpus available to the scientific community.
Tay, Hui-teng, and 鄭暉騰. "Exploring Explicitation in Legal Translation through a Comparable Corpus." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/74935488698970480351.
Full text國立臺灣師範大學
翻譯研究所
104
Explicitation is a much studied phenomenon and considered by some to be a "universal" of translation. In the field of legal translation, where the precision and accuracy of language are key priorities, it would make sense for explicitation to be more pronounced in translations as opposed to non-translated English texts. But to what extent is this true? This is the primary issue this paper is seeking to explore. Through the use of the easily accessible and versatile corpus-processing tool AntConc, I analysed a monolingual, comparable corpus consisting of a translational English sub-corpora and a non-translational English sub-corpora, both drawn from court judgments published on the website of the Judiciary of the Hong Kong Special Administrative Region. Due to the unique nature of Hong Kong's legal system, in which both English and Chinese are official languages of the court, a large number of key judgments are translated from Chinese into English presumably for reference purposes, thus making them suitable for the study of translational differences in legal translation. With respect to explicitation, we looked at several explicitation phenomena including the "verb+that-clause" pattern, conjunctions and transitional words. The frequencies of these explicitation phenomena are tabulated, with the difference in frequencies between the translational and non-translational being subjected to a log-likelihood test to determine their significance. The findings as a whole does support the view that explicitating connectives are used in a statistically more pronounced manner in the translated sub-corpora.
Le, Serrec Annaïch. "Analyse comparative de l'équivalence terminologique en corpus parallèle et en corpus comparable : application au domaine du changement climatique." Thèse, 2012. http://hdl.handle.net/1866/9044.
Full textThe research undertaken for this thesis concerns the analysis of terminological equivalence in a parallel corpus and a comparable corpus. More specifically, we focus on specialized texts related to the domain of climate change. A unique aspect of this study is based on the analysis of the equivalents of single word terms. The theoretical frameworks on which we rely are the terminologie textuelle (Bourigault et Slodzian 1999) and the lexico-sémantique approaches (L’Homme 2005). This study has two objectives. The first is to perform a comparative analysis of terminological equivalents in the two types of corpora in order to verify if the equivalents found in the parallel corpus are different from the ones observed in the comparable corpora. The second is to compare in detail equivalents associated with a same English term, in order to describe them and define a typology. A detailed analysis of the French equivalents of 343 English terms is carried out with the help of computer tools (term extractor, text aligner, etc.) and the establishment of a rigorous methodology divided into three parts. The first part, common to both objectives of the research concerns the elaboration of the corpus, the validation of the English terms and the identification of the French equivalents in the two corpora. The second part describes the criteria on which we rely to compare the equivalents of the two types of corpora. The third part sets up the typology of equivalents associated with a same English term. The results for the first objective shows that of the 343 English words analyzed, terms with equivalents that can be criticized in both corpora are relatively low in number (12), while the number of terms with similar equivalences between the two corpora is very high (272 identical and 55 equivalents not objectionable). The analysis described in this chapter confirms our hypothesis that terminology used in parallel corpora does not differ from that used in comparable corpora. The results of the second objective show that many English terms are rendered by several equivalents (70% of analyzed terms). It is also noted that synonyms are not the largest group of equivalents but near-synonyms. Also, equivalents from another part of speech constitute an important part of the equivalents analyzed. Thus, the typology developed in this thesis presents terminological equivalent mechanisms rarely described as systematically in previous work.
Grégoire, Francis. "Extraction de phrases parallèles à partir d’un corpus comparable avec des réseaux de neurones récurrents bidirectionnels." Thèse, 2017. http://hdl.handle.net/1866/20191.
Full textŠpínová, Adéla. "Hypotéza unique items v překladu. Korpusová studie." Master's thesis, 2017. http://www.nusl.cz/ntk/nusl-370047.
Full text