Academic literature on the topic 'Corpus comparable'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Corpus comparable.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Corpus comparable"

1

Laviosa, Sara. "How Comparable Can 'Comparable Corpora' Be?" Target. International Journal of Translation Studies 9, no. 2 (January 1, 1997): 287–317. http://dx.doi.org/10.1075/target.9.2.05lav.

Full text
Abstract:
Abstract The development of a coherent methodology for corpus-based work in translation studies is essential for the evolution of this newfield of research into a fully-fledged paradigm within the discipline. The design of a monolingual, multi-source-language comparable corpus of English as a resource for the systematic study of the nature of translated text can be regarded as an important step towards the development of such a methodology. This paper deals with a crucial and problematic aspect of the design of a monolingual comparable corpus, namely the achievement of an adequate level of comparability between its translational and non-translational components.
APA, Harvard, Vancouver, ISO, and other styles
2

López Arroyo, Belén. "Can comparable corpora be compared?" Ibérica, no. 39 (January 2, 2020): 43–68. http://dx.doi.org/10.17398/2340-2784.39.43.

Full text
Abstract:
Podemos afirmar que, hoy en día, no existe un acuerdo unánime sobre los criterios para compilar un corpus comparable o sobre cómo evaluar la comparabilidad de un corpus. Un corpus comparable es una colección de textos en diferentes lenguas o variaciones que son similares en ciertos aspectos. Pero, ¿en cuáles? Según McEnery y Wilson (2007: 20), la proporción en las muestras, el género, campo y tiempo deben ser los criterios principales a la hora de compilar un corpus comparable y deben ser los mismos en las diferentes lenguas. Sin embargo, estudios previos (López-Arroyo & Roberts, 2017) demuestran que estos criterios pueden no ser válidos en todos los campos. En el presente estudio, analizamos la comparabilidad desde el punto de vista del propósito del corpus. Para ello, hemos compilado un corpus comparable de 150 fichas de cata en inglés y 150 en español escritas por dos autoridades del campo y publicadas en las mismas décadas; según McEnery y Xiao (2007) nuestros subcorpus reúnen todos los requisitos para ser comparables. Sin embargo, nuestra metodología, centrada en el análisis de otros factores tales como El formato, el contenido y el estilo, demostrará que únicamente la proporción, el género, el campo, el tiempo y el tamaño no son siempre suficientes a la hora de comparar corpus
APA, Harvard, Vancouver, ISO, and other styles
3

Čermáková, Ann, Jarmo Jantunen, Tommi Jauhiainen, John Kirk, Michal Křen, Marc Kupietz, and Elaine Uí Dhonnchadha. "The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora." Research in Corpus Linguistics 10, no. 1 (2021): 89–103. http://dx.doi.org/10.32714/ricl.09.01.06.

Full text
Abstract:
This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.
APA, Harvard, Vancouver, ISO, and other styles
4

Awal, Norsimah Mat, Intan Safinaz Zainuddin, and Imran Ho-Abdullah. "Use of Comparable Corpus in Teaching Translation." Procedia - Social and Behavioral Sciences 18 (2011): 638–42. http://dx.doi.org/10.1016/j.sbspro.2011.05.094.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

López Arroyo, Belén, and Roda P. Roberts. "Genre and Register in Comparable Corpora: An English/Spanish Contrastive Analysis." Meta 62, no. 1 (July 6, 2017): 114–36. http://dx.doi.org/10.7202/1040469ar.

Full text
Abstract:
A multilingual comparable corpus is a corpus containing texts that are collected using the same sampling frame and similar balance and representativeness. According to McEnery and Xiao (2007: 20), presenting proportion, genre, domain, and time constitutes the main criteria when compiling a comparable corpus and these criteria must match in the different languages for the corpus to be considered comparable. The problem is that these criteria do not always guarantee that the different language subcorpora in a comparable corpus match. This study, which analyzes two comparable corpora compiled by the authors, shows that, even when the text selection criteria are refined, genre theory cannot always guarantee enough linguistic similarities between language for specific purposes (LSP) texts in different languages. Genre seems to suffice to establish a good comparable corpus for scientific abstracts. However, the comparable corpus of wine tasting notes is not truly comparable, since the English and Spanish texts differ in register.
APA, Harvard, Vancouver, ISO, and other styles
6

Weng, Yu, Shumin Dong, and Chaomurilige Chaomurilige. "A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things." Mathematics 12, no. 4 (February 17, 2024): 598. http://dx.doi.org/10.3390/math12040598.

Full text
Abstract:
With the expansion of the Internet of Things (IoT) and artificial intelligence (AI) technologies, multilingual scenarios are gradually increasing, and applications based on multilingual resources are also on the rise. In this process, apart from the need for the construction of multilingual resources, privacy protection issues like data privacy leakage are increasingly highlighted. Comparable corpus is important in multilingual language information processing in IoT. However, the multilingual comparable corpus concerning privacy preserving is rare, so there is an urgent need to construct a multilingual corpus resource. This paper proposes a method for constructing a privacy-preserving multilingual comparable corpus, taking Chinese–Uighur–Tibetan IoT based news as an example, and mapping the different language texts to a unified language vector space to avoid sensitive information, then calculates the similarity between different language texts and serves as a comparability index to construct comparable relations. Through the decision-making mechanism of minimizing the impossibility, it can identify a comparable corpus pair of multilingual texts based on chapter size to realize the construction of a privacy-preserving Chinese–Uighur–Tibetan comparable corpus (CUTCC). Evaluation experiments demonstrate the effectiveness of our proposed provable method, which outperforms in accuracy rate by 77%, recall rate by 34% and F value by 47.17%. The CUTCC provides valuable privacy-preserving data resources support and language service for multilingual situations in IoT.
APA, Harvard, Vancouver, ISO, and other styles
7

Sun, Yuan, and Qian Zhao. "Tibetan-Chinese Named Entity Extraction Based on Comparable Corpus." Applied Mechanics and Materials 571-572 (June 2014): 1202–5. http://dx.doi.org/10.4028/www.scientific.net/amm.571-572.1202.

Full text
Abstract:
Tibetan-Chinese named entity extraction is the foundation of cross language information processing, and provides a basis for machine translation and cross language information retrieval research. In this paper, we use the multi-language links of Wikipedia to obtain Tibetan-Chinese comparable corpus, and combine sentence length, word matching and entity boundary words together to get parallel sentence. Then we extract Tibetan-Chinese named entity from the comparable corpus in three ways: (1) Extracting Natural labeling information. (2) Acquiring the links of Tibetan entries and Chinese entries. (3) Using sequence intersection method, which includes the sentence representation, Chinese named entity recognition and corresponding Tibetan sentences intersection. Finally, the results show the extraction method based on comparable corpus is effective.
APA, Harvard, Vancouver, ISO, and other styles
8

LI, BO, ERIC GAUSSIER, and DAN YANG. "Measuring bilingual corpus comparability." Natural Language Engineering 24, no. 4 (January 15, 2018): 523–49. http://dx.doi.org/10.1017/s1351324917000481.

Full text
Abstract:
AbstractComparable corpora serve as an important substitute for parallel resources in cases of under-resourced language pairs. Previous work mostly aims to find a better strategy to exploit existing comparable corpora, while ignoring the variety in corpus quality. The quality of comparable corpora affects a lot its usability in practice, a fact that has been justified by several studies. However, researchers have not been able to establish a widely accepted and fully validated framework to measure corpus quality. We will thus investigate in this paper a comprehensive methodology to deal with the quality of comparable corpora. To be exact, we will propose several comparability measures and a quantitative strategy to test those measures. Our experiments show that the proposed comparability measure can capture gold-standard comparability levels very well and is robust to the bilingual dictionary used. Moreover, we will show in the task of bilingual lexicon extraction that the proposed measure correlates well with the performance of the real world application.
APA, Harvard, Vancouver, ISO, and other styles
9

Forchini, Pierfranca, and Amanda Murphy. "N-grams in comparable specialized corpora." Patterns, meaningful units and specialized discourses 13, no. 3 (September 17, 2008): 351–67. http://dx.doi.org/10.1075/ijcl.13.3.06for.

Full text
Abstract:
This paper investigates the idiom principle realized as four-word phrases (4-grams) headed by prepositions in specialized corpora in English and Italian. Concentrating on at the end of, it reports that the collocates of at the end of regard time, and that apparently synonymic 4-grams are not used in the same contexts. It then explores realizations of at the end of in a specialized comparable corpus of Italian. Two findings emerge: firstly, that the most obvious equivalent, alla fine d*, occurs more frequently than in the English corpus; secondly, this n-gram is frequently used, but has weaker collocational relations, and several synonymic 3-grams share its collocates. This invites contrastive research on lexical variation and repetition and on the strength of collocations of multi-word units in English and Italian. Lastly, the paper recounts an experiment with students who gained awareness of language by concentrating on phraseology in comparable corpora.
APA, Harvard, Vancouver, ISO, and other styles
10

Visky, Mihaela. "L’UTILISATION DU CORPUS COMPARABLE DANS L’ENSEIGNEMENT DE LA TRADUCTION." Professional Communication and Translation Studies 6 (2013): 165–76. http://dx.doi.org/10.59168/mojx8412.

Full text
Abstract:
Les corpus comparables servent de base à la traduction assistée par ordinateur, à l’analyse contrastive, à la lexicologie, etc, et ils sont aussi utilisés dans l’enseignement de la traduction. Les exercices proposés aux étudiants ont eu comme but d’améliorer la compréhension des textes sources et la reformulation dans la langue cible, surtout en ce qui concerne l’utilisation des termes et des expressions propres à chaque langue. Nous estimons que l’utilisation du corpus comparable en classe de traduction représente une initiation au milieu professionnel et une étape importante dans la formation des traducteurs.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Corpus comparable"

1

Shrestha, Prajol. "Alignement inter-modalités de corpus comparable monolingue." Phd thesis, Université de Nantes, 2013. http://tel.archives-ouvertes.fr/tel-00909179.

Full text
Abstract:
L'augmentation de la production des documents électroniques disponibles sous forme du texte ou d'audio (journaux, radio, enregistrements audio de télévision, etc.) nécessite le développement d'outils automatisés pour le suivi et la navigation. Il devrait être possible, par exemple, lors de la lecture d'un article d'un journal en ligne, d'accéder à des émissions radio correspondant à la lecture en cours. Cette navigation fine entre les différents médias exige l'alignement des "passages" avec un contenu similaire dans des documents issus de différentes modalités monolingues et comparables. Notre travail se concentre sur ce problème d'alignement de textes courts dans un contexte comparable monolingue et multimodal. Le problème consiste à trouver des similitudes entre le texte court et comment extraire les caractéristiques de ces textes pour nous aider à trouver les similarités pour le processus d'alignement. Nous contributions à ce problème en trois parties. La première partie tente de définir la similitude qui est la base du processus d'alignement. La deuxième partie vise à développer une nouvelle représentation de texte afin de faciliter la création du corpus de référence qui va servir à évaluer les méthodes d'alignement. Enfin, la troisième contribution est d'étudier différentes méthodes d'alignement et l'effet de ses composants sur le processus d'alignement. Ces composants comprennent différentes représentations textuelles, des poids et des mesures de similarité.
APA, Harvard, Vancouver, ISO, and other styles
2

Prochasson, Emmanuel. "Alignement multilingue en corpus comparables spécialisés." Phd thesis, Université de Nantes, 2009. http://tel.archives-ouvertes.fr/tel-00462248.

Full text
Abstract:
Les corpus comparables rassemblent des documents multilingues n'étant pas en relation de traduction mais partageant des traits communs. Notre travail porte sur l'extraction de lexique bilingue à partir de ces corpus, c'est-à-dire la reconnaissance et l'alignement d'un vocabulaire commun multilingue disponible dans le corpus. Nous nous concentrons sur les corpus comparables spécialisés, c'est-à-dire des corpus constitués de documents révélateurs de la terminologie utilisée dans les langues de spécialité. Nous travaillons sur des corpus médicaux, l'un deux couvre la thématique du diabète et de l'alimentation, en français, anglais et japonais; l'autre couvre la thématique du cancer du sein, en anglais et en français. Nous proposons et évaluons différentes améliorations du processus d'alignement, en particulier dans le cas délicat de la langue japonaise. Nous prolongeons ce manuscrit par une réflexion sur la nature des corpus comparables et la notion de comparabilité.
APA, Harvard, Vancouver, ISO, and other styles
3

Li, Bo. "Mesurer et améliorer la qualité des corpus comparables." Thesis, Grenoble, 2012. http://www.theses.fr/2012GRENM069.

Full text
Abstract:
Les corpus bilingues sont des ressources essentielles pour s'affranchir de la barrière de la langue en traitement automatique des langues (TAL) dans un contexte multilingue. La plupart des travaux actuels utilisent des corpus parallèles qui sont surtout disponibles pour des langues majeurs et pour des domaines spécifiques. Les corpus comparables, qui rassemblent des textes comportant des informations corrélées, sont cependant moins coûteux à obtenir en grande quantité. Plusieurs travaux antérieurs ont montré que l'utilisation des corpus comparables est bénéfique à différentes taches en TAL. En parallèle à ces travaux, nous proposons dans cette thèse d'améliorer la qualité des corpus comparables dans le but d'améliorer les performances des applications qui les exploitent. L'idée est avantageuse puisqu'elle peut être utilisée avec n'importe quelle méthode existante reposant sur des corpus comparables. Nous discuterons en premier la notion de comparabilité inspirée des expériences d'utilisation des corpus bilingues. Cette notion motive plusieurs implémentations de la mesure de comparabilité dans un cadre probabiliste, ainsi qu'une méthodologie pour évaluer la capacité des mesures de comparabilité à capturer un haut niveau de comparabilité. Les mesures de comparabilité sont aussi examinées en termes de robustesse aux changements des entrées du dictionnaire. Les expériences montrent qu'une mesure symétrique s'appuyant sur l'entrelacement du vocabulaire peut être corrélée avec un haut niveau de comparabilité et est robuste aux changements des entrées du dictionnaire. En s'appuyant sur cette mesure de comparabilité, deux méthodes nommées: greedy approach et clustering approach, sont alors développées afin d'améliorer la qualité d'un corpus comparable donnée. L'idée générale de ces deux méthodes est de choisir une sous partie du corpus original qui soit de haute qualité, et d'enrichir la sous-partie de qualité moindre avec des ressources externes. Les expériences montrent que l'on peut améliorer avec ces deux méthodes la qualité en termes de score de comparabilité d'un corpus comparable donnée, avec la méthode clustering approach qui est plus efficace que la method greedy approach. Le corpus comparable ainsi obtenu, permet d'augmenter la qualité des lexiques bilingues en utilisant l'algorithme d'extraction standard. Enfin, nous nous penchons sur la tâche d'extraction d'information interlingue (Cross-Language Information Retrieval, CLIR) et l'application des corpus comparables à cette tâche. Nous développons de nouveaux modèles CLIR en étendant les récents modèles proposés en recherche d'information monolingue. Le modèle CLIR montre de meilleurs performances globales. Les lexiques bilingues extraits à partir des corpus comparables sont alors combinés avec le dictionnaire bilingue existant, est utilisé dans les expériences CLIR, ce qui induit une amélioration significative des systèmes CLIR
Bilingual corpora are an essential resource used to cross the language barrier in multilingual Natural Language Processing (NLP) tasks. Most of the current work makes use of parallel corpora that are mainly available for major languages and constrained areas. Comparable corpora, text collections comprised of documents covering overlapping information, are however less expensive to obtain in high volume. Previous work has shown that using comparable corpora is beneficent for several NLP tasks. Apart from those studies, we will try in this thesis to improve the quality of comparable corpora so as to improve the performance of applications exploiting them. The idea is advantageous since it can work with any existing method making use of comparable corpora. We first discuss in the thesis the notion of comparability inspired from the usage experience of bilingual corpora. The notion motivates several implementations of the comparability measure under the probabilistic framework, as well as a methodology to evaluate the ability of comparability measures to capture gold-standard comparability levels. The comparability measures are also examined in terms of robustness to dictionary changes. The experiments show that a symmetric measure relying on vocabulary overlapping can correlate very well with gold-standard comparability levels and is robust to dictionary changes. Based on the comparability measure, two methods, namely the greedy approach and the clustering approach, are then developed to improve the quality of any given comparable corpus. The general idea of these two methods is to choose the highquality subpart from the original corpus and to enrich the low-quality subpart with external resources. The experiments show that one can improve the quality, in terms of comparability scores, of the given comparable corpus by these two methods, with the clustering approach being more efficient than the greedy approach. The enhanced comparable corpus further results in better bilingual lexicons extracted with the standard extraction algorithm. Lastly, we investigate the task of Cross-Language Information Retrieval (CLIR) and the application of comparable corpora in CLIR. We develop novel CLIR models extending the recently proposed information-based models in monolingual IR. The information-based CLIR model is shown to give the best performance overall. Bilingual lexicons extracted from comparable corpora are then combined with the existing bilingual dictionary and used in CLIR experiments, which results in significant improvement of the CLIR system
APA, Harvard, Vancouver, ISO, and other styles
4

Abdul, Rauf Sadaf. "Sélection de corpus en traduction automatique statistique." Phd thesis, Université du Maine, 2012. http://tel.archives-ouvertes.fr/tel-00732984.

Full text
Abstract:
Dans notre monde de communications au niveau international, la traduction automatique est devenue une technologie clef incontournable. Plusieurs approches existent, mais depuis quelques années la dite traduction automatique statistique est considérée comme la plus prometteuse. Dans cette approche, toutes les connaissances sont extraites automatiquement à partir d'exemples de traductions, appelés textes parallèles, et des données monolingues en langue cible. La traduction automatique statistique est un processus guidé par les données. Ceci est communément avancé comme un grand avantage des approches statistiques puisque l'intervention d'être humains bilingues n'est pas nécessaire, mais peut se retourner en un problème lorsque ces données nécessaires au développement du système ne sont pas disponibles, de taille insuffisante ou dont le genre ne convient pas. Les recherches présentées dans cette thèse sont une tentative pour surmonter un des obstacles au déploiement massif de systèmes de traduction automatique statistique : le manque de corpus parallèles. Un corpus parallèle est une collection de phrases en langues source et cible qui sont alignées au niveau de la phrase. La plupart des corpus parallèles existants ont été produits par des traducteurs professionnels. Ceci est une tâche coûteuse, en termes d'argent, de ressources humaines et de temps. Dans la première partie de cette thèse, nous avons travaillé sur l'utilisation de corpus comparables pour améliorer les systèmes de traduction statistique. Un corpus comparable est une collection de données en plusieurs langues, collectées indépendamment, mais qui contiennent souvent des parties qui sont des traductions mutuelles. La taille et la qualité des contenus parallèles peuvent variées considérablement d'un corpus comparable à un autre, en fonction de divers facteurs, notamment la méthode de construction du corpus. Dans tous les cas, il n'est pas aisé d'identifier automatiquement des parties parallèles. Dans le cadre de cette thèse, nous avons développé une telle approche qui est entièrement basée sur des outils librement disponibles. L'idée principale de notre approche est l'utilisation d'un système de traduction automatique statistique pour traduire toutes les phrases en langue source du corpus comparable. Chacune de ces traductions est ensuite utilisée en tant que requête afin de trouver des phrases potentiellement parallèles. Cette recherche est effectuée à l'aide d'un outil de recherche d'information. En deuxième étape, les phrases obtenues sont comparées aux traductions automatiques afin de déterminer si elles sont effectivement parallèles à la phrase correspondante en langue source. Plusieurs critères ont été évalués tels que le taux d'erreur de mots ou le "translation edit rate (TER)". Nous avons effectué une analyse expérimentale très détaillée afin de démontrer l'intérêt de notre approche. Les corpus comparables utilisés se situent dans le domaine des actualités, plus précisément, des dépêches d'actualités des agences de presse telles que "Agence France Press (AFP)", "Associate press" ou "Xinua News". Ces agences publient quotidiennement des actualités en plusieurs langues. Nous avons pu extraire des textes parallèles à partir de grandes collections de plus de trois cent millions de mots pour les paires de langues français/anglais et arabe/anglais. Ces textes parallèles ont permis d'améliorer significativement nos systèmes de traduction statistique. Nous présentons également une comparaison théorique du modèle développé dans cette thèse avec une autre approche présentée dans la littérature. Diverses extensions sont également étudiées : l'extraction automatique de mots inconnus et la création d'un dictionnaire, la détection et suppression 1 d'informations supplémentaires, etc. Dans la deuxième partie de cette thèse, nous avons examiné la possibilité d'utiliser des données monolingues afin d'améliorer le modèle de traduction d'un système statistique...
APA, Harvard, Vancouver, ISO, and other styles
5

Bouamor, Dhouha. "Constitution de ressources linguistiques multilingues à partir de corpus de textes parallèles et comparables." Phd thesis, Université Paris Sud - Paris XI, 2014. http://tel.archives-ouvertes.fr/tel-00994222.

Full text
Abstract:
Les lexiques bilingues sont des ressources particulièrement utiles pour la Traduction Automatique et la Recherche d'Information Translingue. Leur construction manuelle nécessite une expertise forte dans les deux langues concernées et est un processus coûteux. Plusieurs méthodes automatiques ont été proposées comme une alternative, mais elles qui ne sont disponibles que dans un nombre limité de langues et leurs performances sont encore loin derrière la qualité des traductions manuelles.Notre travail porte sur l'extraction de ces lexiques bilingues à partir de corpus de textes parallèles et comparables, c'est à dire la reconnaissance et l'alignement d'un vocabulaire commun multilingue présent dans ces corpus.
APA, Harvard, Vancouver, ISO, and other styles
6

Al-Qaisi, Fu'ad. "Apport de la linguistique de corpus à la lexicographie bilingue (français-arabe) : macrostructure et microstructure d'un dictionnaire de collocations." Thesis, Lyon 2, 2015. http://www.theses.fr/2015LYO20115.

Full text
Abstract:
L'objet de la présente étude est d’examiner l’apport de la linguistique de corpus à la lexicographie bilingue français-arabe. L’intérêt est porté tout particulièrement à la collocation. Ainsi, la quête commence dès la compilation du corpus jusqu'à l'intégration des collocations au lexique. Les notions fondamentales telle que la linguistique de corpus, le corpus et la collocation sont examinées. Ensuite, la recherche prend une tournure empirique qui se base sur un corpus. Pour pallier la non disponibilité des outils de traitement de corpus en langue arabe, une approche a été élaborée au sein de cette étude, que nous avons baptisée stratégie de passerelle. L’idée est de partir d’un corpus parallèle (traduit) français-arabe. Ce corpus est constitué de la version française du journal Le Monde Diplomatique, ainsi que sa traduction arabe. Le recours à un corpus parallèle a pour vocation de faciliter le repérage des phénomènes contrastifs. Les résultats obtenus seront vérifiés par la suite dans un corpus monolingue arabe (comparable) constitué de trois journaux, à savoir Alrai, Alayam, Algomhuria. Tout au long de cette partie, les résultats sont comparés dans un premiers temps entre corpus et dictionnaires, dans un deuxième temps entre types de corpus (parallèle et comparable), et dans un troisième temps entre journaux du corpus comparable (Alrai, Alayam et Algomhuria). Ensuite, un certain nombre des collocations est soumis à un examen structurel et à un examen sémantique. Ces exploitations apportent non seulement des éléments sur l’environnement collocationnel entre langue et discours, mais également sur une éventuelle approche pour la prise en compte des collocations. Des interrogations légitimes naissent au fur et à mesure des exploitations sur la ressemblance entre les collocations des deux langues. Les résultats mettent en évidence des points comme l’enchaînement collocationnel, la synonymie collocationnelle et d’autres aspects. L’étude est couronnée par la conception d’un dictionnaire informatique de collocations. Il s’agit d’un dictionnaire actif bilingue, qui s’adresse à un public arabisant et aux traducteurs
The aim of this study is to examine the contribution of corpus linguistics to bilingual French-Arabic lexicography. We particularly focus on collocations, as our research begins with the compilation of a bilingual corpus leading up to the integration of collocations in the lexicon. Fundamentals such as corpus linguistics, corpora and collocation are examined. Our research then takes an empirical turn that is based on the use of our corpus. To overcome the unavailability of corpus processing tools in Arabic, an approach was developed in this study that we called the footbridge strategy. The idea is to start from a French-Arabic (translated) parallel corpus. This corpus consists of the French version of Le Monde Diplomatique, and its translation. Using a parallel corpus aims to facilitate the identification of contrastive phenomena. The results obtained in the translated corpus (in its Arabic component) will be subsequently checked in an Arabic monolingual corpus. The latter is a corpus consisting of three newspapers: Alrai, Alayyam, Algouhouria. Throughout the exploitation of the corpus, results are compared first between corpora and dictionaries, secondly between corpus types (parallel and comparable), and thirdly between newspapers (Alrai, Alayyam, Algouhouria). Then a number of collocations are subjected to semantic and structural review and consideration. This review process not only brings some clarifications on the environment of collocations between language and speech but also about a possible approach for their integration in the dictionary. Legitimate questions gradually arise regarding the resemblance of collocations in French and Arabic. The results highlight phenomena such as collocational chains (clusters), collocational synonyms, etc. The study culminates in the design of a computer dictionary of collocations, i.e. an active bilingual dictionary aimed at Arabic language specialists and translators
APA, Harvard, Vancouver, ISO, and other styles
7

Hoddinott, Simon Matthew. "Web mining for translators: automatic construction of comparable, genre-driven corpora." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10775/.

Full text
Abstract:
The aim of this paper is to evaluate the efficacy of the application WebBootCaT to create specialised corpora automatically, investigating the translation of articles of association from Italian into English. The first section reflects on the relevant literature and proposes the utility of corpora for translators. The second section discusses the methodology employed, and the third section analyses the results obtained and comments on how language professionals could possibly exploit the application to its full. The fourth section provides a few concrete usage examples of the thus built corpora, to then conclude that WebBootCaT is a genuinely powerful tool that could be implemented by professional translators in order to save time and improve their translations in the long term.
APA, Harvard, Vancouver, ISO, and other styles
8

Laviosa-Braithwaite, S. "The English Comparable Corpus (ECC) : a resource and a methodology for the empirical study of translation." Thesis, University of Manchester, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.488308.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Chiao, Yun-Chuang. "Extraction lexicale bilingue à partir de textes médicaux comparable : application à la recherche d'information translangue." Paris 6, 2004. https://tel.archives-ouvertes.fr/tel-00007704.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Zennaki, Othman. "Construction automatique d'outils et de ressources linguistiques à partir de corpus parallèles." Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAM006/document.

Full text
Abstract:
Cette thèse porte sur la construction automatique d’outils et de ressources pour l’analyse linguistique de textes des langues peu dotées. Nous proposons une approche utilisant des réseaux de neurones récurrents (RNN - Recurrent Neural Networks) et n'ayant besoin que d'un corpus parallèle ou mutli-parallele entre une langue source bien dotée et une ou plusieurs langues cibles moins bien ou peu dotées. Ce corpus parallèle ou mutli-parallele est utilisé pour la construction d'une représentation multilingue des mots des langues source et cible. Nous avons utilisé cette représentation multilingue pour l’apprentissage de nos modèles neuronaux et nous avons exploré deux architectures neuronales : les RNN simples et les RNN bidirectionnels. Nous avons aussi proposé plusieurs variantes des RNN pour la prise en compte d'informations linguistiques de bas niveau (informations morpho-syntaxiques) durant le processus de construction d'annotateurs linguistiques de niveau supérieur (SuperSenses et dépendances syntaxiques). Nous avons démontré la généricité de notre approche sur plusieurs langues ainsi que sur plusieurs tâches d'annotation linguistique. Nous avons construit trois types d'annotateurs linguistiques multilingues: annotateurs morpho-syntaxiques, annotateurs en SuperSenses et annotateurs en dépendances syntaxiques, avec des performances très satisfaisantes. Notre approche a les avantages suivants : (a) elle n'utilise aucune information d'alignement des mots, (b) aucune connaissance concernant les langues cibles traitées n'est requise au préalable (notre seule supposition est que, les langues source et cible n'ont pas une grande divergence syntaxique), ce qui rend notre approche applicable pour le traitement d'un très grand éventail de langues peu dotées, (c) elle permet la construction d'annotateurs multilingues authentiques (un annotateur pour N langages)
This thesis focuses on the automatic construction of linguistic tools and resources for analyzing texts of low-resource languages. We propose an approach using Recurrent Neural Networks (RNN) and requiring only a parallel or multi-parallel corpus between a well-resourced language and one or more low-resource languages. This parallel or multi-parallel corpus is used to construct a multilingual representation of words of the source and target languages. We used this multilingual representation to train our neural models and we investigated both uni and bidirectional RNN models. We also proposed a method to include external information (for instance, low-level information from Part-Of-Speech tags) in the RNN to train higher level taggers (for instance, SuperSenses taggers and Syntactic dependency parsers). We demonstrated the validity and genericity of our approach on several languages and we conducted experiments on various NLP tasks: Part-Of-Speech tagging, SuperSenses tagging and Dependency parsing. The obtained results are very satisfactory. Our approach has the following characteristics and advantages: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger for N languages)
APA, Harvard, Vancouver, ISO, and other styles
More sources

Books on the topic "Corpus comparable"

1

Laviosa-Braithwaite, S. The English comparable corpus(ECC): A resource and a methodology for the empirical studyof translation. Manchester: UMIST, 1996.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

McEnery, Tony. Corpus Linguistics. Edited by Ruslan Mitkov. Oxford University Press, 2012. http://dx.doi.org/10.1093/oxfordhb/9780199276349.013.0024.

Full text
Abstract:
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described as a large body of linguistic evidence composed of attested language use. It may be contrasted against sentences constructed from metalinguist reflection upon language use, rather than as a result of communication in context. Corpus can be both spoken and written. It can be categorized as follows: monolingual, representing one language; comparable, using multiple monolingual corpora to create a comparative framework; parallel corpora, wherein, corpus of one language is considered, and the data obtained, is translated in other languages. The choice of corpus depends on the research question/the chosen application. Adding linguistic information can enhance a corpus. Analysts, human or mechanical, or a combination achieves annotation. The modern computerized corpus has been in vogue only since the 1940s. Ever since, the volume of corpus banks have risen steadily and assumed an increasingly multilingual nature.
APA, Harvard, Vancouver, ISO, and other styles
3

Cermáková, Anna, Hilde Hasselgård, Markéta Malá, and Denisa Šebestová, eds. Contrastive Corpus Linguistics. Bloomsbury Publishing Plc, 2024. http://dx.doi.org/10.5040/9781350385962.

Full text
Abstract:
Marking 30 years of contrastive corpus linguistics, this volume provides a state-of-the-art of the field, charting its development over time and expanding the boundaries of the discipline. Focusing on a diversity of methods and approaches to language comparison, it uses both comparable and translation corpora, and explores a broad range of language registers from newspaper reporting and spoken political discourse to film scripts and football match reports. Using English as the pivot language for each chapter, the volume offers contrastive bilingual and trilingual perspectives on a number of languages, including Czech, Finnish, French, German, Norwegian, Spanish, and Swedish, covering a typologically diverse field. By exploring the application of complex multi-genre multilingual data sets and expanding the horizons of contrastive studies, it demonstrates how a juxtaposition of cross-linguistic and register variation can deepen our insight into language variation and use. The volume is dedicated to two prominent contrastive corpus linguists: Karin Aijmer and Bengt Altenberg, who have decisively shaped the discipline from its very beginnings. The book opens with a chapter by Aijmer, reflecting on the current breadth and future prospects of research in the area while pointing to emergent trends with an insight that only she can offer.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Corpus comparable"

1

Shultz, Thomas R., Scott E. Fahlman, Susan Craw, Periklis Andritsos, Panayiotis Tsaparas, Ricardo Silva, Chris Drummond, et al. "Comparable Corpus." In Encyclopedia of Machine Learning, 194. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_144.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Baradaran Hashemi, Homa, Azadeh Shakery, and Heshaam Faili. "Creating a Persian-English Comparable Corpus." In Multilingual and Multimodal Information Access Evaluation, 27–39. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-15998-5_5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Eckart, Thomas, and Uwe Quasthoff. "Statistical Corpus and Language Comparison on Comparable Corpora." In Building and Using Comparable Corpora, 151–65. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-20128-8_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

McEnery, Tony, and Richard Xiao. "Parallel and comparable corpora: The state of play." In Corpus-Based Perspectives in Linguistics, 131–45. Amsterdam: John Benjamins Publishing Company, 2007. http://dx.doi.org/10.1075/ubli.6.11mce.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Li, Bo, and Eric Gaussier. "Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality." In Building and Using Comparable Corpora, 131–49. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-20128-8_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Philip, Gill. "Arriving at equivalence: Making a case for comparable general reference corpora in translation studies." In Corpus Use and Translating, 59–73. Amsterdam: John Benjamins Publishing Company, 2009. http://dx.doi.org/10.1075/btl.82.06phi.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Rahimi, Zahra, and Azadeh Shakery. "Topic Based Creation of a Persian-English Comparable Corpus." In Information Retrieval Technology, 458–69. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. http://dx.doi.org/10.1007/978-3-642-25631-8_41.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Rogati, Monica, and Yiming Yang. "Cross-Lingual Pseudo-Relevance Feedback Using a Comparable Corpus." In Lecture Notes in Computer Science, 151–57. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002. http://dx.doi.org/10.1007/3-540-45691-0_12.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Kruijt, Anne, Patrizia Cordin, and Stefan Rabanus. "On the validity of crowdsourced data." In Studies in Corpus Linguistics, 10–33. Amsterdam: John Benjamins Publishing Company, 2023. http://dx.doi.org/10.1075/scl.110.01kru.

Full text
Abstract:
This chapter demonstrates the validity of crowdsourced data by comparing the crowdsourced data from the VinKo project with traditionally collected data from the AThEME project. Both datasets target non-standard language varieties of the South Tyrol, Trentino, and Veneto regions in north-eastern Italy. Three different morphosyntactic phenomena are discussed, each relating to a particular language variety, providing evidence that the crowdsourced data is of comparable quality to the traditionally gathered data and has the added advantage of yielding a larger overall dataset covering a denser location network.
APA, Harvard, Vancouver, ISO, and other styles
10

Picchi, Eugenio, and Carol Peters. "Cross-Language Information Retrieval: A System for Comparable Corpus Querying." In Cross-Language Information Retrieval, 81–92. Boston, MA: Springer US, 1998. http://dx.doi.org/10.1007/978-1-4615-5661-9_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Corpus comparable"

1

Huidrom, Rudali, Yves Lepage, and Khogendra Khomdram. "EM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English." In The 14th Workshop on Building and Using Comparable Corpora. INCOMA Ltd. Shoumen, BULGARIA, 2021. http://dx.doi.org/10.26615/978-954-452-076-2_008.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Youliang, Zhou, Gong Zhengxian, and Zhou Guodong. "Complement the comparable corpus obtained from websites." In 2010 2nd International Conference on Future Computer and Communication. IEEE, 2010. http://dx.doi.org/10.1109/icfcc.2010.5497762.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Lupu, Mihai. "Bootstrapping a Comparable Corpus from Patent Family Members." In 2012 23rd International Workshop on Database and Expert Systems Applications (DEXA). IEEE, 2012. http://dx.doi.org/10.1109/dexa.2012.60.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Sun, Y., and W. B. Guo. "Tibetan–Chinese cross-language-network comparable corpus extraction." In International Conference on Computer Science and Technology. Southampton, UK: WIT Press, 2014. http://dx.doi.org/10.2495/iccst140311.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Lv, Fei, and Zede Zhu. "A Summary of Studies on Bilingual Comparable Corpus." In 2019 International Conference on Smart Grid and Electrical Automation (ICSGEA). IEEE, 2019. http://dx.doi.org/10.1109/icsgea.2019.00138.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

JP, Sanjanasri, Vijay Krishna Menon, Soman KP, and Krzysztof Wolk. "Mining BilingualWord Pairs from Comparable Corpus using Apache Spark Framework." In The 14th Workshop on Building and Using Comparable Corpora. INCOMA Ltd. Shoumen, BULGARIA, 2021. http://dx.doi.org/10.26615/978-954-452-076-2_002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Abidi, K., M. A. Menacer, and Kamel Smaïli. "CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube." In Interspeech 2017. ISCA: ISCA, 2017. http://dx.doi.org/10.21437/interspeech.2017-1305.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Katsumata, Satoru, and Mamoru Komachi. "(Almost) Unsupervised Grammatical Error Correction using Synthetic Comparable Corpus." In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/w19-4413.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Milajevs, Dmitrijs. "Toward a Comparable Corpus of Latvian, Russian and English Tweets." In Proceedings of the 10th Workshop on Building and Using Comparable Corpora. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017. http://dx.doi.org/10.18653/v1/w17-2505.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Yamamoto, Yuki, Tomoyoshi Akiba, and Hajime Tsukada. "Training Neural Machine Translation Models by Using an Entire Comparable Corpus." In 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA). IEEE, 2023. http://dx.doi.org/10.1109/icaicta59291.2023.10390174.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Corpus comparable"

1

Wolfenson, David, William W. Thatcher, and James E. Kinder. Regulation of LH Secretion in the Periovulatory Period as a Strategy to Enhance Ovarian Function and Fertility in Dairy and Beef Cows. United States Department of Agriculture, December 2003. http://dx.doi.org/10.32747/2003.7586458.bard.

Full text
Abstract:
The general research objective was to increase herd pregnancy rates by enhancing corpus luteum (CL) function and optimizing follicle development, in order to increase conception rate and embryo survival. The specific objectives were: to determine the effect of the duration of the preovulatory LH surge on CL function; to determine the function of LH during the postovulatory period on CL development; to optimize CL differentiation and follicle development by means of a biodegradable GnRH implant; to test whether optimization of CL development and follicle dynamics in timed- insemination protocols would improve fertility in high-yielding dairy cows. Low fertility in cattle results in losses of hundreds of millions of dollars in the USA and Israel. Two major causes of low fertility are formation of a functionally impaired CL, and subsequent enhanced ovarian follicle development. A functionally impaired CL may result from suboptimal LH secretion. The two major causes of low fertility in dairy cattle in US and Israel are negative energy status and summer heat stress; in both situations, low fertility is associated with reductions in LH secretion and impaired development of the ovulatory follicle and of the CL. In Florida, the use of 450-mg deslorelin (GnRH analogue) implants to induce ovulation, under the Ovsynch protocol resulted in a higher pregnancy rates than use of 750-mg implants, and pregnancy losses tended to decrease compared to controls, due probably to decrease in follicular development and estradiol secretion at the time of conceptus signaling to maintain the CL. An alternative strategy to enhance progesterone concentrations involved induction of an accessory CL by injection of hCG on day 5 after the cows were inseminated. Treatment with hCG resulted in 86% of the cows having two CLs, compared with 23% of the control cows. Conception rates were higher among the hCG-treated cows than among the controls. Another approach was to replace the second injection of GnRH analogue, in a timed-insemination protocol, with estradiol cypionate (ECP) injected 24 h after the injection of PGF₂ₐ Pregnancy rates were comparable with those obtained under the regular Ovsynch (timed- AI) program. Use of ECP induced estrus, and cows inseminated at detected estrus are indeed more fertile than those not in estrus at the time of insemination. Collectively, the BARD-supported programs at the University of Florida have improved timed insemination programs. In Ohio, the importance of the frequency of LH episodes during the early stages of the estrous cycle of cattle, when the corpus luteum is developing, was studied in an in vivo experiment in which cows were subjected to various episodic exposures to exogenous bovine LH. Results indicate that the frequent LH episodes immediately following the time of ovulation are important in development of the corpus luteum, from the points of view of both size and functionality. In another study, rates of cell proliferation and numbers of endothelial cells were examined in vitro in CLs collected from cows that received post-ovulation pulsatile LH treatment at various frequencies. The results indicate that the corpora lutea growth that results from luteal cell proliferation is enhanced by the episodes of LH release that occur immediately after the time of ovulation in cattle. The results also show that luteal endothelial cell numbers did not differ among cows treated with different LH doses. In Israel. a longer duration of the preovulatory LH surge stimulated the steroidogenic capacity of granulosa-derived luteal cells, and might, thereby, contribute to a higher progesterone output from the bovine corpus luteum. In an in vivo study, a subgroup of high-yielding dairy cows with extended estrus to ovulation interval was identified. Associated with this extended interval were: low plasma progesterone and estradiol concentrations and a low preovulatory LH surge prior to ovulation, as well as low post- ovulation progesterone concentration. In experiments based on the above results, we found that injection of GnRH at the onset of estrus increased the LHpeak, prevented late ovulation, decreased the variability between cows and elicited high and uniform progesterone levels after ovulation. GnRH at estrus onset increased conception rates, especially in the summer, and among primiparous cows and those with low body condition. Another study compared ovarian functions in multiparous lactating cows with those in nulliparous non-lactating heifers. The results revealed differences in ovarian follicular dynamics, and in plasma concentrations of steroids and gonadotropins that may account for the differences in fertility between heifers and cows.
APA, Harvard, Vancouver, ISO, and other styles
2

Seiple, Jacqueline, Luis Santiago, Christopher Spaur, Safra Altman, Matthew Balazik, Thomas Laczo, Daniel Mensah, Warunika Amarasingha, Andrew Payson, and Danielle Szimanski. Two years of post-project monitoring of a navigation solution in a dynamic coastal environment, Smith Island, Maryland. Engineer Research and Development Center (U.S.), June 2022. http://dx.doi.org/10.21079/11681/44620.

Full text
Abstract:
In 2018, jetties and a sill were constructed by the US Army Corps of Engineers adjacent to the Sheep Pen Gut Federal Channel at Rhodes Point, Smith Island, Maryland. These navigation improvements were constructed under Section 107 of the Continuing Authorities Program. Material dredged for construction of the structures and realignment of the channel were used to restore degraded marsh. Following construction and dredging, 2 years of monitoring were performed to evaluate the performance of navigation improvements with respect to the prevention of shoaling within the channel, shoreline changes, and impacts to submerged aquatic vegetation (SAV). Technical Report ERDC/CHL TR-20-14 describes the first year of post-project monitoring and the methodologies employed. This report describes conclusions derived from 2 years of monitoring. While the navigation improvements are largely preventing the channel from infilling, shoaling within is occurring at rates higher than expected. The placement site appears stable and accreting landward; however, there continues to be erosion along the shoreline and through the gaps in the breakwaters. SAV monitoring indicates that SAV is not present in the project footprint, even though turbidity is comparable to the reference area. Physical disturbance of the bottom sediment during construction may explain SAV absence.
APA, Harvard, Vancouver, ISO, and other styles
3

Chidsey, Thomas C., David E. Eby, Michael D. Vanden Berg, and Douglas A. Sprinkel. Microbial Carbonate Reservoirs and Analogs from Utah. Utah Geological Survey, July 2021. http://dx.doi.org/10.34191/ss-168.

Full text
Abstract:
Multiple oil discoveries reveal the global scale and economic importance of a distinctive reservoir type composed of possible microbial lacustrine carbonates like the Lower Cretaceous pre-salt reservoirs in deepwater offshore Brazil and Angola. Marine microbialite reservoirs are also important in the Neoproterozoic to lowest Cambrian starta of the South Oman Salt Basin as well as large Paleozoic deposits including those in the Caspian Basin of Kazakhstan (e.g., Tengiz field), and the Cedar Creek Anticline fields and Ordovician Red River “B” horizontal play of the Williston Basin in Montana and North Dakota, respectively. Evaluation of the various microbial fabrics and facies, associated petrophysical properties, diagenesis, and bounding surfaces are critical to understanding these reservoirs. Utah contains unique analogs of microbial hydrocarbon reservoirs in the modern Great Salt Lake and the lacustrine Tertiary (Eocene) Green River Formation (cores and outcrop) within the Uinta Basin of northeastern Utah. Comparable characteristics of both lake environments include shallowwater ramp margins that are susceptible to rapid widespread shoreline changes, as well as compatible water chemistry and temperature ranges that were ideal for microbial growth and formation/deposition of associated carbonate grains. Thus, microbialites in Great Salt Lake and from the Green River Formation exhibit similarities in terms of the variety of microbial textures and fabrics. In addition, Utah has numerous examples of marine microbial carbonates and associated facies that are present in subsurface analog oil field cores.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography