To see the other types of publications on this topic, follow the link: Corpus-based data.

Journal articles on the topic 'Corpus-based data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Corpus-based data.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Gu, Chonglong. "Corpus triangulation: combining data and methods in corpus-based translation studies." Translator 24, no. 1 (December 6, 2017): 107–10. http://dx.doi.org/10.1080/13556509.2018.1411639.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Gamper, Johann, and Oliviero Stock. "Corpus-based terminology." Terminology 5, no. 2 (December 31, 1998): 147–59. http://dx.doi.org/10.1075/term.5.2.05gam.

Full text
Abstract:
The manual acquisition of terminological material from the domain-specific text material is a very time-consuming task. Recent advances in text-processing research provide a basis for automating this task. Computer-assisted term acquisition improves both the quantity and the quality of terminological work. This paper gives a brief overview of this new approach in terminology acquisition. Three subtasks are distinguished: compilation of an electronic text corpus, extraction of terminological data, and management of terminological data. Each of the subtasks will be discussed in some detail by identifying the core problems as well as proposed solutions. As a concrete initiative in this emerging field, we present an ongoing research project at the European Academy Bolzano, which illustrates the importance of computer-assisted terminology acquisition and of the resulting steps that have been taken in recent times. The paper concludes with a summary of five selected papers which have been presented at a workshop on corpus-based terminology in Bolzano. The full papers are published in this volume and in volume 4(2) of this journal.
APA, Harvard, Vancouver, ISO, and other styles
3

Wolk, Christoph, and Benedikt Szmrecsanyi. "Probabilistic corpus-based dialectometry." Journal of Linguistic Geography 6, no. 1 (April 2018): 56–75. http://dx.doi.org/10.1017/jlg.2018.6.

Full text
Abstract:
Researchers in dialectometry have begun to explore measurements based on fundamentally quantitative metrics, often sourced from dialect corpora, as an alternative to the traditional signals derived from dialect atlases. This change of data type amplifies an existing issue in the classical paradigm, namely that locations may vary in coverage and that this affects the distance measurements: pairs involving a location with lower coverage suffer from greater noise and therefore imprecision. We propose a method for increasing robustness using generalized additive modeling, a statistical technique that allows leveraging the spatial arrangement of the data. The technique is applied to data from the British English dialect corpus FRED; the results are evaluated regarding their interpretability and according to several quantitative metrics. We conclude that data availability is an influential covariate in corpus-based dialectometry and beyond, and recommend that researchers be aware of this issue and of methods to alleviate it.
APA, Harvard, Vancouver, ISO, and other styles
4

Khamis, Noorli. "Corpus-based Data for Determining Specialised Language Features." International Journal of Advanced Trends in Computer Science and Engineering 9, no. 1 (February 15, 2020): 36–41. http://dx.doi.org/10.30534/ijatcse/2020/07912020.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Mikulová, Marie, Eduard Bejček, Veronika Kolářová, and Jarmila Panevová. "Subcategorization of Adverbial Meanings Based on Corpus Data." Journal of Linguistics/Jazykovedný casopis 68, no. 2 (December 1, 2017): 268–77. http://dx.doi.org/10.1515/jazcas-2017-0036.

Full text
Abstract:
Abstract We introduce a corpus based description of selected adverbial meanings in Czech sentences. Its basic repertory is one of a long lasting tradition in both scientific and school grammars. However, before the corpus era, researchers had to rely on their own excerption; but nowadays, current syntax has a vast material basis in the form of electronic corpora available. On the case of spatial adverbials, we describe our methodology which we used to acquire a detailed, comprehensive, well-arranged description of meanings of adverbials including a list of formal realizations with examples. Theoretical knowledge stemming from this work will lead into an improval of the annotation of the meanings in the Prague Dependency Treebanks which serve as the corpus sources for our research. The Prague Dependency Treebanks include data manually annotated on the layer of deep syntax and thus provide a large amount of valuable examples on the basis of which the meanings of adverbials can be defined more accurately and subcategorized more precisely. Both theoretical and practical results will subsequently be used in NLP, such as machine translation.
APA, Harvard, Vancouver, ISO, and other styles
6

Bloothooft, Gerrit. "Corpus-based Name Standardization." History and Computing 6, no. 3 (October 1994): 153–67. http://dx.doi.org/10.3366/hac.1994.6.3.153.

Full text
Abstract:
A method is described to standardize nominal data on the basis of a combination of rules and a probabilistic similarity measure. Onomastic corpora are used to estimate the probability of spelling variations automatically. These corpora are also the basis for finding the most likely standard for a name not encountered before.
APA, Harvard, Vancouver, ISO, and other styles
7

Szmrecsanyi, Benedikt, and Christoph Wolk. "Holistic corpus-based dialectology." Revista Brasileira de Linguística Aplicada 11, no. 2 (2011): 561–92. http://dx.doi.org/10.1590/s1984-63982011000200011.

Full text
Abstract:
This paper is concerned with sketching future directions for corpus-based dialectology. We advocate a holistic approach to the study of geographically conditioned linguistic variability, and we present a suitable methodology, 'corpusbased dialectometry', in exactly this spirit. Specifically, we argue that in order to live up to the potential of the corpus-based method, practitioners need to (i) abandon their exclusive focus on individual linguistic features in favor of the study of feature aggregates, (ii) draw on computationally advanced multivariate analysis techniques (such as multidimensional scaling, cluster analysis, and principal component analysis), and (iii) aid interpretation of empirical results by marshalling state-of-the-art data visualization techniques. To exemplify this line of analysis, we present a case study which explores joint frequency variability of 57 morphosyntax features in 34 dialects all over Great Britain.
APA, Harvard, Vancouver, ISO, and other styles
8

Escudero-Mancebo, David, and Valentín Cardeñoso-Payo. "Applying data mining techniques to corpus based prosodic modeling." Speech Communication 49, no. 3 (March 2007): 213–29. http://dx.doi.org/10.1016/j.specom.2007.01.008.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Lyddon, Paul. "Discovering Language Properties through Corpus-Based Dictionary Data Analysis." Vocabulary Learning and Instruction 6, no. 2 (2017): 61–70. http://dx.doi.org/10.7820/vli.v06.2.lyddon.

Full text
Abstract:
To reveal underlying patterns in real language use, linguists have increasingly come to rely on corpus analyses, involving the evaluation of statistical frequencies in generally sizable bodies of natural linguistic data. However, accessing and analyzing large samples of raw language is neither always practical nor even truly necessary, especially in cases pertaining to structural characteristics. In fact, the requisite data can oftentimes be gleaned from a state-of-the-art (i.e., corpus-based) dictionary. Moreover, given the widespread availability of easily searchable electronic dictionaries nowadays, almost any language teacher or learner can use one to answer a number of these types of queries. This paper illustrates this claim with a step-by-step analysis of corpus-based dictionary data for the purpose of formulating the sound-symbol relations in English words with vowels preceding –gh.
APA, Harvard, Vancouver, ISO, and other styles
10

de Monnink, Inge. "Combining Corpus and Experimental Data." International Journal of Corpus Linguistics 4, no. 1 (August 13, 1999): 77–111. http://dx.doi.org/10.1075/ijcl.4.1.05mon.

Full text
Abstract:
In this article I argue that, from a methodological point of view, descriptive studies improve considerably if they use a multi-method approach to the data, more specifically, if they use a combination of corpus data and experimental data. In the modern conception of corpus linguistics, intuitive data play an important role. The linguist formulates research hypotheses based on his or her intuitive knowledge. These hypotheses are then tested on the corpus data. I argue that a sound descriptive study should not end with simply stating the results from the corpus study. Instead, the corpus data have to be supplemented. An appropriate way to supplement corpus data is through the use of elicitation techniques. I illustrate the multi-method approach on a case study of floating postmodification in the English noun phrase.
APA, Harvard, Vancouver, ISO, and other styles
11

Krummes, Cedric, and Astrid Ensslin. "Formulaic language and collocations in German essays: from corpus-driven data to corpus-based materials." Language Learning Journal 43, no. 1 (July 4, 2012): 110–27. http://dx.doi.org/10.1080/09571736.2012.694900.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Levin, Beth, and Grace Song. "Making Sense of Corpus Data." International Journal of Corpus Linguistics 2, no. 1 (January 1, 1997): 23–64. http://dx.doi.org/10.1075/ijcl.2.1.04lev.

Full text
Abstract:
This paper demonstrates the essential role of corpus data in the development of a theory that explains and predicts word behavior. We make this point through a case study of verbs of sound, drawing our evidence primarily from the British National Corpus. We begin by considering pretheoretic notions of the verbs of sound as presented in corpus-based dictionaries and then contrast them with the predictions made by a theory of syntax, as represented by Chomsky's Government-Binding framework. We identify and classify the transitive uses of sixteen representative verbs of sound found in the corpus data. Finally, we consider what a linguistic account with both syntactic and lexical semantic components has to offer as an explanation of observed differences in the behavior of the sample verbs.
APA, Harvard, Vancouver, ISO, and other styles
13

FATIMA, MEHWISH, SABA ANWAR, AMNA NAVEED, WAQAS ARSHAD, RAO MUHAMMAD ADEEL NAWAB, MUNTAHA IQBAL, and ALIA MASOOD. "Multilingual SMS-based author profiling: Data and methods." Natural Language Engineering 24, no. 5 (June 26, 2018): 695–724. http://dx.doi.org/10.1017/s1351324918000244.

Full text
Abstract:
AbstractIn the recent years, many benchmark author profiling corpora have been developed for various genres including Twitter, social media, blogs, hotel reviews and e-mail, etc. However, no such standard evaluation resource has been developed for Short Messaging Service (SMS), a popular medium of communication, which is very useful for author profiling. The primary aim of this study is to develop a large multilingual (English and Roman Urdu) benchmark SMS-based author profiling corpus. The proposed corpus contains 810 author profiles, wherein each profile consists of an aggregation of SMS messages as a single document of an author, along with seven demographic traits associated with each author profile: gender, age, native language, native city, qualification, occupation and personality type (introvert/extrovert). The secondary aims of this study include the following: (1) annotating the proposed corpus for code-switching annotations at the lexical level (approximately 0.69 million tokens are manually annotated for code-switching) and (2) applying the stylometry-based method (groups of sixty-four features) and the content-based method (twelve features) for gender identification in order to demonstrate how our proposed corpus can be used for the development and evaluation of various author profiling methods. The results show that the content-based character 5-gram feature outperformed all the other features by obtaining the accuracy score of 0.975 andF1score of 0.947 for gender identification while using the entire corpus. Furthermore, our proposed corpora (SMS–AP–18 and code-switched SMS–AP–18) are freely and publicly available for research purpose.
APA, Harvard, Vancouver, ISO, and other styles
14

Stoddart, Kenneth. "The Corpus: A Data-Based Device for Teaching Field Methods." Teaching Sociology 15, no. 2 (April 1987): 197. http://dx.doi.org/10.2307/1318037.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

YOON, JUNTAE. "Compound noun segmentation based on lexical data extracted from corpus." Natural Language Engineering 7, no. 2 (June 2001): 167–85. http://dx.doi.org/10.1017/s1351324901002637.

Full text
Abstract:
Compound noun segmentation is one of the crucial problems in Korean language processing because a series of nouns in Korean may appear without space in real text, which makes it difficult to identify its morphological constituents. This paper presents an effective method of Korean compound noun segmentation based on lexical data extracted from a corpus. The segmentation consists of two tasks: First, it uses a Hand-Build Segmentation Dictionary (HBSD) to segment compound nouns which frequently occur or need an exceptional process. Second, a segmentation algorithm using data from a corpus is proposed, where simple nouns and their frequencies are stored in a Simple Noun Dictionary (SND) for segmentation. The analysis is executed based on modified tabular parsing using min-max operation. Our experiments have shown a very effective accuracy rate of about 97.29%, which turns out to be very effective.
APA, Harvard, Vancouver, ISO, and other styles
16

Li, Qin, Shaobo Li, Sen Zhang, Jie Hu, and Jianjun Hu. "A Review of Text Corpus-Based Tourism Big Data Mining." Applied Sciences 9, no. 16 (August 12, 2019): 3300. http://dx.doi.org/10.3390/app9163300.

Full text
Abstract:
With the massive growth of the Internet, text data has become one of the main formats of tourism big data. As an effective expression means of tourists’ opinions, text mining of such data has big potential to inspire innovations for tourism practitioners. In the past decade, a variety of text mining techniques have been proposed and applied to tourism analysis to develop tourism value analysis models, build tourism recommendation systems, create tourist profiles, and make policies for supervising tourism markets. The successes of these techniques have been further boosted by the progress of natural language processing (NLP), machine learning, and deep learning. With the understanding of the complexity due to this diverse set of techniques and tourism text data sources, this work attempts to provide a detailed and up-to-date review of text mining techniques that have been, or have the potential to be, applied to modern tourism big data analysis. We summarize and discuss different text representation strategies, text-based NLP techniques for topic extraction, text classification, sentiment analysis, and text clustering in the context of tourism text mining, and their applications in tourist profiling, destination image analysis, market demand, etc. Our work also provides guidelines for constructing new tourism big data applications and outlines promising research areas in this field for incoming years.
APA, Harvard, Vancouver, ISO, and other styles
17

Buts, Jan, and Henry Jones. "From text to data mediality in corpus-based translation studies." MonTI. Monografías de Traducción e Interpretación, no. 13 (2021): 301–29. http://dx.doi.org/10.6035/monti.2021.13.10.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Safriyani, Rizka. "Corpus-Based Research in Vocabulary Learning." NOBEL: Journal of Literature and Language Teaching 11, no. 2 (September 30, 2020): 203–16. http://dx.doi.org/10.15642/nobel.2020.11.2.203-216.

Full text
Abstract:
n the university, corpus-based research is commonly done for writing a thesis. However, corpus-based research can also be introduced for the first year of EFL students to build their critical thinking and vocabulary mastery. Less research discusses the practice of corpus-based research for the first year EFL student. Therefore, it is essential to investigate the benefit and the challenges of corpus-based research in the Indonesian EFL Setting. This study aims to examine the benefits and the challenges of corpus-based research in the Indonesian EFL Setting. Students did corpus-based research in English for the Islamic Studies course. Students tried to structure an English glossary from online Islamic articles, Islamic journals, and Islamic blogs. Forty-four students were chosen as the subject of the research. The survey was done to the students to gather the data about the benefits and the challenges of corpus-based research. The results showed that corpus-based research benefits increase vocabulary, increase students' understanding of research, improve students' accuracy in writing, develop critical thinking, and develop collaboration. Students faced several challenges in implementing corpus-based research. The finding shows students have difficulties in understanding new vocabulary. Besides, they have problems classifying data into specific topics, allocating time, and writing their reports.
APA, Harvard, Vancouver, ISO, and other styles
19

Khalil Ibrahim, Riyadh. "Translation oriented corpus-based contrastive linguistics." Babel. Revue internationale de la traduction / International Journal of Translation 61, no. 3 (December 7, 2015): 381–93. http://dx.doi.org/10.1075/babel.61.3.04kha.

Full text
Abstract:
The paper aims at studying the relationship between contrastive linguistics (CL) and translation as branches of applied linguistics, on one hand, and the use of computer corpora (C.C) on the other. It also stresses the fact that the boundaries of CL have been redrawn to incorporate the output of C.C in performing various tasks in translation, that goes beyond the traditional methods of CL carried out exclusively on solving problems in foreign language teaching (FLT). The paper supports the call for the manipulation of data obtained from CC in contrastive linguistic projects for the betterment of translation quality. Previously, CL was concerned with linguistic systems rather than language use, but with the introduction of corpora, language use become more easily accessible and the field of CL has expanded. The access to huge amounts of original texts and their translation in electronic format is of great benefit to professional translators, since a wide range of translation solutions for any particular source language are available by a gentle hit on the required tagging key. As for translation-oriented corpus based CL it becomes obvious that the actual contrastive study will be carried out in order to obtain data for explaining the various phenomena in translation. Hence, translation as a communicative event can assume a fully-fledged descriptive discipline if it manages to develop its own descriptive tools of study. Computer corpora can play a decisive role in turning translation into a well-established academic discipline.
APA, Harvard, Vancouver, ISO, and other styles
20

Cotos, Elena. "Enhancing writing pedagogy with learner corpus data." ReCALL 26, no. 2 (February 21, 2014): 202–24. http://dx.doi.org/10.1017/s0958344014000019.

Full text
Abstract:
AbstractLearner corpora have become prominent in language teaching and learning, enhancing data-driven learning (DDL) pedagogy by promoting ‘learning driven data’ in the classroom. This study explores the potential of a local learner corpus by investigating the effects of two types of DDL activities, one relying on a native-speaker corpus (NSC) and the second combining native-speaker and learner corpora. Both types of activities aimed at improving second language writers’ knowledge of linking adverbials and were based on a preliminary analysis of adverbial use in the local learner corpus produced by 31 study participants. Quantitative and qualitative data, obtained from writing samples, pre/post-tests, and questionnaires, were converged through concurrent triangulation. The results showed an increase in frequency, diversity and accuracy in all participants’ use of adverbials, but more significant improvement was made by the students who were exposed to the corpus containing their own writing. The findings of this study are thus interpreted as suggestive that combining learner and native-speaker data is a feasible and effective practice, which can be readily integrated in DDL-based instruction with positive impact.
APA, Harvard, Vancouver, ISO, and other styles
21

Kesäniemi, Joonas, Turo Vartiainen, Tanja Säily, and Terttu Nevalainen. "Exploring Meta-analysis for Historical Corpus Linguistics Based on Linked Data." Journal of Research Design and Statistics in Linguistics and Communication Science 5, no. 1-2 (August 29, 2019): 4–47. http://dx.doi.org/10.1558/jrds.36709.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

KimHeyoung and 전수인. "Fostering Lexis Awareness and Autonomy by Corpus-based Data-Driven Learning." English Teaching 63, no. 2 (June 2008): 213–35. http://dx.doi.org/10.15858/engtea.63.2.200806.213.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Nuthmann, A., and R. Kliegl. "An examination of binocular reading fixations based on sentence corpus data." Journal of Vision 9, no. 5 (May 1, 2009): 31. http://dx.doi.org/10.1167/9.5.31.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Kim, Jungsoo. "Conative Alternation in English: An Entailment-Based Perspective with Corpus Data." Studies in Modern Grammar 97 (March 31, 2018): 55–88. http://dx.doi.org/10.14342/smog.2018.97.55.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Romli, Taj Rijal Muhamad, Abd Rauf Hassan, and Hasnah Mohamad. "Equivalent Malay-Arabic Data Corpus Collection." European Journal of Language and Literature 4, no. 1 (April 30, 2016): 65. http://dx.doi.org/10.26417/ejls.v4i1.p65-73.

Full text
Abstract:
This paper aims to introduce a search strategy and collecting comparable sentences of Arab-Malay corpus data. This method was introduced for the use of students, researchers and amateur translators to search and compare the structure of sentences in Arabic and Malay. The first stage is to collect data corpus with high impact titles from the press and must be able to enlarge the scope of study as stated by Maia (2003). The second stage is to search using the specified key words based on selected high-impact titles such as the Football World Cup year 2010 and 2014. Data search is by using Webcorp engine http://www.webcorp.org.uk/live/ corpus and also open database Google https://www.google.com. The third stage is to filter the data by using Aker et.al (2012) and Braschler's (1998) method based on similar story, related story and similar aspects. At the fourth stage every category is measured by Guidere's (2002) equivalence strength which is strong comparability (SC), medium (MC) and weak (WC). At the last stage comparable sentences between the two languages are compiled in parallel according to Mona Baker’s (1992) level of grouping which are sentence level, combination of words, grammatical, pragmatic and textual level. The result from data analysis based on Mona Baker and Vinay - Darbelnet’s (1995) comparable theory proved the existence of some sentences in large quantities are on the same level of comparability from the point of information delivery. This can be used as the basis of additional evidence concerning the validity of 'universal theory.' in the science of translation.
APA, Harvard, Vancouver, ISO, and other styles
26

Wolf, Florian, and Edward Gibson. "Representing Discourse Coherence: A Corpus-Based Study." Computational Linguistics 31, no. 2 (June 2005): 249–87. http://dx.doi.org/10.1162/0891201054223977.

Full text
Abstract:
This article aims to present a set of discourse structure relations that are easy to code and to develop criteria for an appropriate data structure for representing these relations. Discourse structure here refers to informational relations that hold between sentences in a discourse. The set of discourse relations introduced here is based on Hobbs (1985). We present a method for annotating discourse coherence structures that we used to manually annotate a database of 135 texts from the Wall Street Journal and the AP Newswire. Alltexts were independently annotated by two annotators. Kappa values of greater than 0.8 indicated good interannotator agreement. We furthermore present evidence that trees are not a descriptively adequate data structure for representing discourse structure: In coherence structures of naturally occurring texts, we found many different kinds of crossed dependencies, as well as many nodes with multiple parents. The claims are supported by statistical results from our hand-annotated database of 135 texts.
APA, Harvard, Vancouver, ISO, and other styles
27

Braun, Sabine. "Integrating corpus work into secondary education: From data-driven learning to needs-driven corpora." ReCALL 19, no. 3 (August 24, 2007): 307–28. http://dx.doi.org/10.1017/s0958344007000535.

Full text
Abstract:
AbstractThis paper reports on an empirical case study conducted to investigate the overall conditions and challenges of integrating corpus materials and corpus-based learning activities into English-language classes at a secondary school in Germany. Starting from the observation that in spite of the large amount of research into corpus-based language learning, hands-on work with corpora has remained an exception in secondary schools, the paper starts by outlining a set of pedagogical requirements for corpus integration and the approach which has formed the basis for designing the case study. Then the findings of the study are reported and discussed. As a result of the methodological challenges identified in the study, the author argues for a move from ‘data-driven learning’ to needs-driven corpora, corpus activities and corpus methodologies.
APA, Harvard, Vancouver, ISO, and other styles
28

Poirier, Éric. "Exploring theoretical functions of corpus data in teaching translation." Cadernos de Tradução 36, no. 1 (April 26, 2016): 177. http://dx.doi.org/10.5007/2175-7968.2016v36nesp1p177.

Full text
Abstract:
http://dx.doi.org/10.5007/2175-7968.2016v36nesp1p177As language referential data banks, corpora are instrumental in the exploration of translation solutions in bilingual parallel texts or conventional usages of source or target language in monolingual general or specialized texts. These roles are firmly rooted in translation processes, from analysis and interpretation of source text to searching for an acceptable equivalent and integrating it into the production of the target text. Provided the creative and not the conservative way be taken, validation or adaptation of target text in accordance with conventional usages in the target language also benefits from corpora. Translation teaching is not exploiting this way of translating that is common practice in the professional translation markets around the world. Instead of showing what corpus tools can do to translation teaching, we start our analysis with a common issue within translation teaching and show how corpus data can help to resolve it in learning activities in translation courses. We suggest a corpus-driven model for the interpretation of ‘business’ as a term and as an item in complex terms based on source text pattern analysis. This methodology will make it possible for teachers to explain and justify interpretation rules that have been defined theoretically from corpus data. It will also help teachers to conceive and non-subjectively assess practical activities designed for learners of translation. Corpus data selected for the examples of rule-based interpretations provided in this paper have been compiled in a corpus-driven study (Poirier, 2015) on the translation of the noun ‘business’ in the field of specialized translation in business, economics, and finance from English to French. The corpus methodology and rule-based interpretation of senses can be generalized and applied in the definition of interpretation rules for other language pairs and other specialized simple and complex terms. These works will encourage the matching of translation study theories and corpus translation studies with professional practices. It will also encourage the matching of translation studies and corpus translation studies with source and target language usages and with textual correlations between source language real usages and target language translation real practices.
APA, Harvard, Vancouver, ISO, and other styles
29

Armstrong, Susan. "Corpus-based methods for NLP and translation studies." Interpreting. International Journal of Research and Practice in Interpreting 2, no. 1-2 (January 1, 1997): 141–62. http://dx.doi.org/10.1075/intp.2.1-2.06arm.

Full text
Abstract:
This paper gives an overview of current topics and themes in corpus-based studies of language that could be of relevance for interpretation research. Basic methods and their practical use in NLP and speech applications are presented. Issues in data acquisition and annotation, the basis for all data-oriented work, are also discussed. The paper concludes with some suggestions on how this work could be applied to corpus-based interpretation studies.
APA, Harvard, Vancouver, ISO, and other styles
30

Jiménez-Crespo, Miguel ángel, and Maribel Tercedor. "Applying Corpus Data to Define Needs in Web Localization Training." Meta 56, no. 4 (July 11, 2012): 998–1021. http://dx.doi.org/10.7202/1011264ar.

Full text
Abstract:
Localization is increasingly making its way into translation training programs at university level. However, there is still a scarce amount of empirical research addressing issues such as defining localization in relation to translation, what localization competence entails or how to best incorporate intercultural differences between digital genres, text types and conventions, among other aspects. In this paper, we propose a foundation for the study of localization competence based upon previous research on translation competence. This project was developed following an empirical corpus-based contrastive study of student translations (learner corpus), combined with data from a comparable corpus made up of an original Spanish corpus and a Spanish localized corpus. The objective of the study is to identify differences in production between digital texts localized by students and professionals on the one hand, and original texts on the other. This contrastive study allows us to gain insight into how localization competence interrelates with the superordinate concept of translation competence, thus shedding light on which aspects need to be addressed during localization training in university translation programs.
APA, Harvard, Vancouver, ISO, and other styles
31

Simsek, Tugba. "TURKISH EFL LEARNERS’ REFLECTIONS ON CORPUS-BASED LANGUAGE TEACHING." Global Journal of Foreign Language Teaching 6, no. 1 (August 1, 2016): 21. http://dx.doi.org/10.18844/gjflt.v6i1.806.

Full text
Abstract:
The aim of this study is investigating Turkish EFL learners’ reflections on corpus-based language teaching; what kind of benefits or drawbacks they have experienced during a corpus-based implementation, and what possible suggestions they can make about this particular experience of theirs. The data was collected through minute papers and semi-structured interviews; and content analysis was conducted for data analysis. The results indicated that the participants found the corpus-based instruction very effective especially thanks to the fact that they could interact with real life data directly. They emphasized that interacting with genuine native speaker language made them more motivated and interested in the classroom. In terms of drawbacks, they stated that sometimes the concordances were difficult to understand. Nevertheless, the learners had a positive perception of corpus-based language teaching instruction. Keywords: Corpus-Based Language Teaching; EFL Learners; Reflection
APA, Harvard, Vancouver, ISO, and other styles
32

Lou, Jianying, and Yanqing Zhang. "Semantic change analysis of Korean verbs based on massive culture corpus data." Personal and Ubiquitous Computing 24, no. 1 (October 26, 2019): 115–25. http://dx.doi.org/10.1007/s00779-019-01328-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Shutova, Ekaterina, Barry J. Devereux, and Anna Korhonen. "Conceptual metaphor theory meets the data: a corpus-based human annotation study." Language Resources and Evaluation 47, no. 4 (June 15, 2013): 1261–84. http://dx.doi.org/10.1007/s10579-013-9238-z.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Xu, Jianbo. "Application Research of Cognitive Linguistics Based on Big Data Internet Corpus Construction." Journal of Physics: Conference Series 1861, no. 1 (March 1, 2021): 012028. http://dx.doi.org/10.1088/1742-6596/1861/1/012028.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Li, Ya. "Teaching Development and Application of Japanese Corpus Based on Computer Big Data." Journal of Physics: Conference Series 1744, no. 4 (February 1, 2021): 042056. http://dx.doi.org/10.1088/1742-6596/1744/4/042056.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Smart, Jonathan. "The role of guided induction in paper-based data-driven learning." ReCALL 26, no. 2 (February 19, 2014): 184–201. http://dx.doi.org/10.1017/s0958344014000081.

Full text
Abstract:
AbstractThis study examines the role of guided induction as an instructional approach in paper-based data-driven learning (DDL) in the context of an ESL grammar course during an intensive English program at an American public university. Specifically, it examines whether corpus-informed grammar instruction is more effective through inductive, data-driven learning or through traditional deductive instruction. In the study, 49 participants completed two weeks of ESL grammar instruction on the passive voice in English. The learners participated in one of three instructional treatments: a data-driven learning treatment, a deductive instructional treatment using corpus-informed teaching materials, and a deductive instructional treatment using traditional (i.e., non-corpus-informed) materials. Results from pre-test, post-test, and delayed post-test indicated that the DDL group significantly improved their grammar ability with the passive voice, while the other two treatment groups did not show significant gains. The findings from this study suggest that in this learning context there are measurable benefits to teaching ESL grammar inductively using paper-based DDL.
APA, Harvard, Vancouver, ISO, and other styles
37

Horch, Stephanie. "Complementing corpus analysis with web-based experimentation in research on World Englishes." English World-Wide 40, no. 1 (February 1, 2019): 24–52. http://dx.doi.org/10.1075/eww.00021.hor.

Full text
Abstract:
Abstract Usage-based research in linguistics has to a large extent relied on corpus data. However, a feature’s “failure to appear in even a very large corpus (such as the Web) is not evidence for ungrammaticality, nor is appearance evidence for grammaticality” (Schütze and Sprouse 2013: 29). It is therefore advisable to complement corpus-based analyses with experimental data, so as to (ideally) obtain converging evidence. This paper reviews reasons for combining corpus linguistic with psycholinguistic experimental methods, and demonstrates how research on varieties of English can profit from experimentation. For a study of conversion in Asian Englishes, the maze task (Forster, Guerrera, and Elliot 2009; Forster 2010) was implemented with a web-based, open-source software. The results of the experiment dovetail with a previous analysis of the Corpus of Global Web-based English (Davies 2013). These results should encourage researchers not to base findings exclusively on corpus evidence, but corroborate them by means of experimental data.
APA, Harvard, Vancouver, ISO, and other styles
38

Hoffmann, Lea. "Review of a Basic Phraseological Vocabulary – an Explorative Data Analysis." Kalbotyra 73 (December 28, 2020): 61–75. http://dx.doi.org/10.15388/kalbotyra.2020.3.

Full text
Abstract:
This article addresses the question of which possibilities and limitations of frequency-based studies on the relevance of multi-word expressions open up for applied purposes. For this purpose, the corpus Ref10 of the project Wortschatzwissen.de was exploratively examined. After the development of a category system for multi-word expressions, a sample of the corpus was examined and assigned to the different categories. Subsequently, the identified multi-word expressions were compared with a phrase list of Hallsteinsdóttir, Šajánková & Quasthoff (2006). Findings suggest that the proportion of collocations is particularly high in all subcorpora and that, in addition, idioms and light verb constructions are predominant. Moreover, a large proportion of the idioms identified in the Ref10 corpus sample does not occur at all or occurs only partially, i.e. in an unlisted variant, in the phraseological optimum of Hallsteinsdóttir, Šajánková & Quasthoff (2006). This raises above all the question of how phrase variance is to be evaluated in corpus analyses and to what extent corpus linguists should rely only on basic vocabulary from the perspective of Applied linguistics.
APA, Harvard, Vancouver, ISO, and other styles
39

Nesset, Tore. "Big data in Russian linguistics?" Zeitschrift für Slawistik 64, no. 2 (May 28, 2019): 157–74. http://dx.doi.org/10.1515/slaw-2019-0012.

Full text
Abstract:
Summary With the advent of large web-based corpora, Russian linguistics steps into the era of “big data”. But how useful are large datasets in our field? What are the advantages? Which problems arise? The present study seeks to shed light on these questions based on an investigation of the Russian paucal construction in the RuTenTen corpus, a web-based corpus with more than ten billion words. The focus is on the choice between adjectives in the nominative (dve/tri/četyre starye knigi) and genitive (dve/tri/četyre staryx knigi) in paucal constructions with the numerals dve, tri or četyre and a feminine noun. Three generalizations emerge. First, the large RuTenTen dataset enables us to identify predictors that could not be explored in smaller corpora. In particular, it is shown that predicates, modifiers, prepositions and word-order affect the case of the adjective. Second, we identify situations where the RuTenTen data cannot be straightforwardly reconciled with findings from earlier studies or there appear to be discrepancies between different statistical models. In such cases, further research is called for. The effect of the numeral (dve, tri vs. četyre) and verbal government are relevant examples. Third, it is shown that adjectives in the nominative have more easily learnable predictors that cover larger classes of examples and show clearer preferences for the relevant case. It is therefore suggested that nominative adjectives have the potential to outcompete adjectives in the genitive over time. Although these three generalizations are valuable additions to our knowledge of Russian paucal constructions, three problems arise. Large internet-based corpora like the RuTenTen corpus (a) are not balanced, (b) involve a certain amount of “noise”, and (c) do not provide metadata. As a consequence of this, it is argued, it may be wise to exercise some caution with regard to conclusions based on “big data”.
APA, Harvard, Vancouver, ISO, and other styles
40

Qiao, Hong Liang. "Structural Boundary Model — A Corpus-based Parsing Approach." International Journal of Corpus Linguistics 4, no. 1 (August 13, 1999): 113–35. http://dx.doi.org/10.1075/ijcl.4.1.06hon.

Full text
Abstract:
The paper discusses the design of a new computational model based on corpora—the Structural Boundary Model (SBM), particularly for the purpose of NLP. The Structural Boundary Model is constructed on the basis of parsed corpora. It consists of two main bodies, namely structural boundary data and CFG rules. The grammar supports parsing in a unique way by assigning structural boundary labels retrieved from a parsed corpus as a training corpus for the parser. Parsing experiments have demonstrated that the Structural Boundary Model is an appropriate novel computational model for parsing.
APA, Harvard, Vancouver, ISO, and other styles
41

Hussein, Khalid Shakir. "The potentialities of corpus-based techniques for analyzing literature." Journal of Literature, Language & Culture (COES&RJ-JLLC) 1, no. 2 (April 1, 2020): 28. http://dx.doi.org/10.25255/2378.3591.2020.1.2.28.43.

Full text
Abstract:
This paper presents an attempt to explore the analytical potential of five corpus-based techniques: concordances, frequency lists, keyword lists, collocate lists, and dispersion plots. The basic question addressed is related to the contribution that these techniques make to gain more objective and insightful knowledge of the way literary meanings are encoded and of the way the literary language is organized. Three sizable English novels (Joyc's Ulysses, Woolf's The Waves, and Faulkner's As I Lay Dying) are laid to corpus linguistic analysis. It is only by virtue of corpus-based techniques that huge amounts of literary data are analyzable. Otherwise, the data will keep on to be not more than several lines of poetry or short excerpts of narrative. The corpus-based techniques presented throughout this paper contribute more or less to a sort of rigorous interpretation of literary texts far from the intuitive approaches usually utilized in traditional stylistics.
APA, Harvard, Vancouver, ISO, and other styles
42

Hizbullah, Nur, Zakiyah Arifa, Yoke Suryadarma, Ferry Hidayat, Luthfi Muhyiddin, and Eka Kurnia Firmansyah. "SOURCE-BASED ARABIC LANGUAGE LEARNING: A CORPUS LINGUISTIC APPROACH." Humanities & Social Sciences Reviews 8, no. 3 (June 17, 2020): 940–54. http://dx.doi.org/10.18510/hssr.2020.8398.

Full text
Abstract:
Purpose: The study explores the process of using Arabic websites for Arabic language learning, utilising the Arabic Corpus Linguistic approach. This approach enables data-mining out of websites, systematically compiling the mined data, as well as processing the data for the express purpose of Arabic language teaching including its clusters, such as Arabic pragmatics, Arabic linguistics, and Arabic translation teaching as well. MethodologyThe research is written descriptively and utilises qualitative methods used for analysing the process and step-by-step procedures to be executed to make good use of the data. Main Findings: This study is conducted based on the theory of source-based teaching, while the process of utilising the websites is systematically elaborated through the corpus linguistic mechanism. The research concludes that almost all Arabic websites can be employed to be authentic, reliable teaching sources. The sources can be made good use of for teaching the four language competencies, for being the object of linguistic studies and for translation through the particular use of websites whose contents are bilingual or multilingual. Implications/ Applications: The utilisation of the Corpus for teaching and learning has still been needing wide-spreading and promoting either among practitioners or among researchers of the Arabic language in Indonesia. Novelty/Originality of this study: This study highlights that almost Arabic-language websites are one of the richest sources of learning. These learning resources can be used for language learning and various other dimensions of scientific Arabic. Corpus linguistics has many benefits for learners and teachers in Arabic language learning. This study gives the new approach of Arabic teaching-learning using website resources, and the dynamic of Arabic learning using technology.
APA, Harvard, Vancouver, ISO, and other styles
43

Xu, Yi. "A corpus-based functional study of shi…de constructions." Chinese Language and Discourse 5, no. 2 (November 28, 2014): 146–84. http://dx.doi.org/10.1075/cld.5.2.02xu.

Full text
Abstract:
Existing research on shi…de constructions has often referred to its purported “emphasis” or “focus” functions. This paper reexamines shi…de by analyzing 787 examples of shi…de extracted from a spoken corpus. Positive evidence was found for several earlier proposals that rely on intuitively-generated data. Meanwhile, additional features of the construction can be observed. Results indicate that the preferred form of shi…de takes stative predicates, and some examples occur in such high frequencies that they form formulaic expressions. The construction always achieves stative predication, is often associated with subjectivity, and expresses the speaker’s certainty in stancetaking. To explain all the data, a unified “emphasis” function is proposed to integrate the traditional analyses of “constructive focus” and “affirmation.” Also, the overlap between the copula and the emphasis/focus function of shi in shi…de suggests that the construction is a form grammaticalized from shi + nominalization. This paper thus shows that corpus data can enable us to tackle an old issue with new evidence by making all subtypes of the construction available for quantitative and qualitative analysis, which in turn helps us redefine and reconceptualize otherwise ambiguous notions.
APA, Harvard, Vancouver, ISO, and other styles
44

Woliński, Marcin, and Witold Kieraś. "ANALIZA FLEKSYJNA TEKSTÓW HISTORYCZNYCH I ZMIENNOŚĆ FLEKSJI POLSKIEJ Z PERSPEKTYWY DANYCH KORPUSOWYCH." Poradnik Językowy, no. 8/2020(777) (October 28, 2020): 66–80. http://dx.doi.org/10.33896/porj.2020.8.5.

Full text
Abstract:
The subject matter of this paper is Chronofl eks, a computer system (http:// chronofl eks.nlp.ipipan.waw.pl/) modelling Polish infl ection based on a corpus material. The system visualises changes of infl ectional paradigms of individual lexemes over time and enables examination of the variability of the frequency of infl ected form groups distinguished based on various criteria. Feeding Chronofl eks with corpus data required development of IT tools to ensure an infl ectional processing sequence of texts analogous to the ones used for modern language; they comprise a transcriber, a morphological analyser, and a tagger. The work was performed on data from three historical periods (1601–1772, 1830–1918, and modern ones) elaborated in independent projects. Therefore, fi nding a common manner of describing data from the individual periods was a signifi cant element of the work. Keywords: electronic text corpus – natural language processing – infl ection of Polish – history of language
APA, Harvard, Vancouver, ISO, and other styles
45

Pérez-Paredes, Pascual, and Jose M. Alcaraz-Calero. "Developing annotation solutions for online Data Driven Learning." ReCALL 21, no. 1 (January 2009): 55–75. http://dx.doi.org/10.1017/s0958344009000093.

Full text
Abstract:
AbstractAlthough annotation is a widely-researched topic in Corpus Linguistics (CL), its potential role in Data Driven Learning (DDL) has not been addressed in depth by Foreign Language Teaching (FLT) practitioners. Furthermore, most of the research in the use of DDL methods pays little attention to annotation in the design and implementation of corpus-based/driven language teaching.In this paper, we set out to examine the process of development of SACODEYL Annotator, an application that seeks to assist SACODEYL system users in annotating XML multilingual corpora. First, we discuss the role of annotation in DDL and the dominating paradigm in general corpus applications. In the context of the language classroom, we argue that it is essential that corpora should be pedagogically motivated (Braun, 2005 and 2007a). Then, we move on to deal with the analysis and design stages of our annotation solution by illustrating its main features. Some of these include a user friendly hierarchical and extensible taxonomy tree to facilitate the learner-oriented annotation of the corpora; real-time graphics representation of the annotated corpus matching the XML TEI-compliant (Text Encoding Initiative) standard, as well as an intuitive management of the different data sections and associated metadata.SACODEYL (System Aided Compilation and Open Distribution of European Youth Language) is an EU funded MINERVA project which aims to develop an ICT-based system for the assisted compilation and open distribution of multimedia European teen talk in the context of language education. This research lays emphasis on the functionalities of the application within the SACODEYL context. However, our paper addresses similarly the needs of potential multimedia language corpus administrators in general on the lookout for powerful annotation assisting software. SACODEYL Annotator is free to use and can be downloaded from our website.
APA, Harvard, Vancouver, ISO, and other styles
46

Devereux, Barry, Nicholas Pilkington, Thierry Poibeau, and Anna Korhonen. "Towards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data." Research on Language and Computation 7, no. 2-4 (December 2009): 137–70. http://dx.doi.org/10.1007/s11168-010-9068-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Gil-Vallejo, Lara, Marta Coll-Florit, Irene Castellón, and Jordi Turmo. "Verb similarity: Comparing corpus and psycholinguistic data." Corpus Linguistics and Linguistic Theory 14, no. 2 (September 25, 2018): 275–307. http://dx.doi.org/10.1515/cllt-2016-0045.

Full text
Abstract:
Abstract Similarity, which plays a key role in fields like cognitive science, psycholinguistics and natural language processing, is a broad and multifaceted concept. In this work we analyse how two approaches that belong to different perspectives, the corpus view and the psycholinguistic view, articulate similarity between verb senses in Spanish. Specifically, we compare the similarity between verb senses based on their argument structure, which is captured through semantic roles, with their similarity defined by word associations. We address the question of whether verb argument structure, which reflects the expression of the events, and word associations, which are related to the speakers’ organization of the mental lexicon, shape similarity between verbs in a congruent manner, a topic which has not been explored previously. While we find significant correlations between verb sense similarities obtained from these two approaches, our findings also highlight some discrepancies between them and the importance of the degree of abstraction of the corpus annotation and psycholinguistic representations.
APA, Harvard, Vancouver, ISO, and other styles
48

Astia, Idda, and Sofi Yunianti. "Corpus-Based Analysis of the Most Frequent Adjective on Covid-19." Indonesian Journal of EFL and Linguistics 5, no. 2 (December 4, 2020): 505. http://dx.doi.org/10.21462/ijefl.v5i2.318.

Full text
Abstract:
This study aims to investigate the type of adjectives in the most frequent adjectives and also the use of the adjective functions on academic writing about COVID-19. This study was conducted by using a corpus tool named sketchengine. The method of this study was a mixed-method by combining quantitative and qualitative approaches. The source of the data was corpus about COVID-19 academic writing due to the fact that COVID-19 has been the trending topic around the globe and also became an international concern. There were several data collection steps; those were first, knowing the most frequent adjective in the COVID-19 corpus by choosing a wordlist. Second, the data were taken 20 the most frequent adjectives used in COVID-19 corpus because 20 data have already represented the most frequent adjectives. Third, it chose the concordance to comprehend the function of the adjective in the COVID-19 corpus. Fourth, 20 the most frequent adjectives were inputted one at a time on concordance. Fifth, the data were analyzed based on the related theory. Finally, it is inferred that the adjective type on the most frequent adjective is a describing adjective, which has the function to frame the condition, situation and characteristic of the noun on the COVID-19 cases.
APA, Harvard, Vancouver, ISO, and other styles
49

Culpeper, Jonathan, Andrew Hardie, Jane Demmen, Jennifer Hughes, and Matt Timperley. "Supporting the corpus-based study of Shakespeare’s language: Enhancing a corpus of the First Folio." ICAME Journal 45, no. 1 (May 1, 2021): 37–86. http://dx.doi.org/10.2478/icame-2021-0002.

Full text
Abstract:
Abstract This article explores challenges in the corpus linguistic analysis of Shakespeare’s language, and Early Modern English more generally, with particular focus on elaborating possible solutions and the benefits they bring. An account of work that took place within the Encyclopedia of Shakespeare’s Language Project (2016–2019) is given, which discusses the development of the project’s data resources, specifically, the Enhanced Shakespearean Corpus. Topics covered include the composition of the corpus and its subcomponents; the structure of the XML markup; the design of the extensive character metadata; and the word-level corpus annotation, including spelling regularisation, part-of-speech tagging, lemmatisation and semantic tagging. The challenges that arise from each of these undertakings are not exclusive to a corpus-based treatment of Shakespeare’s plays but it is in the context of Shakespeare’s language that they are so severe as to seem almost insurmountable. The solutions developed for the Enhanced Shakespearean Corpus – often combining automated manipulation with manual interventions, and always principled – offer a way through.
APA, Harvard, Vancouver, ISO, and other styles
50

Fellbaum, Christiane. "How flexible are idioms? A corpus-based study." Linguistics 57, no. 4 (July 26, 2019): 735–67. http://dx.doi.org/10.1515/ling-2019-0015.

Full text
Abstract:
Abstract Idioms are a compelling subject of study for linguists, lexicographers and psycholinguists due to their seemingly idiosyncratic status as lexical units that pose challenges for integration into accepted grammatical frameworks. The literature reveals much disagreement on the semantic compositionality, syntactic flexibility and lexical variation of both specific idioms and idioms as a class. We analyze some of the sources for the disparate analyses, which are most often based on judgments of constructed rather than attested examples. Relying solely on corpus data from English and German that shows a wide range of syntactic and lexical variation independent of semantic compositionality, we argue that speakers’ use of idioms is in fact compatible with the rules governing freely composed language. This article is based on a talk given at the BSGL 2015 meeting on idioms in Brussels (Fellbaum 2015b).
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography