Dissertations / Theses: 'Corpus-based data'

1

Nolli, Carla Fernanda. "Data-driven learning and corpus-based approaches in language education." Florianópolis, SC, 2006. http://repositorio.ufsc.br/xmlui/handle/123456789/88465.

Full text

Abstract:

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão. Programa de Pós-Graduação em Letras/Inglês e Literatura Correspondente
Made available in DSpace on 2012-10-22T09:21:53Z (GMT). No. of bitstreams: 0
This study focuses on the analysis of conditional sentences examples found in teaching materials (textbooks and grammar books) and compares them with a large corpus in order to verify their frequency and authenticity. In order to do so, the comparison was carried out with the help of a corpus analysis software, which generated a concordance list of the word if. These tokens were analyzed and classified in order to distinguish the three types of conditional sentences studied in this thesis. One of the purposes of this research is also to shed light on an approach that still remains largely unexplored in Brazil, namely Data-Driven Learning (DDL), which explores teaching and learning through corpus linguistics. Este estudo se concentra na análise de exemplos de sentenças condicionais em materiais de ensino (livros textos e gramáticas) e compara-os com um corpus lingüístico a fim de verificar sua freqüência e autenticidade. Para isso, a comparação foi realizada com a ajuda de um software de análise de corpus, que gerou uma lista de concordâncias com a palavra if. Todos os exemplos foram analisados e classificados a fim de detectar os três tipos de sentenças condicionais estudadas nesta dissertação. Um dos objetivos desta pesquisa é também dar ênfase a uma metodologia que ainda permanece muito inexplorada no Brasil, chamada de Aprendizagem a Partir de Dados, que explora o ensino e a aprendizagem através de lingüística de corpus.

APA, Harvard, Vancouver, ISO, and other styles

2

Adolphs, Svenja. "Linking lexico-grammar and speech acts : a corpus-based approach." Thesis, University of Nottingham, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.391412.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Marchewka, Katarzyna M. "Gender agreement in Polish : a study based on elicitation and corpus data." Thesis, University of Surrey, 2016. http://epubs.surrey.ac.uk/809946/.

Full text

Abstract:

This thesis explores the role of gender and explains how gender agreement operates in Polish and presents the possible agreement that particular gender can provide when conjoined with a noun of different gender or with a hybrid noun. The linguistic representation of specific gender is connected not only with the morphological shape but also the inherent semantics of a given noun. In the Polish language a great deal of information about possible masculine-personal or masculine-non-personal agreement is provided by the value ‘person’ for a given noun, and recent research on gender agreement in Polish has shown that some of the proposed rules for gender resolution and agreement between subject and predicate do not describe all the agreement possibilities. Likewise, with regard to hybrid nouns in Polish little research has been done on their agreement. This thesis thus examines the interaction of nouns of different genders, their values and their verbal agreement. Drawing mostly on primary questionnaire work with native speakers of Polish, I argue that semantics has a predominant impact on gender agreement. I support my claim by presenting data from the Polish corpus. The thesis provides the most comprehensive description of Polish gender agreement in sentences with conjoined noun phrases and agreement with hybrid nouns to date, by investigating their morphological status, their semantic restrictions, and their use in discourse. Building on previous analyses of agreement possibilities in the Polish language, I argue for an additional rule in gender resolution. I provide a description of various types of hybrid nouns in Polish and check the impact of semantic agreement versus formal agreement on Polish hybrid nouns using Corbett’s Agreement Hierarchy.

APA, Harvard, Vancouver, ISO, and other styles

4

Wang, Lixum. "The use of parallel texts in language learning : computer software and teaching materials for English and Chinese." Thesis, University of Birmingham, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.368990.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Zhang, Min, and 張珉. "Using corpus data in a MOODLE-based self-learning course : teaching education students to 'cite like an academic'." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2015. http://hdl.handle.net/10722/211141.

Full text

Abstract:

Citation, an essential feature of academic writing, is a challenging area for second language (L2) student writers due to its linguistic and functional complexities. In an effort to address this challenge, I report the development and evaluation of a MOODLE-based self-access workshop on citation learning, Cite Like an Academic (CLA). CLA aims to enhance the understanding of citation use among postgraduate students in education. It employs a design-based research approach characterized by three iterative phases involving needs analysis, pedagogical design, and evaluation of an online learning artefact for increased understanding to guide further improvements (Phillips, McNaught, & Kennedy, 2012). For the first-phase needs analysis research, I investigated the rhetorical functions of citations across various research article (RA) sections and their linguistic features. To this end, genre and corpus approaches were integrated to compare an expert corpus of research articles (the RAC) and a student corpus of master’s in education (MEd) dissertations (the MDC). The findings indicate that (1) all the RA Introduction-Methods-Results-Discussion (IMRD) sections contained citations fulfilling a wide range of rhetorical functions, and (2) RAC writers differed from MDC writers in their preference for citation types across sections, citation density across sections, reporting verb (RV) categories, RV lexico-grammatical patterns, and RV rhetorical functions. Alongside this investigation on citation use, I interviewed postgraduate students and communicated via email with supervisors to understand the needs of potential workshop participants. The second phase, the CLA pedagogy design, was guided by the adapted critical pragmatic approach (Harwood & Hadley, 2004) with adaption. Following the pragmatic approach, instruction materials were informed by the needs analysis research findings. The critical approach involved the participants in trying out genre analysis and corpus analysis of RAs they selected for citation learning. The third phase was the evaluation of the workshop through a user walk-through trial and three rounds of implementations. Various types of data were collected from 41 participants, including personal communications, MOODLE records of forum discussions and log reports, participants’ writing, interviews, and pre-CLA and post-CLA questionnaires. I report the findings on the effects of genre-based materials on thesis revision, as well as students’ gains and difficulties in carrying out genre analysis and building and using their I-Corpus for citation learning. The findings indicate that content familiarity and peer interaction contributed to learners’ in-depth genre analysis; however, Move interpretation needed attention in students’ learning of genre analysis. Genre familiarity and completed writing ready for revision facilitated learners’ direct use of genre-based materials in writing, and building an individual corpus of RA part genres raised learners’ awareness of the variations in RA macro-structures. In addition, the findings demonstrate that students needed training on formulating search terms for citation searches and using corpus analytic software for corpus data observation and interpretation. In particular, students should be reminded of the disciplinary context and textual context when reusing language data from a corpus in writing revision. Finally, I provide suggestions for how to improve and adapt the workshop to support students’ citation learning and accommodate their different learning needs.
published_or_final_version
Education
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

6

Tsiros, Augoustinos. "A multidimensional sketching interface for visual interaction with corpus-based concatenative sound synthesis." Thesis, Edinburgh Napier University, 2016. http://researchrepository.napier.ac.uk/Output/463438.

Full text

Abstract:

The present research sought to investigate the correspondence between auditory and visual feature dimensions and to utilise this knowledge in order to inform the design of audio-visual mappings for visual control of sound synthesis. The first stage of the research involved the design and implementation of Morpheme, a novel interface for interaction with corpus-based concatenative synthesis. Morpheme uses sketching as a model for interaction between the user and the computer. The purpose of the system is to facilitate the expression of sound design ideas by describing the qualities of the sound to be synthesised in visual terms, using a set of perceptually meaningful audio-visual feature associations. The second stage of the research involved the preparation of two multidimensional mappings for the association between auditory and visual dimensions. The third stage of this research involved the evaluation of the Audio-Visual (A/V) mappings and of Morpheme's user interface. The evaluation comprised two controlled experiments, an online study and a user study. Our findings suggest that the strength of the perceived correspondence between the A/V associations prevails over the timbre characteristics of the sounds used to render the complementary polar features. Hence, the empirical evidence gathered by previous research is generalizable/ applicable to different contexts and the overall dimensionality of the sound used to render should not have a very significant effect on the comprehensibility and usability of an A/V mapping. However, the findings of the present research also show that there is a non-linear interaction between the harmonicity of the corpus and the perceived correspondence of the audio-visual associations. For example, strongly correlated cross-modal cues such as size-loudness or vertical position-pitch are affected less by the harmonicity of the audio corpus in comparison to weaker correlated dimensions (e.g. texture granularity-sound dissonance). No significant differences were revealed as a result of musical/audio training. The third study consisted of an evaluation of Morpheme's user interface were participants were asked to use the system to design a sound for a given video footage. The usability of the system was found to be satisfactory. An interface for drawing visual queries was developed for high level control of the retrieval and signal processing algorithms of concatenative sound synthesis. This thesis elaborates on previous research findings and proposes two methods for empirically driven validation of audio-visual mappings for sound synthesis. These methods could be applied to a wide range of contexts in order to inform the design of cognitively useful multi-modal interfaces and representation and rendering of multimodal data. Moreover this research contributes to the broader understanding of multimodal perception by gathering empirical evidence about the correspondence between auditory and visual feature dimensions and by investigating which factors affect the perceived congruency between aural and visual structures.

APA, Harvard, Vancouver, ISO, and other styles

7

Vieira, Nataliya Godinho Soares. "Training and discovering corpus-based data driven exercices in english teaching (L2/FL) to native speakers of portuguese (L1)." Master's thesis, Faculdade de Ciências Sociais e Humanas, Universidade Nova de Lisboa, 2012. http://hdl.handle.net/10362/7422.

Full text

Abstract:

Project submitted as part requirement for the degree of Masters in English teaching,
Considerando o rápido desenvolvimento das novas tecnologias e o seu uso no ensino de línguas estrangeiras, Linguística de Corpus oferece novas ferramentas e materiais que enriquecem a aprendizagem de uma segunda língua. Este projecto apresenta um quadro de princípios teóricos relacionados com os corpora online e propõe os exemplos de training e discovering corpus-based data-driven exercícios, que são uma contribuição original para o ensino/aprendizagem de Inglês (L2) aos falantes nativos da língua Portuguesa (L1). Os data-driven exercícios, com base em concordâncias extraídas de corpora, proporcionam um ensino-descoberta e envolvem os alunos numa "aprendizagemdescoberta", enriquecendo, deste modo, o desenvolvimento pessoal dos professores e dos alunos. Múltiplas são as finalidades pedagógicas deste projecto relacionadas com a utilização da data-driven learning (DDL) abordagem assim como a aplicação dos recursos baseados em TIC no ensino/aprendizagem das línguas estrangeiras.

APA, Harvard, Vancouver, ISO, and other styles

8

Garcia, William Danilo. "Fanfictions, linguística de corpus e aprendizagem direcionada por dados : tarefas de produção escrita com foco no uso autêntico de língua e atividades que visam à autonomia dos alunos de letras em analisar preposições /." São José do Rio Preto, 2020. http://hdl.handle.net/11449/192699.

Full text

Abstract:

Orientador: Paula Tavares Pinto
Resumo: A relação da Linguística de Corpus com o Ensino de Línguas, apesar de receber foco mesmo antes do advento dos computadores, se intensificou por volta da década de 90, momento em que pesquisas em corpora de aprendizes e em Aprendizagem Direcionada por Dados foram enfatizadas. Considerado esse estreitamento, esta pesquisa objetiva compilar quatro corpora de aprendizes a partir do uso autêntico da língua com o intuito de desenvolver atividades didáticas direcionadas por dados dos próprios alunos que promovam nos discentes um perfil autônomo de investigação linguística (mais precisamente das preposições with, in, on, at, for e to). No tocante à fundamentação teórica, destacam-se Prabhu (1987), Skehan (1996), Willis (1996), Nunan (2004) e Ellis (2006) a respeito do Ensino de Línguas por Tarefas, Jenkins (2012) e Neves (2014) que discorrem sobre as ficções de fã. Já sobre a Linguística de Corpus, tem-se Sinclair (1991), Berber Sardinha (2000) e Viana (2011). Granger (1998, 2002, 2013) mais relacionado a Corpus de Aprendizes, e Johns (1991, 1994), Berber Sardinha (2011) e Boulton (2010) no que diz respeito à Aprendizagem Direcionada por Dados. Como metodologia, levantaram-se textos escritos pelos alunos a partir de uma tarefa de produção escrita em que eles redigiram uma ficção de fã. Em seguida, esses textos formaram dois corpora de aprendizes iniciais, que foram analisados com o auxílio da ferramenta AntConc (ANTHONY, 2018) no intuito de observar a presença ou não de inadequações ... (Resumo completo, clicar acesso eletrônico abaixo)
Abstract: Although the relation between Corpus Linguistics and Language Teaching has been emphasized even before the advent of computers, it has been highlighted around the 90s. This was the moment when research on learner corpora and Data-Driven Learning was focused. Having said that, this study aimed to compile four learner corpora based on the authentic use of the language. This was done in order to develop data-driven teaching activities that could promote, among the students, an autonomous profile of linguistic investigation (more precisely about the prepositions with, in, on, at, for and to). Concerning the existing literature, we highlight the works of Prabhu (1987), Skehan (1996), Willis (1996), Nunan (2004) and Ellis (2006) about Task-Based Language Teaching, and Jenkins (2012) and Neves (2014) about fanfictions. In relation to Corpus Linguistics, this study is based on Sinclair (1991), Berber Sardinha (2000) and Viana (2011). Granger (1998, 2012, 2013) is referenced to define learner corpora, and Johns (1991, 1994), Berber Sardinha (2011) and Boulton (2010) to discuss Data-Driven Learning. The methodological approach involved the collection of the compositions from Language Teaching undergraduate students who developed a writing task in which they had to write a fanfiction. These texts composed two learner corpora, which were analyzed with the AntConc tool (ANTHONY, 2018) with the purpose of observing the occurrence of prepositions in English and whether they were accurately ... (Complete abstract click electronic access below)
Mestre

APA, Harvard, Vancouver, ISO, and other styles

9

Gentilini, Livia. "La terminologia della sicurezza informatica nella banca dati FranceTerme: un'analisi corpus-based." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17696/.

Full text

Abstract:

L’elaborato si propone di indagare la diffusione di anglicismi relativi alla terminologia della sicurezza informatica nella lingua francese. L’obiettivo è quello di osservare l’operato degli enti ufficiali francesi per la protezione linguistica, tramite un confronto tra la percentuale d’uso in francese di alcuni anglicismi scelti e dei rispettivi traducenti francesi ufficiali proposti dal Dispositif d’enrichissement de la langue française. Il primo capitolo fornisce una panoramica del tema dei linguaggi specialistici, con particolare attenzione al fenomeno della variazione terminologica, e descrive le caratteristiche del linguaggio informatico in lingua inglese e francese, oltre a fornire una definizione del concetto di sicurezza informatica. Il secondo capitolo affronta il tema dell’interferenza linguistica, partendo dal concetto di neologia per arrivare a quelli di prestito e di calco. Il terzo capitolo tratta delle politiche linguistiche in Francia: le evoluzioni degli enti ufficiali, l’approccio politico di fronte alla diffusione di elementi esteri nella lingua, e le specifiche leggi promulgate a riguardo. Il focus è sui principali enti appartenenti al Dispositif d’enrichissement de la langue française, del quale vengono presentati il funzionamento e gli obiettivi. Il quarto capitolo si focalizza sulla metodologia di ricerca. Sono state selezionate una serie di schede terminologiche affini al dominio della sicurezza informatica, individuate nella banca dati ministeriale FranceTerme. Il web corpus Araneum Francogallicum Maius è stato utilizzato per individuare le occorrenze dei suddetti termini, da confrontarsi quantitativamente. Il quinto capitolo si concentra sull’analisi dei materiali: dopo aver elencato le occorrenze totali riscontrate nel corpus, il capitolo passa al confronto delle frequenze assolute e delle frequenze relative percentuali dei traducenti francesi ufficiali e dei rispettivi forestierismi, allo scopo di individuare eventuali tendenze.

APA, Harvard, Vancouver, ISO, and other styles

10

Ghisi, Daniele. "Music across music : towards a corpus-based, interactive computer-aided composition." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066561/document.

Full text

Abstract:

Le traitement de musique existante pour en construire de nouvelle est une caractéristique fondamentale de la tradition musicale occidentale. Cette thèse propose et discute mon approche personnelle au sujet : l'emprunt de fragments de musique à partir de grands corpus (contenant des échantillons audio ainsi que des partitions symboliques) afin de créer une palette de grains organisée par descripteurs de bas niveau. Les paramètres sont gérés par des partitions numériques hybrides. Cette thèse présente également la bibliothèque "dada", qui fournit au logiciel Max la possibilité d'organiser, de sélectionner et de générer du contenu musical grâce à un ensemble d'interfaces graphiques manifestant une approche exploratoire à la composition. Ses modules abordent, entre autre, la visualisation de bases de données, la segmentation et l'analyse des partitions, la synthèse concaténative, la génération musicale à travers la modélisation physique ou géométrique, la synthèse "wave-terrain", l'exploration de graphes, les automates cellulaires, l'intelligence distribuée et les jeux vidéo. Pour terminer, cette thèse traite de la question de savoir si la représentation classique de la musique, démêlée dans l'ensemble standard des paramètres traditionnels, est optimale. Deux alternatives possibles aux décompositions orthogonales sont présentées : des représentations de partitions fondées sur les "grains", qui héritent les techniques de la composition basée sur corpus, et des modèles d'apprentissage automatique non supervisés, fournissant représentations de la musique "agnostiques". La thèse détaille aussi ma première expérience d'écriture collaborative au sein du collectif /nu/thing
The reworking of existing music in order to build new one is a quintessential characteristic of the Western musical tradition. This thesis proposes and discusses my personal approach to the subject: the borrowing of music fragments from large-scale corpora (containing audio samples as well as symbolic scores) in order to build a low-level, descriptor-based palette of grains. Parameters are handled via digital hybrid scores, in order to equip corpus-based composition with the control of notational practices. This thesis also introduces the dada library, providing Max with the ability to organize, select and generate musical content via a set of graphical interfaces manifesting an exploratory approach towards music composition. Its modules address a range of scenarios, including, but not limited to, database visualization, score segmentation and analysis, concatenative synthesis, music generation via physical or geometrical modelling, wave terrain synthesis, graph exploration, cellular automata, swarm intelligence, and videogames. The library is open-source and it fosters a performative approach to computer-aided composition. Finally, this thesis addresses the issue of whether classical representation of music, disentangled in the standard set of traditional parameters, is optimal. Two possible alternatives to orthogonal decompositions are presented: grain-based score representations, inheriting techniques from corpus-based composition, and unsupervised machine learning models, providing entangled, `agnostic' representations of music. The thesis also details my first experience of collaborative writing within the /nu/thing collective

APA, Harvard, Vancouver, ISO, and other styles

11

Kalledat, Tobias. "Tracking domain knowledge based on segmented textual sources." Doctoral thesis, Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät, 2009. http://dx.doi.org/10.18452/15925.

Full text

Abstract:

Die hier vorliegende Forschungsarbeit hat zum Ziel, Erkenntnisse über den Einfluss der Vorverarbeitung auf die Ergebnisse der Wissensgenerierung zu gewinnen und konkrete Handlungsempfehlungen für die geeignete Vorverarbeitung von Textkorpora in Text Data Mining (TDM) Vorhaben zu geben. Der Fokus liegt dabei auf der Extraktion und der Verfolgung von Konzepten innerhalb bestimmter Wissensdomänen mit Hilfe eines methodischen Ansatzes, der auf der waagerechten und senkrechten Segmentierung von Korpora basiert. Ergebnis sind zeitlich segmentierte Teilkorpora, welche die Persistenzeigenschaft der enthaltenen Terme widerspiegeln. Innerhalb jedes zeitlich segmentierten Teilkorpus können jeweils Cluster von Termen gebildet werden, wobei eines diejenigen Terme enthält, die bezogen auf das Gesamtkorpus nicht persistent sind und das andere Cluster diejenigen, die in allen zeitlichen Segmenten vorkommen. Auf Grundlage einfacher Häufigkeitsmaße kann gezeigt werden, dass allein die statistische Qualität eines einzelnen Korpus es erlaubt, die Vorverarbeitungsqualität zu messen. Vergleichskorpora sind nicht notwendig. Die Zeitreihen der Häufigkeitsmaße zeigen signifikante negative Korrelationen zwischen dem Cluster von Termen, die permanent auftreten, und demjenigen das die Terme enthält, die nicht persistent in allen zeitlichen Segmenten des Korpus vorkommen. Dies trifft ausschließlich auf das optimal vorverarbeitete Korpus zu und findet sich nicht in den anderen Test Sets, deren Vorverarbeitungsqualität gering war. Werden die häufigsten Terme unter Verwendung domänenspezifischer Taxonomien zu Konzepten gruppiert, zeigt sich eine signifikante negative Korrelation zwischen der Anzahl unterschiedlicher Terme pro Zeitsegment und den einer Taxonomie zugeordneten Termen. Dies trifft wiederum nur für das Korpus mit hoher Vorverarbeitungsqualität zu. Eine semantische Analyse auf einem mit Hilfe einer Schwellenwert basierenden TDM Methode aufbereiteten Datenbestand ergab signifikant unterschiedliche Resultate an generiertem Wissen, abhängig von der Qualität der Datenvorverarbeitung. Mit den in dieser Forschungsarbeit vorgestellten Methoden und Maßzahlen ist sowohl die Qualität der verwendeten Quellkorpora, als auch die Qualität der angewandten Taxonomien messbar. Basierend auf diesen Erkenntnissen werden Indikatoren für die Messung und Bewertung von Korpora und Taxonomien entwickelt sowie Empfehlungen für eine dem Ziel des nachfolgenden Analyseprozesses adäquate Vorverarbeitung gegeben.
The research work available here has the goal of analysing the influence of pre-processing on the results of the generation of knowledge and of giving concrete recommendations for action for suitable pre-processing of text corpora in TDM. The research introduced here focuses on the extraction and tracking of concepts within certain knowledge domains using an approach of horizontally (timeline) and vertically (persistence of terms) segmenting of corpora. The result is a set of segmented corpora according to the timeline. Within each timeline segment clusters of concepts can be built according to their persistence quality in relation to each single time-based corpus segment and to the whole corpus. Based on a simple frequency measure it can be shown that only the statistical quality of a single corpus allows measuring the pre-processing quality. It is not necessary to use comparison corpora. The time series of the frequency measure have significant negative correlations between the two clusters of concepts that occur permanently and others that vary within an optimal pre-processed corpus. This was found to be the opposite in every other test set that was pre-processed with lower quality. The most frequent terms were grouped into concepts by the use of domain-specific taxonomies. A significant negative correlation was found between the time series of different terms per yearly corpus segments and the terms assigned to taxonomy for corpora with high quality level of pre-processing. A semantic analysis based on a simple TDM method with significant frequency threshold measures resulted in significant different knowledge extracted from corpora with different qualities of pre-processing. With measures introduced in this research it is possible to measure the quality of applied taxonomy. Rules for the measuring of corpus as well as taxonomy quality were derived from these results and advice suggested for the appropriate level of pre-processing.

APA, Harvard, Vancouver, ISO, and other styles

12

Utgof, Darja. "The Perception of Lexical Similarities Between L2 English and L3 Swedish." Thesis, Linköping University, Department of Culture and Communication, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-15874.

Full text

Abstract:

The present study investigates lexical similarity perceptions by students of Swedish as a foreign language (L3) with a good yet non-native proficiency in English (L2). The general theoretical framework is provided by studies in transfer of learning and its specific instance, transfer in language acquisition.

It is accepted as true that all previous linguistic knowledge is facilitative in developing proficiency in a new language. However, a frequently reported phenomenon is that students see similarities between two systems in a different way than linguists and theoreticians of education do. As a consequence, the full facilitative potential of transfer remains unused.

The present research seeks to shed light on the similarity perceptions with the focus on the comprehension of a written text. In order to elucidate students’ views, a form involving similarity judgements and multiple choice questions for formally similar items has been designed, drawing on real language use as provided by corpora. 123 forms have been distributed in 6 groups of international students, 4 of them studying Swedish at Level I and 2 studying at Level II.

The test items in the form vary in the degree of formal, semantic and functional similarity from very close cognates, to similar words belonging to different word classes, to items exhibiting category membership and/or being in subordinate/superordinate relation to each other, to deceptive cognates. The author proposes expected similarity ratings and compares them to the results obtained. The objective measure of formal similarity is provided by a string matching algorithm, Levenshtein distance.

The similarity judgements point at the fact that intermediate similarity values can be considered problematic. Similarity ratings between somewhat similar items are usually lower than could be expected. Besides, difference in grammatical meaning lowers similarity values significantly even if lexical meaning nearly coincides. Thus, the obtained results indicate that in order to utilize similarities to facilitate language learning, more attention should be paid to underlying similarities.

APA, Harvard, Vancouver, ISO, and other styles

13

Hong, Shinchul. "The pedgogical use of corpus date based on two case studies:the Dong-A for Korean Learners and the Chemnitz for German Learners." Thesis, Lancaster University, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.531696.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

"A corpus-based induction learning approach to natural language processing." Chinese University of Hong Kong, 1996. http://library.cuhk.edu.hk/record=b5888859.

Full text

Abstract:

by Leung Chi Hong.
Thesis (Ph.D.)--Chinese University of Hong Kong, 1996.
Includes bibliographical references (leaves 163-171).
Chapter Chapter 1. --- Introduction --- p.1
Chapter Chapter 2. --- Background Study of Natural Language Processing --- p.9
Chapter 2.1. --- Knowledge-based approach --- p.9
Chapter 2.1.1. --- Morphological analysis --- p.10
Chapter 2.1.2. --- Syntactic parsing --- p.11
Chapter 2.1.3. --- Semantic parsing --- p.16
Chapter 2.1.3.1. --- Semantic grammar --- p.19
Chapter 2.1.3.2. --- Case grammar --- p.20
Chapter 2.1.4. --- Problems of knowledge acquisition in knowledge-based approach --- p.22
Chapter 2.2. --- Corpus-based approach --- p.23
Chapter 2.2.1. --- Beginning of corpus-based approach --- p.23
Chapter 2.2.2. --- An example of corpus-based application: word tagging --- p.25
Chapter 2.2.3. --- Annotated corpus --- p.26
Chapter 2.2.4. --- State of the art in the corpus-based approach --- p.26
Chapter 2.3. --- Knowledge-based approach versus corpus-based approach --- p.28
Chapter 2.4. --- Co-operation between two different approaches --- p.32
Chapter Chapter 3. --- Induction Learning applied to Corpus-based Approach --- p.35
Chapter 3.1. --- General model of traditional corpus-based approach --- p.36
Chapter 3.1.1. --- Division of a problem into a number of sub-problems --- p.36
Chapter 3.1.2. --- Solution selected from a set of predefined choices --- p.36
Chapter 3.1.3. --- Solution selection based on a particular kind of linguistic entity --- p.37
Chapter 3.1.4. --- Statistical correlations between solutions and linguistic entities --- p.37
Chapter 3.1.5. --- Prediction of the best solution based on statistical correlations --- p.38
Chapter 3.2. --- First problem in the corpus-based approach: Irrelevance in the corpus --- p.39
Chapter 3.3. --- Induction learning --- p.41
Chapter 3.3.1. --- General issues about induction learning --- p.41
Chapter 3.3.2. --- Reasons of using induction learning in the corpus-based approach --- p.43
Chapter 3.3.3. --- General model of corpus-based induction learning approach --- p.45
Chapter 3.3.3.1. --- Preparation of positive corpus and negative corpus --- p.45
Chapter 3.3.3.2. --- Statistical correlations between solutions and linguistic entities --- p.46
Chapter 3.3.3.3. --- Combination of the statistical correlations obtained from the positive and negative corpora --- p.48
Chapter 3.4. --- Second problem in the corpus-based approach: Modification of initial probabilistic approximations --- p.50
Chapter 3.5. --- Learning feedback modification --- p.52
Chapter 3.5.1. --- Determination of which correlation scores to be modified --- p.52
Chapter 3.5.2. --- Determination of the magnitude of modification --- p.53
Chapter 3.5.3. --- An general algorithm of learning feedback modification --- p.56
Chapter Chapter 4. --- Identification of Phrases and Templates in Domain-specific Chinese Texts --- p.59
Chapter 4.1. --- Analysis of the problem solved by the traditional corpus-based approach --- p.61
Chapter 4.2. --- Phrase identification based on positive and negative corpora --- p.63
Chapter 4.3. --- Phrase identification procedure --- p.64
Chapter 4.3.1. --- Step 1: Phrase seed identification --- p.65
Chapter 4.3.2. --- Step 2: Phrase construction from phrase seeds --- p.65
Chapter 4.4. --- Template identification procedure --- p.67
Chapter 4.5. --- Experiment and result --- p.70
Chapter 4.5.1. --- Testing data --- p.70
Chapter 4.5.2. --- Details of experiments --- p.71
Chapter 4.5.3. --- Experimental results --- p.72
Chapter 4.5.3.1. --- Phrases and templates identified in financial news articles --- p.72
Chapter 4.5.3.2. --- Phrases and templates identified in political news articles --- p.73
Chapter 4.6. --- Conclusion --- p.74
Chapter Chapter 5. --- A Corpus-based Induction Learning Approach to Improving the Accuracy of Chinese Word Segmentation --- p.76
Chapter 5.1. --- Background of Chinese word segmentation --- p.77
Chapter 5.2. --- Typical methods of Chinese word segmentation --- p.78
Chapter 5.2.1. --- Syntactic and semantic approach --- p.78
Chapter 5.2.2. --- Statistical approach --- p.79
Chapter 5.2.3. --- Heuristic approach --- p.81
Chapter 5.3. --- Problems in word segmentation --- p.82
Chapter 5.3.1. --- Chinese word definition --- p.82
Chapter 5.3.2. --- Word dictionary --- p.83
Chapter 5.3.3. --- Word segmentation ambiguity --- p.84
Chapter 5.4. --- Corpus-based induction learning approach to improving word segmentation accuracy --- p.86
Chapter 5.4.1. --- Rationale of approach --- p.87
Chapter 5.4.2. --- Method of constructing modification rules --- p.89
Chapter 5.5. --- Experiment and results --- p.94
Chapter 5.6. --- Characteristics of modification rules constructed in experiment --- p.96
Chapter 5.7. --- Experiment constructing rules for compound words with suffixes --- p.98
Chapter 5.8. --- Relationship between modification frequency and Zipfs first law --- p.99
Chapter 5.9. --- Problems in the approach --- p.100
Chapter 5.10. --- Conclusion --- p.101
Chapter Chapter 6. --- Corpus-based Induction Learning Approach to Automatic Indexing of Controlled Index Terms --- p.103
Chapter 6.1. --- Background of automatic indexing --- p.103
Chapter 6.1.1. --- Definition of index term and indexing --- p.103
Chapter 6.1.2. --- Manual indexing versus automatic indexing --- p.105
Chapter 6.1.3. --- Different approaches to automatic indexing --- p.107
Chapter 6.2. --- Corpus-based induction learning approach to automatic indexing --- p.109
Chapter 6.2.1. --- Fundamental concept about corpus-based automatic indexing --- p.110
Chapter 6.2.2. --- Procedure of automatic indexing --- p.111
Chapter 6.2.2.1. --- Learning process --- p.112
Chapter 6.2.2.2. --- Indexing process --- p.118
Chapter 6.3. --- Experiments of corpus-based induction learning approach to automatic indexing --- p.118
Chapter 6.3.1. --- An experiment evaluating the complete procedures --- p.119
Chapter 6.3.1.1. --- Testing data used in the experiment --- p.119
Chapter 6.3.1.2. --- Details of the experiment --- p.119
Chapter 6.3.1.3. --- Experimental result --- p.121
Chapter 6.3.2. --- An experiment comparing with the traditional approach --- p.122
Chapter 6.3.3. --- An experiment determining the optimal indexing score threshold --- p.124
Chapter 6.3.4. --- An experiment measuring the precision and recall of indexing performance --- p.127
Chapter 6.4. --- Learning feedback modification --- p.128
Chapter 6.4.1. --- Positive feedback --- p.129
Chapter 6.4.2. --- Negative feedback --- p.131
Chapter 6.4.3. --- Change of indexed proportions of positive/negative training corpus in feedback iterations --- p.132
Chapter 6.4.4. --- An experiment evaluating the learning feedback modification --- p.134
Chapter 6.4.5. --- An experiment testing the significance factor in merging process --- p.136
Chapter 6.5. --- Conclusion --- p.138
Chapter Chapter 7. --- Conclusion --- p.140
Appendix A: Some examples of identified phrases in financial news articles --- p.149
Appendix B: Some examples of identified templates in financial news articles --- p.150
Appendix C: Some examples of texts containing the templates in financial news articles --- p.151
Appendix D: Some examples of identified phrases in political news articles --- p.152
Appendix E: Some examples of identified templates in political news articles --- p.153
Appendix F: Some examples of texts containing the templates in political news articles --- p.154
Appendix G: Syntactic tags used in word segmentation modification rule experiment --- p.155
Appendix H: An example of semantic approach to automatic indexing --- p.156
Appendix I: An example of syntactic approach to automatic indexing --- p.158
Appendix J: Samples of INSPEC and MEDLINE Records --- p.161
Appendix K: Examples of Promoting and Demoting Words --- p.162
References --- p.163

APA, Harvard, Vancouver, ISO, and other styles

15

Mak, King Tong. "The dynamics of collocation: a corpus-based study of the phraseology and pragmatics of the introductory-it construction." Thesis, 2005. http://hdl.handle.net/2152/1776.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Vyatkina, Nina A. "Development of second language pragmatic competence the data-driven teaching of German modal particles based on a learner corpus /." 2007. http://www.etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-1928/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

YANG, SHU-HUI, and 楊淑惠. "An Investigation into the Acquisition of Japanese Giving and Receiving Expressions-Based on the LARP at SCU Corpus Data-." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/96861692541568529274.

Full text

Abstract:

碩士
東吳大學
日本語文學系
104
In Japanese, the giving and receiving expression is a complicated part that included modestly and cautiously of Japanese interpersonal relationship and could cause misunderstanding merely by misusage. This study is conducted mainly based on the LARP at SCU Corpus Data and focused on Japanese auxiliary verb for giving and receiving to examine the process of acquisition of this kind by Taiwanese learners. This study consists of 5 chapters. The first chapter is introduction. Many previous researches which inspired the author to parse from different aspects are mentioned in second chapter. The third chapter is an overall introduction regarding the history of LARP at SCU Corpus Data, the definition of related terminologies and the methodology in this study. A conclusion can be found in the fifth chapter. Following are the analysis findings. 1. Category “(TE) KURERU” “(TE) AGERU” and “(TE) MORAU” are the most misusage items. 2. Using Auxiliary verbs are more difficult than verbs, this implies more attention. 3. The ratio of pre-test and post-test is 1:3.5, this justifies the vital role the tutor plays. 4. The usage ability of giving and receiving terminology strongly related with the inadequate knowledge of Japanese language and culture. To conclude, the acquisition of the giving and receiving expression will be more effective for students if the teacher can give them more guidance and advice.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Corpus-based data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles