Dissertations / Theses: 'Data-to-text'

1

Kyle, Cameron. "Data to information to text summaries of financial data." Master's thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29643.

Full text

Abstract:

The field of auditing is becoming increasingly dependent on information technology as auditors are forced to follow the increasingly complex information processing of their clients. There exists a need for a system that can convert vast quantities of data generated by existing systems and data analytics techniques, into usable information and then into a format that is easy for someone not trained in data analytics to understand. This is possible through Natural Language Generation (NLG). The field of auditing has not previously been applied to this pipeline. This research looks at the auditing of Investment Fund Management, of which a specific procedure is the comparison of two time series (one of the fund being tested and another of the benchmark it is supposed to follow) to identify potential misstatements in the investment fund. We solve this problem through a combination of incremental innovations on existing techniques in the text planning stage as well as pre-NLG processing steps, with effective leveraging of accepted sentence planning and realisation techniques. Additionally, fuzzy logic is used to provide a more human decision system. This allows the system to transform data into information and then into text. This has been evaluated by experts and achieved positive results with regard to audit impact, readability and understandability, while falling slight short of the stated accuracy targets. These preliminary results are positive in general and are therefore encouraging for further development.

APA, Harvard, Vancouver, ISO, and other styles

2

Gkatzia, Dimitra. "Data-driven approaches to content selection for data-to-text generation." Thesis, Heriot-Watt University, 2015. http://hdl.handle.net/10399/3003.

Full text

Abstract:

Data-to-text systems are powerful in generating reports from data automatically and thus they simplify the presentation of complex data. Rather than presenting data using visualisation techniques, data-to-text systems use human language, which is the most common way for human-human communication. In addition, data-to-text systems can adapt their output content to users’ preferences, background or interests and therefore they can be pleasant for users to interact with. Content selection is an important part of every data-to-text system, because it is the module that decides which from the available information should be conveyed to the user. This thesis makes three important contributions. Firstly, it investigates data-driven approaches to content selection with respect to users’ preferences. It develops, compares and evaluates two novel content selection methods. The first method treats content selection as a Markov Decision Process (MDP), where the content selection decisions are made sequentially, i.e. given the already chosen content, decide what to talk about next. The MDP is solved using Reinforcement Learning (RL) and is optimised with respect to a cumulative reward function. The second approach considers all content selection decisions simultaneously by taking into account data relationships and treats content selection as a multi-label classification task. The evaluation shows that the users significantly prefer the output produced by the RL framework, whereas the multi-label classification approach scores significantly higher than the RL method in automatic metrics. The results also show that the end users’ preferences should be taken into account when developing Natural Language Generation (NLG) systems. NLG systems are developed with the assistance of domain experts, however the end users are normally non-experts. Consider for instance a student feedback generation system, where the system imitates the teachers. The system will produce feedback based on the lecturers’ rather than the students’ preferences although students are the end users. Therefore, the second contribution of this thesis is an approach that adapts the content to “speakers” and “hearers” simultaneously. It considers initially two types of known stakeholders; lecturers and students. It develops a novel approach that analyses the preferences of the two groups using Principal Component Regression and uses the derived knowledge to hand-craft a reward function that is then optimised using RL. The results show that the end users prefer the output generated by this system, rather than the output that is generated by a system that mimics the experts. Therefore, it is possible to model the middle ground of the preferences of different known stakeholders. In most real world applications however, first-time users are generally unknown, which is a common problem for NLG and interactive systems: the system cannot adapt to user preferences without prior knowledge. This thesis contributes a novel framework for addressing unknown stakeholders such as first time users, using Multi-objective Optimisation to minimise regret for multiple possible user types. In this framework, the content preferences of potential users are modelled as objective functions, which are simultaneously optimised using Multi-objective Optimisation. This approach outperforms two meaningful baselines and minimises regret for unknown users.

APA, Harvard, Vancouver, ISO, and other styles

3

Turner, Ross. "Georeferenced data-to-text techniques and application /." Thesis, Available from the University of Aberdeen Library and Historic Collections Digital Resources, 2009. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?application=DIGITOOL-3&owner=resourcediscovery&custom_att_2=simple_viewer&pid=56243.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Štajner, Sanja. "New data-driven approaches to text simplification." Thesis, University of Wolverhampton, 2015. http://hdl.handle.net/2436/554413.

Full text

Abstract:

Many texts we encounter in our everyday lives are lexically and syntactically very complex. This makes them difficult to understand for people with intellectual or reading impairments, and difficult for various natural language processing systems to process. This motivated the need for text simplification (TS) which transforms texts into their simpler variants. Given that this is still a relatively new research area, many challenges are still remaining. The focus of this thesis is on better understanding the current problems in automatic text simplification (ATS) and proposing new data-driven approaches to solving them. We propose methods for learning sentence splitting and deletion decisions, built upon parallel corpora of original and manually simplified Spanish texts, which outperform the existing similar systems. Our experiments in adaptation of those methods to different text genres and target populations report promising results, thus offering one possible solution for dealing with the scarcity of parallel corpora for text simplification aimed at specific target populations, which is currently one of the main issues in ATS. The results of our extensive analysis of the phrase-based statistical machine translation (PB-SMT) approach to ATS reject the widespread assumption that the success of that approach largely depends on the size of the training and development datasets. They indicate more influential factors for the success of the PB-SMT approach to ATS, and reveal some important differences between cross-lingual MT and the monolingual v MT used in ATS. Our event-based system for simplifying news stories in English (EventSimplify) overcomes some of the main problems in ATS. It does not require a large number of handcrafted simplification rules nor parallel data, and it performs significant content reduction. The automatic and human evaluations conducted show that it produces grammatical text and increases readability, preserving and simplifying relevant content and reducing irrelevant content. Finally, this thesis addresses another important issue in TS which is how to automatically evaluate the performance of TS systems given that access to the target users might be difficult. Our experiments indicate that existing readability metrics can successfully be used for this task when enriched with human evaluation of grammaticality and preservation of meaning.

APA, Harvard, Vancouver, ISO, and other styles

5

Jones, Greg 1963-2017. "RADIX 95n: Binary-to-Text Data Conversion." Thesis, University of North Texas, 1991. https://digital.library.unt.edu/ark:/67531/metadc500582/.

Full text

Abstract:

This paper presents Radix 95n, a binary to text data conversion algorithm. Radix 95n (base 95) is a variable length encoding scheme that offers slightly better efficiency than is available with conventional fixed length encoding procedures. Radix 95n advances previous techniques by allowing a greater pool of 7-bit combinations to be made available for 8-bit data translation. Since 8-bit data (i.e. binary files) can prove to be difficult to transfer over 7-bit networks, the Radix 95n conversion technique provides a way to convert data such as compiled programs or graphic images to printable ASCII characters and allows for their transfer over 7-bit networks.

APA, Harvard, Vancouver, ISO, and other styles

6

Štajner, Sanja. "New data-driven approaches to text simplification." Thesis, University of Wolverhampton, 2016. http://hdl.handle.net/2436/601113.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Rose, Øystein. "Text Mining in Health Records : Classification of Text to Facilitate Information Flow and Data Overview." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2007. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9629.

Full text

Abstract:

This project consists of two parts. In the first part we apply techniques from the field of text mining to classify sentences in encounter notes of the electronic health record (EHR) into classes of {it subjective}, {it objective} and {it plan} character. This is a simplification of the {it SOAP} standard, and is applied due to the way GPs structure the encounter notes. Structuring the information in a subjective, objective, and plan way, may enhance future information flow between the EHR and the personal health record (PHR). In the second part of the project we seek to use apply the most adequate to classify encounter notes from patient histories of patients suffering from diabetes. We believe that the distribution of sentences of a subjective, objective, and plan character changes according to different phases of diseases. In our work we experiment with several preprocessing techniques, classifiers, and amounts of data. Of the classifiers considered, we find that Complement Naive Bayes (CNB) produces the best result, both when the preprocessing of the data has taken place and not. On the raw dataset, CNB yields an accuracy of 81.03%, while on the preprocessed dataset, CNB yields an accuracy of 81.95%. The Support Vector Machines (SVM) classifier algorithm yields results comparable to the results obtained by use of CNB, while the J48 classifier algorithm performs poorer. Concerning preprocessing techniques, we find that use of techniques reducing the dimensionality of the datasets improves the results for smaller attribute sets, but worsens the result for larger attribute sets. The trend is opposite for preprocessing techniques that expand the set of attributes. However, finding the ratio between the size of the dataset and the number of attributes, where the preprocessing techniques improve the result, is difficult. Hence, preprocessing techniques are not applied in the second part of the project. From the result of the classification of the patient histories we have extracted graphs that show how the sentence class distribution after the first diagnosis of diabetes is set. Although no empiric research is carried out, we believe that such graphs may, through further research, facilitate the recognition of points of interest in the patient history. From the same results we also create graphs that show the average distribution of sentences of subjective, objective, and plan character for 429 patients after the first diagnosis of diabetes is set. From these graphs we find evidence that there is an overrepresentation of subjective sentences in encounter notes where the diagnosis of diabetes is first set. However, we believe that similar experiments for several diseases, may uncover patterns or trends concerning the diseases in focus.

APA, Harvard, Vancouver, ISO, and other styles

8

Ma, Yimin. "Text classification on imbalanced data: Application to systematic reviews automation." Thesis, University of Ottawa (Canada), 2007. http://hdl.handle.net/10393/27532.

Full text

Abstract:

Systematic Review is the basic process of Evidence-based Medicine, and consequently there is urgent need for tools assisting and eventually automating a large part of this process. In the traditional Systematic Review System, reviewers or domain experts manually classify literatures into relevant class and irrelevant class through a series of systematic review levels. In our work with TrialStat, we apply text classification techniques to a Systematic Review System in order to minimize the human efforts in identifying relevant literatures. In most cases, the relevant articles are a small portion of the Medline corpus. The first essential issue for this task is achieving high recall for those relevant articles. We also face two technical challenges: handling imbalanced data, and reducing the size of the labeled training set. To address these issues, we first study the feature selection and sample selection bias caused by the skewness data. We then experimented with different feature selection, sample selection, and classification methods to find the ones that can properly handle these problems. In order to minimize the labeled training set size, we also experimented with the active learning techniques. Active learning selects the most informative instances to be labeled, so that the required training examples are reduced while the performance is guaranteed. By using an active learning technique, we saved 86% of the effort required to label the training examples. The best testing result was obtained by combining the feature selection method Modified BNS, the sample selection method clustering-based sample selection and active learning with the Naive Bayes as classifier. We achieved 100% recall for the minority class with the overall accuracy of 58.43%. By achieving work saved over sampling (WSS) as 53.4%, we saved half of the workload for the reviewers.

APA, Harvard, Vancouver, ISO, and other styles

9

Salah, Aghiles. "Von Mises-Fisher based (co-)clustering for high-dimensional sparse data : application to text and collaborative filtering data." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB093/document.

Full text

Abstract:

La classification automatique, qui consiste à regrouper des objets similaires au sein de groupes, également appelés classes ou clusters, est sans aucun doute l’une des méthodes d’apprentissage non-supervisé les plus utiles dans le contexte du Big Data. En effet, avec l’expansion des volumes de données disponibles, notamment sur le web, la classification ne cesse de gagner en importance dans le domaine de la science des données pour la réalisation de différentes tâches, telles que le résumé automatique, la réduction de dimension, la visualisation, la détection d’anomalies, l’accélération des moteurs de recherche, l’organisation d’énormes ensembles de données, etc. De nombreuses méthodes de classification ont été développées à ce jour, ces dernières sont cependant fortement mises en difficulté par les caractéristiques complexes des ensembles de données que l’on rencontre dans certains domaines d’actualité tel que le Filtrage Collaboratif (FC) et de la fouille de textes. Ces données, souvent représentées sous forme de matrices, sont de très grande dimension (des milliers de variables) et extrêmement creuses (ou sparses, avec plus de 95% de zéros). En plus d’être de grande dimension et sparse, les données rencontrées dans les domaines mentionnés ci-dessus sont également de nature directionnelles. En effet, plusieurs études antérieures ont démontré empiriquement que les mesures directionnelles, telle que la similarité cosinus, sont supérieurs à d’autres mesures, telle que la distance Euclidiennes, pour la classification des documents textuels ou pour mesurer les similitudes entre les utilisateurs/items dans le FC. Cela suggère que, dans un tel contexte, c’est la direction d’un vecteur de données (e.g., représentant un document texte) qui est pertinente, et non pas sa longueur. Il est intéressant de noter que la similarité cosinus est exactement le produit scalaire entre des vecteurs unitaires (de norme 1). Ainsi, d’un point de vue probabiliste l’utilisation de la similarité cosinus revient à supposer que les données sont directionnelles et réparties sur la surface d’une hypersphère unité. En dépit des nombreuses preuves empiriques suggérant que certains ensembles de données sparses et de grande dimension sont mieux modélisés sur une hypersphère unité, la plupart des modèles existants dans le contexte de la fouille de textes et du FC s’appuient sur des hypothèses populaires : distributions Gaussiennes ou Multinomiales, qui sont malheureusement inadéquates pour des données directionnelles. Dans cette thèse, nous nous focalisons sur deux challenges d’actualité, à savoir la classification des documents textuels et la recommandation d’items, qui ne cesse d’attirer l’attention dans les domaines de la fouille de textes et celui du filtrage collaborative, respectivement. Afin de répondre aux limitations ci-dessus, nous proposons une série de nouveaux modèles et algorithmes qui s’appuient sur la distribution de von Mises-Fisher (vMF) qui est plus appropriée aux données directionnelles distribuées sur une hypersphère unité
Cluster analysis or clustering, which aims to group together similar objects, is undoubtedly a very powerful unsupervised learning technique. With the growing amount of available data, clustering is increasingly gaining in importance in various areas of data science for several reasons such as automatic summarization, dimensionality reduction, visualization, outlier detection, speed up research engines, organization of huge data sets, etc. Existing clustering approaches are, however, severely challenged by the high dimensionality and extreme sparsity of the data sets arising in some current areas of interest, such as Collaborative Filtering (CF) and text mining. Such data often consists of thousands of features and more than 95% of zero entries. In addition to being high dimensional and sparse, the data sets encountered in the aforementioned domains are also directional in nature. In fact, several previous studies have empirically demonstrated that directional measures—that measure the distance between objects relative to the angle between them—, such as the cosine similarity, are substantially superior to other measures such as Euclidean distortions, for clustering text documents or assessing the similarities between users/items in CF. This suggests that in such context only the direction of a data vector (e.g., text document) is relevant, not its magnitude. It is worth noting that the cosine similarity is exactly the scalar product between unit length data vectors, i.e., L 2 normalized vectors. Thus, from a probabilistic perspective using the cosine similarity is equivalent to assuming that the data are directional data distributed on the surface of a unit-hypersphere. Despite the substantial empirical evidence that certain high dimensional sparse data sets, such as those encountered in the above domains, are better modeled as directional data, most existing models in text mining and CF are based on popular assumptions such as Gaussian, Multinomial or Bernoulli which are inadequate for L 2 normalized data. In this thesis, we focus on the two challenging tasks of text document clustering and item recommendation, which are still attracting a lot of attention in the domains of text mining and CF, respectively. In order to address the above limitations, we propose a suite of new models and algorithms which rely on the von Mises-Fisher (vMF) assumption that arises naturally for directional data lying on a unit-hypersphere

APA, Harvard, Vancouver, ISO, and other styles

10

Natarajan, Jeyakumar. "Text mining of biomedical literature and its applications to microarray data analysis and interpretation." Thesis, University of Ulster, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445041.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Malatji, Promise Tshepiso. "The development of accented English synthetic voices." Thesis, University of Limpopo, 2019. http://hdl.handle.net/10386/2917.

Full text

Abstract:

Thesis (M. Sc. (Computer Science)) --University of Limpopo, 2019
A Text-to-speech (TTS) synthesis system is a software system that receives text as input and produces speech as output. A TTS synthesis system can be used for, amongst others, language learning, and reading out text for people living with different disabilities, i.e., physically challenged, visually impaired, etc., by native and non-native speakers of the target language. Most people relate easily to a second language spoken by a non-native speaker they share a native language with. Most online English TTS synthesis systems are usually developed using native speakers of English. This research study focuses on developing accented English synthetic voices as spoken by non-native speakers in the Limpopo province of South Africa. The Modular Architecture for Research on speech sYnthesis (MARY) TTS engine is used in developing the synthetic voices. The Hidden Markov Model (HMM) method was used to train the synthetic voices. Secondary training text corpus is used to develop the training speech corpus by recording six speakers reading the text corpus. The quality of developed synthetic voices is measured in terms of their intelligibility, similarity and naturalness using a listening test. The results in the research study are classified based on evaluators’ occupation and gender and the overall results. The subjective listening test indicates that the developed synthetic voices have a high level of acceptance in terms of similarity and intelligibility. A speech analysis software is used to compare the recorded synthesised speech and the human recordings. There is no significant difference in the voice pitch of the speakers and the synthetic voices except for one synthetic voice.

APA, Harvard, Vancouver, ISO, and other styles

12

Odd, Joel, and Emil Theologou. "Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-148350.

Full text

Abstract:

This study investigated if it was feasible to use machine learning tools on OCR extracted text data to classify receipts and extract specific data points. Two OCR tools were evaluated, the first was Azure Computer Vision API and the second was Google Drive REST Api, where Google Drive REST Api was the main OCR tool used in the project because of its impressive performance. The classification task mainly tried to predict which of five given categories the receipts belongs to, and also a more challenging task of predicting specific subcategories inside those five larger categories. The data points we where trying to extract was the date of purchase on the receipt and the total price of the transaction. The classification was mainly done with the help of scikit-learn, while the extraction of data points was achieved by a simple custom made N-gram model. The results were promising with about 94 % cross validation score for classifying receipts based on category with the help of a LinearSVC classifier. Our custom model was successful in 72 % of cases for the price data point while the results for extracting the date was less successful with an accuracy of 50 %, which we still consider very promising given the simplistic nature of the custom model.

APA, Harvard, Vancouver, ISO, and other styles

13

Bergh, Adrienne. "A Machine Learning Approach to Predicting Alcohol Consumption in Adolescents From Historical Text Messaging Data." Chapman University Digital Commons, 2019. https://digitalcommons.chapman.edu/cads_theses/2.

Full text

Abstract:

Techniques based on artificial neural networks represent the current state-of-the-art in machine learning due to the availability of improved hardware and large data sets. Here we employ doc2vec, an unsupervised neural network, to capture the semantic content of text messages sent by adolescents during high school, and encode this semantic content as numeric vectors. These vectors effectively condense the text message data into highly leverageable inputs to a logistic regression classifier in a matter of hours, as compared to the tedious and often quite lengthy task of manually coding data. Using our machine learning approach, we are able to train a logistic regression model to predict adolescents' engagement in substance abuse during distinct life phases with accuracy ranging from 76.5% to 88.1%. We show the effects of grade level and text message aggregation strategy on the efficacy of document embedding generation with doc2vec. Additional examination of the vectorizations for specific terms extracted from the text message data adds quantitative depth to this analysis. We demonstrate the ability of the method used herein to overcome traditional natural language processing concerns related to unconventional orthography. These results suggest that the approach described in this thesis is a competitive and efficient alternative to existing methodologies for predicting substance abuse behaviors. This work reveals the potential for the application of machine learning-based manipulation of text messaging data to development of automatic intervention strategies against substance abuse and other adolescent challenges.

APA, Harvard, Vancouver, ISO, and other styles

14

Shokat, Imran. "Computational Analyses of Scientific Publications Using Raw and Manually Curated Data with Applications to Text Visualization." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-78995.

Full text

Abstract:

Text visualization is a field dedicated to the visual representation of textual data by using computer technology. A large number of visualization techniques are available, and now it is becoming harder for researchers and practitioners to choose an optimal technique for a particular task among the existing techniques. To overcome this problem, the ISOVIS Group developed an interactive survey browser for text visualization techniques. ISOVIS researchers gathered papers which describe text visualization techniques or tools and categorized them according to a taxonomy. Several categories were manually assigned to each visualization technique. In this thesis, we aim to analyze the dataset of this browser. We carried out several analyses to find temporal trends and correlations of the categories present in the browser dataset. In addition, a comparison of these categories with a computational approach has been made. Our results show that some categories became more popular than before whereas others have declined in popularity. The cases of positive and negative correlation between various categories have been found and analyzed. Comparison between manually labeled datasets and results of computational text analyses were presented to the experts with an opportunity to refine the dataset. Data which is analyzed in this thesis project is specific to text visualization field, however, methods that are used in the analyses can be generalized for applications to other datasets of scientific literature surveys or, more generally, other manually curated collections of textual documents.

APA, Harvard, Vancouver, ISO, and other styles

15

Hill, Geoffrey. "Sensemaking in Big Data: Conceptual and Empirical Approaches to Actionable Knowledge Generation from Unstructured Text Streams." Kent State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=kent1433597354.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Zhou, Wubai. "Data Mining Techniques to Understand Textual Data." FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.

Full text

Abstract:

More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions.

APA, Harvard, Vancouver, ISO, and other styles

17

Pereira, José Casimiro. "Natural language generation in the context of multimodal interaction in Portuguese : Data-to-text based in automatic translation." Doctoral thesis, Universidade de Aveiro, 2017. http://hdl.handle.net/10773/21767.

Full text

Abstract:

Doutoramento em Informática
Resumo em português não disponivel
To enable the interaction by text and/or speech it is essential that we devise systems capable of translating internal data into sentences or texts that can be shown on screen or heard by users. In this context, it is essential that these natural language generation (NLG) systems provide sentences in the native languages of the users (in our case European Portuguese) and enable an easy development and integration process while providing an output that is perceived as natural. The creation of high quality NLG systems is not an easy task, even for a small domain. The main di culties arise from: classic approaches being very demanding in know-how and development time; a lack of variability in generated sentences of most generation methods; a di culty in easily accessing complete tools; shortage of resources, such as large corpora; and support being available in only a limited number of languages. The main goal of this work was to propose, develop and test a method to convert Data-to-Portuguese, which can be developed with the smallest amount possible of time and resources, but being capable of generating utterances with variability and quality. The thesis defended argues that this goal can be achieved adopting data-driven language generation { more precisely generation based in language translation { and following an Engineering Research Methodology. In this thesis, two Data2Text NLG systems are presented. They were designed to provide a way to quickly develop an NLG system which can generate sentences with good quality. The proposed systems use tools that are freely available and can be developed by people with low linguistic skills. One important characteristic is the use of statistical machine translation techniques and this approach requires only a small natural language corpora resulting in easier and cheaper development when compared to more common approaches. The main result of this thesis is the demonstration that, by following the proposed approach, it is possible to create systems capable of translating information/data into good quality sentences in Portuguese. This is done without major e ort regarding resources creation and with the common knowledge of an experienced application developer. The systems created, particularly the hybrid system, are capable of providing a good solution for problems in data to text conversion.

APA, Harvard, Vancouver, ISO, and other styles

18

Thorstensson, Niklas. "A knowledge-based grapheme-to-phoneme conversion for Swedish." Thesis, University of Skövde, Department of Computer Science, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-731.

Full text

Abstract:

A text-to-speech system is a complex system consisting of several different modules such as grapheme-to-phoneme conversion, articulatory and prosodic modelling, voice modelling etc.

This dissertation is aimed at the creation of the initial part of a text-to-speech system, i.e. the grapheme-to-phoneme conversion, designed for Swedish. The problem area at hand is the conversion of orthographic text into a phonetic representation that can be used as a basis for a future complete text-to speech system.

The central issue of the dissertation is the grapheme-to-phoneme conversion and the elaboration of rules and algorithms required to achieve this task. The dissertation aims to prove that it is possible to make such a conversion by a rule-based algorithm with reasonable performance. Another goal is to find a way to represent phonotactic rules in a form suitable for parsing. It also aims to find and analyze problematic structures in written text compared to phonetic realization.

This work proposes a knowledge-based grapheme-to-phoneme conversion system for Swedish. The system suggested here is implemented, tested, evaluated and compared to other existing systems. The results achieved are promising, and show that the system is fast, with a high degree of accuracy.

APA, Harvard, Vancouver, ISO, and other styles

19

Yu, Shuren. "How to Leverage Text Data in a Decision Support System? : A Solution Based on Machine Learning and Qualitative Analysis Methods." Thesis, Umeå universitet, Institutionen för informatik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163899.

Full text

Abstract:

In the big data context, the growing volume of textual data presents challenges for traditional structured data-based decision support systems (DSS). DSS based on structured data is difficult to process the semantic information of text data. To meet the challenge, this thesis proposes a solution for the Decision Support System (DSS) based on machine Learning and qualitative analysis, namely TLE-DSS. TLE-DSS refers to three critical analytical modules: Thematic Analysis (TA), Latent Dirichlet Allocation (LDA)and Evolutionary Grounded Theory (EGT). To better understand the operation mechanism of TLE-DSS, this thesis used an experimental case to explain how to make decisions through TLE-DSS. Additionally, during the data analysis of the experimental case, by calculating the difference of perplexity of different models to compare similarities, this thesis proposed a solution to determine the optimal number of topics in LDA. Meanwhile, by using LDAvis, a model with the optimal number of topics was visualized. Moreover, the thesis also expounded the principle and application value of EGT. In the last part, this thesis discussed the challenges and potential ethical issues that TLE-DSS still faces.

APA, Harvard, Vancouver, ISO, and other styles

20

Price, Sarah Jane. "What are we missing by ignoring text records in the Clinical Practice Research Datalink? : using three symptoms of cancer as examples to estimate the extent of data in text format that is hidden to research." Thesis, University of Exeter, 2016. http://hdl.handle.net/10871/21692.

Full text

Abstract:

Electronic medical record databases (e.g. the Clinical Practice Research Datalink, CPRD) are increasingly used in epidemiological research. The CPRD has two formats of data: coded, which is the sole format used in almost all research; and free-text (or ‘hidden’), which may contain much clinical information but is generally unavailable to researchers. This thesis examines the ramifications of omitting free-text records from research. Cases with bladder (n=4,915) or pancreatic (n=3,635) cancer were matched to controls (n=21,718, bladder; n=16,459, pancreas) on age, sex and GP practice. Coded and text-only records of attendance for haematuria, jaundice and abdominal pain in the year before cancer diagnosis were identified. The number of patients whose entire attendance record for a symptom/sign existed solely in the text was quantified. Associations between recording method (coded or text-only) and case/control status were estimated (χ2 test). For each symptom/sign, the positive predictive value (PPV, Bayes' Theorem) and odds ratio (OR, conditional logistic regression) for cancer were estimated before and after supplementation with text-only records. Text-only recording was considerable, with 7,951/20,958 (37%) of symptom records being in that format. For individual patients, text-only recording was more likely in controls (140/336=42%) than cases (556/3,147=18%) for visible haematuria in bladder cancer (χ2 test, p<0.001), and for jaundice (21/31=67% vs 463/1,565=30%, p<0.0001) and abdominal pain (323/1,126=29% vs 397/1,789=22%, p<0.001) in pancreatic cancer. Adding text records reduced PPVs of visible haematuria for bladder cancer from 4.0% (95% CI: 3.5–4.6%) to 2.9% (2.6–3.2%) and of jaundice for pancreatic cancer from 12.8% (7.3–21.6%) to 6.3% (4.5–8.7%). Coded records suggested that non-visible haematuria occurred in 127/4,915 (2.6%) cases, a figure below that generally used for study. Supplementation with text-only records increased this to 312/4,915 (6.4%), permitting the first estimation of its OR (28.0, 95% CI: 20.7–37.9, p<0.0001) and PPV (1.60%, 1.22–2.10%, p<0.0001) for bladder cancer. The results suggest that GPs make strong clinical judgements about the probable significance of symptoms – preferentially coding clinical features they consider significant to a diagnosis, while using text to record those that they think are not.

APA, Harvard, Vancouver, ISO, and other styles

21

Stojanovic, Milan. "Teknik för dokumentering avmöten och konferenser." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-247247.

Full text

Abstract:

Documentation of meetings and conferences is performed at most companies by one or more people sitting at a computer and typing what has been said during the meeting. This may lead to typing mistakes or incorect perception by the person who records. The human factor is quite large. This work will focus on developing proposals for new technologies that reduce or eliminate the human factor, thus improving the documentation of meetings and conferences. It represents a problem for many companies and institutions, including Seavus Stockholm, where this study is conducted. It is assumed that most of the companies do not document their meetings and conferences in video or audio format, so this study will therefore only be about text-based documentation.The aim of this study was to investigate how to implement new features and build a modern conference system, using modern technologies and new applications to improve the documentation of meetings and conferences. Speech to text in combination with speech recognition is something that has not yet been implemented for such a purpose, and it can facilitate documenting meetings and conferences.To complete the study, several methods were combined to achieve the desired goals. First, the projects scope and objectives were defined. Then, based on analysis of the observations made in the company documenting process, a design proposal was created. Following this, interviews with the stakeholders were conducted where the proposals were presented and a requirement specification was created. Then the theory was studied to create an understanding of how different techniques work to then design and create a proposal for the architecture.The result of this study contains a proposal for architecture that shows that it is possible to implement these techniques to improve the documentation process. Furthermore, possible use cases and interaction diagrams are presented that show how the system may work.Although the proof of the concept is considered to be satisfactory, additional work and testing is needed to fully implement and integrate the concept into reality.
Dokumentering av möten och konferenser utförs på de flesta företag av en eller flera personer som sitter vid en dator och antecknar det som har sagts under mötet. Det kan medföra att det som skrivs ner inte stämmer med det som har sagts eller att det uppfattades felaktigt av personen som antecknar. Den mänskliga faktorn är ganska stor. Detta arbete kommer att fokusera på att ta fram förslag på nya tekniker som minskar eller eliminerar den mänskliga faktorn, och därmed förbättrar dokumenteringen av möten och konferenser. Det föreställer ett problem för många företag och institutioner, däribland för Seavus Stockholm, där denna studie utförs. Det antas att de flesta företag inte dokumenterar deras möten och konferenser i video eller ljudformat, och därmed kommer denna studie bara att handla om dokumentering i textformat.Målet med denna studie var att undersöka hur man, med hjälp av moderna tekniker och nya tillämpningar, kan implementera nya funktioner och bygga ett modernt konferenssystem, för att förbättra dokumenteringen av möten och konferenser. Tal till text i kombination med talarigenkänning är något som ännu inte har implementerats för ett sådant ändamål, och det kan underlätta dokumenteringen av möten och konferenser.För att slutföra studien kombinerades flera metoder för att uppnå de önskade målen.Först definierades projektens omfattning och mål. Därefter, baserat på analys och observationer av företagets dokumenteringsprocess, skapades ett designförslag. Därefter genomfördes intervjuer med intressenterna där förslagen presenterades och en kravspecifikation skapades. Då studerades teorin för att skapa förståelse för hur olika tekniker arbetar, för att sedan designa och skapa ett förslag till arkitekturen.Resultatet av denna studie innehåller ett förslag till arkitektur, som visar att det är möjligt att implementera dessa tekniker för att förbättra dokumentationsprocessen. Dessutom presenteras möjliga användningsfall och interaktionsdiagram som visar hur systemet kan fungera.Även om beviset av konceptet anses vara tillfredsställande, ytterligare arbete och test behövs för att fullt ut implementera och integrera konceptet i verkligheten.

APA, Harvard, Vancouver, ISO, and other styles

22

Gerrish, Charlotte. "European Copyright Law and the Text and Data Mining Exceptions and Limitations : With a focus on the DSM Directive, is the EU Approach a Hindrance or Facilitator to Innovation in the Region?" Thesis, Uppsala universitet, Juridiska institutionen, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-385195.

Full text

Abstract:

We are in a digital age with Big Data at the heart of our global online environment. Exploiting Big Data by manual means is virtually impossible. We therefore need to rely on innovative methods such as Machine Learning and AI to allow us to fully harness the value of Big Data available in our digital society. One of the key processes allowing us to innovate using new technologies such as Machine Learning and AI is by the use of TDM which is carried out on large volumes of Big Data. Whilst there is no single definition of TDM, it is universally acknowledged that TDM involves the automated analytical processing of raw and unstructured data sets through sophisticated ICT tools in order to obtain valuable insights for society or to enable efficient Machine Learning and AI development. Some of the source text and data on which TDM is performed is likely to be protected by copyright, which creates difficulties regarding the balance between the exclusive rights of copyright holders, and the interests of innovators developing TDM technologies and performing TDM, for both research and commercial purposes, who need as much unfettered access to source material in order to create the most performant AI solutions. As technology has grown so rapidly over the last few decades, the copyright law framework must adapt to avoid becoming redundant. This paper looks at the European approach to copyright law in the era of Big Data, and specifically its approach to TDM exceptions in light of the recent DSM Directive, and whether this approach has been, or is, a furtherance or hindrance to innovation in the EU.

APA, Harvard, Vancouver, ISO, and other styles

23

Eriksson, Ruth, and Miranda Luis Galaz. "Ett digitalt läromedel för barn med lässvårigheter." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189205.

Full text

Abstract:

Den digitala tidsåldern förändrar samhället. Ny teknik ger möjligheter att framställa och organisera kunskap på nya sätt. Tekniken som finns i skolan i dag, kan även utnyttjas till att optimera lästräningen till elever med lässvårigheter. Denna avhandling undersöker hur ett digitalt läromedel för läsinlärning för barn med lässvårigheter kan designas och implementeras, och visar att detta är möjligt att genomföra. Ett digitalt läromedel av bra kvalitet måste utgå ifrån en vetenskapligt vedertagen läsinlärningsmetod. Denna avhandling utgår ifrån Gunnel Wendicks modell, som redan används av många specialpedagoger. Modellen används dock i sin ursprungsform, med papperslistor med ord, utan datorer, surfplattor eller liknande. Vi analyserar Wendick-modellen, och tillämpar den på ett kreativt sätt för att designa en digital motsvarighet till det ursprungliga arbetssättet. Vårt mål är att skapa ett digitalt läromedel som implementerar Wendick-modellen, och på så sätt göra det möjligt att modellen används på olika smarta enheter. Med detta hoppas vi kunna underlätta arbetet både för specialpedagoger och barn med lässvårigheter, samt göra rutinerna mer tilltalande och kreativa. I vår studie undersöker vi olika tekniska möjligheter för att implementera Wendick-modellen. Vi väljer att skapa en prototyp av en webbapplikation, med passande funktionalitet för både administratörer, specialpedagoger och elever. Prototypens funktionalitet kan delas upp i två delar, den administrativa delen och övningsdelen. Den administrativa delen omfattar användargränssnitt och funktionalitet för hantering av elever och andra relevanta uppgifter. Övningsdelen omfattar övningsvyer och deras funktionalitet. Övningarnas funktionalitet är tänkt för att träna den auditiva kanalen, den fonologiska avkodningen - med målet att läsa rätt, samt den ortografiska avkodningen - med målet att eleven ska automatisera sin avkodning, d.v.s. att uppfatta orden som en bild. I utvecklandet av det digitala läromedlet används beprövade principer inom mjukvaruteknik och beprövade implementationstekniker. Man sammanställer högnivåkrav, modellerar domänen och definierar passande användningsfall. För att implementera applikationen används Java EE plattform, Web speech API, Primefaces specifikationer, och annat. Vår prototyp är en bra början som inspirerar till vidare utveckling, med förhoppning om att en fullständig webapplikation ska skapas, som ska förändra arbetssättet i våra skolor.
The digital age is changing society. New technology provides opportunities to produce and organize knowledge in new ways. The technology available in schools today can also be used to optimize literacy training for students with reading difficulties. This thesis examines how a digital teaching material for literacy training for children with reading difficulties can be designed and implemented, and shows that this is possible to achieve. A digital learning material of good quality should be based on a scientifically accepted method of literacy training. This thesis uses Gunnel Wendick’s training model which is already used by many special education teachers. The training model is used with word lists, without computers, tablets or the like. We analyze Wendick’s training model and employ it, in a creative way, to design a digital equivalent to the original model. Our goal is to create a digital learning material that implements Wendick’s training model, and thus make it possible to use in various smart devices. With this we hope to facilitate the work of both the special education teachers and children with reading difficulties and to make the procedures more appealing and creative. In our study, we examine various technical possibilities to implement Wendick’s training model. We choose to create a prototype of a web application, with suitable functionality for both administrators, special education teachers and students. The prototype’s functionality can be divided into two parts, the administrative part and the exercise part. The administrative part covers the user interface and functionality for handling students and other relevant data. The exercise part includes training views and their functionality. The functionality of the exercises is intended to train the auditory channel, the phonological awarenesswith the goal of reading accurately, and the orthographic decoding - with the goal that students should automate their decoding, that is, to perceive the words as an image. In the development of the digital teaching material, we used proven principles in software technologies and proven implementation techniques. It compiles high-level requirements, the domain model and defines the appropriate use cases. To implement the application, we used the Java EE platform, Web Speech API, Prime Faces specifications, and more. Our prototype is a good start to inspire further development, with the hope that a full web application will be created, that will transform the practices in our schools.

APA, Harvard, Vancouver, ISO, and other styles

24

陳我智 and Ngor-chi Chan. "Text-to-speech conversion for Putonghua." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1990. http://hub.hku.hk/bib/B31209580.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Almqvist, Daniel, and Magnus Jansson. "Förbättrat informationsflöde med hjälp av Augmented Reality." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177463.

Full text

Abstract:

Augmented Reality är en teknik för att förstärka verkligheten, där digitala objekt placeras framför bilder eller liknande genom att använda kameran på den mobila enhet. Eftersom det finns flera olika metoder att använda Augmented Reality-tekniken har undersökningar och efterforskningar inom området gjorts. Ett exempel på ett område där denna teknik går att använda är reklam. Reklam är något som alla dagligen möts av, men oftast kan ses som tråkiga eller är något många inte lägger märke till. Genom en Augmented Reality prototyp kan användaren registrera respektive mönster eller tal och hämta nödvändig data från en databas. Sedan skapas en interaktiv händelse som visar informationen på ett unikt sätt, där alla, även de funktionshindrade kan ta del av den information de oftast saknar. Denna interaktiva händelse ger även liv till de tidigare tråkiga reklam- eller informationsaffischer. Resultatet av rapporten är en prototyp på mobila plattformen Android som använder Augmented Reality-tekniken och har många funktioner. Den kan acceptera röstigenkänning för att registrera det som talas in och utifrån specifika nyckelord kan prototypen ge information om nyckelordet. Testningen av denna prototyp visar att många är positiva i användningen av prototypen och ser det som ett intressant sätt att få ut informationen. Personerna som har testat prototypen kan tänka sig att använda prototypen själva för att få ut sin egna reklam på ett unikt och lockande sätt.
Augmented Reality is a technology where an object is introduced in front of a picture or a similar media using the camera on a mobile device. There are several different ways to use the Augmented Reality technology, research in the field has therefore been made. An example of an area where the technology can be used is advertisement. Since advertisement is something everyone is confronted with daily, but usually the advertisement can be seen as boring or is something many do not even notice. Through a Augmented Reality prototype, users can register both patterns and speech and get the required data from a database. It can create an interactive event that displays the information in a unique way, where everyone, even people with disabilities can take part of the information they usually can not take part of. This interactive event gives life to the previously tedious advertisement or information posters. The result of the report is a prototype on the mobile platform Android using Augmented Reality technology and the prototype has many features. It can use voice recognition and keywords to access additional information about the keyword. The testing of this prototype shows that many are in favour of the use of the prototype and they see it as an interesting way to get the information. That is why they are willing use the application themselves to get their own advertising in a unique and appealing way.

APA, Harvard, Vancouver, ISO, and other styles

26

Luffarelli, Marta. "A text mining approach to materiality assessment." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/23127/.

Full text

Abstract:

Worldwide companies currently make a significant effort in performing the materiality analysis, whose aim is to explain corporate sustainability in an annual report. Materiality reflects what are the most important social, economic and environmental issues for a company and its stakeholders. Many studies and standards have been proposed to establish what are the main steps to follow to identify the specific topics to be included in a sustainability report. However, few existing quantitative and structured approaches help understanding how to deal with the identified topics and how to prioritise them to effectively show the most valuable ones. Moreover, the use of traditional approaches involves a long-lasting and complex procedure where a lot of people have to be reached and interviewed and several companies' reports have to be read to extrapolate the material topics to be discussed in the sustainability report. This dissertation aims to propose an automated mechanism to gather stakeholders and the company's opinions identifying relevant issues. To accomplish this purpose, text mining techniques are exploited to analyse textual documents written by either a stakeholder or the reporting company. It is then extracted a measure of how much a document deals with some defined topics. This kind of information is finally manipulated to prioritise topics based on how the author's opinion matters. The entire work is based upon a real case study in the domain of telecommunications.

APA, Harvard, Vancouver, ISO, and other styles

27

Wu, Qinyi. "Partial persistent sequences and their applications to collaborative text document editing and processing." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/44916.

Full text

Abstract:

In a variety of text document editing and processing applications, it is necessary to keep track of the revision history of text documents by recording changes and the metadata of those changes (e.g., user names and modification timestamps). The recent Web 2.0 document editing and processing applications, such as real-time collaborative note taking and wikis, require fine-grained shared access to collaborative text documents as well as efficient retrieval of metadata associated with different parts of collaborative text documents. Current revision control techniques only support coarse-grained shared access and are inefficient to retrieve metadata of changes at the sub-document granularity. In this dissertation, we design and implement partial persistent sequences (PPSs) to support real-time collaborations and manage metadata of changes at fine granularities for collaborative text document editing and processing applications. As a persistent data structure, PPSs have two important features. First, items in the data structure are never removed. We maintain necessary timestamp information to keep track of both inserted and deleted items and use the timestamp information to reconstruct the state of a document at any point in time. Second, PPSs create unique, persistent, and ordered identifiers for items of a document at fine granularities (e.g., a word or a sentence). As a result, we are able to support consistent and fine-grained shared access to collaborative text documents by detecting and resolving editing conflicts based on the revision history as well as to efficiently index and retrieve metadata associated with different parts of collaborative text documents. We demonstrate the capabilities of PPSs through two important problems in collaborative text document editing and processing applications: data consistency control and fine-grained document provenance management. The first problem studies how to detect and resolve editing conflicts in collaborative text document editing systems. We approach this problem in two steps. In the first step, we use PPSs to capture data dependencies between different editing operations and define a consistency model more suitable for real-time collaborative editing systems. In the second step, we extend our work to the entire spectrum of collaborations and adapt transactional techniques to build a flexible framework for the development of various collaborative editing systems. The generality of this framework is demonstrated by its capabilities to specify three different types of collaborations as exemplified in the systems of RCS, MediaWiki, and Google Docs respectively. We precisely specify the programming interfaces of this framework and describe a prototype implementation over Oracle Berkeley DB High Availability, a replicated database management engine. The second problem of fine-grained document provenance management studies how to efficiently index and retrieve fine-grained metadata for different parts of collaborative text documents. We use PPSs to design both disk-economic and computation-efficient techniques to index provenance data for millions of Wikipedia articles. Our approach is disk economic because we only save a few full versions of a document and only keep delta changes between those full versions. Our approach is also computation-efficient because we avoid the necessity of parsing the revision history of collaborative documents to retrieve fine-grained metadata. Compared to MediaWiki, the revision control system for Wikipedia, our system uses less than 10% of disk space and achieves at least an order of magnitude speed-up to retrieve fine-grained metadata for documents with thousands of revisions.

APA, Harvard, Vancouver, ISO, and other styles

28

Pilipiec, Patrick. "Using Machine Learning to Understand Text for Pharmacovigilance: A Systematic Review." Thesis, Luleå tekniska universitet, Institutionen för system- och rymdteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-83458.

Full text

Abstract:

Background: Pharmacovigilance is a science that involves the ongoing monitoring of adverse drug reactions of existing medicines. Its primary purpose is to sustain and improve public health. The existing systems that apply the science of pharmacovigilance to practice are, however, not only expensive and time-consuming, but they also fail to include experiences from many users. The application of computational linguistics to user-generated text is hypothesized as a pro-active and an effective supplemental source of evidence. Objective: To review the existing evidence on the effectiveness of computational linguistics to understand user-generated text for the purpose of pharmacovigilance. Methodology: A broad and multi-disciplinary systematic literature search was conducted that involved four databases. Studies were considered relevant if they reported on the application of computational linguistics to understand text for pharmacovigilance. Both peer- reviewed journal articles and conference materials were included. The PRISMA guidelines were used to evaluate the quality of this systematic review. Results: A total of 16 relevant publications were included in this systematic review. All studies were evaluated to have a medium reliability and validity. Despite the quality, for all types of drugs, a vast majority of publications reported positive findings with respect to the identification of adverse drug reactions. The remaining two studies reported rather neutral results but acknowledged the potential of computational linguistics for pharmacovigilance. Conclusions: There exists consistent evidence that computational linguistics can be used effectively and accurately on user-generated textual content that was published to the Internet, to identify adverse drug reactions for the purpose of pharmacovigilance. The evidence suggests that the analysis of textual data has the potential to complement the traditional system of pharmacovigilance. Recommendations for researchers and practitioners of computational linguistics, policy makers, and users of drugs are suggested.

APA, Harvard, Vancouver, ISO, and other styles

29

Pafilis, Evangelos. "Web-based named entity recognition and data integration to accelerate molecular biology research." [S.l. : s.n.], 2008. http://nbn-resolving.de/urn:nbn:de:bsz:16-opus-89706.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Katerattanakul, Nitsawan. "A pilot study in an application of text mining to learning system evaluation." Diss., Rolla, Mo. : Missouri University of Science and Technology, 2010. http://scholarsmine.mst.edu/thesis/pdf/Katerattanakul_09007dcc807b614f.pdf.

Full text

Abstract:

Thesis (M.S.)--Missouri University of Science and Technology, 2010.
Vita. The entire thesis text is included in file. Title from title screen of thesis/dissertation PDF file (viewed June 19, 2010) Includes bibliographical references (p. 72-75).

APA, Harvard, Vancouver, ISO, and other styles

31

Adapa, Supriya. "TensorFlow Federated Learning: Application to Decentralized Data." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text

Abstract:

Machine learning is a complex discipline. But implementing machine learning models is far less daunting and difficult than it used to be, thanks to machine learning frameworks such as Google’s TensorFlow Federated that ease the process of acquiring data, training models, serving predictions, and refining future results. There are an estimated 3 billion smartphones in the world and 7 billion connected devices. These phones and devices are constantly generating new data. Traditional analytics and machine learning need that data to be centrally collected before it is processed to yield insights, ML models, and ultimately better products. This centralized approach can be problematic if the data is sensitive or expensive to centralize. Wouldn’t it be better if we could run the data analysis and machine learning right on the devices where that data is generated, and still be able to aggregate together what’s been learned? TensorFlow Federated (TFF) is an open-source framework for experimenting with machine learning and other computations on decentralized data. It implements an approach called Federated Learning (FL), which enables many participating clients to train shared ML models while keeping their data locally. We have designed TFF based on our experiences with developing the federated learning technology at Google, where it powers ML models for mobile keyboard predictions and on-device search. With TFF, we are excited to put a flexible, open framework for locally simulating decentralized computations into the hands of all TensorFlow users. By using Twitter datasets we have done text classification of positives and negatives tweets of Twitter Account by using the Twitter application in machine learning.

APA, Harvard, Vancouver, ISO, and other styles

32

Strandberg, Per Erik. "On text mining to identify gene networks with a special reference to cardiovascular disease." Thesis, Linköping University, The Department of Physics, Chemistry and Biology, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2810.

Full text

Abstract:

The rate at which articles gets published grows exponentially and the possibility to access texts in machine-readable formats is also increasing. The need of an automated system to gather relevant information from text, text mining, is thus growing.

The goal of this thesis is to find a biologically relevant gene network for atherosclerosis, themain cause of cardiovascular disease, by inspecting gene cooccurrences in abstracts from PubMed. In addition to this gene nets for yeast was generated to evaluate the validity of using text mining as a method.

The nets found were validated in many ways, they were for example found to have the well known power law link distribution. They were also compared to other gene nets generated by other, often microbiological, methods from different sources. In addition to classic measurements of similarity like overlap, precision, recall and f-score a new way to measure similarity between nets are proposed and used. The method uses an urn approximation and measures the distance from comparing two unrelated nets in standard deviations. The validity of this approximation is supported both analytically and with simulations for both Erd¨os-R´enyi nets and nets having a power law link distribution. The new method explains that very poor overlap, precision, recall and f-score can still be very far from random and also how much overlap one could expect at random. The cutoff was also investigated.

Results are typically in the order of only 1% overlap but with the remarkable distance of 100 standard deviations from what one could have expected at random. Of particular interest is that one can only expect an overlap of 2 edges with a variance of 2 when comparing two trees with the same set of nodes. The use of a cutoff at one for cooccurrence graphs is discussed and motivated by for example the observation that this eliminates about 60-70% of the false positives but only 20-30% of the overlapping edges. This thesis shows that text mining of PubMed can be used to generate a biologically relevant gene subnet of the human gene net. A reasonable extension of this work is to combine the nets with gene expression data to find a more reliable gene net.

APA, Harvard, Vancouver, ISO, and other styles

33

Thun, Anton. "Matching Job Applicants to Free Text Job Ads Using Semantic Networks and Natural Language Inference." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281250.

Full text

Abstract:

Automated e-recruitment systems have been a focus for research in the past decade due to the amount of work required to screen suitable applicants to a job post, whose resumés and job ads are commonly submitted as free text. While recruitment organizations have data concerning applicant resumés and job ad descriptions, resumés are often confidential, limiting the use of direct deep learning methods. This presents an issue where traditional data-agnostic methods have greater potential to achieve good results in determining applicant suitability for a given job post. However, with the advent of transfer learning methods, it is possible to train language models independently of the task at hand, and thus independently of the available data. In this report, a language model fine-tuned on Natural Language Inference (NLI) via cross-lingual transfer learning is used for the job matching task. This is compared to a semantic method using Swedish taxonomies to construct networks with hierarchical and synonymy relations. As NLI may apply to arbitrary sentence pairs, the use of text segmentation to enhance the methods’ performance is also examined. The results show that the NLI approach is significantly better than a random suitability classifier, but is outperformed by the semantic method that achieved 34% better performance on the used dataset. The use of text segmentation had negligible effect on overall performance, but was shown to improve ranking of the top-most suitable applicants in relation to manual expert relevance scores.
Automatiserade e-rekryteringssystem har varit ett forskningsfokus det senaste årtioendet på grund av hur mycket arbete som krävs för att sålla passande jobbsökande till en jobbpost, vars CV och jobbannons vanligen skickas som fritext. Medan rekryteringsorganisationer har data omfattande sökandes CV:n och jobbannonsbeskrivningar, är CV:n oftast konfidentiella, vilket begränsar direktanvändande av djupinlärningsmetoder. Detta leder till ett problem där traditionella data-agnostiska metoder har större potential att uppnå bra resultat i att bestämma hur lämplig en jobbsökande är för en given jobbpost. Dock är det möjligt att träna språkmodeller oberoende av det faktiska problemet– och därav oberoende av tillgänglig data – med tillkomsten av inlärningsöverföring. I den här rapporten används en språkmodell som finjusterats på naturlig språkinferens (NSI) via tvärspråklig inlärningsöverföring för jobbmatchningsproblemet. Den jämförs mot en semantisk metod som använder svenska taxonomier för att konstruera nätverk med hierarki- och synonymrelationer. Då NSI kan appliceras på godtyckliga meningspar, undersöks även textsegmentering för att förbättra metodernas prestanda. Resultaten visar att NSI-metoden är signifikant bättre än en slumpmässig lämplighetsklassificerare, men överträffas av den semantiska metoden som hade 34% bättre prestanda på datasetet som användes. Användandet av textsegmentering hade försumbar effekt för den samlade prestandan, men visades uppnå bättre rankning av de mest lämpande jobbsökande relativt expertbedömning av deras relevans.

APA, Harvard, Vancouver, ISO, and other styles

34

Kongthon, Alisa. "A text mining framework for discovering technological intelligence to support science and technology management." Diss., Available online, Georgia Institute of Technology, 2004:, 2004. http://etd.gatech.edu/theses/available/etd-04052004-162415/unrestricted/kongthon%5Falisa%5F200405%5Fphd.pdf.

Full text

Abstract:

Thesis (Ph. D.)--Industrial and Systems Engineering, Georgia Institute of Technology, 2004.
Zhu, Donghua, Committee Member ; Cozzens, Susan, Committee Member ; Huo, Xiaoming, Committee Member ; Porter, Alan, Committee Chair ; Lu, Jye-Chyi, Committee Member. Vita. Includes bibliographical references (leaves 191-195).

APA, Harvard, Vancouver, ISO, and other styles

35

Courseault, Cherie Renee. "A Text Mining Framework Linking Technical Intelligence from Publication Databases to Strategic Technology Decisions." Diss., Georgia Institute of Technology, 2004. http://hdl.handle.net/1853/5214.

Full text

Abstract:

This research developed a comprehensive methodology to quickly monitor key technical intelligence areas, provided a method that cleanses and consolidates information into an understandable, concise picture of topics of interest, thus bridging issues of managing technology and text mining. This research evaluated and altered some existing analysis methods, and developed an overall framework for answering technical intelligence questions. A six-step approach worked through the various stages of the Intelligence and Text Data Mining Processes to address issues that hindered the use of Text Data Mining in the Intelligence Cycle and the actual use of that intelligence in making technology decisions. A questionnaire given to 34 respondents from four different industries identified the information most important to decision-makers as well as clusters of common interests. A bibliometric/text mining tool applied to journal publication databases, profiled technology trends and presented that information in the context of the stated needs from the questionnaire. In addition to identifying the information that is important to decision-makers, this research improved the methods for analyzing information. An algorithm was developed that removed common non-technical terms and delivered at least an 89% precision rate in identifying synonymous terms. Such identifications are important to improving accuracy when mining free text, thus enabling the provision of the more specific information desired by the decision-makers. This level of precision was consistent across five different technology areas and three different databases. The result is the ability to use abstract phrases in analysis, which allows the more detailed nature of abstracts to be captured in clustering, while portraying the broad relationships as well.

APA, Harvard, Vancouver, ISO, and other styles

36

Wigington, Curtis Michael. "End-to-End Full-Page Handwriting Recognition." BYU ScholarsArchive, 2018. https://scholarsarchive.byu.edu/etd/7099.

Full text

Abstract:

Despite decades of research, offline handwriting recognition (HWR) of historical documents remains a challenging problem, which if solved could greatly improve the searchability of online cultural heritage archives. Historical documents are plagued with noise, degradation, ink bleed-through, overlapping strokes, variation in slope and slant of the writing, and inconsistent layouts. Often the documents in a collection have been written by thousands of authors, all of whom have significantly different writing styles. In order to better capture the variations in writing styles we introduce a novel data augmentation technique. This methods achieves state-of-the-art results on modern datasets written in English and French and a historical dataset written in German.HWR models are often limited by the accuracy of the preceding steps of text detection and segmentation.Motivated by this, we present a deep learning model that jointly learns text detection, segmentation, and recognition using mostly images without detection or segmentation annotations.Our Start, Follow, Read (SFR) model is composed of a Region Proposal Network to find the start position of handwriting lines, a novel line follower network that incrementally follows and preprocesses lines of (perhaps curved) handwriting into dewarped images, and a CNN-LSTM network to read the characters. SFR exceeds the performance of the winner of the ICDAR2017 handwriting recognition competition, even when not using the provided competition region annotations.

APA, Harvard, Vancouver, ISO, and other styles

37

Chiarella, Andrew Francesco 1971. "Enabling the collective to assist the individual : a self-organising systems approach to social software and the creation of collaborative text signals." Thesis, McGill University, 2008. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=115618.

Full text

Abstract:

Authors augment their texts using devices such as bold and italic typeface to signal important information to the reader. These typographical text signals are an example of a signal designed to have some affect on others. However, some signals emerge through the unplanned, indirect, and collective efforts of a group of individuals. Paths emerge in parks without having been designed by anyone. Objects accumulate wear patterns that signal how others have interacted with the object. Books open to important, well studied pages because the spine has worn, for example (Hill, Hollan, Wroblewski, & McCandless, 1992). Digital text and the large-scale collaboration made possible through the internet provide an opportunity to examine how unplanned, collaborative text signals could emerge. A software application was designed, called CoREAD, that enables readers to highlight sections of the text they deem important. In addition, CoREAD adds text signals to the text using font colour, based on the group's collective history and an aggregation function based on self-organising systems. The readers are potentially influenced by the text signals presented by CoREAD but also help to modify these same signals. Importantly, readers only interact with each other indirectly through the text. The design of CoREAD was greatly inspired by the previous work on history-enriched digital objects (Hill & Hollan, 1993) and at a more general level it can be viewed as an example of distributed cognition (Hollan, Hutchins, & Kirsh, 2000).
Forty undergraduate students read two texts on topics from psychology using CoREAD. Students were asked to read each text in order to write a summary of it. After each new student read the text, the text signals were changed to reflect the current group of students. As such, each student read the text with different text signals presented.
The data were analysed for each text to determine if the text signals that emerged were stable and valid representations of the semantic content of the text. As well, the students' summaries were analysed to determine if students who read the text after the text signals had stabilised produced better summaries. Three methods demonstrated that CoREAD was capable of generating stable typographical text signals. The high importance text signals also appeared to capture the semantic content of the texts. For both texts, a summary made of the high signals performed as well as a benchmark summary. The results did not suggest that the stable text signals assisted readers to produce better summaries, however. Readers may not respond to these collaborative text signals as they would to authorial text signals, which previous research has shown to be beneficial (Lorch, 1989). The CoREAD project has demonstrated that readers can produce stable and valid text signals through an unplanned, self-organising process.

APA, Harvard, Vancouver, ISO, and other styles

38

Levefeldt, Christer. "Evaluation of NETtalk as a means to extract phonetic features from text for synchronization with speech." Thesis, University of Skövde, Department of Computer Science, 1998. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-173.

Full text

Abstract:

The background for this project is a wish to automate synchronization of text and speech. The idea is to present speech through speakers synchronized word-for-word with text appearing on a monitor.

The solution decided upon is to use artificial neural networks, ANNs, to convert both text and speech into streams made up of sets of phonetic features and then matching these two streams against each other. Several text-to-feature ANN designs based on the NETtalk system are implemented and evaluated. The extraction of phonetic features from speech and the synchronization itself are not implemented, but some assessments are made regarding their possible performances. The performance of a finished system is not possible to determine, but a NETtalk-based ANN is believed to be suitable for such a system using phonetic features for synchronization.

APA, Harvard, Vancouver, ISO, and other styles

39

Dall, Rasmus. "Statistical parametric speech synthesis using conversational data and phenomena." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/29016.

Full text

Abstract:

Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods.

APA, Harvard, Vancouver, ISO, and other styles

40

Abade, André da Silva. "Uma abordagem de teste estrutural de uma transformações M2T baseada em hipergrafos." Universidade Federal de São Carlos, 2016. https://repositorio.ufscar.br/handle/ufscar/8721.

Full text

Abstract:

Submitted by Aelson Maciera (aelsoncm@terra.com.br) on 2017-05-03T20:33:15Z No. of bitstreams: 1 DissASA.pdf: 6143481 bytes, checksum: ae99305f43474756b358bade1f0bd0c7 (MD5)
Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-05-04T13:50:02Z (GMT) No. of bitstreams: 1 DissASA.pdf: 6143481 bytes, checksum: ae99305f43474756b358bade1f0bd0c7 (MD5)
Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-05-04T13:50:10Z (GMT) No. of bitstreams: 1 DissASA.pdf: 6143481 bytes, checksum: ae99305f43474756b358bade1f0bd0c7 (MD5)
Made available in DSpace on 2017-05-04T13:53:49Z (GMT). No. of bitstreams: 1 DissASA.pdf: 6143481 bytes, checksum: ae99305f43474756b358bade1f0bd0c7 (MD5) Previous issue date: 2016-01-05
Não recebi financiamento
Context: MDD (Model-Driven Development) is a software development paradigm in which the main artefacts are models, from which source code or other artefacts are generated. Even though MDD allows different views of how to decompose a problem and how to design a software to solve it, this paradigm introduces new challenges related to the input models, transformations and output artefacts. Problem Statement: Thus, software testing is a fundamental activity to reveal defects and improve confidence in the software products developed in this context. Several techniques and testing criteria have been proposed and investigated. Among them, functional testing has been extensively explored primarily in the M2M (Model-to-Model) transformations, while structural testing for M2T (Model-to-Text) transformations still poses challenges and lacks appropriate approaches. Objective: This work aims to to present a proposal for the structural testing of M2T transformations through the characterisation of input models as complex data, templates and output artefacts involved in this process. Method: The proposed approach was organised in five phases. Its strategy proposes that the complex data (grammars and metamodels) are represented by directed hypergraphs, allowing that a combinatorial-based traversal algorithm creates subsets of the input models that will be used as test cases for the M2T transformations. In this perspective, we carried out two exploratory studies with the specific purpose of feasibility analysis of the proposed approach. Results and Conclusion: The evaluation of results from the exploratory studies, through the analysis of some testing coverage criteria, demonstrated the relevance and feasibility of the approach for characterizing complex data for M2T transformations testing. Moreover, structuring the testing strategy in phases enables the revision and adjustment of activities, in addition to assisting the replication of the approach within different applications that make use of the MDD paradigm.
Contexto: O MDD (Model-Driven Development ou Desenvolvimento Dirigido por Modelos) e um paradigma de desenvolvimento de software em que os principais artefatos são os modelos, a partir dos quais o código ou outros artefatos são gerados. Esse paradigma, embora possibilite diferentes visões de como decompor um problema e projetar um software para soluciona-lo, introduz novos desafios, qualificados pela complexidade dos modelos de entrada, as transformações e os artefatos de saída. Definição do Problema: Dessa forma, o teste de software e uma atividade fundamental para revelar defeitos e aumentar a confiança nos produtos de software desenvolvidos nesse contexto. Diversas técnicas e critérios de teste vem sendo propostos e investigados. Entre eles, o teste funcional tem sido bastante explorado primordialmente nas transformações M2M (Model-to-Model ou Modelo para Modelo), enquanto que o teste estrutural em transformações M2T (Model-to-Text ou Modelo para Texto) ainda possui alguns desafios e carência de novas abordagens. Objetivos: O objetivo deste trabalho e apresentar uma proposta para o teste estrutural de transformações M2T, por meio da caracterização dos dados complexos dos modelos de entrada, templates e artefatos de saída envolvidos neste processo. Metodologia: A abordagem proposta foi organizada em cinco fases e sua estratégia propõe que os dados complexos (gramáticas e metamodelos) sejam representados por meio de hipergrafos direcionados, permitindo que um algoritmo de percurso em hipergrafos, usando combinatória, crie subconjuntos dos modelos de entrada que serão utilizados como casos de teste para as transformações M2T. Nesta perspectiva, realizou-se dois estudos exploratórios com propósito específico da analise de viabilidade quanto a abordagem proposta. Resultados: A avaliação dos estudos exploratórios proporcionou, por meio da analise dos critérios de cobertura aplicados, um conjunto de dados que demonstram a relevância e viabilidade da abordagem quanto a caracterização de dados complexos para os testes em transformações M2T. A segmentação das estratégias em fases possibilita a revisão e adequação das atividades do processo, além de auxiliar na replicabilidade da abordagem em diferentes aplicações que fazem uso do paradigma MDD.

APA, Harvard, Vancouver, ISO, and other styles

41

Mhlana, Siphe. "Development of isiXhosa text-to-speech modules to support e-Services in marginalized rural areas." Thesis, University of Fort Hare, 2011. http://hdl.handle.net/10353/495.

Full text

Abstract:

Information and Communication Technology (ICT) projects are being initiated and deployed in marginalized areas to help improve the standard of living for community members. This has lead to a new field, which is responsible for information processing and knowledge development in rural areas, called Information and Communication Technology for Development (ICT4D). An ICT4D projects has been implemented in a marginalized area called Dwesa; this is a rural area situated in the wild coast of the former homelandof Transkei, in the Eastern Cape Province of South Africa. In this rural community there are e-Service projects which have been developed and deployed to support the already existent ICT infrastructure. Some of these projects include the e-Commerce platform, e-Judiciary service, e-Health and e-Government portal. Although these projects are deployed in this area, community members face a language and literacy barrier because these services are typically accessed through English textual interfaces. This becomes a challenge because their language of communication is isiXhosa and some of the community members are illiterate. Most of the rural areas consist of illiterate people who cannot read and write isiXhosa but can only speak the language. This problem of illiteracy in rural areas affects both the youth and the elderly. This research seeks to design, develop and implement software modules that can be used to convert isiXhosa text into natural sounding isiXhosa speech. Such an application is called a Text-to-Speech (TTS) system. The main objective of this research is to improve ICT4D eServices’ usability through the development of an isiXhosa Text-to-Speech system. This research is undertaken within the context of Siyakhula Living Lab (SLL), an ICT4D intervention towards improving the lives of rural communities of South Africa in an attempt to bridge the digital divide. Thedeveloped TTS modules were subsequently tested to determine their applicability to improve eServices usability. The results show acceptable levels of usability as having produced audio utterances for the isiXhosa Text-To-Speech system for marginalized areas.

APA, Harvard, Vancouver, ISO, and other styles

42

Dol, Zulkifli. "A strategy for a systematic approach to biomarker discovery validation : a study on lung cancer microarray data set." Thesis, University of Manchester, 2015. https://www.research.manchester.ac.uk/portal/en/theses/a-strategy-for-a-systematic-approach-to-biomarker-discovery-validation--a-study-on-lung-cancer-microarray-data-set(8e439385-27d1-44ac-8b20-259b4a8f6716).html.

Full text

Abstract:

Cancer is a serious threat to human health and is now one of major causes of death worldwide. However, the complexity of the cancer makes the development of new and specific diagnostic tools particularly challenging. A number of different strategies have been developed for biomarker discovery in cancer using microarray data. The problem that typically needs to be addressed is the scale of the data sets; we simply do not have (or are likely to obtain) sufficient data for classical machine learning approaches for biomarker discovery to be properly validated. Obtaining a biomarker that is specific to a particular cancer is also very challenging. The initial promise that was held out for gene microarray work for the development of cancer biomarkers has not yet yielded the hoped for breakthroughs. This work discusses the construction of a strategy for a systematic approach to biomarker discovery validation using lung cancer gene expression microarray data based around non-small cell cancer and in patients which either stayed disease free after surgery (a five year window) or in which the disease progressed and re-occurred. As a means of assisting the validation purposes we have therefore looked at new methodologies for using existing biological knowledge to support machine learning biomarker discovery techniques. We employ text mining strategy using previously published literature for correlating biological concepts to a given phenotype. Pathway driven approaches through the use of Web Services and workflows, enabled the large-scale dataset to be analysed systematically. The results showed that it was possible, at least using this specific data set, to clearly differentiate between progressive disease and disease free patients using a set of biomarkers implicated in neuroendocrine signaling. A validation of the biomarkers identified was attempted in three separately published data sets. This analysis showed that although there was support for some of our findings in one of these data sets, this appeared to be a function of the close similarity in experimental design followed rather than through specific of the analysis method developed.

APA, Harvard, Vancouver, ISO, and other styles

43

Alshaer, Mohammad. "An Efficient Framework for Processing and Analyzing Unstructured Text to Discover Delivery Delay and Optimization of Route Planning in Realtime." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1105/document.

Full text

Abstract:

L'Internet des objets, ou IdO (en anglais Internet of Things, ou IoT) conduit à un changement de paradigme du secteur de la logistique. L'avènement de l'IoT a modifié l'écosystème de la gestion des services logistiques. Les fournisseurs de services logistiques utilisent aujourd'hui des technologies de capteurs telles que le GPS ou la télémétrie pour collecter des données en temps réel pendant la livraison. La collecte en temps réel des données permet aux fournisseurs de services de suivre et de gérer efficacement leur processus d'expédition. Le principal avantage de la collecte de données en temps réel est qu’il permet aux fournisseurs de services logistiques d’agir de manière proactive pour éviter des conséquences telles que des retards de livraison dus à des événements imprévus ou inconnus. De plus, les fournisseurs ont aujourd'hui tendance à utiliser des données provenant de sources externes telles que Twitter, Facebook et Waze, parce que ces sources fournissent des informations critiques sur des événements tels que le trafic, les accidents et les catastrophes naturelles. Les données provenant de ces sources externes enrichissent l'ensemble de données et apportent une valeur ajoutée à l'analyse. De plus, leur collecte en temps réel permet d’utiliser les données pour une analyse en temps réel et de prévenir des résultats inattendus (tels que le délai de livraison, par exemple) au moment de l’exécution. Cependant, les données collectées sont brutes et doivent être traitées pour une analyse efficace. La collecte et le traitement des données en temps réel constituent un énorme défi. La raison principale est que les données proviennent de sources hétérogènes avec une vitesse énorme. La grande vitesse et la variété des données entraînent des défis pour effectuer des opérations de traitement complexes telles que le nettoyage, le filtrage, le traitement de données incorrectes, etc. La diversité des données - structurées, semi-structurées et non structurées - favorise les défis dans le traitement des données à la fois en mode batch et en temps réel. Parce que, différentes techniques peuvent nécessiter des opérations sur différents types de données. Une structure technique permettant de traiter des données hétérogènes est très difficile et n'est pas disponible actuellement. En outre, l'exécution d'opérations de traitement de données en temps réel est très difficile ; des techniques efficaces sont nécessaires pour effectuer les opérations avec des données à haut débit, ce qui ne peut être fait en utilisant des systèmes d'information logistiques conventionnels. Par conséquent, pour exploiter le Big Data dans les processus de services logistiques, une solution efficace pour la collecte et le traitement des données en temps réel et en mode batch est essentielle. Dans cette thèse, nous avons développé et expérimenté deux méthodes pour le traitement des données: SANA et IBRIDIA. SANA est basée sur un classificateur multinomial Naïve Bayes, tandis qu'IBRIDIA s'appuie sur l'algorithme de classification hiérarchique (CLH) de Johnson, qui est une technologie hybride permettant la collecte et le traitement de données par lots et en temps réel. SANA est une solution de service qui traite les données non structurées. Cette méthode sert de système polyvalent pour extraire les événements pertinents, y compris le contexte (tel que le lieu, l'emplacement, l'heure, etc.). En outre, il peut être utilisé pour effectuer une analyse de texte sur les événements ciblés. IBRIDIA a été conçu pour traiter des données inconnues provenant de sources externes et les regrouper en temps réel afin d'acquérir une connaissance / compréhension des données permettant d'extraire des événements pouvant entraîner un retard de livraison. Selon nos expériences, ces deux approches montrent une capacité unique à traiter des données logistiques
Internet of Things (IoT) is leading to a paradigm shift within the logistics industry. The advent of IoT has been changing the logistics service management ecosystem. Logistics services providers today use sensor technologies such as GPS or telemetry to collect data in realtime while the delivery is in progress. The realtime collection of data enables the service providers to track and manage their shipment process efficiently. The key advantage of realtime data collection is that it enables logistics service providers to act proactively to prevent outcomes such as delivery delay caused by unexpected/unknown events. Furthermore, the providers today tend to use data stemming from external sources such as Twitter, Facebook, and Waze. Because, these sources provide critical information about events such as traffic, accidents, and natural disasters. Data from such external sources enrich the dataset and add value in analysis. Besides, collecting them in real-time provides an opportunity to use the data for on-the-fly analysis and prevent unexpected outcomes (e.g., such as delivery delay) at run-time. However, data are collected raw which needs to be processed for effective analysis. Collecting and processing data in real-time is an enormous challenge. The main reason is that data are stemming from heterogeneous sources with a huge speed. The high-speed and data variety fosters challenges to perform complex processing operations such as cleansing, filtering, handling incorrect data, etc. The variety of data – structured, semi-structured, and unstructured – promotes challenges in processing data both in batch-style and real-time. Different types of data may require performing operations in different techniques. A technical framework that enables the processing of heterogeneous data is heavily challenging and not currently available. In addition, performing data processing operations in real-time is heavily challenging; efficient techniques are required to carry out the operations with high-speed data, which cannot be done using conventional logistics information systems. Therefore, in order to exploit Big Data in logistics service processes, an efficient solution for collecting and processing data in both realtime and batch style is critically important. In this thesis, we developed and experimented with two data processing solutions: SANA and IBRIDIA. SANA is built on Multinomial Naïve Bayes classifier whereas IBRIDIA relies on Johnson's hierarchical clustering (HCL) algorithm which is hybrid technology that enables data collection and processing in batch style and realtime. SANA is a service-based solution which deals with unstructured data. It serves as a multi-purpose system to extract the relevant events including the context of the event (such as place, location, time, etc.). In addition, it can be used to perform text analysis over the targeted events. IBRIDIA was designed to process unknown data stemming from external sources and cluster them on-the-fly in order to gain knowledge/understanding of data which assists in extracting events that may lead to delivery delay. According to our experiments, both of these approaches show a unique ability to process logistics data. However, SANA is found more promising since the underlying technology (Naïve Bayes classifier) out-performed IBRIDIA from performance measuring perspectives. It is clearly said that SANA was meant to generate a graph knowledge from the events collected immediately in realtime without any need to wait, thus reaching maximum benefit from these events. Whereas, IBRIDIA has an important influence within the logistics domain for identifying the most influential category of events that are affecting the delivery. Unfortunately, in IBRIRDIA, we should wait for a minimum number of events to arrive and always we have a cold start. Due to the fact that we are interested in re-optimizing the route on the fly, we adopted SANA as our data processing framework

APA, Harvard, Vancouver, ISO, and other styles

44

Shatnawi, Safwan. "A data mining approach to ontology learning for automatic content-related question-answering in MOOCs." Thesis, Robert Gordon University, 2016. http://hdl.handle.net/10059/2122.

Full text

Abstract:

The advent of Massive Open Online Courses (MOOCs) allows massive volume of registrants to enrol in these MOOCs. This research aims to offer MOOCs registrants with automatic content related feedback to fulfil their cognitive needs. A framework is proposed which consists of three modules which are the subject ontology learning module, the short text classification module, and the question answering module. Unlike previous research, to identify relevant concepts for ontology learning a regular expression parser approach is used. Also, the relevant concepts are extracted from unstructured documents. To build the concept hierarchy, a frequent pattern mining approach is used which is guided by a heuristic function to ensure that sibling concepts are at the same level in the hierarchy. As this process does not require specific lexical or syntactic information, it can be applied to any subject. To validate the approach, the resulting ontology is used in a question-answering system which analyses students' content-related questions and generates answers for them. Textbook end of chapter questions/answers are used to validate the question-answering system. The resulting ontology is compared vs. the use of Text2Onto for the question-answering system, and it achieved favourable results. Finally, different indexing approaches based on a subject's ontology are investigated when classifying short text in MOOCs forum discussion data; the investigated indexing approaches are: unigram-based, concept-based and hierarchical concept indexing. The experimental results show that the ontology-based feature indexing approaches outperform the unigram-based indexing approach. Experiments are done in binary classification and multiple labels classification settings . The results are consistent and show that hierarchical concept indexing outperforms both concept-based and unigram-based indexing. The BAGGING and random forests classifiers achieved the best result among the tested classifiers.

APA, Harvard, Vancouver, ISO, and other styles

45

Spens, Henrik, and Johan Lindgren. "Using cloud services and machine learning to improve customer support : Study the applicability of the method on voice data." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-340639.

Full text

Abstract:

This project investigated how machine learning could be used to classify voice calls in a customer support setting. A set of a few hundred labeled voice calls were recorded and used as data. The calls were transcribed to text using a speech-to-text cloud service. This text was then normalized and used to train models able to predict new voice calls. Different algorithms were used to build the models, including support vector machines and neural networks. The optimal model, found by extensive parameter search, was found to be a support vector machine. Using this optimal model a program that can classify live voice calls was made.

APA, Harvard, Vancouver, ISO, and other styles

46

Milosevic, Nikola. "A multi-layered approach to information extraction from tables in biomedical documents." Thesis, University of Manchester, 2018. https://www.research.manchester.ac.uk/portal/en/theses/a-multilayered-approach-to-information-extraction-from-tables-in-biomedical-documents(c2edce9c-ae7f-48fa-81c2-14d4bb87423e).html.

Full text

Abstract:

The quantity of literature in the biomedical domain is growing exponentially. It is becoming impossible for researchers to cope with this ever-increasing amount of information. Text mining provides methods that can improve access to information of interest through information retrieval, information extraction and question answering. However, most of these systems focus on information presented in main body of text while ignoring other parts of the document such as tables and figures. Tables present a potentially important component of research presentation, as authors often include more detailed information in tables than in textual sections of a document. Tables allow presentation of large amounts of information in relatively limited space, due to their structural flexibility and ability to present multi-dimensional information. Table processing encapsulates specific challenges that table mining systems need to take into account. Challenges include a variety of visual and semantic structures in tables, variety of information presentation formats, and dense content in table cells. The work presented in this thesis examines a multi-layered approach to information extraction from tables in biomedical documents. In this thesis we propose a representation model of tables and a method for table structure disentangling and information extraction. The model describes table structures and how they are read. We propose a method for information extraction that consists of: (1) table detection, (2) functional analysis, (3) structural analysis, (4) semantic tagging, (5) pragmatic analysis, (6) cell selection and (7) syntactic processing and extraction. In order to validate our approach, show its potential and identify remaining challenges, we applied our methodology to two case studies. The aim of the first case study was to extract baseline characteristics of clinical trials (number of patients, age, gender distribution, etc.) from tables. The second case study explored how the methodology can be applied to relationship extraction, examining extraction of drug-drug interactions. Our method performed functional analysis with a precision score of 0.9425, recall score of 0.9428 and F1-score of 0.9426. Relationships between cells were recognized with a precision of 0.9238, recall of 0.9744 and F1-score of 0.9484. The information extraction methodology performance is the state-of-the-art in table information extraction recording an F1-score range of 0.82-0.93 for demographic data, adverse event and drug-drug interaction extraction, depending on the complexity of the task and available semantic resources. Presented methodology demonstrated that information can be efficiently extracted from tables in biomedical literature. Information extraction from tables can be important for enhancing data curation, information retrieval, question answering and decision support systems with additional information from tables that cannot be found in the other parts of the document.

APA, Harvard, Vancouver, ISO, and other styles

47

Zhang, Zhuo. "A planning approach to migrating domain-specific legacy systems into service oriented architecture." Thesis, De Montfort University, 2012. http://hdl.handle.net/2086/9020.

Full text

Abstract:

The planning work prior to implementing an SOA migration project is very important for its success. Up to now, most of this kind of work has been manual work. An SOA migration planning approach based on intelligent information processing methods is addressed to semi-automate the manual work. This thesis will investigate the principle research question: 'How can we obtain SOA migration planning schemas (semi-) automatically instead of by traditional manual work in order to determine if legacy software systems should be migrated to SOA computation environment?'. The controlled experiment research method has been adopted for directing research throughout the whole thesis. Data mining methods are used to analyse SOA migration source and migration targets. The mined information will be the supplementation of traditional analysis results. Text similarity measurement methods are used to measure the matching relationship between migration sources and migration targets. It implements the quantitative analysis of matching relationships instead of common qualitative analysis. Concretely, an association rule and sequence pattern mining algorithms are proposed to analyse legacy assets and domain logics for establishing a Service model and a Component model. These two algorithms can mine all motifs with any min-support number without assuming any ordering. It is better than the existing algorithms for establishing Service models and Component models in SOA migration situations. Two matching strategies based on keyword level and superficial semantic levels are described, which can calculate the degree of similarity between legacy components and domain services effectively. Two decision-making methods based on similarity matrix and hybrid information are investigated, which are for creating SOA migration planning schemas. Finally a simple evaluation method is depicted. Two case studies on migrating e-learning legacy systems to SOA have been explored. The results show the proposed approach is encouraging and applicable. Therefore, the SOA migration planning schemas can be created semi-automatically instead of by traditional manual work by using data mining and text similarity measurement methods.

APA, Harvard, Vancouver, ISO, and other styles

48

Van, Niekerk Daniel Rudolph. "Automatic speech segmentation with limited data / by D.R. van Niekerk." Thesis, North-West University, 2009. http://hdl.handle.net/10394/3978.

Full text

Abstract:

The rapid development of corpus-based speech systems such as concatenative synthesis systems for under-resourced languages requires an efﬁcient, consistent and accurate solution with regard to phonetic speech segmentation. Manual development of phonetically annotated corpora is a time consuming and expensive process which suffers from challenges regarding consistency and reproducibility, while automation of this process has only been satisfactorily demonstrated on large corpora of a select few languages by employing techniques requiring extensive and specialised resources. In this work we considered the problem of phonetic segmentation in the context of developing small prototypical speech synthesis corpora for new under-resourced languages. This was done through an empirical evaluation of existing segmentation techniques on typical speech corpora in three South African languages. In this process, the performance of these techniques were characterised under different data conditions and the efﬁcient application of these techniques were investigated in order to improve the accuracy of resulting phonetic alignments. We found that the application of baseline speaker-speciﬁc Hidden Markov Models results in relatively robust and accurate alignments even under extremely limited data conditions and demonstrated how such models can be developed and applied efﬁciently in this context. The result is segmentation of sufﬁcient quality for synthesis applications, with the quality of alignments comparable to manual segmentation efforts in this context. Finally, possibilities for further automated reﬁnement of phonetic alignments were investigated and an efﬁcient corpus development strategy was proposed with suggestions for further work in this direction.
Thesis (M.Ing. (Computer Engineering))--North-West University, Potchefstroom Campus, 2009.

APA, Harvard, Vancouver, ISO, and other styles

49

Green, Charles A. "An empirical study on the effects of a collaboration-aware computer system and several communication media alternatives on product quality and time to complete in a co-authoring environment." Thesis, Virginia Tech, 1992. http://hdl.handle.net/10919/40617.

Full text

Abstract:

A new type of software, termed a "group editor", allows multiple users to create and simultaneously edit a single document; this software has ostensibly been developed to increase efficiency in co-authoring environments where users may not be co-located. However, questions as to the effectiveness of this type of communication aid, which is a member of the "groupware" family of tools used for some types of computer supported cooperative work, remain. Particularly, there has been very little objective data on any group editor because of the problems inherent in evaluating writing, as well as due to the few examples of group editors that exist. A method was developed to examine the effect of using a particular group editor, Aspectsâ ¢ from Group Technologies in Arlington, Va., in conjunction with several communication media, on a simple dyad writing task. Six dyads of college students familiar with journalistic writing were matched on attributes of dominance and writing ability and were asked to write short news articles based on short video clips in a balanced two factor within-subject analysis of variance design. Six conditions were tested based on communication media: audio only, audio plus video, and face-to-face; each of these with and without the availability of the group editor. Constraints inherent in the task attempted to enforce consistent document quality levels, measured by grammatical quality and content quality (correctness of infonnation and chronological sequencing). Time to complete the articles was used as a measure of efficiency, independent from quality due to the consistent quality levels of the resulting work. Results from the time data indicated a significant effect of communication media, with the face-to-face conditions taking significantly less time to complete than either of the other media alternatives. Grammatical quality of the written articles was found to be of consistent high quality by way of computerized grammar checker. Content quality of the documents did not significantly differ for any of the conditions. A supplemental Latin square analysis showed additional significant differences in time to complete for trial means (a practice effect) and team differences. Further, significantly less variance was found in certain conditions which had the group editor than in other conditions which did not. Subjective data obtained from questionnaires supported these results and additional1y showed that subjects significantly preferred trials with the group editor and considered then more productive. The face-to-face conditions may have been more efficient due to the nature of the task or due to increased communication structure within dyads due to practice with the group editor. The significant effect of Team Differences may have been due to consistent style differences between dyads that affected efficiency levels. The decreased variability in time to complete in certain group editor conditions may have been due to increased communication structure in these conditions, or perhaps due to leveling effects of group writing as opposed to individual writing with team member aid. These hypotheses need to be tested with further study, and generalizability of the experimental task conditions and results from this particular group editor need to be established as well face-to-face conditions clearly resulted in the most efficient performance on this task. The results obtained concerning the group editor suggest possible efficiency or consistency benefits from the use of group editors by co-authoring persons when face-to-face communication is not practical. Perhaps group editors will become a useful method for surrogate travel for persons with disabilities.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

50

Munnecom, Lorenna, and Miguel Chaves de Lemos Pacheco. "Exploration of an Automated Motivation Letter Scoring System to Emulate Human Judgement." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34563.

Full text

Abstract:

As the popularity of the master’s in data science at Dalarna University increases, so does the number of applicants. The aim of this thesis was to explore different approaches to provide an automated motivation letter scoring system which could emulate the human judgement and automate the process of candidate selection. Several steps such as image processing and text processing were required to enable the authors to retrieve numerous features which could lead to the identification of the factors graded by the program managers. Grammatical based features and Advanced textual features were extracted from the motivation letters followed by the application of Topic Modelling methods to extract the probability of each topics occurring within a motivation letter. Furthermore, correlation analysis was applied to quantify the association between the features and the different factors graded by the program managers, followed by Ordinal Logistic Regression and Random Forest to build models with the most impactful variables. Finally, Naïve Bayes Algorithm, Random Forest and Support Vector Machine were used, first for classification and then for prediction purposes. These results were not promising as the factors were not accurately identified. Nevertheless, the authors suspected that the factors may be strongly related to the highlight of specific topics within a motivation letter which can lead to further research.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data-to-text'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles