Dissertations / Theses: 'Mining software engineering data'

1

Delorey, Daniel Pierce. "Observational Studies of Software Engineering Using Data from Software Repositories." Diss., CLICK HERE for online access, 2007. http://contentdm.lib.byu.edu/ETD/image/etd1716.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Unterkalmsteiner, Michael. "Coordinating requirements engineering and software testing." Doctoral thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-663.

Full text

Abstract:

The development of large, software-intensive systems is a complex undertaking that is generally tackled by a divide and conquer strategy. Organizations face thereby the challenge of coordinating the resources which enable the individual aspects of software development, commonly solved by adopting a particular process model. The alignment between requirements engineering (RE) and software testing (ST) activities is of particular interest as those two aspects are intrinsically connected: requirements are an expression of user/customer needs while testing increases the likelihood that those needs are actually satisfied. The work in this thesis is driven by empirical problem identification, analysis and solution development towards two main objectives. The first is to develop an understanding of RE and ST alignment challenges and characteristics. Building this foundation is a necessary step that facilitates the second objective, the development of solutions relevant and scalable to industry practice that improve REST alignment. The research methods employed to work towards these objectives are primarily empirical. Case study research is used to elicit data from practitioners while technical action research and field experiments are conducted to validate the developed solutions in practice. This thesis contains four main contributions: (1) An in-depth study on REST alignment challenges and practices encountered in industry. (2) A conceptual framework in the form of a taxonomy providing constructs that further our understanding of REST alignment. The taxonomy is operationalized in an assessment framework, REST-bench (3), that was designed to be lightweight and can be applied as a postmortem in closing development projects. (4) An extensive investigation into the potential of information retrieval techniques to improve test coverage, a common REST alignment challenge, resulting in a solution prototype, risk-based testing supported by topic models (RiTTM). REST-bench has been validated in five cases and has shown to be efficient and effective in identifying improvement opportunities in the coordination of RE and ST. Most of the concepts operationalized from the REST taxonomy were found to be useful, validating the conceptual framework. RiTTM, on the other hand, was validated in a single case experiment where it has shown great potential, in particular by identifying test cases that were originally overlooked by expert test engineers, improving effectively test coverage.

APA, Harvard, Vancouver, ISO, and other styles

3

Santamaría, Diego, and Álvaro de Ramón. "Data Mining Web-Tool Prototype Using Monte Carlo Simulations." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-3164.

Full text

Abstract:

Facilitating the decision making process using models and patterns is viewed in this thesis to be really helpful. Data mining is one option to accomplish this task. Data mining algorithms can show all the relations within given data, find rules and create behavior patterns. In this thesis seven different types of data mining algorithms are employed. Monte Carlo is a statistical method that is used in the developed prototype to obtain random data and to simulate different scenarios. Monte Carlo methods are useful for modeling phenomena with significant uncertainty in the inputs. This thesis presents the steps followed during the development of a web-tool prototype that uses data mining techniques to assist decision-makers of port planning to make better forecasts using generated data from the Monte Carlo simulation. The prototype generates random port planning forecasts using Monte Carlo simulation. These forecasts are then evaluated with several data mining algorithms. Then decision-makers can evaluate the outcomes of the prototype (rules, decision tress and regressions) to be able to make better decisions.

APA, Harvard, Vancouver, ISO, and other styles

4

Waters, Robert Lee. "Obtaining Architectural Descriptions from Legacy Systems: The Architectural Synthesis Process (ASP)." Diss., Available online, Georgia Institute of Technology, 2004:, 2004. http://etd.gatech.edu/theses/available/etd-10272004-160115/unrestricted/waters%5Frobert%5Fl%5F200412%5Fphd.pdf.

Full text

Abstract:

Thesis (Ph. D.)--Computing, Georgia Institute of Technology, 2005.
Rick Kazman, Committee Member ; Colin Potts, Committee Member ; Mike McCracken, Committee Member ; Gregory Abowd, Committee Chair ; Spencer Rugaber, Committee Member. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

5

Matyja, Dariusz. "Applications of data mining algorithms to analysis of medical data." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-4253.

Full text

Abstract:

Medical datasets have reached enormous capacities. This data may contain valuable information that awaits extraction. The knowledge may be encapsulated in various patterns and regularities that may be hidden in the data. Such knowledge may prove to be priceless in future medical decision making. The data which is analyzed comes from the Polish National Breast Cancer Prevention Program ran in Poland in 2006. The aim of this master's thesis is the evaluation of the analytical data from the Program to see if the domain can be a subject to data mining. The next step is to evaluate several data mining methods with respect to their applicability to the given data. This is to show which of the techniques are particularly usable for the given dataset. Finally, the research aims at extracting some tangible medical knowledge from the set. The research utilizes a data warehouse to store the data. The data is assessed via the ETL process. The performance of the data mining models is measured with the use of the lift charts and confusion (classification) matrices. The medical knowledge is extracted based on the indications of the majority of the models. The experiments are conducted in the Microsoft SQL Server 2005. The results of the analyses have shown that the Program did not deliver good-quality data. A lot of missing values and various discrepancies make it especially difficult to build good models and draw any medical conclusions. It is very hard to unequivocally decide which is particularly suitable for the given data. It is advisable to test a set of methods prior to their application in real systems. The data mining models were not unanimous about patterns in the data. Thus the medical knowledge is not certain and requires verification from the medical people. However, most of the models strongly associated patient's age, tissue type, hormonal therapies and disease in family with the malignancy of cancers. The next step of the research is to present the findings to the medical people for verification. In the future the outcomes may constitute a good background for development of a Medical Decision Support System.

APA, Harvard, Vancouver, ISO, and other styles

6

Imam, Ayad Tareq. "Relative-fuzzy : a novel approach for handling complex ambiguity for software engineering of data mining models." Thesis, De Montfort University, 2010. http://hdl.handle.net/2086/3909.

Full text

Abstract:

There are two main defined classes of uncertainty namely: fuzziness and ambiguity, where ambiguity is ‘one-to-many’ relationship between syntax and semantic of a proposition. This definition seems that it ignores ‘many-to-many’ relationship ambiguity type of uncertainty. In this thesis, we shall use complex-uncertainty to term many-to-many relationship ambiguity type of uncertainty. This research proposes a new approach for handling the complex ambiguity type of uncertainty that may exist in data, for software engineering of predictive Data Mining (DM) classification models. The proposed approach is based on Relative-Fuzzy Logic (RFL), a novel type of fuzzy logic. RFL defines a new formulation of the problem of ambiguity type of uncertainty in terms of States Of Proposition (SOP). RFL describes its membership (semantic) value by using the new definition of Domain of Proposition (DOP), which is based on the relativity principle as defined by possible-worlds logic. To achieve the goal of proposing RFL, a question is needed to be answered, which is: how these two approaches; i.e. fuzzy logic and possible-world, can be mixed to produce a new membership value set (and later logic) that able to handle fuzziness and multiple viewpoints at the same time? Achieving such goal comes via providing possible world logic the ability to quantifying multiple viewpoints and also model fuzziness in each of these multiple viewpoints and expressing that in a new set of membership value. Furthermore, a new architecture of Hierarchical Neural Network (HNN) called ML/RFL-Based Net has been developed in this research, along with a new learning algorithm and new recalling algorithm. The architecture, learning algorithm and recalling algorithm of ML/RFL-Based Net follow the principles of RFL. This new type of HNN is considered to be a RFL computation machine. The ability of the Relative Fuzzy-based DM prediction model to tackle the problem of complex ambiguity type of uncertainty has been tested. Special-purpose Integrated Development Environment (IDE) software, which generates a DM prediction model for speech recognition, has been developed in this research too, which is called RFL4ASR. This special purpose IDE is an extension of the definition of the traditional IDE. Using multiple sets of TIMIT speech data, the prediction model of type ML/RFL-Based Net has classification accuracy of 69.2308%. This accuracy is higher than the best achievements of WEKA data mining machines given the same speech data.

APA, Harvard, Vancouver, ISO, and other styles

7

Thun, Julia, and Rebin Kadouri. "Automating debugging through data mining." Thesis, KTH, Data- och elektroteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-203244.

Full text

Abstract:

Contemporary technological systems generate massive quantities of log messages. These messages can be stored, searched and visualized efficiently using log management and analysis tools. The analysis of log messages offer insights into system behavior such as performance, server status and execution faults in web applications. iStone AB wants to explore the possibility to automate their debugging process. Since iStone does most parts of their debugging manually, it takes time to find errors within the system. The aim was therefore to find different solutions to reduce the time it takes to debug. An analysis of log messages within access – and console logs were made, so that the most appropriate data mining techniques for iStone’s system would be chosen. Data mining algorithms and log management and analysis tools were compared. The result of the comparisons showed that the ELK Stack as well as a mixture between Eclat and a hybrid algorithm (Eclat and Apriori) were the most appropriate choices. To demonstrate their feasibility, the ELK Stack and Eclat were implemented. The produced results show that data mining and the use of a platform for log analysis can facilitate and reduce the time it takes to debug.
Dagens system genererar stora mängder av loggmeddelanden. Dessa meddelanden kan effektivt lagras, sökas och visualiseras genom att använda sig av logghanteringsverktyg. Analys av loggmeddelanden ger insikt i systemets beteende såsom prestanda, serverstatus och exekveringsfel som kan uppkomma i webbapplikationer. iStone AB vill undersöka möjligheten att automatisera felsökning. Eftersom iStone till mestadels utför deras felsökning manuellt så tar det tid att hitta fel inom systemet. Syftet var att därför att finna olika lösningar som reducerar tiden det tar att felsöka. En analys av loggmeddelanden inom access – och konsolloggar utfördes för att välja de mest lämpade data mining tekniker för iStone’s system. Data mining algoritmer och logghanteringsverktyg jämfördes. Resultatet av jämförelserna visade att ELK Stacken samt en blandning av Eclat och en hybrid algoritm (Eclat och Apriori) var de lämpligaste valen. För att visa att så är fallet så implementerades ELK Stacken och Eclat. De framställda resultaten visar att data mining och användning av en plattform för logganalys kan underlätta och minska den tid det tar för att felsöka.

APA, Harvard, Vancouver, ISO, and other styles

8

Sobolewska, Katarzyna-Ewa. "Web links utility assessment using data mining techniques." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2936.

Full text

Abstract:

This paper is focusing on the data mining solutions for the WWW, specifically how it can be used for the hyperlinks evaluation. We are focusing on the hyperlinks used in the web sites systems and on the problem which consider evaluation of its utility. Since hyperlinks reflect relation to other webpage one can expect that there exist way to verify if users follow desired navigation paths. The Challenge is to use available techniques to discover usage behavior patterns and interpret them. We have evaluated hyperlinks of the selected pages from www.bth.se web site. By using web expert’s help the usefulness of the data mining as the assessment basis was validated. The outcome of the research shows that data mining gives decision support for the changes in the web site navigational structure.
akasha.kate@gmail.com

APA, Harvard, Vancouver, ISO, and other styles

9

Saltin, Joakim. "Interactive visualization of financial data : Development of a visual data mining tool." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-181225.

Full text

Abstract:

In this project, a prototype visual data mining tool was developed, allowing users to interactively investigate large multi-dimensional datasets visually (using 2D visualization techniques) using so called drill-down, roll-up and slicing operations. The project included all steps of the development, from writing specifications and designing the program to implementing and evaluating it. Using ideas from data warehousing, custom methods for storing pre-computed aggregations of data (commonly referred to as materialized views) and retrieving data from these were developed and implemented in order to achieve higher performance on large datasets. View materialization enables the program to easily fetch or calculate a view using other views, something which can yield significant performance gains if view sizes are much smaller than the underlying raw dataset. The choice of which views to materialize was done in an automated manner using a well-known algorithm - the greedy algorithm for view materialization - which selects the fraction of all possible views that is likely (but not guaranteed) to yield the best performance gain. The use of materialized views was shown to have good potential to increase performance for large datasets, with an average speedup (compared to on-the-fly queries) between 20 and 70 for a test dataset containing 500~000 rows. The end result was a program combining flexibility with good performance, which was also reflected by good scores in a user-acceptance test, with participants from the company where this project was carried out.

APA, Harvard, Vancouver, ISO, and other styles

10

Allahyari, Hiva. "On the concept of Understandability as a Property of Data mining Quality." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-6134.

Full text

Abstract:

This paper reviews methods for evaluating and analyzing the comprehensibility and understandability of models generated from data in the context of data mining and knowledge discovery. The motivation for this study is the fact that the majority of previous work has focused on increasing the accuracy of models, ignoring user-oriented properties such as comprehensibility and understandability. Approaches for analyzing the understandability of data mining models have been discussed on two different levels: one is regarding the type of the models’ presentation and the other is considering the structure of the models. In this study, we present a summary of existing assumptions regarding both approaches followed by an empirical work to examine the understandability from the user’s point of view through a survey. From the results of the survey, we obtain that models represented as decision trees are more understandable than models represented as decision rules. Using the survey results regarding understandability of a number of models in conjunction with quantitative measurements of the complexity of the models, we are able to establish correlation between complexity and understandability of the models.

APA, Harvard, Vancouver, ISO, and other styles

11

Gupta, Shweta. "Software Development Productivity Metrics, Measurements and Implications." Thesis, University of Oregon, 2018. http://hdl.handle.net/1794/23816.

Full text

Abstract:

The rapidly increasing capabilities and complexity of numerical software present a growing challenge to software development productivity. While many open source projects enable the community to share experiences, learn and collaborate; estimating individual developer productivity becomes more difficult as projects expand. In this work, we analyze some HPC software Git repositories with issue trackers and compute productivity metrics that can be used to better understand and potentially improve development processes. Evaluating productivity in these communities presents additional challenges because bug reports and feature requests are often done by using mailing lists instead of issue tracking, resulting in difficult-to-analyze unstructured data. For such data, we investigate automatic tag generation by using natural language processing techniques. We aim to produce metrics that help quantify productivity improvement or degradation over the projects lifetimes. We also provide an objective measurement of productivity based on the effort estimation for the developer's work.

APA, Harvard, Vancouver, ISO, and other styles

12

Güneş, Serkan. "Investment and Financial Forecasting : A Data Mining Approach on Port Industry." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-5340.

Full text

Abstract:

ABSTRACT This thesis examines and analyzes the use of data mining techniques and simulations as a forecasting tool. Decision making process for business can be risky. Corporate decision makers have to make decisions to protect company’s benefit and lower the risk. In order to evaluate data mining approach on forecasting, a tool, called IFF, was developed for evaluating and simulating forecasts. Specifically data mining techniques’ and simulation’s ability to predict, evaluate and validate Port Industry forecasts is tested. Accuracy is calculated with data mining methods. Finally the probability of user’s and simulation model’s confidentiality is calculated. The results of the research indicate that data mining approach on forecasting and Monte Carlo method have the capability to forecast on Port industry and, if properly analyzed, can give accurate results for forecasts.

APA, Harvard, Vancouver, ISO, and other styles

13

Barysau, Mikalai. "Developers' performance analysis based on code review data : How to perform comparisons of different groups of developers." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-13335.

Full text

Abstract:

Nowadays more and more IT companies switch to the distributed development model. This trend has a number of advantages and disadvantages, which are studied by researchers through different aspects of the modern code development. One of such aspects is code review, which is used by many companies and produces a big amount of data. A number of studies describe different data mining and data analysis approaches, which are based on a link between code review data and performance. According to these studies analysis of the code review data can give a good insight to the development performance and help software companies to detect a number of performance issues and improve the quality of their code. The main goal of this Thesis was to collect reported knowledge about the code review data analysis and implement a solution, which will help to perform such analysis in a real industrial setting. During the performance of the research the author used multiple research techniques, such as Snowballing literature review, Case study and Semi-structured interviews. The results of the research contain a list of code review data metrics, extracted from the literature and a software tool for collecting and visualizing data. The performed literature review showed that among the literature sources, related to the code review, relatively small amount of sources are related to the topic of the Thesis, which exposes a field for a future research. Application of the found metrics showed that most of the found metrics are possible to use in the context of the studied environment. Presentation of the results and interviews with company's representatives showed that the graphic plots are useful for observing trends and correlations in development of company's development sites and help the company to improve its performance and decision making process.

APA, Harvard, Vancouver, ISO, and other styles

14

Polańska, Julia, and Michał Zyznarski. "Elaboration of a method for comparison of Business Intelligence Systems which support data mining process." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2078.

Full text

Abstract:

Business Intelligence Systems are becoming more and more popular in recent years. It is caused by the need of reusing data in order to gain some potentially useful business information about. Those systems are advanced set of tools, which causes high prices of purchase and licensing. Therefore, it is important to choose the system which fits the best particular business needs. The aim of this thesis is to elaborate a method for comparison of existing Business Intelligence Systems that are supporting data mining. The method consist of a quality model, build according to existing standards, and set of steps which should be taken to choose a Business Intelligence System according to particular requirements of its future user. The first part of the thesis focuses on the analysis of existing works providing a way for comparison of those software products. It is shown here that there is no existing systematic approach resolving this problem. However, criteria presented in those works along with the description of quality model standards were used for creating the quality model and proposing a set of basic measures. Also the phrases for the evaluation process were identified. The next part of the research is a case study which purpose is to show the usefulness of proposed evaluation method. The example is simple, but has proven that the method can be easily modified for specific needs and used for comparison of real Business Intelligence Systems. The quality level measured in the case study turned out to be very similar for each system. The evaluation method may be extended in future work with more advanced measures or additional characteristic which were not taken into account in this research.

APA, Harvard, Vancouver, ISO, and other styles

15

Aftarczuk, Kamila. "Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-6194.

Full text

Abstract:

The goal of this master’s thesis is to identify and evaluate data mining algorithms which are commonly implemented in modern Medical Decision Support Systems (MDSS). They are used in various healthcare units all over the world. These institutions store large amounts of medical data. This data may contain relevant medical information hidden in various patterns buried among the records. Within the research several popular MDSS’s are analyzed in order to determine the most common data mining algorithms utilized by them. Three algorithms have been identified: Naïve Bayes, Multilayer Perceptron and C4.5. Prior to the very analyses the algorithms are calibrated. Several testing configurations are tested in order to determine the best setting for the algorithms. Afterwards, an ultimate comparison of the algorithms orders them with respect to their performance. The evaluation is based on a set of performance metrics. The analyses are conducted in WEKA on five UCI medical datasets: breast cancer, hepatitis, heart disease, dermatology disease, diabetes. The analyses have shown that it is very difficult to name a single data mining algorithm to be the most suitable for the medical data. The results gained for the algorithms were very similar. However, the final evaluation of the outcomes allowed singling out the Naïve Bayes to be the best classifier for the given domain. It was followed by the Multilayer Perceptron and the C4.5.

APA, Harvard, Vancouver, ISO, and other styles

16

Cruzes, Daniela Soares. "Analise secundaria de estudos experimentais em engenharia de software." [s.n.], 2007. http://repositorio.unicamp.br/jspui/handle/REPOSIP/260999.

Full text

Abstract:

Orientadores: Mario Jino, Manoel Gomes de Mendonça Neto, Victor Robert basili
Tese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de Computação
Made available in DSpace on 2018-08-09T03:08:37Z (GMT). No. of bitstreams: 1 Cruzes_DanielaSoares_D.pdf: 5878913 bytes, checksum: 3daddec5bb0c08c955c288b74419bccc (MD5) Previous issue date: 2007
Resumo: Enquanto é claro que existem muitas fontes de variação de um contexto de desenvolvimento de software para outro, não é claro, a priori, quais variáveis específicas influenciarão a eficácia de um processo, de uma técnica ou de um método em um determinado contexto. Por esta razão, o conhecimento sobre a engenharia de software deve ser construído a partir de muitos estudos, executados tanto em contextos similares como em contextos diferentes entre si. Trabalhos precedentes discutiram como projetar estudos relacionados documentando tão precisamente quanto possível os valores de variáveis do contexto para assim poder comparálos com os valores observados em novos estudos. Esta abordagem é importante, porém argumentamos neste trabalho que uma abordagem oportunística também é prática. A abordagem de análise secundária de estudos discutida neste trabalho (SecESE) visa combinar resultados de múltiplos estudos individuais realizados independentemente, permitindo a expansão do conhecimento experimental em engenharia de software. Usamos uma abordagem baseada na codificação da informação extraída dos artigos e dos dados experimentais em uma base estruturada. Esta base pode então ser minerada para extrair novos conhecimentos de maneira simples e flexível
Abstract: While it is clear that there are many sources of variation from one software development context to another, it is not clear a priori, what specific variables will influence the effectiveness of a process, technique, or method in a given context. For this reason, we argue that knowledge about software engineering must be built from many studies, in which related studies are run within similar contexts as well as very different ones. Previous works have discussed how to design related studies so as to document as precisely as possible the values of context variables and be able to compare with those observed in new studies. While such a planned approach is important, we argue that an opportunistic approach is also practical. This approach would combine results from multiple individual studies after the fact, enabling the expansion of empirical software engineering knowledge from large evidence bases. In this dissertation, we describe a process to build empirical knowledge about software engineering. It uses an approach based on encoding the information extracted from papers and experimental data into a structured base. This base can then be mined to extract new knowledge from it in a simple and flexible way
Doutorado
Engenharia de Computação
Doutor em Engenharia Elétrica

APA, Harvard, Vancouver, ISO, and other styles

17

Burji, Supreeth Jagadish. "Reverse Engineering of a Malware : Eyeing the Future of Computer Security." Akron, OH : University of Akron, 2009. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=akron1247447165.

Full text

Abstract:

Thesis (M.S.)--University of Akron, Dept. of Computer Science, 2009.
"August, 2009." Title from electronic thesis title page (viewed 11/11/2009) Advisor, Kathy J. Liszka; Faculty Readers, Timothy W. O'Neil, Wolfgang Pelz; Department Chair, Chien-Chung Chan; Dean of the College, Chand Midha; Dean of the Graduate School, George R. Newkome. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

18

Taylor, Quinn Carlson. "Analysis and Characterization of Author Contribution Patterns in Open Source Software Development." BYU ScholarsArchive, 2012. https://scholarsarchive.byu.edu/etd/2971.

Full text

Abstract:

Software development is a process fraught with unpredictability, in part because software is created by people. Human interactions add complexity to development processes, and collaborative development can become a liability if not properly understood and managed. Recent years have seen an increase in the use of data mining techniques on publicly-available repository data with the goal of improving software development processes, and by extension, software quality. In this thesis, we introduce the concept of author entropy as a metric for quantifying interaction and collaboration (both within individual files and across projects), present results from two empirical observational studies of open-source projects, identify and analyze authorship and collaboration patterns within source code, demonstrate techniques for visualizing authorship patterns, and propose avenues for further research.

APA, Harvard, Vancouver, ISO, and other styles

19

Krüger, Franz David, and Mohamad Nabeel. "Hyperparameter Tuning Using Genetic Algorithms : A study of genetic algorithms impact and performance for optimization of ML algorithms." Thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-42404.

Full text

Abstract:

Maskininlärning har blivit allt vanligare inom näringslivet. Informationsinsamling med Data mining (DM) har expanderats och DM-utövare använder en mängd tumregler för att effektivisera tillvägagångssättet genom att undvika en anständig tid att ställa in hyperparametrarna för en given ML-algoritm för nå bästa träffsäkerhet. Förslaget i denna rapport är att införa ett tillvägagångssätt som systematiskt optimerar ML-algoritmerna med hjälp av genetiska algoritmer (GA), utvärderar om och hur modellen ska konstrueras för att hitta globala lösningar för en specifik datamängd. Genom att implementera genetiska algoritmer på två utvalda ML-algoritmer, K-nearest neighbors och Random forest, med två numeriska datamängder, Iris-datauppsättning och Wisconsin-bröstcancerdatamängd. Modellen utvärderas med träffsäkerhet och beräkningstid som sedan jämförs med sökmetoden exhaustive search. Resultatet har visat att GA fungerar bra för att hitta bra träffsäkerhetspoäng på en rimlig tid. Det finns vissa begränsningar eftersom parameterns betydelse varierar för olika ML-algoritmer.
As machine learning (ML) is being more and more frequent in the business world, information gathering through Data mining (DM) is on the rise, and DM-practitioners are generally using several thumb rules to avoid having to spend a decent amount of time to tune the hyperparameters (parameters that control the learning process) of an ML algorithm to gain a high accuracy score. The proposal in this report is to conduct an approach that systematically optimizes the ML algorithms using genetic algorithms (GA) and to evaluate if and how the model should be constructed to find global solutions for a specific data set. By implementing a GA approach on two ML-algorithms, K-nearest neighbors, and Random Forest, on two numerical data sets, Iris data set and Wisconsin breast cancer data set, the model is evaluated by its accuracy scores as well as the computational time which then is compared towards a search method, specifically exhaustive search. The results have shown that it is assumed that GA works well in finding great accuracy scores in a reasonable amount of time. There are some limitations as the parameter’s significance towards an ML algorithm may vary.

APA, Harvard, Vancouver, ISO, and other styles

20

Chu, Justin. "CONTEXT-AWARE DEBUGGING FOR CONCURRENT PROGRAMS." UKnowledge, 2017. https://uknowledge.uky.edu/cs_etds/61.

Full text

Abstract:

Concurrency faults are difficult to reproduce and localize because they usually occur under specific inputs and thread interleavings. Most existing fault localization techniques focus on sequential programs but fail to identify faulty memory access patterns across threads, which are usually the root causes of concurrency faults. Moreover, existing techniques for sequential programs cannot be adapted to identify faulty paths in concurrent programs. While concurrency fault localization techniques have been proposed to analyze passing and failing executions obtained from running a set of test cases to identify faulty access patterns, they primarily focus on using statistical analysis. We present a novel approach to fault localization using feature selection techniques from machine learning. Our insight is that the concurrency access patterns obtained from a large volume of coverage data generally constitute high dimensional data sets, yet existing statistical analysis techniques for fault localization are usually applied to low dimensional data sets. Each additional failing or passing run can provide more diverse information, which can help localize faulty concurrency access patterns in code. The patterns with maximum feature diversity information can point to the most suspicious pattern. We then apply data mining technique and identify the interleaving patterns that are occurred most frequently and provide the possible faulty paths. We also evaluate the effectiveness of fault localization using test suites generated from different test adequacy criteria. We have evaluated Cadeco on 10 real-world multi-threaded Java applications. Results indicate that Cadeco outperforms state-of-the-art approaches for localizing concurrency faults.

APA, Harvard, Vancouver, ISO, and other styles

21

van, Schaik Sebastiaan Johannes. "A framework for processing correlated probabilistic data." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:91aa418d-536e-472d-9089-39bef5f62e62.

Full text

Abstract:

The amount of digitally-born data has surged in recent years. In many scenarios, this data is inherently uncertain (or: probabilistic), such as data originating from sensor networks, image and voice recognition, location detection, and automated web data extraction. Probabilistic data requires novel and different approaches to data mining and analysis, which explicitly account for the uncertainty and the correlations therein. This thesis introduces ENFrame: a framework for processing and mining correlated probabilistic data. Using this framework, it is possible to express both traditional and novel algorithms for data analysis in a special user language, without having to explicitly address the uncertainty of the data on which the algorithms operate. The framework will subsequently execute the algorithm on the probabilistic input, and perform exact or approximate parallel probability computation. During the probability computation, correlations and provenance are succinctly encoded using probabilistic events. This thesis contains novel contributions in several directions. An expressive user language – a subset of Python – is introduced, which allows a programmer to implement algorithms for probabilistic data without requiring knowledge of the underlying probabilistic model. Furthermore, an event language is presented, which is used for the probabilistic interpretation of the user program. The event language can succinctly encode arbitrary correlations using events, which are the probabilistic counterparts of deterministic user program variables. These highly interconnected events are stored in an event network, a probabilistic interpretation of the original user program. Multiple techniques for exact and approximate probability computation (with error guarantees) of such event networks are presented, as well as techniques for parallel computation. Adaptations of multiple existing data mining algorithms are shown to work in the framework, and are subsequently subjected to an extensive experimental evaluation. Additionally, a use-case is presented in which a probabilistic adaptation of a clustering algorithm is used to predict faults in energy distribution networks. Lastly, this thesis presents techniques for integrating a number of different probabilistic data formalisms for use in this framework and in other applications.

APA, Harvard, Vancouver, ISO, and other styles

22

Kamenieva, Iryna. "Research Ontology Data Models for Data and Metadata Exchange Repository." Thesis, Växjö University, School of Mathematics and Systems Engineering, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-6351.

Full text

Abstract:

For researches in the field of the data mining and machine learning the necessary condition is an availability of various input data set. Now researchers create the databases of such sets. Examples of the following systems are: The UCI Machine Learning Repository, Data Envelopment Analysis Dataset Repository, XMLData Repository, Frequent Itemset Mining Dataset Repository. Along with above specified statistical repositories, the whole pleiad from simple filestores to specialized repositories can be used by researchers during solution of applied tasks, researches of own algorithms and scientific problems. It would seem, a single complexity for the user will be search and direct understanding of structure of so separated storages of the information. However detailed research of such repositories leads us to comprehension of deeper problems existing in usage of data. In particular a complete mismatch and rigidity of data files structure with SDMX - Statistical Data and Metadata Exchange - standard and structure used by many European organizations, impossibility of preliminary data origination to the concrete applied task, lack of data usage history for those or other scientific and applied tasks.

Now there are lots of methods of data miming, as well as quantities of data stored in various repositories. In repositories there are no methods of DM (data miming) and moreover, methods are not linked to application areas. An essential problem is subject domain link (problem domain), methods of DM and datasets for an appropriate method. Therefore in this work we consider the building problem of ontological models of DM methods, interaction description of methods of data corresponding to them from repositories and intelligent agents allowing the statistical repository user to choose the appropriate method and data corresponding to the solved task. In this work the system structure is offered, the intelligent search agent on ontological model of DM methods considering the personal inquiries of the user is realized.

For implementation of an intelligent data and metadata exchange repository the agent oriented approach has been selected. The model uses the service oriented architecture. Here is used the cross platform programming language Java, multi-agent platform Jadex, database server Oracle Spatial 10g, and also the development environment for ontological models - Protégé Version 3.4.

APA, Harvard, Vancouver, ISO, and other styles

23

Shokat, Imran. "Computational Analyses of Scientific Publications Using Raw and Manually Curated Data with Applications to Text Visualization." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-78995.

Full text

Abstract:

Text visualization is a field dedicated to the visual representation of textual data by using computer technology. A large number of visualization techniques are available, and now it is becoming harder for researchers and practitioners to choose an optimal technique for a particular task among the existing techniques. To overcome this problem, the ISOVIS Group developed an interactive survey browser for text visualization techniques. ISOVIS researchers gathered papers which describe text visualization techniques or tools and categorized them according to a taxonomy. Several categories were manually assigned to each visualization technique. In this thesis, we aim to analyze the dataset of this browser. We carried out several analyses to find temporal trends and correlations of the categories present in the browser dataset. In addition, a comparison of these categories with a computational approach has been made. Our results show that some categories became more popular than before whereas others have declined in popularity. The cases of positive and negative correlation between various categories have been found and analyzed. Comparison between manually labeled datasets and results of computational text analyses were presented to the experts with an opportunity to refine the dataset. Data which is analyzed in this thesis project is specific to text visualization field, however, methods that are used in the analyses can be generalized for applications to other datasets of scientific literature surveys or, more generally, other manually curated collections of textual documents.

APA, Harvard, Vancouver, ISO, and other styles

24

Xiaojun, Chen, and Premlal Bhattrai. "A Method for Membership Card Generation Based on Clustering and Optimization Models in A Hypermarket." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2227.

Full text

Abstract:

Context: Data mining as a technique is used to find interesting and valuable knowledge from huge amount of stored data within databases or data warehouses. It encompasses classification, clustering, association rule learning, etc., whose goals are to improve commercial decisions and behaviors in organizations. Amongst these, hierarchical clustering method is commonly used in data selection preprocessing step for customer segmentation in business enterprises. However, this method could not treat with the overlapped or diverse clusters very well. Thus, we attempt to combine clustering and optimization into an integrated and sequential approach that can substantially be employed for segmenting customers and subsequent membership cards generation. Clustering methods is used to segment customers into groups while optimization aids in generating the required membership cards. Objectives: Our master thesis project aims to develop a methodological approach for customer segmentation based on their characteristics in order to define membership cards based on mathematical optimization model in a hypermarket. Methods: In this thesis, literature review of articles was conducted using five reputed databases: IEEE, Google Scholar, Science Direct, Springer and Engineering Village. This was done to have a background study and to gain knowledge about the current research in the field of clustering and optimization based method for membership card generating in a hypermarket. Further, we also employed video interviews as research methodologies and a proof-of-concept implementation for our solution. Interviews allowed us to collect raw data from the hypermarket while testing the data produces preliminary results. This was important because the data could be regarded as a guideline to evaluate the performance of customer segmentation and generating membership cards. Results: We built clustering and optimization models as a two-step sequential method. In the first step, the clustering model was used to segment customers into different clusters. In the second step, our optimization model was utilized to produce different types of membership cards. Besides, we tested a dataset consisting of 100 customer records consequently obtaining five clusters and five types of membership cards respectively. Conclusions: This research provides a basis for customer segmentation and generating membership cards in a hypermarket by way of data mining techniques and optimization. Thus, through our research, an integrated and sequential approach to clustering and optimization can suitably be used for customer segmentation and membership card generation respectively.

APA, Harvard, Vancouver, ISO, and other styles

25

Pietruszewski, Przemyslaw. "Association rules analysis for objects hierarchy." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-3512.

Full text

Abstract:

Association rules are one of the most popular methods of data mining. This technique allows to discover interesting dependences between objects. The thesis concerns on association rules for hierarchy of objects. As a multi–level structure is used DBLP database, which contains bibliographic descriptions of scientific papers conferences and journals in computer science. The main goal of thesis is investigation of interesting patterns of co-authorship with respect to different levels of hierarchy. To reach this goal own extracting method is proposed.
p.pietruszewski@op.pl

APA, Harvard, Vancouver, ISO, and other styles

26

Kurin, Erik, and Adam Melin. "Data-driven test automation : augmenting GUI testing in a web application." Thesis, Linköpings universitet, Programvara och system, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-96380.

Full text

Abstract:

For many companies today, it is highly valuable to collect and analyse data in order to support decision making and functions of various sorts. However, this kind of data-driven approach is seldomly applied to software testing and there is often a lack of verification that the testing performed is relevant to how the system under test is used. Therefore, the aim of this thesis is to investigate the possibility of introducing a data-driven approach to test automation by extracting user behaviour data and curating it to form input for testing. A prestudy was initially conducted in order to collect and assess different data sources for augmenting the testing. After suitable data sources were identified, the required data, including data about user activity in the system, was extracted. This data was then processed and three prototypes where built on top of this data. The first prototype augments the model-based testing by automatically creating models of the most common user behaviour by utilising data mining algorithms. The second prototype tests the most frequent occurring client actions. The last prototype visualises which features of the system are not covered by automated regression testing. The data extracted and analysed in this thesis facilitates the understanding of the behaviour of the users in the system under test. The three prototypes implemented with this data as their foundation can be used to assist other testing methods by visualising test coverage and executing regression tests.

APA, Harvard, Vancouver, ISO, and other styles

27

Macedo, Charles Mendes de. "Aplicação de algoritmos de agrupamento para descoberta de padrões de defeito em software JavaScript." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/100/100131/tde-29012019-152129/.

Full text

Abstract:

As aplicações desenvolvidas com a linguagem JavaScript, vêm aumentando a cada dia, não somente aquelas na web (client-side), como também as aplicações executadas no servidor (server-side) e em dispositivos móveis (mobile). Neste contexto, a existência de ferramentas para identicação de defeitos e code smells é fundamental, para auxiliar desenvolvedores durante a evoluçãp destas aplicações. A maioria dessas ferramentas utiliza uma lista de defeitos predenidos que são descobertos a partir da observação das melhores práticas de programação e a intuição do desenvolvedor. Para melhorar essas ferramentas, a descoberta automática de defeitos e code smells é importante, pois permite identicar quais ocorrem realmente na prática e de forma frequente. Uma ferramenta que implementa uma estratégia semiautomática para descobrir padrões de defeitos através de agrupamentos das mudanças realizadas no decorrer do desenvolvimento do projeto é a ferramenta BugAID. O objetivo deste trabalho é contribuir nessa ferramenta estendendo-a com melhorias na abordagem da extração de características, as quais são usadas pelos algoritmos de clusterização. O módulo estendido encarregado da extração de características é chamado de BugAIDExtract+ +. Além disso, neste trabalho é realizada uma avaliação de vários algoritmos de clusterização na descoberta dos padrõs de defeitos em software JavaScript
Applications developed with JavaScript language are increasing every day, not only for client-side, but also for server-side and for mobile devices. In this context, the existence of tools to identify faults is fundamental in order to assist developers during the evolution of their applications. Most of these tools use a list of predened faults that are discovered from the observation of the programming best practices and developer intuition. To improve these tools, the automatic discovery of faults and code smells is important because it allows to identify which ones actually occur in practice and frequently. A tool that implements a semiautomatic strategy for discovering bug patterns by grouping the changes made during the project development is the BugAID. The objective of this work is to contribute to the BugAID tool, extending this tool with improvements in the extraction of characteristics to be used by the clustering algorithm. The extended module that extracts the characteristics is called BE+. Additionally, an evaluation of the clustering algorithms used for discovering fault patterns in JavaScript software is performed

APA, Harvard, Vancouver, ISO, and other styles

28

Åström, Gustav. "Kognitiva tjänster på en myndighet : Förstudie om hur Lantmäteriet kan tillämpa IBM Watson." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-30902.

Full text

Abstract:

Many milestones have been passed in computer science and currently we are on our way to pass yet another: artificial intelligence. One of the characteristics of AI is to be able to interpret so-called unstructured data, i.e., data that lacks structure. Unstructured data can be useful and with the new tools within AI is it possible to interpret it and use it to solve problems. This has the potential to be useful in practical applications such as processing and decision support. The work has been done at Apendo AB, which has the Swedish National Land Survey as a customer. The work is to investigate how AI-driven cognitive services through IBM Watson can be applied to the Swedish National Land Survey. The goal is to answer the following questions: Is it possible to apply cognitive services through Watson's services to give decision support to the Swedish National Land Survey already? In what ways can you use Watson's services to create a decision support? How effective can the solution for the Swedish National Land Survey be, i.e. how much time and costs can they save by using Watson's services on the chosen concept? As a practical part of the AI study, a perceptron was developed and evaluated. Through an agile approach, tests and studies about IBM Watson have taken place in parallel with interviews with employees at the Swedish National Land Survey. The tests were performed in the PaaS service IBM Bluemix with both Node-RED and an own built web application. Though the interviews, the Watson service Retrieve and Rank became interesting and examined more closely. With Retrieve and Rank you can get questions answered by ranking selected corpus pieces that are then trained for better answers. Uploading the corpus with related questions resulted in that 75% of the questions was answered correctly. Applications for the Swedish National Land Survey can then be a cognitive search function that helps administrators to search information in manuals and the law book.
Många milstolpar har passerats inom datavetenskapen och just nu håller vi på att passera en till: artificiell intelligens. En av de egenskaper som kännetecknar AI är att kunna tolka s.k. ostrukturerad data, alltså sådan data som saknar struktur. Ostrukturerad data vara användbar och med de nya verktygen inom AI är det möjligt att tolka för sedan använda det till att lösa problem. Detta har potential att vara användbart inom praktiska applikationer såsom handläggning och beslutsstöd. Arbetet har skett på företaget Apendo AB som har Lantmäteriet som kund. Arbetet går ut på att undersöka hur AI-drivna kognitiva tjänster genom IBM Watson kan tillämpas på Lantmäteriet. Målet är att besvara följande frågor: Är det möjligt att tillämpa kognitiva tjänster genom Watsons tjänster för att ge beslutsstöd åt Lantmäteriet redan i dagsläget? På vilka sätt kan man använda Watsons tjänster för att skapa ett beslutsstöd? Hur effektiv kan lösningen för Lantmäteriet bli, d.v.s. hur mycket tid och kostnader kan de tänkas spara genom att använda Watsons tjänster på valt koncept? Som praktisk del av studien om AI utvecklades och utvärderades en perceptron. Genom ett agilt förhållningssätt har tester och studier om IBM Watson skett parallellt med intervjuer med anställda på Lantmäteriet. Testerna utfördes i PaaS-tjänsten IBM Bluemix med både Node- RED och egenbyggd webbapplikation. Av intervjuerna blev Watson-tjänsten Retrieve and Rank intressant och undersöktes noggrannare. Med Retrieve and Rank kan man få frågor besvarade genom rankning av stycken av valt korpus som sedan tränas upp för bättre svar. Uppladdning av korpus med tillhörade frågor gav att 75 % av frågorna besvarades korrekt. Tillämpningarna Lantmäteriet kan då vara en kognitiv uppträningsbar sökfunktion som hjälper handläggare att söka information i handböcker och lagboken.

APA, Harvard, Vancouver, ISO, and other styles

29

Alsalama, Ahmed. "A Hybrid Recommendation System Based on Association Rules." TopSCHOLAR®, 2013. http://digitalcommons.wku.edu/theses/1250.

Full text

Abstract:

Recommendation systems are widely used in e-commerce applications. Theengine of a current recommendation system recommends items to a particular user based on user preferences and previous high ratings. Various recommendation schemes such as collaborative filtering and content-based approaches are used to build a recommendation system. Most of current recommendation systems were developed to fit a certain domain such as books, articles, and movies. We propose a hybrid framework recommendation system to be applied on two dimensional spaces (User × Item) with a large number of users and a small number of items. Moreover, our proposed framework makes use of both favorite and non-favorite items of a particular user. The proposed framework is built upon the integration of association rules mining and the content-based approach. The results of experiments show that our proposed framework can provide accurate recommendations to users.

APA, Harvard, Vancouver, ISO, and other styles

30

Taylor, Phillip. "Data mining of vehicle telemetry data." Thesis, University of Warwick, 2015. http://wrap.warwick.ac.uk/77645/.

Full text

Abstract:

Driving a safety critical task that requires a high level of attention and workload from the driver. Despite this, people often perform secondary tasks such as eating or using a mobile phone, which increase workload levels and divert cognitive and physical attention from the primary task of driving. As well as these distractions, the driver may also be overloaded for other reasons, such as dealing with an incident on the road or holding conversations in the car. One solution to this distraction problem is to limit the functionality of in-car devices while the driver is overloaded. This can take the form of withholding an incoming phone call or delaying the display of a non-urgent piece of information about the vehicle. In order to design and build these adaptions in the car, we must first have an understanding of the driver's current level of workload. Traditionally, driver workload has been monitored using physiological sensors or camera systems in the vehicle. However, physiological systems are often intrusive and camera systems can be expensive and are unreliable in poor light conditions. It is important, therefore, to use methods that are non-intrusive, inexpensive and robust, such as sensors already installed on the car and accessible via the Controller Area Network (CAN)-bus. This thesis presents a data mining methodology for this problem, as well as for others in domains with similar types of data, such as human activity monitoring. It focuses on the variable selection stage of the data mining process, where inputs are chosen for models to learn from and make inferences. Selecting inputs from vehicle telemetry data is challenging because there are many irrelevant variables with a high level of redundancy. Furthermore, data in this domain often contains biases because only relatively small amounts can be collected and processed, leading to some variables appearing more relevant to the classification task than they are really. Over the course of this thesis, a detailed variable selection framework that addresses these issues for telemetry data is developed. A novel blocked permutation method is developed and applied to mitigate biases when selecting variables from potentially biased temporal data. This approach is infeasible computationally when variable redundancies are also considered, and so a novel permutation redundancy measure with similar properties is proposed. Finally, a known redundancy structure between features in telemetry data is used to enhance the feature selection process in two ways. First the benefits of performing raw signal selection, feature extraction, and feature selection in different orders are investigated. Second, a two-stage variable selection framework is proposed and the two permutation based methods are combined. Throughout the thesis, it is shown through classification evaluations and inspection of the features that these permutation based selection methods are appropriate for use in selecting features from CAN-bus data.

APA, Harvard, Vancouver, ISO, and other styles

31

Kanellopoulos, Yiannis. "Supporting software systems maintenance using data mining techniques." Thesis, University of Manchester, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.496254.

Full text

Abstract:

Data mining and its ability to handle large amounts of data and uncover hidden patterns has the potential to facilitate the comprehension and maintainability evaluation of a software system. Source code artefacts and measurement values can be used as input to data mining algorithms in order to provide insights into a system's structure or to create groups of artefacts with similar software measurements. This thesis investigates the applicability and suitability of data mining techniques to facilitate a the comprehension and maintainability evaluation of a software system's source code.

APA, Harvard, Vancouver, ISO, and other styles

32

Maden, Engin. "Data Mining On Architecture Simulation." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/2/12611635/index.pdf.

Full text

Abstract:

Data mining is the process of extracting patterns from huge data. One of the branches in data mining is mining sequence data and here the data can be viewed as a sequence of events and each event has an associated time of occurrence. Sequence data is modelled using episodes and events are included in episodes. The aim of this thesis work is analysing architecture simulation output data by applying episode mining techniques, showing the previously known relationships between the events in architecture and providing an environment to predict the performance of a program in an architecture before executing the codes. One of the most important points here is the application area of episode mining techniques. Architecture simulation data is a new domain to apply these techniques and by using the results of these techniques making predictions about the performance of programs in an architecture before execution can be considered as a new approach. For this purpose, by implementing three episode mining techniques which are WINEPI approach, non-overlapping occurrence based approach and MINEPI approach a data mining tool has been developed. This tool has three main components. These are data pre-processor, episode miner and output analyser.

APA, Harvard, Vancouver, ISO, and other styles

33

Artchounin, Daniel. "Tuning of machine learning algorithms for automatic bug assignment." Thesis, Linköpings universitet, Programvara och system, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139230.

Full text

Abstract:

In software development projects, bug triage consists mainly of assigning bug reports to software developers or teams (depending on the project). The partial or total automation of this task would have a positive economic impact on many software projects. This thesis introduces a systematic four-step method to find some of the best configurations of several machine learning algorithms intending to solve the automatic bug assignment problem. These four steps are respectively used to select a combination of pre-processing techniques, a bug report representation, a potential feature selection technique and to tune several classifiers. The aforementioned method has been applied on three software projects: 66 066 bug reports of a proprietary project, 24 450 bug reports of Eclipse JDT and 30 358 bug reports of Mozilla Firefox. 619 configurations have been applied and compared on each of these three projects. In production, using the approach introduced in this work on the bug reports of the proprietary project would have increased the accuracy by up to 16.64 percentage points.

APA, Harvard, Vancouver, ISO, and other styles

34

Kagdi, Huzefa H. "Mining Software Repositories to Support Software Evolution." Kent State University / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=kent1216149768.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Wang, Grant J. (Grant Jenhorn) 1979. "Algorithms for data mining." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/38315.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.
Includes bibliographical references (p. 81-89).
Data of massive size are now available in a wide variety of fields and come with great promise. In theory, these massive data sets allow data mining and exploration on a scale previously unimaginable. However, in practice, it can be difficult to apply classic data mining techniques to such massive data sets due to their sheer size. In this thesis, we study three algorithmic problems in data mining with consideration to the analysis of massive data sets. Our work is both theoretical and experimental - we design algorithms and prove guarantees for their performance and also give experimental results on real data sets. The three problems we study are: 1) finding a matrix of low rank that approximates a given matrix, 2) clustering high-dimensional points into subsets whose points lie in the same subspace, and 3) clustering objects by pairwise similarities/distances.
by Grant J. Wang.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

36

Bala, Saimir. "Mining Projects from Structured and Unstructured Data." Jens Gulden, Selmin Nurcan, Iris Reinhartz-Berger, Widet Guédria, Palash Bera, Sérgio Guerreiro, Michael Fellman, Matthias Weidlich, 2017. http://epub.wu.ac.at/7205/1/ProjecMining%2DCamera%2DReady.pdf.

Full text

Abstract:

Companies working on safety-critical projects must adhere to strict rules imposed by the domain, especially when human safety is involved. These projects need to be compliant to standard norms and regulations. Thus, all the process steps must be clearly documented in order to be verifiable for compliance in a later stage by an auditor. Nevertheless, documentation often comes in the form of manually written textual documents in different formats. Moreover, the project members use diverse proprietary tools. This makes it difficult for auditors to understand how the actual project was conducted. My research addresses the project mining problem by exploiting logs from project-generated artifacts, which come from software repositories used by the project team.

APA, Harvard, Vancouver, ISO, and other styles

37

Dai, Jianyong. "Detecting malicious software by dynamic execution." Orlando, Fla. : University of Central Florida, 2009. http://purl.fcla.edu/fcla/etd/CFE0002798.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Liebchen, Gernot Armin. "Data cleaning techniques for software engineering data sets." Thesis, Brunel University, 2010. http://bura.brunel.ac.uk/handle/2438/5951.

Full text

Abstract:

Data quality is an important issue which has been addressed and recognised in research communities such as data warehousing, data mining and information systems. It has been agreed that poor data quality will impact the quality of results of analyses and that it will therefore impact on decisions made on the basis of these results. Empirical software engineering has neglected the issue of data quality to some extent. This fact poses the question of how researchers in empirical software engineering can trust their results without addressing the quality of the analysed data. One widely accepted definition for data quality describes it as `fitness for purpose', and the issue of poor data quality can be addressed by either introducing preventative measures or by applying means to cope with data quality issues. The research presented in this thesis addresses the latter with the special focus on noise handling. Three noise handling techniques, which utilise decision trees, are proposed for application to software engineering data sets. Each technique represents a noise handling approach: robust filtering, where training and test sets are the same; predictive filtering, where training and test sets are different; and filtering and polish, where noisy instances are corrected. The techniques were first evaluated in two different investigations by applying them to a large real world software engineering data set. In the first investigation the techniques' ability to improve predictive accuracy in differing noise levels was tested. All three techniques improved predictive accuracy in comparison to the do-nothing approach. The filtering and polish was the most successful technique in improving predictive accuracy. The second investigation utilising the large real world software engineering data set tested the techniques' ability to identify instances with implausible values. These instances were flagged for the purpose of evaluation before applying the three techniques. Robust filtering and predictive filtering decreased the number of instances with implausible values, but substantially decreased the size of the data set too. The filtering and polish technique actually increased the number of implausible values, but it did not reduce the size of the data set. Since the data set contained historical software project data, it was not possible to know the real extent of noise detected. This led to the production of simulated software engineering data sets, which were modelled on the real data set used in the previous evaluations to ensure domain specific characteristics. These simulated versions of the data set were then injected with noise, such that the real extent of the noise was known. After the noise injection the three noise handling techniques were applied to allow evaluation. This procedure of simulating software engineering data sets combined the incorporation of domain specific characteristics of the real world with the control over the simulated data. This is seen as a special strength of this evaluation approach. The results of the evaluation of the simulation showed that none of the techniques performed well. Robust filtering and filtering and polish performed very poorly, and based on the results of this evaluation they would not be recommended for the task of noise reduction. The predictive filtering technique was the best performing technique in this evaluation, but it did not perform significantly well either. An exhaustive systematic literature review has been carried out investigating to what extent the empirical software engineering community has considered data quality. The findings showed that the issue of data quality has been largely neglected by the empirical software engineering community. The work in this thesis highlights an important gap in empirical software engineering. It provided clarification and distinctions of the terms noise and outliers. Noise and outliers are overlapping, but they are fundamentally different. Since noise and outliers are often treated the same in noise handling techniques, a clarification of the two terms was necessary. To investigate the capabilities of noise handling techniques a single investigation was deemed as insufficient. The reasons for this are that the distinction between noise and outliers is not trivial, and that the investigated noise cleaning techniques are derived from traditional noise handling techniques where noise and outliers are combined. Therefore three investigations were undertaken to assess the effectiveness of the three presented noise handling techniques. Each investigation should be seen as a part of a multi-pronged approach. This thesis also highlights possible shortcomings of current automated noise handling techniques. The poor performance of the three techniques led to the conclusion that noise handling should be integrated into a data cleaning process where the input of domain knowledge and the replicability of the data cleaning process are ensured.

APA, Harvard, Vancouver, ISO, and other styles

39

Gu, Zhuoer. "Mining previously unknown patterns in time series data." Thesis, University of Warwick, 2017. http://wrap.warwick.ac.uk/99207/.

Full text

Abstract:

The emerging importance of distributed computing systems raises the needs of gaining a better understanding of system performance. As a major indicator of system performance, analysing CPU host load helps evaluate system performance in many ways. Discovering similar patterns in CPU host load is very useful since many applications rely on the pattern mined from the CPU host load, such as pattern-based prediction, classification and relative rule mining of CPU host load. Essentially, the problem of mining patterns in CPU host load is mining the time series data. Due to the complexity of the problem, many traditional mining techniques for time series data are not suitable anymore. Comparing to mining known patterns in time series, mining unknown patterns is a much more challenging task. In this thesis, we investigate the major difficulties of the problem and develop the techniques for mining unknown patterns by extending the traditional techniques of mining the known patterns. In this thesis, we develop two different CPU host load discovery methods: the segment-based method and the reduction-based method to optimize the pattern discovery process. The segment-based method works by extracting segment features while the reduction-based method works by reducing the size of raw data. The segment-based pattern discovery method maps the CPU host load segments to a 5-dimension space, then applies the DBSCAN clustering method to discover similar segments. The reduction-based method reduces the dimensionality and numerosity of the CPU host load to reduce the search space. A cascade method is proposed to support accurate pattern mining while maintaining efficiency. The investigations into the CPU host load data inspired us to further develop a pattern mining algorithm for general time series data. The method filters out the unlikely starting positions for reoccurring patterns at the early stage and then iteratively locates all best-matching patterns. The results obtained by our method do not contain any meaningless patterns, which has been a different problematic issue for a long time. Comparing to the state of art techniques, our method is more efficient and effective in most scenarios.

APA, Harvard, Vancouver, ISO, and other styles

40

Somaraki, Vassiliki. "A framework for trend mining with application to medical data." Thesis, University of Huddersfield, 2013. http://eprints.hud.ac.uk/id/eprint/23482/.

Full text

Abstract:

This thesis presents research work conducted in the field of knowledge discovery. It presents an integrated trend-mining framework and SOMA, which is the application of the trend-mining framework in diabetic retinopathy data. Trend mining is the process of identifying and analysing trends in the context of the variation of support of the association/classification rules that have been extracted from longitudinal datasets. The integrated framework concerns all major processes from data preparation to the extraction of knowledge. At the pre-process stage, data are cleaned, transformed if necessary, and sorted into time-stamped datasets using logic rules. At the next stage, time-stamp datasets are passed through the main processing, in which the ARM technique of matrix algorithm is applied to identify frequent rules with acceptable confidence. Mathematical conditions are applied to classify the sequences of support values into trends. Afterwards, interestingness criteria are applied to obtain interesting knowledge, and a visualization technique is proposed that maps how objects are moving from the previous to the next time stamp. A validation and verification (external and internal validation) framework is described that aims to ensure that the results at the intermediate stages of the framework are correct and that the framework as a whole can yield results that demonstrate causality. To evaluate the thesis, SOMA was developed. The dataset is, in itself, also of interest, as it is very noisy (in common with other similar medical datasets) and does not feature a clear association between specific time stamps and subsets of the data. The Royal Liverpool University Hospital has been a major centre for retinopathy research since 1991. Retinopathy is a generic term used to describe damage to the retina of the eye, which can, in the long term, lead to visual loss. Diabetic retinopathy is used to evaluate the framework, to determine whether SOMA can extract knowledge that is already known to the medics. The results show that those datasets can be used to extract knowledge that can show causality between patients’ characteristics such as the age of patient at diagnosis, type of diabetes, duration of diabetes, and diabetic retinopathy.

APA, Harvard, Vancouver, ISO, and other styles

41

Dai, Jianyong. "DETECTING MALICIOUS SOFTWARE BY DYNAMICEXECUTION." Doctoral diss., University of Central Florida, 2009. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2849.

Full text

Abstract:

Traditional way to detect malicious software is based on signature matching. However, signature matching only detects known malicious software. In order to detect unknown malicious software, it is necessary to analyze the software for its impact on the system when the software is executed. In one approach, the software code can be statically analyzed for any malicious patterns. Another approach is to execute the program and determine the nature of the program dynamically. Since the execution of malicious code may have negative impact on the system, the code must be executed in a controlled environment. For that purpose, we have developed a sandbox to protect the system. Potential malicious behavior is intercepted by hooking Win32 system calls. Using the developed sandbox, we detect unknown virus using dynamic instruction sequences mining techniques. By collecting runtime instruction sequences in basic blocks, we extract instruction sequence patterns based on instruction associations. We build classification models with these patterns. By applying this classification model, we predict the nature of an unknown program. We compare our approach with several other approaches such as simple heuristics, NGram and static instruction sequences. We have also developed a method to identify a family of malicious software utilizing the system call trace. We construct a structural system call diagram from captured dynamic system call traces. We generate smart system call signature using profile hidden Markov model (PHMM) based on modularized system call block. Smart system call signature weakly identifies a family of malicious software.
Ph.D.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science PhD

APA, Harvard, Vancouver, ISO, and other styles

42

Roberts, J. (Juho). "Iterative root cause analysis using data mining in software testing processes." Master's thesis, University of Oulu, 2016. http://urn.fi/URN:NBN:fi:oulu-201604271548.

Full text

Abstract:

In order to remain competitive, companies need to be constantly vigilant and aware of the current trends in the industry in which they operate. The terms big data and data mining have exploded in popularity in recent years, and will continue to do so with the launch of the internet of things (IoT) and the 5th generation of mobile networks (5G) in the next decade. Companies need to recognize the value of the big data they are generating in their day-to-day operations, and learn how and why to exploit data mining techniques to extract the most knowledge out of the data their customers and the company itself are generating. The root cause analysis of faults uncovered during base station system testing is a difficult process due to the profound complexity caused by the multi-disciplinary nature of a base station system, and the sheer volume of log data outputted by the numerous system components. The goal of this research is to investigate if data mining can be exploited to conduct root cause analysis. It took the form of action research and is conducted in industry at an organisation unit responsible for the research and development of mobile base station equipment. In this thesis, we survey existing literature on how data mining has been used to address root cause analysis. Then we propose a novel approach to root cause analysis by making iterations to the root cause analysis process with data mining. We use the data mining tool Splunk in this thesis as an example; however, the practices presented in this research can be applied to other similar tools. We conduct root cause analysis by mining system logs generated by mobile base stations, to investigate which system component is causing the base station to fall short of its performance specifications. We then evaluate and validate our hypotheses by conducting a training session for the test engineers to collect feedback on the suitability of data mining in their work. The results from the evaluation show that amongst other benefits, data mining makes root cause analysis more efficient, but also makes bug reporting in the target organisation more complete. We conclude that data mining techniques can be a significant asset in root cause analysis. The efficiency gains are significant in comparison to the manual root cause analysis which is currently being conducted at the target organisation
Kilpailuedun säilyttämiseksi yritysten on pysyttävä ajan tasalla markkinoiden viimeisimpien kehityssuuntien kanssa. Massadata ja sen jatkojalostaminen, eli tiedonlouhinta, ovat tällä hetkellä mm. IT- ja markkinointialan muotisanoja. Esineiden internetin ja viidennen sukupolven matkapuhelinverkon (5G) yleistyessä tiedonlouhinnan merkitys tulee kasvamaan entisestään. Yritysten on kyettävä tunnistamaan luomansa massadatan merkitys omissa toiminnoissaan, ja mietittävä kuinka soveltaa tiedonlouhintamenetelmiä kilpailuedun luomiseksi. Matkapuhelinverkon tukiasemien vika-analyysi on haastavaa tukiasemien monimutkaisen luonteen sekä valtavan datamäärän ulostulon vuoksi. Tämän tutkimuksen tavoitteena on arvioida tiedonlouhinnan soveltuvuutta vika-analyysin edesauttamiseksi. Tämä pro gradu -tutkielma toteutettiin toimintatutkimuksen muodossa matkapuhelinverkon tukiasemia valmistavassa yrityksessä. Tämä pro gradu -tutkielma koostui sekä kirjallisuuskatsauksesta, jossa perehdyttiin siihen, kuinka tiedonlouhintaa on sovellettu vika-analyysissä aikaisemmissa tutkimuksissa että empiirisestä osiosta, jossa esitetään uudenlaista iteratiivista lähestymistapaa vika-analyysiin tiedonlouhintaa hyödyntämällä. Tiedonlouhinta toteutettiin Splunk -nimistä tiedonlouhintatyökalua hyödyntäen, mutta tutkimuksessa esitelty teoria voidaan toteuttaa muitakin työkaluja käyttäen. Tutkimuksessa louhittiin tukiaseman synnyttämiä lokitiedostoja, joista pyrittiin selvittämään, mikä tukiaseman ohjelmistokomponentti esti tukiasemaa saavuttamasta suorituskyvyllisiä laatuvaatimuksia. Tutkimuksen tulokset osoittivat tiedonlouhinnan olevan oivallinen lähestymistapa vika-analyysiin sekä huomattava etu työn tehokkuuden lisäämiseksi verrattuna nykyiseen käsin tehtyyn analyysiin

APA, Harvard, Vancouver, ISO, and other styles

43

Poyias, Andreas. "Engineering compact dynamic data structures and in-memory data mining." Thesis, University of Leicester, 2018. http://hdl.handle.net/2381/42282.

Full text

Abstract:

Compact and succinct data structures use space that approaches the information-theoretic lower bound on the space that is required to represent the data. In practice, their memory footprint is orders of magnitude smaller than normal data structures and at the same time they are competitive in speed. A main drawback with many of these data structures is that they do not support dynamic operations efficiently. It can be exceedingly expensive to rebuild a static data structure each time an update occurs. In this thesis, we propose a number of novel compact dynamic data structures including m-Bonsai, which is a compact tree representation, compact dynamic rewritable (CDRW) arrays which is a compact representation of variable-length bit-strings. These data structures can answer queries efficiently, perform updates fast while they maintain their small memory footprint. In addition to the designing of these data structures, we analyze them theoretically, we implement them and finally test them to show their good practical performance. Many data mining algorithms require data structures that can query and dynamically update data in memory. One such algorithm is FP-growth. It is one of the fastest algorithms for the solution of Frequent Itemset Mining, which is one of the most fundamental problems in data mining. FP-growth reads the entire data in memory, updates the data structures in memory and performs a series of queries on the given data. We propose a compact implementation for the FP-growth algorithm, the PFP-growth. Based on our experimental evaluation, our implementation is one order of magnitude more space efficient compared to the classic implementation of FP-growth and 2 - 3 times compared to a more recent carefully engineered implementation. At the same time it is competitive in terms of speed.

APA, Harvard, Vancouver, ISO, and other styles

44

Kriukov, Illia. "Multi-version software quality analysis through mining software repositories." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-74424.

Full text

Abstract:

The main objective of this thesis is to identify how the software repository features influence software quality during software evolution. To do that the mining software repository area was used. This field analyzes the rich data from software repositories to extract interesting and actionable information about software systems, projects and software engineering. The ability to measure code quality and analyze the impact of software repository features on software quality allows us to better understand project history, project quality state, development processes and conduct future project analysis. Existing work in the area of software quality describes software quality analysis without a connection to the software repository features. Thus they lose important information that can be used for preventing bugs, decision-making and optimizing development processes. To conduct the analysis specific tool was developed, which cover quality measurement and repository features extraction. During the research general procedure of the software quality analysis was defined, described and applied in practice. It was found that there is no most influential repository feature and the correlation between software quality and software repository features exist, but it is too small to make a real influence.

APA, Harvard, Vancouver, ISO, and other styles

45

Kidwell, Billy R. "MiSFIT: Mining Software Fault Information and Types." UKnowledge, 2015. http://uknowledge.uky.edu/cs_etds/33.

Full text

Abstract:

As software becomes more important to society, the number, age, and complexity of systems grow. Software organizations require continuous process improvement to maintain the reliability, security, and quality of these software systems. Software organizations can utilize data from manual fault classification to meet their process improvement needs, but organizations lack the expertise or resources to implement them correctly. This dissertation addresses the need for the automation of software fault classification. Validation results show that automated fault classification, as implemented in the MiSFIT tool, can group faults of similar nature. The resulting classifications result in good agreement for common software faults with no manual effort. To evaluate the method and tool, I develop and apply an extended change taxonomy to classify the source code changes that repaired software faults from an open source project. MiSFIT clusters the faults based on the changes. I manually inspect a random sample of faults from each cluster to validate the results. The automatically classified faults are used to analyze the evolution of a software application over seven major releases. The contributions of this dissertation are an extended change taxonomy for software fault analysis, a method to cluster faults by the syntax of the repair, empirical evidence that fault distribution varies according to the purpose of the module, and the identification of project-specific trends from the analysis of the changes.

APA, Harvard, Vancouver, ISO, and other styles

46

Tibbetts, Kevin (Kevin Joseph). "Data mining for structure type prediction." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/34413.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Materials Science and Engineering, 2004.
Includes bibliographical references (p. 41-42).
Determining the stable structure types of an alloy is critical to determining many properties of that material. This can be done through experiment or computation. Both methods can be expensive and time consuming. Computational methods require energy calculations of hundreds of structure types. Computation time would be greatly improved if this large number of possible structure types was reduced. A method is discussed here to predict the stable structure types for an alloy based on compiled data. This would include experimentally observed stable structure types and calculated energies of structure types. In this paper I will describe the state of this technology. This will include an overview of past and current work. Curtarolo et al. showed a factor of three improvement in the number of calculations required to determine a given percentage of the ground state structure types for an alloy system by using correlations among a database of over 6000 calculated energies.I will show correlations among experimentally determined stable structure types appearing in the same alloy system through statistics computed from the Pauling File Inorganic Materials Database Binaries edition. I will compare a method to predict stable structure types based on correlations among pairs of structure types that appear in the same alloy system with a method based simply on the frequency of occurrence of each structure type. I will show a factor of two improvement in the number of calculations required to determine the ground state structure types between these two methods. This paper will examine the potential market value for a software tool used to predict likely stable structure types. A timeline for introduction of this product and an analysis of the market for such a tool will be included. There is no established market for structure type prediction software, but the market will be similar to that of materials database software and energy calculation software.The potential market is small, but the production and maintenance costs are also small. These small costs, combined with the potential of this tool to improve greatly over time, make this a potentially promising investment. These methods are still in development. The key to the value of this tool lies in the accuracy of the prediction methods developed over the next few years.
by Kevin Tibbetts.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

47

Hu, Weikun. "Overdue invoice forecasting and data mining." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/104327.

Full text

Abstract:

Thesis: S.M. in Transportation, Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2016.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 64-67).
The account receivable is one of the main challenges in the business operation. With poor management of invoice to cash collection process, the over due invoice may pile up, and the increasing amount of unpaid invoice may lead to cash flow problems. In this thesis, I addressed the proactive approach to improving account receivable management using predictive modeling. To complete the task, I built supervised learning models to identity the delayed invoices in advance and made recommendations on improving performance of order to cash collection process. The main procedures of the research work are data cleaning and processing, statistical analysis, building machine learning models and evaluating model performance. The analytical and modeling of the study are based on the real-world invoice data from a Fortune 500 company. The thesis also discussed approaches of dealing with imbalanced data, which includes sampling techniques, performance measurements and ensemble algorithms. The invoice data used in this thesis is imbalanced, because on-time invoice and delayed invoice classes are not approximately equally represented. The cost sensitivity learning techniques demonstrates favorable improvement on classification results. The results of the thesis reveal that the supervised machine learning models can predict the potential late payment of invoice with high accuracy.
by Weikun Hu.
S.M. in Transportation

APA, Harvard, Vancouver, ISO, and other styles

48

Kim, Edward Soo. "Data-mining natural language materials syntheses." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122075.

Full text

Abstract:

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Materials Science and Engineering, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references.
Discovering, designing, and developing a novel material is an arduous task, involving countless hours of human effort and ingenuity. While some aspects of this process have been vastly accelerated by the advent of first-principles-based computational techniques and high throughput experimental methods, a vast ocean of untapped historical knowledge lies dormant in the scientific literature. Namely, the precise methods by which many inorganic compounds are synthesized are recorded only as text within journal articles. This thesis aims to realize the potential of this data for informing the syntheses of inorganic materials through the use of data-mining algorithms. Critically, the methods used and produced in this thesis are fully automated, thus maximizing the impact for accelerated synthesis planning by human researchers.
There are three primary objectives of this thesis: 1) aggregate and codify synthesis knowledge contained within scientific literature, 2) identify synthesis "driving factors" for different synthesis outcomes (e.g., phase selection) and 3) autonomously learn synthesis hypotheses from the literature and extend these hypotheses to predicted syntheses for novel materials. Towards the first goal of this thesis, a pipeline of algorithms is developed in order to extract and codify materials synthesis information from journal articles into a structured, machine readable format, analogous to existing databases for materials structures and properties. To efficiently guide the extraction of materials data, this pipeline leverages domain knowledge regarding the allowable relations between different types of information (e.g., concentrations often correspond to solutions).
Both unsupervised and supervised machine learning algorithms are also used to rapidly extract synthesis information from the literature. To examine the autonomous learning of driving factors for morphology selection during hydrothermal syntheses, TiO₂ nanotube formation is found to be correlated with NaOH concentrations and reaction temperatures, using models that are given no internal chemistry knowledge. Additionally, the capacity for transfer learning is shown by predicting phase symmetry in materials systems unseen by models during training, outperforming heuristic physically-motivated baseline stratgies, and again with chemistry-agnostic models. These results suggest that synthesis parameters possess some intrinsic capability for predicting synthesis outcomes. The nature of this linkage between synthesis parameters and synthesis outcomes is then further explored by performing virtual synthesis parameter screening using generative models.
Deep neural networks (variational autoencoders) are trained to learn low-dimensional representations of synthesis routes on augmented datasets, created by aggregated synthesis information across materials with high structural similarity. This technique is validated by predicting ion-mediated polymorph selection effects in MnO₂, using only data from the literature (i.e., without knowledge of competing free energies). This method of synthesis parameter screening is then applied to suggest a new hypothesis for solvent-driven formation of the rare TiO₂ phase, brookite. To extend the capability of synthesis planning with literature-based generative models, a sequence-based conditional variational autoencoder (CVAE) neural network is developed. The CVAE allows a materials scientist to query the model for synthesis suggestions of arbitrary materials, including those that the model has not observed before.
In a demonstrative experiment, the CVAE suggests the correct precursors for literature-reported syntheses of two perovskite materials using training data published more than a decade prior to the target syntheses. Thus, the CVAE is used as an additional materials synthesis screening utility that is complementary to techniques driven by density functional theory calculations. Finally, this thesis provides a broad commentary on the status quo for the reporting of written materials synthesis methods, and suggests a new format which improves both human and machine readability. The thesis concludes with comments on promising future directions which may build upon the work described in this document.
by Edward Soo Kim.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Materials Science and Engineering

APA, Harvard, Vancouver, ISO, and other styles

49

Wang, Jie. "MATRIX DECOMPOSITION FOR DATA DISCLOSURE CONTROL AND DATA MINING APPLICATIONS." UKnowledge, 2008. http://uknowledge.uky.edu/gradschool_diss/677.

Full text

Abstract:

Access to huge amounts of various data with private information brings out a dual demand for preservation of data privacy and correctness of knowledge discovery, which are two apparently contradictory tasks. Low-rank approximations generated by matrix decompositions are a fundamental element in this dissertation for the privacy preserving data mining (PPDM) applications. Two categories of PPDM are studied: data value hiding (DVH) and data pattern hiding (DPH). A matrix-decomposition-based framework is designed to incorporate matrix decomposition techniques into data preprocessing to distort original data sets. With respect to the challenge in the DVH, how to protect sensitive/confidential attribute values without jeopardizing underlying data patterns, we propose singular value decomposition (SVD)-based and nonnegative matrix factorization (NMF)-based models. Some discussion on data distortion and data utility metrics is presented. Our experimental results on benchmark data sets demonstrate that our proposed models have potential for outperforming standard data perturbation models regarding the balance between data privacy and data utility. Based on an equivalence between the NMF and K-means clustering, a simultaneous data value and pattern hiding strategy is developed for data mining activities using K-means clustering. Three schemes are designed to make a slight alteration on submatrices such that user-specified cluster properties of data subjects are hidden. Performance evaluation demonstrates the efficacy of the proposed strategy since some optimal solutions can be computed with zero side effects on nonconfidential memberships. Accordingly, the protection of privacy is simplified by one modified data set with enhanced performance by this dual privacy protection. In addition, an improved incremental SVD-updating algorithm is applied to speed up the real-time performance of the SVD-based model for frequent data updates. The performance and effectiveness of the improved algorithm have been examined on synthetic and real data sets. Experimental results indicate that the introduction of the incremental matrix decomposition produces a significant speedup. It also provides potential support for the use of the SVD technique in the On-Line Analytical Processing for business data analysis.

APA, Harvard, Vancouver, ISO, and other styles

50

Dondero, Robert Michael Jr Hislop Gregory W. "Predicting software change coupling /." Philadelphia, Pa. : Drexel University, 2008. http://hdl.handle.net/1860/2759.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Mining software engineering data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles