To see the other types of publications on this topic, follow the link: Knowledge Discovery in Databases (KDD).

Dissertations / Theses on the topic 'Knowledge Discovery in Databases (KDD)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Knowledge Discovery in Databases (KDD).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Storti, Emanuele. "KDD process design in collaborative and distributed environments." Doctoral thesis, Università Politecnica delle Marche, 2012. http://hdl.handle.net/11566/242061.

Full text
Abstract:
Il termine Knowledge Discovery in Databases (KDD) si riferisce al processo di scoperta di conoscenza all'interno di grandi volumi di dati, per mezzo di specifici algoritmi. L'applicazione di tali tecniche a contesti organizzativi reali risulta oggi ancora limitata, principalmente a causa della complessità nella configurazione degli algoritmi di analisi dei dati e nella difficoltà nella gestione/esecuzione dei processi di KDD, che impone spesso di far riferimento a contesti di computazione distribuita ed alla interazione tra diversi utenti, tra i quali specialisti con competenze tecniche ed esperti nello specifico dominio oggetto dell'analisi. In questo lavoro viene presentata Knowledge Discovery in Database Virtual Mart (KDDVM), una piattaforma orientata a supportare utenti con diversi livelli di esperienza nella progettazione di processi di KDD in ambito collaborativo e distribuito. La piattaforma si basa su un'architettura aperta, modulare, estensibile ed orientata ai servizi, nella quale vengono messe a disposizione funzionalità di preprocessing, modellazione e postprocessing. In KDDVM, tutte le risorse coinvolte in un processo, comprese le applicazioni, i dati e gli utenti, vengono rappresentate sistematicamente per mezzo di tecnologie semantiche, a vari livelli di astrazione. In tal modo è possibile approcciare il processo di estrazione della conoscenza in modo innovativo, fornendo un supporto più efficace ad utenti non esperti nell'esecuzione di attività complesse. Tra di essi sono disponibili funzionalità per il deployment di tool eterogenei, per la ricerca sintattica e semantica, all'interno di repository, di servizi che corrispondono a determinati requisiti, per il supporto intelligente alla composizione semi-automatica di processi, nonché strumenti capaci di supportare più utenti distribuiti, in un'ottica collaborativa, nella progettazione condivisa di un processo di KDD.
Knowledge Discovery in Databases (KDD), as well as scientific experimentation in e-Science, is a complex and computationally intensive process aimed at gaining knowledge from a huge set of data. Often performed in distributed settings, KDD projects usually involve a deep interaction among heterogeneous tools and several users with specific expertise. Given the high complexity of the process, such users need effective support to achieve their goal of knowledge extraction. This work presents the Knowledge Discovery in Database Virtual Mart (KDDVM), a user- and knowledge-centric framework aimed at supporting the design of KDD processes in a highly distributed and collaborative scenario, in which computational resources and actors dynamically interoperate to share and elaborate knowledge. The contribution of the work is two-fold: firstly, a conceptual systematization of the relevant knowledge is provided, with the aim to formalize, through semantic technologies, each element taking part in the design and execution of a KDD process, including computational resources, data and actors; secondly, we propose an implementation of the framework as an open, modular and extensible Service-Oriented platform, in which several services are available both to perform basic operations of data manipulations and to support more advanced functionalities. Among them, the management of deployment/activation of computational resources, service discovery and their composition to build KDD processes. Since the cooperative design and execution of a distributed KDD process typically require several skills, both technical and managerial, collaboration can easily become a source of complexity if not supported by any kind of coordination. For such reasons, a set of functionalities of the platform is specifically addressed to support collaboration within a distributed team, by providing an environment in which users can work on the same project and share processes, results and ideas.
APA, Harvard, Vancouver, ISO, and other styles
2

Huynh, Xuan-Hiep. "Interestingness Measures for Association Rules in a KDD Process : PostProcessing of Rules with ARQAT Tool." Phd thesis, Université de Nantes, 2006. http://tel.archives-ouvertes.fr/tel-00482649.

Full text
Abstract:
This work takes place in the framework of Knowledge Discovery in Databases (KDD), often called "Data Mining". This domain is both a main research topic and an application ¯eld in companies. KDD aims at discovering previously unknown and useful knowledge in large databases. In the last decade many researches have been published about association rules, which are frequently used in data mining. Association rules, which are implicative tendencies in data, have the advantage to be an unsupervised model. But, in counter part, they often deliver a large number of rules. As a consequence, a postprocessing task is required by the user to help him understand the results. One way to reduce the number of rules - to validate or to select the most interesting ones - is to use interestingness measures adapted to both his/her goals and the dataset studied. Selecting the right interestingness measures is an open problem in KDD. A lot of measures have been proposed to extract the knowledge from large databases and many authors have introduced the interestingness properties for selecting a suitable measure for a given application. Some measures are adequate for some applications but the others are not. In our thesis, we propose to study the set of interestingness measure available in the literature, in order to evaluate their behavior according to the nature of data and the preferences of the user. The ¯nal objective is to guide the user's choice towards the measures best adapted to its needs and in ¯ne to select the most interesting rules. For this purpose, we propose a new approach implemented in a new tool, ARQAT (Association Rule Quality Analysis Tool), in order to facilitate the analysis of the behavior about 40 interest- ingness measures. In addition to elementary statistics, the tool allows a thorough analysis of the correlations between measures using correlation graphs based on the coe±cients suggested by Pear- son, Spearman and Kendall. These graphs are also used to identify the clusters of similar measures. Moreover, we proposed a series of comparative studies on the correlations between interestingness measures on several datasets. We discovered a set of correlations not very sensitive to the nature of the data used, and which we called stable correlations. Finally, 14 graphical and complementary views structured on 5 levels of analysis: ruleset anal- ysis, correlation and clustering analysis, most interesting rules analysis, sensitivity analysis, and comparative analysis are illustrated in order to show the interest of both the exploratory approach and the use of complementary views.
APA, Harvard, Vancouver, ISO, and other styles
3

Ribeiro, Lamark dos Santos. "Uma abordagem semântica para seleção de atributos no processo de KDD." Universidade Federal da Paraí­ba, 2010. http://tede.biblioteca.ufpb.br:8080/handle/tede/6048.

Full text
Abstract:
Made available in DSpace on 2015-05-14T12:36:27Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 2925122 bytes, checksum: e65ad4a8f7ca12fb8a90eaf2a8783d65 (MD5) Previous issue date: 2010-08-27
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Currently, two issues of great importance for the computation are being used together in an increasingly apparent: a Knowledge Discovery in Databases (KDD) and Ontologies. By developing the ways in which data is stored, the amount of information available for analysis has increased exponentially, making it necessary techniques to analyze data and gain knowledge for different purposes. In this sense, the KDD process introduces stages that enable the discovery of useful knowledge, and new features that usually cannot be seen only by viewing the data in raw form. In a complementary field, the Knowledge Discovery can be benefited with Ontologies. These, in a sense, have the capacity to store the "knowledge" about certain areas. The knowledge that can be retrieved through inference classes, descriptions, properties and constraints. Phases existing in the process of knowledge discovery, the selection of attributes allows the area of analysis for data mining algorithms can be improved with attributes more relevant to the problem analyzed. But sometimes these screening methods do not eliminate the attributes satisfactorily, do allow a preliminary analysis on the area treated. To address this problem this paper proposes a system that uses ontologies to store the prior knowledge about a specific domain, enabling a semantic analysis previously not possible using conventional methodologies. Was elaborated an ontology, with reuse of various repositories of ontologies available on the Web, specific to the medical field with a possible common specifications in key areas of medicine. To introduce semantics in the selection of attributes is first performed the mapping between data base attributes and classes of the ontology. Done this mapping, the user can now select attributes by semantic categories, reducing the dimensionality of the data and view redundancies between semantically related attributes.
Atualmente, dois temas de grande importância para a computação, estão sendo utilizados conjuntamente de uma forma cada vez mais aparente: a Descoberta de Conhecimento em Bancos de Dados (Knowledge Discovery in Databases KDD) e as Ontologias. Com o aperfeiçoamento das formas com que os dados são armazenados, a quantidade de informação disponível para análise aumentou exponencialmente, tornando necessário técnicas para analisar esses dados e obter conhecimento para os mais diversos propósitos. Nesse contexto, o processo de KDD introduz etapas que possibilitam a descoberta de conhecimentos úteis, novos e com características que geralmente não podiam ser vistas apenas visualizando os dados de forma bruta. Em um campo complementar, a Descoberta de Conhecimento em Banco de Dados pode ser beneficiada com Ontologias. Essas, de certa forma, apresentam a capacidade para armazenar o conhecimento , segundo um modelo de alta expressividade semântica, sobre determinados domínios. As ontologias permitem que o conhecimento seja recuperado através de inferências nas classes, descrições, propriedades e restrições. Nas fases existentes no processo de descoberta do conhecimento, a Seleção de Atributos permite que o espaço de análise para os algoritmos de Mineração de Dados possa ser melhorado com atributos mais relevantes para o problema analisado. Porém, algumas vezes esses métodos de seleção não eliminam de forma satisfatória os atributos irrelevantes, pois não permitem uma análise prévia sobre o domínio tratado. Para tratar esse problema, esse trabalho propõe um sistema que utiliza ontologias para armazenar o conhecimento prévio sobre um domínio específico, possibilitando uma análise semântica antes não viável pelas metodologias convencionais. Foi elaborada uma ontologia, com reuso de diversos repositórios de ontologias disponíveis na Web, específica para o domínio médico e com possíveis especificações comuns nas principais áreas da medicina. Para introduzir semântica no processo de seleção de atributos primeiro é realizado o mapeamento entre os atributos do banco de dados e as classes da ontologia. Feito esse mapeamento, o usuário agora pode selecionar atributos através de categorias semânticas, reduzir a dimensionalidade dos dados e ainda visualizar redundâncias existentes entre atributos correlacionados semanticamente.
APA, Harvard, Vancouver, ISO, and other styles
4

Storopoli, José Eduardo. "O uso do Knowledge Discovery in Database (KDD) de informações patentárias sobre ensino a distância: contribuições para instituições de ensino superior." Universidade Nove de Julho, 2016. http://bibliotecatede.uninove.br/handle/tede/1517.

Full text
Abstract:
Submitted by Nadir Basilio (nadirsb@uninove.br) on 2016-09-01T19:37:54Z No. of bitstreams: 1 José Eduardo Storopoli.pdf: 3248722 bytes, checksum: c6f49ec5728d3ca3b10f36aa03c94865 (MD5)
Made available in DSpace on 2016-09-01T19:37:54Z (GMT). No. of bitstreams: 1 José Eduardo Storopoli.pdf: 3248722 bytes, checksum: c6f49ec5728d3ca3b10f36aa03c94865 (MD5) Previous issue date: 2016-04-14
Distance learning (DL) has a long history of success and failures, and has existed for at least since the end of the XVIII century. Higher education DL began in Brazil during 1994, having the expansion of the internet as the main factor. The search of innovations and new models related to the process of DL has become critical, both from the operational and strategic aspect. Regarding those challenges, the available information in patent databases can contrive add to, in an important manner, the design of DL strategies in higher education institutions (HIE), therefore, the thesis’ objective is: to analyze the employment of Knowledge Discovery in Database (KDD) in patent information and its main contributions to DL in HIE. The method employed was the KDD structure to discovery, analysis, selection, pre-processing, filtering, transformation, data mining, interpretation and assessment of patent information data from the European Patent Office’s (EPO) database, composed of 90 million documents. The data collection was based on a sample of patents acquired through enhanced search expressions, by the crawler software Patent2Netv.2. The data of 3.090 patents were analyzed by dynamic tables, network analysis, mindmaps, content analysis and clustering. The main results: (1) provided the diagnosis of patents related to DL in a global perspective; (2) developed a methodology for the use of KDD to analyze the content of DL patent information to HIE; (3) mapping of DL patents from HIE; and, ultimately, (4) assessment the use of patent information in order to formulate strategies on adopting DL in HIE, in the light of the Resouce-based View.
O ensino a distância tem uma longa história de sucessos e fracassos, existe pelo menos desde o final do século XVIII. O ensino superior a distância iniciou no Brasil em meados de 1994, tendo como principal fator a expansão da internet. A busca de inovações e novos modelos relacionados ao processo de ensino a distância (EAD) torna-se importante, tanto do aspecto operacional como estratégico. Face a estes desafios, as informações existentes nos bancos de dados de patentes podem contribuir de forma significativa para a definição de estratégias de EAD em instituições de ensino superior (IES), portanto, o objetivo da tese foi analisar o uso do Knowledge Discovery in Database (KDD) de informações patentárias e suas possíveis contribuições para o EAD em IES. A metodologia utilizada foi a estrutura do KDD para exploração, análise, seleção, pré-processamento, limpeza, transformação, data mining, interpretação e avaliação de dados de informações patentárias sobre EAD da base do European Patent Office (EPO) que possui aproximadamente 90 milhões de documentos. A coleta dos dados utilizou-se o emprego de data mining por meio do software crawler Patent2Netv.2. A amostra de patentes adquiridas com o uso de aprimoradas expressões de busca resultou em 3.090 patentes, que foram analisadas por meio de tabelas dinâmicas, análises de rede, mapas mentais, análise de conteúdo e clustering. Os principais resultados: (1) possibilitaram apresentar o diagnóstico sobre as patentes relacionadas a EAD no mundo; (2) o desenvolvimento de uma metodologia de uso do KDD para análise de conteúdo de informações patentearias em EAD para IES; (3) o mapeamento das patentes em EAD em Universidades; e, finalmente, (4) a avaliação do uso de informações patentárias e sua utilização na definição de estratégias de adoção de EAD em IES, à luz do Visão Baseada em Recursos.
APA, Harvard, Vancouver, ISO, and other styles
5

Oliveira, Robson Butaca Taborelli de. "O processo de extração de conhecimento de base de dados apoiado por agentes de software." Universidade de São Paulo, 2000. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-23092001-231242/.

Full text
Abstract:
Os sistemas de aplicações científicas e comerciais geram, cada vez mais, imensas quantidades de dados os quais dificilmente podem ser analisados sem que sejam usados técnicas e ferramentas adequadas de análise. Além disso, muitas destas aplicações são voltadas para Internet, ou seja, possuem seus dados distribuídos, o que dificulta ainda mais a realização de tarefas como a coleta de dados. A área de Extração de Conhecimento de Base de Dados diz respeito às técnicas e ferramentas usadas para descobrir automaticamente conhecimento embutido nos dados. Num ambiente de rede de computadores, é mais complicado realizar algumas das etapas do processo de KDD, como a coleta e processamento de dados. Dessa forma, pode ser feita a utilização de novas tecnologias na tentativa de auxiliar a execução do processo de descoberta de conhecimento. Os agentes de software são programas de computadores com propriedades, como, autonomia, reatividade e mobilidade, que podem ser utilizados para esta finalidade. Neste sentido, o objetivo deste trabalho é apresentar a proposta de um sistema multi-agente, chamado Minador, para auxiliar na execução e gerenciamento do processo de Extração de Conhecimento de Base de Dados.
Nowadays, commercial and scientific application systems generate huge amounts of data that cannot be easily analyzed without the use of appropriate tools and techniques. A great number of these applications are also based on the Internet which makes it even more difficult to collect data, for instance. The field of Computer Science called Knowledge Discovery in Databases deals with issues of the use and creation of the tools and techniques that allow for the automatic discovery of knowledge from data. Applying these techniques in an Internet environment can be particulary difficult. Thus, new techniques need to be used in order to aid the knowledge discovery process. Software agents are computer programs with properties such as autonomy, reactivity and mobility that can be used in this way. In this context, this work has the main goal of presenting the proposal of a multiagent system, called Minador, aimed at supporting the execution and management of the Knowledge Discovery in Databases process.
APA, Harvard, Vancouver, ISO, and other styles
6

Scarinci, Rui Gureghian. "SES : sistema de extração semântica de informações." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 1997. http://hdl.handle.net/10183/18398.

Full text
Abstract:
Entre as áreas que mais se desenvolvem na informática nos últimos anos estão aquelas relacionadas ao crescimento da rede Internet, que interliga milhões de usuários de todo o mundo. Esta rede disponibiliza aos usuários uma a enorme variedade e quantidade de informações, principalmente dados armazenados de forma não estruturada ou semi estruturada. Contudo, tal volume e heterogeneidade acaba dificultando a manipulação dos dados recuperados a partir da Internet. Este problema motivou o desenvolvimento deste trabalho. Mesmo com o auxílio de várias ferramentas de pesquisa na Internet, buscando realizar pesquisas sobre assuntos específicos, o usuário ainda tem que manipular em seu computador pessoal uma grande quantidade de informação, pois estas ferramentas não realizam um processo de seleção detalhado. Ou seja, são recuperados muitos dados não interessantes ao usuário. Existe, também, uma grande diversidade de assuntos e padrões de transferência e armazenamento da informação criando os mais heterogêneos ambientes de pesquisa e consulta de dados. Esta heterogeneidade faz com que o usuário da rede deva conhecer todo um conjunto de padrões e ferramentas a fim de obter a informação desejada. No entanto, a maior dificuldade de manipulação esta ligada aos formatos de armazenamento não estruturados ou pouco estruturados, como, por exemplo: arquivos textos, Mails (correspondência eletrônica) e artigos de News (jornais eletrônicos). Nestes formatos, o entendimento do documento exige a leitura do mesmo pelo usuário, o que muitas vezes acarreta em um gasto de tempo desnecessário, pois o documento, por exemplo, pode não ser de interesse deste ou, então, ser de interesse, mas sua leitura completa só seria útil posteriormente. Várias informações, como chamadas de trabalhos para congressos, preços de produtos e estatísticas econômicas, entre outras, apresentam validade temporal. Outras informações são atualizadas periodicamente. Muitas dessas características temporais são explicitas, outras estão implícitas no meio de outros tipos de dados. Isto torna muito difícil a recuperação de tal tipo de informação, gerando, várias vezes, a utilização de informações desatualizadas, ou a perda de oportunidades. Desta forma, o grande volume de dados em arquivos pessoais obtidos a partir da Internet criou uma complexa tarefa de gerenciamento dos mesmos em conseqüência da natureza não estruturada dos documentos recuperados e da complexidade da análise do tempo de validade inerente a estes dados. Com o objetivo de satisfazer as necessidades de seleção e conseqüente manipulação das informações existentes a nível local (computador pessoal), neste trabalho, é descrito um sistema para extração e sumarização destes dados, utilizando conceitos de IE (Information Extraction) e Sistemas Baseados em Conhecimento. Os dados processados são parcialmente estruturados ou não estruturados, sendo manipulados por um extrator configurado a partir de bases de conhecimento geradas pelo usuário do sistema. O objetivo final desta dissertação é a implementação do Sistema de Extração Semântica de Informações, o qual permite a classificação dos dados extraídos em classes significativas para o usuário e a determinação da validade temporal destes dados a partir da geração de uma base de dados estruturada.
One of the most challenging area in Computer Science is related to Internet technology. This network offers to the users a large variety and amount of information, mainly, data storage in unstructured or semi-structured formats. However, the vast data volume and heterogeneity transforms the retrieved data manipulation a very arduous work. This problem was the prime motivation of this work. As with many tools for data retrieval and specific searching, the user has to manipulate in his personal computer an increasing amount of information, because these tools do not realize a precise data selection process. Many retrieval data are not interesting for the user. There are, also, a big diversity of subjects and standards in information transmission and storage, creating the most heterogeneous environments in data searching and retrieval. Due to this heterogeneity, the user has to know many data standards and searching tools to obtain the requested information. However, the fundamental problem for data manipulation is the partially or fully unstructured data formats, as text, mail and news data structures. For files in these formats, the user has to read each of the files to filter the relevant information, originating a loss of time, because the document could be not interesting for the user, or if it is interesting, its complete reading may be unnecessary at the moment. Some information as call-for-papers, product prices, economic statistics and others, has associated a temporal validity. Other information are updated periodically. Some of these temporal characteristics are explicit, others are implicitly embedded in other data types. As it is very difficult to retrieve the temporal data automatically, which generate, many times, the use of invalid information, as a result, some opportunities are lost. On this paper a system for extraction and summarizing of data is described. The main objective is to satisfy the user's selection needs and consequently information manipulation stored in a personal computer. To achieve this goal we are employed the concepts of Information Extraction (IE) and Knowledge Based Systems. The input data manipulation is done by an extraction procedure configured by a user who defined knowledge base. The objective of this paper is to develop a System of Semantic Extraction of Information which classifies the data extracted in meaningful classes for the user and to deduce the temporal validity of this data. This goal was achieved by the generation of a structured temporal data base.
APA, Harvard, Vancouver, ISO, and other styles
7

Moretti, Caio Benatti. "Análise de grandezas cinemáticas e dinâmicas inerentes à hemiparesia através da descoberta de conhecimento em bases de dados." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/18/18149/tde-13062016-184240/.

Full text
Abstract:
Em virtude de uma elevada expectativa de vida mundial, faz-se crescente a probabilidade de ocorrer acidentes naturais e traumas físicos no cotidiano, o que ocasiona um aumento na demanda por reabilitação. A terapia física, sob o paradigma da reabilitação robótica com serious games, oferece maior motivação e engajamento do paciente ao tratamento, cujo emprego foi recomendado pela American Heart Association (AHA), apontando a mais alta avaliação (Level A) para pacientes internados e ambulatoriais. No entanto, o potencial de análise dos dados coletados pelos dispositivos robóticos envolvidos é pouco explorado, deixando de extrair informações que podem ser de grande valia para os tratamentos. O foco deste trabalho consiste na aplicação de técnicas para descoberta de conhecimento, classificando o desempenho de pacientes diagnosticados com hemiparesia crônica. Os pacientes foram inseridos em um ambiente de reabilitação robótica, fazendo uso do InMotion ARM, um dispositivo robótico para reabilitação de membros superiores e coleta dos dados de desempenho. Foi aplicado sobre os dados um roteiro para descoberta de conhecimento em bases de dados, desempenhando pré-processamento, transformação (extração de características) e então a mineração de dados a partir de algoritmos de aprendizado de máquina. A estratégia do presente trabalho culminou em uma classificação de padrões com a capacidade de distinguir lados hemiparéticos sob uma precisão de 94%, havendo oito atributos alimentando a entrada do mecanismo obtido. Interpretando esta coleção de atributos, foi observado que dados de força são mais significativos, os quais abrangem metade da composição de uma amostra.
As a result of a higher life expectancy, the high probability of natural accidents and traumas occurences entails an increasing need for rehabilitation. Physical therapy, under the robotic rehabilitation paradigm with serious games, offers the patient better motivation and engagement to the treatment, being a method recommended by American Heart Association (AHA), pointing the highest assessment (Level A) for inpatients and outpatients. However, the rich potential of the data analysis provided by robotic devices is poorly exploited, discarding the opportunity to aggregate valuable information to treatments. The aim of this work consists of applying knowledge discovery techniques by classifying the performance of patients diagnosed with chronic hemiparesis. The patients, inserted into a robotic rehabilitation environment, exercised with the InMotion ARM, a robotic device for upper-limb rehabilitation which also does the collection of performance data. A Knowledge Discovery roadmap was applied over collected data in order to preprocess, transform and perform data mining through machine learning methods. The strategy of this work culminated in a pattern classification with the abilty to distinguish hemiparetic sides with an accuracy rate of 94%, having eight attributes feeding the input of the obtained mechanism. The interpretation of these attributes has shown that force-related data are more significant, comprising half of the composition of a sample.
APA, Harvard, Vancouver, ISO, and other styles
8

Schneider, Luís Felipe. "Aplicação do processo de descoberta de conhecimento em dados do poder judiciário do estado do Rio Grande do Sul." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2003. http://hdl.handle.net/10183/8968.

Full text
Abstract:
Para explorar as relações existentes entre os dados abriu-se espaço para a procura de conhecimento e informações úteis não conhecidas, a partir de grandes conjuntos de dados armazenados. A este campo deu-se o nome de Descoberta de Conhecimento em Base de Dados (DCBD), o qual foi formalizado em 1989. O DCBD é composto por um processo de etapas ou fases, de natureza iterativa e interativa. Este trabalho baseou-se na metodologia CRISP-DM . Independente da metodologia empregada, este processo tem uma fase que pode ser considerada o núcleo da DCBD, a “mineração de dados” (ou modelagem conforme CRISP-DM), a qual está associado o conceito “classe de tipo de problema”, bem como as técnicas e algoritmos que podem ser empregados em uma aplicação de DCBD. Destacaremos as classes associação e agrupamento, as técnicas associadas a estas classes, e os algoritmos Apriori e K-médias. Toda esta contextualização estará compreendida na ferramenta de mineração de dados escolhida, Weka (Waikato Environment for Knowledge Analysis). O plano de pesquisa está centrado em aplicar o processo de DCBD no Poder Judiciário no que se refere a sua atividade fim, julgamentos de processos, procurando por descobertas a partir da influência da classificação processual em relação à incidência de processos, ao tempo de tramitação, aos tipos de sentenças proferidas e a presença da audiência. Também, será explorada a procura por perfis de réus, nos processos criminais, segundo características como sexo, estado civil, grau de instrução, profissão e raça. O trabalho apresenta nos capítulos 2 e 3 o embasamento teórico de DCBC, detalhando a metodologia CRISP-DM. No capítulo 4 explora-se toda a aplicação realizada nos dados do Poder Judiciário e por fim, no capítulo 5, são apresentadas as conclusões.
With the purpose of exploring existing connections among data, a space has been created for the search of Knowledge an useful unknown information based on large sets of stored data. This field was dubbed Knowledge Discovery in Databases (KDD) and it was formalized in 1989. The KDD consists of a process made up of iterative and interactive stages or phases. This work was based on the CRISP-DM methodology. Regardless of the methodology used, this process features a phase that may be considered as the nucleus of KDD, the “data mining” (or modeling according to CRISP-DM) which is associated with the task, as well as the techniques and algorithms that may be employed in an application of KDD. What will be highlighted in this study is affinity grouping and clustering, techniques associated with these tasks and Apriori and K-means algorithms. All this contextualization will be embodied in the selected data mining tool, Weka (Waikato Environment for Knowledge Analysis). The research plan focuses on the application of the KDD process in the Judiciary Power regarding its related activity, court proceedings, seeking findings based on the influence of the procedural classification concerning the incidence of proceedings, the proceduring time, the kind of sentences pronounced and hearing attendance. Also, the search for defendants’ profiles in criminal proceedings such as sex, marital status, education background, professional and race. In chapters 2 and 3, the study presents the theoretical grounds of KDD, explaining the CRISP-DM methodology. Chapter 4 explores all the application preformed in the data of the Judiciary Power, and lastly, in Chapter conclusions are drawn
APA, Harvard, Vancouver, ISO, and other styles
9

Yu, Xiaobo. "Knowledge discovery in Internet databases." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/mq30577.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Howard, Craig M. "Tools and techniques for knowledge discovery." Thesis, University of East Anglia, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.368357.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Wu, Fei. "Knowledge discovery in time-series databases." Versailles-St Quentin en Yvelines, 2001. http://www.theses.fr/2001VERS0023.

Full text
Abstract:
@Aborde trois problématiques dans le contexte de la base de données temporelles. Ils sont le problème de regroupement, la similarité et l'extraction des stratégies. Il reste encore des problèmes pour les travaux futurs. Par exemple, comment réaliser le regroupement graduel pour d'autres algorithmes. Il sera intéressant de grouper des séquences en se basant sur notre nouveau modèle. Mais les questions posées sont le choix d'un algorithme, ou il faut un nouvel algorithme carrément ? Pour construire une stratégie, ce sera aussi possible de pré-définir nos actions. Puis trouver les relations entre les actions et les indicateurs correspondants afin de générer des stratégies. . .
APA, Harvard, Vancouver, ISO, and other styles
12

Krogel, Mark-André. "On propositionalization for knowledge discovery in relational databases." [S.l. : s.n.], 2005. http://deposit.ddb.de/cgi-bin/dokserv?idn=976835835.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Corzo, F. A. "Abstraction and structure in knowledge discovery in databases." Thesis, University College London (University of London), 2011. http://discovery.ucl.ac.uk/1126395/.

Full text
Abstract:
Knowledge discovery in databases is the field of computer science that, combines different computational, statistical and mathematical techniques, in order to create systems that support the process of finding new knowledge from databases. This thesis investigates adaptability and reusability techniques, namely abstraction and structure tactics, applicable to software solutions in the field of Knowledge Discovery in Databases. The research is driven by two business problems in operational loss specifically fraud and system failure. Artificial Intelligence (AI)1 techniques are increasingly being applied to complex and dynamic business process. Business processes that require analytical processing of large volumes of data, highly changing and complex domain knowledge driven data analysis, knowledge management and knowledge discovery are examples of problems that would typically be addressed using AI techniques. To control the business, data and software complexity, the industry has responded with a wide variety of products that have specific software architectures, user environments and include the latest AI trends. Specific fields of research like knowledge discovery in databases (KDD) [1][2] have been created in order to address the different challenges of supporting the discovery of new knowledge from raw data using AI related techniques (e.g. data mining). Regardless of all this academic and commercial effort, solutions for specific business processes are suffering from adaptability, flexibility and reusability limitations. The solutions‟ software architecture and user interfacing environments are not adaptable and flexible enough to cope with business process changes or the need to reuse accumulated knowledge (i.e. prior analyses). Consequently the life time of some of these solutions is reduced severely or increasing efforts are required to keep them running. This research is driven by a specific business domain and it is conducted in two phases. The first phase focuses on a single intelligent and analytical system solution and aims to capture specific problem domain requirements that drive the definition of a business domain specific KDD reference architecture. Through a case study a detailed analysis of the semantics of fraud detection is done and the elements, components and services of an intelligent and analytics fraud detection system are investigated. The second phase takes the architectural observations from phase I, to the more generic and wide KDD challenges, defines an operational loss domain model, a reference architecture and tests its reuse in a different type of operational loss business problem. Software related KDD challenges are revised and addressed in the reference architecture. A second application is analysed through a second case study and it is used to test the architecture and refine it. This application is in the domain of detection and prevention of operational loss due to data related system failure, The software architectures defined in the different phases of this research are analyzed using the Architecture Trade off Analysis Method (ATAM)2 [3] in order to evaluate risks and compare their adaptability, flexibility and reusability properties. This thesis has the following contributions: It constitutes one of the first investigations of adaptability and reusability in business domain specific KDD software architecture from an abstraction and structure viewpoint. It defines the TRANSLATIONAL architectural style for high data volume and intensive data analysis systems that supports the balancing of flexibility, reusability and performance. Using the TRANSLATIONAL architectural style, it defines and implements OL-KDA, a reference architecture that can be applied to problems in operational loss, namely fraud and data related system failure, and supports the complexity and dynamicity challenges. Developed and implemented a method for supporting data, dataflow and rules in KDD pre-processing and post-processing tasks. It defines a data manipulation and maintenance model that favours performance and adaptability in specific KDD tasks. Two substantial case studies where developed and analysed in order to understand and subsequently test the defined techniques and reference architecture in business domains. 1 AI: Artificial Intelligence techniques are used in computer science to mimic or use aspects of human behavior within information systems. 2 ATAM: Architecture analysis method that focuses on analyzing quality attributes and use cases.
APA, Harvard, Vancouver, ISO, and other styles
14

Grissa, Dhouha. "Etude comportementale des mesures d'intérêt d'extraction de connaissances." Phd thesis, Université Blaise Pascal - Clermont-Ferrand II, 2013. http://tel.archives-ouvertes.fr/tel-01023975.

Full text
Abstract:
La recherche de règles d'association intéressantes est un domaine important et actif en fouille de données. Puisque les algorithmes utilisés en extraction de connaissances à partir de données (ECD), ont tendance à générer un nombre important de règles, il est difficile à l'utilisateur de sélectionner par lui même les connaissances réellement intéressantes. Pour répondre à ce problème, un post-filtrage automatique des règles s'avère essentiel pour réduire fortement leur nombre. D'où la proposition de nombreuses mesures d'intérêt dans la littérature, parmi lesquelles l'utilisateur est supposé choisir celle qui est la plus appropriée à ses objectifs. Comme l'intérêt dépend à la fois des préférences de l'utilisateur et des données, les mesures ont été répertoriées en deux catégories : les mesures subjectives (orientées utilisateur ) et les mesures objectives (orientées données). Nous nous focalisons sur l'étude des mesures objectives. Néanmoins, il existe une pléthore de mesures objectives dans la littérature, ce qui ne facilite pas le ou les choix de l'utilisateur. Ainsi, notre objectif est d'aider l'utilisateur, dans sa problématique de sélection de mesures objectives, par une approche par catégorisation. La thèse développe deux approches pour assister l'utilisateur dans sa problématique de choix de mesures objectives : (1) étude formelle suite à la définition d'un ensemble de propriétés de mesures qui conduisent à une bonne évaluation de celles-ci ; (2) étude expérimentale du comportement des différentes mesures d'intérêt à partir du point de vue d'analyse de données. Pour ce qui concerne la première approche, nous réalisons une étude théorique approfondie d'un grand nombre de mesures selon plusieurs propriétés formelles. Pour ce faire, nous proposons tout d'abord une formalisation de ces propriétés afin de lever toute ambiguïté sur celles-ci. Ensuite, nous étudions, pour différentes mesures d'intérêt objectives, la présence ou l'absence de propriétés caractéristiques appropriées. L'évaluation des mesures est alors un point de départ pour une catégorisation de celle-ci. Différentes méthodes de classification ont été appliquées : (i) méthodes sans recouvrement (CAH et k-moyennes) qui permettent l'obtention de groupes de mesures disjoints, (ii) méthode avec recouvrement (analyse factorielle booléenne) qui permet d'obtenir des groupes de mesures qui se chevauchent. Pour ce qui concerne la seconde approche, nous proposons une étude empirique du comportement d'une soixantaine de mesures sur des jeux de données de nature différente. Ainsi, nous proposons une méthodologie expérimentale, où nous cherchons à identifier les groupes de mesures qui possèdent, empiriquement, un comportement semblable. Nous effectuons par la suite une confrontation avec les deux résultats de classification, formel et empirique dans le but de valider et mettre en valeur notre première approche. Les deux approches sont complémentaires, dans l'optique d'aider l'utilisateur à effectuer le bon choix de la mesure d'intérêt adaptée à son application.
APA, Harvard, Vancouver, ISO, and other styles
15

Rydzi, Daniel. "Metodika vývoje a nasazování Business Intelligence v malých a středních podnicích." Doctoral thesis, Vysoká škola ekonomická v Praze, 2005. http://www.nusl.cz/ntk/nusl-77060.

Full text
Abstract:
Dissertation thesis deals with development and implementation of Business Intelligence (BI) solutions for Small and Medium Sized Enterprises (SME) in the Czech Republic. This thesis represents climax of author's up to now effort that has been put into completing a methodological model for development of this kind of applications for SMEs using self-owned skills and minimum of external resources and costs. This thesis can be divided into five major parts. First part that describes used technologies is divided into two chapters. First chapter describes contemporary state of Business Intelligence concept and it also contains original taxonomy of Business Intelligence solutions. Second chapter describes two Knowledge Discovery in Databases (KDD) techniques that were used for building those BI solutions that are introduced in case studies. Second part describes the area of Czech SMEs, which is an environment where the thesis was written and which it is meant to contribute to. This environment is represented by one chapter that defines the differences of SMEs against large corporations. Furthermore, there are author's reasons why he is personally focusing on this area explained. Third major part introduces the results of survey that was conducted among Czech SMEs with support of Department of Information Technologies of Faculty of Informatics and Statistics of University of Economics in Prague. This survey had three objectives. First one was to map the readiness of Czech SMEs for BI solutions development and deployment. Second was to determine major problems and consequent decisions of Czech SMEs that could be supported by BI solutions and the third objective was to determine top factors preventing SMEs from developing and deploying BI solutions. Fourth part of the thesis is also the core one. In two chapters there is the original Methodology for development and deployment of BI solutions by SMEs described as well as other methodologies that were studied. Original methodology is partly based on famous CRISP-DM methodology. Finally, last part describes particular company that has become a testing ground for author's theories and that supports his research. In further chapters it introduces case-studies of development and deployment of those BI solutions in this company, that were build using contemporary BI and KDD techniques with respect to original methodology. In that sense, these case-studies verified theoretical methodology in real use.
APA, Harvard, Vancouver, ISO, and other styles
16

Ghoorah, Anisah W. "Extraction de connaissances pour la modélisation tri-dimensionnelle de l'interactome structural." Thesis, Université de Lorraine, 2012. http://www.theses.fr/2012LORR0204/document.

Full text
Abstract:
L'étude structurale de l'interactome cellulaire peut conduire à des découvertes intéressantes sur les bases moléculaires de certaines pathologies. La modélisation par homologie et l'amarrage de protéines ("protein docking") sont deux approches informatiques pour modéliser la structure tri-dimensionnelle (3D) d'une interaction protéine-protéine (PPI). Des études précédentes ont montré que ces deux approches donnent de meilleurs résultats quand des données expérimentales sur les PPIs sont prises en compte. Cependant, les données PPI ne sont souvent pas disponibles sous une forme facilement accessible, et donc ne peuvent pas être re-utilisées par les algorithmes de prédiction. Cette thèse présente une approche systématique fondée sur l'extraction de connaissances pour représenter et manipuler les données PPI disponibles afin de faciliter l'analyse structurale de l'interactome et d'améliorer les algorithmes de prédiction par la prise en compte des données PPI. Les contributions majeures de cette thèse sont de : (1) décrire la conception et la mise en oeuvre d'une base de données intégrée KBDOCK qui regroupe toutes les interactions structurales domaine-domaine (DDI); (2) présenter une nouvelle méthode de classification des DDIs par rapport à leur site de liaison dans l'espace 3D et introduit la notion de site de liaison de famille de domaines protéiques ("domain family binding sites" ou DFBS); (3) proposer une classification structurale (inspirée du système CATH) des DFBSs et présenter une étude étendue sur les régularités d'appariement entre DFBSs en terme de structure secondaire; (4) introduire une approche systématique basée sur le raisonnement à partir de cas pour modéliser les structures 3D des complexes protéiques à partir des DDIs connus. Une interface web (http://kbdock.loria.fr) a été développée pour rendre accessible le système KBDOCK
Understanding how the protein interactome works at a structural level could provide useful insights into the mechanisms of diseases. Comparative homology modelling and ab initio protein docking are two computational methods for modelling the three-dimensional (3D) structures of protein-protein interactions (PPIs). Previous studies have shown that both methods give significantly better predictions when they incorporate experimental PPI information. However, in general, PPI information is often not available in an easily accessible way, and cannot be re-used by 3D PPI modelling algorithms. Hence, there is currently a need to develop a reliable framework to facilitate the reuse of PPI data. This thesis presents a systematic knowledge-based approach for representing, describing and manipulating 3D interactions to study PPIs on a large scale and to facilitate knowledge-based modelling of protein-protein complexes. The main contributions of this thesis are: (1) it describes an integrated database of non-redundant 3D hetero domain interactions; (2) it presents a novel method of describing and clustering DDIs according to the spatial orientations of the binding partners, thus introducing the notion of "domain family-level binding sites" (DFBS); (3) it proposes a structural classification of DFBSs similar to the CATH classification of protein folds, and it presents a study of secondary structure propensities of DFBSs and interaction preferences; (4) it introduces a systematic case-base reasoning approach to model on a large scale the 3D structures of protein complexes from existing structural DDIs. All these contributions have been made publicly available through a web server (http://kbdock.loria.fr)
APA, Harvard, Vancouver, ISO, and other styles
17

Hertkorn, Peter. "Knowledge discovery in databases auf der Grundlage dimensionshomogener Funktionen /." Stuttgart : Univ., Inst. f. Statik u. Dynamik d. Luft- u. Raumfahrtkonstruktionen, 2005. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=014636277&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Hertkorn, Peter. "Knowledge discovery in databases auf der Grundlage dimensionshomogener Funktionen." Stuttgart ISD, 2004. http://deposit.ddb.de/cgi-bin/dokserv?id=2710474&prov=M&dok_var=1&dok_ext=htm.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Chowdhury, Israt Jahan. "Knowledge discovery from tree databases using balanced optimal search." Thesis, Queensland University of Technology, 2016. https://eprints.qut.edu.au/92263/1/Israt%20Jahan_Chowdhury_Thesis.pdf.

Full text
Abstract:
This research is a step forward in discovering knowledge from databases of complex structure like tree or graph. Several data mining algorithms are developed based on a novel representation called Balanced Optimal Search for extracting implicit, unknown and potentially useful information like patterns, similarities and various relationships from tree data, which are also proved to be advantageous in analysing big data. This thesis focuses on analysing unordered tree data, which is robust to data inconsistency, irregularity and swift information changes, hence, in the era of big data it becomes a popular and widely used data model.
APA, Harvard, Vancouver, ISO, and other styles
20

Ghoorah, Anisah W. "Extraction de connaissances pour la modélisation tri-dimensionnelle de l'interactome structural." Electronic Thesis or Diss., Université de Lorraine, 2012. http://www.theses.fr/2012LORR0204.

Full text
Abstract:
L'étude structurale de l'interactome cellulaire peut conduire à des découvertes intéressantes sur les bases moléculaires de certaines pathologies. La modélisation par homologie et l'amarrage de protéines ("protein docking") sont deux approches informatiques pour modéliser la structure tri-dimensionnelle (3D) d'une interaction protéine-protéine (PPI). Des études précédentes ont montré que ces deux approches donnent de meilleurs résultats quand des données expérimentales sur les PPIs sont prises en compte. Cependant, les données PPI ne sont souvent pas disponibles sous une forme facilement accessible, et donc ne peuvent pas être re-utilisées par les algorithmes de prédiction. Cette thèse présente une approche systématique fondée sur l'extraction de connaissances pour représenter et manipuler les données PPI disponibles afin de faciliter l'analyse structurale de l'interactome et d'améliorer les algorithmes de prédiction par la prise en compte des données PPI. Les contributions majeures de cette thèse sont de : (1) décrire la conception et la mise en oeuvre d'une base de données intégrée KBDOCK qui regroupe toutes les interactions structurales domaine-domaine (DDI); (2) présenter une nouvelle méthode de classification des DDIs par rapport à leur site de liaison dans l'espace 3D et introduit la notion de site de liaison de famille de domaines protéiques ("domain family binding sites" ou DFBS); (3) proposer une classification structurale (inspirée du système CATH) des DFBSs et présenter une étude étendue sur les régularités d'appariement entre DFBSs en terme de structure secondaire; (4) introduire une approche systématique basée sur le raisonnement à partir de cas pour modéliser les structures 3D des complexes protéiques à partir des DDIs connus. Une interface web (http://kbdock.loria.fr) a été développée pour rendre accessible le système KBDOCK
Understanding how the protein interactome works at a structural level could provide useful insights into the mechanisms of diseases. Comparative homology modelling and ab initio protein docking are two computational methods for modelling the three-dimensional (3D) structures of protein-protein interactions (PPIs). Previous studies have shown that both methods give significantly better predictions when they incorporate experimental PPI information. However, in general, PPI information is often not available in an easily accessible way, and cannot be re-used by 3D PPI modelling algorithms. Hence, there is currently a need to develop a reliable framework to facilitate the reuse of PPI data. This thesis presents a systematic knowledge-based approach for representing, describing and manipulating 3D interactions to study PPIs on a large scale and to facilitate knowledge-based modelling of protein-protein complexes. The main contributions of this thesis are: (1) it describes an integrated database of non-redundant 3D hetero domain interactions; (2) it presents a novel method of describing and clustering DDIs according to the spatial orientations of the binding partners, thus introducing the notion of "domain family-level binding sites" (DFBS); (3) it proposes a structural classification of DFBSs similar to the CATH classification of protein folds, and it presents a study of secondary structure propensities of DFBSs and interaction preferences; (4) it introduces a systematic case-base reasoning approach to model on a large scale the 3D structures of protein complexes from existing structural DDIs. All these contributions have been made publicly available through a web server (http://kbdock.loria.fr)
APA, Harvard, Vancouver, ISO, and other styles
21

Xie, Tian. "Knowledge discovery and machinelearning for capacity optimizationof Automatic Milking RotarySystem." Thesis, KTH, Kommunikationsteori, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-199630.

Full text
Abstract:
Dairy farming as one part of agriculture has thousands of year’s history. The increasingdemands of dairy products and the rapid development of technology bring dairyfarming tremendous changes. Started by first hand milking, dairy farming goes throughvacuum bucket milking, pipeline milking, and now parlors milking. The automatic andtechnical milking system provided farmer with high-efficiency milking, effective herdmanagement and above all booming income.DeLaval Automatic Milking Rotary (AMRTM) is the world’s leading automatic milkingrotary system. It presents an ultimate combination of technology and machinerywhich brings dairy farming with significant benefits. AMRTM technical milking capacityis 90 cows per hour. However, constrained by farm management, cow’s condition andsystem configuration, the actual capacity is lower than technical value. In this thesis, anoptimization system is designed to analyze and improve AMRTM performance. The researchis focusing on cow behavior and AMRTM robot timeout. Through applying knowledgediscover from database (KDD), building machine learning cow behavior predictionsystem and developing modeling methods for system simulation, the optimizing solutionsare proposed and validated.
Mjölkproduktion är en del av vårt jordbruks tusenåriga historia. Med ökande krav påmejeriprodukter tillsammans med den snabba utvecklingen utav tekniken för det enormaförändringar i mjölkproduktionen. Mjölkproduktion började inledningsvis med handmjölkningsedan har mjölkproduktionsmetoder utvecklats genom olika tekniker och gettoss t.ex. vakuum mjölkning, rörledning mjölkning, fram till dagens mjölkningskarusell.Nu har det automatiska och tekniska mjölkningssystem försedd bönder med högeffektivmjölkning, effektiv djurhållningen och framför allt blomstrande inkomster.DeLaval Automatic Milking Rotary (AMRTM) är världens ledande automatiska roterandemjölkningssystemet. Den presenterar en ultimat kombination av teknik och maskinersom ger mjölkproduktionen betydande fördelar. DeLaval Automatic Milking Rotarytekniska mjölknings kapacitet är 90 kor per timme. Den begränsas utav jordbruksdrift,tillståndet hos kor och hantering av systemet. Det gör att den faktiska kapaciteten blirlägre än den tekniska. I denna avhandling undersöks hur ett optimeringssystem kan analyseraoch förbättra DeLaval Automatic Milking Rotary prestanda genom fokusering påkors beteenden och robot timeout. Genom att tillämpa kunskap från databas (KDD), skapamaskininlärande system som förutsäger kors beteenden samt utveckla modelleringsmetoderför systemsimulering, ges lösningsförslag av optimering samt validering.
APA, Harvard, Vancouver, ISO, and other styles
22

Schneider, Ulrike, and Joachim Hagleitner. "Knowledge Discovery in Databases am Beispiel des österreichischen Nonprofit Sektors." Institut für Sozialpolitik, WU Vienna University of Economics and Business, 2005. http://epub.wu.ac.at/1352/1/document.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Healy, Jerome V. "Computational knowledge discovery techniques and their application to options market databases." Thesis, London Metropolitan University, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.426594.

Full text
Abstract:
Financial options are central to controlling investor's risk exposure. However, since 1987 parametric option pricing models have performed poorly in assessing risk levels. Also, electronic trading systems were introduced in this period, and these produce option price quotations at a rate of up to several times per second. There is a large and rapidly expanding amount of data to be analysed. A new generation of techniques for pattern recognition in large datasets has evolved, collectively termed 'computational knowledge discovery techniques' in this work. Preliminary evidence suggests that certain of these techniques are superior to parametric approaches in pricing options. Statistical confidence in models is of paramount importance in finance, hence there is a need for a systems framework for their effective deployment. In this thesis, a dedicated computational framework is developed, for the application of computational knowledge discovery techniques to options market databases. The framework incorporates practical procedures, methods, and algorithms, applicable to many different domains, to determine statistical significance and confidence for data mining models and predictions. To enable a fuller evaluation of the uncertainty of model predictions, these include a new method for estimating pointwise prediction errors, which is computationally efficient for large datasets, and robust to problems of regression and heteroskedasticity typical of options market data. A number of case study examples are used to demonstrate that computational knowledge discovery techniques can yield useful knowledge for the domain, when applied using the framework, its components, and appropriate statistical and diagnostic tests. They address an omission in the literature documenting the application of these techniques to option pricing, which reports few findings based on hypothesis testing. A contribution to the field of nonparametric density estimation is made, by an application of neural nets to the recovery of risk-neutral distributions from put option prices. The findings are also new contributions for finance. Finally, in a discussion of software implementation issues emerging technology trends are identified. Also, a case is made that future vertical data mining solutions for options market applications, should incorporate statistical analysis within the tool, and should provide access to values of partial derivatives of the models.
APA, Harvard, Vancouver, ISO, and other styles
24

Amirbekyan, Artak. "Protocols and Data Structures for Knowledge Discovery on Distributed Private Databases." Thesis, Griffith University, 2007. http://hdl.handle.net/10072/367447.

Full text
Abstract:
Data mining has developed many techniques for automatic analysis of today’s rapidly collected data. Yahoo collects 12 TB daily of query logs and this is a quarter of what Google collects. For many important problems, the data is actually collected in distributed format by different institutions and organisations, and it can relate to businesses and individuals. The accuracy of knowledge that data mining brings for decision making depends on considering the collective datasets that describe a phenomenon. But privacy, confidentiality and trust emerge as major issues in the analysis of partitioned datasets among competitors, governments and other data holders that have conflicts of interest. Managing privacy is of the utmost importance in the emergent applications of data mining. For example, data mining has been identified as one of the most useful tools for the global collective fight on terror and crime [80]. Parties holding partitions of the database are very interested in the results, but may not trust the others with their data, or may be reluctant to release their data freely without some assurances regarding privacy. Data mining technology that reveals patterns in large databases could compromise the information that an individual or an organisation regards as private. The aim is to find the right balance between maximising analysis results (that are useful for each party) and keeping the inferences that disclose private information about organisation or individuals at a minimum. We address two core data analysis tasks, namely clustering and regression. For these to be solvable in the privacy context, we focus on the protocol’s efficiency and practicality. Because associative queries are central to clustering (and to many other data mining tasks), we provide protocols for privacy-preserving knear neighbour (k-NN) queries. Our methods improve previous methods for k-NN queries in privacy-preserving data-mining (which are based on Fagin’s A0 algorithm) because we do leak at least an order of magnitude less candidates and we achieve logarithmic performance on average. The foundations of our methods for k-NN queries are two pillars, firstly data structures and secondly, metrics. This thesis provides protocols for privacy-preserving computation of various common metrics and for construction of necessary data structures. We present here new algorithms for secure-multiparty-computation of some basic operations (like a new solution for Yao’s comparison problem and new protocols to perform linear algebra, in particular the scalar product). These algorithms will be used for the construction of protocols for different metrics (we provide protocols for all Minkowski metrics, the cosine metrics and the chessboard metric) and for performing associative queries in the privacy context. In order to be efficient, our protocols for associative queries are supported by specific data structures. Thus, we present the construction of privacy-preserving data structures like R-Trees [42, 7], KD-Trees [8, 53, 33] and the SASH [8, 60]. We demonstrate the use of all these tools, and we provide a new version of the well known clustering algorithm DBSCAN [42, 7]. This new version is now suitable for applications that demand privacy. Similarly, we apply our machinery and provide new multi-linear regression protocols that are now suitable for privacy applications. Our algorithms are more efficient than earlier methods and protocols. In particular, the cost associated with ensuring privacy provides only a linear-cost overhead for most of the protocols presented here. That is, our methods are essentially as costly as concentrating all the data in one site, performing the data-mining task, and disregarding privacy. However, in some cases we make use of a third-trusted party. This is not a problem when more than two parties are involved, since there is always one party that can act as the third.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Information and Communication Technology
Science, Environment, Engineering and Technology
Full Text
APA, Harvard, Vancouver, ISO, and other styles
25

Páircéir, Rónán. "Knowledge discovery from distributed aggregate data in data warehouses and statistical databases." Thesis, University of Ulster, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.274398.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Prášil, Zdeněk. "Využití data miningu v řízení podniku." Master's thesis, Vysoká škola ekonomická v Praze, 2010. http://www.nusl.cz/ntk/nusl-150279.

Full text
Abstract:
The thesis is focused on data mining and its use in management of an enterprise. The thesis is structured into theoretical and practical part. Aim of the theoretical part was to find out: 1/ the most used methods of the data mining, 2/ typical application areas, 3/ typical problems solved in the application areas. Aim of the practical part was: 1/ to demonstrate use of the data mining in small Czech e-shop for understanding of the structure of the sale data, 2/ to demonstrate, how the data mining analysis can help to increase marketing results. In my analyses of the literature data I found decision trees, linear and logistic regression, neural network, segmentation methods and association rules are the most used methods of the data mining analysis. CRM and marketing, financial institutions, insurance and telecommunication companies, retail trade and production are the application areas using the data mining the most. The specific tasks of the data mining focus on relationships between marketing sales and customers to make better business. In the analysis of the e-shop data I revealed the types of goods which are buying together. Based on this fact I proposed that the strategy supporting this type of shopping is crucial for the business success. As a conclusion I proved the data mining is methods appropriate also for the small e-shop and have capacity to improve its marketing strategy.
APA, Harvard, Vancouver, ISO, and other styles
27

Orlygsdottir, Brynja. "Using knowledge discovery to identify potentially useful patterns of health promotion behavior of 10-12 year old Icelandic children." Diss., University of Iowa, 2008. http://ir.uiowa.edu/etd/6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Hayward, John T. "Mining Oncology Data: Knowledge Discovery in Clinical Performance of Cancer Patients." Worcester, Mass. : Worcester Polytechnic Institute, 2006. http://www.wpi.edu/Pubs/ETD/Available/etd-081606-083026/.

Full text
Abstract:
Thesis (M.S.)--Worcester Polytechnic Institute.
Keywords: Clinical Performance; Databases; Cancer; oncology; Knowledge Discovery in Databases; data mining. Includes bibliographical references (leaves 267-270).
APA, Harvard, Vancouver, ISO, and other styles
29

Chang, Namsik. "Knowledge discovery in databases with joint decision outcomes: A decision-tree induction approach." Diss., The University of Arizona, 1995. http://hdl.handle.net/10150/187227.

Full text
Abstract:
Inductive symbolic learning algorithms have been used successfully over the years to build knowledge-based systems. One of these, a decision-tree induction algorithm, has formed the central component in several commercial packages because of its particular efficiency, simplicity, and popularity. However, the decision-tree induction algorithms developed thus far are limited to domains where each decision instance's outcome belongs to only a single decision outcome class. Their goal is merely to specify the properties necessary to distinguish instances pertaining to different decision outcome classes. These algorithms are not readily applicable to many challenging new types of applications in which decision instances have outcomes belonging to more than one decision outcome class (i.e., joint decision outcomes). Furthermore, when applied to domains with a single decision outcome, these algorithms become less efficient as the number of the pre-defined outcome classes increases. The objective of this dissertation is to modify previous decision-tree induction techniques in order to apply them to applications with joint decision outcomes. We propose a new decision-tree induction approach called the Multi-Decision-Tree Induction (MDTI) approach. Data was collected for a patient image retrieval application where more than one prior radiological examination would be retrieved based on characteristics of the current examination and patient status. We present empirical comparisons of the MDTI approach with the Backpropagation network algorithm and the traditional knowledge-engineer-driven knowledge acquisition approach, using the same set of cases. These comparisons are made in terms of recall rate, precision rate, average number of prior examinations suggested, and understandability of the acquired knowledge. The results show that the MDTI approach outperforms the Backpropagation network algorithms and is comparable to the traditional approach in all performance measures considered, while requiring much less learning time than either approach. To gain analytical and empirical insights into MDTI, we have compared this approach with the two best known symbolic learning algorithms (i.e., ID3 and AQ) using data domains with a single decision outcome. It has been found analytically that rules generated by the MDTI approach are more general and supported by more instances in the training set. Four empirical experiments have supported the findings.
APA, Harvard, Vancouver, ISO, and other styles
30

Cowley, Jonathan Bowes. "The use of knowledge discovery databases in the identification of patients with colorectal cancer." Thesis, University of Hull, 2012. http://hydra.hull.ac.uk/resources/hull:7082.

Full text
Abstract:
Colorectal cancer is one of the most common forms of malignancy with 35,000 new patients diagnosed annually within the UK. Survival figures show that outcomes are less favourable within the UK when compared with the USA and Europe with 1 in 4 patients having incurable disease at presentation as of data from 2000. Epidemiologists have demonstrated that the incidence of colorectal cancer is highest on the industrialised western world with numerous contributory factors. These range from a genetic component to concurrent medical conditions and personal lifestyle. In addition, data also demonstrates that environmental changes play a significant role with immigrants rapidly reaching the incidence rates of the host country. Detection of colorectal cancer remains an important and evolving aspect of healthcare with the aim of improving outcomes by earlier diagnosis. This process was initially revolutionised within the UK in 2002 with the ACPGBI 2 week wait guidelines to facilitate referrals form primary care and has subsequently seen other schemes such as bowel cancer screening introduced to augment earlier detection rates. Whereas the national screening programme is dependent on FOBT the standard referral practice is dependent upon a number of trigger symptoms that qualify for an urgent referral to a specialist for further investigations. This process only identifies 25-30% of those with colorectal cancer and remains a labour intensive process with only 10% of those seen in the 2 week wait clinics having colorectal cancer. This thesis hypothesises whether using a patient symptom questionnaire in conjunction with knowledge discovery techniques such as data mining and artificial neural networks could identify patients at risk of colorectal cancer and therefore warrant urgent further assessment. Artificial neural networks and data mining methods are used widely in industry to detect consumer patterns by an inbuilt ability to learn from previous examples within a dataset and model often complex, non-linear patterns. Within medicine these methods have been utilised in a host of diagnostic techniques from myocardial infarcts to its use in the Papnet cervical smear programme for cervical cancer detection. A linkert based questionnaire of those attending the 2 week wait fast track colorectal clinic was used to produce a ‘symptoms’ database. This was then correlated with individual patient diagnoses upon completion of their clinical assessment. A total of 777 patients were included in the study and their diagnosis categorised into a dichotomous variable to create a selection of datasets for analysis. These data sets were then taken by the author and used to create a total of four primary databases based on all questions, 2 week wait trigger symptoms, Best knowledge questions and symptoms identified in Univariate analysis as significant. Each of these databases were entered into an artificial neural network programme, altering the number of hidden units and layers to obtain a selection of outcome models that could be further tested based on a selection of set dichotomous outcomes. Outcome models were compared for sensitivity, specificity and risk. Further experiments were carried out with data mining techniques and the WEKA package to identify the most accurate model. Both would then be compared with the accuracy of a colorectal specialist and GP. Analysis of the data identified that 24% of those referred on the 2 week wait referral pathway failed to meet referral criteria as set out by the ACPGBI. The incidence of those with colorectal cancer was 9.5% (74) which is in keeping with other studies and the main symptoms were rectal bleeding, change in bowel habit and abdominal pain. The optimal knowledge discovery database model was a back propagation ANN using all variables for outcomes cancer/not cancer with sensitivity of 0.9, specificity of 0.97 and LR 35.8. Artificial neural networks remained the more accurate modelling method for all the dichotomous outcomes. The comparison of GP’s and colorectal specialists at predicting outcome demonstrated that the colorectal specialists were the more accurate predictors of cancer/not cancer with sensitivity 0.27 and specificity 0.97, (95% CI 0.6-0.97, PPV 0.75, NPV 0.83) and LR 10.6. When compared to the KDD models for predicting the same outcome, once again the ANN models were more accurate with the optimal model having sensitivity 0.63, specificity 0.98 (95% CI 0.58-1, PPV 0.71, NPV 0.96) and LR 28.7. The results demonstrate that diagnosis colorectal cancer remains a challenging process, both for clinicians and also for computation models. KDD models have been shown to be consistently more accurate in the prediction of those with colorectal cancer than clinicians alone when used solely in conjunction with a questionnaire. It would be ill conceived to suggest that KDD models could be used as a replacement to clinician- patient interaction but they may aid in the acceleration of some patients for further investigations or ‘straight to test’ if used on those referred as routine patients.
APA, Harvard, Vancouver, ISO, and other styles
31

Iglesia, Beatriz de la. "The development and application of heuristic techniques for the data mining task of nugget discovery." Thesis, University of East Anglia, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.368386.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Ponsan, Christiane. "Computing with words for data mining." Thesis, University of Bristol, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.310744.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Aydin, Ugur [Verfasser]. "Interstitial solution enthalpies derived from first-principles : knowledge discovery using high-throughput databases / Ugur Aydin." Paderborn : Universitätsbibliothek, 2016. http://d-nb.info/1098210433/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Hamed, Ahmed A. "An Exploratory Analysis of Twitter Keyword-Hashtag Networks and Knowledge Discovery Applications." ScholarWorks @ UVM, 2014. http://scholarworks.uvm.edu/graddis/325.

Full text
Abstract:
The emergence of social media has impacted the way people think, communicate, behave, learn, and conduct research. In recent years, a large number of studies have analyzed and modeled this social phenomena. Driven by commercial and social interests, social media has become an attractive subject for researchers. Accordingly, new models, algorithms, and applications to address specific domains and solve distinct problems have erupted. In this thesis, we propose a novel network model and a path mining algorithm called HashnetMiner to discover implicit knowledge that is not easily exposed using other network models. Our experiments using HashnetMiner have demonstrated anecdotal evidence of drug-drug interactions when applied to a drug reaction context. The proposed research comprises three parts built upon the common theme of utilizing hashtags in tweets. 1 Digital Recruitment on Twitter. We build an expert system shell for two different studies: (1) a nicotine patch study where the system reads streams of tweets in real time and decides whether to recruit the senders to participate in the study, and (2) an environmental health study where the system identifies individuals who can participate in a survey using Twitter. 2 Does Social Media Big Data Make the World Smaller? This work provides an exploratory analysis of large-scale keyword-hashtag networks (K-H) generated from Twitter. We use two different measures, (1) the number of vertices that connect any two keywords, and (2) the eccentricity of keyword vertices, a well-known centrality and shortest path measure. Our analysis shows that K-H networks conform to the phenomenon of the shrinking world and expose hidden paths among concepts. 3 We pose the following biomedical web science question: Can patterns identified in Twitter hashtags provide clinicians with a powerful tool to extrapolate a new medical therapies and/or drugs? We present a systematic network mining method HashnetMiner, that operates on networks of medical concepts and hashtags. To the best of our knowledge, this is the first effort to present Biomedical Web Science models and algorithms that address such a question by means of data mining and knowledge discovery using hashtag-based networks.
APA, Harvard, Vancouver, ISO, and other styles
35

Dopitová, Kateřina. "Empirické porovnání systémů dobývání znalostí z databází." Master's thesis, Vysoká škola ekonomická v Praze, 2010. http://www.nusl.cz/ntk/nusl-18159.

Full text
Abstract:
Submitted diploma thesis considers empirical comparison of knowledge discovery in databases systems. Basic terms and methods of knowledge discovery in databases domain are defined and criterions used to system comparison are determined. Tested software products are also shortly described in the thesis. Results of real task processing are brought out for each system. The comparison of individual systems according to previously determined criterions and comparison of competitiveness of commercial and non-commercial knowledge discovery in databases systems are performed within the framework of thesis.
APA, Harvard, Vancouver, ISO, and other styles
36

Beth, Madariaga Daniel Guillermo. "Identificación de las tendencias de reclamos presentes en reclamos.cl y que apunten contra instituciones de educación y organizaciones públicas." Tesis, Universidad de Chile, 2012. http://www.repositorio.uchile.cl/handle/2250/113396.

Full text
Abstract:
Ingeniero Civil Industrial
En la siguiente memoria se busca corroborar, por medio de una experiencia práctica y aplicada, si a caso el uso de las técnicas de Web Opinion Mining (WOM) y de herramientas informáticas, permiten determinar las tendencias generales que pueden poseer un conjunto de opiniones presentes en la Web. Particularmente, los reclamos publicados en el sitio web Reclamos.cl, y que apuntan contra instituciones pertenecientes a las industrias nacionales de Educación y de Gobierno. En ese sentido, los consumidores cada vez están utilizando más la Web para publicar en ella las apreciaciones positivas y negativas que poseen sobre lo que adquieren en el mercado, situación que hace de esta una mina de oro para diversas instituciones, especialmente para lo que es el identificar las fortalezas y las debilidades de los productos y los servicios que ofrecen, su imagen pública, entre varios otros aspectos. Concretamente, el experimento se realiza a través de la confección y la ejecución de una aplicación informática que integra e implementa conceptos de WOM, tales como Knowledge Discovery from Data (KDD), a modo de marco metodológico para alcanzar el objetivo planteado, y Latent Dirichlet Allocation (LDA), para lo que es la detección de tópicos dentro de los contenidos de los reclamos abordados. También se hace uso de programación orientada a objetos, basada en el lenguaje Python, almacenamiento de datos en bases de datos relacionales, y se incorporan herramientas pre fabricadas con tal de simplificar la realización de ciertas tareas requeridas. La ejecución de la aplicación permitió descargar las páginas web en cuyo interior se encontraban los reclamos de interés para la realización experimento, detectando en ellas 6.460 de estos reclamos; los cueles estaban dirigidos hacia 245 instituciones, y cuya fecha de publicación fue entre el 13 de Julio de 2006 y el 5 de Diciembre de 2011. Así también, la aplicación, mediante el uso de listas de palabras a descartar y de herramientas de lematización, procesó los contenidos de los reclamos, dejando en ellos sólo las versiones canónicas de las palabras que los constituían y que aportasen significado a estos. Con ello, la aplicación llevó a cabo varios análisis LDA sobre estos contenidos, los que arbitrariamente se definieron para ser ejecutados por cada institución detectada, tanto sobre el conjunto total de sus reclamos, como en segmentos de estos agrupados por año de publicación, con tal de generar, por cada uno de estos análisis, resultados compuestos por 20 tópicos de 30 palabras cada uno. Con los resultados de los análisis LDA, y mediante una metodología de lectura e interpretación manual de las palabras que constituían cada uno de los conjuntos de tópicos obtenidos, se procedió a generar frases y oraciones que apuntasen a hilarlas, con tal de obtener una interpretación que reflejase la tendencia a la cual los reclamos, representados en estos resultados, apuntaban. De esto se pudo concluir que es posible detectar las tendencias generales de los reclamos mediante el uso de las técnicas de WOM, pero con observaciones al respecto, pues al surgir la determinación de las tendencias desde un proceso de interpretación manual, se pueden generar subjetividades en torno al objeto al que apuntan dichas tendencias, ya sea por los intereses, las experiencias, entre otros, que posea la persona que realice el ejercicio de interpretación de los resultados.
APA, Harvard, Vancouver, ISO, and other styles
37

Bogorny, Vania. "Enhancing spatial association rule mining in geographic databases." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2006. http://hdl.handle.net/10183/7841.

Full text
Abstract:
A técnica de mineração de regras de associação surgiu com o objetivo de encontrar conhecimento novo, útil e previamente desconhecido em bancos de dados transacionais, e uma grande quantidade de algoritmos de mineração de regras de associação tem sido proposta na última década. O maior e mais bem conhecido problema destes algoritmos é a geração de grandes quantidades de conjuntos freqüentes e regras de associação. Em bancos de dados geográficos o problema de mineração de regras de associação espacial aumenta significativamente. Além da grande quantidade de regras e padrões gerados a maioria são associações do domínio geográfico, e são bem conhecidas, normalmente explicitamente representadas no esquema do banco de dados. A maioria dos algoritmos de mineração de regras de associação não garantem a eliminação de dependências geográficas conhecidas a priori. O resultado é que as mesmas associações representadas nos esquemas do banco de dados são extraídas pelos algoritmos de mineração de regras de associação e apresentadas ao usuário. O problema de mineração de regras de associação espacial pode ser dividido em três etapas principais: extração dos relacionamentos espaciais, geração dos conjuntos freqüentes e geração das regras de associação. A primeira etapa é a mais custosa tanto em tempo de processamento quanto pelo esforço requerido do usuário. A segunda e terceira etapas têm sido consideradas o maior problema na mineração de regras de associação em bancos de dados transacionais e tem sido abordadas como dois problemas diferentes: “frequent pattern mining” e “association rule mining”. Dependências geográficas bem conhecidas aparecem nas três etapas do processo. Tendo como objetivo a eliminação dessas dependências na mineração de regras de associação espacial essa tese apresenta um framework com três novos métodos para mineração de regras de associação utilizando restrições semânticas como conhecimento a priori. O primeiro método reduz os dados de entrada do algoritmo, e dependências geográficas são eliminadas parcialmente sem que haja perda de informação. O segundo método elimina combinações de pares de objetos geográficos com dependências durante a geração dos conjuntos freqüentes. O terceiro método é uma nova abordagem para gerar conjuntos freqüentes não redundantes e sem dependências, gerando conjuntos freqüentes máximos. Esse método reduz consideravelmente o número final de conjuntos freqüentes, e como conseqüência, reduz o número de regras de associação espacial.
The association rule mining technique emerged with the objective to find novel, useful, and previously unknown associations from transactional databases, and a large amount of association rule mining algorithms have been proposed in the last decade. Their main drawback, which is a well known problem, is the generation of large amounts of frequent patterns and association rules. In geographic databases the problem of mining spatial association rules increases significantly. Besides the large amount of generated patterns and rules, many patterns are well known geographic domain associations, normally explicitly represented in geographic database schemas. The majority of existing algorithms do not warrant the elimination of all well known geographic dependences. The result is that the same associations represented in geographic database schemas are extracted by spatial association rule mining algorithms and presented to the user. The problem of mining spatial association rules from geographic databases requires at least three main steps: compute spatial relationships, generate frequent patterns, and extract association rules. The first step is the most effort demanding and time consuming task in the rule mining process, but has received little attention in the literature. The second and third steps have been considered the main problem in transactional association rule mining and have been addressed as two different problems: frequent pattern mining and association rule mining. Well known geographic dependences which generate well known patterns may appear in the three main steps of the spatial association rule mining process. Aiming to eliminate well known dependences and generate more interesting patterns, this thesis presents a framework with three main methods for mining frequent geographic patterns using knowledge constraints. Semantic knowledge is used to avoid the generation of patterns that are previously known as non-interesting. The first method reduces the input problem, and all well known dependences that can be eliminated without loosing information are removed in data preprocessing. The second method eliminates combinations of pairs of geographic objects with dependences, during the frequent set generation. A third method presents a new approach to generate non-redundant frequent sets, the maximal generalized frequent sets without dependences. This method reduces the number of frequent patterns very significantly, and by consequence, the number of association rules.
APA, Harvard, Vancouver, ISO, and other styles
38

TALEBZADEH, SAEED. "Data Mining in Scientific Databases for Knowledge Discovery, the Case of Interpreting Support Vector Machines via Genetic Programming as Simple Understandable Terms." Doctoral thesis, Università degli Studi di Roma "Tor Vergata", 2015. http://hdl.handle.net/2108/202257.

Full text
Abstract:
Support Vector Machines (SVM) method is a powerful classification technique as a data mining application. This methodology has been applied in many scientific researches in wide variety of fields in the recent decades and has proved its high efficiency, performance, and accuracy. In spite of all the advantages of Support Vector Machines, it has an important drawback that is related to the interpretability difficulty of the produced results which are not easily understandable. This disadvantage highlights more especially for scientific cases in which the main purpose is knowledge discovery from the databases, and not just classifying the data. This research is an attempt to cover this weakness via an innovative and practical technique. In this Methodology, adequate number of points of the decision boundary for classification must be distinguished, afterwards, a Symbolic Regression (SR) technique must be applied on the obtained points. In the current developed method, Genetic Programming (GP), that is an evolutionary algorithm, is being used for the Symbolic Regression part of the procedure. Based on the presented materials, the developed procedure and algorithm in this research was called SVM-GP methodology that is a technique for presenting the results of Support Vector Machines as a simple and easily understandable algebraic mathematical equation. The performance and efficiency of the developed methodology have been tested on several synthetic databases with different conditions which led to produce a versatile code for solving high-dimensional problems. Next, the algorithm have been applied for classification of three real-world cases and interesting knowledge was extracted from them. The discovered information is useful for the related researches, meanwhile, it provides evidence for the importance of SVM-GP methodology and its high performance, efficiency, and accuracy for scientific cases.
APA, Harvard, Vancouver, ISO, and other styles
39

Otine, Charles. "HIV Patient Monitoring Framework Through Knowledge Engineering." Doctoral thesis, Blekinge Tekniska Högskola [bth.se], School of Planning and Media Design, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-00540.

Full text
Abstract:
Uganda has registered more than a million deaths since the HIV virus was first offi¬cially reported in the country over 3 decades ago. The governments in partnership with different groups have implemented different programmes to address the epidemic. The support from different donors and reduction in prices of treatment resulted in the focus on antiretroviral therapy access to those affected. Presently only a quarter of the approximately 1 million infected by HIV in Uganda are undergoing antiretroviral therapy. The number of patients pause a challenge in monitoring of therapy given the overall resource needs for health care in the country. Furthermore the numbers on antiretroviral therapy are set to increase in addition to the stringent requirements in tracking and monitoring of each individual patient during therapy. This research aimed at developing a framework for adopting knowledge engineering in information systems for monitoring HIV/AIDS patients. An open source approach was adopted due to the resource constrained context of the study to ensure a cost effec¬tive and sustainable solution. The research was motivated by the inconclusive literature on open source dimensional models for data warehouses and data mining for monitor¬ing antiretroviral therapy. The first phase of the research involved a situational analysis of HIV in health care and different health care information systems in the country. An analysis of the strengths, weaknesses and opportunities of the health care system to adopt knowledge bases was done. It proposed a dimensional model for implementing a data warehouse focused on monitoring HIV patients. The second phase involved the development of a knowledge base inform of an open source data warehouse, its simulation and testing. The study involved interdisciplinary collaboration between different stakeholders in the research domain and adopted a participatory action research methodology. This involved identification of the most appropriate technologies to foster this collabora¬tion. Analysis was done of how stakeholders can take ownership of basic HIV health information system architecture as their expertise grow in managing the systems and make changes to reflect even better results out of system functionality. Data mining simulations was done on the data warehouse out of which two machine learning algorithms (regression and classification) were developed and tested using data from the data warehouse. The algorithms were used to predict patient viral load from CD4 count test figures and to classify cases of treatment failure with 83% accu¬racy. The research additionally presents an open source dimensional model for moni¬toring antiretroviral therapy and the status of information systems in health care. An architecture showing the integration of different knowledge engineering components in the study including the data warehouse, the data mining platform and user interac-tion is presented.
APA, Harvard, Vancouver, ISO, and other styles
40

Abeysekara, Thusitha Bernad. "A proposal for the protection of digital databases in Sri Lanka." Thesis, University of Exeter, 2013. http://hdl.handle.net/10871/14172.

Full text
Abstract:
Economic development in Sri Lanka has relied heavily on foreign and domestic investment. Digital databases are a new and attractive area for this investment. This thesis argues that investment needs protection and this is crucial to attract future investment. The thesis therefore proposes a digital database protection mechanism with a view to attracting investment in digital databases to Sri Lanka. The research examines various existing protection measures whilst mainly focusing on the sui generis right protection which confirms the protection of qualitative and/or quantitative substantial investment in the obtaining, verification or presentation of the contents of digital databases. In digital databases, this process is carried out by computer programs which establish meaningful and useful data patterns through their data mining process, and subsequently use those patterns in Knowledge Discovery within database processes. Those processes enhance the value and/or usefulness of the data/information. Computer programs need to be protected, as this thesis proposes, by virtue of patent protection because the process carried out by computer programs is that of a technical process - an area for which patents are particularly suitable for the purpose of protecting. All intellectual property concepts under the existing mechanisms address the issue of investment in databases in different ways. These include Copyright, Contract, Unfair Competition law and Misappropriation and Sui generis right protection. Since the primary objective of the thesis is to introduce a protection system for encouraging qualitative and quantitative investment in digital databases in Sri Lanka, this thesis suggests a set of mechanisms and rights which comprises of existing intellectual protection mechanisms for databases. The ultimate goal of the proposed protection mechanisms and rights is to improve the laws pertaining to the protection of digital databases in Sri Lanka in order to attract investment, to protect the rights and duties of the digital database users and owners/authors and, eventually, to bring positive economic effects to the country. Since digital database protection is a new concept in the Sri Lankan legal context, this research will provide guidelines for policy-makers, judges and lawyers in Sri Lanka and throughout the South Asian region.
APA, Harvard, Vancouver, ISO, and other styles
41

Fihn, John, and Johan Finndahl. "A Framework for How to Make Use of an Automatic Passenger Counting System." Thesis, Uppsala universitet, Datorteknik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-158139.

Full text
Abstract:
Most of the modern cities are today facing tremendous traffic congestions, which is a consequence of an increasing usage of private motor vehicles in the cities. Public transport plays a crucial role to reduce this traffic, but to be an attractive alternative to the use of private motor vehicles the public transport needs to provide services that suit the citizens requirements for travelling. A system that can provide transit agencies with rapid feedback about the usage of their transport network is the Automatic Passenger Counting (APC) system, a system that registers the number of passengers boarding and alighting a vehicle. Knowledge about the passengers travel behaviour can be used by transit agencies to adapt and improve their services to satisfy the requirements, but to achieve this knowledge transit agencies needs to know how to use an APC system. This thesis investigates how a transit agency can make use of an APC system. The research has taken place in Melbourne where Yarra Trams, operator of the tram network, now are putting effort in how to utilise the APC system. A theoretical framework based on theories about Knowledge Discovery from Data, System Development, and Human Computer Interaction, is built, tested, and evaluated in a case study at Yarra Trams. The case study resulted in a software system that can process and model Yarra Tram's APC data. The result of the research is a proposal of a framework consistingof different steps and events that can be used as a guide for a transit agency that wants to make use of an APC system.
APA, Harvard, Vancouver, ISO, and other styles
42

Beskyba, Jan. "Automatizace předzpracování dat za využití doménových znalosti." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193429.

Full text
Abstract:
In this work we propose a solution that would help automate the part of knowledge discovery in databases. Domain knowledge has an important role in the automation process which is necessary to include into the proposed program for data preparation. In the introduction to this work, we focus on the theoretical basis of knowledge discovery of databases with an emphasis on domain knowledge. Next, we focus on the basic principles of data pre-processing and scripting language LMCL that could be part of the design of the newly established applications for automated data preparation. Subsequently, we will deal with application design for data pre-processing, which will be verified on the data the House of Commons.
APA, Harvard, Vancouver, ISO, and other styles
43

Nigita, Giovanni. "Knowledge bases and stochastic algorithms for mining biological data: applications on A-to-I RNA editing and RNAi." Doctoral thesis, Università di Catania, 2014. http://hdl.handle.net/10761/1555.

Full text
Abstract:
Until the second half of twenty century, the connection between Biology and Computer Science was not so strict and the data were usually collected on perishable materials such as paper and then stored up in filing cabinets. This situation changed thanks to the Bioinformatics, a relatively novel field that aims to deal with biological problems by making use of computational approaches. This interdisciplinary science has two particular fields of action: on the one hand, the construction of biological databases in order to store in a rational way the huge amount of data, and, on the other hand, the development and application of algorithms also approximate for extracting predicting patterns from such kind of data. This thesis will present novel results on both of the above aspects. It will introduce three new database called miRandola, miReditar and VIRGO, respectively. All of them have been developed as open sources and equipped with user-friendly web interfaces. Then, some results concerning the application of stochastic approaches on microRNA targeting and RNA A-to-I interference will be introduced.
APA, Harvard, Vancouver, ISO, and other styles
44

Sanavia, Tiziana. "Biomarker lists stability in genomic studies: analysis and improvement by prior biological knowledge integration into the learning process." Doctoral thesis, Università degli studi di Padova, 2012. http://hdl.handle.net/11577/3422197.

Full text
Abstract:
The analysis of high-throughput sequencing, microarray and mass spectrometry data has been demonstrated extremely helpful for the identification of those genes and proteins, called biomarkers, helpful for answering to both diagnostic/prognostic and functional questions. In this context, robustness of the results is critical both to understand the biological mechanisms underlying diseases and to gain sufficient reliability for clinical/pharmaceutical applications. Recently, different studies have proved that the lists of identified biomarkers are poorly reproducible, making the validation of biomarkers as robust predictors of a disease a still open issue. The reasons of these differences are referable to both data dimensions (few subjects with respect to the number of features) and heterogeneity of complex diseases, characterized by alterations of multiple regulatory pathways and of the interplay between different genes and the environment. Typically in an experimental design, data to analyze come from different subjects and different phenotypes (e.g. normal and pathological). The most widely used methodologies for the identification of significant genes related to a disease from microarray data are based on computing differential gene expression between different phenotypes by univariate statistical tests. Such approach provides information on the effect of specific genes as independent features, whereas it is now recognized that the interplay among weakly up/down regulated genes, although not significantly differentially expressed, might be extremely important to characterize a disease status. Machine learning algorithms are, in principle, able to identify multivariate nonlinear combinations of features and have thus the possibility to select a more complete set of experimentally relevant features. In this context, supervised classification methods are often used to select biomarkers, and different methods, like discriminant analysis, random forests and support vector machines among others, have been used, especially in cancer studies. Although high accuracy is often achieved in classification approaches, the reproducibility of biomarker lists still remains an open issue, since many possible sets of biological features (i.e. genes or proteins) can be considered equally relevant in terms of prediction, thus it is in principle possible to have a lack of stability even by achieving the best accuracy. This thesis represents a study of several computational aspects related to biomarker discovery in genomic studies: from the classification and feature selection strategies to the type and the reliability of the biological information used, proposing new approaches able to cope with the problem of the reproducibility of biomarker lists. The study has highlighted that, although reasonable and comparable classification accuracy can be achieved by different methods, further developments are necessary to achieve robust biomarker lists stability, because of the high number of features and the high correlation among them. In particular, this thesis proposes two different approaches to improve biomarker lists stability by using prior information related to biological interplay and functional correlation among the analyzed features. Both approaches were able to improve biomarker selection. The first approach, using prior information to divide the application of the method into different subproblems, improves results interpretability and offers an alternative way to assess lists reproducibility. The second, integrating prior information in the kernel function of the learning algorithm, improves lists stability. Finally, the interpretability of results is strongly affected by the quality of the biological information available and the analysis of the heterogeneities performed in the Gene Ontology database has revealed the importance of providing new methods able to verify the reliability of the biological properties which are assigned to a specific feature, discriminating missing or less specific information from possible inconsistencies among the annotations. These aspects will be more and more deepened in the future, as the new sequencing technologies will monitor an increasing number of features and the number of functional annotations from genomic databases will considerably grow in the next years.
L’analisi di dati high-throughput basata sull’utilizzo di tecnologie di sequencing, microarray e spettrometria di massa si è dimostrata estremamente utile per l’identificazione di quei geni e proteine, chiamati biomarcatori, utili per rispondere a quesiti sia di tipo diagnostico/prognostico che funzionale. In tale contesto, la stabilità dei risultati è cruciale sia per capire i meccanismi biologici che caratterizzano le malattie sia per ottenere una sufficiente affidabilità per applicazioni in campo clinico/farmaceutico. Recentemente, diversi studi hanno dimostrato che le liste di biomarcatori identificati sono scarsamente riproducibili, rendendo la validazione di tali biomarcatori come indicatori stabili di una malattia un problema ancora aperto. Le ragioni di queste differenze sono imputabili sia alla dimensione dei dataset (pochi soggetti rispetto al numero di variabili) sia all’eterogeneità di malattie complesse, caratterizzate da alterazioni di più pathway di regolazione e delle interazioni tra diversi geni e l’ambiente. Tipicamente in un disegno sperimentale, i dati da analizzare provengono da diversi soggetti e diversi fenotipi (e.g. normali e patologici). Le metodologie maggiormente utilizzate per l’identificazione di geni legati ad una malattia si basano sull’analisi differenziale dell’espressione genica tra i diversi fenotipi usando test statistici univariati. Tale approccio fornisce le informazioni sull’effetto di specifici geni considerati come variabili indipendenti tra loro, mentre è ormai noto che l’interazione tra geni debolmente up/down regolati, sebbene non differenzialmente espressi, potrebbe rivelarsi estremamente importante per caratterizzare lo stato di una malattia. Gli algoritmi di machine learning sono, in linea di principio, capaci di identificare combinazioni non lineari delle variabili e hanno quindi la possibilità di selezionare un insieme più dettagliato di geni che sono sperimentalmente rilevanti. In tale contesto, i metodi di classificazione supervisionata vengono spesso utilizzati per selezionare i biomarcatori, e diversi approcci, quali discriminant analysis, random forests e support vector machines tra altri, sono stati utilizzati, soprattutto in studi oncologici. Sebbene con tali approcci di classificazione si ottenga un alto livello di accuratezza di predizione, la riproducibilità delle liste di biomarcatori rimane ancora una questione aperta, dato che esistono molteplici set di variabili biologiche (i.e. geni o proteine) che possono essere considerati ugualmente rilevanti in termini di predizione. Quindi in teoria è possibile avere un’insufficiente stabilità anche raggiungendo il massimo livello di accuratezza. Questa tesi rappresenta uno studio su diversi aspetti computazionali legati all’identificazione di biomarcatori in genomica: dalle strategie di classificazione e di feature selection adottate alla tipologia e affidabilità dell’informazione biologica utilizzata, proponendo nuovi approcci in grado di affrontare il problema della riproducibilità delle liste di biomarcatori. Tale studio ha evidenziato che sebbene un’accettabile e comparabile accuratezza nella predizione può essere ottenuta attraverso diversi metodi, ulteriori sviluppi sono necessari per raggiungere una robusta stabilità nelle liste di biomarcatori, a causa dell’alto numero di variabili e dell’alto livello di correlazione tra loro. In particolare, questa tesi propone due diversi approcci per migliorare la stabilità delle liste di biomarcatori usando l’informazione a priori legata alle interazioni biologiche e alla correlazione funzionale tra le features analizzate. Entrambi gli approcci sono stati in grado di migliorare la selezione di biomarcatori. Il primo approccio, usando l’informazione a priori per dividere l’applicazione del metodo in diversi sottoproblemi, migliora l’interpretabilità dei risultati e offre un modo alternativo per verificare la riproducibilità delle liste. Il secondo, integrando l’informazione a priori in una funzione kernel dell’algoritmo di learning, migliora la stabilità delle liste. Infine, l’interpretabilità dei risultati è fortemente influenzata dalla qualità dell’informazione biologica disponibile e l’analisi delle eterogeneità delle annotazioni effettuata sul database Gene Ontology rivela l’importanza di fornire nuovi metodi in grado di verificare l’attendibilità delle proprietà biologiche che vengono assegnate ad una specifica variabile, distinguendo la mancanza o la minore specificità di informazione da possibili inconsistenze tra le annotazioni. Questi aspetti verranno sempre più approfonditi in futuro, dato che le nuove tecnologie di sequencing monitoreranno un maggior numero di variabili e il numero di annotazioni funzionali derivanti dai database genomici crescer`a considerevolmente nei prossimi anni.
APA, Harvard, Vancouver, ISO, and other styles
45

Černý, Ján. "Implementace procedur pro předzpracování dat v systému Rapid Miner." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193216.

Full text
Abstract:
Knowledge Discovery in Databases (KDD) is gaining importance with the rising amount of data being collected lately, despite this analytic software systems often provide only the basic and most used procedures and algorithms. The aim of this thesis is to extend RapidMiner, one of the most frequently used systems, with some new procedures for data preprocessing. To understand and develop the procedures, it is important to be acquainted with the KDD, with emphasis on the data preparation phase. It's also important to describe the analytical procedures themselves. To be able to develop an extention for Rapidminer, its needed to get acquainted with the process of creating the extention and the tools that are used. Finally, the resulting extension is introduced and tested.
APA, Harvard, Vancouver, ISO, and other styles
46

Válek, Martin. "Analýza reálných dat produktové redakce Alza.cz pomocí metod DZD." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-198448.

Full text
Abstract:
This thesis deals with data analysis using methods of knowledge discovery in databases. The goal is to select appropriate methods and tools for implementation of a specific project based on real data from Alza.cz product department. Data analysis is performed by using association rules and decision rules in the Lisp-Miner and decision trees in the RapidMiner. The methodology used is the CRISP-DM. The thesis is divided into three main sections. First section is focused on the theoretical summary of information about KDD. There are defined basic terms and described the types of tasks and methods of KDD. In the second section is introduced the methodology CRISP-DM. The practical part firstly introduces company Alza.cz and its goals for this task. Afterwards, the basic structure of the data and preparation for the next step (data mining) is described. In conclusion, the results are evaluated and the possibility of their use is outlined.
APA, Harvard, Vancouver, ISO, and other styles
47

Razavi, Amir Reza. "Applications of Knowledge Discovery in Quality Registries - Predicting Recurrence of Breast Cancer and Analyzing Non-compliance with a Clinical Guideline." Doctoral thesis, Linköping : Univ, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10142.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Mohd, Saudi Madihah. "A new model for worm detection and response : development and evaluation of a new model based on knowledge discovery and data mining techniques to detect and respond to worm infection by integrating incident response, security metrics and apoptosis." Thesis, University of Bradford, 2011. http://hdl.handle.net/10454/5410.

Full text
Abstract:
Worms have been improved and a range of sophisticated techniques have been integrated, which make the detection and response processes much harder and longer than in the past. Therefore, in this thesis, a STAKCERT (Starter Kit for Computer Emergency Response Team) model is built to detect worms attack in order to respond to worms more efficiently. The novelty and the strengths of the STAKCERT model lies in the method implemented which consists of STAKCERT KDD processes and the development of STAKCERT worm classification, STAKCERT relational model and STAKCERT worm apoptosis algorithm. The new concept introduced in this model which is named apoptosis, is borrowed from the human immunology system has been mapped in terms of a security perspective. Furthermore, the encouraging results achieved by this research are validated by applying the security metrics for assigning the weight and severity values to trigger the apoptosis. In order to optimise the performance result, the standard operating procedures (SOP) for worm incident response which involve static and dynamic analyses, the knowledge discovery techniques (KDD) in modeling the STAKCERT model and the data mining algorithms were used. This STAKCERT model has produced encouraging results and outperformed comparative existing work for worm detection. It produces an overall accuracy rate of 98.75% with 0.2% for false positive rate and 1.45% is false negative rate. Worm response has resulted in an accuracy rate of 98.08% which later can be used by other researchers as a comparison with their works in future.
APA, Harvard, Vancouver, ISO, and other styles
49

Kolafa, Ondřej. "Reálná úloha dobývání znalostí." Master's thesis, Vysoká škola ekonomická v Praze, 2012. http://www.nusl.cz/ntk/nusl-200136.

Full text
Abstract:
The major objective of this thesis is to perform a real data mining task of classifying term deposit accounts holders. For this task an anonymous bank customers with low funds position data are used. In correspondence with CRISP-DM methodology the work is guided through these steps: business understanding, data understanding, data preparation, modeling, evaluation and deployment. The RapidMiner application is used for modeling. Methods and procedures used in actual task are described in theoretical part. Basic concepts of data mining with special respect to CRM segment was introduced as well as CRISP-DM methodology and technics suitable for this task. A difference in proportions of long term accounts holders and non-holders enforced data set had to be balanced in favour of holders. At the final stage, there are twelve models built. According to chosen criterias (area under curve and f-measure) two best models (logistic regression and bayes network) were elected. In the last stage of data mining process a possible real-world utilisation is mentioned. The task is developed only in form of recommendations, because it can't be applied to the real situation.
APA, Harvard, Vancouver, ISO, and other styles
50

Aldas, Cem Nuri. "An Analysis Of Peculiarity Oriented Interestingness Measures On Medical Data." Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12609856/index.pdf.

Full text
Abstract:
Peculiar data are regarded as patterns which are significantly distinguishable from other records, relatively few in number and they are accepted as to be one of the most striking aspects of the interestingness concept. In clinical domain, peculiar records are probably signals for malignancy or disorder to be intervened immediately. The investigation of the rules and mechanisms which lie behind these records will be a meaningful contribution for improved clinical decision support systems. In order to discover the most interesting records and patterns, many peculiarity oriented interestingness measures, each fulfilling a specific requirement, have been developed. In this thesis well-known peculiarity oriented interestingness measures, Local Outlier Factor (LOF), Cluster Based Local Outlier Factor (CBLOF) and Record Peculiar Factor (RPF) are compared. The insights derived from the theoretical infrastructures of the algorithms were evaluated by using experiments on synthetic and real world medical data. The results are discussed based on the interestingness perspective and some departure points for building a more developed methodology for knowledge discovery in databases are proposed.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography