Dissertations / Theses on the topic 'Tabular data'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 49 dissertations / theses for your research on the topic 'Tabular data.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Xu, Lei S. M. Massachusetts Institute of Technology. "Synthesizing tabular data using conditional GAN." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/128349.
Full textCataloged from PDF version of thesis.
Includes bibliographical references (pages 89-93).
In data science, the ability to model the distribution of rows in tabular data and generate realistic synthetic data enables various important applications including data compression, data disclosure, and privacy-preserving machine learning. However, because tabular data usually contains a mix of discrete and continuous columns, building such a model is a non-trivial task. Continuous columns may have multiple modes, while discrete columns are sometimes imbalanced, making modeling difficult. To address this problem, I took two major steps. (1) I designed SDGym, a thorough benchmark, to compare existing models, identify different properties of tabular data and analyze how these properties challenge different models. Our experimental results show that statistical models, such as Bayesian networks, that are constrained to a fixed family of available distributions cannot model tabular data effectively, especially when both continuous and discrete columns are included. Recently proposed deep generative models are capable of modeling more sophisticated distributions, but cannot outperform Bayesian network models in practice, because the network structure and learning procedure are not optimized for tabular data which may contain non-Gaussian continuous columns and imbalanced discrete columns. (2) To address these problems, I designed CTGAN, which uses a conditional generative adversarial network to address the challenges in modeling tabular data. Because CTGAN uses reversible data transformations and is trained by re-sampling the data, it can address common challenges in synthetic data generation. I evaluated CTGAN on the benchmark and showed that it consistently and significantly outperforms existing statistical and deep learning models.
by Lei Xu.
S.M.
S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
Liu, Zhicheng. "Network-based visual analysis of tabular data." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/43687.
Full textCaspár, Sophia. "Visualization of tabular data on mobile devices." Thesis, Luleå tekniska universitet, Institutionen för system- och rymdteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-68036.
Full textBraunschweig, Katrin. "Recovering the Semantics of Tabular Web Data." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-184502.
Full textCappuzzo, Riccardo. "Deep learning models for tabular data curation." Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS047.
Full textData retention is a pervasive and far-reaching topic, affecting everything from academia to industry. Current solutions rely on manual work by domain users, but they are not adequate. We are investigating how to apply deep learning to tabular data curation. We focus our work on developing unsupervised data curation systems and designing curation systems that intrinsically model categorical values in their raw form. We first implement EmbDI to generate embeddings for tabular data, and address the tasks of entity resolution and schema matching. We then turn to the data imputation problem using graphical neural networks in a multi-task learning framework called GRIMP
Baxter, Jay. "BayesDB : querying the probable implications of tabular data." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/91451.
Full text43
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 93-95).
BayesDB, a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with little statistics knowledge can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. BayesDB is suitable for analyzing complex, heterogeneous data tables with no preprocessing or parameter adjustment required. This generality rests on the model independence provided by BQL, analogous to the physical data independence provided by the relational model. SQL enables data filtering and aggregation tasks to be described independently of the physical layout of data in memory and on disk. Non-experts rely on generic indexing strategies for good-enough performance, while experts customize schemes and indices for performance-sensitive applications. Analogously, BQL enables analysis tasks to be described independently of the models used to solve them. Non-statisticians can rely on a general-purpose modeling method called CrossCat to build models that are good enough for a broad class of applications, while experts can customize the schemes and models when needed. This thesis defines BQL, describes an implementation of BayesDB, quantitatively characterizes its scalability and performance, and illustrates its efficacy on real-world data analysis problems in the areas of healthcare economics, statistical survey data analysis, web analytics, and predictive policing.
by Jay Baxter.
M. Eng.
Jiang, Ji Chu. "High Precision Deep Learning-Based Tabular Data Extraction." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/41699.
Full textRahman, Md Anisur. "Tabular Representation of Schema Mappings: Semantics and Algorithms." Thèse, Université d'Ottawa / University of Ottawa, 2011. http://hdl.handle.net/10393/20032.
Full textBaena, Mirabete Daniel. "Exact and heuristic methods for statistical tabular data protection." Doctoral thesis, Universitat Politècnica de Catalunya, 2017. http://hdl.handle.net/10803/456809.
Full textUn dels principals objectius dels Instituts Nacionals d'Estadística (INEs) és proporcionar, als ciutadans o als investigadors, una gran quantitat de dades estadístiques fiables i precises. Al mateix temps els INEs deuen garantir la confidencialitat estadística i que cap dada personal pot ser obtinguda gràcies a les dades estadístiques disseminades. La disciplina Control de revelació estadística (en anglès Statistical Disclosure Control, SDC) s'ocupa de garantir que cap dada individual pot derivar-se dels outputs de estadístics publicats però intentant al mateix temps mantenir el màxim possible de riquesa de les dades. Els INEs treballen amb dos tipus de dades: microdades i dades tabulars. Les microdades son arxius amb registres individuals de persones o empreses amb un conjunt d'atributs. Per exemple, el censos nacional recull atributs tals com l'edat, sexe, adreça o salari entre d'altres. Les dades tabulars són dades agregades obtingudes a partir del creuament d’un o més atributs o variables categòriques dels fitxers de microdades. Varis mètodes CRE són disponibles per evitar la revelació estadística en fitxers de microdades o dades tabulars. Aquesta tesi es centra en la protecció de dades tabulars tot i que la recerca duta a terme pot ser aplicada també a altres tipus de problemes. Els mètodes CTA (en anglès Controlled Tabular Adjustment) i CSP (en anglès Cell Suppression Problem) ha centrat la major part de la recerca feta en el camp de protecció de dades tabulars. Tots dos mètodes formulen problemes MILP (Mixed Integer Linear Programming problems) difícils de solucionar en taules de mida moderada. Fins i tot trobar solucions inicials factibles pot resultar molt difícil. Donat el fet que molts usuaris finals donen prioritat a tenir solucions ràpides i bones tot i que aquestes no siguin les òptimes, la primera contribució de la tesis presenta una millora en una coneguda i exitosa heurística per trobar solucions factibles de MILPs, anomenada feasibility pump. La nova aproximació, basada en el càlcul de centres analítics, s'anomena Analytic Center Feasibility Pump. La segona contribució consisteix en l'aplicació de la heurística fix-and-relax (FR) al mètode CTA. FR (sol o en combinació amb d'altres heurístiques) es mostra com a competitiu davant CPLEX branch-and-cut en termes de trobar ràpidament solucions factibles o bons upper bounds. La darrera contribució d’aquesta tesi tracta sobre el problema general de descomposició de Benders, aportant una millora amb l'aplicació de tècniques d’estabilització. Presentem un mètode anomenat stabilized Benders decomposition que es centra en trobar noves solucions properes a punts considerats prèviament com a bons. Aquesta aproximació ha estat eficientment aplicada al problema CSP, obtenint molt bons resultats en dades tabulars reals, millorant altres alternatives conegudes del mètode CSP. Les dues primeres contribucions ja han estat publicades en revistes indexades (Operations Research Letters and Computers and Operations Research). Actualment estem treballant en la publicació de la tercera contribució i serà en breu enviada a revisar.
Karlsson, Anton, and Torbjörn Sjöberg. "Synthesis of Tabular Financial Data using Generative Adversarial Networks." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-273633.
Full textDigitaliseringen har fört med sig stora mängder tillgänglig kunddata och skapat möjligheter för datadriven innovation. För att skydda kundernas integritet måste dock uppgifterna hanteras varsamt. Generativa Motstidande Nätverk (GANs) är en ny lovande utveckling inom generativ modellering. De kan användas till att syntetisera data som underlättar dataanalys samt bevarar kundernas integritet. Tidigare forskning på GANs har visat lovande resultat på bilddata. I det här examensarbetet undersöker vi gångbarheten av GANs inom finansbranchen. Vi undersöker två framstående GANs designade för att syntetisera tabelldata, TGAN och CTGAN, samt en enklare GAN modell som vi kallar för WGAN. Ett omfattande ramverk för att utvärdera syntetiska dataset utvecklas för att möjliggöra jämförelse mellan olika GANs. Resultaten indikerar att GANs klarar av att syntetisera högkvalitativa dataset som bevarar de statistiska egenskaperna hos det underliggande datat, vilket möjliggör en gångbar och reproducerbar efterföljande analys. Alla modellerna som testades uppvisade dock problem med att återskapa numerisk data.
OLIVEIRA, Hugo Santos. "CSVValidation: uma ferramenta para validação de arquivos CSV a partir de metadados." Universidade Federal de Pernambuco, 2015. https://repositorio.ufpe.br/handle/123456789/18413.
Full textMade available in DSpace on 2017-03-14T18:10:49Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação Hugo Santos de Oliveira - Versão Depósito Bib Central.pdf: 2529045 bytes, checksum: a83fb438eaa8daaa0b4dcba01cb0b729 (MD5) Previous issue date: 2015-08-14
Modelos de dados tabulares têm sido amplamente utilizados para a publicação de dados na Web, devido a sua simplicidade de representação e facilidade de manipulação. Entretanto, nem sempre os dados são dispostos em arquivos tabulares de maneira adequada, o que pode causar dificuldades no momento do processamento dos dados. Dessa forma, o consórcio W3C tem trabalhado em uma proposta de especificação padrão para representação de dados em formatos tabulares. Neste contexto, este trabalho tem como objetivo geral propor uma solução para o problema de validação de arquivos de Dados Tabulares. Estes arquivos, são representados no formato CSV e descritos por metadados, os quais são representados em JSON e definidos de acordo com a especificação proposta pelo W3C. A principal contribuição deste trabalho foi a definição do processo de validação de arquivos de dados tabulares e dos algoritmos necessários para a execução desse processo, além da implementação de um protótipo que tem por objetivo realizar a validação dos dados tabulares, conforme especificado pelo W3C. Outra importante contribuição foi a realização de experimentos com fontes de dados disponíveis na Web, com o objetivo de avaliar a abordagem proposta neste trabalho.
Tabular data models have been used a lot for publishing data on the Web because of its simplicity of representation and easy manipulation. However, in some cases the data are not disposed in tabular files appropriately, which can cause data processing problems. Thus, the W3C proposed a standard specification for representing data in tabular format. In this context this work has as main objective to propose a solution to the problem of validating tabular data files, represented in CSV, files and described by metadata represented as JSON files and described, according to the specification proposed by the W3C. The main contribution of this work is the definition of a tabular data file validation process and algorithms necessary for the implementation of this process as well as the implementation of a prototype that aimed to validate tabular data as specified by the W3C. Other important contribution is the execution of experiments with data sources available on the Web with the objective to evaluate the approach proposed in this work.
CREMASCHI, MARCO. "ENABLING TABULAR DATA UNDERSTANDING BY HUMANS AND MACHINES THROUGH SEMANTIC INTERPRETATION." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2020. http://hdl.handle.net/10281/263555.
Full textA significant number of documents, reports and Web pages –an analysis reports 233M relational tables within the Common Crawl repository of 1.81 billion documents– makes use of tables to convey information that cannot be easily processed by humans, and understood by computers. To address this issue, we propose a new approach that allows computers to interpret the semantics of a table, and provides humans with a more accessible representation of the data contained in a table. To achieve the objective, the general problem has been broken down into three sub-problems: (i) define a method to provide a semantic interpretation of table data; (ii) define a descriptive model that allows computers to understand and share table data; and (iii) define processes, techniques and algorithms to generate natural language representation of the table data. Regarding sub-problem (i), the semantic representation of a data has been obtained through the application of table interpretation techniques, which supports users to identify in a semi-automatic way the meaning of the data in the table and the relationships between them. Such techniques take a table and a Knowledge Graph (KG) as input, and deliver as output an RDF representation –a set of tuples
Radulovic, Nedeljko. "Post-hoc Explainable AI for Black Box Models on Tabular Data." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAT028.
Full textCurrent state-of-the-art Artificial Intelligence (AI) models have been proven to be verysuccessful in solving various tasks, such as classification, regression, Natural Language Processing(NLP), and image processing. The resources that we have at our hands today allow us to trainvery complex AI models to solve different problems in almost any field: medicine, finance, justice,transportation, forecast, etc. With the popularity and widespread use of the AI models, the need toensure the trust in them also grew. Complex as they come today, these AI models are impossible to be interpreted and understood by humans. In this thesis, we focus on the specific area of research, namely Explainable Artificial Intelligence (xAI), that aims to provide the approaches to interpret the complex AI models and explain their decisions. We present two approaches STACI and BELLA which focus on classification and regression tasks, respectively, for tabular data. Both methods are deterministic model-agnostic post-hoc approaches, which means that they can be applied to any black-box model after its creation. In this way, interpretability presents an added value without the need to compromise on black-box model's performance. Our methods provide accurate, simple and general interpretations of both the whole black-box model and its individual predictions. We confirmed their high performance through extensive experiments and a user study
Alexandersson, Calle. "An evaluation of HTML5 components for web-based manipulation of tabular data." Thesis, Umeå universitet, Institutionen för datavetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-108274.
Full textTroisemaine, Colin. "Novel class discovery in tabular data : an application to network fault diagnosis." Electronic Thesis or Diss., Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2024. http://www.theses.fr/2024IMTA0422.
Full textThis thesis focuses on Novel Class Discovery (NCD) in the context of tabular data. The Novel Class Discovery problem consists in extracting knowledge from a labeled set of already known classes in order to more accurately partition an unlabeled set of new classes. Although NCD has recently received a lot of of attention from the community, it is generally addressed in computer vision problems and sometimes under unrealistic conditions. In particular, the number of novel classes is often assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods based on these assumptions are not applicable to realworld scenarios. Thus, in this thesis we focus on discovery resolution in tabular data when no a priori knowledge is available. The methods developed in the thesis are applied to a real-world case: automatic fault diagnosis in telecommunication networks, with a focus on fiber optic access networks. The aim is to achieve efficient fault management, particularly at the diagnosis stage when unknown faults (new classes) may appear
PANFILO, DANIELE. "Generating Privacy-Compliant, Utility-Preserving Synthetic Tabular and Relational Datasets Through Deep Learning." Doctoral thesis, Università degli Studi di Trieste, 2022. http://hdl.handle.net/11368/3030920.
Full textTwo trends have rapidly been redefining the artificial intelligence (AI) landscape over the past several decades. The first of these is the rapid technological developments that make increasingly sophisticated AI feasible. From a hardware point of view, this includes increased computational power and efficient data storage. From a conceptual and algorithmic viewpoint, fields such as machine learning have undergone a surge and synergies between AI and other disciplines have resulted in considerable developments. The second trend is the growing societal awareness around AI. While institutions are becoming increasingly aware that they have to adopt AI technology to stay competitive, issues such as data privacy and explainability have become part of public discourse. Combined, these developments result in a conundrum: AI can improve all aspects of our lives, from healthcare to environmental policy to business opportunities, but invoking it requires the use of sensitive data. Unfortunately, traditional anonymization techniques do not provide a reliable solution to this conundrum. They are insufficient in protecting personal data, but also reduce the analytic value of data through distortion. However, the emerging study of deep-learning generative models (DLGM) may form a more refined alternative to traditional anonymization. Originally conceived for image processing, these models capture probability distributions underlying datasets. Such distributions can subsequently be sampled, giving new data points not present in the original dataset. However, the overall distribution of synthetic datasets, consisting of data sampled in this manner, is equivalent to that of the original dataset. In our research activity, we study the use of DLGM as an enabling technology for wider AI adoption. To do so, we first study legislation around data privacy with an emphasis on the European Union. In doing so, we also provide an outline of traditional data anonymization technology. We then provide an introduction to AI and deep-learning. Two case studies are discussed to illustrate the field’s merits, namely image segmentation and cancer diagnosis. We then introduce DLGM, with an emphasis on variational autoencoders. The application of such methods to tabular and relational data is novel and involves innovative preprocessing techniques. Finally, we assess the developed methodology in reproducible experiments, evaluating both the analytic utility and the degree of privacy protection through statistical metrics.
Hedbrant, Per. "Towards a fully automated extraction and interpretation of tabular data using machine learning." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-391490.
Full textMiles, David B. L. "A User-Centric Tabular Multi-Column Sorting Interface For Intact Transposition Of Columnar Data." Diss., CLICK HERE for online access, 2006. http://contentdm.lib.byu.edu/ETD/image/etd1160.pdf.
Full textLiu, Jixiong. "Semantic Annotations for Tabular Data Using Embeddings : Application to Datasets Indexing and Table Augmentation." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS529.
Full textWith the development of Open Data, a large number of data sources are made available to communities (including data scientists and data analysts). This data is the treasure of digital services as long as data is cleaned, unbiased, as well as combined with explicit and machine-processable semantics in order to foster exploitation. In particular, structured data sources (CSV, JSON, XML, etc.) are the raw material for many data science processes. However, this data derives from different domains for which consumers are not always familiar with (knowledge gap), which complicates their appropriation, while this is a critical step in creating machine learning models. Semantic models (in particular, ontologies) make it possible to explicitly represent the implicit meaning of data by specifying the concepts and relationships present in the data. The provision of semantic labels on datasets facilitates the understanding and reuse of data by providing documentation on the data that can be easily used by a non-expert. Moreover, semantic annotation opens the way to search modes that go beyond simple keywords and allow the use of queries of a high conceptual level on the content of the datasets but also their structure while overcoming the problems of syntactic heterogeneity encountered in tabular data. This thesis introduces a complete pipeline for the extraction, interpretation, and applications of tables in the wild with the help of knowledge graphs. We first refresh the exiting definition of tables from the perspective of table interpretation and develop systems for collecting and extracting tables on the Web and local files. Three table interpretation systems are further proposed based on either heuristic rules or graph representation models facing the challenges observed from the literature. Finally, we introduce and evaluate two table augmentation applications based on semantic annotations, namely data imputation and schema augmentation
Ayad, Célia. "Towards Reliable Post Hoc Explanations for Machine Learning on Tabular Data and their Applications." Electronic Thesis or Diss., Institut polytechnique de Paris, 2024. http://www.theses.fr/2024IPPAX082.
Full textAs machine learning continues to demonstrate robust predictive capabili-ties, it has emerged as a very valuable tool in several scientific and indus-trial domains. However, as ML models evolve to achieve higher accuracy,they also become increasingly complex and require more parameters. Beingable to understand the inner complexities and to establish trust in the pre-dictions of these machine learning models, has therefore become essentialin various critical domains including healthcare, and finance. Researchershave developed explanation methods to make machine learning models moretransparent, helping users understand why predictions are made. However,these explanation methods often fall short in accurately explaining modelpredictions, making it difficult for domain experts to utilize them effectively.It’s crucial to identify the shortcomings of ML explanations, enhance theirreliability, and make them more user-friendly. Additionally, with many MLtasks becoming more data-intensive and the demand for widespread inte-gration rising, there is a need for methods that deliver strong predictiveperformance in a simpler and more cost-effective manner. In this disserta-tion, we address these problems in two main research thrusts: 1) We proposea methodology to evaluate various explainability methods in the context ofspecific data properties, such as noise levels, feature correlations, and classimbalance, and offer guidance for practitioners and researchers on selectingthe most suitable explainability method based on the characteristics of theirdatasets, revealing where these methods excel or fail. Additionally, we pro-vide clinicians with personalized explanations of cervical cancer risk factorsbased on their desired properties such as ease of understanding, consistency,and stability. 2) We introduce Shapley Chains, a new explanation techniquedesigned to overcome the lack of explanations of multi-output predictionsin the case of interdependent labels, where features may have indirect con-tributions to predict subsequent labels in the chain (i.e. the order in whichthese labels are predicted). Moreover, we propose Bayes LIME Chains toenhance the robustness of Shapley Chains
Bandyopadhyay, Bortik. "Querying Structured Data via Informative Representations." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1595447189545086.
Full textBraunschweig, Katrin [Verfasser], Wolfgang [Akademischer Betreuer] Lehner, and Stefan [Akademischer Betreuer] Conrad. "Recovering the Semantics of Tabular Web Data / Katrin Braunschweig. Betreuer: Wolfgang Lehner. Gutachter: Wolfgang Lehner ; Stefan Conrad." Dresden : Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2015. http://d-nb.info/1078205256/34.
Full textKanerva, Anton, and Fredrik Helgesson. "On the Use of Model-Agnostic Interpretation Methods as Defense Against Adversarial Input Attacks on Tabular Data." Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-20085.
Full textKontext. Maskininlärning är ett område inom artificiell intelligens som är under konstant utveckling. Mängden domäner som vi sprider maskininlärningsmodeller i växer sig allt större och systemen sprider sig obemärkt nära inpå våra dagliga liv genom olika elektroniska enheter. Genom åren har mycket tid och arbete lagts på att öka dessa modellers prestanda vilket har överskuggat risken för sårbarheter i systemens kärna, den tränade modellen. En relativt ny attack, kallad "adversarial input attack", med målet att lura modellen till felaktiga beslutstaganden har nästan uteslutande forskats på inom bildigenkänning. Men, hotet som adversarial input-attacker utgör sträcker sig utom ramarna för bilddata till andra datadomäner som den tabulära domänen vilken är den vanligaste datadomänen inom industrin. Metoder för att tolka komplexa maskininlärningsmodeller kan hjälpa människor att förstå beteendet hos dessa komplexa maskininlärningssystem samt de beslut som de tar. Att förstå en modells beteende är en viktig komponent för att upptäcka, förstå och mitigera sårbarheter hos modellen. Syfte. Den här studien försöker reducera det forskningsgap som adversarial input-attacker och motsvarande försvarsmetoder i den tabulära domänen utgör. Målet med denna studie är att analysera hur modelloberoende tolkningsmetoder kan användas för att mitigera och detektera adversarial input-attacker mot tabulär data. Metod. Det uppsatta målet nås genom tre på varandra följande experiment där modelltolkningsmetoder analyseras, adversarial input-attacker utvärderas och visualiseras samt där en ny metod baserad på modelltolkning föreslås för detektion av adversarial input-attacker tillsammans med en ny mitigeringsteknik där feature selection används defensivt för att minska attackvektorns storlek. Resultat. Den föreslagna metoden för detektering av adversarial input-attacker visar state-of-the-art-resultat med över 86% träffsäkerhet. Den föreslagna mitigeringstekniken visades framgångsrik i att härda modellen mot adversarial input attacker genom att minska deras attackstyrka med 33% utan att degradera modellens klassifieringsprestanda. Slutsats. Denna studie bidrar med användbara metoder för detektering och mitigering av adversarial input-attacker såväl som metoder för att utvärdera och visualisera svårt förnimbara attacker mot tabulär data.
Mosquera, Evinton Antonio Cordoba. "Uma nova metáfora visual escalável para dados tabulares e sua aplicação na análise de agrupamentos." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-07022018-082548/.
Full textThe rapid evolution of computing resources has enabled large datasets to be stored and retrieved. However, exploring, understanding and extracting useful information is still a challenge. Among the computational tools to address this problem, information visualization techniques enable the data analysis employing the human visual ability by making a graphic representation of the data set, and data mining provides automatic processes for the discovery and interpretation of patterns. Despite the recent popularity of information visualization methods, a recurring problem is the low visual scalability when analyzing large data sets resulting in context loss and visual disorder. To represent large datasets reducing the loss of relevant information, the process of aggregation is being used. Aggregation decreases the amount of data to be represented, preserving the distribution and trends of the original dataset. Regarding data mining, information visualization has become an essential tool in the interpretation of computational models and generated results, especially of unsupervised techniques, such as clustering. This occurs because, in these techniques, the only way the user interacts with the mining process is through parameterization, limiting the insertion of domain knowledge in the process. In this thesis, we propose and develop the new visual metaphor based on the TableLens that employs approaches based on the concept of aggregation to create more scalable representations of tabular data. As application, we use the developed metaphor in the analysis of the results of clustering techniques. The resulting framework does not only support large database analysis but also provides insights into how data attributes contribute to clustering regarding cohesion and separation of the composed groups
Janga, Prudhvi. "Integration of Heterogeneous Web-based Information into a Uniform Web-based Presentation." University of Cincinnati / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1397467105.
Full textVišnja, Ognjenović. "Aproksimativna diskretizacija tabelarno organizovanih podataka." Phd thesis, Univerzitet u Novom Sadu, Tehnički fakultet Mihajlo Pupin u Zrenjaninu, 2016. https://www.cris.uns.ac.rs/record.jsf?recordId=101259&source=NDLTD&language=en.
Full textThis dissertation analyses the influence of data distribution on the results of discretization algorithms within the process of machine learning. Based on the chosen databases and the discretization algorithms within the rough set theory and decision trees, the influence of the data distribution-cuts relation within certain discretization has been researched.Changes in consistency of a discretized table, as dependent on the position of the reduced cut on the histogram, has been monitored. Fixed cuts have been defined, as dependent on the multimodal segmentation, on basis of which it is possible to do the reduction of the remaining cuts. To determine the fixed cuts, an algorithm FixedPoints has been constructed, determining these points in accordance with the rough segmentation of multimodal distribution.An algorithm for approximate discretization, APPROX MD, has been constructed for cuts reduction, using cuts obtained through the maximum discernibility (MD-Heuristic) algorithm and the parametres related to the percent of imprecise rules, the total classification percent and the number of reduction cuts. The algorithm has been compared to the MD algorithm and to the MD algorithm with approximate solutions for α=0,95.
Da, Silva Carvalho Paulo. "Plateforme visuelle pour l'intégration de données faiblement structurées et incertaines." Thesis, Tours, 2017. http://www.theses.fr/2017TOUR4020/document.
Full textWe hear a lot about Big Data, Open Data, Social Data, Scientific Data, etc. The importance currently given to data is, in general, very high. We are living in the era of massive data. The analysis of these data is important if the objective is to successfully extract value from it so that they can be used. The work presented in this thesis project is related with the understanding, assessment, correction/modification, management and finally the integration of the data, in order to allow their respective exploitation and reuse. Our research is exclusively focused on Open Data and, more precisely, Open Data organized in tabular form (CSV - being one of the most widely used formats in the Open Data domain). The first time that the term Open Data appeared was in 1995 when the group GCDIS (Global Change Data and Information System) (from United States) used this expression to encourage entities, having the same interests and concerns, to share their data [Data et System, 1995]. However, the Open Data movement has only recently undergone a sharp increase. It has become a popular phenomenon all over the world. Being the Open Data movement recent, it is a field that is currently growing and its importance is very strong. The encouragement given by governments and public institutions to have their data published openly has an important role at this level
Huřťák, Ladislav. "Kontingenční tabulky a jejich využití ve výzkumu." Master's thesis, Vysoká škola ekonomická v Praze, 2008. http://www.nusl.cz/ntk/nusl-3562.
Full textBršlíková, Jana. "Analýza úmrtnostních tabulek pomocí vybraných vícerozměrných statistických metod." Master's thesis, Vysoká škola ekonomická v Praze, 2015. http://www.nusl.cz/ntk/nusl-201859.
Full textKocáb, Jan. "Statistické usuzování v analýze kategoriálních dat." Master's thesis, Vysoká škola ekonomická v Praze, 2010. http://www.nusl.cz/ntk/nusl-76171.
Full textKeclík, David. "Návrh změn informačního systému firmy." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2011. http://www.nusl.cz/ntk/nusl-222834.
Full textČejková, Lenka. "Možnosti využití moderních technologií ve výuce ekonomických předmětů na SŠ." Master's thesis, Vysoká škola ekonomická v Praze, 2015. http://www.nusl.cz/ntk/nusl-205949.
Full textŠmerda, Vojtěch. "Grafický editor metadat pro OLAP." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2008. http://www.nusl.cz/ntk/nusl-235893.
Full textBranderský, Gabriel. "Knihovna znovupoužitelných komponent a utilit pro framework Angular 2." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363860.
Full textBažant, Milan. "Software pro detekci,analýzu a opravu kolizních objednávek v CRM systému." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2009. http://www.nusl.cz/ntk/nusl-222115.
Full textVlach, Petr. "Grafický podsystém v prostředí internetového prohlížeče." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2008. http://www.nusl.cz/ntk/nusl-235437.
Full textHoopes, Daniel Matthew. "The ContexTable: Building and Testing an Intelligent, Context-Aware Kitchen Table." BYU ScholarsArchive, 2004. https://scholarsarchive.byu.edu/etd/12.
Full textPepuchová, Valéria. "Návrh systému stimulace pracovníků." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2011. http://www.nusl.cz/ntk/nusl-222832.
Full textChou, Hsu-Heng, and 周盱衡. "A Tree Based (m,k)-Anonymity Privacy Preserving Technique For Tabular Data." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/646dv4.
Full text國立中興大學
電機工程學系所
107
Data Publishing contributes to the advancement of data science and the application of knowledge-based decision making. However, data publishing faces the problems of privacy leakage. Once the data is published, sensitive information may be excavated and results in virtual or physical threats and attacks. For example, the ID of an anonymous user may be recognized and the information that he/she is reluctant to bring to light is revealed. More seriously, the revealing of one’s physical location could hazard his/her life safety. Therefore, data should be carefully examined and go through a privacy protection handling process before being released. Nowadays, k-anonymity is still one of the most frequently used privacy preserving model and generalization and perturbation are the common anonymity techniques. However, most of the generalization or perturbation techniques do not consider the data characteristics of high dimensionality thus leads to low data utilization. For handling the privacy preserving problem of high dimensional data, we consider that it is not easy for adversaries to obtain many data attributes to proceed privacy attacks. On the other hand, it is not easy to determine the quasi-identifier attributes. Instead of making all attributes k-anonymized, ensuring any m sub-dimensions of data attributes conform to the k-anonymity condition is probably a compromise to trade of the privacy preserving and data utility. Therefore, to handle the (m,k)-anonymity problem of a tabular data, we propose the (m,k)-anonymity algorithm with a Combination-Tree (C-Tree). The (m,k)-anonymity algorithm searches the C-Tree in a greedy and top-down manner to generalize the attributes of unqualified data records. The C-Tree is built based on the Pascal Theorem to summarize the data for easy of searching the unqualified data and figuring out the equivalent classes for local generalization. Also, we propose the Taxonomy Index Support (TIS) to speed the generalization process. To validate our methods, we conduct experiments with real dataset to study the key factors that influence the Information Loss and utility. According to the experimental results, our method outperforms the previous methods in achieving k-anonymity with lower Information Loss. Besides, the experimental results show a long computing time which is due to the high computational complexity. The future works include designing efficient data structures or algorithms to make the technique serviceable.
Chen, Jiun-Lin, and 陳俊霖. "A Study on Structure Extraction, Identification and Data Extraction in Tabular-form Document Images Processing." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/11910018098098981706.
Full text國立交通大學
資訊工程系
89
In this thesis, we propose methods for solving several problems in an automatic tabular form document processing system. These problems include form structure extraction, form line removal, form document identification and field data grouping. First, a strip projection method is presented for extracting the form structure. We first segment input form images into uniform vertical and horizontal strips. Since most form lines are vertical or horizontal, we can locate form lines by projecting the image of each vertical strip horizontally and that of each horizontal strip vertically. The peak positions in these projection profiles denote possible locations of lines in form images. We then extract the lines starting with the possible line positions in the source image. After all lines have been extracted, redundant lines are removed using a line-verification algorithm and broken lines are linked using a line-merging algorithm. Experimental results show that the proposed method can extract form structures from A4-sized documents in about 3 seconds, which is very efficient, compared with the methods based on Hough transformation and run-based line-detection algorithms. Second, a form documents identification algorithm is proposed according to the similarity values of extracted form fields. The field similarity is defined by normalizing the differences of the top-left corner points and of the bottom-right corner points between the template field and the input field. Since the input form image can be de-skewed according the results of form structure extraction, each extracted field is a rectangle. Thus, a field can berepresented by the top-left and bottom-right corner points. Besides, since a short boundary line does not mean the field with this line is small, this kind of short form lines can introduce certain difficulties in form document identification if some short lines are not correctly extracted. By comparing the extracted fields, our method is much efficient than the method with comparing extracted form lines and is not less sensitive to mis-extracted form lines. A slice removing method with filled-in data preserving for form lines removal is also proposed in this thesis. In this method, we first go through a given form line to estimate its width. Then, we go through this line again to calculate he effective line width at each position, and compare with the estimated line width. The effective line width is the maximum length of the line traced from the slice position on the orthographic direction with this form line. If the effective line width at a position is larger than the estimated line width, we reserve the line slice at this position since this slice should be located on filled-in data. Otherwise, this slice is removed. This thesis also proposes a novel approach to grouping Chinese handwritten field data filled in form documents using a gravitation-based algorithm. This algorithm is developed to extract handwritten field data which may be written out of form fields. First, form lines are extracted and removed from input form images. Connected-components are then detected from remaining data, and the gravitation for each connected-component is computed by using the black pixel counts as their mass. Next, we move connected-components according to their gravitation. As generally known, filled-in data have the locality property, i.e., data of the same field are normally written in a local area consecutively. Therefore, the relationship of these connected-components can be determined by this property. Repeatedly moving these connected-components according to their neighbor components allows us to determine which connected-components should be extracted together for a particular field. Experimental results demonstrate the effectiveness of the proposed method in extracting field data.
Ferreira, Francisco Martins. "Anonymizing Private Information: From Noise to Data." Master's thesis, 2021. http://hdl.handle.net/10316/95554.
Full textIn the Information Age data has become more important for all types of organizations. The information carried by large datasets habilitates the creation of intelligent systems that overcome inefficiencies and create a safer and better quality of life. Because of this, organizations have come to see data as a competitive advantage.Fraud Detection solutions are one example of intelligent systems that are highly dependent on having access to large amounts of data. These solutions receive information about monetary transactions and classify them as legitimate or fraudulent in real time. This field has benefitted from higher availability of data, allowing the application of Machine Learning (ML) algorithms that leverage the information in datasets to finding fraudulent activity in real-time.In a context of systematic gathering of information, privacy dictates how data can be used and shared, in order to protect the information of users and organizations. In order to retain the utility of data, a growing amount of effort has been dedicated to creating and exploring avenues for privacy conscious data sharing.Generating synthetic datasets that carry the same information as real data allows for the creation of ML solutions while respecting the limitations placed on data usage. In this work, we introduce Duo-GAN and DW-GAN as frameworks for synthetic data generation that learn the specificities of financial transactions data and generate fictitious data that keeps the utility of the original collections of data. Both these frameworks use two generators, one for generating fraudulent instances and one for generating legitimate instances. This allows each generator to learn the distribution for each class, avoiding the problems created by highly unbalanced data. Duo-GAN achieves positive results, in some instances achieving a disparity of only 4% in F1 score between classifiers trained with synthetic data and classifiers trained with real data and both tested on the same real data. DW-GAN presents positive results too with disparity of 3% in F1 score in the same conditions.
Na Idade da Informação os dados tornaram-se mais importantes para todos os tipos de organizações. A informação contida pelos grandes datasets permite a criação de sistemas inteligentes que ultrapassam ineficiências e criam qualidade de vida melhor e mais segura. Devido a isto, as organizações começaram a ver os dados com uma vantagem competitiva.As soluções de Deteção de Fraude são exemplos de sistemas inteligentes que dependem do acesso a grandes quantidades de dados. Estas soluções recebem informação relativas a transações monetárias e atribuem classificações de legítimas ou fraudulentas em tempo real. Este é um dos campos que beneficiou da maior disponibilidade de dados, sendo capaz de aplicar algoritmos de Machine Learning que utilizam a informação contida nos datasets para detetar atividade fraudulenta em tempo real.Num contexto de agregação sistemática de informação, a privacidade dita como os dados podem ser utilizados e partilhados, com o objetivo de proteger a informação dos utilizadores de sistemas e de organizações. De forma a reter a utilidade dos dados, uma quantidade crescente de esforço tem sido dispendido em criar e explorar avenidas para a partilha de dados respeitando a privacidade.A geração de dados sintéticos que contém a mesma informação que os dados reais permite a criação de soluções de Machine Learning (ML) mantendo o respeito pelas limitações colocadas sobre a utilização de dados.Neste trabalho introduzimos Duo-GAN e DW-GAN como frameworks para geração de dados sintéticos que aprendem as especificidades dos dados de transações financeiras e geram dados fictícios que retém a utilidade das coleções de dados originais. Ambos os frameworks utilizam dois geradores, um para gerar instâncias fraudulentas e outro para gerar instâncias legítimas. Isto permite que cada gerador aprenda a distribuição de cada uma das classes, evitando assim os problemas criados por datasets desiquilibrados. O Duo- GAN atinge resultados positivos, em certos casos atingindo uma disparidade de apenas 4% no F1 score entre classificadores treinados com dados sintéticos e classificadores treinados com dados reais, e ambos testados nos mesmos dados reais. O DW-GAN também apresenta resultados positivos, com disparidade de 3% no F1 score para as mesmas condições.
Outro - This work is partially funded by national funds through the FCT - Foundation for Science and Technology, I.P., within the scope of the project CISUC - UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020. and by the CMU|Portugal project CAMELOT (POCI-01-0247-FEDER-045915).
Santos, Ana Rita Albuquerque. "A client focused business intelligence & analytics solution for the hospitality sector." Master's thesis, 2020. http://hdl.handle.net/10362/106502.
Full textOne of the greatest needs of today's business is to know the customer or the type of customer it wants to reach, which makes a customer database a strategic weapon and one of the most important investments a company can make. The business world is becoming more competitive every day, we are constantly overwhelmed with advertisements of products we may like, product promotions we usually buy or discounts on the next purchase if we subscribe to the company’s newsletter. All of this creates a client customization, and any company that is not able to do this cannot keep up with its competition. This report details the project developed at Pestana Hotel Group, which consisted of a Business Intelligence solution, more specifically the development of a customer database with the creation of two tabular models using SQL Server tools, one specific for loyal customers and another, more general, with information about all Pestana customers, and two Power BI reports that allow the visualization of the information obtained in an effective and simplified way. This report contains a literature review that situates the reader on the subject addressed in this project, a chapter dedicated to the data modeling used to create the tabular models, and another on the creation of the reports.
Uma das maiores necessidades dos negócios atuais é conhecer o seu cliente ou o tipo de cliente que quer atingir, o que torna uma base de dados de cliente uma arma estratégica e um dos mais importantes investimentos. O mundo empresarial está cada dia mais competitivo, somos constantemente assoberbados com anúncios de produtos que podemos gostar, promoções de produtos que costumamos comprar ou descontos na próxima compra caso subscrevamos a newsletter. Tudo isto cria uma personalização para o cliente, e qualquer empresa que não o consiga fazer não conseguirá acompanhar a concorrência. Este relatório detalha o projeto feito no Pestana Hotel Group, que consistiu numa solução de Business Intelligence, mais especificamente na construção de uma base de dados do cliente com a criação de dois modelos tabulares através de ferramentas do SQL Server, um específico para clientes fidelizados e outro mais geral com informação sobre todos os clientes Pestana, e dois relatórios em Power BI que permitem a visualização da informação obtida de uma forma eficaz e simplificada. O relatório contém uma revisão de literatura que situa o leitor sobre os assuntos abordados neste projeto, um capítulo dedicado à modelação dos dados de forma a criar os modelos tabulares e outro sobre a criação dos relatórios.
Sarvghad, Batn Moghaddam Ali. "Tracking and visualizing dimension space coverage for exploratory data analysis." Thesis, 2016. http://hdl.handle.net/1828/7442.
Full textGraduate
0984
ali.sarvghad@gmail.com
Cardoso, Francisco Pereira. "Gestão e otimização de dados para a criação de análises informativas." Master's thesis, 2019. http://hdl.handle.net/10362/120489.
Full textWith the constant accumulation of data over time by companies, it becomes complex to maintain an organized and efficient storage structure and analysis of this data, being therefore notorious the lack of use of an analysis information to assist in business decision making. The client to whom the solution will be developed presents these same problems and seeks a solution that will solve them. Initially, it is necessary to analyze the data stored by the customer and understand the most efficient way to process and organize it in an analytical solution, in order to promote a frequent consumption by the various users. In addition, it is necessary to design the data extraction, transformation and loading process and also design and implement a structured, efficient and scalable architecture that meets the client needs. By consulting it, the final solution aims to promote a bigger and greater use of the stored data obtained. Another major objective is to reduce the time and effort required to create business reports, thus contributing to reducing company costs and increasing business productivity.
Slavíková, Karolina. "Netradiční fyzikální tabulky." Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-328146.
Full textToušková, Daniela. "Hodnotící tabulka jako nástroj pro měření makroekonomických nerovnováh." Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-324950.
Full textBrodec, Václav. "Hledání a vytváření relací mezi sloupci v CSV souborech s využitím Linked Dat." Master's thesis, 2019. http://www.nusl.cz/ntk/nusl-393117.
Full textSmutný, Lukáš. "M-technologie ve výuce a v řízení základních škol v rámci Moravskoslezského kraje." Master's thesis, 2014. http://www.nusl.cz/ntk/nusl-335128.
Full textMATĚJŮ, Petr. "Videosekvence a jejich využití při výuce fyziky na ZŠ." Master's thesis, 2011. http://www.nusl.cz/ntk/nusl-81502.
Full text