To see the other types of publications on this topic, follow the link: Tabular data.

Dissertations / Theses on the topic 'Tabular data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 49 dissertations / theses for your research on the topic 'Tabular data.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Xu, Lei S. M. Massachusetts Institute of Technology. "Synthesizing tabular data using conditional GAN." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/128349.

Full text
Abstract:
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2020
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 89-93).
In data science, the ability to model the distribution of rows in tabular data and generate realistic synthetic data enables various important applications including data compression, data disclosure, and privacy-preserving machine learning. However, because tabular data usually contains a mix of discrete and continuous columns, building such a model is a non-trivial task. Continuous columns may have multiple modes, while discrete columns are sometimes imbalanced, making modeling difficult. To address this problem, I took two major steps. (1) I designed SDGym, a thorough benchmark, to compare existing models, identify different properties of tabular data and analyze how these properties challenge different models. Our experimental results show that statistical models, such as Bayesian networks, that are constrained to a fixed family of available distributions cannot model tabular data effectively, especially when both continuous and discrete columns are included. Recently proposed deep generative models are capable of modeling more sophisticated distributions, but cannot outperform Bayesian network models in practice, because the network structure and learning procedure are not optimized for tabular data which may contain non-Gaussian continuous columns and imbalanced discrete columns. (2) To address these problems, I designed CTGAN, which uses a conditional generative adversarial network to address the challenges in modeling tabular data. Because CTGAN uses reversible data transformations and is trained by re-sampling the data, it can address common challenges in synthetic data generation. I evaluated CTGAN on the benchmark and showed that it consistently and significantly outperforms existing statistical and deep learning models.
by Lei Xu.
S.M.
S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
2

Liu, Zhicheng. "Network-based visual analysis of tabular data." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/43687.

Full text
Abstract:
Tabular data is pervasive in the form of spreadsheets and relational databases. Although tables often describe multivariate data without explicit network semantics, it may be advantageous to explore the data modeled as a graph or network for analysis. Even when a given table design conveys some static network semantics, analysts may want to look at multiple networks from different perspectives, at different levels of abstraction, and with different edge semantics. This dissertation is motivated by the observation that a general approach for performing multi-dimensional and multi-level network-based visual analysis on multivariate tabular data is necessary. We present a formal framework based on the relational data model that systematically specifies the construction and transformation of graphs from relational data tables. In the framework, a set of relational operators provide the basis for rich expressive power for network modeling. Powered by this relational algebraic framework, we design and implement a visual analytics system called Ploceus. Ploceus supports flexible construction and transformation of networks through a direct manipulation interface, and integrates dynamic network manipulation with visual exploration for a seamless analytic experience.
APA, Harvard, Vancouver, ISO, and other styles
3

Caspár, Sophia. "Visualization of tabular data on mobile devices." Thesis, Luleå tekniska universitet, Institutionen för system- och rymdteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-68036.

Full text
Abstract:
This thesis evaluates various ways of displaying tabular data on mobile devices using different responsive table solutions. It also presents a tool to help web developers and designers in the process of choosing and implementing a suitable table approach. The proposed solution for this thesis is a web system called The Visualizing Wizard that allows the user to answer some questions about the intended table and then get a recommended responsive table solution generated based on the answers. The system uses a rule-based approach via Prolog to match the answers to a set of rules and provide an appropriate result. In order to determine which table solutions are more appropriate to use for which type of data a statistical analysis and user tests were performed. The statistical analysis contains an investigation to identify the most common table approaches and data types used on various websites. The result indicates that solutions such as "squish", "collapse by rows", "click" and "scroll" are most common. The most common table categories are product comparison, product offerings, sports and stock market/statistics. This information was used to implement and establish user tests to collect feedback and opinions. The data and statistics gathered from the user tests were mapped into sets of rules to answer the question of which responsive table solution is more appropriate to use for which type of data. This serves as the foundation for The Visualizing Wizard.
APA, Harvard, Vancouver, ISO, and other styles
4

Braunschweig, Katrin. "Recovering the Semantics of Tabular Web Data." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-184502.

Full text
Abstract:
The Web provides a platform for people to share their data, leading to an abundance of accessible information. In recent years, significant research effort has been directed especially at tables on the Web, which form a rich resource for factual and relational data. Applications such as fact search and knowledge base construction benefit from this data, as it is often less ambiguous than unstructured text. However, many traditional information extraction and retrieval techniques are not well suited for Web tables, as they generally do not consider the role of the table structure in reflecting the semantics of the content. Tables provide a compact representation of similarly structured data. Yet, on the Web, tables are very heterogeneous, often with ambiguous semantics and inconsistencies in the quality of the data. Consequently, recognizing the structure and inferring the semantics of these tables is a challenging task that requires a designated table recovery and understanding process. In the literature, many important contributions have been made to implement such a table understanding process that specifically targets Web tables, addressing tasks such as table detection or header recovery. However, the precision and coverage of the data extracted from Web tables is often still quite limited. Due to the complexity of Web table understanding, many techniques developed so far make simplifying assumptions about the table layout or content to limit the amount of contributing factors that must be considered. Thanks to these assumptions, many sub-tasks become manageable. However, the resulting algorithms and techniques often have a limited scope, leading to imprecise or inaccurate results when applied to tables that do not conform to these assumptions. In this thesis, our objective is to extend the Web table understanding process with techniques that enable some of these assumptions to be relaxed, thus improving the scope and accuracy. We have conducted a comprehensive analysis of tables available on the Web to examine the characteristic features of these tables, but also identify unique challenges that arise from these characteristics in the table understanding process. To extend the scope of the table understanding process, we introduce extensions to the sub-tasks of table classification and conceptualization. First, we review various table layouts and evaluate alternative approaches to incorporate layout classification into the process. Instead of assuming a single, uniform layout across all tables, recognizing different table layouts enables a wide range of tables to be analyzed in a more accurate and systematic fashion. In addition to the layout, we also consider the conceptual level. To relax the single concept assumption, which expects all attributes in a table to describe the same semantic concept, we propose a semantic normalization approach. By decomposing multi-concept tables into several single-concept tables, we further extend the range of Web tables that can be processed correctly, enabling existing techniques to be applied without significant changes. Furthermore, we address the quality of data extracted from Web tables, by studying the role of context information. Supplementary information from the context is often required to correctly understand the table content, however, the verbosity of the surrounding text can also mislead any table relevance decisions. We first propose a selection algorithm to evaluate the relevance of context information with respect to the table content in order to reduce the noise. Then, we introduce a set of extraction techniques to recover attribute-specific information from the relevant context in order to provide a richer description of the table content. With the extensions proposed in this thesis, we increase the scope and accuracy of Web table understanding, leading to a better utilization of the information contained in tables on the Web.
APA, Harvard, Vancouver, ISO, and other styles
5

Cappuzzo, Riccardo. "Deep learning models for tabular data curation." Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS047.

Full text
Abstract:
La conservation des données est un sujet omniprésent et de grande envergure, qui touche tous les domaines, du monde universitaire à l'industrie. Les solutions actuelles reposent sur le travail manuel des utilisateurs du domaine, mais elles ne sont pas adaptées. Nous étudions comment appliquer l'apprentissage profond à la conservation des données tabulaires. Nous concentrons notre travail sur le développement de systèmes de curation de données non supervisés et sur la conception de systèmes de curation qui modélisent intrinsèquement les valeurs catégorielles dans leur forme brute. Nous implémentons d'abord EmbDI pour générer des embeddings pour les données tabulaires, et nous traitons les tâches de résolution d'entités et de correspondance de schémas. Nous passons ensuite au problème de l'imputation des données en utilisant des réseaux neuronaux graphiques dans un cadre d'apprentissage multi-tâches appelé GRIMP
Data retention is a pervasive and far-reaching topic, affecting everything from academia to industry. Current solutions rely on manual work by domain users, but they are not adequate. We are investigating how to apply deep learning to tabular data curation. We focus our work on developing unsupervised data curation systems and designing curation systems that intrinsically model categorical values in their raw form. We first implement EmbDI to generate embeddings for tabular data, and address the tasks of entity resolution and schema matching. We then turn to the data imputation problem using graphical neural networks in a multi-task learning framework called GRIMP
APA, Harvard, Vancouver, ISO, and other styles
6

Baxter, Jay. "BayesDB : querying the probable implications of tabular data." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/91451.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.
43
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 93-95).
BayesDB, a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with little statistics knowledge can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. BayesDB is suitable for analyzing complex, heterogeneous data tables with no preprocessing or parameter adjustment required. This generality rests on the model independence provided by BQL, analogous to the physical data independence provided by the relational model. SQL enables data filtering and aggregation tasks to be described independently of the physical layout of data in memory and on disk. Non-experts rely on generic indexing strategies for good-enough performance, while experts customize schemes and indices for performance-sensitive applications. Analogously, BQL enables analysis tasks to be described independently of the models used to solve them. Non-statisticians can rely on a general-purpose modeling method called CrossCat to build models that are good enough for a broad class of applications, while experts can customize the schemes and models when needed. This thesis defines BQL, describes an implementation of BayesDB, quantitatively characterizes its scalability and performance, and illustrates its efficacy on real-world data analysis problems in the areas of healthcare economics, statistical survey data analysis, web analytics, and predictive policing.
by Jay Baxter.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
7

Jiang, Ji Chu. "High Precision Deep Learning-Based Tabular Data Extraction." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/41699.

Full text
Abstract:
The advancements of AI methodologies and computing power enables automation and propels the Industry 4.0 phenomenon. Information and data are digitized more than ever, millions of documents are being processed every day, they are fueled by the growth in institutions, organizations, and their supply chains. Processing documents is a time consuming laborious task. Therefore automating data processing is a highly important task for optimizing supply chains efficiency across all industries. Document analysis for data extraction is an impactful field, this thesis aims to achieve the vital steps in an ideal data extraction pipeline. Data is often stored in tables since it is a structured formats and the user can easily associate values and attributes. Tables can contain vital information from specifications, dimensions, cost etc. Therefore focusing on table analysis and recognition in documents is a cornerstone to data extraction. This thesis applies deep learning methodologies for automating the two main problems within table analysis for data extraction; table detection and table structure detection. Table detection is identifying and localizing the boundaries of the table. The output of the table detection model will be inputted into the table structure detection model for structure format analysis. Therefore the output of the table detection model must have high localization performance otherwise it would affect the rest of the data extraction pipeline. Our table detection improves bounding box localization performance by incorporating a Kullback–Leibler loss function that calculates the divergence between the probabilistic distribution between ground truth and predicted bounding boxes. As well as adding a voting procedure into the non-maximum suppression step to produce better localized merged bounding box proposals. This model improved precision of tabular detection by 1.2% while achieving the same recall as other state-of-the-art models on the public ICDAR2013 dataset. While also achieving state-of-the-art results of 99.8% precision on the ICDAR2017 dataset. Furthermore, our model showed huge improvements espcially at higher intersection over union (IoU) thresholds; at 95% IoU an improvement of 10.9% can be seen for ICDAR2013 dataset and an improvement of 8.4% can be seen for ICDAR2017 dataset. Table structure detection is recognizing the internal layout of a table. Often times researchers approach this through detecting the rows and columns. However, in order for correct mapping of each individual cell data location in the semantic extraction step the rows and columns would have to be combined and form a matrix, this introduces additional degrees of error. Alternatively we propose a model that directly detects each individual cell. Our model is an ensemble of state-of-the-art models; Hybird Task Cascade as the detector and dual ResNeXt101 backbones arranged in a CBNet architecture. There is a lack of quality labeled data for table cell structure detection, therefore we hand labeled the ICDAR2013 dataset, and we wish to establish a strong baseline for this dataset. Our model was compared with other state-of-the-art models that excelled at table or table structure detection. Our model yielded a precision of 89.2% and recall of 98.7% on the ICDAR2013 cell structure dataset.
APA, Harvard, Vancouver, ISO, and other styles
8

Rahman, Md Anisur. "Tabular Representation of Schema Mappings: Semantics and Algorithms." Thèse, Université d'Ottawa / University of Ottawa, 2011. http://hdl.handle.net/10393/20032.

Full text
Abstract:
Our thesis investigates a mechanism for representing schema mapping by tabular forms and checking utility of the new representation. Schema mapping is a high-level specification that describes the relationship between two database schemas. Schema mappings constitute essential building blocks of data integration, data exchange and peer-to-peer data sharing systems. Global-and-local-as-view (GLAV) is one of the approaches for specifying the schema mappings. Tableaux are used for expressing queries and functional dependencies on a single database in a tabular form. In our thesis, we first introduce a tabular representation of GLAV mappings. We find that this tabular representation helps to solve many mapping-related algorithmic and semantic problems. For example, a well-known problem is to find the minimal instance of the target schema for a given instance of the source schema and a set of mappings between the source and the target schema. Second, we show that our proposed tabular mapping can be used as an operator on an instance of the source schema to produce an instance of the target schema which is `minimal' and `most general' in nature. There exists a tableaux-based mechanism for finding equivalence of two queries. Third, we extend that mechanism for deducing equivalence between two schema mappings using their corresponding tabular representations. Sometimes, there exist redundant conjuncts in a schema mapping which causes data exchange, data integration and data sharing operations more time consuming. Fourth, we present an algorithm that utilizes the tabular representations for reducing number of constraints in the schema mappings. At present, either schema-level mappings or data-level mappings are used for data sharing purposes. Fifth, we introduce and give the semantics of bi-level mapping that combines the schema-level and data-level mappings. We also show that bi-level mappings are more effective for data sharing systems. Finally, we implemented our algorithms and developed a software prototype to evaluate our proposed strategies.
APA, Harvard, Vancouver, ISO, and other styles
9

Baena, Mirabete Daniel. "Exact and heuristic methods for statistical tabular data protection." Doctoral thesis, Universitat Politècnica de Catalunya, 2017. http://hdl.handle.net/10803/456809.

Full text
Abstract:
One of the main purposes of National Statistical Agencies (NSAs) is to provide citizens or researchers with a large amount of trustful and high quality statistical information. NSAs must guarantee that no confidential individual information can be obtained from the released statistical outputs. The discipline of Statistical disclosure control (SDC) aims to avoid that confidential information is derived from data released while, at the same time, maintaining as much as possible the data utility. NSAs work with two types of data: microdata and tabular data. Microdata files contain records of individuals or respondents (persons or enterprises) with attributes. For instance, a national census might collect attributes such as age, address, salary, etc. Tabular data contains aggregated information obtained by crossing one or more categorical variables from those microdata files. Several SDC methods are available to avoid that no confidential individual information can be obtained from the released microdata or tabular data. This thesis focus on tabular data protection, although the research carried out can be applied to other classes of problems. Controlled Tabular Adjustment(CTA) and Cell Suppression Problem (CSP) have concentrated most of the recent research in the tabular data protection field. Both methods formulate Mixed Integer Linear Programming problems (MILPs) which are challenging for tables of moderate size. Even finding a feasible initial solution may be a challenging task for large instances. Due to the fact that many end users give priority to fast executions and are thus satisfied, in practice, with suboptimal solutions, as a first result of this thesis we present an improvement of a known and successful heuristic for finding feasible solutions of MILPs, called feasibility pump. The new approach, based on the computation of analytic centers, is named the Analytic Center Feasbility Pump.The second contribution consists in the application of the fix-and-relax heuristic (FR) to the CTA method. FR (alone or in combination with other heuristics) is shown to be competitive compared to CPLEX branch-and-cut in terms of quickly finding either a feasible solution or a good upper bound. The last contribution of this thesis deals with general Benders decomposition, which is improved with the application of stabilization techniques. A stabilized Benders decomposition is presented,which focus on finding new solutions in the neighborhood of "good'' points. This approach is efficiently applied to the solution of realistic and real-world CSP instances, outperforming alternative approaches.The first two contributions are already published in indexed journals (Operations Research Letters and Computers and Operations Research). The third contribution is a working paper to be submitted soon.
Un dels principals objectius dels Instituts Nacionals d'Estadística (INEs) és proporcionar, als ciutadans o als investigadors, una gran quantitat de dades estadístiques fiables i precises. Al mateix temps els INEs deuen garantir la confidencialitat estadística i que cap dada personal pot ser obtinguda gràcies a les dades estadístiques disseminades. La disciplina Control de revelació estadística (en anglès Statistical Disclosure Control, SDC) s'ocupa de garantir que cap dada individual pot derivar-se dels outputs de estadístics publicats però intentant al mateix temps mantenir el màxim possible de riquesa de les dades. Els INEs treballen amb dos tipus de dades: microdades i dades tabulars. Les microdades son arxius amb registres individuals de persones o empreses amb un conjunt d'atributs. Per exemple, el censos nacional recull atributs tals com l'edat, sexe, adreça o salari entre d'altres. Les dades tabulars són dades agregades obtingudes a partir del creuament d’un o més atributs o variables categòriques dels fitxers de microdades. Varis mètodes CRE són disponibles per evitar la revelació estadística en fitxers de microdades o dades tabulars. Aquesta tesi es centra en la protecció de dades tabulars tot i que la recerca duta a terme pot ser aplicada també a altres tipus de problemes. Els mètodes CTA (en anglès Controlled Tabular Adjustment) i CSP (en anglès Cell Suppression Problem) ha centrat la major part de la recerca feta en el camp de protecció de dades tabulars. Tots dos mètodes formulen problemes MILP (Mixed Integer Linear Programming problems) difícils de solucionar en taules de mida moderada. Fins i tot trobar solucions inicials factibles pot resultar molt difícil. Donat el fet que molts usuaris finals donen prioritat a tenir solucions ràpides i bones tot i que aquestes no siguin les òptimes, la primera contribució de la tesis presenta una millora en una coneguda i exitosa heurística per trobar solucions factibles de MILPs, anomenada feasibility pump. La nova aproximació, basada en el càlcul de centres analítics, s'anomena Analytic Center Feasibility Pump. La segona contribució consisteix en l'aplicació de la heurística fix-and-relax (FR) al mètode CTA. FR (sol o en combinació amb d'altres heurístiques) es mostra com a competitiu davant CPLEX branch-and-cut en termes de trobar ràpidament solucions factibles o bons upper bounds. La darrera contribució d’aquesta tesi tracta sobre el problema general de descomposició de Benders, aportant una millora amb l'aplicació de tècniques d’estabilització. Presentem un mètode anomenat stabilized Benders decomposition que es centra en trobar noves solucions properes a punts considerats prèviament com a bons. Aquesta aproximació ha estat eficientment aplicada al problema CSP, obtenint molt bons resultats en dades tabulars reals, millorant altres alternatives conegudes del mètode CSP. Les dues primeres contribucions ja han estat publicades en revistes indexades (Operations Research Letters and Computers and Operations Research). Actualment estem treballant en la publicació de la tercera contribució i serà en breu enviada a revisar.
APA, Harvard, Vancouver, ISO, and other styles
10

Karlsson, Anton, and Torbjörn Sjöberg. "Synthesis of Tabular Financial Data using Generative Adversarial Networks." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-273633.

Full text
Abstract:
Digitalization has led to tons of available customer data and possibilities for data-driven innovation. However, the data needs to be handled carefully to protect the privacy of the customers. Generative Adversarial Networks (GANs) are a promising recent development in generative modeling. They can be used to create synthetic data which facilitate analysis while ensuring that customer privacy is maintained. Prior research on GANs has shown impressive results on image data. In this thesis, we investigate the viability of using GANs within the financial industry. We investigate two state-of-the-art GAN models for synthesizing tabular data, TGAN and CTGAN, along with a simpler GAN model that we call WGAN. A comprehensive evaluation framework is developed to facilitate comparison of the synthetic datasets. The results indicate that GANs are able to generate quality synthetic datasets that preserve the statistical properties of the underlying data and enable a viable and reproducible subsequent analysis. It was however found that all of the investigated models had problems with reproducing numerical data.
Digitaliseringen har fört med sig stora mängder tillgänglig kunddata och skapat möjligheter för datadriven innovation. För att skydda kundernas integritet måste dock uppgifterna hanteras varsamt. Generativa Motstidande Nätverk (GANs) är en ny lovande utveckling inom generativ modellering. De kan användas till att syntetisera data som underlättar dataanalys samt bevarar kundernas integritet. Tidigare forskning på GANs har visat lovande resultat på bilddata. I det här examensarbetet undersöker vi gångbarheten av GANs inom finansbranchen. Vi undersöker två framstående GANs designade för att syntetisera tabelldata, TGAN och CTGAN, samt en enklare GAN modell som vi kallar för WGAN. Ett omfattande ramverk för att utvärdera syntetiska dataset utvecklas för att möjliggöra jämförelse mellan olika GANs. Resultaten indikerar att GANs klarar av att syntetisera högkvalitativa dataset som bevarar de statistiska egenskaperna hos det underliggande datat, vilket möjliggör en gångbar och reproducerbar efterföljande analys. Alla modellerna som testades uppvisade dock problem med att återskapa numerisk data.
APA, Harvard, Vancouver, ISO, and other styles
11

OLIVEIRA, Hugo Santos. "CSVValidation: uma ferramenta para validação de arquivos CSV a partir de metadados." Universidade Federal de Pernambuco, 2015. https://repositorio.ufpe.br/handle/123456789/18413.

Full text
Abstract:
Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2017-03-14T18:10:49Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação Hugo Santos de Oliveira - Versão Depósito Bib Central.pdf: 2529045 bytes, checksum: a83fb438eaa8daaa0b4dcba01cb0b729 (MD5)
Made available in DSpace on 2017-03-14T18:10:49Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação Hugo Santos de Oliveira - Versão Depósito Bib Central.pdf: 2529045 bytes, checksum: a83fb438eaa8daaa0b4dcba01cb0b729 (MD5) Previous issue date: 2015-08-14
Modelos de dados tabulares têm sido amplamente utilizados para a publicação de dados na Web, devido a sua simplicidade de representação e facilidade de manipulação. Entretanto, nem sempre os dados são dispostos em arquivos tabulares de maneira adequada, o que pode causar dificuldades no momento do processamento dos dados. Dessa forma, o consórcio W3C tem trabalhado em uma proposta de especificação padrão para representação de dados em formatos tabulares. Neste contexto, este trabalho tem como objetivo geral propor uma solução para o problema de validação de arquivos de Dados Tabulares. Estes arquivos, são representados no formato CSV e descritos por metadados, os quais são representados em JSON e definidos de acordo com a especificação proposta pelo W3C. A principal contribuição deste trabalho foi a definição do processo de validação de arquivos de dados tabulares e dos algoritmos necessários para a execução desse processo, além da implementação de um protótipo que tem por objetivo realizar a validação dos dados tabulares, conforme especificado pelo W3C. Outra importante contribuição foi a realização de experimentos com fontes de dados disponíveis na Web, com o objetivo de avaliar a abordagem proposta neste trabalho.
Tabular data models have been used a lot for publishing data on the Web because of its simplicity of representation and easy manipulation. However, in some cases the data are not disposed in tabular files appropriately, which can cause data processing problems. Thus, the W3C proposed a standard specification for representing data in tabular format. In this context this work has as main objective to propose a solution to the problem of validating tabular data files, represented in CSV, files and described by metadata represented as JSON files and described, according to the specification proposed by the W3C. The main contribution of this work is the definition of a tabular data file validation process and algorithms necessary for the implementation of this process as well as the implementation of a prototype that aimed to validate tabular data as specified by the W3C. Other important contribution is the execution of experiments with data sources available on the Web with the objective to evaluate the approach proposed in this work.
APA, Harvard, Vancouver, ISO, and other styles
12

CREMASCHI, MARCO. "ENABLING TABULAR DATA UNDERSTANDING BY HUMANS AND MACHINES THROUGH SEMANTIC INTERPRETATION." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2020. http://hdl.handle.net/10281/263555.

Full text
Abstract:
Esiste un numero significativo di documenti, report e pagine Web – un'analisi riporta 233 milioni di tabelle relazionali nel repository Common Crawl contenente un totale 2,85 miliardi di documenti – che fanno uso di tabelle per fornire informazioni che non possono essere facilmente elaborate dagli umani o capite dai computer. Per risolvere questo problema proponiamo un nuovo approccio che permetterà ai computer di interpretare la semantica di una tabella, e fornirà agli umani una rappresentazione più accessibile dei dati contenuti in essa. Per conseguire questo obiettivo, il problema principale è stato suddiviso in tre sotto-problemi: (i) la definizione di un metodo per fornire un'interpretazione semantica dei dati di una tabella; (ii) la definizione di un modello descrittivo che permetta ai computer di capire e condividere dati di una tabella; e (iii) la definizione di processi, tecniche e algoritmi per generare rappresentazioni dei dati in linguaggio naturale. Per quanto riguarda il sotto-problema (i), la rappresentazione semantica dei dati è stata ottenuta attraverso l'applicazione di tecniche di interpretazione di tabelle (table interpretation), che aiuta gli utenti ad identificare, in una maniera semi-automatica, il significato dei dati di una tabella e le relazioni tra di essi. Queste tecniche considerano in input una tabella e un Knowledge Graph, e restituiscono una rappresentazione RDF – un set di tuple – del contenuto della tabella, facendo riferimento ai concetti e alle proprietà del KG. Questa dissertazione presenta un nuovo approccio che, a partire dai lavori presenti in letteratura, ha portato allo sviluppo di un nuovo strumento, chiamato MantisTable, che effettua automaticamente un'interpretazione semantica completa della tabella. Gli esperimenti condotti hanno mostrato buoni risultati, rispetto alle tecniche e ai tool simili. Il sotto-problema (ii) è stato affrontato con la definizione di nuovi modi di rappresentazione dei dati: è stato definito un nuovo tipo di descrizione che combina la specifica OpenAPI con il linguaggio JSON-LD. I risultati delle tecniche di interpretazione semantica delle tabelle vengono così sfruttati per migliorare un formato già popolare, permettendo il recupero e il processamento dei dati tabellari. Il sotto-problema (iii) è stato affrontato definendo una tecnica di generazione del linguaggio naturale che utilizza una rete neurale per trasformare dati RDF, ottenuti dall'interpretazione delle tabelle, in frasi. Grazie a queste frasi, è possibile creare una rappresentazione testuale del contenuto delle tabelle. Questa è poi estendibile con informazioni aggiuntive provenienti da fonti che possono essere selezionate automaticamente utilizzando l'annotazione semantica.
A significant number of documents, reports and Web pages –an analysis reports 233M relational tables within the Common Crawl repository of 1.81 billion documents– makes use of tables to convey information that cannot be easily processed by humans, and understood by computers. To address this issue, we propose a new approach that allows computers to interpret the semantics of a table, and provides humans with a more accessible representation of the data contained in a table. To achieve the objective, the general problem has been broken down into three sub-problems: (i) define a method to provide a semantic interpretation of table data; (ii) define a descriptive model that allows computers to understand and share table data; and (iii) define processes, techniques and algorithms to generate natural language representation of the table data. Regarding sub-problem (i), the semantic representation of a data has been obtained through the application of table interpretation techniques, which supports users to identify in a semi-automatic way the meaning of the data in the table and the relationships between them. Such techniques take a table and a Knowledge Graph (KG) as input, and deliver as output an RDF representation –a set of tuples –. The output contains the input table annotated with the KG concepts and properties. This thesis presents a new approach, rooted in the existing literature, to laid the foundations for the development of a new tool -called MantisTable- which automatically performs a complete semantic interpretation of a table. The conducted experiments have shown good results compared to similar techniques. Sub-problem (ii) has been tackled by defining new ways of representing data. A new kind of description has been defined that combines the OpenAPI specification with the JSON-LD. The results of semantic table interpretation techniques are exploited to enhance a popular description format and allow automatic retrieval and processing of table data. Sub-problem (iii) has been addressed by defining a natural language generation technique that uses a neural network to translate RDF data obtained from table interpretation into sentences. Thanks to these sentences, it is possible to create a textual representation of the content of the table, and possibly extend it with additional information from data sources that can be selected automatically using semantic annotations.
APA, Harvard, Vancouver, ISO, and other styles
13

Radulovic, Nedeljko. "Post-hoc Explainable AI for Black Box Models on Tabular Data." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAT028.

Full text
Abstract:
Les modèles d'intelligence artificielle (IA) actuels ont fait leurs preuves dans la résolution de diverses tâches, telles que la classification, la régression, le traitement du langage naturel (NLP) et le traitement d'images. Les ressources dont nous disposons aujourd'hui nous permettent d'entraîner des modèles d'IA très complexes pour résoudre différents problèmes dans presque tous les domaines : médecine, finance, justice, transport, prévisions, etc. Avec la popularité et l'utilisation généralisée des modèles d'IA, la nécessite d'assurer la confiance dans ces modèles s'est également accrue. Aussi complexes soient-ils aujourd'hui, ces modèles d'IA sont impossibles à interpréter et à comprendre par les humains. Dans cette thèse nous nous concentrons sur un domaine de recherche spécifique, à savoir l'intelligence artificielle explicable (xAI), qui vise à fournir des approches permettant d'interpréter les modèles d'IA complexes et d'expliquer leurs décisions. Nous présentons deux approches, STACI et BELLA, qui se concentrent sur les tâches de classification et de régression, respectivement, pour les données tabulaires. Les deux méthodes sont des approches post-hoc agnostiques au modèle déterministe, ce qui signifie qu'elles peuvent être appliquées à n'importe quel modèle boîte noire après sa création. De cette manière, l'interopérabilité présente une valeur ajoutée sans qu'il soit nécessaire de faire des compromis sur les performances du modèle de boîte noire. Nos méthodes fournissent des interprétations précises, simples et générales à la fois de l'ensemble du modèle boîte noire et de ses prédictions individuelles. Nous avons confirmé leur haute performance par des expériences approfondies et étude d'utilisateurs
Current state-of-the-art Artificial Intelligence (AI) models have been proven to be verysuccessful in solving various tasks, such as classification, regression, Natural Language Processing(NLP), and image processing. The resources that we have at our hands today allow us to trainvery complex AI models to solve different problems in almost any field: medicine, finance, justice,transportation, forecast, etc. With the popularity and widespread use of the AI models, the need toensure the trust in them also grew. Complex as they come today, these AI models are impossible to be interpreted and understood by humans. In this thesis, we focus on the specific area of research, namely Explainable Artificial Intelligence (xAI), that aims to provide the approaches to interpret the complex AI models and explain their decisions. We present two approaches STACI and BELLA which focus on classification and regression tasks, respectively, for tabular data. Both methods are deterministic model-agnostic post-hoc approaches, which means that they can be applied to any black-box model after its creation. In this way, interpretability presents an added value without the need to compromise on black-box model's performance. Our methods provide accurate, simple and general interpretations of both the whole black-box model and its individual predictions. We confirmed their high performance through extensive experiments and a user study
APA, Harvard, Vancouver, ISO, and other styles
14

Alexandersson, Calle. "An evaluation of HTML5 components for web-based manipulation of tabular data." Thesis, Umeå universitet, Institutionen för datavetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-108274.

Full text
Abstract:
HTML5 is a promising technology that is on its way to becoming a standard for the web. Companies that have built their web application components using plugins now have to move to a entirely new JavaScript environment. One such component is data grids or tables and will be the focus of this report. In this report I present a proposal for evaluation criteria for tabular components in JavaScript frameworks. Using these criteria, grid components in some of the market leading frameworks are evaluated. Further I will for one of these frameworks present a test implementation and performance test focusing on load time with and without UI Virtualization.
APA, Harvard, Vancouver, ISO, and other styles
15

Troisemaine, Colin. "Novel class discovery in tabular data : an application to network fault diagnosis." Electronic Thesis or Diss., Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2024. http://www.theses.fr/2024IMTA0422.

Full text
Abstract:
Cette thèse porte sur la découverte de nouvelles classes dans le contexte de données tabulaires. Le problème de Novel Class Discovery (NCD) consiste à extraire d’un ensemble étiqueté de classes déjà connues des connaissances qui permettront de partitionner plus précisément un ensemble non étiqueté de nouvelles classes. Bien que le NCD ait récemment fait l’objet d’une grande attention de la part de la communauté, il est généralement résolu sur des problèmes de vision par ordinateur et parfois dans des conditions irréalistes. En particulier, le nombre de nouvelles classes est souvent supposé étant connu à l’avance, et leurs étiquettes sont parfois utilisées pour ajuster les hyperparamètres. Les méthodes qui reposent sur ces hypothèses ne sont pas applicables aux scénarios du monde réel. C’est pourquoi dans cette thèse nous nous concentrons sur la résolution de découverte dans les données tabulaires lorsqu’aucune connaissance a priori n’est disponible. Les méthodes développées au cours de la thèse sont appliquées à un cas réel : le diagnostic automatique des pannes dans les réseaux de télécommunication, spécifiquement les réseaux d’accès à fibre optique. Le but est d’avoir une gestion efficace des pannes, en particulier au stade du diagnostic lorsque des pannes inconnues (nouvelles classes) peuvent apparaitre
This thesis focuses on Novel Class Discovery (NCD) in the context of tabular data. The Novel Class Discovery problem consists in extracting knowledge from a labeled set of already known classes in order to more accurately partition an unlabeled set of new classes. Although NCD has recently received a lot of of attention from the community, it is generally addressed in computer vision problems and sometimes under unrealistic conditions. In particular, the number of novel classes is often assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods based on these assumptions are not applicable to realworld scenarios. Thus, in this thesis we focus on discovery resolution in tabular data when no a priori knowledge is available. The methods developed in the thesis are applied to a real-world case: automatic fault diagnosis in telecommunication networks, with a focus on fiber optic access networks. The aim is to achieve efficient fault management, particularly at the diagnosis stage when unknown faults (new classes) may appear
APA, Harvard, Vancouver, ISO, and other styles
16

PANFILO, DANIELE. "Generating Privacy-Compliant, Utility-Preserving Synthetic Tabular and Relational Datasets Through Deep Learning." Doctoral thesis, Università degli Studi di Trieste, 2022. http://hdl.handle.net/11368/3030920.

Full text
Abstract:
Due tendenze hanno rapidamente ridefinito il panorama dell'intelligenza artificiale (IA) negli ultimi decenni. La prima è il rapido sviluppo tecnologico che rende possibile un'intelligenza artificiale sempre più sofisticata. Dal punto di vista dell'hardware, ciò include una maggiore potenza di calcolo ed una sempre crescente efficienza di archiviazione dei dati. Da un punto di vista concettuale e algoritmico, campi come l'apprendimento automatico hanno subito un'impennata e le sinergie tra l'IA e le altre discipline hanno portato a sviluppi considerevoli. La seconda tendenza è la crescente consapevolezza della società nei confronti dell'IA. Mentre le istituzioni sono sempre più consapevoli di dover adottare la tecnologia dell'IA per rimanere competitive, questioni come la privacy dei dati e la possibilità di spiegare il funzionamento dei modelli di apprendimento automatico sono diventate parte del dibattito pubblico. L'insieme di questi sviluppi genera però una sfida: l'IA può migliorare tutti gli aspetti della nostra vita, dall'assistenza sanitaria alla politica ambientale, fino alle opportunità commerciali, ma poterla sfruttare adeguatamente richiede l'uso di dati sensibili. Purtroppo, le tecniche di anonimizzazione tradizionali non forniscono una soluzione affidabile a suddetta sfida. Non solo non sono sufficienti a proteggere i dati personali, ma ne riducono anche il valore analitico a causa delle inevitabili distorsioni apportate ai dati. Tuttavia, lo studio emergente dei modelli generativi ad apprendimento profondo (MGAP) può costituire un'alternativa più raffinata all'anonimizzazione tradizionale. Originariamente concepiti per l'elaborazione delle immagini, questi modelli catturano le distribuzioni di probabilità sottostanti agli insiemi di dati. Tali distribuzioni possono essere successivamente campionate, fornendo nuovi campioni di dati, non presenti nel set di dati originale. Tuttavia, la distribuzione complessiva degli insiemi di dati sintetici, costituiti da dati campionati in questo modo, è equivalente a quella del set dei dati originali. In questa tesi, verrà analizzato l'uso dei MGAP come tecnologia abilitante per una più ampia adozione dell'IA. A tal scopo, verrà ripercorsa prima di tutto la legislazione sulla privacy dei dati, con particolare attenzione a quella relativa all'Unione Europea. Nel farlo, forniremo anche una panoramica delle tecnologie tradizionali di anonimizzazione dei dati. Successivamente, verrà fornita un'introduzione all'IA e al deep-learning. Per illustrare i meriti di questo campo, vengono discussi due casi di studio: uno relativo alla segmentazione delle immagini ed uno reltivo alla diagnosi del cancro. Si introducono poi i MGAP, con particolare attenzione agli autoencoder variazionali. L'applicazione di questi metodi ai dati tabellari e relazionali costituisce una utile innovazione in questo campo che comporta l’introduzione di tecniche innovative di pre-elaborazione. Infine, verrà valutata la metodologia sviluppata attraverso esperimenti riproducibili, considerando sia l'utilità analitica che il grado di protezione della privacy attraverso metriche statistiche.
Two trends have rapidly been redefining the artificial intelligence (AI) landscape over the past several decades. The first of these is the rapid technological developments that make increasingly sophisticated AI feasible. From a hardware point of view, this includes increased computational power and efficient data storage. From a conceptual and algorithmic viewpoint, fields such as machine learning have undergone a surge and synergies between AI and other disciplines have resulted in considerable developments. The second trend is the growing societal awareness around AI. While institutions are becoming increasingly aware that they have to adopt AI technology to stay competitive, issues such as data privacy and explainability have become part of public discourse. Combined, these developments result in a conundrum: AI can improve all aspects of our lives, from healthcare to environmental policy to business opportunities, but invoking it requires the use of sensitive data. Unfortunately, traditional anonymization techniques do not provide a reliable solution to this conundrum. They are insufficient in protecting personal data, but also reduce the analytic value of data through distortion. However, the emerging study of deep-learning generative models (DLGM) may form a more refined alternative to traditional anonymization. Originally conceived for image processing, these models capture probability distributions underlying datasets. Such distributions can subsequently be sampled, giving new data points not present in the original dataset. However, the overall distribution of synthetic datasets, consisting of data sampled in this manner, is equivalent to that of the original dataset. In our research activity, we study the use of DLGM as an enabling technology for wider AI adoption. To do so, we first study legislation around data privacy with an emphasis on the European Union. In doing so, we also provide an outline of traditional data anonymization technology. We then provide an introduction to AI and deep-learning. Two case studies are discussed to illustrate the field’s merits, namely image segmentation and cancer diagnosis. We then introduce DLGM, with an emphasis on variational autoencoders. The application of such methods to tabular and relational data is novel and involves innovative preprocessing techniques. Finally, we assess the developed methodology in reproducible experiments, evaluating both the analytic utility and the degree of privacy protection through statistical metrics.
APA, Harvard, Vancouver, ISO, and other styles
17

Hedbrant, Per. "Towards a fully automated extraction and interpretation of tabular data using machine learning." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-391490.

Full text
Abstract:
Motivation A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. This handling includes import of data into the same format, regardless of the output of the various instruments used. There are commercial solutions available for this process, but to our knowledge, all these require prior generation of templates to which data must conform.A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. Significant amount of time is spent on manual pre- processing, converting from one format to another. There are currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet. Problem Definition The desired solution is to build a self-learning Software as-a-Service (SaaS) for automated recognition and loading of data stored in arbitrary formats. The aim of this study is three-folded: A) Investigate if unsupervised machine learning methods can be used to label different types of cells in spreadsheets. B) Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and technologies for the SaaS solution. Method A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done. A hypothesis-driven algorithm is built and adapted to two of the data formats CBCS uses most frequently. Discussions are held on choices of architecture and technologies for the SaaS solution, including system design patterns, web development framework and database. Result The reading and pre-processing framework is in itself a valuable result, due to its general applicability. No satisfying results are found when using mini-batch K means clustering method. When only reading data from one format, the dimensionality can be reduced from 542 to around 40 dimensions. The hypothesis-driven algorithm can consistently interpret the format it is designed for. More work is needed to make it more general. Implication The study contribute to the desired solution in short-term by the hypothesis-generating algorithm, and in a more generalisable way by the unsupervised learning approach. The study also contributes by initiating a conversation around the system design choices.
APA, Harvard, Vancouver, ISO, and other styles
18

Miles, David B. L. "A User-Centric Tabular Multi-Column Sorting Interface For Intact Transposition Of Columnar Data." Diss., CLICK HERE for online access, 2006. http://contentdm.lib.byu.edu/ETD/image/etd1160.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Liu, Jixiong. "Semantic Annotations for Tabular Data Using Embeddings : Application to Datasets Indexing and Table Augmentation." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS529.

Full text
Abstract:
Avec le développement de l'Open Data, un grand nombre de sources de données sont mises à disposition des communautés (notamment les data scientists et les data analysts). Ces données constituent des sources importantes pour les services numériques sous réserve que les données soient nettoyées, non biaisées, et combinées à une sémantique explicite et compréhensible par les algorithmes afin de favoriser leur exploitation. En particulier, les sources de données structurées (CSV, JSON, XML, etc.) constituent la matière première de nombreux processus de science des données. Cependant, ces données proviennent de différents domaines pour lesquels l'expertise des consommateurs des données peut être limitée (knowledge gap). Ainsi, l'appropriation des données, étape critique pour la création de modèles d'apprentissage automatique de qualité, peut être complexe.Les modèles sémantiques (en particulier, les ontologies) permettent de représenter explicitement le sens des données en spécifiant les concepts et les relations présents dans les données. L'association d'étiquettes sémantiques aux ensembles de données facilite la compréhension et la réutilisation des données en fournissant une documentation sur les données qui peut être facilement utilisée par un non-expert. De plus, l'annotation sémantique ouvre la voie à des modes de recherche qui vont au-delà de simples mots-clés et permettent l'expression de requêtes d'un haut niveau conceptuel sur le contenu des jeux de données mais aussi leur structure tout en surmontant les problèmes d'hétérogénéité syntaxique rencontrés dans les données tabulaires. Cette thèse introduit un pipeline complet pour l'extraction, l'interprétation et les applications de tableaux de données à l'aide de graphes de connaissances. Nous rappelons tout d'abord la définition des tableaux du point de vue de leur interprétation et nous développons des systèmes de collecte et d'extraction de tableaux sur le Web et dans des fichiers locaux. Nous proposons ensuite trois systèmes d'interprétation de tableaux basés sur des règles heuristiques ou sur des modèles de représentation de graphes, afin de relever les défis observés dans la littérature. Enfin, nous présentons et évaluons deux applications d'augmentation des tables tirant parti des annotations sémantiques produites: l'imputation des données et l'augmentation des schémas
With the development of Open Data, a large number of data sources are made available to communities (including data scientists and data analysts). This data is the treasure of digital services as long as data is cleaned, unbiased, as well as combined with explicit and machine-processable semantics in order to foster exploitation. In particular, structured data sources (CSV, JSON, XML, etc.) are the raw material for many data science processes. However, this data derives from different domains for which consumers are not always familiar with (knowledge gap), which complicates their appropriation, while this is a critical step in creating machine learning models. Semantic models (in particular, ontologies) make it possible to explicitly represent the implicit meaning of data by specifying the concepts and relationships present in the data. The provision of semantic labels on datasets facilitates the understanding and reuse of data by providing documentation on the data that can be easily used by a non-expert. Moreover, semantic annotation opens the way to search modes that go beyond simple keywords and allow the use of queries of a high conceptual level on the content of the datasets but also their structure while overcoming the problems of syntactic heterogeneity encountered in tabular data. This thesis introduces a complete pipeline for the extraction, interpretation, and applications of tables in the wild with the help of knowledge graphs. We first refresh the exiting definition of tables from the perspective of table interpretation and develop systems for collecting and extracting tables on the Web and local files. Three table interpretation systems are further proposed based on either heuristic rules or graph representation models facing the challenges observed from the literature. Finally, we introduce and evaluate two table augmentation applications based on semantic annotations, namely data imputation and schema augmentation
APA, Harvard, Vancouver, ISO, and other styles
20

Ayad, Célia. "Towards Reliable Post Hoc Explanations for Machine Learning on Tabular Data and their Applications." Electronic Thesis or Diss., Institut polytechnique de Paris, 2024. http://www.theses.fr/2024IPPAX082.

Full text
Abstract:
Alors que l’apprentissage automatique continue de démontrer de solides capacités prédictives, il est devenu un outil très précieux dans plusieurs domaines scientifiques et industriels. Cependant, à mesure que les modèles ML évoluent pour atteindre une plus grande précision, ils deviennent également de plus en plus complexes et nécessitent davantage de paramètres.Être capable de comprendre les complexités internes et d’établir une confiance dans les prédictions de ces modèles d’apprentissage automatique est donc devenu essentiel dans divers domaines critiques, notamment la santé et la finance.Les chercheurs ont développé des méthodes d’explication pour rendre les modèles d’apprentissage automatique plus transparents, aidant ainsi les utilisateurs à comprendre pourquoi les prédictions sont faites. Cependant, ces méthodes d’explication ne parviennent souvent pas à expliquer avec précision les prédictions des modèles, ce qui rend difficile leur utilisation efficace par les experts du domaine. Il est crucial d'identifier les lacunes des explications du ML, d'améliorer leur fiabilité et de les rendre plus conviviales. De plus, alors que de nombreuses tâches de ML sont de plus en plus gourmandes en données et que la demande d'intégration généralisée augmente, il existe un besoin pour des méthodes offrant de solides performances prédictives de manière plus simple et plus rentable.Dans cette thèse, nous abordons ces problèmes dans deux axes de recherche principaux:1) Nous proposons une méthodologie pour évaluer diverses méthodes d'explicabilité dans le contexte de propriétés de données spécifiques, telles que les niveaux de bruit, les corrélations de caractéristiques et le déséquilibre de classes, et proposons des conseils aux praticiens et aux chercheurs pour sélectionner la méthode d'explicabilité la plus appropriée en fonction des caractéristiques de leurs ensembles de données, révélant où ces méthodes excellent ou échouent.De plus, nous fournissons aux cliniciens des explications personnalisées sur les facteurs de risque du cancer du col de l’utérus en fonction de leurs propriétés souhaitées telles que la facilité de compréhension, la cohérence et la stabilité.2) Nous introduisons Shapley Chains, une nouvelle technique d'explication conçue pour surmonter le manque d'explications conçues pour les cas à sorties multiples où les étiquettes sont interdépendantes, où les caractéristiques peuvent avoir des contributions indirectes pour prédire les étiquettes ultérieures dans la chaîne (l'ordre dans lequel ces étiquettes sont prédit). De plus, nous proposons Bayes LIME Chains pour améliorer la robustesse de Shapley Chains
As machine learning continues to demonstrate robust predictive capabili-ties, it has emerged as a very valuable tool in several scientific and indus-trial domains. However, as ML models evolve to achieve higher accuracy,they also become increasingly complex and require more parameters. Beingable to understand the inner complexities and to establish trust in the pre-dictions of these machine learning models, has therefore become essentialin various critical domains including healthcare, and finance. Researchershave developed explanation methods to make machine learning models moretransparent, helping users understand why predictions are made. However,these explanation methods often fall short in accurately explaining modelpredictions, making it difficult for domain experts to utilize them effectively.It’s crucial to identify the shortcomings of ML explanations, enhance theirreliability, and make them more user-friendly. Additionally, with many MLtasks becoming more data-intensive and the demand for widespread inte-gration rising, there is a need for methods that deliver strong predictiveperformance in a simpler and more cost-effective manner. In this disserta-tion, we address these problems in two main research thrusts: 1) We proposea methodology to evaluate various explainability methods in the context ofspecific data properties, such as noise levels, feature correlations, and classimbalance, and offer guidance for practitioners and researchers on selectingthe most suitable explainability method based on the characteristics of theirdatasets, revealing where these methods excel or fail. Additionally, we pro-vide clinicians with personalized explanations of cervical cancer risk factorsbased on their desired properties such as ease of understanding, consistency,and stability. 2) We introduce Shapley Chains, a new explanation techniquedesigned to overcome the lack of explanations of multi-output predictionsin the case of interdependent labels, where features may have indirect con-tributions to predict subsequent labels in the chain (i.e. the order in whichthese labels are predicted). Moreover, we propose Bayes LIME Chains toenhance the robustness of Shapley Chains
APA, Harvard, Vancouver, ISO, and other styles
21

Bandyopadhyay, Bortik. "Querying Structured Data via Informative Representations." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1595447189545086.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Braunschweig, Katrin [Verfasser], Wolfgang [Akademischer Betreuer] Lehner, and Stefan [Akademischer Betreuer] Conrad. "Recovering the Semantics of Tabular Web Data / Katrin Braunschweig. Betreuer: Wolfgang Lehner. Gutachter: Wolfgang Lehner ; Stefan Conrad." Dresden : Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2015. http://d-nb.info/1078205256/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Kanerva, Anton, and Fredrik Helgesson. "On the Use of Model-Agnostic Interpretation Methods as Defense Against Adversarial Input Attacks on Tabular Data." Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-20085.

Full text
Abstract:
Context. Machine learning is a constantly developing subfield within the artificial intelligence field. The number of domains in which we deploy machine learning models is constantly growing and the systems using these models spread almost unnoticeably in our daily lives through different devices. In previous years, lots of time and effort has been put into increasing the performance of these models, overshadowing the significant risks of attacks targeting the very core of the systems, the trained machine learning models themselves. A specific attack with the aim of fooling the decision-making of a model, called the adversarial input attack, has almost exclusively been researched for models processing image data. However, the threat of adversarial input attacks stretches beyond systems using image data, to e.g the tabular domain which is the most common data domain used in the industry. Methods used for interpreting complex machine learning models can help humans understand the behavior and predictions of these complex machine learning systems. Understanding the behavior of a model is an important component in detecting, understanding and mitigating vulnerabilities of the model. Objectives. This study aims to reduce the research gap of adversarial input attacks and defenses targeting machine learning models in the tabular data domain. The goal of this study is to analyze how model-agnostic interpretation methods can be used in order to mitigate and detect adversarial input attacks on tabular data. Methods. The goal is reached by conducting three consecutive experiments where model interpretation methods are analyzed and adversarial input attacks are evaluated as well as visualized in terms of perceptibility. Additionally, a novel method for adversarial input attack detection based on model interpretation is proposed together with a novel way of defensively using feature selection to reduce the attack vector size. Results. The adversarial input attack detection showed state-of-the-art results with an accuracy over 86%. The proposed feature selection-based mitigation technique was successful in hardening the model from adversarial input attacks by reducing their scores by 33% without decreasing the performance of the model. Conclusions. This study contributes with satisfactory and useful methods for adversarial input attack detection and mitigation as well as methods for evaluating and visualizing the imperceptibility of attacks on tabular data.
Kontext. Maskininlärning är ett område inom artificiell intelligens som är under konstant utveckling. Mängden domäner som vi sprider maskininlärningsmodeller i växer sig allt större och systemen sprider sig obemärkt nära inpå våra dagliga liv genom olika elektroniska enheter. Genom åren har mycket tid och arbete lagts på att öka dessa modellers prestanda vilket har överskuggat risken för sårbarheter i systemens kärna, den tränade modellen. En relativt ny attack, kallad "adversarial input attack", med målet att lura modellen till felaktiga beslutstaganden har nästan uteslutande forskats på inom bildigenkänning. Men, hotet som adversarial input-attacker utgör sträcker sig utom ramarna för bilddata till andra datadomäner som den tabulära domänen vilken är den vanligaste datadomänen inom industrin. Metoder för att tolka komplexa maskininlärningsmodeller kan hjälpa människor att förstå beteendet hos dessa komplexa maskininlärningssystem samt de beslut som de tar. Att förstå en modells beteende är en viktig komponent för att upptäcka, förstå och mitigera sårbarheter hos modellen. Syfte. Den här studien försöker reducera det forskningsgap som adversarial input-attacker och motsvarande försvarsmetoder i den tabulära domänen utgör. Målet med denna studie är att analysera hur modelloberoende tolkningsmetoder kan användas för att mitigera och detektera adversarial input-attacker mot tabulär data. Metod. Det uppsatta målet nås genom tre på varandra följande experiment där modelltolkningsmetoder analyseras, adversarial input-attacker utvärderas och visualiseras samt där en ny metod baserad på modelltolkning föreslås för detektion av adversarial input-attacker tillsammans med en ny mitigeringsteknik där feature selection används defensivt för att minska attackvektorns storlek. Resultat. Den föreslagna metoden för detektering av adversarial input-attacker visar state-of-the-art-resultat med över 86% träffsäkerhet. Den föreslagna mitigeringstekniken visades framgångsrik i att härda modellen mot adversarial input attacker genom att minska deras attackstyrka med 33% utan att degradera modellens klassifieringsprestanda. Slutsats. Denna studie bidrar med användbara metoder för detektering och mitigering av adversarial input-attacker såväl som metoder för att utvärdera och visualisera svårt förnimbara attacker mot tabulär data.
APA, Harvard, Vancouver, ISO, and other styles
24

Mosquera, Evinton Antonio Cordoba. "Uma nova metáfora visual escalável para dados tabulares e sua aplicação na análise de agrupamentos." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-07022018-082548/.

Full text
Abstract:
A rápida evolução dos recursos computacionais vem permitindo que grandes conjuntos de dados sejam armazenados e recuperados. No entanto, a exploração, compreensão e extração de informação útil ainda são um desafio. Com relação às ferramentas computacionais que visam tratar desse problema, a Visualização de Informação possibilita a análise de conjuntos de dados por meio de representações gráficas e a Mineração de Dados fornece processos automáticos para a descoberta e interpretação de padrões. Apesar da recente popularidade dos métodos de visualização de informação, um problema recorrente é a baixa escalabilidade visual quando se está analisando grandes conjuntos de dados, resultando em perda de contexto e desordem visual. Com intuito de representar grandes conjuntos de dados reduzindo a perda de informação relevante, o processo de agregação visual de dados vem sendo empregado. A agregação diminui a quantidade de dados a serem representados, preservando a distribuição e as tendências do conjunto de dados original. Quanto à mineração de dados, visualização de informação vêm se tornando ferramental essencial na interpretação dos modelos computacionais e resultados gerados, em especial das técnicas não-supervisionados, como as de agrupamento. Isso porque nessas técnicas, a única forma do usuário interagir com o processo de mineração é por meio de parametrização, limitando a inserção de conhecimento de domínio no processo de análise de dados. Nesta dissertação, propomos e desenvolvemos uma metáfora visual baseada na TableLens que emprega abordagens baseadas no conceito de agregação para criar representações mais escaláveis para a interpretação de dados tabulares. Como aplicação, empregamos a metáfora desenvolvida na análise de resultados de técnicas de agrupamento. O ferramental resultante não somente suporta análise de grandes bases de dados com reduzida perda de contexto, mas também fornece subsídios para entender como os atributos dos dados contribuem para a formação de agrupamentos em termos da coesão e separação dos grupos formados.
The rapid evolution of computing resources has enabled large datasets to be stored and retrieved. However, exploring, understanding and extracting useful information is still a challenge. Among the computational tools to address this problem, information visualization techniques enable the data analysis employing the human visual ability by making a graphic representation of the data set, and data mining provides automatic processes for the discovery and interpretation of patterns. Despite the recent popularity of information visualization methods, a recurring problem is the low visual scalability when analyzing large data sets resulting in context loss and visual disorder. To represent large datasets reducing the loss of relevant information, the process of aggregation is being used. Aggregation decreases the amount of data to be represented, preserving the distribution and trends of the original dataset. Regarding data mining, information visualization has become an essential tool in the interpretation of computational models and generated results, especially of unsupervised techniques, such as clustering. This occurs because, in these techniques, the only way the user interacts with the mining process is through parameterization, limiting the insertion of domain knowledge in the process. In this thesis, we propose and develop the new visual metaphor based on the TableLens that employs approaches based on the concept of aggregation to create more scalable representations of tabular data. As application, we use the developed metaphor in the analysis of the results of clustering techniques. The resulting framework does not only support large database analysis but also provides insights into how data attributes contribute to clustering regarding cohesion and separation of the composed groups
APA, Harvard, Vancouver, ISO, and other styles
25

Janga, Prudhvi. "Integration of Heterogeneous Web-based Information into a Uniform Web-based Presentation." University of Cincinnati / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1397467105.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Višnja, Ognjenović. "Aproksimativna diskretizacija tabelarno organizovanih podataka." Phd thesis, Univerzitet u Novom Sadu, Tehnički fakultet Mihajlo Pupin u Zrenjaninu, 2016. https://www.cris.uns.ac.rs/record.jsf?recordId=101259&source=NDLTD&language=en.

Full text
Abstract:
Disertacija se bavi analizom uticaja raspodela podataka na rezultate algoritama diskretizacije u okviru procesa mašinskog učenja. Na osnovu izabranih baza i algoritama diskretizacije teorije grubih skupova i stabala odlučivanja, istražen je uticaj odnosa raspodela podataka i tačaka reza određene diskretizacije.Praćena je promena konzistentnosti diskretizovane tabele u zavisnosti od položaja redukovane tačke reza na histogramu. Definisane su fiksne tačke reza u zavisnosti od segmentacije multimodal raspodele, na osnovu kojih je moguće raditi redukciju preostalih tačaka reza. Za određivanje fiksnih tačaka konstruisan je algoritam FixedPoints koji ih određuje u skladu sa grubom segmentacijom multimodal raspodele.Konstruisan je algoritam aproksimativne diskretizacije APPROX MD za redukciju tačaka reza, koji koristi tačke reza dobijene algoritmom maksimalne razberivosti i parametre vezane za procenat nepreciznih pravila, ukupni procenat klasifikacije i broj tačaka redukcije. Algoritam je kompariran u odnosu na algoritam maksimalne razberivosti i u odnosu na algoritam maksimalne razberivosti sa aproksimativnim rešenjima za α=0,95.
This dissertation analyses the influence of data distribution on the results of discretization algorithms within the process of machine learning. Based on the chosen databases and the discretization algorithms within the rough set theory and decision trees, the influence of the data distribution-cuts relation within certain discretization has been researched.Changes in consistency of a discretized table, as dependent on the position of the reduced cut on the histogram, has been monitored. Fixed cuts have been defined, as dependent on the multimodal segmentation, on basis of which it is possible to do the reduction of the remaining cuts. To determine the fixed cuts, an algorithm FixedPoints has been constructed, determining these points in accordance with the rough segmentation of multimodal distribution.An algorithm for approximate discretization, APPROX MD, has been constructed for cuts reduction, using cuts obtained through the maximum discernibility (MD-Heuristic) algorithm and the parametres related to the percent of imprecise rules, the total classification percent and the number of reduction cuts. The algorithm has been compared to the MD algorithm and to the MD algorithm with approximate solutions for α=0,95.
APA, Harvard, Vancouver, ISO, and other styles
27

Da, Silva Carvalho Paulo. "Plateforme visuelle pour l'intégration de données faiblement structurées et incertaines." Thesis, Tours, 2017. http://www.theses.fr/2017TOUR4020/document.

Full text
Abstract:
Nous entendons beaucoup parler de Big Data, Open Data, Social Data, Scientific Data, etc. L’importance qui est apportée aux données en général est très élevée. L’analyse de ces données est importante si l’objectif est de réussir à en extraire de la valeur pour pouvoir les utiliser. Les travaux présentés dans cette thèse concernent la compréhension, l’évaluation, la correction/modification, la gestion et finalement l’intégration de données, pour permettre leur exploitation. Notre recherche étudie exclusivement les données ouvertes (DOs - Open Data) et plus précisément celles structurées sous format tabulaire (CSV). Le terme Open Data est apparu pour la première fois en 1995. Il a été utilisé par le groupe GCDIS (Global Change Data and Information System) (États-Unis) pour encourager les entités, possédant les mêmes intérêts et préoccupations, à partager leurs données [Data et System, 1995]. Le mouvement des données ouvertes étant récent, il s’agit d’un champ qui est actuellement en grande croissance. Son importance est actuellement très forte. L’encouragement donné par les gouvernements et institutions publiques à ce que leurs données soient publiées a sans doute un rôle important à ce niveau
We hear a lot about Big Data, Open Data, Social Data, Scientific Data, etc. The importance currently given to data is, in general, very high. We are living in the era of massive data. The analysis of these data is important if the objective is to successfully extract value from it so that they can be used. The work presented in this thesis project is related with the understanding, assessment, correction/modification, management and finally the integration of the data, in order to allow their respective exploitation and reuse. Our research is exclusively focused on Open Data and, more precisely, Open Data organized in tabular form (CSV - being one of the most widely used formats in the Open Data domain). The first time that the term Open Data appeared was in 1995 when the group GCDIS (Global Change Data and Information System) (from United States) used this expression to encourage entities, having the same interests and concerns, to share their data [Data et System, 1995]. However, the Open Data movement has only recently undergone a sharp increase. It has become a popular phenomenon all over the world. Being the Open Data movement recent, it is a field that is currently growing and its importance is very strong. The encouragement given by governments and public institutions to have their data published openly has an important role at this level
APA, Harvard, Vancouver, ISO, and other styles
28

Huřťák, Ladislav. "Kontingenční tabulky a jejich využití ve výzkumu." Master's thesis, Vysoká škola ekonomická v Praze, 2008. http://www.nusl.cz/ntk/nusl-3562.

Full text
Abstract:
Diplomová práce se zabývá kontingenčními tabulkami a jejich využitím v sociologických a marketingových výzkumech. Nejprve je stručně zmíněna historie a současnost těchto výzkumů. Poté jsou představeny typy proměnných. Dále jsou popsány problémy spojené s kategorizací dat a následně je pozornost věnována prezentaci a analýze kategoriálních dat. Jsou zmíněny nejčastěji používané testy nezávislosti a míry závislosti. S pomocí čtyřpolní tabulky jsou ukázány různé přístupy k statistické indukci - je porovnán bayesovský a klasický způsob uvažovaní. Jedna z kapitol je věnována korespondenční analýze. Aplikační část diplomové práce tvoří sociologická analýza pracovní orientace občanů USA. Data byla získána při General Social Survey 2006. Jsou zkoumány faktory ovlivňující motivaci k práci. Studie má dvě hlavní témata. Udržení osvědčených zaměstnanců ve firmě a závislost ročního příjmu na dosaženém stupni vzdělání.
APA, Harvard, Vancouver, ISO, and other styles
29

Bršlíková, Jana. "Analýza úmrtnostních tabulek pomocí vybraných vícerozměrných statistických metod." Master's thesis, Vysoká škola ekonomická v Praze, 2015. http://www.nusl.cz/ntk/nusl-201859.

Full text
Abstract:
The mortality is historically one of the most important demographic indicator and definitely reflects the maturity of each country. The objective of this diploma thesis is the comparison of mortality rates in analyzed countries around the world over time and among each other using the principle component analysis that allows assessing data different way. The big advantage of this method is minimal loss of information and quite understandable interpretation of mortality in each country. This thesis offers several interesting graphical outputs, that for example confirm higher mortality rate in Eastern European countries compared to Western European countries and show that Czech republic is country where mortality has fallen most in context of post-communist countries between 1990 and 2010. Source of the data is Human Mortality Database and all data were processed in statistical tool SPSS.
APA, Harvard, Vancouver, ISO, and other styles
30

Kocáb, Jan. "Statistické usuzování v analýze kategoriálních dat." Master's thesis, Vysoká škola ekonomická v Praze, 2010. http://www.nusl.cz/ntk/nusl-76171.

Full text
Abstract:
This thesis introduces statistical methods for categorical data. These methods are especially used in social sciences such as sociology, psychology and political science, but their importance has increased also in medical and technical sciences. In the first part there is mentioned statistical inference for a proportion. Here is written about classical, exact and Bayesian methods for estimating and hypothesis testing. If we have a large sample then we can approximate exact distribution by normal distribution but if we have a small sample cannot use this approximation and it is necessary to use discrete distribution which makes inference more complicated. The second part deals with two categorical variables analysis in contingency tables. Here are explained measures of association for 2 x 2 contingency tables such as difference of proportion and odds ratio and also presented how we can test independence in the case of large sample and small one. If we have small sample we are not allowed to use classical chi-squared tests and it is necessary to use alternative methods. This part contains variety of exact tests of independence and Bayesian approach for the 2 x 2 table too. In the end of this part there is written about a table for two dependent samples and we are interested whether two variables give identical results which occurs when marginal proportions are equal. In the last part there are methods used on data and discussed results.
APA, Harvard, Vancouver, ISO, and other styles
31

Keclík, David. "Návrh změn informačního systému firmy." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2011. http://www.nusl.cz/ntk/nusl-222834.

Full text
Abstract:
The main aim of this diploma thesis is to propose for changes in the company information system. Document content is focuse on analyse, implementation and using data warehouse and usage advance to optimaze methods for achievement added value to users. The first part of this thesis is focused on choice of suitable concept of data warehouse, implementation and possible problems. In the second part, I compare assets of this system with costs for development and administration.
APA, Harvard, Vancouver, ISO, and other styles
32

Čejková, Lenka. "Možnosti využití moderních technologií ve výuce ekonomických předmětů na SŠ." Master's thesis, Vysoká škola ekonomická v Praze, 2015. http://www.nusl.cz/ntk/nusl-205949.

Full text
Abstract:
This thesis is thematically focused on the use of modern technology in economic education at secondary schools. The theoretical part discusses the issue of modernization of education, including the presentation of the Ministry of Education document Strategy digital education by 2020. Furthermore in the context of theoretical analyzed in detail the different types of modern technology and the last chapter of this part of this thesis deals with current projects, which are designed to support the modernization of education. In the empirical part of this thesis, analyzes a survey carried out among teachers of business subject at secondary schools and pupils at a secondary school as well. The aim of this section is to use the survey to determine attitudes of teachers and students to modern technologies. And then also examine the possibilities of new modern technologies in teaching business subjects.
APA, Harvard, Vancouver, ISO, and other styles
33

Šmerda, Vojtěch. "Grafický editor metadat pro OLAP." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2008. http://www.nusl.cz/ntk/nusl-235893.

Full text
Abstract:
This thesis describes OLAP and data mining technologies and methods of their communication with users by using dynamic tables. Key theoretical and technical information is also included. Next part focuses on particular implementation of dynamic tables used in Vema portal solution. Last parts come close to analysis and implementation of the metadata editor which enables the metadata to be effectively designed.
APA, Harvard, Vancouver, ISO, and other styles
34

Branderský, Gabriel. "Knihovna znovupoužitelných komponent a utilit pro framework Angular 2." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363860.

Full text
Abstract:
Táto práca sa zaoberá vytvorením knižnice znovapoužiteľných komponent a utilít určené na použitie v dátavo-intenzívnych aplikáciach. Jednou typickou komponentou pre také aplikácie je tabuľka, ktorá je považovaná za hlavnú komponentu knižnice. Pre zaistenie vysokej kohezie sú všetky ostatné komponenty a utility sú s nou úzko prepojené. Výsledná sada komponent je použiteľná deklaratívným spôsobom a umožnuje rôzne konfigurácie. Uživateľské rozhranie je tiež prizpôsobené na dátovo-intenzívne aplikácie s rôznymi prvkami.
APA, Harvard, Vancouver, ISO, and other styles
35

Bažant, Milan. "Software pro detekci,analýzu a opravu kolizních objednávek v CRM systému." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2009. http://www.nusl.cz/ntk/nusl-222115.

Full text
Abstract:
The master's thesis focuses on the creation of the automatic error cleansing software which clears the errors generated by the ordering system. Based on the error analysis (e.g. their causes and possibilities of their correction) the software enables the automatic error correction without the necessity of any manual action resulting in minimal delay of customers’ orders.
APA, Harvard, Vancouver, ISO, and other styles
36

Vlach, Petr. "Grafický podsystém v prostředí internetového prohlížeče." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2008. http://www.nusl.cz/ntk/nusl-235437.

Full text
Abstract:
This master's thesis, divided into two sections, compares in part one existing (non-)commercial systems for OLAP presentation using contingency table or graph. The main focus is put on a graph. Results received from my observations in part one are used for implementing a graphic subsystem within internet browser's environment. User friendly interface and good arrangement of displayed data are the most important tasks.
APA, Harvard, Vancouver, ISO, and other styles
37

Hoopes, Daniel Matthew. "The ContexTable: Building and Testing an Intelligent, Context-Aware Kitchen Table." BYU ScholarsArchive, 2004. https://scholarsarchive.byu.edu/etd/12.

Full text
Abstract:
The purpose of this thesis was to design and evaluate The ContexTable, a context-aware system built into a kitchen table. After establishing the current status of the field of context-aware systems and the hurdles and problems being faced, a functioning prototype system was designed and built. The prototype makes it possible to explore established, untested theory and novel solutions to problems faced in the field.
APA, Harvard, Vancouver, ISO, and other styles
38

Pepuchová, Valéria. "Návrh systému stimulace pracovníků." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2011. http://www.nusl.cz/ntk/nusl-222832.

Full text
Abstract:
Master Thesis deals with the problem of employee satisfaction at Copy General. I am presenting here a new version of questionnaire created by optimizing the original version, which were designed according to circumstances and desired outcomes. I also analyze actual situation of employee satisfaction. In Thesis I am presenting solutions based on the outputs from research and conversations with managers.
APA, Harvard, Vancouver, ISO, and other styles
39

Chou, Hsu-Heng, and 周盱衡. "A Tree Based (m,k)-Anonymity Privacy Preserving Technique For Tabular Data." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/646dv4.

Full text
Abstract:
碩士
國立中興大學
電機工程學系所
107
Data Publishing contributes to the advancement of data science and the application of knowledge-based decision making. However, data publishing faces the problems of privacy leakage. Once the data is published, sensitive information may be excavated and results in virtual or physical threats and attacks. For example, the ID of an anonymous user may be recognized and the information that he/she is reluctant to bring to light is revealed. More seriously, the revealing of one’s physical location could hazard his/her life safety. Therefore, data should be carefully examined and go through a privacy protection handling process before being released. Nowadays, k-anonymity is still one of the most frequently used privacy preserving model and generalization and perturbation are the common anonymity techniques. However, most of the generalization or perturbation techniques do not consider the data characteristics of high dimensionality thus leads to low data utilization. For handling the privacy preserving problem of high dimensional data, we consider that it is not easy for adversaries to obtain many data attributes to proceed privacy attacks. On the other hand, it is not easy to determine the quasi-identifier attributes. Instead of making all attributes k-anonymized, ensuring any m sub-dimensions of data attributes conform to the k-anonymity condition is probably a compromise to trade of the privacy preserving and data utility. Therefore, to handle the (m,k)-anonymity problem of a tabular data, we propose the (m,k)-anonymity algorithm with a Combination-Tree (C-Tree). The (m,k)-anonymity algorithm searches the C-Tree in a greedy and top-down manner to generalize the attributes of unqualified data records. The C-Tree is built based on the Pascal Theorem to summarize the data for easy of searching the unqualified data and figuring out the equivalent classes for local generalization. Also, we propose the Taxonomy Index Support (TIS) to speed the generalization process. To validate our methods, we conduct experiments with real dataset to study the key factors that influence the Information Loss and utility. According to the experimental results, our method outperforms the previous methods in achieving k-anonymity with lower Information Loss. Besides, the experimental results show a long computing time which is due to the high computational complexity. The future works include designing efficient data structures or algorithms to make the technique serviceable.
APA, Harvard, Vancouver, ISO, and other styles
40

Chen, Jiun-Lin, and 陳俊霖. "A Study on Structure Extraction, Identification and Data Extraction in Tabular-form Document Images Processing." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/11910018098098981706.

Full text
Abstract:
博士
國立交通大學
資訊工程系
89
In this thesis, we propose methods for solving several problems in an automatic tabular form document processing system. These problems include form structure extraction, form line removal, form document identification and field data grouping. First, a strip projection method is presented for extracting the form structure. We first segment input form images into uniform vertical and horizontal strips. Since most form lines are vertical or horizontal, we can locate form lines by projecting the image of each vertical strip horizontally and that of each horizontal strip vertically. The peak positions in these projection profiles denote possible locations of lines in form images. We then extract the lines starting with the possible line positions in the source image. After all lines have been extracted, redundant lines are removed using a line-verification algorithm and broken lines are linked using a line-merging algorithm. Experimental results show that the proposed method can extract form structures from A4-sized documents in about 3 seconds, which is very efficient, compared with the methods based on Hough transformation and run-based line-detection algorithms. Second, a form documents identification algorithm is proposed according to the similarity values of extracted form fields. The field similarity is defined by normalizing the differences of the top-left corner points and of the bottom-right corner points between the template field and the input field. Since the input form image can be de-skewed according the results of form structure extraction, each extracted field is a rectangle. Thus, a field can berepresented by the top-left and bottom-right corner points. Besides, since a short boundary line does not mean the field with this line is small, this kind of short form lines can introduce certain difficulties in form document identification if some short lines are not correctly extracted. By comparing the extracted fields, our method is much efficient than the method with comparing extracted form lines and is not less sensitive to mis-extracted form lines. A slice removing method with filled-in data preserving for form lines removal is also proposed in this thesis. In this method, we first go through a given form line to estimate its width. Then, we go through this line again to calculate he effective line width at each position, and compare with the estimated line width. The effective line width is the maximum length of the line traced from the slice position on the orthographic direction with this form line. If the effective line width at a position is larger than the estimated line width, we reserve the line slice at this position since this slice should be located on filled-in data. Otherwise, this slice is removed. This thesis also proposes a novel approach to grouping Chinese handwritten field data filled in form documents using a gravitation-based algorithm. This algorithm is developed to extract handwritten field data which may be written out of form fields. First, form lines are extracted and removed from input form images. Connected-components are then detected from remaining data, and the gravitation for each connected-component is computed by using the black pixel counts as their mass. Next, we move connected-components according to their gravitation. As generally known, filled-in data have the locality property, i.e., data of the same field are normally written in a local area consecutively. Therefore, the relationship of these connected-components can be determined by this property. Repeatedly moving these connected-components according to their neighbor components allows us to determine which connected-components should be extracted together for a particular field. Experimental results demonstrate the effectiveness of the proposed method in extracting field data.
APA, Harvard, Vancouver, ISO, and other styles
41

Ferreira, Francisco Martins. "Anonymizing Private Information: From Noise to Data." Master's thesis, 2021. http://hdl.handle.net/10316/95554.

Full text
Abstract:
Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
In the Information Age data has become more important for all types of organizations. The information carried by large datasets habilitates the creation of intelligent systems that overcome inefficiencies and create a safer and better quality of life. Because of this, organizations have come to see data as a competitive advantage.Fraud Detection solutions are one example of intelligent systems that are highly dependent on having access to large amounts of data. These solutions receive information about monetary transactions and classify them as legitimate or fraudulent in real time. This field has benefitted from higher availability of data, allowing the application of Machine Learning (ML) algorithms that leverage the information in datasets to finding fraudulent activity in real-time.In a context of systematic gathering of information, privacy dictates how data can be used and shared, in order to protect the information of users and organizations. In order to retain the utility of data, a growing amount of effort has been dedicated to creating and exploring avenues for privacy conscious data sharing.Generating synthetic datasets that carry the same information as real data allows for the creation of ML solutions while respecting the limitations placed on data usage. In this work, we introduce Duo-GAN and DW-GAN as frameworks for synthetic data generation that learn the specificities of financial transactions data and generate fictitious data that keeps the utility of the original collections of data. Both these frameworks use two generators, one for generating fraudulent instances and one for generating legitimate instances. This allows each generator to learn the distribution for each class, avoiding the problems created by highly unbalanced data. Duo-GAN achieves positive results, in some instances achieving a disparity of only 4% in F1 score between classifiers trained with synthetic data and classifiers trained with real data and both tested on the same real data. DW-GAN presents positive results too with disparity of 3% in F1 score in the same conditions.
Na Idade da Informação os dados tornaram-se mais importantes para todos os tipos de organizações. A informação contida pelos grandes datasets permite a criação de sistemas inteligentes que ultrapassam ineficiências e criam qualidade de vida melhor e mais segura. Devido a isto, as organizações começaram a ver os dados com uma vantagem competitiva.As soluções de Deteção de Fraude são exemplos de sistemas inteligentes que dependem do acesso a grandes quantidades de dados. Estas soluções recebem informação relativas a transações monetárias e atribuem classificações de legítimas ou fraudulentas em tempo real. Este é um dos campos que beneficiou da maior disponibilidade de dados, sendo capaz de aplicar algoritmos de Machine Learning que utilizam a informação contida nos datasets para detetar atividade fraudulenta em tempo real.Num contexto de agregação sistemática de informação, a privacidade dita como os dados podem ser utilizados e partilhados, com o objetivo de proteger a informação dos utilizadores de sistemas e de organizações. De forma a reter a utilidade dos dados, uma quantidade crescente de esforço tem sido dispendido em criar e explorar avenidas para a partilha de dados respeitando a privacidade.A geração de dados sintéticos que contém a mesma informação que os dados reais permite a criação de soluções de Machine Learning (ML) mantendo o respeito pelas limitações colocadas sobre a utilização de dados.Neste trabalho introduzimos Duo-GAN e DW-GAN como frameworks para geração de dados sintéticos que aprendem as especificidades dos dados de transações financeiras e geram dados fictícios que retém a utilidade das coleções de dados originais. Ambos os frameworks utilizam dois geradores, um para gerar instâncias fraudulentas e outro para gerar instâncias legítimas. Isto permite que cada gerador aprenda a distribuição de cada uma das classes, evitando assim os problemas criados por datasets desiquilibrados. O Duo- GAN atinge resultados positivos, em certos casos atingindo uma disparidade de apenas 4% no F1 score entre classificadores treinados com dados sintéticos e classificadores treinados com dados reais, e ambos testados nos mesmos dados reais. O DW-GAN também apresenta resultados positivos, com disparidade de 3% no F1 score para as mesmas condições.
Outro - This work is partially funded by national funds through the FCT - Foundation for Science and Technology, I.P., within the scope of the project CISUC - UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020. and by the CMU|Portugal project CAMELOT (POCI-01-0247-FEDER-045915).
APA, Harvard, Vancouver, ISO, and other styles
42

Santos, Ana Rita Albuquerque. "A client focused business intelligence & analytics solution for the hospitality sector." Master's thesis, 2020. http://hdl.handle.net/10362/106502.

Full text
Abstract:
Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence
One of the greatest needs of today's business is to know the customer or the type of customer it wants to reach, which makes a customer database a strategic weapon and one of the most important investments a company can make. The business world is becoming more competitive every day, we are constantly overwhelmed with advertisements of products we may like, product promotions we usually buy or discounts on the next purchase if we subscribe to the company’s newsletter. All of this creates a client customization, and any company that is not able to do this cannot keep up with its competition. This report details the project developed at Pestana Hotel Group, which consisted of a Business Intelligence solution, more specifically the development of a customer database with the creation of two tabular models using SQL Server tools, one specific for loyal customers and another, more general, with information about all Pestana customers, and two Power BI reports that allow the visualization of the information obtained in an effective and simplified way. This report contains a literature review that situates the reader on the subject addressed in this project, a chapter dedicated to the data modeling used to create the tabular models, and another on the creation of the reports.
Uma das maiores necessidades dos negócios atuais é conhecer o seu cliente ou o tipo de cliente que quer atingir, o que torna uma base de dados de cliente uma arma estratégica e um dos mais importantes investimentos. O mundo empresarial está cada dia mais competitivo, somos constantemente assoberbados com anúncios de produtos que podemos gostar, promoções de produtos que costumamos comprar ou descontos na próxima compra caso subscrevamos a newsletter. Tudo isto cria uma personalização para o cliente, e qualquer empresa que não o consiga fazer não conseguirá acompanhar a concorrência. Este relatório detalha o projeto feito no Pestana Hotel Group, que consistiu numa solução de Business Intelligence, mais especificamente na construção de uma base de dados do cliente com a criação de dois modelos tabulares através de ferramentas do SQL Server, um específico para clientes fidelizados e outro mais geral com informação sobre todos os clientes Pestana, e dois relatórios em Power BI que permitem a visualização da informação obtida de uma forma eficaz e simplificada. O relatório contém uma revisão de literatura que situa o leitor sobre os assuntos abordados neste projeto, um capítulo dedicado à modelação dos dados de forma a criar os modelos tabulares e outro sobre a criação dos relatórios.
APA, Harvard, Vancouver, ISO, and other styles
43

Sarvghad, Batn Moghaddam Ali. "Tracking and visualizing dimension space coverage for exploratory data analysis." Thesis, 2016. http://hdl.handle.net/1828/7442.

Full text
Abstract:
In this dissertation, I investigate interactive visual history for collaborative exploratory data analysis (EDA). In particular, I examine use of analysis history for improving the awareness of the dimension space coverage 1 2 3 to better support data exploration. Commonly, interactive history tools facilitate data analysis by capturing and representing information about the analysis process. These tools can support a wide range of use-cases from simple undo and redo to complete reconstructions of the visualization pipeline. In the con- text of exploratory collaborative Visual Analytics (VA), history tools are commonly used for reviewing and reusing past states/actions and do not efficiently support other use-cases such as understanding the past analysis from the angle of dimension space coverage. How- ever, such knowledge is essential for exploratory analysis which requires constant formulation of new questions about data. To carry out exploration, an analyst needs to understand “what has been done” versus “what is remaining” to explore. Lack of such insight can result in premature fixation on certain questions, compromising the coverage of the data set and breadth of exploration [80]. In addition, exploration of large data sets sometimes requires collaboration between a group of analysts who might be in different time/location settings. In this case, in addition to personal analysis history, each team member needs to understand what aspects of the problem his or her collaborators have explored. Such scenarios are common in domains such as science and business [34] where analysts explore large multi-dimensional data sets in search of relationships, patterns and trends. Currently, analysts typically rely on memory and/or externalization to keep track of investigated versus uninvestigated aspects of the problem. Although analysis history 4 mechanisms have the potential to assist analyst(s) with this problem, most common visual representations of history are geared towards reviewing & reusing the visualization pipeline or visualization states. I started this research with an observational user study to gain a better understanding of analysts’ history needs in the context of collaborative exploratory VA. This study showed that understanding the coverage of dimension space by using linear history 5 was cumbersome and inefficient. To address this problem, I investigated how alternate visual representations of analysis history could support this use-case. First, I designed and evaluated Footprint-I, a visual history tool that represented analysis from the angle of dimension space coverage (i.e. history of investigation of data dimensions; specifically, this approach revealed which dimensions had been previously investigated and in which combinations). I performed a user study that evaluated participants’ ability to recall the scope of past analysis using my proposed design versus a linear representation of analysis history. I measured participants’ task duration and accuracy in answering questions about a past exploratory VA session. Findings of this study showed that participants with access to dimension space coverage information were both faster and more accurate in understanding dimension space coverage information. Next, I studied the effects of providing coverage information on collaboration. To investigate this question, I designed and implemented Footprint-II, the next version of Footprint-I. In this version, I redesigned the representation of dimension space coverage to be more usable and scalable. I conducted a user study that measured the effects of presenting history from the angle of dimension space coverage on task coordination (tacit breakdown of a common task between collaborators). I asked each participant to assume the role of a business data analyst and continue a exploratory analysis work which was started by a collaborator. The results of this study showed that providing dimension space coverage information helped participants to focus on dimensions that were not investigated in the initial analysis, hence improving tacit task coordination. Finally, I investigated the effects of providing live dimension space coverage information on VA outcomes. To this end, I designed and implemented a standalone prototype VA tool with a visual history module. I used scented widgets [76] to incorporate real-time dimension space coverage information into the GUI widgets. Results of a user study showed that providing live dimension space coverage information increased the number of top-level findings. Moreover, it expanded the breadth of exploration (without compromising the depth) and helped analysts to formulate and ask more questions about their data.
Graduate
0984
ali.sarvghad@gmail.com
APA, Harvard, Vancouver, ISO, and other styles
44

Cardoso, Francisco Pereira. "Gestão e otimização de dados para a criação de análises informativas." Master's thesis, 2019. http://hdl.handle.net/10362/120489.

Full text
Abstract:
Com a constante acumulação de dados ao longo do tempo por parte das empresas, é complexo manter uma estrutura organizada e eficiente de armazenamento e análise dos mesmos, sendo por isso notória a falta de aproveitamento de uma análise informativa relevante dos dados para auxiliar nas tomadas de decisão de negócio. O projeto que se pretende desenvolver enquadra-se nesta temática e é dirigido a um cliente que revela esses problemas, procurando uma resolução dos mesmos. Inicialmente, é necessário fazer uma análise aos dados armazenados pelo cliente e perceber a forma mais eficiente de os tratar e organizar numa solução analítica, de forma a promover um consumo frequente pelos diversos utilizadores. Além disso, é necessário desenhar o processo de extração, tratamento e carregamento dos dados para a solução e definir e implementar uma arquitetura estruturada, eficiente e escalável que satisfaça as necessidades do cliente. A solução final tem como objetivo promover um maior e melhor aproveitamento dos dados armazenados, assim como a consulta dos mesmos. Outro grande objetivo passa por reduzir o tempo e esforço necessários para a criação de relatórios de negócio, contribuindo assim para a redução de custos da empresa e aumento da produtividade do negócio.
With the constant accumulation of data over time by companies, it becomes complex to maintain an organized and efficient storage structure and analysis of this data, being therefore notorious the lack of use of an analysis information to assist in business decision making. The client to whom the solution will be developed presents these same problems and seeks a solution that will solve them. Initially, it is necessary to analyze the data stored by the customer and understand the most efficient way to process and organize it in an analytical solution, in order to promote a frequent consumption by the various users. In addition, it is necessary to design the data extraction, transformation and loading process and also design and implement a structured, efficient and scalable architecture that meets the client needs. By consulting it, the final solution aims to promote a bigger and greater use of the stored data obtained. Another major objective is to reduce the time and effort required to create business reports, thus contributing to reducing company costs and increasing business productivity.
APA, Harvard, Vancouver, ISO, and other styles
45

Slavíková, Karolina. "Netradiční fyzikální tabulky." Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-328146.

Full text
Abstract:
Název práce: Netradiční fyzikální tabulky Autor: Karolina Slavíková Katedra: Katedra didaktiky fyziky Vedoucí diplomové práce: Mgr. Jakub Jermář, Katedra didaktiky fyziky Abstrakt: Hlavním výsledkem diplomové práce jsou elektronické fyzikální tabulky obsahující přes 300 položek objekt· a jejich vlastností - fyzikálních veličin. Tyto tabulky jsou součástí práce a jsou umístěny na jejím konci jako příloha. Práce se zabývá problematikou výběru vhodných objekt· z reálného života a použitím jejich typických vlastností pro tvorbu nových fyzikálních úloh. Její součástí je zmapování středoškolských učebnic, které tvořily podklad pro výběr objekt· a jejich vlastností, a dále tvorba elektronických fyzikálních tabulek a jejich využití ve výuce. Práce obsahuje čtyři vzorové příklady pro základní a střední školy a návod na použití elektronických tabulek. ást práce se zabývá názory na pro- blematiku hmotného bodu a metodami měření a odhadování fyzikálních veličin pomocí jednoduchých pom·cek. Klíčová slova: fyzikální veličiny, elektronické tabulky, tabelované hodnoty Title: Unusual physical tables Author: Karolina Slavíková Department: Department of Physics Education Supervisor: Mgr. Jakub Jermář, Department of Physics Education Abstract: The main outcomes of this thesis are electronic physical tables contai- ning over...
APA, Harvard, Vancouver, ISO, and other styles
46

Toušková, Daniela. "Hodnotící tabulka jako nástroj pro měření makroekonomických nerovnováh." Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-324950.

Full text
Abstract:
This thesis examined an ability of the scoreboard indicators created by the European Commission to capture macroeconomic imbalances expressed as the changes of GDP. We conducted an empirical analysis for panel data of 27 EU countries in the 1997-2011 period. We adopted three different dynamic panel data models based on the three estimators: the Arrelano- Bond, the Arrelano-Bover and the corrected LSDV estimator. Our results suggest that despite some bad characteristics of our dataset we can conclude that some of the indicators such as 3- year average of current account balance or percentage change in export market shares seem to be inadequate for measuring the imbalances. Moreover, the indicators were proved not to be able to predict an occurrence of imbalances.
APA, Harvard, Vancouver, ISO, and other styles
47

Brodec, Václav. "Hledání a vytváření relací mezi sloupci v CSV souborech s využitím Linked Dat." Master's thesis, 2019. http://www.nusl.cz/ntk/nusl-393117.

Full text
Abstract:
A large amount of data produced by governmental organizations is accessible in the form of tables encoded as CSV files. Semantic table interpretation (STI) strives to transform them into linked data in order to make them more useful. As significant portion of the tabular data is of statistical nature, and therefore comprises predominantly of numeric values, it is paramount to possess effective means for interpreting relations between the entities and their numeric properties as captured in the tables. As the current general-purpose STI tools infer the annotations of the columns almost exclusively from numeric objects of RDF triples already present in the linked data knowledge bases, they are unable to handle unknown input values. This leaves them with weak evidence for their suggestions. On the other hand, known techniques focusing on the numeric values also have their downsides. Either their background knowledge representation is built in a top-down manner from general knowledge bases, which do not reflect the domain of input and in turn do not contain the values in a recognizable form. Or they do not make use of context provided by the general STI tools. This causes them to mismatch annotations of columns consisting from similar values, but of entirely different meaning. This thesis addresses the...
APA, Harvard, Vancouver, ISO, and other styles
48

Smutný, Lukáš. "M-technologie ve výuce a v řízení základních škol v rámci Moravskoslezského kraje." Master's thesis, 2014. http://www.nusl.cz/ntk/nusl-335128.

Full text
Abstract:
The diploma thesis "M-technologies in education and in the management of Primary schools in the Moravian-Silesian region" deals with modern technologies in the context of education and in the context of school management. The theoretical part summarizes the basic theoretical information about m-technologies and other modern technologies which then comes the part of research of the thesis which is focused on facilities and use of modern technologies ICT in education and in the management of Primary schools in the Moravian-Silesian region. Questionnaire results and research about using modern technologies in education and in school management are the focus of the practical part. The thesis deals with level of implementation and use of m-technologies and other modern technologies in primary schools. KEYWORDS: information and communication technologies; m-technologies; tablet; smartphone; eReader; interactive whiteboard; dataprojector; data box; information system
APA, Harvard, Vancouver, ISO, and other styles
49

MATĚJŮ, Petr. "Videosekvence a jejich využití při výuce fyziky na ZŠ." Master's thesis, 2011. http://www.nusl.cz/ntk/nusl-81502.

Full text
Abstract:
The Thesis "Videosequences and their use in teaching physics at the elementary school" provides a basic understanding of the issue of the use of audiovisual technology in an environment at the elementary school, focusing on the integration of video in teaching physics at the second stage. It describes the possibilities of video sources on the Internet and instructions for their preparation and to their own creation. The Thesis is also based on practical experiences that means the inclusion of specific movies in two selected topics in teaching physics.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography