Dissertations / Theses: 'Data anonymization'

1

Lasko, Thomas A. (Thomas Anton) 1965. "Spectral anonymization of data." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/42055.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (p. 87-96).
Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed over the past few decades; most work by perturbing or suppressing variable values. None have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to be made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing in a spectral basis, such as the basis provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state of-the-art methods merely by a change of basis.
(cont) The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results.
by Thomas Anton Lasko.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

2

Reje, Niklas. "Synthetic Data Generation for Anonymization." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-276239.

Full text

Abstract:

Because of regulations but also from a need to find willing participants for surveys, any released data needs to have some sort of privacy preservation. Privacy preservation, however, always requires some sort of reduction of the utility of the data, how much can vary with the method. Synthetic data generation seeks to be a privacy preserving alternative that keeps the privacy of the participants by generating new records that do not correspond to any real individuals/organizations but still preserve relationships and information within the original dataset. For a method to see wide adoption however it will need to be shown to be useful, for, even if it would be privacy preserving, if it cannot be used for usable research, it will never be used. We investigated four different methods for synthetic data generation: Parametric methods, Decision Trees, Saturated Model with Parametric and Saturated Model with Decision Trees and how the datasets affect those methods with regard to utility together with some restrictions due to how much data can be released and time limitations. We saw through comparing inferences made on the original and the synthetic datasets that a large number of synthetic datasets, about 10 or more, are needed to be released for good utility and that the more datasets that are released, the more stable the inferences are. We see that using as many variables in the imputation process of each variable as possible is best in order to generate synthetic datasets for general usage but that being selective in what variables are used for each imputation can be better for specific inferences that match the preserved relationships. Being selective also helps with keeping down the time complexity of generating synthetic datasets. When compared with k-anonymity we found that the results depended heavily on how much we included as quasi-identifiers but regardless, the synthetic data generation method could get inferences that were at least just as close to the original as inferences made from the k-anonymized datasets, though synthetic more often performed better. We found that Saturated Model with Decision Trees is the overall best method due to high utility with stable generation time regardless of the datasets we used. Decision Trees on their own was second with very close results to the Saturated Model with Decision Trees but some slightly worse results with categorical variables. Third best was Saturated Model with Parametric with good utility often but not with datasets with few categorical variables and occasionally a very long generation time. Parametric was the worst one with poor utility with all datasets and an unstable generation time that as well could be very long
På grund av lagstiftning men även för att få villiga deltagare i studier behöver publicerade data något slags integritetsskydd. Integritetsskydd kräver alltid en viss reducering av användbarheten av data och hur mycket varierar mellan metoder. Syntetisk datagenerering är ett integritetsskyddande alternativ som försöker skydda deltagare genom att generera nya uppgifter som inte korresponderar till någon riktig individ/organisation men som bevarar samma relationer och information som i originaldata. För att en metod ska få vid spridning behöver den visa sig användbar ty, även om den är integritetsskyddande så kommer den aldrig att användas om den inte är användbar för forskning. Vi undersökte fyra olika metoder för syntetisk datagenerering: Parametriska metoder, ”Decision Trees”, ”Saturated Model with Parametric” samt ”Saturated Model with Decision Trees” och vilken effekt olika data har på dessa metoder från ett användbarhetsperspektiv samt tidsbegränsningar och restriktioner på mängden data som kan publiceras. Vi fann genom att jämföra slutledningar gjorda på de syntetiska dataset och orginaldataset att det krävs att man publicerar ett stort antal syntetiska dataset, ungefär 10 eller fler, för att uppnå god användbarhet och att ju fler dataset man publicerar desto stabilare blir slutledningar. Vi fann att använda så många variabler som möjligt i imputeringen av en variabel är det bästa för att generera syntetisk data för generell användning men att vara selektiv i vilka variabler som används i imputeringen kan vara bättre för specifika slutledningar som matchar de bevarade relationerna. Att vara selektiv hjälper också med att hålla nere tidskomplexiteten för att generera syntetisk data. Jämfört med k-anonymity fann vi att resultaten berodde mycket på hur många variabler vi inkluderade som quasi-identifiers men att slutledningar från genererad syntetisk data var minst lika nära de man drog från orginaldata som med k-anonymity, om inte oftare närmare. Vi fann att ”Saturated Model with Decision Trees” är den bästa metoden tack vare dess höga användbarhet med stabil genereringstid oberoende av dataset. Decision Trees” var näst bäst med liknande resultat som föregående men med lite sämre resultat med kategorivariabler. Tredje bäst var ”Saturated Model with Parametric” med bra användbarhet ofta men inte med dataset som hade få kategorivariabler samt ibland en lång genereringstid. Parametrisk var den sämsta med dålig användbarhet med alla dataset samt en instabil genereringstid som ibland kunde vara väldigt lång.

APA, Harvard, Vancouver, ISO, and other styles

3

Miracle, Jacob M. "De-Anonymization Attack Anatomy and Analysis of Ohio Nursing Workforce Data Anonymization." Wright State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=wright1482825210051101.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Sivakumar, Anusha. "Enhancing Privacy Of Data Through Anonymization." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177349.

Full text

Abstract:

A steep rise in availability of personal data has resulted in endless opportunities for data scientists who utilize this open data for research. However, such easy availability of complex personal data challenges privacy of individuals represented in the data. To protect privacy, traditional methods such as using pseudonyms or blurring identity of individuals are followed before releasing data. These traditional methods alone are not sufficient to enhance privacy because combining released data with other publicly available data or background knowledge identifies individuals. A potential solution to this privacy loss problem is to anonymize data so that it cannot be linked to individuals represented in the data. In case of researches involving personal data, anonymization becomes more important than ever. If we alter data to preserve privacy of research participants, the resultant data becomes almost useless for many researches. Therefore, preserving privacy of individuals represented in the data and minimizing data loss caused by privacy preservation is very vital. In this project, we first study the different cases in which attacks take place, different forms of attacks and existing solutions to prevent the attacks. After carefully examining the literature and the undertaken problem, we propose a solution to preserve privacy of research participants as much as possible and to make data useful to the researchers. To support our solution, we consider the case of Digital Footprints which collects and publishes Facebook data with the consent of the users.
En kraftig ökning av tillgång på personligt relaterat data, har lett till oändliga möjligheter för dataforskare att utnyttja dessa data för forskning. En konsekvens är att det blir svårt att bevara personers integritet på grund av den enorma mängd uppgifter som är tillgängliga. För att skydda den personliga integriteten finns möjligheten att med traditionella metoder använda pseudonymer och alias, innan personen publicerar personligt data. Att enbart använda dessa traditionella metoder är inte tillräckligt för att skydda privatlivet, det finns alltid möjligheter att koppla data till verkliga individer. En potentiell lösning på detta problem är att använda anonymiseringstekniker, för att förändra data om individen på att anpassat sätt och på det viset försvåra att data sammankopplas med en individ. Vid undersökningar som innehåller personuppgifter blir anonymisering allt viktigare. Om vi försöker att ändra uppgifter för att bevara integriteten av forskningsdeltagare innan data publiceras, blir den resulterande uppgifter nästan oanvändbar för många undersökningar. För att bevara integriteten av individer representerade i underlaget och att minimera dataförlust orsakad av privatlivet bevarande är mycket viktigt. I denna avhandling har vi studerat de olika fall där attackerna kan ske, olika former av attacker och befintliga lösningar för att förhindra attackerna. Efter att noggrant granskat litteraturen och problemet, föreslår vi en teoretisk lösning för att bevara integriteten av forskningsdeltagarna så mycket som möjligt och att uppgifterna ska vara till nytta för forskning. Som stöd för vår lösning, gällande digitala fotspår som lagrar Facebook uppgifter med samtycke av användarna och släpper den lagrade informationen via olika användargränssnitt.

APA, Harvard, Vancouver, ISO, and other styles

5

Folkesson, Carl. "Anonymization of directory-structured sensitive data." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-160952.

Full text

Abstract:

Data anonymization is a relevant and important field within data privacy, which tries to find a good balance between utility and privacy in data. The field is especially relevant since the GDPR came into force, because the GDPR does not regulate anonymous data. This thesis focuses on anonymization of directory-structured data, which means data structured into a tree of directories. In the thesis, four of the most common models for anonymization of tabular data, k-anonymity, ℓ-diversity, t-closeness and differential privacy, are adapted for anonymization of directory-structured data. This adaptation is done by creating three different approaches for anonymizing directory-structured data: SingleTable, DirectoryWise and RecursiveDirectoryWise. These models and approaches are compared and evaluated using five metrics and three attack scenarios. The results show that there is always a trade-off between utility and privacy when anonymizing data. Especially it was concluded that the differential privacy model when using the RecursiveDirectoryWise approach gives the highest privacy, but also the highest information loss. On the contrary, the k-anonymity model when using the SingleTable approach or the t-closeness model when using the DirectoryWise approach gives the lowest information loss, but also the lowest privacy. The differential privacy model and the RecursiveDirectoryWise approach were also shown to give best protection against the chosen attacks. Finally, it was concluded that the differential privacy model when using the RecursiveDirectoryWise approach, was the most suitable combination to use when trying to follow the GDPR when anonymizing directory-structured data.

APA, Harvard, Vancouver, ISO, and other styles

6

Cohen, Aloni(Aloni Jonathan). "New guarantees for cryptographic circuits and data anonymization." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122737.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 305-320).
The first part of this thesis presents new definitions and constructions for three modern problems in cryptography: watermarking cryptographic circuits, updatable cryptographic circuits, and proxy reencryption. The second part is dedicate to advancing the understanding of data anonymization. We examine what it means for a data anonymization mechanism to prevent singling out in a data release, a necessary condition to be considered effectively anonymized under the European Union's General Data Protection Regulation. We also demonstrate that heretofore theoretical privacy attacks against ad-hoc privacy preserving technologies are in fact realistic and practical.
by Alon Jonathan Cohen.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

7

Hassan, FadiAbdulfattah Mohammed. "Utility-Preserving Anonymization of Textual Documents." Doctoral thesis, Universitat Rovira i Virgili, 2021. http://hdl.handle.net/10803/672012.

Full text

Abstract:

Cada dia els éssers humans afegim una gran quantitat de dades a Internet, tals com piulades, opinions, fotos i vídeos. Les organitzacions que recullen aquestes dades tan diverses n'extreuen informació per tal de millorar llurs serveis o bé per a propòsits comercials. Tanmateix, si les dades recollides contenen informació personal sensible, hom no les pot compartir amb tercers ni les pot publicar sense el consentiment o una protecció adequada dels subjectes de les dades. Els mecanismes de preservació de la privadesa forneixen maneres de sanejar les dades per tal que no revelin identitats o atributs confidencials. S'ha proposat una gran varietat de mecanismes per anonimitzar bases de dades estructurades amb atributs numèrics i categòrics; en canvi, la protecció automàtica de dades textuals no estructurades ha rebut molta menys atenció. En general, l'anonimització de dades textuals exigeix, primer, detectar trossos del text que poden revelar informació sensible i, després, emmascarar aquests trossos mitjançant supressió o generalització. En aquesta tesi fem servir diverses tecnologies per anonimitzar documents textuals. De primer, millorem les tècniques existents basades en etiquetatge de seqüències. Després, estenem aquestes tècniques per alinear-les millor amb el risc de revelació i amb les exigències de privadesa. Finalment, proposem un marc complet basat en models d'immersió de paraules que captura un concepte més ampli de protecció de dades i que forneix una protecció flexible guiada per les exigències de privadesa. També recorrem a les ontologies per preservar la utilitat del text emmascarat, és a dir, la seva semàntica i la seva llegibilitat. La nostra experimentació extensa i detallada mostra que els nostres mètodes superen els mètodes existents a l'hora de proporcionar anonimització robusta tot preservant raonablement la utilitat del text protegit.
Cada día las personas añadimos una gran cantidad de datos a Internet, tales como tweets, opiniones, fotos y vídeos. Las organizaciones que recogen dichos datos los usan para extraer información para mejorar sus servicios o para propósitos comerciales. Sin embargo, si los datos recogidos contienen información personal sensible, no pueden compartirse ni publicarse sin el consentimiento o una protección adecuada de los sujetos de los datos. Los mecanismos de protección de la privacidad proporcionan maneras de sanear los datos de forma que no revelen identidades ni atributos confidenciales. Se ha propuesto una gran variedad de mecanismos para anonimizar bases de datos estructuradas con atributos numéricos y categóricos; en cambio, la protección automática de datos textuales no estructurados ha recibido mucha menos atención. En general, la anonimización de datos textuales requiere, primero, detectar trozos de texto que puedan revelar información sensible, para luego enmascarar dichos trozos mediante supresión o generalización. En este trabajo empleamos varias tecnologías para anonimizar documentos textuales. Primero mejoramos las técnicas existentes basadas en etiquetaje de secuencias. Posteriormente las extendmos para alinearlas mejor con la noción de riesgo de revelación y con los requisitos de privacidad. Finalmente, proponemos un marco completo basado en modelos de inmersión de palabras que captura una noción más amplia de protección de datos y ofrece protección flexible guiada por los requisitos de privacidad. También recurrimos a las ontologías para preservar la utilidad del texto enmascarado, es decir, su semantica y legibilidad. Nuestra experimentación extensa y detallada muestra que nuestros métodos superan a los existentes a la hora de proporcionar una anonimización más robusta al tiempo que se preserva razonablemente la utilidad del texto protegido.
Every day, people post a significant amount of data on the Internet, such as tweets, reviews, photos, and videos. Organizations collecting these types of data use them to extract information in order to improve their services or for commercial purposes. Yet, if the collected data contain sensitive personal information, they cannot be shared with third parties or released publicly without consent or adequate protection of the data subjects. Privacy-preserving mechanisms provide ways to sanitize data so that identities and/or confidential attributes are not disclosed. A great variety of mechanisms have been proposed to anonymize structured databases with numerical and categorical attributes; however, automatically protecting unstructured textual data has received much less attention. In general, textual data anonymization requires, first, to detect pieces of text that may disclose sensitive information and, then, to mask those pieces via suppression or generalization. In this work, we leverage several technologies to anonymize textual documents. We first improve state-of-the-art techniques based on sequence labeling. After that, we extend them to make them more aligned with the notion of privacy risk and the privacy requirements. Finally, we propose a complete framework based on word embedding models that captures a broader notion of data protection and provides flexible protection driven by privacy requirements. We also leverage ontologies to preserve the utility of the masked text, that is, its semantics and readability. Extensive experimental results show that our methods outperform the state of the art by providing more robust anonymization while reasonably preserving the utility of the protected outcomes

APA, Harvard, Vancouver, ISO, and other styles

8

Michel, Axel. "Personalising privacy contraints in Generalization-based Anonymization Models." Thesis, Bourges, INSA Centre Val de Loire, 2019. http://www.theses.fr/2019ISAB0001/document.

Full text

Abstract:

Les bénéfices engendrés par les études statistiques sur les données personnelles des individus sont nombreux, que ce soit dans le médical, l'énergie ou la gestion du trafic urbain pour n'en citer que quelques-uns. Les initiatives publiques de smart-disclosure et d'ouverture des données rendent ces études statistiques indispensables pour les institutions et industries tout autour du globe. Cependant, ces calculs peuvent exposer les données personnelles des individus, portant ainsi atteinte à leur vie privée. Les individus sont alors de plus en plus réticent à participer à des études statistiques malgré les protections garanties par les instituts. Pour retrouver la confiance des individus, il devient nécessaire de proposer dessolutions de user empowerment, c'est-à-dire permettre à chaque utilisateur de contrôler les paramètres de protection des données personnelles les concernant qui sont utilisées pour des calculs.Cette thèse développe donc un nouveau concept d'anonymisation personnalisé, basé sur la généralisation de données et sur le user empowerment.En premier lieu, ce manuscrit propose une nouvelle approche mettant en avant la personnalisation des protections de la vie privée par les individus, lors de calculs d'agrégation dans une base de données. De cette façon les individus peuvent fournir des données de précision variable, en fonction de leur perception du risque. De plus, nous utilisons une architecture décentralisée basée sur du matériel sécurisé assurant ainsi les garanties de respect de la vie privée tout au long des opérations d'agrégation.En deuxième lieu, ce manuscrit étudie la personnalisations des garanties d'anonymat lors de la publication de jeux de données anonymisés. Nous proposons l'adaptation d'heuristiques existantes ainsi qu'une nouvelle approche basée sur la programmation par contraintes. Des expérimentations ont été menées pour étudier l'impact d’une telle personnalisation sur la qualité des données. Les contraintes d’anonymat ont été construites et simulées de façon réaliste en se basant sur des résultats d'études sociologiques
The benefit of performing Big data computations over individual’s microdata is manifold, in the medical, energy or transportation fields to cite only a few, and this interest is growing with the emergence of smart-disclosure initiatives around the world. However, these computations often expose microdata to privacy leakages, explaining the reluctance of individuals to participate in studies despite the privacy guarantees promised by statistical institutes. To regain indivuals’trust, it becomes essential to propose user empowerment solutions, that is to say allowing individuals to control the privacy parameter used to make computations over their microdata.This work proposes a novel concept of personalized anonymisation based on data generalization and user empowerment.Firstly, this manuscript proposes a novel approach to push personalized privacy guarantees in the processing of database queries so that individuals can disclose different amounts of information (i.e. data at different levels of accuracy) depending on their own perception of the risk. Moreover, we propose a decentralized computing infrastructure based on secure hardware enforcing these personalized privacy guarantees all along the query execution process.Secondly, this manuscript studies the personalization of anonymity guarantees when publishing data. We propose the adaptation of existing heuristics and a new approach based on constraint programming. Experiments have been done to show the impact of such personalization on the data quality. Individuals’privacy constraints have been built and realistically using social statistic studies

APA, Harvard, Vancouver, ISO, and other styles

9

Sakpere, Aderonke Busayo. "Usability heuristics for fast crime data anonymization in resource-constrained contexts." Doctoral thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/28157.

Full text

Abstract:

This thesis considers the case of mobile crime-reporting systems that have emerged as an effective and efficient data collection method in low and middle-income countries. Analyzing the data, can be helpful in addressing crime. Since law enforcement agencies in resource-constrained context typically do not have the expertise to handle these tasks, a cost-effective strategy is to outsource the data analytics tasks to third-party service providers. However, because of the sensitivity of the data, it is expedient to consider the issue of privacy. More specifically, this thesis considers the issue of finding low-intensive computational solutions to protecting the data even from an "honest-but-curious" service provider, while at the same time generating datasets that can be queried efficiently and reliably. This thesis offers a three-pronged solution approach. Firstly, the creation of a mobile application to facilitate crime reporting in a usable, secure and privacy-preserving manner. The second step proposes a streaming data anonymization algorithm, which analyses reported data based on occurrence rate rather than at a preset time on a static repository. Finally, in the third step the concept of using privacy preferences in creating anonymized datasets was considered. By taking into account user preferences the efficiency of the anonymization process is improved upon, which is beneficial in enabling fast data anonymization. Results from the prototype implementation and usability tests indicate that having a usable and covet crime-reporting application encourages users to declare crime occurrences. Anonymizing streaming data contributes to faster crime resolution times, and user privacy preferences are helpful in relaxing privacy constraints, which makes for more usable data from the querying perspective. This research presents considerable evidence that the concept of a three-pronged solution to addressing the issue of anonymity during crime reporting in a resource-constrained environment is promising. This solution can further assist the law enforcement agencies to partner with third party in deriving useful crime pattern knowledge without infringing on users' privacy. In the future, this research can be extended to more than one low-income or middle-income countries.

APA, Harvard, Vancouver, ISO, and other styles

10

Ji, Shouling. "Evaluating the security of anonymized big graph/structural data." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54913.

Full text

Abstract:

We studied the security of anonymized big graph data. Our main contributions include: new De-Anonymization (DA) attacks, comprehensive anonymity, utility, and de-anonymizability quantifications, and a secure graph data publishing/sharing system SecGraph. New DA Attacks. We present two novel graph DA frameworks: cold start single-phase Optimization-based DA (ODA) and De-anonymizing Social-Attribute Graphs (De-SAG). Unlike existing seed-based DA attacks, ODA does not priori knowledge. In addition, ODA’s DA results can facilitate existing DA attacks by providing more seed information. De-SAG is the first attack that takes into account both graph structure and attribute information. Through extensive evaluations leveraging real world graph data, we validated the performance of both ODA and De-SAG. Graph Anonymity, Utility, and De-anonymizability Quantifications. We developed new techniques that enable comprehensive graph data anonymity, utility, and de-anonymizability evaluation. First, we proposed the first seed-free graph de-anonymizability quantification framework under a general data model which provides the theoretical foundation for seed-free SDA attacks. Second, we conducted the first seed-based quantification on the perfect and partial de-anonymizability of graph data. Our quantification closes the gap between seed-based DA practice and theory. Third, we conducted the first attribute-based anonymity analysis for Social-Attribute Graph (SAG) data. Our attribute-based anonymity analysis together with existing structure-based de-anonymizability quantifications provide data owners and researchers a more complete understanding of the privacy of graph data. Fourth, we conducted the first graph Anonymity-Utility-De-anonymity (AUD) correlation quantification and provided close-forms to explicitly demonstrate such correlation. Finally, based on our quantifications, we conducted large-scale evaluations leveraging 100+ real world graph datasets generated by various computer systems and services. Using the evaluations, we demonstrated the datasets’ anonymity, utility, and de-anonymizability, as well as the significance and validity of our quantifications. SecGraph. We designed, implemented, and evaluated the first uniform and open-source Secure Graph data publishing/sharing (SecGraph) system. SecGraph enables data owners and researchers to conduct accurate comparative studies of anonymization/DA techniques, and to comprehensively understand the resistance/vulnerability of existing or newly developed anonymization techniques, the effectiveness of existing or newly developed DA attacks, and graph and application utilities of anonymized data.

APA, Harvard, Vancouver, ISO, and other styles

11

Sehatkar, Morvarid. "Towards a Privacy Preserving Framework for Publishing Longitudinal Data." Thesis, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/31629.

Full text

Abstract:

Recent advances in information technology have enabled public organizations and corporations to collect and store huge amounts of individuals' data in data repositories. Such data are powerful sources of information about an individual's life such as interests, activities, and finances. Corporations can employ data mining and knowledge discovery techniques to extract useful knowledge and interesting patterns from large repositories of individuals' data. The extracted knowledge can be exploited to improve strategic decision making, enhance business performance, and improve services. However, person-specific data often contain sensitive information about individuals and publishing such data poses potential privacy risks. To deal with these privacy issues, data must be anonymized so that no sensitive information about individuals can be disclosed from published data while distortion is minimized to ensure usefulness of data in practice. In this thesis, we address privacy concerns in publishing longitudinal data. A data set is longitudinal if it contains information of the same observation or event about individuals collected at several points in time. For instance, the data set of multiple visits of patients of a hospital over a period of time is longitudinal. Due to temporal correlations among the events of each record, potential background knowledge of adversaries about an individual in the context of longitudinal data has specific characteristics. None of the previous anonymization techniques can effectively protect longitudinal data against an adversary with such knowledge. In this thesis we identify the potential privacy threats on longitudinal data and propose a novel framework of anonymization algorithms in a way that protects individuals' privacy against both identity disclosure and attribute disclosure, and preserves data utility. Particularly, we propose two privacy models: (K,C)^P -privacy and (K,C)-privacy, and for each of these models we propose efficient algorithms for anonymizing longitudinal data. An extensive experimental study demonstrates that our proposed framework can effectively and efficiently anonymize longitudinal data.

APA, Harvard, Vancouver, ISO, and other styles

12

Gidofalvi, Gyözö. "Spatio-Temporal Data Mining for Location-Based Services." Doctoral thesis, Geomatic ApS - Center for Geoinformatics, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-86310.

Full text

Abstract:

Largely driven by advances in communication and information technology, such as the increasing availability and accuracy of GPS technology and the miniaturization of wireless communication devices, Location–Based Services (LBS) are continuously gaining popularity. Innovative LBSes integrate knowledge about the users into the service. Such knowledge can be derived by analyzing the location data of users. Such data contain two unique dimensions, space and time, which need to be analyzed. The objectives of this thesis are three–fold. First, to extend popular data mining methods to the spatio–temporal domain. Second, to demonstrate the usefulness of the extended methods and the derived knowledge in two promising LBS examples. Finally, to eliminate privacy concerns in connection with spatio–temporal data mining by devising systems for privacy–preserving location data collection and mining. To this extent, Chapter 2 presents a general methodology, pivoting, to extend a popular data mining method, namely rule mining, to the spatio–temporal domain. By considering the characteristics of a number of real–world data sources, Chapter 2 also derives a taxonomy of spatio–temporal data, and demonstrates the usefulness of the rules that the extended spatio–temporal rule mining method can discover. In Chapter 4 the proposed spatio–temporal extension is applied to find long, sharable patterns in trajectories of moving objects. Empirical evaluations show that the extended method and its variants, using high–level SQL implementations, are effective tools for analyzing trajectories of moving objects. Real–world trajectory data about a large population of objects moving over extended periods within a limited geographical space is difficult to obtain. To aid the development in spatio–temporal data management and data mining, Chapter 3 develops a Spatio–Temporal ACTivity Simulator (ST–ACTS). ST–ACTS uses a number of real–world geo–statistical data sources and intuitive principles to effectively generate realistic spatio–temporal activities of mobile users. Chapter 5 proposes an LBS in the transportation domain, namely cab–sharing. To deliver an effective service, a unique spatio–temporal grouping algorithm is presented and implemented as a sequence of SQL statements. Chapter 6 identifies ascalability bottleneck in the grouping algorithm. To eliminate the bottleneck, the chapter expresses the grouping algorithm as a continuous stream query in a data stream management system, and then devises simple but effective spatio–temporal partitioning methods for streams to parallelize the computation. Experimental results show that parallelization through adaptive partitioning methods leads to speed–ups of orders of magnitude without significantly effecting the quality of the grouping. Spatio–temporal stream partitioning is expected to be an effective method to scale computation–intensive spatial queries and spatial analysis methods for streams. Location–Based Advertising (LBA), the delivery of relevant commercial information to mobile consumers, is considered to be one of the most promising business opportunities amongst LBSes. To this extent, Chapter 7 describes an LBA framework and an LBA database that can be used for the management of mobile ads. Using a simulated but realistic mobile consumer population and a set of mobile ads, the LBA database is used to estimate the capacity of the mobile advertising channel. The estimates show that the channel capacity is extremely large, which is evidence for a strong business case, but it also necessitates adequate user controls. When data about users is collected and analyzed, privacy naturally becomes a concern. To eliminate the concerns, Chapter 8 first presents a grid–based framework in which location data is anonymized through spatio–temporal generalization, and then proposes a system for collecting and mining anonymous location data. Experimental results show that the privacy–preserving data mining component discovers patterns that, while probabilistic, are accurate enough to be useful for many LBSes. To eliminate any uncertainty in the mining results, Chapter 9 proposes a system for collecting exact trajectories of moving objects in a privacy–preserving manner. In the proposed system there are no trusted components and anonymization is performed by the clients in a P2P network via data cloaking and data swapping. Realistic simulations show that under reasonable conditions and privacy/anonymity settings the proposed system is effective.
QC 20120215

APA, Harvard, Vancouver, ISO, and other styles

13

Raza, Ali. "Test Data Extraction and Comparison with Test Data Generation." DigitalCommons@USU, 2011. https://digitalcommons.usu.edu/etd/982.

Full text

Abstract:

Testing an integrated information system that relies on data from multiple sources can be a challenge, particularly when the data is confidential. This thesis describes a novel test data extraction approach, called semantic-based test data extraction for integrated systems (iSTDE) that solves many of the problems associated with creating realistic test data for integrated information systems containing confidential data. iSTDE reads a consistent cross-section of data from the production databases, manipulates that data to obscure individual identities while still preserving overall semantic data characteristics that are critical to thorough system testing, and then moves that test data to an external test environment. This thesis also presents a theoretical study that compares test-data extraction with a competing technique, named test-data generation. Specifically, this thesis a) describes a comparison method that includes a comprehensive list of characteristics essential for testing the database applications organized into seven different areas, b) presents an analysis of the relative strengths and weaknesses of the different test-data creation techniques, and c) reports a number of specific conclusions that will help testers make appropriate choices.

APA, Harvard, Vancouver, ISO, and other styles

14

Brunet, Solenn. "Conception de mécanismes d'accréditations anonymes et d'anonymisation de données." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S130/document.

Full text

Abstract:

L'émergence de terminaux mobiles personnels, capables à la fois de communiquer et de se positionner, entraîne de nouveaux usages et services personnalisés. Néanmoins, ils impliquent une collecte importante de données à caractère personnel et nécessitent des solutions adaptées en termes de sécurité. Les utilisateurs n'ont pas toujours conscience des informations personnelles et sensibles qui peuvent être déduites de leurs utilisations. L'objectif principal de cette thèse est de montrer comment des mécanismes cryptographiques et des techniques d'anonymisation de données peuvent permettre de concilier à la fois le respect de la vie privée, les exigences de sécurité et l'utilité du service fourni. Dans une première partie, nous étudions les accréditations anonymes avec vérification par clé. Elles permettent de garantir l'anonymat des utilisateurs vis-à-vis du fournisseur de service : un utilisateur prouve son droit d'accès, sans révéler d'information superflue. Nous introduisons des nouvelles primitives qui offrent des propriétés distinctes et ont un intérêt à elles-seules. Nous utilisons ces constructions pour concevoir trois systèmes respectueux de la vie privée : un premier système d'accréditations anonymes avec vérification par clé, un deuxième appliqué au vote électronique et un dernier pour le paiement électronique. Chaque solution est validée par des preuves de sécurité et offre une efficacité adaptée aux utilisations pratiques. En particulier, pour deux de ces contributions, des implémentations sur carte SIM ont été réalisées. Néanmoins, certains types de services nécessitent tout de même l'utilisation ou le stockage de données à caractère personnel, par nécessité de service ou encore par obligation légale. Dans une seconde partie, nous étudions comment rendre respectueuses de la vie privée les données liées à l'usage de ces services. Nous proposons un procédé d'anonymisation pour des données de mobilité stockées, basé sur la confidentialité différentielle. Il permet de fournir des bases de données anonymes, en limitant le bruit ajouté. De telles bases de données peuvent alors être exploitées à des fins d'études scientifiques, économiques ou sociétales, par exemple
The emergence of personal mobile devices, with communication and positioning features, is leading to new use cases and personalized services. However, they imply a significant collection of personal data and therefore require appropriate security solutions. Indeed, users are not always aware of the personal and sensitive information that can be inferred from their use. The main objective of this thesis is to show how cryptographic mechanisms and data anonymization techniques can reconcile privacy, security requirements and utility of the service provided. In the first part, we study keyed-verification anonymous credentials which guarantee the anonymity of users with respect to a given service provider: a user proves that she is granted access to its services without revealing any additional information. We introduce new such primitives that offer different properties and are of independent interest. We use these constructions to design three privacy-preserving systems: a keyed-verification anonymous credentials system, a coercion-resistant electronic voting scheme and an electronic payment system. Each of these solutions is practical and proven secure. Indeed, for two of these contributions, implementations on SIM cards have been carried out. Nevertheless, some kinds of services still require using or storing personal data for compliance with a legal obligation or for the provision of the service. In the second part, we study how to preserve users' privacy in such services. To this end, we propose an anonymization process for mobility traces based on differential privacy. It allows us to provide anonymous databases by limiting the added noise. Such databases can then be exploited for scientific, economic or societal purposes, for instance

APA, Harvard, Vancouver, ISO, and other styles

15

Brown, Emily Elizabeth. "Adaptable Privacy-preserving Model." Diss., NSUWorks, 2019. https://nsuworks.nova.edu/gscis_etd/1069.

Full text

Abstract:

Current data privacy-preservation models lack the ability to aid data decision makers in processing datasets for publication. The proposed algorithm allows data processors to simply provide a dataset and state their criteria to recommend an xk-anonymity approach. Additionally, the algorithm can be tailored to a preference and gives the precision range and maximum data loss associated with the recommended approach. This dissertation report outlined the research’s goal, what barriers were overcome, and the limitations of the work’s scope. It highlighted the results from each experiment conducted and how it influenced the creation of the end adaptable algorithm. The xk-anonymity model built upon two foundational privacy models, the k-anonymity and l-diversity models. Overall, this study had many takeaways on data and its power in a dataset.

APA, Harvard, Vancouver, ISO, and other styles

16

Paulson, Jörgen. "The Effect of 5-anonymity on a classifier based on neural network that is applied to the adult dataset." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-17918.

Full text

Abstract:

Privacy issues relating to having data made public is relevant with the introduction of the GDPR. To limit problems related to data becoming public, intentionally or via an event such as a security breach, anonymization of datasets can be employed. In this report, the impact of the application of 5-anonymity to the adult dataset on a classifier based on a neural network predicting whether people had an income exceeding $50,000 was investigated using precision, recall and accuracy. The classifier was trained using the non-anonymized data, the anonymized data, and the non-anonymized data with those attributes which were suppressed in the anonymized data removed. The result was that average accuracy dropped from 0.82 to 0.76, precision from 0.58 to 0.50, and recall increased from 0.82 to 0.87. The average values and distributions seem to support the estimation that the majority of the performance impact of anonymization in this case comes from the suppression of attributes.

APA, Harvard, Vancouver, ISO, and other styles

17

Affonso, Elaine Parra [UNESP]. "A insciência do usuário na fase de coleta de dados: privacidade em foco." Universidade Estadual Paulista (UNESP), 2018. http://hdl.handle.net/11449/154737.

Full text

Abstract:

Submitted by Elaine Parra Affonso (elaine_affonso@yahoo.com.br) on 2018-07-27T19:40:14Z No. of bitstreams: 1 TESE_FINAL_27_07.pdf: 7225615 bytes, checksum: d2fc79d0116faacbbf780985663be725 (MD5)
Approved for entry into archive by Satie Tagara (satie@marilia.unesp.br) on 2018-07-27T20:15:34Z (GMT) No. of bitstreams: 1 affonso_ep_dr_mar.pdf: 7225615 bytes, checksum: d2fc79d0116faacbbf780985663be725 (MD5)
Made available in DSpace on 2018-07-27T20:15:34Z (GMT). No. of bitstreams: 1 affonso_ep_dr_mar.pdf: 7225615 bytes, checksum: d2fc79d0116faacbbf780985663be725 (MD5) Previous issue date: 2018-07-05
Não recebi financiamento
A coleta de dados tem se tornado uma atividade predominante nos mais diversos meios digitais, em que as redes de computadores, principalmente a Internet, são essenciais para essa fase. A fim de minimizar a complexidade envolvida no uso de aplicações e de meios de comunicação, a relação usuário-tecnologia tem sido apoiada por interfaces cada vez mais amigáveis, o que contribui para que a coleta de dados, muitas vezes, ocorra de forma imperceptível ao usuário, tornando-o insciente sobre a coleta realizada pelos detentores de dados, situação que pode ferir o direito à privacidade de usuários e de referenciados. Para proporcionar consciência sobre a coleta de dados aos usuários, ambientes digitais disponibilizam políticas de privacidade com informações sobre essa fase, buscando conformidade às leis e aos regulamentos que amparam a proteção de dados pessoais, muito representada na literatura acadêmica por meio de modelos e técnicas para anonimização. A insciência sobre a coleta de dados pode estabelecer como o indivíduo se preocupa em relação às ameaças à sua privacidade e quais são as atitudes que ele deveria ter para ampliar a proteção de seus dados, que também pode ser estimulada pela carência de ações e de pesquisas de diversas áreas do conhecimento. Diante do exposto, o objetivo desta tese é caracterizar o contexto que favorece a insciência do usuário enquanto alvo de fases de coleta de dados em ambientes digitais, considerando implicações de privacidade. Para tanto, adotou-se a pesquisa exploratória-descritiva, com abordagem qualitativa. Utilizou-se a triangulação metodológica, a partir do referencial teórico que abarca a anonimização na fase de coleta de dados; legislações que amparam a proteção de dados pessoais e a coleta de dados realizada por tecnologias. Em relação às pesquisas no âmbito de proteção de dados por anonimização, observou-se que existe uma carência de trabalhos na fase de coleta de dados, uma vez que, muitas pesquisas têm concentrado esforços no contexto de medidas para compartilhar dados anonimizados, e quando a anonimização se efetua na coleta de dados, a ênfase tem sido em relação a dados de localização. Muitas vezes, as legislações ao abordarem elementos que estão envolvidos com a fase de coleta, apresentam esses conceitos de modo generalizado, principalmente em relação ao consentimento sobre a coleta, inclusive, a própria menção a atividade de coleta, emerge na maioria das leis por meio do termo tratamento. A maior parte das leis não possui um tópico específico para a coleta de dados, fator que pode fortalecer a insciência do usuário no que tange a coleta de seus dados. Os termos técnicos como anonimização, cookies e dados de tráfego são mencionados nas leis de modo esparso, e muitas vezes não estão vinculados especificamente a fase de coleta. Os dados semi-identificadores se sobressaem na coleta de dados pelos ambientes digitais, cenário que pode ampliar ainda mais as ameaças a privacidade devido à possibilidade de correlação desses dados, e com isso, a construção de perfis de indivíduos. A opacidade promovida pela abstração na coleta de dados pelos dispositivos tecnológicos vai além da insciência do usuário, ocasionando incalculáveis ameaças à privacidade e ampliando, indubitavelmente, a assimetria informacional entre detentores de dados e usuários. Conclui-se que a insciência do usuário sobre sua interação com os ambientes digitais pode diminuir a autonomia para controlar seus dados e acentuar quebras de privacidade. No entanto, a privacidade na coleta de dados é fortalecida no momento em que o usuário tem consciência sobre as ações vinculadas aos seus dados, que devem ser determinadas pelas políticas de privacidade, pelas leis e pelas pesquisas acadêmicas, três elementos evidenciados neste trabalho que se configuram como participativos no cenário que propicia a insciência do usuário.
Data collection has become a predominant activity in several digital media, in which computer networks, especially the internet, are essential for this phase. In order to minimize the complexity involved in the use of applications and media, the relationship between user and technology has been supported by ever more friendly interfaces, which oftentimes contributes to that data collection often occurs imperceptibly. This procedure leads the user to lack of awareness about the collection performed by the data holders, a situation that may harm the right to the privacy of this user and the referenced users. In order to provide awareness about the data collection to the user, digital environments provide privacy policies with information on this phase, seeking compliance with laws and regulations that protect personal data, widely represented in the academic literature through models and techniques to anonymization in the phase of data recovery. The lack of awareness on the data collection can establish how the individual is concerned about threats to its privacy and what actions it should take to extend the protection of its data, which can also be stimulated by the lack of action and researches in several areas of the knowledge. In view of the above, the objective of this thesis is to characterize the context that favors the lack of awareness of the user while the target of data collection phases in digital environments, considering privacy implications. For that, the exploratory research was adopted, with a qualitative approach. The methodological triangulation was used, from the theoretical referential that includes the anonymization in the phase of the data collection; the legislation that supports the protection of personal data and the data collection performed by technologies. The results show that, regarding researches on data protection by anonymization, it was observed that there is an absence of works about the data collection phase, since many researches have concentrated efforts in the context of measures to share anonymized data. When anonymization is done in data collection, the emphasis has been on location data. Often, legislation when addressing elements that are involved with the collection phase, present these concepts in a generalized way, mainly in relation to the consent on the collection, including the very mention of the collection activity, emerges in most laws through the term treatment. Most laws do not have a specific topic for data collection, a factor that can strengthen user insight regarding the collection of their data. Technical terms such as anonymization, cookies and traffic data emerge in the laws sparingly and are often not specifically linked to the collection phase. The quasi-identifiers data stands out in the data collected by the digital environments, a scenario that can further extend the threats to privacy due to the possibility of a correlation of this data, and with this, the construction of profiles of individuals. The opacity promoted by abstraction in data collection by computer networks goes beyonds the lack of awareness of the user, causing incalculable threats to its privacy and undoubtedly widening the informational asymmetry among data keepers and users. It is concluded that user insight about their interaction with digital environments can reduce the autonomy to control their data and accentuates privacy breaches. However, privacy in data collection is strengthened when the user is aware of the actions linked to its data, which should be determined by privacy policies, laws and academic research, i.e, three elements evidenced in this work that are constitute as participatory in the scenario that provides the lack of awareness of the user.

APA, Harvard, Vancouver, ISO, and other styles

18

Heinrich, Jan-Philipp, Carsten Neise, and Andreas Müller. "Ähnlichkeitsmessung von ausgewählten Datentypen in Datenbanksystemen zur Berechnung des Grades der Anonymisierung." Universitätsbibliothek Chemnitz, 2018. http://nbn-resolving.de/urn:nbn:de:bsz:ch1-qucosa-233422.

Full text

Abstract:

Es soll ein mathematisches Modell zur Berechnung von Abweichungen verschiedener Datentypen auf relationalen Datenbanksystemen eingeführt und getestet werden. Basis dieses Modells sind Ähnlichkeitsmessungen für verschiedene Datentypen. Hierbei führen wir zunächst eine Betrachtung der relevanten Datentypen für die Arbeit durch. Danach definieren wir für die für diese Arbeit relevanten Datentypen eine Algebra, welche die Grundlage zur Berechnung des Anonymisierungsgrades θ ist. Das Modell soll zur Messung des Grades der Anonymisierung, vor allem personenbezogener Daten, zwischen Test- und Produktionsdaten angewendet werden. Diese Messung ist im Zuge der Einführung der EU-DSGVO im Mai 2018 sinnvoll, und soll helfen personenbezogene Daten mit einem hohen Ähnlichkeitsgrad zu identifizieren.

APA, Harvard, Vancouver, ISO, and other styles

19

Nilsson, Mattias, and Sebastian Olsson. "Bluetooth-enheter i offentliga rummet och anonymisering av data." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20025.

Full text

Abstract:

Internet of Things (IoT) ger stora möjligheter att samla in data för olika syfte som till exempel att estimera antalet personer för att styra värmen i ett rum. Vidare så kan IoT-system automatisera uppgifter som kan hjälpa oss människor. Den här studien syftar till vilken typ av data som kan vara intressant att samla in för att kunna estimera antalet personer på en offentlig plats. Det handlar även om hur känslig data som samlas in kan anonymiseras. För att göra detta så valdes det att undersöka hur MAC-adresser från Bluetooth-enheter skulle fungerar för att uppskatta antalet personer. För att samla in MAC-adresser så utvecklades ett proof of concept-system där en Android-applikation samlade in MAC-adresser som anonymiserades innan de lagrades i en databas. Applikationen anonymiserar den unika MAC-adressen enligt tre nivåer med olika säkerhet. Fältstudier gjordes där antalet personer räknades visuellt sedan gjordes anonymiserade insamlingar av MAC-adresser. Slutsatsen var att Bluetooth blir svårt att använda för att estimera antal personer eftersom alla inte har Bluetooth på. Applikationen som utvecklats påvisar att data kan samlas in säkert och på så sätt inte kränka integritet.
Internet of Things (IoT) provides great opportunities to collect data for different purposes such as to estimate the number of people to control the heat in a room. Furthermore, IoT systems can automate tasks that can help us humans. This study is aimed at the type of data that can be interesting to gather in order to estimate the number of people in a public place. It is also about how sensitive data can be anonymized when gathered. To do this, Bluetooth devices was chosen for investigating how the MAC addresses would work to estimate the number of people. For collecting MAC addresses a proof of concept system was developed, where an Android application was used to collect MAC addresses. These MAC addresses were anonymized before being stored in a database. The application anonymize the unique MAC address according to three levels of security. Field studies were conducted as the number of people were counted visually then anonymous collection of MAC addresses were made. The conclusion was that Bluetooth will be difficult to use for estimating the number of people because not everyone has Bluetooth on. The application developed demonstrates that data can be collected safely and thus does not violate privacy.

APA, Harvard, Vancouver, ISO, and other styles

20

Clause, James Alexander. "Enabling and supporting the debugging of software failures." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/39514.

Full text

Abstract:

This dissertation evaluates the following thesis statement: Program analysis techniques can enable and support the debugging of failures in widely-used applications by (1) capturing, replaying, and, as much as possible, anonymizing failing executions and (2) highlighting subsets of failure-inducing inputs that are likely to be helpful for debugging such failures. To investigate this thesis, I developed techniques for recording, minimizing, and replaying executions captured from users' machines, anonymizing execution recordings, and automatically identifying failure-relevant inputs. I then performed experiments to evaluate the techniques in realistic scenarios using real applications and real failures. The results of these experiments demonstrate that the techniques can reduce the cost and difficulty of debugging.

APA, Harvard, Vancouver, ISO, and other styles

21

Nuñez, del Prado Cortez Miguel. "Inference attacks on geolocated data." Thesis, Toulouse, INSA, 2013. http://www.theses.fr/2013ISAT0028/document.

Full text

Abstract:

Au cours des dernières années, nous avons observé le développement de dispositifs connectéset nomades tels que les téléphones mobiles, tablettes ou même les ordinateurs portablespermettant aux gens d’utiliser dans leur quotidien des services géolocalisés qui sont personnalisésd’après leur position. Néanmoins, les services géolocalisés présentent des risques enterme de vie privée qui ne sont pas forcément perçus par les utilisateurs. Dans cette thèse,nous nous intéressons à comprendre les risques en terme de vie privée liés à la disséminationet collection de données de localisation. Dans ce but, les attaques par inférence que nousavons développé sont l’extraction des points d’intérêts, la prédiction de la prochaine localisationainsi que la désanonymisation de traces de mobilité, grâce à un modèle de mobilité quenous avons appelé les chaînes de Markov de mobilité. Ensuite, nous avons établi un classementdes attaques d’inférence dans le contexte de la géolocalisation se basant sur les objectifsde l’adversaire. De plus, nous avons évalué l’impact de certaines mesures d’assainissement àprémunir l’efficacité de certaines attaques par inférence. En fin nous avons élaboré une plateformeappelé GEoPrivacy Enhanced TOolkit (GEPETO) qui permet de tester les attaques parinférences développées
In recent years, we have observed the development of connected and nomad devices suchas smartphones, tablets or even laptops allowing individuals to use location-based services(LBSs), which personalize the service they offer according to the positions of users, on a dailybasis. Nonetheless, LBSs raise serious privacy issues, which are often not perceived by the endusers. In this thesis, we are interested in the understanding of the privacy risks related to thedissemination and collection of location data. To address this issue, we developed inferenceattacks such as the extraction of points of interest (POI) and their semantics, the predictionof the next location as well as the de-anonymization of mobility traces, based on a mobilitymodel that we have coined as mobility Markov chain. Afterwards, we proposed a classificationof inference attacks in the context of location data based on the objectives of the adversary.In addition, we evaluated the effectiveness of some sanitization measures in limiting the efficiencyof inference attacks. Finally, we have developed a generic platform called GEPETO (forGEoPrivacy Enhancing Toolkit) that can be used to test the developed inference attacks

APA, Harvard, Vancouver, ISO, and other styles

22

Delanaux, Rémy. "Intégration de données liées respectueuse de la confidentialité." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1303.

Full text

Abstract:

La confidentialité des données personnelles est un souci majeur et un problème peu étudié pour la publication de données dans le Web des données ouvertes (ou LOD cloud, pour Linked Open Data cloud) . Ce nuage formé par le LOD est un réseau d'ensembles de données interconnectés et accessibles publiquement sous la forme de graphes de données modélisés dans le format RDF, et interrogés via des requêtes écrites dans le langage SPARQL. Ce cadre très standardisé est très utilisé de nos jours par des organismes publics et des entreprises. Mais certains acteurs notamment du secteur privé sont toujours réticents à la publication de leurs données, découragés par des soucis potentiels de confidentialité. Pour pallier cela, nous présentons et développons un cadre formel déclaratif pour la publication de données liées respectant la confidentialité, dans lequel les contraintes de confidentialité et d'utilité des données sont spécifiées sous forme de politiques (des ensembles de requêtes SPARQL). Cette approche est indépendante des données et du graphe considéré, et consiste en l'analyse statique d'une politique de confidentialité et d'une politique d'utilité pour déterminer des séquences d'opérations d'anonymization à appliquer à n'importe quel graphe RDF pour satisfaire les politiques fournies. Nous démontrons la sûreté de nos algorithmes et leur efficacité en terme de performance via une étude expérimentale. Un autre aspect à prendre en compte est qu'un nouveau graphe publié dans le nuage LOD est évidemment exposé à des failles de confidentialité car il peut être relié à des données déjà publiées dans d'autres données liées. Dans le second volet de cette thèse, nous nous concentrons donc sur le problème de construction d'anonymisations *sûres* d'un graphe RDF garantissant que relier le graphe anonymisé à un graphe externe quelconque ne causera pas de brèche de confidentialité. En prenant un ensemble de requêtes de confidentialité en entrée, nous étudions le problème de sûreté indépendamment des données du graphe, et la construction d'une séquence d'opérations d'anonymisation permettant d'assurer cette sûreté. Nous détaillons des conditions suffisantes sous lesquelles une instance d'anonymisation est sûre pour une certaine politique de confidentialité fournie. Par ailleurs, nous montrons que nos algorithmes sont robustes même en présence de liens de type sameAs (liens d'égalité entre entités en RDF), qu'ils soient explicites ou inférés par de la connaissance externe. Enfin, nous évaluons l'impact de cette contribution assurant la sûreté de données en la testant sur divers graphes. Nous étudions notamment la performance de cette solution et la perte d'utilité causée par nos algorithmes sur des données RDF réelles comme synthétiques. Nous étudions d'abord les diverses mesures d'utilité existantes et nous en choisissons afin de comparer le graphe original et son pendant anonymisé. Nous définissons également une méthode pour générer de nouvelles politiques de confidentialité à partir d'une politique de référence, via des modifications incrémentales. Nous étudions le comportement de notre contribution sur 4 graphes judicieusement choisis et nous montrons que notre approche est efficace avec un temps très faible même sur de gros graphes (plusieurs millions de triplets). Cette approche est graduelle : le plus spécifique est la politique de confidentialité, le plus faible est son impact sur les données. Pour conclure, nous montrons via différentes métriques structurelles (adaptées aux graphes) que nos algorithmes ne sont que peu destructeurs, et cela même quand les politiques de confidentialité couvrent une grosse partie du graphe
Individual privacy is a major and largely unexplored concern when publishing new datasets in the context of Linked Open Data (LOD). The LOD cloud forms a network of interconnected and publicly accessible datasets in the form of graph databases modeled using the RDF format and queried using the SPARQL language. This heavily standardized context is nowadays extensively used by academics, public institutions and some private organizations to make their data available. Yet, some industrial and private actors may be discouraged by potential privacy issues. To this end, we introduce and develop a declarative framework for privacy-preserving Linked Data publishing in which privacy and utility constraints are specified as policies, that is sets of SPARQL queries. Our approach is data-independent and only inspects the privacy and utility policies in order to determine the sequence of anonymization operations applicable to any graph instance for satisfying the policies. We prove the soundness of our algorithms and gauge their performance through experimental analysis. Another aspect to take into account is that a new dataset published to the LOD cloud is indeed exposed to privacy breaches due to the possible linkage to objects already existing in the other LOD datasets. In the second part of this thesis, we thus focus on the problem of building safe anonymizations of an RDF graph to guarantee that linking the anonymized graph with any external RDF graph will not cause privacy breaches. Given a set of privacy queries as input, we study the data-independent safety problem and the sequence of anonymization operations necessary to enforce it. We provide sufficient conditions under which an anonymization instance is safe given a set of privacy queries. Additionally, we show that our algorithms are robust in the presence of sameAs links that can be explicit or inferred by additional knowledge. To conclude, we evaluate the impact of this safety-preserving solution on given input graphs through experiments. We focus on the performance and the utility loss of this anonymization framework on both real-world and artificial data. We first discuss and select utility measures to compare the original graph to its anonymized counterpart, then define a method to generate new privacy policies from a reference one by inserting incremental modifications. We study the behavior of the framework on four carefully selected RDF graphs. We show that our anonymization technique is effective with reasonable runtime on quite large graphs (several million triples) and is gradual: the more specific the privacy policy is, the lesser its impact is. Finally, using structural graph-based metrics, we show that our algorithms are not very destructive even when privacy policies cover a large part of the graph. By designing a simple and efficient way to ensure privacy and utility in plausible usages of RDF graphs, this new approach suggests many extensions and in the long run more work on privacy-preserving data publishing in the context of Linked Open Data

APA, Harvard, Vancouver, ISO, and other styles

23

Eisoldt, Martin, Carsten Neise, and Andreas Müller. "Analyse verschiedener Distanzmetriken zur Messung des Anonymisierungsgrades theta." Technische Universität Chemnitz, 2019. https://monarch.qucosa.de/id/qucosa%3A34715.

Full text

Abstract:

Das bereits existierende Konzept zur Bewertung der Anonymisierung von Testdaten wird in dieser Arbeit weiter untersucht. Dabei zeigen sich die Vor- und Nachteile gegenüber bereits existierenden Distanzmetriken. Weiterführend wird untersucht, welchen Einfluss Parameteränderungen auf die Ergebnisse haben.

APA, Harvard, Vancouver, ISO, and other styles

24

Trujillo, Rasúa Rolando. "Privacy in rfid and mobile objects." Doctoral thesis, Universitat Rovira i Virgili, 2012. http://hdl.handle.net/10803/86942.

Full text

Abstract:

Los sistemas RFID permiten la identificación rápida y automática de etiquetas RFID a través de un canal de comunicación inalámbrico. Dichas etiquetas son dispositivos con cierto poder de cómputo y capacidad de almacenamiento de información. Es por ello que los objetos que contienen una etiqueta RFID adherida permiten la lectura de una cantidad rica y variada de datos que los describen y caracterizan, por ejemplo, un código único de identificación, el nombre, el modelo o la fecha de expiración. Además, esta información puede ser leída sin la necesidad de un contacto visual entre el lector y la etiqueta, lo cual agiliza considerablemente los procesos de inventariado, identificación, o control automático. Para que el uso de la tecnología RFID se generalice con éxito, es conveniente cumplir con varios objetivos: eficiencia, seguridad y protección de la privacidad. Sin embargo, el diseño de protocolos de identificación seguros, privados, y escalables es un reto difícil de abordar dada las restricciones computacionales de las etiquetas RFID y su naturaleza inalámbrica. Es por ello que, en la presente tesis, partimos de protocolos de identificación seguros y privados, y mostramos cómo se puede lograr escalabilidad mediante una arquitectura distribuida y colaborativa. De este modo, la seguridad y la privacidad se alcanzan mediante el propio protocolo de identificación, mientras que la escalabilidad se logra por medio de novedosos métodos colaborativos que consideran la posición espacial y temporal de las etiquetas RFID. Independientemente de los avances en protocolos inalámbricos de identificación, existen ataques que pueden superar exitosamente cualquiera de estos protocolos sin necesidad de conocer o descubrir claves secretas válidas ni de encontrar vulnerabilidades en sus implementaciones criptográficas. La idea de estos ataques, conocidos como ataques de “relay”, consiste en crear inadvertidamente un puente de comunicación entre una etiqueta legítima y un lector legítimo. De este modo, el adversario usa los derechos de la etiqueta legítima para pasar el protocolo de autenticación usado por el lector. Nótese que, dada la naturaleza inalámbrica de los protocolos RFID, este tipo de ataques representa una amenaza importante a la seguridad en sistemas RFID. En esta tesis proponemos un nuevo protocolo que además de autenticación realiza un chequeo de la distancia a la cual se encuentran el lector y la etiqueta. Este tipo de protocolos se conocen como protocolos de acotación de distancia, los cuales no impiden este tipo de ataques, pero sí pueden frustrarlos con alta probabilidad. Por último, afrontamos los problemas de privacidad asociados con la publicación de información recogida a través de sistemas RFID. En particular, nos concentramos en datos de movilidad que también pueden ser proporcionados por otros sistemas ampliamente usados tales como el sistema de posicionamiento global (GPS) y el sistema global de comunicaciones móviles. Nuestra solución se basa en la conocida noción de k-anonimato, alcanzada mediante permutaciones y microagregación. Para este fin, definimos una novedosa función de distancia entre trayectorias con la cual desarrollamos dos métodos diferentes de anonimización de trayectorias.
Els sistemes RFID permeten la identificació ràpida i automàtica d’etiquetes RFID a través d’un canal de comunicació sense fils. Aquestes etiquetes són dispositius amb cert poder de còmput i amb capacitat d’emmagatzematge de informació. Es per això que els objectes que porten una etiqueta RFID adherida permeten la lectura d’una quantitat rica i variada de dades que els descriuen i caracteritzen, com per exemple un codi únic d’identificació, el nom, el model o la data d’expiració. A més, aquesta informació pot ser llegida sense la necessitat d’un contacte visual entre el lector i l’etiqueta, la qual cosa agilitza considerablement els processos d’inventariat, identificació o control automàtic. Per a que l’ús de la tecnologia RFID es generalitzi amb èxit, es convenient complir amb diversos objectius: eficiència, seguretat i protecció de la privacitat. No obstant això, el disseny de protocols d’identificació segurs, privats i escalables, es un repte difícil d’abordar dades les restriccions computacionals de les etiquetes RFID i la seva naturalesa sense fils. Es per això que, en la present tesi, partim de protocols d’identificació segurs i privats, i mostrem com es pot aconseguir escalabilitat mitjançant una arquitectura distribuïda i col•laborativa. D’aquesta manera, la seguretat i la privacitat s’aconsegueixen mitjançant el propi protocol d’identificació, mentre que l’escalabilitat s’aconsegueix per mitjà de nous protocols col•laboratius que consideren la posició espacial i temporal de les etiquetes RFID. Independentment dels avenços en protocols d’identificació sense fils, existeixen atacs que poden passar exitosament qualsevol d’aquests protocols sense necessitat de conèixer o descobrir claus secretes vàlides, ni de trobar vulnerabilitats a les seves implantacions criptogràfiques. La idea d’aquestos atacs, coneguts com atacs de “relay”, consisteix en crear inadvertidament un pont de comunicació entre una etiqueta legítima i un lector legítim. D’aquesta manera, l’adversari utilitza els drets de l’etiqueta legítima per passar el protocol d’autentificació utilitzat pel lector. Es important tindre en compte que, dada la naturalesa sense fils dels protocols RFID, aquests tipus d’atacs representen una amenaça important a la seguretat en sistemes RFID. En aquesta dissertació proposem un nou protocol que, a més d’autentificació, realitza una revisió de la distància a la qual es troben el lector i l’etiqueta. Aquests tipus de protocols es coneixen com a “distance-boulding protocols”, els quals no prevenen aquests tipus d’atacs, però si que poden frustrar-los amb alta probabilitat. Per últim, afrontem els problemes de privacitat associats amb la publicació de informació recol•lectada a través de sistemes RFID. En concret, ens concentrem en dades de mobilitat, que també poden ser proveïdes per altres sistemes àmpliament utilitzats tals com el sistema de posicionament global (GPS) i el sistema global de comunicacions mòbils. La nostra solució es basa en la coneguda noció de privacitat “k-anonymity” i parcialment en micro-agregació. Per a aquesta finalitat, definim una nova funció de distància entre trajectòries amb la qual desenvolupen dos mètodes diferents d’anonimització de trajectòries.
Radio Frequency Identification (RFID) is a technology aimed at efficiently identifying and tracking goods and assets. Such identification may be performed without requiring line-of-sight alignment or physical contact between the RFID tag and the RFID reader, whilst tracking is naturally achieved due to the short interrogation field of RFID readers. That is why the reduction in price of the RFID tags has been accompanied with an increasing attention paid to this technology. However, since tags are resource-constrained devices sending identification data wirelessly, designing secure and private RFID identification protocols is a challenging task. This scenario is even more complex when scalability must be met by those protocols. Assuming the existence of a lightweight, secure, private and scalable RFID identification protocol, there exist other concerns surrounding the RFID technology. Some of them arise from the technology itself, such as distance checking, but others are related to the potential of RFID systems to gather huge amount of tracking data. Publishing and mining such moving objects data is essential to improve efficiency of supervisory control, assets management and localisation, transportation, etc. However, obvious privacy threats arise if an individual can be linked with some of those published trajectories. The present dissertation contributes to the design of algorithms and protocols aimed at dealing with the issues explained above. First, we propose a set of protocols and heuristics based on a distributed architecture that improve the efficiency of the identification process without compromising privacy or security. Moreover, we present a novel distance-bounding protocol based on graphs that is extremely low-resource consuming. Finally, we present two trajectory anonymisation methods aimed at preserving the individuals' privacy when their trajectories are released.

APA, Harvard, Vancouver, ISO, and other styles

25

Klimek, Martin. "Neuroinformatika a sdílení dat z lékařských zobrazovacích systémů." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2010. http://www.nusl.cz/ntk/nusl-218660.

Full text

Abstract:

The presented master's thesis deals with the issue of storing and sharing data from medical imaging systems. This thesis, inter alia, consists of organizational and informatics aspects of medical imaging systems data in multicentric studies containing MRI brain images. This thesis also includes technical design of a web-based application for image data sharing including a web interface suitable for manipulation with the image data stored in a database.

APA, Harvard, Vancouver, ISO, and other styles

26

Šejvlová, Ludmila. "Porovnání přístupů ke generování umělých dat." Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-358804.

Full text

Abstract:

The diploma thesis deals with synthetic data, selected approaches to their generation together with a practical task of data generation. The goal of the thesis is to describe the selected approaches to data generation, capture their key advantages and disadvantages and compare the individual approaches to each other. The practical part of the thesis describes generation of synthetic data for teaching knowledge discovery using databases. The thesis includes a basic description of synthetic data and thoroughly explains the process of their generation. The approaches selected for further examination are random data generation, the statistical approach, data generation languages and the ReverseMiner tool. The thesis also describes the practical usage of synthetic data and the suitability of each approach for certain purposes. Within this thesis, educational data Hotel SD were created using the ReverseMiner tool. The data contain relations discoverable with SD (set-difference) GUHA-procedures.

APA, Harvard, Vancouver, ISO, and other styles

27

Pàmies, Estrems David. "Contributions to Lifelogging Protection In Streaming Environments." Doctoral thesis, Universitat Rovira i Virgili, 2020. http://hdl.handle.net/10803/669809.

Full text

Abstract:

Tots els dies, més de cinc mil milions de persones generen algun tipus de dada a través d'Internet. Per accedir a aquesta informació, necessitem utilitzar serveis de recerca, ja siguin motors de cerca web o assistents personals. A cada interacció amb ells, el nostre registre d'accions, logs, s'utilitza per oferir una millor experiència. Per a les empreses, també són molt valuosos, ja que ofereixen una forma de monetitzar el servei. La monetització s'aconsegueix venent dades a tercers, però, els logs de consultes podrien exposar informació confidencial de l'usuari (identificadors, malalties, tendències sexuals, creences religioses) o usar-se per al que es diu "life-logging ": Un registre continu de les activitats diàries. La normativa obliga a protegir aquesta informació. S'han proposat prèviament sistemes de protecció per a conjunts de dades tancats, la majoria d'ells treballant amb arxius atòmics o dades estructurades. Desafortunadament, aquests sistemes no s'adapten quan es fan servir en el creixent entorn de dades no estructurades en temps real que representen els serveis d'Internet. Aquesta tesi té com objectiu dissenyar tècniques per protegir la informació confidencial de l'usuari en un entorn no estructurat d’streaming en temps real, garantint un equilibri entre la utilitat i la protecció de dades. S'han fet tres propostes per a una protecció eficaç dels logs. La primera és un nou mètode per anonimitzar logs de consultes, basat en k-anonimat probabilística i algunes eines de desanonimització per determinar fuites de dades. El segon mètode, s'ha millorat afegint un equilibri configurable entre privacitat i usabilitat, aconseguint una gran millora en termes d'utilitat de dades. La contribució final es refereix als assistents personals basats en Internet. La informació generada per aquests dispositius es pot considerar "life-logging" i pot augmentar els riscos de privacitat de l'usuari. Es proposa un esquema de protecció que combina anonimat de logs i signatures sanitizables.
Todos los días, más de cinco mil millones de personas generan algún tipo de dato a través de Internet. Para acceder a esa información, necesitamos servicios de búsqueda, ya sean motores de búsqueda web o asistentes personales. En cada interacción con ellos, nuestro registro de acciones, logs, se utiliza para ofrecer una experiencia más útil. Para las empresas, también son muy valiosos, ya que ofrecen una forma de monetizar el servicio, vendiendo datos a terceros. Sin embargo, los logs podrían exponer información confidencial del usuario (identificadores, enfermedades, tendencias sexuales, creencias religiosas) o usarse para lo que se llama "life-logging": Un registro continuo de las actividades diarias. La normativa obliga a proteger esta información. Se han propuesto previamente sistemas de protección para conjuntos de datos cerrados, la mayoría de ellos trabajando con archivos atómicos o datos estructurados. Desafortunadamente, esos sistemas no se adaptan cuando se usan en el entorno de datos no estructurados en tiempo real que representan los servicios de Internet. Esta tesis tiene como objetivo diseñar técnicas para proteger la información confidencial del usuario en un entorno no estructurado de streaming en tiempo real, garantizando un equilibrio entre utilidad y protección de datos. Se han hecho tres propuestas para una protección eficaz de los logs. La primera es un nuevo método para anonimizar logs de consultas, basado en k-anonimato probabilístico y algunas herramientas de desanonimización para determinar fugas de datos. El segundo método, se ha mejorado añadiendo un equilibrio configurable entre privacidad y usabilidad, logrando una gran mejora en términos de utilidad de datos. La contribución final se refiere a los asistentes personales basados en Internet. La información generada por estos dispositivos se puede considerar “life-logging” y puede aumentar los riesgos de privacidad del usuario. Se propone un esquema de protección que combina anonimato de logs y firmas sanitizables.
Every day, more than five billion people generate some kind of data over the Internet. As a tool for accessing that information, we need to use search services, either in the form of Web Search Engines or through Personal Assistants. On each interaction with them, our record of actions via logs, is used to offer a more useful experience. For companies, logs are also very valuable since they offer a way to monetize the service. Monetization is achieved by selling data to third parties, however query logs could potentially expose sensitive user information: identifiers, sensitive data from users (such as diseases, sexual tendencies, religious beliefs) or be used for what is called ”life-logging”: a continuous record of one’s daily activities. Current regulations oblige companies to protect this personal information. Protection systems for closed data sets have previously been proposed, most of them working with atomic files or structured data. Unfortunately, those systems do not fit when used in the growing real-time unstructured data environment posed by Internet services. This thesis aims to design techniques to protect the user’s sensitive information in a non-structured real-time streaming environment, guaranteeing a trade-off between data utility and protection. In this regard, three proposals have been made in efficient log protection. The first is a new method to anonymize query logs, based on probabilistic k-anonymity and some de-anonymization tools to determine possible data leaks. A second method has been improved in terms of a configurable trade-off between privacy and usability, achieving a great improvement in terms of data utility. Our final contribution concerns Internet-based Personal Assistants. The information generated by these devices is likely to be considered life-logging, and it can increase the user’s privacy risks. The proposal is a protection scheme that combines log anonymization and sanitizable signatures.

APA, Harvard, Vancouver, ISO, and other styles

28

Skřivánková, Barbora. "Anonymizace SPZ vozidel." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255442.

Full text

Abstract:

While browsing an online map server, continuous photographs of certain places can be browsed as well. When the map service takes pictures of a public space, there are some personal data captured as well (i.e. faces, car licence plates). The goal of this thesis is the design of automated car licence plates anonymization system, optimized for the Panorama service provided by the Seznam.cz a.s. corporation. In this thesis, the process of car licence plate anonymization is divided into two parts: the first one solves a detection of cars and the second solves a car licence plate localization in the selected image. The car detection is based on the deep neural network approach, the car licence plate localization is solved by using a fully connected neural network performing a regression task. The goal of this thesis is to get over the disadvantages of commercial solution used nowadays. These are false posititive results and high computational complexity. Results of this thesis are not as good as expected. The reason could be a dataset provided by Seznam.cz a.s. corporation, which seemed to be robust enough in the beginning, but in the end it showed up to be not suffice enough to train the neural network.

APA, Harvard, Vancouver, ISO, and other styles

29

Dölle, Lukas. "Der Schutz der Privatsphäre bei der Anfragebearbeitung in Datenbanksystemen." Doctoral thesis, Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät, 2016. http://dx.doi.org/10.18452/17531.

Full text

Abstract:

In den letzten Jahren wurden viele Methoden entwickelt, um den Schutz der Privatsphäre bei der Veröffentlichung von Daten zu gewährleisten. Die meisten Verfahren anonymisieren eine gesamte Datentabelle, sodass sensible Werte einzelnen Individuen nicht mehr eindeutig zugeordnet werden können. Deren Privatsphäre gilt als ausreichend geschützt, wenn eine Menge von mindestens k sensiblen Werten existiert, aus der potentielle Angreifer den tatsächlichen Wert nicht herausfinden können. Ausgangspunkt für die vorliegende Arbeit ist eine Sequenz von Anfragen auf personenbezogene Daten, die durch ein Datenbankmanagementsystem mit der Rückgabe einer Menge von Tupeln beantwortet werden. Das Ziel besteht darin herauszufinden, ob Angreifer durch die Kenntnis aller Ergebnisse in der Lage sind, Individuen eindeutig ihre sensiblen Werte zuzuordnen, selbst wenn alle Ergebnismengen anonymisiert sind. Bisher sind Verfahren nur für aggregierte Anfragen wie Summen- oder Durchschnittsbildung bekannt. Daher werden in dieser Arbeit Ansätze entwickelt, die den Schutz auch für beliebige Anfragen gewährleisten. Es wird gezeigt, dass die Lösungsansätze auf Matchingprobleme in speziellen Graphen zurückgeführt werden können. Allerdings ist das Bestimmen größter Matchings in diesen Graphen NP-vollständig. Aus diesem Grund werden Approximationsalgorithmen vorgestellt, die in Polynomialzeit eine Teilmenge aller Matchings konstruieren, ohne die Privatsphäre zu kompromittieren.
Over the last ten years many techniques for privacy-preserving data publishing have been proposed. Most of them anonymize a complete data table such that sensitive values cannot clearly be assigned to individuals. Their privacy is considered to be adequately protected, if an adversary cannot discover the actual value from a given set of at least k values. For this thesis we assume that users interact with a data base by issuing a sequence of queries against one table. The system returns a sequence of results that contains sensitive values. The goal of this thesis is to check if adversaries are able to link uniquely sensitive values to individuals despite anonymized result sets. So far, there exist algorithms to prevent deanonymization for aggregate queries. Our novel approach prevents deanonymization for arbitrary queries. We show that our approach can be transformed to matching problems in special graphs. However, finding maximum matchings in these graphs is NP-complete. Therefore, we develop several approximation algorithms, which compute specific matchings in polynomial time, that still maintaining privacy.

APA, Harvard, Vancouver, ISO, and other styles

30

Coufal, Zdeněk. "Korelace dat na vstupu a výstupu sítě Tor." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-235412.

Full text

Abstract:

Communication in public networks based on the IP protocol is not really anonymous because it is possible to determine the source and destination IP address of each packet. Users who want to be anonymous are forced to use anonymization networks, such as Tor. In case such a user is target of lawful interception, it presents a problem for those systems because they only see that the user communicated with anonymization network and have a suspicion that the data stream at the output of anonymization network belong to the same user. The aim of this master thesis was to design a correlation method to determine the dependence of the data stream at the input and the output of the Tor network. The proposed method analysis network traffic and compares characteristics of data streams extracted from metadata, such as time of occurence and the size of packets. This method specializes in correlating data flows of protocol HTTP, specifically web server responses. It was tested on real data from the Tor network and successfully recognized dependency of data flows.

APA, Harvard, Vancouver, ISO, and other styles

31

Raad, Eliana. "Towards better privacy preservation by detecting personal events in photos shared within online social networks." Thesis, Dijon, 2015. http://www.theses.fr/2015DIJOS079/document.

Full text

Abstract:

De nos jours, les réseaux sociaux ont considérablement changé la façon dont les personnes prennent des photos qu’importe le lieu, le moment, le contexte. Plus que 500 millions de photos sont partagées chaque jour sur les réseaux sociaux, auxquelles on peut ajouter les 200 millions de vidéos échangées en ligne chaque minute. Plus particulièrement, avec la démocratisation des smartphones, les utilisateurs de réseaux sociaux partagent instantanément les photos qu’ils prennent lors des divers événements de leur vie, leurs voyages, leurs aventures, etc. Partager ce type de données présente un danger pour la vie privée des utilisateurs et les expose ensuite à une surveillance grandissante. Ajouté à cela, aujourd’hui de nouvelles techniques permettent de combiner les données provenant de plusieurs sources entre elles de façon jamais possible auparavant. Cependant, la plupart des utilisateurs des réseaux sociaux ne se rendent même pas compte de la quantité incroyable de données très personnelles que les photos peuvent renfermer sur eux et sur leurs activités (par exemple, le cas du cyberharcèlement). Cela peut encore rendre plus difficile la possibilité de garder l’anonymat sur Internet dans de nombreuses situations où une certaine discrétion est essentielle (politique, lutte contre la fraude, critiques diverses, etc.).Ainsi, le but de ce travail est de fournir une mesure de protection de la vie privée, visant à identifier la quantité d’information qui permettrait de ré-identifier une personne en utilisant ses informations personnelles accessibles en ligne. Premièrement, nous fournissons un framework capable de mesurer le risque éventuel de ré-identification des personnes et d’assainir les documents multimédias destinés à être publiés et partagés. Deuxièmement, nous proposons une nouvelle approche pour enrichir le profil de l’utilisateur dont on souhaite préserver l’anonymat. Pour cela, nous exploitons les évènements personnels à partir des publications des utilisateurs et celles partagées par leurs contacts sur leur réseau social. Plus précisément, notre approche permet de détecter et lier les évènements élémentaires des personnes en utilisant les photos (et leurs métadonnées) partagées au sein de leur réseau social. Nous décrivons les expérimentations que nous avons menées sur des jeux de données réelles et synthétiques. Les résultats montrent l’efficacité de nos différentes contributions
Today, social networking has considerably changed why people are taking pictures all the time everywhere they go. More than 500 million photos are uploaded and shared every day, along with more than 200 hours of videos every minute. More particularly, with the ubiquity of smartphones, social network users are now taking photos of events in their lives, travels, experiences, etc. and instantly uploading them online. Such public data sharing puts at risk the users’ privacy and expose them to a surveillance that is growing at a very rapid rate. Furthermore, new techniques are used today to extract publicly shared data and combine it with other data in ways never before thought possible. However, social networks users do not realize the wealth of information gathered from image data and which could be used to track all their activities at every moment (e.g., the case of cyberstalking). Therefore, in many situations (such as politics, fraud fighting and cultural critics, etc.), it becomes extremely hard to maintain individuals’ anonymity when the authors of the published data need to remain anonymous.Thus, the aim of this work is to provide a privacy-preserving constraint (de-linkability) to bound the amount of information that can be used to re-identify individuals using online profile information. Firstly, we provide a framework able to quantify the re-identification threat and sanitize multimedia documents to be published and shared. Secondly, we propose a new approach to enrich the profile information of the individuals to protect. Therefore, we exploit personal events in the individuals’ own posts as well as those shared by their friends/contacts. Specifically, our approach is able to detect and link users’ elementary events using photos (and related metadata) shared within their online social networks. A prototype has been implemented and several experiments have been conducted in this work to validate our different contributions

APA, Harvard, Vancouver, ISO, and other styles

32

Chen, Yi-Jie, and 陳羿傑. "Privacy Information Protection through Data Anonymization." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/3bpp9z.

Full text

Abstract:

碩士
中原大學
資訊工程研究所
103
Mobile apps is moving power behind the prevalence of intelligent mobile devices, which, in turn, bring in the exponentially growing number of mobile apps being developed. The personalized and ubiquitous characteristics of intelligent mobile devices, with the added variety of record taking and data sensing capabilities become a serious threat to user privacy when linked with the communication ability of the mobile devices. How to allow us to enjoy all the conveniences and services without privacy risk is an important issue to all users of mobile devices. The available privacy protection schemes or methods either require change made at the mobile device system framework and core, or require complicate technology process and skill. In this thesis, we proposed a proxy server based approach to develop a solution practical to ordinary users. A prototype has been implemented to demonstrate the practicality and usability of the privacy protection mechanism.

APA, Harvard, Vancouver, ISO, and other styles

33

AbuSharkh, Hani. "Border-based Anonymization Method of Sharing Spatial-Temporal Data." Thesis, 2011. http://spectrum.library.concordia.ca/15131/1/AbuSharkh_MASc_F2011.pdf.

Full text

Abstract:

Many location-based software applications have been developed for mobile devices. Consequently, location-based service providers often have a detailed trajectory history of their service recipients. The collected spatial-temporal information of their service recipients can be invaluable for other organizations and companies in many ways; for example, it can be used for direct marking, market analysis, and consumer behaviour analysis. Yet, releasing the spatial-temporal data together with other user-specific data in its raw format often leads to privacy threats to the service recipients. In this thesis, we study the problem of spatial-temporal data publishing with the consideration of preserving both privacy protection and information utility for data mining. The contributions are in twofold. First, we propose a service-oriented architecture to determine an appropriate location-based service provider for a given data request. Second, we present a border-based data anonymization method to transform a raw spatial-temporal data table into an anonymous version that preserves both privacy and information utility. Experimental results suggest that our proposed method can efficiently and effectively preserve the information required for data mining.

APA, Harvard, Vancouver, ISO, and other styles

34

Lin, Wang Chih Jui, and 林王智瑞. "Multiple Release Anonymization for Time-Series Social Network Data." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/53612112549688181096.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系
100
Nowadays, social networks have gained popularity among the world. Social networks have a lot of data about interaction among entities and these data may contain the individual privacy content. Accordingly, many studies have been proposed to protect the privacy in social networks. The previous works only focus on developing the privacy preserving methods for releasing a single anonymized graph used to represent a social network. However, the single anonymized graph may not be enough for analyzing the evolution of the whole network. Therefore, we address a novel problem of preserving the privacy of interaction among entities for multiple releases on time-series social network data in this thesis, which means we will release multiple anonymized graphs to represent a social network with time-series data. We point out that the privacy may be revealed across the multiple releases, if we apply the existing methods to generate the individual anonymized graph for the network in the different timestamps. We provide an anonymization method for releasing multiple anonymized graphs at one time on time-series social network data. We detail our experiment steps and evaluate the utility of the anonymized graphs by answering a series of aggregate queries. The results show that the multiple releases generated by our method answer the queries accurately.

APA, Harvard, Vancouver, ISO, and other styles

35

Gao, Tianchong. "Privacy Preserving in Online Social Network Data Sharing and Publication." Thesis, 2019. http://hdl.handle.net/1805/20972.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
Following the trend of online data sharing and publishing, researchers raise their concerns about the privacy problem. Online Social Networks (OSNs), for example, often contain sensitive information about individuals. Therefore, anonymizing network data before releasing it becomes an important issue. This dissertation studies the privacy preservation problem from the perspectives of both attackers and defenders. To defenders, preserving the private information while keeping the utility of the published OSN is essential in data anonymization. At one extreme, the final data equals the original one, which contains all the useful information but has no privacy protection. At the other extreme, the final data is random, which has the best privacy protection but is useless to the third parties. Hence, the defenders aim to explore multiple potential methods to strike a desirable tradeoff between privacy and utility in the published data. This dissertation draws on the very fundamental problem, the definition of utility and privacy. It draws on the design of the privacy criterion, the graph abstraction model, the utility method, and the anonymization method to further address the balance between utility and privacy. To attackers, extracting meaningful information from the collected data is essential in data de-anonymization. De-anonymization mechanisms utilize the similarities between attackers’ prior knowledge and published data to catch the targets. This dissertation focuses on the problems that the published data is periodic, anonymized, and does not cover the target persons. There are two thrusts in studying the de-anonymization attacks: the design of seed mapping method and the innovation of generating-based attack method. To conclude, this dissertation studies the online data privacy problem from both defenders’ and attackers’ point of view and introduces privacy and utility enhancement mechanisms in different novel angles.

APA, Harvard, Vancouver, ISO, and other styles

36

(7428566), Tianchong Gao. "Privacy Preserving in Online Social Network Data Sharing and Publication." Thesis, 2019.

Find full text

Abstract:

Following the trend of online data sharing and publishing, researchers raise their concerns about the privacy problem. Online Social Networks (OSNs), for example, often contain sensitive information about individuals. Therefore, anonymizing network data before releasing it becomes an important issue. This dissertation studies the privacy preservation problem from the perspectives of both attackers and defenders.

To defenders, preserving the private information while keeping the utility of the published OSN is essential in data anonymization. At one extreme, the final data equals the original one, which contains all the useful information but has no privacy protection. At the other extreme, the final data is random, which has the best privacy protection but is useless to the third parties. Hence, the defenders aim to explore multiple potential methods to strike a desirable tradeoff between privacy and utility in the published data. This dissertation draws on the very fundamental problem, the definition of utility and privacy. It draws on the design of the privacy criterion, the graph abstraction model, the utility method, and the anonymization method to further address the balance between utility and privacy.

To attackers, extracting meaningful information from the collected data is essential in data de-anonymization. De-anonymization mechanisms utilize the similarities between attackers’ prior knowledge and published data to catch the targets. This dissertation focuses on the problems that the published data is periodic, anonymized, and does not cover the target persons. There are two thrusts in studying the de-anonymization attacks: the design of seed mapping method and the innovation of generating-based attack method. To conclude, this dissertation studies the online data privacy problem from both defenders’ and attackers’ point of view and introduces privacy and utility enhancement mechanisms in different novel angles.

APA, Harvard, Vancouver, ISO, and other styles

37

Li, Yidong. "Preserving privacy in data publishing and analysis." Thesis, 2011. http://hdl.handle.net/2440/68556.

Full text

Abstract:

As data collection and storage techniques being greatly improved, data analysis is becoming an increasingly important issue in many business and academic collaborations that enhances their productivity and competitiveness. Multiple techniques for data analysis, such as data mining, business intelligence, statistical analysis and predictive analytics, have been developed in different science, commerce and social science domains. To ensure quality data analysis, effective information sharing between organizations becomes a vital requirement in today’s society. However, the shared data often contains person-specific and sensitive information like medical records. As more and more realworld datasets are released publicly, there is a growing concern about privacy breaches for the entities involved. To respond to this challenge, this thesis discusses the problem of eliminating privacy threats while, at the same time, preserving useful information in the released database for data analysis. The first part of this thesis discuss the problem of privacy preservation on relational data. Due to the inherent drawbacks of applying equi-depth data swapping in distancebased data analysis, we study efficient swapping algorithms based on equi-width partitioning for relational data publishing. We develop effective methods for both univariate and multivariate data swapping. With extensive theoretical analysis and experimental validation, we show that, Equi-Width Swapping (EWS) can achieve a similar performance in privacy preservation to that of Equi-Depth Swapping (EDS) if the number of partitions is sufficiently large (e.g. ≳ √n, where n is the size of dataset). In addition, our analysis shows that the multivariate EWS algorithm has much lower computational complexity O(n) than that of the multivariate EDS (which is O(n³) basically), while it still provides good protection for sensitive information. The second part of this thesis focuses on solving the problem of privacy preservation on graphs, which has increasing significance as more and more real-world graphs modelling complex systems such as social networks are released publicly, . We point out that the real labels of a large portion of nodes can be easily re-identified with some weight-related attacks in a weighted graph, even the graph is perturbed with weight-independent invariants like degree. Two concrete attacks have been identified based on the following elementary weight invariants: 1) volume: the sum of adjacent weights for a vertex; and 2) histogram: the neighborhood weight distribution of a vertex. In order to protect a graph from these attacks, we formalize a general model for weighted graph anonymization and provide efficient methods with respect to a two-step framework including property anonymization and graph reconstruction. Moreover, we theoretically prove the histogram anonymization problem is NP-hard in the general case, and present an efficient heuristic algorithm for this problem running in near-quadratic time on graph size. The final part of this thesis turns to exploring efficient privacy preserving techniques for hypergraphs, meanwhile, maintaining the quality of community detection. We first model a background knowledge attack based on so-called rank, which is one of the important properties of hyperedges. Then, we show empirically how high the disclosure risk is with the attack to breach the real-world data. We formalize a general model for rank-based hypergraph anonymization, and justify its hardness. As a solution, we extend the two-step framework for graph anonymization into our new problem and propose efficient algorithms that perform well on preserving data privacy. Also, we explore the issue of constructing a hypergraph with a specified rank set in the first place so far as we know. The proposed construction algorithm also has the characteristics of minimizing the bias of community detection on the original and the perturbed hypergraphs. In addition, we consider two de-anonymizing schemes that may be used to attack an anonymizied hypergraph and verify that both schemes fail in breaching the privacy of a hypergraph with rank anonymity in the real-world case.
Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2011

APA, Harvard, Vancouver, ISO, and other styles

38

Viscardi, Cecilia. "Approximate Bayesian Computation and Statistical Applications to Anonymized Data: an Information Theoretic Perspective." Doctoral thesis, 2021. http://hdl.handle.net/2158/1236316.

Full text

Abstract:

Realistic statistical modelling of complex phenomena often leads to considering several latent variables and nuisance parameters. In such cases, the Bayesian approach to inference requires the computation of challenging integrals or summations over high dimensional spaces. Monte Carlo methods are a class of widely used algorithms for performing simulated inference. In this thesis, we consider the problem of sample degeneracy in Monte Carlo methods focusing on Approximate Bayesian Computation (ABC), a class of likelihood-free algorithms allowing inference when the likelihood function is analytically intractable or computationally demanding to evaluate. In the ABC framework sample degeneracy arises when proposed values of the parameters, once given as input to the generative model, rarely lead to simulations resembling the observed data and are hence discarded. Such "poor" parameter proposals, i.e., parameter values having an (exponentially) small probability of producing simulation outcomes close to the observed data, do not contribute at all to the representation of the parameter's posterior distribution. This leads to a very large number of required simulations and/or a waste of computational resources, as well as to distortions in the computed posterior distribution. To mitigate this problem, we propose two algorithms, referred to as the Large Deviations Approximate Bayesian Computation algorithms (LD-ABC), where the ABC typical rejection step is avoided altogether. We adopt an information theoretic perspective resorting to the Method of Types formulation of Large Deviations, thus first restricting our attention to models for i.i.d. discrete random variables and then extending the method to parametric finite state Markov chains. We experimentally evaluate our method through proof-of-concept implementations. Furthermore, we consider statistical applications to anonymized data. We adopt the point of view of an evaluator interested in publishing data about individuals in an ananonymized form that allows balancing the learner’s utility against the risk posed by an attacker, potentially targeting individuals in the dataset. Accordingly, we present a unified Bayesian model applying to data anonymized employing group-based schemes and a related MCMC method to learn the population parameters. This allows relative threat analysis, i.e., an analysis of the risk for any individual in the dataset to be linked to a specific sensitive value beyond what is implied for the general population. Finally, we show the performance of the ABC methods in this setting and test LD-ABC at work on a real-world obfuscated dataset.

APA, Harvard, Vancouver, ISO, and other styles

39

Peng, Wei. "Seed and Grow: An Attack Against Anonymized Social Networks." 2012. http://hdl.handle.net/1805/2884.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
Digital traces left by a user of an on-line social networking service can be abused by a malicious party to compromise the person’s privacy. This is exacerbated by the increasing overlap in user-bases among various services. To demonstrate the feasibility of abuse and raise public awareness of this issue, I propose an algorithm, Seed and Grow, to identify users from an anonymized social graph based solely on graph structure. The algorithm first identifies a seed sub-graph either planted by an attacker or divulged by collusion of a small group of users, and then grows the seed larger based on the attacker’s existing knowledge of the users’ social relations. This work identifies and relaxes implicit assumptions taken by previous works, eliminates arbitrary parameters, and improves identification effectiveness and accuracy. Experiment results on real-world collected datasets further corroborate my expectation and claim.

APA, Harvard, Vancouver, ISO, and other styles

40

Duaso, Calés Rosario. "La protection des données personnelles contenues dans les documents publics accessibles sur Internet : le cas des données judiciaires." Thèse, 2002. http://hdl.handle.net/1866/2435.

Full text

Abstract:

Les bouleversements engendrés par les nouveaux moyens de communication des données publiques de même que les multiples possibilités offertes par le réseau Internet, telles que le stockage des informations, la mémoire sans faille et l'utilisation des moteurs de recherche, présentent des enjeux majeurs liés à la protection de la vie privée. La diffusion des données publiques en support numérique suscite un changement d'échelle dans le temps et dans l'espace et elle modifie le concept classique de publicité qui existait dans l'univers papier. Nous étudierons les moyens de respecter le droit à la vie privée et les conditions d'accès et d'utilisation des données personnelles, parfois à caractère sensible, contenues dans les documents publics diffusés sur Internet. Le cas particulier des données accessibles dans les banques de données judiciaires exige des solutions particulières : il s'agit de trouver l'équilibre nécessaire entre le principe de transparence judiciaire et le droit à la vie privée.
The upheavals generated by the new means of disseminating public data, together with the multiple possibilities offered by the Internet, such as information storage, comprehensive memory tools and the use of search engines, give rise to major issues related to privacy protection. The dissemination of public data in digital format causes a shift in our scales of time and space, and changes the traditional concept ofpublic nature previously associated with the "paper" universe. We will study the means of protecting privacy, and the conditions for accessing and using the personal information, sometimes of a "sensitive" nature, which is contained in the public documents posted on the Internet. The characteristics of the information available through judicial data banks require special protection solutions, so that the necessary balance can be found between the principle of judicial transparency and the right to privacy.
"Mémoire présenté à la faculté des études supérieures en vue de l'obtention du grade de maître en droit (LL.M.)"

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data anonymization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles