Dissertations / Theses: 'Data management and data science'

1

Yang, Ying. "Interactive Data Management and Data Analysis." Thesis, State University of New York at Buffalo, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10288109.

Full text

Abstract:

Everyone today has a big data problem. Data is everywhere and in different formats, they can be referred to as data lakes, data streams, or data swamps. To extract knowledge or insights from the data or to support decision-making, we need to go through a process of collecting, cleaning, managing and analyzing the data. In this process, data cleaning and data analysis are two of the most important and time-consuming components.

One common challenge in these two components is a lack of interaction. The data cleaning and data analysis are typically done as a batch process, operating on the whole dataset without any feedback. This leads to long, frustrating delays during which users have no idea if the process is effective. Lacking interaction, human expert effort is needed to make decisions on which algorithms or parameters to use in the systems for these two components.

We should teach computers to talk to humans, not the other way around. This dissertation focuses on building systems --- Mimir and CIA --- that help user conduct data cleaning and analysis through interaction. Mimir is a system that allows users to clean big data in a cost- and time-efficient way through interaction, a process I call on-demand ETL. Convergent inference algorithms (CIA) are a family of inference algorithms in probabilistic graphical models (PGM) that enjoys the benefit of both exact and approximate inference algorithms through interaction.

Mimir provides a general language for user to express different data cleaning needs. It acts as a shim layer that wraps around the database making it possible for the bulk of the ETL process to remain within a classical deterministic system. Mimir also helps users to measure the quality of an analysis result and provides rankings for cleaning tasks to improve the result quality in a cost efficient manner. CIA focuses on providing user interaction through the process of inference in PGMs. The goal of CIA is to free users from the upfront commitment to either approximate or exact inference, and provide user more control over time/accuracy trade-offs to direct decision-making and computation instance allocations. This dissertation describes the Mimir and CIA frameworks to demonstrate that it is feasible to build efficient interactive data management and data analysis systems.

APA, Harvard, Vancouver, ISO, and other styles

2

Dedge, Parks Dana M. "Defining Data Science and Data Scientist." Scholar Commons, 2017. http://scholarcommons.usf.edu/etd/7014.

Full text

Abstract:

The world’s data sets are growing exponentially every day due to the large number of devices generating data residue across the multitude of global data centers. What to do with the massive data stores, how to manage them and defining who are performing these tasks has not been adequately defined and agreed upon by academics and practitioners. Data science is a cross disciplinary, amalgam of skills, techniques and tools which allow business organizations to identify trends and build assumptions which lead to key decisions. It is in an evolutionary state as new technologies with capabilities are still being developed and deployed. The data science tasks and the data scientist skills needed in order to be successful with the analytics across the data stores are defined in this document. The research conducted across twenty-two academic articles, one book, eleven interviews and seventy-eight surveys are combined to articulate the convergence on the terms data science. In addition, the research identified that there are five key skill categories (themes) which have fifty-five competencies that are used globally by data scientists to successfully perform the art and science activities of data science. Unspecified portions of statistics, technology programming, development of models and calculations are combined to determine outcomes which lead global organizations to make strategic decisions every day. This research is intended to provide a constructive summary about the topics data science and data scientist in order to spark the dialogue for us to formally finalize the definitions and ultimately change the world by establishing set guidelines on how data science is performed and measured.

APA, Harvard, Vancouver, ISO, and other styles

3

Wason, Jasmin Lesley. "Automating data management in science and engineering." Thesis, University of Southampton, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.396143.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Wang, Yi. "Data Management and Data Processing Support on Array-Based Scientific Data." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1436157356.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Anumalla, Kalyani. "DATA PREPROCESSING MANAGEMENT SYSTEM." University of Akron / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=akron1196650015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Fernández, Moctezuma Rafael J. "A Data-Descriptive Feedback Framework for Data Stream Management Systems." PDXScholar, 2012. https://pdxscholar.library.pdx.edu/open_access_etds/116.

Full text

Abstract:

Data Stream Management Systems (DSMSs) provide support for continuous query evaluation over data streams. Data streams provide processing challenges due to their unbounded nature and varying characteristics, such as rate and density fluctuations. DSMSs need to adapt stream processing to these changes within certain constraints, such as available computational resources and minimum latency requirements in producing results. The proposed research develops an inter-operator feedback framework, where opportunities for run-time adaptation of stream processing are expressed in terms of descriptions of substreams and actions applicable to the substreams, called feedback punctuations. Both the discovery of adaptation opportunities and the exploitation of these opportunities are performed in the query operators. DSMSs are also concerned with state management, in particular, state derived from tuple processing. The proposed research also introduces the Contracts Framework, which provides execution guarantees about state purging in continuous query evaluation for systems with and without inter-operator feedback. This research provides both theoretical and design contributions. The research also includes an implementation and evaluation of the feedback techniques in the NiagaraST DSMS, and a reference implementation of the Contracts Framework.

APA, Harvard, Vancouver, ISO, and other styles

7

Nguyen, Benjamin. "Privacy-Centric Data Management." Habilitation à diriger des recherches, Université de Versailles-Saint Quentin en Yvelines, 2013. http://tel.archives-ouvertes.fr/tel-00936130.

Full text

Abstract:

This document will focus on my core computer science research since 2010, covering the topic of data management and privacy. More speci cally, I will present the following topics : - A new paradigm, called Trusted Cells for privacy-centric personal data management based on the Asymmetric Architecture composed of trusted or open (low power) distributed hardware devices acting as personal data servers and a highly powerful, highly available supporting server, such as a cloud. (Chapter 2). - Adapting aggregate data computation techniques to the Trusted Cells environment, with the example of Privacy-Preserving Data Publishing (Chapter 3). - Minimizing the data that leaves a Trusted Cell, i.e. enforcing the general privacy principle of Limited Data Collection (Chapter 4). This document contains only results that have already been published. As such, rather than focus on the details and technicalities of each result, I have tried to provide an easy way to have a global understanding of the context behind the work, explain the problematic of the work, and give a summary of the main scienti c results and impact.

APA, Harvard, Vancouver, ISO, and other styles

8

Tran, Viet-Trung. "Scalable data-management systems for Big Data." Phd thesis, École normale supérieure de Cachan - ENS Cachan, 2013. http://tel.archives-ouvertes.fr/tel-00920432.

Full text

Abstract:

Big Data can be characterized by 3 V's. * Big Volume refers to the unprecedented growth in the amount of data. * Big Velocity refers to the growth in the speed of moving data in and out management systems. * Big Variety refers to the growth in the number of different data formats. Managing Big Data requires fundamental changes in the architecture of data management systems. Data storage should continue being innovated in order to adapt to the growth of data. They need to be scalable while maintaining high performance regarding data accesses. This thesis focuses on building scalable data management systems for Big Data. Our first and second contributions address the challenge of providing efficient support for Big Volume of data in data-intensive high performance computing (HPC) environments. Particularly, we address the shortcoming of existing approaches to handle atomic, non-contiguous I/O operations in a scalable fashion. We propose and implement a versioning-based mechanism that can be leveraged to offer isolation for non-contiguous I/O without the need to perform expensive synchronizations. In the context of parallel array processing in HPC, we introduce Pyramid, a large-scale, array-oriented storage system. It revisits the physical organization of data in distributed storage systems for scalable performance. Pyramid favors multidimensional-aware data chunking, that closely matches the access patterns generated by applications. Pyramid also favors a distributed metadata management and a versioning concurrency control to eliminate synchronizations in concurrency. Our third contribution addresses Big Volume at the scale of the geographically distributed environments. We consider BlobSeer, a distributed versioning-oriented data management service, and we propose BlobSeer-WAN, an extension of BlobSeer optimized for such geographically distributed environments. BlobSeer-WAN takes into account the latency hierarchy by favoring locally metadata accesses. BlobSeer-WAN features asynchronous metadata replication and a vector-clock implementation for collision resolution. To cope with the Big Velocity characteristic of Big Data, our last contribution feautures DStore, an in-memory document-oriented store that scale vertically by leveraging large memory capability in multicore machines. DStore demonstrates fast and atomic complex transaction processing in data writing, while maintaining high throughput read access. DStore follows a single-threaded execution model to execute update transactions sequentially, while relying on a versioning concurrency control to enable a large number of simultaneous readers.

APA, Harvard, Vancouver, ISO, and other styles

9

Nyström, Dag. "Data Management in Vehicle Control-Systems." Doctoral thesis, Mälardalen University, Department of Computer Science and Electronics, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-66.

Full text

Abstract:

As the complexity of vehicle control-systems increases, the amount of information that these systems are intended to handle also increases. This thesis provides concepts relating to real-time database management systems to be used in such control-systems. By integrating a real-time database management system into a vehicle control-system, data management on a higher level of abstraction can be achieved. Current database management concepts are not sufficient for use in vehicles, and new concepts are necessary. A case-study at Volvo Construction Equipment Components AB in Eskilstuna, Sweden presented in this thesis, together with a survey of existing database platforms confirms this. The thesis specifically addresses data access issues by introducing; (i) a data access method, denoted database pointers, which enables data in a real-time database management system to be accessed efficiently. Database pointers, which resemble regular pointers variables, permit individual data elements in the database to be directly pointed out, without risking a violation of the database integrity. (ii) two concurrency-control algorithms, denoted 2V-DBP and 2V-DBP-SNAP which enable critical (hard real-time) and non-critical (soft real-time) data accesses to co-exist, without blocking of the hard real-time data accesses or risking unnecessary abortions of soft real-time data accesses. The thesis shows that 2V-DBP significantly outperforms a standard real-time concurrency control algorithm both with respect to lower response-times and minimized abortions. (iii) two concepts, denoted substitution and subscription queries that enable service- and diagnostics-tools to stimulate and monitor a control-system during run-time. The concepts presented in this thesis form a basis on which a data management concept suitable for embedded real-time systems, such as vehicle control-systems, can be built.

Ett modernt fordon är idag i princip helt styrt av inbyggda datorer. I takt med att funktionaliteten i fordonen ökar, blir programvaran i dessa datorer mer och mer komplex. Komplex programvara är svår och kostsam att konstruera. För att hantera denna komplexitet och underlätta konstruktion, satsar nu industrin på att finna metoder för att konstruera dessa system på en högre abstraktionsnivå. Dessa metoder syftar till att strukturera programvaran idess olika funktionella beståndsdelar, till exempel genom att använda så kallad komponentbaserad programvaruutveckling. Men, dessa metoder är inte effektiva vad gäller att hantera den ökande mängden information som följer med den ökande funktionaliteten i systemen. Exempel på information som skall hanteras är data från sensorer utspridda i bilen (temperaturer, tryck, varvtal osv.), styrdata från föraren (t.ex. rattutslag och gaspådrag), parameterdata, och loggdata som används för servicediagnostik. Denna information kan klassas som säkerhetskritisk eftersom den används för att styra beteendet av fordonet. På senare tid har dock mängden icke säkerhetskritisk information ökat, exempelvis i bekvämlighetssystem som multimedia-, navigations- och passagerarergonomisystem.

Denna avhandling syftar till att visa hur ett datahanteringssystem för inbyggda system, till exempel fordonssystem, kan konstrueras. Genom att använda ett realtidsdatabashanteringssystem för att lyfta upp datahanteringen på en högre abstraktionsnivå kan fordonssystem tillåtas att hantera stora mängder information på ett mycket enklare sätt än i nuvarande system. Ett sådant datahanteringssystem ger systemarkitekterna möjlighet att strukturera och modellera informationen på ett logiskt och överblickbart sätt. Informationen kan sedan läsas och uppdateras genom standardiserade gränssnitt anpassade förolika typer av funktionalitet. Avhandlingen behandlar specifikt problemet hur information i databasen, med hjälp av en concurrency-control algoritm, skall kunna delas av både säkerhetskritiska och icke säkerhetskritiska systemfunktioner i fordonet. Vidare avhandlas hur information kan distribueras både mellan olika datorsystem i fordonet, men också till diagnostik- och serviceverktyg som kan kopplas in i fordonet.

APA, Harvard, Vancouver, ISO, and other styles

10

Karras, Panagiotis. "Data structures and algorithms for data representation in constrained environments." Thesis, Click to view the E-thesis via HKUTO, 2007. http://sunzi.lib.hku.hk/hkuto/record/B38897647.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Tatarinov, Igor. "Semantic data sharing with a peer data management system /." Thesis, Connect to this title online; UW restricted, 2004. http://hdl.handle.net/1773/6942.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Matus, Castillejos Abel, and n/a. "Management of Time Series Data." University of Canberra. Information Sciences & Engineering, 2006. http://erl.canberra.edu.au./public/adt-AUC20070111.095300.

Full text

Abstract:

Every day large volumes of data are collected in the form of time series. Time series are collections of events or observations, predominantly numeric in nature, sequentially recorded on a regular or irregular time basis. Time series are becoming increasingly important in nearly every organisation and industry, including banking, finance, telecommunication, and transportation. Banking institutions, for instance, rely on the analysis of time series for forecasting economic indices, elaborating financial market models, and registering international trade operations. More and more time series are being used in this type of investigation and becoming a valuable resource in today�s organisations. This thesis investigates and proposes solutions to some current and important issues in time series data management (TSDM), using Design Science Research Methodology. The thesis presents new models for mapping time series data to relational databases which optimise the use of disk space, can handle different time granularities, status attributes, and facilitate time series data manipulation in a commercial Relational Database Management System (RDBMS). These new models provide a good solution for current time series database applications with RDBMS and are tested with a case study and prototype with financial time series information. Also included is a temporal data model for illustrating time series data lifetime behaviour based on a new set of time dimensions (confidentiality, definitiveness, validity, and maturity times) specially targeted to manage time series data which are introduced to correctly represent the different status of time series data in a timeline. The proposed temporal data model gives a clear and accurate picture of the time series data lifecycle. Formal definitions of these time series dimensions are also presented. In addition, a time series grouping mechanism in an extensible commercial relational database system is defined, illustrated, and justified. The extension consists of a new data type and its corresponding rich set of routines that support modelling and operating time series information within a higher level of abstraction. It extends the capability of the database server to organise and manipulate time series into groups. Thus, this thesis presents a new data type that is referred to as GroupTimeSeries, and its corresponding architecture and support functions and operations. Implementation options for the GroupTimeSeries data type in relational based technologies are also presented. Finally, a framework for TSDM with enough expressiveness of the main requirements of time series application and the management of that data is defined. The framework aims at providing initial domain know-how and requirements of time series data management, avoiding the impracticability of designing a TSDM system on paper from scratch. Many aspects of time series applications including the way time series data are organised at the conceptual level are addressed. The central abstraction for the proposed domain specific framework is the notions of business sections, group of time series, and time series itself. The framework integrates comprehensive specification regarding structural and functional aspects for time series data management. A formal framework specification using conceptual graphs is also explored.

APA, Harvard, Vancouver, ISO, and other styles

13

Vijayakumar, Nithya Nirmal. "Data management in distributed stream processing systems." [Bloomington, Ind.] : Indiana University, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3278228.

Full text

Abstract:

Thesis (Ph.D.)--Indiana University, Dept. of Computer Science, 2007.
Source: Dissertation Abstracts International, Volume: 68-09, Section: B, page: 6093. Adviser: Beth Plale. Title from dissertation home page (viewed May 9, 2008).

APA, Harvard, Vancouver, ISO, and other styles

14

Agbaw, Catherine E. (Catherine Ebenye). "Management data collection in a distributed environment." Thesis, McGill University, 1995. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=22713.

Full text

Abstract:

The goal of this research is to explore the feasibility of building management information gateways that relay management information at regular intervals between the Simple Network Management Protocol (SNMP) and the Common Management Information Protocol (CMIP). This management information is stored in an object-oriented Management Information Base (MIB) that can be accessed by Common Management Information Service (CMIS) management applications.
An approach for polling based on a variable polling frequency is proposed. A stateful model for a simple version CMIP proxy agent for SNMP which requires management information collected from SNMP agents to be stored in the proxy agent's MIB is also proposed. The proxy agent is implemented using the so-called OSIMIS-3.0 software package which implements CMIP, and an existing SNMP application. A policy of variable polling frequency which is based on the cost of polling, the cost of loss of relevant management information and the frequency of update of new information is used by the proxy agent. The agent is tested on a distributed network consisting of a LAN at McGill University and another LAN at the University of Montreal.
The results from the test show that using the above model of a proxy agent between CMIP and SNMP yields a better response time as compared to the stateless proxy agent model used by the Network Management Forum (NMF93), as well as an up-to-date information about the network to a CMIS manager during critical situations.

APA, Harvard, Vancouver, ISO, and other styles

15

Zou, Beibei 1974. "Data mining with relational database management systems." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82456.

Full text

Abstract:

With the increasing demands of transforming raw data into information and knowledge, data mining becomes an important field to the discovery of useful information and hidden patterns in huge datasets. Both machine learning and database research have made major contributions to the field of data mining. However, there is still little effort made to improve the scalability of algorithms applied in data raining tasks. Scalability is crucial for data mining algorithms, since they have to handle large datasets quite often. In this thesis we take a step in this direction by extending a popular machine learning software, Weka3.4, to handle large datasets that can not fit into main memory by relying on relational database technology. Weka3.4-DB is implemented to store the data into and access the data from DB2 with a loose coupling approach in general. Additionally, a semi-tight coupling is applied to optimize the data manipulation methods by implementing core functionalities within the database. Based on the DB2 storage implementation, Weka3.4-DB achieves better scalability, but still provides a general interface for developers to implement new algorithms without the need of database or SQL knowledge.

APA, Harvard, Vancouver, ISO, and other styles

16

Ma, Xuesong 1975. "Data mining using relational database management system." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=98757.

Full text

Abstract:

With the wide availability of huge amounts of data and the imminent demands to transform the raw data into useful information and knowledge, data mining has become an important research field both in the database area and the machine learning areas. Data mining is defined as the process to solve problems by analyzing data already present in the database and discovering knowledge in the data. Database systems provide efficient data storage, fast access structures and a wide variety of indexing methods to speed up data retrieval. Machine learning provides theory support for most of the popular data mining algorithms. Weka-DB combines properties of these two areas to improve the scalability of Weka, which is an open source machine learning software package. Weka implements most of the machine learning algorithms using main memory based data structure, so it cannot handle large datasets that cannot fit into main memory. Weka-DB is implemented to store the data into and access the data from DB2, so it achieves better scalability than Weka. However, the speed of Weka-DB is much slower than Weka because secondary storage access is more expensive than main memory access. In this thesis we extend Weka-DB with a buffer management component to improve the performance of Weka-DB. Furthermore, we increase the scalability of Weka-DB even further by putting further data structures into the database, which uses a buffer to access the data in database. Furthermore, we explore another method to improve the speed of the algorithms, which takes advantage of the data access properties of machine learning algorithms.

APA, Harvard, Vancouver, ISO, and other styles

17

Tatikonda, Shirish. "Towards Efficient Data Analysis and Management of Semi-structured Data." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1275414859.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Kumar, Aman. "Metadata-Driven Management of Scientific Data." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1243898671.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Quintero, Michael C. "Constructing a Clinical Research Data Management System." Thesis, University of South Florida, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10640886.

Full text

Abstract:

Clinical study data is usually collected without knowing what kind of data is going to be collected in advance. In addition, all of the possible data points that can apply to a patient in any given clinical study is almost always a superset of the data points that are actually recorded for a given patient. As a result of this, clinical data resembles a set of sparse data with an evolving data schema. To help researchers at the Moffitt Cancer Center better manage clinical data, a tool was developed called GURU that uses the Entity Attribute Value model to handle sparse data and allow users to manage a database entity’s attributes without any changes to the database table definition. The Entity Attribute Value model’s read performance gets faster as the data gets sparser but it was observed to perform many times worse than a wide table if the attribute count is not sufficiently large. Ultimately, the design trades read performance for flexibility in the data schema.

APA, Harvard, Vancouver, ISO, and other styles

20

Busack, Nancy Long. "The intelligent data object and its data base interface." Thesis, Kansas State University, 1985. http://hdl.handle.net/2097/9825.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Ma, Yu. "A composable data management architecture for scientific applications." [Bloomington, Ind.] : Indiana University, 2006. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3243773.

Full text

Abstract:

Thesis (Ph.D.)--Indiana University, Dept. of Computer Science, 2006.
Title from PDF t.p. (viewed Nov. 18, 2008). Source: Dissertation Abstracts International, Volume: 67-12, Section: B, page: 7170. Adviser: Randall Bramley.

APA, Harvard, Vancouver, ISO, and other styles

22

Onolaja, Olufunmilola Oladunni. "Dynamic data-driven framework for reputation management." Thesis, University of Birmingham, 2012. http://etheses.bham.ac.uk//id/eprint/3824/.

Full text

Abstract:

The landscape of security has been changed by the increase in online market places, and the rapid growth of mobile and wireless networks. Users are now exposed to greater risks as they interact anonymously in these domains. Despite the existing security paradigms, trust among users remains a problem. Reputation systems have now gained popularity because of their effectiveness in providing trusted interactions. We argue that managing reputation by relying on history alone and/or biased opinions is inadequate for security, because such an approach exposes the domain to vulnerabilities. Alternatively, the use of historical, recent and anticipated events supports effective reputation management. We investigate how the dynamic data-driven application systems paradigm can aid reputation management. We suggest the use of the paradigm's primitives, which includes the use of controller and simulation components for performing computations and predictions. We demonstrate how a dynamic framework can provide effective reputation management that is not influenced by biased observations. This is an online decision support system that can enable stakeholders make informed judgments. To highlight the framework's usefulness, we report on its predictive performance through an evaluation stage. Our results indicate that a dynamic data-driven approach can lead to effective reputation management in trust-reliant domains.

APA, Harvard, Vancouver, ISO, and other styles

23

Kelley, Ian Robert. "Data management in dynamic distributed computing environments." Thesis, Cardiff University, 2012. http://orca.cf.ac.uk/44477/.

Full text

Abstract:

Data management in parallel computing systems is a broad and increasingly important research topic. As network speeds have surged, so too has the movement to transition storage and computation loads to wide-area network resources. The Grid, the Cloud, and Desktop Grids all represent different aspects of this movement towards highly-scalable, distributed, and utility computing. This dissertation contends that a peer-to-peer (P2P) networking paradigm is a natural match for data sharing within and between these heterogeneous network architectures. Peer-to-peer methods such as dynamic discovery, fault-tolerance, scalability, and ad-hoc security infrastructures provide excellent mappings for many of the requirements in today’s distributed computing environment. In recent years, volunteer Desktop Grids have seen a growth in data throughput as application areas expand and new problem sets emerge. These increasing data needs require storage networks that can scale to meet future demand while also facilitating expansion into new data-intensive research areas. Current practices are to mirror data from centralized locations, a technique that is not practical for growing data sets, dynamic projects, or data-intensive applications. The fusion of Desktop and Service Grids provides an ideal use-case to research peer-to-peer data distribution strategies in a hybrid environment. Desktop Grids have a data management gap, while integration with Service Grids raises new challenges with regard to cross-platform design. The work undertaken here is two-fold: first it explores how P2P techniques can be leveraged to meet the data management needs of Desktop Grids, and second, it shows how the same distribution paradigm can provide migration paths for Service Grid data. The result of this research is a Peer-to-Peer Architecture for Data-Intensive Cycle Sharing (ADICS) that is capable not only of distributing volunteer computing data, but also of providing a transitional platform and storage space for migrating Service Grid jobs to Desktop Grid environments.

APA, Harvard, Vancouver, ISO, and other styles

24

Branco, Miguel. "Distributed data management for large scale applications." Thesis, University of Southampton, 2009. https://eprints.soton.ac.uk/72283/.

Full text

Abstract:

Improvements in data storage and network technologies, the emergence of new highresolution scientific instruments, the widespread use of the Internet and the World Wide Web and even globalisation have contributed to the emergence of new large scale dataintensive applications. These applications require new systems that allow users to store, share and process data across computing centres around the world. Worldwide distributed data management is particularly important when there is a lot of data, more than can fit in a single computer or even in a single data centre. Designing systems to cope with the demanding requirements of these applications is the focus of the present work. This thesis presents four contributions. First, it introduces a set of design principles that can be used to create distributed data management systems for data-intensive applications. Second, it describes an architecture and implementation that follows the proposed design principles, and which results in a scalable, fault tolerant and secure system. Third, it presents the system evaluation, which occurred under real operational conditions using close to one hundred computing sites and with more than 14 petabytes of data. Fourth, it proposes novel algorithms to model the behaviour of file transfers on a wide-area network. This work also presents a detailed description of the problem of managing distributed data, ranging from the collection of requirements to the identification of the uncertainty that underlies a large distributed environment. This includes a critique of existing work and the identification of practical limits to the development of transfer algorithms on a shared distributed environment. The motivation for this work has been the ATLAS Experiment for the Large Hadron Collider (LHC) at CERN, where the author was responsible for the development of the data management middleware.

APA, Harvard, Vancouver, ISO, and other styles

25

Strand, Mattias. "External Data Incorporation into Data Warehouses." Doctoral thesis, Kista : Skövde : Dept. of computer and system sciences, Stockholm University : School of humanities and informatics, University of Skövde, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-660.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Kairouz, Joseph. "Patient data management system medical knowledge-base evaluation." Thesis, McGill University, 1996. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=24060.

Full text

Abstract:

The purpose of this thesis is to evaluate the medical data management expert system at the Pediatric Intensive Care Unit of the Montreal Children's Hospital. The objective of this study is to provide a systematic method to evaluate and, progressively improve the knowledge embedded in the medical expert system.
Following a literature survey on evaluation techniques and architecture of existing expert systems, an overview of the Patient Data Management System hardware and software components is presented. The design of the Expert Monitoring System is elaborated. Following its installation in the intensive Care Unit, the performance of the Expert Monitoring System is evaluated, operating on real vital sign data and corrections were formulated. A progressive evaluation technique, new methodology for evaluating an expert system knowledge-base is proposed for subsequent corrections and evaluations of the Expert Monitoring System.

APA, Harvard, Vancouver, ISO, and other styles

27

Su, Yu. "Big Data Management Framework based on Virtualization and Bitmap Data Summarization." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1420738636.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Li, Yujiang. "Development architecture for industrial data management." Licentiate thesis, KTH, Datorsystem för konstruktion och tillverkning, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-132244.

Full text

Abstract:

Standardized information modeling is important for interoperability of CAx systems. Existing information standards such as ISO 10303 STEP have been proposed and developed for decades for this purpose. Comprehensive data structure and various implementation methodologies make such standards strong in support of different industry domains, information types, and technical requirements. However, this fact also leads to increased implementation complexity and workloads for CAx system developers. This licentiate proposes the development architecture, STEP Toolbox, to help users implement standards with a simplified development process and minimal knowledge requirements on standards. Implementation difficulties for individuals are identified with analysis on implementation of the information standards in three aspects: tasks, users, and technology. Then the toolbox is introduced with an illustration of design of behavior and structure. Case studies are performed to validate the toolbox with prototypes. Internal and external observation has shown the around two-month learning process skipped and a great amount of workload reduction in implementation with the utilization of this architecture.

QC 20131025

APA, Harvard, Vancouver, ISO, and other styles

29

Tibbetts, Richard S. (Richard Singleton) 1979. "Linear Road : benchmarking stream-based data management systems." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/18017.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
Includes bibliographical references (p. 57-61).
This thesis describes the design, implementation, and execution of the Linear Road benchmark for stream-based data management systems. The motivation for benchmarking and the selection of the benchmark application are described. Test harness implementation is discussed, as are experiences using the benchmark to evaluate the Aurora engine. Effects of this work on the evolution of the Aurora engine are also discussed. Streams consist of continuous feeds of data from external data sources such as sensor networks or other monitoring systems. Stream data management systems execute continuous and historical queries over these streams, producing query results in real-time. This benchmark provides a means of comparing the functionality and performance of stream-based data management systems relative to each other and to relational systems. The benchmark presented is motivated by the increasing prevalence of "variable tolling" on highway systems throughout the world. Variable tolling uses dynamically determined factors such as congestion levels and accident proximity to calculate tolls. Linear Road specifies a variable tolling system for a fictional urban area, including such features as accident detection and alerts, traffic congestion measurements, toll calculations, and ad hoc requests for travel time predictions and account balances. This benchmark has already been adopted in the Aurora [ACC⁺03] and STREAM [MWA⁺03] streaming data management systems.
by Richard S. Tibbetts, III.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

30

Yip, Alexander Siumann 1979. "Improving web site security with data flow management." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/54647.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 91-98).
This dissertation describes two systems, RESIN and BFLow, whose goal is to help Web developers build more secure Web sites. RESIN and BFLOW use data flow management to help reduce the security risks of using buggy or malicious code. RESIN provides programmers with language-level mechanisms to track and manage the flow of data within the server. These mechanisms make it easy for programmers to catch server-side data flow bugs that result in security vulnerabilities, and prevent these bugs from being exploited. BFLow is a system that adds information flow control, a restrictive form of data flow management, both to the Web browser and to the interface between a browser and a server. BFLOW makes it possible for a Web site to combine confidential data with untrusted JavaScript in its Web pages, without risking leaks of that data. This work makes a number of contributions. RESIN introduces the idea of a data flow assertion and demonstrates how to build them using three language-level mechanisms, policy objects, data tracking, and filter objects. We built prototype implementations of RESIN in both the PHP and Python runtimes. We adapt seven real off-the-shelf applications and implement 11 different security policies in RESIN which thwart at least 27 real security vulnerabilities. BFLow introduces an information flow control model that fits the JavaScript communication mechanisms, and a system that maps that model to JavaScript's existing isolation system.
(cont.) Together, these techniques allow untrusted JavaScript to read, compute with, and display confidential data without the risk of leaking that data, yet requires only minor changes to existing software. We built a prototype of the BFLow system and three different applications including a social networking application, a novel shared-data Web platform, and BFlogger, a third-party JavaScript platform similar to that of Blogger.com. We ported several untrusted JavaScript extensions from Blogger.com to BFlogger, and show that the extensions cannot leak data as they can in Blogger.com.
by Alexander Siumann Yip.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

31

Johnston, Steven. "Encouraging collaboration through a new data management approach." Thesis, University of Southampton, 2006. https://eprints.soton.ac.uk/65549/.

Full text

Abstract:

The ability to store large volumes of data is increasing faster than processing power. Some existing data management methods often result in data loss, inaccessibility or repetition of simulations. We propose a framework which promotes collaboration and simplifies data management. In particular we have demonstrated the proposed framework in the scenario of handling large scale data generated from biomolecular simulations in a multiinstitutional global collaboration. The framework has extended the ability of the Python problem solving environment to manage data files and metadata associated with simulations. We provide a transparent and seamless environment for user submitted code to analyse and post-process data stored in the framework. Based on this scenario we have further enhanced and extended the framework to deal with the more generic case of enabling any existing data file to be post processed from any .NET enabled programming language.

APA, Harvard, Vancouver, ISO, and other styles

32

Roger, Kathleen Mary Louise. "A nursing workload manager for a patient data management system /." Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=61047.

Full text

Abstract:

This thesis presents the design and implementation of a Nursing Workload Manager module for a Patient Data Management System in an intensive care unit. The Nursing Workload Manager aids in the planning and documentation of the nurse's workload. It automates the generation of the nursing care plan and automatically assigns a score to the care plan based on a nursing workload measurement system. In the thesis a literature survey of patient data management systems, nursing workload measurement systems and system evaluation methods is presented. This is followed by an overview of the work environment of an intensive care unit. The functionality of the Nursing Workload Manager is described and details of the software environment and application implementation are discussed. Finally, the results of a user evaluation of the module are presented, and future work on the module is discussed.

APA, Harvard, Vancouver, ISO, and other styles

33

Vellanki, Vivekanand. "Extending caching for two applications : disseminating live data and accessing data from disks." Diss., Georgia Institute of Technology, 2001. http://hdl.handle.net/1853/9243.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Lee, Jong Sik. "Space-based data management for high-performance distributed simulation." Diss., The University of Arizona, 2001. http://hdl.handle.net/10150/279803.

Full text

Abstract:

There is a rapidly growing demand to model and simulate complex large-scale distributed systems and to collaboratively share geographically dispersed data assets and computing resources to perform such distributed simulation with reasonable communication and computation resources. Interest management schemes have been studied in the literature. In this dissertation we propose an interest-based quantization scheme that is created by combining a quantization scheme and an interest management scheme. We show that this approach provides a superior solution to reduce message traffic and network data transmission load. As an environmental platform for data distribution management, we extended the DEVS/HLA distributed modeling and simulation environment. This environment allows us to study interest-based quantization schemes in order to achieve effective reduction of data communication in distributed simulation. In this environment, system modeling is provided by the DEVS (Discrete Event System Specification) formalism and supports effective modeling based on hierarchical and modular object-oriented technology. Distributed simulation is performed by a highly reliable facility using the HLA (High Level Architecture). The extended DEVS/HLA environment, called DEVS/GDDM (Generic Data Distribution Management), provides a high level abstraction to specify a set of interest-based quantization schemes. This dissertation presents a performance analysis of centralized and distributed configurations to study the scalability of the interest-based quantization schemes. These results illustrate the advantages of using space-based quantization in reducing both network load and overall simulation execution time. A real world application, relating to ballistic missiles simulation, demonstrates the operation of the DEVS/GDDM environment. Theoretical and empirical results of the ballistic missiles application show that the space-based quantization scheme, especially with predictive and multiplexing extensions, is very effective and scalable due to reduced local computation demands and extremely favorable communication data reduction with a reasonably small potential for error. This realistic case study establishes that the DEVS/GDDM environment can provide scalable distributed simulation for practical, real-world applications.

APA, Harvard, Vancouver, ISO, and other styles

35

Lofstead, Gerald Fredrick. "Extreme scale data management in high performance computing." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/37232.

Full text

Abstract:

Extreme scale data management in high performance computing requires consideration of the end-to-end scientific workflow process. Of particular importance for runtime performance, the write-read cycle must be addressed as a complete unit. Any optimization made to enhance writing performance must consider the subsequent impact on reading performance. Only by addressing the full write-read cycle can scientific productivity be enhanced. The ADIOS middleware developed as part of this thesis provides an API nearly as simple as the standard POSIX interface, but with the flexibilty to choose what transport mechanism(s) to employ at or during runtime. The accompanying BP file format is designed for high performance parallel output with limited coordination overheads while incorporating features to accelerate subsequent use of the output for reading operations. This pair of optimizations of the output mechanism and the output format are done such that they either do not negatively impact or greatly improve subsequent reading performance when compared to popular self-describing file formats. This end-to-end advantage of the ADIOS architecture is further enhanced through techniques to better enable asychronous data transports affording the incorporation of 'in flight' data processing operations and pseudo-transport mechanisms that can trigger workflows or other operations.

APA, Harvard, Vancouver, ISO, and other styles

36

Weigel, Tobias [Verfasser], and Thomas [Akademischer Betreuer] Ludwig. "Persistent Identifiers for Earth Science Data Management / Tobias Weigel. Betreuer: Thomas Ludwig." Hamburg : Staats- und Universitätsbibliothek Hamburg, 2016. http://d-nb.info/1097561712/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Weigel, Tobias Verfasser], and Thomas [Akademischer Betreuer] [Ludwig. "Persistent Identifiers for Earth Science Data Management / Tobias Weigel. Betreuer: Thomas Ludwig." Hamburg : Staats- und Universitätsbibliothek Hamburg, 2016. http://d-nb.info/1097561712/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Rosenfeld, Abraham M. "Data collection and management of a mobile sensor platform." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85486.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (page 53).
This thesis explores the development of a platform to better collect and manage data from multiple senor inputs mounted on a car sensor platform. Specifically, focusing on the collection and synchronization of multiple forms of data across a single mobile sensor system. The project will be implemented for three versions of a light-sensing platform, and will cover the different methods of data collection and different types of sensor devices implemented in each version. It will also cover the different technical challenges faced when collecting and managing data across multiple mobile sensors.
by Abraham M. Rosenfeld.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

39

Lisanskiy, Ilya 1976. "A data model for the Haystack document management system." Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/80103.

Full text

Abstract:

Thesis (S.B. and M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.
Includes bibliographical references (p. 97-98).
by Ilya Lisanskiy.
S.B.and M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

40

Lu, Kaiyuan. "Data distribution management schemes for HLA-compliant distributed simulation systems." Thesis, University of Ottawa (Canada), 2006. http://hdl.handle.net/10393/27151.

Full text

Abstract:

Data Distribution Management (DDM), one of the six services provided by High Level Architecture and Run-Time Infrastructure, provides an efficient and scalable mechanism for data routing among hosts in distributed simulations. Traditional, DDM schemes are classified into two main types, region-based methods and grid-based methods. Currently, the time, computation and communication overhead of DDMs are still issues for large-scale simulations. We proposed two new DDM schemes addressing these issues. Our first algorithm, which we refer to as optimized dynamic grid-based DDM scheme, aims at further reducing irrelevant data that might be received by simulation entities in dynamic grid-based approach [11], by enforcing a second level of sender-side data filtering mechanism. Our second algorithm, which we refer to as grid-filtered region-based DDM, uses a threshold value of coverage percentage to determines if exact matching is necessary. In this thesis, we present and discuss the implementation of our proposed DDM algorithms, and report on their performance based on an extensive set of simulation experiments. Last but not least, we present the preliminary work we have done on real-time enabling scheme to RTI for HLA-compliant simulations.

APA, Harvard, Vancouver, ISO, and other styles

41

Fumai, Nicola. "A database for an intensive care unit patient data management system." Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=22500.

Full text

Abstract:

Computerization has had a large impact on hospital intensive care units, allowing continuous monitoring and display of physiological patient data. Treatment of the critically ill patient, however, now requires assimilating large amounts of patient data.
Computers can help by processing the data and displaying the information in easy to understand formats. Also, knowledge-based systems can provide advice in diagnosis and treatment of patients. If these systems are to be effective, they must be integrated into the total hospital information system and the separate computer data must be jointly integrated into a new database which will become the primary medical record.
This thesis presents the design and implementation of a computerized database for an intensive care unit patient data management system being developed for the Montreal Children's Hospital. The database integrates data from the various PDMS components into one logical information store. The patient data currently managed includes physiological parameter data, patient administrative data and fluid balance data.
A simulator design is also described, which allows for thorough validation and verification of the Patient Data Management System. This simulator can easily be extended for use as a teaching and training tool for PDMS users.
The database and simulator were developed in C and implemented under the OS/2 operating system environment. The database is based on the OS/2 Extended Edition relational Database Manager.

APA, Harvard, Vancouver, ISO, and other styles

42

Yang, Haofan. "Reputation modelling in citizen science for environmental acoustic data analysis." Thesis, Queensland University of Technology, 2012. https://eprints.qut.edu.au/54657/1/Haofan_Yang_Thesis.pdf.

Full text

Abstract:

Citizen Science projects are initiatives in which members of the general public participate in scientific research projects and perform or manage research-related tasks such as data collection and/or data annotation. Citizen Science is technologically possible and scientifically significant. However, as the gathered information is from the crowd, the data quality is always hard to manage. There are many ways to manage data quality, and reputation management is one of the common approaches. In recent year, many research teams have deployed many audio or image sensors in natural environment in order to monitor the status of animals or plants. The collected data will be analysed by ecologists. However, as the amount of collected data is exceedingly huge and the number of ecologists is very limited, it is impossible for scientists to manually analyse all these data. The functions of existing automated tools to process the data are still very limited and the results are still not very accurate. Therefore, researchers have turned to recruiting general citizens who are interested in helping scientific research to do the pre-processing tasks such as species tagging. Although research teams can save time and money by recruiting general citizens to volunteer their time and skills to help data analysis, the reliability of contributed data varies a lot. Therefore, this research aims to investigate techniques to enhance the reliability of data contributed by general citizens in scientific research projects especially for acoustic sensing projects. In particular, we aim to investigate how to use reputation management to enhance data reliability. Reputation systems have been used to solve the uncertainty and improve data quality in many marketing and E-Commerce domains. The commercial organizations which have chosen to embrace the reputation management and implement the technology have gained many benefits. Data quality issues are significant to the domain of Citizen Science due to the quantity and diversity of people and devices involved. However, research on reputation management in this area is relatively new. We therefore start our investigation by examining existing reputation systems in different domains. Then we design novel reputation management approaches for Citizen Science projects to categorise participants and data. We have investigated some critical elements which may influence data reliability in Citizen Science projects. These elements include personal information such as location and education and performance information such as the ability to recognise certain bird calls. The designed reputation framework is evaluated by a series of experiments involving many participants for collecting and interpreting data, in particular, environmental acoustic data. Our research in exploring the advantages of reputation management in Citizen Science (or crowdsourcing in general) will help increase awareness among organizations that are unacquainted with its potential benefits.

APA, Harvard, Vancouver, ISO, and other styles

43

Wang, Yanchao. "Protein Structure Data Management System." Digital Archive @ GSU, 2007. http://digitalarchive.gsu.edu/cs_diss/20.

Full text

Abstract:

With advancement in the development of the new laboratory instruments and experimental techniques, the protein data has an explosive increasing rate. Therefore how to efficiently store, retrieve and modify protein data is becoming a challenging issue that most biological scientists have to face and solve. Traditional data models such as relational database lack of support for complex data types, which is a big issue for protein data application. Hence many scientists switch to the object-oriented databases since object-oriented nature of life science data perfectly matches the architecture of object-oriented databases, but there are still a lot of problems that need to be solved in order to apply OODB methodologies to manage protein data. One major problem is that the general-purpose OODBs do not have any built-in data types for biological research and built-in biological domain-specific functional operations. In this dissertation, we present an application system with built-in data types and built-in biological domain-specific functional operations that extends the Object-Oriented Database (OODB) system by adding domain-specific additional layers Protein-QL, Protein Algebra Architecture and Protein-OODB above OODB to manage protein structure data. This system is composed of three parts: 1) Client API to provide easy usage for different users. 2) Middleware including Protein-QL, Protein Algebra Architecture and Protein-OODB is designed to implement protein domain specific query language and optimize the complex queries, also it capsulates the details of the implementation such that users can easily understand and master Protein-QL. 3) Data Storage is used to store our protein data. This system is for protein domain, but it can be easily extended into other biological domains to build a bio-OODBMS. In this system, protein, primary, secondary, and tertiary structures are defined as internal data types to simplify the queries in Protein-QL such that the domain scientists can easily master the query language and formulate data requests, and EyeDB is used as the underlying OODB to communicate with Protein-OODB. In addition, protein data is usually stored as PDB format and PDB format is old, ambiguous, and inadequate, therefore, PDB data curation will be discussed in detail in the dissertation.

APA, Harvard, Vancouver, ISO, and other styles

44

Nowak, Hans II(Hans Antoon). "Strategic capacity planning using data science, optimization, and machine learning." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/126914.

Full text

Abstract:

Thesis: M.B.A., Massachusetts Institute of Technology, Sloan School of Management, in conjunction with the Leaders for Global Operations Program at MIT, May, 2020
Thesis: S.M., Massachusetts Institute of Technology, Department of Mechanical Engineering, in conjunction with the Leaders for Global Operations Program at MIT, May, 2020
Cataloged from the official PDF of thesis.
Includes bibliographical references (pages 101-104).
Raytheon's Circuit Card Assembly (CCA) factory in Andover, MA is Raytheon's largest factory and the largest Department of Defense (DOD) CCA manufacturer in the world. With over 500 operations, it manufactures over 7000 unique parts with a high degree of complexity and varying levels of demand. Recently, the factory has seen an increase in demand, making the ability to continuously analyze factory capacity and strategically plan for future operations much needed. This study seeks to develop a sustainable strategic capacity optimization model and capacity visualization tool that integrates demand data with historical manufacturing data. Through automated data mining algorithms of factory data sources, capacity utilization and overall equipment effectiveness (OEE) for factory operations are evaluated. Machine learning methods are then assessed to gain an accurate estimate of cycle time (CT) throughout the factory. Finally, a mixed-integer nonlinear program (MINLP) integrates the capacity utilization framework and machine learning predictions to compute the optimal strategic capacity planning decisions. Capacity utilization and OEE models are shown to be able to be generated through automated data mining algorithms. Machine learning models are shown to have a mean average error (MAE) of 1.55 on predictions for new data, which is 76.3% lower than the current CT prediction error. Finally, the MINLP is solved to optimality within a tolerance of 1.00e-04 and generates resource and production decisions that can be acted upon.
by Hans Nowak II.
M.B.A.
S.M.
M.B.A. Massachusetts Institute of Technology, Sloan School of Management
S.M. Massachusetts Institute of Technology, Department of Mechanical Engineering

APA, Harvard, Vancouver, ISO, and other styles

45

Ahmad, Yasmeen. "Management, visualisation & mining of quantitative proteomics data." Thesis, University of Dundee, 2012. https://discovery.dundee.ac.uk/en/studentTheses/6ed071fc-e43b-410c-898d-50529dc298ce.

Full text

Abstract:

Exponential data growth in life sciences demands cross discipline work that brings together computing and life sciences in a usable manner that can enhance knowledge and understanding in both fields. High throughput approaches, advances in instrumentation and overall complexity of mass spectrometry data have made it impossible for researchers to manually analyse data using existing market tools. By applying a user-centred approach to effectively capture domain knowledge and experience of biologists, this thesis has bridged the gap between computation and biology through software, PepTracker (http://www.peptracker.com). This software provides a framework for the systematic detection and analysis of proteins that can be correlated with biological properties to expand the functional annotation of the genome. The tools created in this study aim to place analysis capabilities back in the hands of biologists, who are expert in evaluating their data. Another major advantage of the PepTracker suite is the implementation of a data warehouse, which manages and collates highly annotated experimental data from numerous experiments carried out by many researchers. This repository captures the collective experience of a laboratory, which can be accessed via user-friendly interfaces. Rather than viewing datasets as isolated components, this thesis explores the potential that can be gained from collating datasets in a “super-experiment” ideology, leading to formation of broad ranging questions and promoting biology driven lines of questioning. This has been uniquely implemented by integrating tools and techniques from the field of Business Intelligence with Life Sciences and successfully shown to aid in the analysis of proteomic interaction experiments. Having conquered a means of documenting a static proteomics snapshot of cells, the proteomics field is progressing towards understanding the extremely complex nature of cell dynamics. PepTracker facilitates this by providing the means to gather and analyse many protein properties to generate new biological insight, as demonstrated by the identification of novel protein isoforms.

APA, Harvard, Vancouver, ISO, and other styles

46

Sridharan, Vaikunth. "Sensor Data Streams Correlation Platform for Asthma Management." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1527546937956439.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Ousterhout, Amy (Amy Elizabeth). "Flexplane : a programmable data plane for resource management in datacenters." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/101584.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 47-51).
Network resource management schemes can significantly improve the performance of datacenter applications. However, it is difficult to experiment with and evaluate these schemes today because they require modifications to hardware routers. To address this we introduce Flexplane, a programmable network data plane for datacenters. Flexplane enables users to express their schemes in a high-level language (C++) and then run real datacenter applications over them at hardware rates. We demonstrate that Flexplane can accurately reproduce the behavior of schemes already supported in hardware (e.g. RED, DCTCP) and can be used to experiment with new schemes not yet supported in hardware, such as HULL. We also show that Flexplane is scalable and has the potential to support large networks.
by Amy Ousterhout.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

48

Cates, Josh 1977. "Robust and efficient data management for a distributed hash table." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87381.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Tsai, Eva Y. (Eva Yi-hua). "Inter-database data quality management : a relational-model based approach." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/40202.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Mukkara, Anurag. "Techniques to improve dynamic cache management with static data classification." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/105962.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 55-59).
Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on regular programs, but because they set fixed policies, they are vulnerable to changes in program behavior or available cache space. Instead, most systems rely on dynamic caching policies that adapt to observed program behavior. Unfortunately, dynamic policies spend significant resources trying to learn how programs use memory, and yet they often perform worse than a static policy. This thesis presents Whirlpool, a novel approach that combines static information with dynamic policies to reap the benefits of each. Whirlpool statically classifies data into pools based on how the program uses memory. Whirlpool then uses dynamic policies to tune the cache to each pool. Hence, rather than setting policies statically, Whirlpool uses static analysis to guide dynamic policies. Whirlpool provides both an API that lets programmers specify pools manually and a profiling tool that discovers pools automatically in unmodified binaries. On a state-of-the-art NUCA cache, Whirlpool significantly outperforms prior approaches: on sequential programs, Whirlpool improves performance by up to 38% and reduces data movement energy by up to 53%; on parallel programs, Whirlpool improves performance by up to 67% and reduces data movement energy by up to 2.6x.
by Anurag Mukkara.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data management and data science'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles