To see the other types of publications on this topic, follow the link: Database Scalability.

Dissertations / Theses on the topic 'Database Scalability'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 28 dissertations / theses for your research on the topic 'Database Scalability.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Yee, Wai Gen. "Improving the performance and scalability of intermittently synchronized database systems." Diss., Georgia Institute of Technology, 2003. http://hdl.handle.net/1853/8311.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Håkansson, Kristina, and Andreas Rosenqvist. "Evaluation of CockroachDB in a cloud-native environment." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-21671.

Full text
Abstract:
The increased demand for using large databases that scale easily and stay consistent requires service providers to find new solutions for storing data in databases. One solution that has emerged is cloud-native databases. Service providers who effectively can transit to cloud-native databases will benefit from new enterprise applications, industrial automation, Internet of Things (IoT) as well as consumer services, such as gaming and AR/VR. This consequently changes the requirements on a database's architecture and infrastructure in terms of being compatible with the services deployed in a cloud-native environment - this is where CockroachDB comes into the picture. CockroachDB is relatively new and is built from the ground up to run in a cloud-native environment. It is built up with nodes that work as individual machines, and these nodes form a cluster. The authors of this report aim to evaluate the characteristics of the Cockroach database to get an understanding of what it offers to companies that are in a cloud-infrastructure transition phase. For the scope of characteristics, this report is focusing on performance, throughput, stress-test, version hot-swapping, horizontal-/vertical scaling, and node disruptions. To do this, a CockroachDB database was deployed on a Kubernetes cluster, in which simulated traffic was conducted. For the throughput measurement, the TPC-C transaction processing benchmark was used. For scaling, version hot-swapping, and node disruptions, an experimental method was performed. The result of the study confirms the expected outcome. CockroachDB does in fact scale easily, both horizontally and vertically, with minimal effort. It also shows that the throughput remains the same when the cluster is scaled up and out since CockroachDB does not have a master write-node, which is the case with some other databases. CockroachDB also has built-in functionality to handle configuration changes like version hot-swapping and node disruptions. This study concluded that CockroachDB lives up to its promises regarding the subjects handled in the report, and can be seen as a robust, easily scalable database that can be deployed in acloud-native environment.
APA, Harvard, Vancouver, ISO, and other styles
3

Mathiason, Gunnar. "Virtual Full Replication for Scalable Distributed Real-Time Databases." Doctoral thesis, Linköpings universitet, Institutionen för datavetenskap, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-20661.

Full text
Abstract:
A fully replicated distributed real-time database provides high availability and predictable access times, independent of user location, since all the data is available at each node. However, full replication requires that all updates are replicated to every node, resulting in exponential growth of bandwidth and processing demands with the number of nodes and objects added. To eliminate this scalability problem, while retaining the advantages of full replication, this thesis explores Virtual Full Replication (ViFuR); a technique that gives database users a perception of using a fully replicated database while only replicating a subset of the data. We use ViFuR in a distributed main memory real-time database where timely transaction execution is required. ViFuR enables scalability by replicating only data used at the local nodes. Also, ViFuR enables flexibility by adaptively replicating the currently used data, effectively providing logical availability of all data objects. Hence, ViFuR substantially reduces the problem of non-scalable resource usage of full replication, while allowing timely execution and access to arbitrary data objects. In the thesis we pursue ViFuR by exploring the use of database segmentation. We give a scheme (ViFuR-S) for static segmentation of the database prior to execution, where access patterns are known a priori. We also give an adaptive scheme (ViFuR-A) that changes segmentation during execution to meet the evolving needs of database users. Further, we apply an extended approach of adaptive segmentation (ViFuR-ASN) in a wireless sensor network - a typical dynamic large-scale and resource-constrained environment. We use up to several hundreds of nodes and thousands of objects per node, and apply a typical periodic transaction workload with operation modes where the used data set changes dynamically. We show that when replacing full replication with ViFuR, resource usage scales linearly with the required number of concurrent replicas, rather than exponentially with the system size.
APA, Harvard, Vancouver, ISO, and other styles
4

Umair, Muhammad. "Performance Evaluation and Elastic Scaling of an IP Multimedia Subsystem Implemented in a Cloud." Thesis, KTH, Radio Systems Laboratory (RS Lab), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-124578.

Full text
Abstract:
Network (NGN) technology which enables telecommunication operators to provide multimedia services over fixed and mobile networks. All of the IMS infrastructure protocols work over IP which makes IMS easy to deploy on a cloud platform. The purpose of this thesis is to analysis a novel technique of “cloudifying” the OpenIMS core infrastructure. The primary goal of running OpenIMS in the cloud is to enable a highly available and horizontally scalable Home Subscriber Server (HSS). The resulting database should offer high availability, and high scalability. The prototype developed in this thesis project demonstrates a virtualized OpenIMS core with an integrated horizontal scalable HSS. Functional and performance measurements of the system under test (i.e. the virtualized OpenIMS core with horizontally scalable HSS) were conducted. The results of this testing include an analysis of benchmarking scenarios, the CPU utilization, and the available memory of the virtual machines. Based on these results we conclude that it is both feasible and desirable to deploy the OpenIMS core in a cloud.<br>IP Multimedia Subsystem (IMS) ramverk är ett Next Generation Network (NGN) teknik som möjliggör teleoperatörer att erbjuda multimediatjänster via fasta och mobila nät. Alla IMS infrastruktur protokollen fungera över IP som gör IMS lätt att distribuera på ett moln plattform. Syftet med denna uppsats är att analysera en ny teknik för “cloudifying” den OpenIMS kärninfrastrukturen.  Det primära målet med att köra OpenIMS i molnet är att möjliggöra en hög tillgänglighet och horisontellt skalbara Server Home Subscriber (HSS). Den resulterande databasen bör erbjuda hög tillgänglighet och hög skalbarhet. Prototypen utvecklas i detta examensarbete visar en virtualiserad OpenIMS kärna med en integrerad horisontell skalbar HSS. Funktionella och prestanda mätningar av systemet under test (dvs. virtualiserade OpenIMS kärnan med horisontellt skalbara HSS) genomfördes. Resultaten av detta test inkluderar en analys av benchmarking scenarier, CPU-användning, och tillgängligt minne för de virtuella maskinerna. Baserat på dessa resultat drar vi slutsatsen att det är både möjligt och önskvärt att distribuera OpenIMS kärnan i ett moln.
APA, Harvard, Vancouver, ISO, and other styles
5

Mukhammadov, Ruslan. "A scalable database for a remote patient monitoring system." Thesis, KTH, Radio Systems Laboratory (RS Lab), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-124603.

Full text
Abstract:
Today one of the fast growing social services is the ability for doctors to monitor patients in their residences. The proposed highly scalable database system is designed to support a Remote Patient Monitoring system (RPMS). In an RPMS, a wide range of applications are enabled by collecting health related measurement results from a number of medical devices in the patient’s home, parsing and formatting these results, and transmitting them from the patient’s home to specific data stores. Subsequently, another set of applications will communicate with these data stores to provide clinicians with the ability to observe, examine, and analyze these health related measurements in (near) real-time. Because of the rapid expansion in the number of patients utilizing RPMS, it is becoming a challenge to store, manage, and process the very large number of health related measurements that are being collected. The primary reason for this problem is that most RPMSs are built on top of traditional relational databases, which are inefficient when dealing with this very large amount of data (often called “big data”). This thesis project analyzes scalable data management to support RPMSs, introduces a new set of open-source technologies that efficiently store and manage any amount of data which might be used in conjunction with such a scalable RPMS based upon HBase, implements these technologies, and as a proof of concept, compares the prototype data management system with the performance of a traditional relational database (specifically MySQL). This comparison considers both a single node and a multi node cluster. The comparison evaluates several critical parameters, including performance, scalability, and load balancing (in the case of multiple nodes). The amount of data used for testing input/output (read/write) and data statistics performance is 1, 10, 50, 100, and 250 GB. The thesis presents several ways of dealing with large amounts of data and develops &amp; evaluates a highly scalable database that could be used with a RPMS. Several software suites were used to compare both relational and non-relational systems and these results are used to evaluate the performance of the prototype of the proposed RPMS. The results of benchmarking show that MySQL is better than HBase in terms of read performance, while HBase is better in terms of write performance. Which of these types of databases should be used to implement a RPMS is a function of the expected ratio of reads and writes. Learning this ratio should be the subject of a future thesis project.<br>En av de snabbast växande sociala tjänsterna idag är möjligheten för läkare att övervaka patienter i sina bostäder. Det beskrivna, mycket skalbara databassystemet är utformat för att stödja ett sådant Remote Patient Monitoring-system (RPMS). I ett RPMS kan flertalet applikationer användas med hälsorelaterade mätresultat från medicintekniska produkter i patientens hem, för att analysera och formatera resultat, samt överföra dem från patientens hem till specifika datalager. Därefter kommer ytterligare en uppsättning program kommunicera med dessa datalager för att ge kliniker möjlighet att observera, undersöka och analysera dessa hälsorelaterade mått i (nära) realtid. På grund av den snabba expansionen av antalet patienter som använder RPMS, är det en utmaning att hantera och bearbeta den stora mängd hälsorelaterade mätningar som samlas in. Den främsta anledningen till detta problem är att de flesta RPMS är inbyggda i traditionella relationsdatabaser, som är ineffektiva när det handlar om väldigt stora mängder data (ofta kallat "big data"). Detta examensarbete analyserar skalbar datahantering för RPMS, och inför en ny uppsättning av teknologier baserade på öppen källkod som effektivt lagrar och hanterar godtyckligt stora datamängder. Dessa tekniker används i en prototypversion (proof of concept) av ett skalbart RPMS baserat på HBase. Implementationen av det designade systemet jämförs mot ett RPMS baserat på en traditionell relationsdatabas (i detta fall MySQL). Denna jämförelse ges för både en ensam nod och flera noder. Jämförelsen utvärderar flera kritiska parametrar, inklusive prestanda, skalbarhet, och lastbalansering (i fallet med flera noder). Datamängderna som används för att testa läsning/skrivning och statistisk prestanda är 1, 10, 50, 100 respektive 250 GB. Avhandlingen presenterar flera sätt att hantera stora mängder data och utvecklar samt utvärderar en mycket skalbar databas, som är lämplig för användning i RPMS. Flera mjukvaror för att jämföra relationella och icke-relationella system används för att utvärdera prototypen av de föreslagna RPMS och dess resultat. Resultaten av dessa jämförelser visar att MySQL presterar bättre än HBase när det gäller läsprestanda, medan HBase har bättre prestanda vid skrivning. Vilken typ av databas som bör väljas vid en RMPS-implementation beror därför på den förväntade kvoten mellan läsningar och skrivningar. Detta förhållande är ett lämpligt ämne för ett framtida examensarbete.
APA, Harvard, Vancouver, ISO, and other styles
6

Vrbík, Tomáš. "Srovnání distribuovaných "NoSQL" databází s důrazem na výkon a škálovatelnost." Master's thesis, Vysoká škola ekonomická v Praze, 2011. http://www.nusl.cz/ntk/nusl-124673.

Full text
Abstract:
This paper focuses on NoSQL database systems. These systems currently serve rather as supplement than replacement of relational database systems. The aim of this paper is to compare 4 selected NoSQL database systems (MongoDB, Apache Cassandra, Apache HBase and Redis) with a main focus on performance and scalability. Performance comparison is done using simulated workload in a 4 nodes cluster environment. One relational SQL database is also benchmarked to provide comparison between classic and modern way of maintaining structured data. As the result of comparison I found out that none of these database systems can be labeled as "the best" as each of the compared systems is suitable for different production deployment.
APA, Harvard, Vancouver, ISO, and other styles
7

RODRIGUES, JUNIOR Paulo Lins. "Upper: uma ferramenta para escolha de servidor e estimação de gatilhos de escalabilidade de banco de dados relacionais na plataforma Amazon AWS." Universidade Federal de Pernambuco, 2013. https://repositorio.ufpe.br/handle/123456789/17509.

Full text
Abstract:
Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2016-07-21T16:43:15Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Upper.pdf: 1291176 bytes, checksum: 335e26f2c99d96f05a40fca5acb1fed1 (MD5)<br>Made available in DSpace on 2016-07-21T16:43:15Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Upper.pdf: 1291176 bytes, checksum: 335e26f2c99d96f05a40fca5acb1fed1 (MD5) Previous issue date: 2013-12-09<br>A escalabilidade de uma aplicação é de vital importância para o sucesso de um negócio, sendo considerado um dos atributos mais importantes das aplicações atualmente. Diversas aplicações atualmente são voltadas diretamente a dados, o que torna o banco de dados uma camada crítica em toda estrutura do sistema. Entre os tipos de bancos de dados existentes, destacam-se os bancos de dados relacionais por fornecerem sobretudo um nível de consistência adequado a maioria destas aplicações. A projeção de infraestrutura e de gatilhos de escalabilidade são tarefas complexas até mesmo para profissionais experientes, e erros nestas tarefas podem representar perdas significativas de negócio. A plataforma de computação em nuvem, em particular o modelo de infraestrutura como serviço se torna vantajosa por proporcionar um baixo investimento inicial e modelos de escala conforme demanda. Para se usufruir das vantagens oferecidas pela plataforma, os administradores de sistema ainda tem a difícil tarefa de definir o servidor adequado assim como estimar o momento certo de escalar atendendo as necessidades da aplicação e garantindo eficiência na alocação de recursos. Este trabalho propõe um ambiente de simulação para auxílio na definição do servidor adequado e dos gatilhos de escalabilidade do servidor de banco de dados na Amazon Web Services, plataforma líder de serviços de computação em nuvem. A principal contribuição desta ferramenta, chamada Upper, é facilitar o trabalho do administrador de sistema, possibilitando-o executar a tarefa de estimativa de forma mais rápida e precisa.<br>The scalability of an application is of vital importance to the success of a business, being considered one of the most important attributes of current applications. Many applications are now directly targeting to data, which makes the database a critical layer throughout the system structure. Among the types of existing databases, highlight the relational databases primarily for providing an appropriate level of consistency needed for most of these applications. The projection of infrastructure and scalability triggers is complex even for senior professionals, and errors in these tasks can result in significant business losses. The platform of cloud computing, in particular the model of infrastructure as a service becomes advantageous for providing a low initial investment and models of scale on demand. To benefit from the advantages offered by the platform, system administrators still have the difficult task of defining the appropriate server as well as estimating the right time to scale ensuring the performance needs of the application and efficiency in resource allocation. This paper proposes a simulation environment to aid in defining the appropriate server and scalability triggers of the database server on Amazon Web Services, a leading platform for cloud computing services. The main contribution of this tool, called Upper, is to facilitate the work of system administrator, providing him means to perform the task of estimation faster and more accurately.
APA, Harvard, Vancouver, ISO, and other styles
8

Xiong, Fanfan. "Resource Efficient Parallel VLDB with Customizable Degree of Redundancy." Diss., Temple University Libraries, 2009. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/43445.

Full text
Abstract:
Computer and Information Science<br>Ph.D.<br>This thesis focuses on the practical use of very large scale relational databases. It leverages two recent breakthroughs in parallel and distributed computing: a) synchronous transaction replication technologies by Justin Y. Shi and Suntain Song; and b) Stateless Parallel Processing principle pioneered by Justin Y. Shi. These breakthroughs enable scalable performance and reliability of database service using multiple redundant shared-nothing database servers. This thesis presents a Functional Horizontal Partitioning method with customizable degree of redundancy to address practical very large scale database applications problems. The prototype VLDB implementation is designed for transparent non-intrusive deployments. The prototype system supports Microsoft SQL Servers databases. Computational experiments are conducted using industry-standard benchmark (TPC-E).<br>Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
9

Gottemukkala, Vibby. "Scalability issues in distributed and parallel databases." Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/8176.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Mathew, Ajit. "Multicore Scalability Through Asynchronous Work." Thesis, Virginia Tech, 2020. http://hdl.handle.net/10919/104116.

Full text
Abstract:
With the end of Moore's Law, computer architects have turned to multicore architecture to provide high performance. Unfortunately, to achieve higher performance, multicores require programs to be parallelized which is an untamed problem. Amdahl's law tells that the maximum theoretical speedup of a program is dictated by the size of the non-parallelizable section of a program. Hence to achieve higher performance, programmers need to reduce the size of sequential code in the program. This thesis explores asynchronous work as a means to reduce sequential portions of program. Using asynchronous work, a programmer can remove tasks which do not affect data consistency from the critical path and can be performed using background thread. Using this idea, the thesis introduces two systems. First, a synchronization mechanism, Multi-Version Read-Log-Update(MV-RLU), which extends Read-Log-Update (RLU) through multi-versioning. At the core of MV-RLU design is a concurrent garbage collection algorithm which reclaims obsolete versions asynchronously reducing blocking of threads. Second, a concurrent and highly scalable index-structure called Hydralist for multi-core. The key idea behind design of Hydralist is that an index-structure can be divided into two component (search layer and data layer) and updates to data layer can be done synchronously while updates to search layer can be propagated asynchronously using background threads.<br>Master of Science<br>Up until mid-2000s, Moore's law predicted that performance CPU doubled every two years. This is because improvement in transistor technology allowed smaller transistor which can switch at higher frequency leading to faster CPU clocks. But faster clock leads to higher heat dissipation and as chips reached their thermal limits, computer architects could no longer increase clock speeds. Hence they moved to multicore architecture, wherein a single die contains multiple CPUs, to allow higher performance. Now programmers are required to parallelize their code to take advangtage of all the CPUs in a chip which is a non trivial problem. The theoretical speedup achieved by a program on multicore architecture is dictated by Amdahl's law which describes the non parallelizable code in a program as the limiting factor for speedup. For example, a program with 99% parallelizable code can achieve speedup of 20 whereas a program with 50% parallelizable code can only achieve speedup of 2. Therefore to achieve high speedup, programmers need to reduce size of serial section in their program. One way to reduce sequential section in a program is to remove non-critical task from the sequential section and perform the tasks asynchronously using background thread. This thesis explores this technique in two systems. First, a synchronization mechanism which is used co-ordinate access to shared resource called Multi-Version Read-Log-Update (MV-RLU). MV-RLU achieves high performance by removing garbage collection from critical path and performing it asynchronously using background thread. Second, an index structure, Hydralist, which based on the insight that an index structure can be decomposed into two components, search layer and data layer, and decouples updates to both the layer which allows higher performance. Updates to search layer is done synchronously while updates to data layer is done asynchronously using background threads. Evaluation shows that both the systems perform better than state-of-the-art competitors in a variety of workloads.
APA, Harvard, Vancouver, ISO, and other styles
11

Kuruganti, NSR Sankaran. "Distributed databases for Multi Mediation : Scalability, Availability & Performance." Thesis, Blekinge Tekniska Högskola, Institutionen för kommunikationssystem, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-1018.

Full text
Abstract:
Context: Multi Mediation is a process of collecting data from network(s) &amp; network elements, pre-processing this data and distributing it to various systems like Big Data analysis, Billing Systems, Network Monitoring Systems, and Service Assurance etc. With the growing demand for networks and emergence of new services, data collected from networks is growing. There is need for efficiently organizing this data and this can be done using databases. Although RDBMS offers Scale-up solutions to handle voluminous data and concurrent requests, this approach is expensive. So, alternatives like distributed databases are an attractive solution. Suitable distributed database for Multi Mediation, needs to be investigated. Objectives: In this research we analyze two distributed databases in terms of performance, scalability and availability. The inter-relations between performance, scalability and availability of distributed databases are also analyzed. The distributed databases that are analyzed are MySQL Cluster 7.4.4 and Apache Cassandra 2.0.13. Performance, scalability and availability are quantified, measurements are made in the context of Multi Mediation system. Methods: The methods to carry out this research are both qualitative and quantitative. Qualitative study is made for the selection of databases for evaluation. A benchmarking harness application is designed to quantitatively evaluate the performance of distributed database in the context of Multi Mediation. Several experiments are designed and performed using the benchmarking harness on the database cluster. Results: Results collected include average response time &amp; average throughput of the distributed databases in various scenarios. The average throughput &amp; average INSERT response time results favor Apache Cassandra low availability configuration. MySQL Cluster average SELECT response time is better than Apache Cassandra for greater number of client threads, in high availability and low availability configurations.Conclusions: Although Apache Cassandra outperforms MySQL Cluster, the support for transaction and ACID compliance are not to be forgotten for the selection of database. Apart from the contextual benchmarks, organizational choices, development costs, resource utilizations etc. are more influential parameters for selection of database within an organization. There is still a need for further evaluation of distributed databases.<br><p>I am indebted to my advisor Prof. Lars Lundberg and his valuable ideas which helped in the completion of this work. In fact he has guided on every crucial and important stages of this research work.</p><p>I sincerely thank Prof. Markus Fiedler &amp; Prof. Kurt Tutschku for their endless support during the work.</p><p>I am grateful to Neeraj Garg, Sourab, Saket &amp; Kulbir at Ericsson, for providing me necessary equipment and helping me financially during my work.</p><p>To my family members and friends who one way or the other shared their support. Thank you.</p><p>Above all I would like to thank the Supreme Personality of Godhead, the author of everything.</p>
APA, Harvard, Vancouver, ISO, and other styles
12

Cary, Ariel. "Scaling Geospatial Searches in Large Spatial Databases." FIU Digital Commons, 2011. http://digitalcommons.fiu.edu/etd/548.

Full text
Abstract:
Modern geographical databases store a rich set of aspatial attributes in addition to geographic data. Retrieving spatial records constrained on spatial and aspatial attributes provides users the ability to perform more interesting spatial analyses via composite spatial searches; e.g., in a real estate database, "Find the nearest homes for sale to my current location that have backyard and whose prices are between $50,000 and $80,000". Efficient processing of such composite searches requires combined indexing strategies of multiple types of data. Existing spatial query engines commonly apply a two-filter approach (spatial filter followed by non-spatial filter, or viceversa), which can incur large performance overheads. On the other hand, the amount of geolocation data in databases is rapidly increasing due in part to advances in geolocation technologies (e.g., GPS- enabled mobile devices) that allow to associate location data to nearly every object or event. Hence, practical spatial databases may face data ingestion challenges of large data volumes. In this dissertation, we first show how indexing spatial data with R-trees (a typical data pre- processing task) can be scaled in MapReduce – a well-adopted parallel programming model, developed by Google, for data intensive problems. Close to linear scalability was observed in index construction tasks over large spatial datasets. Subsequently, we develop novel techniques for simultaneously indexing spatial with textual and numeric data to process k-nearest neighbor searches with aspatial Boolean selection constraints. In particular, numeric ranges are compactly encoded and explicitly indexed. Experimental evaluations with real spatial databases showed query response times within acceptable ranges for interactive search systems.
APA, Harvard, Vancouver, ISO, and other styles
13

Mesmoudi, Amin. "Declarative parallel query processing on large scale astronomical databases." Thesis, Lyon 1, 2015. http://www.theses.fr/2015LYO10326.

Full text
Abstract:
Les travaux de cette thèse s'inscrivent dans le cadre du projet Petasky. Notre objectif est de proposer des outils permettant de gérer des dizaines de Peta-octets de données issues d'observations astronomiques. Nos travaux se focalisent essentiellement sur la conception des nouveaux systèmes permettant de garantir le passage à l'échelle. Dans cette thèse, nos contributions concernent trois aspects : Benchmarking des systèmes existants, conception d'un nouveau système et optimisation du système. Nous avons commencé par analyser la capacité des systèmes fondés sur le modèle MapReduce et supportant SQL à gérer les données LSST et leurs capacités d'optimisation de certains types de requêtes. Nous avons pu constater qu'il n'y a pas de technique « magique » pour partitionner, stocker et indexer les données mais l'efficacité des techniques dédiées dépend essentiellement du type de requête et de la typologie des données considérées. Suite à notre travail de Benchmarking, nous avons retenu quelques techniques qui doivent être intégrées dans un système de gestion de données à large échelle. Nous avons conçu un nouveau système de façon à garantir la capacité dudit système à supporter plusieurs mécanismes de partitionnement et plusieurs opérateurs d'évaluation. Nous avons utilisé BSP (Bulk Synchronous Parallel) comme modèle de calcul. Les données sont représentées logiquement par des graphes. L'évaluation des requêtes est donc faite en explorant le graphe de données en utilisant les arcs entrants et les arcs sortants. Les premières expérimentations ont montré que notre approche permet une amélioration significative des performances par rapport aux systèmes Map/Reduce<br>This work is carried out in framework of the PetaSky project. The objective of this project is to provide a set of tools allowing to manage Peta-bytes of data from astronomical observations. Our work is concerned with the design of a scalable approach. We first started by analyzing the ability of MapReduce based systems and supporting SQL to manage the LSST data and ensure optimization capabilities for certain types of queries. We analyzed the impact of data partitioning, indexing and compression on query performance. From our experiments, it follows that there is no “magic” technique to partition, store and index data but the efficiency of dedicated techniques depends mainly on the type of queries and the typology of data that are considered. Based on our work on benchmarking, we identified some techniques to be integrated to large-scale data management systems. We designed a new system allowing to support multiple partitioning mechanisms and several evaluation operators. We used the BSP (Bulk Synchronous Parallel) model as a parallel computation paradigm. Unlike MapeReduce model, we send intermediate results to workers that can continue their processing. Data is logically represented as a graph. The evaluation of queries is performed by exploring the data graph using forward and backward edges. We also offer a semi-automatic partitioning approach, i.e., we provide the system administrator with a set of tools allowing her/him to choose the manner of partitioning data using the schema of the database and domain knowledge. The first experiments show that our approach provides a significant performance improvement with respect to Map/Reduce systems
APA, Harvard, Vancouver, ISO, and other styles
14

Wu, Hengzhi. "Techniques for improving efficiency and scalability for the integration of information retrieval and databases." Thesis, Queen Mary, University of London, 2010. http://qmro.qmul.ac.uk/xmlui/handle/123456789/374.

Full text
Abstract:
This thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with particular focuses on improving efficiency and scalability of integrated IR and DB technology (IR+DB). The main purpose of this study is to develop efficient and scalable techniques for supporting integrated IR and DB technology, which is a popular approach today for handling complex queries over text and structured data. Our specific interest in this thesis is how to efficiently handle queries over large-scale text and structured data. The work is based on a technology that integrates probability theory and relational algebra, where retrievals for text and data are to be expressed in probabilistic logical programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient processing of probabilistic logical programs, we proposed three optimization techniques that focus on aspects covered logical and physical layers, which include: scoring-driven query optimization using scoring expression, query processing with top-k incorporated pipeline, and indexing with relational inverted index. Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics of implied scoring functions of PRA expressions, so that efficient query execution plan can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and effectiveness so that to improve query response time, we studied methods for incorporating topk algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which can be used to support efficient probability estimation and aggregation as well as conventional relational operations. Experiments were carried out to investigate the performances of proposed techniques. Experimental results showed that the efficiency and scalability of an IR+DB prototype have been improved, while the system can handle queries efficiently on considerable large data sets for a number of IR tasks.
APA, Harvard, Vancouver, ISO, and other styles
15

Pecsérke, Róbert. "Podpora MongoDB pro UnifiedPush Server." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255415.

Full text
Abstract:
Tato diplomová práce se zabývá návrhem a implementací rozšíření pro UnifiedPush Server, které serveru umožní přistupovat k nerelační databázi MongoDB a využívá potenciál horiznotální škálovatelnosti neralačních databází. Součástí práce je i návrh výkonnostních testů a porovnání výkonu při behu na jednom a vícero uzlích, návrh migračního scénáře z MySQL na MongoDB, identifikace úzkých míst. Aplikace je implementována v jazyce Java a využívá Java Persistence API pro přístup k databázím. Pro přístup k nerelačním databázím používá implementaci standardu JPA Hibernate OGM.
APA, Harvard, Vancouver, ISO, and other styles
16

Petera, Martin. "Srovnání distribuovaných "No-SQL" databází s důrazem na výkon a škálovatelnost." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193233.

Full text
Abstract:
This thesis deals with NoSQL database performance issue. The aim of the paper is to compare most common prototypes of distributed database systems with emphasis on performance and scalability. Yahoo! Cloud Serving Benchmark (YCSB) is used to accomplish the aforementioned aim. The YCSB tool allows performance testing through performance indicators like throughput or response time. It is followed by a thorough explanation of how to work with this tool, which gives readers an opportunity to test performance or do a performance comparison of other distributed database systems than of those described in this thesis. It also helps readers to be able to create testing environment and apply the testing method which has been listed in this thesis should they need it. This paper can be used as a help when making an arduous choice for a specific system from a wide variety of NoSQL database systems for intended solution.
APA, Harvard, Vancouver, ISO, and other styles
17

Li, Honghao. "Interpretable biological network reconstruction from observational data." Electronic Thesis or Diss., Université Paris Cité, 2021. http://www.theses.fr/2021UNIP5207.

Full text
Abstract:
Cette thèse porte sur les méthodes basées sur des contraintes. Nous présentons comme exemple l’algorithme PC, pour lequel nous proposons une modification qui garantit la cohérence des ensembles de séparation, utilisés pendant l’étape de reconstruction du squelette pour supprimer les arêtes entre les variables conditionnellement indépendantes, par rapport au graphe final. Elle consiste à itérer l’algorithme d’apprentissage de structure tout en limitant la recherche des ensembles de séparation à ceux qui sont cohérents par rapport au graphe obtenu à la fin de l’itération précédente. La contrainte peut être posée avec une complexité de calcul limitée à l’aide de la décomposition en block-cut tree du squelette du graphe. La modification permet d’augmenter le rappel au prix de la précision des méthodes basées sur des contraintes, tout en conservant une performance globale similaire ou supérieure. Elle améliore également l’interprétabilité et l’explicabilité du modèle graphique obtenu. Nous présentons ensuite la méthode basée sur des contraintes MIIC, récemment développée, qui adopte les idées du cadre du maximum de vraisemblance pour améliorer la robustesse et la performance du graphe obtenu. Nous discutons les caractéristiques et les limites de MIIC, et proposons plusieurs modifications qui mettent l’accent sur l’interprétabilité du graphe obtenu et l’extensibilité de l’algorithme. En particulier, nous mettons en œuvre l’approche itérative pour renforcer la cohérence de l’ensemble de séparation, nous optons pour une règle d’orientation conservatrice et nous utilisons la probabilité d’orientation de MIIC pour étendre la notation des arêtes dans le graphe final afin d’illustrer différentes relations causales. L’algorithme MIIC est appliqué à un ensemble de données d’environ 400 000 dossiers de cancer du sein provenant de la base de données SEER, comme benchmark à grande échelle dans la vie réelle<br>This thesis is focused on constraint-based methods, one of the basic types of causal structure learning algorithm. We use PC algorithm as a representative, for which we propose a simple and general modification that is applicable to any PC-derived methods. The modification ensures that all separating sets used during the skeleton reconstruction step to remove edges between conditionally independent variables remain consistent with respect to the final graph. It consists in iterating the structure learning algorithm while restricting the search of separating sets to those that are consistent with respect to the graph obtained at the end of the previous iteration. The restriction can be achieved with limited computational complexity with the help of block-cut tree decomposition of the graph skeleton. The enforcement of separating set consistency is found to increase the recall of constraint-based methods at the cost of precision, while keeping similar or better overall performance. It also improves the interpretability and explainability of the obtained graphical model. We then introduce the recently developed constraint-based method MIIC, which adopts ideas from the maximum likelihood framework to improve the robustness and overall performance of the obtained graph. We discuss the characteristics and the limitations of MIIC, and propose several modifications that emphasize the interpretability of the obtained graph and the scalability of the algorithm. In particular, we implement the iterative approach to enforce separating set consistency, and opt for a conservative rule of orientation, and exploit the orientation probability feature of MIIC to extend the edge notation in the final graph to illustrate different causal implications. The MIIC algorithm is applied to a dataset of about 400 000 breast cancer records from the SEER database, as a large-scale real-life benchmark
APA, Harvard, Vancouver, ISO, and other styles
18

Zawirski, Marek. "Cohérence à terme fiable avec des types de données répliquées." Thesis, Paris 6, 2015. http://www.theses.fr/2015PA066638/document.

Full text
Abstract:
Les bases de données répliquées cohérentes à terme récentes encapsulent la complexité de la concurrence et des pannes par le biais d'une interface supportant la cohérence causale, protégeant l'application des problèmes d'ordre, et/ou des Types de Données Répliqués (RDTs), assurant une sémantique convergente des mises-à-jour concurrentes en utilisant une interface objet. Cependant, les algorithmes fiables pour les RDTs et la cohérence causale ont un coût en terme de taille des métadonnées. Cette thèse étudie la conception de tels algorithmes avec une taille de métadonnées minimisée et leurs limites. Notre première contribution est une étude de la complexité des métadonnées des RDTs. Les nombreuses implémentations existantes impliquent un important surcoût en espace de stockage. Nous concevons un ensemble optimisé et un registre RDTs avec un surcoût des métadonnées réduit au nombre de répliques. Nous démontrons également les bornes inférieures de la taille des métadonnées pour six RDTs, prouvant ainsi l'optimalité de quatre implémentations. Notre seconde contribution est le design de SwiftCloud, une base de données répliquée causalement cohérente d'objets RDTs pour les applications côté client. Nous concevons des algorithmes qui supportent un grand nombre de répliques partielles côté client, s'appuyant sur le cloud, tout en étant tolérant aux fautes et avec une faible taille de métadonnées. Nous démontrons comment supporter la disponibilité (y compris la capacité à basculer entre des centre de données lors d'une erreur), la cohérence et le passage à l'échelle (petite taille de métadonnées, parallélisme) au détriment d'un léger retard dans l'actualisation des données<br>Eventually consistent replicated databases offer excellent responsiveness and fault-tolerance, but expose applications to the complexity of concurrency andfailures. Recent databases encapsulate these problems behind a stronger interface, supporting causal consistency, which protects the application from orderinganomalies, and/or Replicated Data Types (RDTs), which ensure convergent semantics of concurrent updates using object interface. However, dependable algorithms for RDT and causal consistency come at a cost in metadata size. This thesis studies the design of such algorithms with minimized metadata, and the limits of the design space. Our first contribution is a study of metadata complexity of RDTs. RDTs use metadata to provide rich semantics; many existing RDT implementations incur high overhead in storage space. We design optimized set and register RDTs with metadata overhead reduced to the number of replicas. We also demonstrate metadata lower bounds for six RDTs, thereby proving optimality of four implementations. Our second contribution is the design of SwiftCloud, a replicated causally-consistent RDT object database for client-side applications. We devise algorithms to support high numbers of client-side partial replicas backed by the cloud, in a fault-tolerant manner, with small metadata. We demonstrate how to support availability and consistency, at the expense of some slight data staleness; i.e., our approach trades freshness for scalability (small metadata, parallelism), and availability (ability to fail-over between data centers). We validate our approach with experiments involving thousands of client replicas
APA, Harvard, Vancouver, ISO, and other styles
19

Gorisse, David. "Passage à l’échelle des méthodes de recherche sémantique dans les grandes bases d’images." Thesis, Cergy-Pontoise, 2010. http://www.theses.fr/2010CERG0519/document.

Full text
Abstract:
Avec la révolution numérique de cette dernière décennie, la quantité de photos numériques mise à disposition de chacun augmente plus rapidement que la capacité de traitement des ordinateurs. Les outils de recherche actuels ont été conçus pour traiter de faibles volumes de données. Leur complexité ne permet généralement pas d'effectuer des recherches dans des corpus de grande taille avec des temps de calculs acceptables pour les utilisateurs. Dans cette thèse, nous proposons des solutions pour passer à l'échelle les moteurs de recherche d'images par le contenu. Dans un premier temps, nous avons considéré les moteurs de recherche automatique traitant des images indexées sous la forme d'histogrammes globaux. Le passage à l'échelle de ces systèmes est obtenu avec l'introduction d'une nouvelle structure d'index adaptée à ce contexte qui nous permet d'effectuer des recherches de plus proches voisins approximées mais plus efficaces. Dans un second temps, nous nous sommes intéressés à des moteurs plus sophistiqués permettant d'améliorer la qualité de recherche en travaillant avec des index locaux tels que les points d'intérêt. Dans un dernier temps, nous avons proposé une stratégie pour réduire la complexité de calcul des moteurs de recherche interactifs. Ces moteurs permettent d'améliorer les résultats en utilisant des annotations que les utilisateurs fournissent au système lors des sessions de recherche. Notre stratégie permet de sélectionner rapidement les images les plus pertinentes à annoter en optimisant une méthode d'apprentissage actif<br>In this last decade, would the digital revolution and its ancillary consequence of a massive increases in digital picture quantities. The database size grow much faster than the processing capacity of computers. The current search engine which conceived for small data volumes do not any more allow to make searches in these new corpus with acceptable response times for users.In this thesis, we propose scalable content-based image retrieval engines.At first, we considered automatic search engines where images are indexed with global histograms. Secondly, we were interested in more sophisticated engines allowing to improve the search quality by working with bag of feature. In a last time, we proposed a strategy to reduce the complexity of interactive search engines. These engines allow to improve the results by using labels which the users supply to the system during the search sessions
APA, Harvard, Vancouver, ISO, and other styles
20

Zawirski, Marek. "Cohérence à terme fiable avec des types de données répliquées." Electronic Thesis or Diss., Paris 6, 2015. http://www.theses.fr/2015PA066638.

Full text
Abstract:
Les bases de données répliquées cohérentes à terme récentes encapsulent la complexité de la concurrence et des pannes par le biais d'une interface supportant la cohérence causale, protégeant l'application des problèmes d'ordre, et/ou des Types de Données Répliqués (RDTs), assurant une sémantique convergente des mises-à-jour concurrentes en utilisant une interface objet. Cependant, les algorithmes fiables pour les RDTs et la cohérence causale ont un coût en terme de taille des métadonnées. Cette thèse étudie la conception de tels algorithmes avec une taille de métadonnées minimisée et leurs limites. Notre première contribution est une étude de la complexité des métadonnées des RDTs. Les nombreuses implémentations existantes impliquent un important surcoût en espace de stockage. Nous concevons un ensemble optimisé et un registre RDTs avec un surcoût des métadonnées réduit au nombre de répliques. Nous démontrons également les bornes inférieures de la taille des métadonnées pour six RDTs, prouvant ainsi l'optimalité de quatre implémentations. Notre seconde contribution est le design de SwiftCloud, une base de données répliquée causalement cohérente d'objets RDTs pour les applications côté client. Nous concevons des algorithmes qui supportent un grand nombre de répliques partielles côté client, s'appuyant sur le cloud, tout en étant tolérant aux fautes et avec une faible taille de métadonnées. Nous démontrons comment supporter la disponibilité (y compris la capacité à basculer entre des centre de données lors d'une erreur), la cohérence et le passage à l'échelle (petite taille de métadonnées, parallélisme) au détriment d'un léger retard dans l'actualisation des données<br>Eventually consistent replicated databases offer excellent responsiveness and fault-tolerance, but expose applications to the complexity of concurrency andfailures. Recent databases encapsulate these problems behind a stronger interface, supporting causal consistency, which protects the application from orderinganomalies, and/or Replicated Data Types (RDTs), which ensure convergent semantics of concurrent updates using object interface. However, dependable algorithms for RDT and causal consistency come at a cost in metadata size. This thesis studies the design of such algorithms with minimized metadata, and the limits of the design space. Our first contribution is a study of metadata complexity of RDTs. RDTs use metadata to provide rich semantics; many existing RDT implementations incur high overhead in storage space. We design optimized set and register RDTs with metadata overhead reduced to the number of replicas. We also demonstrate metadata lower bounds for six RDTs, thereby proving optimality of four implementations. Our second contribution is the design of SwiftCloud, a replicated causally-consistent RDT object database for client-side applications. We devise algorithms to support high numbers of client-side partial replicas backed by the cloud, in a fault-tolerant manner, with small metadata. We demonstrate how to support availability and consistency, at the expense of some slight data staleness; i.e., our approach trades freshness for scalability (small metadata, parallelism), and availability (ability to fail-over between data centers). We validate our approach with experiments involving thousands of client replicas
APA, Harvard, Vancouver, ISO, and other styles
21

Rafiq, Taha. "Elasca: Workload-Aware Elastic Scalability for Partition Based Database Systems." Thesis, 2013. http://hdl.handle.net/10012/7525.

Full text
Abstract:
Providing the ability to increase or decrease allocated resources on demand as the transactional load varies is essential for database management systems (DBMS) deployed on today's computing platforms, such as the cloud. The need to maintain consistency of the database, at very large scales, while providing high performance and reliability makes elasticity particularly challenging. In this thesis, we exploit data partitioning as a way to provide elastic DBMS scalability. We assert that the flexibility provided by a partitioned, shared-nothing parallel DBMS can be used to implement elasticity. Our idea is to start with a small number of servers that manage all the partitions, and to elastically scale out by dynamically adding new servers and redistributing database partitions among these servers as the load varies. Implementing this approach requires (a) efficient mechanisms for addition/removal of servers and migration of partitions, and (b) policies to efficiently determine the optimal placement of partitions on the given servers as well as plans for partition migration. This thesis presents Elasca, a system that implements both these features in an existing shared-nothing DBMS (namely VoltDB) to provide automatic elastic scalability. Elasca consists of a mechanism for enabling elastic scalability, and a workload-aware optimizer for determining optimal partition placement and migration plans. Our optimizer minimizes computing resources required and balances load effectively without compromising system performance, even in the presence of variations in intensity and skew of the load. The results of our experiments show that Elasca is able to achieve performance close to a fully provisioned system while saving 35% resources on average. Furthermore, Elasca's workload-aware optimizer performs up to 79% less data movement than a greedy approach to resource minimization, and also balance load much more effectively.
APA, Harvard, Vancouver, ISO, and other styles
22

Leszczyński, Paweł. "An update propagator for joint scalable storage." Doctoral thesis, 2012. http://depotuw.ceon.pl/handle/item/140.

Full text
Abstract:
In recent years, the scalability of web applications has become crit- ical. Web sites get more dynamic and customized. This increases servers' workload. Furthermore, the future increase of load is dif- cult to predict. Thus, the industry seeks for solutions that scale well. With current technology, almost all items of system architec- tures can be multiplied when necessary. There are, however, prob- lems with databases in this respect. The traditional approach with a single relational database has become insu cient. In order to achieve scalability, architects add a number of di erent kinds of storage facil- ities. This could be error prone because of inconsistencies in stored data. In this paper, we present a novel method to assemble sys- tems with multiple storages. We propose an algorithm for update propagation among di erent storages like multi-column, key-value, and relational databases. We also apply this algorithm for consistent object caching, which reduces database workload and makes web ap- plication perform signi cantly better. Next, we describe PropScale, i.e. a proof-of-concept implementation of the proposed algorithm. Using this system we have conducted experimental evaluation of our solution. The results prove its robustness.
APA, Harvard, Vancouver, ISO, and other styles
23

Du, Toit Petrus. "An evaluation of non-relational database management systems as suitable storage for user generated text-based content in a distributed environment." Diss., 2016. http://hdl.handle.net/10500/21613.

Full text
Abstract:
Non-relational database management systems address some of the limitations relational database management systems have when storing large volumes of unstructured, user generated text-based data in distributed environments. They follow different approaches through the data model they use, their ability to scale data storage over distributed servers and the programming interface they provide. An experimental approach was followed to measure the capabilities these alternative database management systems present in their approach to address the limitations of relational databases in terms of their capability to store unstructured text-based data, data warehousing capabilities, ability to scale data storage across distributed servers and the level of programming abstraction they provide. The results of the research highlighted the limitations of relational database management systems. The different database management systems do address certain limitations, but not all. Document-oriented databases provide the best results and successfully address the need to store large volumes of user generated text-based data in a distributed environment<br>School of Computing<br>M. Sc. (Computer Science)
APA, Harvard, Vancouver, ISO, and other styles
24

Minhas, Umar Farooq. "Scalable and Highly Available Database Systems in the Cloud." Thesis, 2013. http://hdl.handle.net/10012/7194.

Full text
Abstract:
Cloud computing allows users to tap into a massive pool of shared computing resources such as servers, storage, and network. These resources are provided as a service to the users allowing them to “plug into the cloud” similar to a utility grid. The promise of the cloud is to free users from the tedious and often complex task of managing and provisioning computing resources to run applications. At the same time, the cloud brings several additional benefits including: a pay-as-you-go cost model, easier deployment of applications, elastic scalability, high availability, and a more robust and secure infrastructure. One important class of applications that users are increasingly deploying in the cloud is database management systems. Database management systems differ from other types of applications in that they manage large amounts of state that is frequently updated, and that must be kept consistent at all scales and in the presence of failure. This makes it difficult to provide scalability and high availability for database systems in the cloud. In this thesis, we show how we can exploit cloud technologies and relational database systems to provide a highly available and scalable database service in the cloud. The first part of the thesis presents RemusDB, a reliable, cost-effective high availability solution that is implemented as a service provided by the virtualization platform. RemusDB can make any database system highly available with little or no code modifications by exploiting the capabilities of virtualization. In the second part of the thesis, we present two systems that aim to provide elastic scalability for database systems in the cloud using two very different approaches. The three systems presented in this thesis bring us closer to the goal of building a scalable and reliable transactional database service in the cloud.
APA, Harvard, Vancouver, ISO, and other styles
25

Guo, Zhaochen. "Entity resolution for large relational datasets." Master's thesis, 2010. http://hdl.handle.net/10048/924.

Full text
Abstract:
Thesis (M.Sc.)--University of Alberta, 2010.<br>Title from PDF file main screen (viewed on Apr. 16, 2010). A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Science, Department of Computing Science, University of Alberta. Includes bibliographical references.
APA, Harvard, Vancouver, ISO, and other styles
26

Xie, Junyi. "Handling Resource Constraints and Scalability in Continuous Query Processing." Diss., 2007. http://hdl.handle.net/10161/455.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Karyakin, Alexey. "Dynamic Scale-out Mechanisms for Partitioned Shared-Nothing Databases." Thesis, 2011. http://hdl.handle.net/10012/6312.

Full text
Abstract:
For a database system used in pay-per-use cloud environments, elastic scaling becomes an essential feature, allowing for minimizing costs while accommodating fluctuations of load. One approach to scalability involves horizontal database partitioning and dynamic migration of partitions between servers. We define a scale-out operation as a combination of provisioning a new server followed by migration of one or more partitions to the newly-allocated server. In this thesis we study the efficiency of different implementations of the scale-out operation in the context of online transaction processing (OLTP) workloads. We designed and implemented three migration mechanisms featuring different strategies for data transfer. The first one is based on a modification of the Xen hypervisor, Snowflock, and uses on-demand block transfers for both server provisioning and partition migration. The second one is implemented in a database management system (DBMS) and uses bulk transfers for partition migration, optimized for higher bandwidth utilization. The third one is a conventional application, using SQL commands to copy partitions between servers. We perform an experimental comparison of those scale-out mechanisms for disk-bound and CPU-bound configurations. When comparing the mechanisms we analyze their impact on whole-system performance and on the experience of individual clients.
APA, Harvard, Vancouver, ISO, and other styles
28

Soares, João Paulo da Conceição. "Scaling In-Memory databases on multicores." Doctoral thesis, 2015. http://hdl.handle.net/10362/17095.

Full text
Abstract:
Current computer systems have evolved from featuring only a single processing unit and limited RAM, in the order of kilobytes or few megabytes, to include several multicore processors, o↵ering in the order of several tens of concurrent execution contexts, and have main memory in the order of several tens to hundreds of gigabytes. This allows to keep all data of many applications in the main memory, leading to the development of inmemory databases. Compared to disk-backed databases, in-memory databases (IMDBs) are expected to provide better performance by incurring in less I/O overhead. In this dissertation, we present a scalability study of two general purpose IMDBs on multicore systems. The results show that current general purpose IMDBs do not scale on multicores, due to contention among threads running concurrent transactions. In this work, we explore di↵erent direction to overcome the scalability issues of IMDBs in multicores, while enforcing strong isolation semantics. First, we present a solution that requires no modification to either database systems or to the applications, called MacroDB. MacroDB replicates the database among several engines, using a master-slave replication scheme, where update transactions execute on the master, while read-only transactions execute on slaves. This reduces contention, allowing MacroDB to o↵er scalable performance under read-only workloads, while updateintensive workloads su↵er from performance loss, when compared to the standalone engine. Second, we delve into the database engine and identify the concurrency control mechanism used by the storage sub-component as a scalability bottleneck. We then propose a new locking scheme that allows the removal of such mechanisms from the storage sub-component. This modification o↵ers performance improvement under all workloads, when compared to the standalone engine, while scalability is limited to read-only workloads. Next we addressed the scalability limitations for update-intensive workloads, and propose the reduction of locking granularity from the table level to the attribute level. This further improved performance for intensive and moderate update workloads, at a slight cost for read-only workloads. Scalability is limited to intensive-read and read-only workloads. Finally, we investigate the impact applications have on the performance of database systems, by studying how operation order inside transactions influences the database performance. We then propose a Read before Write (RbW) interaction pattern, under which transaction perform all read operations before executing write operations. The RbW pattern allowed TPC-C to achieve scalable performance on our modified engine for all workloads. Additionally, the RbW pattern allowed our modified engine to achieve scalable performance on multicores, almost up to the total number of cores, while enforcing strong isolation.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography