Dissertations / Theses: 'Mapreduce'

1

Gault, Sylvain. "Improving MapReduce Performance on Clusters." Thesis, Lyon, École normale supérieure, 2015. http://www.theses.fr/2015ENSL0985/document.

Full text

Abstract:

Beaucoup de disciplines scientifiques s'appuient désormais sur l'analyse et la fouille de masses gigantesques de données pour produire de nouveaux résultats. Ces données brutes sont produites à des débits toujours plus élevés par divers types d'instruments tels que les séquenceurs d'ADN en biologie, le Large Hadron Collider (LHC) qui produisait en 2012, 25 pétaoctets par an, ou les grands télescopes tels que le Large Synoptic Survey Telescope (LSST) qui devrait produire 30 pétaoctets par nuit. Les scanners haute résolution en imagerie médicale et l'analyse de réseaux sociaux produisent également d'énormes volumes de données. Ce déluge de données soulève de nombreux défis en termes de stockage et de traitement informatique. L'entreprise Google a proposé en 2004 d'utiliser le modèle de calcul MapReduce afin de distribuer les calculs sur de nombreuses machines.Cette thèse s'intéresse essentiellement à améliorer les performances d'un environnement MapReduce. Pour cela, une conception modulaire et adaptable d'un environnement MapReduce est nécessaire afin de remplacer aisément les briques logicielles nécessaires à l'amélioration des performances. C'est pourquoi une approche à base de composants est étudiée pour concevoir un tel environnement de programmation. Afin d'étudier les performances d'une application MapReduce, il est nécessaire de modéliser la plate-forme, l'application et leurs performances. Ces modèles doivent être à la fois suffisamment précis pour que les algorithmes les utilisant produisent des résultats pertinents, mais aussi suffisamment simple pour être analysés. Un état de l'art des modèles existants est effectué et un nouveau modèle correspondant aux besoins d'optimisation est défini. De manière à optimiser un environnement MapReduce la première approche étudiée est une approche d'optimisation globale qui aboutit à une amélioration du temps de calcul jusqu'à 47 %. La deuxième approche se concentre sur la phase de shuffle de MapReduce où tous les nœuds envoient potentiellement des données à tous les autres nœuds. Différents algorithmes sont définis et étudiés dans le cas où le réseau est un goulet d'étranglement pour les transferts de données. Ces algorithmes sont mis à l'épreuve sur la plate-forme expérimentale Grid'5000 et montrent souvent un comportement proche de la borne inférieure alors que l'approche naïve en est éloignée
Nowadays, more and more scientific fields rely on data mining to produce new results. These raw data are produced at an increasing rate by several tools like DNA sequencers in biology, the Large Hadron Collider (LHC) in physics that produced 25 petabytes per year as of 2012, or the Large Synoptic Survey Telescope (LSST) that should produce 30 petabyte of data per night. High-resolution scanners in medical imaging and social networks also produce huge amounts of data. This data deluge raise several challenges in terms of storage and computer processing. The Google company proposed in 2004 to use the MapReduce model in order to distribute the computation across several computers.This thesis focus mainly on improving the performance of a MapReduce environment. In order to easily replace the software parts needed to improve the performance, designing a modular and adaptable MapReduce environment is necessary. This is why a component based approach is studied in order to design such a programming environment. In order to study the performance of a MapReduce application, modeling the platform, the application and their performance is mandatory. These models should be both precise enough for the algorithms using them to produce meaningful results, but also simple enough to be analyzed. A state of the art of the existing models is done and a new model adapted to the needs is defined. On order to optimise a MapReduce environment, the first studied approach is a global optimization which result in a computation time reduced by up to 47 %. The second approach focus on the shuffle phase of MapReduce when all the nodes may send some data to every other node. Several algorithms are defined and studied when the network is the bottleneck of the data transfers. These algorithms are tested on the Grid'5000 experiment platform and usually show a behavior close to the lower bound while the trivial approach is far from it

APA, Harvard, Vancouver, ISO, and other styles

2

Polo, Jordà. "Multi-constraint scheduling of MapReduce workloads." Doctoral thesis, Universitat Politècnica de Catalunya, 2014. http://hdl.handle.net/10803/276174.

Full text

Abstract:

In recent years there has been an extraordinary growth of large-scale data processing and related technologies in both, industry and academic communities. This trend is mostly driven by the need to explore the increasingly large amounts of information that global companies and communities are able to gather, and has lead the introduction of new tools and models, most of which are designed around the idea of handling huge amounts of data. A good example of this trend towards improved large-scale data processing is MapReduce, a programming model intended to ease the development of massively parallel applications, and which has been widely adopted to process large datasets thanks to its simplicity. While the MapReduce model was originally used primarily for batch data processing in large static clusters, nowadays it is mostly deployed along with other kinds of workloads in shared environments in which multiple users may be submitting concurrent jobs with completely different priorities and needs: from small, almost interactive, executions, to very long applications that take hours to complete. Scheduling and selecting tasks for execution is extremely relevant in MapReduce environments since it governs a job's opportunity to make progress and determines its performance. However, only basic primitives to prioritize between jobs are available at the moment, constantly causing either under or over-provisioning, as the amount of resources needed to complete a particular job are not obvious a priori. This thesis aims to address both, the lack of management capabilities and the increased complexity of the environments in which MapReduce is executed. To that end, new models and techniques are introduced in order to improve the scheduling of MapReduce in the presence of different constraints found in real-world scenarios, such as completion time goals, data locality, hardware heterogeneity, or availability of resources. The focus is on improving the integration of MapReduce with the computing infrastructures in which it usually runs, allowing alternative techniques for dynamic management and provisioning of resources. More specifically, it is focused in three scenarios that are incremental in its scope. First, it studies the prospects of using high-level performance criteria to manage and drive the performance of MapReduce applications, taking advantage of the fact that MapReduce is executed in controlled environments in which the status of the cluster is known. Second, it examines the feasibility and benefits of making the MapReduce runtime more aware of the underlying hardware and the characteristics of applications. And finally, it also considers the interaction between MapReduce and other kinds of workloads, proposing new techniques to handle these increasingly complex environments. Following these three items described above, this thesis contributes to the management of MapReduce workloads by 1) proposing a performance model for MapReduce workloads and a scheduling algorithm that leverages the proposed model and is able to adapt depending on the various needs of its users in the presence of completion time constraints; 2) proposing a new resource model for MapReduce and a placement algorithm aware of the underlying hardware as well as the characteristics of the applications, capable of improving cluster utilization while still being guided by job performance metrics; and 3) proposing a model for shared environments in which MapReduce is executed along with other kinds of workloads such as transactional applications, and a scheduler aware of these workloads and its expected demand of resources, capable of improving resource utilization across machines while observing completion time goals.

APA, Harvard, Vancouver, ISO, and other styles

3

Nilsson, Johan. "Hadoop MapReduce in Eucalyptus Private Cloud." Thesis, Umeå universitet, Institutionen för datavetenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-51309.

Full text

Abstract:

This thesis investigates how setting up a private cloud using the Eucalyptus Cloud system could be done along with it's usability, requirements and limitations as an open-source cloud platform providing private cloud solutions. It also studies if using the MapReduce framework through Apache Hadoop's implementation on top of the private Eucalyptus Cloud can provide near linear scalability in terms of time and the amount of virtual machines in the cluster. Analysis has shown that Eucalyptus is lacking in a few usability areas when setting up the cloud infrastructure in terms of private networking and DNS lookups, yet the API that Eucalyptus provides gives benefits when migrating from public clouds like Amazon. The MapReduce framework is showing an initial near-linear relation which is declining when the amount of virtual machines is reaching the maximum of the cloud infrastructure.

APA, Harvard, Vancouver, ISO, and other styles

4

Kloss, Fernando Cesar. "Motor de transformações baseado em Mapreduce." reponame:Repositório Institucional da UFPR, 2013. http://hdl.handle.net/1884/35083.

Full text

Abstract:

Resumo: A busca por agilidade no processo de desenvolvimento de software tem impulsionado a crescente adoção de tecnologias, paradigmas e abordagens baseada em modelos (Model- Driven Engineering). Essas soluções mudam o foco de codificação para modelagem, onde modelos são utilizados para descrever diferentes aspectos de um sistema em diferentes níves de abstração. Uma série de linguagens, padrões e ferramentas surgiram para automatizar a construção e modificação de modelos e assim apoiar a principal operação executada neste cenário que são as transformações de modelos. A inserção de grandes modelos neste contexto evidenciou uma limitação dessa metodologia, a capacidade de tratar modelos com esta característica. Problemas de escalabilidade surgem quando modelos da ordem de milhares de elementos são utilizados em processos de desenvolvimento de software. Trabalhos recentes, visando desenvolver soluções para o problema de escalabilidade, tem explorado e focado em diferentes abordagens como armazenamento, fragmentação e persistência de modelos, porém pouco se tem visto em relação a ferramentas de transformação de modelos. Com base em trabalhos feitos em outros domínios, desenvolvemos um mecanismo de transformação de modelos executando de forma distribuída em uma nuvem. A solução consiste na adaptação de uma ferramenta de transformação de modelos para execução distribuída, através da integração com MapReduce. Duas implementações distintas arquiteturalmente são apresentadas, uma baseada em regras de transformação e outra baseada em operações de transformação de modelos. Os resultados obtidos são promissores especialmente para transformação de modelos grandes e complexos.

APA, Harvard, Vancouver, ISO, and other styles

5

Polo, Bardès Jordà. "Multi-constraint scheduling of MapReduce workloads." Doctoral thesis, Universitat Politècnica de Catalunya, 2014. http://hdl.handle.net/10803/276174.

Full text

Abstract:

In recent years there has been an extraordinary growth of large-scale data processing and related technologies in both, industry and academic communities. This trend is mostly driven by the need to explore the increasingly large amounts of information that global companies and communities are able to gather, and has lead the introduction of new tools and models, most of which are designed around the idea of handling huge amounts of data. A good example of this trend towards improved large-scale data processing is MapReduce, a programming model intended to ease the development of massively parallel applications, and which has been widely adopted to process large datasets thanks to its simplicity. While the MapReduce model was originally used primarily for batch data processing in large static clusters, nowadays it is mostly deployed along with other kinds of workloads in shared environments in which multiple users may be submitting concurrent jobs with completely different priorities and needs: from small, almost interactive, executions, to very long applications that take hours to complete. Scheduling and selecting tasks for execution is extremely relevant in MapReduce environments since it governs a job's opportunity to make progress and determines its performance. However, only basic primitives to prioritize between jobs are available at the moment, constantly causing either under or over-provisioning, as the amount of resources needed to complete a particular job are not obvious a priori. This thesis aims to address both, the lack of management capabilities and the increased complexity of the environments in which MapReduce is executed. To that end, new models and techniques are introduced in order to improve the scheduling of MapReduce in the presence of different constraints found in real-world scenarios, such as completion time goals, data locality, hardware heterogeneity, or availability of resources. The focus is on improving the integration of MapReduce with the computing infrastructures in which it usually runs, allowing alternative techniques for dynamic management and provisioning of resources. More specifically, it is focused in three scenarios that are incremental in its scope. First, it studies the prospects of using high-level performance criteria to manage and drive the performance of MapReduce applications, taking advantage of the fact that MapReduce is executed in controlled environments in which the status of the cluster is known. Second, it examines the feasibility and benefits of making the MapReduce runtime more aware of the underlying hardware and the characteristics of applications. And finally, it also considers the interaction between MapReduce and other kinds of workloads, proposing new techniques to handle these increasingly complex environments. Following these three items described above, this thesis contributes to the management of MapReduce workloads by 1) proposing a performance model for MapReduce workloads and a scheduling algorithm that leverages the proposed model and is able to adapt depending on the various needs of its users in the presence of completion time constraints; 2) proposing a new resource model for MapReduce and a placement algorithm aware of the underlying hardware as well as the characteristics of the applications, capable of improving cluster utilization while still being guided by job performance metrics; and 3) proposing a model for shared environments in which MapReduce is executed along with other kinds of workloads such as transactional applications, and a scheduler aware of these workloads and its expected demand of resources, capable of improving resource utilization across machines while observing completion time goals.

APA, Harvard, Vancouver, ISO, and other styles

6

Memon, Neelam. "Anonymizing large transaction data using MapReduce." Thesis, Cardiff University, 2016. http://orca.cf.ac.uk/97342/.

Full text

Abstract:

Publishing transaction data is important to applications such as marketing research and biomedical studies. Privacy is a concern when publishing such data since they often contain person-specific sensitive information. To address this problem, different data anonymization methods have been proposed. These methods have focused on protecting the associated individuals from different types of privacy leaks as well as preserving utility of the original data. But all these methods are sequential and are designed to process data on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In this work, we consider how MapReduce may be used to provide scalability in large transaction data anonymization. More specifically, we consider how setbased generalization methods such as RBAT (Rule-Based Anonymization of Transaction data) may be parallelized using MapReduce. Set-based generalization methods have some desirable features for transaction anonymization, but their highly iterative nature makes parallelization challenging. RBAT is a good representative of such methods. We propose a method for transaction data partitioning and representation. We also present two MapReduce-based parallelizations of RBAT. Our methods ensure scalability when the number of transaction records and domain of items are large. Our preliminary results show that a direct parallelization of RBAT by partitioning data alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that generalizes our direct parallel method and allows to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and to the available resources while retaining good data utility.

APA, Harvard, Vancouver, ISO, and other styles

7

Hammoud, Suhel. "MapReduce network enabled algorithms for classification based on association rules." Thesis, Brunel University, 2011. http://bura.brunel.ac.uk/handle/2438/5833.

Full text

Abstract:

There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters. The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization.

APA, Harvard, Vancouver, ISO, and other styles

8

Deolikar, Piyush P. "Lecture Video Search Engine Using Hadoop MapReduce." Thesis, California State University, Long Beach, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10638908.

Full text

Abstract:

With the advent of the Internet and ease of uploading video content over video libraries and social networking sites, the video data availability was increased very rapidly during this decade. Universities are uploading video tutorials in the online courses. Companies like Udemy, coursera, Lynda, etc. made video tutorials available over the Internet. We propose and implement a scalable solution, which helps to find relevant videos with respect to a query provided by the user. Our solution maintains an updated list of the available videos on the web and assigns a rank according to their relevance. The proposed solution consists of three main components that can mutually interact. The first component, called the crawler, continuously visits and locally stores the relevant information of all the webpages with videos available on the Internet. The crawler has several threads, concurrently parsing webpages. The second component obtains the inverted index of the web pages stored by the crawler. Given a query, the inverted index is used to obtain the videos that contain the words in the query. The third component computes the rank of the video. This rank is then used to display the results in the order of relevance. We implement a scalable solution in the Apache Hadoop Framework. Hadoop is a distributed operating system that provides a distributed file system able to handle large files as well as distributed computation among the participants.

APA, Harvard, Vancouver, ISO, and other styles

9

Kolb, Lars. "Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-157163.

Full text

Abstract:

In den vergangenen Jahren hat das neu entstandene Paradigma Infrastructure as a Service die IT-Welt massiv verändert. Die Bereitstellung von Recheninfrastruktur durch externe Dienstleister bietet die Möglichkeit, bei Bedarf in kurzer Zeit eine große Menge von Rechenleistung, Speicherplatz und Bandbreite ohne Vorabinvestitionen zu akquirieren. Gleichzeitig steigt sowohl die Menge der frei verfügbaren als auch der in Unternehmen zu verwaltenden Daten dramatisch an. Die Notwendigkeit zur effizienten Verwaltung und Auswertung dieser Datenmengen erforderte eine Weiterentwicklung bestehender IT-Technologien und führte zur Entstehung neuer Forschungsgebiete und einer Vielzahl innovativer Systeme. Ein typisches Merkmal dieser Systeme ist die verteilte Speicherung und Datenverarbeitung in großen Rechnerclustern bestehend aus Standard-Hardware. Besonders das MapReduce-Programmiermodell hat in den vergangenen zehn Jahren zunehmend an Bedeutung gewonnen. Es ermöglicht eine verteilte Verarbeitung großer Datenmengen und abstrahiert von den Details des verteilten Rechnens sowie der Behandlung von Hardwarefehlern. Innerhalb dieser Dissertation steht die Nutzung des MapReduce-Konzeptes zur automatischen Parallelisierung rechenintensiver Entity Resolution-Aufgaben im Mittelpunkt. Entity Resolution ist ein wichtiger Teilbereich der Informationsintegration, dessen Ziel die Entdeckung von Datensätzen einer oder mehrerer Datenquellen ist, die dasselbe Realweltobjekt beschreiben. Im Rahmen der Dissertation werden schrittweise Verfahren präsentiert, welche verschiedene Teilprobleme der MapReduce-basierten Ausführung von Entity Resolution-Workflows lösen. Zur Erkennung von Duplikaten vergleichen Entity Resolution-Verfahren üblicherweise Paare von Datensätzen mithilfe mehrerer Ähnlichkeitsmaße. Die Auswertung des Kartesischen Produktes von n Datensätzen führt dabei zu einer quadratischen Komplexität von O(n²) und ist deswegen nur für kleine bis mittelgroße Datenquellen praktikabel. Für Datenquellen mit mehr als 100.000 Datensätzen entstehen selbst bei verteilter Ausführung Laufzeiten von mehreren Stunden. Deswegen kommen sogenannte Blocking-Techniken zum Einsatz, die zur Reduzierung des Suchraums dienen. Die zugrundeliegende Annahme ist, dass Datensätze, die eine gewisse Mindestähnlichkeit unterschreiten, nicht miteinander verglichen werden müssen. Die Arbeit stellt eine MapReduce-basierte Umsetzung der Auswertung des Kartesischen Produktes sowie einiger bekannter Blocking-Verfahren vor. Nach dem Vergleich der Datensätze erfolgt abschließend eine Klassifikation der verglichenen Kandidaten-Paare in Match beziehungsweise Non-Match. Mit einer steigenden Anzahl verwendeter Attributwerte und Ähnlichkeitsmaße ist eine manuelle Festlegung einer qualitativ hochwertigen Strategie zur Kombination der resultierenden Ähnlichkeitswerte kaum mehr handhabbar. Aus diesem Grund untersucht die Arbeit die Integration maschineller Lernverfahren in MapReduce-basierte Entity Resolution-Workflows. Eine Umsetzung von Blocking-Verfahren mit MapReduce bedingt eine Partitionierung der Menge der zu vergleichenden Paare sowie eine Zuweisung der Partitionen zu verfügbaren Prozessen. Die Zuweisung erfolgt auf Basis eines semantischen Schlüssels, der entsprechend der konkreten Blocking-Strategie aus den Attributwerten der Datensätze abgeleitet ist. Beispielsweise wäre es bei der Deduplizierung von Produktdatensätzen denkbar, lediglich Produkte des gleichen Herstellers miteinander zu vergleichen. Die Bearbeitung aller Datensätze desselben Schlüssels durch einen Prozess führt bei Datenungleichverteilung zu erheblichen Lastbalancierungsproblemen, die durch die inhärente quadratische Komplexität verschärft werden. Dies reduziert in drastischem Maße die Laufzeiteffizienz und Skalierbarkeit der entsprechenden MapReduce-Programme, da ein Großteil der Ressourcen eines Clusters nicht ausgelastet ist, wohingegen wenige Prozesse den Großteil der Arbeit verrichten müssen. Die Bereitstellung verschiedener Verfahren zur gleichmäßigen Ausnutzung der zur Verfügung stehenden Ressourcen stellt einen weiteren Schwerpunkt der Arbeit dar. Blocking-Strategien müssen stets zwischen Effizienz und Datenqualität abwägen. Eine große Reduktion des Suchraums verspricht zwar eine signifikante Beschleunigung, führt jedoch dazu, dass ähnliche Datensätze, z. B. aufgrund fehlerhafter Attributwerte, nicht miteinander verglichen werden. Aus diesem Grunde ist es hilfreich, für jeden Datensatz mehrere von verschiedenen Attributen abgeleitete semantische Schlüssel zu generieren. Dies führt jedoch dazu, dass ähnliche Datensätze unnötigerweise mehrfach bezüglich verschiedener Schlüssel miteinander verglichen werden. Innerhalb der Arbeit werden deswegen Algorithmen zur Vermeidung solch redundanter Ähnlichkeitsberechnungen präsentiert. Als Ergebnis dieser Arbeit wird das Entity Resolution-Framework Dedoop präsentiert, welches von den entwickelten MapReduce-Algorithmen abstrahiert und eine High-Level-Spezifikation komplexer Entity Resolution-Workflows ermöglicht. Dedoop fasst alle in dieser Arbeit vorgestellten Techniken und Optimierungen in einem nutzerfreundlichen System zusammen. Der Prototyp überführt nutzerdefinierte Workflows automatisch in eine Menge von MapReduce-Jobs und verwaltet deren parallele Ausführung in MapReduce-Clustern. Durch die vollständige Integration der Cloud-Dienste Amazon EC2 und Amazon S3 in Dedoop sowie dessen Verfügbarmachung ist es für Endnutzer ohne MapReduce-Kenntnisse möglich, komplexe Entity Resolution-Workflows in privaten oder dynamisch erstellten externen MapReduce-Clustern zu berechnen.

APA, Harvard, Vancouver, ISO, and other styles

10

Dyer, James. "Secure computation in the cloud using MapReduce." Thesis, University of Manchester, 2018. https://www.research.manchester.ac.uk/portal/en/theses/secure-computation-in-the-cloud-using-mapreduce(8f63dc8e-dc35-43ec-a083-9f3a6230c142).html.

Full text

Abstract:

Processing large volumes of data has become increasingly important to businesses and government [218]. One of the popular tools used in processing large sets of data is MapReduce. As new MapReduce applications are developed to be deployed in untrusted environments, such as public clouds, or to process sensitive data or to be deployed across data centres, well understood security measures may be deployed, such as authentication and authorisation or encryption of messages passed between cluster nodes. However, there may be situations where authorised individuals cannot be trusted, such as ârogueâ system administrators, or, in the cloud, where the MapReduce cluster nodes have been compromised. Where the input data is sensitive, such as medical data, we require a means to protect this data from exposure. Furthermore, we may need to protect the intermediate data and details of the computation from snoopers to prevent information leakage. To take full advantage of MapReduce cloud computing services, we require a means to process the data securely on such a platform. We designate such a computation, secure computation in the cloud (SCC). SCC should not expose input or output data to any other party, including the cloud service provider. Furthermore, the details of the computation should not allow any other party to deduce its inputs and outputs. Most importantly, we require the computations to be performed in practically reasonable time and space. The ability to perform MapReduce computations on encrypted data would offer a solution to these problems. However, this poses a significant problem in that many encryption schemes transform the original data in such way that meaningful computation on the encrypted data is impossible. Our work aims to provide a practical SSCC system (CryptMR) inspired by CryptDB [272]. Our solution (CryptMR) has detailed several novel cryptographic methods suitable for implementation in a secure computation in the cloud solution using MapReduce. We encrypt integer data using a novel somewhat homomorphic encryption scheme (SHE). This SHE can be made fully homomorphic. We also provide novel order-preserving (OPE) and searchable symmetric encryption (SSE) for the purposes of sorting and searching. Our OPE scheme is, to our knowledge, the first scheme to rely on a computationally hard primitive rather a security model. We have implemented all of these encryption schemes and devised experiments to test their suitability for large-scale distributed computing by integrating them into Hadoop MapReduce (MR) applications. In addition to the work on encryption schemes, we have provided a novel probabilistic method for verifying that mappers and reducers are correctly computing and reporting their results (see chapter 10). This work uses random sampling to minimise the probability that a cheating mapper or reducer can successfully report a false result. We evaluated our âproof-of-conceptâ implementation in a small scale cloud environment. Our results show that sampling 5 to 10% of intermediate or final data allows us to detect cheating with very strong probability.

APA, Harvard, Vancouver, ISO, and other styles

11

Elteir, Marwa Khamis. "A MapReduce Framework for Heterogeneous Computing Architectures." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/28786.

Full text

Abstract:

Nowadays, an increasing number of computational systems are equipped with heterogeneous compute resources, i.e., following different architecture. This applies to the level of a single chip, a single node and even supercomputers and large-scale clusters. With its impressive price-to-performance ratio as well as power efficiently compared to traditional multicore processors, graphics processing units (GPUs) has become an integrated part of these systems. GPUs deliver high peak performance; however efficiently exploiting their computational power requires the exploration of a multi-dimensional space of optimization methodologies, which is challenging even for the well-trained expert. The complexity of this multi-dimensional space arises not only from the traditionally well known but arduous task of architecture-aware GPU optimization at design and compile time, but it also arises in the partitioning and scheduling of the computation across these heterogeneous resources. Even with programming models like the Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), the developer still needs to manage the data transfer be- tween host and device and vice versa, orchestrate the execution of several kernels, and more arduously, optimize the kernel code. In this dissertation, we aim to deliver a transparent parallel programming environment for heterogeneous resources by leveraging the power of the MapReduce programming model and OpenCL programming language. We propose a portable architecture-aware framework that efficiently runs an application across heterogeneous resources, specifically AMD GPUs and NVIDIA GPUs, while hiding complex architectural details from the developer. To further enhance performance portability, we explore approaches for asynchronously and efficiently distributing the computations across heterogeneous resources. When applied to benchmarks and representative applications, our proposed framework significantly enhances performance, including up to 58% improvement over traditional approaches to task assignment and up to a 45-fold improvement over state-of-the-art MapReduce implementations.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

12

Wang, Guanying. "Evaluating MapReduce System Performance: A Simulation Approach." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/28820.

Full text

Abstract:

Scale of data generated and processed is exploding in the Big Data era. The MapReduce system popularized by open-source Hadoop is a powerful tool for the exploding data problem, and is widely employed in many areas involving large scale of data. In many circumstances, hypothetical MapReduce systems must be evaluated, e.g. to provision a new MapReduce system to provide certain performance goal, to upgrade a currently running system to meet increasing business demands, to evaluate novel network topology, new scheduling algorithms, or resource arrangement schemes. The traditional trial-and-error solution involves the time-consuming and costly process in which a real cluster is first built and then benchmarked. In this dissertation, we propose to simulate MapReduce systems and evaluate hypothetical MapReduce systems using simulation. This simulation approach offers significantly lower turn-around time and lower cost than experiments. Simulation cannot entirely replace experiments, but can be used as a preliminary step to reveal potential flaws and gain critical insights. We studied MapReduce systems in detail and developed a comprehensive performance model for MapReduce, including sub-task phase level performance models for both map and reduce tasks and a model for resource contention between multiple processes running in concurrent. Based on the performance model, we developed a comprehensive simulator for MapReduce, MRPerf. MRPerf is the first full-featured MapReduce simulator. It supports both workload simulation and resource contention, and it still offers the most complete features among all MapReduce simulators to date. Using MRPerf, we conducted two case studies to evaluate scheduling algorithms in MapReduce and shared storage in MapReduce, without building real clusters. Furthermore, in order to further integrate simulation and performance prediction into MapReduce systems and leverage predictions to improve system performance, we developed online prediction framework for MapReduce, which periodically runs simulations within a live Hadoop MapReduce system. The framework can predict task execution within a window in near future. These predictions can be used by other components in MapReduce systems in order to improve performance. Our results show that the framework can achieve high prediction accuracy and incurs negligible overhead. We present two potential use cases, prefetching and dynamic adapting scheduler.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

13

Akokhia, Emmanuel Oshoke. "NETFLIX films database clustering using MAPREDUCE technology." Thesis, Тернопільський національний технічний університет імені Івана Пулюя, 2017. http://elartu.tntu.edu.ua/handle/123456789/19533.

Full text

Abstract:

In the thesis work was illustrated MapReduce is attractive because it abstracts parallel and distributed concepts in such a way that it allows novice programmers to take advantage of cluster computing without needing to be familiar with associated complexities such as data dependency, mutual exclusion, replication, and reliability. However, the challenge is that problems must be expressed in such a way that they can be solved using MapReduce. This often involves carefully designing inputs and outputs of MapReduce problems as often outputs of one MapReduce are used as inputs to another.

APA, Harvard, Vancouver, ISO, and other styles

14

Темирбекова, Ж. Е., and Ж. М. Меренбаев. "Параллельное масштабирование изображений в технологии mapreduce hadoop." Thesis, Сумский государственный университет, 2015. http://essuir.sumdu.edu.ua/handle/123456789/40775.

Full text

Abstract:

Цифровая обработка изображений находит широкое применение практически во всех областях промышленности. Часто еѐ использование позволяет выйти на качественно новый технологический уровень производства. При этом наиболее сложными являются вопросы, связанные с автоматическим извлечением информации из изображения и ее интерпретацией, являющейся основой для принятия решений в процессе управления производственными процессами.

APA, Harvard, Vancouver, ISO, and other styles

15

Lam, Wilma Samhita Samuel. "A MapReduce Performance Study of XML Shredding." University of Cincinnati / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1467126954.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Lu, Peng. "Application profiling and resource management for MapReduce." Thesis, The University of Sydney, 2015. http://hdl.handle.net/2123/13969.

Full text

Abstract:

Scale of data generated and processed is exponential growth in the Big Data ear. It poses a challenge that is far beyond the goal of a single computing system. Processing such vast amount of data on a single machine is impracticable in term of time or cost. Hence, distributed systems, which can harness very large clusters of commodity computers and processing data within restrictive time deadlines, are imperative. In this thesis, we target two aspects of distributed systems: application profiling and resource management. We study a MapReduce system in detail, which is a programming paradigm for large scale distributed computing, and presents solutions to tackle three key problems. Firstly, this thesis analyzes the characteristics of jobs running on the MapReduce system to reveal the problem—the Application scope of MapReduce has been extended beyond the original design goal that was large-scale data processing. This problem enables us to present a Workload Characteristic Oriented Scheduler (WCO), which strives for co-locating tasks of possibly different MapReduce jobs with complementing resource usage characteristics. Secondly, this thesis studies the current job priority mechanism focusing on resource management. In the MapReduce system, job priority only exists at scheduling level. High priority jobs are placed at the front of the scheduling queue and dispatched first. Resource, however, is fairly shared among jobs running at the same worker node without any consideration for their priorities. In order to resolve this, this thesis presents a non-intrusive slot layering solution, which dynamically allocates resource between running jobs based on their priority and efficiently reduces the execution time of high priority jobs while improves overall throughput. Last, based on the fact of underutilization of resource at each individual worker node, this thesis propose a new way, Local Resource Shaper (LRS), to smooth resource consumption of each individual job by automatically tuning the execution of concurrent jobs to maximize resource utilization while minimizing resource contention.

APA, Harvard, Vancouver, ISO, and other styles

17

Li, Lei. "Rolling Window Time Series Prediction Using MapReduce." Thesis, The University of Sydney, 2014. http://hdl.handle.net/2123/13552.

Full text

Abstract:

Prediction of time series data is an important application in many domains. Despite their inherent advantages, traditional databases and MapReduce methodology are not ideally suited for this type of processing due to dependencies introduced by the sequential nature of time series. In this thesis a novel framework is presented to facilitate retrieval and rolling window prediction of irregularly sampled large-scale time series data. By introducing a new index pool data structure, processing of time series can be efficiently parallelised. The proposed framework is implemented in R programming environment and utilises Hadoop to support parallelisation and fault tolerance. A systematic multi-predictor selection model is designed and applied, in order to choose the best-fit algorithm for different circumstances. Additionally, the boosting method is deployed as a post-processing to further optimise the predictive results. Experimental results on a cloud-based platform indicate that the proposed framework scales linearly up to 32-nodes, and performs efficiently with a relatively optimised prediction.

APA, Harvard, Vancouver, ISO, and other styles

18

Alkan, Sertan. "A Distributed Graph Mining Framework Based On Mapreduce." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12611588/index.pdf.

Full text

Abstract:

The frequent patterns hidden in a graph can reveal crucial information about the network the graph represents. Existing techniques to mine the frequent subgraphs in a graph database generally rely on the premise that the data can be fit into main memory of the device that the computation takes place. Even though there are some algorithms that are designed using highly optimized methods to some extent, many lack the solution to the problem of scalability. In this thesis work, our aim is to find and enumerate the subgraphs that are at least as frequent as the designated threshold in a given graph. Here, we propose a new distributed algorithm for frequent subgraph mining problem that can scale horizontally as the computing cluster size increases. The method described here, uses a partitioning method and Map/Reduce programming model to distribute the computation of frequent subgraphs. In the core of this algorithm, we make use of an existing graph partitioning method to split the given data in the distributed file system and to merge and join the computed subgraphs without losing information. The frequent subgraph computation in each split is done using another known method which can enumerate the frequent patterns. Although current algorithms can efficiently find frequent patterns, they are not parallel or distributed algorithms in that even when they partition the data, they are designed to work on a single machine. Furthermore, these algorithms are computationally expensive but not fault tolerant and are not designed to work on a distributed file system. Using the Map/Reduce paradigm, we distribute the computation of frequent patterns to every machine in a cluster. Our algorithm, first bi-partitions the data via successive Map/Reduce jobs, then invokes another Map/Reduce job to compute the subgraphs in partitions using CloseGraph, recovers the whole set by invoking a series of Map/Reduce jobs to merge-join the previously found patterns. The implementation uses an open source Map/Reduce environment, Hadoop. In our experiments, our method can scale up to large graphs, as the graph data size gets bigger, this method performs better than the existing algorithms.

APA, Harvard, Vancouver, ISO, and other styles

19

Wang, Yongzhi. "Constructing Secure MapReduce Framework in Cloud-based Environment." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/2238.

Full text

Abstract:

MapReduce, a parallel computing paradigm, has been gaining popularity in recent years as cloud vendors offer MapReduce computation services on their public clouds. However, companies are still reluctant to move their computations to the public cloud due to the following reason: In the current business model, the entire MapReduce cluster is deployed on the public cloud. If the public cloud is not properly protected, the integrity and the confidentiality of MapReduce applications can be compromised by attacks inside or outside of the public cloud. From the result integrity’s perspective, if any computation nodes on the public cloud are compromised,thosenodes can return incorrect task results and therefore render the final job result inaccurate. From the algorithmic confidentiality’s perspective, when more and more companies devise innovative algorithms and deploy them to the public cloud, malicious attackers can reverse engineer those programs to detect the algorithmic details and, therefore, compromise the intellectual property of those companies. In this dissertation, we propose to use the hybrid cloud architecture to defeat the above two threats. Based on the hybrid cloud architecture, we propose separate solutions to address the result integrity and the algorithmic confidentiality problems. To address the result integrity problem, we propose the Integrity Assurance MapReduce (IAMR) framework. IAMR performs the result checking technique to guarantee high result accuracy of MapReduce jobs, even if the computation is executed on an untrusted public cloud. We implemented a prototype system for a real hybrid cloud environment and performed a series of experiments. Our theoretical simulations and experimental results show that IAMR can guarantee a very low job error rate, while maintaining a moderate performance overhead. To address the algorithmic confidentiality problem, we focus on the program control flow and propose the Confidentiality Assurance MapReduce (CAMR) framework. CAMR performs the Runtime Control Flow Obfuscation (RCFO) technique to protect the predicates of MapReduce jobs. We implemented a prototype system for a real hybrid cloud environment. The security analysis and experimental results show that CAMR defeats static analysis-based reverse engineering attacks, raises the bar for the dynamic analysis-based reverse engineering attacks, and incurs a modest performance overhead.

APA, Harvard, Vancouver, ISO, and other styles

20

Pugsley, Seth Hintze. "Opportunities for near data computing in MapReduce workloads." Thesis, The University of Utah, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3704952.

Full text

Abstract:

In-memory big data applications are growing in popularity, including in-memory versions of the MapReduce framework. The move away from disk-based datasets shifts the performance bottleneck from slow disk accesses to memory bandwidth. MapReduce is a data-parallel application, and is therefore amenable to being executed on as many parallel processors as possible, with each processor requiring high amounts of memory bandwidth. We propose using Near Data Computing (NDC) as a means to develop systems that are optimized for in-memory MapReduce workloads, offering high compute parallelism and even higher memory bandwidth. This dissertation explores three different implementations and styles of NDC to improve MapReduce execution. First, we use 3D-stacked memory+logic devices to process the Map phase on compute elements in close proximity to database splits. Second, we attempt to replicate the performance characteristics of the 3D-stacked NDC using only commodity memory and inexpensive processors to improve performance of both Map and Reduce phases. Finally, we incorporate fixed-function hardware accelerators to improve sorting performance within the Map phase. This dissertation shows that it is possible to improve in-memory MapReduce performance by potentially two orders of magnitude by designing system and memory architectures that are specifically tailored to that end.

APA, Harvard, Vancouver, ISO, and other styles

21

Wang, Liqiang. "An Efficient Platform for Large-Scale MapReduce Processing." ScholarWorks@UNO, 2009. http://scholarworks.uno.edu/td/963.

Full text

Abstract:

In this thesis we proposed and implemented the MMR, a new and open-source MapRe- duce model with MPI for parallel and distributed programing. MMR combines Pthreads, MPI and the Google's MapReduce processing model to support multi-threaded as well as dis- tributed parallelism. Experiments show that our model signi cantly outperforms the leading open-source solution, Hadoop. It demonstrates linear scaling for CPU-intensive processing and even super-linear scaling for indexing-related workloads. In addition, we designed a MMR live DVD which facilitates the automatic installation and con guration of a Linux cluster with integrated MMR library which enables the development and execution of MMR applications.

APA, Harvard, Vancouver, ISO, and other styles

22

Izurieta, Iván Carrera. "Performance modeling of MapReduce applications for the cloud." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/99055.

Full text

Abstract:

Nos últimos anos, Cloud Computing tem se tornado uma tecnologia importante que possibilitou executar aplicações sem a necessidade de implementar uma infraestrutura física com a vantagem de reduzir os custos ao usuário cobrando somente pelos recursos computacionais utilizados pela aplicação. O desafio com a implementação de aplicações distribuídas em ambientes de Cloud Computing é o planejamento da infraestrutura de máquinas virtuais visando otimizar o tempo de execução e o custo da implementação. Assim mesmo, nos últimos anos temos visto como a quantidade de dados produzida pelas aplicações cresceu mais que nunca. Estes dados contêm informação valiosa que deve ser obtida utilizando ferramentas como MapReduce. MapReduce é um importante framework para análise de grandes quantidades de dados desde que foi proposto pela Google, e disponibilizado Open Source pela Apache com a sua implementação Hadoop. O objetivo deste trabalho é apresentar que é possível predizer o tempo de execução de uma aplicação distribuída, a saber, uma aplicação MapReduce, na infraestrutura de Cloud Computing, utilizando um modelo matemático baseado em especificações teóricas. Após medir o tempo levado na execução da aplicação e variando os parámetros indicados no modelo matemático, e, após utilizar uma técnica de regressão linear, o objetivo é atingido encontrando um modelo do tempo de execução que foi posteriormente aplicado para predizer o tempo de execução de aplicações MapReduce com resultados satisfatórios. Os experimentos foram realizados em diferentes configurações: a saber, executando diferentes aplicações MapReduce em clusters privados e públicos, bem como em infraestruturas de Cloud comercial, e variando o número de nós que compõem o cluster, e o tamanho do workload dado à aplicação. Os experimentos mostraram uma clara relação com o modelo teórico, indicando que o modelo é, de fato, capaz de predizer o tempo de execução de aplicações MapReduce. O modelo desenvolvido é genérico, o que quer dizer que utiliza abstrações teóricas para a capacidade computacional do ambiente e o custo computacional da aplicação MapReduce. Motiva-se a desenvolver trabalhos futuros para estender esta abordagem para atingir outro tipo de aplicações distribuídas, e também incluir o modelo matemático deste trabalho dentro de serviços na núvem que ofereçam plataformas MapReduce, a fim de ajudar os usuários a planejar suas implementações.
In the last years, Cloud Computing has become a key technology that made possible running applications without needing to deploy a physical infrastructure with the advantage of lowering costs to the user by charging only for the computational resources used by the application. The challenge with deploying distributed applications in Cloud Computing environments is that the virtual machine infrastructure should be planned in a way that is time and cost-effective. Also, in the last years we have seen how the amount of data produced by applications has grown bigger than ever. This data contains valuable information that has to be extracted using tools like MapReduce. MapReduce is an important framework to analyze large amounts of data since it was proposed by Google, and made open source by Apache with its Hadoop implementation. The goal of this work is to show that the execution time of a distributed application, namely, a MapReduce application, in a Cloud computing environment, can be predicted using a mathematical model based on theoretical specifications. This prediction is made to help the users of the Cloud Computing environment to plan their deployments, i.e., quantify the number of virtual machines and its characteristics in order to have a lesser cost and/or time. After measuring the application execution time and varying parameters stated in the mathematical model, and after that, using a linear regression technique, the goal is achieved finding a model of the execution time which was then applied to predict the execution time of MapReduce applications with satisfying results. The experiments were conducted in several configurations: namely, private and public clusters, as well as commercial cloud infrastructures, running different MapReduce applications, and varying the number of nodes composing the cluster, as well as the amount of workload given to the application. Experiments showed a clear relation with the theoretical model, revealing that the model is in fact able to predict the execution time of MapReduce applications. The developed model is generic, meaning that it uses theoretical abstractions for the computing capacity of the environment and the computing cost of the MapReduce application. Further work in extending this approach to fit other types of distributed applications is encouraged, as well as including this mathematical model into Cloud services offering MapReduce platforms, in order to aid users plan their deployments.

APA, Harvard, Vancouver, ISO, and other styles

23

Aljarah, Ibrahim Mithgal. "Mapreduce-Enabled Scalable Nature-Inspired Approaches for Clustering." Diss., North Dakota State University, 2014. https://hdl.handle.net/10365/27094.

Full text

Abstract:

The increasing volume of data to be analyzed imposes new challenges to the data mining methodologies. Traditional data mining such as clustering methods do not scale well with larger data sizes and are computationally expensive in terms of memory and time. Clustering large data sets has received attention in the last few years in several application areas such as document categorization, which is in urgent need of scalable approaches. Swarm intelligence algorithms have self-organizing features, which are used to share knowledge among swarm members to locate the best solution. These algorithms have been successfully applied to clustering, however, they suffer from the scalability issue when large data is involved. In order to satisfy these needs, new parallel scalable clustering methods need to be developed. The MapReduce framework has become a popular model for parallelizing data-intensive applications due to its features such as fault-tolerance, scalability, and usability. However, the challenge is to formulate the tasks with map and reduce functions. This dissertation firstly presents a scalable particle swarm optimization (MR-CPSO) clustering algorithm that is based on the MapReduce framework. Experimental results reveal that the proposed algorithm scales very well with increasing data set sizes while maintaining good clustering quality. Moreover, a parallel intrusion detection system using the MR-CPSO is introduced. This system has been tested on a real large-scale intrusion data set to confirm its scalability and detection quality. In addition, the MapReduce framework is utilized to implement a parallel glowworm swarm optimization (MR-GSO) algorithm to optimize difficult multimodal functions. The experiments demonstrate that MR-GSO can achieve high function peak capture rates. Moreover, this dissertation presents a new clustering algorithm based on GSO (CGSO). CGSO takes into account the multimodal search capability to locate optimal centroids in order to enhance the clustering quality without the need to provide the number of clusters in advance. The experimental results demonstrate that CGSO outperforms other well-known clustering algorithms. In addition, a MapReduce GSO clustering (MRCGSO) algorithm version is introduced to evaluate the algorithm's scalability with large scale data sets. MRCGSO achieves a good speedup and utilization when more computing nodes are used.

APA, Harvard, Vancouver, ISO, and other styles

24

Sugandharaju, Ravi Kumar Chatnahalli. "Gaussian Deconvolution and MapReduce Approach for Chipseq Analysis." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1307323875.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Zhang, Yue. "A Workload Balanced MapReduce Framework on GPU Platforms." Wright State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=wright1450180042.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Xu, Xiaoyong. "QoS-guaranteed resource provisioning for cloud-based MapReduce." Thesis, Queensland University of Technology, 2016. https://eprints.qut.edu.au/97990/1/Xiaoyong_Xu_Thesis.pdf.

Full text

Abstract:

This PhD project has investigated how to guarantee the quality of MapReduce services in cloud computing while minimizing the operational cost of the MapReduce services through dynamic resource provisioning. In this PhD project, a framework for the dynamic resource provisioning has been developed. Meanwhile, theoretical results for the dynamic resource provisioning have been derived, and a set of efficient and effective algorithms used in the framework have been proposed.

APA, Harvard, Vancouver, ISO, and other styles

27

Grythe, Knut Auvor. "Automated tuning of MapReduce performance in Vespa Document Store." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2007. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9584.

Full text

Abstract:

MapReduce is a programming model for distributed processing, originally designed by Google Inc. It is designed to simplify the implementation and deployment of distributed programs. Vespa Document Store (VDS) is a distributed document storage solution developed by Yahoo! Technologies Norway. VDS does not currently have any feature allowing distributed aggregation of data. Therefore, a prototype of the MapReduce distributed programming model was previously developed. However, the implementation requires manual tuning of several parameters before each deployment. The goal of this thesis is to allow as many as possible of these parameters to be either automatically configured or set to universally suitable defaults. We have created a working MapReduce implementation based on previous work, and a framework for monitoring of VDS nodes. Various VDS features have been documented in detail, this documentation has been used to analyse how the performance of these features may be improved. We have also performed various experiments to validate the analysis and gain additional insight. Numerous configuration options for either VDS in general or the MapReduce implementation have been considered, and recommended settings have been proposed. The propositions are either in the form of default values or algorithms for computing the most suitable setting. Finally, we provide a list of suggested further work, with suggestions for both general VDS improvements and MapReduce-specific research.

APA, Harvard, Vancouver, ISO, and other styles

28

Fries, Sergej [Verfasser]. "Efficient clustering of massive data with MapReduce / Sergej Fries." Aachen : Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2015. http://d-nb.info/1074562143/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Lakshminarayanan, Mahalakshmi. "ACE| Agile, Contingent and Efficient Similarity Joins Using MapReduce." Thesis, The University of Toledo, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1554211.

Full text

Abstract:

Similarity Join is an important operation for data mining, with a diverse range of real world applications. Three efficient MapReduce algorithms for performing Similarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better, by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.

This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage. We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need.

In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file.

In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework. However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies.

SSS and ESSJ algorithms were developed using Hadoop and tested using real-world Twitter data. For both SSS and ESSJ, experimental results demonstrate phenomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm.

APA, Harvard, Vancouver, ISO, and other styles

30

HAPP, PATRICK NIGRI. "A DISTRIBUTED REGION GROWING IMAGE SEGMENTATION BASED ON MAPREDUCE." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2015. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=34941@1.

Full text

Abstract:

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO
PROGRAMA DE EXCELENCIA ACADEMICA
A Segmentação de imagens representa uma etapa fundamental na análise de imagens e geralmente envolve um alto custo computacional, especialmente ao lidar com grandes volumes de dados. Devido ao significativo aumento nas resoluções espaciais, espectrais e temporais das imagens de sensoriamento remoto nos últimos anos, as soluções sequenciais e paralelas atualmente empregadas não conseguem alcançar os níveis de desempenho e escalabilidade esperados. Este trabalho propõe um método de segmentação de imagens distribuída capaz de lidar, de forma escalável e eficiente, com imagens grandes de altíssima resolução. A solução proposta é baseada no modelo MapReduce, que oferece uma estrutura altamente escalável e confiável para armazenar e processar dados muito grandes em ambientes de computação em clusters e, em particular, também para nuvens privadas e comerciais. O método proposto é extensível a qualquer algoritmo de crescimento de regiões podendo também ser adaptado para outros modelos. A solução foi implementada e validada usando a plataforma Hadoop. Os resultados experimentais comprovam a viabilidade de realizar a segmentação distribuída sobre o modelo MapReduce por intermédio da computação na nuvem.
Image segmentation is a critical step in image analysis, and generally involves a high computational cost, especially when dealing with large volumes of data. Given the significant increase in the spatial, spectral and temporal resolutions of remote sensing imagery in the last years, current sequential and parallel solutions fail to deliver the expected performance and scalability. This work proposes a distributed image segmentation method, capable of handling very large high-resolution images in an efficient and scalable way. The proposed solution is based on the MapReduce model, which offers a highly scalable and reliable framework for storing and processing massive data in cluster environments and in private and public computing clouds. The proposed method is extendable to any region-growing algorithm and can be adapted to other models. The solution was implemented and validated using the Hadoop platform. Experimental results attest the viability of performing distributed segmentation over the MapReduce model through cloud computing.

APA, Harvard, Vancouver, ISO, and other styles

31

Wottrich, Rodolfo Guilherme 1990. "Loop parallelization in the cloud using OpenMP and MapReduce." [s.n.], 2014. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275500.

Full text

Abstract:

Orientadores: Guido Costa Souza de Araújo, Rodolfo Jardim de Azevedo
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-24T12:44:05Z (GMT). No. of bitstreams: 1 Wottrich_RodolfoGuilherme_M.pdf: 2132128 bytes, checksum: b8ac1197909b6cdaf96b95d6097649f3 (MD5) Previous issue date: 2014
Resumo: A busca por paralelismo sempre foi um importante objetivo no projeto de sistemas computacionais, conduzida principalmente pelo constante interesse na redução de tempos de execução de aplicações. Programação paralela é uma área de pesquisa ativa, na qual o interesse tem crescido devido à emergência de arquiteturas multicore. Por outro lado, aproveitar as grandes capacidades de computação e armazenamento da nuvem e suas características desejáveis de flexibilidade e escalabilidade oferece várias oportunidades interessantes para abordar problemas de pesquisa relevantes em computação científica. Infelizmente, em muitos casos a implementação de aplicações na nuvem demanda conhecimento específico de interfaces de programação paralela e APIs, o que pode se tornar um fardo na programação de aplicações complexas. Para superar tais limitações, neste trabalho propomos OpenMR, um modelo de execução baseado na sintaxe e nos princípios da API OpenMP que facilita a tarefa de programar sistemas distribuídos (isto é, clusters locais ou a nuvem remota). Especificamente, este trabalho aborda o problema de executar a paralelização de laços, usando OpenMR, em um ambiente distribuído, através do mapeamento de iterações do laço para nós MapReduce. Assim, a interface de programação para a nuvem se torna a própria linguagem, livrando o desenvolvedor da tarefa de se preocupar com detalhes da distribuição de cargas de trabalho e dados. Para avaliar a validade da proposta, modificamos benchmarks da suite SPEC OMP2012 para se encaixarem no modelo proposto, desenvolvemos outros toy benchmarks que são I/O-bound e executamo-os em duas configurações: (a) um cluster de computadores disponível localmente através de uma LAN padrão; e (b) clusters disponíveis remotamente através dos serviços Amazon AWS. Comparamos os resultados com a execução utilizando OpenMP em uma arquitetura SMP e mostramos que a técnica de paralelização proposta é factível e demonstra boa escalabilidade
Abstract: The pursuit of parallelism has always been an important goal in the design of computer systems, driven mainly by the constant interest in reducing program execution time. Parallel programming is an active research area, which has grown in interest due to the emergence of multicore architectures. On the other hand, harnessing the large computing and storage capabilities of the cloud and its desirable flexibility and scaling features offers a number of interesting opportunities to address some relevant research problems in scientific computing. Unfortunately, in many cases the implementation of applications on the cloud demands specific knowledge of parallel programming interfaces and APIs, which may become a burden when programming complex applications. To overcome such limitations, in this work we propose OpenMR, an execution model based on the syntax and principles of the OpenMP API which eases the task of programming distributed systems (i.e. local clusters or remote cloud). Specifically, this work addresses the problem of performing loop parallelization, using OpenMR, in a distributed environment, through the mapping of loop iterations to MapReduce nodes. By doing so, the cloud programming interface becomes the programming language itself, freeing the developer from the task of worrying about the details of distributing workload and data. To assess the validity of the proposal, we modified benchmarks from the SPEC OMP2012 suite to fit the proposed model, developed other I/O-bound toy benchmarks and executed them in two settings: (a) a computer cluster locally available through a standard LAN; and (b) clusters remotely available through the Amazon AWS services. We compare the results to the execution using OpenMP in an SMP architecture and show that the proposed parallelization technique is feasible and demonstrates good scalability
Mestrado
Ciência da Computação
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

32

Neves, Marcelo Veiga. "Application-aware software-defined networking to accelerate mapreduce applications." Pontifícia Universidade Católica do Rio Grande do Sul, 2015. http://hdl.handle.net/10923/7074.

Full text

Abstract:

Made available in DSpace on 2015-03-17T02:01:04Z (GMT). No. of bitstreams: 1 000466322-Texto+Completo-0.pdf: 4102408 bytes, checksum: d0728ba001c22ab7a016962b0a3e122f (MD5) Previous issue date: 2015
The rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce (MR) framework, as implemented in Hadoop, has become the de facto standard for Big Data analytics. It also forms a base platform for a plurality of Big Data technologies that are used today. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MR contributes significantly to the overall response time. This problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MR workloads. MR applications normally run in large data centers (DCs) employing dense network topologies (e. g. multi-rooted trees) with multiple paths available between any pair of hosts. These DC network designs, combined with recent software-defined network (SDN) programmability, offer a new opportunity to dynamically and intelligently configure the network to achieve shorter application runtime. The initial intuition motivating our work is that the well-defined structure of MR and the rich traffic demand information available in Hadoop’s log and meta-data files could be used to guide the network control. We therefore conjecture that an application-aware network control (i. e., one that knows the applicationlevel semantics and traffic demands) can improve MR applications’ performance when compared to state-of-the-art application-agnostic network control. To confirm our thesis, we first studied MR systems in detail and identified typical communication patterns and common causes of network-related performance bottlenecks in MR applications. Then, we studied the state of the art in DC networks and evaluated its ability to handle MapReduce-like communication patterns. Our results confirmed the assumption that existing techniques are not able to deal with MR communication patterns mainly because of the lack of visibility of application-level information. Based on these findings, we proposed an architecture for an application-aware network control for DCs running MR applications. We implemented a prototype within a SDN controller and used it to successfully accelerate MR applications. Depending on the network oversubscription ratio, we demonstrated a 2% to 58% reduction in the job completion time for popular MR benchmarks, when compared to ECMP (the de facto flow allocation algorithm in multipath DC networks), thus, confirming the thesis. Other contributions include a method to predict network demands in MR applications, algorithms to identify the critical communication path in MR shuffle and dynamically alocate paths to flows in a multipath network, and an emulation-based testbed for realistic MR workloads.
O modelo de programação MapReduce (MR), tal como implementado por Hadoop, tornou-se o padrão de facto para análise de dados de larga escala em data centers, sendo também a base para uma grande variedade de tecnologias de Big Data que são utilizadas atualmente. Neste contexto, Hadoop é um framework escalável que permite a utilização de um grande número de servidores para manipular os crescentes conjutos de dados da área de Big Data. Enquanto capacidade de processamento e E/S podem ser escalados através da adição de mais servidores, isto gera um tráfego acentuado na rede. No caso de MR, a fase que realiza comunicações via rede representa uma significante parcela do tempo total de execução. Esse problema é agravado ainda mais quando os padrões de comunicação são desbalanceados, o que não é incomum para muitas aplicações MR. MR normalmente executa em grandes data centers (DC) de commodity hardware. A rede de tais DCs normalmente utiliza topologias densas que oferecem múltiplos caminhos alternativos (multipath) entre cada par de hosts. Este tipo de topologia, combinado com a emergente tecnologia de redes definidas por software (SDN), possibilita a criação de protocolos inteligentes para distribuir o tráfego entre os diferentes caminhos disponíveis e reduzir o tempo de execução das aplicações. Assim, esse trabalho propõe a criação de um controle de rede ciente de aplicação (isto é, que conhece as semânticas e demandas de tráfego do nível de aplicação) para melhorar o desempenho de aplicações MR quando comparado com um controle de rede tradicional. Para isso, primeiramente estudou-se MR em detalhes e identificou-se os padrões típicos de comunicação e causas frequentes de gargalos de desempenho relativos à utilização de rede nesse tipo de aplicação. Em seguida, estudou-se o estado da arte em redes de data centers e sua habilidade de lidar com os padrões de comunicação encontrados em aplicações MR. Baseado nos resultados obtidos, foi proposta uma arquitetura para controle de rede ciente de aplicação. Um protótipo foi desenvolvido utilizando um controlador SDN, o qual foi utilizado com sucesso para acelerar aplicações MR. Experimentos utilizando benchmarks populares e diferentes características de rede demonstraram uma redução de 2% a 58% no tempo total de execução de aplicações MR. Além do ganho de desempenho em aplicações MR, outras contribuições desse trabalho incluem um método para predizer demandas de tráfego de aplicações MR, heurísticas para otimização de rede e um ambiente de testes para redes de data centers baseado em emulação.

APA, Harvard, Vancouver, ISO, and other styles

33

Lakshminarayanan, Mahalakshmi. "ACE: Agile,Contingent and Efficient Similarity Joins Using MapReduce." University of Toledo / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1383931387.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Liu, Wei, and 柳維. "Privacy Preserving for MapReduce Netwok using SDN with MapReduce Virtual Management." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/09411923656857766696.

Full text

Abstract:

碩士
國立中正大學
通訊工程研究所
104
MapReduce, as the heart of Hadoop, become a new programming paradigm which has a good scalability to process big data in a distributed environment. However, data privacy poses a new challenge in cloud computing environment. Although many approaches have been proposed to improve the privacy of data in MapReduce network, Man-in-the-middle attack still remains unanswered. We propose a novel approach, called MapReduce virtual management (MVM), which uses Software Defined Networking (SDN) structure to enhance the data privacy protection. MVM is a virtualized SDN application which has benefits of simplicity, agility, and automation across the system. Experiment results show that MVM is able to route data properly with a low eavesdropping probability under different network topologies.

APA, Harvard, Vancouver, ISO, and other styles

35

Chi, Wen-Chun, and 紀玟君. "Parallel QBL-PSO Using MapReduce." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/42584912905270860447.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

WU, DONG-YUAN, and 吳東原. "Set-similarity joins using MapReduce." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/4g35s3.

Full text

Abstract:

碩士
玄奘大學
資訊管理學系碩士班
106
In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is named RF comparing algorithm as our groundwork. We modified RF comparing algorithm for its defect and developed a new efficient algorithm which is named Prefix Accumulating algorithm. Our solution is to identify similarities between data set with MapReduce framework and output the table for similarities between data set. There are two phases for Algorithm. We used Prefix Filtering to pick out the data that is possible to match each other from a lot amount of data, then collected the same pair of candidate for accumulating common elements in first MapReduce process. We verified the last half of data based on pair of candidate, then integrated union and intersection of data for calculating similarities in second one. In experiment, we proved that the Prefix Accumulating algorithm is faster than the RF comparing algorithm. The conclusion is that the advantage of Prefix Accumulating algorithm didn’t need to compare complete data again after prefix filtering. The disadvantage of Prefix Accumulating algorithm is that the more data partition, the more cost of integrating data.

APA, Harvard, Vancouver, ISO, and other styles

37

Yadav, Rakesh. "Genetic Algorithms Using Hadoop MapReduce." Thesis, 2015. http://ethesis.nitrkl.ac.in/7790/1/2015_Genetic_Yadav.pdf.

Full text

Abstract:

Data-Intensive Computing (DIC) played an important role for large data set utilizing the parallel computing. DIC model shown that can process large amount of data like petabyte or zettabyte day to day. So these have some sorts of attempts to checkout that how DIC will support the Evolutionary(Genetic) Algorithms. Here we have shown step by step explanation that how the Genetic Algorithms(GA), with different implementation form, will be interpret into Hadoop MapReduce framework. Here the results will give details as how Hadoop is best choice to impel genetic algorithm on large dataset problem and shown how the speedup will be increased using parallel computing. MapReduce is designed for large volume of data set. It is introduced for BigData Analysis and it is used a lots of algorithms like Breadth-First Search, Traveling Salesman problems, Finding Shortest Path problem etc. In this framework two key factor, Map and reducer. The Map which is parallely divided the data into many cluster and each cluster the data is form of key and value. The output of map phase data will goes into intermediate phase where data will be shuffling and sorting. Then using the partitioner for dividing the data parallely in different cluster according to the user. The number of cluster are depends on the number of reducers. The reducers will taking all iteration of data give the results in form of values. In This thesis we also show that we compare our implementation with implementation presented in existing model. These two implementation are compare with ONEMAX (bit counting) PROBLEM. The comparison criteria between two implementation are fitness convergences, stability with fixed number of node, quality of final solution, cloud resource utilization and algorithms scalability.

APA, Harvard, Vancouver, ISO, and other styles

38

Xiang, Jingen. "Scalable Scientific Computing Algorithms Using MapReduce." Thesis, 2013. http://hdl.handle.net/10012/7830.

Full text

Abstract:

Cloud computing systems, like MapReduce and Pregel, provide a scalable and fault tolerant environment for running computations at massive scale. However, these systems are designed primarily for data intensive computational tasks, while a large class of problems in scientific computing and business analytics are computationally intensive (i.e., they require a lot of CPU in addition to I/O). In this thesis, we investigate the use of cloud computing systems, in particular MapReduce, for computationally intensive problems, focusing on two classic problems that arise in scienti c computing and also in analytics: maximum clique and matrix inversion. The key contribution that enables us to e ectively use MapReduce to solve the maximum clique problem on dense graphs is a recursive partitioning method that partitions the graph into several subgraphs of similar size and running time complexity. After partitioning, the maximum cliques of the di erent partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of di erent sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant. For the matrix inversion problem, we show that a recursive block LU decomposition allows us to e ectively compute in parallel both the lower triangular (L) and upper triangular (U) matrices using MapReduce. After computing the L and U matrices, their inverses are computed using MapReduce. The inverse of the original matrix, which is the product of the inverses of the L and U matrices, is also obtained using MapReduce. Our technique is the rst matrix inversion technique that uses MapReduce. We show experimentally that our technique has good scalability, and it is simpler and more fault tolerant than MPI implementations such as ScaLAPACK.

APA, Harvard, Vancouver, ISO, and other styles

39

Côrte-Real, Joana. "A MapReduce Construct for Yap Prolog." Dissertação, 2013. http://hdl.handle.net/10216/73862.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Peng, Hao-Ting, and 彭晧廷. "A Block-Oriented MapReduce Programming Environment." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/hns5hz.

Full text

Abstract:

碩士
國立高雄應用科技大學
電機工程系博碩士班
104
This study is aimed at developing a block-oriented MapReduce programming environment called Blockly-MR based on Google Blockly. In this environment, users can create MapReduce programs by means of dragging and dropping blocks. Blocly-MR can automatically translate the blocks into Java source programs, and send the source programs to backend Hadoop clusters for compilation and execution. Users can view the source codes and execution results of the programs in the front-end web pages. With the support of Blockly-MR, users can easily develop MapReduce programs while they have no need to learn Java programming and Hadoop API. Consequently, the proposed programming environment can effectively reduce the complexity of MapReduce programming. This contribution is useful for encouraging users to learn MapReduce programming.

APA, Harvard, Vancouver, ISO, and other styles

41

Costa, Pedro Alexandre Reis Sá da Costa. "Hadoop MapReduce tolerante a faltas bizantinas." Master's thesis, 2011. http://hdl.handle.net/10451/8695.

Full text

Abstract:

Tese de mestrado em Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2011
O MapReduce é frequentemente usado para executar tarefas críticas, tais como análise de dados científicos. No entanto, evidências na literatura mostram que as faltas ocorrem de forma arbitrária e podem corromper os dados. O Hadoop MapReduce está preparado para tolerar faltas acidentais, mas não tolera faltas arbitrárias ou Bizantinas. Neste trabalho apresenta-se um protótipo do Hadoop MapReduce Tolerante a Faltas Bizantinas(BFT). Uma avaliaçãao experimental mostra que a execução de um trabalho com o algoritmo implementado usa o dobro dos recursos do Hadoop original, em vez de mais 3 ou 4 vezes, como seria alcançado com uma aplicação directa dos paradigmas comuns a tolerância a faltas Bizantinas. Acredita-se que este custo seja aceitável para aplicações críticas que requerem este nível de tolerância a faltas.
MapReduce is often used to run critical jobs such as scientific data analysis. However, evidence in the literature shows that arbitrary faults do occur and can probably corrupt the results of MapReduce jobs. MapReduce runtimes like Hadoop tolerate crash faults, but not arbitrary or Byzantine faults. In this work, it is presented a MapReduce algorithm and prototype that tolerate these faults. An experimental evaluation shows that the execution of a job with the implemented algorithm uses twice the resources of the original Hadoop, instead of the 3 or 4 times more that would be achieved with the direct application of common Byzantine fault-tolerance paradigms. It is believed that this cost is acceptable .for critical applications that require that level of fault tolerance.

APA, Harvard, Vancouver, ISO, and other styles

42

da, Costa Pedro Alexandre Reis Sá. "Hadoop mapreduce tolerante a faltas bizantinas." Master's thesis, 2011. http://hdl.handle.net/10451/13903.

Full text

Abstract:

MapReduce is often used to run critical jobs such as scientific data analysis. However, evidence in the literature shows that arbitrary faults do occur and can probably corrupt the results of MapReduce jobs. MapReduce runtimes like Hadoop tolerate crash faults, butnot arbitrary or Byzantine faults. In this work, it is presented a MapReduce algorithm andprototype that tolerate these faults. An experimental evaluation shows that the execution of a job with the implemented algorithm uses twice the resources of the original Hadoop,instead of the 3 or 4 times more that would be achieved with the direct application of common Byzantine fault-tolerance paradigms. It is believed that this cost is acceptable for critical applications that require that level of fault tolerance.

APA, Harvard, Vancouver, ISO, and other styles

43

Côrte-Real, Joana Sílvia Santos. "A MapReduce Construct for Yap Prolog." Master's thesis, 2013. https://repositorio-aberto.up.pt/handle/10216/68038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Elnikety, Eslam. "iHadoop: Asynchronous Iterations Support for MapReduce." Thesis, 2011. http://hdl.handle.net/10754/209389.

Full text

Abstract:

MapReduce is a distributed programming framework designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and task scheduling, is completely unaware of the nature of iterative applications; tasks are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal execution for a consecutive set of iterations. This work presents iHadoop, a modified MapReduce model, and an associated implementation, optimized for iterative computations. The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop's task scheduler exploits inter- iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application's latency. This thesis also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of MapReduce. Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that caches invariant data between iterations, reduces execution time by 38% on average.

APA, Harvard, Vancouver, ISO, and other styles

45

Chiang, Hsuan-Yu, and 江炫佑. "MapReduce-based NearestWindow Cluster Query Processing." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/19737665874236805972.

Full text

Abstract:

碩士
國立交通大學
資訊科學與工程研究所
105
With the growing development of Geographical Information System (GIS) and Location Based Services (LBS), various applications of spatial query are proposed. However, due to the wide use of mobile devices and explosive growth of users, a huge amount of data is generated with time passing. Distributed computing techniques are used to facilitate efficient query processing on such huge data. In this paper, we focus on the spatial query, called Nearest Window Cluster Query (NWCQ). By given a query point, desired window size and the amount of data objects, NWCQ returns a group of objects within a desired window range. The previous work used R-tree for spatial indexing, but R-tree might have overlapping between index nodes, which is inappropriate for distributed computing. Therefore, based on MapReduce framework, we propose a grid-based indexing algorithm to index data objects and a companion query processing algorithm for NWCQ.

APA, Harvard, Vancouver, ISO, and other styles

46

WU, I.-CHUN, and 吳亦鈞. "Data Mining Based on MapReduce Technology." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/ak4v53.

Full text

Abstract:

碩士
國立雲林科技大學
資訊工程系
106
The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time. This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15.

APA, Harvard, Vancouver, ISO, and other styles

47

Côrte-Real, Joana Sílvia Santos. "A MapReduce Construct for Yap Prolog." Dissertação, 2013. https://repositorio-aberto.up.pt/handle/10216/68038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Elgohary, Ahmed. "Scalable Embeddings for Kernel Clustering on MapReduce." Thesis, 2014. http://hdl.handle.net/10012/8262.

Full text

Abstract:

There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, that data are available in an attribute-value format, and that each data instance can be represented as a vector in a feature space where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the use of complex data structures in real-world clustering applications. The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically.

APA, Harvard, Vancouver, ISO, and other styles

49

Liu, Yu-Yang, and 劉育瑒. "Parallel Genetic-Fuzzy Mining with MapReduce Architecture." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/eq783m.

Full text

Abstract:

碩士
國立中山大學
資訊工程學系研究所
103
Fuzzy data mining can successfully find out hidden linguistic association rules by transforming quantity information into fuzzy membership values. In the derivation process, good membership functions play a key role in achieving the quality of finial results. In the past, some researches were proposed to train membership functions by genetic algorithms and could indeed improve the quality of found rules. Those kinds of methods were, however, suffered from the long execution time in the training phase. Besides, after appropriate fuzzy membership functions are found, mining out the frequent itemsets from them is also a very time-consuming process as traditional data mining. In this thesis, we thus propose a series of approaches based on the MapReduce architecture to speed up the GA-fuzzy mining process. The contributions can be divided into three parts, including data preprocessing, membership-function training by GA, and fuzzy association-rule derivation. All are performed by MapReduce. For data preprocessing, the proposed approach can not only transform the original data into key-value format to fit the requirement of MapReduce, but also efficiently reduce the redundant database scan by joining the quantities into lists. For membership-function training by GA, the fitness evaluation, which is the most time-costly process, is distributed to shorten the execution time. At last, a distributed fuzzy rule mining approach based on FP-growth is designed to improve the time efficiency of finding fuzzy association rules. The performance between using a single processor and using MapReduce will be compared and discussed from experiments and the results show that our approaches can efficiently reduce the execution time of the whole process.

APA, Harvard, Vancouver, ISO, and other styles

50

Hsieh, Cheng-Han, and 謝承翰. "Parallel Black Hole Clustering Based on MapReduce." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/db9m35.

Full text

Abstract:

碩士
國立中山大學
資訊工程學系研究所
104
One of the key reasons that traditional clustering methods are inefficient for analyzing large-scale datasets is because most of them are designed for a centralized system. This means that if the size of input data exceeds the size of storage or memory of such a system, it would make it very difficult to do clustering on such a system. To mitigate this problem, an efficient clustering algorithm, called MapReduce Black Hole (MRBH), is presented in this thesis to leverage the strength of the black hole algorithm and the MapReduce programming model of Hadoop to accelerate the clustering speed by both software and hardware. By using MapReduce, MRBH will then divide a large dataset into a number of small data sets and cluster these smaller data sets in parallel. Moreover, MRBH inherits the characteristics of the black hole algorithm, meaning that no parameters are to be set manually; thus, the implementation is easy. To evaluate the performance of the proposed algorithm, several datasets are used with different numbers of nodes. Experimental results show that the proposed algorithm can provide a significant speedup as the number of nodes increases.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Mapreduce'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles