Dissertations / Theses on the topic 'Mapreduce'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Mapreduce.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Gault, Sylvain. "Improving MapReduce Performance on Clusters." Thesis, Lyon, École normale supérieure, 2015. http://www.theses.fr/2015ENSL0985/document.
Full textNowadays, more and more scientific fields rely on data mining to produce new results. These raw data are produced at an increasing rate by several tools like DNA sequencers in biology, the Large Hadron Collider (LHC) in physics that produced 25 petabytes per year as of 2012, or the Large Synoptic Survey Telescope (LSST) that should produce 30 petabyte of data per night. High-resolution scanners in medical imaging and social networks also produce huge amounts of data. This data deluge raise several challenges in terms of storage and computer processing. The Google company proposed in 2004 to use the MapReduce model in order to distribute the computation across several computers.This thesis focus mainly on improving the performance of a MapReduce environment. In order to easily replace the software parts needed to improve the performance, designing a modular and adaptable MapReduce environment is necessary. This is why a component based approach is studied in order to design such a programming environment. In order to study the performance of a MapReduce application, modeling the platform, the application and their performance is mandatory. These models should be both precise enough for the algorithms using them to produce meaningful results, but also simple enough to be analyzed. A state of the art of the existing models is done and a new model adapted to the needs is defined. On order to optimise a MapReduce environment, the first studied approach is a global optimization which result in a computation time reduced by up to 47 %. The second approach focus on the shuffle phase of MapReduce when all the nodes may send some data to every other node. Several algorithms are defined and studied when the network is the bottleneck of the data transfers. These algorithms are tested on the Grid'5000 experiment platform and usually show a behavior close to the lower bound while the trivial approach is far from it
Polo, Jordà. "Multi-constraint scheduling of MapReduce workloads." Doctoral thesis, Universitat Politècnica de Catalunya, 2014. http://hdl.handle.net/10803/276174.
Full textNilsson, Johan. "Hadoop MapReduce in Eucalyptus Private Cloud." Thesis, Umeå universitet, Institutionen för datavetenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-51309.
Full textKloss, Fernando Cesar. "Motor de transformações baseado em Mapreduce." reponame:Repositório Institucional da UFPR, 2013. http://hdl.handle.net/1884/35083.
Full textPolo, Bardès Jordà. "Multi-constraint scheduling of MapReduce workloads." Doctoral thesis, Universitat Politècnica de Catalunya, 2014. http://hdl.handle.net/10803/276174.
Full textMemon, Neelam. "Anonymizing large transaction data using MapReduce." Thesis, Cardiff University, 2016. http://orca.cf.ac.uk/97342/.
Full textHammoud, Suhel. "MapReduce network enabled algorithms for classification based on association rules." Thesis, Brunel University, 2011. http://bura.brunel.ac.uk/handle/2438/5833.
Full textDeolikar, Piyush P. "Lecture Video Search Engine Using Hadoop MapReduce." Thesis, California State University, Long Beach, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10638908.
Full textWith the advent of the Internet and ease of uploading video content over video libraries and social networking sites, the video data availability was increased very rapidly during this decade. Universities are uploading video tutorials in the online courses. Companies like Udemy, coursera, Lynda, etc. made video tutorials available over the Internet. We propose and implement a scalable solution, which helps to find relevant videos with respect to a query provided by the user. Our solution maintains an updated list of the available videos on the web and assigns a rank according to their relevance. The proposed solution consists of three main components that can mutually interact. The first component, called the crawler, continuously visits and locally stores the relevant information of all the webpages with videos available on the Internet. The crawler has several threads, concurrently parsing webpages. The second component obtains the inverted index of the web pages stored by the crawler. Given a query, the inverted index is used to obtain the videos that contain the words in the query. The third component computes the rank of the video. This rank is then used to display the results in the order of relevance. We implement a scalable solution in the Apache Hadoop Framework. Hadoop is a distributed operating system that provides a distributed file system able to handle large files as well as distributed computation among the participants.
Kolb, Lars. "Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-157163.
Full textDyer, James. "Secure computation in the cloud using MapReduce." Thesis, University of Manchester, 2018. https://www.research.manchester.ac.uk/portal/en/theses/secure-computation-in-the-cloud-using-mapreduce(8f63dc8e-dc35-43ec-a083-9f3a6230c142).html.
Full textElteir, Marwa Khamis. "A MapReduce Framework for Heterogeneous Computing Architectures." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/28786.
Full textPh. D.
Wang, Guanying. "Evaluating MapReduce System Performance: A Simulation Approach." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/28820.
Full textPh. D.
Akokhia, Emmanuel Oshoke. "NETFLIX films database clustering using MAPREDUCE technology." Thesis, Тернопільський національний технічний університет імені Івана Пулюя, 2017. http://elartu.tntu.edu.ua/handle/123456789/19533.
Full textТемирбекова, Ж. Е., and Ж. М. Меренбаев. "Параллельное масштабирование изображений в технологии mapreduce hadoop." Thesis, Сумский государственный университет, 2015. http://essuir.sumdu.edu.ua/handle/123456789/40775.
Full textLam, Wilma Samhita Samuel. "A MapReduce Performance Study of XML Shredding." University of Cincinnati / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1467126954.
Full textLu, Peng. "Application profiling and resource management for MapReduce." Thesis, The University of Sydney, 2015. http://hdl.handle.net/2123/13969.
Full textLi, Lei. "Rolling Window Time Series Prediction Using MapReduce." Thesis, The University of Sydney, 2014. http://hdl.handle.net/2123/13552.
Full textAlkan, Sertan. "A Distributed Graph Mining Framework Based On Mapreduce." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12611588/index.pdf.
Full textWang, Yongzhi. "Constructing Secure MapReduce Framework in Cloud-based Environment." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/2238.
Full textPugsley, Seth Hintze. "Opportunities for near data computing in MapReduce workloads." Thesis, The University of Utah, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3704952.
Full textIn-memory big data applications are growing in popularity, including in-memory versions of the MapReduce framework. The move away from disk-based datasets shifts the performance bottleneck from slow disk accesses to memory bandwidth. MapReduce is a data-parallel application, and is therefore amenable to being executed on as many parallel processors as possible, with each processor requiring high amounts of memory bandwidth. We propose using Near Data Computing (NDC) as a means to develop systems that are optimized for in-memory MapReduce workloads, offering high compute parallelism and even higher memory bandwidth. This dissertation explores three different implementations and styles of NDC to improve MapReduce execution. First, we use 3D-stacked memory+logic devices to process the Map phase on compute elements in close proximity to database splits. Second, we attempt to replicate the performance characteristics of the 3D-stacked NDC using only commodity memory and inexpensive processors to improve performance of both Map and Reduce phases. Finally, we incorporate fixed-function hardware accelerators to improve sorting performance within the Map phase. This dissertation shows that it is possible to improve in-memory MapReduce performance by potentially two orders of magnitude by designing system and memory architectures that are specifically tailored to that end.
Wang, Liqiang. "An Efficient Platform for Large-Scale MapReduce Processing." ScholarWorks@UNO, 2009. http://scholarworks.uno.edu/td/963.
Full textIzurieta, Iván Carrera. "Performance modeling of MapReduce applications for the cloud." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/99055.
Full textIn the last years, Cloud Computing has become a key technology that made possible running applications without needing to deploy a physical infrastructure with the advantage of lowering costs to the user by charging only for the computational resources used by the application. The challenge with deploying distributed applications in Cloud Computing environments is that the virtual machine infrastructure should be planned in a way that is time and cost-effective. Also, in the last years we have seen how the amount of data produced by applications has grown bigger than ever. This data contains valuable information that has to be extracted using tools like MapReduce. MapReduce is an important framework to analyze large amounts of data since it was proposed by Google, and made open source by Apache with its Hadoop implementation. The goal of this work is to show that the execution time of a distributed application, namely, a MapReduce application, in a Cloud computing environment, can be predicted using a mathematical model based on theoretical specifications. This prediction is made to help the users of the Cloud Computing environment to plan their deployments, i.e., quantify the number of virtual machines and its characteristics in order to have a lesser cost and/or time. After measuring the application execution time and varying parameters stated in the mathematical model, and after that, using a linear regression technique, the goal is achieved finding a model of the execution time which was then applied to predict the execution time of MapReduce applications with satisfying results. The experiments were conducted in several configurations: namely, private and public clusters, as well as commercial cloud infrastructures, running different MapReduce applications, and varying the number of nodes composing the cluster, as well as the amount of workload given to the application. Experiments showed a clear relation with the theoretical model, revealing that the model is in fact able to predict the execution time of MapReduce applications. The developed model is generic, meaning that it uses theoretical abstractions for the computing capacity of the environment and the computing cost of the MapReduce application. Further work in extending this approach to fit other types of distributed applications is encouraged, as well as including this mathematical model into Cloud services offering MapReduce platforms, in order to aid users plan their deployments.
Aljarah, Ibrahim Mithgal. "Mapreduce-Enabled Scalable Nature-Inspired Approaches for Clustering." Diss., North Dakota State University, 2014. https://hdl.handle.net/10365/27094.
Full textSugandharaju, Ravi Kumar Chatnahalli. "Gaussian Deconvolution and MapReduce Approach for Chipseq Analysis." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1307323875.
Full textZhang, Yue. "A Workload Balanced MapReduce Framework on GPU Platforms." Wright State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=wright1450180042.
Full textXu, Xiaoyong. "QoS-guaranteed resource provisioning for cloud-based MapReduce." Thesis, Queensland University of Technology, 2016. https://eprints.qut.edu.au/97990/1/Xiaoyong_Xu_Thesis.pdf.
Full textGrythe, Knut Auvor. "Automated tuning of MapReduce performance in Vespa Document Store." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2007. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9584.
Full textMapReduce is a programming model for distributed processing, originally designed by Google Inc. It is designed to simplify the implementation and deployment of distributed programs. Vespa Document Store (VDS) is a distributed document storage solution developed by Yahoo! Technologies Norway. VDS does not currently have any feature allowing distributed aggregation of data. Therefore, a prototype of the MapReduce distributed programming model was previously developed. However, the implementation requires manual tuning of several parameters before each deployment. The goal of this thesis is to allow as many as possible of these parameters to be either automatically configured or set to universally suitable defaults. We have created a working MapReduce implementation based on previous work, and a framework for monitoring of VDS nodes. Various VDS features have been documented in detail, this documentation has been used to analyse how the performance of these features may be improved. We have also performed various experiments to validate the analysis and gain additional insight. Numerous configuration options for either VDS in general or the MapReduce implementation have been considered, and recommended settings have been proposed. The propositions are either in the form of default values or algorithms for computing the most suitable setting. Finally, we provide a list of suggested further work, with suggestions for both general VDS improvements and MapReduce-specific research.
Fries, Sergej [Verfasser]. "Efficient clustering of massive data with MapReduce / Sergej Fries." Aachen : Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2015. http://d-nb.info/1074562143/34.
Full textLakshminarayanan, Mahalakshmi. "ACE| Agile, Contingent and Efficient Similarity Joins Using MapReduce." Thesis, The University of Toledo, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1554211.
Full textSimilarity Join is an important operation for data mining, with a diverse range of real world applications. Three efficient MapReduce algorithms for performing Similarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better, by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.
This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage. We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need.
In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file.
In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework. However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies.
SSS and ESSJ algorithms were developed using Hadoop and tested using real-world Twitter data. For both SSS and ESSJ, experimental results demonstrate phenomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm.
HAPP, PATRICK NIGRI. "A DISTRIBUTED REGION GROWING IMAGE SEGMENTATION BASED ON MAPREDUCE." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2015. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=34941@1.
Full textCOORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO
PROGRAMA DE EXCELENCIA ACADEMICA
A Segmentação de imagens representa uma etapa fundamental na análise de imagens e geralmente envolve um alto custo computacional, especialmente ao lidar com grandes volumes de dados. Devido ao significativo aumento nas resoluções espaciais, espectrais e temporais das imagens de sensoriamento remoto nos últimos anos, as soluções sequenciais e paralelas atualmente empregadas não conseguem alcançar os níveis de desempenho e escalabilidade esperados. Este trabalho propõe um método de segmentação de imagens distribuída capaz de lidar, de forma escalável e eficiente, com imagens grandes de altíssima resolução. A solução proposta é baseada no modelo MapReduce, que oferece uma estrutura altamente escalável e confiável para armazenar e processar dados muito grandes em ambientes de computação em clusters e, em particular, também para nuvens privadas e comerciais. O método proposto é extensível a qualquer algoritmo de crescimento de regiões podendo também ser adaptado para outros modelos. A solução foi implementada e validada usando a plataforma Hadoop. Os resultados experimentais comprovam a viabilidade de realizar a segmentação distribuída sobre o modelo MapReduce por intermédio da computação na nuvem.
Image segmentation is a critical step in image analysis, and generally involves a high computational cost, especially when dealing with large volumes of data. Given the significant increase in the spatial, spectral and temporal resolutions of remote sensing imagery in the last years, current sequential and parallel solutions fail to deliver the expected performance and scalability. This work proposes a distributed image segmentation method, capable of handling very large high-resolution images in an efficient and scalable way. The proposed solution is based on the MapReduce model, which offers a highly scalable and reliable framework for storing and processing massive data in cluster environments and in private and public computing clouds. The proposed method is extendable to any region-growing algorithm and can be adapted to other models. The solution was implemented and validated using the Hadoop platform. Experimental results attest the viability of performing distributed segmentation over the MapReduce model through cloud computing.
Wottrich, Rodolfo Guilherme 1990. "Loop parallelization in the cloud using OpenMP and MapReduce." [s.n.], 2014. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275500.
Full textDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-24T12:44:05Z (GMT). No. of bitstreams: 1 Wottrich_RodolfoGuilherme_M.pdf: 2132128 bytes, checksum: b8ac1197909b6cdaf96b95d6097649f3 (MD5) Previous issue date: 2014
Resumo: A busca por paralelismo sempre foi um importante objetivo no projeto de sistemas computacionais, conduzida principalmente pelo constante interesse na redução de tempos de execução de aplicações. Programação paralela é uma área de pesquisa ativa, na qual o interesse tem crescido devido à emergência de arquiteturas multicore. Por outro lado, aproveitar as grandes capacidades de computação e armazenamento da nuvem e suas características desejáveis de flexibilidade e escalabilidade oferece várias oportunidades interessantes para abordar problemas de pesquisa relevantes em computação científica. Infelizmente, em muitos casos a implementação de aplicações na nuvem demanda conhecimento específico de interfaces de programação paralela e APIs, o que pode se tornar um fardo na programação de aplicações complexas. Para superar tais limitações, neste trabalho propomos OpenMR, um modelo de execução baseado na sintaxe e nos princípios da API OpenMP que facilita a tarefa de programar sistemas distribuídos (isto é, clusters locais ou a nuvem remota). Especificamente, este trabalho aborda o problema de executar a paralelização de laços, usando OpenMR, em um ambiente distribuído, através do mapeamento de iterações do laço para nós MapReduce. Assim, a interface de programação para a nuvem se torna a própria linguagem, livrando o desenvolvedor da tarefa de se preocupar com detalhes da distribuição de cargas de trabalho e dados. Para avaliar a validade da proposta, modificamos benchmarks da suite SPEC OMP2012 para se encaixarem no modelo proposto, desenvolvemos outros toy benchmarks que são I/O-bound e executamo-os em duas configurações: (a) um cluster de computadores disponível localmente através de uma LAN padrão; e (b) clusters disponíveis remotamente através dos serviços Amazon AWS. Comparamos os resultados com a execução utilizando OpenMP em uma arquitetura SMP e mostramos que a técnica de paralelização proposta é factível e demonstra boa escalabilidade
Abstract: The pursuit of parallelism has always been an important goal in the design of computer systems, driven mainly by the constant interest in reducing program execution time. Parallel programming is an active research area, which has grown in interest due to the emergence of multicore architectures. On the other hand, harnessing the large computing and storage capabilities of the cloud and its desirable flexibility and scaling features offers a number of interesting opportunities to address some relevant research problems in scientific computing. Unfortunately, in many cases the implementation of applications on the cloud demands specific knowledge of parallel programming interfaces and APIs, which may become a burden when programming complex applications. To overcome such limitations, in this work we propose OpenMR, an execution model based on the syntax and principles of the OpenMP API which eases the task of programming distributed systems (i.e. local clusters or remote cloud). Specifically, this work addresses the problem of performing loop parallelization, using OpenMR, in a distributed environment, through the mapping of loop iterations to MapReduce nodes. By doing so, the cloud programming interface becomes the programming language itself, freeing the developer from the task of worrying about the details of distributing workload and data. To assess the validity of the proposal, we modified benchmarks from the SPEC OMP2012 suite to fit the proposed model, developed other I/O-bound toy benchmarks and executed them in two settings: (a) a computer cluster locally available through a standard LAN; and (b) clusters remotely available through the Amazon AWS services. We compare the results to the execution using OpenMP in an SMP architecture and show that the proposed parallelization technique is feasible and demonstrates good scalability
Mestrado
Ciência da Computação
Mestre em Ciência da Computação
Neves, Marcelo Veiga. "Application-aware software-defined networking to accelerate mapreduce applications." Pontifícia Universidade Católica do Rio Grande do Sul, 2015. http://hdl.handle.net/10923/7074.
Full textThe rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce (MR) framework, as implemented in Hadoop, has become the de facto standard for Big Data analytics. It also forms a base platform for a plurality of Big Data technologies that are used today. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MR contributes significantly to the overall response time. This problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MR workloads. MR applications normally run in large data centers (DCs) employing dense network topologies (e. g. multi-rooted trees) with multiple paths available between any pair of hosts. These DC network designs, combined with recent software-defined network (SDN) programmability, offer a new opportunity to dynamically and intelligently configure the network to achieve shorter application runtime. The initial intuition motivating our work is that the well-defined structure of MR and the rich traffic demand information available in Hadoop’s log and meta-data files could be used to guide the network control. We therefore conjecture that an application-aware network control (i. e., one that knows the applicationlevel semantics and traffic demands) can improve MR applications’ performance when compared to state-of-the-art application-agnostic network control. To confirm our thesis, we first studied MR systems in detail and identified typical communication patterns and common causes of network-related performance bottlenecks in MR applications. Then, we studied the state of the art in DC networks and evaluated its ability to handle MapReduce-like communication patterns. Our results confirmed the assumption that existing techniques are not able to deal with MR communication patterns mainly because of the lack of visibility of application-level information. Based on these findings, we proposed an architecture for an application-aware network control for DCs running MR applications. We implemented a prototype within a SDN controller and used it to successfully accelerate MR applications. Depending on the network oversubscription ratio, we demonstrated a 2% to 58% reduction in the job completion time for popular MR benchmarks, when compared to ECMP (the de facto flow allocation algorithm in multipath DC networks), thus, confirming the thesis. Other contributions include a method to predict network demands in MR applications, algorithms to identify the critical communication path in MR shuffle and dynamically alocate paths to flows in a multipath network, and an emulation-based testbed for realistic MR workloads.
O modelo de programação MapReduce (MR), tal como implementado por Hadoop, tornou-se o padrão de facto para análise de dados de larga escala em data centers, sendo também a base para uma grande variedade de tecnologias de Big Data que são utilizadas atualmente. Neste contexto, Hadoop é um framework escalável que permite a utilização de um grande número de servidores para manipular os crescentes conjutos de dados da área de Big Data. Enquanto capacidade de processamento e E/S podem ser escalados através da adição de mais servidores, isto gera um tráfego acentuado na rede. No caso de MR, a fase que realiza comunicações via rede representa uma significante parcela do tempo total de execução. Esse problema é agravado ainda mais quando os padrões de comunicação são desbalanceados, o que não é incomum para muitas aplicações MR. MR normalmente executa em grandes data centers (DC) de commodity hardware. A rede de tais DCs normalmente utiliza topologias densas que oferecem múltiplos caminhos alternativos (multipath) entre cada par de hosts. Este tipo de topologia, combinado com a emergente tecnologia de redes definidas por software (SDN), possibilita a criação de protocolos inteligentes para distribuir o tráfego entre os diferentes caminhos disponíveis e reduzir o tempo de execução das aplicações. Assim, esse trabalho propõe a criação de um controle de rede ciente de aplicação (isto é, que conhece as semânticas e demandas de tráfego do nível de aplicação) para melhorar o desempenho de aplicações MR quando comparado com um controle de rede tradicional. Para isso, primeiramente estudou-se MR em detalhes e identificou-se os padrões típicos de comunicação e causas frequentes de gargalos de desempenho relativos à utilização de rede nesse tipo de aplicação. Em seguida, estudou-se o estado da arte em redes de data centers e sua habilidade de lidar com os padrões de comunicação encontrados em aplicações MR. Baseado nos resultados obtidos, foi proposta uma arquitetura para controle de rede ciente de aplicação. Um protótipo foi desenvolvido utilizando um controlador SDN, o qual foi utilizado com sucesso para acelerar aplicações MR. Experimentos utilizando benchmarks populares e diferentes características de rede demonstraram uma redução de 2% a 58% no tempo total de execução de aplicações MR. Além do ganho de desempenho em aplicações MR, outras contribuições desse trabalho incluem um método para predizer demandas de tráfego de aplicações MR, heurísticas para otimização de rede e um ambiente de testes para redes de data centers baseado em emulação.
Lakshminarayanan, Mahalakshmi. "ACE: Agile,Contingent and Efficient Similarity Joins Using MapReduce." University of Toledo / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1383931387.
Full textLiu, Wei, and 柳維. "Privacy Preserving for MapReduce Netwok using SDN with MapReduce Virtual Management." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/09411923656857766696.
Full text國立中正大學
通訊工程研究所
104
MapReduce, as the heart of Hadoop, become a new programming paradigm which has a good scalability to process big data in a distributed environment. However, data privacy poses a new challenge in cloud computing environment. Although many approaches have been proposed to improve the privacy of data in MapReduce network, Man-in-the-middle attack still remains unanswered. We propose a novel approach, called MapReduce virtual management (MVM), which uses Software Defined Networking (SDN) structure to enhance the data privacy protection. MVM is a virtualized SDN application which has benefits of simplicity, agility, and automation across the system. Experiment results show that MVM is able to route data properly with a low eavesdropping probability under different network topologies.
Chi, Wen-Chun, and 紀玟君. "Parallel QBL-PSO Using MapReduce." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/42584912905270860447.
Full textWU, DONG-YUAN, and 吳東原. "Set-similarity joins using MapReduce." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/4g35s3.
Full text玄奘大學
資訊管理學系碩士班
106
In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is named RF comparing algorithm as our groundwork. We modified RF comparing algorithm for its defect and developed a new efficient algorithm which is named Prefix Accumulating algorithm. Our solution is to identify similarities between data set with MapReduce framework and output the table for similarities between data set. There are two phases for Algorithm. We used Prefix Filtering to pick out the data that is possible to match each other from a lot amount of data, then collected the same pair of candidate for accumulating common elements in first MapReduce process. We verified the last half of data based on pair of candidate, then integrated union and intersection of data for calculating similarities in second one. In experiment, we proved that the Prefix Accumulating algorithm is faster than the RF comparing algorithm. The conclusion is that the advantage of Prefix Accumulating algorithm didn’t need to compare complete data again after prefix filtering. The disadvantage of Prefix Accumulating algorithm is that the more data partition, the more cost of integrating data.
Yadav, Rakesh. "Genetic Algorithms Using Hadoop MapReduce." Thesis, 2015. http://ethesis.nitrkl.ac.in/7790/1/2015_Genetic_Yadav.pdf.
Full textXiang, Jingen. "Scalable Scientific Computing Algorithms Using MapReduce." Thesis, 2013. http://hdl.handle.net/10012/7830.
Full textCôrte-Real, Joana. "A MapReduce Construct for Yap Prolog." Dissertação, 2013. http://hdl.handle.net/10216/73862.
Full textPeng, Hao-Ting, and 彭晧廷. "A Block-Oriented MapReduce Programming Environment." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/hns5hz.
Full text國立高雄應用科技大學
電機工程系博碩士班
104
This study is aimed at developing a block-oriented MapReduce programming environment called Blockly-MR based on Google Blockly. In this environment, users can create MapReduce programs by means of dragging and dropping blocks. Blocly-MR can automatically translate the blocks into Java source programs, and send the source programs to backend Hadoop clusters for compilation and execution. Users can view the source codes and execution results of the programs in the front-end web pages. With the support of Blockly-MR, users can easily develop MapReduce programs while they have no need to learn Java programming and Hadoop API. Consequently, the proposed programming environment can effectively reduce the complexity of MapReduce programming. This contribution is useful for encouraging users to learn MapReduce programming.
Costa, Pedro Alexandre Reis Sá da Costa. "Hadoop MapReduce tolerante a faltas bizantinas." Master's thesis, 2011. http://hdl.handle.net/10451/8695.
Full textO MapReduce é frequentemente usado para executar tarefas críticas, tais como análise de dados científicos. No entanto, evidências na literatura mostram que as faltas ocorrem de forma arbitrária e podem corromper os dados. O Hadoop MapReduce está preparado para tolerar faltas acidentais, mas não tolera faltas arbitrárias ou Bizantinas. Neste trabalho apresenta-se um protótipo do Hadoop MapReduce Tolerante a Faltas Bizantinas(BFT). Uma avaliaçãao experimental mostra que a execução de um trabalho com o algoritmo implementado usa o dobro dos recursos do Hadoop original, em vez de mais 3 ou 4 vezes, como seria alcançado com uma aplicação directa dos paradigmas comuns a tolerância a faltas Bizantinas. Acredita-se que este custo seja aceitável para aplicações críticas que requerem este nível de tolerância a faltas.
MapReduce is often used to run critical jobs such as scientific data analysis. However, evidence in the literature shows that arbitrary faults do occur and can probably corrupt the results of MapReduce jobs. MapReduce runtimes like Hadoop tolerate crash faults, but not arbitrary or Byzantine faults. In this work, it is presented a MapReduce algorithm and prototype that tolerate these faults. An experimental evaluation shows that the execution of a job with the implemented algorithm uses twice the resources of the original Hadoop, instead of the 3 or 4 times more that would be achieved with the direct application of common Byzantine fault-tolerance paradigms. It is believed that this cost is acceptable .for critical applications that require that level of fault tolerance.
da, Costa Pedro Alexandre Reis Sá. "Hadoop mapreduce tolerante a faltas bizantinas." Master's thesis, 2011. http://hdl.handle.net/10451/13903.
Full textCôrte-Real, Joana Sílvia Santos. "A MapReduce Construct for Yap Prolog." Master's thesis, 2013. https://repositorio-aberto.up.pt/handle/10216/68038.
Full textElnikety, Eslam. "iHadoop: Asynchronous Iterations Support for MapReduce." Thesis, 2011. http://hdl.handle.net/10754/209389.
Full textChiang, Hsuan-Yu, and 江炫佑. "MapReduce-based NearestWindow Cluster Query Processing." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/19737665874236805972.
Full text國立交通大學
資訊科學與工程研究所
105
With the growing development of Geographical Information System (GIS) and Location Based Services (LBS), various applications of spatial query are proposed. However, due to the wide use of mobile devices and explosive growth of users, a huge amount of data is generated with time passing. Distributed computing techniques are used to facilitate efficient query processing on such huge data. In this paper, we focus on the spatial query, called Nearest Window Cluster Query (NWCQ). By given a query point, desired window size and the amount of data objects, NWCQ returns a group of objects within a desired window range. The previous work used R-tree for spatial indexing, but R-tree might have overlapping between index nodes, which is inappropriate for distributed computing. Therefore, based on MapReduce framework, we propose a grid-based indexing algorithm to index data objects and a companion query processing algorithm for NWCQ.
WU, I.-CHUN, and 吳亦鈞. "Data Mining Based on MapReduce Technology." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/ak4v53.
Full text國立雲林科技大學
資訊工程系
106
The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time. This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15.
Côrte-Real, Joana Sílvia Santos. "A MapReduce Construct for Yap Prolog." Dissertação, 2013. https://repositorio-aberto.up.pt/handle/10216/68038.
Full textElgohary, Ahmed. "Scalable Embeddings for Kernel Clustering on MapReduce." Thesis, 2014. http://hdl.handle.net/10012/8262.
Full textLiu, Yu-Yang, and 劉育瑒. "Parallel Genetic-Fuzzy Mining with MapReduce Architecture." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/eq783m.
Full text國立中山大學
資訊工程學系研究所
103
Fuzzy data mining can successfully find out hidden linguistic association rules by transforming quantity information into fuzzy membership values. In the derivation process, good membership functions play a key role in achieving the quality of finial results. In the past, some researches were proposed to train membership functions by genetic algorithms and could indeed improve the quality of found rules. Those kinds of methods were, however, suffered from the long execution time in the training phase. Besides, after appropriate fuzzy membership functions are found, mining out the frequent itemsets from them is also a very time-consuming process as traditional data mining. In this thesis, we thus propose a series of approaches based on the MapReduce architecture to speed up the GA-fuzzy mining process. The contributions can be divided into three parts, including data preprocessing, membership-function training by GA, and fuzzy association-rule derivation. All are performed by MapReduce. For data preprocessing, the proposed approach can not only transform the original data into key-value format to fit the requirement of MapReduce, but also efficiently reduce the redundant database scan by joining the quantities into lists. For membership-function training by GA, the fitness evaluation, which is the most time-costly process, is distributed to shorten the execution time. At last, a distributed fuzzy rule mining approach based on FP-growth is designed to improve the time efficiency of finding fuzzy association rules. The performance between using a single processor and using MapReduce will be compared and discussed from experiments and the results show that our approaches can efficiently reduce the execution time of the whole process.
Hsieh, Cheng-Han, and 謝承翰. "Parallel Black Hole Clustering Based on MapReduce." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/db9m35.
Full text國立中山大學
資訊工程學系研究所
104
One of the key reasons that traditional clustering methods are inefficient for analyzing large-scale datasets is because most of them are designed for a centralized system. This means that if the size of input data exceeds the size of storage or memory of such a system, it would make it very difficult to do clustering on such a system. To mitigate this problem, an efficient clustering algorithm, called MapReduce Black Hole (MRBH), is presented in this thesis to leverage the strength of the black hole algorithm and the MapReduce programming model of Hadoop to accelerate the clustering speed by both software and hardware. By using MapReduce, MRBH will then divide a large dataset into a number of small data sets and cluster these smaller data sets in parallel. Moreover, MRBH inherits the characteristics of the black hole algorithm, meaning that no parameters are to be set manually; thus, the implementation is easy. To evaluate the performance of the proposed algorithm, several datasets are used with different numbers of nodes. Experimental results show that the proposed algorithm can provide a significant speedup as the number of nodes increases.