Academic literature on the topic 'Mapreduce'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Mapreduce.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Mapreduce"

1

Garg, Uttama. "Data Analytic Models That Redress the Limitations of MapReduce." International Journal of Web-Based Learning and Teaching Technologies 16, no. 6 (November 2021): 1–15. http://dx.doi.org/10.4018/ijwltt.20211101.oa7.

Full text
Abstract:
The amount of data in today’s world is increasing exponentially. Effectively analyzing Big Data is a very complex task. The MapReduce programming model created by Google in 2004 revolutionized the big-data comput-ing market. Nowadays the model is being used by many for scientific and research analysis as well as for commercial purposes. The MapReduce model however is quite a low-level progamming model and has many limitations. Active research is being undertaken to make models that overcome/remove these limitations. In this paper we have studied some popular data analytic models that redress some of the limitations of MapReduce; namely ASTERIX and Pregel (Giraph) We discuss these models briefly and through the discussion highlight how these models are able to overcome MapReduce’s limitations.
APA, Harvard, Vancouver, ISO, and other styles
2

Zhang, Yulun, Chenxu Zhang, Lei Yang, and Hongyang Li. "Large-scale Data Mining Method based on Clustering Algorithm Combined with MAPREDUCE." Transactions on Computer Science and Intelligent Systems Research 2 (December 21, 2023): 9–13. http://dx.doi.org/10.62051/8p9b3106.

Full text
Abstract:
With the continuous deepening and development of information technology, the diversity and amount of information in data continue to grow. Effectively mining these text data to extract valuable content has become an urgent task in the field of data research. This study combines the MapReduce distributed system with the K-means clustering algorithm to meet the challenges of large-scale data mining. At the same time, the paper use a distributed caching mechanism to solve the problem of repeated application of resources for multiple MapReduce collaborative operations and improve data mining efficiency. The combination of MapReduce's distributed computing and the advantages of K-means clustering algorithm provides an efficient and scalable method for large-scale data mining. Experimental results combining internal and external indicators show that the advantage of combining K-means with MapReduce is to fully utilize the distributed and parallel computing characteristics of MapReduce, providing users with an efficient and scalable data mining tool. Through this research, the paper provide new methods and insights for large-scale data mining, improving the efficiency and accuracy of data mining.
APA, Harvard, Vancouver, ISO, and other styles
3

Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce." Communications of the ACM 51, no. 1 (January 2008): 107–13. http://dx.doi.org/10.1145/1327452.1327492.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce." Communications of the ACM 53, no. 1 (January 2010): 72–77. http://dx.doi.org/10.1145/1629175.1629198.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Zhang, Guigang, Chao Li, Yong Zhang, and Chunxiao Xing. "A Semantic++ MapReduce Parallel Programming Model." International Journal of Semantic Computing 08, no. 03 (September 2014): 279–99. http://dx.doi.org/10.1142/s1793351x14400091.

Full text
Abstract:
Big data is playing a more and more important role in every area such as medical health, internet finance, culture and education etc. How to process these big data efficiently is a huge challenge. MapReduce is a good parallel programming language to process big data. However, it has lots of shortcomings. For example, it cannot process complex computing. It cannot suit real-time computing. In order to overcome these shortcomings of MapReduce and its variants, in this paper, we propose a Semantic++ MapReduce parallel programming model. This study includes the following parts. (1) Semantic++ MapReduce parallel programming model. It includes physical framework of semantic++ MapReduce parallel programming model and logic framework of semantic++ MapReduce parallel programming model; (2) Semantic++ extraction and management method for big data; (3) Semantic++ MapReduce parallel programming computing framework. It includes semantic++ map, semantic++ reduce and semantic++ shuffle; (4) Semantic++ MapReduce for multi-data centers. It includes basic framework of semantic++ MapReduce for multi-data centers and semantic++ MapReduce application framework for multi-data centers; (5) A Case Study of semantic++ MapReduce across multi-data centers.
APA, Harvard, Vancouver, ISO, and other styles
6

Wang, Zhong, Bo Suo, and Zhuo Wang. "MRScheduling: An Effective Technique for Multi-Tenant Meeting Deadline in MapReduce." Applied Mechanics and Materials 644-650 (September 2014): 4482–86. http://dx.doi.org/10.4028/www.scientific.net/amm.644-650.4482.

Full text
Abstract:
The multi-tenant jobs scheduling problem based on MapReduce framework has become more and more significant in contemporary society. Existing scheduling approach or algorithm no longer fit well in scenario that numerous jobs were submitted by multiple users at the same time. Therefore, taken enlarging jobs’ throughput for MapReduce into account, we firstly propose an MRScheduling which focuses on meeting job’s respective deadline. Considering the various parameters which are related to job execution time of a MapReduce’s job, we present a simply time-cost model, for the aim that quantifying the number of job’s assigned map slots and reduce slots. Then, an MRScheduling algorithm is discussed in details. Finally, we perform our approach on both real data and synthetic data on real distributed cluster to verify its effectiveness and efficiency.
APA, Harvard, Vancouver, ISO, and other styles
7

Chen, Rong, and Haibo Chen. "Tiled-MapReduce." ACM Transactions on Architecture and Code Optimization 10, no. 1 (April 2013): 1–30. http://dx.doi.org/10.1145/2445572.2445575.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Friedman, Eric, Peter Pawlowski, and John Cieslewicz. "SQL/MapReduce." Proceedings of the VLDB Endowment 2, no. 2 (August 2009): 1402–13. http://dx.doi.org/10.14778/1687553.1687567.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Garcia, Christopher. "Demystifying MapReduce." Procedia Computer Science 20 (2013): 484–89. http://dx.doi.org/10.1016/j.procs.2013.09.307.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Al-Badarneh, Amer, Amr Mohammad, and Salah Harb. "A Survey on MapReduce Implementations." International Journal of Cloud Applications and Computing 6, no. 1 (January 2016): 59–87. http://dx.doi.org/10.4018/ijcac.2016010104.

Full text
Abstract:
A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Mapreduce"

1

Gault, Sylvain. "Improving MapReduce Performance on Clusters." Thesis, Lyon, École normale supérieure, 2015. http://www.theses.fr/2015ENSL0985/document.

Full text
Abstract:
Beaucoup de disciplines scientifiques s'appuient désormais sur l'analyse et la fouille de masses gigantesques de données pour produire de nouveaux résultats. Ces données brutes sont produites à des débits toujours plus élevés par divers types d'instruments tels que les séquenceurs d'ADN en biologie, le Large Hadron Collider (LHC) qui produisait en 2012, 25 pétaoctets par an, ou les grands télescopes tels que le Large Synoptic Survey Telescope (LSST) qui devrait produire 30 pétaoctets par nuit. Les scanners haute résolution en imagerie médicale et l'analyse de réseaux sociaux produisent également d'énormes volumes de données. Ce déluge de données soulève de nombreux défis en termes de stockage et de traitement informatique. L'entreprise Google a proposé en 2004 d'utiliser le modèle de calcul MapReduce afin de distribuer les calculs sur de nombreuses machines.Cette thèse s'intéresse essentiellement à améliorer les performances d'un environnement MapReduce. Pour cela, une conception modulaire et adaptable d'un environnement MapReduce est nécessaire afin de remplacer aisément les briques logicielles nécessaires à l'amélioration des performances. C'est pourquoi une approche à base de composants est étudiée pour concevoir un tel environnement de programmation. Afin d'étudier les performances d'une application MapReduce, il est nécessaire de modéliser la plate-forme, l'application et leurs performances. Ces modèles doivent être à la fois suffisamment précis pour que les algorithmes les utilisant produisent des résultats pertinents, mais aussi suffisamment simple pour être analysés. Un état de l'art des modèles existants est effectué et un nouveau modèle correspondant aux besoins d'optimisation est défini. De manière à optimiser un environnement MapReduce la première approche étudiée est une approche d'optimisation globale qui aboutit à une amélioration du temps de calcul jusqu'à 47 %. La deuxième approche se concentre sur la phase de shuffle de MapReduce où tous les nœuds envoient potentiellement des données à tous les autres nœuds. Différents algorithmes sont définis et étudiés dans le cas où le réseau est un goulet d'étranglement pour les transferts de données. Ces algorithmes sont mis à l'épreuve sur la plate-forme expérimentale Grid'5000 et montrent souvent un comportement proche de la borne inférieure alors que l'approche naïve en est éloignée
Nowadays, more and more scientific fields rely on data mining to produce new results. These raw data are produced at an increasing rate by several tools like DNA sequencers in biology, the Large Hadron Collider (LHC) in physics that produced 25 petabytes per year as of 2012, or the Large Synoptic Survey Telescope (LSST) that should produce 30 petabyte of data per night. High-resolution scanners in medical imaging and social networks also produce huge amounts of data. This data deluge raise several challenges in terms of storage and computer processing. The Google company proposed in 2004 to use the MapReduce model in order to distribute the computation across several computers.This thesis focus mainly on improving the performance of a MapReduce environment. In order to easily replace the software parts needed to improve the performance, designing a modular and adaptable MapReduce environment is necessary. This is why a component based approach is studied in order to design such a programming environment. In order to study the performance of a MapReduce application, modeling the platform, the application and their performance is mandatory. These models should be both precise enough for the algorithms using them to produce meaningful results, but also simple enough to be analyzed. A state of the art of the existing models is done and a new model adapted to the needs is defined. On order to optimise a MapReduce environment, the first studied approach is a global optimization which result in a computation time reduced by up to 47 %. The second approach focus on the shuffle phase of MapReduce when all the nodes may send some data to every other node. Several algorithms are defined and studied when the network is the bottleneck of the data transfers. These algorithms are tested on the Grid'5000 experiment platform and usually show a behavior close to the lower bound while the trivial approach is far from it
APA, Harvard, Vancouver, ISO, and other styles
2

Polo, Jordà. "Multi-constraint scheduling of MapReduce workloads." Doctoral thesis, Universitat Politècnica de Catalunya, 2014. http://hdl.handle.net/10803/276174.

Full text
Abstract:
In recent years there has been an extraordinary growth of large-scale data processing and related technologies in both, industry and academic communities. This trend is mostly driven by the need to explore the increasingly large amounts of information that global companies and communities are able to gather, and has lead the introduction of new tools and models, most of which are designed around the idea of handling huge amounts of data. A good example of this trend towards improved large-scale data processing is MapReduce, a programming model intended to ease the development of massively parallel applications, and which has been widely adopted to process large datasets thanks to its simplicity. While the MapReduce model was originally used primarily for batch data processing in large static clusters, nowadays it is mostly deployed along with other kinds of workloads in shared environments in which multiple users may be submitting concurrent jobs with completely different priorities and needs: from small, almost interactive, executions, to very long applications that take hours to complete. Scheduling and selecting tasks for execution is extremely relevant in MapReduce environments since it governs a job's opportunity to make progress and determines its performance. However, only basic primitives to prioritize between jobs are available at the moment, constantly causing either under or over-provisioning, as the amount of resources needed to complete a particular job are not obvious a priori. This thesis aims to address both, the lack of management capabilities and the increased complexity of the environments in which MapReduce is executed. To that end, new models and techniques are introduced in order to improve the scheduling of MapReduce in the presence of different constraints found in real-world scenarios, such as completion time goals, data locality, hardware heterogeneity, or availability of resources. The focus is on improving the integration of MapReduce with the computing infrastructures in which it usually runs, allowing alternative techniques for dynamic management and provisioning of resources. More specifically, it is focused in three scenarios that are incremental in its scope. First, it studies the prospects of using high-level performance criteria to manage and drive the performance of MapReduce applications, taking advantage of the fact that MapReduce is executed in controlled environments in which the status of the cluster is known. Second, it examines the feasibility and benefits of making the MapReduce runtime more aware of the underlying hardware and the characteristics of applications. And finally, it also considers the interaction between MapReduce and other kinds of workloads, proposing new techniques to handle these increasingly complex environments. Following these three items described above, this thesis contributes to the management of MapReduce workloads by 1) proposing a performance model for MapReduce workloads and a scheduling algorithm that leverages the proposed model and is able to adapt depending on the various needs of its users in the presence of completion time constraints; 2) proposing a new resource model for MapReduce and a placement algorithm aware of the underlying hardware as well as the characteristics of the applications, capable of improving cluster utilization while still being guided by job performance metrics; and 3) proposing a model for shared environments in which MapReduce is executed along with other kinds of workloads such as transactional applications, and a scheduler aware of these workloads and its expected demand of resources, capable of improving resource utilization across machines while observing completion time goals.
APA, Harvard, Vancouver, ISO, and other styles
3

Nilsson, Johan. "Hadoop MapReduce in Eucalyptus Private Cloud." Thesis, Umeå universitet, Institutionen för datavetenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-51309.

Full text
Abstract:
This thesis investigates how setting up a private cloud using the Eucalyptus Cloud system could be done along with it's usability, requirements and limitations as an open-source cloud platform providing private cloud solutions. It also studies if using the MapReduce framework through Apache Hadoop's implementation on top of the private Eucalyptus Cloud can provide near linear scalability in terms of time and the amount of virtual machines in the cluster. Analysis has shown that Eucalyptus is lacking in a few usability areas when setting up the cloud infrastructure in terms of private networking and DNS lookups, yet the API that Eucalyptus provides gives benefits when migrating from public clouds like Amazon. The MapReduce framework is showing an initial near-linear relation which is declining when the amount of virtual machines is reaching the maximum of the cloud infrastructure.
APA, Harvard, Vancouver, ISO, and other styles
4

Kloss, Fernando Cesar. "Motor de transformações baseado em Mapreduce." reponame:Repositório Institucional da UFPR, 2013. http://hdl.handle.net/1884/35083.

Full text
Abstract:
Resumo: A busca por agilidade no processo de desenvolvimento de software tem impulsionado a crescente adoção de tecnologias, paradigmas e abordagens baseada em modelos (Model- Driven Engineering). Essas soluções mudam o foco de codificação para modelagem, onde modelos são utilizados para descrever diferentes aspectos de um sistema em diferentes níves de abstração. Uma série de linguagens, padrões e ferramentas surgiram para automatizar a construção e modificação de modelos e assim apoiar a principal operação executada neste cenário que são as transformações de modelos. A inserção de grandes modelos neste contexto evidenciou uma limitação dessa metodologia, a capacidade de tratar modelos com esta característica. Problemas de escalabilidade surgem quando modelos da ordem de milhares de elementos são utilizados em processos de desenvolvimento de software. Trabalhos recentes, visando desenvolver soluções para o problema de escalabilidade, tem explorado e focado em diferentes abordagens como armazenamento, fragmentação e persistência de modelos, porém pouco se tem visto em relação a ferramentas de transformação de modelos. Com base em trabalhos feitos em outros domínios, desenvolvemos um mecanismo de transformação de modelos executando de forma distribuída em uma nuvem. A solução consiste na adaptação de uma ferramenta de transformação de modelos para execução distribuída, através da integração com MapReduce. Duas implementações distintas arquiteturalmente são apresentadas, uma baseada em regras de transformação e outra baseada em operações de transformação de modelos. Os resultados obtidos são promissores especialmente para transformação de modelos grandes e complexos.
APA, Harvard, Vancouver, ISO, and other styles
5

Polo, Bardès Jordà. "Multi-constraint scheduling of MapReduce workloads." Doctoral thesis, Universitat Politècnica de Catalunya, 2014. http://hdl.handle.net/10803/276174.

Full text
Abstract:
In recent years there has been an extraordinary growth of large-scale data processing and related technologies in both, industry and academic communities. This trend is mostly driven by the need to explore the increasingly large amounts of information that global companies and communities are able to gather, and has lead the introduction of new tools and models, most of which are designed around the idea of handling huge amounts of data. A good example of this trend towards improved large-scale data processing is MapReduce, a programming model intended to ease the development of massively parallel applications, and which has been widely adopted to process large datasets thanks to its simplicity. While the MapReduce model was originally used primarily for batch data processing in large static clusters, nowadays it is mostly deployed along with other kinds of workloads in shared environments in which multiple users may be submitting concurrent jobs with completely different priorities and needs: from small, almost interactive, executions, to very long applications that take hours to complete. Scheduling and selecting tasks for execution is extremely relevant in MapReduce environments since it governs a job's opportunity to make progress and determines its performance. However, only basic primitives to prioritize between jobs are available at the moment, constantly causing either under or over-provisioning, as the amount of resources needed to complete a particular job are not obvious a priori. This thesis aims to address both, the lack of management capabilities and the increased complexity of the environments in which MapReduce is executed. To that end, new models and techniques are introduced in order to improve the scheduling of MapReduce in the presence of different constraints found in real-world scenarios, such as completion time goals, data locality, hardware heterogeneity, or availability of resources. The focus is on improving the integration of MapReduce with the computing infrastructures in which it usually runs, allowing alternative techniques for dynamic management and provisioning of resources. More specifically, it is focused in three scenarios that are incremental in its scope. First, it studies the prospects of using high-level performance criteria to manage and drive the performance of MapReduce applications, taking advantage of the fact that MapReduce is executed in controlled environments in which the status of the cluster is known. Second, it examines the feasibility and benefits of making the MapReduce runtime more aware of the underlying hardware and the characteristics of applications. And finally, it also considers the interaction between MapReduce and other kinds of workloads, proposing new techniques to handle these increasingly complex environments. Following these three items described above, this thesis contributes to the management of MapReduce workloads by 1) proposing a performance model for MapReduce workloads and a scheduling algorithm that leverages the proposed model and is able to adapt depending on the various needs of its users in the presence of completion time constraints; 2) proposing a new resource model for MapReduce and a placement algorithm aware of the underlying hardware as well as the characteristics of the applications, capable of improving cluster utilization while still being guided by job performance metrics; and 3) proposing a model for shared environments in which MapReduce is executed along with other kinds of workloads such as transactional applications, and a scheduler aware of these workloads and its expected demand of resources, capable of improving resource utilization across machines while observing completion time goals.
APA, Harvard, Vancouver, ISO, and other styles
6

Memon, Neelam. "Anonymizing large transaction data using MapReduce." Thesis, Cardiff University, 2016. http://orca.cf.ac.uk/97342/.

Full text
Abstract:
Publishing transaction data is important to applications such as marketing research and biomedical studies. Privacy is a concern when publishing such data since they often contain person-specific sensitive information. To address this problem, different data anonymization methods have been proposed. These methods have focused on protecting the associated individuals from different types of privacy leaks as well as preserving utility of the original data. But all these methods are sequential and are designed to process data on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In this work, we consider how MapReduce may be used to provide scalability in large transaction data anonymization. More specifically, we consider how setbased generalization methods such as RBAT (Rule-Based Anonymization of Transaction data) may be parallelized using MapReduce. Set-based generalization methods have some desirable features for transaction anonymization, but their highly iterative nature makes parallelization challenging. RBAT is a good representative of such methods. We propose a method for transaction data partitioning and representation. We also present two MapReduce-based parallelizations of RBAT. Our methods ensure scalability when the number of transaction records and domain of items are large. Our preliminary results show that a direct parallelization of RBAT by partitioning data alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that generalizes our direct parallel method and allows to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and to the available resources while retaining good data utility.
APA, Harvard, Vancouver, ISO, and other styles
7

Hammoud, Suhel. "MapReduce network enabled algorithms for classification based on association rules." Thesis, Brunel University, 2011. http://bura.brunel.ac.uk/handle/2438/5833.

Full text
Abstract:
There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters. The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization.
APA, Harvard, Vancouver, ISO, and other styles
8

Deolikar, Piyush P. "Lecture Video Search Engine Using Hadoop MapReduce." Thesis, California State University, Long Beach, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10638908.

Full text
Abstract:

With the advent of the Internet and ease of uploading video content over video libraries and social networking sites, the video data availability was increased very rapidly during this decade. Universities are uploading video tutorials in the online courses. Companies like Udemy, coursera, Lynda, etc. made video tutorials available over the Internet. We propose and implement a scalable solution, which helps to find relevant videos with respect to a query provided by the user. Our solution maintains an updated list of the available videos on the web and assigns a rank according to their relevance. The proposed solution consists of three main components that can mutually interact. The first component, called the crawler, continuously visits and locally stores the relevant information of all the webpages with videos available on the Internet. The crawler has several threads, concurrently parsing webpages. The second component obtains the inverted index of the web pages stored by the crawler. Given a query, the inverted index is used to obtain the videos that contain the words in the query. The third component computes the rank of the video. This rank is then used to display the results in the order of relevance. We implement a scalable solution in the Apache Hadoop Framework. Hadoop is a distributed operating system that provides a distributed file system able to handle large files as well as distributed computation among the participants.

APA, Harvard, Vancouver, ISO, and other styles
9

Kolb, Lars. "Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-157163.

Full text
Abstract:
In den vergangenen Jahren hat das neu entstandene Paradigma Infrastructure as a Service die IT-Welt massiv verändert. Die Bereitstellung von Recheninfrastruktur durch externe Dienstleister bietet die Möglichkeit, bei Bedarf in kurzer Zeit eine große Menge von Rechenleistung, Speicherplatz und Bandbreite ohne Vorabinvestitionen zu akquirieren. Gleichzeitig steigt sowohl die Menge der frei verfügbaren als auch der in Unternehmen zu verwaltenden Daten dramatisch an. Die Notwendigkeit zur effizienten Verwaltung und Auswertung dieser Datenmengen erforderte eine Weiterentwicklung bestehender IT-Technologien und führte zur Entstehung neuer Forschungsgebiete und einer Vielzahl innovativer Systeme. Ein typisches Merkmal dieser Systeme ist die verteilte Speicherung und Datenverarbeitung in großen Rechnerclustern bestehend aus Standard-Hardware. Besonders das MapReduce-Programmiermodell hat in den vergangenen zehn Jahren zunehmend an Bedeutung gewonnen. Es ermöglicht eine verteilte Verarbeitung großer Datenmengen und abstrahiert von den Details des verteilten Rechnens sowie der Behandlung von Hardwarefehlern. Innerhalb dieser Dissertation steht die Nutzung des MapReduce-Konzeptes zur automatischen Parallelisierung rechenintensiver Entity Resolution-Aufgaben im Mittelpunkt. Entity Resolution ist ein wichtiger Teilbereich der Informationsintegration, dessen Ziel die Entdeckung von Datensätzen einer oder mehrerer Datenquellen ist, die dasselbe Realweltobjekt beschreiben. Im Rahmen der Dissertation werden schrittweise Verfahren präsentiert, welche verschiedene Teilprobleme der MapReduce-basierten Ausführung von Entity Resolution-Workflows lösen. Zur Erkennung von Duplikaten vergleichen Entity Resolution-Verfahren üblicherweise Paare von Datensätzen mithilfe mehrerer Ähnlichkeitsmaße. Die Auswertung des Kartesischen Produktes von n Datensätzen führt dabei zu einer quadratischen Komplexität von O(n²) und ist deswegen nur für kleine bis mittelgroße Datenquellen praktikabel. Für Datenquellen mit mehr als 100.000 Datensätzen entstehen selbst bei verteilter Ausführung Laufzeiten von mehreren Stunden. Deswegen kommen sogenannte Blocking-Techniken zum Einsatz, die zur Reduzierung des Suchraums dienen. Die zugrundeliegende Annahme ist, dass Datensätze, die eine gewisse Mindestähnlichkeit unterschreiten, nicht miteinander verglichen werden müssen. Die Arbeit stellt eine MapReduce-basierte Umsetzung der Auswertung des Kartesischen Produktes sowie einiger bekannter Blocking-Verfahren vor. Nach dem Vergleich der Datensätze erfolgt abschließend eine Klassifikation der verglichenen Kandidaten-Paare in Match beziehungsweise Non-Match. Mit einer steigenden Anzahl verwendeter Attributwerte und Ähnlichkeitsmaße ist eine manuelle Festlegung einer qualitativ hochwertigen Strategie zur Kombination der resultierenden Ähnlichkeitswerte kaum mehr handhabbar. Aus diesem Grund untersucht die Arbeit die Integration maschineller Lernverfahren in MapReduce-basierte Entity Resolution-Workflows. Eine Umsetzung von Blocking-Verfahren mit MapReduce bedingt eine Partitionierung der Menge der zu vergleichenden Paare sowie eine Zuweisung der Partitionen zu verfügbaren Prozessen. Die Zuweisung erfolgt auf Basis eines semantischen Schlüssels, der entsprechend der konkreten Blocking-Strategie aus den Attributwerten der Datensätze abgeleitet ist. Beispielsweise wäre es bei der Deduplizierung von Produktdatensätzen denkbar, lediglich Produkte des gleichen Herstellers miteinander zu vergleichen. Die Bearbeitung aller Datensätze desselben Schlüssels durch einen Prozess führt bei Datenungleichverteilung zu erheblichen Lastbalancierungsproblemen, die durch die inhärente quadratische Komplexität verschärft werden. Dies reduziert in drastischem Maße die Laufzeiteffizienz und Skalierbarkeit der entsprechenden MapReduce-Programme, da ein Großteil der Ressourcen eines Clusters nicht ausgelastet ist, wohingegen wenige Prozesse den Großteil der Arbeit verrichten müssen. Die Bereitstellung verschiedener Verfahren zur gleichmäßigen Ausnutzung der zur Verfügung stehenden Ressourcen stellt einen weiteren Schwerpunkt der Arbeit dar. Blocking-Strategien müssen stets zwischen Effizienz und Datenqualität abwägen. Eine große Reduktion des Suchraums verspricht zwar eine signifikante Beschleunigung, führt jedoch dazu, dass ähnliche Datensätze, z. B. aufgrund fehlerhafter Attributwerte, nicht miteinander verglichen werden. Aus diesem Grunde ist es hilfreich, für jeden Datensatz mehrere von verschiedenen Attributen abgeleitete semantische Schlüssel zu generieren. Dies führt jedoch dazu, dass ähnliche Datensätze unnötigerweise mehrfach bezüglich verschiedener Schlüssel miteinander verglichen werden. Innerhalb der Arbeit werden deswegen Algorithmen zur Vermeidung solch redundanter Ähnlichkeitsberechnungen präsentiert. Als Ergebnis dieser Arbeit wird das Entity Resolution-Framework Dedoop präsentiert, welches von den entwickelten MapReduce-Algorithmen abstrahiert und eine High-Level-Spezifikation komplexer Entity Resolution-Workflows ermöglicht. Dedoop fasst alle in dieser Arbeit vorgestellten Techniken und Optimierungen in einem nutzerfreundlichen System zusammen. Der Prototyp überführt nutzerdefinierte Workflows automatisch in eine Menge von MapReduce-Jobs und verwaltet deren parallele Ausführung in MapReduce-Clustern. Durch die vollständige Integration der Cloud-Dienste Amazon EC2 und Amazon S3 in Dedoop sowie dessen Verfügbarmachung ist es für Endnutzer ohne MapReduce-Kenntnisse möglich, komplexe Entity Resolution-Workflows in privaten oder dynamisch erstellten externen MapReduce-Clustern zu berechnen.
APA, Harvard, Vancouver, ISO, and other styles
10

Dyer, James. "Secure computation in the cloud using MapReduce." Thesis, University of Manchester, 2018. https://www.research.manchester.ac.uk/portal/en/theses/secure-computation-in-the-cloud-using-mapreduce(8f63dc8e-dc35-43ec-a083-9f3a6230c142).html.

Full text
Abstract:
Processing large volumes of data has become increasingly important to businesses and government [218]. One of the popular tools used in processing large sets of data is MapReduce. As new MapReduce applications are developed to be deployed in untrusted environments, such as public clouds, or to process sensitive data or to be deployed across data centres, well understood security measures may be deployed, such as authentication and authorisation or encryption of messages passed between cluster nodes. However, there may be situations where authorised individuals cannot be trusted, such as “rogue” system administrators, or, in the cloud, where the MapReduce cluster nodes have been compromised. Where the input data is sensitive, such as medical data, we require a means to protect this data from exposure. Furthermore, we may need to protect the intermediate data and details of the computation from snoopers to prevent information leakage. To take full advantage of MapReduce cloud computing services, we require a means to process the data securely on such a platform. We designate such a computation, secure computation in the cloud (SCC). SCC should not expose input or output data to any other party, including the cloud service provider. Furthermore, the details of the computation should not allow any other party to deduce its inputs and outputs. Most importantly, we require the computations to be performed in practically reasonable time and space. The ability to perform MapReduce computations on encrypted data would offer a solution to these problems. However, this poses a significant problem in that many encryption schemes transform the original data in such way that meaningful computation on the encrypted data is impossible. Our work aims to provide a practical SSCC system (CryptMR) inspired by CryptDB [272]. Our solution (CryptMR) has detailed several novel cryptographic methods suitable for implementation in a secure computation in the cloud solution using MapReduce. We encrypt integer data using a novel somewhat homomorphic encryption scheme (SHE). This SHE can be made fully homomorphic. We also provide novel order-preserving (OPE) and searchable symmetric encryption (SSE) for the purposes of sorting and searching. Our OPE scheme is, to our knowledge, the first scheme to rely on a computationally hard primitive rather a security model. We have implemented all of these encryption schemes and devised experiments to test their suitability for large-scale distributed computing by integrating them into Hadoop MapReduce (MR) applications. In addition to the work on encryption schemes, we have provided a novel probabilistic method for verifying that mappers and reducers are correctly computing and reporting their results (see chapter 10). This work uses random sampling to minimise the probability that a cheating mapper or reducer can successfully report a false result. We evaluated our “proof-of-concept” implementation in a small scale cloud environment. Our results show that sampling 5 to 10% of intermediate or final data allows us to detect cheating with very strong probability.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Books on the topic "Mapreduce"

1

Lin, Jimmy. Data-intensive text processing with MapReduce. [San Rafael, Calif.]: Morgan & Claypool Publishers, 2010.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Lin, Jimmy, and Chris Dyer. Data-Intensive Text Processing with MapReduce. Cham: Springer International Publishing, 2010. http://dx.doi.org/10.1007/978-3-031-02136-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Writing and querying MapReduce views in CouchDB. Sebastopol, Calif: O'Reilly Media, 2011.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

Kumar, Vavilapalli Vinod, Eadline Doug 1956-, Niemiec Joseph, and Markham Jeff, eds. Apache Hadoop YARN: Moving beyond MapReduce and batch processing with Apache Hadoop 2. Upper Saddle River, NJ: Addison-Wesley, 2014.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

Kaur, Gurpreet. MapReduce: Introduction. Independently Published, 2022.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
6

Hadoop MapReduce Cookbook. Packt Publishing, 2013.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
7

Mapreduce Design Patterns. O'Reilly Media, 2012.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
8

Enterprise Hadoop and MapReduce. Pearson Education, Limited, 2025.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
9

Tannir, Khaled. Optimizing Hadoop for MapReduce. Packt Publishing, 2014.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
10

Chalkiopoulos, Antonios. Programming MapReduce with Scalding. Packt Publishing, Limited, 2014.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
More sources

Book chapters on the topic "Mapreduce"

1

Wayne, Hillel. "MapReduce." In Practical TLA+, 167–97. Berkeley, CA: Apress, 2018. http://dx.doi.org/10.1007/978-1-4842-3829-5_11.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Huang, Qunying. "MapReduce." In Encyclopedia of GIS, 1–7. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-23519-6_1608-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Wu, Sai. "MapReduce." In Encyclopedia of Database Systems, 1–5. New York, NY: Springer New York, 2016. http://dx.doi.org/10.1007/978-1-4899-7993-3_80802-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Nita, Stefania Loredana, and Marius Mihailescu. "MapReduce." In Practical Concurrent Haskell, 237–45. Berkeley, CA: Apress, 2017. http://dx.doi.org/10.1007/978-1-4842-2781-7_16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Huang, Qunying. "MapReduce." In Encyclopedia of GIS, 1170–76. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-17885-1_1608.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Padua, David, Amol Ghoting, John A. Gunnels, Mark S. Squillante, José Meseguer, James H. Cownie, Duncan Roweth, et al. "MapReduce." In Encyclopedia of Parallel Computing, 1089. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-09766-4_20000.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wu, Sai. "MapReduce." In Encyclopedia of Database Systems, 2206–10. New York, NY: Springer New York, 2018. http://dx.doi.org/10.1007/978-1-4614-8265-9_80802.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Vassilvitskii, Sergei. "MapReduce Algorithmics." In Lecture Notes in Computer Science, 524. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-40104-6_45.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Miličić, Dejan. "MapReduce Indexes." In Introducing RavenDB, 141–64. Berkeley, CA: Apress, 2022. http://dx.doi.org/10.1007/978-1-4842-8919-8_6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Williams, Andreas, Pavlos Mitsoulis-Ntompos, and Damianos Chatziantoniou. "Tagged MapReduce: Efficiently Computing Multi-analytics Using MapReduce." In Data Warehousing and Knowledge Discovery, 240–51. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. http://dx.doi.org/10.1007/978-3-642-23544-3_18.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Mapreduce"

1

Ferrera, Pedro, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, and Giovanna Di Marzo Serugendo. "Tuple MapReduce: Beyond Classic MapReduce." In 2012 IEEE 12th International Conference on Data Mining (ICDM). IEEE, 2012. http://dx.doi.org/10.1109/icdm.2012.141.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Lahmer, Ibrahim, and Ning Zhang. "MapReduce." In the 7th International Conference. New York, New York, USA: ACM Press, 2014. http://dx.doi.org/10.1145/2659651.2659722.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Langhans, Philipp, Christoph Wieser, and François Bry. "Crowdsourcing MapReduce." In the 22nd International Conference. New York, New York, USA: ACM Press, 2013. http://dx.doi.org/10.1145/2487788.2487915.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Chen, Rong, Haibo Chen, and Binyu Zang. "Tiled-MapReduce." In the 19th international conference. New York, New York, USA: ACM Press, 2010. http://dx.doi.org/10.1145/1854273.1854337.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Malewicz, Greg. "Beyond MapReduce." In the second international workshop. New York, New York, USA: ACM Press, 2011. http://dx.doi.org/10.1145/1996092.1996098.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Ullman, Jeff. "MapReduce Algorithms." In CODS-IKDD '15: 2nd IKDD Conference on Data Sciences. New York, NY, USA: ACM, 2015. http://dx.doi.org/10.1145/2778865.2778866.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Mantha, Pradeep Kumar, Andre Luckow, and Shantenu Jha. "Pilot-MapReduce." In third international workshop. New York, New York, USA: ACM Press, 2012. http://dx.doi.org/10.1145/2287016.2287020.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Li, Songze, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. "Coded MapReduce." In 2015 53rd Annual Allerton Conference on Communication, Control and Computing (Allerton). IEEE, 2015. http://dx.doi.org/10.1109/allerton.2015.7447112.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Martha, V. S., Weizhong Zhao, and Xiaowei Xu. "h-MapReduce: A Framework for Workload Balancing in MapReduce." In 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA). IEEE, 2013. http://dx.doi.org/10.1109/aina.2013.48.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Tao, Yufei, Wenqing Lin, and Xiaokui Xiao. "Minimal MapReduce algorithms." In the 2013 international conference. New York, New York, USA: ACM Press, 2013. http://dx.doi.org/10.1145/2463676.2463719.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Mapreduce"

1

Troisi, Louis R. Clustering Systems with Kolmogorov Complexity and MapReduce. Fort Belvoir, VA: Defense Technical Information Center, June 2011. http://dx.doi.org/10.21236/ada547540.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Chen, Yanpei, Sara Alspaugh, and Randy H. Katz. Design Insights for MapReduce from Diverse Production Workloads. Fort Belvoir, VA: Defense Technical Information Center, January 2012. http://dx.doi.org/10.21236/ada555881.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Chen, Yanpei, Sara Alspaugh, and Randy H. Katz. Interactive Query Processing in Big Data Systems: A Cross Industry Study of MapReduce Workloads. Fort Belvoir, VA: Defense Technical Information Center, April 2012. http://dx.doi.org/10.21236/ada561769.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography