To see the other types of publications on this topic, follow the link: Hadoop.

Dissertations / Theses on the topic 'Hadoop'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Hadoop.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Raja, Anitha. "A Coordination Framework for Deploying Hadoop MapReduce Jobs on Hadoop Cluster." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-196951.

Full text
Abstract:
Apache Hadoop is an open source framework that delivers reliable, scalable, and distributed computing. Hadoop services are provided for distributed data storage, data processing, data access, and security. MapReduce is the heart of the Hadoop framework and was designed to process vast amounts of data distributed over a large number of nodes. MapReduce has been used extensively to process structured and unstructured data in diverse fields such as e-commerce, web search, social networks, and scientific computation. Understanding the characteristics of Hadoop MapReduce workloads is the key to achieving improved configurations and refining system throughput. Thus far, MapReduce workload characterization in a large-scale production environment has not been well studied. In this thesis project, the focus is mainly on composing a Hadoop cluster (as an execution environment for data processing) to analyze two types of Hadoop MapReduce (MR) jobs via a proposed coordination framework. This coordination framework is referred to as a workload translator. The outcome of this work includes: (1) a parametric workload model for the target MR jobs, (2) a cluster specification to develop an improved cluster deployment strategy using the model and coordination framework, and (3) better scheduling and hence better performance of jobs (i.e. shorter job completion time). We implemented a prototype of our solution using Apache Tomcat on (OpenStack) Ubuntu Trusty Tahr, which uses RESTful APIs to (1) create a Hadoop cluster version 2.7.2 and (2) to scale up and scale down the number of workers in the cluster. The experimental results showed that with well tuned parameters, MR jobs can achieve a reduction in the job completion time and improved utilization of the hardware resources. The target audience for this thesis are developers. As future work, we suggest adding additional parameters to develop a more refined workload model for MR and similar jobs.
Apache Hadoop är ett öppen källkods system som levererar pålitlig, skalbar och distribuerad användning. Hadoop tjänster hjälper med distribuerad data förvaring, bearbetning, åtkomst och trygghet. MapReduce är en viktig del av Hadoop system och är designad att bearbeta stora data mängder och även distribuerad i flera leder. MapReduce är använt extensivt inom bearbetning av strukturerad och ostrukturerad data i olika branscher bl. a e-handel, webbsökning, sociala medier och även vetenskapliga beräkningar. Förståelse av MapReduces arbetsbelastningar är viktiga att få förbättrad konfigurationer och resultat. Men, arbetsbelastningar av MapReduce inom massproduktions miljö var inte djup-forskat hittills. I detta examensarbete, är en hel del fokus satt på ”Hadoop cluster” (som en utförande miljö i data bearbetning) att analysera två typer av Hadoop MapReduce (MR) arbeten genom ett tilltänkt system. Detta system är refererad som arbetsbelastnings översättare. Resultaten från denna arbete innehåller: (1) en parametrisk arbetsbelastningsmodell till inriktad MR arbeten, (2) en specifikation att utveckla förbättrad kluster strategier med båda modellen och koordinations system, och (3) förbättrad planering och arbetsprestationer, d.v.s kortare tid att utföra arbetet. Vi har realiserat en prototyp med Apache Tomcat på (OpenStack) Ubuntu Trusty Tahr som använder RESTful API (1) att skapa ”Hadoop cluster” version 2.7.2 och (2) att båda skala upp och ner antal medarbetare i kluster. Forskningens resultat har visat att med vältrimmad parametrar, kan MR arbete nå förbättringar dvs. sparad tid vid slutfört arbete och förbättrad användning av hårdvara resurser. Målgruppen för denna avhandling är utvecklare. I framtiden, föreslår vi tilläggning av olika parametrar att utveckla en allmän modell för MR och liknande arbeten.
APA, Harvard, Vancouver, ISO, and other styles
2

Savvidis, Evangelos. "Searching Metadata in Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177467.

Full text
Abstract:
The rapid expansion of the internet has led to the Big Data era. Companies that provide services which deal with Big Data have to face two major issues: i) storing petabytes of data and ii) manipulating this data. On the one end the open source Hadoop ecosystem and particularly its distributed file system HDFS comes to take care of the former issue, by providing a persistent storage for unprecedented amounts of data. For the latter, there are many approaches when it comes to data analytics – from map-reduce jobs to information retrieval and data discovery. This thesis provides a novel approach to information discovery firstly by providing the means to create, manage and associate metadata to HDFS files and secondly searching for files through their metadata using Elasticsearch. The work is composed of three parts: The first one is the metadata designer/manager, which is the AngularJS front end. The second part is the J2EE back end which enables the front end to perform all the managing actions on metadata using websockets. The third part is the indexing of data into Elasticsearch, the distributed and scalable open source search engine. Our work has shown that this approach works and it greatly helps finding information in the vast sea of data in the HDFS.
APA, Harvard, Vancouver, ISO, and other styles
3

Bux, Marc Nicolas. "Scientific Workflows for Hadoop." Doctoral thesis, Humboldt-Universität zu Berlin, 2018. http://dx.doi.org/10.18452/19321.

Full text
Abstract:
Scientific Workflows bieten flexible Möglichkeiten für die Modellierung und den Austausch komplexer Arbeitsabläufe zur Analyse wissenschaftlicher Daten. In den letzten Jahrzehnten sind verschiedene Systeme entstanden, die den Entwurf, die Ausführung und die Verwaltung solcher Scientific Workflows unterstützen und erleichtern. In mehreren wissenschaftlichen Disziplinen wachsen die Mengen zu verarbeitender Daten inzwischen jedoch schneller als die Rechenleistung und der Speicherplatz verfügbarer Rechner. Parallelisierung und verteilte Ausführung werden häufig angewendet, um mit wachsenden Datenmengen Schritt zu halten. Allerdings sind die durch verteilte Infrastrukturen bereitgestellten Ressourcen häufig heterogen, instabil und unzuverlässig. Um die Skalierbarkeit solcher Infrastrukturen nutzen zu können, müssen daher mehrere Anforderungen erfüllt sein: Scientific Workflows müssen parallelisiert werden. Simulations-Frameworks zur Evaluation von Planungsalgorithmen müssen die Instabilität verteilter Infrastrukturen berücksichtigen. Adaptive Planungsalgorithmen müssen eingesetzt werden, um die Nutzung instabiler Ressourcen zu optimieren. Hadoop oder ähnliche Systeme zur skalierbaren Verwaltung verteilter Ressourcen müssen verwendet werden. Diese Dissertation präsentiert neue Lösungen für diese Anforderungen. Zunächst stellen wir DynamicCloudSim vor, ein Simulations-Framework für Cloud-Infrastrukturen, welches verschiedene Aspekte der Variabilität adäquat modelliert. Im Anschluss beschreiben wir ERA, einen adaptiven Planungsalgorithmus, der die Ausführungszeit eines Scientific Workflows optimiert, indem er Heterogenität ausnutzt, kritische Teile des Workflows repliziert und sich an Veränderungen in der Infrastruktur anpasst. Schließlich präsentieren wir Hi-WAY, eine Ausführungsumgebung die ERA integriert und die hochgradig skalierbare Ausführungen in verschiedenen Sprachen beschriebener Scientific Workflows auf Hadoop ermöglicht.
Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today's data-driven science. Over the last decades, scientific workflow management systems have emerged to facilitate the design, execution, and monitoring of such workflows. At the same time, the amounts of data generated in various areas of science outpaced hardware advancements. Parallelization and distributed execution are generally proposed to deal with increasing amounts of data. However, the resources provided by distributed infrastructures are subject to heterogeneity, dynamic performance changes at runtime, and occasional failures. To leverage the scalability provided by these infrastructures despite the observed aspects of performance variability, workflow management systems have to progress: Parallelization potentials in scientific workflows have to be detected and exploited. Simulation frameworks, which are commonly employed for the evaluation of scheduling mechanisms, have to consider the instability encountered on the infrastructures they emulate. Adaptive scheduling mechanisms have to be employed to optimize resource utilization in the face of instability. State-of-the-art systems for scalable distributed resource management and storage, such as Apache Hadoop, have to be supported. This dissertation presents novel solutions for these aspirations. First, we introduce DynamicCloudSim, a cloud computing simulation framework that is able to adequately model the various aspects of variability encountered in computational clouds. Secondly, we outline ERA, an adaptive scheduling policy that optimizes workflow makespan by exploiting heterogeneity, replicating bottlenecks in workflow execution, and adapting to changes in the underlying infrastructure. Finally, we present Hi-WAY, an execution engine that integrates ERA and enables the highly scalable execution of scientific workflows written in a number of languages on Hadoop.
APA, Harvard, Vancouver, ISO, and other styles
4

Wu, Yuanyuan. "HADOOP-EDF: LARGE-SCALE DISTRIBUTED PROCESSING OF ELECTROPHYSIOLOGICAL SIGNAL DATA IN HADOOP MAPREDUCE." UKnowledge, 2019. https://uknowledge.uky.edu/cs_etds/88.

Full text
Abstract:
The rapidly growing volume of electrophysiological signals has been generated for clinical research in neurological disorders. European Data Format (EDF) is a standard format for storing electrophysiological signals. However, the bottleneck of existing signal analysis tools for handling large-scale datasets is the sequential way of loading large EDF files before performing an analysis. To overcome this, we develop Hadoop-EDF, a distributed signal processing tool to load EDF data in a parallel manner using Hadoop MapReduce. Hadoop-EDF uses a robust data partition algorithm making EDF data parallel processable. We evaluate Hadoop-EDF’s scalability and performance by leveraging two datasets from the National Sleep Research Resource and running experiments on Amazon Web Service clusters. The performance of Hadoop-EDF on a 20-node cluster improves 27 times and 47 times than sequential processing of 200 small-size files and 200 large-size files, respectively. The results demonstrate that Hadoop-EDF is more suitable and effective in processing large EDF files.
APA, Harvard, Vancouver, ISO, and other styles
5

Büchler, Peter. "Indexing Genomic Data on Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177298.

Full text
Abstract:
In the last years Hadoop has been used as a standard backend for big data applications. Its most known application MapReduce provides a powerful parallel programming paradigm. Big companies, storing petabytes of data, like Facebook and Yahoo deployed their own Hadoop distribution for data analytics, interactive services etc. Nevertheless MapReduce’s simplicity in its map stage always leads to a full data scan of the input data and thus potentially wastes resources. Recently new sources of big data, e.g. the 4k video format or genomic data, have appeared. Genomic data in its raw file format (FastQ) can take up to hundreds of gigabytes per file. Simply using MapReduce for a population analysis would easily end up in a full data scan on terabytes of data. Obviously there is a need for more efficient ways of accessing the data by reducing the amount of data, considered for the computation. Already existing approaches introduce indexing structures into their respective Hadoop distribution. While some of them are specifically made for certain data structures, e.g. key-value pairs, others strongly depend on the existence of a MapReduce framework. To overcome these problems we integrated an indexing structure into Hadoop’s file system, the Hadoop Distributed File System (HDFS), working independently of MapReduce. This structure supports the definition of own input formats and individual indexing strategies. The building process of an index is integrated into the file writing processes and is independent of software, working in higher layers of Hadoop. As a proof-of-concept though MapReduce has been given the possibility to make use of these indexing structures by simply adding a new parameter to its job definition. A prototype and its evaluation will show the advantages of using those structures with genomic data (FastQ and SAM files) as a use case.
APA, Harvard, Vancouver, ISO, and other styles
6

Schätzle, Alexander [Verfasser], and Georg [Akademischer Betreuer] Lausen. "Distributed RDF Querying on Hadoop." Freiburg : Universität, 2017. http://d-nb.info/1128574187/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Tabatabaei, Mahsa. "Evaluation of Security in Hadoop." Thesis, KTH, Kommunikationsnät, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-160269.

Full text
Abstract:
There are different ways to store and process large amount of data. Hadoop iswidely used, one of the most popular platforms to store huge amount of dataand process them in parallel. While storing sensitive data, security plays animportant role to keep it safe. Security was not that much considered whenHadoop was initially designed. The initial use of Hadoop was managing largeamount of public web data so confidentiality of the stored data was not anissue. Initially users and services in Hadoop were not authenticated; Hadoop isdesigned to run code on a distributed cluster of machines so without properauthentication anyone could submit code and it would be executed. Differentprojects have started to improve the security of Hadoop. Two of these projectsare called project Rhino and Project Sentry [1].Project Rhino implements splittable crypto codec to provide encryptionfor the data that is stored in Hadoop distributed file system. It also developsthe centralized authentication by implementing Hadoop single sign on whichprevents repeated authentication of the users accessing the same services manytimes. From the authorization point of view Project Rhino provides cell-basedauthorization for Hbase [2].Project Sentry provides fine-grained access control by supporting role-basedauthorization which different services can be bound to it to provide authorizationfor their users [3].It is possible to combine security enhancements which have been done inthe Project Rhino and Project Sentry to further improve the performance andprovide better mechanisms to secure Hadoop.In this thesis, the security of the system in Hadoop version 1 and Hadoopversion 2 is evaluated and different security enhancements are proposed, consideringsecurity improvements made by the two aforementioned projects, ProjectRhino and Project Sentry, in terms of encryption, authentication, and authorization.This thesis suggests some high-level security improvements on theCentralized authentication system (Hadoop Single Sign on) implementationmade by Project Rhino.
APA, Harvard, Vancouver, ISO, and other styles
8

Дикий, В. С. "Сутність та особливості використання Hadoop." Thesis, Київський національний універститет технологій та дизайну, 2017. https://er.knutd.edu.ua/handle/123456789/10420.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Brotánek, Jan. "Apache Hadoop jako analytická platforma." Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-358801.

Full text
Abstract:
Diploma Thesis focuses on integrating Hadoop platform into current data warehouse architecture. In theoretical part, properties of Big Data are described together with their methods and processing models. Hadoop framework, its components and distributions are discussed. Moreover, compoments which enables end users, developers and analytics to access Hadoop cluster are described. Case study of batch data extraction from current data warehouse on Oracle platform with aid of Sqoop tool, their transformation in relational structures of Hive component and uploading them back to the original source is being discussed at practical part of thesis. Compression of data and efficiency of queries depending on various storage formats is also discussed. Quality and consistency of manipulated data is checked during all phases of the process. Fraction of practical part discusses ways of storing and capturing stream data. For this purposes tool Flume is used to capture stream data. Further this data are transformed in Pig tool. Purpose of implementing the process is to move part of data and its processing from current data warehouse to Hadoop cluster. Therefore process of integration of current data warehouse and Hortonworks Data Platform and its components, was designed
APA, Harvard, Vancouver, ISO, and other styles
10

Nilsson, Johan. "Hadoop MapReduce in Eucalyptus Private Cloud." Thesis, Umeå universitet, Institutionen för datavetenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-51309.

Full text
Abstract:
This thesis investigates how setting up a private cloud using the Eucalyptus Cloud system could be done along with it's usability, requirements and limitations as an open-source cloud platform providing private cloud solutions. It also studies if using the MapReduce framework through Apache Hadoop's implementation on top of the private Eucalyptus Cloud can provide near linear scalability in terms of time and the amount of virtual machines in the cluster. Analysis has shown that Eucalyptus is lacking in a few usability areas when setting up the cloud infrastructure in terms of private networking and DNS lookups, yet the API that Eucalyptus provides gives benefits when migrating from public clouds like Amazon. The MapReduce framework is showing an initial near-linear relation which is declining when the amount of virtual machines is reaching the maximum of the cloud infrastructure.
APA, Harvard, Vancouver, ISO, and other styles
11

Johannsen, Fabian, and Mattias Hellsing. "Hadoop Read Performance During Datanode Crashes." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-130466.

Full text
Abstract:
This bachelor thesis evaluates the impact of datanode crashes on the performance of the read operations of a Hadoop Distributed File System, HDFS. The goal is to better understand how datanode crashes, as well as how certain parameters, affect the  performance of the read operation by looking at the execution time of the get command. The parameters used are the number of crashed nodes, block size and file size. By setting up a Linux test environment with ten virtual machines and Hadoop installed on them and running tests on it, data has been collected in order to answer these questions. From this data the average execution time and standard deviation of the get command was calculated. The network activity during the tests was also measured. The results showed that neither the number of crashed nodes nor block size had any significant effect on the execution time. It also demonstrated that the execution time of the get command was not directly proportional to the size of the fetched file. The execution time was up to 4.5 times as long when the file size was four times as large. A four times larger file did sometimes result in more than a four times as long execution time. Although, the consequences of a datanode crash while fetching a small file appear to be much greater than with a large file. The average execution time increased by up to 36% when a large file was fetched but it increased by as much as 85% when fetching a small file.
APA, Harvard, Vancouver, ISO, and other styles
12

ERGENEKON, EMRE BERGE, and PETTER ERIKSSON. "Big Data Archivingwith Splunk and Hadoop." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-137374.

Full text
Abstract:
Splunk is a software that handles large amounts of data every day. With data constantly growing, there is a need to phase out old data to keep the software from running slow. However, some of Splunk’s customers have retention policies that require the data to be stored longer than Splunk can offer. This thesis investigates how to create a solution for archiving large amounts of data. We present the problems with archiving data, the properties of the data we are archiving and the types of file systems suitable for archiving. By carefully considering data safety, reliability and using the Apache Hadoop project to support multiple distributed file systems, we create a flexible, reliable and scalable archiving solution.
Splunk är en mjukvara som hanterar stora mängder data varje dag. Eftersom datavolymen ökar med tiden, finns det ett behov att flytta ut gammalt data från programmet så att det inte blir segt. Men vissa av Spunks kunder har datalagringspolicies som kräver att datat lagras längre än vad Splunk kan erbjuda. Denna rapport undersöker hur man kan lagra stora mängder data. Vi presenterar problemen som finns med att arkivera data, egenskaperna av datat som ska arkiveras och typer av filsystem som passar för arkivering. Vi skapar en flexibel, tillförlitlig och skalbar lösning för arkivering genom att noga studera datasäkerhet, tillförlitlighet och genom att använda Apache Hadoop för att stödja flera distribuerade filsystem.
APA, Harvard, Vancouver, ISO, and other styles
13

Cassales, Guilherme Weigert. "Escalonamento adaptativo para o Apache Hadoop." Universidade Federal de Santa Maria, 2016. http://repositorio.ufsm.br/handle/1/12025.

Full text
Abstract:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES
Many alternatives have been employed in order to process all the data generated by current applications in a timely manner. One of these alternatives, the Apache Hadoop, combines parallel and distributed processing with the MapReduce paradigm in order to provide an environment that is able to process a huge data volume using a simple programming model. However, Apache Hadoop has been designed for dedicated and homogeneous clusters, a limitation that creates challenges for those who wish to use the framework in other circumstances. Often, acquiring a dedicated cluster can be impracticable due to the cost, and the acquisition of reposition parts can be a threat to the homogeneity of a cluster. In these cases, an option commonly used by the companies is the usage of idle computing resources in their network, however the original distribution of Hadoop would show serious performance issues in these conditions. Thus, this study was aimed to improve Hadoop’s capacity of adapting to pervasive and shared environments, where the availability of resources will undergo variations during the execution. Therefore, context-awareness techniques were used in order to collect information about the available capacity in each worker node and distributed communication techniques were used to update this information on scheduler. The joint usage of both techniques aimed at minimizing and/or eliminating the overload that would happen on shared nodes, resulting in an improvement of up to 50% on performance in a shared cluster, when compared to the original distribution, and indicated that a simple solution can positively impact the scheduling, increasing the variety of environments where the use of Hadoop is possible.
Diversas alternativas têm sido empregadas para o processamento, em tempo hábil, da grande quantidade de dados que é gerada pelas aplicações atuais. Uma destas alternativas, o Apache Hadoop, combina processamento paralelo e distribuído com o paradigma MapReduce para fornecer um ambiente capaz de processar um grande volume de informações através de um modelo de programação simplificada. No entanto, o Apache Hadoop foi projetado para utilização em clusters dedicados e homogêneos, uma limitação que gera desafios para aqueles que desejam utilizá-lo sob outras circunstâncias. Muitas vezes um cluster dedicado pode ser inviável pelo custo de aquisição e a homogeneidade pode ser ameaçada devido à dificuldade de adquirir peças de reposição. Em muitos desses casos, uma opção encontrada pelas empresas é a utilização dos recursos computacionais ociosos em sua rede, porém a distribuição original do Hadoop apresentaria sérios problemas de desempenho nestas condições. Sendo assim, este estudo propôs melhorar a capacidade do Hadoop em adaptar-se a ambientes, pervasivos e compartilhados, onde a disponibilidade de recursos sofrerá variações no decorrer da execução. Para tanto, utilizaram-se técnicas de sensibilidade ao contexto para coletar informações sobre a capacidade disponível nos nós trabalhadores e técnicas de comunicação distribuída para atualizar estas informações no escalonador. A utilização conjunta dessas técnicas teve como objetivo a minimização e/ou eliminação da sobrecarga que seria causada em nós com compartilhamento, resultando em uma melhora de até 50% no desempenho em um cluster compartilhado, quando comparado com a distribuição original, e indicou que uma solução simples pode impactar positivamente o escalonamento, aumentando a variedade de ambientes onde a utilização do Hadoop é possível.
APA, Harvard, Vancouver, ISO, and other styles
14

Lorente, Leal Alberto. "KTHFS Orchestration : PaaS orchestration for Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-128935.

Full text
Abstract:
Platform as a Service (PaaS) has produced a huge impact on how we can offer easy and scalable software that adapts to the needs of the users. This has allowed the possibility of systems being capable to easily configure themselves upon the demand of the customers. Based on these features, a large interest has emerged to try and offer virtualized Hadoop solutions based on Infrastructure as a Service (IaaS) architectures in order to easily deploy completely functional Hadoop clusters in platforms like Amazon EC2 or OpenStack. Throughout the thesis work, it was studied the possibility of enhancing the capabilities of KTHFS, a modified Hadoop platform in development; to allow automatic configuration of a whole functional cluster on IaaS platforms. In order to achieve this, we will study different proposals of similar PaaS platforms from companies like VMWare or Amazon EC2 and analyze existing node orchestration techniques to configure nodes in cloud providers like Amazon or Openstack and later on automatize this process. This will be the starting point for this work, which will lead to the development of our own orchestration language for KTHFS and two artifacts (i) a simple Web Portal to launch the KTHFS Dashboard in the supported IaaS platforms, (ii) an integrated component in the Dashboard in charge of analyzing a cluster definition file, and initializing the configuration and deployment of a cluster using Chef. Lastly, we discover new issues related to scalability and performance when integrating the new components to the Dashboard. This will force us to analyze solutions in order to optimize the performance of our deployment architecture. This will allow us to reduce the deployment time by introducing a few modifications in the architecture. Finally, we will conclude with some few words about the on-going and future work.
APA, Harvard, Vancouver, ISO, and other styles
15

Čecho, Jaroslav. "Optimalizace platformy pro distribuované výpočty Hadoop." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236464.

Full text
Abstract:
This thesis is focusing on possibilities of improving the Apache Hadoop framework by outsourcing some computation to a graphic card using the NVIDIA CUDA technology. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model called mapreduce. NVIDIA CUDA is a platform which allows one to use a graphic card for a general computation. This thesis contains description and experimental implementations of suitable computation inside te Hadoop framework that can benefit from being executed on a graphic card.
APA, Harvard, Vancouver, ISO, and other styles
16

Gupta, Puja Makhanlal. "Characterization of Performance Anomalies in Hadoop." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1429878722.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Björk, Kim, and Jonatan Bodvill. "Data streaming in Hadoop : A STUDY OF REAL TIME DATA PIPELINE INTEGRATION BETWEEN HADOOP ENVIRONMENTS AND EXTERNAL SYSTEMS." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186380.

Full text
Abstract:
The field of distributed computing is growing and quickly becoming a natural part of large as well as smaller enterprises’ IT processes. Driving the progress is the cost effectiveness of distributed systems compared to centralized options, the physical limitations of single machines and reliability concerns. There are frameworks within the field which aims to create a standardized platform to facilitate the development and implementation of distributed services and applications. Apache Hadoop is one of those projects. Hadoop is a framework for distributed processing and data storage. It contains support for many different modules for different purposes such as distributed database management, security, data streaming and processing. In addition to offering storage much cheaper than traditional centralized relation databases, Hadoop supports powerful methods of handling very large amounts of data as it streams through and is stored on the system. These methods are widely used for all kinds of big data processing in large IT companies with a need for low-latency, high-throughput processing of the data. More and more companies are looking towards implementing Hadoop in their IT processes, one of them is Unomaly, a company which offers agnostic, proactive anomaly detection. The anomaly detection system analyses system logs to detect discrepancies. The anomaly detection system is reliant on large amounts of data to build an accurate image of the target system. Integration with Hadoop would result in the possibility to consume incredibly large amounts of data as it is streamed to the Hadoop storage or other parts of the system. In this degree project an integration layer application has been developed to allow Hadoop integration with Unomalys system. Research has been conducted throughout the project in order to determine the best way of implementing the integration. The first part of the result of the project is a PoC application for real time data pipelining between Hadoop clusters and the Unomaly system. The second part is a recommendation of how the integration should be designed, based on the studies conducted in the thesis work.
Distribuerade system blir allt vanligare inom både stora och små företags IT-system. Anledningarna till denna utveckling är kostnadseffektivitet, feltolerans och tekniska fysiska begränsningar på centraliserade system. Det finns ramverk inom området som ämnar att skapa en standardiserad plattform för att underlätta för utveckling och implementation av distribuerade tjänster och applikationer. Apache Hadoop är ett av dessa projekt. Hadoop är ett ramverk för distribuerade beräkningar och distribuerad datalagring. Hadoop har stöd för många olika moduler med olika syften, t.ex. för hantering av distribuerade databaser, datasäkerhet, dataströmmning och beräkningar. Utöver att erbjuda mycket billigare lagring än centraliserade alternativ så erbjuder Hadoop kraftulla sätt att hantera väldigt stora mängder data när den strömmas genom, och lagras på, systemet. Dessa metoder används för en stor mängd olika syften på IT-företag som har ett behov av snabb och kraftfull datahantering. Fler och fler företag implementerar Hadoop i sina IT-processer. Ett av dessa företag är Unomaly. Unomaly är företag som erbjuder generisk, förebyggande avvikelsedetektering. Deras system fungerar genom att aggregera stora volymer systemloggar från godtyckliga ITsystem. Avvikelsehanteringssystemet är beroende av stora mängder loggar för att kunna bygga upp en korrekt bild av värdsystemet. Integration med Hadoop skulle låta Unomaly konsumera väldigt stora mängder loggdata när den strömmar genom värdsystemets Hadooparkitektur. I dettta kandidatexamensarbete har ett integrationslager mellan Hadoop och Unomalys avvikelsehanteringssystem utvecklats. Studier har också gjorts för att identifiera den bästa lösningen för integraion mellan avvikelsehanteringssystem och Hadoop Arbetet har resulterat i en applikationsprototyp som erbjuder realtids datatransportering mellan Hadoop och Unomalys system. Arbetet har även resulterat i en studie som diskuterar det bästa tillvägagångsättet för hur en integration av detta slag ska implementeras.
APA, Harvard, Vancouver, ISO, and other styles
18

Lopes, Bezerra Aprigio Augusto. "Planificación de trabajos en clusters hadoop compartidos." Doctoral thesis, Universitat Autònoma de Barcelona, 2015. http://hdl.handle.net/10803/285573.

Full text
Abstract:
La industria y los científicos han buscado alternativas para procesar con eficacia el gran volumen de datos que se generan en diferentes áreas del conocimiento. MapReduce se presenta como una alternativa viable para el procesamiento de aplicaciones intensivas de datos. Los archivos de entrada se dividen en bloques más pequeños. Posteriormente, se distribuyen y se almacenan en los nodos donde serán procesados. Entornos Hadoop han sido utilizados para ejecutar aplicaciones MapReduce. Hadoop realiza automáticamente la división y distribución de los archivos de entrada, la división del trabajo en tareas Map y Reduce, la planificación de tareas entre los nodos, el control de fallos de nodos; y gestiona la necesidad de comunicación entre los nodos del cluster. Sin embargo, algunas aplicaciones MapReduce tienen un conjunto de características que no permiten que se beneficien plenamente de las políticas de planificación de tareas construídas para Hadoop. Los archivos de entrada compartidos entre múltiples trabajos y aplicaciones con grandes volúmenes de datos intermedios son las características de las aplicaciones que manejamos en nuestra investigación. El objetivo de nuestro trabajo es implementar una nueva política de planificación de trabajos que mejore el tiempo de makespan de lotes de trabajos Hadoop de dos maneras: en un nivel macro (nivel de planificación de trabajos), agrupar los trabajos que comparten los mismos archivos de entrada y procesarlos en lote; y en un nivel micro (nivel de planificación de tareas) las tareas de los diferentes trabajos procesados en el mismo lote, que manejan los mismos bloques de datos, se agrupan para ser ejecutas en el mismo nodo donde se asignó el bloque. La política de planificación de trabajos almacena los archivos compartidos de entrada y los datos intermedios en una RAMDISK, durante el procesamiento de cada lote.
Industry and scientists have sought alternatives to process effectively the large volume of data generated in different areas of knowledge. MapReduce is presented as a viable alternative for the processing of data intensive application. Input files are broken into smaller blocks. So they are distributed and stored in the nodes where they will be processed. Hadoop clusters have been used to execute MapReduce applications. The Hadoop framework automatically performs the division and distribution of the input files, the division of a job into Map and Reduce tasks, the scheduling tasks among the nodes, the failures control of nodes; and manages the need for communication between nodes in the cluster. However, some MapReduce applications have a set of features that do not allow them to benefit fully from the default Hadoop job scheduling policies. Input files shared between multiple jobs and applications with large volumes of intermediate data are the characteristics of the applications we handle in our research. The objective of our work is to improve execution efficiency in two ways: On a macro level (job scheduler level), we group the jobs that share the same input files and process them in batch. Then we store shared input files and intermediate data on a RAMDISK during batch processing. On a micro level (task scheduler level) tasks of different jobs processed in the same batch that handle the same data blocks are grouped to be executed on the same node where the block was allocated.
APA, Harvard, Vancouver, ISO, and other styles
19

Deolikar, Piyush P. "Lecture Video Search Engine Using Hadoop MapReduce." Thesis, California State University, Long Beach, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10638908.

Full text
Abstract:

With the advent of the Internet and ease of uploading video content over video libraries and social networking sites, the video data availability was increased very rapidly during this decade. Universities are uploading video tutorials in the online courses. Companies like Udemy, coursera, Lynda, etc. made video tutorials available over the Internet. We propose and implement a scalable solution, which helps to find relevant videos with respect to a query provided by the user. Our solution maintains an updated list of the available videos on the web and assigns a rank according to their relevance. The proposed solution consists of three main components that can mutually interact. The first component, called the crawler, continuously visits and locally stores the relevant information of all the webpages with videos available on the Internet. The crawler has several threads, concurrently parsing webpages. The second component obtains the inverted index of the web pages stored by the crawler. Given a query, the inverted index is used to obtain the videos that contain the words in the query. The third component computes the rank of the video. This rank is then used to display the results in the order of relevance. We implement a scalable solution in the Apache Hadoop Framework. Hadoop is a distributed operating system that provides a distributed file system able to handle large files as well as distributed computation among the participants.

APA, Harvard, Vancouver, ISO, and other styles
20

Svedlund, Nordström Johan. "A Global Ecosystem for Datasets on Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-205297.

Full text
Abstract:
The immense growth of the web has led to the age of Big Data. Companies like Google, Yahoo and Facebook generates massive amounts of data everyday. In order to gain value from this data, it needs to be effectively stored and processed. Hadoop, a Big Data framework, can store and process Big Data in a scalable and performant fashion. Both Yahoo and Facebook, two major IT companies, deploy Hadoop as their solution to the Big Data problem. Many application areas for Big Data would benefit from the ability to share datasets across cluster boundaries. However, Hadoop does not support searching for datasets either local to a single Hadoop cluster or across many Hadoop clusters. Similarly, there is only limited support for copying datasets between Hadoop clusters (using Distcp). This project presents a solution to this weakness using the Hadoop distribution, Hops, and its frontend Hopsworks. Clusters advertise their peer-to-peer and search endpoints to a central server called Hops-Site. The advertised endpoints builds a global hadoop ecosystem and gives clusters the ability to participate in publicsearch or peer-to-peer sharing of datasets. HopsWorks users are given a choice to write data into Kafka as it’s being downloaded. This opens up new possibilities for data scientists who can interactively analyse remote datasets without having to download everything in advance. By writing data into Kafka as its being downloaded, it can be consumed by entities like Spark-streaming or Flink.
APA, Harvard, Vancouver, ISO, and other styles
21

RajuladeviKasi, UdayKiran. "Location-aware replication in virtual Hadoop environment." Thesis, Wichita State University, 2012. http://hdl.handle.net/10057/5609.

Full text
Abstract:
MapReduce is a framework for processing highly distributable tasks across huge datasets using a large number of compute nodes. As an implementation of MapReduce, Hadoop is widely used in the industry. Hadoop is a software platform that utilizes the distributed processing of big data across a cluster of servers. Virtualization of Hadoop Cluster shows great potential as it is easy to configure and economical to use. With some of the advantages like rapid provisioning, security and efficient resource utilization, Virtualization can be a great tool to increase efficiency of a Hadoop Cluster. However, the data redundancy which is a critical part of the Hadoop architecture can be compromised using traditional Hadoop data allocation methods. MapReduce which is known for its I/O intensive applications faces a problem with the decrease in data redundancy and unbalanced load in the virtual Hadoop cluster. In this research, the authors consider a Hadoop cluster where multiple virtual machines (VMs) co-exist on several physical machines to analyze the data allocation problem in a virtual environment. The authors also design a strategy for file block allocation which is compatible with the native Hadoop data allocation method. This research shows the serious implications of the native Hadoop data redundancy method and proposes a new algorithm that can correct the data placement in the nodes and maintain the redundancy in Hadoop cluster.
Thesis (M.S.)--Wichita State University, College of Engineering, Dept. of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
22

Темирбекова, Ж. Е., and Ж. М. Меренбаев. "Параллельное масштабирование изображений в технологии mapreduce hadoop." Thesis, Сумский государственный университет, 2015. http://essuir.sumdu.edu.ua/handle/123456789/40775.

Full text
Abstract:
Цифровая обработка изображений находит широкое применение практически во всех областях промышленности. Часто еѐ использование позволяет выйти на качественно новый технологический уровень производства. При этом наиболее сложными являются вопросы, связанные с автоматическим извлечением информации из изображения и ее интерпретацией, являющейся основой для принятия решений в процессе управления производственными процессами.
APA, Harvard, Vancouver, ISO, and other styles
23

Capitão, Micael José Pedrosa. "Mediator framework for inserting data into hadoop." Master's thesis, Universidade de Aveiro, 2014. http://hdl.handle.net/10773/14697.

Full text
Abstract:
Mestrado em Engenharia de Computadores e Telemática
Data has always been one of the most valuable resources for organizations. With it we can extract information and, with enough information on a subject, we can build knowledge. However, it is first needed to store that data for later processing. On the last decades we have been assisting what was called “information explosion”. With the advent of the new technologies, the volume, velocity and variety of data has increased exponentially, becoming what is known today as big data. Telecommunications operators gather, using network monitoring equipment, millions of network event records, the Call Detail Records (CDRs) and the Event Detail Records (EDRs), commonly known as xDRs. These records are stored and later processed to compute network performance and quality of service metrics. With the ever increasing number of telecommunications subscribers, the volume of generated xDRs needing to be stored and processed has increased exponentially, making the current solutions based on relational databases not suited any more and so, they are facing a big data problem. To handle that problem, many contributions have been made on the last years that have resulted in solid and innovative solutions. Among them, Hadoop and its vast ecosystem stands out. Hadoop integrates new methods of storing and process high volumes of data in a robust and cost-effective way, using commodity hardware. This dissertation presents a platform that enables the current systems inserting data into relational databases, to keep doing it transparently when migrating those to Hadoop. The platform has to, like in the relational databases, give delivery guarantees, support unique constraints and, be fault tolerant. As proof of concept, the developed platform was integrated with a system specifically designed to the computation of performance and quality of service metrics from xDRs, the Altaia. The performance tests have shown the platform fulfils and exceeds the requirements for the insertion rate of records. During the tests the behaviour of the platform when trying to insert duplicated records and when in failure scenarios have also been evaluated. The results for both situations were as expected.
“Dados” sempre foram um dos mais valiosos recursos das organizações. Com eles pode-se extrair informação e, com informação suficiente, pode-se criar conhecimento. No entanto, é necessário primeiro conseguir guardar esses dados para posteriormente os processar. Nas últimas décadas tem-se assistido ao que foi apelidado de “explosão de informação”. Com o advento das novas tecnologias, o volume, velocidade e variedade dos dados tem crescido exponencialmente, tornando-se no que é hoje conhecido como big data. Os operadores de telecomunicações obtêm, através de equipamentos de monitorização da rede, milhões de registos relativos a eventos da rede, os Call Detail Records (CDRs) e os Event Detail Records (EDRs), conhecidos como xDRs. Esses registos são armazenados e depois processados para deles se produzirem métricas relativas ao desempenho da rede e à qualidade dos serviços prestados. Com o aumento dos utilizadores de telecomunicações, o volume de registos gerados que precisam de ser armazenados e processados cresceu exponencialmente, inviabilizando as soluções que assentam em bases de dados relacionais, estando-se agora perante um problema de big data. Para tratar esse problema, múltiplas contribuições foram feitas ao longo dos últimos anos que resultaram em soluções sólidas e inovadores. De entre elas, destaca-se o Hadoop e o seu vasto ecossistema. O Hadoop incorpora novos métodos de guardar e tratar elevados volumes de dados de forma robusta e rentável, usando hardware convencional. Esta dissertação apresenta uma plataforma que possibilita aos actuais sistemas que inserem dados em bases de dados relacionais, que o continuem a fazer de forma transparente quando essas migrarem para Hadoop. A plataforma tem de, tal como nas bases de dados relacionais, dar garantias de entrega, suportar restrições de chaves únicas e ser tolerante a falhas. Como prova de conceito, integrou-se a plataforma desenvolvida com um sistema especificamente desenhado para o cálculo de métricas de performance e de qualidade de serviço a partir de xDRs, o Altaia. Pelos testes de desempenho realizados, a plataforma cumpre e excede os requisitos relativos à taxa de inserção de registos. Durante os testes também se avaliou o seu comportamento perante tentativas de inserção de registos duplicados e perante situações de falha, tendo o resultado, para ambas as situações, sido o esperado.
APA, Harvard, Vancouver, ISO, and other styles
24

Justice, Matthew Adam. "Optimizing MongoDB-Hadoop Performance with Record Grouping." Thesis, The University of Arizona, 2012. http://hdl.handle.net/10150/244396.

Full text
Abstract:
Computational cloud computing is more important than ever. Since time is literally money on cloud platforms, performance is the primary focus of researchers and programmers alike. Although distributed computing platforms today do a fine job of optimizing most types of workflows, there are some types, specifically those which are not computation-oriented, that are left out. After introducing important players in the world of computational cloud computing, this paper explores a possible performance enhancement for these types of workflows by reducing the overhead that platform designers assumed was acceptable. The enhancement is tested in two environments: an actual distributed computing platform and an environment that simulates that platform. Along the way it becomes clear that computational cloud computing is far from perfect and its use can often deliver surprising results. Regardless, the presented solution remains viable and is capable of increasing the performance of particular types of jobs by up to twenty percent.
APA, Harvard, Vancouver, ISO, and other styles
25

Lorenzetto, Luca <1988&gt. "Evaluating performance of Hadoop Distributed File System." Master's Degree Thesis, Università Ca' Foscari Venezia, 2014. http://hdl.handle.net/10579/4773.

Full text
Abstract:
In recent years, a huge quantity of data produced by multiple sources has appeared. Dealing with this data has arisen the so called "big data problem", which can be faced only with new computing paradigms and platforms. Many vendors compete in this field, but at this day the de-facto standard platform for big-data is the opensource framework Apache Hadoop . Inspired by Google's private cluster platform, some indipendent developers created Hadoop and, following the structure published by Google's engineering team, a complete set of components for big data elaboration has been developed. One of this components is the Hadoop Distributed File System, one of the core components. In this thesis work, we will analyze its performance and identify some action points that can be tuned to improve its behavior in a real implementation.
APA, Harvard, Vancouver, ISO, and other styles
26

Lu, Yue. "CloudNotes: Annotation Management in Cloud-Based Platforms." Digital WPI, 2014. https://digitalcommons.wpi.edu/etd-theses/273.

Full text
Abstract:
We present an annotation management system for cloud-based platforms, which is called “CloudNotes�. CloudNotes enables the annotation management feature in the scalable Hadoop and MapRedue platforms. In CloudNotes system, every piece of data may have one or more annotations associate with it, and these annotations will be propagated when the data is being transformed through the MapReduce jobs. Such an annotation management system is important for understanding the provenance and quality of data, especially in applications that deal with integration of scientific and biological data at unprecedented scale and complexity. We propose several extensions to the Hadoop platform that allow end-users to add and retrieve annotations seamlessly. Annotations in CloudNotes will be generated, propagated and managed in a distributed manner. We address several challenges that include attaching annotations to data at various granularities in Hadoop, annotating data in flat files with no known schema until query time, and creating and storing the annotations is a distributed fashion. We also present new storage mechanisms and novel indexing techniques that enable adding the annotations in small increments although Hadoop’s file system is optimized for large batch processing.
APA, Harvard, Vancouver, ISO, and other styles
27

Shetty, Kartik. "Evaluating Clustering Techniques over Big Data in Distributed Infrastructures." Digital WPI, 2018. https://digitalcommons.wpi.edu/etd-theses/1226.

Full text
Abstract:
Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions.
APA, Harvard, Vancouver, ISO, and other styles
28

Kakantousis, Theofilos. "Scaling YARN: A Distributed Resource Manager for Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177200.

Full text
Abstract:
In recent years, there has been a growing need for computer systems that are capable of handling unprecedented amounts of data. To this end, Hadoop HDFS and Hadoop YARN have become the de facto standard for meeting demanding storage requirements and for managing applications that can process this data. Although YARN is a major advancement from its predecessor MapReduce in terms of scalability and fault-tolerance, its Resource Manager component that performs resource allocation introduces a potential single point of failure and a performance bottleneck due to its centralized architecture. This thesis presents a novel architecture in which the Resource Manager runs on a distributed network of stateless commodity machines as its state is migrated to MySQL Cluster, a relational write-scalable and highly available in-memory database. By doing so, the Resource Manager becomes more scalable as it can now run on multiple nodes as well as more fault-tolerant as arbitrary node failures do not result in state loss. In this work we implemented the proposed architecture for the Resource Tracker service which performs cluster node management for the Resource Manager. Experimental results validate the correctness of our proposal, demonstrate how it scales well by utilizing stateless Resource Manager machines and evaluate its performance in terms of request throughput, system resource and database utilization.
APA, Harvard, Vancouver, ISO, and other styles
29

Lindberg, Johan. "Big Data och Hadoop : Nästa generation av lagring." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-31079.

Full text
Abstract:
The goal of this report and study is to at a theoretical level determine the possi- bilities for Försäkringskassan IT to change platform for storage of data used in their daily activities. Försäkringskassan collects immense amounts of data ev- eryday containing personal information, lines of programming code, payments and customer service tickets. Today, everything is stored in large relationship databases which leads to problems with scalability and performance. The new platform studied in this report is built on a storage technology named Hadoop. Hadoop is developed to store and process data distributed in what is called clus- ters. Clusters that consists of commodity server hardware. The platform promises near linear scalability, possibility to store all data with a high fault tolerance and that it can handle massive amounts of data. The study is done through theo- retical studies as well as a proof of concept. The theory studies focus on the background of Hadoop, it’s structure and what to expect in the future. The plat- form being used at Försäkringskassan today is to be specified and compared to the new platform. A proof of concept will be conducted in a test environment at Försäkringskassan running a Hadoop platform from Hortonworks. Its purpose is to show how storing data is done as well as to show that unstructured data can be stored. The study shows that no theoretical problems have been found and that a move to the new platform should be possible. It does however move handling of the data from before storage to after. This is because todays platform is reliant on relationship databases that require data to be structured neatly to be stored. Hadoop however stores all data but require more work and knowledge to retrieve the data.
Målet med rapporten och undersökningen är att på en teoretisk nivå undersöka möjligheterna för Försäkringskassan IT att byta plattform för lagring av data och information som används i deras dagliga arbete. Försäkringskassan samlar på sig oerhörda mängder data på daglig basis innehållandes allt från personupp- gifter, programkod, utbetalningar och kundtjänstärenden. Idag lagrar man allt detta i stora relationsdatabaser vilket leder till problem med skalbarhet och prestanda. Den nya plattformen som undersöks bygger på en lagringsteknik vid namn Hadoop. Hadoop är utvecklat för att både lagra och processerna data distribuerat över så kallade kluster bestående av billigare serverhårdvara. Plattformen utlovar näst intill linjär skalbarhet, möjlighet att lagra all data med hög feltolerans samt att hantera enorma datamängder. Undersökningen genomförs genom teoristudier och ett proof of concept. Teoristudierna fokuserar på bakgrunden på Hadoop, dess uppbyggnad och struktur samt hur framtiden ser ut. Dagens upplägg för lagring hos Försäkringskassan specificeras och jämförs med den nya plattformen. Ett proof of concept genomförs på en testmiljö hos För- säkringskassan där en Hadoop plattform från Hortonworks används för att påvi- sa hur lagring kan fungera samt att så kallad ostrukturerad data kan lagras. Undersökningen påvisar inga teoretiska problem i att byta till den nya plattformen. Dock identifieras ett behov av att flytta hanteringen av data från inläsning till utläsning. Detta beror på att dagens lösning med relationsdatabaser kräver väl strukturerad data för att kunna lagra den medan Hadoop kan lagra allt utan någon struktur. Däremot kräver Hadoop mer handpåläggning när det kommer till att hämta data och arbeta med den.
APA, Harvard, Vancouver, ISO, and other styles
30

Brito, José Benedito de Souza. "Modelo para estimar performance de um Cluster Hadoop." reponame:Repositório Institucional da UnB, 2014. http://repositorio.unb.br/handle/10482/17180.

Full text
Abstract:
Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2014.
Submitted by Albânia Cézar de Melo (albania@bce.unb.br) on 2014-12-02T12:56:55Z No. of bitstreams: 1 2014_JoseBeneditoSouzaBrito.pdf: 4169418 bytes, checksum: 0acba0fc24656f44b12166c01ba2dc3c (MD5)
Approved for entry into archive by Patrícia Nunes da Silva(patricia@bce.unb.br) on 2014-12-02T13:25:34Z (GMT) No. of bitstreams: 1 2014_JoseBeneditoSouzaBrito.pdf: 4169418 bytes, checksum: 0acba0fc24656f44b12166c01ba2dc3c (MD5)
Made available in DSpace on 2014-12-02T13:25:34Z (GMT). No. of bitstreams: 1 2014_JoseBeneditoSouzaBrito.pdf: 4169418 bytes, checksum: 0acba0fc24656f44b12166c01ba2dc3c (MD5)
O volume, a variedade e a velocidade dos dados apresenta um grande desa o para extrair informações úteis em tempo hábil, sem gerar grandes impactos nos demais processamentos existentes nas organizações, impulsionando a utilização de clusters para armazenamento e processamento, e a utilização de computação em nuvem. Este cenário é propício para o Hadoop, um framework open source escalável e e ciente, para a execução de cargas de trabalho sobre Big Data. Com o advento da computação em nuvem um cluster com o framework Hadoop pode ser alocado em minutos, todavia, garantir que o Hadoop tenha um desempenho satisfatório para realizar seus processamentos apresenta vários desa os, como as necessidades de ajustes das con gurações do Hadoop às cargas de trabalho, alocar um cluster apenas com os recursos necessários para realizar determinados processamentos e de nir os recursos necessários para realizar um processamento em um intervalo de tempo conhecido. Neste trabalho, foi proposta uma abordagem que busca otimizar o framework Hadoop para determinada carga de trabalho e estimar os recursos computacionais necessário para realizar um processamento em determinado intervalo de tempo. A abordagem proposta é baseada na coleta de informações, base de regras para ajustes de con gurações do Hadoop, de acordo com a carga de trabalho, e simulações. A simplicidade e leveza do modelo permite que a solução seja adotada como um facilitador para superar os desa os apresentados pelo Big Data, e facilitar a de nição inicial de um cluster para o Hadoop, mesmo por usuários com pouca experiência em TI. O modelo proposto trabalha com o MapReduce para de nir os principais parâmetros de con guração e determinar recursos computacionais dos hosts do cluster para atender aos requisitos desejados de tempo de execução para determinada carga de trabalho. _______________________________________________________________________________ ABSTRACT
The volume, variety and velocity of data presents a great challenge to extracting useful information in a timely manner, without causing impacts on other existing processes in organizations, promoting the use of clusters for storage and processing, and the use of cloud computing. This a good scenario for the Hadoop an open source framework scalable and e cient for running workloads on Big Data. With the advent of cloud computing one cluster with Hadoop framework can be allocated in minutes, however, ensure that the Hadoop has a good performance to accomplish their processing has several challenges, such as needs tweaking the settings of Hadoop for their workloads, allocate a cluster with the necessary resources to perform certain processes and de ne the resources required to perform processing in a known time interval. In this work, an approach that seeks to optimize the Hadoop for a given workload and estimate the computational resources required to realize a processing in a given time interval was proposed. The approach is based on collecting information, based rules for adjusting Hadoop settings for certain workload and simulations. The simplicity and lightness of the model allows the solution be adopted how a facilitator to overcome the challenges presented by Big Data, and facilitate the use of the Hadoop, even by users with little IT experience. The proposed model works with the MapReduce to de ne the main con guration parameters and determine the computational resources of nodes of cluster, to meet the desired runtime for a given workload requirements.
APA, Harvard, Vancouver, ISO, and other styles
31

Бабич, А. С., and Елена Петровна Черных. "Использование Apache Hadoop для обработки больших наборов данных." Thesis, Національний технічний університет "Харківський політехнічний інститут", 2015. http://repository.kpi.kharkov.ua/handle/KhPI-Press/45546.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Hou, Jun. "Using Hadoop to Cluster Data in Energy System." University of Dayton / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1430092547.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Sodhi, Bir Apaar Singh. "DATA MINING: TRACKING SUSPICIOUS LOGGING ACTIVITY USING HADOOP." CSUSB ScholarWorks, 2016. https://scholarworks.lib.csusb.edu/etd/271.

Full text
Abstract:
In this modern rather interconnected era, an organization’s top priority is to protect itself from major security breaches occurring frequently within a communicational environment. But, it seems, as if they quite fail in doing so. Every week there are new headlines relating to information being forged, funds being stolen and corrupt usage of credit card and so on. Personal computers are turned into “zombie machines” by hackers to steal confidential and financial information from sources without disclosing hacker’s true identity. These identity thieves rob private data and ruin the very purpose of privacy. The purpose of this project is to identify suspicious user activity by analyzing a log file which then later can help an investigation agency like FBI to track and monitor anonymous user(s) who seek for weaknesses to attack vulnerable parts of a system to have access of it. The project also emphasizes the potential damage that a malicious activity could have on the system. This project uses Hadoop framework to search and store log files for logging activities and then performs a ‘Map Reduce’ programming code to finally compute and analyze the results.
APA, Harvard, Vancouver, ISO, and other styles
34

Palummo, Alexandra Lina. "Supporto SQL al sistema Hadoop per big data analytics." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016.

Find full text
Abstract:
Negli ultimi anni, si sta parlando sempre più spesso di Big Data, riferendosi non solo a grandi moli di dati generati da diversi fenomeni come l’esplosione delle reti sociali e l’accelerazione senza precedenti dello sviluppo tecnologico, ma l'espressione riguarda alcune nuove necessità e le conseguenti sfide, dette delle Tre V: Volume, Velocità e Varietà. Per poter analizzare ed estrarre informazioni da questi grandi volumi di dati, sono state sviluppate risorse e tecnologie differenti dai sistemi convenzionali di immagazzinamento e gestione dei dati. Una delle tecnologie che ha avuto maggior successo è rappresentata da Apache Hadoop, un framework Open Source di Apache. In questo elaborato viene illustrata una panoramica di Hadoop, concepito per offrire supporto ad applicazioni distribuite e semplificare le operazioni di storage e gestione di dataset di grandi dimensioni, fornendo una alternativa ai DBMS relazionali poco adatti alle trasformazioni dei Big Data. Hadoop fornisce inoltre strumenti in grado di analizzare e processare una grande quantità di informazioni, tra i quali Hive, Impala e BigSQL 3.0, descritti nella seconda parte dell’elaborato. Confrontando le prestazioni di questi tre sistemi mediante un esperimento, condotto sul benchmark TPC-DS su piattaforma Hadoop, è stato evidenziato come BigSQL 3.0 riesce ad ottenere le prestazioni migliori.
APA, Harvard, Vancouver, ISO, and other styles
35

Fischer, e. Silva Renan. "E-EON : Energy-Efficient and Optimized Networks for Hadoop." Doctoral thesis, Universitat Politècnica de Catalunya, 2018. http://hdl.handle.net/10803/586061.

Full text
Abstract:
Energy efficiency and performance improvements have been two of the major concerns of current Data Centers. With the advent of Big Data, more information is generated year after year, and even the most aggressive predictions of the largest network equipment manufacturer have been surpassed due to the non-stop growing network traffic generated by current Big Data frameworks. As, currently, one of the most famous and discussed frameworks designed to store, retrieve and process the information that is being consistently generated by users and machines, Hadoop has gained a lot of attention from the industry in recent years and presently its name describes a whole ecosystem designed to tackle the most varied requirements of today’s cloud applications. This thesis relates to Hadoop clusters, mainly focused on their interconnects, which is commonly considered to be the bottleneck of such ecosystem. We conducted research focusing on energy efficiency and also on performance optimizations as improvements on cluster throughput and network latency. Regarding the energy consumption, a significant proportion of a data center's energy consumption is caused by the network, which stands for 12% of the total system power at full load. With the non-stop growing network traffic, it is desired by industry and academic community that network energy consumption should be proportional to its utilization. Considering cluster performance, although Hadoop is a network throughput-sensitive workload with less stringent requirements for network latency, there is an increasing interest in running batch and interactive workloads concurrently on the same cluster. Doing so maximizes system utilization, to obtain the greatest benefits from the capital and operational expenditures. For this to happen, cluster throughput should not be impacted when network latency is minimized. The two biggest challenges faced during the development of this thesis were related to achieving near proportional energy consumption for the interconnects and also improving the network latency found on Hadoop clusters, while having virtually no loss on cluster throughput. Such challenges led to comparable sized opportunity: proposing new techniques that must solve such problems from the current generation of Hadoop clusters. We named E-EON the set of techniques presented in this work, which stands for Energy Efficient and Optimized Networks for Hadoop. E-EON can be used to reduce the network energy consumption and yet, to reduce network latency while cluster throughput is improved at the same time. Furthermore, such techniques are not exclusive to Hadoop and they are also expected to have similar benefits if applied to any other Big Data framework infrastructure that fits the problem characterization we presented throughout this thesis. With E-EON we were able to reduce the energy consumption by up to 80% compared to the state-of-the art technique. We were also able to reduce network latency by up to 85% and in some cases, even improve cluster throughput by 10%. Although these were the two major accomplishment from this thesis, we also present minor benefits which translate to easier configuration compared to the stat-of-the-art techniques. Finally, we enrich the discussions found in this thesis with recommendations targeting network administrators and network equipment manufacturers.
La eficiencia energética y las mejoras de rendimiento han sido dos de las principales preocupaciones de los Data Centers actuales. Con el arribo del Big Data, se genera más información año con año, incluso las predicciones más agresivas de parte del mayor fabricante de dispositivos de red se han superado debido al continuo tráfico de red generado por los sistemas de Big Data. Actualmente, uno de los más famosos y discutidos frameworks desarrollado para almacenar, recuperar y procesar la información generada consistentemente por usuarios y máquinas, Hadoop acaparó la atención de la industria en los últimos años y actualmente su nombre describe a todo un ecosistema diseñado para abordar los requisitos más variados de las aplicaciones actuales de Cloud Computing. Esta tesis profundiza sobre los clusters Hadoop, principalmente enfocada a sus interconexiones, que comúnmente se consideran el cuello de botella de dicho ecosistema. Realizamos investigaciones centradas en la eficiencia energética y también en optimizaciones de rendimiento como mejoras en el throughput de la infraestructura y de latencia de la red. En cuanto al consumo de energía, una porción significativa de un Data Center es causada por la red, representada por el 12 % de la potencia total del sistema a plena carga. Con el tráfico constantemente creciente de la red, la industria y la comunidad académica busca que el consumo energético sea proporcional a su uso. Considerando las prestaciones del cluster, a pesar de que Hadoop mantiene una carga de trabajo sensible al rendimiento de red aunque con requisitos menos estrictos sobre la latencia de la misma, existe un interés creciente en ejecutar aplicaciones interactivas y secuenciales de manera simultánea sobre dicha infraestructura. Al hacerlo, se maximiza la utilización del sistema para obtener los mayores beneficios al capital y gastos operativos. Para que esto suceda, el rendimiento del sistema no puede verse afectado cuando se minimiza la latencia de la red. Los dos mayores desafíos enfrentados durante el desarrollo de esta tesis estuvieron relacionados con lograr un consumo energético cercano a la cantidad de interconexiones y también a mejorar la latencia de red encontrada en los clusters Hadoop al tiempo que la perdida del rendimiento de la infraestructura es casi nula. Dichos desafíos llevaron a una oportunidad de tamaño semejante: proponer técnicas novedosas que resuelven dichos problemas a partir de la generación actual de clusters Hadoop. Llamamos a E-EON (Energy Efficient and Optimized Networks) al conjunto de técnicas presentadas en este trabajo. E-EON se puede utilizar para reducir el consumo de energía y la latencia de la red al mismo tiempo que el rendimiento del cluster se mejora. Además tales técnicas no son exclusivas de Hadoop y también se espera que tengan beneficios similares si se aplican a cualquier otra infraestructura de Big Data que se ajuste a la caracterización del problema que presentamos a lo largo de esta tesis. Con E-EON pudimos reducir el consumo de energía hasta en un 80% en comparación con las técnicas encontradas en la literatura actual. También pudimos reducir la latencia de la red hasta en un 85% y, en algunos casos, incluso mejorar el rendimiento del cluster en un 10%. Aunque estos fueron los dos principales logros de esta tesis, también presentamos beneficios menores que se traducen en una configuración más sencilla en comparación con las técnicas más avanzadas. Finalmente, enriquecimos las discusiones encontradas en esta tesis con recomendaciones dirigidas a los administradores de red y a los fabricantes de dispositivos de red.
APA, Harvard, Vancouver, ISO, and other styles
36

Benslimane, Ziad. "Optimizing Hadoop Parameters Based on the Application Resource Consumption." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-200144.

Full text
Abstract:
The interest in analyzing the growing amounts of data has encouraged the deployment of large scale parallel computing frameworks such as Hadoop. In other words, data analytic is the main reason behind the success of distributed systems; this is due tothe fact that data might not fit on a single disk, and that processing can be very time consuming which makes parallel input analysis very useful. Hadoop relies on the MapReduce programming paradigm to distribute work among the machines; so a good balance of load will eventually influence the execution time of those kinds of applications. This paper introduces a technique to optimize some configuration parameters using the application's CPU utilization in order to tune Hadoop; the theories stated and proved in this paper rely on the fact that the CPUs should neither be over utilized nor under utilized; in other words, the conclusion will be a sort of an equation of the parameter to be optimized in terms of the cluster infrastructure.The future research concerning this topic is planned to focus on tuning other Hadoop parameters and to use more accurate tools to analyze the cluster performance; moreover, it is also interesting to research any possible ways to optimize Hadoop parameters based on other consumption criteria such the input/output statistics and the network traffic.
APA, Harvard, Vancouver, ISO, and other styles
37

Donepudi, Harinivesh. "An Apache Hadoop Framework for Large-Scale Peptide Identification." TopSCHOLAR®, 2015. http://digitalcommons.wku.edu/theses/1527.

Full text
Abstract:
Peptide identification is an essential step in protein identification, and Peptide Spectrum Match (PSM) data set is huge, which is a time consuming process to work on a single machine. In a typical run of the peptide identification method, PSMs are positioned by a cross correlation, a statistical score, or a likelihood that the match between the trial and hypothetical is correct and unique. This process takes a long time to execute, and there is a demand for an increase in performance to handle large peptide data sets. Development of distributed frameworks are needed to reduce the processing time, but this comes at the price of complexity in developing and executing them. In distributed computing, the program may divide into multiple parts to be executed. The work in this thesis describes the implementation of Apache Hadoop framework for large-scale peptide identification using C-Ranker. The Apache Hadoop data processing software is immersed in a complex environment composed of massive machine clusters, large data sets, and several processing jobs. The framework uses Apache Hadoop Distributed File System (HDFS) and Apache Mapreduce to store and process the peptide data respectively.The proposed framework uses a peptide processing algorithm named CRanker which takes peptide data as an input and identifies the correct PSMs. The framework has two steps: Execute the C-Ranker algorithm on Hadoop cluster and compare the correct PSMs data generated via Hadoop approach with the normal execution approach of C-Ranker. The goal of this framework is to process large peptide datasets using Apache Hadoop distributed approach.
APA, Harvard, Vancouver, ISO, and other styles
38

Vijayakumar, Sruthi. "Hadoop Based Data Intensive Computation on IAAS Cloud Platforms." UNF Digital Commons, 2015. http://digitalcommons.unf.edu/etd/567.

Full text
Abstract:
Cloud computing is a relatively new form of computing which uses virtualized resources. It is dynamically scalable and is often provided as pay for use service over the Internet or Intranet or both. With increasing demand for data storage in the cloud, the study of data-intensive applications is becoming a primary focus. Data intensive applications are those which involve high CPU usage, processing large volumes of data typically in size of hundreds of gigabytes, terabytes or petabytes. The research in this thesis is focused on the Amazon’s Elastic Cloud Compute (EC2) and Amazon Elastic Map Reduce (EMR) using HiBench Hadoop Benchmark suite. HiBench is a Hadoop benchmark suite and is used for performing and evaluating Hadoop based data intensive computation on both these cloud platforms. Both quantitative and qualitative comparisons of Amazon EC2 and Amazon EMR are presented. Also presented are their pricing models and suggestions for future research.
APA, Harvard, Vancouver, ISO, and other styles
39

SHELLY. "IRIS RECOGNITION ON HADOOP." Thesis, 2011. http://dspace.dtu.ac.in:8080/jspui/handle/repository/13885.

Full text
Abstract:
M.TECH
Iris Recognition is a type of pattern recognition which recognizes a user by determining the physical structure of an individual's Iris. A unique Iris pattern is extracted from a digitized image of the eye, and encoded into an iris template. Iris template contains unique information of an individual and is stored in a database. To identify an individual by iris recognition system, an individual’s eye image is captured using video camera and converted into iris template. These templates are compared with stored iris templates in database. If templates are matches then user is said to be genuine, otherwise imposter. Iris Recognition offers advantages over traditional recognition methods (ID cards, PIN numbers) because the person to be identified has no need to remember or carry any information. Iris pattern remains stable throughout life of a person. This characteristic makes it very attractive for use as a biometric for identifying individual. Iris recognition is deployed for verification and/or identification in applications such as access control, border management, and Identification systems. With increasing security concerns, biometric database size is growing very fast and technologies like iris recognition has a very huge database for comparison. Iris recognition algorithms are implemented on general purpose sequential processing systems, and also existing relational database systems are not enough to handle this huge size of data in some reasonable time. In this thesis, a parallel processing alternative using cloud computing, offering an opportunity to increase speed and use it on huge database is proposed. An open source Hadoop framework for cloud computing is used to implement the proposed system. Hadoop Distributed File System (HDFS) is used to handle large data sets, by breaking it into blocks and replicating blocks on various machines in cloud. Template comparison is done independently on different blocks of data by various machines in parallel. Map/Reduce programming model is used for processing large data sets. Map/Reduce process the data in format. Iris database is stored in a text format. Mappers process the input and produce an intermediate output. Reducer takes intermediate output and produces final result. This research work shows how, the most time-consuming operations (matching process) of a modern iris recognition algorithm are parallelized. In particular, template matching is parallelized on a cloud based system with a demonstrated speedup gain.
APA, Harvard, Vancouver, ISO, and other styles
40

Yang, Ming-hsien, and 楊明憲. "High-Performance Heterogeneous Hadoop Architecture." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/33987755238266822134.

Full text
Abstract:
碩士
國立臺灣科技大學
自動化及控制研究所
101
Native Hadoop is a two-layered structure composed of one master and many slaves. Therein slave can be seen as the combination of DataNode and TaskTracker, while master is in charge of managing slave nodes. Since users may add more slave nodes in the Hadoop to increase the efficiency of parallel computing of massive data, Hadoop has been viewed as the key technology of massive data processing. Although Hadoop can promote the efficiency, the Hadoop cluster, composed of a large number of slaves, makes the real structure larger and consumes more energy. Therefore, this study combines ARM and x86 to form the new “Heterogeneously Three-Layered” Hadoop structure, based on ARM's characteristics which owns the characteristics: energy saving, high performance of massive data processing, and small space. Moreover, the concept of “Dynamically Managing Block Algorithm” is introduced to the task scheduler. This design not only can improve shortcomings in native Hadoop but also can effectively reduce more than 22% Map/Reduce operation time.
APA, Harvard, Vancouver, ISO, and other styles
41

Hsin-YingLee and 李信穎. "A Transparent Hadoop Data Service." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/y2s2g2.

Full text
Abstract:
碩士
國立成功大學
資訊工程學系
106
Hadoop is an open source distributed processing framework and storage for big data. Its storage called Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files with streaming data access patterns, but it can't effectively handle lots of files. Although there are many ways to solve small data problem, users still need to take a lot of extra processing. HBase is a distributed database that is often paired with Hadoop, providing efficient random access, HBase can be used to solve small data problems, but these two systems must to take time to learn and are written in Java, which are high barrier introducing to data analysts. This paper proposes a transparent distributed storage system, designed to solve the problem of small data on HDFS, and supports the HDFS Interface to be compatible with Hadoop ecosystem, such as Hive, Spark, etc. Users can access the data directly without changing any code, and system also provide a simple Web API to hide the data platform’s complex operations, let users migrate data into the data platform from other file servers through the API.
APA, Harvard, Vancouver, ISO, and other styles
42

Yadav, Rakesh. "Genetic Algorithms Using Hadoop MapReduce." Thesis, 2015. http://ethesis.nitrkl.ac.in/7790/1/2015_Genetic_Yadav.pdf.

Full text
Abstract:
Data-Intensive Computing (DIC) played an important role for large data set utilizing the parallel computing. DIC model shown that can process large amount of data like petabyte or zettabyte day to day. So these have some sorts of attempts to checkout that how DIC will support the Evolutionary(Genetic) Algorithms. Here we have shown step by step explanation that how the Genetic Algorithms(GA), with different implementation form, will be interpret into Hadoop MapReduce framework. Here the results will give details as how Hadoop is best choice to impel genetic algorithm on large dataset problem and shown how the speedup will be increased using parallel computing. MapReduce is designed for large volume of data set. It is introduced for BigData Analysis and it is used a lots of algorithms like Breadth-First Search, Traveling Salesman problems, Finding Shortest Path problem etc. In this framework two key factor, Map and reducer. The Map which is parallely divided the data into many cluster and each cluster the data is form of key and value. The output of map phase data will goes into intermediate phase where data will be shuffling and sorting. Then using the partitioner for dividing the data parallely in different cluster according to the user. The number of cluster are depends on the number of reducers. The reducers will taking all iteration of data give the results in form of values. In This thesis we also show that we compare our implementation with implementation presented in existing model. These two implementation are compare with ONEMAX (bit counting) PROBLEM. The comparison criteria between two implementation are fitness convergences, stability with fixed number of node, quality of final solution, cloud resource utilization and algorithms scalability.
APA, Harvard, Vancouver, ISO, and other styles
43

Lee, Yao-Cheng, and 李曜誠. "Exploiting Parallelism in Heterogeneous Hadoop System." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/69673317513666799922.

Full text
Abstract:
碩士
國立臺灣大學
資訊網路與多媒體研究所
102
With the rise of big data, Apache Hadoop had been attracting increasing attention. There are two primary components at the core of Apache Hadoop: Hadoop Distributed File System(HDFS) and MapReduce framework. MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. However, MapReduce framework is not efficient enough. With the parallelism of mapper, he implementation of Hadoop MapReduce does not fully exploit the parallelism to enhance performance. The implementation of mapper adopts serial processing algorithm instead of parallel processing algorithm. To solve these problems, this thesis proposed a new Hadoop framework which fully exploit parallelism by parallel processing. For better performance, we utilize GPGPU’s computational power to accelerate the program. Besides, in order to utilize both CPU and GPU to reduce the overall execution time, we also propose a scheduling policy to dynamically dispatch the computation on the appropriate device. Our experimental results show that our system can achieve a speedup of 1.45X on the benchmarks over Hadoop.
APA, Harvard, Vancouver, ISO, and other styles
44

Wang, Kuang-Hsin, and 王廣新. "Cloud-based surveillance system under Hadoop." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/59297974316482725855.

Full text
Abstract:
碩士
國立交通大學
資訊學院資訊學程
100
Information technology has been widely used to solve many problems in many domains. Many cameras have been installed at the corner of streets for maintaining public order or monitoring traffic status. The recorded video data is stored to the traditional surveillance system or NVR in public department for future inquiry. However, the traditional surveillance system could not process and store large amount of data generated from large number of cameras. Moreover, user must to increase NVR by manual. Thus, scaling up traditional surveillance system increases the cost and complexity of management. In addition, traditional surveillance system is not capable of handling exception break situations such as blackout and system crash while storing streaming data to storage nodes. Thus, the purpose of this thesis is to propose a cloud-based surveillance system to solve these problems. The proposed system integrates Hadoop to store large amount of streaming and provides a backup mechanism to handle the exception break situations. By integrating HaDoop File System(HDFS), the approach of this thesis can easily scale up the system for processing corresponding large amount of data generated from cameras. The cluster central of Hadoop is integrated with a novel backup mechanism for handling exception break situations. The evaluation results show that the proposed system can easily process the increasing number of video streams and lost only a few frames while handling the exception break situation. The result suggested that replacing NVR to the proposed cloud-based surveillance system under Hadoop with our backup mechanism is practicable.
APA, Harvard, Vancouver, ISO, and other styles
45

Cheng, Ya-Wen, and 鄭雅文. "Improving Fair Scheduling Performance on Hadoop." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/16860934128679350588.

Full text
Abstract:
碩士
國立東華大學
資訊工程學系
103
Cloud computing and big data are both famous issues in the world. Cloud computing can not only support a storage platform which we can access big data, but also can provide a real technique to process truly large amounts of data at the same time. Therefore, this thesis choses the open-source-based Hadoop to study. Our study is focused on improving Hadoop performance by using fair scheduling. Our goal is trying to refer many real time parameters and using them to decide which job can take system resource at first. In addition, we adjust the relative parameters dynamically, for example job priority or delay time etc. We hope to enhance the job runtime speed and improve system performance. This thesis mentions five mechanisms: job classification, pool resource assignment, job sorting based on FIFO, job sorting based on fairness, dynamic delay time adjustment and dynamic job priority adjustment. We use these strategies by consulting real system status and making them to impact on the system performance. Finally, our proposed mechanisms can actually improve the fair scheduling performance. The experiment approved our method is better than the original Hadoop fair scheduling. The result displays the great improvement.
APA, Harvard, Vancouver, ISO, and other styles
46

Lin, Jiun-Yu, and 林君豫. "A Hadoop-Based Endpoint Protection Mechanism." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/10751644664014425491.

Full text
Abstract:
碩士
國立交通大學
管理學院資訊管理學程
103
Today the number of end-users in enterprises and organizations are obviously more than past twenty years, in particular, almost employees can use their own devices to connect the intranet of enterprises and organizations, making IT sector controlling and managing in endpoint securities and management become more difficultly. Strengthening manage security software may lead to increase the burden of companies and organizations hidden costs. Various anti-virus software vendors begin to figure out the solution, improving their products, and implementing the advanced threat prevention technologies into a single agent program which can be the great achievement of the original anti-virus software, not only providing organizations with unparalleled various endpoint defense capabilities, but also being comprehensive upgrade protection more safe. Therefore, almost of them only focus on the functionality, the monitoring service, the establishment of a state of exception, and the number of virus classification and statistics so that it fail to offer an effective way for the enterprise or organization's internal information security control. This study attempts to implement the cloud analysis tool (Hadoop) into Symantec Endpoint Protection Manager (SEPM) architecture, analyzing, noting to the managers and generating a priority checking list. Finally, offering assessment of effectiveness of before and after. The result has significant benefit. This study use the cloud analysis software (Hadoop) to effectively strengthen information security event handling, reducing the threat which may occur, for other companies or organizations to enhance overall information security level of the reference.
APA, Harvard, Vancouver, ISO, and other styles
47

Zhuo, Ye-Qi, and 卓也琦. "Optimization of Hadoop System Configuration Parameters." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/29088821473196350661.

Full text
Abstract:
碩士
國立臺灣大學
資訊工程學研究所
103
Hadoop system is very popular recent year, which is a software framework with distributed processing large-scale data-sets by using a cluster of machines with MapReduce programming model. However, there are still two essential challenges for Hadoop users to manage the Hadoop system. (1) To tune the parameters appropriately; (2) To deal with dozens of configuration parameters which are involved to its performance. This paper will focus on optimizing the Hadoop MapReduce job performance. Our approach has two key model: Prediction and Optimization. The Prediction model is to estimate execution time of a MapReduce job and the Optimization model is to search the approximately optimal configuration parameters by invoking the prediction part repeatedly. By using an analytical method to choose approximately optimal configuration parameters to improve users’ job performance . Besides the configuration parameter tuning, the relevance of each parameters and the evaluation of our methods will also be discussed in this paper. Our paper may provide users a better method to improve the Hadoop system performance and save the hardware resource.
APA, Harvard, Vancouver, ISO, and other styles
48

CHIANG, YU-PING, and 江宇平. "A Hadoop-based Password Cracking System." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/89505687663130189313.

Full text
Abstract:
碩士
銘傳大學
資訊傳播工程學系碩士班
104
With the fast development of the computer, the storage of data is becoming increasing. The most important issues might be how to ana-lytics Big data sets in shorten time. Hadoop software library allows distributed processing of large data sets across clusters of computers. MapReduce programing paradigm lends itself well to these data-intensive analytics jobs, given its ability to scale-out and leverage several machines to parallel process. Usually, the traditional passwords cracking system used only one computing processing to crack passwords. The algo-rithms of the cracking passwords are brute-force cracking and dictionary attack. Brute-force, in which a computer tries every possible until it suc-ceeds, is the lowest common denominator of password cracking. The advantages of brute-force cracking are only computing with CPU and not utilize storages to save data. The disadvantage is that it might spend few days cracking a password. Dictionary attack is based on trying all the strings in a pre-arranged listing. The advantage is that it could crack passwords faster than brute-force cracking, but it might spend large storage to save wordlists. HDFS could distribute wordlists across slaves. MapReduce could parallel processes to reduce cracking time. We design a passwords cracking system which is based on Hadoop. The adopted methods of cracking MD5 hash are dictionary attack and brute-force at-tack. The contribution of this research is partitioning brute-force com-puting into many parts, and distributing 10GB wordlists to HDFS, and thus improves the cracking performance.
APA, Harvard, Vancouver, ISO, and other styles
49

Dias, Henrique José Rosa. "Augmenting data warehousing architectures with hadoop." Master's thesis, 2018. http://hdl.handle.net/10362/28933.

Full text
Abstract:
Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
As the volume of available data increases exponentially, traditional data warehouses struggle to transform this data into actionable knowledge. Data strategies that include the creation and maintenance of data warehouses have a lot to gain by incorporating technologies from the Big Data’s spectrum. Hadoop, as a transformation tool, can add a theoretical infinite dimension of data processing, feeding transformed information into traditional data warehouses that ultimately will retain their value as central components in organizations’ decision support systems. This study explores the potentialities of Hadoop as a data transformation tool in the setting of a traditional data warehouse environment. Hadoop’s execution model, which is oriented for distributed parallel processing, offers great capabilities when the amounts of data to be processed require the infrastructure to expand. Horizontal scalability, which is a key aspect in a Hadoop cluster, will allow for proportional growth in processing power as the volume of data increases. Through the use of a Hive on Tez, in a Hadoop cluster, this study transforms television viewing events, extracted from Ericsson’s Mediaroom Internet Protocol Television infrastructure, into pertinent audience metrics, like Rating, Reach and Share. These measurements are then made available in a traditional data warehouse, supported by a traditional Relational Database Management System, where they are presented through a set of reports. The main contribution of this research is a proposed augmented data warehouse architecture where the traditional ETL layer is replaced by a Hadoop cluster, running Hive on Tez, with the purpose of performing the heaviest transformations that convert raw data into actionable information. Through a typification of the SQL statements, responsible for the data transformation processes, we were able to understand that Hadoop, and its distributed processing model, delivers outstanding performance results associated with the analytical layer, namely in the aggregation of large data sets. Ultimately, we demonstrate, empirically, the performance gains that can be extracted from Hadoop, in comparison to an RDBMS, regarding speed, storage usage and scalability potential, and suggest how this can be used to evolve data warehouses into the age of Big Data.
APA, Harvard, Vancouver, ISO, and other styles
50

Costa, Pedro Alexandre Reis Sá da Costa. "Hadoop MapReduce tolerante a faltas bizantinas." Master's thesis, 2011. http://hdl.handle.net/10451/8695.

Full text
Abstract:
Tese de mestrado em Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2011
O MapReduce é frequentemente usado para executar tarefas críticas, tais como análise de dados científicos. No entanto, evidências na literatura mostram que as faltas ocorrem de forma arbitrária e podem corromper os dados. O Hadoop MapReduce está preparado para tolerar faltas acidentais, mas não tolera faltas arbitrárias ou Bizantinas. Neste trabalho apresenta-se um protótipo do Hadoop MapReduce Tolerante a Faltas Bizantinas(BFT). Uma avaliaçãao experimental mostra que a execução de um trabalho com o algoritmo implementado usa o dobro dos recursos do Hadoop original, em vez de mais 3 ou 4 vezes, como seria alcançado com uma aplicação directa dos paradigmas comuns a tolerância a faltas Bizantinas. Acredita-se que este custo seja aceitável para aplicações críticas que requerem este nível de tolerância a faltas.
MapReduce is often used to run critical jobs such as scientific data analysis. However, evidence in the literature shows that arbitrary faults do occur and can probably corrupt the results of MapReduce jobs. MapReduce runtimes like Hadoop tolerate crash faults, but not arbitrary or Byzantine faults. In this work, it is presented a MapReduce algorithm and prototype that tolerate these faults. An experimental evaluation shows that the execution of a job with the implemented algorithm uses twice the resources of the original Hadoop, instead of the 3 or 4 times more that would be achieved with the direct application of common Byzantine fault-tolerance paradigms. It is believed that this cost is acceptable .for critical applications that require that level of fault tolerance.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography