Dissertations / Theses on the topic 'Hadoop'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Hadoop.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Raja, Anitha. "A Coordination Framework for Deploying Hadoop MapReduce Jobs on Hadoop Cluster." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-196951.
Full textApache Hadoop är ett öppen källkods system som levererar pålitlig, skalbar och distribuerad användning. Hadoop tjänster hjälper med distribuerad data förvaring, bearbetning, åtkomst och trygghet. MapReduce är en viktig del av Hadoop system och är designad att bearbeta stora data mängder och även distribuerad i flera leder. MapReduce är använt extensivt inom bearbetning av strukturerad och ostrukturerad data i olika branscher bl. a e-handel, webbsökning, sociala medier och även vetenskapliga beräkningar. Förståelse av MapReduces arbetsbelastningar är viktiga att få förbättrad konfigurationer och resultat. Men, arbetsbelastningar av MapReduce inom massproduktions miljö var inte djup-forskat hittills. I detta examensarbete, är en hel del fokus satt på ”Hadoop cluster” (som en utförande miljö i data bearbetning) att analysera två typer av Hadoop MapReduce (MR) arbeten genom ett tilltänkt system. Detta system är refererad som arbetsbelastnings översättare. Resultaten från denna arbete innehåller: (1) en parametrisk arbetsbelastningsmodell till inriktad MR arbeten, (2) en specifikation att utveckla förbättrad kluster strategier med båda modellen och koordinations system, och (3) förbättrad planering och arbetsprestationer, d.v.s kortare tid att utföra arbetet. Vi har realiserat en prototyp med Apache Tomcat på (OpenStack) Ubuntu Trusty Tahr som använder RESTful API (1) att skapa ”Hadoop cluster” version 2.7.2 och (2) att båda skala upp och ner antal medarbetare i kluster. Forskningens resultat har visat att med vältrimmad parametrar, kan MR arbete nå förbättringar dvs. sparad tid vid slutfört arbete och förbättrad användning av hårdvara resurser. Målgruppen för denna avhandling är utvecklare. I framtiden, föreslår vi tilläggning av olika parametrar att utveckla en allmän modell för MR och liknande arbeten.
Savvidis, Evangelos. "Searching Metadata in Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177467.
Full textBux, Marc Nicolas. "Scientific Workflows for Hadoop." Doctoral thesis, Humboldt-Universität zu Berlin, 2018. http://dx.doi.org/10.18452/19321.
Full textScientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today's data-driven science. Over the last decades, scientific workflow management systems have emerged to facilitate the design, execution, and monitoring of such workflows. At the same time, the amounts of data generated in various areas of science outpaced hardware advancements. Parallelization and distributed execution are generally proposed to deal with increasing amounts of data. However, the resources provided by distributed infrastructures are subject to heterogeneity, dynamic performance changes at runtime, and occasional failures. To leverage the scalability provided by these infrastructures despite the observed aspects of performance variability, workflow management systems have to progress: Parallelization potentials in scientific workflows have to be detected and exploited. Simulation frameworks, which are commonly employed for the evaluation of scheduling mechanisms, have to consider the instability encountered on the infrastructures they emulate. Adaptive scheduling mechanisms have to be employed to optimize resource utilization in the face of instability. State-of-the-art systems for scalable distributed resource management and storage, such as Apache Hadoop, have to be supported. This dissertation presents novel solutions for these aspirations. First, we introduce DynamicCloudSim, a cloud computing simulation framework that is able to adequately model the various aspects of variability encountered in computational clouds. Secondly, we outline ERA, an adaptive scheduling policy that optimizes workflow makespan by exploiting heterogeneity, replicating bottlenecks in workflow execution, and adapting to changes in the underlying infrastructure. Finally, we present Hi-WAY, an execution engine that integrates ERA and enables the highly scalable execution of scientific workflows written in a number of languages on Hadoop.
Wu, Yuanyuan. "HADOOP-EDF: LARGE-SCALE DISTRIBUTED PROCESSING OF ELECTROPHYSIOLOGICAL SIGNAL DATA IN HADOOP MAPREDUCE." UKnowledge, 2019. https://uknowledge.uky.edu/cs_etds/88.
Full textBüchler, Peter. "Indexing Genomic Data on Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177298.
Full textSchätzle, Alexander [Verfasser], and Georg [Akademischer Betreuer] Lausen. "Distributed RDF Querying on Hadoop." Freiburg : Universität, 2017. http://d-nb.info/1128574187/34.
Full textTabatabaei, Mahsa. "Evaluation of Security in Hadoop." Thesis, KTH, Kommunikationsnät, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-160269.
Full textДикий, В. С. "Сутність та особливості використання Hadoop." Thesis, Київський національний універститет технологій та дизайну, 2017. https://er.knutd.edu.ua/handle/123456789/10420.
Full textBrotánek, Jan. "Apache Hadoop jako analytická platforma." Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-358801.
Full textNilsson, Johan. "Hadoop MapReduce in Eucalyptus Private Cloud." Thesis, Umeå universitet, Institutionen för datavetenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-51309.
Full textJohannsen, Fabian, and Mattias Hellsing. "Hadoop Read Performance During Datanode Crashes." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-130466.
Full textERGENEKON, EMRE BERGE, and PETTER ERIKSSON. "Big Data Archivingwith Splunk and Hadoop." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-137374.
Full textSplunk är en mjukvara som hanterar stora mängder data varje dag. Eftersom datavolymen ökar med tiden, finns det ett behov att flytta ut gammalt data från programmet så att det inte blir segt. Men vissa av Spunks kunder har datalagringspolicies som kräver att datat lagras längre än vad Splunk kan erbjuda. Denna rapport undersöker hur man kan lagra stora mängder data. Vi presenterar problemen som finns med att arkivera data, egenskaperna av datat som ska arkiveras och typer av filsystem som passar för arkivering. Vi skapar en flexibel, tillförlitlig och skalbar lösning för arkivering genom att noga studera datasäkerhet, tillförlitlighet och genom att använda Apache Hadoop för att stödja flera distribuerade filsystem.
Cassales, Guilherme Weigert. "Escalonamento adaptativo para o Apache Hadoop." Universidade Federal de Santa Maria, 2016. http://repositorio.ufsm.br/handle/1/12025.
Full textMany alternatives have been employed in order to process all the data generated by current applications in a timely manner. One of these alternatives, the Apache Hadoop, combines parallel and distributed processing with the MapReduce paradigm in order to provide an environment that is able to process a huge data volume using a simple programming model. However, Apache Hadoop has been designed for dedicated and homogeneous clusters, a limitation that creates challenges for those who wish to use the framework in other circumstances. Often, acquiring a dedicated cluster can be impracticable due to the cost, and the acquisition of reposition parts can be a threat to the homogeneity of a cluster. In these cases, an option commonly used by the companies is the usage of idle computing resources in their network, however the original distribution of Hadoop would show serious performance issues in these conditions. Thus, this study was aimed to improve Hadoop’s capacity of adapting to pervasive and shared environments, where the availability of resources will undergo variations during the execution. Therefore, context-awareness techniques were used in order to collect information about the available capacity in each worker node and distributed communication techniques were used to update this information on scheduler. The joint usage of both techniques aimed at minimizing and/or eliminating the overload that would happen on shared nodes, resulting in an improvement of up to 50% on performance in a shared cluster, when compared to the original distribution, and indicated that a simple solution can positively impact the scheduling, increasing the variety of environments where the use of Hadoop is possible.
Diversas alternativas têm sido empregadas para o processamento, em tempo hábil, da grande quantidade de dados que é gerada pelas aplicações atuais. Uma destas alternativas, o Apache Hadoop, combina processamento paralelo e distribuído com o paradigma MapReduce para fornecer um ambiente capaz de processar um grande volume de informações através de um modelo de programação simplificada. No entanto, o Apache Hadoop foi projetado para utilização em clusters dedicados e homogêneos, uma limitação que gera desafios para aqueles que desejam utilizá-lo sob outras circunstâncias. Muitas vezes um cluster dedicado pode ser inviável pelo custo de aquisição e a homogeneidade pode ser ameaçada devido à dificuldade de adquirir peças de reposição. Em muitos desses casos, uma opção encontrada pelas empresas é a utilização dos recursos computacionais ociosos em sua rede, porém a distribuição original do Hadoop apresentaria sérios problemas de desempenho nestas condições. Sendo assim, este estudo propôs melhorar a capacidade do Hadoop em adaptar-se a ambientes, pervasivos e compartilhados, onde a disponibilidade de recursos sofrerá variações no decorrer da execução. Para tanto, utilizaram-se técnicas de sensibilidade ao contexto para coletar informações sobre a capacidade disponível nos nós trabalhadores e técnicas de comunicação distribuída para atualizar estas informações no escalonador. A utilização conjunta dessas técnicas teve como objetivo a minimização e/ou eliminação da sobrecarga que seria causada em nós com compartilhamento, resultando em uma melhora de até 50% no desempenho em um cluster compartilhado, quando comparado com a distribuição original, e indicou que uma solução simples pode impactar positivamente o escalonamento, aumentando a variedade de ambientes onde a utilização do Hadoop é possível.
Lorente, Leal Alberto. "KTHFS Orchestration : PaaS orchestration for Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-128935.
Full textČecho, Jaroslav. "Optimalizace platformy pro distribuované výpočty Hadoop." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236464.
Full textGupta, Puja Makhanlal. "Characterization of Performance Anomalies in Hadoop." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1429878722.
Full textBjörk, Kim, and Jonatan Bodvill. "Data streaming in Hadoop : A STUDY OF REAL TIME DATA PIPELINE INTEGRATION BETWEEN HADOOP ENVIRONMENTS AND EXTERNAL SYSTEMS." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186380.
Full textDistribuerade system blir allt vanligare inom både stora och små företags IT-system. Anledningarna till denna utveckling är kostnadseffektivitet, feltolerans och tekniska fysiska begränsningar på centraliserade system. Det finns ramverk inom området som ämnar att skapa en standardiserad plattform för att underlätta för utveckling och implementation av distribuerade tjänster och applikationer. Apache Hadoop är ett av dessa projekt. Hadoop är ett ramverk för distribuerade beräkningar och distribuerad datalagring. Hadoop har stöd för många olika moduler med olika syften, t.ex. för hantering av distribuerade databaser, datasäkerhet, dataströmmning och beräkningar. Utöver att erbjuda mycket billigare lagring än centraliserade alternativ så erbjuder Hadoop kraftulla sätt att hantera väldigt stora mängder data när den strömmas genom, och lagras på, systemet. Dessa metoder används för en stor mängd olika syften på IT-företag som har ett behov av snabb och kraftfull datahantering. Fler och fler företag implementerar Hadoop i sina IT-processer. Ett av dessa företag är Unomaly. Unomaly är företag som erbjuder generisk, förebyggande avvikelsedetektering. Deras system fungerar genom att aggregera stora volymer systemloggar från godtyckliga ITsystem. Avvikelsehanteringssystemet är beroende av stora mängder loggar för att kunna bygga upp en korrekt bild av värdsystemet. Integration med Hadoop skulle låta Unomaly konsumera väldigt stora mängder loggdata när den strömmar genom värdsystemets Hadooparkitektur. I dettta kandidatexamensarbete har ett integrationslager mellan Hadoop och Unomalys avvikelsehanteringssystem utvecklats. Studier har också gjorts för att identifiera den bästa lösningen för integraion mellan avvikelsehanteringssystem och Hadoop Arbetet har resulterat i en applikationsprototyp som erbjuder realtids datatransportering mellan Hadoop och Unomalys system. Arbetet har även resulterat i en studie som diskuterar det bästa tillvägagångsättet för hur en integration av detta slag ska implementeras.
Lopes, Bezerra Aprigio Augusto. "Planificación de trabajos en clusters hadoop compartidos." Doctoral thesis, Universitat Autònoma de Barcelona, 2015. http://hdl.handle.net/10803/285573.
Full textIndustry and scientists have sought alternatives to process effectively the large volume of data generated in different areas of knowledge. MapReduce is presented as a viable alternative for the processing of data intensive application. Input files are broken into smaller blocks. So they are distributed and stored in the nodes where they will be processed. Hadoop clusters have been used to execute MapReduce applications. The Hadoop framework automatically performs the division and distribution of the input files, the division of a job into Map and Reduce tasks, the scheduling tasks among the nodes, the failures control of nodes; and manages the need for communication between nodes in the cluster. However, some MapReduce applications have a set of features that do not allow them to benefit fully from the default Hadoop job scheduling policies. Input files shared between multiple jobs and applications with large volumes of intermediate data are the characteristics of the applications we handle in our research. The objective of our work is to improve execution efficiency in two ways: On a macro level (job scheduler level), we group the jobs that share the same input files and process them in batch. Then we store shared input files and intermediate data on a RAMDISK during batch processing. On a micro level (task scheduler level) tasks of different jobs processed in the same batch that handle the same data blocks are grouped to be executed on the same node where the block was allocated.
Deolikar, Piyush P. "Lecture Video Search Engine Using Hadoop MapReduce." Thesis, California State University, Long Beach, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10638908.
Full textWith the advent of the Internet and ease of uploading video content over video libraries and social networking sites, the video data availability was increased very rapidly during this decade. Universities are uploading video tutorials in the online courses. Companies like Udemy, coursera, Lynda, etc. made video tutorials available over the Internet. We propose and implement a scalable solution, which helps to find relevant videos with respect to a query provided by the user. Our solution maintains an updated list of the available videos on the web and assigns a rank according to their relevance. The proposed solution consists of three main components that can mutually interact. The first component, called the crawler, continuously visits and locally stores the relevant information of all the webpages with videos available on the Internet. The crawler has several threads, concurrently parsing webpages. The second component obtains the inverted index of the web pages stored by the crawler. Given a query, the inverted index is used to obtain the videos that contain the words in the query. The third component computes the rank of the video. This rank is then used to display the results in the order of relevance. We implement a scalable solution in the Apache Hadoop Framework. Hadoop is a distributed operating system that provides a distributed file system able to handle large files as well as distributed computation among the participants.
Svedlund, Nordström Johan. "A Global Ecosystem for Datasets on Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-205297.
Full textRajuladeviKasi, UdayKiran. "Location-aware replication in virtual Hadoop environment." Thesis, Wichita State University, 2012. http://hdl.handle.net/10057/5609.
Full textThesis (M.S.)--Wichita State University, College of Engineering, Dept. of Electrical Engineering and Computer Science
Темирбекова, Ж. Е., and Ж. М. Меренбаев. "Параллельное масштабирование изображений в технологии mapreduce hadoop." Thesis, Сумский государственный университет, 2015. http://essuir.sumdu.edu.ua/handle/123456789/40775.
Full textCapitão, Micael José Pedrosa. "Mediator framework for inserting data into hadoop." Master's thesis, Universidade de Aveiro, 2014. http://hdl.handle.net/10773/14697.
Full textData has always been one of the most valuable resources for organizations. With it we can extract information and, with enough information on a subject, we can build knowledge. However, it is first needed to store that data for later processing. On the last decades we have been assisting what was called “information explosion”. With the advent of the new technologies, the volume, velocity and variety of data has increased exponentially, becoming what is known today as big data. Telecommunications operators gather, using network monitoring equipment, millions of network event records, the Call Detail Records (CDRs) and the Event Detail Records (EDRs), commonly known as xDRs. These records are stored and later processed to compute network performance and quality of service metrics. With the ever increasing number of telecommunications subscribers, the volume of generated xDRs needing to be stored and processed has increased exponentially, making the current solutions based on relational databases not suited any more and so, they are facing a big data problem. To handle that problem, many contributions have been made on the last years that have resulted in solid and innovative solutions. Among them, Hadoop and its vast ecosystem stands out. Hadoop integrates new methods of storing and process high volumes of data in a robust and cost-effective way, using commodity hardware. This dissertation presents a platform that enables the current systems inserting data into relational databases, to keep doing it transparently when migrating those to Hadoop. The platform has to, like in the relational databases, give delivery guarantees, support unique constraints and, be fault tolerant. As proof of concept, the developed platform was integrated with a system specifically designed to the computation of performance and quality of service metrics from xDRs, the Altaia. The performance tests have shown the platform fulfils and exceeds the requirements for the insertion rate of records. During the tests the behaviour of the platform when trying to insert duplicated records and when in failure scenarios have also been evaluated. The results for both situations were as expected.
“Dados” sempre foram um dos mais valiosos recursos das organizações. Com eles pode-se extrair informação e, com informação suficiente, pode-se criar conhecimento. No entanto, é necessário primeiro conseguir guardar esses dados para posteriormente os processar. Nas últimas décadas tem-se assistido ao que foi apelidado de “explosão de informação”. Com o advento das novas tecnologias, o volume, velocidade e variedade dos dados tem crescido exponencialmente, tornando-se no que é hoje conhecido como big data. Os operadores de telecomunicações obtêm, através de equipamentos de monitorização da rede, milhões de registos relativos a eventos da rede, os Call Detail Records (CDRs) e os Event Detail Records (EDRs), conhecidos como xDRs. Esses registos são armazenados e depois processados para deles se produzirem métricas relativas ao desempenho da rede e à qualidade dos serviços prestados. Com o aumento dos utilizadores de telecomunicações, o volume de registos gerados que precisam de ser armazenados e processados cresceu exponencialmente, inviabilizando as soluções que assentam em bases de dados relacionais, estando-se agora perante um problema de big data. Para tratar esse problema, múltiplas contribuições foram feitas ao longo dos últimos anos que resultaram em soluções sólidas e inovadores. De entre elas, destaca-se o Hadoop e o seu vasto ecossistema. O Hadoop incorpora novos métodos de guardar e tratar elevados volumes de dados de forma robusta e rentável, usando hardware convencional. Esta dissertação apresenta uma plataforma que possibilita aos actuais sistemas que inserem dados em bases de dados relacionais, que o continuem a fazer de forma transparente quando essas migrarem para Hadoop. A plataforma tem de, tal como nas bases de dados relacionais, dar garantias de entrega, suportar restrições de chaves únicas e ser tolerante a falhas. Como prova de conceito, integrou-se a plataforma desenvolvida com um sistema especificamente desenhado para o cálculo de métricas de performance e de qualidade de serviço a partir de xDRs, o Altaia. Pelos testes de desempenho realizados, a plataforma cumpre e excede os requisitos relativos à taxa de inserção de registos. Durante os testes também se avaliou o seu comportamento perante tentativas de inserção de registos duplicados e perante situações de falha, tendo o resultado, para ambas as situações, sido o esperado.
Justice, Matthew Adam. "Optimizing MongoDB-Hadoop Performance with Record Grouping." Thesis, The University of Arizona, 2012. http://hdl.handle.net/10150/244396.
Full textLorenzetto, Luca <1988>. "Evaluating performance of Hadoop Distributed File System." Master's Degree Thesis, Università Ca' Foscari Venezia, 2014. http://hdl.handle.net/10579/4773.
Full textLu, Yue. "CloudNotes: Annotation Management in Cloud-Based Platforms." Digital WPI, 2014. https://digitalcommons.wpi.edu/etd-theses/273.
Full textShetty, Kartik. "Evaluating Clustering Techniques over Big Data in Distributed Infrastructures." Digital WPI, 2018. https://digitalcommons.wpi.edu/etd-theses/1226.
Full textKakantousis, Theofilos. "Scaling YARN: A Distributed Resource Manager for Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177200.
Full textLindberg, Johan. "Big Data och Hadoop : Nästa generation av lagring." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-31079.
Full textMålet med rapporten och undersökningen är att på en teoretisk nivå undersöka möjligheterna för Försäkringskassan IT att byta plattform för lagring av data och information som används i deras dagliga arbete. Försäkringskassan samlar på sig oerhörda mängder data på daglig basis innehållandes allt från personupp- gifter, programkod, utbetalningar och kundtjänstärenden. Idag lagrar man allt detta i stora relationsdatabaser vilket leder till problem med skalbarhet och prestanda. Den nya plattformen som undersöks bygger på en lagringsteknik vid namn Hadoop. Hadoop är utvecklat för att både lagra och processerna data distribuerat över så kallade kluster bestående av billigare serverhårdvara. Plattformen utlovar näst intill linjär skalbarhet, möjlighet att lagra all data med hög feltolerans samt att hantera enorma datamängder. Undersökningen genomförs genom teoristudier och ett proof of concept. Teoristudierna fokuserar på bakgrunden på Hadoop, dess uppbyggnad och struktur samt hur framtiden ser ut. Dagens upplägg för lagring hos Försäkringskassan specificeras och jämförs med den nya plattformen. Ett proof of concept genomförs på en testmiljö hos För- säkringskassan där en Hadoop plattform från Hortonworks används för att påvi- sa hur lagring kan fungera samt att så kallad ostrukturerad data kan lagras. Undersökningen påvisar inga teoretiska problem i att byta till den nya plattformen. Dock identifieras ett behov av att flytta hanteringen av data från inläsning till utläsning. Detta beror på att dagens lösning med relationsdatabaser kräver väl strukturerad data för att kunna lagra den medan Hadoop kan lagra allt utan någon struktur. Däremot kräver Hadoop mer handpåläggning när det kommer till att hämta data och arbeta med den.
Brito, José Benedito de Souza. "Modelo para estimar performance de um Cluster Hadoop." reponame:Repositório Institucional da UnB, 2014. http://repositorio.unb.br/handle/10482/17180.
Full textSubmitted by Albânia Cézar de Melo (albania@bce.unb.br) on 2014-12-02T12:56:55Z No. of bitstreams: 1 2014_JoseBeneditoSouzaBrito.pdf: 4169418 bytes, checksum: 0acba0fc24656f44b12166c01ba2dc3c (MD5)
Approved for entry into archive by Patrícia Nunes da Silva(patricia@bce.unb.br) on 2014-12-02T13:25:34Z (GMT) No. of bitstreams: 1 2014_JoseBeneditoSouzaBrito.pdf: 4169418 bytes, checksum: 0acba0fc24656f44b12166c01ba2dc3c (MD5)
Made available in DSpace on 2014-12-02T13:25:34Z (GMT). No. of bitstreams: 1 2014_JoseBeneditoSouzaBrito.pdf: 4169418 bytes, checksum: 0acba0fc24656f44b12166c01ba2dc3c (MD5)
O volume, a variedade e a velocidade dos dados apresenta um grande desa o para extrair informações úteis em tempo hábil, sem gerar grandes impactos nos demais processamentos existentes nas organizações, impulsionando a utilização de clusters para armazenamento e processamento, e a utilização de computação em nuvem. Este cenário é propício para o Hadoop, um framework open source escalável e e ciente, para a execução de cargas de trabalho sobre Big Data. Com o advento da computação em nuvem um cluster com o framework Hadoop pode ser alocado em minutos, todavia, garantir que o Hadoop tenha um desempenho satisfatório para realizar seus processamentos apresenta vários desa os, como as necessidades de ajustes das con gurações do Hadoop às cargas de trabalho, alocar um cluster apenas com os recursos necessários para realizar determinados processamentos e de nir os recursos necessários para realizar um processamento em um intervalo de tempo conhecido. Neste trabalho, foi proposta uma abordagem que busca otimizar o framework Hadoop para determinada carga de trabalho e estimar os recursos computacionais necessário para realizar um processamento em determinado intervalo de tempo. A abordagem proposta é baseada na coleta de informações, base de regras para ajustes de con gurações do Hadoop, de acordo com a carga de trabalho, e simulações. A simplicidade e leveza do modelo permite que a solução seja adotada como um facilitador para superar os desa os apresentados pelo Big Data, e facilitar a de nição inicial de um cluster para o Hadoop, mesmo por usuários com pouca experiência em TI. O modelo proposto trabalha com o MapReduce para de nir os principais parâmetros de con guração e determinar recursos computacionais dos hosts do cluster para atender aos requisitos desejados de tempo de execução para determinada carga de trabalho. _______________________________________________________________________________ ABSTRACT
The volume, variety and velocity of data presents a great challenge to extracting useful information in a timely manner, without causing impacts on other existing processes in organizations, promoting the use of clusters for storage and processing, and the use of cloud computing. This a good scenario for the Hadoop an open source framework scalable and e cient for running workloads on Big Data. With the advent of cloud computing one cluster with Hadoop framework can be allocated in minutes, however, ensure that the Hadoop has a good performance to accomplish their processing has several challenges, such as needs tweaking the settings of Hadoop for their workloads, allocate a cluster with the necessary resources to perform certain processes and de ne the resources required to perform processing in a known time interval. In this work, an approach that seeks to optimize the Hadoop for a given workload and estimate the computational resources required to realize a processing in a given time interval was proposed. The approach is based on collecting information, based rules for adjusting Hadoop settings for certain workload and simulations. The simplicity and lightness of the model allows the solution be adopted how a facilitator to overcome the challenges presented by Big Data, and facilitate the use of the Hadoop, even by users with little IT experience. The proposed model works with the MapReduce to de ne the main con guration parameters and determine the computational resources of nodes of cluster, to meet the desired runtime for a given workload requirements.
Бабич, А. С., and Елена Петровна Черных. "Использование Apache Hadoop для обработки больших наборов данных." Thesis, Національний технічний університет "Харківський політехнічний інститут", 2015. http://repository.kpi.kharkov.ua/handle/KhPI-Press/45546.
Full textHou, Jun. "Using Hadoop to Cluster Data in Energy System." University of Dayton / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1430092547.
Full textSodhi, Bir Apaar Singh. "DATA MINING: TRACKING SUSPICIOUS LOGGING ACTIVITY USING HADOOP." CSUSB ScholarWorks, 2016. https://scholarworks.lib.csusb.edu/etd/271.
Full textPalummo, Alexandra Lina. "Supporto SQL al sistema Hadoop per big data analytics." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016.
Find full textFischer, e. Silva Renan. "E-EON : Energy-Efficient and Optimized Networks for Hadoop." Doctoral thesis, Universitat Politècnica de Catalunya, 2018. http://hdl.handle.net/10803/586061.
Full textLa eficiencia energética y las mejoras de rendimiento han sido dos de las principales preocupaciones de los Data Centers actuales. Con el arribo del Big Data, se genera más información año con año, incluso las predicciones más agresivas de parte del mayor fabricante de dispositivos de red se han superado debido al continuo tráfico de red generado por los sistemas de Big Data. Actualmente, uno de los más famosos y discutidos frameworks desarrollado para almacenar, recuperar y procesar la información generada consistentemente por usuarios y máquinas, Hadoop acaparó la atención de la industria en los últimos años y actualmente su nombre describe a todo un ecosistema diseñado para abordar los requisitos más variados de las aplicaciones actuales de Cloud Computing. Esta tesis profundiza sobre los clusters Hadoop, principalmente enfocada a sus interconexiones, que comúnmente se consideran el cuello de botella de dicho ecosistema. Realizamos investigaciones centradas en la eficiencia energética y también en optimizaciones de rendimiento como mejoras en el throughput de la infraestructura y de latencia de la red. En cuanto al consumo de energía, una porción significativa de un Data Center es causada por la red, representada por el 12 % de la potencia total del sistema a plena carga. Con el tráfico constantemente creciente de la red, la industria y la comunidad académica busca que el consumo energético sea proporcional a su uso. Considerando las prestaciones del cluster, a pesar de que Hadoop mantiene una carga de trabajo sensible al rendimiento de red aunque con requisitos menos estrictos sobre la latencia de la misma, existe un interés creciente en ejecutar aplicaciones interactivas y secuenciales de manera simultánea sobre dicha infraestructura. Al hacerlo, se maximiza la utilización del sistema para obtener los mayores beneficios al capital y gastos operativos. Para que esto suceda, el rendimiento del sistema no puede verse afectado cuando se minimiza la latencia de la red. Los dos mayores desafíos enfrentados durante el desarrollo de esta tesis estuvieron relacionados con lograr un consumo energético cercano a la cantidad de interconexiones y también a mejorar la latencia de red encontrada en los clusters Hadoop al tiempo que la perdida del rendimiento de la infraestructura es casi nula. Dichos desafíos llevaron a una oportunidad de tamaño semejante: proponer técnicas novedosas que resuelven dichos problemas a partir de la generación actual de clusters Hadoop. Llamamos a E-EON (Energy Efficient and Optimized Networks) al conjunto de técnicas presentadas en este trabajo. E-EON se puede utilizar para reducir el consumo de energía y la latencia de la red al mismo tiempo que el rendimiento del cluster se mejora. Además tales técnicas no son exclusivas de Hadoop y también se espera que tengan beneficios similares si se aplican a cualquier otra infraestructura de Big Data que se ajuste a la caracterización del problema que presentamos a lo largo de esta tesis. Con E-EON pudimos reducir el consumo de energía hasta en un 80% en comparación con las técnicas encontradas en la literatura actual. También pudimos reducir la latencia de la red hasta en un 85% y, en algunos casos, incluso mejorar el rendimiento del cluster en un 10%. Aunque estos fueron los dos principales logros de esta tesis, también presentamos beneficios menores que se traducen en una configuración más sencilla en comparación con las técnicas más avanzadas. Finalmente, enriquecimos las discusiones encontradas en esta tesis con recomendaciones dirigidas a los administradores de red y a los fabricantes de dispositivos de red.
Benslimane, Ziad. "Optimizing Hadoop Parameters Based on the Application Resource Consumption." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-200144.
Full textDonepudi, Harinivesh. "An Apache Hadoop Framework for Large-Scale Peptide Identification." TopSCHOLAR®, 2015. http://digitalcommons.wku.edu/theses/1527.
Full textVijayakumar, Sruthi. "Hadoop Based Data Intensive Computation on IAAS Cloud Platforms." UNF Digital Commons, 2015. http://digitalcommons.unf.edu/etd/567.
Full textSHELLY. "IRIS RECOGNITION ON HADOOP." Thesis, 2011. http://dspace.dtu.ac.in:8080/jspui/handle/repository/13885.
Full textIris Recognition is a type of pattern recognition which recognizes a user by determining the physical structure of an individual's Iris. A unique Iris pattern is extracted from a digitized image of the eye, and encoded into an iris template. Iris template contains unique information of an individual and is stored in a database. To identify an individual by iris recognition system, an individual’s eye image is captured using video camera and converted into iris template. These templates are compared with stored iris templates in database. If templates are matches then user is said to be genuine, otherwise imposter. Iris Recognition offers advantages over traditional recognition methods (ID cards, PIN numbers) because the person to be identified has no need to remember or carry any information. Iris pattern remains stable throughout life of a person. This characteristic makes it very attractive for use as a biometric for identifying individual. Iris recognition is deployed for verification and/or identification in applications such as access control, border management, and Identification systems. With increasing security concerns, biometric database size is growing very fast and technologies like iris recognition has a very huge database for comparison. Iris recognition algorithms are implemented on general purpose sequential processing systems, and also existing relational database systems are not enough to handle this huge size of data in some reasonable time. In this thesis, a parallel processing alternative using cloud computing, offering an opportunity to increase speed and use it on huge database is proposed. An open source Hadoop framework for cloud computing is used to implement the proposed system. Hadoop Distributed File System (HDFS) is used to handle large data sets, by breaking it into blocks and replicating blocks on various machines in cloud. Template comparison is done independently on different blocks of data by various machines in parallel. Map/Reduce programming model is used for processing large data sets. Map/Reduce process the data in
Yang, Ming-hsien, and 楊明憲. "High-Performance Heterogeneous Hadoop Architecture." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/33987755238266822134.
Full text國立臺灣科技大學
自動化及控制研究所
101
Native Hadoop is a two-layered structure composed of one master and many slaves. Therein slave can be seen as the combination of DataNode and TaskTracker, while master is in charge of managing slave nodes. Since users may add more slave nodes in the Hadoop to increase the efficiency of parallel computing of massive data, Hadoop has been viewed as the key technology of massive data processing. Although Hadoop can promote the efficiency, the Hadoop cluster, composed of a large number of slaves, makes the real structure larger and consumes more energy. Therefore, this study combines ARM and x86 to form the new “Heterogeneously Three-Layered” Hadoop structure, based on ARM's characteristics which owns the characteristics: energy saving, high performance of massive data processing, and small space. Moreover, the concept of “Dynamically Managing Block Algorithm” is introduced to the task scheduler. This design not only can improve shortcomings in native Hadoop but also can effectively reduce more than 22% Map/Reduce operation time.
Hsin-YingLee and 李信穎. "A Transparent Hadoop Data Service." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/y2s2g2.
Full text國立成功大學
資訊工程學系
106
Hadoop is an open source distributed processing framework and storage for big data. Its storage called Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files with streaming data access patterns, but it can't effectively handle lots of files. Although there are many ways to solve small data problem, users still need to take a lot of extra processing. HBase is a distributed database that is often paired with Hadoop, providing efficient random access, HBase can be used to solve small data problems, but these two systems must to take time to learn and are written in Java, which are high barrier introducing to data analysts. This paper proposes a transparent distributed storage system, designed to solve the problem of small data on HDFS, and supports the HDFS Interface to be compatible with Hadoop ecosystem, such as Hive, Spark, etc. Users can access the data directly without changing any code, and system also provide a simple Web API to hide the data platform’s complex operations, let users migrate data into the data platform from other file servers through the API.
Yadav, Rakesh. "Genetic Algorithms Using Hadoop MapReduce." Thesis, 2015. http://ethesis.nitrkl.ac.in/7790/1/2015_Genetic_Yadav.pdf.
Full textLee, Yao-Cheng, and 李曜誠. "Exploiting Parallelism in Heterogeneous Hadoop System." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/69673317513666799922.
Full text國立臺灣大學
資訊網路與多媒體研究所
102
With the rise of big data, Apache Hadoop had been attracting increasing attention. There are two primary components at the core of Apache Hadoop: Hadoop Distributed File System(HDFS) and MapReduce framework. MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. However, MapReduce framework is not efficient enough. With the parallelism of mapper, he implementation of Hadoop MapReduce does not fully exploit the parallelism to enhance performance. The implementation of mapper adopts serial processing algorithm instead of parallel processing algorithm. To solve these problems, this thesis proposed a new Hadoop framework which fully exploit parallelism by parallel processing. For better performance, we utilize GPGPU’s computational power to accelerate the program. Besides, in order to utilize both CPU and GPU to reduce the overall execution time, we also propose a scheduling policy to dynamically dispatch the computation on the appropriate device. Our experimental results show that our system can achieve a speedup of 1.45X on the benchmarks over Hadoop.
Wang, Kuang-Hsin, and 王廣新. "Cloud-based surveillance system under Hadoop." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/59297974316482725855.
Full text國立交通大學
資訊學院資訊學程
100
Information technology has been widely used to solve many problems in many domains. Many cameras have been installed at the corner of streets for maintaining public order or monitoring traffic status. The recorded video data is stored to the traditional surveillance system or NVR in public department for future inquiry. However, the traditional surveillance system could not process and store large amount of data generated from large number of cameras. Moreover, user must to increase NVR by manual. Thus, scaling up traditional surveillance system increases the cost and complexity of management. In addition, traditional surveillance system is not capable of handling exception break situations such as blackout and system crash while storing streaming data to storage nodes. Thus, the purpose of this thesis is to propose a cloud-based surveillance system to solve these problems. The proposed system integrates Hadoop to store large amount of streaming and provides a backup mechanism to handle the exception break situations. By integrating HaDoop File System(HDFS), the approach of this thesis can easily scale up the system for processing corresponding large amount of data generated from cameras. The cluster central of Hadoop is integrated with a novel backup mechanism for handling exception break situations. The evaluation results show that the proposed system can easily process the increasing number of video streams and lost only a few frames while handling the exception break situation. The result suggested that replacing NVR to the proposed cloud-based surveillance system under Hadoop with our backup mechanism is practicable.
Cheng, Ya-Wen, and 鄭雅文. "Improving Fair Scheduling Performance on Hadoop." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/16860934128679350588.
Full text國立東華大學
資訊工程學系
103
Cloud computing and big data are both famous issues in the world. Cloud computing can not only support a storage platform which we can access big data, but also can provide a real technique to process truly large amounts of data at the same time. Therefore, this thesis choses the open-source-based Hadoop to study. Our study is focused on improving Hadoop performance by using fair scheduling. Our goal is trying to refer many real time parameters and using them to decide which job can take system resource at first. In addition, we adjust the relative parameters dynamically, for example job priority or delay time etc. We hope to enhance the job runtime speed and improve system performance. This thesis mentions five mechanisms: job classification, pool resource assignment, job sorting based on FIFO, job sorting based on fairness, dynamic delay time adjustment and dynamic job priority adjustment. We use these strategies by consulting real system status and making them to impact on the system performance. Finally, our proposed mechanisms can actually improve the fair scheduling performance. The experiment approved our method is better than the original Hadoop fair scheduling. The result displays the great improvement.
Lin, Jiun-Yu, and 林君豫. "A Hadoop-Based Endpoint Protection Mechanism." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/10751644664014425491.
Full text國立交通大學
管理學院資訊管理學程
103
Today the number of end-users in enterprises and organizations are obviously more than past twenty years, in particular, almost employees can use their own devices to connect the intranet of enterprises and organizations, making IT sector controlling and managing in endpoint securities and management become more difficultly. Strengthening manage security software may lead to increase the burden of companies and organizations hidden costs. Various anti-virus software vendors begin to figure out the solution, improving their products, and implementing the advanced threat prevention technologies into a single agent program which can be the great achievement of the original anti-virus software, not only providing organizations with unparalleled various endpoint defense capabilities, but also being comprehensive upgrade protection more safe. Therefore, almost of them only focus on the functionality, the monitoring service, the establishment of a state of exception, and the number of virus classification and statistics so that it fail to offer an effective way for the enterprise or organization's internal information security control. This study attempts to implement the cloud analysis tool (Hadoop) into Symantec Endpoint Protection Manager (SEPM) architecture, analyzing, noting to the managers and generating a priority checking list. Finally, offering assessment of effectiveness of before and after. The result has significant benefit. This study use the cloud analysis software (Hadoop) to effectively strengthen information security event handling, reducing the threat which may occur, for other companies or organizations to enhance overall information security level of the reference.
Zhuo, Ye-Qi, and 卓也琦. "Optimization of Hadoop System Configuration Parameters." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/29088821473196350661.
Full text國立臺灣大學
資訊工程學研究所
103
Hadoop system is very popular recent year, which is a software framework with distributed processing large-scale data-sets by using a cluster of machines with MapReduce programming model. However, there are still two essential challenges for Hadoop users to manage the Hadoop system. (1) To tune the parameters appropriately; (2) To deal with dozens of configuration parameters which are involved to its performance. This paper will focus on optimizing the Hadoop MapReduce job performance. Our approach has two key model: Prediction and Optimization. The Prediction model is to estimate execution time of a MapReduce job and the Optimization model is to search the approximately optimal configuration parameters by invoking the prediction part repeatedly. By using an analytical method to choose approximately optimal configuration parameters to improve users’ job performance . Besides the configuration parameter tuning, the relevance of each parameters and the evaluation of our methods will also be discussed in this paper. Our paper may provide users a better method to improve the Hadoop system performance and save the hardware resource.
CHIANG, YU-PING, and 江宇平. "A Hadoop-based Password Cracking System." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/89505687663130189313.
Full text銘傳大學
資訊傳播工程學系碩士班
104
With the fast development of the computer, the storage of data is becoming increasing. The most important issues might be how to ana-lytics Big data sets in shorten time. Hadoop software library allows distributed processing of large data sets across clusters of computers. MapReduce programing paradigm lends itself well to these data-intensive analytics jobs, given its ability to scale-out and leverage several machines to parallel process. Usually, the traditional passwords cracking system used only one computing processing to crack passwords. The algo-rithms of the cracking passwords are brute-force cracking and dictionary attack. Brute-force, in which a computer tries every possible until it suc-ceeds, is the lowest common denominator of password cracking. The advantages of brute-force cracking are only computing with CPU and not utilize storages to save data. The disadvantage is that it might spend few days cracking a password. Dictionary attack is based on trying all the strings in a pre-arranged listing. The advantage is that it could crack passwords faster than brute-force cracking, but it might spend large storage to save wordlists. HDFS could distribute wordlists across slaves. MapReduce could parallel processes to reduce cracking time. We design a passwords cracking system which is based on Hadoop. The adopted methods of cracking MD5 hash are dictionary attack and brute-force at-tack. The contribution of this research is partitioning brute-force com-puting into many parts, and distributing 10GB wordlists to HDFS, and thus improves the cracking performance.
Dias, Henrique José Rosa. "Augmenting data warehousing architectures with hadoop." Master's thesis, 2018. http://hdl.handle.net/10362/28933.
Full textAs the volume of available data increases exponentially, traditional data warehouses struggle to transform this data into actionable knowledge. Data strategies that include the creation and maintenance of data warehouses have a lot to gain by incorporating technologies from the Big Data’s spectrum. Hadoop, as a transformation tool, can add a theoretical infinite dimension of data processing, feeding transformed information into traditional data warehouses that ultimately will retain their value as central components in organizations’ decision support systems. This study explores the potentialities of Hadoop as a data transformation tool in the setting of a traditional data warehouse environment. Hadoop’s execution model, which is oriented for distributed parallel processing, offers great capabilities when the amounts of data to be processed require the infrastructure to expand. Horizontal scalability, which is a key aspect in a Hadoop cluster, will allow for proportional growth in processing power as the volume of data increases. Through the use of a Hive on Tez, in a Hadoop cluster, this study transforms television viewing events, extracted from Ericsson’s Mediaroom Internet Protocol Television infrastructure, into pertinent audience metrics, like Rating, Reach and Share. These measurements are then made available in a traditional data warehouse, supported by a traditional Relational Database Management System, where they are presented through a set of reports. The main contribution of this research is a proposed augmented data warehouse architecture where the traditional ETL layer is replaced by a Hadoop cluster, running Hive on Tez, with the purpose of performing the heaviest transformations that convert raw data into actionable information. Through a typification of the SQL statements, responsible for the data transformation processes, we were able to understand that Hadoop, and its distributed processing model, delivers outstanding performance results associated with the analytical layer, namely in the aggregation of large data sets. Ultimately, we demonstrate, empirically, the performance gains that can be extracted from Hadoop, in comparison to an RDBMS, regarding speed, storage usage and scalability potential, and suggest how this can be used to evolve data warehouses into the age of Big Data.
Costa, Pedro Alexandre Reis Sá da Costa. "Hadoop MapReduce tolerante a faltas bizantinas." Master's thesis, 2011. http://hdl.handle.net/10451/8695.
Full textO MapReduce é frequentemente usado para executar tarefas críticas, tais como análise de dados científicos. No entanto, evidências na literatura mostram que as faltas ocorrem de forma arbitrária e podem corromper os dados. O Hadoop MapReduce está preparado para tolerar faltas acidentais, mas não tolera faltas arbitrárias ou Bizantinas. Neste trabalho apresenta-se um protótipo do Hadoop MapReduce Tolerante a Faltas Bizantinas(BFT). Uma avaliaçãao experimental mostra que a execução de um trabalho com o algoritmo implementado usa o dobro dos recursos do Hadoop original, em vez de mais 3 ou 4 vezes, como seria alcançado com uma aplicação directa dos paradigmas comuns a tolerância a faltas Bizantinas. Acredita-se que este custo seja aceitável para aplicações críticas que requerem este nível de tolerância a faltas.
MapReduce is often used to run critical jobs such as scientific data analysis. However, evidence in the literature shows that arbitrary faults do occur and can probably corrupt the results of MapReduce jobs. MapReduce runtimes like Hadoop tolerate crash faults, but not arbitrary or Byzantine faults. In this work, it is presented a MapReduce algorithm and prototype that tolerate these faults. An experimental evaluation shows that the execution of a job with the implemented algorithm uses twice the resources of the original Hadoop, instead of the 3 or 4 times more that would be achieved with the direct application of common Byzantine fault-tolerance paradigms. It is believed that this cost is acceptable .for critical applications that require that level of fault tolerance.