Dissertations / Theses on the topic 'Hadoop Distributed File System'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Hadoop Distributed File System.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Lorenzetto, Luca <1988>. "Evaluating performance of Hadoop Distributed File System." Master's Degree Thesis, Università Ca' Foscari Venezia, 2014. http://hdl.handle.net/10579/4773.
Full textPolato, Ivanilton. "Energy savings and performance improvements with SSDs in the Hadoop Distributed File System." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-31102016-155908/.
Full textAo longo da última década, questões energéticas atraíram forte atenção da sociedade, chegando às infraestruturas de TI para processamento de dados. Agora, essas infraestruturas devem se ajustar a essa responsabilidade, adequando plataformas existentes para alcançar desempenho aceitável enquanto promovem a redução no consumo de energia. Considerado um padrão para o processamento de Big Data, o Apache Hadoop tem evoluído significativamente ao longo dos últimos anos, com mais de 60 versões lançadas. Implementando o paradigma de programação MapReduce juntamente com o HDFS, seu sistema de arquivos distribuídos, o Hadoop tornou-se um middleware tolerante a falhas e confiável para a computação paralela e distribuída para grandes conjuntos de dados. No entanto, o Hadoop pode perder desempenho com determinadas cargas de trabalho, resultando em elevado consumo de energia. Cada vez mais, usuários exigem que a sustentabilidade e o consumo de energia controlado sejam parte intrínseca de soluções de computação de alto desempenho. Nesta tese, apresentamos o HDFSH, um sistema de armazenamento híbrido para o HDFS, que usa uma combinação de discos rígidos e discos de estado sólido para alcançar maior desempenho, promovendo economia de energia em aplicações usando Hadoop. O HDFSH traz ao middleware o melhor dos HDs (custo acessível por GB e grande capacidade de armazenamento) e SSDs (alto desempenho e baixo consumo de energia) de forma configurável, usando zonas de armazenamento dedicadas para cada dispositivo de armazenamento. Implementamos nosso mecanismo como uma política de alocação de blocos para o HDFS e o avaliamos em seis versões recentes do Hadoop com diferentes arquiteturas de software. Os resultados indicam que nossa abordagem aumenta o desempenho geral das aplicações, enquanto diminui o consumo de energia na maioria das configurações híbridas avaliadas. Os resultados também mostram que, em muitos casos, armazenar apenas uma parte dos dados em SSDs resulta em economia significativa de energia e aumento na velocidade de execução
Musatoiu, Mihai. "An approach to choosing the right distributed file system : Microsoft DFS vs. Hadoop DFS." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-844.
Full textBhat, Adithya. "RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1440188090.
Full textCheng, Lu. "Concentric layout, a new scientific data layout for matrix data set in Hadoop file system." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4545.
Full textID: 029051151; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Thesis (M.S.)--University of Central Florida, 2010.; Includes bibliographical references (p. 56-58).
M.S.
Masters
Department of Electrical Engineering and Computer Science
Engineering
Sodhi, Bir Apaar Singh. "DATA MINING: TRACKING SUSPICIOUS LOGGING ACTIVITY USING HADOOP." CSUSB ScholarWorks, 2016. https://scholarworks.lib.csusb.edu/etd/271.
Full textJohannsen, Fabian, and Mattias Hellsing. "Hadoop Read Performance During Datanode Crashes." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-130466.
Full textCareres, Gutierrez Franco Jesus. "Towards an S3-based, DataNode-lessimplementation of HDFS." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291125.
Full textRelevansen av databehandling och analys idag kan inte överdrivas. Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder. HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare.
Caceres, Gutierrez Franco Jesus. "Towards an S3-based, DataNode-less implementation of HDFS." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291125.
Full textRelevansen av databehandling och analys idag kan inte överdrivas. Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder. HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare.
Benkő, Krisztián. "Zpracování velkých dat z rozsáhlých IoT sítí." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2019. http://www.nusl.cz/ntk/nusl-403820.
Full textYeager, Philip S. "A distributed file system for distributed conferencing system." [Gainesville, Fla.] : University of Florida, 2003. http://purl.fcla.edu/fcla/etd/UFE0001123.
Full textJayaraman, Prashant. "A distributed file system (DFS)." [Gainesville, Fla.] : University of Florida, 2006. http://purl.fcla.edu/fcla/etd/UFE0014040.
Full textWasif, Malik. "A Distributed Namespace for a Distributed File System." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-101482.
Full textHuchton, Scott. "Secure mobile distributed file system (MDFS)." Thesis, Monterey, California. Naval Postgraduate School, 2011. http://hdl.handle.net/10945/5758.
Full textThe goal of this research is to provide a way for frontline troops to securely store and exchange sensitive information on a network of mobile devices with resiliency. The first portion of the thesis is the design of a file system to meet military mission specific security and resiliency requirements. The design integrates advanced concepts including erasure coding, Shamir's threshold based secret sharing algorithm, and symmetric AES cryptography. The resulting system supports two important properties: (1) data can be recovered only if some minimum number of devices are accessible, and (2) sensitive data remains protected even after a small number of devices are compromised. The second part of the thesis is to implement the design on Android mobile devices and demonstrate the system under real world conditions. We implement and demonstrate a functional version of MDFS on Android hardware. Due to the device's limited resources, there are some issues that must be explored before MDFS could be deployed as a viable distributed file system.
Li, Haoyuan. "Alluxio| A Virtual Distributed File System." Thesis, University of California, Berkeley, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10814792.
Full textThe world is entering the data revolution era. Along with the latest advancements of the Internet, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing exponentially. To store and process these data has exposed tremendous challenges and opportunities.
Over the past two decades, we have seen significant innovation in the data stack. For example, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Tensorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundred popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.
This increasing complexity in the stack creates challenges in multi-fold. Data is siloed in various storage systems, making it difficult for users and applications to find and access the data efficiently. For example, for system developers, it requires more work to integrate a new compute or storage component as a building block to work with the existing ecosystem. For data application developers, understanding and managing the correct way to access different data stores becomes more complex. For end users, accessing data from various and often remote data stores often results in performance penalty and semantics mismatch. For system admins, adding, removing, or upgrading an existing compute or data store or migrating data from one store to another can be arduous if the physical storage has been deeply coupled with all applications.
To address these challenges, this dissertation proposes an architecture to have a Virtual Distributed File System (VDFS) as a new layer between the compute layer and the storage layer. Adding VDFS into the stack brings many benefits. Specifically, VDFS enables global data accessibility for different compute frameworks, efficient in-memory data sharing and management across applications and data stores, high I/O performance and efficient use of network bandwidth, and the flexible choice of compute and storage. Meanwhile, as the layer to access data and collect data metrics and usage patterns, it also provides users insight into their data and can also be used to optimize the data access based on workloads.
We achieve these goals through an implementation of VDFS called Alluxio (formerly Tachyon). Alluxio presents a set of disparate data stores as a single file system, greatly reducing the complexity of storage APIs, and semantics exposed to applications. Alluxio is designed with a memory centric architecture, enabling applications to leverage memory speed I/O by simply using Alluxio. Alluxio has been deployed at hundreds of leading companies in production, serving critical workloads. Its open source community has attracted more than 800 contributors worldwide from over 200 companies.
In this dissertation, we also investigate lineage as an important technique in the VDFS to improve write performance, and also propose DFS-Perf, a scalable distributed file system performance evaluation framework to help researchers and developers better design and implement systems in the Alluxio ecosystem.
Purdin, Titus Douglas Mahlon. "ENHANCING FILE AVAILABILITY IN DISTRIBUTED SYSTEMS (THE SAGUARO FILE SYSTEM)." Diss., The University of Arizona, 1987. http://hdl.handle.net/10150/184161.
Full textPradeep, Aakash. "P2PHDFS: AN IMPLEMENTATION OF STATISTIC MULTIPLEXED COMPUTING ARCHITECTURE IN HADOOP FILE SYSTEM." Master's thesis, Temple University Libraries, 2012. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/214757.
Full textM.S.
The Peer to Peer Hadoop Distributed File System (P2PHDFS) is designed to store and process extremely large-scale data sets reliably. This is a first attempt implementation of the Statistic Multiplexed Computing Architecture concept proposed by Dr. Shi for the existing Hadoop File System (HDFS) to eliminate all single point failures. Unlike HDFS, in P2PHDFS every node is designed to be equal and behaves as a file system server as well as slave, which enable it to attain higher performance and higher reliability at the same time as the infrastructure up scales. Due to the data intensive nature, a full implementation of P2PHDFS must address CAP Theorem challenges. This MS project is only intended as the ground breaking point using only sequential replication at this time.
Temple University--Theses
Merritt, John W. "Distributed file systems in an authentication system." Thesis, Kansas State University, 1986. http://hdl.handle.net/2097/9938.
Full textMukhopadhyay, Meenakshi. "Performance analysis of a distributed file system." PDXScholar, 1990. https://pdxscholar.library.pdx.edu/open_access_etds/4198.
Full textMeth, Halli Elaine. "DecaFS: A Modular Distributed File System to Facilitate Distributed Systems Education." DigitalCommons@CalPoly, 2014. https://digitalcommons.calpoly.edu/theses/1206.
Full textRao, Ananth K. "The DFS distributed file system : design and implementation." Online version of thesis, 1989. http://hdl.handle.net/1850/10500.
Full textLindroth, Fredrik. "Designing a distributed peer-to-peer file system." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-200380.
Full textWennergren, Oscar, Mattias Vidhall, and Jimmy Sörensen. "Transparency analysis of Distributed file systems : With a focus on InterPlanetary File System." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-15727.
Full textZhang, Junyao. "Researches on reverse lookup problem in distributed file system." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4638.
Full textID: 029049697; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Thesis (M.S.)--University of Central Florida, 2010.; Includes bibliographical references (p. 46-48).
M.S.
Masters
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Stenkvist, Joel. "S3-HopsFS: A Scalable Cloud-native Distributed File System." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254664.
Full textData har ansetts vara den nya oljan i dagens moderna värld. Data kommer från överallt från hur du handlar online till var du reser. Företag är beroende på analysering av denna data för att kunna göra välgrundade affärsbeslut och förbättra sina produkter och tjänster. Det är väldigt dyrt att spara denna enorm mängd av data för analysering. Nuvarande distribuerade filsystem använder vanlig hårdvara för att kunna ge stark och konsekvent datalagring till stora dataanalysprogram, som Hadoop och Spark. Dessa lagrings kluster kan kosta väldigt mycket. Det beräknas att lagra 100 TB med ett HDFS-kluster i AWS EC2 kostar $47 000 per månad. På andra sidan kostar molnlagring med Amazons S3 bara cirka $ 3 000 per månad för 100 TB, men S3 är inte tillräckligt på grund av eventuell konsistens och låg prestanda. Därför är kombinationen av dessa två lösningar optimalt för ett billigt, konsekvent och snabbt filsystem. Forskningen i denna thesis designar och bygger en ny klass av distribue-rat filsystem som använder cloud blocklagring som datalagret, som Amazonas S3, istället för vanlig hårdvara. AWS ökade nyligen bandbredd från S3 till EC2 från 5 Gbps till 25Gbps, som gjorde ett nytt intresse i det här området. Det nya systemet är byggt på toppen av HopsFS; ett hierarkiskt, distribuerat filsystem med utökad metadata som utnyttjar av en in-memory-distribuerad databas som heter NDB som dramatiskt ökar filsystemets skalbarhet. I kombination med inbyggd molnlagring minskar detta nya filsystem priset för implementering upp till 15 gånger, men med en prestandakostnad på 25 % av det ursprungliga HopsFS-systemet (den är fyra gånger långsammare). Test i denna undersökning visar dock att S3-HopsFS kan förbättras till 38% av den ursprungliga prestandan genom att jämföra den med bara användning av S3.Förutom den nya HopsFS-versionen, utvecklades S3Guard för att använda NDB istället för Amazons DynamoDB för att spara fil systemets metadata. S3Guard är ett verktyg som tillåter stora dataanalysprogram som Hive att använda S3 istället för HDFS. De eventuella konsekvensproblemen i S3 är nu lösta och tester visar en 36% förbättring av prestanda när man listar och tar bort filer och kataloger. S3Guard är tillräckligt för att stödja flera dataanalys program som Hive, men vi förlorar alla fördelar med HopsFS som prestanda, skalbarhet och utökad metadata. Därför behöver vi ett nytt filsystem som kombinerar båda lösningarna.
Ledung, Gabriel, and Johan Andersson. "Darknet file sharing : application of a private peer-to-peer distributed file system concept." Thesis, Uppsala universitet, Informationssystem, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-129908.
Full textFildelningsapplikationer som använder peer-to-peer teknik har varit en enorm framgång blandslutanvändare och har därmed erhållit mycket uppmärksamhet från akademi och indus- tri, liksom olaglig fildelning fått inom media. Däremot har inte privat fildelning mellan vän- ner, arbetskamrater och kollegor tilldelats samma uppmärksamhet från forskningssamfundet. Nuvarande tillämpningar begränsar användaren genom att inte tillåta naturlig interaktion med användarapplikationer. I denna uppsats utforskar vi hur privat fildelning kan göras snabb, skalbar och säker utan att begränsa användaren ur den aspekten. Vi demonstrerar ett koncept- för privat fildelning som nyttjar decentraliserad peer-to-peer arkitektur m.h.a en prototyp som tagits fram med extreme programming som metodologi. För att maximera användarnas frihet nyttjas ett virtuellt filsystem som gränssnitt. Prototypen visar att vår tillämpning fungerar i praktiken och vi hoppas att läsaren kan använda vårt arbete som en plattform för fortsatt utveckling inom detta område.
Yee, Adam J. "Sharing the love : a generic socket API for Hadoop Mapreduce." Scholarly Commons, 2011. https://scholarlycommons.pacific.edu/uop_etds/772.
Full textZhao, Pei. "E-CRADLE v1.1 - An improved distributed system for Photovoltaic Informatics." Case Western Reserve University School of Graduate Studies / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=case1449001689.
Full textSim, Hyogi. "AnalyzeThis: An Analysis Workflow-Aware Storage System." Thesis, Virginia Tech, 2014. http://hdl.handle.net/10919/76927.
Full textMaster of Science
Kerkinos, Ioannis. "Evaluation and benchmarking of Tachyon as a memory-centric distributed storage system for Apache Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189571.
Full textOnishchuk, А. "Creating Highly Available Distributed File System for Maui Family Job Schedulers." Thesis, Sumy State University, 2017. http://essuir.sumdu.edu.ua/handle/123456789/55757.
Full textNarasimhan, Srivatsan. "Reliable, Efficient and Distributed Cooperative Caching for Improving File System Performance." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin985880302.
Full textSchleinzer, Benjamin [Verfasser]. "A File System for Wireless Mesh Networks : A New Approach for a Scalable, Secure, and Distributed File System / Benjamin Schleinzer." Aachen : Shaker, 2012. http://d-nb.info/1069044520/34.
Full textPatil, Swapnil. "Scale and Concurrency of Massive File System Directories." Research Showcase @ CMU, 2013. http://repository.cmu.edu/dissertations/250.
Full textClabough, Douglas M. "An electronic calendar system in a distributed UNIX environment." Thesis, Kansas State University, 1986. http://hdl.handle.net/2097/9906.
Full textHoffman, P. Kuyper. "A file server for the DistriX prototype : a multitransputer UNIX system." Master's thesis, University of Cape Town, 1989. http://hdl.handle.net/11427/17188.
Full textThe DISTRIX operating system is a multiprocessor distributed operating system based on UNIX. It consists of a number of satellite processors connected to central servers. The system is derived from the MINIX operating system, compatible with UNIX Version 7. A remote procedure call interface is used in conjunction with a system wide, end-to-end communication protocol that connects satellite processors to the central servers. A cached file server provides access to all files and devices at the UNIX system call level. The design of the file server is discussed in depth and the performance evaluated. Additional information is given about the software and hardware used during the development of the project. The MINIX operating system has proved to be a good choice as the software base, but certain features have proved to be poorer. The Inmos transputer emerges as a processor with many useful features that eased the implementation.
AlShaikh, Raed A. "Towards building a fault tolerant and conflict-free distributed file system for mobile clients." Thesis, University of Ottawa (Canada), 2006. http://hdl.handle.net/10393/27109.
Full textOriani, André 1984. "Uma solução de alta disponibilidade para o sistema de arquivos distribuidos do Hadoop." [s.n.], 2013. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275641.
Full textDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-22T22:11:10Z (GMT). No. of bitstreams: 1 Oriani_Andre_M.pdf: 3560692 bytes, checksum: 90ac96e4274dea19b7bcaec78aa959f8 (MD5) Previous issue date: 2013
Resumo: Projetistas de sistema geralmente optam por sistemas de arquivos baseados em cluster como solução de armazenamento para ambientes de computação de alto desempenho. A razão para isso é que eles provêm dados com confiabilidade, consistência e alta vazão. Porém a maioria desses sistemas de arquivos emprega uma arquitetura centralizada, o que compromete sua disponibilidade. Este trabalho foca especificamente em um exemplar de tais sistemas, o Hadoop Distributed File System (HDFS). O trabalho propõe um hot standby para o nó mestre do HDFS a fim de conferir-lhe alta disponibilidade. O hot standby é implementado por meio da (i) extensão da replicação de estado do mestre realizada por seu checkpoint helper, o Backup Node; e por meio da (ii) introdução de um mecanismo automático de failover. O passo (i) aproveitou-se da técnica de duplicação de mensagens desenvolvida por outra técnica de alta disponibilidade para o HDFS chamada Avatar Nodes. O passo (ii) empregou ZooKeeper, um serviço distribuído de coordenação. Essa estratégia resultou em mudanças de código pequenas, cerca de 0,18% do código original, o que faz a solução ser de fácil estudo e manutenção. Experimentos mostraram que o custo adicional imposto pela replicação não aumentou em mais de 11% o consumo médio de recursos pelos nós do sistema nem diminuiu a vazão de dados comparando-se com a versão original do HDFS. A transição completa para o hot standby pode tomar até 60 segundos quando sob cargas de trabalho dominadas por operações de E/S, mas menos de 0,4 segundos em cenários com predomínio de requisições de metadados. Estes resultados evidenciam que a solução desenvolvida nesse trabalho alcançou seus objetivos de produzir uma solução de alta disponibilidade para o HDFS com baixo custo e capaz de reagir a falhas em um breve espaço de tempo
Abstract: System designers generally adopt cluster-based file systems as the storage solution for high-performance computing environments. That happens because they provide data with reliability, consistency and high throughput. But most of those fie systems employ a centralized architecture which compromises their availability. This work focuses on a specimen of such systems, the Hadoop Distributed File System (HDFS). A hot standby for the master node of HDFS is proposed in order to bring high availability to the system. The hot standby was achieved by (i) extending the master's state replication performed by its checkpointer helper, the Backup Node; and by (ii) introducing an automatic failover mechanism. Step (i) took advantage of the message duplication technique developed by other high availability solution for HDFS named AvatarNodes. Step (ii) employed ZooKeeper, a distributed coordination service. That approach resulted on small code changes, around 0.18% of the original code, which makes the solution easy to understand and to maintain. Experiments showed that the overhead implied by replication did not increase the average resource consumption of system nodes by more than 11% nor did it diminish the data throughput compared to the original version of HDFS. The complete transition for the hot standby can take up to 60 seconds on workloads dominated by I/O operations, but less than 0.4 seconds when there is predominance of metadata requisitions. Those results show that the solution developed on this work achieved the goals of producing a high availability solution for the HDFS with low overhead and short reaction time to failures
Mestrado
Ciência da Computação
Mestre em Ciência da Computação
Lin, Tsai S. (Tsai Shooumeei). "A Highly Fault-Tolerant Distributed Database System with Replicated Data." Thesis, University of North Texas, 1994. https://digital.library.unt.edu/ark:/67531/metadc278403/.
Full textJones, Michael Angus Scott. "Using AFS as a distributed file system for computational and data grids in high energy physics." Thesis, University of Manchester, 2005. http://www.manchester.ac.uk/escholar/uk-ac-man-scw:181210.
Full textCuce, Simon. "GLOMAR : a component based framework for maintaining consistency of data objects within a heterogeneous distributed file system." Monash University, School of Computer Science and Software Engineering, 2003. http://arrow.monash.edu.au/hdl/1959.1/5743.
Full textLin, Jenglung. "The Implementation and Integration of the Interactive Markup Language to the Distributed Component Object Model Protocol in the Application of Distributed File System Security." NSUWorks, 1999. http://nsuworks.nova.edu/gscis_etd/671.
Full textVenkateswaran, Jayendran. "PRODUCTION AND DISTRIBUTION PLANNING FOR DYNAMIC SUPPLY CHAINS USING MULTI-RESOLUTION HYBRID MODELS." Diss., Tucson, Arizona : University of Arizona, 2005. http://etd.library.arizona.edu/etd/GetFileServlet?file=file:///data1/pdf/etd/azu%5Fetd%5F1185%5F1%5Fm.pdf&type=application/pdf.
Full textLiao, Jhih-Kai, and 廖治凱. "Fault-Tolerant Management Framework for Hadoop Distributed File System." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/20941508217703176221.
Full text淡江大學
資訊工程學系碩士班
101
Due to the rapid development of modern Internet, the mode of operation of a large number of applications has changed from single-machine to a cluster of machines over the network. This trend also contributed to the development of cloud computing technology, among which Google invented the MapReduce framework, Google File System (GFS), and BigTable, and Yahoo invested the open-source Hadoop project to implement those technologies proposed by Google. The Hadoop Distributed File System (HDFS) is based on the master/slave model to manage the entire file system. Specifically, a single NameNode acting as the master manages a large number of slaves called DataNodes. Since the NameNode is responsible for maintaining a lot of important metadata information, a NameNode crash can render the entire file system unusable. That is, the NameNode forms a Single Point of Failure (SPOF). In addition, in the master/slave model, all the requests and responses have to go through the master. It is obvious that without load sharing, the NameNode forms a performance bottleneck. Therefore, in this research we propose to allocate Sub_NameNodes dynamically for each MapReduce job, in order to relieve the network congestion, and accelerate the speed of communication between the master and the slaves. Our approach also reduces the risk of data loss by replicating the metadata to the Sub_NameNodes. Once the NameNode fails, its state can be reconstructed from the Sub_NameNodes. The simulation results show significant reduction on both the number of communication hops and the communication time.
CHO, CHIH-YUAN, and 卓志遠. "Performance Comparison of Hadoop Distributed File System and Ceph." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/27230766807802849865.
Full text東海大學
資訊工程學系
102
Cloud computing refers to services at anytime, anywhere, on demand, using any device to access various services. It is a model that can be easily accessed in accordance with the needs of the network computer resources provided by these computer resources, including networks, servers, storage, applications, and services. In response to the popularity of cloud computing services, which produce large amount of information and data, and in order to save the future of science and technology development, processing and analyzing massive data applications for key research direction, the storage and handling of large amount of data without the use of distributed computing and Distributed File System, has become the focal point. In this thesis, the open source, Hadoop Distributed File System, and Ceph were compared in these areas of file uploading/downloading performance, transmission capacity, and fault tolerance comparative analysis of file size. In 60-different-file size transmission test, Ceph performed only 2 times better than the obvious data Hadoop. The rest of the experimental data shown a better performance achieved with the Hadoop. The more stable and better performing Hadoop though currently under proof stage, has yet to be implemented by the industry. Ceph is not currently recommended in production environment; however, there can be a great development for future growth.
Lin, Ying-Chen, and 林映辰. "A Load-Balancing Algorithm for Hadoop Distributed File System." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/79778516225414074113.
Full text淡江大學
資訊工程學系碩士班
103
With the advancement of Internet and increasing data demands, many enterprises are offering cloud services to their customers. Among various cloud computing platforms, the Apache Hadoop project has been widely adopted by many large organizations and enterprises. In the Hadoop ecosystem, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and HBase are open source equivalents of the Google proposed Google File System (GFS), MapReduce framework, and BigTable, respectively. To meet the requirement of horizontal scaling of storage in the big data era, HDFS has received significant attention among researchers. HDFS clusters are in a master-slave architecture: there is a single NameNode and a number of DataNodes in each cluster. NameNode is the master responsible for managing the DataNodes and the client accesses. DataNodes are slaves, and are responsible for storing data. As the name suggests, HDFS stores files distributedly. Files are divided into fixed-sized blocks, and in default configuration each block has three replicas stored in three different DataNodes to ensure the fault tolerance capability of HDFS. However, Hadoop''s default strategy of allocating new blocks does not take into account of DataNodes’ utilization, which can lead to load imbalance in HDFS. To cope with the problem, NameNode has a built-in tool called Balancer, which can be executed by the system administrator. Balancer iteratively moves blocks from DataNodes in high utilization to those in low utilization to keep each DataNode’s disk utilization within a configurable range centered at the average utilization. The primary cost of using Balancer to achieve load balance is the bandwidth consumed during the movement of blocks. Besides, the previous research shows that the NameNode is the performance bottleneck of HDFS. That is, frequent execution of Balancer by the NameNode may degrade the performance of HDFS. Therefore, in this research we would like to design a new load-balancing algorithm by considering all the situations that may influence the load-balancing state. In the proposed algorithm a new role named BalanceNode is introduced to help in matching heavy-loaded and light-loaded nodes, so those light-loaded nodes can share part of the load from heavy-loaded ones. The simulation results show that our algorithm not only can achieve good load-balancing state in the HDFS, but also with minimized movement cost.
Huang, Hsin-Yi, and 黃心怡. "Realizing Prioritized MapReduce Service in Hadoop Distributed File System." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/85801068064786371658.
Full text輔仁大學
資訊工程學系碩士班
104
Hadoop is a widely used and highly scalable platform software, and it is a distributed system which can handle a large amount of data with a high fault-tolerance feature. Like other application software, Hadoop system must build on the operating system, and must communicate and coordinate with hardware through the operating system. As Cloud Computing and Big Data appear, the cloud software platform becomes very important to support the cloud services implementation. Hadoop has a mechanism for the work performed by the allocation of resources. The work groups submitted under this mechanism are assigned to different levels of resource allocation sequence, and the work with high allocation of resources may have more chances of getting resources and higher priority to implement than those with low allocation of resources. However, you can not enable a particular job with higher precedence over other jobs in the same level of resource allocation group. When Hadoop is busy, a lot of works with the same level of resource allocation wait in line. Even for the work with high resource allocation, there is no guarantee that it can quickly get more resources to complete the work earlier. The research presents a Hadoop environment that users can set different priority levels to different jobs by adding priority mechanism to disk CFQ scheduler and to memory replacement. As a result, the execution of program with high priority can be accelerated accordingly. In the experiments, we performed multiple simultaneous Hadoop system applications to simulate a busy environment, and set specific programs with high priority to see how faster they can execute than their execution with normal priority. Our results show that for programs with high priority, their execution time can be reduced by a range between 30% and 80%.
Fan, Kuo-Zheng, and 范國拯. "Dynamic De-duplication Decision in a Hadoop Distributed File System." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/12180320597103126420.
Full text國立東華大學
資訊工程學系
101
Nowadays, data is generated and updated per second and this makes coping with those tremendously fast and multiform amounts of data a heavy challenge. The Hadoop Distributed File System (HDFS) is the first choice solution for most people. However, data is usually prevented from being lost with many backups, and HDFS also does this. Obviously, these duplicates occupy a lot of storage space, and this also means that we need to invest sufficient funding in infrastructure. However, this is not a good method for everybody, since it may be unaffordable. Therefore, using De-duplication technology can improve the memory space effectively, which has been gaining increasing attention in many researches, products, and which has also been applied in our implementation. In this paper, we proposed a dynamic De-duplication decision to improve the memory space which runs on HDFS. Under the memory space limitation, the system according to the ability of clusters and the utility of storage space can formulate a proper De-duplication strategy. By doing so, the usage of storage systems can be improved.
Queirós, Jorge Afonso Barandas. "Implementing Hadoop distributed file system (hdfs) Cluster for BI Solution." Master's thesis, 2021. https://hdl.handle.net/10216/133038.
Full textQueirós, Jorge Afonso Barandas. "Implementing Hadoop distributed file system (hdfs) Cluster for BI Solution." Dissertação, 2021. https://hdl.handle.net/10216/133038.
Full text