Log in

Relevant bibliographies by topics / Hadoop Distributed File System / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Hadoop Distributed File System.

Dissertations / Theses on the topic 'Hadoop Distributed File System'

Author: Grafiati

Published: 4 June 2021

Last updated: 18 June 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Hadoop Distributed File System.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Lorenzetto, Luca <1988&gt. "Evaluating performance of Hadoop Distributed File System." Master's Degree Thesis, Università Ca' Foscari Venezia, 2014. http://hdl.handle.net/10579/4773.

Full text

Abstract:

In recent years, a huge quantity of data produced by multiple sources has appeared. Dealing with this data has arisen the so called "big data problem", which can be faced only with new computing paradigms and platforms. Many vendors compete in this field, but at this day the de-facto standard platform for big-data is the opensource framework Apache Hadoop . Inspired by Google's private cluster platform, some indipendent developers created Hadoop and, following the structure published by Google's engineering team, a complete set of components for big data elaboration has been developed. One of this components is the Hadoop Distributed File System, one of the core components. In this thesis work, we will analyze its performance and identify some action points that can be tuned to improve its behavior in a real implementation.

APA, Harvard, Vancouver, ISO, and other styles

2

Polato, Ivanilton. "Energy savings and performance improvements with SSDs in the Hadoop Distributed File System." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-31102016-155908/.

Full text

Abstract:

Energy issues gathered strong attention over the past decade, reaching IT data processing infrastructures. Now, they need to cope with such responsibility, adjusting existing platforms to reach acceptable performance while promoting energy consumption reduction. As the de facto platform for Big Data, Apache Hadoop has evolved significantly over the last years, with more than 60 releases bringing new features. By implementing the MapReduce programming paradigm and leveraging HDFS, its distributed file system, Hadoop has become a reliable and fault tolerant middleware for parallel and distributed computing over large datasets. Nevertheless, Hadoop may struggle under certain workloads, resulting in poor performance and high energy consumption. Users increasingly demand that high performance computing solutions address sustainability and limit energy consumption. In this thesis, we introduce HDFSH, a hybrid storage mechanism for HDFS, which uses a combination of Hard Disks and Solid-State Disks to achieve higher performance while saving power in Hadoop computations. HDFSH brings, to the middleware, the best from HDs (affordable cost per GB and high storage capacity) and SSDs (high throughput and low energy consumption) in a configurable fashion, using dedicated storage zones for each storage device type. We implemented our mechanism as a block placement policy for HDFS, and assessed it over six recent releases of Hadoop with different architectural properties. Results indicate that our approach increases overall job performance while decreasing the energy consumption under most hybrid configurations evaluated. Our results also showed that, in many cases, storing only part of the data in SSDs results in significant energy savings and execution speedups<br>Ao longo da última década, questões energéticas atraíram forte atenção da sociedade, chegando às infraestruturas de TI para processamento de dados. Agora, essas infraestruturas devem se ajustar a essa responsabilidade, adequando plataformas existentes para alcançar desempenho aceitável enquanto promovem a redução no consumo de energia. Considerado um padrão para o processamento de Big Data, o Apache Hadoop tem evoluído significativamente ao longo dos últimos anos, com mais de 60 versões lançadas. Implementando o paradigma de programação MapReduce juntamente com o HDFS, seu sistema de arquivos distribuídos, o Hadoop tornou-se um middleware tolerante a falhas e confiável para a computação paralela e distribuída para grandes conjuntos de dados. No entanto, o Hadoop pode perder desempenho com determinadas cargas de trabalho, resultando em elevado consumo de energia. Cada vez mais, usuários exigem que a sustentabilidade e o consumo de energia controlado sejam parte intrínseca de soluções de computação de alto desempenho. Nesta tese, apresentamos o HDFSH, um sistema de armazenamento híbrido para o HDFS, que usa uma combinação de discos rígidos e discos de estado sólido para alcançar maior desempenho, promovendo economia de energia em aplicações usando Hadoop. O HDFSH traz ao middleware o melhor dos HDs (custo acessível por GB e grande capacidade de armazenamento) e SSDs (alto desempenho e baixo consumo de energia) de forma configurável, usando zonas de armazenamento dedicadas para cada dispositivo de armazenamento. Implementamos nosso mecanismo como uma política de alocação de blocos para o HDFS e o avaliamos em seis versões recentes do Hadoop com diferentes arquiteturas de software. Os resultados indicam que nossa abordagem aumenta o desempenho geral das aplicações, enquanto diminui o consumo de energia na maioria das configurações híbridas avaliadas. Os resultados também mostram que, em muitos casos, armazenar apenas uma parte dos dados em SSDs resulta em economia significativa de energia e aumento na velocidade de execução

APA, Harvard, Vancouver, ISO, and other styles

3

Musatoiu, Mihai. "An approach to choosing the right distributed file system : Microsoft DFS vs. Hadoop DFS." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-844.

Full text

Abstract:

Context. An important goal of most IT groups is to manage server resources in such a way that their users are provided with fast, reliable and secure access to files. The modern needs of organizations imply that resources are often distributed geographically, asking for new design solutions for the file systems to remain highly available and efficient. This is where distributed file systems (DFSs) come into the picture. A distributed file system (DFS), as opposed to a "classical", local, file system, is accessible across some kind of network and allows clients to access files remotely as if they were stored locally. Objectives. This paper has the goal of comparatively analyzing two distributed file systems, Microsoft DFS (MSDFS) and Hadoop DFS (HDFS). The two systems come from different "worlds" (proprietary - Microsoft DFS - vs. open-source - Hadoop DFS); the abundance of solutions and the variety of choices that exist today make such a comparison more relevant. Methods. The comparative analysis is done on a cluster of 4 computers running dual-installations of Microsoft Windows Server 2012 R2 (the MSDFS environment) and Linux Ubuntu 14.04 (the HDFS environment). The comparison is done on read and write operations on files and sets of files of increasing sizes, as well as on a set of key usage scenarios. Results. Comparative results are produced for reading and writing operations of files of increasing size - 1 MB, 2 MB, 4 MB and so on up to 4096 MB - and of sets of small files (64 KB each) amounting to totals of 128 MB, 256 MB and so on up to 4096 MB. The results expose the behavior of the two DFSs on different types of stressful activities (when the size of the transferred file increases, as well as when the quantity of data is divided into (tens of) thousands of many small files). The behavior in the case of key usage scenarios is observed and analyzed. Conclusions. HDFS performs better at writing large files, while MSDFS is better at writing many small files. At read operations, the two show similar performance, with a slight advantage for MSDFS. In the key usage scenarios, HDFS shows more flexibility, but MSDFS could be the better choice depending on the needs of the users (for example, most of the common functions can be configured through the graphical user interface).

APA, Harvard, Vancouver, ISO, and other styles

4

Bhat, Adithya. "RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1440188090.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Cheng, Lu. "Concentric layout, a new scientific data layout for matrix data set in Hadoop file system." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4545.

Full text

Abstract:

The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data speed and space efficiently, data pre-process operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout which adopted in current Hadoop framework by default, concentric data layout stores the data from the same sub-matrix into one chunk. This matches well with the matrix operations like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments indicate that the concentric data layout improves the overall performance, reduces the execution time by 38% when the file size is 16 GB, also it relieves the data overhead phenomenon and increases the effective data retrieval rate by 32% on average.<br>ID: 029051151; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Thesis (M.S.)--University of Central Florida, 2010.; Includes bibliographical references (p. 56-58).<br>M.S.<br>Masters<br>Department of Electrical Engineering and Computer Science<br>Engineering

APA, Harvard, Vancouver, ISO, and other styles

6

Sodhi, Bir Apaar Singh. "DATA MINING: TRACKING SUSPICIOUS LOGGING ACTIVITY USING HADOOP." CSUSB ScholarWorks, 2016. https://scholarworks.lib.csusb.edu/etd/271.

Full text

Abstract:

In this modern rather interconnected era, an organization’s top priority is to protect itself from major security breaches occurring frequently within a communicational environment. But, it seems, as if they quite fail in doing so. Every week there are new headlines relating to information being forged, funds being stolen and corrupt usage of credit card and so on. Personal computers are turned into “zombie machines” by hackers to steal confidential and financial information from sources without disclosing hacker’s true identity. These identity thieves rob private data and ruin the very purpose of privacy. The purpose of this project is to identify suspicious user activity by analyzing a log file which then later can help an investigation agency like FBI to track and monitor anonymous user(s) who seek for weaknesses to attack vulnerable parts of a system to have access of it. The project also emphasizes the potential damage that a malicious activity could have on the system. This project uses Hadoop framework to search and store log files for logging activities and then performs a ‘Map Reduce’ programming code to finally compute and analyze the results.

APA, Harvard, Vancouver, ISO, and other styles

7

Johannsen, Fabian, and Mattias Hellsing. "Hadoop Read Performance During Datanode Crashes." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-130466.

Full text

Abstract:

This bachelor thesis evaluates the impact of datanode crashes on the performance of the read operations of a Hadoop Distributed File System, HDFS. The goal is to better understand how datanode crashes, as well as how certain parameters, affect the performance of the read operation by looking at the execution time of the get command. The parameters used are the number of crashed nodes, block size and file size. By setting up a Linux test environment with ten virtual machines and Hadoop installed on them and running tests on it, data has been collected in order to answer these questions. From this data the average execution time and standard deviation of the get command was calculated. The network activity during the tests was also measured. The results showed that neither the number of crashed nodes nor block size had any significant effect on the execution time. It also demonstrated that the execution time of the get command was not directly proportional to the size of the fetched file. The execution time was up to 4.5 times as long when the file size was four times as large. A four times larger file did sometimes result in more than a four times as long execution time. Although, the consequences of a datanode crash while fetching a small file appear to be much greater than with a large file. The average execution time increased by up to 36% when a large file was fetched but it increased by as much as 85% when fetching a small file.

APA, Harvard, Vancouver, ISO, and other styles

8

Careres, Gutierrez Franco Jesus. "Towards an S3-based, DataNode-lessimplementation of HDFS." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291125.

Full text

Abstract:

The relevance of data processing and analysis today cannot be overstated. The convergence of several technological advancements has fostered the proliferation of systems and infrastructure that together support the generation, transmission, and storage of nearly 15,000 exabytes of digital, analyzabledata. The Hadoop Distributed File System (HDFS) is an open source system designed to leverage the storage capacity of thousands of servers, and is the file system component of an entire ecosystem of tools to transform and analyze massive data sets. While HDFS is used by organizations of all sizes, smaller ones are not as well-suited to organically grow their clusters to accommodate their ever-expanding data sets and processing needs. This is because larger clusters are concomitant with higher investment in servers, greater rates of failures to recover from, and the need to allocate moreresources in maintenance and administration tasks. This poses a potential limitation down the road for organizations, and it might even deter some from venturing into the data world altogether. This thesis addresses this matter by presenting a novel implementation of HopsFS, an already improved version of HDFS, that requires no user-managed data servers. Instead, it relies on S3, a leading object storage service, for all its user-data storage needs. We compared the performance of both S3-based and regular clusters and found that such architecture is not only feasible, but also perfectly viable in terms of read and write throughputs, in some cases even outperforming its original counterpart. Furthermore, our solution provides first-class elasticity, reliability, and availability, all while being remarkably more affordable.<br>Relevansen av databehandling och analys idag kan inte överdrivas. Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder. HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare.

APA, Harvard, Vancouver, ISO, and other styles

9

Caceres, Gutierrez Franco Jesus. "Towards an S3-based, DataNode-less implementation of HDFS." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291125.

Full text

Abstract:

The relevance of data processing and analysis today cannot be overstated. The convergence of several technological advancements has fostered the proliferation of systems and infrastructure that together support the generation, transmission, and storage of nearly 15,000 exabytes of digital, analyzabledata. The Hadoop Distributed File System (HDFS) is an open source system designed to leverage the storage capacity of thousands of servers, and is the file system component of an entire ecosystem of tools to transform and analyze massive data sets. While HDFS is used by organizations of all sizes, smaller ones are not as well-suited to organically grow their clusters to accommodate their ever-expanding data sets and processing needs. This is because larger clusters are concomitant with higher investment in servers, greater rates of failures to recover from, and the need to allocate moreresources in maintenance and administration tasks. This poses a potential limitation down the road for organizations, and it might even deter some from venturing into the data world altogether. This thesis addresses this matter by presenting a novel implementation of HopsFS, an already improved version of HDFS, that requires no user-managed data servers. Instead, it relies on S3, a leading object storage service, for all its user-data storage needs. We compared the performance of both S3-based and regular clusters and found that such architecture is not only feasible, but also perfectly viable in terms of read and write throughputs, in some cases even outperforming its original counterpart. Furthermore, our solution provides first-class elasticity, reliability, and availability, all while being remarkably more affordable.<br>Relevansen av databehandling och analys idag kan inte överdrivas. Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder. HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare.

APA, Harvard, Vancouver, ISO, and other styles

10

Benkő, Krisztián. "Zpracování velkých dat z rozsáhlých IoT sítí." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2019. http://www.nusl.cz/ntk/nusl-403820.

Full text

Abstract:

The goal of this diploma thesis is to design and develop a system for collecting, processing and storing data from large IoT networks. The developed system introduces a complex solution able to process data from various IoT networks using Apache Hadoop ecosystem. The data are real-time processed and stored in a NoSQL database, but the data are also stored in the file system for a potential later processing. The system is optimized and tested using data from IQRF network. The data stored in the NoSQL database are visualized and the system periodically generates derived predictions. Users are connected to this system via an information system, which is able to automatically generate notifications when monitored values are out of range.

APA, Harvard, Vancouver, ISO, and other styles

11

Yeager, Philip S. "A distributed file system for distributed conferencing system." [Gainesville, Fla.] : University of Florida, 2003. http://purl.fcla.edu/fcla/etd/UFE0001123.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Jayaraman, Prashant. "A distributed file system (DFS)." [Gainesville, Fla.] : University of Florida, 2006. http://purl.fcla.edu/fcla/etd/UFE0014040.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Wasif, Malik. "A Distributed Namespace for a Distributed File System." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-101482.

Full text

Abstract:

Due to the rapid growth of data in recent years, distributed file systems have gained widespread adoption. The new breed of distributed file systems reliably store petabytes of data on commodity hardware, and also provide rich abstractions for massively parallel data analytics. The Hadoop Distributed File System (HDFS) is one such system which provides the storage layer for MapReduce, Hive, HBase and Mahout. The metadata server in HDFS, called the NameNode, is a centralized server which stores information about the whole namespace. The centralized architecture not only makes the NameNode a bottleneck and a single-point-of-failure, but also restricts the overall capacity of the filesystem. To solve the availability and scalability issues of HDFS, a new architecture is required. In this report, we propose a distributed implementation of the HDFS NameNode, where the filesystem metadata is stored in a distributed, in-memory, replicated database called MySQL Cluster. The NameNodes are state-less, and the throughput of the system can be increased by either adding NameNodes, or by adding more data nodes in NDB. HDFS clients can access the metadata by connecting to any one of the NameNodes. The evaluation section shows that the new architecture, as compared to HDFS, can handle more requests per second, store ten times more files and recover from failures within a few seconds.

APA, Harvard, Vancouver, ISO, and other styles

14

Huchton, Scott. "Secure mobile distributed file system (MDFS)." Thesis, Monterey, California. Naval Postgraduate School, 2011. http://hdl.handle.net/10945/5758.

Full text

Abstract:

Approved for public release; distribution is unlimited<br>The goal of this research is to provide a way for frontline troops to securely store and exchange sensitive information on a network of mobile devices with resiliency. The first portion of the thesis is the design of a file system to meet military mission specific security and resiliency requirements. The design integrates advanced concepts including erasure coding, Shamir's threshold based secret sharing algorithm, and symmetric AES cryptography. The resulting system supports two important properties: (1) data can be recovered only if some minimum number of devices are accessible, and (2) sensitive data remains protected even after a small number of devices are compromised. The second part of the thesis is to implement the design on Android mobile devices and demonstrate the system under real world conditions. We implement and demonstrate a functional version of MDFS on Android hardware. Due to the device's limited resources, there are some issues that must be explored before MDFS could be deployed as a viable distributed file system.

APA, Harvard, Vancouver, ISO, and other styles

15

Li, Haoyuan. "Alluxio| A Virtual Distributed File System." Thesis, University of California, Berkeley, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10814792.

Full text

Abstract:

<p> The world is entering the data revolution era. Along with the latest advancements of the Internet, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing exponentially. To store and process these data has exposed tremendous challenges and opportunities. </p><p> Over the past two decades, we have seen significant innovation in the data stack. For example, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Tensorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundred popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics. </p><p> This increasing complexity in the stack creates challenges in multi-fold. Data is siloed in various storage systems, making it difficult for users and applications to find and access the data efficiently. For example, for system developers, it requires more work to integrate a new compute or storage component as a building block to work with the existing ecosystem. For data application developers, understanding and managing the correct way to access different data stores becomes more complex. For end users, accessing data from various and often remote data stores often results in performance penalty and semantics mismatch. For system admins, adding, removing, or upgrading an existing compute or data store or migrating data from one store to another can be arduous if the physical storage has been deeply coupled with all applications. </p><p> To address these challenges, this dissertation proposes an architecture to have a Virtual Distributed File System (VDFS) as a new layer between the compute layer and the storage layer. Adding VDFS into the stack brings many benefits. Specifically, VDFS enables global data accessibility for different compute frameworks, efficient in-memory data sharing and management across applications and data stores, high I/O performance and efficient use of network bandwidth, and the flexible choice of compute and storage. Meanwhile, as the layer to access data and collect data metrics and usage patterns, it also provides users insight into their data and can also be used to optimize the data access based on workloads. </p><p> We achieve these goals through an implementation of VDFS called Alluxio (formerly Tachyon). Alluxio presents a set of disparate data stores as a single file system, greatly reducing the complexity of storage APIs, and semantics exposed to applications. Alluxio is designed with a memory centric architecture, enabling applications to leverage memory speed I/O by simply using Alluxio. Alluxio has been deployed at hundreds of leading companies in production, serving critical workloads. Its open source community has attracted more than 800 contributors worldwide from over 200 companies. </p><p> In this dissertation, we also investigate lineage as an important technique in the VDFS to improve write performance, and also propose DFS-Perf, a scalable distributed file system performance evaluation framework to help researchers and developers better design and implement systems in the Alluxio ecosystem. </p><p>

APA, Harvard, Vancouver, ISO, and other styles

16

Purdin, Titus Douglas Mahlon. "ENHANCING FILE AVAILABILITY IN DISTRIBUTED SYSTEMS (THE SAGUARO FILE SYSTEM)." Diss., The University of Arizona, 1987. http://hdl.handle.net/10150/184161.

Full text

Abstract:

This dissertation describes the design and implementation of the file system component of the Saguaro operating system for computers connected by a local-area network. Systems constructed on such an architecture have the potential advantage of increased file availability due to their inherent redundancy. In Saguaro, this advantage is made available through two mechanisms that support semi-automatic file replication and access: reproduction sets and metafiles. A reproduction set is a collection of files that the system attempts to keep identical on a "best effort" basis, relying on the user to handle unusual situations that may arise. A metafile is a special file that contains symbolic path names of other files; when a metafile is opened, the system selects an available constituent file and opens it instead. These mechanisms are especially appropriate for situations that do not require guaranteed consistency or a large number of copies. Other interesting aspects of the Saguaro file system design are also described. The logical file system forms a single tree, yet any file can be placed in any of the physical file systems. This organization allows the creation of a logical association among files that is quite different from their physical association. In addition, the broken path algorithm is described. This algorithm makes it possible to bypass elements in a path name that are on inaccessible physical file systems. Thus, any accessible file can be made available, regardless of the availability of directories in its path. Details are provided on the implementation of the Saguaro file system. The servers of which the system is composed are described individually and a comprehensive operational example is supplied to illustrate their interation. The underlying data structures of the file system are presented. The virtual roots, which contain information used by the broken path algorithm, are the most novel of these. Finally, an implementation of reproduction sets and metafiles for interconnected networks running Berkeley UNIX is described. This implementation demonstrates the broad applicability of these mechanisms. It also provides insight into the way in which mechanisms to facilitate user controlled replication of files can be inexpensively added to existing file systems. Performance measurements for this implementation are also presented.

APA, Harvard, Vancouver, ISO, and other styles

17

Pradeep, Aakash. "P2PHDFS: AN IMPLEMENTATION OF STATISTIC MULTIPLEXED COMPUTING ARCHITECTURE IN HADOOP FILE SYSTEM." Master's thesis, Temple University Libraries, 2012. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/214757.

Full text

Abstract:

Computer and Information Science<br>M.S.<br>The Peer to Peer Hadoop Distributed File System (P2PHDFS) is designed to store and process extremely large-scale data sets reliably. This is a first attempt implementation of the Statistic Multiplexed Computing Architecture concept proposed by Dr. Shi for the existing Hadoop File System (HDFS) to eliminate all single point failures. Unlike HDFS, in P2PHDFS every node is designed to be equal and behaves as a file system server as well as slave, which enable it to attain higher performance and higher reliability at the same time as the infrastructure up scales. Due to the data intensive nature, a full implementation of P2PHDFS must address CAP Theorem challenges. This MS project is only intended as the ground breaking point using only sequential replication at this time.<br>Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

18

Merritt, John W. "Distributed file systems in an authentication system." Thesis, Kansas State University, 1986. http://hdl.handle.net/2097/9938.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Mukhopadhyay, Meenakshi. "Performance analysis of a distributed file system." PDXScholar, 1990. https://pdxscholar.library.pdx.edu/open_access_etds/4198.

Full text

Abstract:

An important design goal of a distributed file system, a component of many distributed systems, is to provide UNIX file access semantics, e.g., the result of any write system call is visible by all processes as soon as the call completes. In a distributed environment, these semantics are difficult to implement because processes on different machines do not share kernel cache and data structures. Strong data consistency guarantees may be provided only at the expense of performance. This work investigates the time costs paid by AFS 3.0, which uses a callback mechanism to provide consistency guarantees, and those paid by AFS 4.0 which uses typed tokens for synchronization. AFS 3.0 provides moderately strong consistency guarantees, but they are not like UNIX because data are written back to the server only after a file is closed. AFS 4.0 writes back data to the server whenever there are other clients wanting to access it, the effect being like UNIX file access semantics. Also, AFS 3.0 does not guarantee synchronization of multiple writers, whereas AFS 4.0 does.

APA, Harvard, Vancouver, ISO, and other styles

20

Meth, Halli Elaine. "DecaFS: A Modular Distributed File System to Facilitate Distributed Systems Education." DigitalCommons@CalPoly, 2014. https://digitalcommons.calpoly.edu/theses/1206.

Full text

Abstract:

Data quantity, speed requirements, reliability constraints, and other factors encourage industry developers to build distributed systems and use distributed services. Software engineers are therefore exposed to distributed systems and services daily in the workplace. However, distributed computing is hard to teach in Computer Science courses due to the complexity distribution brings to all problem spaces. This presents a gap in education where students may not fully understand the challenges introduced with distributed systems. Teaching students distributed concepts would help better prepare them for industry development work. DecaFS, Distributed Educational Component Adaptable File System, is a modular distributed file system designed for educational use. The goal of the system is to teach distributed computing concepts to undergraduate and graduate level students by allowing them to develop small, digestible portions of the system. The system is broken up into layers, and each layer is broken up into modules so that students can build or modify different components in small, assignment- sized portions. Students can replace modules or entire layers by following the DecaFS APIs and recompiling the system. This allows the behavior of the DFS (Distributed File System) to change based on student implementation, while providing base functionality for students to work from. Our implementation includes a code base of core DecaFS Modules that students can work from and basic implementations of non-core DecaFS Modules. Our basic non-core modules can be modified to implement more complex distribution techniques without modifying core modules. We have shown the feasibility of developing a modular DFS, while adhering to requirements such as configurable sizes (file, stripe, chunk) and support of multiple data replication strategies.

APA, Harvard, Vancouver, ISO, and other styles

21

Rao, Ananth K. "The DFS distributed file system : design and implementation." Online version of thesis, 1989. http://hdl.handle.net/1850/10500.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Lindroth, Fredrik. "Designing a distributed peer-to-peer file system." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-200380.

Full text

Abstract:

Currently, most companies and institutions are relying on dedicated file servers in order to provide both shared and personal files to employees. Meanwhile, a lot of desktop machines have a lot of unused hard drive space, especially if most files are stored on these servers. This report tries to create a file system which can be deployed in an existing infra structure, and is completely managed and replicated on machines which normally would hold nothing more than an operating system and a few personal files. This report discusses distributed file systems, files, and directories, within the context of a UNIX-based local area network (LAN), and how file operations, such as opening, reading, writing, and locking can be performed on these distributed objects.

APA, Harvard, Vancouver, ISO, and other styles

23

Wennergren, Oscar, Mattias Vidhall, and Jimmy Sörensen. "Transparency analysis of Distributed file systems : With a focus on InterPlanetary File System." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-15727.

Full text

Abstract:

IPFS claims to be the replacement of HTTP and aims to be used globally. However, our study shows that in terms of scalability, performance and security, IPFS is inadequate. This is a result from our experimental and qualitative study of transparency of IPFS version 0.4.13. Moreover, since IPFS is a distributed file system, it should fulfill all aspects of transparency, but according to our study, this is not the case. From our small-scale analysis, we speculate that nested files appear to be the main cause of the performance issues and replication amplifies these problems even further.

APA, Harvard, Vancouver, ISO, and other styles

24

Zhang, Junyao. "Researches on reverse lookup problem in distributed file system." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4638.

Full text

Abstract:

Recent years have witnessed an increasing demand for super data clusters. The super data clusters have reached the petabyte-scale can consist of thousands or tens of thousands storage nodes at a single site. For this architecture, reliability is becoming a great concern. In order to achieve a high reliability, data recovery and node reconstruction is a must. Although extensive research works have investigated how to sustain high performance and high reliability in case of node failures at large scale, a reverse lookup problem, namely finding the objects list for the failed node remains open. This is especially true for storage systems with high requirement of data integrity and availability, such as scientific research data clusters and etc. Existing solutions are either time consuming or expensive. Meanwhile, replication based block placement can be used to realize fast reverse lookup. However, they are designed for centralized, small-scale storage architectures. In this thesis, we propose a fast and efficient reverse lookup scheme named Group-based Shifted Declustering (G-SD) layout that is able to locate the whole content of the failed node. G-SD extends our previous shifted declustering layout and applies to large-scale file systems. Our mathematical proofs and real-life experiments show that G-SD is a scalable reverse lookup scheme that is up to one order of magnitude faster than existing schemes.<br>ID: 029049697; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Thesis (M.S.)--University of Central Florida, 2010.; Includes bibliographical references (p. 46-48).<br>M.S.<br>Masters<br>School of Electrical Engineering and Computer Science<br>Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

25

Stenkvist, Joel. "S3-HopsFS: A Scalable Cloud-native Distributed File System." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254664.

Full text

Abstract:

Data has been regarded as the new oil in today’s modern world. Data is generated everywhere from how you do online shopping to where you travel. Companies rely on analyzing this data to make informed business decisions and improve their products and services. However, storing this massive amount of data can be very expensive. Current distributed file systems rely on commodity hardware to provide strongly consistent data storage for big data analytics applications, such as Hadoop and Spark. Running these storage clusters can be very costly; it is estimated that storing 100 TB in an HDFS cluster with AWS EC2 costs $47,000 per month. On the other hand, using cloud storage such as Amazon’s S3 to store 100 TB only costs about $3,000 per month however S3 is not sufficient due to eventual consistency and low performance. Therefore, combining these two solutions is optimal for a cheap, consistent, and fast file system.This thesis outlines and builds a new class of distributed file system that utilizes cloud native block storage as the data-layer, such as Amazon’s S3. AWS recently increased the bandwidth from S3 to EC2 from 5 Gbps to 25Gbps, sparking new interest in this area. The new system is built on top of HopsFS; a hierarchical, distributed file system with a scale-out metadata layer utilizing an in-memory, distributed database called NDB which dramatically increases the scalability of the file system. In combination with native cloud storage, this new file system reduces the price of deployment by up to 15 times, but at a performance cost of 25% of the original HopsFS system (four times slower). However, tests in this research shows that S3-HopsFS can be improved towards 38% of the original performance by comparing it with only using S3 by itself. In addition to the new HopsFS version, S3Guard was developed to use NDB instead of Amazon’s DynamoDB to store the file tree hierarchy metadata. S3Guard is a tool that allows big data analytics applications such as Hive to utilize S3 as a direct input and output source for queries. The eventual consistency problems of S3 have been solved and tests show a 36% performance boost when listing and deleting files and directories. S3Guard is sufficient to support some big data analytic applications like Hive, but we lose all the benefits of HopsFS like the performance, scalability, and extended metadata -therefore we need a new file system combining both solutions.<br>Data har ansetts vara den nya oljan i dagens moderna värld. Data kommer från överallt från hur du handlar online till var du reser. Företag är beroende på analysering av denna data för att kunna göra välgrundade affärsbeslut och förbättra sina produkter och tjänster. Det är väldigt dyrt att spara denna enorm mängd av data för analysering. Nuvarande distribuerade filsystem använder vanlig hårdvara för att kunna ge stark och konsekvent datalagring till stora dataanalysprogram, som Hadoop och Spark. Dessa lagrings kluster kan kosta väldigt mycket. Det beräknas att lagra 100 TB med ett HDFS-kluster i AWS EC2 kostar $47 000 per månad. På andra sidan kostar molnlagring med Amazons S3 bara cirka $ 3 000 per månad för 100 TB, men S3 är inte tillräckligt på grund av eventuell konsistens och låg prestanda. Därför är kombinationen av dessa två lösningar optimalt för ett billigt, konsekvent och snabbt filsystem. Forskningen i denna thesis designar och bygger en ny klass av distribue-rat filsystem som använder cloud blocklagring som datalagret, som Amazonas S3, istället för vanlig hårdvara. AWS ökade nyligen bandbredd från S3 till EC2 från 5 Gbps till 25Gbps, som gjorde ett nytt intresse i det här området. Det nya systemet är byggt på toppen av HopsFS; ett hierarkiskt, distribuerat filsystem med utökad metadata som utnyttjar av en in-memory-distribuerad databas som heter NDB som dramatiskt ökar filsystemets skalbarhet. I kombination med inbyggd molnlagring minskar detta nya filsystem priset för implementering upp till 15 gånger, men med en prestandakostnad på 25 % av det ursprungliga HopsFS-systemet (den är fyra gånger långsammare). Test i denna undersökning visar dock att S3-HopsFS kan förbättras till 38% av den ursprungliga prestandan genom att jämföra den med bara användning av S3.Förutom den nya HopsFS-versionen, utvecklades S3Guard för att använda NDB istället för Amazons DynamoDB för att spara fil systemets metadata. S3Guard är ett verktyg som tillåter stora dataanalysprogram som Hive att använda S3 istället för HDFS. De eventuella konsekvensproblemen i S3 är nu lösta och tester visar en 36% förbättring av prestanda när man listar och tar bort filer och kataloger. S3Guard är tillräckligt för att stödja flera dataanalys program som Hive, men vi förlorar alla fördelar med HopsFS som prestanda, skalbarhet och utökad metadata. Därför behöver vi ett nytt filsystem som kombinerar båda lösningarna.

APA, Harvard, Vancouver, ISO, and other styles

26

Ledung, Gabriel, and Johan Andersson. "Darknet file sharing : application of a private peer-to-peer distributed file system concept." Thesis, Uppsala universitet, Informationssystem, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-129908.

Full text

Abstract:

Peer-to-peer network applications has been a tremendous success among end users and has therefore received much attention in academia and industry, as have illegal public file sharing in media. However, private peer-to-peer file sharing between family, friends and co-workers have attracted little interest from the research community. Existing approaches also limit the users by not allowing for native interaction with userspace applications. In this paper we ex- -plore how private file sharing can be made safe, fast and scalable without constraining the users in this aspect. We demonstrate the concept of a private file sharing application utilizing a decentralized peer-to-peer network overlay by creating a prototype with extreme program- ming as methodology. To maximize the freedom of users the network is accessed through a virtual file-system interface. The prototype proves this to be a valid approach and we hope readers can use this paper as a platform for further developments in this area.<br>Fildelningsapplikationer som använder peer-to-peer teknik har varit en enorm framgång blandslutanvändare och har därmed erhållit mycket uppmärksamhet från akademi och indus- tri, liksom olaglig fildelning fått inom media. Däremot har inte privat fildelning mellan vän- ner, arbetskamrater och kollegor tilldelats samma uppmärksamhet från forskningssamfundet. Nuvarande tillämpningar begränsar användaren genom att inte tillåta naturlig interaktion med användarapplikationer. I denna uppsats utforskar vi hur privat fildelning kan göras snabb, skalbar och säker utan att begränsa användaren ur den aspekten. Vi demonstrerar ett koncept- för privat fildelning som nyttjar decentraliserad peer-to-peer arkitektur m.h.a en prototyp som tagits fram med extreme programming som metodologi. För att maximera användarnas frihet nyttjas ett virtuellt filsystem som gränssnitt. Prototypen visar att vår tillämpning fungerar i praktiken och vi hoppas att läsaren kan använda vårt arbete som en plattform för fortsatt utveckling inom detta område.

APA, Harvard, Vancouver, ISO, and other styles

27

Yee, Adam J. "Sharing the love : a generic socket API for Hadoop Mapreduce." Scholarly Commons, 2011. https://scholarlycommons.pacific.edu/uop_etds/772.

Full text

Abstract:

Hadoop is a popular software framework written in Java that performs data-intensive distributed computations on a cluster. It includes Hadoop MapReduce and the Hadoop Distributed File System (HDFS). HDFS has known scalability limitations due to its single NameNode which holds the entire file system namespace in RAM on one computer. Therefore, the NameNode can only store limited amounts of file names depending on the RAM capacity. The solution to furthering scalability is distributing the namespace similar to how file is data divided into chunks and stored across cluster nodes. Hadoop has an abstract file system API which is extended to integrate HDFS, but has also been extended for integrating file systems S3, CloudStore, Ceph and PVFS. File systems Ceph and PVFS already distribute the namespace, while others such as Lustre are making the conversion. Google previously announced in 2009 they have been implementing a Google File System distributed namespace to achieve greater scalability. The Generic Hadoop API is created from Hadoop's abstract file system API. It speaks a simple communication protocol that can integrate any file system which supports TCP sockets. By providing a file system agnostic API, future work with other file systems might provide ways for surpassing Hadoop 's current scalability limitations. Furthermore, the new API eliminates the need for customizing Hadoop's Java implementation, and instead moves the implementation to the file system itself. Thus, developers wishing to integrate their new file system with Hadoop are not responsible for understanding details ofHadoop's internal operation. The API is tested on a homogeneous, four-node cluster with OrangeFS. Initial OrangeFS I/0 throughputs compared to HDFS are 67% ofHDFS' write throughput and 74% percent of HDFS' read throughput. But, compared with an alternate method of integrating with OrangeFS (a POSIX kernel interface), write and read throughput is increased by 23% and 7%, respectively

APA, Harvard, Vancouver, ISO, and other styles

28

Zhao, Pei. "E-CRADLE v1.1 - An improved distributed system for Photovoltaic Informatics." Case Western Reserve University School of Graduate Studies / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=case1449001689.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Sim, Hyogi. "AnalyzeThis: An Analysis Workflow-Aware Storage System." Thesis, Virginia Tech, 2014. http://hdl.handle.net/10919/76927.

Full text

Abstract:

Supercomputing application simulations on hundreds of thousands of cores produce vast amounts of data that need to be analyzed on smaller-scale clusters to glean insights. The process is referred to as an end-to-end workflow. Extant workflow systems are stymied by the storage wall, resulting from both the disk-based parallel file system (PFS) failing to keep pace with the compute and memory subsystems as well as the inefficiencies in end-to-end workflow processing. In the post-petaflop era, supercomputers are provisioned with flash devices, as an intermediary between compute nodes and the PFS, enabling novel paradigms not just for expediting I/O, but also for the in-situ analysis of the simulation output data on the flash device. An array of such active flash elements allows us to fundamentally rethink the way data analysis workflows interact with storage systems. By blending the flash storage array and data analysis together in a seamless fashion, we create an analysis workflow-aware storage system, AnalyzeThis. Our guiding principle is that analysis-awareness be deeply ingrained in each and every layer of the storage system—active flash fabric, analysis object abstraction layer, scheduling layer within the storage, and an easy-to-use file system interface—thereby elevating data analyses as first-class citizens. Together, these concepts transform AnalyzeThis into a potent analytics-aware appliance.<br>Master of Science

APA, Harvard, Vancouver, ISO, and other styles

30

Kerkinos, Ioannis. "Evaluation and benchmarking of Tachyon as a memory-centric distributed storage system for Apache Hadoop." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189571.

Full text

Abstract:

Hadoop was developed as an open-source software framework that leveraged initially the MapReduce programming model and therefore was able to efficiently analyse and process large datasets. At the core of Hadoop is the Hadoop distributed file system or HDFS, which is used as the default storage across the cluster. Hadoop can also be used with other types of storage, with or without HDFS, such as Amazon S3, Windows Azure Storage Blobs, GlusterFS, Tachyon etc. This thesis focuses on Tachyon, a distributed file system that claims to enable reliable data sharing at memory speed across cluster computing frameworks. We benchmark and evaluate HDFS with and without Tachyon in regards to performance. To do so we used TestDFSIO as a benchmark to simulate different MapReduce workloads and an in-production Spark job from Spotify. Tachyon's different writetypes were also put to the test and evaluated. To see how cloud solutions compare, we perform the same evaluations of Tachyon over Google Cloud Storage.

APA, Harvard, Vancouver, ISO, and other styles

31

Onishchuk, А. "Creating Highly Available Distributed File System for Maui Family Job Schedulers." Thesis, Sumy State University, 2017. http://essuir.sumdu.edu.ua/handle/123456789/55757.

Full text

Abstract:

This article describes a way to implement a distributed file system for MAUI job scheduler, which solves the problems of low scalability and unreliability of data storage, as well as a problem of problem of data inaccessibility due to failures in software or hardware. The architecture which is suitable for MAUI GRID systems is suggested.

APA, Harvard, Vancouver, ISO, and other styles

32

Narasimhan, Srivatsan. "Reliable, Efficient and Distributed Cooperative Caching for Improving File System Performance." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin985880302.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Schleinzer, Benjamin [Verfasser]. "A File System for Wireless Mesh Networks : A New Approach for a Scalable, Secure, and Distributed File System / Benjamin Schleinzer." Aachen : Shaker, 2012. http://d-nb.info/1069044520/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Patil, Swapnil. "Scale and Concurrency of Massive File System Directories." Research Showcase @ CMU, 2013. http://repository.cmu.edu/dissertations/250.

Full text

Abstract:

File systems store data in files and organize these files in directories. Over decades, file systems have evolved to handle increasingly large files: they distribute files across a cluster of machines, they parallelize access to these files, they decouple data access from metadata access, and hence they provide scalable file access for high-performance applications. Sadly, most cluster-wide file systems lack any sophisticated support for large directories. In fact, most cluster file systems continue to use directories that were designed for humans, not for large-scale applications. The former use-case typically involves hundreds of files and infrequent concurrent mutations in each directory, while the latter use-case consists of tens of thousands of concurrent threads that simultaneously create large numbers of small files in a single directory at very high speeds. As a result, most cluster file systems exhibit very poor file create rate in a directory either due to limited scalability from using a single centralized directory server or due to reduced concurrency from using a system-wide synchronization mechanism. This dissertation proposes a directory architecture called GIGA+ that enables a directory in a cluster file system to store millions of files and sustain hundreds of thousands of concurrent file creations every second. GIGA+ makes two indexing technique to scale out a growing directory on many servers and an efficient layered design to scale up performance. GIGA+ uses a hash-based, incremental partitioning algorithm that enables highly concurrent directory indexing through asynchrony and eventual consistency of the internal indexing state (while providing strong consistency guarantees to the application data). This dissertation analyzes several trade-offs between data migration overhead, load balancing effectiveness, directory scan performance, and entropy of indexing state made by the GIGA+ design, and compares them with policies used in other systems. GIGA+ also demonstrates a modular implementation that separates directory distribution from directory representation. It layers a client-server middleware, which spreads work among many GIGA+ servers, on top of a backend storage system, which manages on-disk directory representation. This dissertation studies how system behavior is tightly dependent on both the indexing scheme and the on-disk implementations, and evaluates how the system performs for different backend configurations including local and shared-disk stores. The GIGA+ prototype delivers highly scalable directory performance (that exceeds the most demanding Petascale-era requirements), provides the traditional UNIX file system interface (that can run applications without any modifications) and offers a new functionality layered on existing cluster file systems (that lack support for distributed directories)contributions: a concurrent

APA, Harvard, Vancouver, ISO, and other styles

35

Clabough, Douglas M. "An electronic calendar system in a distributed UNIX environment." Thesis, Kansas State University, 1986. http://hdl.handle.net/2097/9906.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Hoffman, P. Kuyper. "A file server for the DistriX prototype : a multitransputer UNIX system." Master's thesis, University of Cape Town, 1989. http://hdl.handle.net/11427/17188.

Full text

Abstract:

Bibliography: pages 90-94.<br>The DISTRIX operating system is a multiprocessor distributed operating system based on UNIX. It consists of a number of satellite processors connected to central servers. The system is derived from the MINIX operating system, compatible with UNIX Version 7. A remote procedure call interface is used in conjunction with a system wide, end-to-end communication protocol that connects satellite processors to the central servers. A cached file server provides access to all files and devices at the UNIX system call level. The design of the file server is discussed in depth and the performance evaluated. Additional information is given about the software and hardware used during the development of the project. The MINIX operating system has proved to be a good choice as the software base, but certain features have proved to be poorer. The Inmos transputer emerges as a processor with many useful features that eased the implementation.

APA, Harvard, Vancouver, ISO, and other styles

37

Oriani, André 1984. "Uma solução de alta disponibilidade para o sistema de arquivos distribuidos do Hadoop." [s.n.], 2013. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275641.

Full text

Abstract:

Orientador: Islene Calciolari Garcia<br>Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação<br>Made available in DSpace on 2018-08-22T22:11:10Z (GMT). No. of bitstreams: 1 Oriani_Andre_M.pdf: 3560692 bytes, checksum: 90ac96e4274dea19b7bcaec78aa959f8 (MD5) Previous issue date: 2013<br>Resumo: Projetistas de sistema geralmente optam por sistemas de arquivos baseados em cluster como solução de armazenamento para ambientes de computação de alto desempenho. A razão para isso é que eles provêm dados com confiabilidade, consistência e alta vazão. Porém a maioria desses sistemas de arquivos emprega uma arquitetura centralizada, o que compromete sua disponibilidade. Este trabalho foca especificamente em um exemplar de tais sistemas, o Hadoop Distributed File System (HDFS). O trabalho propõe um hot standby para o nó mestre do HDFS a fim de conferir-lhe alta disponibilidade. O hot standby é implementado por meio da (i) extensão da replicação de estado do mestre realizada por seu checkpoint helper, o Backup Node; e por meio da (ii) introdução de um mecanismo automático de failover. O passo (i) aproveitou-se da técnica de duplicação de mensagens desenvolvida por outra técnica de alta disponibilidade para o HDFS chamada Avatar Nodes. O passo (ii) empregou ZooKeeper, um serviço distribuído de coordenação. Essa estratégia resultou em mudanças de código pequenas, cerca de 0,18% do código original, o que faz a solução ser de fácil estudo e manutenção. Experimentos mostraram que o custo adicional imposto pela replicação não aumentou em mais de 11% o consumo médio de recursos pelos nós do sistema nem diminuiu a vazão de dados comparando-se com a versão original do HDFS. A transição completa para o hot standby pode tomar até 60 segundos quando sob cargas de trabalho dominadas por operações de E/S, mas menos de 0,4 segundos em cenários com predomínio de requisições de metadados. Estes resultados evidenciam que a solução desenvolvida nesse trabalho alcançou seus objetivos de produzir uma solução de alta disponibilidade para o HDFS com baixo custo e capaz de reagir a falhas em um breve espaço de tempo<br>Abstract: System designers generally adopt cluster-based file systems as the storage solution for high-performance computing environments. That happens because they provide data with reliability, consistency and high throughput. But most of those fie systems employ a centralized architecture which compromises their availability. This work focuses on a specimen of such systems, the Hadoop Distributed File System (HDFS). A hot standby for the master node of HDFS is proposed in order to bring high availability to the system. The hot standby was achieved by (i) extending the master's state replication performed by its checkpointer helper, the Backup Node; and by (ii) introducing an automatic failover mechanism. Step (i) took advantage of the message duplication technique developed by other high availability solution for HDFS named AvatarNodes. Step (ii) employed ZooKeeper, a distributed coordination service. That approach resulted on small code changes, around 0.18% of the original code, which makes the solution easy to understand and to maintain. Experiments showed that the overhead implied by replication did not increase the average resource consumption of system nodes by more than 11% nor did it diminish the data throughput compared to the original version of HDFS. The complete transition for the hot standby can take up to 60 seconds on workloads dominated by I/O operations, but less than 0.4 seconds when there is predominance of metadata requisitions. Those results show that the solution developed on this work achieved the goals of producing a high availability solution for the HDFS with low overhead and short reaction time to failures<br>Mestrado<br>Ciência da Computação<br>Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

38

AlShaikh, Raed A. "Towards building a fault tolerant and conflict-free distributed file system for mobile clients." Thesis, University of Ottawa (Canada), 2006. http://hdl.handle.net/10393/27109.

Full text

Abstract:

The rising demand for mobile computing has created a need for improved file systems that support mobile clients. Current file systems with support for mobility provide availability through file replicas that are cached at the client side. However, mobile clients may experience different obstacles in regard to the local cache, such as the intermittent connection, and serious conflicts when synchronizing back to the server. In this thesis, we present a comprehensive classification of distributed and mobile file systems, and propose a novel distributed file system model for mobile clients with cacheless wireless devices. We discuss the implementation of our model, investigate its high availability functions and report on its performance evaluation using a cluster of workstations as a test-bed. Our test run results indicate clearly that our technique exhibits a significant degree of automation and conflict-free mobile file system. Last but not least, we have proposed a novel scheme based on the FBR scheme and file pre-fetching to enhance the server-side caching strategy of our distributed file system. We present our scheme and discuss its evaluation performance using an extensive set of experiments.

APA, Harvard, Vancouver, ISO, and other styles

39

Lin, Tsai S. (Tsai Shooumeei). "A Highly Fault-Tolerant Distributed Database System with Replicated Data." Thesis, University of North Texas, 1994. https://digital.library.unt.edu/ark:/67531/metadc278403/.

Full text

Abstract:

Because of the high cost and impracticality of a high connectivity network, most recent research in transaction processing has focused on a distributed replicated database system. In such a system, multiple copies of a data item are created and stored at several sites in the network, so that the system is able to tolerate more crash and communication failures and attain higher data availability. However, the multiple copies also introduce a global inconsistency problem, especially in a partitioned network. In this dissertation a tree quorum algorithm is proposed to solve this problem, imposing a logical tree structure along with dynamic system reconfiguration on all the copies of each data item. The proposed algorithm can be viewed as a dynamic voting technique which, with the help of an appropriate concurrency control algorithm, exhibits the major advantages of quorum-based replica control algorithms and of the available copies algorithm, so that a single copy is read for a read operation and a quorum of copies is written for a write operation. In addition, read and write quorums are computed dynamically and independently. As a result expensive read operations, like those that require several copies of a data item to be read in most quorum schemes, are eliminated. Furthermore, the message costs of read and write operations are reduced by the use of smaller quorum sizes. Quorum sizes can be reduced to a constant in a lightly loaded system, and log n in a failure-free network, as well as [n +1/2] in a partitioned network in a heavily loaded system. On average, our algorithm requires fewer messages than the best known tree quorum algorithm, while still maintaining the same upper bound on quorum size. One-copy serializability is guaranteed with higher data availability and highest degree of fault tolerance (up to n - 1 site failures).

APA, Harvard, Vancouver, ISO, and other styles

40

Jones, Michael Angus Scott. "Using AFS as a distributed file system for computational and data grids in high energy physics." Thesis, University of Manchester, 2005. http://www.manchester.ac.uk/escholar/uk-ac-man-scw:181210.

Full text

Abstract:

The use of the distributed file system, AFS, as a solution to the “input/output sandbox” problem in grid computing is studied. A computational grid middleware, primarily to accommodate the environment of the BaBar Computing Model, has been designed, written and is presented. A summary of the existing grid middleware and resources is discussed. A number of benchmarks (one written for this thesis) are used to test the performance of the AFS over the wide area network and grid environment. The performance of the AFS is also tested using a straightforward BaBar Analysis code on real data. Secure web-based and command-line interfaces created to monitor job submission and grid fabric are presented.

APA, Harvard, Vancouver, ISO, and other styles

41

Cuce, Simon. "GLOMAR : a component based framework for maintaining consistency of data objects within a heterogeneous distributed file system." Monash University, School of Computer Science and Software Engineering, 2003. http://arrow.monash.edu.au/hdl/1959.1/5743.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Lin, Jenglung. "The Implementation and Integration of the Interactive Markup Language to the Distributed Component Object Model Protocol in the Application of Distributed File System Security." NSUWorks, 1999. http://nsuworks.nova.edu/gscis_etd/671.

Full text

Abstract:

This dissertation is about the implementation and integration of the interactive markup language to the distributed component object model protocol with the application to modeling distributed file system security. Among the numerous researches in network security, the file system usually plays in the least important role of the spectrum. From the simple Disk Operating System (DOS) to modern Network Operating System (NOS), the file system relies only on one or more login passwords to protect it from being misused. Today the most thorough protection scheme for the file system is from virus protection and removal application, but it does not prevent a hostile but well-behaved program from deleting files or formatting hard disk. There are several network-monitoring systems that provide packet-level examination, although they suffer significant degradation in system performance. In order to accomplish this objective, the implementation and integration of an interactive markup language to the distributed component object model protocol is created. The framework is also associated with the network security model for protecting the file system against unfriendly users or programs. The research will utilize a comprehensive set of methods that include software signature, caller identification, backup for vital files, and encryption for selected system files. It is expected that the results of this work are sufficient so those component objects can be implemented to support the integration definitions defined in this dissertation. In addition, it is expected that the extensions and techniques defined in this work may have further utilization in similar theoretical and applied problem domains.

APA, Harvard, Vancouver, ISO, and other styles

43

Venkateswaran, Jayendran. "PRODUCTION AND DISTRIBUTION PLANNING FOR DYNAMIC SUPPLY CHAINS USING MULTI-RESOLUTION HYBRID MODELS." Diss., Tucson, Arizona : University of Arizona, 2005. http://etd.library.arizona.edu/etd/GetFileServlet?file=file:///data1/pdf/etd/azu%5Fetd%5F1185%5F1%5Fm.pdf&type=application/pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Liao, Jhih-Kai, and 廖治凱. "Fault-Tolerant Management Framework for Hadoop Distributed File System." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/20941508217703176221.

Full text

Abstract:

碩士<br>淡江大學<br>資訊工程學系碩士班<br>101<br>Due to the rapid development of modern Internet, the mode of operation of a large number of applications has changed from single-machine to a cluster of machines over the network. This trend also contributed to the development of cloud computing technology, among which Google invented the MapReduce framework, Google File System (GFS), and BigTable, and Yahoo invested the open-source Hadoop project to implement those technologies proposed by Google. The Hadoop Distributed File System (HDFS) is based on the master/slave model to manage the entire file system. Specifically, a single NameNode acting as the master manages a large number of slaves called DataNodes. Since the NameNode is responsible for maintaining a lot of important metadata information, a NameNode crash can render the entire file system unusable. That is, the NameNode forms a Single Point of Failure (SPOF). In addition, in the master/slave model, all the requests and responses have to go through the master. It is obvious that without load sharing, the NameNode forms a performance bottleneck. Therefore, in this research we propose to allocate Sub_NameNodes dynamically for each MapReduce job, in order to relieve the network congestion, and accelerate the speed of communication between the master and the slaves. Our approach also reduces the risk of data loss by replicating the metadata to the Sub_NameNodes. Once the NameNode fails, its state can be reconstructed from the Sub_NameNodes. The simulation results show significant reduction on both the number of communication hops and the communication time.

APA, Harvard, Vancouver, ISO, and other styles

45

CHO, CHIH-YUAN, and 卓志遠. "Performance Comparison of Hadoop Distributed File System and Ceph." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/27230766807802849865.

Full text

Abstract:

碩士<br>東海大學<br>資訊工程學系<br>102<br>Cloud computing refers to services at anytime, anywhere, on demand, using any device to access various services. It is a model that can be easily accessed in accordance with the needs of the network computer resources provided by these computer resources, including networks, servers, storage, applications, and services. In response to the popularity of cloud computing services, which produce large amount of information and data, and in order to save the future of science and technology development, processing and analyzing massive data applications for key research direction, the storage and handling of large amount of data without the use of distributed computing and Distributed File System, has become the focal point. In this thesis, the open source, Hadoop Distributed File System, and Ceph were compared in these areas of file uploading/downloading performance, transmission capacity, and fault tolerance comparative analysis of file size. In 60-different-file size transmission test, Ceph performed only 2 times better than the obvious data Hadoop. The rest of the experimental data shown a better performance achieved with the Hadoop. The more stable and better performing Hadoop though currently under proof stage, has yet to be implemented by the industry. Ceph is not currently recommended in production environment; however, there can be a great development for future growth.

APA, Harvard, Vancouver, ISO, and other styles

46

Lin, Ying-Chen, and 林映辰. "A Load-Balancing Algorithm for Hadoop Distributed File System." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/79778516225414074113.

Full text

Abstract:

碩士<br>淡江大學<br>資訊工程學系碩士班<br>103<br>With the advancement of Internet and increasing data demands, many enterprises are offering cloud services to their customers. Among various cloud computing platforms, the Apache Hadoop project has been widely adopted by many large organizations and enterprises. In the Hadoop ecosystem, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and HBase are open source equivalents of the Google proposed Google File System (GFS), MapReduce framework, and BigTable, respectively. To meet the requirement of horizontal scaling of storage in the big data era, HDFS has received significant attention among researchers. HDFS clusters are in a master-slave architecture: there is a single NameNode and a number of DataNodes in each cluster. NameNode is the master responsible for managing the DataNodes and the client accesses. DataNodes are slaves, and are responsible for storing data. As the name suggests, HDFS stores files distributedly. Files are divided into fixed-sized blocks, and in default configuration each block has three replicas stored in three different DataNodes to ensure the fault tolerance capability of HDFS. However, Hadoop''s default strategy of allocating new blocks does not take into account of DataNodes’ utilization, which can lead to load imbalance in HDFS. To cope with the problem, NameNode has a built-in tool called Balancer, which can be executed by the system administrator. Balancer iteratively moves blocks from DataNodes in high utilization to those in low utilization to keep each DataNode’s disk utilization within a configurable range centered at the average utilization. The primary cost of using Balancer to achieve load balance is the bandwidth consumed during the movement of blocks. Besides, the previous research shows that the NameNode is the performance bottleneck of HDFS. That is, frequent execution of Balancer by the NameNode may degrade the performance of HDFS. Therefore, in this research we would like to design a new load-balancing algorithm by considering all the situations that may influence the load-balancing state. In the proposed algorithm a new role named BalanceNode is introduced to help in matching heavy-loaded and light-loaded nodes, so those light-loaded nodes can share part of the load from heavy-loaded ones. The simulation results show that our algorithm not only can achieve good load-balancing state in the HDFS, but also with minimized movement cost.

APA, Harvard, Vancouver, ISO, and other styles

47

Huang, Hsin-Yi, and 黃心怡. "Realizing Prioritized MapReduce Service in Hadoop Distributed File System." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/85801068064786371658.

Full text

Abstract:

碩士<br>輔仁大學<br>資訊工程學系碩士班<br>104<br>Hadoop is a widely used and highly scalable platform software, and it is a distributed system which can handle a large amount of data with a high fault-tolerance feature. Like other application software, Hadoop system must build on the operating system, and must communicate and coordinate with hardware through the operating system. As Cloud Computing and Big Data appear, the cloud software platform becomes very important to support the cloud services implementation. Hadoop has a mechanism for the work performed by the allocation of resources. The work groups submitted under this mechanism are assigned to different levels of resource allocation sequence, and the work with high allocation of resources may have more chances of getting resources and higher priority to implement than those with low allocation of resources. However, you can not enable a particular job with higher precedence over other jobs in the same level of resource allocation group. When Hadoop is busy, a lot of works with the same level of resource allocation wait in line. Even for the work with high resource allocation, there is no guarantee that it can quickly get more resources to complete the work earlier. The research presents a Hadoop environment that users can set different priority levels to different jobs by adding priority mechanism to disk CFQ scheduler and to memory replacement. As a result, the execution of program with high priority can be accelerated accordingly. In the experiments, we performed multiple simultaneous Hadoop system applications to simulate a busy environment, and set specific programs with high priority to see how faster they can execute than their execution with normal priority. Our results show that for programs with high priority, their execution time can be reduced by a range between 30% and 80%.

APA, Harvard, Vancouver, ISO, and other styles

48

Fan, Kuo-Zheng, and 范國拯. "Dynamic De-duplication Decision in a Hadoop Distributed File System." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/12180320597103126420.

Full text

Abstract:

碩士<br>國立東華大學<br>資訊工程學系<br>101<br>Nowadays, data is generated and updated per second and this makes coping with those tremendously fast and multiform amounts of data a heavy challenge. The Hadoop Distributed File System (HDFS) is the first choice solution for most people. However, data is usually prevented from being lost with many backups, and HDFS also does this. Obviously, these duplicates occupy a lot of storage space, and this also means that we need to invest sufficient funding in infrastructure. However, this is not a good method for everybody, since it may be unaffordable. Therefore, using De-duplication technology can improve the memory space effectively, which has been gaining increasing attention in many researches, products, and which has also been applied in our implementation. In this paper, we proposed a dynamic De-duplication decision to improve the memory space which runs on HDFS. Under the memory space limitation, the system according to the ability of clusters and the utility of storage space can formulate a proper De-duplication strategy. By doing so, the usage of storage systems can be improved.

APA, Harvard, Vancouver, ISO, and other styles

49

Queirós, Jorge Afonso Barandas. "Implementing Hadoop distributed file system (hdfs) Cluster for BI Solution." Master's thesis, 2021. https://hdl.handle.net/10216/133038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Queirós, Jorge Afonso Barandas. "Implementing Hadoop distributed file system (hdfs) Cluster for BI Solution." Dissertação, 2021. https://hdl.handle.net/10216/133038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!