Academic literature on the topic 'Big Data, Hadoop, Business Intelligence, MapReduce'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Big Data, Hadoop, Business Intelligence, MapReduce.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Big Data, Hadoop, Business Intelligence, MapReduce"

1

Xu, Yi Qiao. "Massive Data Analysis Based MapReduce Structure on Hadoop System." Advanced Materials Research 981 (July 2014): 262–66. http://dx.doi.org/10.4028/www.scientific.net/amr.981.262.

Full text
Abstract:
Massive Data analysis is becoming increasingly prominent in a variety of application fields ranging from scientific studies to business researches. In this paper, we demonstrate the necessity and possibility of using MapReduce [1] module on Hadoop System [2]. Furthermore, we conducted MapReduce module to implement Clustering Algorithms [3] on our Hadoop System [4] and improved the efficiency of the Clustering Algorithms sharply. We showed how to design parallel clustering algorithms based on Hadoop System. Experiments by different size of data demonstrate that our purposed clustering algorithms have good performance on speed-up, scale-up and size-up. So, it is suitable for big data mining and analysis.
APA, Harvard, Vancouver, ISO, and other styles
2

Meddah, Ishak H. A., Khaled Belkadi, and Mohamed Amine Boudia. "Parallel Mining Small Patterns from Business Process Traces." International Journal of Software Science and Computational Intelligence 8, no. 1 (January 2016): 32–45. http://dx.doi.org/10.4018/ijssci.2016010103.

Full text
Abstract:
Hadoop MapReduce has arrived to solve the problem of treatment of big data, also the parallel treatment, with this framework the authors analyze, process a large size of data. It based for distributing the work in two big steps, the map and the reduce steps in a cluster or big set of machines. They apply the MapReduce framework to solve some problems in the domain of process mining how provides a bridge between data mining and business process analysis, this technique consists to mine lot of information from the process traces; In process mining, there are two steps, correlation definition and the process inference. The work consists in first time of mining patterns whom are the work flow of the process from execution traces, those patterns present the work or the history of each party of the process, the authors' small patterns are represented in this work by finite state automaton or their regular expression, the authors have only two patterns to facilitate the process, the general presentation of the process is the combination of the small mining patterns. The patterns are represented by the regular expressions (ab)* and (ab*c)*. Secondly, they compute the patterns, and combine them using the Hadoop MapReduce framework, in this work they have two general steps, first the Map step, they mine small patterns or small models from business process, and the second is the combination of models as reduce step. The authors use the business process of two web applications, the SKYPE, and VIBER applications. The general result shown that the parallel distributed process by using the Hadoop MapReduce framework is scalable, and minimizes the execution time.
APA, Harvard, Vancouver, ISO, and other styles
3

Srinivasan, Sujatha, and T. Thirumalai Kumari. "Big data analytics tools a review." International Journal of Engineering & Technology 7, no. 3.3 (June 8, 2018): 685. http://dx.doi.org/10.14419/ijet.v7i2.33.15476.

Full text
Abstract:
Big data is the hottest trending term all over the globe and the internet. Big organizations are trying to make use of the large amounts of data collected and stored by them in big memory storages. Further large amounts of data is being produced every millisecond all over the world from users of computing devices, from satellites of all kinds, from scientific research, from governments, from big organizations that deal with huge number of customers especially financial institutions and many more. These data lie there for exploration and exploitation to gain more knowledge or rather intelligence and turning out them into wisdom for better decision making. Traditional data mining tools are not able to handle this big data. Hadoop and MapReduce are the first of the kind of tools that are being used to handle big data. Additional data mining and machine learning capabilities have been added to Hadoop and MapReduce through various plug-ins by different open source as well as vendor tools for big data analytics (BDA). Further big organizations have and are in the process of creating BDA tools most of which come with a price tag. This study gives a short review of the available BDA tools taking into consideration different characteristics of these tools. Possible solutions for existing challenges related to big data analytics are discussed.
APA, Harvard, Vancouver, ISO, and other styles
4

Chiang, Dai-Lun, Sheng-Kuan Wang, Yu-Ying Wang, Yi-Nan Lin, Tsang-Yen Hsieh, Cheng-Ying Yang, Victor R. L. Shen, and Hung-Wei Ho. "Modeling and Analysis of Hadoop MapReduce Systems for Big Data Using Petri Nets." Applied Artificial Intelligence 35, no. 1 (November 14, 2020): 80–104. http://dx.doi.org/10.1080/08839514.2020.1842111.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Meddah, Ishak H. A., Khaled Belkadi, and Mohamed Amine Boudia. "Efficient Implementation of Hadoop MapReduce based Business Process Dataflow." International Journal of Decision Support System Technology 9, no. 1 (January 2017): 49–60. http://dx.doi.org/10.4018/ijdsst.2017010104.

Full text
Abstract:
Hadoop MapReduce is one of the solutions for the process of large and big data, with-it the authors can analyze and process data, it does this by distributing the computational in a large set of machines. Process mining provides an important bridge between data mining and business process analysis, his techniques allow for mining data information from event logs. Firstly, the work consists to mine small patterns from a log traces, those patterns are the workflow of the execution traces of business process. The authors' work is an amelioration of the existing techniques who mine only one general workflow, the workflow present the general traces of two web applications; they use existing techniques; the patterns are represented by finite state automaton; the final model is the combination of only two types of patterns whom are represented by the regular expressions. Secondly, the authors compute these patterns in parallel, and then combine those patterns using MapReduce, they have two parts the first is the Map Step, they mine patterns from execution traces and the second is the combination of these small patterns as reduce step. The results are promising; they show that the approach is scalable, general and precise. It reduces the execution time by the use of Hadoop MapReduce Framework.
APA, Harvard, Vancouver, ISO, and other styles
6

Wang, C., F. Hu, X. Hu, S. Zhao, W. Wen, and C. Yang. "A HADOOP-BASED DISTRIBUTED FRAMEWORK FOR EFFICIENT MANAGING AND PROCESSING BIG REMOTE SENSING IMAGES." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences II-4/W2 (July 10, 2015): 63–66. http://dx.doi.org/10.5194/isprsannals-ii-4-w2-63-2015.

Full text
Abstract:
Various sensors from airborne and satellite platforms are producing large volumes of remote sensing images for mapping, environmental monitoring, disaster management, military intelligence, and others. However, it is challenging to efficiently storage, query and process such big data due to the data- and computing- intensive issues. In this paper, a Hadoop-based framework is proposed to manage and process the big remote sensing data in a distributed and parallel manner. Especially, remote sensing data can be directly fetched from other data platforms into the Hadoop Distributed File System (HDFS). The Orfeo toolbox, a ready-to-use tool for large image processing, is integrated into MapReduce to provide affluent image processing operations. With the integration of HDFS, Orfeo toolbox and MapReduce, these remote sensing images can be directly processed in parallel in a scalable computing environment. The experiment results show that the proposed framework can efficiently manage and process such big remote sensing data.
APA, Harvard, Vancouver, ISO, and other styles
7

Tyagi, Adhishtha, and Sonia Sharma. "A Framework of Security and Performance Enhancement for Hadoop." International Journal of Advanced Research in Computer Science and Software Engineering 7, no. 7 (July 30, 2017): 437. http://dx.doi.org/10.23956/ijarcsse/v7i6/0171.

Full text
Abstract:
Hadoop framework has been emerged as the most effective and widely adopted framework for Big Data processing. Map Reduce programming model is used for processing as well as generating large data sets. Data security has become an important issue as far as storage is concerned. By default theres no security mechanism in hadoop and it is the first choice of the business analyst and industrialists to store and manage data as well as theres a need to introduce security solutions to Hadoop in order to secure the important data in the Hadoop environment. We implemented and evaluated Dynamic Task Splitting Scheduler (DTSS) which explores the tradeoffs between fairness and data performance by splitting the tasks dynamically before processing in hadoop along with AES-MR (an Advanced Encryption Standard based encryption using mapreduce) encryption in MapReduce paradigm. This paper would be useful for beginners and researchers for understanding DTSS scheduling along with security.
APA, Harvard, Vancouver, ISO, and other styles
8

Song, Miao Miao, Zhe Li, Bin Zhou, and Chao Ling Li. "Cloud Computing Model for Big Geological Data Processing." Applied Mechanics and Materials 475-476 (December 2013): 306–11. http://dx.doi.org/10.4028/www.scientific.net/amm.475-476.306.

Full text
Abstract:
Geological data with phyletic and various, huge and complex data format, the analysis of geological data processing is mainly divided into three parts: Mines forecast, mine evaluation and mine positioning. Traditional geological data analysis model is limited by limited storage space and computational efficiency, and cannot meet the needs of a large number of geological data fast operations. "Big data technology" provides the ideal solution to the vast amounts of geological data management, information extraction, and comprehensive analysis. For mass storage capacity and high-speed computing power that the "big data technology" need, we built an intelligence systems applied to the analysis of geological data based on MapReduce and GPU double parallel processing cloud computing model. For a large number of geological data, using hadoop cluster system to solve the problem of large amounts of data storage, and designing efficient parallel processing method based on GPU (Graphics Processing Units: calculation of Graphics Processing unit), the method was applied to MapReduce framework, finally completing MapReduce and GPU double parallel processing cloud computing model to improve the operation speed of the system. Through theoretical modeling and experimental verification, indicating that the system can meet the analysis of geological data operation precision, the operation data amount and the operation speed.
APA, Harvard, Vancouver, ISO, and other styles
9

Manogaran, Gunasekaran, and Daphne Lopez. "Disease Surveillance System for Big Climate Data Processing and Dengue Transmission." International Journal of Ambient Computing and Intelligence 8, no. 2 (April 2017): 88–105. http://dx.doi.org/10.4018/ijaci.2017040106.

Full text
Abstract:
Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of unstructured sensor data and so it is often called big climate data. Nowadays, researchers are trying to use big climate data to monitor and predict the climate change and possible diseases. Traditional data processing techniques and tools are not capable of handling such huge amount of climate data. Hence, there is a need to develop advanced big data architecture for processing the real time climate data. The purpose of this paper is to propose a big data based surveillance system that analyzes spatial climate big data and performs continuous monitoring of correlation between climate change and Dengue. Proposed disease surveillance system has been implemented with the help of Apache Hadoop MapReduce and its supporting tools.
APA, Harvard, Vancouver, ISO, and other styles
10

Bu, Lingrui, Hui Zhang, Haiyan Xing, and Lijun Wu. "Research on parallel data processing of data mining platform in the background of cloud computing." Journal of Intelligent Systems 30, no. 1 (January 1, 2021): 479–86. http://dx.doi.org/10.1515/jisys-2020-0113.

Full text
Abstract:
Abstract The efficient processing of large-scale data has very important practical value. In this study, a data mining platform based on Hadoop distributed file system was designed, and then K-means algorithm was improved with the idea of max-min distance. On Hadoop distributed file system platform, the parallelization was realized by MapReduce. Finally, the data processing effect of the algorithm was analyzed with Iris data set. The results showed that the parallel algorithm divided more correct samples than the traditional algorithm; in the single-machine environment, the parallel algorithm ran longer; in the face of large data sets, the traditional algorithm had insufficient memory, but the parallel algorithm completed the calculation task; the acceleration ratio of the parallel algorithm was raised with the expansion of cluster size and data set size, showing a good parallel effect. The experimental results verifies the reliability of parallel algorithm in big data processing, which makes some contributions to further improve the efficiency of data mining.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Big Data, Hadoop, Business Intelligence, MapReduce"

1

Marchi, Francesca. "Progettazione e sviluppo di una soluzione Hadoop per il calcolo di Big Data Analytics." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/8591/.

Full text
Abstract:
Il presente elaborato ha come oggetto la progettazione e lo sviluppo di una soluzione Hadoop per il Calcolo di Big Data Analytics. Nell'ambito del progetto di monitoraggio dei bottle cooler, le necessità emerse dall'elaborazione di dati in continua crescita, ha richiesto lo sviluppo di una soluzione in grado di sostituire le tradizionali tecniche di ETL, non pi�ù su�fficienti per l'elaborazione di Big Data. L'obiettivo del presente elaborato consiste nel valutare e confrontare le perfomance di elaborazione ottenute, da un lato, dal flusso di ETL tradizionale, e dall'altro dalla soluzione Hadoop implementata sulla base del framework MapReduce.
APA, Harvard, Vancouver, ISO, and other styles
2

Besson, Henrik. "Konsulters beskrivning av Big Data och dess koppling till Business Intelligence." Thesis, Linnéuniversitetet, Institutionen för datavetenskap, fysik och matematik, DFM, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-22747.

Full text
Abstract:
De allra flesta av oss kommer ständigt i kontakt med olika dataflöden vilket har blivit en helt naturlig del av vårt nutida informationssamhälle. Dagens företag agerar i en ständigt föränderlig omvärld, och hantering av data och information har blivit en allt viktigare konkurrensfaktor. Detta i takt med att den totala datamängden i den digitala världen har ökat kraftigt de senaste åren. En benämning för gigantiska datamängder är Big Data, som har blivit ett populärt begrepp inom IT-branschen. Big Data kommer med helt nya analysmöjligheter, men det har visat sig att många företag är oroliga för hur de ska hantera och ta tillvara på de växande datamängderna. Syftet med denna studie har varit att ge ett kunskapsbidrag till det relativt outforskade Big Data området, detta utifrån en induktiv ansats med utgångspunkten ur intervjuer. Den problematik som kommit med Big Data beskrivs oftast ur tre perspektiv; där data förekommer i stora volymer, med varierande data-typer och källor, samt att data genereras med olika hastighet. Det framgick av studiens resultat att Big Data som begrepp berör många olika områden och det kan variera väldigt mycket mellan företag inom olika branscher vad gäller betydelse, förmåga, ambition och omfattning. De traditionella teknologierna för datalagring och utvinning är inte tillräckliga för att hantera data som benämns som Big Data. I samband med att ny teknologi tagits fram och äldre lösningar uppgraderats, har detta dock lett till att det nu går att se informationshantering och analysarbete i helt nya perspektiv. Eftersom Big Data huvudsakligen har samma syfte som området Business Intelligence, kan dessa lösningar lämpligen integreras. En mycket stor utmaning med Big Data är att det inte är möjligt att exakt veta vad som kommer att uppnås med datainsamling och analys. Efter att data har samlats in bör ett business case tas fram med riktlinjer för vad som ska uppnås. Det finns en stor potential i denna uppgående marknad som, trots allt, är relativt omogen. Informationshantering kommer att bli allt viktigare framöver och för företagen handlar det om att hänga med i snabba utvecklingen och skaffa sig en bra förståelse för nya trender i IT-världen.
APA, Harvard, Vancouver, ISO, and other styles
3

Miloš, Marek. "Nástroje pro Big Data Analytics." Master's thesis, Vysoká škola ekonomická v Praze, 2013. http://www.nusl.cz/ntk/nusl-199274.

Full text
Abstract:
The thesis covers the term for specific data analysis called Big Data. The thesis firstly defines the term Big Data and the need for its creation because of the rising need for deeper data processing and analysis tools and methods. The thesis also covers some of the technical aspects of Big Data tools, focusing on Apache Hadoop in detail. The later chapters contain Big Data market analysis and describe the biggest Big Data competitors and tools. The practical part of the thesis presents a way of using Apache Hadoop to perform data analysis with data from Twitter and the results are then visualized in Tableau.
APA, Harvard, Vancouver, ISO, and other styles
4

Šoltýs, Matej. "Big Data v technológiách IBM." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193914.

Full text
Abstract:
This diploma thesis presents Big Data technologies and their possible use cases and applications. Theoretical part is initially focused on definition of term Big Data and afterwards is focused on Big Data technology, particularly on Hadoop framework. There are described principles of Hadoop, such as distributed storage and data processing, and its individual components. Furthermore are presented the largest vendors of Big Data technologies. At the end of this part of the thesis are described possible use cases of Big Data technologies and also some case studies. The practical part describes implementation of demo example of Big Data technologies and it is divided into two chapters. The first chapter of the practical part deals with conceptual design of demo example, used products and architecture of the solution. Afterwards, implementation of the demo example is described in the second chapter, from preparation of demo environment to creation of applications. Goals of this thesis are description and characteristics of Big Data, presentation of the largest vendors and their Big Data products, description of possible use cases of Big Data technologies and especially implementation of demo example in Big Data tools from IBM.
APA, Harvard, Vancouver, ISO, and other styles
5

Firsov, Vitaly. "Big Data a jejích potenciál pro bankovní sektor." Master's thesis, Vysoká škola ekonomická v Praze, 2013. http://www.nusl.cz/ntk/nusl-165114.

Full text
Abstract:
In this thesis, I want to explore present (y. 2012/2013) modern trends in Business Intelligence and focus specifically on the rapidly evolving and, in my (and not only) opinion, a very perspective area of analysis and use of Big Data in large enterprises. The first, introductory part contains general information and the formal conditions as aims of the work, on whom the work is oriented and where it could be used. Then there are described inputs and outputs, structure, methods to achieve the objectives, potential benefits and limitations in this part. Because at the same time I work as a data analyst in the largest bank Czech Republic, Czech Savings Bank, I focused on the using of Big Data in the banking, because I think, that it is possible to achieve great benefits from collecting and analyzing Big Data in this area. The thesis itself is divided into 3 parts (chapters 2, 3-4, 5). In the second chapter you will learn, how developed the area of BI, how it evolved historically, what is BI today and what future is predicted to the BI by the experts like the world famous and respected analyst firm Gartner. In the third chapter I will focus on Big Data itself, what this term means, how Big Data differs from traditional business information available from ERP, ECM, DMS and other enterprise systems. You will learn about ways to store and process this type of data, as well as about the existing and applicable technologies, focused on Big Data analysis. In the fourth chapter I focus on the using of Big Data in business, information in this chapter will reflect my personal views on the potential of Big Data, based on my experience during practice in Czech Savings Bank. The final part will summarize this thesis, assess, how I fulfilled the objectives defined at the beginning, and express my opinion on perspective of the trend of Big Data analytics, based to the analyzed during the writing this thesis information and knowledge.
APA, Harvard, Vancouver, ISO, and other styles
6

Kiška, Vladislav. "Integrace Big Data a datového skladu." Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-359181.

Full text
Abstract:
Master thesis deals with a problem of data integration between Big Data platform and enterprise data warehouse. Main goal of this thesis is to create a complex transfer system to move data from a data warehouse to this platform using a suitable tool for this task. This system should also store and manage all metadata information about previous transfers. Theoretical part focuses on describing concepts of Big Data, brief introduction into their history and presents factors which led to need for this new approach. Next chapters describe main principles and attributes of these technologies and discuss benefits of their implementation within an enterprise. Thesis also describes technologies known as Business Intelligence, their typical use cases and their relation to Big Data. Minor chapter presents main components of Hadoop system and most popular related applications. Practical part of this work consists of implementation of a system to execute and manage transfers from traditional relation database, in this case representing a data warehouse, to cluster of a few computers running a Hadoop system. This part also includes a summary of most used applications to move data into Hadoop and a design of database metadata schema, which is used to manage these transfers and to store transfer metadata.
APA, Harvard, Vancouver, ISO, and other styles
7

Chiossi, Antony. "Progettazione e prototipazione di un sistema di Social Business Intelligence con Hadoop Impala." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/9683/.

Full text
Abstract:
Il presente elaborato ha come oggetto l’analisi delle prestazioni e il porting di un sistema di SBI sulla distribuzione Hadoop di Cloudera. Nello specifico è stato fatto un porting dei dati del progetto WebPolEU. Successivamente si sono confrontate le prestazioni del query engine Impala con quelle di ElasticSearch che, diversamente da Oracle, sfrutta la stessa componente hardware (cluster).
APA, Harvard, Vancouver, ISO, and other styles
8

Brotánek, Jan. "Apache Hadoop jako analytická platforma." Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-358801.

Full text
Abstract:
Diploma Thesis focuses on integrating Hadoop platform into current data warehouse architecture. In theoretical part, properties of Big Data are described together with their methods and processing models. Hadoop framework, its components and distributions are discussed. Moreover, compoments which enables end users, developers and analytics to access Hadoop cluster are described. Case study of batch data extraction from current data warehouse on Oracle platform with aid of Sqoop tool, their transformation in relational structures of Hive component and uploading them back to the original source is being discussed at practical part of thesis. Compression of data and efficiency of queries depending on various storage formats is also discussed. Quality and consistency of manipulated data is checked during all phases of the process. Fraction of practical part discusses ways of storing and capturing stream data. For this purposes tool Flume is used to capture stream data. Further this data are transformed in Pig tool. Purpose of implementing the process is to move part of data and its processing from current data warehouse to Hadoop cluster. Therefore process of integration of current data warehouse and Hortonworks Data Platform and its components, was designed
APA, Harvard, Vancouver, ISO, and other styles
9

Silva, Neto Arlindo Rodrigues da. "GoldBI: uma solu??o de Business Intelligence como servi?o." PROGRAMA DE P?S-GRADUA??O EM ENGENHARIA DE SOFTWARE, 2016. https://repositorio.ufrn.br/jspui/handle/123456789/22304.

Full text
Abstract:
Submitted by Automa??o e Estat?stica (sst@bczm.ufrn.br) on 2017-03-14T23:51:19Z No. of bitstreams: 1 ArlindoRodriguesDaSilvaNeto_DISSERT.pdf: 3147140 bytes, checksum: 65ec83f6b7b7603769da720a2273e85b (MD5)
Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2017-03-16T23:01:46Z (GMT) No. of bitstreams: 1 ArlindoRodriguesDaSilvaNeto_DISSERT.pdf: 3147140 bytes, checksum: 65ec83f6b7b7603769da720a2273e85b (MD5)
Made available in DSpace on 2017-03-16T23:01:46Z (GMT). No. of bitstreams: 1 ArlindoRodriguesDaSilvaNeto_DISSERT.pdf: 3147140 bytes, checksum: 65ec83f6b7b7603769da720a2273e85b (MD5) Previous issue date: 2016-08-26
Este trabalho consiste em criar uma ferramenta de BI (Business Intelligence) dispon?vel em nuvem (cloud computing) atrav?s de SaaS (Software as Service) utilizando t?cnicas de ETL (Extract, Transform, Load) e tecnologias de Big Data, com a inten??o de facilitar a extra??o descentralizada e o processamento de dados em grande quantidade. Atualmente, constata-se que ? praticamente invi?vel realizar uma an?lise consistente sem o aux?lio de um software para gera??o de relat?rios e estat?sticas. Para tais fins, a obten??o de resultados concretos com a tomada de decis?o exige estrat?gias de an?lise de dados e vari?veis consolidadas. Partindo dessa vis?o, enfatiza-se neste estudo o Business Intelligence (BI) com o objetivo de simplificar a an?lise de informa??es gerenciais e estat?sticas para propiciar indicadores atrav?s de gr?ficos ou listagens din?micas de dados gerenciais. Assim, ? poss?vel inferir que, com o crescimento exponencial dos dados torna-se cada vez mais dif?cil a obten??o de resultados de forma r?pida e consistente, tornando necess?rio atuar com novas t?cnicas e ferramentas para tratamentos de dados em larga escala. Este trabalho ? de natureza t?cnica de cria??o de um produto de Engenharia de Software, fundamentado a partir do estudo da arte da ?rea, e de um comparativo com as principais ferramentas existentes no mercado, evidenciando vantagens e desvantagens da solu??o criada.
This work is to create a BI tool (Business Intelligence) available in the cloud (cloud computing) through SaaS (Software as Service) using ETL techniques (extract, transform, load) and Big Data technologies, with the intention of facilitating decentralized extraction and data processing in large quantities. Currently, it appears that it is practically impossible conduct a consistent analysis without the aid of a software for reporting and statistics. For these purposes, the achievement of concrete results with decision making requires data analysis strategies and consolidated variable. From this view, it is emphasized in this study Business Intelligence (BI) in order to simplify the analysis of management information and statistics to provide indicators through graphs or dynamic lists of data management. Thus, it is possible to infer that with the exponential growth of data becomes increasingly difficult to obtain results quickly and consistently, making it necessary to work with new techniques and tools for large-scale data processing. This work is technical in nature to create a product of Software Engineering, based from the study of art in the area, and a comparison with the main existing tools on the market, showing advantages and disadvantages of the created solution.
2020-12-31
APA, Harvard, Vancouver, ISO, and other styles
10

Ghesmoune, Mohammed. "Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCD061/document.

Full text
Abstract:
Le travail de recherche exposé dans cette thèse concerne le développement d'approches à base de growing neural gas (GNG) pour le clustering de flux de données massives. Nous proposons trois extensions de l'approche GNG : séquentielle, distribuée et parallèle, et une méthode hiérarchique; ainsi qu'une nouvelle modélisation pour le passage à l'échelle en utilisant le paradigme MapReduce et l'application de ce modèle pour le clustering au fil de l'eau du jeu de données d'assurance. Nous avons d'abord proposé la méthode G-Stream. G-Stream, en tant que méthode "séquentielle" de clustering, permet de découvrir de manière incrémentale des clusters de formes arbitraires et en ne faisant qu'une seule passe sur les données. G-Stream utilise une fonction d'oubli an de réduire l'impact des anciennes données dont la pertinence diminue au fil du temps. Les liens entre les nœuds (clusters) sont également pondérés par une fonction exponentielle. Un réservoir de données est aussi utilisé an de maintenir, de façon temporaire, les observations très éloignées des prototypes courants. L'algorithme batchStream traite les données en micro-batch (fenêtre de données) pour le clustering de flux. Nous avons défini une nouvelle fonction de coût qui tient compte des sous ensembles de données qui arrivent par paquets. La minimisation de la fonction de coût utilise l'algorithme des nuées dynamiques tout en introduisant une pondération qui permet une pénalisation des données anciennes. Une nouvelle modélisation utilisant le paradigme MapReduce est proposée. Cette modélisation a pour objectif de passer à l'échelle. Elle consiste à décomposer le problème de clustering de flux en fonctions élémentaires (Map et Reduce). Ainsi de traiter chaque sous ensemble de données pour produire soit les clusters intermédiaires ou finaux. Pour l'implémentation de la modélisation proposée, nous avons utilisé la plateforme Spark. Dans le cadre du projet Square Predict, nous avons validé l'algorithme batchStream sur les données d'assurance. Un modèle prédictif combinant le résultat du clustering avec les arbres de décision est aussi présenté. L'algorithme GH-Stream est notre troisième extension de GNG pour la visualisation et le clustering de flux de données massives. L'approche présentée a la particularité d'utiliser une structure hiérarchique et topologique, qui consiste en plusieurs arbres hiérarchiques représentant des clusters, pour les tâches de clustering et de visualisation
The research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the G-Stream method. G-Stream, as a “sequential" clustering method, is a one-pass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. G-Stream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a micro-batch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each sub-dataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GH-Stream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Big Data, Hadoop, Business Intelligence, MapReduce"

1

Big Data Analytics with Microsoft HDInsight in 24 Hours, Sams Teach Yourself: Big Data, Hadoop, and Microsoft Azure for Better Business Intelligence. Pearson Education, 2015.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Russell, John. Getting Started with Impala: Interactive SQL for Apache Hadoop. O'Reilly Media, Incorporated, 2014.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Big Data, Hadoop, Business Intelligence, MapReduce"

1

Furtado, Pedro. "Scalability and Realtime on Big Data, MapReduce, NoSQL and Spark." In Business Intelligence, 79–104. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-61164-8_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Savvas, Ilias K., Georgia N. Sofianidou, and M.-Tahar Kechadi. "Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce." In Business Intelligence, 1220–43. IGI Global, 2016. http://dx.doi.org/10.4018/978-1-4666-9562-7.ch062.

Full text
Abstract:
Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.
APA, Harvard, Vancouver, ISO, and other styles
3

Samadi, Yassir, Mostapha Zbakh, and Amine Haouari. "Big Data Processing on Cloud Computing Using Hadoop Mapreduce and Apache Spark." In Advances in Business Information Systems and Analytics, 224–50. IGI Global, 2018. http://dx.doi.org/10.4018/978-1-5225-3038-1.ch009.

Full text
Abstract:
Size of the data used by enterprises has been growing at exponential rates since last few years; handling such huge data from various sources is a challenge for Businesses. In addition, Big Data becomes one of the major areas of research for Cloud Service providers due to a large amount of data produced every day, and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. In order to resolve the aforementioned problems and to meet the increasing demand for high-speed and data-intensive computing, several solutions have been developed by researches and developers. Among these solutions, there are Cloud Computing tools such as Hadoop MapReduce and Apache Spark, which work on the principles of parallel computing. This chapter focuses on how big data processing challenges can be handled by using Cloud Computing frameworks and the importance of using Cloud Computing by businesses
APA, Harvard, Vancouver, ISO, and other styles
4

Manogaran, Gunasekaran, and Daphne Lopez. "Disease Surveillance System for Big Climate Data Processing and Dengue Transmission." In Web Services, 490–509. IGI Global, 2019. http://dx.doi.org/10.4018/978-1-5225-7501-6.ch028.

Full text
Abstract:
Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of unstructured sensor data and so it is often called big climate data. Nowadays, researchers are trying to use big climate data to monitor and predict the climate change and possible diseases. Traditional data processing techniques and tools are not capable of handling such huge amount of climate data. Hence, there is a need to develop advanced big data architecture for processing the real time climate data. The purpose of this paper is to propose a big data based surveillance system that analyzes spatial climate big data and performs continuous monitoring of correlation between climate change and Dengue. Proposed disease surveillance system has been implemented with the help of Apache Hadoop MapReduce and its supporting tools.
APA, Harvard, Vancouver, ISO, and other styles
5

Pal, Kamalendu. "Quality Assurance Issues for Big Data Applications in Supply Chain Management." In Predictive Intelligence Using Big Data and the Internet of Things, 51–76. IGI Global, 2019. http://dx.doi.org/10.4018/978-1-5225-6210-8.ch003.

Full text
Abstract:
Heterogeneous data types, widely distributed data sources, huge data volumes, and large-scale business-alliance partners describe typical global supply chain operational environments. Mobile and wireless technologies are putting an extra layer of data source in this technology-enriched supply chain operation. This environment also needs to provide access to data anywhere, anytime to its end-users. This new type of data set originating from the global retail supply chain is commonly known as big data because of its huge volume, resulting from the velocity with which it arrives in the global retail business environment. Such environments empower and necessitate decision makers to act or react quicker to all decision tasks. Academics and practitioners are researching and building the next generation of big-data-based application software systems. This new generation of software applications is based on complex data analysis algorithms (i.e., on data that does not adhere to standard relational data models). The traditional software testing methods are insufficient for big-data-based applications. Testing big-data-based applications is one of the biggest challenges faced by modern software design and development communities because of lack of knowledge on what to test and how much data to test. Big-data-based applications developers have been facing a daunting task in defining the best strategies for structured and unstructured data validation, setting up an optimal test environment, and working with non-relational databases testing approaches. This chapter focuses on big-data-based software testing and quality-assurance-related issues in the context of Hadoop, an open source framework. It includes discussion about several challenges with respect to massively parallel data generation from multiple sources, testing methods for validation of pre-Hadoop processing, software application quality factors, and some of the software testing mechanisms for this new breed of applications
APA, Harvard, Vancouver, ISO, and other styles
6

Jakobczak, Dariusz Jacek, and Ahan Chatterjee. "The Rise of “Big Data” in the Field of Cloud Analytics." In Advances in Data Mining and Database Management, 204–25. IGI Global, 2021. http://dx.doi.org/10.4018/978-1-7998-4706-9.ch008.

Full text
Abstract:
The huge amount of data burst which occurred with the arrival of economic access to the internet led to the rise of market of cloud computing which stores this data. And obtaining results from these data led to the growth of the “big data” industry which analyses this humongous amount of data and retrieve conclusion using various algorithms. Hadoop as a big data platform certainly uses map-reduce framework to give an analysis report of big data. The term “big data” can be defined as modern technique to store, capture, and manage data which are in the scale of petabytes or larger sized dataset with high-velocity and various structures. To address this massive growth of data or big data requires a huge computing space to ensure fruitful results through processing of data, and cloud computing is that technology that can perform huge-scale and computation which are very complex in nature. Cloud analytics does enable organizations to perform better business intelligence, data warehouse operation, and online analytical processing (OLAP).
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Big Data, Hadoop, Business Intelligence, MapReduce"

1

Akthar, Nadeem, Mohd Vasim Ahamad, and Shahbaz Khan. "Clustering on Big Data Using Hadoop MapReduce." In 2015 International Conference on Computational Intelligence and Communication Networks (CICN). IEEE, 2015. http://dx.doi.org/10.1109/cicn.2015.161.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Paul, Rajdeep. "Big data analysis of Indian premier league using Hadoop and MapReduce." In 2017 International Conference on Computational Intelligence in Data Science (ICCIDS). IEEE, 2017. http://dx.doi.org/10.1109/iccids.2017.8272628.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

"Changing Paradigms of Technical Skills for Data Engineers." In InSITE 2018: Informing Science + IT Education Conferences: La Verne California. Informing Science Institute, 2018. http://dx.doi.org/10.28945/4001.

Full text
Abstract:
Aim/Purpose: [This Proceedings paper was revised and published in the 2018 issue of the journal Issues in Informing Science and Information Technology, Volume 15] This paper investigates the new technical skills that are needed for Data Engineering. Past research is compared to new research which creates a list of the 20 top tech-nical skills required by a Data Engineer. The growing availability of Data Engineering jobs is discussed. The research methodology describes the gathering of sample data and then the use of Pig and MapReduce on AWS (Amazon Web Services) to count occurrences of Data Engineering technical skills from 100 Indeed.com job advertisements in July, 2017. Background: A decade ago, Data Engineering relied heavily on the technology of Relational Database Management Sys-tems (RDBMS). For example, Grisham, P., Krasner, H., and Perry D. (2006) described an Empirical Soft-ware Engineering Lab (ESEL) that introduced Relational Database concepts to students with hands-on learning that they called “Data Engineering Education with Real-World Projects.” However, as seismic im-provements occurred for the processing of large distributed datasets, big data analytics has moved into the forefront of the IT industry. As a result, the definition for Data Engineering has broadened and evolved to include newer technology that supports the distributed processing of very large amounts of data (e.g. Hadoop Ecosystem and NoSQL Databases). This paper examines the technical skills that are needed to work as a Data Engineer in today’s rapidly changing technical environment. Research is presented that re-views 100 job postings for Data Engineers from Indeed (2017) during the month of July, 2017 and then ranks the technical skills in order of importance. The results are compared to earlier research by Stitch (2016) that ranked the top technical skills for Data Engineers in 2016 using LinkedIn to survey 6,500 peo-ple that identified themselves as Data Engineers. Methodology: A sample of 100 Data Engineering job postings were collected and analyzed from Indeed during July, 2017. The job postings were pasted into a text file and then related words were grouped together to make phrases. For example, the word “data” was put into context with other related words to form phrases such as “Big Data”, “Data Architecture” and “Data Engineering”. A text editor was used for this task and the find/replace functionality of the text editor proved to be very useful for this project. After making phrases, the large text file was uploaded to the Amazon cloud (AWS) and a Pig batch job using Map Reduce was leveraged to count the occurrence of phrases and words within the text file. The resulting phrases/words with occurrence counts was download to a Personal Computer (PC) and then was loaded into an Excel spreadsheet. Using a spreadsheet enabled the phrases/words to be sorted by oc-currence count and then facilitated the filtering out of irrelevant words. Another task to prepare the data involved the combination phrases or words that were synonymous. For example, the occurrence count for the acronym ELT and the occurrence count for the acronym ETL were added together to make an overall ELT/ETL occurrence count. ETL is a Data Warehousing acronym for Extracting, Transforming and Loading data. This task required knowledge of the subject area. Also, some words were counted in lower case and then the same word was also counted in mixed or upper case, thus producing two or three occur-rence counts for the same word. These different counts were added together to make an overall occur-rence count for the word (e.g. word occurrence counts for Python and python were added together). Fi-nally, the Indeed occurrence counts were sorted to allow for the identification of a list of the top 20 tech-nical skills needed by a Data Engineer. Contribution: Provides new information about the Technical Skills needed by Data Engineers. Findings: Twelve of the 20 Stitch (2016) report phrases/words that are highlighted in bold above matched the tech-nical skills mentioned in the Indeed research. I considered C, C++ and Java a match to the broader cate-gory of Programing in the Indeed data. Although the ranked order of the two lists did not match, the top five ranked technical skills for both lists are similar. The reader of this paper might consider the skills of SQL, Python, Hadoop/HDFS to be very important technical skills for a Data Engineer. Although the programming language R is very popular with Data Scientists, it did not make the top 20 skills for Data Engineering; it was in the overall list from Indeed. The R programming language is oriented towards ana-lytical processing (e.g. used by Data Scientists), whereas the Python language is a scripting and object-oriented language that facilitates the creation of Data Pipelines (e.g. used by Data Engineers). Because the data was collected one year apart and from very different data sources, the timing of the data collection and the different data sources could account for some of the differences in the ranked lists. It is worth noting that the Indeed research ranked list introduced the technical skills of Design Skills, Spark, AWS (Amazon Web Services), Data Modeling, Kafta, Scala, Cloud Computing, Data Pipelines, APIs and AWS Redshift Data Warehousing to the top 20 ranked technical skills list. The Stitch (2016) report that did not have matches to the Indeed (2017) sample data for Linux, Databases, MySQL, Business Intelligence, Oracle, Microsoft SQL Server, Data Analysis and Unix. Although many of these Stitch top 20 technical skills were on the Indeed list, they did not make the top 20 ranked technical skills. Recommendations for Practitioners: Some of the skills needed for Database Technologies are transferable to Data Engineering. Recommendation for Researchers: None Impact on Society: There is not much peer reviewed literature on the subject of Data Engineering, this paper will add new information to the subject area. Future Research: I'm developing a Specialization in Data Engineering for the MS in Data Science degree at our university.
APA, Harvard, Vancouver, ISO, and other styles
4

"Keynote - Performance Impact of Data Locality in MapReduce on Hadoop." In 2017 5th Intl Conf on Applied Computing and Information Technology/4th Intl Conf on Computational Science/Intelligence and Applied Informatics/2nd Intl Conf on Big Data, Cloud Computing, Data Science (ACIT-CSII-BCD). IEEE, 2017. http://dx.doi.org/10.1109/acit-csii-bcd.2017.88.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Huang, Su-yu, and Bo Zhang. "Research on Improved k-Means Clustering Algorithm Based on Hadoop Platform." In 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI). IEEE, 2019. http://dx.doi.org/10.1109/mlbdbi48998.2019.00067.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Yulong, Zhao, and Lin Weiting. "A Research on Battlefield Situation Analysis and Decision-making Modeling based on a Hadoop Framework." In 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI). IEEE, 2020. http://dx.doi.org/10.1109/mlbdbi51377.2020.00083.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wahid, Ali, Steven Munkeby, and Samuel Sambasivam. "Machine Learning-based Flu Forecasting Study Using the Official Data from the Centers for Disease Control and Prevention and Twitter Data." In InSITE 2021: Informing Science + IT Education Conferences. Informing Science Institute, 2021. http://dx.doi.org/10.28945/4773.

Full text
Abstract:
Aim/Purpose: In the United States, the Centers for Disease Control and Prevention (CDC) tracks the disease activity using data collected from medical practices on a weekly basis. Collection of data by CDC from medical practices on a weekly basis leads to a lag time of approximately 2 weeks before any viable action can be planned. The 2-week delay problem was addressed in the study by creating machine learning models to predict flu outbreak. Background: The 2-week delay problem was addressed in the study by correlation of the flu trends identified from Twitter data and official flu data from the Centers for Disease Control and Prevention (CDC) in combination with creating a machine learning model using both data sources to predict flu outbreak. Methodology: A quantitative correlational study was performed using a quasi-experimental design. Flu trends from the CDC portal and tweets with mention of flu and influenza from the state of Georgia were used over a period of 22 weeks from December 29, 2019 to May 30, 2020 for this study. Contribution: This research contributed to the body of knowledge by using a simple bag-of-word method for sentiment analysis followed by the combination of CDC and Twitter data to generate a flu prediction model with higher accuracy than using CDC data only. Findings: The study found that (a) there is no correlation between official flu data from CDC and tweets with mention of flu and (b) there is an improvement in the performance of a flu forecasting model based on a machine learning algorithm using both official flu data from CDC and tweets with mention of flu. Recommendations for Practitioners: In this study, it was found that there was no correlation between the official flu data from the CDC and the count of tweets with mention of flu, which is why tweets alone should be used with caution to predict a flu out-break. Based on the findings of this study, social media data can be used as an additional variable to improve the accuracy of flu prediction models. It is also found that fourth order polynomial and support vector regression models offered the best accuracy of flu prediction models. Recommendation's for Researchers: Open-source data, such as Twitter feed, can be mined for useful intelligence benefiting society. Machine learning-based prediction models can be improved by adding open-source data to the primary data set. Impact on Society: Key implication of this study for practitioners in the field were to use social media postings to identify neighborhoods and geographic locations affected by seasonal outbreak, such as influenza, which would help reduce the spread of the disease and ultimately lead to containment. Based on the findings of this study, social media data will help health authorities in detecting seasonal outbreaks earlier than just using official CDC channels of disease and illness reporting from physicians and labs thus, empowering health officials to plan their responses swiftly and allocate their resources optimally for the most affected areas. Future Research: A future researcher could use more complex deep learning algorithms, such as Artificial Neural Networks and Recurrent Neural Networks, to evaluate the accuracy of flu outbreak prediction models as compared to the regression models used in this study. A future researcher could apply other sentiment analysis techniques, such as natural language processing and deep learning techniques, to identify context-sensitive emotion, concept extraction, and sarcasm detection for the identification of self-reporting flu tweets. A future researcher could expand the scope by continuously collecting tweets on a public cloud and applying big data applications, such as Hadoop and MapReduce, to perform predictions using several months of historical data or even years for a larger geographical area. *** NOTE: This Proceedings paper was revised and published in the journal Issues in Informing Science and Information Technology, 18, 63-81 At the bottom of this page, click DOWNLOAD PDF to download the published paper. ***
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography