Literatura académica sobre el tema "Apache Structured Streaming"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte las listas temáticas de artículos, libros, tesis, actas de conferencias y otras fuentes académicas sobre el tema "Apache Structured Streaming".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Artículos de revistas sobre el tema "Apache Structured Streaming"

1

Ilbeigipour, Sadegh, Amir Albadvi y Elham Akhondzadeh Noughabi. "Real-Time Heart Arrhythmia Detection Using Apache Spark Structured Streaming". Journal of Healthcare Engineering 2021 (22 de abril de 2021): 1–13. http://dx.doi.org/10.1155/2021/6624829.

Texto completo
Resumen
One of the major causes of death in the world is cardiac arrhythmias. In the field of healthcare, physicians use the patient’s electrocardiogram (ECG) records to detect arrhythmias, which indicate the electrical activity of the patient’s heart. The problem is that the symptoms do not always appear and the physician may be mistaken in the diagnosis. Therefore, patients need continuous monitoring through real-time ECG analysis to detect arrhythmias in a timely manner and prevent an eventual incident that threatens the patient’s life. In this research, we used the Structured Streaming module built top on the open-source Apache Spark platform for the first time to implement a machine learning pipeline for real-time cardiac arrhythmias detection and evaluate the impact of using this new module on classification performance metrics and the rate of delay in arrhythmia detection. The ECG data collected from the MIT/BIH database for the detection of three class labels: normal beats, RBBB, and atrial fibrillation arrhythmias. We also developed three decision trees, random forest, and logistic regression multiclass classifiers for data classification where the random forest classifier showed better performance in classification than the other two classifiers. The results show previous results in performance metrics of the classification model and a significant decrease in pipeline runtime by using more class labels compared to previous studies.
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Estévez-Pereira, Julio J., Diego Fernández y Francisco J. Novoa. "Network Anomaly Detection Using Machine Learning Techniques". Proceedings 54, n.º 1 (19 de agosto de 2020): 8. http://dx.doi.org/10.3390/proceedings2020054008.

Texto completo
Resumen
While traditional network security methods have been proven useful until now, the flexibility of machine learning techniques makes them a solid candidate in the current scene of our networks. In this paper, we assess how well the latter are capable of detecting security threats in a corporative network. To that end, we configure and compare several models to find the one which fits better with our needs. Furthermore, we distribute the computational load and storage so we can handle extensive volumes of data. The algorithms that we use to create our models, Random Forest, Naive Bayes, and Deep Neural Networks (DNN), are both divergent and tested in other papers in order to make our comparison richer. For the distribution phase, we operate with Apache Structured Streaming, PySpark, and MLlib. As for the results, it is relevant to mention that our dataset has been found to be effectively modelable with just a reduced number of features. Finally, given the outcomes obtained, we find this line of research encouraging and, therefore, this approach worth pursuing.
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Hafsa, Mounir y Farah Jemili. "Comparative Study between Big Data Analysis Techniques in Intrusion Detection". Big Data and Cognitive Computing 3, n.º 1 (20 de diciembre de 2018): 1. http://dx.doi.org/10.3390/bdcc3010001.

Texto completo
Resumen
Cybersecurity ventures expect that cyber-attack damage costs will rise to $11.5 billion in 2019 and that a business will fall victim to a cyber-attack every 14 seconds. Notice here that the time frame for such an event is seconds. With petabytes of data generated each day, this is a challenging task for traditional intrusion detection systems (IDSs). Protecting sensitive information is a major concern for both businesses and governments. Therefore, the need for a real-time, large-scale and effective IDS is a must. In this work, we present a cloud-based, fault tolerant, scalable and distributed IDS that uses Apache Spark Structured Streaming and its Machine Learning library (MLlib) to detect intrusions in real-time. To demonstrate the efficacy and effectivity of this system, we implement the proposed system within Microsoft Azure Cloud, as it provides both processing power and storage capabilities. A decision tree algorithm is used to predict the nature of incoming data. For this task, the use of the MAWILab dataset as a data source will give better insights about the system capabilities against cyber-attacks. The experimental results showed a 99.95% accuracy and more than 55,175 events per second were processed by the proposed system on a small cluster.
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Moertini, Veronica y Mariskha Adithia. "Uncovering Active Communities from Directed Graphs on Distributed Spark Frameworks, Case Study: Twitter Data". Big Data and Cognitive Computing 5, n.º 4 (22 de septiembre de 2021): 46. http://dx.doi.org/10.3390/bdcc5040046.

Texto completo
Resumen
Directed graphs can be prepared from big data containing peoples’ interaction information. In these graphs the vertices represent people, while the directed edges denote the interactions among them. The number of interactions at certain intervals can be included as the edges’ attribute. Thus, the larger the count, the more frequent the people (vertices) interact with each other. Subgraphs which have a count larger than a threshold value can be created from these graphs, and temporal active communities can then be mined from each of these subgraphs. Apache Spark has been recognized as a data processing framework that is fast and scalable for processing big data. It provides DataFrames, GraphFrames, and GraphX APIs which can be employed for analyzing big graphs. We propose three kinds of active communities, namely, Similar interest communities (SIC), Strong-interacting communities (SC), and Strong-interacting communities with their “inner circle” neighbors (SCIC), along with algorithms needed to uncover them. The algorithm design and implementation are based on these APIs. We conducted experiments on a Spark cluster using ten machines. The results show that our proposed algorithms are able to uncover active communities from public big graphs as well from Twitter data collected using Spark structured streaming. In some cases, the execution time of the algorithms that are based on GraphFrames’ motif findings is faster.
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Al Jawarneh, Isam Mashhour, Paolo Bellavista, Antonio Corradi, Luca Foschini y Rebecca Montanari. "QoS-Aware Approximate Query Processing for Smart Cities Spatial Data Streams". Sensors 21, n.º 12 (17 de junio de 2021): 4160. http://dx.doi.org/10.3390/s21124160.

Texto completo
Resumen
Large amounts of georeferenced data streams arrive daily to stream processing systems. This is attributable to the overabundance of affordable IoT devices. In addition, interested practitioners desire to exploit Internet of Things (IoT) data streams for strategic decision-making purposes. However, mobility data are highly skewed and their arrival rates fluctuate. This nature poses an extra challenge on data stream processing systems, which are required in order to achieve pre-specified latency and accuracy goals. In this paper, we propose ApproxSSPS, which is a system for approximate processing of geo-referenced mobility data, at scale with quality of service guarantees. We focus on stateful aggregations (e.g., means, counts) and top-N queries. ApproxSSPS features a controller that interactively learns the latency statistics and calculates proper sampling rates to meet latency or/and accuracy targets. An overarching trait of ApproxSSPS is its ability to strike a plausible balance between latency and accuracy targets. We evaluate ApproxSSPS on Apache Spark Structured Streaming with real mobility data. We also compared ApproxSSPS against a state-of-the-art online adaptive processing system. Our extensive experiments prove that ApproxSSPS can fulfill latency and accuracy targets with varying sets of parameter configurations and load intensities (i.e., transient peaks in data loads versus slow arriving streams). Moreover, our results show that ApproxSSPS outperforms the baseline counterpart by significant magnitudes. In short, ApproxSSPS is a novel spatial data stream processing system that can deliver real accurate results in a timely manner, by dynamically specifying the limits on data samples.
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Xiao, Wen y Juan Hu. "SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming". Journal of Supercomputing 76, n.º 10 (4 de febrero de 2020): 7619–34. http://dx.doi.org/10.1007/s11227-020-03190-5.

Texto completo
Resumen
Abstract Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

"Designing Framework for Real Time Twitter Data Analytics using Apache Flume and Pig". International Journal of Recent Technology and Engineering 8, n.º 6 (30 de marzo de 2020): 4474–77. http://dx.doi.org/10.35940/ijrte.f7726.038620.

Texto completo
Resumen
In the world of technology, people prefer social media to express themselves. Record says Twitter has more than 321 million active users with 100 million users posting approximately 340 million tweets a day. Twitter is the largest source of breaking news on social issues specially election-related where people can express their views also suggest their opinion. Twitter is generating unlimited unstructured text data. Hadoop is one of the finest tools accessible for analyzing twitter data because it supports processing of distributed big data, streaming data, time stamped data, text data etc. Whereas Apache Flume is used to extract real time twitter data into HDFS. This study attempts to establish an analytical framework to derive and interpret structured as well as unstructured Twitter data. The proposed framework comprises of real time twitter data insertion, its processing, and data visualization utilizing Apache Flume and pig. In this project we fetch positive and negative tweets on election data from twitter and analyzing the party status and the probability to win the election.
Los estilos APA, Harvard, Vancouver, ISO, etc.

Tesis sobre el tema "Apache Structured Streaming"

1

Rexa, Denis. "Výpočetní úlohy pro řešení paralelního zpracování dat". Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-400899.

Texto completo
Resumen
The goal of this diploma thesis was to create four laboratory exercises for the subject "Parallel Data Processing", where students will try on the options and capabilities of Apache Spark as a parallel computing platform. The work also includes basic setup and use of Apache Kafka technology and NoSQL Apache Cassandra database. The other two lab assignments focus on working with a Travelling Salesman Problem. The first lab was designed to demonstrate the difficulty of a task where the student will face an exponential increase in complexity. The second task consists of an optimization algorithm to solve the problem in cluster. This algorithm is subjected to performance measurements in clusters. The conclusion of the thesis contains recommendations for optimization as well as comparison of running with different number of computing devices.
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Cannalire, Pietro. "Geo-distributed multi-layer stream aggregation". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-230217.

Texto completo
Resumen
The standard processing architectures are enough to satisfy a lot of applications by employing already existing stream processing frameworks which are able to manage distributed data processing. In some specific cases, having geographically distributed data sources requires to distribute even more the processing over a large area by employing a geographically distributed architecture.‌ The issue addressed in this work is the reduction of data movement across the network which is continuously flowing in a geo-distributed architecture from streaming sources to the processing location and among processing entities within the same distributed cluster. Reduction of data movement can be critical for decreasing bandwidth costs since accessing links placed in the middle of the network can be costly and can increase as the amount of data exchanges increase. In this work we want to create a different concept to deploy geographically distributed architectures by relying on Apache Spark Structured Streaming and Apache Kafka. The features needed for an algorithm to run on a geo-distributed architecture are provided. The algorithms to be executed on this architecture apply the windowing and the data synopses techniques to produce a summaries of the input data and to address issues of the geographically distributed architecture. The computation of the average and the Misra-Gries algorithm are then implemented to test the designed architecture. This thesis work contributes in providing a new model of building geographically distributed architecture. The experimental results show that, for the algorithms running on top of the geo distributed architecture, the computation time is reduced on average by 70% compared to the distributed setup. Similarly, and the amount of data exchanged across the network is reduced on average by 99%, compared to the distributed setup.
Standardbehandlingsarkitekturer är tillräckligt för uppfylla behoven av många tillämpningar genom användning av befintliga ramverk för flödesbehandling med stöd för distribuerad databehandling. I specifika fall kan geografiskt fördelade datakällor kräva att databehandlingen fördelas över ett stort område med hjälp av en geografiskt distribuerad arkitektur. Problemet som behandlas i detta arbete är minskningen av kontinuerlig dataöverföring i ett nätverk med geo-distribuerad arkitektur. Minskad dataöverföring kan vara avgörande för minskade bandbreddskonstnader då åtkomst av länkar placerade i mitten av ett nätverk kan vara dyrt och öka ytterligare med tilltagande dataöverföring. I det här arbetet vill vi skapa ett nytt koncept för att upprätta geografiskt distribuerade arkitekturer med hjälp av Apache Spark Structured Streaming och Apache Kafka. Funktioner och förutsättningar som behövs för att en algoritm ska kunna köras på en geografisk distribuerad arkitektur tillhandahålls. Algoritmerna som ska köras på denna arkitektur tillämpar “windowing synopsing” och “data synopses”-tekniker för att framställa en sammanfattning av ingående data samt behandla problem beträffande den geografiskt fördelade arkitekturen. Beräkning av medelvärdet och Misra-Gries-algoritmen implementeras för att testa den konstruerade arkitekturen. Denna avhandling bidrar till att förse ny modell för att bygga geografiskt distribuerad arkitektur. Experimentella resultat visar att beräkningstiden reduceras i genomsnitt 70% för de algoritmer som körs ovanför den geo-distribuerade arkitekturen jämfört med den distribuerade konfigurationen. På liknande sätt reduceras mängden data som utväxlas över nätverket med 99% i snitt jämfört med den distribuerade inställningen.
Los estilos APA, Harvard, Vancouver, ISO, etc.

Libros sobre el tema "Apache Structured Streaming"

1

Maas, Gerard y Francois Garillot. Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming. O'Reilly Media, 2019.

Buscar texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Luu, Hien. Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library. Apress, 2018.

Buscar texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.

Capítulos de libros sobre el tema "Apache Structured Streaming"

1

Chellappan, Subhashini y Dharanitharan Ganesan. "Spark Structured Streaming". En Practical Apache Spark, 157–74. Berkeley, CA: Apress, 2018. http://dx.doi.org/10.1007/978-1-4842-3652-9_6.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Elliott, Ed. "Structured Streaming". En Introducing .NET for Apache Spark, 171–84. Berkeley, CA: Apress, 2021. http://dx.doi.org/10.1007/978-1-4842-6992-3_9.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.

Actas de conferencias sobre el tema "Apache Structured Streaming"

1

Guimarães, Lucas Chagas de Brito, Gabriel Antonio Fontes Rebello, Felipe Schreiber Fernandes, Gustavo Franco Camilo, Lucas Airam Castro de Souza, Danyel Clinário dos Santos, Luiz Gustavo Costa Marques de Oliveira y Otto Carlos Muniz Bandeira Duarte. "TeMIA-NT: Monitoramento e Análise Inteligente de Ameaças de Tráfego de Rede". En XXXVIII Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos. Sociedade Brasileira de Computação, 2020. http://dx.doi.org/10.5753/sbrc_estendido.2020.12402.

Texto completo
Resumen
Ataques cibernéticos têm se tornado cada vez mais comuns e causam grandes danos a pessoas e organizações. A detecção tardia desses ataques aumenta a possibilidade de ocorrerem danos irreparáveis, com altas perdas financeiras sendo uma ocorrência comum. Este artigo propõe TeMIA-NT: Monitoramento e Análise Inteligente de Ameaças de Tráfego de Rede, uma ferramenta para análise de tráfego em tempo real usando processamento paralelo de fluxos em um aglomerado. As principais contribuições da ferramenta TeMIA-NT são: i) a proposta de uma arquitetura modular para detecção em tempo real de intrusões de rede que suporta alta taxas de tráfego, ii) o uso da biblioteca structured streaming do Apache Spark e iii) dois modos de operação: em linha (online) e em tempo diferenciado (offline). O modo de operação em tempo diferenciado permite avaliar o desempenho de múltiplos algoritmos de aprendizado de máquina sobre um determinado conjunto de dados incluindo métricas como acurácia, F1-score e área sob a curva ROC. No modo em linha a ferramenta usa estruturas de dataframe e a biblioteca structured streaming no modo contínuo, o que permite a detecção de ameaças em tempo real e a rápida reação a ataques. De modo a minimizar os danos causados, TeMIA-NT atinge taxas de processamento de fluxo que chegam a 50 GB/s.
Los estilos APA, Harvard, Vancouver, ISO, etc.
Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!

Pasar a la bibliografía