Log in

Relevant bibliographies by topics / Massive data set post-processing / Journal articles

To see the other types of publications on this topic, follow the link: Massive data set post-processing.

Journal articles on the topic 'Massive data set post-processing'

Author: Grafiati

Published: 4 June 2021

Last updated: 1 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Massive data set post-processing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Singh, Gurinderbeer, Sreeraman Rajan, and Shikharesh Majumdar. "A Fast-Iterative Data Association Technique for Multiple Object Tracking." International Journal of Semantic Computing 12, no. 02 (June 2018): 261–85. http://dx.doi.org/10.1142/s1793351x18400135.

Full text

Abstract:

A massive amount of video data is recorded daily for forensic post analysis and computer vision applications. The analyses of this data often require multiple object tracking (MOT). Advancements in image analysis algorithms and global optimization techniques have improved the accuracy of MOT, often at the cost of slow processing speed which limits its applications only to small video datasets. With the focus on speed, a fast-iterative data association technique (FIDA) for MOT that uses a tracking-by-detection paradigm and finds a locally optimal solution with a low computational overhead is introduced. The performance analyses conducted on a set of benchmark video datasets show that the proposed technique is significantly faster (50–600 times) than the existing state-of-the-art techniques that produce a comparable tracking accuracy.

APA, Harvard, Vancouver, ISO, and other styles

2

Nagy, Máté, János Tapolcai, and Gábor Rétvári. "R3D3: A Doubly Opportunistic Data Structure for Compressing and Indexing Massive Data." Infocommunications journal, no. 2 (2019): 58–66. http://dx.doi.org/10.36244/icj.2019.2.7.

Full text

Abstract:

Opportunistic data structures are used extensively in big data practice to break down the massive storage space requirements of processing large volumes of information. A data structure is called (singly) opportunistic if it takes advantage of the redundancy in the input in order to store it in iformationtheoretically minimum space. Yet, efficient data processing requires a separate index alongside the data, whose size often substantially exceeds that of the compressed information. In this paper, we introduce doubly opportunistic data structures to not only attain best possible compression on the input data but also on the index. We present R3D3 that encodes a bitvector of length n and Shannon entropy H0 to nH0 bits and the accompanying index to nH0(1/2 + O(log C/C)) bits, thus attaining provably minimum space (up to small error terms) on both the data and the index, and supports a rich set of queries to arbitrary position in the compressed bitvector in O(C) time when C = o(log n). Our R3D3 prototype attains several times space reduction beyond known compression techniques on a wide range of synthetic and real data sets, while it supports operations on the compressed data at comparable speed.

APA, Harvard, Vancouver, ISO, and other styles

3

Chang, Xu, Shan Shan Pei, and Na Su. "Research on Real-Time Network Forensics Based on Improved Data Mining Algorithm." Applied Mechanics and Materials 380-384 (August 2013): 1881–85. http://dx.doi.org/10.4028/www.scientific.net/amm.380-384.1881.

Full text

Abstract:

According to the characteristics of high precision and massive amounts of data processing during real-time network forensic, combining the defects of traditional Apriori algorithm which scan data sets more times, the paper improved Apriori algorithm, the data set is divided into parallel processing blocks, and then use dynamic itemsets counting method weight each block to construct tree, and depth-first search the tree, mark the data set which is divided out of the data block, and dynamic evaluation all the items set which has counted in order to acquire frequent itemsets, reducing the number of scanning, improved data processing capability of network forensics, use K-mediods algorithm for secondary mining to improve the accuracy, reduce network data loss, improve legal effect of network crime evidence.

APA, Harvard, Vancouver, ISO, and other styles

4

Lam, Ping-Man, Chi-Sing Leung, and Tien-Tsin Wong. "A compression method for a massive image data set in image-based rendering." Signal Processing: Image Communication 19, no. 8 (September 2004): 741–54. http://dx.doi.org/10.1016/j.image.2004.04.007.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Jia, Chen, and Hong Wei Chen. "The HGEDA Hybrid Algorithm for OLAP Data Cubes." Applied Mechanics and Materials 130-134 (October 2011): 3158–62. http://dx.doi.org/10.4028/www.scientific.net/amm.130-134.3158.

Full text

Abstract:

On-Line Analytical Processing (OLAP) tools are frequently used in business, science and health to extract useful knowledge from massive databases. An important and hard optimization problem in OLAP data warehouses is the view selection problem, consisting of selecting a set of aggregate views of the data for speeding up future query processing. In this paper we present a new approach, named HGEDA, which is a new hybrid algorithm based on genetic and estimation of distribution algorithms. The original objective is to get benefits from both approaches. Experimental results show that the HGEDA are competitive with the genetic algorithm on a variety of problem instances, often ﬁnding approximate optimal solutions in a reasonable amount of time.

APA, Harvard, Vancouver, ISO, and other styles

6

Chen, Jia, Hong Wei Chen, and Xin Rong Hu. "Simulation for View Selection in Data Warehouse." Advanced Materials Research 748 (August 2013): 1028–32. http://dx.doi.org/10.4028/www.scientific.net/amr.748.1028.

Full text

Abstract:

On-Line Analytical Processing (OLAP) tools are frequently used in business, science and health to extract useful knowledge from massive databases. An important and hard optimization problem in OLAP data warehouses is the view selection problem, consisting of selecting a set of aggregate views of the data for speeding up future query processing. We apply one n Estimation of Distribution Algorithms (EDAs) to view selection under a size constraint. Our emphasis is to determine the suitability of the combination of EDAs with constraint handling to the view selection problem, compared to a widely used genetic algorithm. The EDAs are competitive with the genetic algorithm on a variety of problem instances, often finding approximate optimal solutions in a reasonable amount of time.

APA, Harvard, Vancouver, ISO, and other styles

7

Wang, Wei. "Optimization of Intelligent Data Mining Technology in Big Data Environment." Journal of Advanced Computational Intelligence and Intelligent Informatics 23, no. 1 (January 20, 2019): 129–33. http://dx.doi.org/10.20965/jaciii.2019.p0129.

Full text

Abstract:

At present, storage technology cannot save data completely. Therefore, in such a big data environment, data mining technology needs to be optimized for intelligent data. Firstly, in the face of massive intelligent data, the potential relationship between data items in the database is firstly described by association rules. The data items are measured by support degree and confidence level, and the data set with minimum support is found. At the same time, strong association rules are obtained according to the given confidence level of users. Secondly, in order to effectively improve the scanning speed of data items, an optimized association data mining technology based on hash technology and optimized transaction compression technology is proposed. A hash function is used to count the item set in the set of waiting options, and the count is less than its support, then the pruning is done, and then the object compression technique is used to delete the item and the transaction which is unrelated to the item set, so as to improve the processing efficiency of the association rules. Experiments show that the optimized data mining technology can significantly improve the efficiency of obtaining valuable intelligent data.

APA, Harvard, Vancouver, ISO, and other styles

8

Annapoorani, S., and B. Srinivasan. "Implementation of Effective Data Emplacement Algorithm in Heterogeneous Cloud Environment." Asian Journal of Computer Science and Technology 8, S1 (February 5, 2019): 87–88. http://dx.doi.org/10.51983/ajcst-2019.8.s1.1944.

Full text

Abstract:

This paper is concerned with the study and implementation of effective Data Emplacement Algorithm in large set of databases called Big Data and proposes a model for improving the efficiency of data processing and storage utilization for dynamic load imbalance among nodes in a heterogeneous cloud environment. With the era of explosive information and data receiving, more and more fields need to deal with massive, large scale of data. A method has been proposed with an improved Data Placement algorithm called Effective Data Emplacement Algorithm with computing capacity of each node as a predominant factor that promotes and improves the efficiency in data processing in a short duration time from large set of data. The adaptability of the proposed model can be obtained by minimizing the time with processing efficiency through the computing capacity of each node in the cluster. The proposed solution improves the performance of the heterogeneous cluster environment by effectively distributing data based on the performance oriented sampling as the experimental results made with word count applications.

APA, Harvard, Vancouver, ISO, and other styles

9

Wang, De Wen, and Lin Xiao He. "A Fault Diagnosis Model for Power Transformer Using Association Rule Mining-Based on Rough Set." Applied Mechanics and Materials 519-520 (February 2014): 1169–72. http://dx.doi.org/10.4028/www.scientific.net/amm.519-520.1169.

Full text

Abstract:

With the development of on-line monitoring technology of electric power equipment, and the accumulation of both on-line monitoring data and off-line testing data, the data available to fault diagnosis of power transformer is bound to be massive. How to utilize those massive data reasonably is the issue that eagerly needs us to study. Since the on-line monitoring technology is not totally mature, which resulting in incomplete, noisy, wrong characters for monitoring data, so processing the initial data by using rough set is necessary. Furthermore, when the issue scale becomes larger, the computing amount of association rule mining grows dramatically, and its easy to cause data expansion. So it needs to use attribute reduction algorithm of rough set theory. Taking the above two points into account, this paper proposes a fault diagnosis model for power transformer using association rule mining-based on rough set.

APA, Harvard, Vancouver, ISO, and other styles

10

Nguyen Mau Quoc, Hoan, Martin Serrano, Han Mau Nguyen, John G. Breslin, and Danh Le-Phuoc. "EAGLE—A Scalable Query Processing Engine for Linked Sensor Data." Sensors 19, no. 20 (October 9, 2019): 4362. http://dx.doi.org/10.3390/s19204362.

Full text

Abstract:

Recently, many approaches have been proposed to manage sensor data using semantic web technologies for effective heterogeneous data integration. However, our empirical observations revealed that these solutions primarily focused on semantic relationships and unfortunately paid less attention to spatio–temporal correlations. Most semantic approaches do not have spatio–temporal support. Some of them have attempted to provide full spatio–temporal support, but have poor performance for complex spatio–temporal aggregate queries. In addition, while the volume of sensor data is rapidly growing, the challenge of querying and managing the massive volumes of data generated by sensing devices still remains unsolved. In this article, we introduce EAGLE, a spatio–temporal query engine for querying sensor data based on the linked data model. The ultimate goal of EAGLE is to provide an elastic and scalable system which allows fast searching and analysis with respect to the relationships of space, time and semantics in sensor data. We also extend SPARQL with a set of new query operators in order to support spatio–temporal computing in the linked sensor data context.

APA, Harvard, Vancouver, ISO, and other styles

11

Zeng, Jun. "A Dynamic Clustering Querying Algorithm Based on Grid in Manufacturing System." Advanced Materials Research 323 (August 2011): 89–93. http://dx.doi.org/10.4028/www.scientific.net/amr.323.89.

Full text

Abstract:

This article presents a querying algorithm of dynamic clustering based on grid in manufacturing system. The algorithm divides grids based on the location of nodes, and computes clustering center of grids, then queries based on clustering in the station, processing speed of this method are independent of size of data set, processing speed is quick, it can handle massive and multi-density data sets and performance is better in terms of accuracy and efficiency of querying.

APA, Harvard, Vancouver, ISO, and other styles

12

FAN, XIAOCONG, and MENG SU. "MULTI-AGENT DIFFUSION OF DECISION EXPERIENCES." International Journal on Artificial Intelligence Tools 22, no. 05 (October 2013): 1360001. http://dx.doi.org/10.1142/s0218213013600014.

Full text

Abstract:

Diffusion geometry offers a fresh perspective on multi-scale information analysis, which is critical to multiagent systems that need to process massive data sets. A recent study has shown that when the "diffusion distance" concept is applied to human decision experiences, its performance on solution synthesis can be significantly better than using Euclidean distance. However, as a data set expands over time, it can quickly exceed the processing capacity of a single agent. In this paper, we proposed a multi-agent diffusion approach where a massive data set is split into several subsets and each diffusion agent only needs to work with one subset in diffusion computation. We conducted experiments with different splitting strategies applied to a set of decision experiences. The result indicates that the multi-agent diffusion approach is beneficial, and it is even possible to benefit from using a larger group of diffusion agents if their subsets have common experiences and pairly-shared experiences. Our study also shows that system performance could be affected significantly by the splitting granularity (size of each splitting unit). This study paves the road for applying the multi-agent diffusion approach to massive data analysis.

APA, Harvard, Vancouver, ISO, and other styles

13

Vengadeswaran, S., and S. R. Balasundaram. "An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering." International Journal of Ambient Computing and Intelligence 9, no. 3 (July 2018): 15–30. http://dx.doi.org/10.4018/ijaci.2018070102.

Full text

Abstract:

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

APA, Harvard, Vancouver, ISO, and other styles

14

Hou, Kaihua, Chengqi Cheng, Bo Chen, Chi Zhang, Liesong He, Li Meng, and Shuang Li. "A Set of Integral Grid-Coding Algebraic Operations Based on GeoSOT-3D." ISPRS International Journal of Geo-Information 10, no. 7 (July 19, 2021): 489. http://dx.doi.org/10.3390/ijgi10070489.

Full text

Abstract:

As the amount of collected spatial information (2D/3D) increases, the real-time processing of these massive data is among the urgent issues that need to be dealt with. Discretizing the physical earth into a digital gridded earth and assigning an integral computable code to each grid has become an effective way to accelerate real-time processing. Researchers have proposed optimization algorithms for spatial calculations in specific scenarios. However, a complete set of algorithms for real-time processing using grid coding is still lacking. To address this issue, a carefully designed, integral grid-coding algebraic operation framework for GeoSOT-3D (a multilayer latitude and longitude grid model) is proposed. By converting traditional floating-point calculations based on latitude and longitude into binary operations, the complexity of the algorithm is greatly reduced. We then present the detailed algorithms that were designed, including basic operations, vector operations, code conversion operations, spatial operations, metric operations, topological relation operations, and set operations. To verify the feasibility and efficiency of the above algorithms, we developed an experimental platform using C++ language (including major algorithms, and more algorithms may be expanded in the future). Then, we generated random data and conducted experiments. The experimental results show that the computing framework is feasible and can significantly improve the efficiency of spatial processing. The algebraic operation framework is expected to support large geospatial data retrieval and analysis, and experience a revival, on top of parallel and distributed computing, in an era of large geospatial data.

APA, Harvard, Vancouver, ISO, and other styles

15

Battino, U., A. Tattersall, C. Lederer-Woods, F. Herwig, P. Denissenkov, R. Hirschi, R. Trappitsch, J. W. den Hartogh, and M. Pignatari. "NuGrid stellar data set – III. Updated low-mass AGB models and s-process nucleosynthesis with metallicities Z= 0.01, Z = 0.02, and Z = 0.03." Monthly Notices of the Royal Astronomical Society 489, no. 1 (August 20, 2019): 1082–98. http://dx.doi.org/10.1093/mnras/stz2158.

Full text

Abstract:

ABSTRACT The production of the neutron-capture isotopes beyond iron that we observe today in the Solar system is the result of the combined contribution of the r-process, the s-process, and possibly the i-process. Low-mass asymptotic giant branch (AGB) (1.5 < M/M⊙ < 3) and massive (M > 10 M⊙) stars have been identified as the main site of the s-process. In this work we consider the evolution and nucleosynthesis of low-mass AGB stars. We provide an update of the NuGrid Set models, adopting the same general physics assumptions but using an updated convective-boundary-mixing model accounting for the contribution from internal gravity waves. The combined data set includes the initial masses MZAMS/M⊙ = 2, 3 for Z = 0.03, 0.02, 0.01. These new models are computed with the mesa stellar code and the evolution is followed up to the end of the AGB phase. The nucleosynthesis was calculated for all isotopes in post-processing with the NuGrid mppnp code. The convective-boundary-mixing model leads to the formation of a 13C-pocket three times wider compared to the one obtained in the previous set of models, bringing the simulation results now in closer agreement with observations. Using these new models, we discuss the potential impact of other processes inducing mixing, like rotation, adopting parametric models compatible with theory and observations. Complete yield data tables, derived data products, and online analytic data access are provided.

APA, Harvard, Vancouver, ISO, and other styles

16

Fonseca, Daniel J., and Isaac Heim. "Development of an Automated MATLAB Based Platform for the Analysis of Massive EEG Datasets." International Journal of Emerging Technology and Advanced Engineering 10, no. 11 (November 30, 2020): 7–11. http://dx.doi.org/10.46338/ijetae1120_02.

Full text

Abstract:

EEG studies consist of multiple sessions. Individually loading and processing each data set is repetitive and time consuming. Finding the action markers embedded into the data can be difficult since the times of these markers vary greatly between individuals. These markers indicate sections that need to be analyzed. Based on the particularities of each marker, individual data segments or sections have to be treated differently or independently from each other. Therefore, splitting an EEG data file in order to compare sections or even analyze one section independently can be beneficial. However, de-identified and uniquely named files can be difficult to run through a program since they usually follow ununiformed naming conventions. While file re-naming is an option, this adds additional steps into identifying the nature and uniqueness of the data. Also, the manually renaming of data files is prone to error in larger studies. All of these challenges add up to considerable hurdles and difficulties when processing and filtering data originated from multiple EEG sessions. A series of MATLAB computer scripts developed by the authors of this paper address these problems.

APA, Harvard, Vancouver, ISO, and other styles

17

Xu, Fangqin, and Haifeng Lu. "The Application of FP-Growth Algorithm Based on Distributed Intelligence in Wisdom Medical Treatment." International Journal of Pattern Recognition and Artificial Intelligence 31, no. 04 (February 2, 2017): 1759005. http://dx.doi.org/10.1142/s0218001417590054.

Full text

Abstract:

FP-Growth algorithm is an algorithm of association rules that does not generate a set of candidate, so it has very high practical value in face of the rapid growth of data volume in wisdom medical treatment. Because FP-Growth is a memory-resident algorithm, it will appear to be powerless when it is used for massive data sets. The paper combines Hadoop and FP-Growth algorithm and through the actual analysis of traditional Chinese medicine (TCM) data compares the performance in two different environments of stand-alone and distributed. The experimental results show that FP-Growth algorithm has a great advantage in the processing and execution of massive data after the MapReduce parallel model, so that it will have better development prospects for intelligent medical treatment.

APA, Harvard, Vancouver, ISO, and other styles

18

Zhang, Xuelong. "Research on Data Mining Algorithm Based on Pattern Recognition." International Journal of Pattern Recognition and Artificial Intelligence 34, no. 06 (October 4, 2019): 2059015. http://dx.doi.org/10.1142/s0218001420590156.

Full text

Abstract:

With the advent of the era of big data, people are eager to extract valuable knowledge from the rapidly expanding data, so that they can more effectively use these massive storage data. The traditional data processing technology can only achieve basic functions such as data query and statistics, and cannot achieve the goal of extracting the knowledge existing in the data to predict the future trend. Therefore, along with the rapid development of database technology and the rapid improvement of computer’s computing power, data mining (DM) came into existence. Research on DM algorithms includes knowledge of various fields such as database, statistics, pattern recognition and artificial intelligence. Pattern recognition mainly extracts features of known data samples. The DM algorithm using pattern recognition technology is a better method to obtain effective information from massive data, thus providing decision support, and has a good application prospect. Support vector machine (SVM) is a new pattern recognition algorithm proposed in recent years, which avoids dimension disaster by dimensioning and linearization. Based on this, this paper studies the DM algorithm based on pattern recognition, and proposes a DM algorithm based on SVM. The algorithm divides the vector of the SV set into two different types and iterates through multiple iterations to obtain a classifier that converges to the final result. Finally, through the cross-validation simulation experiment, the results show that the DM algorithm based on pattern recognition can effectively reduce the training time and solve the mining problem of massive data. The results show that the algorithm has certain rationality and feasibility.

APA, Harvard, Vancouver, ISO, and other styles

19

Morris, Scott, Shuang Li, Tony Dupont, and John D. Grace. "Batch automated image processing of 2D seismic data for salt discrimination and basin-wide mapping." GEOPHYSICS 84, no. 6 (November 1, 2019): O113—O123. http://dx.doi.org/10.1190/geo2018-0569.1.

Full text

Abstract:

We have explored the technical utility of analyzing massive sets of digital 2D seismic data, collected and processed in dozens of different surveys, conducted more than 25 years ago, using batch, automated and unsupervised pattern recognition techniques to produce a basin-wide map of the top of salt. This workflow was developed for the United States portion of the Gulf of Mexico to detect top-salt boundaries on 2D poststack migrated lines. Texture-based attributes as well as novel, reflector-based attributes were used to discriminate between salt and nonsalt on each seismic line. Explicit measures of accuracy were not calculated because the data are unlabeled, but an assessment of confidence was used to score the boundaries. The depth to the top of the salt was estimated for more than 67% of the study area ([Formula: see text] or [Formula: see text]), 17% of the study area had insufficient data for processing and analysis, and 16% of the area did not meet confidence requirements for inclusion. The final results compared well with published maps of salt and the locations of salt-trapped fields. Reliable mapping of salt deeper than 6 s two-way time could not be achieved with this data set and approach because many seismic images had indistinguishable features at this depth. The computing time was greater than linear in the number of lines, but parallelization and changes in hardware configuration could reduce the run time of about three weeks to about three days.

APA, Harvard, Vancouver, ISO, and other styles

20

Li, Jian, Hong Yuan Fang, Yu Rong Ma, and Hai Bo Yang. "Research on Point Cloud Data Management Based on Spatial Index and Database." Advanced Materials Research 850-851 (December 2013): 685–88. http://dx.doi.org/10.4028/www.scientific.net/amr.850-851.685.

Full text

Abstract:

Laser point set has a massive quantity of data, which not only increases system load, but greatly reduces the follow-up processing efficiency as well. Considering the difficulty of mass point cloud data organization and management, grid index of point cloud data based on the decimal linear quadtree is proposed in this paper, point cloud data is segmented and encoded by means of linear quadtree based on decimal system, which will be stored in SQL Server, then, we can block and index point cloud in terms of Morton code or rectangular region, combine with the advantages of both spatial index and database to manage point cloud efficiently and safely. The results show that the proposed method ideally solves the problem of point cloud data organization and management.

APA, Harvard, Vancouver, ISO, and other styles

21

Dennis, Jack B. "Static Mapping of Functional Programs: An Example in Signal Processing." Scientific Programming 5, no. 2 (1996): 121–35. http://dx.doi.org/10.1155/1996/360960.

Full text

Abstract:

Complex signal-processing problems are naturally described by compositions of program modules that process streams of data. In this article we discuss how such compositions may be analyzed and mapped onto multiprocessor computers to effectively exploit the massive parallelism of these applications. The methods are illustrated with an example of signal processing for an optical surveillance problem. Program transformation and analysis are used to construct a program description tree that represents the given computation as an acyclic interconnection of stream-processing modules. Each module may be mapped to a set of threads run on a group of processing elements of a target multiprocessor. Performance is considered for two forms of multiprocessor architecture, one based on conventional DSP technology and the other on a multithreaded-processing element design.

APA, Harvard, Vancouver, ISO, and other styles

22

Wang, Shuchun, Xiaoguang Sun, Jianyu Geng, Yuan Han, Chunyong Zhang, and Weihua Zhang. "The Key Techniques of Constructing the Database of Treatment Measures for Hidden Troubles in Electric Power System." E3S Web of Conferences 185 (2020): 01027. http://dx.doi.org/10.1051/e3sconf/202018501027.

Full text

Abstract:

This article adopts a research method that combined with theoretical analysis and system design, analyzes multiple dimensions of common safety hazard events, uses support vector machine classification algorithms to filter and mine valuable information in massive data, and establishes a common safety hazard event feature set. This kind of technology can realize the automatic classification of safety hazards and extract their characteristic information, introduce the semantic analysis, word segmentation clustering method to determine the typical processing measures of events, and generate knowledge items to add methods and models to the processing measures database, which can solve the grid safety hazard data entry specifications, In-depth analysis and other issues.

APA, Harvard, Vancouver, ISO, and other styles

23

CHRISTOPHE, BENOIT. "MANAGING MASSIVE DATA OF THE INTERNET OF THINGS THROUGH COOPERATIVE SEMANTIC NODES." International Journal of Semantic Computing 06, no. 04 (December 2012): 389–408. http://dx.doi.org/10.1142/s1793351x12400120.

Full text

Abstract:

The Internet of Things refers to extending the Internet to physical entities of interest (EoI) to humans (e.g., a table, a room or another human being) sensed as a set of properties that can be observed, measured, accessed or triggered by devices such as actuators, sensors or other smart components. In this vision, the IoT foresees novel types of applications dynamically finding the associations between devices and EoIs around a common feature of interest (e.g., temperature of a room) to provide meaningful information as well as rich services to users about the things they are interested in. Growing interest in providing sensors and actuators has led to billions of services or data offered through different platforms, some of them wrapped with semantic descriptions to realize aforementioned associations through accurate search processes. However, due to the ubiquitous aspect of the IoT and the potential mobility of the devices that enable it, a centralized approach does not allow designing scalable processes to efficiently search and manage these associations or the devices and EoIs that compose them. As location seems to be an important parameter when searching the IoT, we believe that designing a framework composed of geographically distributed nodes with local reasoning capabilities is a much more scalable approach to realize the IoT vision. We describe our approach of such a vision by creating a federated network composed of such nodes that declare their location based on a formal model. In this vision, each node is capable of processing semantic descriptions of devices or EoIs to share deduced associations with other peers that are selected based on their location nearness.

APA, Harvard, Vancouver, ISO, and other styles

24

Diaz-del-Pino, Sergio, Pablo Rodriguez-Brazzarola, Esteban Perez-Wohlfeil, and Oswaldo Trelles. "Combining Strengths for Multi-genome Visual Analytics Comparison." Bioinformatics and Biology Insights 13 (January 2019): 117793221882512. http://dx.doi.org/10.1177/1177932218825127.

Full text

Abstract:

The eclosion of data acquisition technologies has shifted the bottleneck in molecular biology research from data acquisition to data analysis. Such is the case in Comparative Genomics, where sequence analysis has transitioned from genes to genomes of several orders of magnitude larger. This fact has revealed the need to adapt software to work with huge experiments efficiently and to incorporate new data-analysis strategies to manage results from such studies. In previous works, we presented GECKO, a software to compare large sequences; now we address the representation, browsing, data exploration, and post-processing of the massive amount of information derived from such comparisons. GECKO-MGV is a web-based application organized as client-server architecture. It is aimed at visual analysis of the results from both pairwise and multiple sequences comparison studies combining a set of common commands for image exploration with improved state-of-the-art solutions. In addition, GECKO-MGV integrates different visualization analysis tools while exploiting the concept of layers to display multiple genome comparison datasets. Moreover, the software is endowed with capabilities for contacting external-proprietary and third-party services for further data post-processing and also presents a method to display a timeline of large-scale evolutionary events. As proof-of-concept, we present 2 exercises using bacterial and mammalian genomes which depict the capabilities of GECKO-MGV to perform in-depth, customizable analyses on the fly using web technologies. The first exercise is mainly descriptive and is carried out over bacterial genomes, whereas the second one aims to show the ability to deal with large sequence comparisons. In this case, we display results from the comparison of the first Homo sapiens chromosome against the first 5 chromosomes of Mus musculus.

APA, Harvard, Vancouver, ISO, and other styles

25

Paz, Hellen, Mateus Maia, Fernando Moraes, Ricardo Lustosa, Lilia Costa, Samuel Macêdo, Marcos E. Barreto, and Anderson Ara. "Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme." Stats 3, no. 4 (October 19, 2020): 444–64. http://dx.doi.org/10.3390/stats3040028.

Full text

Abstract:

The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. In general, the data analysis community has difficulty to handle a massive amount of data on local machines, often requiring high-performance computing servers. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for an R application to use Spark. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP—conditional cash transfer), comprising a large data set with 1.26 billion observations. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables reflecting its utilization rate. Statistical modeling was performed using random forest to predict the utilization rated of BFP. Variable selection was performed through a recent method based on the importance and interpretation of variables in the random forest model. Among the 89 variables initially considered, the final model presented a high predictive performance capacity with 17 selected variables, as well as indicated high importance of some variables for the observed utilization rate in income, education, job informality, and inactive youth, namely: family income, education, occupation and density of people in the homes. In this work, using a local machine, we highlighted the potential of aggregating Spark and R for analysis of a large database of 111.6 GB. This can serve as proof of concept or reference for other similar works within the Statistics community, as well as our case study can provide important evidence for further analysis of this important social support programme.

APA, Harvard, Vancouver, ISO, and other styles

26

Savorskiy, V., E. Lupyan, I. Balashov, M. Burtsev, A. Proshin, V. Tolpin, D. Ermakov, et al. "Basic technologies of web services framework for research, discovery, and processing the disparate massive Earth observation data from heterogeneous sources." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XL-4 (April 23, 2014): 223–28. http://dx.doi.org/10.5194/isprsarchives-xl-4-223-2014.

Full text

Abstract:

Both development and application of remote sensing involves a considerable expenditure of material and intellectual resources. Therefore, it is important to use high-tech means of distribution of remote sensing data and processing results in order to facilitate access for as much as possible number of researchers. It should be accompanied with creation of capabilities for potentially more thorough and comprehensive, i.e. ultimately deeper, acquisition and complex analysis of information about the state of Earth's natural resources. As well objective need in a higher degree of Earth observation (EO) data assimilation is set by conditions of satellite observations, in which the observed objects are uncontrolled state. Progress in addressing this problem is determined to a large extent by order of the distributed EO information system (IS) functioning. Namely, it is largely dependent on reducing the cost of communication processes (data transfer) between spatially distributed IS nodes and data users. One of the most effective ways to improve the efficiency of data exchange processes is the creation of integrated EO IS optimized for running procedures of distributed data processing. The effective EO IS implementation should be based on specific software architecture.

APA, Harvard, Vancouver, ISO, and other styles

27

Brédif, M., L. Caraffa, M. Yirci, and P. Memari. "PROVABLY CONSISTENT DISTRIBUTED DELAUNAY TRIANGULATION." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences V-2-2020 (August 3, 2020): 195–202. http://dx.doi.org/10.5194/isprs-annals-v-2-2020-195-2020.

Full text

Abstract:

Abstract. This paper deals with the distributed computation of Delaunay triangulations of massive point sets, mainly motivated by the needs of a scalable out-of-core surface reconstruction workflow from massive urban LIDAR datasets. Such a data often corresponds to a huge point cloud represented through a set of tiles of relatively homogeneous point sizes. This will be the input of our algorithm which will naturally partition this data across multiple processing elements. The distributed computation and communication between processing elements is orchestrated efficiently through an uncentralized model to represent, manage and locally construct the triangulation corresponding to each tile. Initially inspired by the star splaying approach, we review the Tile& Merge algorithm for computing Distributed Delaunay Triangulations on the cloud, provide a theoretical proof of correctness of this algorithm, and analyse the performance of our Spark implementation in terms of speedup and strong scaling in both synthetic and real use case datasets. A HPC implementation (e.g. using MPI), left for future work, would benefit from its more efficient message passing paradigm but lose the robustness and failure resilience of our Spark approach.

APA, Harvard, Vancouver, ISO, and other styles

28

Li, Deguang, and Zhanyou Cui. "A Parallel Attribute Reduction Method Based on Classification." Complexity 2021 (April 10, 2021): 1–8. http://dx.doi.org/10.1155/2021/9989471.

Full text

Abstract:

Parallel processing as a method to improve computer performance has become a development trend. Based on rough set theory and divide-and-conquer idea of knowledge reduction, this paper proposes a classification method that supports parallel attribute reduction processing, the method makes the relative positive domain which needs to be calculated repeatedly independent, and the independent relative positive domain calculation could be processed in parallel; thus, attribute reduction could be handled in parallel based on this classification method. Finally, the proposed algorithm and the traditional algorithm are analyzed and compared by experiments, and the results show that the proposed method in this paper has more advantages in time efficiency, which proves that the method could improve the processing efficiency of attribute reduction and makes it more suitable for massive data sets.

APA, Harvard, Vancouver, ISO, and other styles

29

Dejnožková, Eva, and Petr Dokládal. "A PARALLEL ARCHITECTURE FOR CURVE-EVOLUTION PARTIAL DIFFERENTIAL EQUATIONS." Image Analysis & Stereology 22, no. 2 (May 3, 2011): 121. http://dx.doi.org/10.5566/ias.v22.p121-132.

Full text

Abstract:

The computation of the distance function is a crucial and limiting element in many applications of image processing. This is particularly true for the PDE-based methods, where the distance is used to compute various geometric properties of the travelling curve. Massive Marchinga is a parallel algorithm computing the distance function by propagating the solution from the sources and permitting simultaneous spreading of component labels in the infiuence zones. Its hardware implementation is conceivable as no sorted data structures are used. The feasibility is demonstrated here on a set of parallely-operating Processing Units arranged in a linear array. The text concludes by a study of the accuracy and the implementation cost.

APA, Harvard, Vancouver, ISO, and other styles

30

Leong, Darrell, and Anand Bahuguni. "Auto-control model building using machine learning regression for extreme response prediction." APPEA Journal 60, no. 1 (2020): 155. http://dx.doi.org/10.1071/aj19239.

Full text

Abstract:

The long-term forecast of extreme response presents a daunting practical problem for offshore structures. These installations are subject to varying sea conditions, which amplify the need to account for the uncertainties of wave heights and periods across a given sea state. Analysis of each sea state involves numerically intensive non-linear dynamic analysis, leading to massive computational expense across the environmental scatter diagram. Recent research has proposed several effective solutions to predict long-term extreme responses, but not without drawbacks, such as the limitation to specific failure locations and the absence of error estimates. This paper explores the practical implementation of control variates as an efficiency enhancing post-processing technique. The model building framework exhibits the advantage of being fully defined from existing simulation results, without the need for external inputs to set up a control experiment. A composite machine learning regression model is developed and investigated for performance in correlating against Monte Carlo data. The sampling methodology presented possesses a crucial advantage of being independent of failure characteristics, allowing for the concurrent extreme response analyses of multiple components across the global structure without the need for re-analysis. The approach is applied on a simulated floating production storage and offloading unit in a site located in the hurricane-prone Gulf of Mexico, vulnerable to heavy-tailed extreme load events.

APA, Harvard, Vancouver, ISO, and other styles

31

Wang, Xinyan, and Guie Jiao. "Research on association rules of course grades based on parallel FP-Growth algorithm." Journal of Computational Methods in Sciences and Engineering 20, no. 3 (September 30, 2020): 759–69. http://dx.doi.org/10.3233/jcm-194079.

Full text

Abstract:

With the rapid growth of massive data in all walks of life, massive data faces enormous challenges such as storage capacity and computing power. In Chinese universities, traditional data analysis of student course cannot meet the growing demand for increasing data size and real-time computation of big data. In this paper, a parallel FP-Growth algorithm based on split is proposed. The established FP-Tree is split into blocks, and the split FP-Trees are equally divided into different nodes. The monitoring point is set up to monitor the operation of other nodes, dynamically migrate tasks and maintain load balancing. The experiment proves that each node has good load balancing with the given support degree, and the improved algorithm has better running performance than the classic FP-Growth algorithm in parallel processing. Finally, the parallel FP-Growth algorithm based on split is implemented on Hadoop to mine association rules between course grades. The mining process includes data preprocessing, mining results and analysis. The association rules between course grades provide suggestions for the way students learn and the way teachers teach.

APA, Harvard, Vancouver, ISO, and other styles

32

Navarro, Cristóbal A., Nancy Hitschfeld-Kahler, and Luis Mateu. "A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures." Communications in Computational Physics 15, no. 2 (February 2014): 285–329. http://dx.doi.org/10.4208/cicp.110113.010813a.

Full text

Abstract:

AbstractParallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions. The evolution of computer architectures (multi-coreandmany-core) towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm. In the last decade, the graphics processing unit, or GPU, has gained an important place in the field of high performance computing (HPC) because of its low cost and massive parallel processing power. Super-computing has become, for the first time, available to anyone at the price of a desktop computer. In this paper, we survey the concept of parallel computing and especially GPU computing. Achieving efficient parallel algorithms for the GPU is not a trivial task, there are several technical restrictions that must be satisfied in order to achieve the expected performance. Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it. Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and itsmassive parallelismmodel. In particular, we show how this new technology can help the field ofcomputational physics,especially when the problem isdata-parallel.We present four examples of computational physics problems;n-body, collision detection, Potts modelandcellular automatasimulations. These examples well represent the kind of problems that are suitable for GPU computing. By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.

APA, Harvard, Vancouver, ISO, and other styles

33

Yavorskaya, Liliya. "New Data on the Role of Madzhar in Golden Horde Trade of Skin and Leather Products: Archeozoological Aspect." Nizhnevolzhskiy Arheologicheskiy Vestnik, no. 1 (July 2020): 202–10. http://dx.doi.org/10.15688/nav.jvolsu.2020.1.11.

Full text

Abstract:

Documentary sources on Italian sea trade of the 13th – 14th centuries report about the export of large volumes of animal skins and processed leathers from the Golden Horde. A significant problem was the finding out places of slaughtering livestock and processing raw materials of animal origin by archaeological methods, since organic remains are not preserved in the cultural layers of settlements and cities of the Golden Horde. The article analyzes three collections of animal bones from archaeological excavations in the craftsmen’s quarter of the Golden Horde city of Madzhar during 2014–2017. In the 2014 year’s collection at excavation site no. X (10), where the master of bone carving lived, a fact of special selection of goat and ram horns for products being manufactured of their horn covers was identified, aside from production wastes of a dense horn. In pit no. 2 (2016), an archaeozoological research has revealed a specific anatomical set of domestic ungulates remains: shattered heads and distal parts of the legs, which can be formed only as a result of massive slaughter of livestock to obtain skins. At excavation site no. XIII of 2017, archaeozoological research was able to record not only a specific anatomical set, but also traces of the use of small cattle bones in leather processing devices, which, combined with the archaeological context, made it possible to identify the presence of a specialized seasonal leather workshop on this site. It was established that cattle was slaughtered right in the cities, as well as artisans processed the obtained skins on specially equipped seasonal workshop sites. Thus, archaeozoological research showed that Madzhar, like other cities, participated in the production of animal skins and leather, which subsequently became the most important export products of the Golden Horde state.

APA, Harvard, Vancouver, ISO, and other styles

34

Marinakis, Vangelis. "Big Data for Energy Management and Energy-Efficient Buildings." Energies 13, no. 7 (March 27, 2020): 1555. http://dx.doi.org/10.3390/en13071555.

Full text

Abstract:

European buildings are producing a massive amount of data from a wide spectrum of energy-related sources, such as smart meters’ data, sensors and other Internet of things devices, creating new research challenges. In this context, the aim of this paper is to present a high-level data-driven architecture for buildings data exchange, management and real-time processing. This multi-disciplinary big data environment enables the integration of cross-domain data, combined with emerging artificial intelligence algorithms and distributed ledgers technology. Semantically enhanced, interlinked and multilingual repositories of heterogeneous types of data are coupled with a set of visualization, querying and exploration tools, suitable application programming interfaces (APIs) for data exchange, as well as a suite of configurable and ready-to-use analytical components that implement a series of advanced machine learning and deep learning algorithms. The results from the pilot application of the proposed framework are presented and discussed. The data-driven architecture enables reliable and effective policymaking, as well as supports the creation and exploitation of innovative energy efficiency services through the utilization of a wide variety of data, for the effective operation of buildings.

APA, Harvard, Vancouver, ISO, and other styles

35

Padirayon, Lourdes M., Melvin S. Atayan, Jose Sherief Panelo, and Carlito R. Fagela, Jr. "Mining the crime data using naïve Bayes model." Indonesian Journal of Electrical Engineering and Computer Science 23, no. 2 (August 1, 2021): 1084. http://dx.doi.org/10.11591/ijeecs.v23.i2.pp1084-1092.

Full text

Abstract:

<p>A massive number of documents on crime has been handled by police departments worldwide and today's criminals are becoming technologically elegant. One obstacle faced by law enforcement is the complexity of processing voluminous crime data. Approximately 439 crimes have been registered in sanchez mira municipality in the past seven years. Police officers have no clear view as to the pattern crimes in the municipality, peak hours, months of the commission and the location where the crimes are concentrated. The naïve Bayes modelis a classification algorithm using the Rapid miner auto model which is used and analyze the crime data set. This approach helps to recognize crime trends and of which, most of the crimes committed were a violation of special penal laws. The month of May has the highest for index and non-index crimes and Tuesday as for the day of crimes. Hotspots were barangay centro 1 for non-index crimes and barangay centro 2 for index crimes. Most non-index crimes committed were violations of special law and for index crime rape recorded the highest crime and usually occurs at 2 o’clock in the afternoon. The crime outcome takes various decisions to maximize the efficacy of crime solutions.</p>

APA, Harvard, Vancouver, ISO, and other styles

36

Vo, A. V., D. F. Laefer, M. Trifkovic, C. N. L. Hewage, M. Bertolotto, N. A. Le-Khac, and U. Ofterdinger. "A HIGHLY SCALABLE DATA MANAGEMENT SYSTEM FOR POINT CLOUD AND FULL WAVEFORM LIDAR DATA." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIII-B4-2020 (August 25, 2020): 507–12. http://dx.doi.org/10.5194/isprs-archives-xliii-b4-2020-507-2020.

Full text

Abstract:

Abstract. The massive amounts of spatio-temporal information often present in LiDAR data sets make their storage, processing, and visualisation computationally demanding. There is an increasing need for systems and tools that support all the spatial and temporal components and the three-dimensional nature of these datasets for effortless retrieval and visualisation. In response to these needs, this paper presents a scalable, distributed database system that is designed explicitly for retrieving and viewing large LiDAR datasets on the web. The ultimate goal of the system is to provide rapid and convenient access to a large repository of LiDAR data hosted in a distributed computing platform. The system is composed of multiple, share-nothing nodes operating in parallel. Namely, each node is autonomous and has a dedicated set of processors and memory. The nodes communicate with each other via an interconnected network. The data management system presented in this paper is implemented based on Apache HBase, a distributed key-value datastore within the Hadoop eco-system. HBase is extended with new data encoding and indexing mechanisms to accommodate both the point cloud and the full waveform components of LiDAR data. The data can be consumed by any desktop or web application that communicates with the data repository using the HTTP protocol. The communication is enabled by a web servlet. In addition to the command line tool used for administration tasks, two web applications are presented to illustrate the types of user-facing applications that can be coupled with the data system.

APA, Harvard, Vancouver, ISO, and other styles

37

Alexopoulos, Athanasios, Georgios Drakopoulos, Andreas Kanavos, Phivos Mylonas, and Gerasimos Vonitsanos. "Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark." Algorithms 13, no. 3 (March 24, 2020): 71. http://dx.doi.org/10.3390/a13030071.

Full text

Abstract:

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.

APA, Harvard, Vancouver, ISO, and other styles

38

Zyprych-Walczak, J., A. Szabelska, L. Handschuh, K. Górczak, K. Klamecka, M. Figlerowicz, and I. Siatkowski. "The Impact of Normalization Methods on RNA-Seq Data Analysis." BioMed Research International 2015 (2015): 1–10. http://dx.doi.org/10.1155/2015/621690.

Full text

Abstract:

High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.

APA, Harvard, Vancouver, ISO, and other styles

39

Grønbech, Christopher Heje, Maximillian Fornitz Vording, Pascal N. Timshel, Casper Kaae Sønderby, Tune H. Pers, and Ole Winther. "scVAE: variational auto-encoders for single-cell gene expression data." Bioinformatics 36, no. 16 (May 16, 2020): 4415–22. http://dx.doi.org/10.1093/bioinformatics/btaa293.

Full text

Abstract:

Abstract Motivation Models for analysing and making relevant biological inferences from massive amounts of complex single-cell transcriptomic data typically require several individual data-processing steps, each with their own set of hyperparameter choices. With deep generative models one can work directly with count data, make likelihood-based model comparison, learn a latent representation of the cells and capture more of the variability in different cell populations. Results We propose a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell. We tested several count likelihood functions and a variant of the VAE that has a priori clustering in the latent space. We show for several scRNA-seq datasets that our method outperforms recently proposed scRNA-seq methods in clustering cells and that the resulting clusters reflect cell types. Availability and implementation Our method, called scVAE, is implemented in Python using the TensorFlow machine-learning library, and it is freely available at https://github.com/scvae/scvae. Supplementary information Supplementary data are available at Bioinformatics online.

APA, Harvard, Vancouver, ISO, and other styles

40

Lu, Wenjuan, Aiguo Liu, and Chengcheng Zhang. "Research and implementation of big data visualization based on WebGIS." Proceedings of the ICA 2 (July 10, 2019): 1–6. http://dx.doi.org/10.5194/ica-proc-2-79-2019.

Full text

Abstract:

<p><strong>Abstract.</strong> With the development of geographic information technology, the way to get geographical information is constantly, and the data of space-time is exploding, and more and more scholars have started to develop a field of data processing and space and time analysis. In this, the traditional data visualization technology is high in popularity and simple and easy to understand, through simple pie chart and histogram, which can reveal and analyze the characteristics of the data itself, but still cannot combine with the map better to display the hidden time and space information to exert its application value. How to fully explore the spatiotemporal information contained in massive data and accurately explore the spatial distribution and variation rules of geographical things and phenomena is a key research problem at present. Based on this, this paper designed and constructed a universal thematic data visual analysis system that supports the full functions of data warehousing, data management, data analysis and data visualization. In this paper, Weifang city is taken as the research area, starting from the aspects of rainfall interpolation analysis and population comprehensive analysis of Weifang, etc., the author realizes the fast and efficient display under the big data set, and fully displays the characteristics of spatial and temporal data through the visualization effect of thematic data. At the same time, Cassandra distributed database is adopted in this research, which can also store, manage and analyze big data. To a certain extent, it reduces the pressure of front-end map drawing, and has good query analysis efficiency and fast processing ability.</p>

APA, Harvard, Vancouver, ISO, and other styles

41

Ross, M. K., Wei Wei, and L. Ohno-Machado. "“Big Data” and the Electronic Health Record." Yearbook of Medical Informatics 23, no. 01 (August 2014): 97–104. http://dx.doi.org/10.15265/iy-2014-0003.

Full text

Abstract:

Summary Objectives: Implementation of Electronic Health Record (EHR) systems continues to expand. The massive number of patient encounters results in high amounts of stored data. Transforming clinical data into knowledge to improve patient care has been the goal of biomedical informatics professionals for many decades, and this work is now increasingly recognized outside our field. In reviewing the literature for the past three years, we focus on “big data” in the context of EHR systems and we report on some examples of how secondary use of data has been put into practice. Methods: We searched PubMed database for articles from January 1, 2011 to November 1, 2013. We initiated the search with keywords related to “big data” and EHR. We identified relevant articles and additional keywords from the retrieved articles were added. Based on the new keywords, more articles were retrieved and we manually narrowed down the set utilizing predefined inclusion and exclusion criteria. Results: Our final review includes articles categorized into the themes of data mining (pharmacovigilance, phenotyping, natural language processing), data application and integration (clinical decision support, personal monitoring, social media), and privacy and security. Conclusion: The increasing adoption of EHR systems worldwide makes it possible to capture large amounts of clinical data. There is an increasing number of articles addressing the theme of “big data”, and the concepts associated with these articles vary. The next step is to transform healthcare big data into actionable knowledge.

APA, Harvard, Vancouver, ISO, and other styles

42

Wang, Xu, and Shi Fei Ding. "An Overview of Quotient Space Theory." Advanced Materials Research 187 (February 2011): 326–31. http://dx.doi.org/10.4028/www.scientific.net/amr.187.326.

Full text

Abstract:

Granular computing (GrC) is another solving method of artificial intelligence problems after neural network, fuzzy set theory, genetic algorithm, evolutionary algorithm and so on. GrC involves all the theories, methodologies and techniques of granularity, providing a powerful tool for the solution of complex problems, massive data mining, and fuzzy information processing. Quotient space theory is a representative model of granular computing. In this paper, first the current situation and the development prospects of quotient space theory are introduced, then the basic theory of quotient space granular computing are presented and the stratified and synthesis principle of granularity are summarized. Finally we discuss some important issues such as the application and promotion of quotient space

APA, Harvard, Vancouver, ISO, and other styles

43

Qu, Zhijian, Hanxin Liu, Hanlin Wang, Xinqiang Chen, Rui Chi, and Zixiao Wang. "Cluster equilibrium scheduling method based on backpressure flow control in railway power supply systems." PLOS ONE 15, no. 12 (December 9, 2020): e0243543. http://dx.doi.org/10.1371/journal.pone.0243543.

Full text

Abstract:

The purpose of the study is to solve problems, i.e., increasingly significant processing delay of massive monitoring data and imbalanced tasks in the scheduling and monitoring center for a railway network. To tackle these problems, a method by using a smooth weighted round-robin scheduling based on backpressure flow control (BF-SWRR) is proposed. The method is developed based on a model for message queues and real-time streaming computing. By using telemetry data flow as input data sources, the fields of data sources are segmented into different sets by using a distributed model of stream computing parallel processing. Moreover, the round-robin (RR) scheduling method for the distributed server is improved. The parallelism, memory occupancy, and system delay are tested by taking a high-speed train section of a certain line as an example. The result showed that the BF-SWRR method for clusters can control the delay to within 1 s. When the parallelism of distributed clusters is set to 8, occupancy rates of the CPU and memory can be decreased by about 15%. In this way, the overall load of the cluster during stream computing is more balanced.

APA, Harvard, Vancouver, ISO, and other styles

44

Brunt, Scott, Heather Solomon, Kathleen Brown, and April Davis. "Feline and Canine Rabies in New York State, USA." Viruses 13, no. 3 (March 10, 2021): 450. http://dx.doi.org/10.3390/v13030450.

Full text

Abstract:

In New York State, domestic animals are no longer considered rabies vector species, but given their ubiquity with humans, rabies cases in dogs and cats often result in multiple individuals requiring post-exposure prophylaxis. For over a decade, the New York State rabies laboratory has variant-typed these domestic animals to aid in epidemiological investigations, determine exposures, and generate demographic data. We produced a data set that outlined vaccination status, ownership, and rabies results. Our data demonstrate that a large percentage of felines submitted for rabies testing were not vaccinated or did not have a current rabies vaccination, while canines were largely vaccinated. Despite massive vaccination campaigns, free clinics, and education, these companion animals still occasionally contract rabies. Barring translocation events, we note that rabies-positive cats and dogs in New York State have exclusively contracted a raccoon variant. While the United States has made tremendous strides in reducing its rabies burden, we hope these data will encourage responsible pet ownership including rabies vaccinations to reduce unnecessary animal mortality, long quarantines, and post-exposure prophylaxis in humans.

APA, Harvard, Vancouver, ISO, and other styles

45

Riyadh, Musaab, and Dina Riadh Alshibani. "Intrusion detection system based on machine learning techniques." Indonesian Journal of Electrical Engineering and Computer Science 23, no. 2 (August 1, 2021): 953. http://dx.doi.org/10.11591/ijeecs.v23.i2.pp953-961.

Full text

Abstract:

Recently, the data flow over the internet has exponentially increased due to the massive growth of computer networks connected to it. Some of these data can be classified as a malicious activity which cannot be captured by firewalls and anti-malwares. Due to this, the intrusion detection systems are urgent need in order to recognize malicious activity to keep data integrity and availability. In this study, an intrusion detection system based on cluster feature concepts and KNN classifier has been suggested to handle the various challenges issues in data such as in complete data, mixed-type and noise data. To streng then the proposed system a special kind of patterns similarity measures are supported to deal with these types of challenges. The experimental results show that the classification accuracy of the suggested system is better than K-nearest neighbor (KNN) and support vector machine classifiers when processing incomplete data set, inspite of droping down the overall detection accuracy.

APA, Harvard, Vancouver, ISO, and other styles

46

Li, Qin, Shaobo Li, Sen Zhang, Jie Hu, and Jianjun Hu. "A Review of Text Corpus-Based Tourism Big Data Mining." Applied Sciences 9, no. 16 (August 12, 2019): 3300. http://dx.doi.org/10.3390/app9163300.

Full text

Abstract:

With the massive growth of the Internet, text data has become one of the main formats of tourism big data. As an effective expression means of tourists’ opinions, text mining of such data has big potential to inspire innovations for tourism practitioners. In the past decade, a variety of text mining techniques have been proposed and applied to tourism analysis to develop tourism value analysis models, build tourism recommendation systems, create tourist profiles, and make policies for supervising tourism markets. The successes of these techniques have been further boosted by the progress of natural language processing (NLP), machine learning, and deep learning. With the understanding of the complexity due to this diverse set of techniques and tourism text data sources, this work attempts to provide a detailed and up-to-date review of text mining techniques that have been, or have the potential to be, applied to modern tourism big data analysis. We summarize and discuss different text representation strategies, text-based NLP techniques for topic extraction, text classification, sentiment analysis, and text clustering in the context of tourism text mining, and their applications in tourist profiling, destination image analysis, market demand, etc. Our work also provides guidelines for constructing new tourism big data applications and outlines promising research areas in this field for incoming years.

APA, Harvard, Vancouver, ISO, and other styles

47

Henriques, João, Filipe Caldeira, Tiago Cruz, and Paulo Simões. "Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets." Electronics 9, no. 7 (July 17, 2020): 1164. http://dx.doi.org/10.3390/electronics9071164.

Full text

Abstract:

Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.

APA, Harvard, Vancouver, ISO, and other styles

48

Hegerova, Livia, Jeffrey P. Anderson, and Colleen T. Morton. "A Systems-Based Approach to Improve Fibrinogen Testing and Treatment of Low Fibrinogen in Patients Receiving Massive Transfusions: A Quality Improvement Initiative." Blood 128, no. 22 (December 2, 2016): 2335. http://dx.doi.org/10.1182/blood.v128.22.2335.2335.

Full text

Abstract:

Abstract Background: Uncontrolled hemorrhage is the most common treatable cause of death and four of every ten trauma patients die as a results of exsanguination, or its late effects (Curry et al. Scand J Trauma 2014). There is an increasing understanding of the state of acute coagulopathy and the role that fibrinogen plays in major hemorrhage (Wikkelso A et al. Cochrane Syst Rev 2013). Fibrinogen is a critical protein for hemostasis and clot formation. Low fibrinogen is a risk factor for hemorrhage in patients with major hemorrhage including surgical, obstetrics and trauma patients. Observational studies have reported improved survival with higher fibrinogen:RBC transfusion ratios in trauma.At Regions Hospital, St. Paul, MN, the Transfusion Committee observed that many patients receiving massive transfusions did not have fibrinogen activity tested. Aim: To improve fibrinogen testing and treatment of low fibrinogen in patients receiving massive transfusions by using a hospital-wide, electronic medical record (EMR)-based Massive Transfusion Protocol (MTP) order set. Outcomes, including survival and transfusion requirements will also be evaluated. Methods: Retrospective analysis of data from existing databases identified 127 patients who had massive hemorrhage as defined by activation of the massive transfusion protocol (MTP) at Regions Hospital between 2014-2016. We performed chart reviews to assess fibrinogen replacement practice 6 months before (n=68) and 6 months (n=59) after implementation of an EMR-based MTP order set in a quality improvement model. The order set automatically orders fibrinogen activity, in addition to hemoglobin, platelet count, INR, and PTT. Once the order set is activated, it will alert the provider to a low fibrinogen activity result using a best practice alert. The alert then directs therapy by opening the order for administration of cryoprecipitate. To evaluate the impact of this order set on fibrinogen testing and clinical outcomes, we constructed multivariable logistic regression models. Results: During the study period, 127 patients had the MTP activated. The median age was 51 years and 67% were male. The majority of MTPs were activated for trauma (57%) located primarily in ED (64%). The common admitting diagnoses were motor vehicle accident (29%), heart surgery/procedure (18%), or GI bleed (16%). The admitting hemoglobin, platelet count, INR, and PTT were similar pre and post-intervention. Prior to the use of the MTP order set, only 32% of patients receiving the MTP had fibrinogen tested. Of the patients with a fibrinogen activity tested, over one-third had a low fibrinogen and of those 56% did not receive cryoprecipitate. Fibrinogen testing increased after the intervention (61% vs 32%, p=0.001), and among patients with low fibrinogen, transfusion of cryoprecipitate occurred more often (70% vs 44%, p=0.370). Blood transfusion requirements for red blood cells (7.0 vs 9.9, p=0.133), fresh-frozen plasma (4.9 vs 6.7, p=0.063), and platelets (1.2 vs 1.6, p=0.068) decreased post-intervention. In multivariate analysis, patients were approximately 3 times more likely to have fibrinogen activity tested after the intervention (OR 3.06, p=0.003). Deaths within 24 hours of MTP were more likely to occur among patients in the pre-intervention period (OR=1.45; 95% CI 0.42-5.00) and those with low fibrinogen (OR=1.34; 95% CI 0.26-7.08), however, due to the limited number of events, these estimates did not reach statistical significance. Conclusions: A systems-based approach with a hospital-wide EMR order set for the MTP improved the testing for and treatment of low fibrinogen in patients with massive hemorrhage. This resulted in a trend towards improved outcomes. We did not achieve 100% fibrinogen testing after the intervention because the MTP can still be activated without using the order set, and this will be corrected in a future update. The treatment of patients with traumatic hemorrhage remains challenging and varies widely between trauma centers. Standardized treatment, automation of lab ordering, and the use of alerts can help providers improve the quality of care and clinical outcomes for patients. Disclosures No relevant conflicts of interest to declare.

APA, Harvard, Vancouver, ISO, and other styles

49

Pimpalkar, Amit Purushottam, and R. Jeberson Retna Raj. "Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features." ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal 9, no. 2 (June 18, 2020): 49–68. http://dx.doi.org/10.14201/adcaij2020924968.

Full text

Abstract:

Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. For polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification. The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.

APA, Harvard, Vancouver, ISO, and other styles

50

Marouf, Djamila, Djamila Hamdadou, and Karim Bouamrane. "The MAV-ES Data Integration Approach for Decisional Information Systems (DIS)." International Journal of Healthcare Information Systems and Informatics 11, no. 4 (October 2016): 32–55. http://dx.doi.org/10.4018/ijhisi.2016100102.

Full text

Abstract:

Massive data to facilitate decision making for organizations and their corporate users exist in many forms, types and formats. Importantly, the acquisition and retrieval of relevant supporting information should be timely, precise and complete. Unfortunately, due to differences in syntax and semantics, the extraction and integration of available semi-structured data from different sources often fail. Needs for seamless and effective data integration so as to access, retrieve and use information from diverse data sources cannot be overly emphasized. Moreover, information external to organizations may also often have to be sourced for the intended users through a smart data integration system. Owing to the open, dynamic and heterogeneity nature of data, data integration is becoming an increasingly complex process. A new data integration approach encapsulating mediator systems and data warehouse is proposed here. Aside from the heterogeneity of data sources, other data integration design problems include distinguishing the definition of the global schema, the mappings and query processing. In order to meet all of these challenges, the authors of this paper advocate an approach named MAV-ES, which is characterized by an architecture based on a global schema, partial schemas and a set of sources. The primary benefit of this architecture is that it combines the two basic GAV and LAV approaches so as to realize added-value benefits of the mixed approach.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!