To see the other types of publications on this topic, follow the link: Quality of datasets.

Journal articles on the topic 'Quality of datasets'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Quality of datasets.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Chen, Yijun, Shenxin Zhao, Lihua Zhang, and Qi Zhou. "Quality Assessment of Global Ocean Island Datasets." ISPRS International Journal of Geo-Information 12, no. 4 (2023): 168. http://dx.doi.org/10.3390/ijgi12040168.

Full text
Abstract:
Ocean Island data are essential to the conservation and management of islands and coastal ecosystems, and have also been adopted by the United Nations as a sustainable development goal (SDG 14). Currently, two categories of island datasets, i.e., global shoreline vector (GSV) and OpenStreetMap (OSM), are freely available on a global scale. However, few studies have focused on accessing and comparing the data quality of these two datasets, which is the main purpose of our study. Specifically, these two datasets were accessed using four 100 × 100 (km2) study areas, in terms of three aspects of measures, i.e., accuracy (including overall accuracy (OA), precision, recall and F1), completeness (including area completeness and count completeness) and shape complexity. The results showed that: (1) Both the two datasets perform well in terms of the OA (98% or above) and F1 (0.9 or above); the OSM dataset performs better in terms of precision, but the GSV dataset performs better in terms of recall. (2) The area completeness is almost 100%, but the count completeness is much higher than 100%, indicating the total areas of the two datasets are almost the same, but there are many more islands in the OSM dataset. (3) In most cases, the fractal dimension of the OSM dataset is relatively larger than the GSV dataset in terms of the shape complexity, indicating that the OSM dataset has more detail in terms of the island boundary or coastline. We concluded that both of the datasets (GSV and OSM) are effective for island mapping, but the OSM dataset can identify more small islands and has more detail.
APA, Harvard, Vancouver, ISO, and other styles
2

Waller, John. "Data Location Quality at GBIF." Biodiversity Information Science and Standards 3 (June 13, 2019): e35829. https://doi.org/10.3897/biss.3.35829.

Full text
Abstract:
I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig. 1) and country centroids. I will highlight the challenges GBIF faces identifying potential data quality problems and what we and others (Zizka et al. 2019) are doing to discover and address them. GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (> 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis. Data quality at GBIF will always be a moving target (Chapman 2005), and GBIF already handles many obvious errors such as zero/impossible coordinates, empty or invalid data fields, and fuzzy taxon matching. Since GBIF primarily (but not exclusively) serves lat-lon location information, there is an expectation that occurrences fall somewhat close to where the species actually occurs. This is not always the case. Occurrence data can be hundereds of kilometers away from where the species naturally occur, and there can be multiple reasons for why this can happen, which might not be entirely obvious to users. One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package). I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.
APA, Harvard, Vancouver, ISO, and other styles
3

Diamant, Roee, Ilan Shachar, Yizhaq Makovsky, Bruno Miguel Ferreira, and Nuno Alexandre Cruz. "Cross-Sensor Quality Assurance for Marine Observatories." Remote Sensing 12, no. 21 (2020): 3470. http://dx.doi.org/10.3390/rs12213470.

Full text
Abstract:
Measuring and forecasting changes in coastal and deep-water ecosystems and climates requires sustained long-term measurements from marine observation systems. One of the key considerations in analyzing data from marine observatories is quality assurance (QA). The data acquired by these infrastructures accumulates into Giga and Terabytes per year, necessitating an accurate automatic identification of false samples. A particular challenge in the QA of oceanographic datasets is the avoidance of disqualification of data samples that, while appearing as outliers, actually represent real short-term phenomena, that are of importance. In this paper, we present a novel cross-sensor QA approach that validates the disqualification decision of a data sample from an examined dataset by comparing it to samples from related datasets. This group of related datasets is chosen so as to reflect upon the same oceanographic phenomena that enable some prediction of the examined dataset. In our approach, a disqualification is validated if the detected anomaly is present only in the examined dataset, but not in its related datasets. Results for a surface water temperature dataset recorded by our Texas A&M—Haifa Eastern Mediterranean Marine Observatory (THEMO)—over a period of 7 months, show an improved trade-off between accurate and false disqualification rates when compared to two standard benchmark schemes.
APA, Harvard, Vancouver, ISO, and other styles
4

Gao, Chenxi. "Generative Adversarial Networks-based solution for improving medical data quality and insufficiency." Applied and Computational Engineering 49, no. 1 (2024): 167–75. http://dx.doi.org/10.54254/2755-2721/49/20241086.

Full text
Abstract:
As big data brings intelligent solutions and innovations to various fields, the goal of this research is to solve the problem of poor-quality and insufficient datasets in the medical field, to help poor areas can access to high quality and rich medical datasets as well. This study focuses on solving the current problem by utilizing variants of generative adversarial network, Super Resolution Generative Adversarial Network (SRGAN) and Deep Convolutional Generative Adversarial Network (DCGAN). In this study, OpenCV is employed to introduce fuzziness to the Brain Tumor MRI Dataset, resulting in a blurred dataset. Subsequently, the research utilizes both the unaltered and blurred datasets to train the SRGAN model, which is then applied to enhance the low-quality dataset through inpainting. Moving forward, the original dataset, low-quality dataset, and the improved dataset are each used independently to train the DCGAN model. In order to compare the difference between the produced image datasets and the real dataset, the FID Score is separately computed. The results of the study found that by training DCGAN with SRGAN repaired medical dataset, the naked eye can observe that the medical image dataset is significantly clearer and there is a reduction in Frchet Inception Distance (FID) Score. Therefore, by using SRGAN and DCGAN the current problem of low quality and small quantity of datasets in the medical field can be solved, which increase the potential possibilities of big data in artificial intelligence filed of medicine.
APA, Harvard, Vancouver, ISO, and other styles
5

Quarati, Alfonso, Monica De Martino, and Sergio Rosim. "Geospatial Open Data Usage and Metadata Quality." ISPRS International Journal of Geo-Information 10, no. 1 (2021): 30. http://dx.doi.org/10.3390/ijgi10010030.

Full text
Abstract:
The Open Government Data portals (OGD), thanks to the presence of thousands of geo-referenced datasets, containing spatial information are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered as hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. In addition, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.
APA, Harvard, Vancouver, ISO, and other styles
6

Scarpetta, Marco, Luisa De Palma, Attilio Di Nisio, Maurizio Spadavecchia, Paolo Affuso, and Nicola Giaquinto. "Optimizing Satellite Imagery Datasets for Enhanced Land/Water Segmentation." Sensors 25, no. 6 (2025): 1793. https://doi.org/10.3390/s25061793.

Full text
Abstract:
This paper presents an automated procedure for optimizing datasets used in land/water segmentation tasks with deep learning models. The proposed method employs the Normalized Difference Water Index (NDWI) with a variable threshold to automatically assess the quality of annotations associated with multispectral satellite images. By systematically identifying and excluding low-quality samples, the method enhances dataset quality and improves model performance. Experimental results on two different publicly available datasets—the SWED and SNOWED—demonstrate that deep learning models trained on optimized datasets outperform those trained on baseline datasets, achieving significant improvements in segmentation accuracy, with up to a 10% increase in mean intersection over union, despite a reduced dataset size. Therefore, the presented methodology is a promising scalable solution for improving the quality of datasets for environmental monitoring and other remote sensing applications.
APA, Harvard, Vancouver, ISO, and other styles
7

Pan, Hangyu, Yaoyi Xi, Ling Wang, Yu Nan, Zhizhong Su, and Rong Cao. "Dataset construction method of cross-lingual summarization based on filtering and text augmentation." PeerJ Computer Science 9 (March 28, 2023): e1299. http://dx.doi.org/10.7717/peerj-cs.1299.

Full text
Abstract:
Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algorithm to remove low-quality samples of monolingual summarization (MS) from the perspectives of character and semantics, thereby improving the quality of the MS dataset. In terms of scale supervision, the method adopts a text augmentation algorithm based on the pretrained model to increase the size of CLS datasets with quality assurance. This method was used to build an English-Chinese CLS dataset and evaluate it with a reasonable data quality evaluation framework. The evaluation results show that the dataset is of good quality and large size. These outcomes show that the proposed method may comprehensively improve quality and scale, thereby resulting in a high-quality and large-scale CLS dataset at a lower cost.
APA, Harvard, Vancouver, ISO, and other styles
8

Seol, Sujin, Jaewoo Yoon, Jungeun Lee, and Byeongwoo Kim. "Metrics for Evaluating Synthetic Time-Series Data of Battery." Applied Sciences 14, no. 14 (2024): 6088. http://dx.doi.org/10.3390/app14146088.

Full text
Abstract:
The advancements in artificial intelligence have encouraged the application of deep learning in various fields. However, the accuracy of deep learning algorithms is influenced by the quality of the dataset used. Therefore, a high-quality dataset is critical for deep learning. Data augmentation algorithms can generate large, high-quality datasets. The dataset quality is mainly assessed through qualitative and quantitative evaluations. However, conventional qualitative evaluation methods lack the objective and quantitative parameters necessary for battery synthetic datasets. Therefore, this study proposes the application of the rate of change in linear regression correlation coefficients, Dunn index, and silhouette coefficient as clustering indices for quantitatively evaluating the quality of synthetic time-series datasets of batteries. To verify the reliability of the proposed method, we first applied the TimeGAN algorithm to an open-source battery dataset, generated a synthetic battery dataset, and then compared its similarity to the original dataset using the proposed evaluation method. The silhouette coefficient was confirmed as the most reliable index. Furthermore, the similarity of datasets increased as the silhouette index decreased from 0.1053 to 0.0073 based on the number of learning iterations. The results demonstrate that the insufficient quality of datasets used for deep learning can be overcome and supplemented. Furthermore, data similarity can be efficiently evaluated regardless of the learning environment. In conclusion, we present a new synthetic time-series dataset evaluation method that is more reliable than the conventional representative evaluation method (the training loss rate).
APA, Harvard, Vancouver, ISO, and other styles
9

Levy, Matan, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. "Data Roaming and Quality Assessment for Composed Image Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (2024): 2991–99. http://dx.doi.org/10.1609/aaai.v38i4.28081.

Full text
Abstract:
The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.
APA, Harvard, Vancouver, ISO, and other styles
10

Nam, Ki Hyun. "Application of Serial Crystallography for Merging Incomplete Macromolecular Crystallography Datasets." Crystals 14, no. 12 (2024): 1012. http://dx.doi.org/10.3390/cryst14121012.

Full text
Abstract:
In macromolecular crystallography (MX), a complete diffraction dataset is essential for determining the three-dimensional structure. However, collecting a complete experimental dataset using a single crystal is frequently unsuccessful due to poor crystal quality or radiation damage, resulting in the collection of multiple incomplete datasets. This issue can be solved by merging incomplete diffraction datasets to generate a complete dataset. This study introduced a new approach for merging incomplete datasets from MX to generate a complete dataset using serial crystallography (SX). Six incomplete diffraction datasets of β-glucosidase from Thermoanaerobacterium saccharolyticum (TsaBgl) were processed using CrystFEL, an SX program. The statistics of the merged data, such as completeness, CC, CC*, Rsplit, Rwork, and Rfree, demonstrated a complete dataset, indicating improved quality compared with the incomplete datasets and enabling structural determination. Also, the merging of the incomplete datasets was processed using four different indexing algorithms, and their statistics were compared. In conclusion, this approach for generating a complete dataset using SX will provide a new opportunity for determining the crystal structure of macromolecules using multiple incomplete MX datasets.
APA, Harvard, Vancouver, ISO, and other styles
11

Lacagnina, Carlo, Francisco Doblas-Reyes, Gilles Larnicol, et al. "Quality Management Framework for Climate Datasets." Data Science Journal 21, no. 1 (2022): 10. http://dx.doi.org/10.5334/dsj-2022-010.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Gaillard, Fabienne, Emmanuelle Autret, Virginie Thierry, Philippe Galaup, Christine Coatanoan, and Thomas Loubrieu. "Quality Control of Large Argo Datasets." Journal of Atmospheric and Oceanic Technology 26, no. 2 (2009): 337–51. http://dx.doi.org/10.1175/2008jtecho552.1.

Full text
Abstract:
Abstract Argo floats have significantly improved the observation of the global ocean interior, but as the size of the database increases, so does the need for efficient tools to perform reliable quality control. It is shown here how the classical method of optimal analysis can be used to validate very large datasets before operational or scientific use. The analysis system employed is the one implemented at the Coriolis data center to produce the weekly fields of temperature and salinity, and the key data are the analysis residuals. The impacts of the various sensor errors are evaluated and twin experiments are performed to measure the system capacity in identifying these errors. It appears that for a typical data distribution, the analysis residuals extract 2/3 of the sensor error after a single analysis. The method has been applied on the full Argo Atlantic real-time dataset for the 2000–04 period (482 floats) and 15% of the floats were detected as having salinity drifts or offset. A second test was performed on the delayed mode dataset (120 floats) to check the overall consistency, and except for a few isolated anomalous profiles, the corrected datasets were found to be globally good. The last experiment performed on the Coriolis real-time products takes into account the recently discovered problem in the pressure labeling. For this experiment, a sample of 36 floats, mixing well-behaved and anomalous instruments of the 2003–06 period, was considered and the simple test designed to detect the most common systematic anomalies successfully identified the deficient floats.
APA, Harvard, Vancouver, ISO, and other styles
13

Karia, Adrian Jackob, Juma Said Ally, and Stanley Leonard. "Enhancing Coffee Leaf Rust Detection Using DenseNet201: A Comprehensive Analysis of the Mbozi and Public Datasets in Songwe, Tanzania." African Journal of Empirical Research 6, no. 1 (2025): 171–88. https://doi.org/10.51867/ajernet.6.1.17.

Full text
Abstract:
Coffee Leaf Rust (CLR) is a worldwide devastating fungal disease that threatens coffee production, upsetting economic and farmers' livelihoods. Traditional methods of detecting CLR heavily rely on using machine-learning (ML) models trained through weakly collected datasets and physical inspection which is tedious, time-consuming, and subject to human error. This study explores the performance of the DenseNet201 model using three datasets: Mbozi, Public, and Combined (a merger of Mbozi and Public datasets). Machine Learning Theory guided this research. The study objective is to assess the influence of dataset quality on CLR detection, analyze Mbozi and Public datasets using DenseNet201, and enhance robustness by merging the two datasets. A study on coffee leaf rot (CLR) severity was conducted using systematic sampling techniques. Leaves from multiple coffee farms were collected, representing different levels of infection. The Mbozi dataset, sourced from high-resolution images captured from Tanzania's Songwe coffee plantations, was analyzed for quality under controlled conditions, including environmental factors, image clarity, resolution, labeling consistency, and class balance, based on data completeness, image quality score, visual inspection, and model performance. DenseNet201 was trained and validated on each dataset achieving its highest accuracy with the Mbozi dataset at 98.72% and a validation accuracy of 97.65%, demonstrating the importance of consistent image quality and accurate annotations. In contrast, the public dataset suffered from inconsistencies in resolution and labeling, resulting in a lower training and validation accuracy of 96.86% and 96.42% respectively. The Combined dataset, which integrated the strengths of both datasets, exhibited a stronger generalization with an accuracy of 97.48% and validation accuracy of 97.49%, balancing the need for high-quality images with environmental variability. The study shows improved CLR detection speed and accuracy due to high-quality and consistently labeled images from the Mbozi dataset. It recommends future models integrate regionally relevant and high-resolution datasets for robust performance in real-world agricultural conditions, providing coffee farmers with timely disease intervention tools for better production management and economic stability in coffee-growing regions.
APA, Harvard, Vancouver, ISO, and other styles
14

Pando, Francisco. "Quantifying quality: the "Apparent Quality Index", a measure of data quality for occurrence datasets." Proceedings of TDWG 1 (August 23, 2017): e20533. https://doi.org/10.3897/tdwgproceedings.1.20533.

Full text
Abstract:
When making an initial assessment of a dataset originating from an unfamiliar source, a user typically relies on the visible properties of the dataset as a whole, such as, the title, the publisher, and the size of the dataset. Aspects of data quality are usually out of view, beyond some intuitions and hard to compare assertions. In 2007 at GBIF Spain we tried to correct that by developing an index that enables a user to assess the quality of Darwin Core datasets published by GBIF-Spain, and to track improvements in quality over time. Our goal was to create an index that is explicit, easy to understand, and easy to obtain. We dubbed that index "ICA" GBIF Spain (2010) for its name in Spanish "Índice de Calidad Aparente" (Apparent Quality Index). We say ICA measures "apparent quality", because, although unlikely, a dataset can have a high ICA, while its records are actually a poor reflection of the reality to which they refer. ICA summarizes data quality on the three primary dimensions of biodiversity data: taxonomic, geospatial and temporal. In this contribution we will present the rationale behind the ICA, how it is calculated, how it works within the Darwin Test tool Ortega-Maqueda and Pando (2008), how it is integrated in the data publication processes of GBIF Spain, and some discussion and results about its utility and potential. We also compare ICA to the emerging framework for data quality assessmentTDWG Data Quality Interest Group (2016).
APA, Harvard, Vancouver, ISO, and other styles
15

Huang, Xiaoyuan, Silvia Mirri, and Su-Kit Tang. "Macao-ebird: A Curated Dataset for Artificial-Intelligence-Powered Bird Surveillance and Conservation in Macao." Data 10, no. 6 (2025): 84. https://doi.org/10.3390/data10060084.

Full text
Abstract:
Artificial intelligence (AI) currently exhibits considerable potential within the realm of biodiversity conservation. However, high-quality regionally customized datasets remain scarce, particularly within urban environments. The existing large-scale bird image datasets often lack a dedicated focus on endangered species endemic to specific geographic regions, as well as a nuanced consideration of the complex interplay between urban and natural environmental contexts. Therefore, this paper introduces Macao-ebird, a novel dataset designed to advance AI-driven recognition and conservation of endangered bird species in Macao. The dataset comprises two subsets: (1) Macao-ebird-cls, a classification dataset with 7341 images covering 24 bird species, emphasizing endangered and vulnerable species native to Macao; and (2) Macao-ebird-det, an object detection dataset generated through AI-agent-assisted labeling using grounding DETR with improved denoising anchor boxes (DINO), significantly reducing manual annotation effort while maintaining high-quality bounding-box annotations. We validate the dataset’s utility through baseline experiments with the You Only Look Once (YOLO) v8–v12 series, achieving a mean average precision (mAP50) of up to 0.984. Macao-ebird addresses critical gaps in the existing datasets by focusing on region-specific endangered species and complex urban–natural environments, providing a benchmark for AI applications in avian conservation.
APA, Harvard, Vancouver, ISO, and other styles
16

Salhab, M., and A. Basiri. "SPATIAL DATA QUALITY EVALUATION FOR LAND COVER CLASSIFICATION APPROACHES." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences V-3-2020 (August 3, 2020): 681–87. http://dx.doi.org/10.5194/isprs-annals-v-3-2020-681-2020.

Full text
Abstract:
Abstract. Data gaps and poor data quality may lead to flawed conclusions and data-driven policies and decisions, such as the measurement of Sustainable Development Goals progress. This is particularly important for land cover data, as an essential source of data for a wide range of applications and real-world challenges including climate change mitigation, food security planning, resource allocation and mobilization. While global land cover datasets are available, their usability is limited by their coarse spatial and temporal resolutions. Furthermore, having a good understanding of the fitness for the purpose is imperative. This paper compares two datasets from a spatial data quality perspective: (1) a global land cover map, and (2) a fit-for-purpose training dataset that is generated using visual inspection of very high-resolution satellite data. The latter dataset is created using Google Earth Engine (GEE), a cloud-based computing platform and data repository. We systematically evaluate the two datasets from spatial data quality (SDQ) perspective using the Analytic Hierarchy Process (AHP) to prioritise the criteria, i.e. SDQ. To validate the results, land cover classifications are conducted using both datasets, also within GEE. Based on the results of the SDQ evaluation and land cover classification, we find that the second training dataset significantly outperformed the global land cover maps. Our study also shows that cloud-based computing platforms and publicly available data repositories can provide an effective approach to filling land cover data gaps in data-scarce regions.
APA, Harvard, Vancouver, ISO, and other styles
17

Wilde, Henry, Vincent Knight, and Jonathan Gillard. "Evolutionary dataset optimisation: learning algorithm quality through evolution." Applied Intelligence 50, no. 4 (2019): 1172–91. http://dx.doi.org/10.1007/s10489-019-01592-4.

Full text
Abstract:
AbstractIn this paper we propose a novel method for learning how algorithms perform. Classically, algorithms are compared on a finite number of existing (or newly simulated) benchmark datasets based on some fixed metrics. The algorithm(s) with the smallest value of this metric are chosen to be the ‘best performing’. We offer a new approach to flip this paradigm. We instead aim to gain a richer picture of the performance of an algorithm by generating artificial data through genetic evolution, the purpose of which is to create populations of datasets for which a particular algorithm performs well on a given metric. These datasets can be studied so as to learn what attributes lead to a particular progression of a given algorithm. Following a detailed description of the algorithm as well as a brief description of an open source implementation, a case study in clustering is presented. This case study demonstrates the performance and nuances of the method which we call Evolutionary Dataset Optimisation. In this study, a number of known properties about preferable datasets for the clustering algorithms known as k-means and DBSCAN are realised in the generated datasets.
APA, Harvard, Vancouver, ISO, and other styles
18

Li, Zongjie, Daoyuan Wu, Shuai Wang, and Zhendong Su. "API-Guided Dataset Synthesis to Finetune Large Code Models." Proceedings of the ACM on Programming Languages 9, OOPSLA1 (2025): 786–815. https://doi.org/10.1145/3720449.

Full text
Abstract:
Large code models (LCMs), pre-trained on vast code corpora, have demonstrated remarkable performance across a wide array of code-related tasks. Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific requirements and enhancing their performance in particular domains. However, synthesizing high-quality SFT datasets poses a significant challenge due to the uneven quality of datasets and the scarcity of domain-specific datasets. Inspired by APIs as high-level abstractions of code that encapsulate rich semantic information in a concise structure, we propose DataScope, an API-guided dataset synthesis framework designed to enhance the SFT process for LCMs in both general and domain-specific scenarios. DataScope comprises two main components: Dslt and Dgen. On the one hand, Dslt employs API coverage as a core metric, enabling efficient dataset synthesis in general scenarios by selecting subsets of existing (uneven-quality) datasets with higher API coverage. On the other hand, Dgen recasts domain dataset synthesis as a process of using API-specified high-level functionality and deliberately constituted code skeletons to synthesize concrete code. Extensive experiments demonstrate DataScope’s effectiveness, with models fine-tuned on its synthesized datasets outperforming those tuned on unoptimized datasets five times larger. Furthermore, a series of analyses on model internals, relevant hyperparameters, and case studies provide additional evidence for the efficacy of our proposed methods. These findings underscore the significance of dataset quality in SFT and advance the field of LCMs by providing an efficient, cost-effective framework for constructing high-quality datasets, which in turn lead to more powerful and tailored LCMs for both general and domain-specific scenarios.
APA, Harvard, Vancouver, ISO, and other styles
19

Chaouche, Sabrina, Yoann Randon, Faouzi Adjed, Nadira Boudjani, and Mohamed Ibn Khedher. "DQM: Data Quality Metrics for AI Components in the Industry." Proceedings of the AAAI Symposium Series 4, no. 1 (2024): 24–31. http://dx.doi.org/10.1609/aaaiss.v4i1.31767.

Full text
Abstract:
In industrial settings, measuring the quality of data used to represent an intended domain of use and its operating conditions is crucial and challenging. Thus, this paper aims to present a set of metrics addressing this data quality issue in the form of a library, named DQM (Data Quality Metrics), for Machine Learning (ML) use. Additional metrics specific to industrial application are developed in the proposed library. This work aims also to assess various data and datasets types. Those metrics are used to characterize the training and evaluating datasets involved in the process of building ML models for industrial use cases. Two categories of metrics are implemented in DQM: inherent data metrics, are the ones evaluating the quality of a given dataset independently from the ML model such as statistical proprieties and attributes, and model dependent metrics which are those implemented to measure the quality of the dataset by considering the ML model outputs such the gap between two datasets in regards to a given ML model. DQM is used in the scope of the Confiance.ai program to evaluate datasets used for industrial purposes such as autonomous driving.
APA, Harvard, Vancouver, ISO, and other styles
20

Krommyda, Maria, and Verena Kantere. "Semantic Analysis for Conversational Datasets: Improving Their Quality Using Semantic Relationships." International Journal of Semantic Computing 14, no. 03 (2020): 395–422. http://dx.doi.org/10.1142/s1793351x2050004x.

Full text
Abstract:
As more and more datasets become available, their utilization in different applications increases in popularity. Their volume and production rate, however, means that their quality and content control is in most cases non-existing, resulting in many datasets that contain inaccurate information of low quality. Especially, in the field of conversational assistants, where the datasets come from many heterogeneous sources with no quality assurance, the problem is aggravated. We present here an integrated platform that creates task- and topic-specific conversational datasets to be used for training conversational agents. The platform explores available conversational datasets, extracts information based on semantic similarity and relatedness, and applies a weight-based score function to rank the information based on its value for the specific task and topic. The finalized dataset can then be used for the training of an automated conversational assistance over accurate data of high quality.
APA, Harvard, Vancouver, ISO, and other styles
21

Houskeeper, Henry F., and Raphael M. Kudela. "Ocean Color Quality Control Masks Contain the High Phytoplankton Fraction of Coastal Ocean Observations." Remote Sensing 11, no. 18 (2019): 2167. http://dx.doi.org/10.3390/rs11182167.

Full text
Abstract:
Satellite estimation of oceanic chlorophyll-a content has enabled characterization of global phytoplankton stocks, but the quality of retrieval for many ocean color products (including chlorophyll-a) degrades with increasing phytoplankton biomass in eutrophic waters. Quality control of ocean color products is achieved primarily through the application of masks based on standard thresholds designed to identify suspect or low-quality retrievals. This study compares the masked and unmasked fractions of ocean color datasets from two Eastern Boundary Current upwelling ecosystems (the California and Benguela Current Systems) using satellite proxies for phytoplankton biomass that are applicable to satellite imagery without correction for atmospheric aerosols. Evaluation of the differences between the masked and unmasked fractions indicates that high biomass observations are preferentially masked in National Aeronautics and Space Administration (NASA) ocean color datasets as a result of decreased retrieval quality for waters with high concentrations of phytoplankton. This study tests whether dataset modification persists into the default composite data tier commonly disseminated to science end users. Further, this study suggests that statistics describing a dataset’s masked fraction can be helpful in assessing the quality of a composite dataset and in determining the extent to which retrieval quality is linked to biological processes in a given study region.
APA, Harvard, Vancouver, ISO, and other styles
22

Ferenc, Rudolf, Zoltán Tóth, Gergely Ladányi, István Siket, and Tibor Gyimóthy. "A public unified bug dataset for java and its assessment regarding metrics and bug prediction." Software Quality Journal 28, no. 4 (2020): 1447–506. http://dx.doi.org/10.1007/s11219-020-09515-0.

Full text
Abstract:
AbstractBug datasets have been created and used by many researchers to build and validate novel bug prediction models. In this work, our aim is to collect existing public source code metric-based bug datasets and unify their contents. Furthermore, we wish to assess the plethora of collected metrics and the capabilities of the unified bug dataset in bug prediction. We considered 5 public datasets and we downloaded the corresponding source code for each system in the datasets and performed source code analysis to obtain a common set of source code metrics. This way, we produced a unified bug dataset at class and file level as well. We investigated the diversion of metric definitions and values of the different bug datasets. Finally, we used a decision tree algorithm to show the capabilities of the dataset in bug prediction. We found that there are statistically significant differences in the values of the original and the newly calculated metrics; furthermore, notations and definitions can severely differ. We compared the bug prediction capabilities of the original and the extended metric suites (within-project learning). Afterwards, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and we evaluated the bug prediction model build on this large dataset as well. Finally, we also investigated cross-project capabilities of the bug prediction models and datasets. We made the unified dataset publicly available for everyone. By using a public unified dataset as an input for different bug prediction related investigations, researchers can make their studies reproducible, thus able to be validated and verified.
APA, Harvard, Vancouver, ISO, and other styles
23

Akgül, İsmail, Volkan Kaya, and Özge Zencir Tanır. "A novel hybrid system for automatic detection of fish quality from eye and gill color characteristics using transfer learning technique." PLOS ONE 18, no. 4 (2023): e0284804. http://dx.doi.org/10.1371/journal.pone.0284804.

Full text
Abstract:
Fish remains popular among the body’s most essential nutrients, as it contains protein and polyunsaturated fatty acids. It is extremely important to choose the fish consumption according to the season and the freshness of the fish to be purchased. It is very difficult to distinguish between non-fresh fish and fresh fish mixed in the fish stalls. In addition to traditional methods used to determine meat freshness, significant success has been achieved in studies on fresh fish detection with artificial intelligence techniques. In this study, two different types of fish (anchovy and horse mackerel) used to determine fish freshness with convolutional neural networks, one of the artificial intelligence techniques. The images of fresh fish were taken, images of non-fresh fish were taken and two new datasets (Dataset1: Anchovy, Dataset2: Horse mackerel) were created. A novel hybrid model structure has been proposed to determine fish freshness using fish eye and gill regions on these two datasets. In the proposed model, Yolo-v5 and Inception-ResNet-v2 and Xception model structures are used through transfer learning. Whether the fish is fresh in both of the Yolo-v5 + Inception-ResNet-v2 (Dataset1: 97.67%, Dataset2: 96.0%) and Yolo-v5 + Xception (Dataset1: 88.00%, Dataset2: 94.67%) hybrid models created using these model structures has been successfully detected. Thanks to the model we have proposed, it will make an important contribution to the studies that will be conducted in the freshness studies of fish using different storage days and the estimation of fish size.
APA, Harvard, Vancouver, ISO, and other styles
24

Kim, Seul-ki, and Yong-ju Jeon. "Development of a Python Library to Generate Synthetic Datasets for Artificial Intelligence Education." International Journal on Advanced Science, Engineering and Information Technology 14, no. 3 (2024): 936–45. http://dx.doi.org/10.18517/ijaseit.14.3.18158.

Full text
Abstract:
This study aims to improve the quality of AI education for the AI era by developing an educational dataset library and exploring its applicability. Reflecting the needs of teachers engaged in AI educational activities, the dataset library emphasizes the diversity of topics, forms, and sizes of datasets provided. Additionally, it is designed with a feature to generate outliers and missing values suitable for students' accessibility and educational purposes. The library developed in this research is based on Python and employs the random forest modeling method to generate high-quality synthetic datasets. The functionality and suitability of this library for AI education have been evaluated by an expert panel, which has endorsed its applicability in the field. In detailed assessments of the synthetic datasets generated, the library demonstrated its capability to accurately mirror the statistical characteristics of original datasets, achieving high levels of accuracy and cosine similarity in the modeling results. These outcomes confirm the library's efficacy in reconstructing educational datasets specifically for AI education purposes and crafting high-quality synthetic datasets. This approach offers a practical solution to the existing shortage of educational datasets and substantially enhances the overall quality of education. This research proves immensely beneficial for educators and learners, laying a foundation for ongoing and future research focused on creating and utilizing educational datasets in AI. This paves the way for expanding the possibilities and scope of their application in the educational field, potentially transforming AI education practices.
APA, Harvard, Vancouver, ISO, and other styles
25

Orlando, Nathan, Igor Gyacskov, Derek J. Gillies, et al. "Effect of dataset size, image quality, and image type on deep learning-based automatic prostate segmentation in 3D ultrasound." Physics in Medicine & Biology 67, no. 7 (2022): 074002. http://dx.doi.org/10.1088/1361-6560/ac5a93.

Full text
Abstract:
Abstract Three-dimensional (3D) transrectal ultrasound (TRUS) is utilized in prostate cancer diagnosis and treatment, necessitating time-consuming manual prostate segmentation. We have previously developed an automatic 3D prostate segmentation algorithm involving deep learning prediction on radially sampled 2D images followed by 3D reconstruction, trained on a large, clinically diverse dataset with variable image quality. As large clinical datasets are rare, widespread adoption of automatic segmentation could be facilitated with efficient 2D-based approaches and the development of an image quality grading method. The complete training dataset of 6761 2D images, resliced from 206 3D TRUS volumes acquired using end-fire and side-fire acquisition methods, was split to train two separate networks using either end-fire or side-fire images. Split datasets were reduced to 1000, 500, 250, and 100 2D images. For deep learning prediction, modified U-Net and U-Net++ architectures were implemented and compared using an unseen test dataset of 40 3D TRUS volumes. A 3D TRUS image quality grading scale with three factors (acquisition quality, artifact severity, and boundary visibility) was developed to assess the impact on segmentation performance. For the complete training dataset, U-Net and U-Net++ networks demonstrated equivalent performance, but when trained using split end-fire/side-fire datasets, U-Net++ significantly outperformed the U-Net. Compared to the complete training datasets, U-Net++ trained using reduced-size end-fire and side-fire datasets demonstrated equivalent performance down to 500 training images. For this dataset, image quality had no impact on segmentation performance for end-fire images but did have a significant effect for side-fire images, with boundary visibility having the largest impact. Our algorithm provided fast (<1.5 s) and accurate 3D segmentations across clinically diverse images, demonstrating generalizability and efficiency when employed on smaller datasets, supporting the potential for widespread use, even when data is scarce. The development of an image quality grading scale provides a quantitative tool for assessing segmentation performance.
APA, Harvard, Vancouver, ISO, and other styles
26

Chang, Allen, Matthew C. Fontaine, Serena Booth, Maja J. Matarić, and Stefanos Nikolaidis. "Quality-Diversity Generative Sampling for Learning with Synthetic Data." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 18 (2024): 19805–12. http://dx.doi.org/10.1609/aaai.v38i18.29955.

Full text
Abstract:
Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling.
APA, Harvard, Vancouver, ISO, and other styles
27

Koch, Martin, and Michael Wiese. "Quality Visualization of Microarray Datasets Using Circos." Microarrays 1, no. 2 (2012): 84–94. http://dx.doi.org/10.3390/microarrays1020084.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Ciesielski, Paul E., Patrick T. Haertel, Richard H. Johnson, Junhong Wang, and Scot M. Loehrer. "Developing High-Quality Field Program Sounding Datasets." Bulletin of the American Meteorological Society 93, no. 3 (2012): 325–36. http://dx.doi.org/10.1175/bams-d-11-00091.1.

Full text
Abstract:
Enormous resources of time, effort, and finances are expended in collecting field program rawinsonde (sonde) datasets. Correcting the data and performing quality control (QC) in a timely fashion after the field phase of an experiment are important for facilitating scientific research while interest is still high and funding is available. However, a variety of issues (different sonde types, ground station software, data formats, quality control issues, sonde errors, etc.) often makes working with these datasets difficult and time consuming. Our experience working with sounding data for several field programs has led to the design of a general procedure for creating user-friendly, bias-reduced, QCed sonde datasets. This paper describes the steps in this procedure, gives examples for the various processing stages, and provides access to software tools to aide in this process.
APA, Harvard, Vancouver, ISO, and other styles
29

Schmieder, R., and R. Edwards. "Quality control and preprocessing of metagenomic datasets." Bioinformatics 27, no. 6 (2011): 863–64. http://dx.doi.org/10.1093/bioinformatics/btr026.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Quan, T. Phuong. "daiquiri: Data Quality Reporting for Temporal Datasets." Journal of Open Source Software 7, no. 80 (2022): 5034. http://dx.doi.org/10.21105/joss.05034.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Albakri, Maythm, and Duaa Salman Hussien. "Evaluating the Quality of Authoritative Geospatial Datasets." Journal of Engineering 23, no. 11 (2017): 113–29. http://dx.doi.org/10.31026/j.eng.2017.11.09.

Full text
Abstract:
General Directorate of Surveying is considered one of the most important sources of maps in Iraq. It produced digital maps for whole Iraq in the last six years. These maps are produced from different data sources with unknown accuracy; therefore, the quality of these maps needs to be assessed. The main aim of this study is to evaluate the positional accuracy of digital maps that produced from General Directorate of Surveying. Two different study areas were selected: AL-Rusafa and AL-Karkh in Baghdad / Iraq with an area of 172.826 and 135.106 square kilometers, respectively. Different statistical analyses were conducted to calculate the elements of positional accuracy assessment (mean µ, root mean square error RMSE, minimum and maximum errors). According to the obtained results, it can be stated that the maps of the General Directorate of Surveying can be used in reconnaissance or in works that require low or specified positional accuracy (eg. ±5m), and it cannot be used for applications need high accuracy (e.g. precise surveying). 
APA, Harvard, Vancouver, ISO, and other styles
32

de Ávila Mendes, Renê, and Leandro Augusto da Silva. "Modeling the combined influence of complexity and quality in supervised learning." Intelligent Data Analysis 26, no. 5 (2022): 1247–74. http://dx.doi.org/10.3233/ida-215962.

Full text
Abstract:
Data classification is a data mining task that consists of an algorithm adjusted by a training dataset that is used to predict an object’s class (unclassified) on analysis. A significant part of the performance of the classification algorithm depends on the dataset’s complexity and quality. Data Complexity involves the investigation of the effects of dimensionality, the overlap of descriptive attributes, and the classes’ separability. Data Quality focuses on the aspects such as noise data (outlier) and missing values. The factors Data Complexity and Data Quality are fundamental for the performance of classification. However, the literature has very few studies on the relationship between these factors and to highlight their significance. This paper applies Structural Equation Modeling and the Partial Least Squares Structural Equation Modeling (PLS-SEM) algorithm and, in an innovative manner, associates Data Complexity and Data Quality contributions to Classification Quality. Experimental analysis with 178 datasets obtained from the OpenML repository showed that the control of complexity improves the classification results more than data quality does. Additionally paper also presents a visual tool of datasets analysis about the classification performance perspective in the dimensions proposed to represent the structural model.
APA, Harvard, Vancouver, ISO, and other styles
33

Xavier, Emerson M. A., Francisco J. Ariza-López, and Manuel A. Ureña-Cámara. "Evaluación automática de la calidad de datos geoespaciales mediante servicios web." Revista Cartográfica, no. 98 (June 27, 2019): 59–73. http://dx.doi.org/10.35424/rcarto.i98.141.

Full text
Abstract:
The geomatics sector is going through a data overload scenario which new geospatial datasets are generated almost daily. However there are few or nothing infor- mation about the quality of these datasets, and they should be evaluated aiming to provide users some information about their quality. In this context we propose a solution for the automatic quality evaluation of geospatial datasets using the web services platform. This approach is compound by automatic evaluation procedures for quality control of topological consistency, completeness, and positional accuracy described in the Brazilian quality standard. Some procedures require an external dataset for comparison purposes. Hence we provide a set of synthetic datasets and apply over them an experimental design aiming to select suitable methods to find the correspondences between datasets. The solution has an interoperability tier that links users and automatic procedures using the standardized interface of Web Processing Service (WPS). Our results showed that the automatic procedure works very similar to the manual one.
APA, Harvard, Vancouver, ISO, and other styles
34

Shi, Haoxiang, Jun Ai, Jingyu Liu, and Jiaxi Xu. "Improving Software Defect Prediction in Noisy Imbalanced Datasets." Applied Sciences 13, no. 18 (2023): 10466. http://dx.doi.org/10.3390/app131810466.

Full text
Abstract:
Software defect prediction is a popular method for optimizing software testing and improving software quality and reliability. However, software defect datasets usually have quality problems, such as class imbalance and data noise. Oversampling by generating the minority class samples is one of the most well-known methods to improving the quality of datasets; however, it often introduces overfitting noise to datasets. To better improve the quality of these datasets, this paper proposes a method called US-PONR, which uses undersampling to remove duplicate samples from version iterations and then uses oversampling through propensity score matching to reduce class imbalance and noise samples in datasets. The effectiveness of this method was validated in a software prediction experiment that involved 24 versions of software data in 11 projects from PROMISE in noisy environments that varied from 0% to 30% noise level. The experiments showed a significant improvement in the quality of datasets pre-processed by US-PONR in noisy imbalanced datasets, especially the noisiest ones, compared with 12 other advanced dataset processing methods. The experiments also demonstrated that the US-PONR method can effectively identify the label noise samples and remove them.
APA, Harvard, Vancouver, ISO, and other styles
35

Van Hulse, Jason, Taghi M. Khoshgoftaar, and Amri Napolitano. "Evaluating the Impact of Data Quality on Sampling." Journal of Information & Knowledge Management 10, no. 03 (2011): 225–45. http://dx.doi.org/10.1142/s021964921100295x.

Full text
Abstract:
Learning from imbalanced training data can be a difficult endeavour, and the task is made even more challenging if the data is of low quality or the size of the training dataset is small. Data sampling is a commonly used method for improving learner performance when data is imbalanced. However, little effort has been put forth to investigate the performance of data sampling techniques when data is both noisy and imbalanced. In this work, we present a comprehensive empirical investigation of the impact of changes in four training dataset characteristics — dataset size, class distribution, noise level and noise distribution — on data sampling techniques. We present the performance of four common data sampling techniques using 11 learning algorithms. The results, which are based on an extensive suite of experiments for which over 15 million models were trained and evaluated, show that: (1) even for relatively clean datasets, class imbalance can still hurt learner performance, (2) data sampling, however, may not improve performance for relatively clean but imbalanced datasets, (3) data sampling can be very effective at dealing with the combined problems of noise and imbalance, (4) both the level and distribution of class noise among the classes are important, as either factor alone does not cause a significant impact, (5) when sampling does improve the learners (i.e. for noisy and imbalanced datasets), RUS and SMOTE are the most effective at improving the AUC, while SMOTE performed well relative to the F-measure, (6) there are significant differences in the empirical results depending on the performance measure used, and hence it is important to consider multiple metrics in this type of analysis, and (7) data sampling rarely hurt the AUC, but only significantly improved performance when data was at least moderately skewed or noisy, while for the F-measure, data sampling often resulted in significantly worse performance when applied to slightly skewed or noisy datasets, but did improve performance when data was either severely noisy or skewed, or contained moderate levels of both noise and imbalance.
APA, Harvard, Vancouver, ISO, and other styles
36

Idrissou, Al, Frank van Harmelen, and Peter van den Besselaar. "Network metrics for assessing the quality of entity resolution between multiple datasets1." Semantic Web 12, no. 1 (2020): 21–40. http://dx.doi.org/10.3233/sw-200410.

Full text
Abstract:
Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.
APA, Harvard, Vancouver, ISO, and other styles
37

Iskandaryan, Ditsuhi, Francisco Ramos, and Sergio Trilles. "Features Exploration from Datasets Vision in Air Quality Prediction Domain." Atmosphere 12, no. 3 (2021): 312. http://dx.doi.org/10.3390/atmos12030312.

Full text
Abstract:
Air pollution and its consequences are negatively impacting on the world population and the environment, which converts the monitoring and forecasting air quality techniques as essential tools to combat this problem. To predict air quality with maximum accuracy, along with the implemented models and the quantity of the data, it is crucial also to consider the dataset types. This study selected a set of research works in the field of air quality prediction and is concentrated on the exploration of the datasets utilised in them. The most significant findings of this research work are: (1) meteorological datasets were used in 94.6% of the papers leaving behind the rest of the datasets with a big difference, which is complemented with others, such as temporal data, spatial data, and so on; (2) the usage of various datasets combinations has been commenced since 2009; and (3) the utilisation of open data have been started since 2012, 32.3% of the studies used open data, and 63.4% of the studies did not provide the data.
APA, Harvard, Vancouver, ISO, and other styles
38

Gao, Wuliang. "CAugment: An Approach to Diversifying Dataset by Combining Image Processing Operations." Information Technology and Control 52, no. 4 (2023): 996–1009. http://dx.doi.org/10.5755/j01.itc.52.4.33828.

Full text
Abstract:
In deep learning, model quality is extremely important. Consequently, the quality and the sufficiency of the datasets for training models have attracted considerable attention from both industry and academia. Automatic data augmentation, which provides a means of using image processing operators to generate data from existing datasets, is quite effective in searching for mutants of the images and expanding the training datasets. However, existing automatic data augmentation techniques often fail to fully exploit the potential of the data, failing to balance the search efficiency and the model accuracy. This paper presents CAugment, a novel approach to diversifying image datasets by combining image processing operators. Given a training image dataset, CAugment is composed of: 1) the three-level evolutionary algorithm (TLEA) that employs three levels of atomic operations for augmenting the dataset and an adaptive strategy for decreasing granularity and 2) a design that uses the three-dimensional evaluation method (TDEM) and a dHash algorithm to measure the diversity of the dataset. The search space can be expanded, which further improves model accuracy during training. We use CAugment to augment the CIFAR-10/100 and SVHN datasets and use the augmented datasets to train the WideResNet and Shake-Shake models. Our results show that the amount of data increases linearly along with the training epochs; in addition, the models trained by the CAugment-augmented datasets outperform those trained by the datasets augmented by the other techniques by up to 17.9% in accuracy on the SVHN dataset.
APA, Harvard, Vancouver, ISO, and other styles
39

Feeney, Kevin Chekov, Declan O'Sullivan, Wei Tai, and Rob Brennan. "Improving Curated Web-Data Quality with Structured Harvesting and Assessment." International Journal on Semantic Web and Information Systems 10, no. 2 (2014): 35–62. http://dx.doi.org/10.4018/ijswis.2014040103.

Full text
Abstract:
This paper describes a semi-automated process, framework and tools for harvesting, assessing, improving and maintaining high-quality linked-data. The framework, known as DaCura1, provides dataset curators, who may not be knowledge engineers, with tools to collect and curate evolving linked data datasets that maintain quality over time. The framework encompasses a novel process, workflow and architecture. A working implementation has been produced and applied firstly to the publication of an existing social-sciences dataset, then to the harvesting and curation of a related dataset from an unstructured data-source. The framework's performance is evaluated using data quality measures that have been developed to measure existing published datasets. An analysis of the framework against these dimensions demonstrates that it addresses a broad range of real-world data quality concerns. Experimental results quantify the impact of the DaCura process and tools on data quality through an assessment framework and methodology which combines automated and human data quality controls.
APA, Harvard, Vancouver, ISO, and other styles
40

Jiang, Hongyan, Dianjun Fang, Klaus Spicher, Feng Cheng, and Boxing Li. "A New Period-Sequential Index Forecasting Algorithm for Time Series Data." Applied Sciences 9, no. 20 (2019): 4386. http://dx.doi.org/10.3390/app9204386.

Full text
Abstract:
A period-sequential index algorithm with sigma-pi neural network technology, which is called the (SPNN-PSI) method, is proposed for the prediction of time series datasets. Using the SPNN-PSI method, the cumulative electricity output (CEO) dataset, Volkswagen sales (VS) dataset, and electric motors exports (EME) dataset are tested. The results show that, in contrast to the moving average (MA), exponential smoothing (ES), and autoregressive integrated moving average (ARIMA) methods, the proposed SPNN-PSI method shows satisfactory forecasting quality due to lower error, and is more suitable for the prediction of time series datasets. It is also concluded that: There is a trend that the higher the correlation coefficient value of the reference historical datasets, the higher the prediction quality of SPNN-PSI method, and a higher value (>0.4) of correlation coefficient for SPNN-PSI method can help to improve occurrence probability of higher forecasting accuracy, and produce more accurate forecasts for the big datasets.
APA, Harvard, Vancouver, ISO, and other styles
41

Khreis, Haneen, Kees de Hoogh, Josias Zietsman, and Mark J. Nieuwenhuijsen. "The Impact of Different Validation Datasets on Air Quality Modeling Performance." Transportation Research Record: Journal of the Transportation Research Board 2672, no. 25 (2018): 57–66. http://dx.doi.org/10.1177/0361198118780682.

Full text
Abstract:
Many studies rely on air pollution modeling such as land use regression (LUR) or atmospheric dispersion (AD) modeling in epidemiological and health impact assessments. Generally, these models are only validated using one validation dataset and their estimates at select receptor points are generalized to larger areas. The primary objective of this paper was to explore the effect of different validation datasets on the validation of air quality models. The secondary objective was to explore the effect of the model estimates’ spatial resolution on the models’ validity at different locations. Annual NOx and NO2 were generated using a LUR and an AD model. These estimates were validated against four measurement datasets, once when estimates were made at the exact locations of the validation points and once when estimates were made at the centroid of the 100m×100m grid in which the validation point fell. The validation results varied substantially based on the model and validation dataset used. The LUR models’ R2 ranged between 21% and 58%, based on the validation dataset. The AD models’ R2 ranged between 13% and 56% based on the validation dataset and the use of constant or varying background NOx. The validation results based on model estimates at the exact validation site locations were much better than those based on a 100m×100m grid. This paper demonstrated the value of validating modeled air quality against various datasets and suggested that the spatial resolution of the models’ estimates has a significant influence on the validity at the application point.
APA, Harvard, Vancouver, ISO, and other styles
42

Tigga, Onima, Jaya Pal, and Debjani Mustafi. "A Novel Data Handling Technique for Wine Quality Analysis using ML Techniques." International Journal of Experimental Research and Review 45, Spl Vol (2024): 25–40. https://doi.org/10.52756/ijerr.2024.v45spl.003.

Full text
Abstract:
In this era, wine is a regularly redeemed beverage, and industries are seeing increased sales due to product quality certification. This research aims to identify key wine characteristics that contribute to significant outcomes through the application of machine learning classification techniques, specifically Random Forest (RF), Decision Tree (DT) and Multi-Layer Perceptron (MLP), using white and red wine datasets sourced from the UCI Machine Learning repository. This research aims to develop a multiclass classification model using machine learning (ML) to accurately assess the quality of a balanced wine dataset comprising both white and red wines. The dataset is balanced by random oversampling to avoid biases in ML techniques for the majority class obtained by the imbalanced multiclass dataset (IMD). Furthermore, we apply a Yeo-Jhonson transformation (YJT) to the datasets to reduce skewness. We validated the ML algorithm's result using a 10-fold cross-validation approach and found that RF yielded the highest overall accuracy of 93.14%, within a range of 75% to 94%. We have observed that the proposed approach for balanced white wine dataset accuracy is 93.14% using RF, 90.83% using DT, and 75.49% using MLP. Similarly, for the balanced red wine dataset, accuracy is 89.36% using RF, 85.36% using DT, and 78.00% using MLP. The proposed approach improves accuracy by RF 23%, DT 30%, and MLP 21% for the white wine dataset. Similarly, accuracy by RF remained the same, DT 10%, and MLP 22% is improved in the red wine dataset. Additionally, the proposed approach's RF, DT, and MLP yield mean squared error (MSE) values of 0.080, 0.151, and 0.443 for the white wine dataset and 0.143, 0.221, and 0.396 for the red wine dataset. We also observed that the RF accuracy for the proposed technique is the highest among all specified classifiers for white and red wine datasets, respectively.
APA, Harvard, Vancouver, ISO, and other styles
43

Riley, Merilyn, Kerin Robinson, Monique F. Kilkenny, and Sandra G. Leggat. "The knowledge and reuse practices of researchers utilising government health information assets, Victoria, Australia, 2008–2020." PLOS ONE 19, no. 2 (2024): e0297396. http://dx.doi.org/10.1371/journal.pone.0297396.

Full text
Abstract:
Background Using government health datasets for secondary purposes is widespread; however, little is known on researchers’ knowledge and reuse practices within Australia. Objectives To explore researchers’ knowledge and experience of governance processes, and their data reuse practices, when using Victorian government health datasets for research between 2008–2020. Method A cross-sectional quantitative survey was conducted with authors who utilised selected Victorian, Australia, government health datasets for peer-reviewed research published between 2008–2020. Information was collected on researchers’: data reuse practices; knowledge of government health information assets; perceptions of data trustworthiness for reuse; and demographic characteristics. Results When researchers used government health datasets, 45% linked their data, 45% found the data access process easy and 27% found it difficult. Government-curated datasets were significantly more difficult to access compared to other-agency curated datasets (p = 0.009). Many respondents received their data in less than six months (58%), in aggregated or de-identified form (76%). Most reported performing their own data validation checks (70%). To assist in data reuse, almost 71% of researchers utilised (or created) contextual documentation, 69% a data dictionary, and 62% limitations documentation. Almost 20% of respondents were not aware if data quality information existed for the dataset they had accessed. Researchers reported data was managed by custodians with rigorous confidentiality/privacy processes (94%) and good data quality processes (76%), yet half lacked knowledge of what these processes entailed. Many respondents (78%) were unaware if dataset owners had obtained consent from the dataset subjects for research applications of the data. Conclusion Confidentiality/privacy processes and quality control activities undertaken by data custodians were well-regarded. Many respondents included data linkage to additional government datasets in their research. Ease of data access was variable. Some documentation types were well provided and used, but improvement is required for the provision of data quality statements and limitations documentation. Provision of information on participants’ informed consent in a dataset is required.
APA, Harvard, Vancouver, ISO, and other styles
44

Wickett, Karen M., Manika Lamba, and Jarrett Newman. "Putting People First in Data Quality: Feminist Data Ethics for Open Government Datasets." Proceedings of the Association for Information Science and Technology 61, no. 1 (2024): 1135–37. http://dx.doi.org/10.1002/pra2.1209.

Full text
Abstract:
ABSTRACTOpen government information systems offer great potential for advancing civic life and democracy, but they also reflect and reinforce the biases and systematic inequalities faced by members of socially marginalized groups. We present results from a critical data modeling project that uses a data quality framework to examine open datasets published by police departments in order to understand how data modeling choices shape the social impact of these datasets. Using an arrest record dataset published by the Los Angeles Police Department as a case study, we present we present results detailing the representation of racial data and the presence of children in the dataset. We argue that current data quality frameworks for open government data are insufficient for critical data studies due to an orientation around institutional and computational interests. Incorporating feminist data ethics into data quality analysis provides an approach to data quality that centers people and communities. We propose a definition for data quality of open government datasets based on an ethics of care that centers the needs of vulnerable populations and accountability of institutions toward their communities.
APA, Harvard, Vancouver, ISO, and other styles
45

Aherwadi, Nagnath, Usha Mittal, Jimmy Singla, N. Z. Jhanjhi, Abdulsalam Yassine, and M. Shamim Hossain. "Prediction of Fruit Maturity, Quality, and Its Life Using Deep Learning Algorithms." Electronics 11, no. 24 (2022): 4100. http://dx.doi.org/10.3390/electronics11244100.

Full text
Abstract:
Fruit that has reached maturity is ready to be harvested. The prediction of fruit maturity and quality is important not only for farmers or the food industry but also for small retail stores and supermarkets where fruits are sold and purchased. Fruit maturity classification is the process by which fruits are classified according to their maturity in their life cycle. Nowadays, deep learning (DL) has been applied in many applications of smart agriculture such as water and soil management, crop planting, crop disease detection, weed removal, crop distribution, strong fruit counting, crop harvesting, and production forecasting. This study aims to find the best deep learning algorithms which can be used for the prediction of fruit maturity and quality for the shelf life of fruit. In this study, two datasets of banana fruit are used, where we create the first dataset, and the second dataset is taken from Kaggle, named Fruit 360. Our dataset contains 2100 images in 3 categories: ripe, unripe, and over-ripe, each of 700 images. An image augmentation technique is used to maximize the dataset size to 18,900. Convolutional neural networks (CNN) and AlexNet techniques are used for building the model for both datasets. The original dataset achieved an accuracy of 98.25% for the CNN model and 81.75% for the AlexNet model, while the augmented dataset achieved an accuracy of 99.36% for the CNN model and 99.44% for the AlexNet model. The Fruit 360 dataset achieved an accuracy of 81.96% for CNN and 81.75% for the AlexNet model. We concluded that for all three datasets of banana images, the proposed CNN model is the best suitable DL algorithm for bananas’ fruit maturity classification and quality detection.
APA, Harvard, Vancouver, ISO, and other styles
46

R. Short, Andrew, Τheofanis G. Orfanoudakis, and Helen C. Leligou. "Improving Security and Fairness in Federated Learning Systems." International Journal of Network Security & Its Applications 13, no. 6 (2021): 37–53. http://dx.doi.org/10.5121/ijnsa.2021.13604.

Full text
Abstract:
The ever-increasing use of Artificial Intelligence applications has made apparent that the quality of the training datasets affects the performance of the models. To this end, Federated Learning aims to engage multiple entities to contribute to the learning process with locally maintained data, without requiring them to share the actual datasets. Since the parameter server does not have access to the actual training datasets, it becomes challenging to offer rewards to users by directly inspecting the dataset quality. Instead, this paper focuses on ways to strengthen user engagement by offering “fair” rewards, proportional to the model improvement (in terms of accuracy) they offer. Furthermore, to enable objective judgment of the quality of contribution, we devise a point system to record user performance assisted by blockchain technologies. More precisely, we have developed a verification algorithm that evaluates the performance of users’ contributions by comparing the resulting accuracy of the global model against a verification dataset and we demonstrate how this metric can be used to offer security improvements in a Federated Learning process. Further on, we implement the solution in a simulation environment in order to assess the feasibility and collect baseline results using datasets of varying quality.
APA, Harvard, Vancouver, ISO, and other styles
47

Wang, Lin, Yibing Wang, Jian Chen, Shuangqing Zhang, and Lanhong Zhang. "Research on CC-SSBLS Model-Based Air Quality Index Prediction." Atmosphere 15, no. 5 (2024): 613. http://dx.doi.org/10.3390/atmos15050613.

Full text
Abstract:
Establishing reliable and effective prediction models is a major research priority for air quality parameter monitoring and prediction and is utilized extensively in numerous fields. The sample dataset of air quality metrics often established has missing data and outliers because of certain uncontrollable causes. A broad learning system based on a semi-supervised mechanism is built to address some of the dataset’s data-missing issues, hence reducing the air quality model prediction error. Several air parameter sample datasets in the experiment were discovered to have outlier issues, and the anomalous data directly impact the prediction model’s stability and accuracy. Furthermore, the correlation entropy criteria perform better when handling the sample data’s outliers. Therefore, the prediction model in this paper consists of a semi-supervised broad learning system based on the correlation entropy criterion (CC-SSBLS). This technique effectively solves the issue of unstable and inaccurate prediction results due to anomalies in the data by substituting the correlation entropy criterion for the mean square error criterion in the BLS algorithm. Experiments on the CC-SSBLS algorithm and comparative studies with models like Random Forest (RF), Support Vector Regression (V-SVR), BLS, SSBLS, and Categorical and Regression Tree-based Broad Learning System (CART-BLS) were conducted using sample datasets of air parameters in various regions. In this paper, the root mean square error (RMSE) and mean absolute percentage error (MAPE) are used to judge the advantages and disadvantages of the proposed model. Through the experimental analysis, RMSE and MAPE reached 8.68 μg·m−3 and 0.24% in the Nanjing dataset. It is possible to conclude that the CC-SSBLS algorithm has superior stability and prediction accuracy based on the experimental results.
APA, Harvard, Vancouver, ISO, and other styles
48

Hashim, N. M., A. H. Omar, K. M. Omar, M. A. Abbas, M. A. Mustafar, and S. A. Sulaiman. "CADASTRAL POSITIONING ACCURACY IMPROVEMENT (PAI): A CASE STUDY OF PRE-REQUISITE DATA QUALITY ASSURANCE." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-4/W16 (October 1, 2019): 255–60. http://dx.doi.org/10.5194/isprs-archives-xlii-4-w16-255-2019.

Full text
Abstract:
Abstract. Nowadays, there is an increasing need for comprehensive spatial data management especially digital cadastral database (DCDB). Previously, the cadastral database is in hard copy map, then converted into digital format and subsequently updated. Theoretically, these legacy datasets have relatively low positional accuracy caused by limitation of traditional measurement, adjustment technique and technology changes over time. With the growth of spatial based technology especially Geographical Information System (GIS) and Global Navigation Satellite System (GNSS) the Positional Accuracy Improvement (PAI) to the legacy cadastral database is inevitable. PAI is the refining process of the geometry feature in a geospatial dataset through integration between legacy and higher accuracy dataset to improve its actual position. However, by merely integrating both datasets will lead to a distortion of the relative geometry. Thus, an organized method is required to minimize inherent errors in fitting to the new accurate dataset. The focus of this study is to design a comprehensive data preparation for legacy cadastral datasets improvement. The elements of datum traceability, cadastral error propagation and weightage setting in adjustment will be focused to achieve the targeted objective. The proposed result can be applied as a foundation for PAI approach in cadastral database modernization.
APA, Harvard, Vancouver, ISO, and other styles
49

Bondžulić, Boban, Boban Pavlović, Nenad Stojanović, and Vladimir Petrović. "Picture-wise just noticeable difference prediction model for JPEG image quality assessment." Vojnotehnicki glasnik 70, no. 1 (2022): 62–86. http://dx.doi.org/10.5937/vojtehg70-34739.

Full text
Abstract:
Introduction/purpose: The paper presents interesting research related to the performance analysis of the picture-wise just noticeable difference (JND) prediction model and its application in the quality assessment of images with JPEG compression. Methods: The performance analysis of the JND model was conducted in an indirect way by using the publicly available results of subject-rated image datasets with the separation of images into two classes (above and below the threshold of visible differences). In the performance analysis of the JND prediction model and image quality assessment, five image datasets were used, four of which come from the visible wavelength range, and one dataset is intended for remote sensing and surveillance with images from the infrared part of the electromagnetic spectrum. Results: The pap 86 er shows that using a picture-wise JND model, subjective image quality assessment scores can be estimated with better accuracy, leading to significant performance improvements of the traditional peak signal-to-noise ratio (PSNR). The gain achieved by introducing the picture-wise JND model in the objective assessment depends on the chosen dataset and the results of the initial simple to compute PSNR measure, and it was obtained on all five datasets. The mean linear correlation coefficient (for five datasets) between subjective and PSNR objective quality estimates increased from 74% (traditional PSNR) to 90% (picture-wise JND PSNR). Conclusion: Further improvement of the JND-based objective measure can be obtained by improving the picture-wise model of JND prediction.
APA, Harvard, Vancouver, ISO, and other styles
50

Brendlin, Andreas S., Arne Estler, David Plajer, et al. "AI Denoising Significantly Enhances Image Quality and Diagnostic Confidence in Interventional Cone-Beam Computed Tomography." Tomography 8, no. 2 (2022): 933–47. http://dx.doi.org/10.3390/tomography8020075.

Full text
Abstract:
(1) To investigate whether interventional cone-beam computed tomography (cbCT) could benefit from AI denoising, particularly with respect to patient body mass index (BMI); (2) From 1 January 2016 to 1 January 2022, 100 patients with liver-directed interventions and peri-procedural cbCT were included. The unenhanced mask run and the contrast-enhanced fill run of the cbCT were reconstructed using weighted filtered back projection. Additionally, each dataset was post-processed using a novel denoising software solution. Place-consistent regions of interest measured signal-to-noise ratio (SNR) per dataset. Corrected mixed-effects analysis with BMI subgroup analyses compared objective image quality. Multiple linear regression measured the contribution of “Radiation Dose”, “Body-Mass-Index”, and “Mode” to SNR. Two radiologists independently rated diagnostic confidence. Inter-rater agreement was measured using Spearman correlation (r); (3) SNR was significantly higher in the denoised datasets than in the regular datasets (p < 0.001). Furthermore, BMI subgroup analysis showed significant SNR deteriorations in the regular datasets for higher patient BMI (p < 0.001), but stable results for denoising (p > 0.999). In regression, only denoising contributed positively towards SNR (0.6191; 95%CI 0.6096 to 0.6286; p < 0.001). The denoised datasets received overall significantly higher diagnostic confidence grades (p = 0.010), with good inter-rater agreement (r ≥ 0.795, p < 0.001). In a subgroup analysis, diagnostic confidence deteriorated significantly for higher patient BMI (p < 0.001) in the regular datasets but was stable in the denoised datasets (p ≥ 0.103).; (4) AI denoising can significantly enhance image quality in interventional cone-beam CT and effectively mitigate diagnostic confidence deterioration for rising patient BMI.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography