Relevant bibliographies by topics / Quality of datasets

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers
Reports

Academic literature on the topic 'Quality of datasets'

Author: Grafiati

Published: 5 June 2025

Last updated: 24 June 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Quality of datasets.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Quality of datasets"

Chen, Yijun, Shenxin Zhao, Lihua Zhang, and Qi Zhou. "Quality Assessment of Global Ocean Island Datasets." ISPRS International Journal of Geo-Information 12, no. 4 (2023): 168. http://dx.doi.org/10.3390/ijgi12040168.

Full text

Abstract:

Ocean Island data are essential to the conservation and management of islands and coastal ecosystems, and have also been adopted by the United Nations as a sustainable development goal (SDG 14). Currently, two categories of island datasets, i.e., global shoreline vector (GSV) and OpenStreetMap (OSM), are freely available on a global scale. However, few studies have focused on accessing and comparing the data quality of these two datasets, which is the main purpose of our study. Specifically, these two datasets were accessed using four 100 × 100 (km2) study areas, in terms of three aspects of measures, i.e., accuracy (including overall accuracy (OA), precision, recall and F1), completeness (including area completeness and count completeness) and shape complexity. The results showed that: (1) Both the two datasets perform well in terms of the OA (98% or above) and F1 (0.9 or above); the OSM dataset performs better in terms of precision, but the GSV dataset performs better in terms of recall. (2) The area completeness is almost 100%, but the count completeness is much higher than 100%, indicating the total areas of the two datasets are almost the same, but there are many more islands in the OSM dataset. (3) In most cases, the fractal dimension of the OSM dataset is relatively larger than the GSV dataset in terms of the shape complexity, indicating that the OSM dataset has more detail in terms of the island boundary or coastline. We concluded that both of the datasets (GSV and OSM) are effective for island mapping, but the OSM dataset can identify more small islands and has more detail.

APA, Harvard, Vancouver, ISO, and other styles

Waller, John. "Data Location Quality at GBIF." Biodiversity Information Science and Standards 3 (June 13, 2019): e35829. https://doi.org/10.3897/biss.3.35829.

Full text

Abstract:

I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig. 1) and country centroids. I will highlight the challenges GBIF faces identifying potential data quality problems and what we and others (Zizka et al. 2019) are doing to discover and address them. GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (> 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis. Data quality at GBIF will always be a moving target (Chapman 2005), and GBIF already handles many obvious errors such as zero/impossible coordinates, empty or invalid data fields, and fuzzy taxon matching. Since GBIF primarily (but not exclusively) serves lat-lon location information, there is an expectation that occurrences fall somewhat close to where the species actually occurs. This is not always the case. Occurrence data can be hundereds of kilometers away from where the species naturally occur, and there can be multiple reasons for why this can happen, which might not be entirely obvious to users. One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package). I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.

APA, Harvard, Vancouver, ISO, and other styles

Diamant, Roee, Ilan Shachar, Yizhaq Makovsky, Bruno Miguel Ferreira, and Nuno Alexandre Cruz. "Cross-Sensor Quality Assurance for Marine Observatories." Remote Sensing 12, no. 21 (2020): 3470. http://dx.doi.org/10.3390/rs12213470.

Full text

Abstract:

Measuring and forecasting changes in coastal and deep-water ecosystems and climates requires sustained long-term measurements from marine observation systems. One of the key considerations in analyzing data from marine observatories is quality assurance (QA). The data acquired by these infrastructures accumulates into Giga and Terabytes per year, necessitating an accurate automatic identification of false samples. A particular challenge in the QA of oceanographic datasets is the avoidance of disqualification of data samples that, while appearing as outliers, actually represent real short-term phenomena, that are of importance. In this paper, we present a novel cross-sensor QA approach that validates the disqualification decision of a data sample from an examined dataset by comparing it to samples from related datasets. This group of related datasets is chosen so as to reflect upon the same oceanographic phenomena that enable some prediction of the examined dataset. In our approach, a disqualification is validated if the detected anomaly is present only in the examined dataset, but not in its related datasets. Results for a surface water temperature dataset recorded by our Texas A&M—Haifa Eastern Mediterranean Marine Observatory (THEMO)—over a period of 7 months, show an improved trade-off between accurate and false disqualification rates when compared to two standard benchmark schemes.

APA, Harvard, Vancouver, ISO, and other styles

Gao, Chenxi. "Generative Adversarial Networks-based solution for improving medical data quality and insufficiency." Applied and Computational Engineering 49, no. 1 (2024): 167–75. http://dx.doi.org/10.54254/2755-2721/49/20241086.

Full text

Abstract:

As big data brings intelligent solutions and innovations to various fields, the goal of this research is to solve the problem of poor-quality and insufficient datasets in the medical field, to help poor areas can access to high quality and rich medical datasets as well. This study focuses on solving the current problem by utilizing variants of generative adversarial network, Super Resolution Generative Adversarial Network (SRGAN) and Deep Convolutional Generative Adversarial Network (DCGAN). In this study, OpenCV is employed to introduce fuzziness to the Brain Tumor MRI Dataset, resulting in a blurred dataset. Subsequently, the research utilizes both the unaltered and blurred datasets to train the SRGAN model, which is then applied to enhance the low-quality dataset through inpainting. Moving forward, the original dataset, low-quality dataset, and the improved dataset are each used independently to train the DCGAN model. In order to compare the difference between the produced image datasets and the real dataset, the FID Score is separately computed. The results of the study found that by training DCGAN with SRGAN repaired medical dataset, the naked eye can observe that the medical image dataset is significantly clearer and there is a reduction in Frchet Inception Distance (FID) Score. Therefore, by using SRGAN and DCGAN the current problem of low quality and small quantity of datasets in the medical field can be solved, which increase the potential possibilities of big data in artificial intelligence filed of medicine.

APA, Harvard, Vancouver, ISO, and other styles

Quarati, Alfonso, Monica De Martino, and Sergio Rosim. "Geospatial Open Data Usage and Metadata Quality." ISPRS International Journal of Geo-Information 10, no. 1 (2021): 30. http://dx.doi.org/10.3390/ijgi10010030.

Full text

Abstract:

The Open Government Data portals (OGD), thanks to the presence of thousands of geo-referenced datasets, containing spatial information are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered as hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. In addition, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.

APA, Harvard, Vancouver, ISO, and other styles

Scarpetta, Marco, Luisa De Palma, Attilio Di Nisio, Maurizio Spadavecchia, Paolo Affuso, and Nicola Giaquinto. "Optimizing Satellite Imagery Datasets for Enhanced Land/Water Segmentation." Sensors 25, no. 6 (2025): 1793. https://doi.org/10.3390/s25061793.

Full text

Abstract:

This paper presents an automated procedure for optimizing datasets used in land/water segmentation tasks with deep learning models. The proposed method employs the Normalized Difference Water Index (NDWI) with a variable threshold to automatically assess the quality of annotations associated with multispectral satellite images. By systematically identifying and excluding low-quality samples, the method enhances dataset quality and improves model performance. Experimental results on two different publicly available datasets—the SWED and SNOWED—demonstrate that deep learning models trained on optimized datasets outperform those trained on baseline datasets, achieving significant improvements in segmentation accuracy, with up to a 10% increase in mean intersection over union, despite a reduced dataset size. Therefore, the presented methodology is a promising scalable solution for improving the quality of datasets for environmental monitoring and other remote sensing applications.

APA, Harvard, Vancouver, ISO, and other styles

Pan, Hangyu, Yaoyi Xi, Ling Wang, Yu Nan, Zhizhong Su, and Rong Cao. "Dataset construction method of cross-lingual summarization based on filtering and text augmentation." PeerJ Computer Science 9 (March 28, 2023): e1299. http://dx.doi.org/10.7717/peerj-cs.1299.

Full text

Abstract:

Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algorithm to remove low-quality samples of monolingual summarization (MS) from the perspectives of character and semantics, thereby improving the quality of the MS dataset. In terms of scale supervision, the method adopts a text augmentation algorithm based on the pretrained model to increase the size of CLS datasets with quality assurance. This method was used to build an English-Chinese CLS dataset and evaluate it with a reasonable data quality evaluation framework. The evaluation results show that the dataset is of good quality and large size. These outcomes show that the proposed method may comprehensively improve quality and scale, thereby resulting in a high-quality and large-scale CLS dataset at a lower cost.

APA, Harvard, Vancouver, ISO, and other styles

Seol, Sujin, Jaewoo Yoon, Jungeun Lee, and Byeongwoo Kim. "Metrics for Evaluating Synthetic Time-Series Data of Battery." Applied Sciences 14, no. 14 (2024): 6088. http://dx.doi.org/10.3390/app14146088.

Full text

Abstract:

The advancements in artificial intelligence have encouraged the application of deep learning in various fields. However, the accuracy of deep learning algorithms is influenced by the quality of the dataset used. Therefore, a high-quality dataset is critical for deep learning. Data augmentation algorithms can generate large, high-quality datasets. The dataset quality is mainly assessed through qualitative and quantitative evaluations. However, conventional qualitative evaluation methods lack the objective and quantitative parameters necessary for battery synthetic datasets. Therefore, this study proposes the application of the rate of change in linear regression correlation coefficients, Dunn index, and silhouette coefficient as clustering indices for quantitatively evaluating the quality of synthetic time-series datasets of batteries. To verify the reliability of the proposed method, we first applied the TimeGAN algorithm to an open-source battery dataset, generated a synthetic battery dataset, and then compared its similarity to the original dataset using the proposed evaluation method. The silhouette coefficient was confirmed as the most reliable index. Furthermore, the similarity of datasets increased as the silhouette index decreased from 0.1053 to 0.0073 based on the number of learning iterations. The results demonstrate that the insufficient quality of datasets used for deep learning can be overcome and supplemented. Furthermore, data similarity can be efficiently evaluated regardless of the learning environment. In conclusion, we present a new synthetic time-series dataset evaluation method that is more reliable than the conventional representative evaluation method (the training loss rate).

APA, Harvard, Vancouver, ISO, and other styles

Levy, Matan, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. "Data Roaming and Quality Assessment for Composed Image Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (2024): 2991–99. http://dx.doi.org/10.1609/aaai.v38i4.28081.

Full text

Abstract:

The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.

APA, Harvard, Vancouver, ISO, and other styles

Nam, Ki Hyun. "Application of Serial Crystallography for Merging Incomplete Macromolecular Crystallography Datasets." Crystals 14, no. 12 (2024): 1012. http://dx.doi.org/10.3390/cryst14121012.

Full text

Abstract:

In macromolecular crystallography (MX), a complete diffraction dataset is essential for determining the three-dimensional structure. However, collecting a complete experimental dataset using a single crystal is frequently unsuccessful due to poor crystal quality or radiation damage, resulting in the collection of multiple incomplete datasets. This issue can be solved by merging incomplete diffraction datasets to generate a complete dataset. This study introduced a new approach for merging incomplete datasets from MX to generate a complete dataset using serial crystallography (SX). Six incomplete diffraction datasets of β-glucosidase from Thermoanaerobacterium saccharolyticum (TsaBgl) were processed using CrystFEL, an SX program. The statistics of the merged data, such as completeness, CC, CC*, Rsplit, Rwork, and Rfree, demonstrated a complete dataset, indicating improved quality compared with the incomplete datasets and enabling structural determination. Also, the merging of the incomplete datasets was processed using four different indexing algorithms, and their statistics were compared. In conclusion, this approach for generating a complete dataset using SX will provide a new opportunity for determining the crystal structure of macromolecules using multiple incomplete MX datasets.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Quality of datasets"

Koukoletsos, T. "A framework for quality evaluation of VGI linear datasets." Thesis, University College London (University of London), 2012. http://discovery.ucl.ac.uk/1359907/.

Full text

Abstract:

Spatial data collection, processing, distribution and understanding have traditionally been handled by professionals. However, as technology advances, non-experts can now collect Geographic Information (GI), create spatial databases and distribute GI through web applications. This Volunteered Geographic Information (VGI), as it is called, seems to be a promising spatial data source. However, the most concerning issue is its unknown and heterogeneous quality, which cannot be handled by traditional quality measurement methods; the quality elements that these methods measure were standardised long before the appearance of VGI and they assume uniform quality behaviour. The lack of a suitable quality evaluation framework with an appropriate level of automation, which would enable the repetition of the quality assessment when VGI is updated, renders the choice of using it difficult or risky for potential users. This thesis proposes a framework for quality evaluation of linear VGI datasets, used to represent networks. The suggested automated methodology is based on a comparison of a VGI dataset with a dataset of known quality. The heterogeneity issue is handled by producing individual results for small areal units, using a tessellation grid. The quality elements measured are data completeness, attribute and positional accuracy, considered as most important for VGI. Compared to previous research, this thesis includes an automated data matching procedure, specifically designed for VGI. It combines geometric and thematic constraints, shifting the scale of importance from geometry to non-spatial attributes, depending on their existence in the VGI dataset. Based on the data matching results, all quality elements are then measured for corresponding objects, providing a more accurate quality assessment. The method is tested on three case studies. Data matching proves to be quite efficient, leading to more accurate quality results. The data completeness approach also tackles VGI over-completeness, which broadens the method usage for data fusion purposes.

APA, Harvard, Vancouver, ISO, and other styles

Ivanovic, Stefan. "Quality based approach for updating geographic authoritative datasets from crowdsourced GPS traces." Thesis, Paris Est, 2018. http://www.theses.fr/2018PESC1068/document.

Full text

Abstract:

Ces dernières années, le besoin de données géographiques de référence a significativement augmenté. Pour y répondre, il est nécessaire de mettre jour continuellement les données de référence existantes. Cette tâche est coûteuse tant financièrement que techniquement. Pour ce qui concerne les réseaux routiers, trois types de voies sont particulièrement complexes à mettre à jour en continu : les chemins piétonniers, les chemins agricoles et les pistes cyclables. Cette complexité est due à leur nature intermittente (elles disparaissent et réapparaissent régulièrement) et à l’hétérogénéité des terrains sur lesquels elles se situent (forêts, haute montagne, littoral, etc.).En parallèle, le volume de données GPS produites par crowdsourcing et disponibles librement augmente fortement. Le nombre de gens enregistrant leurs positions, notamment leurs traces GPS, est en augmentation, particulièrement dans le contexte d’activités sportives. Ces traces sont rendues accessibles sur les réseaux sociaux, les blogs ou les sites d’associations touristiques. Cependant, leur usage actuel est limité à des mesures et analyses simples telles que la durée totale d’une trace, la vitesse ou l’élévation moyenne, etc. Les raisons principales de ceci sont la forte variabilité de la précision planimétrique des points GPS ainsi que le manque de protocoles et de métadonnées (par ex. la précision du récepteur GPS).Le contexte de ce travail est l’utilisation de traces GPS de randonnées pédestres ou à vélo, collectées par des volontaires, pour détecter des mises à jours potentielles de chemins piétonniers, de voies agricoles et de pistes cyclables dans des données de référence. Une attention particulière est portée aux voies existantes mais absentes du référentiel. L’approche proposée se compose de trois étapes : La première consiste à évaluer et augmenter la qualité des traces GPS acquises par la communauté. Cette qualité a été augmentée en filtrant (1) les points extrêmes à l’aide d’un approche d’apprentissage automatique et (2) les points GPS qui résultent d’une activité humaine secondaire (en dehors de l’itinéraire principal). Les points restants sont ensuite évalués en termes de précision planimétrique par classification automatique. La seconde étape permet de détecter de potentielles mises à jour. Pour cela, nous proposons une solution d’appariement par distance tampon croissante. Cette distance est adaptée à la précision planimétrique des points GPS classifiés pour prendre en compte la forte hétérogénéité de la précision des traces GPS. Nous obtenons ainsi les parties des traces n’ayant pas été appariées au réseau de voies des données de référence. Ces parties sont alors considérées comme de potentielles voies manquantes dans les données de référence. Finalement nous proposons dans la troisième étape une méthode de décision multicritère visant à accepter ou rejeter ces mises à jour possibles. Cette méthode attribue un degré de confiance à chaque potentielle voie manquante. L’approche proposée dans ce travail a été évaluée sur un ensemble de trace GPS multi-sources acquises par crowdsourcing dans le massif des Vosges. Les voies manquantes dans les données de références IGN BDTOPO® ont été détectées avec succès et proposées comme mises à jour potentielles Nowadays, the need for very up to date authoritative spatial data has significantly increased. Thus, to fulfill this need, a continuous update of authoritative spatial datasets is a necessity. This task has become highly demanding in both its technical and financial aspects. In terms of road network, there are three types of roads in particular which are particularly challenging for continuous update: footpath, tractor and bicycle road. They are challenging due to their intermittent nature (e.g. they appear and disappear very often) and various landscapes (e.g. forest, high mountains, seashore, etc.).Simultaneously, GPS data voluntarily collected by the crowd is widely available in a large quantity. The number of people recording GPS data, such as GPS traces, has been steadily increasing, especially during sport and spare time activities. The traces are made openly available and popularized on social networks, blogs, sport and touristic associations' websites. However, their current use is limited to very basic metric analysis like total time of a trace, average speed, average elevation, etc. The main reasons for that are a high variation of spatial quality from a point to a point composing a trace as well as lack of protocols and metadata (e.g. precision of GPS device used).The global context of our work is the use of GPS hiking and mountain bike traces collected by volunteers (VGI traces), to detect potential updates of footpaths, tractor and bicycle roads in authoritative datasets. Particular attention is paid on roads that exist in reality but are not represented in authoritative datasets (missing roads). The approach we propose consists of three phases. The first phase consists of evaluation and improvement of VGI traces quality. The quality of traces was improved by filtering outlying points (machine learning based approach) and points that are a result of secondary human behaviour (activities out of main itinerary). Remained points are then evaluated in terms of their accuracy by classifying into low or high accurate (accuracy) points using rule based machine learning classification. The second phase deals with detection of potential updates. For that purpose, a growing buffer data matching solution is proposed. The size of buffers is adapted to the results of GPS point’s accuracy classification in order to handle the huge variations in VGI traces accuracy. As a result, parts of traces unmatched to authoritative road network are obtained and considered as candidates for missing roads. Finally, in the third phase we propose a decision method where the “missing road” candidates should be accepted as updates or not. This decision method was made in multi-criteria process where potential missing roads are qualified according to their degree of confidence. The approach was tested on multi-sourced VGI GPS traces from Vosges area. Missing roads in IGN authoritative database BDTopo® were successfully detected and proposed as potential updates

APA, Harvard, Vancouver, ISO, and other styles

Shaw, Gavin. "Discovery & effective use of quality association rules in multi-level datasets." Thesis, Queensland University of Technology, 2010. https://eprints.qut.edu.au/41731/1/Gavin_Shaw_Thesis.pdf.

Full text

Abstract:

In today’s electronic world vast amounts of knowledge is stored within many datasets and databases. Often the default format of this data means that the knowledge within is not immediately accessible, but rather has to be mined and extracted. This requires automated tools and they need to be effective and efficient. Association rule mining is one approach to obtaining knowledge stored with datasets / databases which includes frequent patterns and association rules between the items / attributes of a dataset with varying levels of strength. However, this is also association rule mining’s downside; the number of rules that can be found is usually very big. In order to effectively use the association rules (and the knowledge within) the number of rules needs to be kept manageable, thus it is necessary to have a method to reduce the number of association rules. However, we do not want to lose knowledge through this process. Thus the idea of non-redundant association rule mining was born. A second issue with association rule mining is determining which ones are interesting. The standard approach has been to use support and confidence. But they have their limitations. Approaches which use information about the dataset’s structure to measure association rules are limited, but could yield useful association rules if tapped. Finally, while it is important to be able to get interesting association rules from a dataset in a manageable size, it is equally as important to be able to apply them in a practical way, where the knowledge they contain can be taken advantage of. Association rules show items / attributes that appear together frequently. Recommendation systems also look at patterns and items / attributes that occur together frequently in order to make a recommendation to a person. It should therefore be possible to bring the two together. In this thesis we look at these three issues and propose approaches to help. For discovering non-redundant rules we propose enhanced approaches to rule mining in multi-level datasets that will allow hierarchically redundant association rules to be identified and removed, without information loss. When it comes to discovering interesting association rules based on the dataset’s structure we propose three measures for use in multi-level datasets. Lastly, we propose and demonstrate an approach that allows for association rules to be practically and effectively used in a recommender system, while at the same time improving the recommender system’s performance. This especially becomes evident when looking at the user cold-start problem for a recommender system. In fact our proposal helps to solve this serious problem facing recommender systems.

APA, Harvard, Vancouver, ISO, and other styles

Rintala, Jonathan, and Erik Skogetun. "Designing a Mobile User Interface for Crowdsourced Verification of Datasets." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239035.

Full text

Abstract:

During the last decade machine learning has spread rapidly in computer science and beyond, and a central issue for machine learning is data quality. This study was carried out in the intersection of business and Human-Computer Interaction, examining how an interface may be developed for crowdsourced verification of datasets.The interface is developed for efficiency and enjoyability through research on areas such as usability, information presentation models and gamification. The interface was developed iteratively, drawing from needs of potential users as well as the machine learning industry. More specifically, the process involved a literature study, expert interviews, a user survey on the Kenyan market and user tests. The study was divided into a conceptual phase and a design phase, each constituting a clearly bounded part of the study with a prototype being developed in each stage. The results of this study give an interesting insight on what usability factors are important when designing a practical tool-type mobile application, while balancing efficiency and enjoyability. The resulting novel interface indicated on a more effective performance than a conventional grid layout and is more enjoyable to use according to the users. In addition, the ‘rapid serial visual presentation’ can be deemed a well-functioning model for tool-type mobile applications which require a high amount of binary decisions on short time. The study highlights the importance of iterative, user-driven processes, allowing a new innovation or idea to merge with the needs and skills of users. The results may be of interest to anyone developing tool-type mobile applications and certainly if binary decision making on images is central. Studien implementerar en utvecklingsprocess inspirerad av design thinking metodologi, och undersöker således gränslandet mellan ekonomi och människa-dator interaktion. Initialt undersöks affärsmöjligheter för mobila crowdsourcing applikationer i en Östafrikansk kontext, och baserat på resultaten av denna förstudie, utvecklas ett interface för mobil crowdsourcing. Interfacet ämnar hantera verifikation av bildbaserade dataset genom att samla in beslut från användarna. Målet var att designa för två huvudkriterier - njutbarhet och effektivitet, vilket uppnåddes genom efterforskning inom användbarhet, speed-reading metodologier och gamificationprinciper. Interfacet utvecklades iterativt, baserat på krav från potentiella användare, såväl som input från maskininlärningsindustrin. Mer specifikt involverade processen en litteraturstudie, expertintervjuer, en användarstudie på den Kenyanska marknaden och iterativa användartester. Den konceptuella fasen handlade om att identifiera problemet och leverera en relevant idé av hur lösningen skulle utformas. Således togs ett novellt ”Touch-Hold-Release”- interface fram. Resultaten av denna studie ger en intressant insikt i vilka usabilityfaktorer som är viktiga vid design av en praktisk arbetsinriktad applikation, samtidigt som effektivitet och njutbarhet balanseras. Det novella interfacet som tagits fram indikerar på en mer effektiv prestanda än den konventionella grid-layouten och är mer njutbar att använda enligt användarna. Dessutom kan ‘rapid serial visual presentation’ anses vara ett väl fungerande modell för arbetsinriktade mobilapplikationer som kräver stora mängder binära beslut på kort tid. Studien understryker vikten av att arbeta iterativt, med användarfokuserade processer som tillåter nya innovationer och idéer att möta användarnas faktiska behov och kunskaper. Resultaten kan vara av intresse för den som utvecklar en arbetsinriktad mobilapplikation och särskilt då binärt beslutsfattande är fundamentalt.

APA, Harvard, Vancouver, ISO, and other styles

Monsen, Julius. "Building high-quality datasets for abstractive text summarization : A filtering‐based method applied on Swedish news articles." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176352.

Full text

Abstract:

With an increasing amount of information on the internet, automatic text summarization could potentially make content more readily available for a larger variety of people. Training and evaluating text summarization models require datasets of sufficient size and quality. Today, most such datasets are in English, and for minor languages such as Swedish, it is not easy to obtain corresponding datasets with handwritten summaries. This thesis proposes methods for compiling high-quality datasets suitable for abstractive summarization from a large amount of noisy data through characterization and filtering. The data used consists of Swedish news articles and their preambles which are here used as summaries. Different filtering techniques are applied, yielding five different datasets. Furthermore, summarization models are implemented by warm-starting an encoder-decoder model with BERT checkpoints and fine-tuning it on the different datasets. The fine-tuned models are evaluated with ROUGE metrics and BERTScore. All models achieve significantly better results when evaluated on filtered test data than when evaluated on unfiltered test data. Moreover, models trained on the most filtered dataset with the smallest size achieves the best results on the filtered test data. The trade-off between dataset size and quality and other methodological implications of the data characterization, the filtering and the model implementation are discussed, leading to suggestions for future research.

APA, Harvard, Vancouver, ISO, and other styles

Granato, Italo Stefanine Correia. "snpReady and BGGE: R packages to prepare datasets and perform genome-enabled predictions." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/11/11137/tde-21062018-134207/.

Full text

Abstract:

The use of molecular markers allows an increase in efficiency of the selection as well as better understanding of genetic resources in breeding programs. However, with the increase in the number of markers, it is necessary to process it before it can be ready to use. Also, to explore Genotype x Environment (GE) in the context of genomic prediction some covariance matrices needs to be set up before the prediction step. Thus, aiming to facilitate the introduction of genomic practices in the breeding program pipelines, we developed two R-packages. The former is called snpReady, which is set to prepare data sets to perform genomic studies. This package offers three functions to reach this objective, from organizing and apply the quality control, build the genomic relationship matrix and a summary of a population genetics. Furthermore, we present a new imputation method for missing markers. The latter is the BGGE package that was built to generate kernels for some GE genomic models and perform predictions. It consists of two functions (getK and BGGE). The former is helpful to create kernels for the GE genomic models, and the latter performs genomic predictions with some features for GE kernels that decreases the computational time. The features covered in the two packages presents a fast and straightforward option to help the introduction and usage of genome analysis in the breeding program pipeline. O uso de marcadores moleculares permite um aumento na eficiência da seleção, bem como uma melhor compreensão dos recursos genéticos em programas de melhoramento. No entanto, com o aumento do número de marcadores, é necessário o processamento deste antes de deixa-lo disponível para uso. Além disso, para explorar a interação genótipo x ambiente (GA) no contexto da predição genômica, algumas matrizes de covariância precisam ser obtidas antes da etapa de predição. Assim, com o objetivo de facilitar a introdução de práticas genômicas nos programa de melhoramento, dois pacotes em R foram desenvolvidos. O primeiro, snpReady, foi criado para preparar conjuntos de dados para realizar estudos genômicos. Este pacote oferece três funções para atingir esse objetivo, organizando e aplicando o controle de qualidade, construindo a matriz de parentesco genômico e com estimativas de parâmetros genéticos populacionais. Além disso, apresentamos um novo método de imputação para marcas perdidas. O segundo pacote é o BGGE, criado para gerar kernels para alguns modelos genômicos de interação GA e realizar predições genômicas. Consiste em duas funções (getK e BGGE). A primeira é utilizada para criar kernels para os modelos GA, e a última realiza predições genômicas, com alguns recursos especifico para os kernels GA que diminuem o tempo computacional. Os recursos abordados nos dois pacotes apresentam uma opção rápida e direta para ajudar a introdução e uso de análises genômicas nas diversas etapas do programa de melhoramento.

APA, Harvard, Vancouver, ISO, and other styles

Awwad, Tarek. "Context-aware worker selection for efficient quality control in crowdsourcing." Thesis, Lyon, 2018. http://www.theses.fr/2018LYSEI099/document.

Full text

Abstract:

Le crowdsourcing est une technique qui permet de recueillir une large quantité de données d'une manière rapide et peu onéreuse. Néanmoins, La disparité comportementale et de performances des "workers" d’une part et la variété en termes de contenu et de présentation des tâches par ailleurs influent considérablement sur la qualité des contributions recueillies. Par conséquent, garder leur légitimité impose aux plateformes de crowdsourcing de se doter de mécanismes permettant l’obtention de réponses fiables et de qualité dans un délai et avec un budget optimisé. Dans cette thèse, nous proposons CAWS (Context AwareWorker Selection), une méthode de contrôle de la qualité des contributions dans le crowdsourcing visant à optimiser le délai de réponse et le coût des campagnes. CAWS se compose de deux phases, une phase d’apprentissage opérant hors-ligne et pendant laquelle les tâches de l’historique sont regroupées de manière homogène sous forme de clusters. Pour chaque cluster, un profil type optimisant la qualité des réponses aux tâches le composant, est inféré ; la seconde phase permet à l’arrivée d’une nouvelle tâche de sélectionner les meilleurs workers connectés pour y répondre. Il s’agit des workers dont le profil présente une forte similarité avec le profil type du cluster de tâches, duquel la tâche nouvellement créée est la plus proche. La seconde contribution de la thèse est de proposer un jeu de données, appelé CrowdED (Crowdsourcing Evaluation Dataset), ayant les propriétés requises pour, d’une part, tester les performances de CAWS et les comparer aux méthodes concurrentes et d’autre part, pour tester et comparer l’impact des différentes méthodes de catégorisation des tâches de l’historique (c-à-d, la méthode de vectorisation et l’algorithme de clustering utilisé) sur la qualité du résultat, tout en utilisant un jeu de tâches unique (obtenu par échantillonnage), respectant les contraintes budgétaires et gardant les propriétés de validité en terme de dimension. En outre, CrowdED rend possible la comparaison de méthodes de contrôle de qualité quelle que soient leurs catégories, du fait du respect d’un cahier des charges lors de sa constitution. Les résultats de l’évaluation de CAWS en utilisant CrowdED comparés aux méthodes concurrentes basées sur la sélection de workers, donnent des résultats meilleurs, surtout en cas de contraintes temporelles et budgétaires fortes. Les expérimentations réalisées avec un historique structuré en catégories donnent des résultats comparables à des jeux de données où les taches sont volontairement regroupées de manière homogène. La dernière contribution de la thèse est un outil appelé CREX (CReate Enrich eXtend) dont le rôle est de permettre la création, l’extension ou l’enrichissement de jeux de données destinés à tester des méthodes de crowdsourcing. Il propose des modules extensibles de vectorisation, de clusterisation et d’échantillonnages et permet une génération automatique d’une campagne de crowdsourcing Crowdsourcing has proved its ability to address large scale data collection tasks at a low cost and in a short time. However, due to the dependence on unknown workers, the quality of the crowdsourcing process is questionable and must be controlled. Indeed, maintaining the efficiency of crowdsourcing requires the time and cost overhead related to this quality control to stay low. Current quality control techniques suffer from high time and budget overheads and from their dependency on prior knowledge about individual workers. In this thesis, we address these limitation by proposing the CAWS (Context-Aware Worker Selection) method which operates in two phases: in an offline phase, the correlations between the worker declarative profiles and the task types are learned. Then, in an online phase, the learned profile models are used to select the most reliable online workers for the incoming tasks depending on their types. Using declarative profiles helps eliminate any probing process, which reduces the time and the budget while maintaining the crowdsourcing quality. In order to evaluate CAWS, we introduce an information-rich dataset called CrowdED (Crowdsourcing Evaluation Dataset). The generation of CrowdED relies on a constrained sampling approach that allows to produce a dataset which respects the requester budget and type constraints. Through its generality and richness, CrowdED helps also in plugging the benchmarking gap present in the crowdsourcing community. Using CrowdED, we evaluate the performance of CAWS in terms of the quality, the time and the budget gain. Results shows that automatic grouping is able to achieve a learning quality similar to job-based grouping, and that CAWS is able to outperform the state-of-the-art profile-based worker selection when it comes to quality, especially when strong budget ant time constraints exist. Finally, we propose CREX (CReate Enrich eXtend) which provides the tools to select and sample input tasks and to automatically generate custom crowdsourcing campaign sites in order to extend and enrich CrowdED

APA, Harvard, Vancouver, ISO, and other styles

Lush, Victoria. "Visualisation of quality information for geospatial and remote sensing data : providing the GIS community with the decision support tools for geospatial dataset quality evaluation." Thesis, Aston University, 2015. http://publications.aston.ac.uk/25795/.

Full text

Abstract:

The evaluation of geospatial data quality and trustworthiness presents a major challenge to geospatial data users when making a dataset selection decision. The research presented here therefore focused on defining and developing a GEO label – a decision support mechanism to assist data users in efficient and effective geospatial dataset selection on the basis of quality, trustworthiness and fitness for use. This thesis thus presents six phases of research and development conducted to: (a) identify the informational aspects upon which users rely when assessing geospatial dataset quality and trustworthiness; (2) elicit initial user views on the GEO label role in supporting dataset comparison and selection; (3) evaluate prototype label visualisations; (4) develop a Web service to support GEO label generation; (5) develop a prototype GEO label-based dataset discovery and intercomparison decision support tool; and (6) evaluate the prototype tool in a controlled human-subject study. The results of the studies revealed, and subsequently confirmed, eight geospatial data informational aspects that were considered important by users when evaluating geospatial dataset quality and trustworthiness, namely: producer information, producer comments, lineage information, compliance with standards, quantitative quality information, user feedback, expert reviews, and citations information. Following an iterative user-centred design (UCD) approach, it was established that the GEO label should visually summarise availability and allow interrogation of these key informational aspects. A Web service was developed to support generation of dynamic GEO label representations and integrated into a number of real-world GIS applications. The service was also utilised in the development of the GEO LINC tool – a GEO label-based dataset discovery and intercomparison decision support tool. The results of the final evaluation study indicated that (a) the GEO label effectively communicates the availability of dataset quality and trustworthiness information and (b) GEO LINC successfully facilitates ‘at a glance’ dataset intercomparison and fitness for purpose-based dataset selection.

APA, Harvard, Vancouver, ISO, and other styles

Girgin, Serkan. "Development Of Gis-based National Hydrography Dataset, Sub-basin Boundaries, And Water Quality/quantity Data Analysis System For Turkey." Master's thesis, METU, 2003. http://etd.lib.metu.edu.tr/upload/3/1223338/index.pdf.

Full text

Abstract:

Computerized data visualization and analysis tools, especially Geographic Information Systems (GIS), constitute an important part of today&amp #65533 s water resources development and management studies. In order to obtain satisfactory results from such tools, accurate and comprehensive hydrography datasets are needed that include both spatial and hydrologic information on surface water resources and watersheds. If present, such datasets may support many applications, such as hydrologic and environmental modeling, impact assessment, and construction planning. The primary purposes of this study are production of prototype national hydrography and watershed datasets for Turkey, and development of GIS-based tools for the analysis of local water quality and quantity data. For these purposes national hydrography datasets and analysis systems of several counties are reviewed, and based on gained experience 1) Sub-watershed boundaries of 26 major national basins are derived from digital elevation model of the country by using raster-based analysis methods and these watersheds are named according to coding system of the European Union, 2) A prototype hydrography dataset with built-in connectivity and water flow direction information is produced from publicly available data sources, 3) GIS based spatial tools are developed to facilitate navigation through streams and watersheds in the hydrography dataset, and 4) A state-of-the art GIS-based stream flow and water quality data analysis system is developed, which is based on the structure of nationally available data and includes advanced statistical and spatial analysis capabilities. All datasets and developed tools are gathered in a single graphical user-interface within GIS and made available to the end-users.

APA, Harvard, Vancouver, ISO, and other styles

Abbas, Nacira. "Formal Concept Analysis for Discovering Link Keys in the Web of Data." Electronic Thesis or Diss., Université de Lorraine, 2023. http://www.theses.fr/2023LORR0202.

Full text

Abstract:

Le Web des données est un espace de données global qui peut être considéré comme une couche supplémentaire au-dessus du Web des documents. Le liage des données est la tâche de découverte des liens d'identité entre les ensembles de données RDF (Resource Description Framework) sur le Web des données. Nous nous intéressons à une approche spécifique pour le liage des données, qui repose sur les “clés de liage”. Cette clé a la forme de deux ensembles de paires de propriétés associées à une paire de classes. Par exemple, la clé de liage ({(designation,titre)},{(designation,titre), (createur,auteur)},(Livre,Roman)) indique que si une instance “a” de la classe “Livre” et “b” de la classe “Roman” partagent au moins une valeur pour les propriétés “createur” et “auteur” et que “a” et “b” ont les mêmes valeurs pour les propriétés “designation” et “titre”, alors “a” et “b” désignent la même entité. Ainsi, (a,owl:sameAs,b) est un lien d'identité sur les deux ensembles de données. Cependant, les clés de liage ne sont pas toujours fournies, et divers algorithmes ont été développés pour découvrir automatiquement ces clés. Les algorithmes découvrent d'abord des “clés de liage candidates”. La qualité de ces candidates est ensuite évaluée à l'aide de mesures appropriées, et les clés de liage valides sont sélectionnées en conséquence. L'Analyse Formelle des Concepts (AFC) a été étroitement associée à la découverte de clés de liage candidates, ce qui a conduit à la proposition d'un algorithme basé sur l'AFC à cette fin. Cependant, les algorithmes de découverte de clés de liage présentent certaines limitations. Premièrement, ils ne spécifient pas explicitement les paires de classes associées aux candidates découvertes, ce qui peut conduire à des évaluations inexactes. De plus, les stratégies de sélection utilisées par ces algorithmes peuvent également produire des résultats moins précis. On observe aussi une redondance parmi les ensembles de candidates découvertes, ce qui complique leur visualisation, évaluation et analyse. Pour remédier à ces limitations, nous proposons d'étendre les algorithmes existants sur plusieurs aspects. Tout d'abord, nous introduisons une méthode basée sur les Pattern Structures, une généralisation de l'AFC pour les données non binaires. Cette approche permet de spécifier explicitement les paires de classes associées à chaque clé de liage candidate. Deuxièmement, basée sur la Pattern Structure proposée, nous présentons deux méthodes de sélection de clés de liage. La première méthode est guidée par les paires de classes associées aux candidates, tandis que la deuxième méthode utilise le treillis générée par la Pattern Structure. Ces deux méthodes améliorent la sélection par rapport à la stratégie existante. Enfin, pour remédier à la redondance, nous introduisons deux méthodes. La première méthode est basée sur une Partition Pattern Structure, qui identifie et fusionne les candidates générant les mêmes partitions. La deuxième méthode est basée sur le clustering hiérarchique, qui groupe les candidates produisant des ensembles de liens similaires en clusters et sélectionne un représentant pour chaque cluster. Cette approche réduit efficacement la redondance parmi les clés de liage candidates The Web of data is a global data space that can be seen as an additional layer interconnected with the Web of documents. Data interlinking is the task of discovering identity links across RDF (Resource Description Framework) datasets over the Web of data. We focus on a specific approach for data interlinking, which relies on the “link keys”. A link key has the form of two sets of pairs of properties associated with a pair of classes. For example the link key ({(designation,title)},{(designation,title) (creator,author)},(Book,Novel)), states that whenever an instance “a” of the class “Book” and “b” of the class “Novel”, share at least one value for the properties “creator” and “author” and that, “a” and “b” have the same values for the properties “designation” and “title”, then “a” and “b” denote the same entity. Then (a,owl:sameAs,b) is an identity link over the two datasets. However, link keys are not always provided, and various algorithms have been developed to automatically discover these keys. First, these algorithms focus on finding “link key candidates”. The quality of these candidates is then evaluated using appropriate measures, and valid link keys are selected accordingly. Formal Concept Analysis (FCA) has been closely associated with the discovery of link key candidates, leading to the proposal of an FCA-based algorithm for this purpose. Nevertheless, existing algorithms for link key discovery have certain limitations. First, they do not explicitly specify the associated pairs of classes for the discovered link key candidates, which can lead to inaccurate evaluations. Additionally, the selection strategies employed by these algorithms may also produce less accurate results. Furthermore, redundancy is observed among the sets of discovered candidates, which presents challenges for their visualization, evaluation, and analysis. To address these limitations, we propose to extend the existing algorithms in several aspects. Firstly, we introduce a method based on Pattern Structures, an FCA generalization that can handle non-binary data. This approach allows for explicitly specifying the associated pairs of classes for each link key candidate. Secondly, based on the proposed Pattern Structure, we present two methods for link key selection. The first method is guided by the associated pairs of classes of link keys, while the second method utilizes the lattice generated by the Pattern Structure. These two methods improve the selection compared to the existing strategy. Finally, to address redundancy, we introduce two methods. The first method involves Partition Pattern Structure, which identifies and merges link key candidates that generate the same partitions. The second method is based on hierarchical clustering, which groups candidates producing similar link sets into clusters and selects a representative for each cluster. This approach effectively minimizes redundancy among the link key candidates

APA, Harvard, Vancouver, ISO, and other styles

More sources

Books on the topic "Quality of datasets"

Johnson, L. Marvin. Quality assurance evaluator's handbook: Check lists and datasheets. 5th ed. L. Marvin Johnson & Associates, 1990.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Baker, Nancy T. National Stream Quality Accounting Network and National Monitoring Network Basin Boundary Geospatial Dataset, 2008-13. U.S. Dept. of the Interior, U.S. Geological Survey, 2011.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Kaiyō Kenkyū Kaihatsu Kikō (Japan). Sentanteki yojigen taiki kaiyō rikuiki ketsugō dēta dōka shisutemu no kaihatsu to kōseido kikō hendō yosoku ni hitsuyō na shokichika saikaiseki tōgō dētasetto no kōchiku: Heisei 17-nendo kenkyū seika hōkokusho = Research development of advanced four-dimensional data assimilation system using a climate model toward construction of high-quality reanalysis datasets for climate prediction. Monbu Kagakushō̄ Kenkyū Kaihatsukyoku, 2006.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Kaiyō Kenkyū Kaihatsu Kikō (Japan), Hokkaidō Daigaku, and Japan. Monbu Kagakushō. Kenkyū Kaihatsukyoku., eds. Sentanteki yojigen taiki kaiyō rikuiki ketsugō dēta dōka shisutemu no kaihatsu to kōseido kikō hendō yosoku ni hitsuyō na shokichika saikaiseki tōgō dētasetto no kōchiku: Heisei 18-nendo kenkyū seika hōkokusho = Research development of advanced four-dimensional data assimilation system using a climate model toward construction of high-quality reanalysis datasets for climate prediction. Monbu Kagakushō̄ Kenkyū Kaihatsukyoku, 2007.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Sentā, Kaiyō Kagaku Gijutsu. Sentanteki yojigen taiki kaiyō rikuiki ketsugō dēta dōka shisutemu no kaihatsu to kōseido kikō hendō yosoku ni hitsuyō na shokichika saikaiseki tōgō dētasetto no kōchiku: Heisei 14-nendo kenkyū seika hōkokusho = Research development of advanced four-dimensional data assimilation system using a climate model toward construction of high-quality reanalysis datasets for climate prediction. Monbu Kagakushō̄ Kenkyū Kaihatsukyoku, 2003.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Patrinos, Harry Anthony, and Noam Angrist. Global Dataset on Education Quality: A Review and Update (2000–2017). World Bank, Washington, DC, 2018. http://dx.doi.org/10.1596/1813-9450-8592.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Villez, Kris, Daniel Aguado, Janelcy Alferes, Queralt Plana, Maria Victoria Ruano, and Oscar Samuelsson, eds. Metadata Collection and Organization in Wastewater Treatment and Wastewater Resource Recovery Systems. IWA Publishing, 2024. http://dx.doi.org/10.2166/9781789061154.

Full text

Abstract:

In recent years, the wastewater treatment field has undergone an instrumentation revolution. Thanks to increased efficiency of communication networks and extreme reductions in data storage costs, wastewater plants have entered the era of big data. Meanwhile, artificial intelligence and machine learning tools have enabled the extraction of valuable information from large-scale datasets. Despite this potential, the successful deployment of AI and automation depends on the quality of the data produced and the ability to analyze it usefully in large quantities. Metadata, including a quantification of the data quality, is often missing, so vast amounts of collected data quickly become useless. Ultimately, data-dependent decisions supported by machine learning and AI will not be possible without data readiness skills accounting for all the Vs of big data: volume, velocity, variety, and veracity. Metadata Collection and Organization in Wastewater Treatment and Wastewater Resource Recovery Systems provides recommendations to handle these challenges, and aims to clarify metadata concepts and provide advice on their practical implementation in water resource recovery facilities. This includes guidance on the best practices to collect, organize, and assess data and metadata, based on existing standards and state-of-the-art algorithmic tools. This Scientific and Technical Report offers a great starting point for improved data management and decision making, and will be of interest to a wide audience, including sensor technicians, operational staff, data management specialists, and plant managers. ISBN: 9781789061147 (Paperback) ISBN: 9781789061154 (eBook) ISBN: 9781789061161 (ePub)

APA, Harvard, Vancouver, ISO, and other styles

Taberlet, Pierre, Aurélie Bonin, Lucie Zinger, and Eric Coissac. Environmental DNA. Oxford University Press, 2018. http://dx.doi.org/10.1093/oso/9780198767220.001.0001.

Full text

Abstract:

Environmental DNA (eDNA), i.e. DNA released in the environment by any living form, represents a formidable opportunity to gather high-throughput and standard information on the distribution or feeding habits of species. It has therefore great potential for applications in ecology and biodiversity management. However, this research field is fast-moving, involves different areas of expertise and currently lacks standard approaches, which calls for an up-to-date and comprehensive synthesis. Environmental DNA for biodiversity research and monitoring covers current methods based on eDNA, with a particular focus on “eDNA metabarcoding”. Intended for scientists and managers, it provides the background information to allow the design of sound experiments. It revisits all steps necessary to produce high-quality metabarcoding data such as sampling, metabarcode design, optimization of PCR and sequencing protocols, as well as analysis of large sequencing datasets. All these different steps are presented by discussing the potential and current challenges of eDNA-based approaches to infer parameters on biodiversity or ecological processes. The last chapters of this book review how DNA metabarcoding has been used so far to unravel novel patterns of diversity in space and time, to detect particular species, and to answer new ecological questions in various ecosystems and for various organisms. Environmental DNA for biodiversity research and monitoring constitutes an essential reading for all graduate students, researchers and practitioners who do not have a strong background in molecular genetics and who are willing to use eDNA approaches in ecology and biomonitoring.

APA, Harvard, Vancouver, ISO, and other styles

Kassem, Moulay Abdelmajid. QTL Mapping with Python and R/qtl: A Reproducible Pipeline for Crop Genetics. Atlas Publishing, LLC, 2025. https://doi.org/10.5147/books.ap1.

Full text

Abstract:

QTL Mapping with Python and R/qtl: A Reproducible Pipeline for Crop Genetics presents a comprehensive and hands-on guide to identifying quantitative trait loci (QTL) in recombinant inbred line (RIL) populations using a reproducible workflow that bridges Python and R. Designed for researchers, students, and breeders, this resource walks readers through every stage of the QTL mapping process, from data preparation to advanced visualization and interpretation. The book begins with foundational concepts in QTL mapping and its importance in soybean breeding, using the well-characterized Forrest × Williams 82 RIL population as a case study. It then introduces the required data structures and demonstrates robust data cleaning and formatting techniques in Python, including optional marker filtering and quality control steps. Integration with R/qtl is explained using the rpy2 interface and native R workflows, ensuring a smooth transition between environments. Subsequent chapters cover QTL analysis methods such as single-QTL scans and composite interval mapping (CIM), the use of permutation tests for determining significance thresholds, and strategies for extracting and visualizing key results. Readers also learn to define QTL intervals, generate effect plots, explore multi-trait QTL overlap, and prepare figures suitable for publication. Special attention is given to troubleshooting common issues, scaling to large datasets, and the possibility of extending the pipeline using machine learning (ML) techniques. A full-length case study illustrates the application of the pipeline to seed traits such as protein, oil, fatty acids, and isoflavones in the FxW82 population. Comparative results are discussed in light of previous studies, including Knizia et al. (2021), and the implications for marker-assisted selection are explored. The final chapters highlight future directions, broader applications across crops, and provide a curated list of resources for extended learning. By combining practical code, real data, and clear biological context, this eBook equips its readers with the tools and understanding needed to implement robust QTL mapping pipelines in modern crop genomics.

APA, Harvard, Vancouver, ISO, and other styles

Chan, Ho Fai, Mohammad Wangsit Supriyadi, and Benno Torgler. Trust and Tax Morale. Edited by Eric M. Uslaner. Oxford University Press, 2017. http://dx.doi.org/10.1093/oxfordhb/9780190274801.013.23.

Full text

Abstract:

This empirical chapter examines the relation between trust and tax morale at both country and individual levels using a combined World Values Survey and European Values Study dataset covering 400,000 observations across 108 countries. The results overall indicate that although vertical trust matters, horizontal trust in the form of generalized trust is not linked to tax morale. We do, however, identify intercountry differences that warrant further exploration. We also demonstrate that generalized trust uncertainty, in contrast to vertical trust uncertainty, is negatively correlated with tax morale. Lastly, we provide some evidence that generalized trust varies under different vertical and governance conditions, but we are unable to identify any indirect path from generalized trust to tax morale using governance quality as a mediator.

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Quality of datasets"

Alrashed, Tarfah, Dimitris Paparas, Omar Benjelloun, Ying Sheng, and Natasha Noy. "Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages." In The Semantic Web – ISWC 2021. Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-88361-4_20.

Full text

Abstract:

AbstractSemantic markup, such as , allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google’s Dataset Search. Dataset Search relies on to identify pages that describe datasets. While was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61% of internet hosts that provide markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search’s Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with markup is a dataset page. Our classifier achieves 96.7% recall at the 95% precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users.

APA, Harvard, Vancouver, ISO, and other styles

Li, Yuezun, Pu Sun, Honggang Qi, and Siwei Lyu. "Toward the Creation and Obstruction of DeepFakes." In Handbook of Digital Face Manipulation and Detection. Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-030-87664-7_4.

Full text

Abstract:

AbstractAI-synthesized face-swapping videos, commonly known as DeepFakes, is an emerging problem threatening the trustworthiness of online information. The need to develop and evaluate DeepFake detection algorithms calls for large-scale datasets. However, current DeepFake datasets suffer from low visual quality and do not resemble DeepFake videos circulated on the Internet. We present a new large-scale challenging DeepFake video dataset, Celeb-DF, which contains 5, 639 high-quality DeepFake videos of celebrities generated using an improved synthesis process. We conduct a comprehensive evaluation of DeepFake detection methods and datasets to demonstrate the escalated level of challenges posed by Celeb-DF. Then we introduce Landmark Breaker, the first dedicated method to disrupt facial landmark extraction, and apply it to the obstruction of the generation of DeepFake videos. The experiments are conducted on three state-of-the-art facial landmark extractors using our Celeb-DF dataset.

APA, Harvard, Vancouver, ISO, and other styles

Staron, Miroslaw, Wilhelm Meding, Ola Söder, and Miroslaw Ochodek. "Improving Quality of Code Review Datasets – Token-Based Feature Extraction Method." In Software Quality: Future Perspectives on Software Engineering Quality. Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-65854-0_7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Pons, J. P., F. Ségonne, J. D. Boissonnat, L. Rineau, M. Yvinec, and R. Keriven. "High-Quality Consistent Meshing of Multi-label Datasets." In Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2007. http://dx.doi.org/10.1007/978-3-540-73273-0_17.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Debattista, Jeremy, Santiago Londoño, Christoph Lange, and Sören Auer. "Quality Assessment of Linked Datasets Using Probabilistic Approximation." In The Semantic Web. Latest Advances and New Domains. Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-18818-8_14.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Boudjerida, Fatima, Atidel Lahoulou, and Zahid Akhtar. "Analysis and Comparison of Audiovisual Quality Assessment Datasets." In Advances in Computing Systems and Applications. Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-69418-0_31.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wentzel, Bianca, Fabian Kirstein, Torben Jastrow, Raphael Sturm, Michael Peters, and Sonja Schimmler. "An Extensive Methodology and Framework for Quality Assessment of DCAT-AP Datasets." In Lecture Notes in Computer Science. Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-41138-0_17.

Full text

Abstract:

AbstractThe DCAT Application Profile for Data Portals is a crucial cornerstone for publishing and reusing Open Data in Europe. It supports the harmonization and interoperability of Open Data by providing an expressive set of properties, guidelines, and reusable vocabularies. However, a qualitative and accurate implementation by Open Data providers remains challenging. To improve the informative value and the compliance with RDF-based specifications, we propose a methodology to measure and assess the quality of DCAT-AP datasets. Our approach is based on the FAIR and the 5-star principles for Linked Open Data. We define a set of metrics, where each one covers a specific quality aspect. For example, if a certain property has a compliant value, if mandatory vocabularies are applied or if the actual data is available. The values for the metrics are stored as a custom data model based on the Data Quality Vocabulary and is used to calculate an overall quality score for each dataset. We implemented our approach as a scalable and reusable Open Source solution to demonstrate its feasibility. It is applied in a large-scale production environment (data.europa.eu) and constantly checks more than 1.6 million DCAT-AP datasets and delivers quality reports.

APA, Harvard, Vancouver, ISO, and other styles

Sejdiu, Gezim, Anisa Rula, Jens Lehmann, and Hajira Jabeen. "A Scalable Framework for Quality Assessment of RDF Datasets." In Lecture Notes in Computer Science. Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-30796-7_17.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Hernández, Netzahualcóyotl, Luis A. Castro, Jesús Favela, Layla Michán, and Bert Arnrich. "Data Quality in Mobile Sensing Datasets for Pervasive Healthcare." In Handbook of Large-Scale Distributed Computing in Smart Healthcare. Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-58280-1_9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Etcheverry, Lorena, Shahan Khatchadourian, and Mariano Consens. "Quality Assessment of MAGE-ML Genomic Datasets Using DescribeX." In Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-15120-0_15.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Quality of datasets"

Ivanova, Rositsa V., Thomas Huber, and Christina Niklaus. "Let’s discuss! Quality Dimensions and Annotated Datasets for Computational Argument Quality Assessment." In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024. http://dx.doi.org/10.18653/v1/2024.emnlp-main.1155.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Angelis, G. F., A. Emvoliadis, T. I. Theodorou, A. Zamichos, A. Drosou, and D. Tzovaras. "Regional Datasets for Air Quality Monitoring in European Cities." In IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2024. http://dx.doi.org/10.1109/igarss53475.2024.10640879.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Han, Seung-Ho, Jeongyun Han, Dongkun Lee, and Ho-Jin Choi. "Quality Assurance Framework for Multimodal Assessment Datasets on AI Risk Factors." In 2025 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, 2025. https://doi.org/10.1109/bigcomp64353.2025.00082.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Chandaliya, Praveen Kumar, Kiran Raja, Haoyu Zhang, Raghavendra Ramachandra, and Christoph Busch. "Synthetic Ethnicity Alteration for Diversifying Face Datasets - Investigating Recognition and Quality." In 2024 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2024. https://doi.org/10.1109/wifs61860.2024.10810684.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Li, Jian, Bowen Xu, and Sören Schwertfeger. "High-Quality, ROS Compatible Video Encoding and Decoding for High-Definition Datasets." In 2024 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2024. https://doi.org/10.1109/robio64047.2024.10907468.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Mirza, Samiha, Apurva Gala, Pandu Devarakota, Pranav Mantini, and Shishir Shah. "Integrating Image Quality Assessment Metrics for Enhanced Segmentation Performance in Reconstructed Imaging Datasets." In 20th International Conference on Computer Vision Theory and Applications. SCITEPRESS - Science and Technology Publications, 2025. https://doi.org/10.5220/0013166400003912.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kim, Taehyoun, Duksan Ryu, and Jongmoon Baik. "Enhancing Software Reliability Growth Modeling: A Comprehensive Analysis of Historical Datasets and Optimal Model Selections." In 2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2024. http://dx.doi.org/10.1109/qrs62785.2024.00024.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Setiyanti, Michelle, Genrawan Hoendarto, and Jimmy Tjen. "Enhancing Water Potability Identification through Random Forest Regression and Genetic Algorithm Optimization." In INTERNATIONAL CONFERENCE ON APPLIED TECHNOLOGY 2024. Trans Tech Publications Ltd, 2025. https://doi.org/10.4028/p-2fikqf.

Full text

Abstract:

Water quality is important for both environmental sustainability and public health. This research introduces an innovative method for forecasting water quality using Random Forest Regression, optimized through Genetic Algorithm (GA) techniques. The goal is to enhance prediction accuracy and offer meaningful insights for better water resource management. The study employed the “Water Quality Data” dataset, encompassing 11 essential water quality parameters from different locations. After thorough data preprocessing, the Random Forest model, refined with GA optimization, achieved a Mean Squared Error (MSE) of 0.3476 and an accuracy rate of 91.77%, surpassing conventional methods. This approach highlights the effectiveness of merging machine learning algorithms with evolutionary optimization techniques to achieve superior predictive outcomes. Although the dataset was of moderate size, the results show considerable improvements in model accuracy. This work advances the field of water quality prediction by leveraging sophisticated algorithms and emphasizes the significance of hyperparameter tuning. Future research should focus on using larger datasets and examining the specific regions from which the data is collected.

APA, Harvard, Vancouver, ISO, and other styles

Croft, Roland, M. Ali Babar, and M. Mehdi Kholoosi. "Data Quality for Software Vulnerability Datasets." In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023. http://dx.doi.org/10.1109/icse48619.2023.00022.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Diaz, Catalina, Saul Calderon-Ramirez, and Luis Diego Mora Aguilar. "Data Quality Metrics for Unlabelled Datasets." In 2022 IEEE 4th International Conference on BioInspired Processing (BIP). IEEE, 2022. http://dx.doi.org/10.1109/bip56202.2022.10032475.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Quality of datasets"

Elko, Nicole, Katherine Brutsché, Quin Robertson, Michael Hartman, and Zhifei Dong. USACE Navigation Sediment Placement : An RSM Program Database (1998 – 2019). Engineer Research and Development Center (U.S.), 2022. http://dx.doi.org/10.21079/11681/44703.

Full text

Abstract:

This US Army Corps of Engineers, Regional Sediment Management, technical note describes a geodatabase of federal coastal and inland navigation projects developed to determine the extent to which RSM goals have been implemented across the USACE at the project and district levels. The effort 1) quantified the volume of sediment dredged from federal navigation channels by both contract and USACE-owned dredges and 2) identified the placement type and whether sediment was placed beneficially. The majority of the dredging data used to populate the geodatabase were based on the USACE Dredging Information System DIS database, but when available, the geodatabase was expanded to include more detailed USACE district-specific data that were not included in the DIS database. Two datasets were developed in this study: the National Dataset and the District-Specific and Quality-Checked Dataset. The National Dataset is based on statistics extracted from the combined DIS Contract and Government Plant data. This database is a largely unedited database that combined two available USACE datasets. Due to varying degrees of data completeness in these two datasets, this study undertook a data refinement process to improve the information. This was done through interviews with the districts, literature search, and the inclusion of additional district-specific data provided by individual districts that often represent more detailed information on dredging activities. The District-Specific and Quality-Checked Database represents a customized database generated by this study. An interactive web-based tool was developed that accesses both datasets and displays them on a national map that can be viewed at the district or project scale.

APA, Harvard, Vancouver, ISO, and other styles

Glandon, S. Ross, Casey L. Lorenzen, William F. Farthing, et al. Analysis tools and techniques for evaluating quality in synthetic data generated by the Virtual Autonomous Navigation Environment. US Army Engineer Research and Development Center, 2025. https://doi.org/10.21079/11681/49708.

Full text

Abstract:

The capability to produce high-quality labeled synthetic image data is an important tool for building and maintaining machine learning datasets. However, ensuring computer-generated data is of high quality is very challenging. This report describes an effort to evaluate and improve synthetic image data generated by the Virtual Autonomous Navigation Environment’s Environment and Sensor Engine (VANE::ESE), as well as documenting a set of tools developed to process, analyze, and train models from, image datasets generated by VANE::ESE. Additionally, the results of several experiments are presented, including an investigation into using explainable AI techniques, and direct comparisons of various models trained on multiple synthetic datasets.

APA, Harvard, Vancouver, ISO, and other styles

Tait, Emma, David Gudex-Cross, and James Duncan. Standardized Spatial Datasets for Exploring the Connection Between Forest Cover and Water Quality in the Northeast. Forest Ecosystem Monitoring Cooperative, 2020. http://dx.doi.org/10.18125/5xqty5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Andersen, Kamilla Heimar, Anna Marszal-Pomianowska, Henrik N. Knudsen, et al. Room-based Indoor Environment Measurements and Occupancy Ground Truth Datasets from Five Residential Apartments in a Nordic Climate. Department of the Built Environment, 2023. http://dx.doi.org/10.54337/aau550646548.

Full text

Abstract:

This document consists of the description of the developed and curated datasets of 1) Indoor Environmental Quality (IEQ) and collection of occupancy ground truth (1-week data) and 2) long-term monitoring of IEQ in rooms in apartments in a low-energy multi-story residential building (8 months, February 2023 to August 2023).

APA, Harvard, Vancouver, ISO, and other styles

Juvik, John A., Avri Bar Zur, and Torbert R. Rocheford. Breeding for Quality in Vegetable Maize Using Linked Molecular Markers. United States Department of Agriculture, 1993. http://dx.doi.org/10.32747/1993.7568764.bard.

Full text

Abstract:

Recently, the vegetable corn industry has shifted from the use of traditional cultivars with the sugary1 (su1) endosperm mutation to newer hybrids homozygous for the shrunken2 (sh2) or sugary enhancer1 (se1) genes. With greater kernel sucrose content, these hybrids are preferred by consumers and retain sugar for longer post harvest periods, providing the industry with more time to marker products with superior quality. Commercialization has been hindered, however, by reduced field emergence, and the establishment of stands with heterogeneous uniformity and maturities. This investigation was conducted to identify key biochemical and physiological characteristics in sh2 and se1 maize kernels associated with improved emergence, and stand establishment; and in immature ears at fresh harvest maturity, properties associated with eating quality. The location of genes or QTL controlling these kernel characteristics and other traits were then mapped to specific chromosomal regions by their linkage to molecular markers using two segregating F2:3 populations. This database was used to compare the efficiency of marker-assisted selection of key alleles with phenotypic selection for trait improvement. A model designed to uncover and quantify digenic interaction was applied to the datasets to evaluate the role of epistasis in the inheritance of quantitative traits.

APA, Harvard, Vancouver, ISO, and other styles

Hart, Carl R., D. Keith Wilson, Chris L. Pettit, and Edward T. Nykaza. Machine-Learning of Long-Range Sound Propagation Through Simulated Atmospheric Turbulence. U.S. Army Engineer Research and Development Center, 2021. http://dx.doi.org/10.21079/11681/41182.

Full text

Abstract:

Conventional numerical methods can capture the inherent variability of long-range outdoor sound propagation. However, computational memory and time requirements are high. In contrast, machine-learning models provide very fast predictions. This comes by learning from experimental observations or surrogate data. Yet, it is unknown what type of surrogate data is most suitable for machine-learning. This study used a Crank-Nicholson parabolic equation (CNPE) for generating the surrogate data. The CNPE input data were sampled by the Latin hypercube technique. Two separate datasets comprised 5000 samples of model input. The ﬁrst dataset consisted of transmission loss (TL) ﬁelds for single realizations of turbulence. The second dataset consisted of average TL ﬁelds for 64 realizations of turbulence. Three machine-learning algorithms were applied to each dataset, namely, ensemble decision trees, neural networks, and cluster-weighted models. Observational data come from a long-range (out to 8 km) sound propagation experiment. In comparison to the experimental observations, regression predictions have 5–7 dB in median absolute error. Surrogate data quality depends on an accurate characterization of refractive and scattering conditions. Predictions obtained through a single realization of turbulence agree better with the experimental observations.

APA, Harvard, Vancouver, ISO, and other styles

Arnold, Zachary, Joanne Boisson, Lorenzo Bongiovanni, Daniel Chou, Carrie Peelman, and Ilya Rahkovsky. Using Machine Learning to Fill Gaps in Chinese AI Market Data. Center for Security and Emerging Technology, 2021. http://dx.doi.org/10.51593/20200064.

Full text

Abstract:

In this proof-of-concept project, CSET and Amplyfi Ltd. used machine learning models and Chinese-language web data to identify Chinese companies active in artificial intelligence. Most of these companies were not labeled or described as AI-related in two high-quality commercial datasets. The authors' findings show that using structured data alone—even from the best providers—will yield an incomplete picture of the Chinese AI landscape.

APA, Harvard, Vancouver, ISO, and other styles

Kaiser, Kendra E. Pacific Northwest Streamflow Data Landscape. Boise State University, Albertsons Library, 2023. http://dx.doi.org/10.18122/geo_facpubs.805.boisestate.

Full text

Abstract:

This project was funded by the U.S. Geological Survey Northwest Climate Adaptation Science Center to catalog the location, temporal extent and purpose of non-USGS streamflow datasets. As part of this project, roundtable meetings convened local, state and federal agencies, and nonprofits to explore the complexity of gathering and integrating the identified datasets and identify issues surrounding data-sharing across organizations. This report synthesizes discussions from each of the state roundtable discussions convened in the spring of 2022, and highlights common challenges and needs across the region. Additional information from organizations not able to be present at the meetings were added after one-on-one discussions with organization members. Information gathered through these discussions highlights the importance of streamflow data, multitude of data purposes, the need for additional data, and support for data management and quality assurance.

APA, Harvard, Vancouver, ISO, and other styles

Musser, Micah, Rebecca Gelles, Catherine Aiken, and Andrew Lohn. “The Main Resource is the Human”. Center for Security and Emerging Technology, 2023. http://dx.doi.org/10.51593/20210071.

Full text

Abstract:

Progress in artificial intelligence (AI) depends on talented researchers, well-designed algorithms, quality datasets, and powerful hardware. The relative importance of these factors is often debated, with many recent “notable” models requiring massive expenditures of advanced hardware. But how important is computational power for AI progress in general? This data brief explores the results of a survey of more than 400 AI researchers to evaluate the importance and distribution of computational needs.

APA, Harvard, Vancouver, ISO, and other styles

Ron, Alexa. Pilot Terrestrial Vegetation Monitoring in the Southeastern United States, 2009-2010 - Data Release Report. National Park Service, 2025. https://doi.org/10.36967/2303058.

Full text

Abstract:

Data Release Reports (DRR) are created by the National Park Service and provide detailed descriptions of valuable research datasets in a human-readable format, including the methods used to collect the data and technical analyses supporting the quality of the measurements. DRRs focus on helping others reuse data rather than presenting results, testing hypotheses, or presenting new interpretations and in-depth analyses. Pilot terrestrial vegetation monitoring occurred in eleven Southeast Coast Network (SECN) parks in 2009 and 2010 and evaluated trends in plant cover, frequency, diversity and distribution. After QA/QC, the data were processed by the Inventory and Monitoring Division (IMD) to comply with Executive Order 13642 (Making Open and Machine Readable the New Default for Government Information). One data cleaning script was produced, resulting in six datasets: canopy, events, frequency, locations, shrubs and treedbh. The script was used to format the data for clarity and consistency, and to generate cleaned CSV files for the associated data package. This DRR describes the data package for the pilot terrestrial vegetation monitoring in the Southeast Coast Network in 2009 and 2010, including how and where to access the data, collection methods, processing steps, data quality evaluation, and usage notes. The Data Package this DRR refers to is: Corbett SL, Byrne MW, Ron A. 2024. Pilot Terrestrial Vegetation Monitoring in the Southeastern United States, 2009-2010 - Data Package. National Park Service. Fort Collins CO https://doi.org/10.57830/2303037

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Quality of datasets'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Quality of datasets"

Dissertations / Theses on the topic "Quality of datasets"

Books on the topic "Quality of datasets"

Book chapters on the topic "Quality of datasets"

Conference papers on the topic "Quality of datasets"

Reports on the topic "Quality of datasets"