To see the other types of publications on this topic, follow the link: Dataset Curation.

Journal articles on the topic 'Dataset Curation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Dataset Curation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Koshoffer, Amy, Amy E. Neeser, Linda Newman, and Lisa R. Johnston. "Giving datasets context: a comparison study of institutional repositories that apply varying degrees of curation." International Journal of Digital Curation 13, no. 1 (2018): 15–34. http://dx.doi.org/10.2218/ijdc.v13i1.632.

Full text
Abstract:
This research study compared four academic libraries’ approaches to curating the metadata of dataset submissions in their institutional repositories and classified them in one of four categories: no curation, pre-ingest curation, selective curation, and post-ingest curation. The goal is to understand the impact that curation may have on the quality of user-submitted metadata. The findings were 1) the metadata elements varied greatly between institutions, 2) repositories with more options for authors to contribute metadata did not result in more metadata contributed, 3) pre- or post-ingest curation process could have a measurable impact on the metadata but are difficult to separate from other factors, and 4) datasets submitted to a repository with pre- or post-ingest curation more often included documentation.
APA, Harvard, Vancouver, ISO, and other styles
2

Xu, Jinda, Yuhao Song, Daming Wang, et al. "Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 20 (2025): 21761–69. https://doi.org/10.1609/aaai.v39i20.35481.

Full text
Abstract:
In an era overwhelmed by vast amounts of data, the effective curation of web-crawl datasets is essential for optimizing model performance. This paper tackles the challenges associated with the unstructured and heterogeneous nature of such datasets. Traditional heuristic curation methods often inadequately capture complex features, resulting in biases and the exclusion of relevant data. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators, called EcoDatum, which employs a novel quality-guided deduplication method to balance feature distribution. EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework, utilizing automated optimization to effectively score each data point. EcoDatum, which significantly improves the data curation quality and efficiency, outperforms existing state-of-the-art (SOTA) techniques, ranking 1st on the DataComp leaderboard with an average performance score of 0.182 across 38 diverse evaluation datasets. This represents a 28% improvement over the DataComp baseline method, demonstrating its effectiveness in improving dataset curation and model training efficiency.
APA, Harvard, Vancouver, ISO, and other styles
3

Gordon, Ben, Jake Barrett, Clara Fennessy, et al. "Development of a data utility framework to support effective health data curation." BMJ Health & Care Informatics 28, no. 1 (2021): e100303. http://dx.doi.org/10.1136/bmjhci-2020-100303.

Full text
Abstract:
ObjectivesThe value of healthcare data is being increasingly recognised, including the need to improve health dataset utility. There is no established mechanism for evaluating healthcare dataset utility making it difficult to evaluate the effectiveness of activities improving the data. To describe the method for generating and involving the user community in developing a proposed framework for evaluation and communication of healthcare dataset utility for given research areas.MethodsAninitial version of a matrix to review datasets across a range of dimensions wasdeveloped based on previous published findings regarding healthcare data. Thiswas used to initiate a design process through interviews and surveys with datausers representing a broad range of user types and use cases, to help develop afocused framework for characterising datasets.ResultsFollowing 21 interviews, 31 survey responses and testing on 43 datasets, five major categories and 13 subcategories were identified as useful for a dataset, including Data Model, Completeness and Linkage. Each sub-category was graded to facilitate rapid and reproducible evaluation of dataset utility for specific use-cases. Testing of applicability to >40 existing datasets demonstrated potential usefulness for subsequent evaluation in real-world practice.DiscussionTheresearch has developed an evidenced-based initial approach for a framework tounderstand the utility of a healthcare dataset. It likely to require further refinementfollowing wider application and additional categories may be required.ConclusionThe process has resulted in a user-centred designed framework for objectively evaluating the likely utility of specific healthcare datasets, and therefore, should be of value both for potential users of health data, and for data custodians to identify the areas to provide the optimal value for data curation investment.
APA, Harvard, Vancouver, ISO, and other styles
4

Hagar, Nick, and Jack Bandy. "Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl." Proceedings of the International AAAI Conference on Web and Social Media 19 (June 7, 2025): 2454–64. https://doi.org/10.1609/icwsm.v19i1.35948.

Full text
Abstract:
Large language models (LLMs) rely heavily on web-derived training datasets, yet understanding how filtering and curation decisions affect these datasets remains challenging. This paper presents two complementary datasets designed to enable systematic analysis of LLM training data composition. The first dataset captures domain-level statistics across 96 Common Crawl snapshots, providing baseline data about web content distribution before filtering. The second dataset contains standardized URL information from three major LLM training corpora (C4, Falcon RefinedWeb, and CulturaX), allowing researchers to analyze how different filtering approaches affect content inclusion. By making these datasets publicly available in a consistent format, we aim to (1) facilitate research into training data composition, (2) enable systematic auditing of filtering effects, and (3) support more transparent approaches to dataset development. Our datasets can help researchers investigate questions related to content diversity, source representation, and the impact of different filtering decisions on training data composition. Overall, this work provides a foundation for understanding how curation choices shape the content that ultimately trains widely-deployed language models.
APA, Harvard, Vancouver, ISO, and other styles
5

Khaleel, Mohammad A., Amer Hayat Khan, S. M. Sheikh Ghadzi, and Sami Alshakhshir. "Curation of an international drug proprietary names dataset." Data in Brief 40 (February 2022): 107701. http://dx.doi.org/10.1016/j.dib.2021.107701.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Celino, Irene. "Geospatial dataset curation through a location-based game." Semantic Web 6, no. 2 (2015): 121–30. http://dx.doi.org/10.3233/sw-130129.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Sicilia, Miguel-Angel, Elena García-Barriocanal, and Salvador Sánchez-Alonso. "Community Curation in Open Dataset Repositories: Insights from Zenodo." Procedia Computer Science 106 (2017): 54–60. http://dx.doi.org/10.1016/j.procs.2017.03.009.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Kleebauer, Maximilian, Stefan Karamanski, Doron Callies, and Martin Braun. "A Wind Turbines Dataset for South Africa: OpenStreetMap Data, Deep Learning Based Geo-Coordinate Correction and Capacity Analysis." ISPRS International Journal of Geo-Information 14, no. 6 (2025): 232. https://doi.org/10.3390/ijgi14060232.

Full text
Abstract:
Accurate and detailed spatial data on wind energy infrastructure is essential for renewable energy planning, grid integration, and system analysis. However, publicly available datasets often suffer from limited spatial accuracy, missing attributes, and inconsistent metadata. To address these challenges, this study presents a harmonized and spatially refined dataset of wind turbines in South Africa, combining OpenStreetMap (OSM) data with high-resolution satellite imagery, deep learning-based coordinate correction, and manual curation. The dataset includes 1487 turbines across 42 wind farms, representing over 3.9 GW of installed capacity as of 2025. Of this, more than 3.6 GW is currently operational. The Geo-Coordinates were validated and corrected using a RetinaNet-based object detection model applied to both Google and Bing satellite imagery. Instead of relying solely on spatial precision, the curation process emphasized attribute completeness and consistency. Through systematic verification and cross-referencing with multiple public sources, the final dataset achieves a high level of attribute completeness and internal consistency across all turbines, including turbine type, rated capacity, and commissioning year. The resulting dataset is the most accurate and comprehensive publicly available dataset on wind turbines in South Africa to date. It provides a robust foundation for spatial analysis, energy modeling, and policy assessment related to wind energy development. The dataset is publicly available.
APA, Harvard, Vancouver, ISO, and other styles
9

Barbosa, Susana, Nuno Dias, Carlos Almeida, et al. "The SAIL dataset of marine atmospheric electric field observations over the Atlantic Ocean." Earth System Science Data 17, no. 4 (2025): 1393–405. https://doi.org/10.5194/essd-17-1393-2025.

Full text
Abstract:
Abstract. A unique dataset of marine atmospheric electric field observations over the Atlantic Ocean is described. The data are relevant not only for atmospheric electricity studies, but more generally for studies of the Earth's atmosphere and climate variability, as well as space–Earth interaction studies. In addition to the atmospheric electric field data, the dataset includes simultaneous measurements of other atmospheric variables, including gamma radiation, visibility, and solar radiation. These ancillary observations not only support interpretation and understanding of the atmospheric electric field data, but also are of interest in themselves. The entire framework from data collection to final derived datasets has been duly documented to ensure traceability and reproducibility of the whole data curation chain. All the data, from raw measurements to final datasets, are preserved in data repositories with a corresponding assigned DOI. Final datasets are available from the Figshare repository (https://figshare.com/projects/SAIL_Data/178500, SAIL Data, 2025), and computational notebooks containing the code used at every step of the data curation chain are available from the Zenodo repository (https://zenodo.org/communities/sail, Project SAIL community, 2025).
APA, Harvard, Vancouver, ISO, and other styles
10

Landry, Latrice, Mary Lucas, Anietie Andy, and Ebelechukwu Nwafor. "Artificial Intelligence Assisted Curation of Population Groups in Biomedical Literature." International Journal of Digital Curation 18, no. 1 (2024): 9. http://dx.doi.org/10.2218/ijdc.v18i1.950.

Full text
Abstract:
Curation of the growing body of published biomedical research is of great importance to both the synthesis of contemporary science and the archiving of historical biomedical literature. Each of these tasks has become increasingly challenging given the expansion of journal titles, preprint repositories and electronic databases. Added to this challenge is the need for curation of biomedical literature across population groups to better capture study populations for improved understanding of the generalizability of findings. To address this, our study aims to explore the use of generative artificial intelligence (AI) in the form of large language models (LLMs) such as GPT-4 as an AI curation assistant for the task of curating biomedical literature for population groups. We conducted a series of experiments which qualitatively and quantitatively evaluate the performance of OpenAI’s GPT-4 in curating population information from biomedical literature. Using OpenAI’s GPT-4 and curation instructions, executed through prompts, we evaluate the ability of GPT-4 to classify study ‘populations’, ‘continents’ and ‘countries’ from a previously curated dataset of public health COVID-19 studies. Using three different experimental approaches, we examined performance by: A) evaluation of accuracy (concordance with human curation) using both exact and approximate string matches within a single experimental approach; B) evaluation of accuracy across experimental approaches; and C) conducting a qualitative phenomenology analysis to describe and classify the nature of difference between human curation and GPT curation. Our study shows that GPT-4 has the potential to provide assistance in the curation of population groups in biomedical literature. Additionally, phenomenology provided key information for prompt design that further improved the LLM’s performance in these tasks. Future research should aim to improve prompt design, as well as explore other generative AI models to improve curation performance. An increased understanding of the populations included in research studies is critical for the interpretation of findings, and we believe this study provides keen insight on the potential to increase the scalability of population curation in biomedical studies.
APA, Harvard, Vancouver, ISO, and other styles
11

Madden, Frances, Jan Ashton, and Jez Cope. "Building the Picture Behind a Dataset." International Journal of Digital Curation 15, no. 1 (2020): 9. http://dx.doi.org/10.2218/ijdc.v15i1.702.

Full text
Abstract:
As part of the European Commission funded FREYA project The British Library wanted to explore the possibility of developing provenance information in datasets derived from the British Library’s collections, the data.bl.uk collection. Provenance information is defined in this context as ‘information relating to the origin, source and curation of the datasets’. Provenance information is also identified within the FAIR principles as an important aspect of being able to reuse and understand research datasets. According to the FAIR principles, the aim is to understand how to cite and acknowledge the dataset as well as understanding how the dataset was created and has been processed. There is also reference to the importance of this metadata being machine readable. By enhancing the metadata of these datasets with additional persistent identifiers and metadata a fuller picture of the datasets and their content could be understood. This also adds to the veracity and understanding the dataset by end users of data.bl.uk.
APA, Harvard, Vancouver, ISO, and other styles
12

Pathmakumar, Thejus, Mohan Rajesh Elara, Shreenhithy V. Soundararajan, and Balakrishnan Ramalingam. "Toward a Comprehensive Domestic Dirt Dataset Curation for Cleaning Auditing Applications." Sensors 22, no. 14 (2022): 5201. http://dx.doi.org/10.3390/s22145201.

Full text
Abstract:
Cleaning is an important task that is practiced in every domain and has prime importance. The significance of cleaning has led to several newfangled technologies in the domestic and professional cleaning domain. However, strategies for auditing the cleanliness delivered by the various cleaning methods remain manual and often ignored. This work presents a novel domestic dirt image dataset for cleaning auditing application including AI-based dirt analysis and robot-assisted cleaning inspection. One of the significant challenges in an AI-based robot-aided cleaning auditing is the absence of a comprehensive dataset for dirt analysis. We bridge this gap by identifying nine classes of commonly occurring domestic dirt and a labeled dataset consisting of 3000 microscope dirt images curated from a semi-indoor environment. The dirt dataset gathered using the adhesive dirt lifting method can enhance the current dirt sensing and dirt composition estimation for cleaning auditing. The dataset’s quality is analyzed by AI-based dirt analysis and a robot-aided cleaning auditing task using six standard classification models. The models trained with the dirt dataset were capable of yielding a classification accuracy above 90% in the offline dirt analysis experiment and 82% in real-time test results.
APA, Harvard, Vancouver, ISO, and other styles
13

Alqasab, Mariam, Suzanne M. Embury, and Sandra De F. Mendes Sampaio. "Amplifying Data Curation Efforts to Improve the Quality of Life Science Data." International Journal of Digital Curation 12, no. 1 (2017): 1–12. http://dx.doi.org/10.2218/ijdc.v12i1.495.

Full text
Abstract:
In the era of data science, datasets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in datasets can have far reaching consequences, spreading from dataset to dataset, and affecting the consumers of data in ways that are hard to predict or quantify. Some form of waste is often the result. For example, scientists using defective data to propose hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure. Because of the potential real world costs, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect. However, in some areas human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately. Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort currently available. In this paper,we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do. This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient). This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.
APA, Harvard, Vancouver, ISO, and other styles
14

Alabduljabbar, Abdulrahman, Sajid Ullah Khan, Anas Alsuhaibani, Fahdah Almarshad, and Youssef N. Altherwy. "Medical imaging datasets, preparation, and availability for artificial intelligence in medical imaging." Journal of Alzheimer's Disease Reports 8, no. 1 (2024): 1471–83. https://doi.org/10.3233/adr-240129.

Full text
Abstract:
Background Artificial intelligence (AI) persists as a focal subject within the realm of medical imaging, heralding a multitude of prospective applications that span the comprehensive imaging lifecycle. However, a key hurdle to the development and real-world application of AI algorithms is the necessity for large amounts of well-organized and carefully planned training data, including professional annotations (labelling). Modern supervised AI techniques require thorough data curation to efficiently train, validate, and test models. Objective The proper processing of medical images for use by AI-driven solutions is a critical component in the development of dependable and resilient AI algorithms. Currently, research organizations and corporate entities frequently confront data access limits, working with small amounts of data from restricted geographic locations. Methods This study provides an in-depth examination of the publicly accessible datasets in the field of medical imaging. This work also determines the methods required for preparing medical imaging data for the development of AI algorithms, emphasizes current limitations in dataset curation. Furthermore, this study explores inventive strategies to address the challenge of data availability, offering a detailed overview of data curation technologies. Results This study provides a comprehensive evaluation of medical imaging datasets emphasizes their vital significance in improving diagnostic accuracy and AI models, while also addressing key problems such as dataset diversity, labelling, and ethical implications. Conclusions The paper concludes with an insightful discussion and analysis of challenges in medical image analysis, along with potential future directions in the field.
APA, Harvard, Vancouver, ISO, and other styles
15

Hummel, Riët, Joost den Boer, Geert van der Heijden, Wil van der Sanden, and Josef Bruers. "Longitudinal patterns of provided oral healthcare services to Dutch young patients: An observational study." PLOS ONE 19, no. 2 (2024): e0299470. http://dx.doi.org/10.1371/journal.pone.0299470.

Full text
Abstract:
General dental practitioners (GDPs) differ in the preventive and curative care they provide to their young patients. This may be related to variation in the caries risk of patients, but also to differing opinions among GDPs about ’proper care’. Longitudinal data offers the possibility to make care patterns of GDPs comparable and to reveal possible treatment variation between GDPs. GDPs who participated in this study delivered data on the oral healthcare services (OHS) they provided to young patients during the period 2013–2017. Subsequently, data from patients who received regular OHS for 4 to 5 years were used in the analyses. Based on this, longitudinal preventive and curative care patterns were distinguished. Patients were divided into 3 preventive care patterns: no prevention, occasional prevention, and regular prevention. Furthermore, 3 curative care patterns were distinguished: no curation, curation in 1 year, and curation in several years. These care patterns were then combined. In addition, patients were classified into caries risk categories based on the caries-related treatments they received over a 2-year period: low (no procedures), elevated (1 procedure), and high (2 or more procedures). The caries risk based on the first 2 years and the last 2 years in the dataset were combined into a longitudinal caries risk profile. The most frequent combined care pattern (35.8%) was no curation and occasional or regular prevention. The most common longitudinal caries risk profile was low at beginning and end (45.2%). Dental practices varied considerably in the distribution of curative and preventive care patterns. Thereby, no relationship was shown between curative care patterns and provided preventive care. There was also a large spread in the provided OHS within the various caries risk profiles. These diversities indicated treatment variation between GDPs, which is unwarranted if less or more care is provided than necessary.
APA, Harvard, Vancouver, ISO, and other styles
16

Minocha, Sanchit, and Faisal Hossain. "GRILSS: opening the gateway to global reservoir sedimentation data curation." Earth System Science Data 17, no. 4 (2025): 1743–59. https://doi.org/10.5194/essd-17-1743-2025.

Full text
Abstract:
Abstract. Reservoir sedimentation poses a significant challenge to freshwater management, leading to declining storage capacity and inefficient reservoir operations for various purposes. However, trustworthy and independently verifiable information on declining storage capacity or sedimentation rates around the world is sparse and suffers from inconsistent metadata and curation to allow global-scale archiving and analyses. The Global Reservoir Inventory of Lost Storage by Sedimentation (GRILSS) dataset addresses this challenge by providing organized, well-curated, and open-source data on sedimentation rates and capacity loss for 1013 reservoirs in 75 major river basins across 54 countries. This publicly accessible dataset captures the complexities of reservoir sedimentation, influenced by regional factors such as climate, topography, and land use. By curating the information from numerous sources with disparate formats in a homogenized data structure, GRILSS serves as an invaluable resource for water managers, policymakers, and researchers for improved sediment management strategies. The open-source nature of GRILLS promotes collaboration and contributions from the global community to grow the dataset. By providing essential reference data on sedimentation to understand the global challenge of reservoir sedimentation, the GRILLS dataset represents a gateway for the global community to share sedimentation and storage loss data for sustainable operation of world's reservoirs for future generations. The dataset is publicly available at OSFHome (https://doi.org/10.17605/OSF.IO/W4UG8, Minocha and Hossain, 2025).
APA, Harvard, Vancouver, ISO, and other styles
17

Kim, Tae Kyung, Paul H. Yi, Gregory D. Hager, and Cheng Ting Lin. "Refining dataset curation methods for deep learning-based automated tuberculosis screening." Journal of Thoracic Disease 12, no. 9 (2020): 5078–85. http://dx.doi.org/10.21037/jtd.2019.08.34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Scheuerman, Morgan Klaus, Alex Hanna, and Emily Denton. "Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development." Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021): 1–37. http://dx.doi.org/10.1145/3476058.

Full text
Abstract:
Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision's propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation - how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.
APA, Harvard, Vancouver, ISO, and other styles
19

Tomelleri, Enrico, Luca Belelli Marchesini, Alexey Yaroslavtsev, Shahla Asgharinia, and Riccardo Valentini. "Toward a Unified TreeTalker Data Curation Process." Forests 13, no. 6 (2022): 855. http://dx.doi.org/10.3390/f13060855.

Full text
Abstract:
The Internet of Things (IoT) development is revolutionizing environmental monitoring and research in macroecology. This technology allows for the deployment of sizeable diffuse sensing networks capable of continuous monitoring. Because of this property, the data collected from IoT networks can provide a testbed for scientific hypotheses across large spatial and temporal scales. Nevertheless, data curation is a necessary step to make large and heterogeneous datasets exploitable for synthesis analyses. This process includes data retrieval, quality assurance, standardized formatting, storage, and documentation. TreeTalkers are an excellent example of IoT applied to ecology. These are smart devices for synchronously measuring trees’ physiological and environmental parameters. A set of devices can be organized in a mesh and permit data collection from a single tree to plot or transect scale. The deployment of such devices over large-scale networks needs a standardized approach for data curation. For this reason, we developed a unified processing workflow according to the user manual. In this paper, we first introduce the concept of a unified TreeTalker data curation process. The idea was formalized into an R-package, and it is freely available as open software. Secondly, we present the different functions available in “ttalkR”, and, lastly, we illustrate the application with a demonstration dataset. With such a unified processing approach, we propose a necessary data curation step to establish a new environmental cyberinfrastructure and allow for synthesis activities across environmental monitoring networks. Our data curation concept is the first step for supporting the TreeTalker data life cycle by improving accessibility and thus creating unprecedented opportunities for TreeTalker-based macroecological analyses.
APA, Harvard, Vancouver, ISO, and other styles
20

Wang, Yuqi, Zihan Cai, and Qinghong Zhang. "The Effect of Dataset Imbalance on the Performance of Image-to-Cartoon Generative Adversarial Networks." Applied and Computational Engineering 132, no. 1 (2025): 193–99. https://doi.org/10.54254/2755-2721/2024.20706.

Full text
Abstract:
This report investigates the impact of dataset imbalance on AI-powered image-to-anime style transfer, focusing on the AnimeGANv3 model. Despite the common perception of AI as free of human bias, we highlight that machine learning systems inherently reflect societal prejudices through their training data. The anime art style, popular worldwide but limited in its representation of diverse ethnicities and cultures, serves as a case study for this phenomenon. We analysed AnimeGANv3's training datasets and compared its performance on over- and underrepresented image classes using quantitative and qualitative metrics. Results demonstrate that users from minority groups likely experience inferior outcomes due to dataset imbalance. The study emphasises the need for transparent and responsible dataset curation for machine learning systems to ensure ethical AI development and improved model performance across all user groups.
APA, Harvard, Vancouver, ISO, and other styles
21

Feeney, Kevin Chekov, Declan O'Sullivan, Wei Tai, and Rob Brennan. "Improving Curated Web-Data Quality with Structured Harvesting and Assessment." International Journal on Semantic Web and Information Systems 10, no. 2 (2014): 35–62. http://dx.doi.org/10.4018/ijswis.2014040103.

Full text
Abstract:
This paper describes a semi-automated process, framework and tools for harvesting, assessing, improving and maintaining high-quality linked-data. The framework, known as DaCura1, provides dataset curators, who may not be knowledge engineers, with tools to collect and curate evolving linked data datasets that maintain quality over time. The framework encompasses a novel process, workflow and architecture. A working implementation has been produced and applied firstly to the publication of an existing social-sciences dataset, then to the harvesting and curation of a related dataset from an unstructured data-source. The framework's performance is evaluated using data quality measures that have been developed to measure existing published datasets. An analysis of the framework against these dimensions demonstrates that it addresses a broad range of real-world data quality concerns. Experimental results quantify the impact of the DaCura process and tools on data quality through an assessment framework and methodology which combines automated and human data quality controls.
APA, Harvard, Vancouver, ISO, and other styles
22

Gundert-Remy, U., M. Batke, A. Bitsch, et al. "Optimization of curation of the dataset with data on repeated dose toxicity." Toxicology Letters 238, no. 2 (2015): S166. http://dx.doi.org/10.1016/j.toxlet.2015.08.566.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Witt, Michael, Jacob Carlson, D. Scott Brandt, and Melissa H. Cragin. "Constructing Data Curation Profiles." International Journal of Digital Curation 4, no. 3 (2009): 93–103. http://dx.doi.org/10.2218/ijdc.v4i3.117.

Full text
Abstract:
This paper presents a brief literature review and then introduces the methods, design, and construction of the Data Curation Profile, an instrument that can be used to provide detailed information on particular data forms that might be curated by an academic library. These data forms are presented in the context of the related sub-disciplinary research area, and they provide the flow of the research process from which these data are generated. The profiles also represent the needs for data curation from the perspective of the data producers, using their own language. As such, they support the exploration of data curation across different research domains in real and practical terms. With the sponsorship of the Institute of Museum and Library Services, investigators from Purdue University and the University of Illinois interviewed 19 faculty subjects to identify needs for discovery, access, preservation, and reuse of their research data. For each subject, a profile was constructed that includes information about his or her general research, data forms and stages, value of data, data ingest, intellectual property, organization and description of data, tools, interoperability, impact and prestige, data management, and preservation. Each profile also presents a specific dataset supplied by the subject to serve as a concrete example. The Data Curation Profiles are being published to a public wiki for questions and discussion, and a blank template will be disseminated with guidelines for others to create and share their own profiles. This study was conducted primarily from the viewpoint of librarians interacting with faculty researchers; however, it is expected that these findings will complement a wide variety of data curation research and practice outside of librarianship and the university environment.
APA, Harvard, Vancouver, ISO, and other styles
24

van der Voort, Sebastian R., Marion Smits, and Stefan Klein. "DeepDicomSort: An Automatic Sorting Algorithm for Brain Magnetic Resonance Imaging Data." Neuroinformatics 19, no. 1 (2020): 159–84. http://dx.doi.org/10.1007/s12021-020-09475-7.

Full text
Abstract:
AbstractWith the increasing size of datasets used in medical imaging research, the need for automated data curation is arising. One important data curation task is the structured organization of a dataset for preserving integrity and ensuring reusability. Therefore, we investigated whether this data organization step can be automated. To this end, we designed a convolutional neural network (CNN) that automatically recognizes eight different brain magnetic resonance imaging (MRI) scan types based on visual appearance. Thus, our method is unaffected by inconsistent or missing scan metadata. It can recognize pre-contrast T1-weighted (T1w),post-contrast T1-weighted (T1wC), T2-weighted (T2w), proton density-weighted (PDw) and derived maps (e.g. apparent diffusion coefficient and cerebral blood flow). In a first experiment,we used scans of subjects with brain tumors: 11065 scans of 719 subjects for training, and 2369 scans of 192 subjects for testing. The CNN achieved an overall accuracy of 98.7%. In a second experiment, we trained the CNN on all 13434 scans from the first experiment and tested it on 7227 scans of 1318 Alzheimer’s subjects. Here, the CNN achieved an overall accuracy of 98.5%. In conclusion, our method can accurately predict scan type, and can quickly and automatically sort a brain MRI dataset virtually without the need for manual verification. In this way, our method can assist with properly organizing a dataset, which maximizes the shareability and integrity of the data.
APA, Harvard, Vancouver, ISO, and other styles
25

McWilliams, Chris, Joshua Inoue, Philip Wadey, Graeme Palmer, Raul Santos-Rodriguez, and Christopher Bourdeaux. "Curation of an intensive care research dataset from routinely collected patient data in an NHS trust." F1000Research 8 (August 19, 2019): 1460. http://dx.doi.org/10.12688/f1000research.20193.1.

Full text
Abstract:
In this data note we provide the details of a research database of 4831 adult intensive care patients who were treated in the Bristol Royal Infirmary, UK between 2015 and 2019. The purposes of this publication are to describe the dataset for external researchers who may be interested in making use of it, and to detail the methods used to curate the dataset in order to help other intensive care units make secondary use of their routinely collected data. The curation involves linkage between two critical care datasets within our hospital and the accompanying code is available online. For reasons of data privacy the data cannot be shared without researchers obtaining appropriate ethical consents. In the future we hope to obtain a data sharing agreement in order to publicly share the de-identified data, and to link our data with other intensive care units who use a Philips clinical information system.
APA, Harvard, Vancouver, ISO, and other styles
26

Porras Millán, Pablo, Margaret Duesbury, Maximilian Koch, and Sandra Orchard. "The MINTAct Archive for Mutations Influencing Molecular Interactions." Genomics and Computational Biology 4, no. 1 (2017): 100053. http://dx.doi.org/10.18547/gcb.2018.vol4.iss1.e100053.

Full text
Abstract:
The MINTAct archive for mutations affecting interaction holds results of over 28,000 events on the influence of a change of protein sequence on physical interaction outcome. All data has been manually curated from experimental evidence found in more than 4,100 publications, following the IMEx consortium (www.imexconsortium.org) high-detail curation standards. The dataset contains data from about 300 different organisms, with a predominance of events related to human proteins, and it is freely available at the IntAct database website using this link: www.ebi.ac.uk/intact/resources/datasets#mutationDs.
APA, Harvard, Vancouver, ISO, and other styles
27

Lim, Li Cen, Yee Ying Lim, and Yee Siew Choong. "Data curation to improve the pattern recognition performance of B-cell epitope prediction by support vector machine." Pure and Applied Chemistry 93, no. 5 (2021): 571–77. http://dx.doi.org/10.1515/pac-2020-1107.

Full text
Abstract:
Abstract B-cell epitope will be recognized and attached to the surface of receptors in B-lymphocytes to trigger immune response, thus are the vital elements in the field of epitope-based vaccine design, antibody production and therapeutic development. However, the experimental approaches in mapping epitopes are time consuming and costly. Computational prediction could offer an unbiased preliminary selection to reduce the number of epitopes for experimental validation. The deposited B-cell epitopes in the databases are those with experimentally determined positive/negative peptides and some are ambiguous resulted from different experimental methods. Prior to the development of B-cell epitope prediction module, the available dataset need to be handled with care. In this work, we first pre-processed the B-cell epitope dataset prior to B-cell epitopes prediction based on pattern recognition using support vector machine (SVM). By using only the absolute epitopes and non-epitopes, the datasets were classified into five categories of pathogen and worked on the 6-mers peptide sequences. The pre-processing of the datasets have improved the B-cell epitope prediction performance up to 99.1 % accuracy and showed significant improvement in cross validation results. It could be useful when incorporated with physicochemical propensity ranking in the future for the development of B-cell epitope prediction module.
APA, Harvard, Vancouver, ISO, and other styles
28

Kohli, Marc, James J. Morrison, Judy Wawira, et al. "Creation and Curation of the Society of Imaging Informatics in Medicine Hackathon Dataset." Journal of Digital Imaging 31, no. 1 (2017): 9–12. http://dx.doi.org/10.1007/s10278-017-0003-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Korot, Edward, Zeyu Guan, Daniel Ferraz, et al. "Code-free deep learning for multi-modality medical image classification." Nature Machine Intelligence 3, no. 4 (2021): 288–98. http://dx.doi.org/10.1038/s42256-021-00305-2.

Full text
Abstract:
AbstractA number of large technology companies have created code-free cloud-based platforms that allow researchers and clinicians without coding experience to create deep learning algorithms. In this study, we comprehensively analyse the performance and featureset of six platforms, using four representative cross-sectional and en-face medical imaging datasets to create image classification models. The mean (s.d.) F1 scores across platforms for all model–dataset pairs were as follows: Amazon, 93.9 (5.4); Apple, 72.0 (13.6); Clarifai, 74.2 (7.1); Google, 92.0 (5.4); MedicMind, 90.7 (9.6); Microsoft, 88.6 (5.3). The platforms demonstrated uniformly higher classification performance with the optical coherence tomography modality. Potential use cases given proper validation include research dataset curation, mobile ‘edge models’ for regions without internet access, and baseline models against which to compare and iterate bespoke deep learning approaches.
APA, Harvard, Vancouver, ISO, and other styles
30

Scheuerman, Morgan Klaus, Katy Weathington, Tarun Mugunthan, Emily Denton, and Casey Fiesler. "From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision Datasets." Proceedings of the ACM on Human-Computer Interaction 7, CSCW1 (2023): 1–33. http://dx.doi.org/10.1145/3579488.

Full text
Abstract:
Computer vision is a "data hungry" field. Researchers and practitioners who work on human-centric computer vision, like facial recognition, emphasize the necessity of vast amounts of data for more robust and accurate models. Humans are seen as a data resource which can be converted into datasets. The necessity of data has led to a proliferation of gathering data from easily available sources, including "public" data from the web. Yet the use of public data has significant ethical implications for the human subjects in datasets. We bridge academic conversations on the ethics of using publicly obtained data with concerns about privacy and agency associated with computer vision applications. Specifically, we examine how practices of dataset construction from public data-not only from websites, but also from public settings and public records-make it extremely difficult for human subjects to trace their images as they are collected, converted into datasets, distributed for use, and, in some cases, retracted. We discuss two interconnected barriers current data practices present to providing an ethics of traceability for human subjects: awareness and control. We conclude with key intervention points for enabling traceability for data subjects. We also offer suggestions for an improved ethics of traceability to enable both awareness and control for individual subjects in dataset curation practices.
APA, Harvard, Vancouver, ISO, and other styles
31

Słowik, Agnieszka, Léon Bottou, Sean B. Holden, and Mateja Jamnik. "On the Relation between Distributionally Robust Optimization and Data Curation (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 11 (2022): 13053–54. http://dx.doi.org/10.1609/aaai.v36i11.21663.

Full text
Abstract:
Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. A practical implication of our results is that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation.
APA, Harvard, Vancouver, ISO, and other styles
32

Thomer, Andrea K., Dharma Akmon, Jeremy J. York, et al. "The Craft and Coordination of Data Curation: Complicating Workflow Views of Data Science." Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022): 1–29. http://dx.doi.org/10.1145/3555139.

Full text
Abstract:
Data curation is the process of making a dataset fit-for-use and archivable. It is critical to data-intensive science because it makes complex data pipelines possible, studies reproducible, and data reusable. Yet the complexities of the hands-on, technical, and intellectual work of data curation is frequently overlooked or downplayed. Obscuring the work of data curation not only renders the labor and contributions of data curators invisible but also hides the impact that curators' work has on the later usability, reliability, and reproducibility of data. To better understand the work and impact of data curation, we conducted a close examination of data curation at a large social science data repository, the Inter-university Consortium for Political and Social Research (ICPSR). We asked: What does curatorial work entail at ICPSR, and what work is more or less visible to different stakeholders and in different contexts? And, how is that curatorial work coordinated across the organization? We triangulated accounts of data curation from interviews and records of curation in Jira tickets to develop a rich and detailed account of curatorial work. While we identified numerous curatorial actions performed by ICPSR curators, we also found that curators rely on a number of craft practices to perform their jobs. The reality of their work practices defies the rote sequence of events implied by many life cycle or workflow models. Further, we show that craft practices are needed to enact data curation best practices and standards. The craft that goes into data curation is often invisible to end users, but it is well recognized by ICPSR curators and their supervisors. Explicitly acknowledging and supporting data curators as craftspeople is important in creating sustainable and successful curatorial infrastructures.
APA, Harvard, Vancouver, ISO, and other styles
33

Mayfield, Alex, Margaret Frei, Daryl Ireland, and Eugenio Menegon. "The China Historical Christian Database: A Dataset Quantifying Christianity in China from 1550 to 1950." Data 9, no. 6 (2024): 76. http://dx.doi.org/10.3390/data9060076.

Full text
Abstract:
The era of digitization is revolutionizing traditional humanities research, presenting both novel methodologies and challenges. This field harnesses quantitative techniques to yield groundbreaking insights, contingent upon comprehensive datasets on historical subjects. The China Historical Christian Database (CHCD) exemplifies this trend, furnishing researchers with a rich repository of historical, relational, and geographical data about Christianity in China from 1550 to 1950. The study of Christianity in China confronts formidable obstacles, including the mobility of historical agents, fluctuating relational networks, and linguistic disparities among scattered sources. The CHCD addresses these challenges by curating an open-access database built in neo4j that records information about Christian institutions in China and those that worked inside of them. Drawing on historical sources, the CHCD contains temporal, relational, and geographic data. The database currently has over 40,000 nodes and 200,000 relationships, and continues to grow. Beyond its utility for religious studies, the CHCD encompasses broader interdisciplinary inquiries including social network analysis, geospatial visualization, and economic modeling. This article introduces the CHCD’s structure, and explains the data collection and curation process.
APA, Harvard, Vancouver, ISO, and other styles
34

Armijos Carrion, Angelo D., Damien D. Hinsinger, and Joeri S. Strijk. "ECuADOR—Easy Curation of Angiosperm Duplicated Organellar Regions, a tool for cleaning and curating plastomes assembled from next generation sequencing pipelines." PeerJ 8 (April 7, 2020): e8699. http://dx.doi.org/10.7717/peerj.8699.

Full text
Abstract:
Background With the rapid increase in availability of genomic resources offered by Next-Generation Sequencing (NGS) and the availability of free online genomic databases, efficient and standardized metadata curation approaches have become increasingly critical for the post-processing stages of biological data. Especially in organelle-based studies using circular chloroplast genome datasets, the assembly of the main structural regions in random order and orientation represents a major limitation in our ability to easily generate “ready-to-align” datasets for phylogenetic reconstruction, at both small and large taxonomic scales. In addition, current practices discard the most variable regions of the genomes to facilitate the alignment of the remaining coding regions. Nevertheless, no software is currently available to perform curation to such a degree, through simple detection, organization and positioning of the main plastome regions, making it a time-consuming and error-prone process. Here we introduce a fast and user friendly software ECuADOR, a Perl script specifically designed to automate the detection and reorganization of newly assembled plastomes obtained from any source available (NGS, sanger sequencing or assembler output). Methods ECuADOR uses a sliding-window approach to detect long repeated sequences in draft sequences, which then identifies the inverted repeat regions (IRs), even in case of artifactual breaks or sequencing errors and automates the rearrangement of the sequence to the widely used LSC–Irb–SSC–IRa order. This facilitates rapid post-editing steps such as creation of genome alignments, detection of variable regions, SNP detection and phylogenomic analyses. Results ECuADOR was successfully tested on plant families throughout the angiosperm phylogeny by curating 161 chloroplast datasets. ECuADOR first identified and reordered the central regions (LSC–Irb–SSC–IRa) for each dataset and then produced a new annotation for the chloroplast sequences. The process took less than 20 min with a maximum memory requirement of 150 MB and an accuracy of over 99%. Conclusions ECuADOR is the sole de novo one-step recognition and re-ordination tool that provides facilitation in the post-processing analysis of the extra nuclear genomes from NGS data. The program is available at https://github.com/BiodivGenomic/ECuADOR/.
APA, Harvard, Vancouver, ISO, and other styles
35

Jittawiriyanukoon, Chanintorn. "Granularity analysis of classification and estimation for complex datasets with MOA." International Journal of Electrical and Computer Engineering (IJECE) 9, no. 1 (2019): 409–16. https://doi.org/10.11591/ijece.v9i1.pp409-416.

Full text
Abstract:
Dispersed and unstructured datasets are substantial parameters to realize an exact amount of the required space. Depending upon the size and the data distribution, especially, if the classes are significantly associating, the level of granularity to agree a precise classification of the datasets exceeds. The data complexity is one of the major attributes to govern the proper value of the granularity, as it has a direct impact on the performance. Dataset classification exhibits the vital step in complex data analytics and designs to ensure that dataset is prompt to be efficiently scrutinized. Data collections are always causing missing, noisy and out-of-the-range values. Data analytics which has not been wisely classified for problems as such can induce unreliable outcomes. Hence, classifications for complex data sources help comfort the accuracy of gathered datasets by machine learning algorithms. Dataset complexity and pre-processing time reflect the effectiveness of individual algorithm. Once the complexity of datasets is characterized then comparatively simpler datasets can further investigate with parallelism approach. Speedup performance is measured by the execution of MOA simulation. Our proposed classification approach outperforms and improves granularity level of complex datasets.
APA, Harvard, Vancouver, ISO, and other styles
36

Pinter, Anthony T., Jacob M. Paul, Jessie Smith, and Jed R. Brubaker. "P4KxSpotify: A Dataset of Pitchfork Music Reviews and Spotify Musical Features." Proceedings of the International AAAI Conference on Web and Social Media 14 (May 26, 2020): 895–902. http://dx.doi.org/10.1609/icwsm.v14i1.7355.

Full text
Abstract:
Algorithmically driven curation and recommendation systems like those employed by Spotify have become more ubiquitous for surfacing content that people might want hear. However, expert reviews continue to have a measurable impact on what people choose to listen to and the subsequent commercial success and cultural staying power of those artists. One such site, Pitchfork, is particularly known in the music community for its ability to catapult an artist to stardom based on the review that an album receives. In this paper, we present P4KxSpotify: a dataset of Pitchfork album reviews with the corresponding Spotify audio features for those albums. We describe our data collection and dataset creation process, including the ethics of such a collection. We present basic information and descriptive statistics about the dataset. Finally, we offer several possible avenues for research that might utilize this new dataset.
APA, Harvard, Vancouver, ISO, and other styles
37

Ventura, Lucas, Antoine Yang, Cordelia Schmid, and Gül Varol. "CoVR: Learning Composed Video Retrieval from Web Video Captions." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (2024): 5270–79. http://dx.doi.org/10.1609/aaai.v38i6.28334.

Full text
Abstract:
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.
APA, Harvard, Vancouver, ISO, and other styles
38

Neto, Luís, Nádia Pinto, Alberto Proença, António Amorim, and Eduardo Conde-Sousa. "4SpecID: Reference DNA Libraries Auditing and Annotation System for Forensic Applications." Genes 12, no. 1 (2021): 61. http://dx.doi.org/10.3390/genes12010061.

Full text
Abstract:
Forensic genetics is a fast-growing field that frequently requires DNA-based taxonomy, namely, when evidence are parts of specimens, often highly processed in food, potions, or ointments. Reference DNA-sequences libraries, such as BOLD or GenBank, are imperative tools for taxonomic assignment, particularly when morphology is inadequate for classification. The auditing and curation of these datasets require reliable mechanisms, preferably with automated data preprocessing. Software tools were developed to grade these datasets considering as primary criterion the number of records, which is not compliant with forensic standards, where the priority is validation from independent sources. Moreover, 4SpecID is an efficient and freely available software tool developed to audit and annotate reference libraries, specifically designed for forensic applications. Its intuitive user-friendly interface virtually accesses any database and includes specific data mining functions tuned for the widespread BOLD repositories. The built tool was evaluated in laptop MacBook and a dual-Xeon server with a large BOLD dataset (Culicidae, 36,115 records), and the best execution time to grade the dataset on the laptop was 0.28 s. Datasets of Bovidae and Felidae families were used to evaluate the quality of the tool and the relevance of independent sources validation.
APA, Harvard, Vancouver, ISO, and other styles
39

Piquer-Esteban, Samuel, Vicente Arnau, Wladimiro Diaz, and Andrés Moya. "OMD Curation Toolkit: a workflow for in-house curation of public omics datasets." BMC Bioinformatics 25, no. 1 (2024). http://dx.doi.org/10.1186/s12859-024-05803-9.

Full text
Abstract:
Abstract Background Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing. Results Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources. Conclusions Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.
APA, Harvard, Vancouver, ISO, and other styles
40

Hemphill, Libby, Andrea Thomer, Sara Lafia, Lizhou Fan, David Bleckley, and Elizabeth Moss. "A dataset for measuring the impact of research data and their curation." Scientific Data 11, no. 1 (2024). http://dx.doi.org/10.1038/s41597-024-03303-2.

Full text
Abstract:
AbstractScience funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset developed to measure the impact of archival and data curation decisions on data reuse. The dataset describes 10,605 social science research datasets, their curation histories, and reuse contexts in 94,755 publications that cover 59 years from 1963 to 2022. The dataset was constructed from study-level metadata, citing publications, and curation records available through the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. The dataset includes information about study-level attributes (e.g., PIs, funders, subject terms); usage statistics (e.g., downloads, citations); archiving decisions (e.g., curation activities, data transformations); and bibliometric attributes (e.g., journals, authors) for citing publications. This dataset provides information on factors that contribute to long-term data reuse, which can inform the design of effective evidence-based recommendations to support high-impact research data curation decisions.
APA, Harvard, Vancouver, ISO, and other styles
41

Rutherford, Michael, Seong K. Mun, Betty Levine, et al. "A DICOM dataset for evaluation of medical image de-identification." Scientific Data 8, no. 1 (2021). http://dx.doi.org/10.1038/s41597-021-00967-y.

Full text
Abstract:
AbstractWe developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM objects (a total of 1,693 CT, MRI, PET, and digital X-ray images) were selected from datasets published in the Cancer Imaging Archive (TCIA). Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM Attributes to mimic typical clinical imaging exams. The DICOM Standard and TCIA curation audit logs guided the insertion of synthetic PHI into standard and non-standard DICOM data elements. A TCIA curation team tested the utility of the evaluation dataset. With this publication, the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (the result of TCIA curation) are released on TCIA in advance of a competition, sponsored by the National Cancer Institute (NCI), for algorithmic de-identification of medical image datasets. The competition will use a much larger evaluation dataset constructed in the same manner. This paper describes the creation of the evaluation datasets and guidelines for their use.
APA, Harvard, Vancouver, ISO, and other styles
42

Mayrhofer-Hufnagl, Ingrid, and Benjamin Ennemoser. "Advancing justice in a city’s complex systems using designs enabled by space." International Journal of Architectural Computing, March 30, 2023, 147807712311682. http://dx.doi.org/10.1177/14780771231168223.

Full text
Abstract:
Understanding the importance of data is crucial for realizing the full potential of AI in architectural design. Satellite images are extremely numerous, continuous, high resolution, and accessible, allowing nuanced experimentation through dataset curation. Combining deep learning with remote-sensing technologies, this study poses the following questions. Do newly available datasets uncover ideas about the city previously hidden because urban theory is predominantly Eurocentric? Do extensive and continuous datasets promise a more refined examination of datasets’ effects on outcomes? Generative adversarial networks can endlessly generate new designs based on a curated dataset, but architectural evaluation has been questionable. We employ quantitative and qualitative assessment metrics to investigate human collaboration with AI, producing results that contribute to understanding AI-based urban design models and the significance of dataset curation.
APA, Harvard, Vancouver, ISO, and other styles
43

Jiao, Yu. "Improving Marvel Hero Classification through Dataset Curation." Science and Technology of Engineering, Chemistry and Environmental Protection 1, no. 8 (2024). http://dx.doi.org/10.61173/3gzt6303.

Full text
Abstract:
lores the development of a computer vision model for classifying Marvel superheroes such as Black Widow, Hulk, Iron Man, and Spider-Man. Utilizing a curated dataset sourced from Kaggle, the research emphasizes the critical role of dataset quality in refining model accuracy. Insights gained include adjustments to neural network configurations and leveraging Edge Impulse for enhanced performance. The findings highlight effective strategies for optimizing classification accuracy in complex image recognition tasks.PART 2:This part explores the application of Bayesian logistic regression to model the relationship between temperature and the probability of O-ring failure. By leveraging Bayesian inference techniques, analyzing historical data to quantify the risk associated with temperature variations and emphasize the importance of probabilistic approaches in safety-critical decision-making.The Space Shuttle Challenger disaster on January 28, 1986, remains a poignant case study in aerospace engineering failure. The investigation concluded that the failure of O-ring seals in cold temperatures led to the tragic loss of the shuttle and its crew.
APA, Harvard, Vancouver, ISO, and other styles
44

Luong, Hoa Q., Colleen Fallaw, Genevieve Schmitt, Susan M. Braxton, and Heidi Imker. "Responding to Reality: Evolving Curation Practices and Infrastructure at the University of Illinois at Urbana-Champaign." Journal of eScience Librarianship 10, no. 3 (2021). http://dx.doi.org/10.7191/jeslib.2021.1202.

Full text
Abstract:
Objective: The Illinois Data Bank provides Illinois researchers with the infrastructure to publish research data publicly. During a five-year review of the Research Data Service at the University of Illinois at Urbana-Champaign, it was recognized as the most useful service offering in the unit. Internal metrics are captured and used to monitor the growth, document curation workflows, and surface technical challenges faced as we assist our researchers. Here we present examples of these curation challenges and the solutions chosen to address them. Methods: Some Illinois Data Bank metrics are collected internally by within the system, but most of the curation metrics reported here are tracked separately in a Google spreadsheet. The curator logs required information after curation is complete for each dataset. While the data is sometimes ambiguous (e.g., depending on researcher uptake of suggested actions), our curation data provide a general understanding about our data repository and have been useful in assessing our workflows and services. These metrics also help prioritize development needs for the Illinois Data Bank. Results and Conclusions: The curatorial services polish and improve the datasets, which contributes to the spirit of data reuse. Although we continue to see challenges in our processes, curation makes a positive impact on datasets. Continued development and adaptation of the technical infrastructure allows for an ever-better experience for the curators and users. These improvements have helped our repository more effectively support the data sharing process by successfully fostering depositor engagement with curators to improve datasets and facilitating easy transfer of very large files.
APA, Harvard, Vancouver, ISO, and other styles
45

Lim, Nathaniel, Stepan Tesar, Manuel Belmadani, et al. "Curation of over 10 000 transcriptomic studies to enable data reuse." Database 2021 (January 1, 2021). http://dx.doi.org/10.1093/database/baab006.

Full text
Abstract:
Abstract Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe–gene mappings across microarray technologies. Thus, extensive curation and data reprocessing are necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10 811 manually curated datasets (primarily human, mouse and rat), over 395 000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA sequencing). Dataset topics were represented with 10 215 distinct terms from 12 ontologies, for a total of 54 316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service and an R package. Database URL: https://gemma.msl.ubc.ca/home.html
APA, Harvard, Vancouver, ISO, and other styles
46

Yang, Yili, Heidi Rodenhizer, Brendan M. Rogers, et al. "A Collaborative and Scalable Geospatial Data Set for Arctic Retrogressive Thaw Slumps with Data Standards." Scientific Data 12, no. 1 (2025). https://doi.org/10.1038/s41597-025-04372-7.

Full text
Abstract:
AbstractArctic permafrost is undergoing rapid changes due to climate warming in high latitudes. Retrogressive thaw slumps (RTS) are one of the most abrupt and impactful thermal-denudation events that change Arctic landscapes and accelerate carbon feedbacks. Their spatial distribution remains poorly characterised due to time-intensive conventional mapping methods. While numerous RTS studies have published standalone digitisation datasets, the lack of a centralised, unified database has limited their utilisation, affecting the scale of RTS studies and the generalisation ability of deep learning models. To address this, we established the Arctic Retrogressive Thaw Slumps (ARTS) dataset containing 23,529 RTS-present and 20,434 RTS-absent digitisations from 20 standalone datasets. We also proposed a Data Curation Framework as a working standard for RTS digitisations. This dataset is designed to be comprehensive, accessible, contributable, and adaptable for various RTS-related studies. This dataset and its accompanying curation framework establish a foundation for enhanced collaboration in RTS research, facilitating standardised data sharing and comprehensive analyses across the Arctic permafrost research community.
APA, Harvard, Vancouver, ISO, and other styles
47

Tryggestad, E., A. Anand, C. Beltran, et al. "Scalable radiotherapy data curation infrastructure for deep-learning based autosegmentation of organs-at-risk: A case study in head and neck cancer." Frontiers in Oncology 12 (August 29, 2022). http://dx.doi.org/10.3389/fonc.2022.936134.

Full text
Abstract:
In this era of patient-centered, outcomes-driven and adaptive radiotherapy, deep learning is now being successfully applied to tackle imaging-related workflow bottlenecks such as autosegmentation and dose planning. These applications typically require supervised learning approaches enabled by relatively large, curated radiotherapy datasets which are highly reflective of the contemporary standard of care. However, little has been previously published describing technical infrastructure, recommendations, methods or standards for radiotherapy dataset curation in a holistic fashion. Our radiation oncology department has recently embarked on a large-scale project in partnership with an external partner to develop deep-learning-based tools to assist with our radiotherapy workflow, beginning with autosegmentation of organs-at-risk. This project will require thousands of carefully curated radiotherapy datasets comprising all body sites we routinely treat with radiotherapy. Given such a large project scope, we have approached the need for dataset curation rigorously, with an aim towards building infrastructure that is compatible with efficiency, automation and scalability. Focusing on our first use-case pertaining to head and neck cancer, we describe our developed infrastructure and novel methods applied to radiotherapy dataset curation, inclusive of personnel and workflow organization, dataset selection, expert organ-at-risk segmentation, quality assurance, patient de-identification, data archival and transfer. Over the course of approximately 13 months, our expert multidisciplinary team generated 490 curated head and neck radiotherapy datasets. This task required approximately 6000 human-expert hours in total (not including planning and infrastructure development time). This infrastructure continues to evolve and will support ongoing and future project efforts.
APA, Harvard, Vancouver, ISO, and other styles
48

Lai, Po-Ting, Elisabeth Coudert, Lucila Aimo, et al. "EnzChemRED, a rich enzyme chemistry relation extraction dataset." Scientific Data 11, no. 1 (2024). http://dx.doi.org/10.1038/s41597-024-03835-7.

Full text
Abstract:
AbstractExpert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
APA, Harvard, Vancouver, ISO, and other styles
49

Quaresma, Andreia, Markus J. Ankenbrand, Carlos Ariel Yadró Garcia, et al. "Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding." Scientific Data 11, no. 1 (2024). http://dx.doi.org/10.1038/s41597-024-02962-5.

Full text
Abstract:
AbstractOne of the most critical steps for accurate taxonomic identification in DNA (meta)-barcoding is to have an accurate DNA reference sequence dataset for the marker of choice. Therefore, developing such a dataset has been a long-term ambition, especially in the Viridiplantae kingdom. Typically, reference datasets are constructed with sequences downloaded from general public databases, which can carry taxonomic and other relevant errors. Herein, we constructed a curated (i) global dataset, (ii) European crop dataset, and (iii) 27 datasets for the EU countries for the ITS2 barcoding marker of vascular plants. To that end, we first developed a pipeline script that entails (i) an automated curation stage comprising five filters, (ii) manual taxonomic correction for misclassified taxa, and (iii) manual addition of newly sequenced species. The pipeline allows easy updating of the curated datasets. With this approach, 13% of the sequences, corresponding to 7% of species originally imported from GenBank, were discarded. Further, 259 sequences were manually added to the curated global dataset, which now comprises 307,977 sequences of 111,382 plant species.
APA, Harvard, Vancouver, ISO, and other styles
50

Krishnasamy, Nandhini, Nilima Zade, Dhruvi Khambholia, Rabinder Henry, and Aditya Gupte. "Ensemble Deep Learning Framework for Hybrid Facial Datasets Using Landmark Detection: State-of-the-Art Tools." Journal of Computational and Cognitive Engineering, April 1, 2025. https://doi.org/10.47852/bonviewjcce52024451.

Full text
Abstract:
Autonomous face emotion recognition (FER) with landmarks has become an important field of research for human–computer interaction. A significant achievement has been achieved through deep learning algorithms in recent times. Recognizing faces can be done using an end-to-end approach with deep learning techniques, which learns a mapping from raw pixels to the target label. In the field of emotional classification, the research community has extensively utilized 98 and 68 facial landmarks. In particular, pre-trained convolutional neural networks such as the residual network 50-layer network with the random sampler, Visual Geometry Group 16-layer network, and MobileNet including their ensemble versions of deep learning models are popular among researchers due to their ability to handle complex data. Researchers have mostly evaluated the model on a single dataset. A single dataset poses a challenge in developing a generalized model capable of capturing the full versatility of emotions. The key challenge in the dataset is that a single emotion is represented in multiple facial expressions with low-resolution images. This research study uses a combined dataset (CK+, KDEF, and FER-2013), which is more challenging than a single dataset. This research study offers a comprehensive analysis involving 68 and 98 landmarks with different FER deep models, examining how landmarking and different network architectures contribute to emotion recognition accuracy. This research study also considers the overfitting and class imbalance of the proposed ensemble model, which improves its performance by batch-wise feature extraction. Results show 78% accuracy with 98 landmarks and 75% with 68 landmarks. Overall, the model significantly reduces the gap between training and testing accuracy for both single and combined datasets. Received: 28 September 2024 | Revised: 13 January 2025 | Accepted: 13 February 2025 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement The CK+ data that support the findings of this study are openly available in Kaggle athttps://www.kaggle.com/datasets/shuvoalok/ck-dataset. The KDEF data that support the findings of this study are openly available in Kaggle at https://www.kaggle.com/datasets/tom99763/testtt. The FER2013 data that support the findings of this study is openly available in Kaggle at https://www.kaggle.com/datasets/msambare/fer2013. Author Contribution Statement Nandhini Krishnasamy: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing – original draft, Visualization, Supervision, Funding acquisition. Nilima Zade: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Visualization, Supervision, Project administration. Dhruvi Khambholia: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing – original draft, Visualization. Rabinder Henry: Validation, Formal analysis, Data curation, Writing – review & editing, Visualization, Supervision, Project administration, Funding acquisition. Aditya Gupte: Validation, Formal analysis, Data curation, Writing – review & editing, Visualization, Supervision, Project administration.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!