Log in

Relevant bibliographies by topics / Dataset Curation

Contents

Journal articles
Books
Book chapters
Conference papers
Reports

Academic literature on the topic 'Dataset Curation'

Author: Grafiati

Published: 7 June 2025

Last updated: 11 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Dataset Curation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Dataset Curation"

1

Koshoffer, Amy, Amy E. Neeser, Linda Newman, and Lisa R. Johnston. "Giving datasets context: a comparison study of institutional repositories that apply varying degrees of curation." International Journal of Digital Curation 13, no. 1 (2018): 15–34. http://dx.doi.org/10.2218/ijdc.v13i1.632.

Full text

Abstract:

This research study compared four academic libraries’ approaches to curating the metadata of dataset submissions in their institutional repositories and classified them in one of four categories: no curation, pre-ingest curation, selective curation, and post-ingest curation. The goal is to understand the impact that curation may have on the quality of user-submitted metadata. The findings were 1) the metadata elements varied greatly between institutions, 2) repositories with more options for authors to contribute metadata did not result in more metadata contributed, 3) pre- or post-ingest curation process could have a measurable impact on the metadata but are difficult to separate from other factors, and 4) datasets submitted to a repository with pre- or post-ingest curation more often included documentation.

APA, Harvard, Vancouver, ISO, and other styles

2

Xu, Jinda, Yuhao Song, Daming Wang, et al. "Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 20 (2025): 21761–69. https://doi.org/10.1609/aaai.v39i20.35481.

Full text

Abstract:

In an era overwhelmed by vast amounts of data, the effective curation of web-crawl datasets is essential for optimizing model performance. This paper tackles the challenges associated with the unstructured and heterogeneous nature of such datasets. Traditional heuristic curation methods often inadequately capture complex features, resulting in biases and the exclusion of relevant data. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators, called EcoDatum, which employs a novel quality-guided deduplication method to balance feature distribution. EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework, utilizing automated optimization to effectively score each data point. EcoDatum, which significantly improves the data curation quality and efficiency, outperforms existing state-of-the-art (SOTA) techniques, ranking 1st on the DataComp leaderboard with an average performance score of 0.182 across 38 diverse evaluation datasets. This represents a 28% improvement over the DataComp baseline method, demonstrating its effectiveness in improving dataset curation and model training efficiency.

APA, Harvard, Vancouver, ISO, and other styles

3

Gordon, Ben, Jake Barrett, Clara Fennessy, et al. "Development of a data utility framework to support effective health data curation." BMJ Health & Care Informatics 28, no. 1 (2021): e100303. http://dx.doi.org/10.1136/bmjhci-2020-100303.

Full text

Abstract:

ObjectivesThe value of healthcare data is being increasingly recognised, including the need to improve health dataset utility. There is no established mechanism for evaluating healthcare dataset utility making it difficult to evaluate the effectiveness of activities improving the data. To describe the method for generating and involving the user community in developing a proposed framework for evaluation and communication of healthcare dataset utility for given research areas.MethodsAninitial version of a matrix to review datasets across a range of dimensions wasdeveloped based on previous published findings regarding healthcare data. Thiswas used to initiate a design process through interviews and surveys with datausers representing a broad range of user types and use cases, to help develop afocused framework for characterising datasets.ResultsFollowing 21 interviews, 31 survey responses and testing on 43 datasets, five major categories and 13 subcategories were identified as useful for a dataset, including Data Model, Completeness and Linkage. Each sub-category was graded to facilitate rapid and reproducible evaluation of dataset utility for specific use-cases. Testing of applicability to >40 existing datasets demonstrated potential usefulness for subsequent evaluation in real-world practice.DiscussionTheresearch has developed an evidenced-based initial approach for a framework tounderstand the utility of a healthcare dataset. It likely to require further refinementfollowing wider application and additional categories may be required.ConclusionThe process has resulted in a user-centred designed framework for objectively evaluating the likely utility of specific healthcare datasets, and therefore, should be of value both for potential users of health data, and for data custodians to identify the areas to provide the optimal value for data curation investment.

APA, Harvard, Vancouver, ISO, and other styles

4

Hagar, Nick, and Jack Bandy. "Practical Datasets for Analyzing LLM Corpora Derived from Common Crawl." Proceedings of the International AAAI Conference on Web and Social Media 19 (June 7, 2025): 2454–64. https://doi.org/10.1609/icwsm.v19i1.35948.

Full text

Abstract:

Large language models (LLMs) rely heavily on web-derived training datasets, yet understanding how filtering and curation decisions affect these datasets remains challenging. This paper presents two complementary datasets designed to enable systematic analysis of LLM training data composition. The first dataset captures domain-level statistics across 96 Common Crawl snapshots, providing baseline data about web content distribution before filtering. The second dataset contains standardized URL information from three major LLM training corpora (C4, Falcon RefinedWeb, and CulturaX), allowing researchers to analyze how different filtering approaches affect content inclusion. By making these datasets publicly available in a consistent format, we aim to (1) facilitate research into training data composition, (2) enable systematic auditing of filtering effects, and (3) support more transparent approaches to dataset development. Our datasets can help researchers investigate questions related to content diversity, source representation, and the impact of different filtering decisions on training data composition. Overall, this work provides a foundation for understanding how curation choices shape the content that ultimately trains widely-deployed language models.

APA, Harvard, Vancouver, ISO, and other styles

5

Khaleel, Mohammad A., Amer Hayat Khan, S. M. Sheikh Ghadzi, and Sami Alshakhshir. "Curation of an international drug proprietary names dataset." Data in Brief 40 (February 2022): 107701. http://dx.doi.org/10.1016/j.dib.2021.107701.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Celino, Irene. "Geospatial dataset curation through a location-based game." Semantic Web 6, no. 2 (2015): 121–30. http://dx.doi.org/10.3233/sw-130129.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Sicilia, Miguel-Angel, Elena García-Barriocanal, and Salvador Sánchez-Alonso. "Community Curation in Open Dataset Repositories: Insights from Zenodo." Procedia Computer Science 106 (2017): 54–60. http://dx.doi.org/10.1016/j.procs.2017.03.009.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Kleebauer, Maximilian, Stefan Karamanski, Doron Callies, and Martin Braun. "A Wind Turbines Dataset for South Africa: OpenStreetMap Data, Deep Learning Based Geo-Coordinate Correction and Capacity Analysis." ISPRS International Journal of Geo-Information 14, no. 6 (2025): 232. https://doi.org/10.3390/ijgi14060232.

Full text

Abstract:

Accurate and detailed spatial data on wind energy infrastructure is essential for renewable energy planning, grid integration, and system analysis. However, publicly available datasets often suffer from limited spatial accuracy, missing attributes, and inconsistent metadata. To address these challenges, this study presents a harmonized and spatially refined dataset of wind turbines in South Africa, combining OpenStreetMap (OSM) data with high-resolution satellite imagery, deep learning-based coordinate correction, and manual curation. The dataset includes 1487 turbines across 42 wind farms, representing over 3.9 GW of installed capacity as of 2025. Of this, more than 3.6 GW is currently operational. The Geo-Coordinates were validated and corrected using a RetinaNet-based object detection model applied to both Google and Bing satellite imagery. Instead of relying solely on spatial precision, the curation process emphasized attribute completeness and consistency. Through systematic verification and cross-referencing with multiple public sources, the final dataset achieves a high level of attribute completeness and internal consistency across all turbines, including turbine type, rated capacity, and commissioning year. The resulting dataset is the most accurate and comprehensive publicly available dataset on wind turbines in South Africa to date. It provides a robust foundation for spatial analysis, energy modeling, and policy assessment related to wind energy development. The dataset is publicly available.

APA, Harvard, Vancouver, ISO, and other styles

9

Barbosa, Susana, Nuno Dias, Carlos Almeida, et al. "The SAIL dataset of marine atmospheric electric field observations over the Atlantic Ocean." Earth System Science Data 17, no. 4 (2025): 1393–405. https://doi.org/10.5194/essd-17-1393-2025.

Full text

Abstract:

Abstract. A unique dataset of marine atmospheric electric field observations over the Atlantic Ocean is described. The data are relevant not only for atmospheric electricity studies, but more generally for studies of the Earth's atmosphere and climate variability, as well as space–Earth interaction studies. In addition to the atmospheric electric field data, the dataset includes simultaneous measurements of other atmospheric variables, including gamma radiation, visibility, and solar radiation. These ancillary observations not only support interpretation and understanding of the atmospheric electric field data, but also are of interest in themselves. The entire framework from data collection to final derived datasets has been duly documented to ensure traceability and reproducibility of the whole data curation chain. All the data, from raw measurements to final datasets, are preserved in data repositories with a corresponding assigned DOI. Final datasets are available from the Figshare repository (https://figshare.com/projects/SAIL_Data/178500, SAIL Data, 2025), and computational notebooks containing the code used at every step of the data curation chain are available from the Zenodo repository (https://zenodo.org/communities/sail, Project SAIL community, 2025).

APA, Harvard, Vancouver, ISO, and other styles

10

Landry, Latrice, Mary Lucas, Anietie Andy, and Ebelechukwu Nwafor. "Artificial Intelligence Assisted Curation of Population Groups in Biomedical Literature." International Journal of Digital Curation 18, no. 1 (2024): 9. http://dx.doi.org/10.2218/ijdc.v18i1.950.

Full text

Abstract:

Curation of the growing body of published biomedical research is of great importance to both the synthesis of contemporary science and the archiving of historical biomedical literature. Each of these tasks has become increasingly challenging given the expansion of journal titles, preprint repositories and electronic databases. Added to this challenge is the need for curation of biomedical literature across population groups to better capture study populations for improved understanding of the generalizability of findings. To address this, our study aims to explore the use of generative artificial intelligence (AI) in the form of large language models (LLMs) such as GPT-4 as an AI curation assistant for the task of curating biomedical literature for population groups. We conducted a series of experiments which qualitatively and quantitatively evaluate the performance of OpenAI’s GPT-4 in curating population information from biomedical literature. Using OpenAI’s GPT-4 and curation instructions, executed through prompts, we evaluate the ability of GPT-4 to classify study ‘populations’, ‘continents’ and ‘countries’ from a previously curated dataset of public health COVID-19 studies. Using three different experimental approaches, we examined performance by: A) evaluation of accuracy (concordance with human curation) using both exact and approximate string matches within a single experimental approach; B) evaluation of accuracy across experimental approaches; and C) conducting a qualitative phenomenology analysis to describe and classify the nature of difference between human curation and GPT curation. Our study shows that GPT-4 has the potential to provide assistance in the curation of population groups in biomedical literature. Additionally, phenomenology provided key information for prompt design that further improved the LLM’s performance in these tasks. Future research should aim to improve prompt design, as well as explore other generative AI models to improve curation performance. An increased understanding of the populations included in research studies is critical for the interpretation of findings, and we believe this study provides keen insight on the potential to increase the scalability of population curation in biomedical studies.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Books on the topic "Dataset Curation"

1

Databrarianship: The Academic Data Librarian in Theory and Practice. American Library Association, 2016.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

2

Data Stewardship for Open Science: Implementing FAIR Principles. Taylor & Francis Group, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

3

Mons, Barend. Data Stewardship for Open Science: Implementing FAIR Principles. Taylor & Francis Group, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

4

Mons, Barend. Data Stewardship for Open Science: Implementing FAIR Principles. Taylor & Francis Group, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

5

Mons, Barend. Data Stewardship for Open Science. Taylor & Francis Group, 2021.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

6

Mons, Barend. Data Stewardship for Open Science: Implementing FAIR Principles. Taylor & Francis Group, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

7

Mons, Barend. Data Stewardship for Open Science: Implementing FAIR Principles. Taylor & Francis Group, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Dataset Curation"

1

Andronov, Mikhail, Natalia Andronova, Michael Wand, Jürgen Schmidhuber, and Djork-Arné Clevert. "Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map." In Lecture Notes in Computer Science. Springer Nature Switzerland, 2024. http://dx.doi.org/10.1007/978-3-031-72381-0_3.

Full text

Abstract:

AbstractThe increasing use of machine learning and artificial intelligence in chemical reaction studies demands high-quality reaction data, necessitating specialized tools enabling data understanding and curation. Our work introduces a novel methodology for reaction data examination centered on reagents - essential molecules in reactions that do not contribute atoms to products. We propose an intuitive tool for creating interactive reagent space maps using distributed vector representations, akin to word2vec in Natural Language Processing, capturing the statistics of reagent usage within datasets. Our approach enables swift assessment of reagent action patterns and identification of erroneous reagent entries, which we demonstrate using the USPTO dataset. Our contributions include an open-source web application for visual reagent pattern analysis and a table cataloging around six hundred of the most frequent reagents in USPTO annotated with detailed roles. Our method aims to support organic chemists and cheminformatics experts in reaction data curation routine.

APA, Harvard, Vancouver, ISO, and other styles

2

Scholl, Philipp M., Benjamin Völker, Bernd Becker, and Kristof Van Laerhoven. "A Multi-media Exchange Format for Time-Series Dataset Curation." In Human Activity Sensing. Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-13001-5_8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Singh, Param, Kamlesh Dutta, Robert Kaye, and Suyash Garg. "Music Listening History Dataset Curation and Distributed Music Recommendation Engines Using Collaborative Filtering." In Proceedings of ICETIT 2019. Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-30577-2_55.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Joshi, Keyur, Angelina Aziz, Philip Dietrich, and Markus König. "Efficient Data Curation Using Active Learning for a Video-Based Fire Detection." In CONVR 2023 - Proceedings of the 23rd International Conference on Construction Applications of Virtual Reality. Firenze University Press, 2023. http://dx.doi.org/10.36253/979-12-215-0289-3.60.

Full text

Abstract:

Video-based fire detection is a crucial object detection problem that relies on accurate and reliable data to detect fires. However, collecting and labeling fire-related data can be time-consuming and expensive, making it difficult to obtain sufficient data for training machine learning models. To address this challenge, uncertainty-based active learning techniques can be used to iteratively select the most informative samples for labeling. This can reduce the amount of labeled data needed to achieve high model performance and has the potential to even prune the training data with fewer informative samples. The traditional sampling-based uncertainty estimation methods are computationally expensive. Hence, an efficient prior network-based ensemble distillation State-of-the-Art approach is evaluated on an internal dataset which still requires relatively higher overhead computation making it difficult for production deployment. A biased softmax differencing-based uncertainty approach and a feature-based hard data mining approach are proposed and compared with the distillation approach. The novel approaches are found to have a very low overhead uncertainty estimation time compared to the ensemble distillation approach and traditional sampling techniques. The methods are evaluated in the context of curating the unlabeled pool data and improving the training data. For completeness, the experiments are performed on three different data sizes, and overall, the frame-wise selection strategy is proved to be better than the sequence-wise querying strategy. The Principal Component Analysis (PCA)-based hard data mining outperformed other methods and improved the model performance by 16.33% with AUC2% metric when compared with the random selection of data. The approach even outperformed the main network trained on full data by 7.33%, henceforth improving the training data by using informative 26.39% data. The results indicate that novel data mining provides efficient training and pool data curation

APA, Harvard, Vancouver, ISO, and other styles

5

Joshi, Keyur, Angelina Aziz, Philip Dietrich, and Markus König. "Efficient Data Curation Using Active Learning for a Video-Based Fire Detection." In CONVR 2023 - Proceedings of the 23rd International Conference on Construction Applications of Virtual Reality. Firenze University Press, 2023. http://dx.doi.org/10.36253/10.36253/979-12-215-0289-3.60.

Full text

Abstract:

Video-based fire detection is a crucial object detection problem that relies on accurate and reliable data to detect fires. However, collecting and labeling fire-related data can be time-consuming and expensive, making it difficult to obtain sufficient data for training machine learning models. To address this challenge, uncertainty-based active learning techniques can be used to iteratively select the most informative samples for labeling. This can reduce the amount of labeled data needed to achieve high model performance and has the potential to even prune the training data with fewer informative samples. The traditional sampling-based uncertainty estimation methods are computationally expensive. Hence, an efficient prior network-based ensemble distillation State-of-the-Art approach is evaluated on an internal dataset which still requires relatively higher overhead computation making it difficult for production deployment. A biased softmax differencing-based uncertainty approach and a feature-based hard data mining approach are proposed and compared with the distillation approach. The novel approaches are found to have a very low overhead uncertainty estimation time compared to the ensemble distillation approach and traditional sampling techniques. The methods are evaluated in the context of curating the unlabeled pool data and improving the training data. For completeness, the experiments are performed on three different data sizes, and overall, the frame-wise selection strategy is proved to be better than the sequence-wise querying strategy. The Principal Component Analysis (PCA)-based hard data mining outperformed other methods and improved the model performance by 16.33% with AUC2% metric when compared with the random selection of data. The approach even outperformed the main network trained on full data by 7.33%, henceforth improving the training data by using informative 26.39% data. The results indicate that novel data mining provides efficient training and pool data curation

APA, Harvard, Vancouver, ISO, and other styles

6

Wolff, Benjamin, Eva Seidlmayer, and Konrad U. Förstner. "Enriched BERT Embeddings for Scholarly Publication Classification." In Lecture Notes in Computer Science. Springer Nature Switzerland, 2024. http://dx.doi.org/10.1007/978-3-031-65794-8_16.

Full text

Abstract:

AbstractWith the rapid expansion of academic literature and the proliferation of preprints, researchers face growing challenges in manually organizing and labeling large volumes of articles. The NSLP 2024 FoRC Shared Task I addresses this challenge organized as a competition. The goal is to develop a classifier capable of predicting one of 123 predefined classes from the Open Research Knowledge Graph (ORKG) taxonomy of research fields for a given article. This paper presents our results.Initially, we enrich the dataset (containing English scholarly articles sourced from ORKG and arXiv), then leverage different pre-trained language Models (PLMs), specifically BERT, and explore their efficacy in transfer learning for this downstream task. Our experiments encompass feature-based and fine-tuned transfer learning approaches using diverse PLMs, optimized for scientific tasks, including SciBERT, SciNCL, and SPECTER2. We conduct hyperparameter tuning and investigate the impact of data augmentation from bibliographic databases such as OpenAlex, Semantic Scholar, and Crossref. Our results demonstrate that fine-tuning pre-trained models substantially enhances classification performance, with SPECTER2 emerging as the most accurate model. Moreover, enriching the dataset with additional metadata improves classification outcomes significantly, especially when integrating information from S2AG, OpenAlex and Crossref. Our best-performing approach achieves a weighted F1-score of 0.7415. Overall, our study contributes to the advancement of reliable automated systems for scholarly publication categorization, offering a potential solution to the laborious manual curation process, thereby facilitating researchers in efficiently locating relevant resources.

APA, Harvard, Vancouver, ISO, and other styles

7

Bayraktar, M., Y. E. Bacik, O. Sert, A. Aldemir, and B. Güldür Erkal. "A Curation of Image Datasets for Urban Segmentation Applications." In Lecture Notes in Civil Engineering. Springer Nature Switzerland, 2024. http://dx.doi.org/10.1007/978-3-031-57357-6_43.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Ambure, Pravin, and M. Natália Dias Soeiro Cordeiro. "Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets." In Methods in Pharmacology and Toxicology. Springer US, 2020. http://dx.doi.org/10.1007/978-1-0716-0150-1_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Gonçalves, Rafael, Filipe Gouveia, Inês Lynce, and José Fragoso Santos. "Proxy Attribute Discovery in Machine Learning Datasets via Inductive Logic Programming." In Lecture Notes in Computer Science. Springer Nature Switzerland, 2025. https://doi.org/10.1007/978-3-031-90653-4_17.

Full text

Abstract:

Abstract The issue of fairness is a well-known challenge in Machine Learning (ML) that has gained increased importance with the emergence of Large Language Models (LLMs) and generative AI. Algorithmic bias can manifest during the training of ML models due to the presence of sensitive attributes, such as gender or racial identity. One approach to mitigate bias is to avoid making decisions based on these protected attributes. However, indirect discrimination can still occur if sensitive information is inferred from proxy attributes. To prevent this, there is a growing interest in detecting potential proxy attributes before training ML models. In this case study, we report on the use of Inductive Logic Programming (ILP) to discover proxy attributes in training datasets, with a focus on the ML classification problem. While ILP has established applications in program synthesis and data curation, we demonstrate that it can also advance the state of the art in proxy attribute discovery by removing the need for prior domain knowledge. Our evaluation shows that this approach is effective at detecting potential sources of indirect discrimination, having successfully identified proxy attributes in several well-known datasets used in fairness-awareness studies.

APA, Harvard, Vancouver, ISO, and other styles

10

Guziolowski, Carito, Jeremy Gruel, Ovidiu Radulescu, and Anne Siegel. "Curating a Large-Scale Regulatory Network by Evaluating Its Consistency with Expression Datasets." In Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer Berlin Heidelberg, 2009. http://dx.doi.org/10.1007/978-3-642-02504-4_13.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Dataset Curation"

1

Yanuka, Moran, Morris Alper, Hadar Averbuch-Elor, and Raja Giryes. "ICC : Quantifying Image Caption Concreteness for Multimodal Dataset Curation." In Findings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics, 2024. http://dx.doi.org/10.18653/v1/2024.findings-acl.657.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Ientilucci, Emmett J., and Ahmed Shayer Andalib. "Grss Data Curation: Benchmark UAV Dataset for Hyperspectral Target Detection Studies." In IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2024. http://dx.doi.org/10.1109/igarss53475.2024.10641190.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Shukla, Shruti, Dimitris A. Pados, Kavita Varma, George Sklivanitis, Elizabeth S. Bentley, and Michael J. Medley. "AI/ML curation of AI/ML training datasets." In Machine Learning from Challenging Data 2025, edited by George Sklivanitis, Panagiotis (. Markopoulos, and Bing Ouyang. SPIE, 2025. https://doi.org/10.1117/12.3055515.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Impiö, Mikko, Philipp M. Rehsen, and Jenni Raitoharju. "Efficient Curation of Invertebrate Image Datasets Using Feature Embeddings and Automatic Size Comparison." In 2025 IEEE Symposia on Computational Intelligence for Energy, Transport and Environmental Sustainability (CIETES). IEEE, 2025. https://doi.org/10.1109/cietes63869.2025.10995211.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Reddy, Divya D., Niloufar Saadat, James M. Holcomb, et al. "Advancing Brain Tumor Analysis: Curating a High-Quality MRI Dataset for Deep Learning-Based Molecular Marker Profiling." In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2024. http://dx.doi.org/10.1109/cvprw63382.2024.00243.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Parajuli, Paridhi, Rajat Shinde, Iksha Gurung, Manil Maskey, and Rahul Ramachandran. "Curating AI-Ready Datasets for Equity and Environmental Justice: A Data-Centric AI Case Study." In IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2024. http://dx.doi.org/10.1109/igarss53475.2024.10641786.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Sadat, Abbas, Sean Segal, Sergio Casas, et al. "Diverse Complexity Measures for Dataset Curation in Self-Driving." In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021. http://dx.doi.org/10.1109/iros51168.2021.9636869.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Scholl, Philipp M., and Kristof Van Laerhoven. "A multi-media exchange format for time-series dataset curation." In UbiComp '16: The 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2016. http://dx.doi.org/10.1145/2968219.2968278.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Calatroni, Alberto, Daniel Roggen, and Gerhard Troster. "Collection and curation of a large reference dataset for activity recognition." In 2011 IEEE International Conference on Systems, Man and Cybernetics - SMC. IEEE, 2011. http://dx.doi.org/10.1109/icsmc.2011.6083638.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Zhou, Tong, Yubo Chen, Pengfei Cao, Kang Liu, Shengping Liu, and Jun Zhao. "Oasis: Data Curation and Assessment System for Pretraining of Large Language Models." In Thirty-Third International Joint Conference on Artificial Intelligence {IJCAI-24}. International Joint Conferences on Artificial Intelligence Organization, 2024. http://dx.doi.org/10.24963/ijcai.2024/1048.

Full text

Abstract:

Data is one of the most critical elements in building a large language model. However, existing systems either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To this end, we present a pretraining corpus curation and assessment platform called Oasis — a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in the holistic data assessment module, a corpus can be assessed in local and global views, with three evaluation means including human, GPT-4, and heuristic metrics. We exhibit a complete process to use Oasis for the curation and assessment of pretraining data. In addition, an 800GB bilingual corpus curated by Oasis is publicly released.

APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Dataset Curation"

1

Soenen, Karen, Danie Kinkade, Adam Shepherd, et al. Fitting square pegs into a round hole. Curating heterogeneous oceanographic data at BCO-DMO. Woods Hole Oceanographic Institution, 2024. http://dx.doi.org/10.1575/1912/67676.

Full text

Abstract:

BCO-DMO is a domain-specific repository containing 18 years of curated, heterogeneous oceanographic data. Data managers are at the core of the repository, applying the F.A.I.R. principles to every dataset coming in. This talk steers the audience through such a curated dataset, covering the advancements and challenges that comes with domain curation.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!