Academic literature on the topic 'TDWG BDQIG "Data Quality" "Fitness for Use" Framework "Tests and Assertions" quality vocabularies'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'TDWG BDQIG "Data Quality" "Fitness for Use" Framework "Tests and Assertions" quality vocabularies.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "TDWG BDQIG "Data Quality" "Fitness for Use" Framework "Tests and Assertions" quality vocabularies"

1

Chapman, Arthur, Antonio Saraiva, Lee Belbin, et al. "Fitness for Use: The BDQIG aims for improved Stability and Consistency." Biodiversity Information Science and Standards 1 (August 14, 2017): e20240. https://doi.org/10.3897/tdwgproceedings.1.20240.

Full text
Abstract:
The process of choosing data for a project and then determining what subset of records are suitable for use has become one of the most important concerns for biodiversity researchers in the 21<sup>st</sup> century. The rise of large data aggregators such as GBIF (Global Biodiversity Information Facility), iDigBio (Integrated Digitized Biocollections), the ALA (Atlas of Living Australia) and its many clones, OBIS (Ocean Biogeographic Information System), SIBBr (Sistema de Informação sobre a Biodiversidade Brasileria), CRIA (Centro de Referência em Informação Ambiental) and many others has made access to large volumes of data easier, but choosing which data are fit for use remains a more difficult task. There has been no consistency between the various aggregators on how best to clean and document the quality – how tests are run, or how annotations are stored and reported. Feedback to data custodians on possible errors has been minimal, inconsistent, and adherence to recommendations and controlled vocabularies (where they exist) has been haphazard to say the least. The TDWG Data Quality Interest Group is addressing these issues, either alone or in conjunction with other Interest Groups (Annotations, Darwin Core, Invasive Species, Citizen Science and Vocabulary Maintenance) to develop a framework, tests and assertions, use cases and controlled vocabularies. The Interest Group is also working closely with the data aggregators toward consistent implementations. The practical work is being done through five Task Groups. A published framework is leading to a user-friendly Fitness for Use Backbone (FFUB) and data quality profiles by which users can document the quality they need for a project. A standard set of core tests and assertions has been developed around the Darwin Core standard and are currently being tested and integrated into several aggregators. A use case library has been compiled and these cases will lead to themed data quality profiles as part of the FFUB. Two new Task Groups are being established to develop controlled vocabularies to address the inconsistencies in values of at least 40 Darwin Core terms. These inconsistencies make the evaluation of fitness for use far more difficult than achieved by using controlled vocabularies. The first TG is looking at vocabularies generally, while the second is looking at those just pertaining to Invasive Species. It is not just the aggregators though that are the stakeholders in this work. The data custodians and even the collectors have a vested interest in ensuring their data and metadata are of highest quality and therefore seeing their data used widely. It is only after aggregation that many uses of the data become apparent, and most collectors aren't aware of these uses at the time of collecting. Issues of data quality at the time of collection can later restrict the range of later uses of the data. Feeding back information to the data custodians from users and aggregators on suspect records is essential, and this is where annotations and reporting back on the results of tests conducted by aggregators is important. The project is also generating standard code and test data for the tests and assertions so that data custodians can readily integrate them into their own procedures. It is far cheaper to correct errors at the source than try and rectify them further down the line. A lot of progress has been made, but we still have a long way to go – join us in making biodiversity data quality a product of which we can all be proud.
APA, Harvard, Vancouver, ISO, and other styles
2

Belbin, Lee, Arthur Chapman, John Wieczorek, Paula Zermoglio, and Paul Morris. "Data Quality Task Group 2: Tests and Assertions." Biodiversity Information Science and Standards 3 (July 10, 2019): e35626. https://doi.org/10.3897/biss.3.35626.

Full text
Abstract:
'Data Quality Test and Assertions' Task Group 2 (https://www.tdwg.org/community/bdq/tg-2/) has taken another year to clarify the 102 tests (https://github.com/tdwg/bdq/issues?q=is%3Aissue+is%3Aopen+label%3ATest). The original mandate to develop a core suite of tests that could be widely applied from data collection to user evaluation of aggregated data seemed straight-forward. Two years down the track, we have proven that to be incorrect. Among the final tests are complexities that none of the core group anticipated. For example, the need for a definition of 'empty' or the 'Expected response' from the test under various scenarios. The record-based tests apply to Darwin Core terms (https://dwc.tdwg.org/terms/) and have been classified as of type validation (66), amendment (29), notification (3) or measure (5). Validations test one or more Darwin Core terms against known characteristics, for example, VALIDATION_MONTH_NOTSTANDARD. Amendments may be applied to Darwin Core terms where we can unambiguously offer an improvement to the record, for example, AMENDMENT_MONTH_STANDARDIZED. Notifications are made where we believe a flag will help alert users to an issue that needs evaluation, for example, NOTIFICATION_DATAGENERALIZATIONS_NOTEMPTY. Measures are summaries of test outcomes at the record level, for example, MEASURE_AMENDMENTS_PROPOSED. We note that 41 require some parameters to be established at the time of test implementation, 20 tests require access to a currently accepted vocabulary and 3 tests rely on ISO/DCMI standards. The dependency on vocabularies to circumscribe permissible values for Darwin Core terms led to the establishment by Paula Zermoglio of DQ Task Group 4 (https://github.com/tdwg/bdq/tree/master/Vocabularies). A vocabulary of 154 terms that are associated with the tests and assertions have been developed. As at the time of writing this abstract, test data and demonstration code implementation of each test are yet to be completed. We hope these will be finalized by the time of this presentation.
APA, Harvard, Vancouver, ISO, and other styles
3

Chapman, Arthur, Lee Belbin, Paula Zermoglio, et al. "Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data." Biodiversity Information Science and Standards 4 (March 20, 2020): e50889. https://doi.org/10.3897/biss.4.50889.

Full text
Abstract:
The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community.The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values.Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.
APA, Harvard, Vancouver, ISO, and other styles
4

Belbin, Lee, Arthur Chapman, John Wieczorek, Paul J. Morris, and Paula Zermoglio. "Task Group 2 – Data Quality Tests and Assertions." Biodiversity Information Science and Standards 4 (October 1, 2020): e58982. https://doi.org/10.3897/biss.4.58982.

Full text
Abstract:
MotivationOther than data availability, 'Data Quality' is probably the most significant issue for users of biodiversity data and this is especially so for the research community. Data Quality Tests and Assertions Task Group (TG-2) from the Biodiversity Information Standards (TDWG) Biodiversity Quality Interest Group is reviewing practical aspects relating to 'data quality' with a goal of providing a current best practice at the key interface between data users and data providers: tests and assertions. If an internationally agreed standard suite of core tests and resulting assertions can be used by all data providers and aggregators and hopefully data collectors, then greater and more appropriate use could be made of biodiversity data. Adopting this suite of core tests, data providers and particularly aggregators such as the Global Biodiversity Information Facility (GBIF) and its nodes would have increased credibility with the user communities and could provide more effective information for evaluating 'fitness for use'.Goals, Outputs and OutcomesA standard core (fundamental) set of tests and associated assertions based around Darwin Core termsA standard suite of descriptive fields for each testBroad deployment of the tests, from collector to aggregatorA set of basic principles for the creation of tests/assertionsSoftware that provides an example implementation of each testData that can be used to validate an implementation of the testsA publication that captures the knowledge built during the creation of the tests/assertionsStrategyThe tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. The priority is to create a fully documented suite of core tests that define a framework for ready extension across terms and domains.Status 2019-2020The core tests have proven to be far more complex than any of the team had anticipated. Several times over the past three years, we believed we had finalized the tests, only to find new issues that have required a fresh understanding and subsequent edits, e.g., the most recent dropping of the two tests related to dwc:identificationQualifier:TG2-VALIDATION_IDENTIFICATIONQUALIFIER_DETECTED andTG2-AMENDMENT_IDENTIFICATIONQUALIFIER_FROM_TAXONThis decision resulted from a review of dwc:identificationQualifier values in GBIF records and an evaluation of expected values based on the Darwin Core definition of the term. Aside from there being many values, the term expects the qualifier in relation to a given taxonomic name, and rules of open nomenclature are unevenly adopted across data records to reliably parse and detect dwc:identificationQualifier for these tests to be effective.A similar situation occurs for dwc:scientificName, where we have resorted to the term "polynomial" to refer to the non-authorship part of dwc:scientificName.What has occurred during the past year?Months of work on discussions and edits to the GitHub issues (= mainly the tests), using mainly via Zoom and email.We had hoped to have a face-to-face meeting in Bariloche, Argentina early in 2020 but the Corona virus stopped that. This was unfortunate as we needed this meeting to discuss the remaining complex issues as noted above. Attempting to address such issues by Zoom has been far less efficient.We are occasionally re-visiting decisions made years earlier. An indication that we have been doing this work for (too) many years.We have now standardized all the test parameters for the 99 CORE tests. Much work has gone into standardizing the phrasing and terminology within the 'Expected response' field of the tests – the parameter that most clearly defines each test.Two of the test fields that have taken most of our time to resolve have been 'Parameters' and what we now call 'bdq:sourceAuthority' (Chapman et al. 2020a). These are now complete. The work on 'Parameters' has fed in to Task Group 4 on Vocabularies of Values (see Vocabularies needed for Darwin Core terms prepared by TG4).We have published the work from the Data Quality Interest and Task Groups: Chapman et al. 2020bWe have extended the vocabulary that has been used for the Tests and Assertions.Development of the datasets that validate the implementation of the tests continues.We recognize the dependence on the work of the Annotations Interest Group for the results from the tests to have maximal impact. It is important that test results stay with the records.We will provide details of the challenges, the breakdown of the tests and the advances of the project.
APA, Harvard, Vancouver, ISO, and other styles
5

Belbin, Lee, Arthur Chapman, Paul J. Morris, and John Wieczorek. "It Takes Years for a Good Wine to Mature: Task Group 2 - data quality tests and assertions." Biodiversity Information Science and Standards 6 (August 1, 2022): e91078. https://doi.org/10.3897/biss.6.91078.

Full text
Abstract:
Data Quality Task Group 2 was established to create a suite of core tests and associated assertions about the 'quality' of biodiversity informatics data (Chapman et al. 2020). The group has been active since January 2017, about four years longer than its four main members would have anticipated. We all thought "How hard could it be?" The answer was "Harder than we thought!" We have invested well over two years full time into this project. There were multiple times over the past five years where we thought we were 95% done, but we were wrong. Were we dumb? I doubt it! The authors (other than the lead author) are highly experienced in biodiversity data quality, Darwin Core and data testing. Neither were we lazy.Why has it gone so slowly? It is mostly due to the complexity of the task and the inability to meet face-to-face. Zoom just doesn't cut it for this type of work. We achieved the most at our one face-to-face meeting in Gainesville (Florida) in 2018. Our advances over the past year have come from rounds of feedback between the test specifications, test implementation, development of data for validating the tests and comparison between results from implementations and the expectations of the validation data. There are hopefully useful lessons in this for similar projects.We now have a solid base where future evolution, such as tests for specific environments, will be made relatively easy. The major components of this project are the 99 tests themselves, the parameters for these tests (see https://github.com/tdwg/bdq/issues/122), a vocabulary of the terms used in the framework and test data for validating implementations of the tests. We remain focused on what we call core tests: those that provide power in evaluating 'fitness for use', are widely applicable and are relatively easy to implement. The test descriptions we have settled on are:A human readable label (split into a test class, a target Darwin Core term and an 'action');A Globally Unique Identifier for the test (a GUID);A simple English description;Test class from the Fitness-For-Use Framework (Data Quality Task Group 1): Validation, Amendment, Measure or Issue;Resource Type (all of the Core tests operate on a single record); Information Elements (specified as the applicable Darwin Core Class and as a list of specific Darwin Core terms required as inputs for the test);Specification (an explanation of how the test works from an implementation perspective);Data quality dimension (from the Fitness-for-Use Framework);Warning type (ambiguous, amended, incomplete, invalid, issue, report, unlikely);Parameters (options that allow implementations to behave differently in clearly defined ways such as the use of a national species list);Source Authority (external references required by the test);An example;Source (the origin of the test);References;Link to reference implementations;Link to source code andNotes (explanations of subtle or not so subtle aspects of the test).The composition of the core tests has been stable for over a year. We have generated most of the test data using the template: the applicable test, a unique identifier, input data, expected output data, the response status (e.g., "internal prerequisites not met"), the response result (e.g., "not compliant"), and an optional comment.What remains to be done? We need to complete the test data, produce normative and non-normative documentation, and transform our work into a TDWG Technical Specification. While TG2 is over 95% complete, we would still welcome anyone who is interested to learn about biodiversity data quality to contribute.
APA, Harvard, Vancouver, ISO, and other styles
6

Belbin, Lee, Arthur Chapman, John Wieczorek, Paula Zermoglio, Alex Thompson, and Paul Morris. "Data Quality Task Group 2: Tests and Assertions." Biodiversity Information Science and Standards 2 (May 18, 2018): e25608. https://doi.org/10.3897/biss.2.25608.

Full text
Abstract:
Task Group 2 of the TDWG Data Quality Interest Group aims to provide a standard suite of tests and resulting assertions that can assist with filtering occurrence records for as many applications as possible. Currently 'data aggregators' such as the Global Biodiversity Information Facility (GBIF), the Atlas of Living Australia (ALA) and iDigBio run their own suite of tests over records received and report the results of these tests (the assertions): there is, however, no standard reporting mechanisms. We reasoned that the availability of an internationally agreed set of tests would encourage implementations by the aggregators, and at the data sources (museums, herbaria and others) so that issues could be detected and corrected early in the process. All the tests are limited to Darwin Core terms. The ~95 tests refined from over 250 in use around the world, were classified into four output types: validations, notifications, amendments and measures. Validations test one of more Darwin Core terms, for example, that dwc:decimalLatitude is in a valid range (i.e. between -90 and +90 inclusive). Notifications report a status that a user of the record should know about, for example, if there is a user-annotation associated with the record. Amendments are made to one or more Darwin Core terms when the information across the record can be improved, for example, if there is no value for dwc:scientificName, it can be filled in from a valid dwc:taxonID. Measures report values that may be useful for assessing the overall quality of a record, for example, the number of validation tests passed. Evaluation of the tests was complex and time-consuming, but the important parameters of each test have been consistently documented. Each test has a globally unique identifier, a label, an output type, a resource type, the Darwin Core terms used, a description, a dimension (from the Framework on Data Quality from TG1), an example, references, implementations (if any), test-prerequisites and notes. For each test, generic code is being written that should be easy for institutions to implement – be they aggregators or data custodians. A valuable product of the work of TG2 has been a set of general principles. One example is "Darwin Core terms are either: literal verbatim (e.g., dwc:verbatimLocality) and cannot be assumed capable of validation, open-ended (e.g., dwc:behavior) and cannot be assumed capable of validation, or bounded by an agreed vocabulary or extents, and therefore capable of validation (e.g., dwc:countryCode)". Another is "criteria for including tests is that they are informative, relatively simple to implement, mandatory for amendments and have power in that they will not likely result in 0% or 100% of all record hits." A third: "Do not ascribe precision where it is unknown." GBIF, the ALA and iDigBio have committed to implementing the tests once they have been finalized. We are confident that many museums and herbaria will also implement the tests over time. We anticipate that demonstration code and a test dataset that will validate the code will be available on project completion.
APA, Harvard, Vancouver, ISO, and other styles
7

Chapman, Arthur. "Data Quality – Whose Responsibility is it?" Biodiversity Information Science and Standards 2 (June 13, 2018): e26084. https://doi.org/10.3897/biss.2.26084.

Full text
Abstract:
The quality of biodiversity data is an on-going issue. Early efforts to improve quality go back at least 4 decades, but it has never risen to the level of importance that it should have. For far too long the push to database more and more data regardless of its quality has taken priority. So I pose the question - what is the use of having lots of data if 1) we don't know what its quality is, and 2) if much of it is not fit for use? When data-basing of herbarium and museum collections began in the 1970s many taxonomists saw the only use of the data as being for taxonomic purposes. But as more and more data has become digitally available, so too the uses to which the data can be put. It has also become increasingly important that the data we have in our herbaria and museums be put to more uses to justify on-going support and funding. But whose responsibility is data quality? To answer that I take you to general data quality principles – i.e. that the difficulty and the cost of improving the quality of the data increases the further you move from its source. Responsibility for data quality rests with everyone. Collectors of the specimens Database designers and builders Data entry operators Data curators and managers Those responsible for exchanging/exporting the data Data aggregators Data publishers Data users We all have responsibilities. So, what can we each do to play our part? We need to work together at all levels of the data chain. We need to develop systems whereby feedback on quality from wherever it comes can be documented and fed back. It is no use continually making corrections to the data down the line if those corrections never get back to the data curators and data custodians. It is also of little use if the information fed back goes nowhere and nothing is done with it. The TDWG Data Quality Interest Group is working on setting up standards and tools to help make this possible. We have developed a Framework for Data Quality, we have developed a set of core tests for data quality, and assertions for feeding information back to custodians and forward to users and is beginning a process to deal with vocabularies of value for biodiversity data.
APA, Harvard, Vancouver, ISO, and other styles
8

Chapman, Arthur, Lee Belbin, Paula Zermoglio, et al. "Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data." Biodiversity Information Science and Standards 4 (March 20, 2020). http://dx.doi.org/10.3897/biss.4.50889.

Full text
Abstract:
The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community. The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values. Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.
APA, Harvard, Vancouver, ISO, and other styles
9

Belbin, Lee, Arthur Chapman, John Wieczorek, Paul J. Morris, and Paula Zermoglio. "Task Group 2 – Data Quality Tests and Assertions." Biodiversity Information Science and Standards 4 (October 1, 2020). http://dx.doi.org/10.3897/biss.4.58982.

Full text
Abstract:
Motivation Other than data availability, ‘Data Quality’ is probably the most significant issue for users of biodiversity data and this is especially so for the research community. Data Quality Tests and Assertions Task Group (TG-2) from the Biodiversity Information Standards (TDWG) Biodiversity Quality Interest Group is reviewing practical aspects relating to ‘data quality’ with a goal of providing a current best practice at the key interface between data users and data providers: tests and assertions. If an internationally agreed standard suite of core tests and resulting assertions can be used by all data providers and aggregators and hopefully data collectors, then greater and more appropriate use could be made of biodiversity data. Adopting this suite of core tests, data providers and particularly aggregators such as the Global Biodiversity Information Facility (GBIF) and its nodes would have increased credibility with the user communities and could provide more effective information for evaluating ‘fitness for use’. Goals, Outputs and Outcomes A standard core (fundamental) set of tests and associated assertions based around Darwin Core terms A standard suite of descriptive fields for each test Broad deployment of the tests, from collector to aggregator A set of basic principles for the creation of tests/assertions Software that provides an example implementation of each test Data that can be used to validate an implementation of the tests A publication that captures the knowledge built during the creation of the tests/assertions A standard core (fundamental) set of tests and associated assertions based around Darwin Core terms A standard suite of descriptive fields for each test Broad deployment of the tests, from collector to aggregator A set of basic principles for the creation of tests/assertions Software that provides an example implementation of each test Data that can be used to validate an implementation of the tests A publication that captures the knowledge built during the creation of the tests/assertions Strategy The tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. The priority is to create a fully documented suite of core tests that define a framework for ready extension across terms and domains. Status 2019-2020 The core tests have proven to be far more complex than any of the team had anticipated. Several times over the past three years, we believed we had finalized the tests, only to find new issues that have required a fresh understanding and subsequent edits, e.g., the most recent dropping of the two tests related to dwc:identificationQualifier: TG2-VALIDATION_IDENTIFICATIONQUALIFIER_DETECTED and TG2-AMENDMENT_IDENTIFICATIONQUALIFIER_FROM_TAXON TG2-VALIDATION_IDENTIFICATIONQUALIFIER_DETECTED and TG2-AMENDMENT_IDENTIFICATIONQUALIFIER_FROM_TAXON This decision resulted from a review of dwc:identificationQualifier values in GBIF records and an evaluation of expected values based on the Darwin Core definition of the term. Aside from there being many values, the term expects the qualifier in relation to a given taxonomic name, and rules of open nomenclature are unevenly adopted across data records to reliably parse and detect dwc:identificationQualifier for these tests to be effective. A similar situation occurs for dwc:scientificName, where we have resorted to the term “polynomial” to refer to the non-authorship part of dwc:scientificName. What has occurred during the past year? Months of work on discussions and edits to the GitHub issues (= mainly the tests), using mainly via Zoom and email. We had hoped to have a face-to-face meeting in Bariloche, Argentina early in 2020 but the Corona virus stopped that. This was unfortunate as we needed this meeting to discuss the remaining complex issues as noted above. Attempting to address such issues by Zoom has been far less efficient. We are occasionally re-visiting decisions made years earlier. An indication that we have been doing this work for (too) many years. We have now standardized all the test parameters for the 99 CORE tests. Much work has gone into standardizing the phrasing and terminology within the 'Expected response' field of the tests – the parameter that most clearly defines each test. Two of the test fields that have taken most of our time to resolve have been ‘Parameters’ and what we now call ‘bdq:sourceAuthority’ (Chapman et al. 2020a). These are now complete. The work on ‘Parameters’ has fed in to Task Group 4 on Vocabularies of Values (see Vocabularies needed for Darwin Core terms prepared by TG4). We have published the work from the Data Quality Interest and Task Groups: Chapman et al. 2020b We have extended the vocabulary that has been used for the Tests and Assertions. Development of the datasets that validate the implementation of the tests continues. We recognize the dependence on the work of the Annotations Interest Group for the results from the tests to have maximal impact. It is important that test results stay with the records. Months of work on discussions and edits to the GitHub issues (= mainly the tests), using mainly via Zoom and email. We had hoped to have a face-to-face meeting in Bariloche, Argentina early in 2020 but the Corona virus stopped that. This was unfortunate as we needed this meeting to discuss the remaining complex issues as noted above. Attempting to address such issues by Zoom has been far less efficient. We are occasionally re-visiting decisions made years earlier. An indication that we have been doing this work for (too) many years. We have now standardized all the test parameters for the 99 CORE tests. Much work has gone into standardizing the phrasing and terminology within the 'Expected response' field of the tests – the parameter that most clearly defines each test. Two of the test fields that have taken most of our time to resolve have been ‘Parameters’ and what we now call ‘bdq:sourceAuthority’ (Chapman et al. 2020a). These are now complete. The work on ‘Parameters’ has fed in to Task Group 4 on Vocabularies of Values (see Vocabularies needed for Darwin Core terms prepared by TG4). We have published the work from the Data Quality Interest and Task Groups: Chapman et al. 2020b We have extended the vocabulary that has been used for the Tests and Assertions. Development of the datasets that validate the implementation of the tests continues. We recognize the dependence on the work of the Annotations Interest Group for the results from the tests to have maximal impact. It is important that test results stay with the records. We will provide details of the challenges, the breakdown of the tests and the advances of the project.
APA, Harvard, Vancouver, ISO, and other styles
10

Belbin, Lee, Arthur Chapman, Paul J. Morris, and John Wieczorek. "It Takes Years for a Good Wine to Mature: Task Group 2 - data quality tests and assertions." Biodiversity Information Science and Standards 6 (August 1, 2022). http://dx.doi.org/10.3897/biss.6.91078.

Full text
Abstract:
Data Quality Task Group 2 was established to create a suite of core tests and associated assertions about the 'quality' of biodiversity informatics data (Chapman et al. 2020). The group has been active since January 2017, about four years longer than its four main members would have anticipated. We all thought “How hard could it be?” The answer was “Harder than we thought!” We have invested well over two years full time into this project. There were multiple times over the past five years where we thought we were 95% done, but we were wrong. Were we dumb? I doubt it! The authors (other than the lead author) are highly experienced in biodiversity data quality, Darwin Core and data testing. Neither were we lazy. Why has it gone so slowly? It is mostly due to the complexity of the task and the inability to meet face-to-face. Zoom just doesn’t cut it for this type of work. We achieved the most at our one face-to-face meeting in Gainesville (Florida) in 2018. Our advances over the past year have come from rounds of feedback between the test specifications, test implementation, development of data for validating the tests and comparison between results from implementations and the expectations of the validation data. There are hopefully useful lessons in this for similar projects. We now have a solid base where future evolution, such as tests for specific environments, will be made relatively easy. The major components of this project are the 99 tests themselves, the parameters for these tests (see https://github.com/tdwg/bdq/issues/122), a vocabulary of the terms used in the framework and test data for validating implementations of the tests. We remain focused on what we call core tests: those that provide power in evaluating ‘fitness for use’, are widely applicable and are relatively easy to implement. The test descriptions we have settled on are: A human readable label (split into a test class, a target Darwin Core term and an ‘action’); A Globally Unique Identifier for the test (a GUID); A simple English description; Test class from the Fitness-For-Use Framework (Data Quality Task Group 1): Validation, Amendment, Measure or Issue; Resource Type (all of the Core tests operate on a single record); Information Elements (specified as the applicable Darwin Core Class and as a list of specific Darwin Core terms required as inputs for the test); Specification (an explanation of how the test works from an implementation perspective); Data quality dimension (from the Fitness-for-Use Framework); Warning type (ambiguous, amended, incomplete, invalid, issue, report, unlikely); Parameters (options that allow implementations to behave differently in clearly defined ways such as the use of a national species list); Source Authority (external references required by the test); An example; Source (the origin of the test); References; Link to reference implementations; Link to source code and Notes (explanations of subtle or not so subtle aspects of the test). A human readable label (split into a test class, a target Darwin Core term and an ‘action’); A Globally Unique Identifier for the test (a GUID); A simple English description; Test class from the Fitness-For-Use Framework (Data Quality Task Group 1): Validation, Amendment, Measure or Issue; Resource Type (all of the Core tests operate on a single record); Information Elements (specified as the applicable Darwin Core Class and as a list of specific Darwin Core terms required as inputs for the test); Specification (an explanation of how the test works from an implementation perspective); Data quality dimension (from the Fitness-for-Use Framework); Warning type (ambiguous, amended, incomplete, invalid, issue, report, unlikely); Parameters (options that allow implementations to behave differently in clearly defined ways such as the use of a national species list); Source Authority (external references required by the test); An example; Source (the origin of the test); References; Link to reference implementations; Link to source code and Notes (explanations of subtle or not so subtle aspects of the test). The composition of the core tests has been stable for over a year. We have generated most of the test data using the template: the applicable test, a unique identifier, input data, expected output data, the response status (e.g., “internal prerequisites not met”), the response result (e.g., “not compliant”), and an optional comment. What remains to be done? We need to complete the test data, produce normative and non-normative documentation, and transform our work into a TDWG Technical Specification. While TG2 is over 95% complete, we would still welcome anyone who is interested to learn about biodiversity data quality to contribute.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography