Relevant bibliographies by topics / Pre-training corpora

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers
Reports

Academic literature on the topic 'Pre-training corpora'

Author: Grafiati

Published: 25 May 2024

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Pre-training corpora.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Pre-training corpora"

Sun, Yu, Shuohuan Wang, Yukun Li, et al. "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8968–75. http://dx.doi.org/10.1609/aaai.v34i05.6428.

Full text

Abstract:

Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named E

APA, Harvard, Vancouver, ISO, and other styles

Moodaley, Wayne, and Arnesh Telukdarie. "A Conceptual Framework for Subdomain Specific Pre-Training of Large Language Models for Green Claim Detection." European Journal of Sustainable Development 12, no. 4 (2023): 319. http://dx.doi.org/10.14207/ejsd.2023.v12n4p319.

Full text

Abstract:

Detection of false or misleading green claims (referred to as “greenwashing”) within company sustainability disclosures is challenging for a number of reasons, which include the textual and qualitative nature, volume, and complexity of such disclosures. In recent years, notable progress made in the fields of artificial intelligence and specifically, large language models (LLMs), has showcased the capacity of these tools to effectively analyse extensive and intricate textual data, including the contents of sustainability disclosures. Transformer-based LLMs, such as Google’s BERT architecture, w

APA, Harvard, Vancouver, ISO, and other styles

Hussain, Rida Ghafoor. "RiskBERT: A Pre-Trained Insurance-Based Language Model for Text Classification." International Journal of Innovative Technology and Exploring Engineering 14, no. 7 (2025): 12–18. https://doi.org/10.35940/ijitee.f1097.14070625.

Full text

Abstract:

The rapid growth of insurance-related documents has increased the need for efficient and accurate text classification techniques. Advances in natural language processing (NLP) and deep learning have enabled the extraction of valuable insights from textual data, particularly in specialised domains such as insurance, legal, and scientific documents. While Bidirectional Encoder Representations from Transformers (BERT) models have demonstrated state-of-theart performance across various NLP tasks, their application to domain-specific corpora often results in suboptimal accuracy due to linguistic an

APA, Harvard, Vancouver, ISO, and other styles

Liu, Yinhan, Jiatao Gu, Naman Goyal, et al. "Multilingual Denoising Pre-training for Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Full text

Abstract:

This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete mod

APA, Harvard, Vancouver, ISO, and other styles

Dean, Roger Thornton, and Marcus Thomas Pearce. "Algorithmically-generated Corpora that use Serial Compositional Principles Can Contribute to the Modeling of Sequential Pitch Structure in Non-tonal Music." Empirical Musicology Review 11, no. 1 (2016): 27. http://dx.doi.org/10.18061/emr.v11i1.4900.

Full text

Abstract:

We investigate whether pitch sequences in non-tonal music can be modeled by an information-theoretic approach using algorithmically-generated melodic sequences, made according to 12-tone serial principles, as the training corpus. This is potentially useful, because symbolic corpora of non-tonal music are not readily available. A non-tonal corpus of serially-composed melodies was constructed algorithmically using classic principles of 12-tone music, including prime, inversion, retrograde and retrograde inversion transforms. A similar algorithm generated a tonal melodic corpus of tonal transform

APA, Harvard, Vancouver, ISO, and other styles

Kreutzer, Julia, Isaac Caswell, Lisa Wang, et al. "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Full text

Abstract:

Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/am

APA, Harvard, Vancouver, ISO, and other styles

Yuan, Sha, Hanyu Zhao, Zhengxiao Du, et al. "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models." AI Open 2 (2021): 65–68. http://dx.doi.org/10.1016/j.aiopen.2021.06.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing 12, no. 2 (2023): 01–12. http://dx.doi.org/10.5121/ijnlc.2023.12201.

Full text

Abstract:

The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pretraining + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre

APA, Harvard, Vancouver, ISO, and other styles

Jing, Qian, Yue Yong, Atkinson Katie, and Li Gangmin. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing (IJNLC) 12, no. 2 (2023): 12. https://doi.org/10.5281/zenodo.7929155.

Full text

Abstract:

The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-tra

APA, Harvard, Vancouver, ISO, and other styles

Chukhno, Olena, and Nataliia Tuchyna. "OVERCOMING DIFFICULATIES IN USING LINGUISTIC CORPORA FOR TEACHING ENGLISH TO PRE-SERVICE TEACHERS." Education. Innovation. Practice 12, no. 7 (2024): 91–105. http://dx.doi.org/10.31110/2616-650x-vol12i7-014.

Full text

Abstract:

The rapid pace of technological advancement necessitates that Ukrainian graduates possess advanced digital literacy and critical thinking skills as well as lifelong learning abilities. Within this context, using linguistic corpora can be considered an effective approach which contributes to developing professional communicative competence by engaging students with authentic language data and promoting critical analysis and independent learning. The article addresses the challenges of integrating the direct corpus-based approach into pre-service English language teacher education. These include

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Pre-training corpora"

Ortiz, Suarez Pedro. "A Data-driven Approach to Natural Language Processing for Contemporary and Historical French." Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS155.

Full text

Abstract:

Depuis plusieurs années, les approches neuronales ont régulièrement amélioré l'état de l'art du traitement automatique des langues (TAL) sur une grande variété de tâches. L'un des principaux facteurs ayant permis ces progrès continus est l'utilisation de techniques d'apprentissage par transfert. Ces méthodes consistent à partir d'un modèle pré-entraîné et à le réutiliser, avec peu ou pas d'entraînement supplémentaire, pour traiter d'autres tâches. Même si ces modèles présentent des avantages évidents, leur principal inconvénient est la quantité de données nécessaire pour les pré-entraîner. Ain

APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Pre-training corpora"

Humphreys, S. C. Kinship in Ancient Athens. Oxford University Press, 2018. http://dx.doi.org/10.1093/oso/9780198788249.001.0001.

Full text

Abstract:

The book covers Athenian kinship from Drakon and Solon to Menander (with some references to later developments). It uses a wide range of sources: epigraphic, literary/forensic, and archaeological. It provides an ethnographic ‘thick description’ of Athenians’ interaction with their kin in all contexts: legal relations (adoption, guardianship, marriage, inheritance, disputes in and out of court); economic interaction (property, economic independence/dependence of sons in relation to fathers); training in specialist skills (doctors, actors, artists), loans, guarantees, etc.; rituals (naming, rite

APA, Harvard, Vancouver, ISO, and other styles

Peters, Thomas A. Library Programs Online. ABC-CLIO, LLC, 2009. http://dx.doi.org/10.5040/9798400679216.

Full text

Abstract:

Meet your library patrons where they increasingly live and work-online. This guide introduces you to the exciting possibilities online programs offer, and shows you how to set up online programs in your library-whether one-time stand-alone or half-day, full-day, or multi-day workshops and conferences. Public programs-from lectures, demonstrations, and interviews to book discussions and story hours can be delivered in real time (live) primarily over the web, utilizing a variety of interactive communication tools, including voice-over-IP, text chatting, and co-browsing. Furthermore, online progr

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Pre-training corpora"

Perełkiewicz, Michał, and Rafał Poświata. "A Review of the Challenges with Massive Web-Mined Corpora Used in Large Language Models Pre-training." In Lecture Notes in Computer Science. Springer Nature Switzerland, 2025. https://doi.org/10.1007/978-3-031-81596-6_14.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Nag, Arijit, Bidisha Samanta, Animesh Mukherjee, Niloy Ganguly, and Soumen Chakrabarty. "Effect of Unknown and Fragmented Tokens on the Performance of Multilingual Language Models at Low-Resource Tasks." In Event Analytics across Languages and Communities. Springer Nature Switzerland, 2024. http://dx.doi.org/10.1007/978-3-031-64451-1_5.

Full text

Abstract:

AbstractMultilingual language models (MLLMs) like mBERT promise to extend the benefits of NLP research to low-resource languages (LRLs). However, LRL vocabulary is often seriously under-represented in the workpiece dictionaries of MLLMs. This leads to many LRL words being replaced by UNK (unknown tokens) or concatenated from morphologically unrelated wordpieces, consequently leading to low task accuracy. Pre-training MLLMs after including LRL documents is extremely resource-intensive in terms of both human inputs and computational resources. In this chapter, we study intuitive strategies to se

APA, Harvard, Vancouver, ISO, and other styles

Mahamoud, Ibrahim Souleiman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d’Andecy, and Jean-Marc Ogier. "KAP: Pre-training Transformers for Corporate Documents Understanding." In Document Analysis and Recognition – ICDAR 2023 Workshops. Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-41501-2_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Siva Raju, S., and Khushboo Ahire. "Enhancing the Quality of Pre-school Education Through Training of Anganwadi Workers: A CSR Initiative." In Corporate Social Responsibility in India. Springer Singapore, 2017. http://dx.doi.org/10.1007/978-981-10-3902-7_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Naumenko, Maksym, Iryna Hrashchenko, Tetiana Tsalko, Svitlana Nevmerzhytska, Svitlana Krasniuk, and Yurii Kulynych. "Innovative technological modes of data mining and modelling for adaptive project management of food industry competitive enterprises in crisis conditions." In PROJECT MANAGEMENT: INDUSTRY SPECIFICS. TECHNOLOGY CENTER PC, 2024. https://doi.org/10.15587/978-617-8360-03-0.ch2.

Full text

Abstract:

Developed in this research scientific and practical applied project solutions regarding Data Mining for enterprises and companies (on the example of food industry) involve the application of advanced cybernetic computing methods/algorithms, technological modes and scenarios (for integration, pre-processing, machine learning, testing and in-depth comprehensive interpretation of the results) of analysis and analytics of large structured and semi-structured data sets for training high-quality descriptive, predictive and even prescriptive models. The proposed by authors multi-mode adaptive Data Mi

APA, Harvard, Vancouver, ISO, and other styles

Ho, Shaun. "Impacts of Continued Legal Pre-Training and IFT on LLMs’ Latent Representations of Human-Defined Legal Concepts." In Frontiers in Artificial Intelligence and Applications. IOS Press, 2024. https://doi.org/10.3233/faia241259.

Full text

Abstract:

This paper aims to offer AI & Law researchers and practitioners a more detailed understanding of whether and how continued pre-training and instruction fine-tuning (IFT) of large language models (LLMs) on legal corpora increases their utilization of human-defined legal concepts when developing global contextual representations of input sequences. We compared three models: Mistral 7B, SaulLM-7B-Base (Mistral 7B with continued pre-training on legal corpora), and SaulLM-7B-Instruct (with further IFT). This preliminary assessment examined 7 distinct text sequences from recent AI & Law lite

APA, Harvard, Vancouver, ISO, and other styles

Tufiş Dan. "Algorithms and Data Design Issues for Basic NLP Tools." In NATO Science for Peace and Security Series - D: Information and Communication Security. IOS Press, 2009. https://doi.org/10.3233/978-1-58603-954-7-3.

Full text

Abstract:

This chapter presents some of the basic language engineering pre-processing steps (tokenization, part-of-speech tagging, lemmatization, and sentence and word alignment). Tagging is among the most important processing steps and its accuracy significantly influences any further processing. Therefore, tagset design, validation and correction of training data and the various techniques for improving the tagging quality are discussed in detail. Since sentence and word alignment are prerequisite operations for exploiting parallel corpora for a multitude of purposes such as machine translation, bilin

APA, Harvard, Vancouver, ISO, and other styles

Stevens, Meg, Georgina Kennedy, and Timothy Churches. "Applying and Improving a Publicly Available Medication NER Pipeline in a Clinical Cancer EMR." In Studies in Health Technology and Informatics. IOS Press, 2024. http://dx.doi.org/10.3233/shti231051.

Full text

Abstract:

Clinical NLP can be applied to extract medication information from free-text notes in EMRs, using NER pipelines. Publicly available annotated data for clinical NLP are scarce, and research annotation budgets are often low. Fine-tuning pre-trained pipelines containing a Transformer layer can produce quality results with relatively small training corpora. We examine the transferability of a publicly available, pre-trained NER pipeline with a Transformer layer for medication targets. The pipeline performs poorly when directly validated but achieves an F1-score of 92% for drug names after fine-tun

APA, Harvard, Vancouver, ISO, and other styles

Jiang, Eric P. "Automatic Text Classification from Labeled and Unlabeled Data." In Intelligent Data Analysis for Real-Life Applications. IGI Global, 2012. http://dx.doi.org/10.4018/978-1-4666-1806-0.ch013.

Full text

Abstract:

Automatic text classification is a process that applies information retrieval technology and machine learning algorithms to build models from pre-labeled training samples and then deploys the models to previously unseen documents for classification. Text classification has been widely applied in many fields ranging from Web page indexing, document filtering, and information security, to business intelligence mining. This chapter presents a semi-supervised text classification framework that is based on the radial basis function (RBF) neural networks. The framework integrates an Expectation Maxi

APA, Harvard, Vancouver, ISO, and other styles

Liu, Ran, Ming Liu, Min Yu, et al. "GLIMMER: Incorporating Graph and Lexical Features in Unsupervised Multi-Document Summarization." In Frontiers in Artificial Intelligence and Applications. IOS Press, 2024. http://dx.doi.org/10.3233/faia240930.

Full text

Abstract:

Pre-trained language models are increasingly being used in multi-document summarization tasks. However, these models need large-scale corpora for pre-training and are domain-dependent. Other non-neural unsupervised summarization approaches mostly rely on key sentence extraction, which can lead to information loss. To address these challenges, we propose a lightweight yet effective unsupervised approach called GLIMMER: a Graph and LexIcal features based unsupervised Multi-docuMEnt summaRization approach. It first constructs a sentence graph from the source documents, then automatically identifi

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Pre-training corpora"

Vu, Thuy-Trang, Xuanli He, Gholamreza Haffari, and Ehsan Shareghi. "Koala: An Index for Quantifying Overlaps with Pre-training Corpora." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.emnlp-demo.7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Liu, Zhuang, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. "FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/622.

Full text

Abstract:

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific la

APA, Harvard, Vancouver, ISO, and other styles

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Knowledge-Enriched Moral Understanding upon Continual Pre-training." In 10th International Conference on Computer Networks & Communications (CCNET 2023). Academy and Industry Research Collaboration Center (AIRCC), 2023. http://dx.doi.org/10.5121/csit.2023.130414.

Full text

Abstract:

The aim of moral understanding is to comprehend the abstract concepts that hide in a story by seeing through concrete events and vivid characters. To be specific, the story is highly summarized in one sentence without covering any characters in the original story, which requires the machine to behave more intelligently with the abilities of moral perception and commonsense reasoning. The paradigm of “pre-training + fine-tuning” is generally accepted for applying neural language models. In this paper, we suggest adding an intermediate stage to build the flow of “pre-training + continual pre-tra

APA, Harvard, Vancouver, ISO, and other styles

Lu, Jinliang, Yu Lu, and Jiajun Zhang. "Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only." In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.findings-emnlp.190.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Xu, Yipei, Dakuan Lu, Jiaqing Liang, et al. "Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources." In CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management. ACM, 2024. http://dx.doi.org/10.1145/3627673.3679835.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wang, Xin'ao, Huan Li, Ke Chen, and Lidan Shou. "FedBFPT: An Efficient Federated Learning Framework for Bert Further Pre-training." In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/483.

Full text

Abstract:

This study proposes FEDBFPT (Federated BERT Further Pre-Training), a Federated Learning (FL) framework for further pre-training the BERT language model in specialized domains while addressing privacy concerns. FEDBFPT enables multiple clients to collaboratively train the shallower layers of BERT, which are crucial in the pre-training stage, without the need to share private data. To achieve this, FEDBFPT involves building a local model for each client, progressively training the shallower layers of local models while sampling deeper layers, and aggregating trained parameters on a server to cre

APA, Harvard, Vancouver, ISO, and other styles

Qu, Yuanbin, Peihan Liu, Wei Song, Lizhen Liu, and Miaomiao Cheng. "A Text Generation and Prediction System: Pre-training on New Corpora Using BERT and GPT-2." In 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, 2020. http://dx.doi.org/10.1109/iceiec49280.2020.9152352.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Zan, Daoguang, Bei Chen, Dejian Yang, et al. "CERT: Continual Pre-training on Sketches for Library-oriented Code Generation." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/329.

Full text

Abstract:

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large unlabelled code corpora and perform well in generating code. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the t

APA, Harvard, Vancouver, ISO, and other styles

Edwards, Aleksandra, Jose Camacho-Collados, Hélène De Ribaupierre, and Alun Preece. "Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification." In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.coling-main.481.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Full text

APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Pre-training corpora"

Rosenblat, Sruly, Tim O'Reilly, and Ilan Strauss. Beyond Public Access in LLM Pre-Training Data: Non-public book content in OpenAI’s Models. AI Disclosures Project, Social Science Research Council, 2025. https://doi.org/10.35650/aidp.4111.d.2025.

Full text

Abstract:

Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI’s large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content (AUROC = 82%), compared to OpenAI’s earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples. GPT-4o Mini, as a much smaller model, shows no kno

APA, Harvard, Vancouver, ISO, and other styles

Strauss, Ilan, Isobel Moure, Tim O’Reilly, and Sruly Rosenblat. The State of AI Governance Research: AI Safety and Reliability in Real World Commercial Deployment. AI Disclosures Project, Social Science Research Council, 2025. https://doi.org/10.35650/aidp.4112.d.2025.

Full text

Abstract:

Drawing on 1,178 safety and reliability papers from 9,439 generative AI papers (Jan- uary 2020 - March 2025), we compare research outputs of leading AI companies (An- thropic, Google DeepMind, Meta, Microsoft, and OpenAI) and AI universities (CMU, MIT, NYU, Stanford, UC Berkeley, and University of Washington). We find that cor- porate AI research increasingly concentrates on pre-deployment areas — model align- ment and testing & evaluation — while attention to deployment-stage issues, such as model bias, has waned, as commercial imperatives and existential risks have come into focus. We fi

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Pre-training corpora'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Pre-training corpora"

Dissertations / Theses on the topic "Pre-training corpora"

Books on the topic "Pre-training corpora"

Book chapters on the topic "Pre-training corpora"

Conference papers on the topic "Pre-training corpora"

Reports on the topic "Pre-training corpora"