Log in

Relevant bibliographies by topics / Named entity recognition legal documents transformer / Journal articles

To see the other types of publications on this topic, follow the link: Named entity recognition legal documents transformer.

Journal articles on the topic 'Named entity recognition legal documents transformer'

Author: Grafiati

Published: 7 June 2025

Last updated: 16 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 38 journal articles for your research on the topic 'Named entity recognition legal documents transformer.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Yulianti, Evi, Naradhipa Bhary, Jafar Abdurrohman, Fariz Wahyuzan Dwitilas, Eka Qadri Nuranti, and Husna Sarirah Husin. "Named entity recognition on Indonesian legal documents: a dataset and study using transformer-based models." International Journal of Electrical and Computer Engineering (IJECE) 14, no. 5 (2024): 5489. http://dx.doi.org/10.11591/ijece.v14i5.pp5489-5501.

Full text

Abstract:

The large volume of court decision documents in Indonesia poses a challenge for researchers to assist legal practitioners in extracting useful information from the documents. This information can also benefit the general public by improving legal transparency, law enforcement, and people's understanding of the law implementation in Indonesia. A natural language processing task that extracts important information from a document is called named entity recognition (NER). In this study, the NER task is applied to legal domains, which is then referred to as legal entity recognition (LER) task. In this task, some important legal entities, such as judges, prosecutors, and advocates, are extracted from the decision documents. A new Indonesian LER dataset is built, called IndoLER data, consisting of approximately 1K decision documents with 20 types of fine-grained legal entities. Then, the transformer-based models, such as multilingual bidirectional encoder representations from transformers (BERT) or M-BERT, Indonesian BERT or IndoBERT, Indonesian robustly optimized BERT pretraining approach (RoBERTa) or IndoRoBERTa, XLM (cross lingual language model)-RoBERTa or XLMR, are proposed to solve the Indonesian LER task using this dataset. Our experimental results show that the RoBERTa-based models, such as XLM-R and IndoRoBERTa, can outperform the state-of-the-art deep-learning baselines using BiLSTM (bidirectional long short-term memory) and BiLSTM-conditional random field (BiLSTM-CRF) approaches by 7.2% to 7.9% and 2.1% to 2.6%, respectively. XLM-RoBERTa is shown to be the best-performing model, achieving the F1-score of 0.9295.

APA, Harvard, Vancouver, ISO, and other styles

2

Dong, Hongsong, Yuehui Kong, Wenlian Gao, and Jihua Liu. "Named Entity Recognition for Public Interest Litigation Based on a Deep Contextualized Pretraining Approach." Scientific Programming 2022 (October 11, 2022): 1–14. http://dx.doi.org/10.1155/2022/7682373.

Full text

Abstract:

The named entity recognition (NER) in the field of public interest litigation can assist prosecutors in handling cases and provide them with specific entities in making legal documents. Previously, the context-free deep learning model is used to catch the semantic comprehension, in which the static word vector is obtained without considering the context. Moreover, this kind of method relies on word segmentation technology and cannot solve the error transmission caused by word segmentation inaccuracy, which brings great challenges to the Chinese NER task. To tackle the above issues, an entity recognition method based on pretraining is proposed. First, based on the basic entities, three legal ontologies, NERP, NERCGP, and NERFPP are developed to expand the named entity recognition corpus in the judicial field. Second, a variant of the pretrained model BERT (Bidirectional Encoder Representations from Transformer) called BERT-WWM (whole-word mask)-EXT(extra) is introduced to catch the text character-level word vector hierarchical and the context bidirectional features, which effectively solve the problem of task boundary division of named entities. Then, to further improve the model recognition effect, the general knowledge learned from the pretrained model is used to fit the downstream neural network BiLSTM (bi-long short-term memory), and at the end of the architecture, CRF (conditional random fields) is introduced to restrict the label relationship. Finally, the experimental results show that the proposed method is more effective than the existing methods, which reach 96% and 90% in the F1 index of NER and NERP entities, respectively.

APA, Harvard, Vancouver, ISO, and other styles

3

Aejas, Bajeela, Abdelhak Belhi, and Abdelaziz Bouras. "Using AI to Ensure Reliable Supply Chains: Legal Relation Extraction for Sustainable and Transparent Contract Automation." Sustainability 17, no. 9 (2025): 4215. https://doi.org/10.3390/su17094215.

Full text

Abstract:

Efficient contract management is essential for ensuring sustainable and reliable supply chains; yet, traditional methods remain manual, error-prone, and inefficient, leading to delays, financial risks, and compliance challenges. AI and blockchain technology offer a transformative alternative, enabling the establishment of automated, transparent, and self-executing smart contracts that enhance efficiency and sustainability. As part of AI-driven smart contract automation, we previously implemented contractual clause extraction using question answering (QA) and named entity recognition (NER). This paper presents the next step in the information extraction process, relation extraction (RE), which aims to identify relationships between key legal entities and convert them into structured business rules for smart contract execution. To address RE in legal contracts, we present a novel hierarchical transformer model that captures sentence- and document-level dependencies. It incorporates global and segment-based attention mechanisms to extract complex legal relationships spanning multiple sentences. Given the scarcity of publicly available contractual datasets, we also introduce the contractual relation extraction (ContRE) dataset, specifically curated to support relation extraction tasks in legal contracts, that we use to evaluate the proposed model. Together, these contributions enable the structured automation of legal rules from unstructured contract text, advancing the development of AI-powered smart contracts.

APA, Harvard, Vancouver, ISO, and other styles

4

Ajay Mukund, S., and K. S. Easwarakumar. "Optimizing Legal Text Summarization Through Dynamic Retrieval-Augmented Generation and Domain-Specific Adaptation." Symmetry 17, no. 5 (2025): 633. https://doi.org/10.3390/sym17050633.

Full text

Abstract:

Legal text summarization presents distinct challenges due to the intricate and domain-specific nature of legal language. This paper introduces a novel framework integrating dynamic Retrieval-Augmented Generation (RAG) with domain-specific adaptation to enhance the accuracy and contextual relevance of legal document summaries. The proposed Dynamic Legal RAG system achieves a vital form of symmetry between information retrieval and content generation, ensuring that retrieved legal knowledge is both comprehensive and precise. Using the BM25 retriever with top-3 chunk selection, the system optimizes relevance and efficiency, minimizing redundancy while maximizing legally pertinent content. with top-3 chunk selection, the system optimizes relevance and efficiency, minimizing redundancy while maximizing legally pertinent content. A key design feature is the compression ratio constraint (0.05 to 0.5), maintaining structural symmetry between the original judgment and its summary by balancing representation and information density. Extensive evaluations establish BM25 as the most effective retriever, striking an optimal balance between precision and recall. A comparative analysis of transformer-based (Decoder-only) models—DeepSeek-7B, LLaMA 2-7B, and LLaMA 3.1-8B—demonstrates that LLaMA 3.1-8B, enriched with Legal Named Entity Recognition (NER) and the Dynamic RAG system, achieves superior performance with a BERTScore of 0.89. This study lays a strong foundation for future research in hybrid retrieval models, adaptive chunking strategies, and legal-specific evaluation metrics, with practical implications for case law analysis and automated legal drafting.

APA, Harvard, Vancouver, ISO, and other styles

5

Lu, Rui, and Linying Li. "Named Entity Recognition Method of Chinese Legal Documents Based on Parallel Instance Query Network." International Journal of Digital Crime and Forensics 16, no. 1 (2025): 1–19. https://doi.org/10.4018/ijdcf.367470.

Full text

Abstract:

Legal Named Entity Recognition (NER) is crucial in intelligent judiciary systems, focusing on identifying case-specific entities in legal texts. It helps convert unstructured legal documents into structured data, improving e-discovery efficiency. However, challenges arise from insufficient understanding of legal terminology, leading to errors in identifying long and nested entity boundaries. To address this, a Legal NER method based on a parallel instance query network is proposed. This method uses learnable instance queries to extract entities in parallel, with a BERT+BiLSTM+attention structure to encode context and query information. Entity prediction is performed using a pointer network to identify span boundaries and entity types. A linear label assignment mechanism aligns legal entities with queries for more accurate labeling. Experimental results show that the model outperforms existing methods, and further validation through ablation experiments and case studies supports its effectiveness, offering valuable insights for advancing legal NER research.

APA, Harvard, Vancouver, ISO, and other styles

6

Baviskar, Dipali, Swati Ahirrao, and Ketan Kotecha. "Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition." Data 6, no. 7 (2021): 78. http://dx.doi.org/10.3390/data6070078.

Full text

Abstract:

The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher’s task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents.

APA, Harvard, Vancouver, ISO, and other styles

7

Nastou, Katerina, Mikaela Koutrouli, Sampo Pyysalo, and Lars Juhl Jensen. "Improving dictionary-based named entity recognition with deep learning." Bioinformatics 40, Supplement_2 (2024): ii45—ii52. http://dx.doi.org/10.1093/bioinformatics/btae402.

Full text

Abstract:

Abstract Motivation Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. Results In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). Availability and implementation All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.

APA, Harvard, Vancouver, ISO, and other styles

8

Mazur, Pawel, and Robert Dale. "Handling conjunctions in named entities." Lingvisticæ Investigationes. International Journal of Linguistics and Language Resources 30, no. 1 (2007): 49–68. http://dx.doi.org/10.1075/li.30.1.05maz.

Full text

Abstract:

Although the literature contains reports of very high accuracy figures for the recognition of named entities in text, there are still some named entity phenomena that remain problematic for existing text processing systems. One of these is the ambiguity of conjunctions in candidate named entity strings, an all-too-prevalent problem in corporate and legal documents. In this paper, we distinguish four uses of the conjunction in these strings, and explore the use of a supervised machine learning approach to conjunction disambiguation trained on a very limited set of ‘name internal’ features that avoids the need for expensive lexical or semantic resources. We achieve 84% correctly classified examples using k-fold evaluation on a data set of 600 instances. We argue that further improvements are likely to require the use of wider domain knowledge and name external features.

APA, Harvard, Vancouver, ISO, and other styles

9

van Toledo, Chaïm, Friso van Dijk, and Marco Spruit. "Dutch Named Entity Recognition and De-Identification Methods for the Human Resource Domain." International Journal on Natural Language Computing 9, no. 6 (2020): 23–34. http://dx.doi.org/10.5121/ijnlc.2020.9602.

Full text

Abstract:

The human resource (HR) domain contains various types of privacy-sensitive textual data, such as e-mail correspondence and performance appraisal. Doing research on these documents brings several challenges, one of them anonymisation. In this paper, we evaluate the current Dutch text de-identification methods for the HR domain in four steps. First, by updating one of these methods with the latest named entity recognition (NER) models. The result is that the NER model based on the CoNLL 2002 corpus in combination with the BERTje transformer give the best combination for suppressing persons (recall 0.94) and locations (recall 0.82). For suppressing gender, DEDUCE is performing best (recall 0.53). Second NER evaluation is based on both strict de-identification of entities (a person must be suppressed as a person) and third evaluation on a loose sense of de-identification (no matter what how a person is suppressed, as long it is suppressed). In the fourth and last step a new kind of NER dataset is tested for recognising job titles in tezts.

APA, Harvard, Vancouver, ISO, and other styles

10

Zhao, Liupeng. "Legal Impact of Digital Information Technology on the Chain of Evidence in Criminal Cases." Journal of Combinatorial Mathematics and Combinatorial Computing 123, no. 1 (2024): 103–21. https://doi.org/10.61091/jcmcc123-08.

Full text

Abstract:

Criminal evidence serves as the foundation for criminal proceedings, with evidence used to ascertain the facts of cases being critical to achieving fairness and justice. This study explores the application of digital information technology in building a data resource base for criminal cases, formulating standard evidence guideline rules, and optimizing evidence verification procedures. A named entity recognition model based on the SVM-BiLSTM-CRF framework is proposed, coupled with an evidence relationship extraction model using the Transformer framework to improve evidence information extraction through sequential features and global feature capturing. Results show that the F1 value for entity recognition in criminal cases reaches 94.19%, and the evidence extraction model achieves an F1 value of 81.83% on the CAIL-A dataset. These results are utilized to construct evidence guidelines, helping case handlers increase case resolution rates to approximately 99%. The application of digital technology enhances evidence collection efficiency, accelerates case closures, and offers a pathway to improving judicial credibility.

APA, Harvard, Vancouver, ISO, and other styles

11

Avhad, Prasanna, Parag Jadhav, Sudarshan Madbhavi, Ganesh Devnale, and Dr C. A. Ghuge. "Survey on Intelligent Document Processing: A Comprehensive Approach to Summarization, NER, Language Conversion, and Plagiarism Detection." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 09, no. 01 (2025): 1–9. https://doi.org/10.55041/ijsrem40453.

Full text

Abstract:

IDP technologies transform the way of information management of various domains with both education and law being on the list. A survey paper encompassing a complete overview of four highly critical modules: PDF/DOCX summarization, named entity recognition, English to Hindi language translation, and plagiarism detection is presented here. IDP technologies enrich academic integrity in the field of education, make easy the processes involved in research, and help bridge language gaps for an inclusive learning environment. From a legal perspective, such technologies would help improve efficiency regarding reviews of documents, contract analysis, and research; thus, professionals would make more certain strides to keep ahead of nonconformities. Some of the major challenges include domain-specific solutions, data quality, ethical concerns, and user-centric design. The paper strongly emphasizes the need for further research and development with integrated, effective, and accessible IDP tools that would meet user requirements over time. Keywords: Intelligent Document Processing, PDF/DOCX Summarization, Named Entity Recognition, Language Conversion, Plagiarism Detection, Education, Legal Applications, Natural Language Processing, Academic Integrity, Document Review.

APA, Harvard, Vancouver, ISO, and other styles

12

Kuznetsov, M. D. "Recognition and Extraction of Named Entities from the User Agreements Corpus." LETI Transactions on Electrical Engineering & Computer Science 18, no. 3 (2025): 78–86. https://doi.org/10.32603/2071-8985-2025-18-3-78-86.

Full text

Abstract:

Data analysis and mining are used to solve a variety of different problems, but their effective use requires high-quality and large datasets. Open publication of such datasets is not always possible in accordance with the law. The presence of personal data in datasets necessitates their processing and cleaning before open publication. In particular, the PPInRussian text dataset created in 2024 for studying aspects of personal data processing cannot be published, but it has the potential to become a useful tool for both computer security researchers and legal scholars. This paper discusses modern methods of named entity recognition that can be used to clean a text corpus, tests them, and evaluates their applicability in the context of cleaning legal documents. In addition, the paper proposes a rule-based text corpus cleaning technique that shows more accurate results compared to more general-purpose tools. The application of this technique will clean the corpus of user agreements and, thus, make it possible to publish it for interested researchers.

APA, Harvard, Vancouver, ISO, and other styles

13

Csányi, Gergely Márk, Dániel Nagy, Renátó Vági, János Pál Vadász, and Tamás Orosz. "Challenges and Open Problems of Legal Document Anonymization." Symmetry 13, no. 8 (2021): 1490. http://dx.doi.org/10.3390/sym13081490.

Full text

Abstract:

Data sharing is a central aspect of judicial systems. The openly accessible documents can make the judiciary system more transparent. On the other hand, the published legal documents can contain much sensitive information about the involved persons or companies. For this reason, the anonymization of these documents is obligatory to prevent privacy breaches. General Data Protection Regulation (GDPR) and other modern privacy-protecting regulations have strict definitions of private data containing direct and indirect identifiers. In legal documents, there is a wide range of attributes regarding the involved parties. Moreover, legal documents can contain additional information about the relations between the involved parties and rare events. Hence, the personal data can be represented by a sparse matrix of these attributes. The application of Named Entity Recognition methods is essential for a fair anonymization process but is not enough. Machine learning-based methods should be used together with anonymization models, such as differential privacy, to reduce re-identification risk. On the other hand, the information content (utility) of the text should be preserved. This paper aims to summarize and highlight the open and symmetrical problems from the fields of structured and unstructured text anonymization. The possible methods for anonymizing legal documents discussed and illustrated by case studies from the Hungarian legal practice.

APA, Harvard, Vancouver, ISO, and other styles

14

Zhu, Guicun, Meihui Hao, Changlong Zheng, and Linlin Wang. "Design of Knowledge Graph Retrieval System for Legal and Regulatory Framework of Multilevel Latent Semantic Indexing." Computational Intelligence and Neuroscience 2022 (July 19, 2022): 1–11. http://dx.doi.org/10.1155/2022/6781043.

Full text

Abstract:

Latent semantic analysis (LSA) is a natural language statistical model, which is considered as a method to acquire, generalize, and represent knowledge. Compared with other retrieval models based on concept dictionaries or concept networks, the retrieval model based on LSA has the advantages of strong computability and less human participation. LSA establishes a latent semantic space through truncated singular value decomposition. Words and documents in the latent semantic space are projected onto the dimension representing the latent concept, and then the semantic relationship between words can be extracted to present the semantic structure in natural language. This paper designs the system architecture of the public prosecutorial knowledge graph. Combining the graph data storage technology and the characteristics of the public domain ontology, a knowledge graph storage method is designed. By building a prototype system, the functions of knowledge management, knowledge query, and knowledge push are realized. A named entity recognition method based on bidirectional long-short-term memory (bi-LSTM) combined with conditional random field (CRF) is proposed. Bi-LSTM-CRF performs named entity recognition based on character-level features. CRF can use the transition matrix to further obtain the relationship between each position label, so that bi-LSTM-CRF not only retains the context information but also considers the influence between the current position and the previous position. The experimental results show that the LSTM-entity-context method proposed in this paper improves the representation ability of text semantics compared with other algorithms. However, this method only introduces relevant entity information to supplement the semantic representation of the text. The order in the case is often ignored, especially when it comes to the time series of the case characteristics, and the “order problem” may eventually affect the final prediction result. The knowledge graph of legal documents of theft cases based on ontology can be updated and maintained in real time. The knowledge graph can conceptualize, share, and perpetuate knowledge related to procuratorial organs and can also reasonably utilize and mine many useful experiences and knowledge to assist in decision-making.

APA, Harvard, Vancouver, ISO, and other styles

15

Melnikova, Antonina V., Marina S. Vorobeva, and Anna V. Glazkova. "Comparison of pre-trained models for domain-specific entity extraction from student report documents." Modeling and Analysis of Information Systems 32, no. 1 (2025): 66–79. https://doi.org/10.18255/1818-1015-2025-1-66-79.

Full text

Abstract:

The authors propose a methodology for extracting domain-specific entities from student report documents in Russian language using pre-trained transformer-based language models. Extracting domain-specific entities from student report documents is a relevant task since the obtained data can be used for various purposes, ranging from the formation of project teams to the personalization of learning pathways. Additionally, automating the document processing workflow reduces the labor costs associated with manual processing. As training material for training models, expert-annotated student report documents were used. These documents were created by students in information technology programs between 2019 and 2022 for project-based, practical disciplines, and theses. The domain-specific entity extraction task is approached as two subtasks: named entity recognition (NER) and annotated text generation. A comparative analysis was conducted among NER encoder-only models (ruBERT, ruRoBERTa), encoder-decoder models (ruT5, mBART), and decoder-only models (ruGPT, T-lite) for text generation. The effectiveness of the models was evaluated using the F1-score, along with an analysis of common errors. The highest F1-score on the test set was achieved by mBART (93.55%). This model also showed the lowest error rate in domain-specific entity identification during text generation and annotation. The NER models demonstrated a lower tendency for errors but tended to extract domain-specific entities in a fragmented manner. The obtained results indicate the applicability of the examined models for solving the stated tasks, considering the specific requirements of the problem.

APA, Harvard, Vancouver, ISO, and other styles

16

El Moussaoui, Taoufiq, Chakir Loqman, and Jaouad Boumhidi. "Exploring the Impact of Annotation Schemes on Arabic Named Entity Recognition across General and Specific Domains." Engineering, Technology & Applied Science Research 15, no. 2 (2025): 21918–24. https://doi.org/10.48084/etasr.10205.

Full text

Abstract:

Named Entity Recognition (NER) is a fundamental task in natural language processing (NLP) that involves identifying and classifying entities into predefined categories. Despite its importance, the impact of annotation schemes and their interaction with domain types on NER performance, particularly for Arabic, remains underexplored. This study examines the influence of seven annotation schemes (IO, BIO, IOE, BIOES, BI, IE, and BIES) on arabic NER performance using the general-domain ANERCorp dataset and a domain-specific Moroccan legal corpus. Three models were evaluated: Logistic Regression (LR), Conditional Random Fields (CRF), and the transformer-based Arabic Bidirectional Encoder Representations from Transformers (AraBERT) model. Results show that the impact of annotation schemes on performance is independent of domain type. Traditional Machine Learning (ML) models such as LR and CRF perform best with simpler annotation schemes like IO due to their computational efficiency and balanced precision-recall metrics. On the other hand, AraBERT excels with more complex schemes (BIOES, BIES), achieving superior performance in tasks requiring nuanced contextual understanding and intricate entity relationships, though at the cost of higher computational demands and execution time. These findings underscore the trade-offs between annotation scheme complexity and computational requirements, offering valuable insights for designing NER systems tailored to both general and domain-specific Arabic NLP applications.

APA, Harvard, Vancouver, ISO, and other styles

17

Subowo, Edy, Imam Bukhori, and Warto. "Corpus Development and NER Model for Identification of Legal Entities (Articles, Laws, and Sanctions) in Corruption Court Decisions in Indonesia." Transactions on Informatics and Data Science 2, no. 1 (2025): 27–39. https://doi.org/10.24090/tids.v2i1.13592.

Full text

Abstract:

This study aims to develop an annotated corpus and a deep learning-based Named Entity Recognition (NER) model to identify legal entities in Indonesian corruption court rulings. The corpus was constructed from 450 Supreme Court documents related to the Anti-Corruption Laws (Laws No. 31/1999), collected via web scraping, with semi-automatic annotation (regex) and validation by legal experts. A total of 12,000 entities (Article, Laws, Sanctions) were tagged in IOB format, creating the first specialized dataset for Indonesian corruption laws. The NER model combines the IndoBERT (pre-trained language model) architecture with a CRF layer, fine-tuned to handle legal text complexities such as hierarchical article references (paragraphs, clauses) and amended laws citations (jo.). Evaluation using 10-fold cross-validation revealed that the model achieved an F1-score of 92.3%, outperforming standalone CRF (85.1%) and BiLSTM+CRF (88.7%), particularly in detecting ARTICLE entities (F1: 93.8%). Error analysis highlighted challenges in recognizing SANCTIONS entities (F1: 87.4%) due to sentence structure variability and conjunctions. The model’s implementation could accelerate judicial decision analysis, identify violation patterns, and support sanctions recommendation systems for laws enforcement. This research also provides legal entity annotation guidelines adaptable to other legal domains. Future work should expand to other laws (e.g., ITE Laws, Criminal Code) via transfer learning and integrate knowledge graphs to enhance entity relation detection.

APA, Harvard, Vancouver, ISO, and other styles

18

Majdik, Zoltan P., S. Scott Graham, Jade C. Shiva Edward, et al. "Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study." JMIR AI 3 (May 16, 2024): e52095. http://dx.doi.org/10.2196/52095.

Full text

Abstract:

Background Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. Objective This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. Methods A random sample of 200 disclosure statements was prepared for annotation. All “PERSON” and “ORG” entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Results Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Conclusions Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture’s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

APA, Harvard, Vancouver, ISO, and other styles

19

Vági, Renátó. "How Could Semantic Processing and Other NLP Tools Improve Online Legal Databases?" TalTech Journal of European Studies 13, no. 2 (2023): 138–51. http://dx.doi.org/10.2478/bjes-2023-0018.

Full text

Abstract:

Abstract The spread of online databases and the increasingly sophisticated search solutions in the past 10–15 years have opened up many new opportunities for lawyers to find relevant documents. However, it is still a common problem that the various legal databases and legal search engines face an information crisis. Legal database providers use various information extraction solutions, especially named entity recognition (NER), to mitigate this problem. These solutions can improve the relevance of the lists of results. Their limitation, however, is that they can only extract and create searchable metadata entities if the latter have a well-defined location or regularity in the text. Therefore, the next era of search support for legal databases is semantic processing. Semantic processing solutions are fundamentally different from information extraction and NER because they do not only extract and make visible and/or searchable the specific information element contained in the text but allow for the analytical analysis of the text as a whole. In addition, in many cases, legal database developments using machine learning can be a significant burden on a company, as it is not always known what kind of an AI solution is needed, and how the providers could compare the different solutions. Legal database providers need to customize processing their documents and texts in the most optimal way possible, considering all their legal, linguistic, statistical, or other characteristics. This is where text processing pipelines can help. So, the article reviews the two main natural language processing (NLP) solutions which can help legal database providers to increase the value of legal data within legal databases. The article then shows the importance of text-processing pipelines and frameworks in the era of digitized documents and presents the digital-twin-distiller.

APA, Harvard, Vancouver, ISO, and other styles

20

Gabud, Roselyn, Nelson Pampolina, Vladimir Mariano, and Riza Batista-Navarro. "Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline." Biodiversity Information Science and Standards 7 (September 11, 2023): e112505. https://doi.org/10.3897/biss.7.112505.

Full text

Abstract:

Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022).We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset.In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as "fruited heavily" and "mass flowering"). We then use our hybrid RE tool to extract <em>reproductive condition - temporal expression</em> and <em>habitat - geographical location </em>entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.

APA, Harvard, Vancouver, ISO, and other styles

21

Gaurav, Kumar Sinha. "Democratized Exploration Insights using Augmented Analytics and NLP." Journal of Scientific and Engineering Research 9, no. 7 (2022): 122–33. https://doi.org/10.5281/zenodo.11219784.

Full text

Abstract:

The petroleum sector is swamped with an overwhelming quantity of disorganized data, originating from varied sources such as seismic investigations, geological studies, and reservoir simulations. Extracting meaningful information from this diverse and intricate data presents a significant challenge in analytics. Often, vital links are obscured within numerous documents and databases, which obstructs the efficiency of exploration and the strategizing of development. This study introduces an innovative cloud-based analytics enhancement platform that offers open access to crucial exploration insights through the use of natural language processing (NLP), mappings of knowledge graphs, and the creation of automated queries. Initially, data experts utilized Transformer models to train specialized language models that comprehend the specific jargon found within the petroleum engineering field. A knowledge graph, driven by ontology, was developed utilizing embeddings from GPT-3 to interconnect various entities such as geological basins, rock formations, and drilling equipment. Subsequently, automated web crawlers were employed to collect and catalogue textual reports into AWS data reservoirs, meanwhile assigning metadata labels through the application of named entity recognition. Interactive interfaces enable geologists to effortlessly search the database via textual conversations or through navigating visual representations of linked networks. Natural Language Querying (NLQ) engines translate inquiries into standard database queries, which serve up ranked sets of data cards that combine insights from both structured repositories and disorganized texts. Analyses of networks reveal previously unnoticed interconnections. The findings of this research signify that advancements in augmented analytics have the potential to considerably boost productivity in industries that deal with large volumes of unstructured information.

APA, Harvard, Vancouver, ISO, and other styles

22

K, Santhanalakshmi, A Jameer Basha, R Geetha Rajakumari, and Premkumar C D. "INTELLIDOC - An Adaptive Transformer-Powered Pipeline For Intelligent Document Processing And Entity Extraction." International Journal of Computational and Experimental Science and Engineering 11, no. 3 (2025). https://doi.org/10.22399/ijcesen.2481.

Full text

Abstract:

Efficient and accurate processing of unstructured document data is crucial for legal, enterprise, and academic applications, where vast amounts of textual information must be extracted, summarized, and analyzed. Traditional Optical Character Recognition (OCR) and Named Entity Recognition (NER) methods often face challenges in handling handwritten text, scanned documents, and complex legal structures, leading to data loss and misclassification. To address these limitations, we propose IntelliDoc, an adaptive, transformer-powered document processing pipeline designed to enhance accuracy, efficiency, and contextual understanding of document intelligence. IntelliDoc employs a hybridized multi-stage pipeline that integrates an adaptive OCR layer, which dynamically adjusts to different document characteristics, ensuring high extraction accuracy for diverse document types. Experimental evaluations on a benchmark dataset comprising legal, financial, and administrative documents demonstrate that IntelliDoc achieves an OCR accuracy of 98.2%, NER precision of 94.7%, and a summarization coherence score of 91.5%, significantly outperforming conventional document processing frameworks. Additionally, the parallel architecture reduces processing time by 35% compared to sequential models, making IntelliDoc suitable for real-time applications. Future work will explore integrating domain-specific large language models to further enhance interpretability and accuracy across specialized document categories.

APA, Harvard, Vancouver, ISO, and other styles

23

Yulianti, Evi, Naradhipa Bhary, Jafar Abdurrohman, Fariz Wahyuzan Dwitilas, Eka Qadri Nuranti, and Husna Sarirah Husin. "Named entity recognition on Indonesian legal documents: a dataset and study using transformer-based models." International Journal of Electrical and Computer Engineering (IJECE) 14, no. 5 (2024). https://doi.org/10.11591/ijece.v14i5.pp5489-5501.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Izzidien, Ahmed. "Using the interest theory of rights and Hohfeldian taxonomy to address a gap in machine learning methods for legal document analysis." Humanities and Social Sciences Communications 10, no. 1 (2023). http://dx.doi.org/10.1057/s41599-023-01693-z.

Full text

Abstract:

AbstractRights and duties are essential features of legal documents. Machine learning algorithms have been increasingly applied to extract information from such texts. Currently, their main focus is on named entity recognition, sentiment analysis, and the classification of court cases to predict court outcome. In this paper it is argued that until the essential features of such texts are captured, their analysis can remain bottle-necked by the very technology being used to assess them. As such, the use of legal theory to identify the most pertinent dimensions of such texts is proposed. Specifically, the interest theory of rights, and the first-order Hohfeldian taxonomy of legal relations. These principal legal dimensions allow for a stratified representation of knowledge, making them ideal for the abstractions needed for machine learning. This study considers how such dimensions may be identified. To do so it implements a novel heuristic based in philosophy coupled with language models. Hohfeldian relations of ‘rights-duties’ vs. ‘privileges-no-rights’ are determined to be identifiable. Classification of each type of relation to accuracies of 92.5% is found using Sentence Bidirectional Encoder Representations from Transformers. Testing is carried out on religious discrimination policy texts in the United Kingdom.

APA, Harvard, Vancouver, ISO, and other styles

25

Mentzingen, Hugo, Nuno António, and Fernando Bacao. "Effectiveness in retrieving legal precedents: exploring text summarization and cutting-edge language models toward a cost-efficient approach." Artificial Intelligence and Law, February 20, 2025. https://doi.org/10.1007/s10506-025-09440-2.

Full text

Abstract:

Abstract This study examines the interplay between text summarization techniques and embeddings from Language Models (LMs) in constructing expert systems dedicated to the retrieval of legal precedents, with an emphasis on achieving cost-efficiency. Grounded in the growing domain of Artificial Intelligence (AI) in law, our research confronts the perennial challenges of computational resource optimization and the reliability of precedent identification. Through Named Entity Recognition (NER) and part-of-speech (POS) tagging, we juxtapose various summarization methods to distill legal documents into a convenient form that retains their essence. We investigate the effectiveness of these methods in conjunction with state-of-the-art embeddings based on Large Language Models (LLMs), particularly ADA from OpenAI, which is trained on a wide range of general-purpose texts. Utilizing a dataset from one of Brazil’s administrative courts, we explore the efficacy of embeddings derived from a Transformer model tailored to legal corpora against those from ADA, gauging the impact of parameter size, training corpora, and context window on retrieving legal precedents. Our findings suggest that while the full text embedded with ADA’s extensive context window leads in retrieval performance, a balanced combination of POS-derived summaries and ADA embeddings presents a compelling trade-off between performance and resource expenditure, advocating for an efficient, scalable, intelligent system suitable for broad legal applications. This study contributes to the literature by delineating an optimal approach that harmonizes the dual imperatives of computational frugality and retrieval accuracy, propelling the legal field toward more strategic AI utilization.

APA, Harvard, Vancouver, ISO, and other styles

26

Çetindağ, Can, Berkay Yazıcıoğlu, and Aykut Koç. "Named-entity recognition in Turkish legal texts." Natural Language Engineering, July 11, 2022, 1–28. http://dx.doi.org/10.1017/s1351324922000304.

Full text

Abstract:

Abstract Natural language processing (NLP) technologies and applications in legal text processing are gaining momentum. Being one of the most prominent tasks in NLP, named-entity recognition (NER) can substantiate a great convenience for NLP in law due to the variety of named entities in the legal domain and their accentuated importance in legal documents. However, domain-specific NER models in the legal domain are not well studied. We present a NER model for Turkish legal texts with a custom-made corpus as well as several NER architectures based on conditional random fields and bidirectional long-short-term memories (BiLSTMs) to address the task. We also study several combinations of different word embeddings consisting of GloVe, Morph2Vec, and neural network-based character feature extraction techniques either with BiLSTM or convolutional neural networks. We report 92.27% F1 score with a hybrid word representation of GloVe and Morph2Vec with character-level features extracted with BiLSTM. Being an agglutinative language, the morphological structure of Turkish is also considered. To the best of our knowledge, our work is the first legal domain-specific NER study in Turkish and also the first study for an agglutinative language in the legal domain. Thus, our work can also have implications beyond the Turkish language.

APA, Harvard, Vancouver, ISO, and other styles

27

Ardon Kotey, Allan Almeida, Hariaksh Pandya, et al. "NER Based Law Entity Privacy Protection." International Journal of Scientific Research in Computer Science, Engineering and Information Technology, December 10, 2023, 322–35. http://dx.doi.org/10.32628/cseit2390665.

Full text

Abstract:

Within the field of legal AI, named entity recognition, also known as NER, is an essential step that must be completed before moving on to subsequent processing stages. In this paper, we present the creation of a dataset for the purpose of training natural language understanding models in the legal domain. The dataset is produced by locating and establishing a complete set of legal entities, which goes beyond traditionally employed entities such as person, organization, and location. These are examples of commonly used entities. Annotators are now provided with the means to effectively tag a wide variety of legal documents thanks to these additional entities. The authors tried out several different text annotation tools before settling on the one that proved to be the most effective for this study. The completed annotations are saved in the JavaScript Object Notation (JSON) format, which makes the data more readable and makes it easier to manipulate the data. The dataset that was produced as a result includes approximately thirty documents and five thousand sentences. Following that, these data are use in order to train a pre-trained SpaCy pipeline for accurate legal named entity prediction. There is a possibility that the accuracy of legal named entity recognition can be improved by performing additional fine-tuning on pre-trained models using legal texts.

APA, Harvard, Vancouver, ISO, and other styles

28

Muniz Belém, Fabiano, Cláudio Valiense, Celso França, et al. "Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents." Journal of Information and Data Management 14, no. 1 (2023). http://dx.doi.org/10.5753/jidm.2023.3180.

Full text

Abstract:

Transformer architectures have become the main component of various state-of-the-art methods for natural language processing tasks, such as Named Entity Recognition and Relation Extraction (NER+RE). As these architectures rely on semantic (contextual) aspects of word sequences, they may fail to accurately identify and delimit entity spans when there is little semantic context surrounding the named entities. This is the case of entities composed only by digits and punctuation, such as IDs and phone numbers, as well as long composed names. In this article, we propose new techniques for contextual reinforcement and entity delimitation based on pre- and post-processing techniques to provide a richer semantic context, improving SpERT, a state-of-the-art Span-based Entity and Relation Transformer. To provide further context to the training process of NER+RE, we propose a data augmentation technique based on Generative Pretrained Transformers (GPT). We evaluate our strategies using real data from public administration documents (official gazettes and biddings) and court lawsuits. Our results show that our pre- and post-processing strategies, when used co-jointly, allows significant improvements on NER+ER effectiveness, while we also show the benefits of using GPT for training data augmentation.

APA, Harvard, Vancouver, ISO, and other styles

29

Păis,, Vasile, Maria Mitrofan, Carol Luca Gasan, et al. "LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain." Semantic Web, June 5, 2023, 1–14. http://dx.doi.org/10.3233/sw-233351.

Full text

Abstract:

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time expressions and legal resources mentioned in legal documents. Furthermore, GeoNames identifiers are provided. The resource is available in multiple formats, including span-based, token-based and RDF. The Linked Open Data version is available for both download and querying using SPARQL.

APA, Harvard, Vancouver, ISO, and other styles

30

Nastou, Katerina, Mikaela Koutrouli, Sampo Pyysalo, and Lars Juhl Jensen. "CoNECo: A Corpus for Named Entity Recognition and Normalization of Protein Complexes." Bioinformatics Advances, August 20, 2024. http://dx.doi.org/10.1093/bioadv/vbae116.

Full text

Abstract:

Abstract Motivation Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus. Results We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1,621 documents with 2,052 entities, 1,976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature. Availability All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.

APA, Harvard, Vancouver, ISO, and other styles

31

Naik, Varsha, Rajeswari K, and Purvang Patel. "Enhancing Semantic Searching of Legal Documents Through LSTM-Based Named Entity Recognition and Semantic Classification." International Journal for the Semiotics of Law - Revue internationale de Sémiotique juridique, April 27, 2024. http://dx.doi.org/10.1007/s11196-024-10157-9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Oliveira, Vitor, Gabriel Nogueira, Thiago Faleiros, and Ricardo Marcacini. "Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents." Artificial Intelligence and Law, February 15, 2024. http://dx.doi.org/10.1007/s10506-023-09388-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Li, Jianfu, Qiang Wei, Omid Ghiasvand, et al. "A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora." BMC Medical Informatics and Decision Making 22, S3 (2022). http://dx.doi.org/10.1186/s12911-022-01967-7.

Full text

Abstract:

Abstract Background Clinical trial protocols are the foundation for advancing medical sciences, however, the extraction of accurate and meaningful information from the original clinical trials is very challenging due to the complex and unstructured texts of such documents. Named entity recognition (NER) is a fundamental and necessary step to process and standardize the unstructured text in clinical trials using Natural Language Processing (NLP) techniques. Methods In this study we fine-tuned pre-trained language models to support the NER task on clinical trial eligibility criteria. We systematically investigated four pre-trained contextual embedding models for the biomedical domain (i.e., BioBERT, BlueBERT, PubMedBERT, and SciBERT) and two models for the open domains (BERT and SpanBERT), for NER tasks using three existing clinical trial eligibility criteria corpora. In addition, we also investigated the feasibility of data augmentation approaches and evaluated their performance. Results Our evaluation results using tenfold cross-validation show that domain-specific transformer models achieved better performance than the general transformer models, with the best performance obtained by the PubMedBERT model (F1-scores of 0.715, 0.836, and 0.622 for the three corpora respectively). The data augmentation results show that it is feasible to leverage additional corpora to improve NER performance. Conclusions Findings from this study not only demonstrate the importance of contextual embeddings trained from domain-specific corpora, but also shed lights on the benefits of leveraging multiple data sources for the challenging NER task in clinical trial eligibility criteria text.

APA, Harvard, Vancouver, ISO, and other styles

34

Cutforth, Murray, Hannah Watson, Cameron Brown, et al. "Acute stroke CDS: automatic retrieval of thrombolysis contraindications from unstructured clinical letters." Frontiers in Digital Health 5 (June 14, 2023). http://dx.doi.org/10.3389/fdgth.2023.1186516.

Full text

Abstract:

IntroductionThrombolysis treatment for acute ischaemic stroke can lead to better outcomes if administered early enough. However, contraindications exist which put the patient at greater risk of a bleed (e.g. recent major surgery, anticoagulant medication). Therefore, clinicians must check a patient's past medical history before proceeding with treatment. In this work we present a machine learning approach for accurate automatic detection of this information in unstructured text documents such as discharge letters or referral letters, to support the clinician in making a decision about whether to administer thrombolysis.MethodsWe consulted local and national guidelines for thrombolysis eligibility, identifying 86 entities which are relevant to the thrombolysis decision. A total of 8,067 documents from 2,912 patients were manually annotated with these entities by medical students and clinicians. Using this data, we trained and validated several transformer-based named entity recognition (NER) models, focusing on transformer models which have been pre-trained on a biomedical corpus as these have shown most promise in the biomedical NER literature.ResultsOur best model was a PubMedBERT-based approach, which obtained a lenient micro/macro F1 score of 0.829/0.723. Ensembling 5 variants of this model gave a significant boost to precision, obtaining micro/macro F1 of 0.846/0.734 which approaches the human annotator performance of 0.847/0.839. We further propose numeric definitions for the concepts of name regularity (similarity of all spans which refer to an entity) and context regularity (similarity of all context surrounding mentions of an entity), using these to analyse the types of errors made by the system and finding that the name regularity of an entity is a stronger predictor of model performance than raw training set frequency.DiscussionOverall, this work shows the potential of machine learning to provide clinical decision support (CDS) for the time-critical decision of thrombolysis administration in ischaemic stroke by quickly surfacing relevant information, leading to prompt treatment and hence to better patient outcomes.

APA, Harvard, Vancouver, ISO, and other styles

35

Bourdois, Loick, Marta Avalos, Gabrielle Chenais, et al. "De-identification of Emergency Medical Records in French: Survey and Comparison of State-of-the-Art Automated Systems." International FLAIRS Conference Proceedings 34, no. 1 (2021). http://dx.doi.org/10.32473/flairs.v34i1.128480.

Full text

Abstract:

In France, structured data from emergency room (ER) visits are aggregated at the national level to build a syndromic surveillance system for several health events. For visits motivated by a traumatic event, information on the causes are stored in free-text clinical notes. To exploit these data, an automated de-identification system guaranteeing protection of privacy is required.In this study we review available de-identification tools to de-identify free-text clinical documents in French. A key point is how to overcome the resource barrier that hampers NLP applications in languages other than English. We compare rule-based, named entity recognition, new Transformer-based deep learning and hybrid systems using, when required, a fine-tuning set of 30,000 unlabeled clinical notes. The evaluation is performed on a test set of 3,000 manually annotated notes.Hybrid systems, combining capabilities in complementary tasks, show the best performance. This work is a first step in the foundation of a national surveillance system based on the exhaustive collection of ER visits reports for automated trauma monitoring.

APA, Harvard, Vancouver, ISO, and other styles

36

Gabud, Roselyn, Nelson Pampolina, Vladimir Mariano, and Riza Batista-Navarro. "Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline." Biodiversity Information Science and Standards 7 (September 11, 2023). http://dx.doi.org/10.3897/biss.7.112505.

Full text

Abstract:

Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as "fruited heavily" and "mass flowering"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.

APA, Harvard, Vancouver, ISO, and other styles

37

Iscoe, Mark, Vimig Socrates, Aidan Gilson, et al. "Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models." Academic Emergency Medicine, April 3, 2024. http://dx.doi.org/10.1111/acem.14883.

Full text

Abstract:

AbstractBackgroundNatural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large‐scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms.ObjectivesThis study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large‐scale research, public health surveillance, and EHR‐based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes.MethodsThe study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task‐specific LLMs to perform the task of named entity recognition: a convolutional neural network‐based model (SpaCy) and a transformer‐based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level.ResultsA total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note‐level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model.ConclusionsThe study demonstrated the utility of LLMs and transformer‐based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.

APA, Harvard, Vancouver, ISO, and other styles

38

Kang, Tian, Jessica Munger, Erik T. Mueller, Arpita Saha, and Victoria L. Chiou. "A natural language processing (NLP) approach for optical character recognition (OCR)-resilient extraction, correction, and structuring of karyotype data in oncology clinical notes." Journal of Clinical Oncology 43, no. 16_suppl (2025). https://doi.org/10.1200/jco.2025.43.16_suppl.e13644.

Full text

Abstract:

e13644 Background: Cytogenetics drives precision oncology by uncovering genetic abnormalities that contribute to cancers. Karyotypes, essential for cytogenetic analysis, are documented in the International System for Human Cytogenetic Nomenclature (ISCN) and prevalent in various free-text clinical documents, posing challenges for manual abstraction and computational processing. Furthermore, OCR technology used to digitize these documents often introduces errors and compromises the secondary use of health data, especially problematic for ISCN notation where a single character change can alter meaning. In response, we present a novel NLP approach to extract and structure karyotype data from clinical notes using automated OCR error correction. Methods: We developed a cancer-type-agnostic NLP pipeline by training two semi-supervised models on randomly sampled clinical notes (> 85% from oncology patients, including breast, lung and hematopoietic cancers) in the Tempus Database (Tempus AI, Inc., Chicago, IL): a named entity recognition (NER) model to identify karyotype strings in ISCN notation and T5, a transformer–based model for OCR error correction in identified karyotypes. We employed a two-tiered fine-tuning on T5 for training OCR error correction to reduce the need for manual curation: first on a public karyotype database with synthetic OCR errors, then on real OCR errors from clinical notes. The pipeline then standardized and structured the karyotype string into an in-house common data model. NLP model performances were evaluated against Gemma-2-27b, a state-of-the-art large language model, on curated labels from a mixture of clinical records. Results: The karyotype extraction model, trained as an NER task, achieved precision, recall, and F1 scores of 0.86, 0.92, and 0.90, respectively, on a test set of 2800 curated ISCN karyotype strings. Our fine-tuned T5 model significantly outperformed Gemma2, correcting 95% of synthetic OCR errors in a test set of 44,756 karyotype strings and 84% of real OCR errors from clinical notes in a test set of 2,790 karyotype strings, compared to Gemma-2’s 41% and 43%, respectively. Error analysis showed Gemma-2's tendency to edit uncommon but correct karyotypes to common ones and to inaccurately extend short karyotypes. Conclusions: To our knowledge, this is the first NLP-driven method for extracting and structuring karyotypes in clinical notes using fully automated OCR error correction irrespective of cancer type or document type. The model outperformed a state-of-the-art LLM in OCR error correction, accelerating the abstractions of cytogenetic information from clinical notes at scale. This advancement provides actionable cytogenetic information to oncology healthcare teams, enhancing the delivery of patient care.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!