Log in

Relevant bibliographies by topics / Word2Vec embedding / Journal articles

To see the other types of publications on this topic, follow the link: Word2Vec embedding.

Journal articles on the topic 'Word2Vec embedding'

Author: Grafiati

Published: 7 June 2025

Last updated: 16 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Word2Vec embedding.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Lu, Zihao, Xiaohui Hu, and Yun Xue. "Dual-Word Embedding Model Considering Syntactic Information for Cross-Domain Sentiment Classification." Mathematics 10, no. 24 (2022): 4704. http://dx.doi.org/10.3390/math10244704.

Full text

Abstract:

The purpose of cross-domain sentiment classification (CDSC) is to fully utilize the rich labeled data in the source domain to help the target domain perform sentiment classification even when labeled data are insufficient. Most of the existing methods focus on obtaining domain transferable semantic information but ignore syntactic information. The performance of BERT may decrease because of domain transfer, and traditional word embeddings, such as word2vec, cannot obtain contextualized word vectors. Therefore, achieving the best results in CDSC is difficult when only BERT or word2vec is used. In this paper, we propose a Dual-word Embedding Model Considering Syntactic Information for Cross-domain Sentiment Classification. Specifically, we obtain dual-word embeddings using BERT and word2vec. After performing BERT embedding, we pay closer attention to semantic information, mainly using self-attention and TextCNN. After word2vec word embedding is obtained, the graph attention network is used to extract the syntactic information of the document, and the attention mechanism is used to focus on the important aspects. Experiments on two real-world datasets show that our model outperforms other strong baselines.

APA, Harvard, Vancouver, ISO, and other styles

2

Liu, Ruoyu. "Exploring the Impact of Word2Vec Embeddings Across Neural Network Architectures for Sentiment Analysis." Applied and Computational Engineering 97, no. 1 (2024): 93–98. http://dx.doi.org/10.54254/2755-2721/97/2024melb0085.

Full text

Abstract:

Abstract. Sentiment analysis is crucial for understanding public opinion, gauging customer satisfaction, and making informed business decisions based on the emotional tone of textual data. This study investigates the performance of different Word2Vec-based embedding strategies static, non-static, and multichannel for sentiment analysis across various neural network architectures, including Convolution Neural Networks (CNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs). Despite the rise of advanced contextual embedding methods such as Bidirectional Encoder Representations from Transformers (BERT), Word to Vector (Word2Vec) retains its importance due to its simplicity and lower computational demands, making it ideal for use in settings with limited resources. The goal is to evaluate the impact of fine-tuning Word2Vec embeddings on the accuracy of sentiment classification. Using the Internet Movie Database (IMDb), this work finds that multichannel embeddings, which combine static and non-static representations, provide the best performance across most architectures, while static embeddings continue to deliver strong results in specific sequential models. These findings highlight the balance between efficiency and accuracy in traditional word embeddings, particularly when advanced models are not feasible.

APA, Harvard, Vancouver, ISO, and other styles

3

Liu, Ruoyu. "Exploring the Impact of Word2Vec Embeddings Across Neural Network Architectures for Sentiment Analysis." Applied and Computational Engineering 94, no. 1 (2024): 106–11. http://dx.doi.org/10.54254/2755-2721/94/2024melb0085.

Full text

Abstract:

Abstract. Sentiment analysis is crucial for understanding public opinion, gauging customer satisfaction, and making informed business decisions based on the emotional tone of textual data. This study investigates the performance of different Word2Vec-based embedding strategies static, non-static, and multichannel for sentiment analysis across various neural network architectures, including Convolution Neural Networks (CNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs). Despite the rise of advanced contextual embedding methods such as Bidirectional Encoder Representations from Transformers (BERT), Word to Vector (Word2Vec) retains its importance due to its simplicity and lower computational demands, making it ideal for use in settings with limited resources. The goal is to evaluate the impact of fine-tuning Word2Vec embeddings on the accuracy of sentiment classification. Using the Internet Movie Database (IMDb), this work finds that multichannel embeddings, which combine static and non-static representations, provide the best performance across most architectures, while static embeddings continue to deliver strong results in specific sequential models. These findings highlight the balance between efficiency and accuracy in traditional word embeddings, particularly when advanced models are not feasible.

APA, Harvard, Vancouver, ISO, and other styles

4

Tahmasebi, Nina. "A Study on Word2Vec on a Historical Swedish Newspaper Corpus." Digital Humanities in the Nordic and Baltic Countries Publications 1, no. 1 (2018): 25–37. http://dx.doi.org/10.5617/dhnbpub.11007.

Full text

Abstract:

Detecting word sense changes can be of great interest in the field of digital humanities. Thus far, most investigations and automatic methods have been developed and carried out on English text and most recent methods make use of word embeddings. This paper presents a study on using Word2Vec, a neural word embedding method, on a Swedish historical newspaper collection. Our study includes a set of 11 words and our focus is the quality and stability of the word vectors over time. We investigate whether a word embedding method like Word2Vec can be effectively used on texts where the volume and quality is limited.

APA, Harvard, Vancouver, ISO, and other styles

5

Akshata, Upadhye. "A Deep Dive into Word2Vec and Doc2Vec Models in Natural Language Processing." Journal of Scientific and Engineering Research 7, no. 3 (2020): 244–49. https://doi.org/10.5281/zenodo.10902940.

Full text

Abstract:

<strong>Abstract </strong>In the field of natural language processing, the advent of word2vec and doc2vec models has reshaped the paradigm of language representation. This paper provides a comprehensive exploration of these distributed embedding models, tracing their historical development, key contributions, and advancements. The literature review provides the intricate details of word2vec and doc2vec which acts as the foundation for understanding their operational principles and variations. A critical analysis in the comparison section presents the strengths and weaknesses of both models and offers insights into their suitability for different applications. Real-world case studies are summarized to highlight the effectiveness of word2vec and doc2vec in several fields. Additionally the challenges and limitations of these models are discussed to provide a holistic view of the models’ capabilities. Finally the future perspective on potential developments, including advancements in embedding techniques, domain-specific embeddings, etc., are presented. The exploration of emerging trends including continued growth in contextual embeddings, ethical considerations, and interpretability are discussed. In conclusion, this paper offers a comprehensive overview of word2vec and doc2vec models helpful for the ongoing exploration of distributed representations in natural language understanding

APA, Harvard, Vancouver, ISO, and other styles

6

Li, Saihan, and Bing Gong. "Word embedding and text classification based on deep learning methods." MATEC Web of Conferences 336 (2021): 06022. http://dx.doi.org/10.1051/matecconf/202133606022.

Full text

Abstract:

Traditional manual text classification method has been unable to cope with the current huge amount of data volume. The improvement of deep learning technology also accelerates the technology of text classification. Based on this background, we presented different word embedding methods such as word2vec, doc2vec, tfidf and embedding layer. After word embedding, we demonstrated 8 deep learning models to classify the news text automatically and compare the accuracy of all the models, the model ‘2 layer GRU model with pretrained word2vec embeddings’ model got the highest accuracy. Automatic text classification can help people summary the text accurately and quickly from the mass of text information. No matter in the academic or in the industry area, it is a topic worth discussing.

APA, Harvard, Vancouver, ISO, and other styles

7

Romanyuk, Andriy. "Vector Representations of Ukrainian Words." Ukraina Moderna 27, no. 27 (2019): 46–72. http://dx.doi.org/10.30970/uam.2019.27.1062.

Full text

Abstract:

I n this paper, Ukrainian word embeddings and their properties are examined. Provided are a theoretical description, a brief account of the most common technologies used to produce an embedding, and lists of implemented algorithms. Word2wec, the first technology for calculating word embeddings, is used to demonstrate modern approaches of calculating using neural networks. Word2wec and FastText, which evolved from word2vec, are compared, and FastText’s benefits are described. Word embeddings have been applied to solving majority of the practical tasks of natural language processing. One of the latest such applications have been in the automatic construction of translation dictionaries. A previous analysis indicates that most of the words found in English-Ukrainian dictionaries are absent in the Great Electronic Dictionary of the Ukrainian Language (VESUM) project. For embeddings in Ukrainian based on word2vec, Glove, lex2vec, and FastText, the Gensim open-source library was used to demonstrate the potential of calculated models, and the results of repeating known calculation experiments are provided. They indicate that the hypothesis about the existence of biases and stereotypes in such models does not pertain to the Ukrainian language. The quality of the word embeddings is assessed on the basis of testing analogies, and adapting lexical data from a Ukrainian associative dictionary in order to construct a selection of data for assessing the quality of word embeddings is proposed. Listed are necessary tasks of future research in the field of creating and utilizing Ukrainian word embeddings.

APA, Harvard, Vancouver, ISO, and other styles

8

Alachram, Halima, Hryhorii Chereda, Tim Beißbarth, Edgar Wingender, and Philip Stegmaier. "Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks." PLOS ONE 16, no. 10 (2021): e0258623. http://dx.doi.org/10.1371/journal.pone.0258623.

Full text

Abstract:

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.

APA, Harvard, Vancouver, ISO, and other styles

9

JP, Sanjanasri, Vijay Krishna Menon, Soman KP, Rajendran S, and Agnieszka Wolk. "Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way." Electronics 10, no. 12 (2021): 1372. http://dx.doi.org/10.3390/electronics10121372.

Full text

Abstract:

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

APA, Harvard, Vancouver, ISO, and other styles

10

Ahn, Yoonjoo, Eugene Rhee, and Jihoon Lee. "Dual embedding with input embedding and output embedding for better word representation." Indonesian Journal of Electrical Engineering and Computer Science 27, no. 2 (2022): 1091–99. https://doi.org/10.11591/ijeecs.v27.i2.pp1091-1099.

Full text

Abstract:

Recent studies in distributed vector representations for words have variety of ways to represent words. We propose a various ways using input embedding and output embedding to better represent words than single model. We compared the performance in terms of word analogy and word similarity with each input and output embeddings and various dual embeddings which are the combination of those two embeddings. Performance evaluation results show that the proposed dual embeddings outperform each single embedding, especially with the way of simply adding input and output embeddings. We figured out two things in this paper, i) not only input embedding but also output embedding has such meaning to represent the words and ii) combining input embedding and output embedding as dual embedding outperforms the single embedding when we use input embedding and output embedding individually.

APA, Harvard, Vancouver, ISO, and other styles

11

Santana, Isabel N., Raphael S. Oliveira, and Erick G. S. Nascimento. "Text Classification of News Using Transformer-based Models for Portuguese." Journal of Systemics, Cybernetics and Informatics 20, no. 5 (2022): 33–59. http://dx.doi.org/10.54808/jsci.20.05.33.

Full text

Abstract:

This work proposes the use of a fine-tuned Transformers-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. Metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on this model has a superior performance than the other explored techniques.

APA, Harvard, Vancouver, ISO, and other styles

12

Raheem, Mafas, and Yi Chien Chong. "E-Commerce Fake Reviews Detection Using LSTM with Word2Vec Embedding." Journal of Computing and Information Technology 32, no. 2 (2024): 65–80. http://dx.doi.org/10.20532/cit.2024.1005803.

Full text

Abstract:

Customer reviews inform potential buyers' decisions, but fake reviews in e-commerce can skew perceptions as customers may feel pressured to leave positive feedback. Detecting fake reviews in e-commerce platforms is a critical challenge, impacting online shopping and deceiving customers. Effective detection strategies, employing deep learning architectures and word embeddings, are essential to combat this issue. Specifically, the study presented in this paper employed a 1-layer Simple LSTM model, a 1D Convolutional model, and a combined CNN+LSTM model. These models were trained using different pre-trained word embeddings including Word2Vec, GloVe, FastText, and Keras embeddings, to convert the text data into vector form. The models were evaluated based on accuracy and F1-score to provide a comprehensive measure of their performance. The results indicated that the Simple LSTM model with Word2Vec embeddings achieved an accuracy of nearly 91% and an F1-score of 0.9024, outperforming all other model-embedding combinations. The 1D convolutional model performed best without any embeddings, suggesting its ability to extract meaningful features from the raw text. The transformer-based models, BERT and DistilBERT, showed progressive learning but struggled with generalization, indicating the need for strategies such as early stopping, dropout, or regularization to prevent overfitting. Notably, the DistilBERT model consistently outperformed the LSTM model, achieving optimal performance with an accuracy of 96% and an F1-score of 0.9639 using a batch size of 32 and a learning rate of 4.00E-05.

APA, Harvard, Vancouver, ISO, and other styles

13

Siti Khomsah, Rima Dias Ramadhani, and Sena Wijaya. "The Accuracy Comparison Between Word2Vec and FastText On Sentiment Analysis of Hotel Reviews." Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 6, no. 3 (2022): 352–58. http://dx.doi.org/10.29207/resti.v6i3.3711.

Full text

Abstract:

Word embedding vectorization is more efficient than Bag-of-Word in word vector size. Word embedding also overcomes the loss of information related to sentence context, word order, and semantic relationships between words in sentences. Several kinds of Word Embedding are often considered for sentiment analysis, such as Word2Vec and FastText. Fast Text works on N-Gram, while Word2Vec is based on the word. This research aims to compare the accuracy of the sentiment analysis model using Word2Vec and FastText. Both models are tested in the sentiment analysis of Indonesian hotel reviews using the dataset from TripAdvisor.Word2Vec and FastText use the Skip-gram model. Both methods use the same parameters: number of features, minimum word count, number of parallel threads, and the context window size. Those vectorizers are combined by ensemble learning: Random Forest, Extra Tree, and AdaBoost. The Decision Tree is used as a baseline for measuring the performance of both models. The results showed that both FastText and Word2Vec well-to-do increase accuracy on Random Forest and Extra Tree. FastText reached higher accuracy than Word2Vec when using Extra Tree and Random Forest as classifiers. FastText leverage accuracy 8% (baseline: Decision Tree 85%), it is proofed by the accuracy of 93%, with 100 estimators.

APA, Harvard, Vancouver, ISO, and other styles

14

Shilpi, Kulshretha, and Lodha Lokesh. "Performance Evaluation of Word Embedding Algorithms." Performance Evaluation of Word Embedding Algorithms 8, no. 12 (2023): 7. https://doi.org/10.5281/zenodo.10443962.

Full text

Abstract:

This study intends to explore the field of word embedding and thoroughly examine and contrast various word embedding algorithms. Words retain their semantic relationships and meaning when they are transformed into vectors using word embedding models. Numerous methods have been put forth, each with unique benefits and drawbacks. Making wise choices when using word embedding for NLP tasks requires an understanding of these methods and their relative efficacy. The study presents methodologies, potential uses of each technique and discussed advantages, disadvantages. The fundamental ideas and workings of well-known word embedding methods, such as Word2Vec, GloVe, FastText, contextual embedding ELMo, and BERT, are evaluated in this paper. The performance of these algorithms are evaluated for three datasets on the basis of words similarity and word analogy and finally results are compared. Keywords:- Embedding, Word2Vec, Global Vectors for Word Representation (GloVe), Embedding from Language Models (ELMo), BERT.

APA, Harvard, Vancouver, ISO, and other styles

15

Fan, Yadan, Serguei Pakhomov, Reed McEwan, Wendi Zhao, Elizabeth Lindemann, and Rui Zhang. "Using word embeddings to expand terminology of dietary supplements on clinical notes." JAMIA Open 2, no. 2 (2019): 246–53. http://dx.doi.org/10.1093/jamiaopen/ooz007.

Full text

Abstract:

Abstract Objective The objective of this study is to demonstrate the feasibility of applying word embeddings to expand the terminology of dietary supplements (DS) using over 26 million clinical notes. Methods Word embedding models (ie, word2vec and GloVe) trained on clinical notes were used to predefine a list of top 40 semantically related terms for each of 14 commonly used DS. Each list was further evaluated by experts to generate semantically similar terms. We investigated the effect of corpus size and other settings (ie, vector size and window size) as well as the 2 word embedding models on performance for DS term expansion. We compared the number of clinical notes (and patients they represent) that were retrieved using the word embedding expanded terms to both the baseline terms and external DS sources expanded terms. Results Using the word embedding models trained on clinical notes, we could identify 1–12 semantically similar terms for each DS. Using the word embedding expanded terms, we were able to retrieve averagely 8.39% more clinical notes and 11.68% more patients for each DS compared with 2 sets of terms. The increasing corpus size results in more misspellings, but not more semantic variants and brand names. Word2vec model is also found more capable of detecting semantically similar terms than GloVe. Conclusion Our study demonstrates the utility of word embeddings on clinical notes for terminology expansion on 14 DS. We propose that this method can be potentially applied to create a DS vocabulary for downstream applications, such as information extraction.

APA, Harvard, Vancouver, ISO, and other styles

16

Kamath, S., K. G. Karibasappa, Anvitha Reddy, Arati M. Kallur, B. B. Priyanka, and B. P. Bhagya. "Improving the Relation Classification Using Convolutional Neural Network." IOP Conference Series: Materials Science and Engineering 1187, no. 1 (2021): 012004. http://dx.doi.org/10.1088/1757-899x/1187/1/012004.

Full text

Abstract:

Abstract Relation extraction has been the emerging research topic in the field of Natural Language Processing. The proposed work classifies the relations among the data considering the semantic relevance of words using word2vec embeddings towards training the convolutional neural network. We intended to use the semantic relevance of the words in the document to enrich the learning of the embeddings for improved classification. We designed a framework to automatically extract the relations between the entities using deep learning techniques. The framework includes pre-processing, extracting the feature vectors using word2vec embedding, and classification using convolutional neural networks. We perform extensive experimentation using benchmark datasets and show improved classification accuracy in comparison with the state-of-the-art methodologies using appropriate methods and also including the additional relations.

APA, Harvard, Vancouver, ISO, and other styles

17

Kim, Jeongin, Taekeun Hong, and Pankoo Kim. "Replacing Out-of-Vocabulary Words with an Appropriate Synonym Based on Word2VnCR." Mobile Information Systems 2021 (July 16, 2021): 1–7. http://dx.doi.org/10.1155/2021/5548426.

Full text

Abstract:

The most typical problem in an analysis of natural language is finding synonyms of out-of-vocabulary (OOV) words. When someone tries to understand a sentence containing an OOV word, the person determines the most appropriate meaning of a replacement word using the meanings of co-occurrence words under the same context based on the conceptual system learned. In this study, a word-to-vector and conceptual relationship (Word2VnCR) algorithm is proposed that replaces an OOV word leading to an erroneous morphemic analysis with an appropriate synonym. TheWord2VnCR algorithm is an improvement over the conventional Word2Vec algorithm, which has a problem in suggesting a replacement word by not determining the similarity of the word. After word-embedding learning is conducted using the learning dataset, the replacement word candidates of the OOV word are extracted. The semantic similarities of the extracted replacement word candidates are measured with the surrounding neighboring words of the OOV word, and a replacement word having the highest similarity value is selected as a replacement. To evaluate the performance of the proposed Word2VnCR algorithm, a comparative experiment was conducted using the Word2VnCR and Word2Vec algorithms. As the experimental results indicate, the proposed algorithm shows a higher accuracy than the Word2Vec algorithm.

APA, Harvard, Vancouver, ISO, and other styles

18

Adrian, Muhammad Ghifari, Sri Suryani Prasetyowati, and Yuliant Sibaroni. "Effectiveness of Word Embedding GloVe and Word2Vec within News Detection of Indonesian uUsing LSTM." JURNAL MEDIA INFORMATIKA BUDIDARMA 7, no. 3 (2023): 1180. http://dx.doi.org/10.30865/mib.v7i3.6411.

Full text

Abstract:

In recent years the use of social media platforms in Indonesia has continued to increase. The increasing use of social media has several advantages and disadvantages. The advantage is that the news is easily accessible by anyone, while the disadvantage is that much information that is spread is hoax news. Hoax news must be detected because hoax news spreads false and misleading information. This undermines the integrity of the information and needs to be clarified for the public. By detecting hoax news, we can ensure the information being disseminated is accurate and trustworthy. In this study, the author will detect hoax news on Indonesian news media on Twitter using LSTM with word embedding GloVe and Word2Vec and compare the two-word embeddings to find the best performance in the LSTM model. The reason for choosing the GloVe and Word2Vec extraction features to be compared is that both are useful for representing vectors of words. Their performance may vary. Word2Vec might better capture semantic relationships between words, whereas GloVe might better capture distributional relationships and word co-occurrence. This study shows that LSTM with Word2Vec performs better than LSTM and GloVe in detecting Indonesian language news. LSTM and Word2Vec produced an average accuracy value of 95%, while LSTM with GloVe produced an average accuracy value of 90%.

APA, Harvard, Vancouver, ISO, and other styles

19

Prasetyo, Teguh, Arya Adhyaksa Waskita, and Taswanda Taryo. "Analisis Sentimen Pengguna Mobil Listrik di Media Sosial Twitter Menggunakan Metode Klasifikasi Naïve Bayes, K-Nearest Neighbor (KNN), dan Decision Tree." Jurnal SISKOM-KB (Sistem Komputer dan Kecerdasan Buatan) 8, no. 2 (2025): 108–17. https://doi.org/10.47970/siskom-kb.v8i1.783.

Full text

Abstract:

Penelitian ini bertujuan untuk menganalisis sentimen pengguna kendaraan listrik dengan menggunakan tiga algoritma klasifikasi machine learning, yaitu Naïve Bayes, Decision Tree, dan K-Nearest Neighbors. Data yang digunakan dalam penelitian ini diambil dari platform media sosial Twitter yang dikenal dengan jumlah data yang besar dan beragam. Objek penelitian adalah opini publik yang diungkapkan di Twitter, dengan subjek berupa tweet yang dikumpulkan menggunakan Twitter API, menghasilkan 2000 data, dengan 1869 data bersih setelah preprocessing. Analisis data meliputi ekstraksi teks dan preprocessing, yang mencakup pembersihan data, tokenisasi, penghilangan stopword, dan stemming. Hasil penelitian menunjukkan distribusi sentimen sebagai berikut: sentimen netral mendominasi dengan 53.5% dari total tweet, diikuti sentimen positif sebesar 35.8%, dan sentimen negatif sebesar 10.7%. Di antara model yang diuji, Decision Tree dengan embedding TF/IDF menunjukkan kinerja terbaik, mencapai akurasi 66%, sedangkan Decision Tree dengan embedding Word2Vec memiliki kinerja terendah dengan akurasi 46%. Sementara itu, KNN dengan embedding TF/IDF menunjukkan kinerja yang cukup baik dengan akurasi 58%, lebih tinggi dibandingkan KNN dengan embedding Word2Vec yang mencapai akurasi 57%. Naïve Bayes dengan embedding TF/IDF menghasilkan akurasi sebesar 53%, lebih tinggi dibandingkan Naïve Bayes dengan embedding Word2Vec yang memiliki akurasi 48%. Meskipun terdapat variasi di antara algoritma dan word embedding yang digunakan, tidak ada metode yang secara konsisten mengklasifikasikan sentimen di semua bidang. Penelitian ini memberikan kontribusi yang signifikan dalam memetakan sentimen tentang penggunaan kendaraan listrik melalui analisis data dari media sosial dan memberikan wawasan tentang efektivitas berbagai algoritma machine learning dan word embedding dalam analisis sentimen.

APA, Harvard, Vancouver, ISO, and other styles

20

Nurdin, Arliyanti, Bernadus Anggo Seno Aji, Anugrayani Bustamin, and Zaenal Abidin. "PERBANDINGAN KINERJA WORD EMBEDDING WORD2VEC, GLOVE, DAN FASTTEXT PADA KLASIFIKASI TEKS." Jurnal Tekno Kompak 14, no. 2 (2020): 74. http://dx.doi.org/10.33365/jtk.v14i2.732.

Full text

Abstract:

Karakteristik teks yang tidak terstruktur menjadi tantangan dalam ekstraksi fitur pada bidang pemrosesan teks. Penelitian ini bertujuan untuk membandingkan kinerja dari word embedding seperti Word2Vec, GloVe dan FastText dan diklasifikasikan dengan algoritma Convolutional Neural Network. Ketiga metode ini dipilih karena dapat menangkap makna semantik, sintatik, dan urutan bahkan konteks di sekitar kata jika dibandingkan dengan feature engineering tradisional seperti Bag of Words. Proses word embedding dari metode tersebut akan dibandingkan kinerjanya pada klasifikasi berita dari dataset 20 newsgroup dan Reuters Newswire. Evaluasi kinerja diukur menggunakan F-measure. Performa terbaik menunjukkan FastText unggul dibanding dua metode word embedding lainnya dengan nilai F-Measure sebesar 0.979 untuk dataset 20 Newsgroup dan 0.715 untuk Reuters. Namun, perbedaan kinerja yang tidak begitu signifikan antar ketiga word embedding tersebut menunjukkan bahwa ketiga word embedding ini memiliki kinerja yang kompetitif. Penggunaannya sangat bergantung pada dataset yang digunakan dan permasalahan yang ingin diselesaikan.Kata kunci: word embedding, word2vec, glove, fasttext, klasfikasi teks, convolutional neural network, cnn.

APA, Harvard, Vancouver, ISO, and other styles

21

Xie, Zhongwei, Ling Liu, Yanzhao Wu, Luo Zhong, and Lin Li. "Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering." ACM Transactions on Information Systems 40, no. 4 (2022): 1–27. http://dx.doi.org/10.1145/3490519.

Full text

Abstract:

This article introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce ranking scores for key terms before generating the vector representation for each key term by using Word2vec. We leverage Wide ResNet50 and Word2vec to extract and encode the image category semantics of food images to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, taking into account also the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with deep feature engineering significantly outperforms the state-of-the-art approaches.

APA, Harvard, Vancouver, ISO, and other styles

22

Kang, Hyungsuc, and Janghoon Yang. "Performance Comparison of Word2vec and fastText Embedding Models." Journal of Digital Contents Society 21, no. 7 (2020): 1335–43. http://dx.doi.org/10.9728/dcs.2020.21.7.1335.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Adewumi, Tosin, Foteini Liwicki, and Marcus Liwicki. "Word2Vec: Optimal hyperparameters and their impact on natural language processing downstream tasks." Open Computer Science 12, no. 1 (2022): 134–41. http://dx.doi.org/10.1515/comp-2022-0236.

Full text

Abstract:

Abstract Word2Vec is a prominent model for natural language processing tasks. Similar inspiration is found in distributed embeddings (word-vectors) in recent state-of-the-art deep neural networks. However, wrong combination of hyperparameters can produce embeddings with poor quality. The objective of this work is to empirically show that Word2Vec optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the publicly released, original Word2Vec embedding. Both intrinsic and extrinsic (downstream) evaluations are carried out, including named entity recognition and sentiment analysis. Our main contributions include showing that the best model is usually task-specific, high analogy scores do not necessarily correlate positively with F1 scores, and performance is not dependent on data size alone. If ethical considerations to save time, energy, and the environment are made, then relatively smaller corpora may do just as well or even better in some cases. Increasing the dimension size of embeddings after a point leads to poor quality or performance. In addition, using a relatively small corpus, we obtain better WordSim scores, corresponding Spearman correlation, and better downstream performances (with significance tests) compared to the original model, which is trained on a 100 billion-word corpus.

APA, Harvard, Vancouver, ISO, and other styles

24

Chen, Xi. "Performance analysis of robustness of BERT model under attack." Journal of Physics: Conference Series 2580, no. 1 (2023): 012022. http://dx.doi.org/10.1088/1742-6596/2580/1/012022.

Full text

Abstract:

Abstract With the aim of testing the robustness of machine learning models, this paper tests the performance of five classification models based on IMDB datasets. Furthermore, in this work, two types of sentence embeddings generated by word2vec and BERT are added with the noise of the normal distribution with different intensities. They are fed into the Support Vector Machine for testing. The experimental results show that the performance of the model slowly decreases as the noise intensity is increased. BERT-based sentiment embedding reduces less than Word2vec-based sentiment embedding. This paper believes that this experimental phenomenon can be used as a basis to support the robustness of BERT representation learning. At the same time, this paper also noted that the previous work experimented with perturbating the BERT model by replacing the words in the original text of the IMDB dataset, causing the BERT model performance to drop sharply. However, the experiment in this paper tested the robustness of the BERT model by embedding tampering, and the experimental results were stable. In this paper, it is assumed that word substitution and embedding interference are possible situations when operating the model. Therefore, for the two different experimental phenomena, this paper plans to design more detailed experiments in the next work to check the robustness of the model.

APA, Harvard, Vancouver, ISO, and other styles

25

Maslennikova, Yulia, and Vladimir Bochkarev. "Evaluation of word embedding models used for diachronic semantic change analysis." Journal of Physics: Conference Series 2701, no. 1 (2024): 012082. http://dx.doi.org/10.1088/1742-6596/2701/1/012082.

Full text

Abstract:

Abstract In the last decade, the quantitative analysis of diachronic changes in language and lexical semantic changes have become the subject of active research. A significant role was played by the development of new effective techniques of word embedding. This direction has been effectively demonstrated in a number of studies. Some of them have focused on the analysis of the optimal type of word2vec models, hyperparameters for training, and evaluation techniques. In this research, we used Corpus of Historical American English (COHA). The paper demonstrates the results of multiple training runs and the comparison of word2vec models with different variations of hyperparameters used for lexical semantic change detection. In addition to traditional word similarities and analogical reasoning tests, we used testing on an extended set of synonyms. We have evaluated word2vec models on the set of more than 100,000 English synsets that were randomly selected from the WordNet database. We have shown that changing the word2vec model parameters (such as a dimension of word embedding, a size of context window, a type of model, a word discard rate etc.) can significantly impact on the resulting word embedding vector space and the detected lexical semantic changes. Additionally, the results strongly depended on properties of the corpus, such as word frequency distribution.

APA, Harvard, Vancouver, ISO, and other styles

26

Kalogeropoulos, Nikitas-Rigas, Dimitris Ioannou, Dionysios Stathopoulos, and Christos Makris. "On Embedding Implementations in Text Ranking and Classification Employing Graphs." Electronics 13, no. 10 (2024): 1897. http://dx.doi.org/10.3390/electronics13101897.

Full text

Abstract:

This paper aims to enhance the Graphical Set-based model (GSB) for ranking and classification tasks by incorporating node and word embeddings. The model integrates a textual graph representation with a set-based model for information retrieval. Initially, each document in a collection is transformed into a graph representation. The proposed enhancement involves augmenting the edges of these graphs with embeddings, which can be pretrained or generated using Word2Vec and GloVe models. Additionally, an alternative aspect of our proposed model consists of the Node2Vec embedding technique, which is applied to a graph created at the collection level through the extension of the set-based model, providing edges based on the graph’s structural information. Core decomposition is utilized as a method for pruning the graph. As a byproduct of our information retrieval model, we explore text classification techniques based on our approach. Node2Vec embeddings are generated by our graphs and are applied in order to represent the different documents in our collections that have undergone various preprocessing methods. We compare the graph-based embeddings with the Doc2Vec and Word2Vec representations to elaborate on whether our approach can be implemented on topic classification problems. For that reason, we then train popular classifiers on the document embeddings obtained from each model.

APA, Harvard, Vancouver, ISO, and other styles

27

Alkaabi, Hussein, Ali Kadhim Jasim, and Ali Darroudi. "From Static to Contextual: A Survey of Embedding Advances in NLP." PERFECT: Journal of Smart Algorithms 2, no. 2 (2025): 57–66. https://doi.org/10.62671/perfect.v2i2.77.

Full text

Abstract:

Embedding techniques have been a cornerstone of Natural Language Processing (NLP), enabling machines to represent textual data in a form that captures semantic and syntactic relationships. Over the years, the field has witnessed a significant evolution—from static word embeddings, such as Word2Vec and GloVe, which represent words as fixed vectors, to dynamic, contextualized embeddings like BERT and GPT, which generate word representations based on their surrounding context. This survey provides a comprehensive overview of embedding techniques, tracing their development from early methods to state-of-the-art approaches. We discuss the strengths and limitations of each paradigm, their applications across various NLP tasks, and the challenges they address, such as polysemy and out-of-vocabulary words. Furthermore, we highlight emerging trends, including multimodal embeddings, domain-specific representations, and efforts to mitigate embedding bias. By synthesizing the advancements in this rapidly evolving field, this paper aims to serve as a valuable resource for researchers and practitioners while identifying open challenges and future directions for embedding research in NLP.

APA, Harvard, Vancouver, ISO, and other styles

28

Karyaeva, Maria S., Pavel I. Braslavski, and Valery A. Sokolov. "Word Embedding for Semantically Relative Words: an Experimental Study." Modeling and Analysis of Information Systems 25, no. 6 (2018): 726–33. http://dx.doi.org/10.18255/1818-1015-2018-6-726-733.

Full text

Abstract:

The ability to identify semantic relations between words has made a word2vec model widely used in NLP tasks. The idea of word2vec is based on a simple rule that a higher similarity can be reached if two words have a similar context. Each word can be represented as a vector, so the closest coordinates of vectors can be interpreted as similar words. It allows to establish semantic relations (synonymy, relations of hypernymy and hyponymy and other semantic relations) by applying an automatic extraction. The extraction of semantic relations by hand is considered as a time-consuming and biased task, requiring a large amount of time and some help of experts. Unfortunately, the word2vec model provides an associative list of words which does not consist of relative words only. In this paper, we show some additional criteria that may be applicable to solve this problem. Observations and experiments with well-known characteristics, such as word frequency, a position in an associative list, might be useful for improving results for the task of extraction of semantic relations for the Russian language by using word embedding. In the experiments, the word2vec model trained on the Flibusta and pairs from Wiktionary are used as examples with semantic relationships. Semantically related words are applicable to thesauri, ontologies and intelligent systems for natural language processing.

APA, Harvard, Vancouver, ISO, and other styles

29

Pertiwi, Ayu, Azhari Azhari, and Sri Mulyana. "Fast2Vec, a modified model of FastText that enhances semantic analysis in topic evolution." PeerJ Computer Science 11 (May 19, 2025): e2862. https://doi.org/10.7717/peerj-cs.2862.

Full text

Abstract:

Background Topic modeling approaches, such as latent Dirichlet allocation (LDA) and its successor, the dynamic topic model (DTM), are widely used to identify specific topics by extracting words with similar frequencies from documents. However, these topics often require manual interpretation, which poses challenges in constructing semantics topic evolution, mainly when topics contain negations, synonyms, or rare terms. Neural network-based word embeddings, such as Word2vec and FastText, have advanced semantic understanding but have their limitations. Word2Vec struggles with out-of-vocabulary (OOV) words, and FastText generates suboptimal embeddings for infrequent terms. Methods This study introduces Fast2Vec, a novel model that integrates the semantic capabilities of Word2Vec with the subword analysis strength of FastText to enhance semantic analysis in topic modeling. The model was evaluated using research abstracts from the Science and Technology Index (SINTA) journal database and validated using twelve public word similarity benchmarks, covering diverse semantic and syntactic dimensions. Evaluation metrics include Spearman and Pearson correlation coefficients to assess the alignment with human judgments. Results Experimental findings demonstrated that Fast2Vec outperforms or closely matches Word2Vec and FastText across most benchmark datasets, particularly in task requiring fine-grained semantic similarity. In OOV scenarios, Fast2Vec improved semantic similarity by 39.64% compared to Word2Vec, and 6.18% compared to FastText. Even in scenarios without OOV terms, Fast2Vec achieved a 7.82% improvement over FastText and a marginal 0.087% improvement over Word2Vec. Additionally, the model effectively categorized topics into four distinct evolution patterns (diffusion, shifting, moderate fluctuations, and stability), enabling a deeper understanding of evolution topic interests and their dynamic characteristics. Conclusion Fast2Vec presents a robust and generalizable word embedding framework for semantic-based topic modeling. By combining the contextual sensitivity of Word2Vec with the subword flexibility of FastText, Fast2Vec effectively addresses prior limitations in handling OOV terms and semantic variation and demonstrates strong potential for boarder applications in natural language processing tasks.

APA, Harvard, Vancouver, ISO, and other styles

30

Cahyana, Nur Heri, Yuli Fauziah, Wisnalmawati Wisnalmawati, Agus Sasmito Aribowo, and Shoffan Saifullah. "The Evaluation of Effects of Oversampling and Word Embedding on Sentiment Analysis." JURNAL INFOTEL 17, no. 1 (2025): 54–67. https://doi.org/10.20895/infotel.v17i1.1077.

Full text

Abstract:

Generally, opinion datasets for sentiment analysis are in an unbalanced condition. Unbalanced data tends to have a bias in favor of classification in the majority class. Data balancing by adding synthetic data to the minority class requires an oversampling strategy. This research aims to overcome this imbalance by combining oversampling and word embedding (Word2Vec or FastText). We convert the opinion dataset into a sentence vector, and then an oversampling method is applied here. We use 5 (five) datasets from comments on YouTube videos with several differences in terms, number of records, and imbalance conditions. We observed increased sentiment analysis accuracy with combining Word2Vec or FastText with 3 (three) oversampling methods: SMOTE, Borderline SMOTE, or ADASYN. Random Forest is used as machine learning in the classification model, and Confusion Matrix is used for validation. Model performance measurement uses accuracy and F-measure. After testing with five datasets, the performance of the Word2Vec method is almost equal to FastText. Meanwhile, the best oversampling method is Borderline SMOTE. Combining Word2Vec or FastText with Borderline SMOTE could be the best choice because of its accuracy score and F-measure reaching 91.0% - 91.3%. It is hoped that the sentiment analysis model using Word2Vec or FastText with Borderline SMOTE can become a high-performance alternative model.

APA, Harvard, Vancouver, ISO, and other styles

31

Ayo-Soyemi, Olusola. "Market Sentiment Analysis Using NLP: Understanding Trends and Buyer Preferences in Real Estate and Environmental Sectors." Technix International Journal for Engineering Research 12, no. 3 (2025): 974–88. https://doi.org/10.5281/zenodo.15120636.

Full text

Abstract:

The explosive growth of digital platforms has resulted in an explosion of unstructured textual data, such as customer reviews, social media posts, and feedback forms, which can provide significant insights into consumer preferences and industry trends.  To improve market sentiment research in the real estate and environmental sectors, this study investigates the use of Natural Language Processing (NLP) approaches, including word embedding models like Word2Vec, FastText, GloVe, and custom-developed embeddings.  The study's goal is to use these models to transform raw textual data into structured insights that reveal consumer attitudes and behavior patterns. The research highlights the unique challenges of sentiment analysis, including handling domain-specific language, contextual nuances, and accurately classifying neutral sentiments. To address these issues, the study compares the performance of different word embedding models using intrinsic metrics (e.g., semantic similarity) and extrinsic measures (e.g., sentiment classification accuracy). The findings demonstrate that FastText excels in handling rare words and morphologically rich languages, while GloVe provides strong semantic representations. However, a custom embedding model incorporating BiLSTM and self-attention mechanisms outperforms others in domain-specific sentiment analysis tasks with limited labeled data. This research underscores the importance of selecting appropriate embedding techniques based on task requirements and resource constraints. It also proposes potential solutions for improving neutral sentiment classification, such as enhanced dataset labeling and domain-specific embeddings. The study's findings have significant implications for businesses seeking to leverage sentiment analysis for product development, marketing strategies, customer engagement, and competitive positioning. By bridging advanced NLP techniques with practical applications, this research contributes to democratizing access to market sentiment analysis tools and enabling more responsive, consumer-centric business strategies. <strong>Index</strong><strong> Terms</strong> – Natural language processing (NLP), Market Sentiment Analysis (MSA), Word Embedding Models (WEM), Word2Vec, FastText, GloVe

APA, Harvard, Vancouver, ISO, and other styles

32

Arif, Md Ariful Islam, Md Mahbubur Rahman, Md Golam Rabiul Alam, and M. Akhtaruzzaman. "Analyzing the Performance of Deep Learning Models for Detecting Hate Speech on Social Media Platforms." MIST INTERNATIONAL JOURNAL OF SCIENCE AND TECHNOLOGY 12 (December 26, 2024): 39–52. https://doi.org/10.47981/j.mijst.12(02)2024.466(39-52).

Full text

Abstract:

Currently social media and online platforms have become a major source of cyberbullying and hate speech. It is currently affecting people and communities in harmful ways. Hate speech on social media is rising in Bangladesh and it is creating a need for effective tools to prevent and detect these incidents. This study introduces a deep learning model to mitigate this issue of identifying hate speech in text using three types of word embedding methods: Word2Vec, FastText, and BERT. The text data was labeled to mark hate speech and non-hate speech content. After that, these texts are preprocessed by removing punctuation and symbols to help improve model accuracy. Five deep learning models Bi-GRU-LSTM-CNN, Bi-LSTM, CNN, LSTM, and XGBoost were trained to classify the text as hate speech or non-hate speech. The study found that the LSTM model accomplished the highest accuracy at 95.66% with the Word2Vec embedding method, while CNN reached 87.70% with FastText embeddings. Word2Vec is effective for capturing word meanings in general text classification. FastText works well with rare words and languages that have complex word forms. These findings help advance effective hate speech detection techniques. It could promote more respectful and inclusive interactions on social media. This proposed deep-learning model can help stop cyberbullying and hate speech on social media.

APA, Harvard, Vancouver, ISO, and other styles

33

Susanty, Meredita, and Sahrul Sukardi. "Perbandingan Pre-trained Word Embedding dan Embedding Layer untuk Named-Entity Recognition Bahasa Indonesia." Petir 14, no. 2 (2021): 247–57. http://dx.doi.org/10.33322/petir.v14i2.1164.

Full text

Abstract:

Named-Entity Recognition (NER) is used to extract information from text by identifying entities such as the name of the person, organization, location, time, and other entities. Recently, machine learning approaches, particularly deep-learning, are widely used to recognize patterns of entities in sentences. Embedding, a process to convert text data into a number or vector of numbers, translates high dimensional vectors into relatively low-dimensional space. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. The embedding process can be performed using the supervised learning method, which requires a large number of labeled data sets or an unsupervised learning approach. This study compares the two embedding methods; trainable embedding layer (supervised learning) and pre-trained word embedding (unsupervised learning). The trainable embedding layer uses the embedding layer provided by the Keras library while pre-trained word embedding uses word2vec, GloVe, and fastText to build NER using the BiLSTM architecture. The results show that GloVe had better performance than other embedding techniques with a micro average f1 score of 76.48.

APA, Harvard, Vancouver, ISO, and other styles

34

Zou, Zhuo. "Performance analysis of using multimodal embedding and word embedding transferred to sentiment classification." Applied and Computational Engineering 5, no. 1 (2023): 417–22. http://dx.doi.org/10.54254/2755-2721/5/20230610.

Full text

Abstract:

Multimodal machine learning is one of artificial intelligence's most important research topics. Contrastive Language-Image Pretraining (CLIP) is one of the applications of multimodal machine Learning and is well applied to computer vision. However, there is a research gap in applying CLIP in natural language processing. Therefore, based on IMDB, this paper applies the multimodal features of CLIP and three other pre-trained word vectors, Glove, Word2vec, and BERT, to compare their effects on sentiment classification of natural language processing, to test the performance of CLIP multimodal feature tuning in natural language processing. The results show that the multimodal feature of CLIP does not produce a significant effect on sentiment classification, and other multimodal features gain better effects. The highest accuracy is produced by BERT, and the Word embedding of CLIP is the lowest of the four accuracies of word embedding. At the same time, glove and word2vec are relatively close. The reason may be that the pre-trained CLIP model learns SOTA image representations from pictures and their descriptions, which is unsuitable for sentiment classification tasks. The specific reason remains untested.

APA, Harvard, Vancouver, ISO, and other styles

35

Kim, Harang, and Hyun Min Song. "Lightweight IDS Framework Using Word Embeddings for In-Vehicle Network Security." Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications 15, no. 2 (2022): 1–13. http://dx.doi.org/10.58346/jowua.2024.i2.001.

Full text

Abstract:

As modern vehicle systems evolve into advanced cyber-physical systems, vehicle vulnerability to cyber threats has significantly increased. This paper discusses the need for advanced security in the Controller Area Network (CAN), which currently lacks security features. We propose a novel Intrusion Detection System (IDS) utilizing word embedding techniques from Natural Language Processing (NLP) for effective sequential pattern representations to improve intrusion detection in CAN traffic. This method transforms CAN identifiers into multi-dimensional vectors, enabling the model to capture complex sequential patterns of CAN traffic behaviors. Our methodology focuses on a lightweight neural network adaptable for automotive systems with limited computational resources. At first, a Word2Vec model is trained to make the embedding matrix of CAN IDs. Then, using the pre-trained embedding layer extracted from the Word2Vec network, the classifier analyzes embeddings from CAN data to detect intrusions. This model is viable for resource-constrained environments due to its low computational expense and memory usage. Key contributions of this research are (1) the application of word embeddings for intrusion detection in CAN traffic, (2) a streamlined neural network that balances accuracy with efficiency, and (3) a comprehensive evaluation showing our model’s competitive performance compared to relatively heavy deep learning models. Experimental results using the Car-Hacking dataset, widely used for automotive security research, demonstrate that our IDS effectively detects four different types of attacks on CAN. This work advances vehicle security technologies, contributing to safer transportation systems.

APA, Harvard, Vancouver, ISO, and other styles

36

Pahendra, Muhammad Agung Maugi, Siska Anraeni, and Lutfi Budi Ilmawan. "Perbandingan Kinerja Word Embedding dalam Analisis Sentimen Ulasan Pengguna Aplikasi Perjalanan." Jurnal Teknik Informatika dan Sistem Informasi 11, no. 1 (2025): 49–62. https://doi.org/10.28932/jutisi.v11i1.9681.

Full text

Abstract:

Traveloka, sebagai salah satu platform pemesanan perjalanan terkemuka, telah mencapai lebih dari 50 juta unduhan di Google Play Store. Pencapaian ini menunjukkan tingginya minat dan kepercayaan pengguna terhadap layanan yang ditawarkan. Namun, ulasan pengguna mengindikasikan adanya beberapa isu terkait performa dan kestabilan aplikasi yang perlu diperhatikan. Penelitian ini membandingkan performa metode word embedding Word2Vec dan ELMo menggunakan model BiLSTM dalam analisis sentimen ulasan aplikasi Traveloka. Hasil penelitian menunjukkan bahwa model BiLSTM dengan Word2Vec memiliki akurasi 76,13%, precision 75,22%, dan F1-measure 76,58%, lebih baik dibandingkan model dengan ELMo yang memiliki akurasi 74,38%, precision 70,49%, dan F1-measure 74,40%. Model BiLSTM dengan Word2Vec lebih efektif dalam analisis sentimen ulasan Traveloka, membantu mengidentifikasi dan menangani isu-isu pengguna guna meningkatkan kualitas layanan dan kepuasan pengguna.

APA, Harvard, Vancouver, ISO, and other styles

37

Lumbantoruan, Rosni, Maria Puspita Sari Nababan, and Letare Aiglien Saragih. "Analisis Perbandingan FastText dan Word2Vec pada Sistem Temu Balik Informasi." PROSIDING SEMINAR NASIONAL SAINS DATA 4, no. 1 (2024): 1033–41. https://doi.org/10.33005/senada.v4i1.416.

Full text

Abstract:

Sistem temu balik informasi dengan menggunakan pendekatan pembelajaran mesin pada umumnya memanfaatkan word embedding dalam merepresentasikan dokumen dan kueri pengguna. Pemilihan word embedding menjadi salah satu faktor kunci yang mempengaruhi kinerja sistem temu balik informasi, khususnya untuk mengolah teks atau kalimat dengan karakteristik data yang tidak terstruktur. Pada penelitian ini, word embedding yang paling sering digunakan yaitu FastText dan Word2Vec dibandingkan dalam hal menangkap dan mengembalikan makna semantik kata. Pada penelitian ini, eksperimen untuk membandingkan kedua pendekatan dilakukan dengan menerapkan masing-masing pendekatan pada dua dataset yang berbeda yaitu Internet News dan Movie Plots. Hasil eksperimen menunjukkan bahwa kedua pendekatan memiliki karakteristik masinng-masing, Dimana FastText dengan bantuan representasi kata dengan n-gram mampu menangkap kata yang memiliki kesamaan dari sisi susunan karakter sedangkan Word2Vec mencari kemiripan dengan kata lain berdasarkan keseringan kata tersebut muncul secara bersamaan dengan kata lain pada dokumen.

APA, Harvard, Vancouver, ISO, and other styles

38

Lee, Jin-Hyeok, and Sang-Tae Han. "A study on a fake news identification model based on word embedding method." Korean Data Analysis Society 26, no. 6 (2024): 1847–53. https://doi.org/10.37727/jkdas.2024.26.6.1847.

Full text

Abstract:

Status of information technology and mediaization Fake news is emerging as a permanent problem in our society. By developing a model to detect fake news, we aim to deliver reliable information about the impact of the current coverage of fake news. Among natural language processing methods, we would like to introduce various embedding methods to share fake news and share the performance of the model through a deep learning model based on the word embedding method. The push embedding method is a method of extracting meaningful features from news text data and identifying meaning and consistency between words. This method is used to identify information that does not match the actual content of the news article and place importance on fake news. After generating the embedding matrix of each word embedding method, TF-IDF, Word2Vec, and FastTextt, and combining the embedding layer with the deep learning-based LSTM model, which is a model with fake news, the power (accuracy) of the model is compared to see which is superior. An embedding method was presented. Comparing the cooperation of the models across participants in this study, we show that the Word2Vec method outperforms TF-IDF and FastText.

APA, Harvard, Vancouver, ISO, and other styles

39

Liaquathali, Shaheetha, and Vadivazhagan Kadirvelu. "Integration of natural language processing methods and machine learning model for malicious webpage detection based on web contents." IAES International Journal of Robotics and Automation (IJRA) 14, no. 1 (2025): 47. https://doi.org/10.11591/ijra.v14i1.pp47-57.

Full text

Abstract:

Malicious actors continually exploit vulnerabilities in web systems to distribute malware, launch phishing attacks, steal sensitive information, and perpetrate various forms of cybercrime. Traditional signature-based methods for detecting malicious webpages often struggle to keep pace with the rapid evolution of malware and cyber threats. As a result, there is a growing demand for more advanced and proactive approaches that can effectively identify malicious web content based on its characteristics and behavior. Detection based on web content is crucial because malicious webpages can be designed to mimic legitimate ones, making them difficult to identify through traditional means. By analyzing the content of webpages, it becomes possible to uncover patterns, anomalies, and malicious intent that may not be evident from surface-level inspection. The proposed approach integrates a pretrained Word2Vec model with seven distinct machine learning classifiers to enhance malicious webpage detection. Initially, web contents (documents) are encoded using the Word2Vec model, followed by the computation of average Word2Vec embeddings for each document. Subsequently, each classifier is trained on the extracted average Word2Vec embedding features. The results demonstrate that the Word2Vec model significantly enhances the detection accuracy, achieving an accuracy of 94.8% and an F1-score of 94.9% with the random forest classifier, and an accuracy of 94.6% and an F1-score of 94.7% with the extreme gradient boosting classifier.

APA, Harvard, Vancouver, ISO, and other styles

40

Moudhich, Ihab, and Abdelhadi Fennan. "Evaluating sentiment analysis and word embedding techniques on Brexit." IAES International Journal of Artificial Intelligence (IJ-AI) 13, no. 1 (2024): 695–702. https://doi.org/10.11591/ijai.v13.i1.pp695-702.

Full text

Abstract:

In this study, we investigate the effectiveness of pre-trained word embeddings for sentiment analysis on a real-world topic, namely Brexit. We compare the performance of several popular word embedding models such global vectors for word representation (GloVe), FastText, word to vec (word2vec), and embeddings from language models (ELMo) on a dataset of tweets related to Brexit and evaluate their ability to classify the sentiment of the tweets as positive, negative, or neutral. We find that pre-trained word embeddings provide useful features for sentiment analysis and can significantly improve the performance of machine learning models. We also discuss the challenges and limitations of applying these models to complex, real-world texts such as those related to Brexit.

APA, Harvard, Vancouver, ISO, and other styles

41

Yulianti, Evi, Nicholas Pangestu, and Meganingrum Arista Jiwanggi. "Enhanced TextRank using weighted word embedding for text summarization." International Journal of Electrical and Computer Engineering (IJECE) 13, no. 5 (2023): 5472. http://dx.doi.org/10.11591/ijece.v13i5.pp5472-5482.

Full text

Abstract:

<p><span lang="EN-US">The length of a news article may influence people’s interest to read the article. In this case, text summarization can help to create a shorter representative version of an article to reduce people’s read time. This paper proposes to use weighted word embedding based on Word2Vec, FastText, and bidirectional encoder representations from transformers (BERT) models to enhance the TextRank summarization algorithm. The use of weighted word embedding is aimed to create better sentence representation, in order to produce more accurate summaries. The results show that using (unweighted) word embedding significantly improves the performance of the TextRank algorithm, with the best performance gained by the summarization system using BERT word embedding. When each word embedding is weighed using term frequency-inverse document frequency (TF-IDF), the performance for all systems using unweighted word embedding further significantly improve, with the biggest improvement achieved by the systems using Word2Vec (with 6.80% to 12.92% increase) and FastText (with 7.04% to 12.78% increase). Overall, our systems using weighted word embedding can outperform the TextRank method by up to 17.33% in ROUGE-1 and 30.01% in ROUGE-2. This demonstrates the effectiveness of weighted word embedding in the TextRank algorithm for text summarization.</span></p>

APA, Harvard, Vancouver, ISO, and other styles

42

Yulianti, Evi, Nicholas Pangestu, and Jiwanggi Meganingrum Arista. "Enhanced TextRank using weighted word embedding for text summarization." International Journal of Electrical and Computer Engineering (IJECE) 13, no. 5 (2023): 5472–82. https://doi.org/10.11591/ijece.v13i5.pp5472-5482.

Full text

Abstract:

The length of a news article may influence people’s interest to read the article. In this case, text summarization can help to create a shorter representative version of an article to reduce people’s read time. This paper proposes to use weighted word embedding based on Word2Vec, FastText, and bidirectional encoder representations from transformers (BERT) models to enhance the TextRank summarization algorithm. The use of weighted word embedding is aimed to create better sentence representation, in order to produce more accurate summaries. The results show that using (unweighted) word embedding significantly improves the performance of the TextRank algorithm, with the best performance gained by the summarization system using BERT word embedding. When each word embedding is weighed using term frequency-inverse document frequency (TF-IDF), the performance for all systems using unweighted word embedding further significantly improve, with the biggest improvement achieved by the systems using Word2Vec (with 6.80% to 12.92% increase) and FastText (with 7.04% to 12.78% increase). Overall, our systems using weighted word embedding can outperform the TextRank method by up to 17.33% in ROUGE-1 and 30.01% in ROUGE-2. This demonstrates the effectiveness of weighted word embedding in the TextRank algorithm for text summarization.

APA, Harvard, Vancouver, ISO, and other styles

43

Mahbub, Mahir, Suravi Akhter, Ahmedul Kabir, and Zerina Begum. "Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods." Dhaka University Journal of Applied Science and Engineering 7, no. 2 (2023): 8–15. http://dx.doi.org/10.3329/dujase.v7i2.65088.

Full text

Abstract:

Next word prediction is a helpful feature for various typing subsystems. It is also convenient to have suggestions while typing to speed up the writing of digital documents. Therefore, researchers over time have been trying to enhance the capability of such a prediction system. Knowledge regarding the inner meaning of the words along with the contextual understanding of the sequence can be helpful in enhancing the next word prediction capability. Theoretically, these reasonings seem to be very promising. With the advancement of Natural Language Processing (NLP), these reasonings are found to be applicable in real scenarios. NLP techniques like Word embedding and sequential contextual modeling can help us to gain insight into these points. Word embedding can capture various relations among the words and explain their inner knowledge. On the other hand, sequence modeling can capture contextual information. In this paper, we figure out which embedding method works better for Bengali next word prediction. The embeddings we have compared are word2vec skip-gram, word2vec CBOW, fastText skip-gram and fastText CBOW. We have applied them in a deep learning sequential model based on LSTM which was trained on a large corpus of Bengali texts. The results reveal some useful insights about the contextual and sequential information gathering that will help to implement a context-based Bengali next word prediction system. DUJASE Vol. 7 (2) 8-15, 2022 (July)

APA, Harvard, Vancouver, ISO, and other styles

44

Aphrodite, Tan Tamarine Myrna. "AUGMENTING ABUSIVE WORD IN SOCIAL MEDIA WITH WORD EMBEDDING." Proxies : Jurnal Informatika 8, no. 1 (2024): 34–43. http://dx.doi.org/10.24167/proxies.v8i1.12476.

Full text

Abstract:

The increase in the use of abusive language on social media lately is very bad. Many parties throw abusive words at each other against an object, either personal or group. Abusive words themselves can be in the form of sexism, attacking flaws or disabilities, and others. Activities on social media are now so negative that they do more harm than good. We use Word2vec and some algorithms to detect abusive words in hate speech on social media to see who’s the best algorithms so far that compatible work together with word2vec. First, we need to know the dataset we use from Kaggle.com. Then, for implementation, the dataset needs to be processed in data preprocessing, with steps such as word embedding, so that maximum results can be obtained. The final result of this project will be presented in a table of confusion matrix, and with this research, the calculated average F1 value is 86% and the accuracy rate is also 86%. So, with that result, we know that the final result is that the most suitable algorithm for this dataset is XGBoost, but the algorithm the most suitable with word2vec is KNearestNeighbor.

APA, Harvard, Vancouver, ISO, and other styles

45

Lassner, David, Stephanie Brandl, Anne Baillot, and Shinichi Nakajima. "Domain-Specific Word Embeddings with Structure Prediction." Transactions of the Association for Computational Linguistics 11 (March 27, 2023): 320–35. http://dx.doi.org/10.1162/tacl_a_00538.

Full text

Abstract:

Abstract Complementary to finding good general word embeddings, an important question for representation learning is to find dynamic word embeddings, for example, across time or domain. Current methods do not offer a way to use or predict information on structure between sub-corpora, time or domain and dynamic embeddings can only be compared after post-alignment. We propose novel word embedding methods that provide general word representations for the whole corpus, domain- specific representations for each sub-corpus, sub-corpus structure, and embedding alignment simultaneously. We present an empirical evaluation on New York Times articles and two English Wikipedia datasets with articles on science and philosophy. Our method, called Word2Vec with Structure Prediction (W2VPred), provides better performance than baselines in terms of the general analogy tests, domain-specific analogy tests, and multiple specific word embedding evaluations as well as structure prediction performance when no structure is given a priori. As a use case in the field of Digital Humanities we demonstrate how to raise novel research questions for high literature from the German Text Archive.

APA, Harvard, Vancouver, ISO, and other styles

46

Anil, Kumar Jadon, and Suresh Kumar. "Emotion detection using Word2Vec and convolution neural networks." Indonesian Journal of Electrical Engineering and Computer Science 33, no. 3 (2024): 1812–19. https://doi.org/10.11591/ijeecs.v33.i3.pp1812-1819.

Full text

Abstract:

Emotion detection from text plays a very critical role in different domains, including customer service, social media analysis, healthcare, financial services, education, human-to-computer interaction, psychology, and many more. Nowadays, deep learning techniques become popular due to their capabilities to capture inherent complex insights and patterns from raw data. In this paper, we have used the Word2Vec embedding approach that takes care of the semantic and contextual understanding of text making it more realistic while detecting emotions. These embeddings act as input to the convolution neural network (CNN) to capture insights using feature maps. The Word2Vec and CNN models applied to the international survey on emotion antecedents and reactions (ISEAR) dataset outperform the models in the literature in terms of accuracy and F1-score as model evaluation metrics. The proposed approach not only obtains high accuracy in emotion detection tasks but also generates interpretable representations that contribute to the understanding of emotions in textual data. These findings carry significant implications for applications in diverse domains, such as social media analysis, market research, clinical assessment and counseling, and tailored recommendation systems.

APA, Harvard, Vancouver, ISO, and other styles

47

Kadermyatova, Leysan Maratovna, and Elena Victorovna Tutubalina. "Analysis of Word Embeddings for Semantic Role Labeling of Russian Texts." Russian Digital Libraries Journal 23, no. 5 (2020): 1026–43. http://dx.doi.org/10.26907/1562-5419-2020-23-5-1026-1043.

Full text

Abstract:

Currently, there are a huge number of works dedicated to semantic role labeling of English texts [1–3]. However, semantic role labeling of Russian texts was an unexplored area for many years due to the lack of train and test corpora. Semantic role labeling of Russian Texts was widely disseminated after the appearance of the FrameBank corpus [4]. In this approach, we analyzed the influence of the word embedding models on the quality of semantic role labeling of Russian texts. Micro- and macro- F1 scores on word2vec [5], fastText [6], ELMo [7] embedding models were calculated. The set of experiments have shown that fastText models averaged slightly better than word2vec models as applied to Russian FrameBank corpus. The higher micro- and macro- F1 scores were obtained on deep tokenized word representation model ELMo in relation to classical shallow embedding models.

APA, Harvard, Vancouver, ISO, and other styles

48

Nadia Ristya Dewi, Eva Yulia Puspaningrum, and Hendra Maulana. "Analisis Sentimen Tweet Vaksinasi Covid-19 Menggunakan RNN Dengan Metode TF-IDF Dan Word2Vec." Jurnal Informatika dan Sistem Informasi 3, no. 1 (2022): 56–65. http://dx.doi.org/10.33005/jifosi.v3i1.449.

Full text

Abstract:

Sosial media merupakan sarana komunikasi maya yang saat ini banyak digunakan oleh manusia. Di sosial media banyak terdapat informasi yang mengandung opini, hal ini menyebabkan sering menyebabkan terjadinya ketimpangan antara informasi yang sedang dibicarakan dan informasi yang diinterpretasikan serta banyaknya penyebaran informasi yang tidak benar adanya (hoax) sehingga tidak benar-benar bisa dipahami sentimen yg ingin disampaikan, terlebih pada dua tahun terakhir ini sangat diperlukan informasi yang benar mengenai topik kesehatan. Penelitian ini bertujuan untuk mengetahui akurasi dari percobaan pembelajaran mesin dengan menerapkan metode word embedding TF-IDF dan Word2Vec yang akan melakukan analisis terhadap sentimen mengenai tweet dengan topik vaksinasi Covid-19. Penelitian ini dilakukan dengan menerapkan kedua metode word embedding tersebut pada algoritma Recurrent Neural Network (RNN) dengan menggunakan 6490 data tweet. Data yang dimiliki dibagi dengan perbandingan 7:3 untuk data training dan data testing. Percobaan menggunakan RNN-Word2Vec menghasilkan akurasi sebesar 51.71%, sedangkan percobaan menggunakan RNN-Word2Vec menghasilkan akurasi sebesar 50.73%.

APA, Harvard, Vancouver, ISO, and other styles

49

Desai, Antonio, Aurora Zumbo, Mauro Giordano, et al. "Word2vec Word Embedding-Based Artificial Intelligence Model in the Triage of Patients with Suspected Diagnosis of Major Ischemic Stroke: A Feasibility Study." International Journal of Environmental Research and Public Health 19, no. 22 (2022): 15295. http://dx.doi.org/10.3390/ijerph192215295.

Full text

Abstract:

Background: The possible benefits of using semantic language models in the early diagnosis of major ischemic stroke (MIS) based on artificial intelligence (AI) are still underestimated. The present study strives to assay the feasibility of the word2vec word embedding-based model in decreasing the risk of false negatives during the triage of patients with suspected MIS in the emergency department (ED). Methods: The main ICD-9 codes related to MIS were used for the 7-year retrospective data collection of patients managed at the ED with a suspected diagnosis of stroke. The data underwent “tokenization” and “lemmatization”. The word2vec word-embedding algorithm was used for text data vectorization. Results: Out of 648 MIS, the word2vec algorithm successfully identified 83.9% of them, with an area under the curve of 93.1%. Conclusions: Natural language processing (NLP)-based models in triage have the potential to improve the early detection of MIS and to actively support the clinical staff.

APA, Harvard, Vancouver, ISO, and other styles

50

S, Sushma, Sasmita Kumari Nayak, and M. Vamsi Krishna. "Enhanced toxic comment detection model through Deep Learning models using Word embeddings and transformer architectures." Future Technology 4, no. 3 (2025): 76–84. https://doi.org/10.55670/fpll.futech.4.3.8.

Full text

Abstract:

The proliferation of harmful and toxic comments on social media platforms necessitates the development of robust methods for automatically detecting and classifying such content. This paper investigates the application of natural language processing (NLP) and ML techniques for toxic comment classification using the Jigsaw Toxic Comment Dataset. Several deep learning models, including recurrent neural networks (RNN, LSTM, and GRU), are evaluated in combination with feature extraction methods such as TF-IDF, Word2Vec, and BERT embeddings. The text data is pre-processed using both Word2Vec and TF-IDF techniques for feature extraction. Rather than implementing a combined ensemble output, the study conducts a comparative evaluation of model-embedding combinations to determine the most effective pairings. Results indicate that integrating BERT with traditional models (RNN+BERT, LSTM+BERT, GRU+BERT) leads to significant improvements in classification accuracy, precision, recall, and F1-score, demonstrating the effectiveness of BERT embeddings in capturing nuanced text features. Among all configurations, LSTM combined with Word2Vec and LSTM with BERT yielded the highest performance. This comparative approach highlights the potential of combining classical recurrent models with transformer-based embeddings as a promising direction for detecting toxic comments. The findings of this work provide valuable insights into leveraging deep learning techniques for toxic comment detection, suggesting future directions for refining such models in real-world applications.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!