Log in

Relevant bibliographies by topics / And Web Scrapper / Journal articles

To see the other types of publications on this topic, follow the link: And Web Scrapper.

Journal articles on the topic 'And Web Scrapper'

Author: Grafiati

Published: 5 June 2025

Last updated: 8 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'And Web Scrapper.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Muthee, Mutwiri George, Mutua Makau, and Omamo Amos. "SwaRegex: a lexical transducer for the morphological segmentation of swahili verbs." African Journal of Science, Technology and Social Sciences 1, no. 2 (2022): 77–84. http://dx.doi.org/10.58506/ajstss.v1i2.119.

Full text

Abstract:

The morphological syntax of the Swahili verb comprises 10 slots. In this work, we present SwaRegex, a novel rule-based model for the morphological segmentation of Swahili verbs. This model is designed as a lexical transducer, which accepts a verb as an input string and outputs the morphological slot occupied by the morphemes in the input string. SwaRegex is based on regular expressions developed using the C# programming language. To test the model, we designed a web scraper that obtained verbs from an online Swahili dictionary. The scrapper separated the corpus into two datasets: dataset A, comprising 163 verbs Bantu origin; and dataset B, containing the entire set of 715 non-Arabic verb entries obtained by the web scrapper. The performance of the model was tested against a similar model designed using the Xerox Finite State Tools (XFST). The regular expressions used in both models were the same. SwaRegex outperformed the XFST model on both datasets, achieving a 98.77% accuracy on dataset A, better than the XFST model by 41.1%, and a 68.67% accuracy on dataset B, better than the XFST model by 38.46%. This work is beneficial to prospective learners of Swahili, by helping them understand the syntax of Swahili verbs, and is an integral teaching aid for Swahili. Search engines will benefit from the lexical transducer by leveraging its finite state network when lemmatizing search terms. This work will also create more opportunities for more research to be done on Swahili.

APA, Harvard, Vancouver, ISO, and other styles

2

Onoma, Paul Avweresuo, Joy Agboi, Victor Ochuko Geteloma, et al. "Investigating an Anomaly-based Intrusion Detection via Tree-based Adaptive Boosting Ensemble." Journal of Fuzzy Systems and Control 3, no. 1 (2025): 90–97. https://doi.org/10.59247/jfsc.v3i1.279.

Full text

Abstract:

The eased accessibility, mobility, and portability of smartphones have caused the consequent rise in the proliferation of users' vulnerability to a variety of phishing attacks. Some users are more vulnerable due to factors like personality behavioral traits, media presence, and other factors. Our study seeks to reveal cues utilized by successful attacks by identifying web content as genuine and malicious data. We explore a sentiment-based extreme gradient boost learner with data collected over social platforms, scraped using the Python Google Scrapper. Our results show AdaBoost yields a prediction accuracy of 0.9989 to correctly classify 2148 cases with incorrectly classified 25 cases. The result shows the tree-based AdaBoost ensemble can effectively identify phishing cues and efficiently classify phishing lures against unsuspecting users from access to malicious content.

APA, Harvard, Vancouver, ISO, and other styles

3

Okpor, Margaret Dumebi, Fidelis Obukohwo Aghware, Maureen Ifeanyi Akazue, et al. "Pilot Study on Enhanced Detection of Cues over Malicious Sites Using Data Balancing on the Random Forest Ensemble." Journal of Future Artificial Intelligence and Technologies 1, no. 2 (2024): 109–23. http://dx.doi.org/10.62411/faith.2024-14.

Full text

Abstract:

The digital revolution frontiers have rippled across society today – with various web content shared online for users as they seek to promote monetization and asset exchange, with clients constantly seeking improved alternatives at lowered costs to meet their value demands. From item upgrades to their replacement, businesses are poised with retention strategies to help curb the challenge of customer attrition. The birth of smartphones has proliferated feats such as mobility, ease of accessibility, and portability – which, in turn, have continued to ease their rise in adoption, exposing user device vulnerability as they are quite susceptible to phishing. With users classified as more susceptible than others due to online presence and personality traits, studies have sought to reveal lures/cues as exploited by adversaries to enhance phishing success and classify web content as genuine and malicious. Our study explores the tree-based Random Forest to effectively identify phishing cues via sentiment analysis on phishing website datasets as scrapped from user accounts on social network sites. The dataset is scrapped via Python Google Scrapper and divided into train/test subsets to effectively classify contents as genuine or malicious with data balancing and feature selection techniques. With Random Forest as the machine learning of choice, the result shows the ensemble yields a prediction accuracy of 97 percent with an F1-score of 98.19% that effectively correctly classified 2089 instances with 85 incorrectly classified instances for the test-dataset.

APA, Harvard, Vancouver, ISO, and other styles

4

Ramadandi, Rizki, Novi Yusliani, Osvari Arsalan, Rizki Kurniati, and Rahmat Fadli Isnanto. "Pemodelan Topik Menggunakan Metode Latent Dirichlet Allocation dan Gibbs Sampling." Generic 14, no. 2 (2022): 74–79. http://dx.doi.org/10.18495/generic.v14i2.142.

Full text

Abstract:

Pemodelan topik adalah suatu alat yang digunakan untuk menemukan topik laten pada sekelompok dokumen. Pada penelitian ini dilakukan pemodelan topik dengan menggunakan metode Latent Dirichlet Allocation dan Gibbs Sampling. Enam artikel berita Bahasa Indonesia telah dikumpulkan dari portal berita detiknews dengan menggunakan metode Web Scrapper. Artikel berita dibagi menjadi dua kategori utama yaitu, narkoba dan COVID-19. Analisis model LDA dilakukan dengan menggunakan metode koherensi topik pengukuran skor UCI dengan hasil penelitian menyebutkan diperoleh lima buah topik optimal pada kedua konfigurasi pengujian.

APA, Harvard, Vancouver, ISO, and other styles

5

Gultom, Edra Arkananta, Nurafni Eltivia, and Nur Indah Riwajanti. "Shares Price Forecasting Using Simple Moving Average Method and Web Scrapping." Journal of Applied Business, Taxation and Economics Research 2, no. 3 (2023): 288–97. http://dx.doi.org/10.54408/jabter.v2i3.164.

Full text

Abstract:

The fluctuation of share prices in a secondary market allows investors/traders to gain profits through the difference in share prices (capital gain). In order to obtain these benefits, it is necessary to analyze before buying shares through fundamental and technical analysis. One of several methods in Technical Analysis is Simple Moving Average Method. This method can predict (forecast) share prices by calculating the moving average of the share price history. Historical share prices can be obtained in real-time using the Web Scrapper technique, so the results are more quickly and accurate. Using the MAPE (Mean Absolute Percent Error) method, the level of accuracy of forecasting can be calculated. As a result, the program could run successfully and display the value of forecasting and the level of accuracy for the entire data tested in LQ45. Besides, forecasting with a value of N = 5 has the highest level of accuracy, reaching 97,6 %, while the lowest one uses the value of N = 30, which is 95,0 %.

APA, Harvard, Vancouver, ISO, and other styles

6

Anggraeni, Dessy Tri. "FORECASTING HARGA SAHAM MENGGUNAKAN METODE SIMPLE MOVING AVERAGE DAN WEB SCRAPPING." Jurnal Ilmiah Matrik 21, no. 3 (2019): 234–41. http://dx.doi.org/10.33557/jurnalmatrik.v21i3.726.

Full text

Abstract:

Abstract: The fluctuative of stock prices in a secondary market provide the possibility for investors/traders to gain profits through the difference in stock prices (capital gain). In order to obtain these benefits, it is necessary to analyze before buying shares, through fundamental and technical analysis. One of several methods in Technical Analysis is Simple Moving Average Method. This method can be used to predict (forecast) stock prices by calculating moving average of the stock price history. Historical stock prices can be obtained in real time using the Web Scrapper technique, so the results is more quickly and accurately. Using the MAPE (Mean Absolute Percent Error) method, the level of accuracy of forecasting can be calculated. As a result, the program was able to run successfully and was able to display the value of forecasting and the level of accuracy for the entire data tested in LQ45. Besides forecasting with a value of N = 5 has the highest level of accuracy that reaches 97,6 % while the lowest one is using the value of N = 30 which is 95,0 %.

APA, Harvard, Vancouver, ISO, and other styles

7

Anggraeni, Dessy Tri. "Peramalan Harga Saham Menggunakan Metode Autoregressive Dan Web Scrapping Pada Indeks Saham Lq45 Dengan Python." Rabit : Jurnal Teknologi dan Sistem Informasi Univrab 5, no. 2 (2020): 137–44. http://dx.doi.org/10.36341/rabit.v5i2.1401.

Full text

Abstract:

Bursa Saham memberikan kemungkinan investor untuk memperoleh keuntungan (capital gain) atau mengalami kerugian (capital loss) dikarenakan harga saham yang berfluktuasi. Ketidakpastian ini bisa disiasati dengan menerapkan metode peramalan untuk memprediksi harga saham di masa datang. Salah satu metode peramalan yang dapat digunakan adalah Autoregressive. Metode ini memanfaatkan data saham di masa lalu untuk mendapatkan formula prediksi di masa datang. History harga saham bisa dilihat secara realtime melalui beberapa laman penyedia data saham. Data ini bisa ditarik secara otomatis dengan menggunakan teknik Web Scrapper, sehingga hasil peramalan dapat diperoleh dengan lebih cepat, mudah, dan akurat. Tingkat akurasi peramalan diukur dengan menggunakan metode MAPE (Mean Absolute Percent Error). Metode ini dipilih karena lebih mudah dipahami oleh para pengguna awam. Hasilnya, aplikasi peramalan mampu menampilkan prediksi harga saham beserta tingkat akurasinya. Data yang diujikan pada penelitian adalah semua data saham LQ45. Tingkat akurasi rata-rata yang diperoleh adalah sebesar 94,62 %. Tingkat akurasi terbesar terdapat pada emiten BKSL dengan nilai persentase 99,92 % dan tingkat akurasi terkecil terdapat pada emiten ASRI dengan nilai persentase 90,13 %.

APA, Harvard, Vancouver, ISO, and other styles

8

Prestianta, Albertus Magnus. "Mapping the ASEAN YouTube Uploaders." Jurnal ASPIKOM 6, no. 1 (2021): 1. http://dx.doi.org/10.24329/aspikom.v6i1.761.

Full text

Abstract:

YouTube can now be categorized as mainstream media. It can be seen as a disruptive force in business and society, particularly concerning young people. There have been several recent studies about YouTube, providing essential insights on YouTube videos, viewers, social behavior, video traffic, and recommendation systems. However, research about YouTube uploaders has not been done much, especially YouTube uploaders from ASEAN countries. Using a combination of web content mining and content analysis, this paper reviews 600 YouTube uploaders using the data of Top 100 favorite YouTube uploaders in six ASEAN countries (Indonesia, Singapore, Malaysia, Thailand, Vietnam, and the Philippines), which are retrieved from NoxInfluencer. The study aims to provide a wider picture of YouTube uploaders' characteristics from six ASEAN countries. This study also provides useful information about how to retrieve web documents using Google Web Scrapper automatically. The study results found that the entertainment category dominated the top 100 positions of the NoxInfluencer version. In almost every country analyzed, channels related to news and politics are less attractive to YouTube users. For YouTube uploaders, YouTube can be a potential revenue source through advertising or in collaboration with specific brands. Through the analysis, we discovered that engagement is the critical factor in generating income in the form of likes, dislikes, and comments.

APA, Harvard, Vancouver, ISO, and other styles

9

Divyam, Pithawa, Nahar Sarthak, Sharma Shivam, and Nikhil Chaturvedi Er. "Data Set of AI Jobs." Advancement of Computer Technology and its Applications 5, no. 3 (2022): 1–7. https://doi.org/10.5281/zenodo.7330062.

Full text

Abstract:

The automated, targeted extraction of information from websites is known as web scraping. Similar technology used by search engines is marked as “Web Crawling.” Although human data collection is a possibility, automation is frequently faster, more efficient, and less prone to mistakes. Online job portals frequently collect a substantial amount of data in the form of resumes and job openings, which may be a useful source of knowledge on the features of market demand. Web scraping may be categorized into three steps: the web scraper finds the needed links on the internet; the data is then scraped from the source links; and finally, the data is shown in a CSV file. For doing the scrape, the Python language is used. As part of the job series of datasets, this dataset can be helpful for finding a job as an AI engineer!

APA, Harvard, Vancouver, ISO, and other styles

10

Wijaya, Arie, and Prihandoko. "ANALISIS SENTIMEN REVIEW PENGGUNA APLIKASI DEPOK SINGLE WINDOW DI GOOGLE PLAY MENGGUNAKAN ALGORITMA SUPPORT VECTOR MACHINE." Jurnal Ilmiah Informatika Komputer 28, no. 1 (2023): 77–87. http://dx.doi.org/10.35760/ik.2023.v28i1.7902.

Full text

Abstract:

Technology is developing rapidly, including in the world of government. The district government makes a web or mobile-based application with the aim of helping people in getting the services that the community deserves. The Depok Regency Government created a mobile-based public service application called Depok Single Window. Due to the importance of user reviews for the continuity of the DSW application, it is required to analyze the sentiment of reviews of the Depok Single Window application on Google Play Store. Sentiment analysis is carried out using the Support Vector Machine. The data used in this study were 733 reviews obtained from the scrapping. The scrapping is carried out by utilizing python library, namely google play scrapper as access to retrieve data. The results attained from this research are an accuracy value of 89.23% for the sentiment analysis of the Depok Single Window application, which means that the Support Vector Machine is good to be used to classify the Depok Single Window application review data into positive, negative and neutral.

APA, Harvard, Vancouver, ISO, and other styles

11

Pereira, Brenda Braga, and Sangwoo Ha. "ENVIRONMENTAL ISSUES ON TIKTOK: TOPICS AND CLAIMS OF MISLEADING INFORMATION." Journal of Baltic Science Education 23, no. 1 (2024): 131–50. http://dx.doi.org/10.33225/jbse/24.23.131.

Full text

Abstract:

In light of the increasing frequency of misleading information in social media regarding environmental issues, this study aimed to identify misleading information spread through TikTok videos and to discuss why such content is considered misleading, drawing on relevant literature. Hashtags with large numbers of views, such as #climatechange, #sustainability, #pollution, #biodiversity, #environmentalprotection, #environmentalissues, #energysource, and #environmentalproblems, were used for data collection through web scrapper called Apify (https://apify.com/). A total of 29 misleading videos were found. Content analysis was applied to identify and classify the topics and misleading claims. The topics of misleading videos, according to the most frequent mentions, were energy sources, followed by climate change, pollution, biodiversity, and environmental degradation. Among the misleading claims, videos related to pyramids as non-pollutant power plants and conspiracy related to pollution exhibited the highest frequency. The results show various misleading claims in videos related to environmental topics. Also, emphasized the importance of science education in addressing misleading information. In addition, the importance of an interdisciplinary approach for addressing environmental issues was reinforced. Keywords: TikTok videos, misleading information, environmental issues, content analysis, science education

APA, Harvard, Vancouver, ISO, and other styles

12

Noor, Ibrahim Moge. "Sentiment Analysis on New Currency in Kenya Using Twitter Dataset." Proceeding International Conference on Science and Engineering 3 (April 30, 2020): 237–40. http://dx.doi.org/10.14421/icse.v3.503.

Full text

Abstract:

Social media sites recently became popular, it is clear that it has major influence in society, and almost one third of the entire world are in social media. It became a platform where people express their feelings, share their ideas, wisdoms and give feedback of an event or a product, with help of new technology it gave us an opportunity to analyse these contents easily. Twitter being one of these sites, with full of people opinions, where one can truck sentiment express about different kind of topics, instead of wasting time and energy for long surveys, due to advance sentiment analysis we can now collect a huge data of opinions of people. Sentiment analysis was one of the major interesting research area nowadays. In this paper we focused Sentimental insight into the 2019 Kenya currency replacement. Kenya government has announced that the country currency is to be replace wıth new generatıon of bank notes, the government ordered the Kenyan citizen to return back the old 1000 shilling notes ($10) to bank by 1st October 2019, in a bid to fight against corruption and money laundering. Kenyans citizen expressed their reaction over new banknotes. We perform sentiment analysis of the tweets using Multinomial Naïve Bayes algorithm by utilizing data from one of the social media platform–Twitter and I have collected during this period of demonetization, 1122 tweets from twitter using web scrapper with help of twitter advance search.

APA, Harvard, Vancouver, ISO, and other styles

13

Marconi, Gabriele. "Content removal bias in web scraped data: A solution applied to real estate ads." Open Economics 5, no. 1 (2022): 30–42. http://dx.doi.org/10.1515/openec-2022-0119.

Full text

Abstract:

Abstract I propose a solution to content removal bias in statistics from web scraped data. Content removal bias occurs when data is removed from the web before a scraper is able to collect it. The solution I propose is based on inverse probability weights, derived from the parameters of a survival function with complex forms of data censoring. I apply this solution to the calculation of the proportion of newly built dwellings with web scraped data on Luxembourg, and I run a counterfactual experiment and a Montecarlo simulation to confirm the findings. The results show that the extent of content removal bias is relatively small if the scraping occurs frequently compared with the online permanence of the data; and that it grows larger with less frequent scraping.

APA, Harvard, Vancouver, ISO, and other styles

14

Mathew, Alex, Harish Balakrishnan, and Saravanan Palani. "Scrapple: a Flexible Framework to Develop Semi-Automatic Web Scrapers." International Review on Computers and Software (IRECOS) 10, no. 5 (2015): 475. http://dx.doi.org/10.15866/irecos.v10i5.5864.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Azhari, Nor Khadijah Mohd, Radziah Mahmud, and Noor Ashikin Basarudin. "Developing a Framework of Equity Crowdfunding for Small and Medium Enterprises (SMEs) in Malaysia: A Conceptual Paper." International Journal of Research and Innovation in Social Science VIII, no. X (2024): 1635–42. http://dx.doi.org/10.47772/ijriss.2024.8100142.

Full text

Abstract:

The Covid-19 pandemic has affected the global economic turmoil leading to the global recession around the world. This will adversely affect Malaysia’s small and medium enterprises (SMEs) sustainability. Companies with less than three years of operations face various challenges tapping into funding facilities as banks will prioritise existing customers with good track records. An alternative avenue for funding and financing is necessary to survive in the business. To gain holistic understanding of the ECF, this study has four objectives: (1) To uncover the roles of ECF to SMEs in the time of crisis; (2) To investigate the crucial influencing factors to success the crowdfunding campaigns; (3) To explore the roles of regulation to protect investor and issuers in equity crowdfunding; (4) To develop a framework for fundraisers to ensure their successful campaign. This study will employ a mixed-method approach to achieve the objectives. The interview will be conducted with registered ECF operators, and the questionnaire will be distributed to ECF fundraisers. Since different data sources will be used, the data preparation is divided into web scrapping and text extraction. Python programming language and scrapper library, known as Beautiful Soup, will used to obtain the company’s details, while text extraction will be performed on the pdf files. Therefore, Python library known as PyPDF2 will be used to extract data from PDF files. It is hoped that the findings and proposed framework can assist the government and policymakers in enhancing successful applicant campaigns and improving the investors’ protection legal framework. Along with the national agenda, namely Sustainability Development Goal (SDG) 8 and the Strategic Thrust 1 of the Shared Prosperity Vision 2030, successful SMEs are vital as they can create jobs opportunity for the local community and foster economic growth through more and large pool of investments.

APA, Harvard, Vancouver, ISO, and other styles

16

Graciano, Helton Luiz dos Santos, and Rogério Aparecido Sá Ramalho. "ScraperCI: um web scraper para coleta de dados científicos." Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação 28 (May 17, 2023): 1–18. http://dx.doi.org/10.5007/1518-2924.2023.e92471.

Full text

Abstract:

Objetivo: O desenvolvimento tecnológico das últimas décadas tem impulsionado a produção massiva de recursos informacionais e mudanças significativas nos processos de coleta e gestão de dados em praticamente todas as áreas. Tal cenário não é diferente no âmbito científico, onde a coleta e tratamento adequado de dados tem se apresentado como um desafio para pesquisadores. A presente pesquisa teve como objetivo apresentar um protótipo de Web scraper, denominado como ScraperCI, e analisar as potencialidades da utilização de ferramentas computacionais como esta para a coleta em bases de dados disponíveis na Web. Método: A pesquisa caracteriza-se como aplicada, de natureza exploratória e descritiva, com abordagem qualitativa que visa identificar as potencialidades da utilização de Web scrapers no processo de coleta de dados. Resultado: Conclui-se que o protótipo desenvolvido possibilita avanços consideráveis no processo de automação da coleta de dados científicos e que tais ferramentas possibilitam a automatização de processos de recuperação, favorecendo maior produtividade no que tange a extração de recursos informacionais na Web. Conclusões: Espera-se que esta pesquisa possa estimular os profissionais da informação a desenvolver novas competências e enxergar possibilidades inovadoras em suas áreas de atuação profissional, atuando com protagonismo nesse meio interdisciplinar.

APA, Harvard, Vancouver, ISO, and other styles

17

Park, Youngki, and Youhyun Shin. "Novel Scratch Programming Blocks for Web Scraping." Electronics 11, no. 16 (2022): 2584. http://dx.doi.org/10.3390/electronics11162584.

Full text

Abstract:

Although Scratch is the most widely used block-based educational programming language, it is not easy for students to create various types of Scratch programs based on real-life data because it does not provide web scraping capabilities. In this paper, we present novel Scratch blocks for web scraping. Using these blocks, students can not only scrape the contents of HTML elements in a web page by using CSS selectors but also automate their keyboard and mouse in a number of ways, such as by using XPaths, the coordinates of the mouse, input strings, keys, or hot keys. We also present file access blocks that allow students to easily store and retrieve the scraped data in the form of key–value pairs. We conducted two lectures for a total of 15 primary/secondary school (K-12) teachers, allowing them to make ten web scraping example applications. As a result of a survey of the teachers, the proposed web scraping blocks achieved high scores for all evaluation measures.

APA, Harvard, Vancouver, ISO, and other styles

18

Ravikiran, Sanga, Valasa Kavitha, Mohammad Ismail, and Pallinti Ramalakshmi. "PYTHON-POWERED DATA ANALYSIS THROUGH WEB SCRAPING." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 10, no. 3 (2019): 1696–703. http://dx.doi.org/10.61841/turcomat.v10i3.14667.

Full text

Abstract:

The rationality approach to developing extrapolation exams, for example, and small-scale, subjective, and quantitative examinations are all forms of standard information investigations that are founded on the root-and-effect relationship. The contrast between the Web Scraper's cunning ethics and techniques explains how the scraper operates. The three steps of the process are as follows: first, the web scraper gathers the necessary connections from the internet; second, the data is extracted from the source links; and third, the data is saved in a CSV file. The task is performed using the Python programming language. By doing this, along with the moral library knowledge and practical know-how, we might be able to obtain the necessary Scraper in our possession. Python's extensive community, library resources, and elegant coding style make it the ideal language for extracting required data from the target website.

APA, Harvard, Vancouver, ISO, and other styles

19

Eyzenakh, Denis, Anton Rameykov, and Igor Nikiforov. "High Performance Distributed Web-Scraper." Proceedings of the Institute for System Programming of the RAS 33, no. 3 (2021): 87–100. http://dx.doi.org/10.15514/ispras-2021-33(3)-7.

Full text

Abstract:

Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well.

APA, Harvard, Vancouver, ISO, and other styles

20

Swami, Shridevi A., and Pujashree S. Vidap. "Towards Automatic Web Data Scraper and Aligner (WDSA)." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 13, no. 3 (2014): 4308–18. http://dx.doi.org/10.24297/ijct.v13i3.2762.

Full text

Abstract:

Web is very immense and fast emerging source of information. Web browsers along with search engines have come forward as famous tools for retrieving and accessing the information present on web. Enormous growth of web made the data extraction from web harder than ever. This paper presents the Automatic Web Data Scraper and Aligner (WDSA). Automatic WDSA extracts the interested web data present in dynamically generated web page received from search engine when user gives a query. Automatic web data scraping is necessary because human being can identify the interested query relevant contents from query result web page, however it is tricky for computer applications. Extracted web data can be further transferred into a format suitable for use in applications like comparison shopping, data integrations, value added services etc. WDSA does this by aligning the extracted web data pairwise as well as holistically in table. The novel thing about Automatic WDSA is that Data Scraper and Aligner uses new approach which combines similarity of both tag and value, for extraction and alignment process. Also Data Scraper handles the data which is present in non contiguous fashion due to presence of auxiliary information like advertisement banners, navigational links, pop ups etc. Experimental results show that Automatic WDSA achieves high precision and recall. Further Automatic WDSA is compared with existing most widely used famous tools like Helium scraper, Outwit Hub, Screen Scraper etc. During comparison we observed that Manual labeling or extraction patterns of desired data is to be specified for working of existing tools while Automatic WDSA does not require any user involvement which made it fully automatic.

APA, Harvard, Vancouver, ISO, and other styles

21

İnce, Ferhat, and Emircan Özdemir. "How Has COVID-19 Affected Airline Passenger Satisfaction? Evaluating The Passenger Satisfaction of European Short-Haul Low-Cost Carriers Pre- and Post-COVID-19." Eskişehir Osmangazi Üniversitesi Sosyal Bilimler Dergisi 25, no. 2 (2024): 482–507. http://dx.doi.org/10.17494/ogusbd.1473138.

Full text

Abstract:

This paper investigates whether there has been a change in passenger satisfaction drivers for the three largest short-haul low-cost carriers in Europe before and after COVID-19. User-generated content on the Skytrax platform was used as the data source for passenger satisfaction, and these secondary data were scraped using the Web Scraper tool. Binary logistic regression was used for the classification model related to passenger satisfaction, and ROC analysis was used to evaluate the classification performance of the model. The findings suggested that the service attributes of seat comfort, cabin staff services, and ground services are significant predictors of value for money, and the value for money is a significant determinant of overall satisfaction in both periods. Additionally, it was revealed that ground service is the most important determinant of the value for money perception. The results also indicate that in the post-COVID-19 period, the predictive power of seat comfort has decreased while the predictive power of ground services has increased.

APA, Harvard, Vancouver, ISO, and other styles

22

Mahmood, Yasir Ali, and Bassim Mahmood. "A Web Scraper for Data Mining Purposes." SISTEMASI 13, no. 3 (2024): 1243. http://dx.doi.org/10.32520/stmsi.v13i3.4107.

Full text

Abstract:

The current revolution in technology makes data a crucial part of real-life applications due to its importance in making decisions. In the era of big data and the massive expansion of data streams on Internet networks and platforms, the process of data collection, mining, and analysis has become a not easy matter. Therefore, the presence of auxiliary applications for data mining and gathering has become a necessary need. Usually, companies offer special APIs to collect data from particular destinations, which needs a high cost. Generally, there is a severe lack in the literature in providing approaches that offer flexible, low, or free of cost tools for web scraping. Hence, this article provides a free tool that can be used for data mining and data collection purposes from the web. Specifically, an efficient Google Scholar web scraper is introduced. The extracted data can be used for analysis purposes and making decisions about an issue of interest. The proposed scraper can also be modified for crawling web links and retrieving specific data from a particular website. It can also formalize the collected data as a ready dataset to be used in the analysis phase. The efficiency of the proposed scraper is tested in terms of the time consumption, accuracy, and quality of the data collected. The findings showed that the proposed approach is highly feasible for data collection and can be adopted by data analysts.

APA, Harvard, Vancouver, ISO, and other styles

23

Yashwanth M and Laxmi B Rananavare. "Fake News Identification for Web Scrapped Data." Journal of Advanced Zoology 44, S6 (2023): 971–77. http://dx.doi.org/10.17762/jaz.v44is6.2329.

Full text

Abstract:

Majority of the people get affected with misleading stories spread through different posts on social media and forward them assuming that it is a fact. Nowadays, Social media is used as a weapon to create havoc in the society by spreading fake news. Such havoc can be controlled by using machine-learning algorithms. Various methods of machine learning and deep learning techniques are used to identify false stories. There is a need for identification and controlling of fake news posts that have increased in alarming rate. Here we use Passive-Aggressive Classifier for fake news identification. Two datasets, Kaggle fake news dataset and as well as dynamically web scrapped dataset from politifact.com website. We achieved 88.66% accuracy using Passive Aggressive Classifier.

APA, Harvard, Vancouver, ISO, and other styles

24

Nevgi, Shubham, Sahil Kadam, Sahil Haldankar, Sakshi Jadhav, and Prof Rashmi More. "AI-Powered Web Scraping and Parsing: A Browser Extension Using LLMs for Adaptive Data Extraction." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 09, no. 04 (2025): 1–9. https://doi.org/10.55041/ijsrem44113.

Full text

Abstract:

In the era of information overload, the need for extracting meaningful and structured data from unstructured web sources has grown significantly. Traditional web scraping tools often require significant manual effort to parse and format data, especially when dealing with complex or dynamic websites. To address this challenge, this project presents a Generative AI-based Web Scraping Browser Extension, an innovative tool that combines the power of browser automation, HTML parsing, and generative artificial intelligence to extract and interpret data intelligently. This browser extension allows users to input any URL and extract structured information from the web page using an intuitive interface. Unlike traditional scrapers that rely heavily on predefined rules or regular expressions, the system uses Generative AI models to understand the structure and context of web content. The backend, developed using FastAPI, integrates BeautifulSoup and Selenium for handling both static and dynamic web pages, while AI parsing is powered by transformer models (e.g., LLaMA 3.3). The data extracted can be visualized in a tabular format and downloaded in multiple formats, including CSV, JSON, XML, and Excel. One of the major highlights of the system is its ability to learn patterns from previously scraped data and intelligently adapt to new page layouts, significantly reducing the need for manual intervention. This enhances productivity and provides an accessible solution for both technical and non-technical users who need structured data for research, analytics, or business intelligence. Keywords— Web Scraping, Generative AI, Data Extraction, FastAPI, Selenium, AI Parsing, Browser Extension, Automation.

APA, Harvard, Vancouver, ISO, and other styles

25

Bradley, Alex, and Richard J. E. James. "Web Scraping Using R." Advances in Methods and Practices in Psychological Science 2, no. 3 (2019): 264–70. http://dx.doi.org/10.1177/2515245919859535.

Full text

Abstract:

The ubiquitous use of the Internet in daily life means that there are now large reservoirs of data that can provide fresh insights into human behavior. One of the key barriers preventing more researchers from utilizing online data is that they do not have the skills to access the data. This Tutorial addresses this gap by providing a practical guide to scraping online data using the popular statistical language R. Web scraping is the process of automatically collecting information from websites. Such information can take the form of numbers, text, images, or videos. This Tutorial shows readers how to download web pages, extract information from those pages, store the extracted information, and do so across multiple pages of a website. A website has been created to assist readers in learning how to web-scrape. This website contains a series of examples that illustrate how to scrape a single web page and how to scrape multiple web pages. The examples are accompanied by videos describing the processes involved and by exercises to help readers increase their knowledge and practice their skills. Example R scripts have been made available at the Open Science Framework.

APA, Harvard, Vancouver, ISO, and other styles

26

Fikri, Muhammad Ramadan, Rahmadya Trias Handayanto, and Dadan Irwan. "Web Scraping Situs Berita Menggunakan Bahasa Pemograman Python." Journal of Students‘ Research in Computer Science 3, no. 1 (2022): 123–36. http://dx.doi.org/10.31599/jsrcs.v3i1.1514.

Full text

Abstract:

Currently, the rapid development of technology provides innovation, one of which is the technique of obtaining information from portal websites, termed web scrapers. This application provides data needs in the form of information where the process of retrieving information from sites will later be taken to observe behavior and perceptions to get the right market segmentation. Most data collection is currently still done manually, as a result, this method has several system limitations, namely the length of the data collection process so that it slows down the performance of market segment analysis. The risk is not getting the right market segmentation. To solve this problem, a web scraping news site is needed. In this study, web scraping news sites were created using the python programming language and the flask library to display web scraping. In addition, the Selenium library is used to simplify application creation, facilitate interaction with the Web and provide facilities to control a web browser. This program can retrieve data based on keywords, where the results are in the form of the title, posting date, summary, then collect the data that has been taken into a csv file extension automatically.  Keywords: Internet, News, Python, Scraping, Website  Abstrak Saat ini, perkembangan pesat teknologi memberikan inovasi, salah satunya adalah teknik memperoleh informasi dari situs web portal, yaitu web scraper. Aplikasi ini menyediakan kebutuhan data berupa informasi dimana proses pengambilan informasi dari situs-situs nantinya diambil untuk diamati perilaku dan persepsi untuk mendapatkan segmentasi pasar yang tepat. Kebanyakan pengambilan data saat ini masih dilakukan secara manual, akibatnya cara ini memiliki beberapa keterbatasan system yaitu lamanya proses pengumpulkan data sehingga memperlambat kinerja analisa segmen pasar. Resikonya adalah tidak mendapatkannya segementasi pasar yang tepat. Untuk mengatasi masalah tersebut diperlukan web scraping situs berita. Pada penelitian ini, web scraping situs berita dibuat dengan menggunakan bahasa pemrograman python dan library flask untuk tampilan web scraping. Selain itu, library Selenium digunakan untuk mempermudah pembuatan aplikasi, mempermudah interaksi dengan Web dan menyediakan fasilitas untuk mengontrol suatu peramban web. Program ini dapat mengambil data berdasarkan kata kunci, dimana hasilnya berupa judul, tanggal postingnya, rangkuman, lalu mengumpulkan data yang telah di ambil ke file berekstensi csv secara otomatis. Kata kunci: Berita, Internet, Python, Scraping, Website

APA, Harvard, Vancouver, ISO, and other styles

27

Sagade, Omkar Dhananjay, and Dhanuja Dhananjay Sagade. "Restaurant Data Scraper: An Automated Tool for Extracting Restaurant Information Using Python Html, CSS and Selenium." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 11 (2024): 1–6. https://doi.org/10.55041/ijsrem39222.

Full text

Abstract:

Abstract—In this paper, a web scraping solution for conveniently obtaining restaurant data from an HTML file—a Python-based Restaurant Data Scraper—is developed. BeautifulSoup is used for effective HTML parsing, and Selenium WebDriver is used to automate interactions with a web page. Python, HTML, CSS, Selenium, and BeautifulSoup are all used in the implementation. A structured CSV file containing the extracted data—such as restaurant names, ratings, descriptions, and addresses—allows for additional analysis. By offering a solution that can be modified for different restaurant listing formats, this project responds to the growing need for effective data extraction techniques from semi-structured HTML files. Scalable and UTF-8 encoded to handle special characters, the program provides a platform for more comprehensive data collection. Keywords—Web scraping solution; Python-based Restaurant Data Scraper; CSV file ; HTML parsing; Data extraction techniques.

APA, Harvard, Vancouver, ISO, and other styles

28

Lutfi, Pratama Yogaswara, and Puspitarani Yan. "Implementation of Web Scraping in Inventory Management System for Drop-Shipping." International Journal of Innovative Science and Research Technology 7, no. 8 (2022): 1416–21. https://doi.org/10.5281/zenodo.7073437.

Full text

Abstract:

The drop-shipping is a simplified business model in order fulfilment, which allows retailers to sell products without keeping any physical inventory. The product order is sent to supplier, who ships the order directly to the customers. Therefore, currently, around 27% of online retailers are starting to turn to the dropshipping business model as a method of fulfilling orders from customers. But despite its advantages, the dropshipping business model still has disadvantages, such as inventory problems. This paper proposes an approach to solve the inventory problem in drop-shopping by using web scraper. The web scraper extracts product information from supplier website and stores it in the database. For managing the extracted product information and upload it to the marketplace, we propose an Inventory Management System, implemented as a web-based application. However, users can save time doing drop-shipping without extracting product information by visiting each product page, managing product stock, and uploading to the marketplace manually.

APA, Harvard, Vancouver, ISO, and other styles

29

Kauffman, Stuart. "Innovation and The Evolution of the Economic Web." Entropy 21, no. 9 (2019): 864. http://dx.doi.org/10.3390/e21090864.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Hoarfrost, Adrienne, Nick Brown, C. Titus Brown, and Carol Arnosti. "Sequencing data discovery with MetaSeek." Bioinformatics 35, no. 22 (2019): 4857–59. http://dx.doi.org/10.1093/bioinformatics/btz499.

Full text

Abstract:

Abstract Summary Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share and download matching sequencing metadata. Availability and implementation The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter and download all metadata. MetaSeek source code, metadata scrapers and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/.

APA, Harvard, Vancouver, ISO, and other styles

31

Zhekova, Mariya, and Emir Yumer. "JavaScript Web Scraping Tool for Extraction Information from Agriculture Websites." BIO Web of Conferences 102 (2024): 03008. http://dx.doi.org/10.1051/bioconf/202410203008.

Full text

Abstract:

Extracting information from an information platform, site or system is possible if the information is structured or annotated in a way that is convenient for subsequent analysis and data processing, decision making and reasoning. The goal of this paper is to review and categorize various techniques, tools, and libraries for extracting information from unstructured web content (platforms, sites, systems), and to develop a JavaScript application that crawls and extracts data from dynamic web pages without the need to browse, read and search the page content. The paper presents an implementation of a particular JavaScript web scraper that retrieves a list of news headlines from the official European Union Agriculture and Rural Development website without the need for the content of the document to be read by users. The web scraper is configured to extract the searched content directly from the source HTML code of the document, regardless of whether the information is explicit or implicit. It also searches all pages related to the document. Finally exports data in a proper format. The benefits of such a tool for extracting web content from source code are related to saving time, manual labour and means of generating quality content in the biotech and agriculture industry.

APA, Harvard, Vancouver, ISO, and other styles

32

Ghosh Dastidar, Bhaskar, Devanjan Banerjee, and Subhabrata Sengupta. "An Intelligent Survey of Personalized Information Retrieval using Web Scraper." International Journal of Education and Management Engineering 6, no. 5 (2016): 24–31. http://dx.doi.org/10.5815/ijeme.2016.05.03.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Ashouri, Sajad, Arash Hajikhani, Arho Suominen, Lukas Pukelis, and Scott W. Cunningham. "Measuring digitalization at scale using web scraped data." Technological Forecasting and Social Change 207 (October 2024): 123618. http://dx.doi.org/10.1016/j.techfore.2024.123618.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Juszczak, Adam. "The use of web-scraped data to analyze the dynamics of footwear prices." Journal of Economics and Management 43 (2021): 251–69. http://dx.doi.org/10.22367/jem.2021.43.12.

Full text

Abstract:

Aim/purpose – Web-scraping is a technique used to automatically extract data from websites. After the rise-up of online shopping, it allows the acquisition of information about prices of goods sold by retailers such as supermarkets or internet shops. This study examines the possibility of using web-scrapped data from one clothing store. It aims at comparing known price index formulas being implemented to the web-scraping case and verifying their sensitivity on the choice of data filter type. Design/methodology/approach – The author uses the price data scrapped from one of the biggest online shops in Poland. The data were obtained as part of eCPI (electronic Consumer Price Index) project conducted by the National Bank of Poland. The author decided to select three types of products for this analysis – female ballerinas, male shoes, and male oxfords to compare their prices in over one-year time period. Six price indexes were used for calculation – The Jevons and Dutot indexes with their chain and GEKS (acronym from the names of creators – Gini–Éltető–Köves–Szulc) versions. Apart from the analysis conducted on a full data set, the author introduced filters to remove outliers. Findings – Clothing and footwear are considered one of the most difficult groups of goods to measure price change indexes due to high product churn, which undermines the possibility to use the traditional Jevons and Dutot indexes. However, it is possible to use chained indexes and GEKS indexes instead. Still, these indexes are fairly sensitive to large price changes. As observed in case of both product groups, the results provided by the GEKS and chained versions of indexes were different, which could lead to conclu- sion that even though they are lending promising results, they could be better suited for other COICOP (Classification of Individual Consumption by Purpose) groups. Research implications/limitations – The findings of the paper showed that usage of filters did not significantly reduce the difference between price indexes based on GEKS and chain formulas. Originality/value/contribution – The usage of web-scrapped data is a fairly new topic in the literature. Research on the possibility of using different price indexes provides useful insights for future usage of these data by statistics offices. Keywords: inflation, CPI, web-scraping, online shopping, big data. JEL Classification: C43, C49.

APA, Harvard, Vancouver, ISO, and other styles

35

Betesda, Betesda, Hari Purwanto, Hepi Nuryadi, Dimpo Sinaga, Sayyid Jamal Al Din, and Dhian Yusuf Al Afghani. "ANALISA SENTIMEN DATA ULASAN PADA GOOGLE PLAY DENGAN MENGGUNAKAN ALGORITMA NAÏVE BAYES DAN SUPPORT VECTOR MACHINE." Jurnal Sains dan Teknologi ISTP 22, no. 01 (2024): 08–15. https://doi.org/10.59637/jsti.v22i01.423.

Full text

Abstract:

Reviews on the Google Play website are an application's user ratings, which contain users' ratings and comments. Review data describes user sentiment towards the application according to the ratings and comments are given. In practice, there is often a discrepancy between the rating and the comments given, resulting in a biased sentiment, so it is necessary to analyze the review to find out the sentiment contained therein. The Naïve Bayes method and the Support Vector Machine can be one of the methods that are often used to carry out sentiment analysis, because they have a good level of accuracy in the sentiment analysis process. In collecting data from the Google Play site using the Web Scraping technique with the google- play-scraper package from Python. Reviews that are successfully scraped then go through the preprocessing stage so that the data set is more structured. In the next stage, the data set is labeled based on the rating, and given a weight using TF-IDF. After classifying using the Naïve Bayes and Support Vector Machine methods, then evaluating using the confusion matrix, and validating using K-Fold Cross Validation. Research results using the Naïve Bayes method and Support Vector Machine for sentiment analysis on the Google Play website, the Naïve Bayes method produces 87.82% accuracy, 58.90% precision, 60.08% recall, while the Support Vector Machine method produces 90% accuracy .01% , precision 61.89%, recall 60.18%.

APA, Harvard, Vancouver, ISO, and other styles

36

Mahajan Gaurav, More Pratik, Mirpagar Yash, Mohane Aditya, and Gangawane Manish. "Flexi-Pass Toll Tax Management Using CCTV Camera." International Research Journal on Advanced Engineering Hub (IRJAEH) 3, no. 03 (2025): 1088–91. https://doi.org/10.47392/irjaeh.2025.0156.

Full text

Abstract:

The proposed CCTV Toll Tax Management System aims to revolutionize toll collection by leveraging advanced surveillance technology and internet connectivity. This system involves strategically placing CCTV cameras on poles at 50 km intervals along highways. These cameras, connected to the internet, continuously capture vehicle images and transmit the data to a central database server. The server processes the data to identify vehicles and calculate the toll charges automatically. To enhance user convenience and operational efficiency, the system includes a mobile application for users and a web portal for RTO (Regional Transport Office) officials. The mobile app allows users to view their toll history, receive notifications, and make payments seamlessly. Meanwhile, the web portal provides RTO officials with real-time access to toll data, enabling efficient monitoring and management of toll operations. Additionally, the system incorporates unauthorized number plate detection and scrapped vehicle identification features. These functionalities help detect unauthorized vehicles and scrapped vehicles, enhancing road safety and ensuring compliance with regulations. This integrated approach reduces manual intervention and provides a smooth and transparent toll collection process, ultimately contributing to improved traffic flow and reduced congestion at toll plazas.

APA, Harvard, Vancouver, ISO, and other styles

37

Midhu Bala, G., and K. Chitra. "Data Extraction and Scratching Information Using R." Shanlax International Journal of Arts, Science and Humanities 8, no. 3 (2021): 140–44. http://dx.doi.org/10.34293/sijash.v8i3.3588.

Full text

Abstract:

Web scraping is the process of automatically extracting multiple WebPages from the World Wide Web. It is a field with active developments that shares a common goal with text processing, the semantic web vision, semantic understanding, machine learning, artificial intelligence and human- computer interactions. Current web scraping solutions range from requiring human effort, the ad-hoc, and to fully automated systems that are able to extract the required unstructured information, convert into structured information, with limitations. This paper describes a method for developing a web scraper using R programming that locates files on a website and then extracts the filtered data and stores it. The modules used and the algorithm of automating the navigation of a website via links are mentioned in this paper. Further it can be used for data analytics.

APA, Harvard, Vancouver, ISO, and other styles

38

Anisa Rahma Salsabila, Muhammad Daffa, Muhammad Kandias Happy Maulana, and Eka Dyar Wahyuni. "IMPLEMENTASI TEKNIK WEB SCRAPING UNTUK MENAMPILKAN DATA TIM ENGLISH PREMIER LEAGUE." Prosiding Seminar Nasional Teknologi dan Sistem Informasi 2, no. 1 (2022): 40–45. http://dx.doi.org/10.33005/sitasi.v2i1.264.

Full text

Abstract:

Perkembangan teknologi membawa dampak langsung terhadap kehidupan manusia, baik sisi positif maupun negatif. Terdapat beberapa cara dalam mengintegrasikan sebuah sistem yang akan dibangun, salah satunya dengan menerapkan integrasi level user interface. Integrasi level user interface dapat dilakukan dengan teknik web scraping. Pada penelitian ini akan diterapkan integrasi level user interface menggunakan teknik web scraping dengan bantuan aplikasi web scraper yang merupakan ekstensi dari Google Chrome dalam menghasilkan file CSV. Data penelitian ini akan diambil dari website English Premier League. Data CSV yang dihasilkan melalui web scraping berupa data tim English Premier League, dimana nantinya akan diproses menggunakan ETL (Extraction, Transformation, Loading) pada aplikasi Pentaho Data Integration. Luaran dari proses yang telah dijelaskan sebelumnya akan dilakukan visualisasi dalam bentuk halaman website.

APA, Harvard, Vancouver, ISO, and other styles

39

Ivana Elfirdaus and Eka Dyar Wahyuni. "IMPLEMENTASI WEB SCRAPING UNTUK PENGAMBILAN DATA REKOMENDASI FILM PADA IMDB." Prosiding Seminar Nasional Teknologi dan Sistem Informasi 3, no. 1 (2023): 327–33. http://dx.doi.org/10.33005/sitasi.v3i1.647.

Full text

Abstract:

Teknologi yang terus berkembang saat ini memberikan banyak dampak pada masyarakat. Penerapan teknologi yang tepat dapat mengatasi berbagai permasalahan yang ada. Permasalahan tersebut salah satunya mengenai terbatasnya akses data di beberapa platform untuk pengambilan data. Untuk mengatasi permasalahan tersebut, dapat dilakukan berbagai macam teknik integrasi, terutama pada level user interface. Penelitian ini menerapkan integrasi level user interface menggunakan teknik web scraping dari situs IMDb dengan menggunakan ekstensi dari Google Chrome yaitu web scraper untuk menghasilkan file CSV. Hasil dari web scraping ini berupa data detail dari rekomendasi film terpopuler yang selanjutnya dilakukan transform data menggunakan Microsoft Excel dan kemudian dilakukan import data ke dalam MySQL. Data yang didapatkan dalam penelitian ini kemudian divisualisasi dalam bentuk website dengan mempergunakan template bootstrap.

APA, Harvard, Vancouver, ISO, and other styles

40

Siddhant, Vinayak Chanda, and A. Arivoli. "Web Scraping in Finance using Python." International Journal of Engineering and Advanced Technology (IJEAT) 9, no. 5 (2020): 255–62. https://doi.org/10.35940/ijeat.E9457.069520.

Full text

Abstract:

The objective of this paper is to highlight different ways to extract financial data ( Balance Sheet, Income Statement and Cash Flow) of different companies from Yahoo finance and present an elaborate model to provide an economical, reliable and, a time-efficient tool for this purpose. It aims at aiding business analysts who are not well versed with coding but need quantitative outputs to analyse, predict, and make market decisions, by automating the process of generation of financial data. A python model is used, which scrapes the required data from Yahoo finance and presents it in a precise and concise manner in the form of an Excel sheet. A web application is build using python with a minimalistic and simple User Interface to facilitate this process. This proposed method not only removes any chances of human error caused due to manual extraction of data but also improves the overall productivity of analysts by drastically reducing the time it takes to generate the data and thus saves a substantial amount of human hours for the consumer. We also discuss the importance of data mining and scraping technologies in the finance industry, different methods of scraping online data, and the legal aspect of web scraping which is highly dependent on generated data to analyse and make decisions.

APA, Harvard, Vancouver, ISO, and other styles

41

Castillo-Zúñiga, Iván, Francisco-Javier Luna-Rosas, and Jaime-Iván López-Veyna. "Detection of traits in students with suicidal tendencies on Internet applying Web Mining." Comunicar 30, no. 71 (2022): 105–17. http://dx.doi.org/10.3916/c71-2022-08.

Full text

Abstract:

This article presents an Internet data analysis model based on Web Mining with the aim to find knowledge about large amounts of data in cyberspace. To test the proposed method, suicide web pages were analyzed as a study case to identify and detect traits in students with suicidal tendencies. The procedure considers a Web Scraper to locate and download information from the Internet, as well as Natural Language Processing techniques to retrieve the words. To explore the information, a dataset based on Dynamic Tables and Semantic Ontologies was constructed, specifying the predictive variables in young people with suicidal inclination. Finally, to evaluate the efficiency of the model, Machine Learning and Deep Learning algorithms were used. It should be noticed that the procedures for the construction of the dataset (using Genetic Algorithms) and obtaining the knowledge (using Parallel Computing and Acceleration with GPU) were optimized. The results reveal an accuracy of 96.28% on the detection of characteristics in adolescents with suicidal tendencies, reaching the best result through a Recurrent Neural Network with 98% accuracy. It is inferred that the model is viable to establish bases on mechanisms of action and prevention of suicidal behaviors, which can be implemented in educational institutions or different social actors. Este artículo presenta un modelo de análisis de datos en Internet basado en Minería Web con el objetivo de encontrar conocimiento sobre grandes cantidades de datos en el ciberespacio. A fin de probar el método propuesto, se analizaron páginas web sobre el suicidio como caso de estudio con la intención de identificar y detectar rasgos en estudiantes con tendencias suicidas. El procedimiento considera un Web Scraper para localizar y descargar información de Internet, así como técnicas de Procesamiento de Lenguaje Natural para la recuperación de los vocablos. Con el propósito de explorar la información, se construyó un conjunto de datos basado en Tablas Dinámicas y Ontologías Semánticas, especificando las variables predictivas en jóvenes con inclinación suicida. Por último, para evaluar la eficiencia del modelo se utilizaron algoritmos de Aprendizaje de Máquina y Aprendizaje Profundo. Cabe destacar que se optimizaron los procedimientos para la construcción del dataset (utilizando Algoritmos Genéticos) y obtención del conocimiento empleando Cómputo Paralelo y Aceleración con Unidades de Procesamientos de Gráfico (GPU). Los resultados revelan una precisión del 96,28% sobre la detección de las características en adolescentes con tendencia suicida, alcanzando el mejor resultado a través de una Red Neuronal Recurrente con un 98% de precisión. De donde se infiere que el modelo es viable para establecer bases sobre mecanismos de actuación y prevención de comportamientos suicidas, que pueden ser implementados en instituciones educativas o distintos actores de la sociedad.

APA, Harvard, Vancouver, ISO, and other styles

42

Ayoubkhani, Daniel, and Heledd Thomas. "Estimating Weights for Web-Scraped Data in Consumer Price Indices." Journal of Official Statistics 38, no. 1 (2022): 5–21. http://dx.doi.org/10.2478/jos-2022-0002.

Full text

Abstract:

Abstract In recent years, there has been much interest among national statistical agencies in using web-scraped data in consumer price indices, potentially supplementing or replacing manually collected price quotes. Yet one challenge that has received very little attention to date is the estimation of expenditure weights in the absence of quantity information, which would enable the construction of weighted item-level price indices. In this article we propose the novel approach of predicting sales quantities from their ranks (for example, when products are sorted ‘by popularity’ on consumer websites) via appropriate statistical distributions. Using historical transactional data supplied by a UK retailer for two consumer items, we assessed the out-of-sample accuracy of the Pareto, log-normal and truncated log-normal distributions, finding that the last of these resulted in an index series that most closely approximated an expenditure-weighted benchmark. Our results demonstrate the value of supplementing web-scraped price quotes with a simple set of retailer-supplied summary statistics relating to quantities, allowing statistical agencies to realise the benefits of freely available internet data whilst placing minimal burden on retailers. However, further research would need to be undertaken before the approach could be implemented in the compilation of official price indices.

APA, Harvard, Vancouver, ISO, and other styles

43

Juszczak, Adam. "The use of web-scraped data to analyse the dynamics of clothing and footwear prices." Wiadomości Statystyczne. The Polish Statistician 2023, no. 9 (2023): 15–33. http://dx.doi.org/10.59139/ws.2023.09.2.

Full text

Abstract:

Web scraping is a technique that makes it possible to obtain information from websites automatically. As online shopping grows in popularity, it became an abundant source of information on the prices of goods sold by retailers. The use of scraped data usually allows, in addition to a significant reduction of costs of price research, the improvement of the precision of inflation estimates and real-time tracking. For this reason, web scraping is a popular research tool both for statistical centers (Eurostat, British Office of National Statistics, Belgian Statbel) and universities (e.g. the Billion Prices Project conducted at Massachusetts Institute of Technology). However, the use of scraped data to calculate inflation brings about many challenges at the stage of their collection, processing, and aggregation. The aim of the study is to compare various methods of calculating price indices of clothing and footwear on the basis of scraped data. Using data from one of the largest online stores selling clothing and footwear for the period of February 2018–November 2019, the author compared the results of the Jevons chain index, the GEKS-J index and the GEKS-J expanding and updating window methods. As a result of the calculations, a high chain index drift was confirmed, and very similar results were found using the extension methods and the updated calculation window (excluding the FBEW method).

APA, Harvard, Vancouver, ISO, and other styles

44

Benedetti, Ilaria, Tiziana Laureti, Luigi Palumbo, and Brandon M. Rose. "Computation of High-Frequency Sub-National Spatial Consumer Price Indexes Using Web Scraping Techniques." Economies 10, no. 4 (2022): 95. http://dx.doi.org/10.3390/economies10040095.

Full text

Abstract:

The development of Information and Communications Technology and digital economies has contributed to changes in the consumption of goods and services in various areas of life, affecting the growing expectations of users in relation to price statistics. Therefore, it is important to provide information on differences in consumer prices across space and over time in a timely manner. Web-scraped data, which is the process of collecting large amounts of data from the web, offer the potential to improve greatly the quality and efficiency of consumer price indices. In this paper, we explore the use of web-scraped data for compiling high-frequency price indexes for groups of products by using the time-interaction-region product model. We computed monthly average prices for five entry-level items according to the Consumer Price Index for All Urban Consumers (CPI-U) classification and tracked their evolution over time in 11 USA cities reported in our dataset. Even if our dataset covers a small percentage of the CPI-U index, results show how web scraping data may provide timely estimates of sub-national SPI evolution and unveil seasonal trends for specific categories.

APA, Harvard, Vancouver, ISO, and other styles

45

Prabhdeep Singh Bagga, Narinder Kaur. "Offline Route planner using Web Automation." International Journal for Modern Trends in Science and Technology 6, no. 12 (2020): 365–69. http://dx.doi.org/10.46501/ijmtst061268.

Full text

Abstract:

Automation refers to decreasing repetitive human work, tedious tasks, and minimizing the errors. With the correct automation tools, it's possible to automate browser tasks, web testing, and online data extraction, to fill forms, scrape data, transfer data between applications, and generate reports. The research project focuses on automating the task of placing an order of particular set of items from an online website. The main aim of the project is planning the most efficient route to visit all your stoppages and reach your destination. It automates the process of finding the most optimum route and saves a PDF of commute details to your disk.

APA, Harvard, Vancouver, ISO, and other styles

46

Xiao, Geoffrey. "Data Misappropriation." Science and Technology Law Review 24, no. 1 (2023): 125–72. http://dx.doi.org/10.52214/stlr.v24i1.10456.

Full text

Abstract:

Data scraping (also called web scraping, screen scraping, or web crawling) is a technique that uses “bots” to automate the collection of information from publicly available websites. Fundamentally, data scraping is data copying. Intellectual property (“IP”) law—namely, copyright—typically handles disputes involving copying. However, copyright law largely fails to protect data and databases (i.e., compilations of data). Instead, plaintiff websites assert contract law, Computer Fraud and Abuse Act (“CFAA”), and state unfair competition law (common law misappropriation, unjust enrichment, conversion, and trespass to chattel) claims against data scrapers. This Note proceeds as follows. First, this Note examines how scrapers can be liable under trade secret law for scraping data from publicly accessible websites. Initially, trade secret law seems incongruous with data scraping because the core concept of trade secret law—secrecy—is seemingly at odds with public accessibility. If a website is publicly available, how can a scraper be liable for trade secret misappropriation of the website’s data? This Note explains how a recent Eleventh Circuit case, Compulife Software Inc. v. Newman, laid the groundwork for a trade secret cause of action. This Note reconciles Compulife with existing trade secret jurisprudence, argues that Compulife was rightly decided as a matter of both law and policy, and provides a roadmap for courts to apply trade secret law to data scraping cases. Second, this Note explains why courts and litigators should use trade secret law to adjudicate data scraping disputes. Specifically, this Note argues that, compared to the existing alternatives, trade secret law is best suited to handle the various policy issues surrounding data scraping. This Note explains how contract law and the CFAA have filled the database void left by copyright law: contract law and the CFAA have become “quasi-IP” regimes, granting websites property rights in databases otherwise unprotected by copyright law. In response to the emergence of quasi-IP, this Note argues for reconceptualizing the data scraping problem by reframing data scraping as data copying—reframing data scraping with an intellectual property lens. Trade secret law offers a framework for that reconceptualization. In contrast to contract law and the CFAA (an anti-hacking law premised on criminal trespass principles), trade secret law provides courts and litigators with the appropriate IP-based doctrinal levers to analyze data scraping cases. Finally, this Note analyzes how EU law filled the database gap by creating an IP right, the sui generis database right. This Note argues that Compulife’s trade secret theory emulates many aspects of the EU sui generis database right. In this sense, Compulife’s trade secret theory can be seen as the United States’ attempt to fashion its own sui generis database right to fill the database gap left by copyright.

APA, Harvard, Vancouver, ISO, and other styles

47

KURNIAWAN, ROBI, and AULIA APRILIANI. "ANALISIS SENTIMEN MASYARAKAT TERHADAP VIRUS CORONA BERDASARKAN OPINI DARI TWITTER BERBASIS WEB SCRAPER." Jurnal INSTEK (Informatika Sains dan Teknologi) 5, no. 1 (2020): 67. http://dx.doi.org/10.24252/instek.v5i1.13686.

Full text

Abstract:

Indonesia menjadi salah satu Negara yang pengguna aktif harian twitternya cukup tinggi, berdasarkan hal tersebut twitter dapat dijadikan sebagai media untuk melakukan analisis sentimen terhadap topik corona. Analisis sentimen merupakan salah satu cabang dari text mining yang melakukan proses klasifikasi pada dokumen atau teks. Penelitian ini bertujuan untuk mengetahui bagaimana dampak virus corona di Indonesia sesuai opini masyarakat melalui twitter. Pengumpulan data dilakukan dengan teknik web scraper yang menghasilkan 1000 record sejak tanggal 20 Januari sampai 1 Februari 2020, data yang telah di scraping kemudian dianalisis mengikuti tahapan text mining yaitu case folding, tokenizing dan filtering. Hasil dari penelitian ini menunjukan persentase opini masyarakat terhadap virus corona yaitu 79% negatif, 11% Netral dan 10% Positif. Kata kunci : corona, analisis sentimen, twitter;

APA, Harvard, Vancouver, ISO, and other styles

48

Briney, Kristin A. "Measuring data rot: An analysis of the continued availability of shared data from a Single University." PLOS ONE 19, no. 6 (2024): e0304781. http://dx.doi.org/10.1371/journal.pone.0304781.

Full text

Abstract:

To determine where data is shared and what data is no longer available, this study analyzed data shared by researchers at a single university. 2166 supplemental data links were harvested from the university’s institutional repository and web scraped using R. All links that failed to scrape or could not be tested algorithmically were tested for availability by hand. Trends in data availability by link type, age of publication, and data source were examined for patterns. Results show that researchers shared data in hundreds of places. About two-thirds of links to shared data were in the form of URLs and one-third were DOIs, with several FTP links and links directly to files. A surprising 13.4% of shared URL links pointed to a website homepage rather than a specific record on a website. After testing, 5.4% the 2166 supplemental data links were found to be no longer available. DOIs were the type of shared link that was least likely to disappear with a 1.7% loss, with URL loss at 5.9% averaged over time. Links from older publications were more likely to be unavailable, with a data disappearance rate estimated at 2.6% per year, as well as links to data hosted on journal websites. The results support best practice guidance to share data in a data repository using a permanent identifier.

APA, Harvard, Vancouver, ISO, and other styles

49

Mehrhoff, Jens. "Introduction – The Value Chain of Scanner and Web Scraped Data." Economie et Statistique / Economics and Statistics, no. 509 (September 17, 2019): 5–11. http://dx.doi.org/10.24187/ecostat.2019.509.1980.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Defauw, Szoc, Bardadym, et al. "Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach." Informatics 6, no. 3 (2019): 35. http://dx.doi.org/10.3390/informatics6030035.

Full text

Abstract:

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!