Dissertations / Theses on the topic 'Data Scraping'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 41 dissertations / theses for your research on the topic 'Data Scraping.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Carle, Victor. "Web Scraping using Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281344.
Full textDen här rapporten undersöker vad som krävs för att skapa en robust webbskrapare, designad för att kontinuerligt kunna skrapa en specifik hemsida trots att den underliggande HTML-koden förändras. En algoritm presenteras som är lämplig för hemsidor med en repetitiv HTML-struktur. En repetitiv HTML struktur innebär ofta att det visas saker såsom nyhetsartiklar, videos, böcker och så vidare. Det innebär att samma HTML-kod återanvänds ett flertal gånger, då det enda som skiljer de här sakerna åt är exempelvis deras titlar. Ett bra exempel är hemsidan Youtube. Skraparen funkar genom att använda textklassificering av ord som finns i HTML-koden, på så sätt kan maskinlärningsalgoritmen, support vector machine, känna igen den kod som omger datan som är eftersökt på hemsidan. För att möjliggöra detta så förvandlas HTML-koden, samt relevant metadata, till vektorer med hjälp av bag-of-words-modellen. Efter omvandlingen kan vektorerna matas in i maskinlärnings-modellen och klassifiera datan. Algoritmen testas på äldre versioner utav hemsidan tagna från ett webarkiv för att förhoppningsvis få en bra bild utav vad framtida prestationer skulle kunna vara. Algoritmen uppnår varierande resultat baserat på en stor mängd variabler inom hemsidan samt de äldre versionerna av hemsidorna. Algoritmen presterade bäst på Yahoo news där den uppnådde 90 % träffsäkerhet på äldre sidor.
Färholt, Fredric. "Less Detectable Web Scraping Techniques." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-104887.
Full textWebbskrapning är ett effektivt sätt att hämta data på, det har även blivit en aktivitet som är enkel att genomföra och chansen att en lyckas är hög. Användare behöver inte längre vara fantaster inom teknik när de skrapar data, det finns idag mängder olika och lättanvändliga plattformstjänster. Den här studien utför experi- ment för att se hur personer kan skrapa på ett oupptäckbart sätt med ett populärt och intelligent JavaScript bibliotek (Puppeteer). Tre webbskrapningsalgoritmer, där två av dem använder rörelsemönster från riktiga webbanvändare, demonstrerar hur en kan samla information. Webbskrapningsalgoritmerna har körts på en hemsida som ingått i experimentet med kännbar säkerhet, honeypot, och aktivitetsloggning, nå- got som gjort det möjligt att samla och utvärdera data från både algoritmerna och hemsidan. Resultatet visar att det kan vara möljligt att skrapa på ett oupptäckbart sätt genom att använda Puppeteer. En av algoritmerna avslöjar även möjligheten att kontrollera prestanda genom att använda inbyggda metoder i Puppeteer.
Legaspi, Ramos Xurxo. "Scraping Dynamic Websites for Economical Data : A Framework Approach." Thesis, Linnéuniversitetet, Institutionen för datavetenskap (DV), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-57070.
Full textOucif, Kadday. "Evaluation of web scraping methods : Different automation approaches regarding web scraping using desktop tools." Thesis, KTH, Data- och elektroteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-188418.
Full textEn hel del information kan bli funnen och extraherad i olika format från den semantiska webben med hjälp av webbskrapning, med många tekniker som uppkommit med tiden. Den här rapporten är skriven med målet att utvärdera olika webbskrapnings metoder för att i sin tur utveckla en automatiserad, prestandasäker, enkelt implementerad och solid extraheringsprocess. Ett antal parametrar är definierade för att utvärdera och jämföra befintliga webbskrapningstekniker. En matris av skrivbords verktyg är utforskade och två är valda för utvärdering. Utvärderingen inkluderar också tillvägagångssättet till att lära sig sätta upp olika webbskrapnings processer med så kallade agenter. Ett nummer av länkar blir skrapade efter data med och utan exekvering av JavaScript från webbsidorna. Prototyper med de utvalda teknikerna testas och presenteras med webbskrapningsverktyget Content Grabber som slutlig lösning. Resultatet utav det hela är en bättre förståelse kring ämnet samt en prisvärd extraheringsprocess bestående utav blandade tekniker och metoder, där en god vetskap kring webbsidornas uppbyggnad underlättar datainsamlingen. Sammanfattningsvis presenteras och diskuteras resultatet med hänsyn till valda parametrar.
Rodrigues, Lanny Anthony, and Srujan Kumar Polepally. "Creating Financial Database for Education and Research: Using WEB SCRAPING Technique." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-36010.
Full textCosman, Vadim, and Kailash Chowdary. "End user interface for collecting and evaluating company data : Real-time data collection through web-scraping." Thesis, Högskolan Dalarna, Institutionen för information och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:du-37740.
Full textCeccaroni, Giacomo. "Raccolta di dati eterogenei e multi-sorgente per data visualization dei rapporti internazionali dell'Ateneo di Bologna." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/13940/.
Full textFranchini, Giulia. "Associazioni non profit e linked open data: un esperimento." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/8350/.
Full textHolm, Andreas, and Oscar Ahlm. "Skrapa Facebook : En kartläggning över hur data kan samlas in från Facebook." Thesis, Malmö universitet, Institutionen för datavetenskap och medieteknik (DVMT), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-43326.
Full textA vast amount of data is shared daily on social media platforms. Data that if it can becollected and sorted can prove valueable as a basis for research work. Especially in countrieswhere social media constitutes the only possible place for citizens to make their voicesheard. Facebook is one of the most frequently used social media platforms and thus can bea potential rich source from which data can be collected. But Facebook has become morerestrictive about who gets access to the data on their platform. This has created an interestin ways how to get access to the data that is shared on Facebooks platform without gettingexplicit approval from Facebook. At the same time it creates questions about the ethicsand the legality of it. This work intended to investigate different aspects, such as technical,ethical, legal, related to the collecting of data from Facebooks platform by performing aliterary review and experiments. The literary review showed that it was difficult to findmaterial regarding technical measures taken by Facebook to prevent web scraping. Theexperiments that were performed identified some of these measures, among others thatthe structure of the HTML code changes and that ids of HTML elements updates whendifferent events occur on the web page, which makes web scraping increasingly difficult.The literary review also showed that it is troublesome to know which data is legal to scrapefrom Facebook and which is not. This is partly due to the fact that different countries havedifferent laws to which one must conform when scraping web data, and partly that it canbe difficult to know what counts as personal data and thus is protected by GDPR amongother laws.
Mascellaro, Maria Maddalena. "Integrazione di sorgenti eterogenee per un sistema di Data Visualization." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/16818/.
Full textVentura, Pedro Côrte-Real Machado. "Dashboard de tráfego nos websites de empresas europeias." Master's thesis, Instituto Superior de Economia e Gestão, 2019. http://hdl.handle.net/10400.5/19475.
Full textOs serviços de Business Intelligence oferecem várias formas de processar e analisar a riqueza de um conjunto de dados nas empresas nos dias de hoje. Neste trabalho de projeto, foi utilizado o Microsoft Power BI para desenvolver uma solução de business intelligence com um conjunto de dados no âmbito de tráfego de websites. Para fornecer contexto teórico em como desenvolver uma solução de business intelligence através de um conjunto de dados semiestruturados, os princípios chave de modelação multidimensional são introduzidos. Os resultados do processo de desenvolvimento são evidenciados na parte da descrição e discussão dos resultados, de exemplos de dashboards criados para a solução. Sendo esta uma ferramenta com utilização diária, o desempenho técnico e a leitura de documentação de apoio é importante para uma adopção positiva desta ferramenta por parte dos utilizadores. Os aspectos de funcionalidade e desempenho foram analisados e otimizados com base na pesquisa da revisão de literatura, e a ferramenta foi colocada a funcionar corretamente.
Business intelligence services offer many ways to process and analyze the richness of a data set in business today. In this project work, Microsoft Power BI and a web scraping tool were used to develop a business intelligence solution with a data set within the scope of website traffic. To provide theoretical context on how to develop a business intelligence solution through a semi-structured data set, key principles of multidimensional modeling are introduced. The results of the development process are shown in the description and discussion of the results, examples of dashboards created for the solution. As this is a daily tool, technical performance and reading supporting documentation is important for the positive adoption of this tool by users. Functionality and performance aspects were analyzed and optimized based on literature review research, and the tool was put to work correctly.
info:eu-repo/semantics/publishedVersion
Hidén, Filip, and Magnus Qvarnström. "En jämförelse av prestanda mellan centraliserad och decentraliserad datainsamling." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291266.
Full textI den moderna världen används data och information i en större skala än någonsin tidigare. Mycket av denna information och data kan hittas på internet i många olika former som artiklar, filer, webbsidor med mera. Om man försöker att starta ett nytt projekt eller företag som är beroende av delar av denna data behövs det ett sätt att effektivt söka igenom den, sortera ut det som söks och samla in den för att hanteras. Ett vanligt sätt att göra detta är en metod som kallas Web scraping, som kan implementeras på flera olika sätt för att söka och samla in den funna datan. För små företag kan detta bli en kostsam satsning, då Web scraping är en intensiv process som vanligtvis kräver att man måste betala för att driva en tillräckligt kraftfull server som kan hantera datan. Syftet med denna rapport är att undersöka om det finns giltiga och billigare alternativ för att implementera Web scraping lösningar, som inte kräver tillgång till kostsamma serverlösningar. För att svara på detta utfördes en undersökning runt Web scraping, samt olika systemarkitekturer som används för att utveckla dessa system i den nuvarande marknaden samt hur de kan implementeras. Med denna kunskap utvecklades en Web scraping applikation som anpassades för att samla in ingredienser från recept artiklar på internet. Denna implementation anpassades sedan för två olika lösningar, en centraliserad på en server och en decentraliserad, för Android enheter. Till slut summerades all den insamlade faktan, tillsammans med enhetstester utförda på test implementationerna för att få ut ett resultat. Slutsatsen som drogs av detta resultat var att decentraliserade Android implementationer är en giltig och funktionell lösning för Web scraping idag, men skillnaden i prestanda innebär att det inte alltid är en användbar lösning, istället måste det bestämmas beroende på ett företags behov och specifikationer. Dessutom är forskningen runt detta ämne begränsat, och kräver vidare undersökning och fördjupning för att förbättra kunskaper och implementationer av detta område i framtiden.
Michalakidis, Georgios. "Appreciation of structured and unstructured content to aid decision making : from web scraping to ontologies and data dictionaries in healthcare." Thesis, University of Surrey, 2016. http://epubs.surrey.ac.uk/812261/.
Full textJakupovic, Edin. "Alternative Information Gathering on Mobile Devices." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-210712.
Full textSökning och insamling av information om specifika ämnen är en tidskrävande, men nödvändig praxis. Med den kontinuerliga tillväxten som gått förbi stationära enheters andel, blir mobilmarknaden ett viktigt område att överväga. Med tanke på rörligheten av bärbara enheter, så blir vissa uppgifter svårare att utföra, jämfört med på stationära enheter. Att söka efter information på Internet är generellt långsammare på mobila enheter än på stationära. De största utmaningarna med att söka efter information på Internet med mobila enheter, är de mindre skärmstorlekarna, och tiden spenderad på att ta sig mellan källor och sökresultat i en webbläsare. Dessa utmaningar kan lösas genom att använda en applikation som fokuserar på relevanta sökresultat och sammanfattar innehållet av dem, samt presenterar dem på en enda vy. Syftet med denna studie är att hitta en alternativ datainsamlingsmetod för attskapa en snabbare och enklare sökupplevelse. Denna datainsamlingsmetod kommer snabbt att kunna hitta och samla in data som begärts via en sökterm av en användare. Därefter analyseras och presenteras data för användaren i en sammanfattad form för att eliminera behovet av att besöka innehållets källa. En undersökning utfördes genom att en mindre målgrupp av användare svarade på ett formulär av frågor. Resultaten visade att metoden var snabb, resultaten var ofta relevanta och sammanfattningarna minskade behovet av att besöka källsidan. Men medan metoden hade potential för framtida utveckling, hindras det av de etiska problemen som associeras med användningen av web scrapers.
Blázquez, Soriano María Desamparados. "Design and Evaluation of Web-Based Economic Indicators: A Big Data Analysis Approach." Doctoral thesis, Universitat Politècnica de València, 2020. http://hdl.handle.net/10251/116836.
Full text[CAT] A l'Era Digital, el creixent ús d'Internet i dels dispositius digitals està transformant completament la forma d'interactuar al context econòmic i social. Milers de persones, empreses i organismes públics utilitzen Internet a les seues activitats diàries, generant d'aquesta forma una enorme quantitat de dades actualitzades ("Big Data") accessibles principalment mitjançant la World Wide Web (WWW), que s'ha convertit en el major repositori d'informació del món. Aquestes empremtes digitals poden rastrejar-se i, si se processen i analitzen de forma apropiada, podrien ajudar a monitoritzar en temps real una infinitat de variables econòmiques. En aquest context, l'objectiu principal d'aquesta tesi doctoral és generar indicadors econòmics, basats en dades web, que siguen capaços de proveïr regularment de prediccions a curt termini ("nowcasting") sobre diverses activitats empresarials que són fonamentals per al creixement i desenvolupament de les economies. Concretament, tres indicadors econòmics basats en la web han sigut dissenyats i avaluats: en primer lloc, un indicador d'orientació exportadora, basat en un model que prediu si una empresa és exportadora; en segon lloc, un indicador d'adopció de comerç electrònic, basat en un model que prediu si una empresa ofereix la possibilitat de venda online; i en tercer lloc, un indicador de supervivència empresarial, basat en dos models que indiquen la probabilitat de supervivència d'una empresa i la seua tasa de risc. Per a crear aquestos indicadors, s'han descarregat una diversitat de dades de llocs web corporatius de forma manual i automàtica, que posteriorment s'han analitzat i processat amb tècniques d'anàlisi Big Data. Els resultats mostren que les dades web seleccionades estan altament relacionades amb les variables econòmiques objecte d'estudi, i que els indicadors basats en la web que s'han dissenyat en aquesta tesi capturen en un alt grau els valors reals d'aquestes variables econòmiques, sent per tant vàlids per al seu ús per part del món acadèmic, de les empreses i dels decisors polítics. A més, la naturalesa online i digital dels indicadors basats en la web fa possible proveïr regularment i de forma barata de prediccions a curt termini. D'aquesta forma, són avantatjosos en comparació als indicadors tradicionals. Aquesta tesi doctoral ha contribuït a generar coneixement sobre la viabilitat de produïr indicadors econòmics amb dades online procedents de llocs web corporatius. Els indicadors que s'han dissenyat pretenen contribuïr a la modernització en la producció d'estadístiques oficials, així com ajudar als decisors polítics i als gerents d'empreses a prendre decisions informades més ràpidament.
[EN] In the Digital Era, the increasing use of the Internet and digital devices is completely transforming the way of interacting in the economic and social framework. Myriad individuals, companies and public organizations use the Internet for their daily activities, generating a stream of fresh data ("Big Data") principally accessible through the World Wide Web (WWW), which has become the largest repository of information in the world. These digital footprints can be tracked and, if properly processed and analyzed, could help to monitor in real time a wide range of economic variables. In this context, the main goal of this PhD thesis is to generate economic indicators, based on web data, which are able to provide regular, short-term predictions ("nowcasting") about some business activities that are basic for the growth and development of an economy. Concretely, three web-based economic indicators have been designed and evaluated: first, an indicator of firms' export orientation, which is based on a model that predicts if a firm is an exporter; second, an indicator of firms' engagement in e-commerce, which is based on a model that predicts if a firm offers e-commerce facilities in its website; and third, an indicator of firms' survival, which is based on two models that indicate the probability of survival of a firm and its hazard rate. To build these indicators, a variety of data from corporate websites have been retrieved manually and automatically, and subsequently have been processed and analyzed with Big Data analysis techniques. Results show that the selected web data are highly related to the economic variables under study, and the web-based indicators designed in this thesis are capturing to a great extent their real values, thus being valid for their use by the academia, firms and policy-makers. Additionally, the digital and online nature of web-based indicators makes it possible to provide timely, inexpensive predictions about the economy. This way, they are advantageous with respect to traditional indicators. This PhD thesis has contributed to generating knowledge about the viability of producing economic indicators with data coming from corporate websites. The indicators that have been designed are expected to contribute to the modernization of official statistics and to help in making earlier, more informed decisions to policy-makers and business managers.
Blázquez Soriano, MD. (2019). Design and Evaluation of Web-Based Economic Indicators: A Big Data Analysis Approach [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/116836
TESIS
Wöldern, Lars. "Discovery and Analysis of Social Media Data : How businesses can create customized filters to more effectively use public data." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-75275.
Full textWu, Yongliang. "Aggregating product reviews for the Chinese market." Thesis, KTH, Kommunikationssystem, CoS, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-91484.
Full textI december 2007 uppgick antalet internetanvändare i Kina har ökat till 210 miljoner människor. Den årliga tillväxttakten nådde 53,3 procent 2008, med den genomsnittliga Antalet Internet-användare ökar för varje dag av 200.000 människor. Närvarande Kinas Internet befolkningen är något lägre än de 215 miljoner Internetanvändare i USA Staterna.[1] Trots den snabba tillväxten i den kinesiska ekonomin i den globala Internetmarknaden, Kinas e-handel inte följer det traditionella mönstret av handel, men i stället har utvecklats baserat på användarnas efterfrågan. Denna tillväxt har utvidgas till alla områden I Internet. I väst har expert recensioner visat sig vara en viktig del I användarens köpbeslut. Ju högre kvalitet på produkten recensioner som kunderna mottagna fler produkter de köper från on-line butiker. Eftersom antalet produkter och alternativen ökar, kinesiska kunderna behöver opersonlig, opartisk och detaljerade produkter recensioner. Denna avhandling fokuserar på on-line recensioner och hur de påverkar Kinesiska kundens köpbeslut.</p> E-handel är ett komplext system. Som en typisk modell för e-handel, vi undersöka ett Business to Consumer (B2C) on-line-försäljning plats och överväga ett antal faktorer; inklusive några till synes subtitle faktorer som kan påverka kundens småningom Beslutet att handla på webbplatsen. Uttryckligen detta examensarbete kommer att undersöka aggregerade recensioner från olika online-källor genom att analysera vissa befintliga västra företag. Efter den här avhandlingen visar hur samlade produkt recensioner för en e-affärer webbplats. Under detta examensarbete fann vi att befintliga data mining tekniker gjort det rakt fram för att samla recensioner. Dessa översyner har lagrats i en databas och webb program kan söka denna databas för att ge en användare med en rad relevanta product recensioner. En av de viktiga frågorna, precis som med sökmotorer är att tillhandahålla relevanta produkt recensioner och bestämma vilken ordning de ska presenteras i. vårt arbete har vi valt recensioner baserat på matchning produkten (men i vissa fall det finns oklarheter i fråga om två produkter verkligen identiska eller inte) och beställa matchande recensioner efter datum - med den senaste recensioner närvarande första. Några av de öppna frågorna som kvarstår för framtiden är: (1) förbättra matchning - För att undvika oklarheter rörande om Gästrecensionerna om samma produkt eller inte och (2) avgöra om det finns recensioner faktiskt påverka en kinesisk användarens val att köpa en produkt.
Pettersson, Emeli, and Albin Carlson. "Att hitta en nål i en höstack: Metoder och tekniker för att sålla och gradera stora mängder ostrukturerad textdata." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20105.
Full textBig Data is a popular topic these days which can be utilized for numerous purposes. It can, for instance, be used in order to analyse data made available online in hopes of identifying violations against human rights. By applying techniques within such areas as Artificial Intelligence (AI), Information Retrieval (IR), and Visual Analytics, the company Globalworks Ltd. aims to identify single voices in social media expressing grievances concerning such violations. Artificial Intelligence and Information Retrieval are broad topics however, and have been an active area of research for quite some time. We have therefore chosen to conduct a systematic literature review in hopes of mapping together existing research covering these areas. By presenting a literary compilation, we provide an ontological view of how an information system utilizing techniques within these areas could be structured, in addition to how such a system could deploy said techniques.
De, Luca Gabriele. "PARLEN: uno strumento modulare per l’analisi di articoli e il riconoscimento di entità." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10905/.
Full textAndersson, Pontus. "Developing a Python based web scraper : A study on the development of a web scraper for TimeEdit." Thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-43140.
Full textThe concept of scraping the web is not new, however, with modern programming languages it is possible to build web scrapers that can collect unstructured data and save this in a structured way. TimeEdit, a scheduling platform used by Mid Sweden University, has no feasible way to count how many hours has been scheduled at any given week to a specific course, student, or professor. The goal of this thesis is to build a python-based web scraper that collects data from TimeEdit and saves this in a structured manner. Users can then upload this text file to a dynamic website where it is extracted from the file and saved into a predetermined database and unique to that user. The user can then get this data presented in a fast, efficient, and user-friendly way. This platform is developed and evaluated with the resulting platform being a good and fast way to scan a TimeEdit schedule and evaluate the extracted data. With the platform built future work is recommended to make it a finishes product ready for live use by all types of users.
Morgan, Justin L. "Clustering Web Users By Mouse Movement to Detect Bots and Botnet Attacks." DigitalCommons@CalPoly, 2021. https://digitalcommons.calpoly.edu/theses/2304.
Full textJohansson, Richard, and Heino Otto Engström. "Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177748.
Full textMulazzani, Alberto. "Social media sensing: Twitter e Reddit come casi di studio e comparazione applicati ai test prenatali." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2018.
Find full textYu, Andrew Seohwan. "NBA ON-BALL SCREENS: AUTOMATIC IDENTIFICATION AND ANALYSIS OF BASKETBALL PLAYS." Cleveland State University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=csu14943636475232.
Full textKefurt, Pavel. "Získávání znalostí z veřejných semistrukturovaných dat na webu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255386.
Full textKolečkář, David. "Systém pro integraci webových datových zdrojů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-417239.
Full textBonde-Hansen, Martin. "The Dynamics of Rent Gap Formation in Copenhagen : An empirical look into international investments in the rental market." Thesis, Malmö universitet, Malmö högskola, Institutionen för Urbana Studier (US), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-41157.
Full textDorka, Moritz. "On the domain-specific formalization of requirement specifications - a case study of ETCS." Master's thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-182866.
Full textDiese Arbeit befasst sich mit einer Software zur automatisierten Extraktion von Anforderungen aus Dokumenten im Microsoft Word Format unter Nutzung von Domänenwissen. In einem nachgelagerten Schritt werden diese Anforderungen für Implementierungszwecke aufgewertet und schließlich als ReqIF, einem XML-basierten Dateiformat zum Austausch von Spezifikationsdokumenten, gespeichert. ReqIF wird von zahlreichen branchenüblichen Anforderungsmanagementwerkzeugen unterstützt. Durch die Aufwertung wird eine Formalisierung der Struktur sowie ausgewählter Teile der natürlichsprachlichen Inhalte des Dokuments erreicht. Die jetzige Version der Software wurde speziell für die Verarbeitung des Subset-026 entwickelt, eines konzeptionell anspruchsvollen Anforderungsdokuments zur Beschreibung der Kernfunktionalität des europaweiten Zugsicherungssystems ETCS. Trotz dieser ursprünglichen Intention erlaubt die zweigeteilte Gestaltung der Arbeit eine allgemeine Anwendung der Ergebnisse: Abschnitt 2 zeigt die grundsätzlichen Herausforderungen in Bezug auf schwach strukturierte Anforderungsdokumente auf und widmet sich dabei ausführlich der Ermittlung von eindeutigen, aber dennoch menschenlesbaren Anforderungsidentifikatoren. Abschnitt 3 befasst sich hingegen eingehender mit den domänenspezifischen Eigenschaften, den Textaufbereitungsmöglichkeiten und der konkreten Implementierung der neuen Software. Da die Software unter open-source Prinzipien entwickelt wurde, ist eine Anpassung an andere Anwendungsfälle mit relativ geringem Aufwand möglich
Jílek, Radim. "Služba pro ověření spolehlivosti a pečlivosti českých advokátů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363772.
Full textVivès, Rémi. "Three essays on the role of expectations in business cycles." Thesis, Aix-Marseille, 2019. http://www.theses.fr/2019AIXM0453.
Full textIn this thesis, I investigate the role of expectations in business cycles by studying three different kinds of expectations. First, I focus on a theoretical explanation of business cycles generated by changes in expectations which turn out to be self-fulfilling. This chapter improves a puzzle from the sunspot literature, thereby giving more evidence towards an interpretation of business cycles based on self-fulfilling prophecies. Second, I empirically analyze the propagation mechanisms of central bank announcements through changes in market participants' beliefs. This chapter shows that credible announcements about future unconventional monetary policies can be used as a coordination device in a sovereign debt crisis framework. Third, I study a broader concept of expectations and investigate the predictive power of political climate on the pricing of sovereign risk. This chapter shows that political climate provides additional predictive power beyond the traditional determinants of sovereign bond spreads. In order to interrogate the role of expectations in business cycles from multiple angles, I use a variety of methodologies in this thesis, including theoretical and empirical analyses, web scraping, machine learning, and textual analysis. In addition, this thesis uses innovative data from the social media platform Twitter. Regardless of my methodology, all my results convey the same message: expectations matter, both for economic research and economically sound policy-making
Tadisetty, Srikanth. "Prediction of Psychosis Using Big Web Data in the United States." Kent State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=kent1532962079970169.
Full textSantos, João Manuel Azevedo. "Real Estate Market Data Scraping and Analysis for Financial Investments." Dissertação, 2018. https://repositorio-aberto.up.pt/handle/10216/116510.
Full textSantos, João Manuel Azevedo. "Real Estate Market Data Scraping and Analysis for Financial Investments." Master's thesis, 2018. https://repositorio-aberto.up.pt/handle/10216/116510.
Full textHsu, Ning, and 徐寧. "Intellectual Property Law and Competition Law Regimes on Data Collection in the Era of Big Data: Focusing on Web Scraping." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/qtavez.
Full text國立政治大學
科技管理與智慧財產研究所
106
In the era of Big Data, an unprecedented scale of digital data is being generated, which leads to an explosion of “publicly available” content on websites. In order to obtain those data from the Web, an automatic and efficient data extraction technology, commonly referred to as “web scraping”, has been created. It has become one of the indispensable technologies to gain access to data sources outside of a firm. Web scraping, however, often involves unauthorized use of scraped data for commercial purposes. Data scrapers thus face potential legal liabilities for copyright infringement or considered in contravention of unfair competition law. As the lawfulness associated with web scraping is highly fact sensitive, legal uncertainty might hinder innovative data-driven business models. This paper examines the commercial use of web scraping technologies which retrieves data from public websites. It examines copyright infringement claims in cases such as Kelly v. Arriba, Field v. Google, and AP v. Meltwater. It then reviews the leading cases in the United States, China, and Taiwan involving famous digital companies such as Google, Yelp, and Baidu. Lastly, the paper explores and provides recommendations on how to govern web scraping to better achieve the balance between free flow of information, and the interests of different market participants.
Fabrício, Gustavo de Souza Machado. "Does sacking a coach really help? Evidence from a Difference-in-Differences approach." Master's thesis, 2022. http://hdl.handle.net/10362/136015.
Full textThis project looks to evaluate if football clubs should or should not change their coach in order to improve their performance in the national league. For this analysis I selected, three of the most important European football leagues, La Liga (Spain), Serie A (Italy) and Premier League (England). The data used in this project was taken from the transfermarkt website, a large football platform. The data period is from season 2005-06 to season 2019-20 and has information about individual games results and squad value by player. The steps before the analysis were a data cleaning and consolidation of the information, creation of new features as a performance measure and selection of cases of interest for this analysis based on club and coach profile. Numeric variables were standardized to be on the same scale and make different seasons comparable. A K-means was applied to identify clubs according to their investments which has a proportional correlation with performance. Finally, a difference in differences analysis was applied to evaluate if a club would obtain a performance gain if they decided to sack their coach between game twelve and twenty-six of the season after a poor performance in consideration to squad price. As a general conclusion, it is possible to consider that on average the clubs in the treatment group and comparison group recover their performance after a period of underperforming, but the recovery of the clubs that sack their coach is lower compared with the clubs that keep them.
Cunha, Paulo Ricardo Gonçalves da. "Strategies for extracting web data: practical case." Master's thesis, 2018. http://hdl.handle.net/1822/59299.
Full textNowadays, the task of collecting data from Web sources is becoming increasingly complex. This complexity arises, in part, from the large data volume (and continues to increase), as well as from the proliferation of platforms that make them available. Based on the previous assumption, this dissertation project had as main objective the identification of strategies that allow the extraction of data from Web sources. In order to reach this goal, the following tasks were defined: identification of tools and frameworks that aid in the extraction process of data, tests with the tools and frameworks identified, development of a framework that illustrates possible strategies for the extraction of data and finally the application of the proposed framework in a Practical Case. The proposed framework consists of a methodology with possible strategies for extracting data from web sources. The Practical Case was carried out on the ALGORITMI Research Centre of the University of Minho. In the first instance, the data of the authors in the ALGORITMI Research Centre are collected. Other data are then collected from other sources, such as their publications and later stored in a relational database. The collections and decisions taken during the study case are based on the application of the proposed framework. The insertion of the data obtained from different sources in a single location allows the creation of a Single Entry Point for reading data, that is, we have a single data source. The creation of this unique data source will allow the user to access all the data desired without the need to spend time trying to locate it The present work is organized in five chapters: introduction (where a brief description is given to the problem and objectives of the work), literary review (concepts, methodologies and strategies for obtaining data from Web sources), framework proposal, application of the proposed framework in a Practical Case that focuses on the ALGORITMI Research Centre and finally the conclusion (where some considerations are woven and some proposals for future work are presented).
Nos dias de hoje, a tarefa de recolha de dados proveniente de fontes Web está a tornar-se cada vez mais complexa. Esta complexidade surge, em parte, do grande volume de dados existente (e que continua a aumentar), assim como, da proliferação de plataformas que os disponibilizam. Tendo por base o pressuposto anterior, este projeto de dissertação teve como principal objetivo a identificação de estratégias que possibilitam a extração de dados de fontes Web. Para alcançar esse objetivo foram definidas as seguintes tarefas: identificação de ferramentas e frameworks que auxiliam no processo de extração de dados, realização de testes com as ferramentas e frameworks identificados, desenvolvimento de um framework que ilustra as estratégias possíveis para a extração de dados e por fim a aplicação do framework proposto num caso de estudo. O framework proposto consiste numa metodologia com as estratégias possíveis para a extração de dados provenientes de fontes web. O caso de estudo realizado incide sobre o Centro ALGORITMI da Universidade do Minho. Em primeira instância procede-se à recolha dos dados dos autores existentes no Centro ALGORITMI. De seguida são recolhidos outros dados de outras fontes, tais como, as suas publicações e posteriormente armazenados numa base de dados relacional. As recolhas e decisões tomadas no decorrer do caso de estudo baseiam-se na aplicação do framework proposto. A inserção dos dados obtidos de diferentes fontes num único local permite a criação de um Single Entry Point para a leitura de dados, ou seja, passamos a possuir uma única fonte de dados. A criação desta fonte única de dados permitirá ao utilizador aceder aos dados que pretende sem a necessidade de despender muito tempo à sua procura. O presente trabalho encontra-se organizado em cinco capítulos sendo eles: introdução (onde é efetuada uma descrição ao problema e objetivos do trabalho), revisão literária (conceitos, metodologias e estratégias para obtenção de dados de fontes Web), framework (proposta e explicação da metodologia desenvolvida), caso de estudo (aplicação do framework proposto num caso de estudo que incide sobre o centro ALGORITMI) e conclusão (onde são tecidas consideração e apresentadas algumas propostas para trabalhos futuros).
Freire, Filipe Manuel Leitão Gonçalves. "Recolha de contratos de despesa pública e segmentação dos perfis de despesa a nível municipal." Master's thesis, 2020. http://hdl.handle.net/10362/97480.
Full textDevido à necessidade de analisar como são investidos os capitais públicos nos municípios Portugueses nos diversos tipos de contratos de aquisição de bens e serviços, torna-se fundamental criar ferramentas que permitam a compreensão destes investimentos. É desejável perceber como oscilam estes investimentos em função da dimensão da população. Neste projeto, o objetivo é recolher dados disponibilizados na web sobre contratos e criar uma segmentação para os diversos tipos de despesa pública, que permita detetar eventuais desvios anómalos na relação entre despesa pública municipal e dimensão populacional. Para este efeito, foi desenvolvido um web crawler com recurso à linguagem de programação Python que permitiu extrair de forma automática os contratos públicos do site http://www.base.gov.pt/. Foram analisados os dados recolhidos tendo sido detetada uma relação do tipo log-log entre população e despesa pública. Posteriormente foi feita uma análise de segmentação com base nos resíduos da relação anteriormente mencionada com recurso a técnicas de DataMining. Foram usados diversos algoritmos de Clustering, em particular, o K-Medoids, do qual foram gerados dois grupos distintos de tipos de despesa.
Due to the need to analyze how public capital is invested in Portuguese municipalities in the various types of contracts for the acquisition of goods and services, it is essential to create tools that allow the understanding of these investments. It is desirable to understand how these investments oscillate according to the size of the population. In this project, the objective is to collect data available on the web about contracts and to create a segmentation for the various types of public expenditure, allowing to detect any anomalous deviations in the relationship between municipal public expenditure and population size. For this purpose, a web crawler was developed using the Python programming language that allowed to automatically extract public contracts from the site http://www.base.gov.pt/. The data collected were analyzed and a log-log relationship between population and public expenditure was detected. Subsequently, a segmentation analysis based on the residues of the referred relationship was performed using DataMining techniques. Several Clustering algorithms were used, in particular K-Medoids, from which two distinct groups of expense types were generated.
Fiorani, Matteo. "Mixed-input second-hand car price estimation model based on scraped data." Master's thesis, 2022. http://hdl.handle.net/10362/134276.
Full textThe number of second-hand cars is growing year by year. More and more people prefer to buy a second-hand car rather than a new one due to the increasing cost of new cars and their fast devaluation in price. Consequently, there has also been an increase in online marketplaces for peerto- peer (P2P) second-hand cars trades. A robust price estimation is needed for both dealers, to have a good idea on how to price their cars, and buyers, to understand whether a listing is overpriced or not. Price estimation for second-hand cars has been, to my knowledge, so far only explored with numerical and categorical features such as mileage driven, brand or production year. An approach that also uses image data has yet to be developed. This work aims to investigate the use of a multi-input price estimation model for second-hand cars taking advantage of a convolutional neural network (CNN), to extract features from car images, combined with an artificial neural network (ANN), dealing with the categorical-numerical features, and assess whether this method improves accuracy in price estimation over more traditional single-input methods. To train and evaluate the model, a dataset of second-hand car images and textual features is scraped from a marketplace and curated such that more than 700 images can be used for the training.
(8086355), Ryan Merrill Dailey. "Automated Discovery of Real-Time Network Camera Data from Heterogeneous Web Pages." Thesis, 2021.
Find full textFejfar, Petr. "Interaktivní procházení webu a extrakce dat." Master's thesis, 2018. http://www.nusl.cz/ntk/nusl-389671.
Full textBotelho, Miguel Tavares. "Unfolding the influencing factors and dynamics of overall hotel scores." Master's thesis, 2019. http://hdl.handle.net/10071/19456.
Full textA indústria da hospitalidade e turismo foi impulsionada pela ajuda de sites de avaliações de hotéis, que leva a uma exigencia crescente por parte dos turistas. Extraímos mais de trinta mil avaliações do Tripadvisor para entender as variações nas percepções dos clientes de hotéis de alta/baixa gama e cadeia/independentes e quais os aspectos essa variação é mais evidente. Usámos sentiment analysis para atribuir uma pontuação aos aspectos de cada avaliação. Comparámos algoritmos de aprendizagem automática, nomeadamente, "random forest", "decision tree" e "decision tree with adaBoost", para prever a pontuação geral. Depois, usámos o índice de Gini para entender os aspectos que mais influenciam a pontuação geral. Por fim, comparámos avaliações com as janelas temporais ao longo do tempo com o índice de Jaccard para caracterizar a dinâmica de satisfação do cliente com foco em três aspectos: "Service", "Location" e "Sleep". Ao correlacionar as respostas do hotel com as avaliações, queriamos demonstrar o impacto na percepção dos clientes sobre a qualidade dos hoteis. Os melhores desempenhos foram alcançados pelo decision tree que indicou que "Service" é o aspecto mais influente para satisfação, enquanto que "Location" e "Sleep" foram os aspectos considerados menos importantes. Ao identificar os momentos de mudanças drásticas, constatámos que "Service" também é o mais relacionado à pontuação geral. Estas análises permitem que a gestão dos hoteis acompanhe as tendências da avaliação dos turistas em cada categoria. De um modo geral, um foco no serviço deve ser feito. No entanto, uma análise, para um hotel particular, da dinâmica da pontuação geral para comparar com sua categoria seria vantajosa.