LibGuides: Digital Scholarship and Digital Humanities: Web Scraping

The web is full of data, but not very much of it is usable as-is. IN order to conduct research, we often need to collect information from web pages manually: text, imagery, conversation threads, data from tables, etc. While there are countless sources of downloadable data available, some must still be collected manually by "scraping" it off web pages. The tools on this page are intended to help you do that.

The content in this section of the guide is still being developed, but will me available soon!

Theory & Methods

General Web Scraping Tools

Data Miner

free plan limited to 500 pages/month | data collection only | web-based, using browser extension

Description
Tutorials

Data Miner is a commercial service for web scraping that uses a browser extension as its primary interface. Many popular websites have existing templates you can use without doing any work, but building a customized scraper to pull data from tables is very straightfoward. See this video overview and "how it works" page.

Data Miner overview

Octoparse

free plan limited to 10k records per export | data collection only | Windows only

Description
Tutorials

Octoparse uses desktop software in conjunction with a large set of pre-configured templates to enable web scarping from web sites, social media platforms, and more. You can also build a custom web scraper using visually-oriented tools. Data can be exported to CSV and Excel formats.

Extracting and visualizing Korean coronavirus transmission data

Morph.io

free | data collection only | programming required if not using existing data sets

Description
Tutorials

Morph.io interfaces with GitHub to facilitate the creation and sharing of scripts for data scraping in Python, Node.js, PHP, Ruby, and Perl. The system sets up a basic GitHub project for you, with a template in your chosen language. You then customize it (programming expertise required) to perform a specific data scraping task. Once a scraper has been built for a specific data source, it is added to a searchable directory. The site currently lists more than 10,000 publicly-available scrapers and data sets.

Morph documentation

Digital Scholarship and Digital Humanities: Web Scraping

Theory & Methods

General Web Scraping Tools

Data Miner

Octoparse

Morph.io

Project Examples

Related Techniques