Skip to Main Content

Digital Scholarship and Digital Humanities: Web Scraping

Resources and information for students, researchers and faculty who are incorporating technology into their research, scholarship, and teaching.

Web Scraping

The web is full of data, but not very much of it is usable as-is. IN order to conduct research, we often need to collect information from web pages manually: text, imagery, conversation threads, data from tables, etc. While there are countless sources of downloadable data available, some must still be collected manually by "scraping" it off web pages. The tools on this page are intended to help you do that.

Related activities: Social Media Analysis, Text Analysis, Web Crawling & Archiving

The content in this section of the guide is still being developed, but will me available soon!

Theory & Methods

General Web Scraping Tools

Web Scraping Tools

Data Miner

free plan limited to 500 pages/month | data collection only | web-based, using browser extension

Data Miner is a commercial service for web scraping that uses a browser extension as its primary interface. Many popular websites have existing templates you can use without doing any work, but building a customized scraper to pull data from tables is very straightfoward. See this video overview and "how it works" page.

 

Octoparse

free plan limited to 10k records per export | data collection only | Windows only

Octoparse uses desktop software in conjunction with a large set of pre-configured templates to enable web scarping from web sites, social media platforms, and more. You can also build a custom web scraper using visually-oriented tools. Data can be exported to CSV and Excel formats.

 

Morph.io

free | data collection only | programming required if not using existing data sets

Morph.io interfaces with GitHub to facilitate the creation and sharing of scripts for data scraping in Python, Node.js, PHP, Ruby, and Perl. The system sets up a basic GitHub project for you, with a template in your chosen language. You then customize it (programming expertise required) to perform a specific data scraping task. Once a scraper has been built for a specific data source, it is added to a searchable directory. The site currently lists more than 10,000 publicly-available scrapers and data sets.

Project Examples

Related Techniques