Date(s) - Apr 09, 2021
12:00 pm - 2:00 pm
Title: “Scraping web data for social science: an introduction to Helena for web automation and review of research applications”
Speaker: Chris Hess
Workshop description: “Many data like social media posts, classified ads and search results reside on the internet, but only in a semi-structured form with no clear mechanism for collection like a file form or application programming interface (API). While these data could be useful for quantitative and/or qualitative analyses and may even contain geolocation information that could facilitate merges with other data, conventional approaches to automating web page navigation require considerable programming knowledge in languages like Python that might deter users from pursuing this research. This workshop will illustrate how to use Helena, a novel programming-by-demonstration web scraping tool, for collecting web data in both a one-off and ongoing capacity. After reviewing how to generate structured data using this tool, Chris Hess will discuss how he and his colleagues have used web data for basic and applied research and describe some of the common challenges to using scraped data for social science research.
Bio: Chris Hess is a postdoctoral associate at Cornell University in the Department of Policy Analysis and Management and recent PhD from the University of Washington Department of Sociology. His research investigates the housing search process and changing spatial structure of neighborhood inequalities in the United States through a combination of conventional (longitudinal surveys, census estimates) and novel data sources (scraped ads, administrative records). Over the past three years, he and his colleagues have scaled their web scraping project based on Helena from a single source (Craigslist) in one location (Seattle) to include many major platforms for all core-based statistical areas in the United States.
Zoom Information to come.