SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

Preface

This Knowledge Base text is under construction.

There are numerous definitions for Data Wrangling. While the definitions vary to some extent, there is fair amount of similarity among the definitions we have seen. Most of the differences appear to be in the focus or approach favored by the author. We use the following Wikipedia (11/4/2018) text as our definition: "Data Wrangling is the process of transforming and mapping data from one raw data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics." (The link to the Wikipedia article is data wrangling.)

Data Wrangling is not a new set of skills. Many of the skills used in wrangling data have been around long before the name. If you have worked with programming languages such as Stata, R, Python, and SAS you have likely done some work that would be considered wrangling of data when preparing your data for analysis. What is different about wrangling data than typical data preparation work is the focus and tools used. Wrangling has more of a focus on getting data from multiple and varied sources that need to be integrated. These data sources include non static sources such as data bases and the web, both from APIs and page content. There is also a focus on the work being applicable to more than just one analysis. That is, the work can be used as a production product preparing new data as it becomes available, without needing any changes.

The R and Python communities have developed a set of tools in the tidyverse and the pandas packages respectively designed to wrangle table data. The intuitive nature of these packages makes learning to use them easy and the code easy to read and understand. These tools allow researchers to quickly and accurately complete data preparation for a wide variety of analysis. It is the application of these packages and their approaches to wrangling that are the subject of this book.

The Data Wrangling Essentials title was chosen to emphasize both the use of these new tools and the importance of the work of gathering and preparing data.