Making Internet Things, Part 1: Working with Data

How to Make Dope Shit

This is the first installment of a multi-part series designed to help you familiarize yourself with the tools used to make visual, data-driven essays.

Introduction

Every so often, readers email us asking how they can gain the skills necessary to create the sorts of data-driven, visually essays that we publish on The Pudding. The deficit in clear guides to becoming visual journalists is understandable: this form of storytelling has only recently begun to gather momentum, and unlike traditional essay-writing, reporting, or editorial work, has neither a well-defined form, nor a formalized set of tools. Moreover, resources for those willing to learn are fragmented, tending to focus exclusively on data analysis, or solely dealing with data visualization, making it difficult to find a comprehensive and unified guide.

There is, however, a broader issue at play: we have a curious tendency of assuming that people who can do certain things that we cannot are imbued with superior innate talents (if you’d like to hear more about this in the context of programming, I’d recommend this talk on experts and beginners by Jacob Kaplan-Moss, or this heartening talk by Julia Evans). This may be especially common for the sort of code-driven interactive data visualizations which we work on, since they rely on an odd grab-bag of skills —critical thought, design, writing, and programming — that people in many other professions may have neither a full awareness of, nor full expertise in.

In this series of blog posts, I’ll attempt to present a clear, non-technical introduction to the tools used in visual, data-driven storytelling, and provide you with a map to the field’s general landscape — reading this should give you a sense of how we work, and where you can turn to hone your skills to start working on visualizations like ours. More broadly, however, I’m hoping that my putting this guide together will help remove some of the unnecessary mystique surrounding data viz, and demonstrate that the only things that separate a beginner from a speaker on the conference circuit time and practice.

The Pudding workflow

Members of the Polygraph team have vastly different backgrounds, which include business, computers science, psychology, marine biology, and journalism. Consequently, we’ve all picked up our data skills in myriad ways. While each of us has some manner of specialty or preference, we are, broadly, generalists, each of whom uses a general purpose programming language, some combination of data analysis tools, and JavaScript for web programming and data visualization.

In addition to a common purpose, these tools share something else: they’re all free. While there are many proprietary programs and software packages for data and visualization-related work, our team relies exclusively on open-source tools that anyone can use. In addition to focusing on these tools because they are currently the industry-standard, I’m also hoping to show that the largest barrier to entry into the data-visualization/storytelling world are time and persistence, rather than tool cost.

Broadly, the three components of our work are:

Data
Visualization
Writing

This first blog post is dedicated to data.

Getting comfortable with programming

Newcomers to programming often wonder what the best language to learn is, and the answer largely depends on what you’d like to do.

I’m heavily in favor of learning Python for data work, since Python syntax is relatively easy to understand for beginners, and has a plethora of fantastic pre-written content (generally referred to as libraries) that you can seamless incorporate into your code to get around nearly any data issue.

Beginning with Python has another advantage: the basics you’ll learn here will prove invaluable when it comes to compiling data sets and coding actual visualizations. Having said this, however, it’s worth mentioning that others on the Pudding team lean towards other tools: Russell prefers using JavaScript/Node for data processing, while Amber’s tool of choice is R, a language whose roots lie at Bell Labs, where its predecessor was first created with data analysis in mind; Matt, meanwhile, is a strong proponent of using MySQL for high-level number crunching.