September 17, 2015
Data scientists often come from diverse backgrounds and frequently don't have much, if any, in the way of formal training in computer science or software development. That being said, most data scientists at some point will find themselves in discussions with software engineers because of some code that already is or will be touching production code.
This conversation will probably go something like this:
SE: "You didn't check your code and your tests into master without a code review, did you?"
DS: "Checked my what into what without a what?"
As it turns out, there are a number of skills that software developers often take for granted that new data scientists don't possess -- and may not even have heard of. I did a quick poll on Twitter about what these skills might be. I'll walk through the most common responses below, but I'd say the unifying theme for all of them is that many new data scientists don't know how to effectively collaborate. Perhaps they used to be academics and only worked alone or with a single other collaborator. Perhaps they're self-taught. Regardless of the reasons, writing code in an environment where many other people (and "other people" includes yourself at some later date) will be looking at, trying to understand, and using your code or things that your code produces.
You may be used to your code living on your hard drive or perhaps in a shared Dropbox folder. Now your code will be routinely checked into a repo (more on that below) where anyone can take a look at it. This is a little unnerving at first, and it initially causes you to want to only ever check in perfect code. That's generally a bad idea.
Each of the following topics speak to that idea in some way. I'm not saying that data scientists need to be experts in all of these fields right away but some level of proficiency in each of them will be necessary sooner than later. You won't find most of these topics in "Introduction to Data Science in Python" or "Machine Learning in R" books -- these are the taken-for-granted skills.
Writing modular, reusable code
Many data scientists are self-taught programmers or learned to program as part of a research project. Programming was a tool that one acquired to achieve a certain goal, like estimating a regression or modeling the movement of stars, or simulating atmospheric conditions. Rather than "programming" being a skill that has its own norms, best practices, and so forth, writing code was about learning the right commands to type in the right order to produce the output that could then be lovingly arranged in LaTeX.
Often times, research projects looked different enough from one another or were sufficiently simple (code-wise) that you started from scratch each time or just copied and pasted the bits and pieces from old projects you needed. Your code often was very imperative in style, and could be read start-to-finish to get an idea of what needed to be done. "First load the data, then do this, then do that, then print the results. The end."
Those days are over.
You should learn a principle called DRY, which stands for Don't Repeat Yourself. The basic idea is that many tasks can be abstracted into a function or piece of code that can be reused regardless of the specific task. This is more efficient from a "lines of code" perspective, but also in terms of your time. It can be taken to an illogical extreme, where code becomes very difficult to follow, but there is a happy medium to strive for. A good rule of thumb: if you find yourself writing the same line of code with only minor changes each time, think about how you can turn that code into a function that takes the changes as parameters. Avoid hard-coding values into your code. It is also good practice to revisit code you've written in the past to see if the code can be made cleaner, more efficient, or more modular and reusable. This is called refactoring.
Chances are good that you'll be asked to submit your code for a code review at some point. This is normal and doesn't mean people are skeptical of your work. Many organizations require that code be reviewed by multiple people before it is merged into production code. Writing clean, reusable code will make this process easier for everyone and will lower the probability that you will be rewriting significant portions of your code following a code review.
Further reading: Chris DuBois on becoming a full-stack statistician, The Pragmatic Programmer, Clean Code
Documentation / commenting
Because other people are going to be reading your code, you need to learn how to write meaningful, informative comments as well as documentation for the code that you write. It is a very good practice (although one you probably won't follow) to write comments and documentation before you actually write the code. This is the coding equivalent of writing an outline before you write a paper.
[Aside: Some seasoned programmers will argue that you shouldn't write the comments until the code is complete, because this will force you to write clear, self-explanatory code and the only comments you will have to write are for the situations that are not crystal clear. As a beginning software developer, you should probably ignore this advice.]
Comments are non-executed blocks of code that explain what you are doing and why you are doing it. Good comments make the purpose of code clearer, they don't just restate what's obvious in the code. If you're writing clean, well-styled code, your function, variable, object, etc., names should be fairly self-explanatory.
You've probably heard that you should comment your code many times. So, you wrote things like this:
# import packages import pandas as pd # load some data df = pd.read_csv('data.csv', skiprows=2)
These are bad comments. They don't add any information. Why is that
skiprows parameter set to 2? Are there comments at the beginning
data.csv? Something like this might be preferable:
# Data contains two lines of description text, skip to avoid errors. df = pd.read_csv('data.csv', skiprows=2)
It's very important that you update your comments as you update your
code. Using the example from above, let's say the data source for your CSVs
has changed, and there are no longer any description lines. You modify the
read_csv call, but don't remove the comment, which produces:
# Data contains two lines of description text, skip to avoid errors. df = pd.read_csv('data.csv')
Now whoever is reading your code has no idea if the comment is right or if the code is right, which means they have to execute the code to find out. Then they're effectively debugging your code for you, and no one appreciates that.
If you write a function, write a docstring (or whatever your language of choice calls the attribute of a function that describes what it does) that clearly states what the function does, what parameters it takes, and what it returns.
Unlike comments, documentation is a document written in English (or whatever
language you speak), rather than in a programming language,
that explains the purpose of the code you are writing is, how it operates,
example use cases, who to contact for support, and so on. This can be as
simple as a
README that sits in the directory where your code is to a
full-fledged manual that will be printed and given to users.
In my informal Twitter poll, version control (also known as source or revision control) was the
most oft-cited skill that new data scientists need to learn. It is also
probably one of the most confusing. In your former life, "version control"
probably meant you had a folder somewhere on your hard drive that contained
project3_final_do_not_delete_final_revised.py and so on.
Version control provides a centralized way for one to many people to work on a common codebase at the same time without writing over each other's work. Each person "checks out" a copy of the code and makes changes to it on a local "branch" which they can then "commit" and "merge" back into the common codebase. There's a lot of specialized vocabulary, but it starts to make (some) sense after a while. Version control also allows you to easily "revert" changes that you made that broke something.
Many people use
git as their version control system, although you may
svn). The terminology and exact workflows will differ slightly, but the
basic premise is usually the same. All of the code is stored in one or
more repositories (repos), and within each repo you may have several
branches -- different versions of the code. In each repo, the branch
that everyone treats as the starting/reference point is called the
master branch. GitHub
is a service that hosts repos (both public and private) and provides
a common set of tools for interacting with
There are only three certainties in your life as a data scientist: death, taxes,
and an inevitable git clusterfuck. You will find yourself typing
git reset --hard
and hitting enter while sighing at least once. That's OK.
If you're not familiar with version control, start now. Install
git (it works
on pretty much every operating system) and start using it to manage your own
code. Commit frequently, write meaningful commit messages (which are just
comments), and get to know the system. Create a GitHub account and check your
code into a remote repo.
Further reading: Code School.
There's a good possibility that if you have no formal computer science traning, you don't even know what I mean when I say "testing." I'm talking about writing code that checks your code for bugs across a variety of situations. The most basic type of test that you can write is called a unit test.
In the past, you probably ran most of your code interactively, either by typing it in line-by-line or by writing a script and sending portions of that script to an interpreter of some kind. You're moving to a position where you may not even be awake when your code runs. Maybe you've built a recommender system and you want to generate the recommendations in batch every night for customers that might visit the next day. You write a script that will be run at 2am and will dump the recommendations into a database.
What happens if a the product list that you use for recommendations has an error and returns too few columns? What about if a column that used to be an integer suddenly becomes a floating point value? Do you want to be the one on the hook when there are no recommendations in the database the next day?
You write tests that describe the expected behavior of your code and that fail when that behavior is not produced. I'm working on another post about testing for data scientists, so I won't go into too much detail here, but it's a very important topic. Your code is going to be interacting with other code and messy data. You need to know what will happen when it's run on data that isn't as pristine as the data you are working with now.
In the scenario above, your code is running at 2am, and you're not around to see what happens when (and it's definitely when, not if) it breaks. For this you need logging. Logging is just a record of what happened as your code was executed. It includes information about what parts of your program executed successfully, what parts didn't, and any other diagnostic information you'd like to include. Like comments, documentation, and testing, this is extra code you'll have to write in addition to the actual executable code that you care about, but it's totally worth it.
When you get to work in the morning and find that your code barfed, you'll want to know what happened without re-running all of the code -- and that's not even guaranteed to reproduce the error, since it may have been due to another piece of data that has since been corrected. Logging lets you immediately identify the source of the problem (if your logging code is well-written, that is) and quickly figure out what to do about it.
For instance, if your logs tell you that the code didn't run because the file containing the products wasn't found, you immediately know to try and figure out if the file was deleted, or if the network was down, and so on. If the code partially runs and fails on a specific product or customer, you can go inspect that particular piece of data, and fix your code so it won't happen again.
Disk space is cheap, so log generously. It's a lot easier (and faster) to grep through a big directory of logs than to try to reproduce an unusual error on a large codebase or dataset. Make logging work for you -- log more things than you think you'll need. Be smart at logging when functions are called, when steps in your program are executed.
Further reading: Logging HOWTO (Python)
There are lots of things I didn't cover here:
- how to conduct code reviews
- refactoring code
- navigating a *nix terminal, adding your ssh keys, setting up a dev environment
- working with distributed resources like AWS
- IDE choices
- programming paradigms (functional, object-oriented, etc.)
Posts like this one often balloon into a laundry list of skills and languages and make it seem impossible that any one person could ever master all of them. I've tried to avoid that and focus on things that will help you write better code, interact better with software developers, and ultimately save you time and headaches. You don't need to have them all mastered your first day on the job, and some of them are more important at some companies than at others, but you will encounter all of them at some point.