As we launched another multifaceted geographic data linkage study our multi-institution team, that includes researchers at Drexel University, Columbia University and the University of Washington, has developed a set of commandments to streamline and harmonize our data management, variable naming and data coding processes.
- Thou shalt not transmit HIPAA/IRB protected data, nor data protected by licensing agreement without PI approval.
Clearly, we both want to be responsible custodians of the data entrusted to us, and avoid getting into trouble. For additional discussion of cautions around the common practice of using online tools to characterize addresses, see our recent commentary.
- Thou shalt always use YYYYMMDD when formatting date variable values, stored as a string.
The date storage was much discussed by our group, but ultimately we wanted a solution that would sort chronologically, be readable to humans, and be usable seamlessly across software that use a different sentinel date.
- Thou shalt always use YYYY when using a year in a variable name.
Given that our studies of adult health frequently span both the 1990s and 2000s, using 4 digits (versus 2 digit) for year when possible allows for easier conversion from wide to long format, and sorting in chronological order.
- Thou shalt prefer use of tall rather than wide data formats to avoid storing empty data and simplify query expressions.
As we move to using longitudinal data on where people live, and how their environment has changed over time, the structure of data becomes more complex. Long format avoids storing fields for which many observations have no data. However, the overarching goal is efficiency and usability, which may at times favor a wide format instead.
- Thou shalt always use lowercase for variable names to avoid case sensitivity issues when jumping between software.
Inconsistent capitalization in variable names is a source of frustration for users of software such as STATA. A typical scenario is that you have working syntax, receive an updated dataset with differences in capitalization (which a user of less case sensitive software packages such as SAS may not be attentive to), and have to spend time troubleshooting and editing to get it to work again. While conventions vary, we decided the simplest thing would be to use only lowercase in our variable names.
- Thou shalt only use letters, numbers, and underscores “_” when naming variables and datasets.
There are special characters such as “*” (often a wildcard) which are best reserved for their specific use rather than used in variable names. Likewise, a space used for separation within a variable name is more problematic that an underscore. Rather than list out forbidden special characters, we simply request variable naming use only (lowercase) letters, numbers, and “_”.
- Thou shalt not use spaces in variable names, nor dataset names, but rather use an underscore.
In case any ambiguity was left by the above commandment, we wanted to re-emphasize that spaces are not to be used in variable names.
- Thou shalt always use longer, more descriptive variables names, rather than shorter, harder to interpret names.
Many of us remember when variable names were quite restricted, perhaps to 8 characters. However, in discussion across our group we found that none of the software imposed such restrictions at present. Thus, we prefer when possible to use longer names which are easier for human users to recognize. The thinking is that we will avoid mistakes by making the content and origin of each variable clear.
- Thou shalt not create a variable name of more than 32 characters in length.
While allowable variable names have gotten longer, limits remain. To avoid problems from particularly long and unwieldy names, we propose a strict maximum of 32 characters.
- Thou shalt store missing data in STATA as “.”, in R as “NA”, in SQL as “NULL”, and be mindful of the differences between missing data and values stored as “0”.
Missing data are virtually omnipresent in real data. Storage as a number (e.g., -1 or 999) is common, but occasionally can cause error if not addressed prior to analyses. We prefer to keep missing values in the format recognized by each software as missing, and particularly highlight the necessity of distinguishing missing from “0” or “none”.
- Thou shalt write scripts that contain adequate commenting, where each logical section has an explanation, and readable at the graduate student level; allowing for easy translation to another coding/scripting language.
Even in looking back to one’s own syntax, commenting is invaluable. For sharing syntax across a team and over time, commenting can make the difference between successfully adapting syntax for updated use and starting from scratch.
- Thou shalt write scripts that have a header containing the script name, a description of how the script functions, its requirements/dependencies to run, author, and date.
As above, making the utility of a particular script clear, along with any dependencies, will increase the future utility of that script. This is particularly crucial when results need to be replicated in the course of moving research through the phases of publication, and also when bringing new users of the data up to speed.
- Thou shalt always use km2 (square kilometers) when creating density and area variables.
In geographic research in the US, there is often a mixture of metric (SI) and imperial units used. This is particularly problematic when documentation is inadequate to signal which is used. In any case, given the requirements by some journals to use metric units, and their predominate global use, we propose to consistently use metric units such as square kilometers.
- Thou shalt accept that these commandments supersede other naming conventions, but the ultimate goal is to support efficient and replicable use of the data.
We have used these commandments to capture the aspects of variable naming and data coding that were deemed most crucial to our successful collaboration, and will be on the lookout for additional commandments as we move forward.