As we launched another multifaceted geographic data linkage study our multi-institution team, that includes researchers at Drexel University, Columbia University and the University of Washington, has developed a set of commandments to streamline and harmonize our data management, variable naming and data coding processes.
- Thou shalt not transmit HIPAA/IRB protected data, nor data protected by licensing agreement without PI approval.
Clearly, we both want to be responsible custodians of the data entrusted to us, and avoid getting into trouble. For additional discussion of cautions around the common practice of using online tools to characterize addresses, see our recent commentary.
- Thou shalt always use YYYYMMDD when formatting date variable values, stored as a string.
The date storage was much discussed by our group, but ultimately we wanted a solution that would sort chronologically, be readable to humans, and be usable seamlessly across software that use a different sentinel date.
- Thou shalt always use YYYY when using a year in a variable name.
Given that our studies of adult health frequently span both the 1990s and 2000s, using 4 digits (versus 2 digit) for year when possible allows for easier conversion from wide to long format, and sorting in chronological order.
- Thou shalt prefer use of tall rather than wide data formats to avoid storing empty data and simplify query expressions.
As we move to using longitudinal data on where people live, and how their environment has changed over time, the structure of data becomes more complex. Long format avoids storing fields for which many observations have no data. However, the overarching goal is efficiency and usability, which may at times favor a wide format instead.
- Thou shalt always use lowercase for variable names to avoid case sensitivity issues when jumping between software.
Inconsistent capitalization in variable names is a source of frustration for users of software such as STATA. A typical scenario is that you have working syntax, receive an updated dataset with differences in capitalization (which a user of less case sensitive software packages such as SAS may not be attentive to), and have to spend time troubleshooting and editing to get it to work again. While conventions vary, we decided the simplest thing would be to use only lowercase in our variable names. Continue reading