Like most researchers investigating neighborhood determinants of health, we are excited that both government and the private sector are making more and more spatially located data available. But even as new data sources allow us to characterize study subjects’ environments more completely, the sheer number of potentially interacting contextual variables we can now study introduces analytic complexity. Drawing an analogy with genomic research, we propose the ‘neighborhood environment-wide association study’ (or NE-WAS) as one approach to address the complexity.
Neighborhood research is increasingly a high-volume, high-variety ‘Big Data’ endeavor. Even as neighborhood research mainstays like the US Census and American Community Survey continue to be updated, new forms and sources of data like social media, remote sensing, and commercial aggregators are providing increasingly detailed insight into neighborhood conditions. Furthermore, through GIS tools and spatial analytic approaches, researchers are defining neighborhoods in creative new ways including network buffers, pill buffers, and neighborhood hulls. With more data at more spatial resolutions, we can characterize study subjects’ neighborhoods in high-dimensional space – for example, one dataset we work with has 1,485 separate variables that describe some aspect of each subject’s residential neighborhood.
With all these variables, it’s easy to worry we might be missing something. Defining a neighborhood is complex, and the appropriate scale may be different for different measured characteristics. For example, one study may operationalize neighborhood crime as reported neighborhood crime in an administrative area such as a county or zip code, whereas another may define neighborhood crime by asking subjects to report perceptions of neighborhood safety. It is possible that a third operationalization of neighborhood crime, such as kriging all reported crimes to estimate levels within a .25km circular buffer around subject homes, would more accurately reflect the true impediment to activity. If this last approach were closest to the true aggregate influence on subjects, then both the zip code and self-report studies would underestimate the true effects of neighborhood crime on physical activity.
Since neighborhoods are complex, dynamic environments and many neighborhood characteristics are highly correlated (for example, median incomes are correlated with racial composition), correct model specification in neighborhood research is a topic of ongoing debate. Picking only one neighborhood characteristic for consideration when a true causal effect is due to an interaction between characteristics could result in findings that cannot be replicated in studies with different distributions and may fail to identify optimal intervention targets.
In this sense, neighborhood research may be analogous to genomic and other ‘-omic’ research paradigms: a wealth of data exists, but characterization of the data for analysis remains problematic due to the interacting, dynamic nature of the underlying data generating process. In the context of genetic research, two paradigms have emerged: first, hypothesis-driven ‘candidate gene approaches’, wherein analysis focuses on polymorphisms and mutations of a particular gene whose effects have previously been profiled, and second, genome-wide association study (GWAS) approaches, wherein hypothesis-free empirical analytic approaches are used to search the whole genome for the strongest genetic associations. Recently, this paradigm has been replicated, first in the various high-throughput ‘-omic’ fields focused on biomarker discovery, and more recently in the environmental sciences with the advent of ‘environment wide association studies’ (EWAS). We propose that neighborhood research may benefit by developing the analogous NE-WAS: a ‘neighborhood environment-wide association study’.
For example, we have recently used a lasso estimator, a penalized regression technique, to explore how census tract characteristics predict the change in presence of physical activity venues over two decades in the 23-county New York City region. We used ten-fold cross-validation to select the model that minimized out-of-sample error, finding both expected results, such as that more socioeconomic resources were associated with greater odds of an activity facility opening and unexpected results, such as that race and ethnicity were associated with probability of facility increase controlling for population income and poverty levels. We are also in the early stages of exploring the use of a clustering-aware penalized logistic regression technique originally developed for epigenetic data to explore predictors of physical activity profiles in older adults. As with the lasso approach, we expect to use cross-validation to empirically select a maximally predictive model.
Modern neighborhood health effects research draws from intellectual traditions that value theory-driven analysis, including sociology, social work, and epidemiology. Contrary to some of the ‘Big Data’ hype, we do not anticipate that empirical techniques will supplant the need for theory. But we do anticipate that as the breadth of neighborhood data grows, empirical model selection techniques will become increasingly attractive to researchers motivated by not only the desire to learn from the new data but also sheer curiosity about patterns present in the built environment and society. By developing the NE-WAS now and exploring its strengths, weaknesses, and relation to more traditional theory-driven research, we can help establish the value that empiricism plays in understanding the processes through which neighborhoods influence health.
– Steven Mooney & David Wutchiett