At the Built Environment and Health group, we try hard to measure neighborhood characteristics accurately. We systematically audit Street View imagery, we use LiDAR scans to assess tree canopy, and we use business registration records to profile neighborhood retail. A lot of these measures are spatially interpolated. For example, it’s not feasible to collect pollen counts everywhere in the city, but we can take pollen count samples at a few locations and use those samples to estimate the pollen counts we would have measured in places we couldn’t measure.
One technique we use for spatial interpolation is ordinary kriging. Ordinary kriging uses the spatial correlation between sampled points – that is, on average, how similar are pairs of points at any given distance – to estimate, with confidence levels, measures at unobserved locations. Ordinary kriging was initially developed in geology – miners sampled minerals in locations that appeared promising, then analyzed the spatial variation in mineral content between the samples in order to identify potential gold deposits. We and others have borrowed this technique for neighborhood measures, like when we used ordinary kriging to estimate physical disorder levels throughout cities.
But a key assumption underlying an ordinary kriging model is that there’s a continuous correlation for the measures of interest – on average, mineral content at a given location looks more like the mineral content 50 feet away than the mineral content 100 feet away. We started wondering whether the assumption of continuity doesn’t hold for neighborhood disorder. For example, if physical conditions are worse on the ‘wrong side of the tracks’, then a measure of nearby conditions that happens to be on the opposite side of the tracks might tell us less than a measure that’s further away but on the same side of the tracks. Maybe, we thought, if we pull external information like the side of the tracks into our interpolation models, we can interpolate more accurately.
To make this more concrete: in New York City, local Community Boards advocate for neighborhood needs. If Community Boards have different views towards, say, graffiti cleanup as a priority, then district boundaries might affect the spatial distribution of disorder, and so including the district as a covariate might improve spatial estimation models. More generally, a lot of spatially located administrative data is available on sites like NYC Open Data – maybe a ‘Big Data’ approach can improve our measures at low cost?
One way to pull in this additional information is to use an extension of ordinary kriging called universal kriging (also sometimes called regression kriging and kriging with external drift, depending on the final specification). Universal kriging essentially supplements an ordinary kriging spatial model with additional covariates measured at the location where the value is being estimated. So we tried using cross-validation to compare estimation accuracy of our CANVAS/Street View-based disorder measure using ordinary kriging in Detroit, San Jose, Philadelphia, and New York (as we’ve done previously) to the same measure estimated with universal kriging incorporating census measures like housing vacancy, and unemployment rate.
We recently presented a paper describing this investigation at the 2016 Population Association of the Americas conference. Surprisingly, we found that that universal kriging was barely if at all beneficial, except when incorporating housing vacancy in Detroit and Philadelphia, cities where there has been a lot of abandonment. However, in Detroit and Philadelphia, the universal kriging model with the census housing vacancy measure was only about 4-5% more accurate with the full sample. Maybe more importantly, the universal kriging model needed only about half the sample points to be as accurate as the ordinary kriged model, indicating that we might be able to survey more cities for the same cost given a better interpolation model.
These results are encouraging us to explore universal kriging more deeply as a potential low-cost way to improve our disorder measure’s efficiency and accuracy as we use CANVAS to survey in more spatial contexts.