SlideShare a Scribd company logo
Spatial Data Analysis
An introduction to spatial autocorrelation and spatial regression analysis
january 2015
Johan Blomme | Leenstraat 11 | 8340 Damme-Sijsele
j.blomme@telenet.be
www.johanblomme.com
Many research questions require analysis of complex patterns of interrelated social,
behavioral, economic and environmental phenomena. In addressing these questions, it is
increasingly argued that both spatial thinking and spatial analytical perspectives have an
important role to play. Indeed, research on social stratification and inequality, health,
mortality and fertility and many other issues depends on the collection and analysis of
individual and context-level data.
The geospatial and methodological development environment has changed. The volume,
sources and forms of available geospatial data are growing rapidly. The flow of information
from a host of sensors has grown exponentially in recent years to the point that many
observations can be geo-referenced. Data storage and handling (e.g. cloud computing)
change what, how and when we collect data on individuals and their environments.
In a world where information is increasingly seen through geographic filters, the importance
of spatial thinking is addressed. More and more instances show that space and place are
important elements and stress the leverage of place-based politics. For example,
conventional approaches in health research underestimate the contribution of place to
disease risk. Several studies reinforce the view how neighborhood context is an important
condition of human well being. Place emerges as an important contextual framework for
considering a number of critical societal issues. Place as a social context is deeply connected
to larger patterns of social advantage and disadvantage.
Since the mid 1990s, there is a renewed interest in the much earlier tradition of spatial
demography that focuses on areal aggregates as units of analysis. Trends in technology
during the 1980s and 1990s brought sophistication to the world of spacial demography.
Factors contributing were :
– U.S. Census Bureau’s TIGER files ;
– extensive natural resource, crime and epidemiological databases ;
– powerful GIS software for integrating and mapping spatial data ;
– computing hardware platforms.
These factors altered the way in which spatial demography research was carried out. Other
trends that emerged were :
– the use of exploratory spacial data analysis (ESDA) ;
– the role of regression analysis in spatial demography ;
– the special nature of spatial data that requires modification to the
standard regression model (e.g. the role of geographically
weighted regrssion for exploring spatial variation);
– the need for attention both to global as well as local diagnostic
tools.
When analyzing spatial data from a large number of units (e.g. counties), it is the natural
inclination of researchers to move from simple descriptive analysis to begin asking questions
as : How might these data be modeled ? How well can we account for variability in attribute
values among geographic units ?
To answer these questions, analysts turned to multivariate regression modeling, the
common methodology in the social sciences. However, the application of the standard
regression approach to data tied to spatial units brings spacial complications because
“spatial is special”. Attention has been drawn to the fact that spatial data require special
analytic approaches.
Two properties are particularly important in the analysis of spatial data. The first, spatial
dependence, refers to the tendency for spatial data to exhibit spatial autocorrelation. For
most social phenomena mapped in space, local proximity usually results in value similarity.
High values tend to be located near other high values, while low values tend to be located
near other low values, thus exhibiting positive spatial autocorrelation. Less often, high
values may tend to be co-located with low values (or vice versa), as islands of dissimilarity
(negative spatial autocorrelation).
In either case, the units of analysis in spacial demography likely fail a key assumption of
classical statistics : independence among observations. With respect to statistical analysis
that presumes such independence (e.g. standard regression analysis), positive
autocorrelation means that the spatially autocorrelated observations bring less information
to the model estimation process than would the same number of independent observations.
The greater the extent of spatial autocorrelation, the more severe is the information loss.
A quick explanation for the presence of spatial autocorrelation can be found in the oft-cited
“first law of geography” enunciated by Tobler in 1970 : “Everything is related to everything,
but near things are more related than distant things” (Tobler, 1970 : 36). Tobler’s first law is
somewhat unsatisfying because it doesn’t tell us why this phenomenon arises in practice.
The answer to this question can only be approximated with models of the spatial process
and the analysts’s theory about the process.
The second concept refers to spacial heterogeneity, the tendency for phenomena
distributed in many spaces to be statistically nonstationary (a lack of stability across space of
one or more attribute values). Spacial heterogeneity confounds attempts to generalize
because results of an analysis of a limited area will change when the boundaries of the area
are shifted.
One of the more recent and fascinating developments in the design of local statistics is the
theoretical background and associated software to explore how regression parameters and
regression model performance vary across a study region.
Geographically weighted regression (GWR) is similar to a global regression model in that the
familiar constant, regression coefficients and error term are all present within the regression
specification. There are two ways in which GWR differs from standard (global) regression.
First is the fact that a separate regression is carried out at each location (observation) using
only the other observations that lie within a user-specified distance from that location.
Second, the regression specification includes a statistical device which weights the attributes
of nearby geographical units more highly than it does the attributes of distant geographical
units. The result is a set of local regression parameters for each geographical unit. The
regression is thus localized.
A GWR approach to regression analysis is a highly useful exploratory device for
understanding parameter heterogeneity in one’s data. The output of GWR enables the
researcher to examine and map local parameter estimates and local regression diagnostics,
thereby enabling assessment of the utility of the model for various positions of the larger
study region.
In the first part of this guide, we provide a general introduction to perform spatial regression
and spatial autocorrelation analysis. We use GeoDa, software developed by the Arizona
State University’s GeoDa Center for geospatial analysis and computation
(http://geodacenter.asu.edu). In the second part, we model spatial data with geographically
weighted regression to explain local variations in relationships.
CONTENTS
Part 1
An introduction to spatial autocorrelation and spatial regression with GeoDa 1
1. Manipulating data 4
2. Mapping and exploratory data analysis 8
3. Spatial autocorrelation 25
4. Spatial regression 69
Part 2
Analyzing spatial hereogeneity with geographically weighted regression 94
Part 1
An introduction to spatial autocorrelation and
spatial regression analysis with GeoDa
www.johanblomme.com
• The development of specialized software for spatial data analysis has seen rapid growth since
the late 1980s.
• A substantial collection of spacial data analysis software is available, ranging from niche
programs and commercial statistical and GIS packages to open source software environments
such as R, Java and Python.
• GeoDa, for example, is the result of the effort to facilitate spatial data analysis. The main
objective of the software is to provide the user with a path starting with simple mapping and
geovisualization moving to spatial autocorrelation analysis and ending up with spatial
regression.
www.johanblomme.com 2
www.johanblomme.com 3
1. Manipulating Spatial Data
www.johanblomme.com
Manipulating Spatial Data
4
Creating point shape files from .dbf-file
www.johanblomme.com
Manipulating Spatial Data
5
Tools → Shape → Points to polygon
Creating Thiessen polygons as shape files
Thiessen polygons are created as a polygon shape file
derived from a point shape file. Each Thiessen polygon
encloses the original points in such a way that all points
in a polygon are closer to the enclosed point than any
other point. This correspons to the notion of geographic
market area.
Thiessen polygons allow the computation of contiguity
based spatial weights for point data, using the boundaries
of the polygons to establish contiguity.
Area and perimeter calculations are only supported for
projected coordinates (Euclidean distance). For point shape
files in unprojected latitude and longitude, the results will
not be correct.
www.johanblomme.com
Manipulating Spatial Data
6
Computing spatially lagged variables
Spatially lagged variables are weighted averages of the values for neighboring locations, as specified by a spatial
weights matrix.
The changes and additions made to a table only reside in memory and are not permanent. In order
To make them permanent, the table must be saved to a new file :
File → Save as → Shapefile name to save as
This results in three files to be saved, with file extensions .shp, .shx and .dbf.
www.johanblomme.com
Manipulating Spatial Data
7
2. Mapping and Exploratory Data Analysis
www.johanblomme.com
Mapping and EDA
8
Univariate EDA
www.johanblomme.com 9
Univariate EDA
www.johanblomme.com 10
resource deprivation index (1970)
Hinge value of 1.5 = 1.5 times the interquartile range to define outliers Univariate EDA
www.johanblomme.com 11
sort on variable to find outliers
Univariate EDA
www.johanblomme.com 12
Univariate EDA
www.johanblomme.com 13
Multivariate EDA
Homicide data for counties around St Louis
Quintile map homicide rate Quintile map resource deprivation
www.johanblomme.com 14
Multivariate EDA
scatterplot
parallel coordinate plot (PCP)
www.johanblomme.com 15
Multivariate EDA
Linking and brushing
www.johanblomme.com 16
Multivariate EDAAnalyzing changes over time :
www.johanblomme.com 17
Multivariate EDA
Cartogram crime rate
Cartogram Gini inequality
www.johanblomme.com 18
Ohio counties, total lung cancer deaths for
White females, 1968
selecting a rate variable from the data set (reveals the problem of
variance instability)
both the event and the population at risk are
specified and the rate is calculated on the fly
Rate Smoothing
www.johanblomme.com 19
A commonly used notion in public health analysis is the concept of a standardized mortality rate (SMR), or, the ratio of the observed
moratlity rate to a national (or regional) standard. GeoDa implements this in the form of an excess risk map.
The excess rate is the ratio of the observed rate to the average rate computed for all the data. Note that this average is not the
average of the county rates (instead, it is calculated as the ratio of the total sum of all vents over the total sum of all populations
at risk).
risk is higher than state average
risk is lower than
state average
Rate Smoothing
www.johanblomme.com 20
saved to the table (right click on previous map)
no difference between rescaled raw rates
and raw rates
Rate Smoothing
www.johanblomme.com 21
a new outlier is added
Empirical Bayes consists of computing a weighted average between the raw rate for each county and the state average,
with weights proportional to the underlying population at risk. Small conties will tend to have their rates adjusted
considerably, whereas for larger counties the rates will barely change.
Rate Smoothing
www.johanblomme.com 22
Spatial rate smoothing consists of computing the rate in a moving window that includes the county as well as its neighbors.
In GeoDa neighbors are defined by means of a spatial weights file.
We will construct a simple spacial weights file consisting of the 8 nearest neighbors for each county in the Ohio shapefile.
Rate Smoothing
www.johanblomme.com 23
A spatially smooted box map emphasizes broad regional patterns.
Note how there are no more outliers.
Rate Smoothing
www.johanblomme.com 24
3. Spatial autocorrelation
www.johanblomme.com
Spatial Autocorrelation
25
• Spatial autocorrelation is a measure of spacial dependency that quantifies the degree of spatial
clustering or dispersion in the values of a variable measured across a set of locations.
• There are two basic types of spatial autocorrelation statistics : global measures identify whether the
values of a variable exhibit a significant overall pattern of regional clustering, whereas local measures
identify the location of significant high and low value clusters.
www.johanblomme.com
Spatial Autocorrelation
26
• Basics : Steps in determining the extent of spatial autocorrelation :
– choose a neighborhood criterion : which areas are linked ?
– assign weights to the areas that are linked : create a spatial weights matrix
– run statistical tests, using weights matrix, to examine spatial autocorrelation
Spatial Autocorrelation
www.johanblomme.com 27
• Spacial autocorrelation measures the correlation of a variable with itself through space. Spacial
autocorrelation can be positive or negative. Positive spatial autocorrelation occurs when similar values
occur near one another. Negative spatial autocorrelation occurs when dissimilar values occur near one
another.
• Spacial weights are essential for the computation of spacial autocorrelation statistics.
• Spacial weights can be based on contiguity from polygon boundary files or calculated from the distance
between points.
www.johanblomme.com
Spatial Autocorrelation
28
rook contiguity
queen contiguity
1st order higher order
CONTIGUITY
BASED
WEIGHTS
.GAL-file
uses only common boundaries to define neighbors
uses all common points (denser connectedness structure)
removes redundancies and
circularities in the weights
construction
Contiguity Based Weights
polygon
shape files
www.johanblomme.com 29
flag, number of observations, name of polygon shape file, name of the key variable
Rooks Contiguity
www.johanblomme.com 30
Rooks Contiguity
www.johanblomme.com 31
Queen Contiguity
www.johanblomme.com 32
Comparison of connectedness structure for rook and queen contiguity
Contiguity Based Weights
ROOKS
QUEEN
www.johanblomme.com 33
Rooks Contiguity
Higher Order Contiguity
www.johanblomme.com 34
Pure 2nd order Rooks Contiguity
Higher Order Contiguity
www.johanblomme.com 35
Cumulative 2nd order Rooks Contiguity
Higher Order Contiguity
www.johanblomme.com 36
Higher Order Contiguity
locations with 5 first
Order rook neighbors
www.johanblomme.com 37
threshold distance
K-nearest neighbors
1st order higher order
DISTANCE
BASED
WEIGHTS
.GWT-file
GeoDa calculates the minimum distance required to assure that each observation
has at least one neighbor
Spacial weights based on distance threshold can lead to a very unbalenced connectedness structure (esp. In the
case when spacial units have very different areas, with small areas having many neighbors while larger ones may
have only a few). A commonly used alternative consists of considering the k-nearest neighbors.
point or polygon
shape files
Distance_Based Weights
www.johanblomme.com 38
In contrast to contiguity weights, distance-based spatial weights can be calculated for both point shape files as well as
polygon shape files. For polygon files, if no coordinate variables are specified, the polygon centroids will be used as the
basis for distance calculation. When polygon shape files are used, maps must be projected (e.g. UTM) for proper computation of
centroids. For unprojected maps, the resulting centroids will only approximate.
the minimum distance
required to ensure that
each location has at least
one neighbor
if the points are in latitude and
longitude, select the <Arc Distance>
option
Distance_Based Weights
www.johanblomme.com 39
Connectivity for distance-based weights
distance between neighbor pairs
The distribution has a much broader range compared to contiguity-based weights.
Some points are clustered while other are far apart. The minimum threshold needed
to avoid islands may be too large for many or most locations in the data set. In such
cases, care is needed in the specification of the distance threshold, and the use of
K-nearest weights may be more appropriate.
Distance_Based Weights
www.johanblomme.com 40
Spatially Lagged Variables
Spatially lagged variables are an essential part of the computation of spatial autocorrelation tests and the specification
of spatial regression models. GeoDa computes these variables on the fly, but in some instances it is useful to calculate
spatially lagged variables explicitly.
We will calculate a spatially lagged variable for the variable HH_INC (census tract median household income) in the Sacramento
file.
The first thing we do is open the spatial weights file we created.
Then we create a new field that is added to the table.
The value of the spatially lagged variable “W_HH_INC” for this
location is the mean of its neighbors
www.johanblomme.com 41
Spatially Lagged Variables
www.johanblomme.com 42
• Global spacial autocorrelation is handled in GeoDa by means of Moran’s I spatial autocorrelation statistic
and its visualization in the form of a scatterplot.
• Global spacial autocorrelation requires a spatial weights file and a variable must be specified.
• Spacial autocorrelation analysis is implemented in its traditional univariate form as well in a bivariate
form.
Global Spatial Autocorrelation
www.johanblomme.com 43
Moran’s I for Columbus data
(variable = crime ; spacial weights file =
rooks-based contiguity file)
Global Spatial Autocorrelation
www.johanblomme.com 44
(1) (2)
(3)
(4)
negative autocorrelation
positive autocorrelation
Global Spatial Autocorrelation
www.johanblomme.com 45
Moran’s I
reference distribution calculated for spatially random layouts with the same data as observed
(none of the simulated values is larger than the observed 0.52)
Global Spatial Autocorrelation
www.johanblomme.com 46
Moran’s I = 0.479487
Global Spatial Autocorrelation
www.johanblomme.com 47
the slope of the regression line changes
as specific locations (in this case 1 location)
are excluded from the calculation
Global Spatial Autocorrelation
www.johanblomme.com 48
Inference for Moran’s I is based on a random
permutation procedure, which recalculates the
statistic many times to generate a reference
distribution. The obtained statistic is then
compared to this reference distribution and a
pseudo significance level is computed.
Global Spatial Autocorrelation
www.johanblomme.com 49
Acounty’sspatiallagisaweighedaverageof
theresourcedeprivationofitsneighboringlocalities.
Global Spatial Autocorrelation
www.johanblomme.com 50
• Global measures : global spatial autocorrelation (Moran’s I) : a single value which applies to the entire
data set (the same pattern or process occurs over the entire geographical area ; and average for the entire
area).
• Local measures : local spatial autocorrelation (Lisa) : a value calculated for each observation unit
(different patterns of processes may occur in different parts of the region ; a unique number for each
location).
Local Spatial Autocorrelation
www.johanblomme.com 51
• Local spatial autocorrelation is based on local Moran LISA statistics. This yields a measure of spatial
autocorrelation for each individual location.
• Both univariate and multivariate LISA are included in GeoDa.
• The input needed for local spatial autocorrelation is the same as for global spatial autocorrelation.
Local Spatial Autocorrelation
www.johanblomme.com 52
the significance map shows
the locations with significant
local Moran statistics
Local Spatial Autocorrelation
the high-high and low-low locations (positive
local spatial autocorrelation) are typically
referred to as spatial clusters, while the
low-high and high-low are termed spatial
outliers (while outliers are single locations
by definition, this is not the case for
clusters)
www.johanblomme.com 53
Local Spatial Autocorrelation
www.johanblomme.com 54
The result for univariate LISA is a special chloropleth map showing those locations with a significant local Moran statistic
(depending on the significance level). In the map blow, the significance map is shown for the CRIME variable in the Columbus
Data set, using rook contiguity.
Local Spatial Autocorrelation
www.johanblomme.com 55
The result of the cluster map is a special choropleth map showing those locations with a significant local Moran statistic
Classified by type of spatial correlation : bright red for the high-high association and bright blue for low-low.
The high-high and low-low locations suggest clustering of similar values, while the high-low and low-high locations
Indicate spatial outliers.
Local Spatial Autocorrelation
www.johanblomme.com 56
It is strongly recommended that sensitivity analysis
be carried out before interpreting results of
LISA maps as “significant” clusters.
The randomization option provides a way to
address numerical stability of the results.
The significance filter is designed to assess how
conclusions depend on the chosen significance
level.
Local Spatial Autocorrelation
www.johanblomme.com 57
LISA maps after applying a significance filter.
Local Spatial Autocorrelation
www.johanblomme.com 58
Local Spatial Autocorrelation
When Moran’s I statistic is calculated for rates or proportions, the underlying assumption of stationarity may be
Violated by the instrinsic instability of rates. The latter follows when the population at risk (the base) varies
Considerably across observations. The variance instavility mat lead to spurious inferences for Moran’s I.
To correct for this, GeoDa implements the Empirical Bayes (EB) standardization. This is implemented for both the global
(Moran scatter plot) and local spatial autocorrelation statistics.
To illustrate this, we will use the Scottish lip cancer data set and associated weights file to compare the results of
calculating Moran’s I based on the non-standardized rates with the results of the EB standardization.
www.johanblomme.com 59
The value for Moran’s I of 0.527 differs somewhat from
the statistic for the unstandardized rates (0.479).
More important is to assess whether or not inference is
affected. The resulting permulation distribution still
suggests a highly significant statistic.
Local Spatial Autocorrelation
www.johanblomme.com 60
• Practice : Spatial patterns of rural poverty : An exploratory analysis in the São Fransisco
River Bassin, Brazil (Nove Economia_Belo_Horizonte_21 (1), 45-66_janeiro-abril de 2011).
This study uses recently released municipio-level data on rural poverty in Brazil to identify and analyze spatial
patterns of rural poverty in the SFRB.
Moran’s I statistics are generated and used to test for spatial autocorrelation, and to prepare cluster maps that
locate rural poverty “hot spots” and “cold spots”.
The results indicate that poverty reduction in the SFRB should take into account the spatial distribution of
poverty. Not only is poverty in the SFRB clustered spatially, but the bulk of the bassin’s poor resides in
municipios that comprise the poverty “hot spots” the study identifies. These clusters did not correspond to
state-level boundaries, so scope may exist for geographically refocusing poverty reduction efforts to make
them more efficient.
www.johanblomme.com
Spatial Autocorrelation
61
www.johanblomme.com
Spatial Autocorrelation
62
www.johanblomme.com
Spatial Autocorrelation
63
• Information on spatial patterns of rural poverty in the SFRB may shed light on the importance of location
as a causal factor per se. Municipios may be more likely to have high (or low) rural poverty rates
depending on where they are located geographically :
– one obvious reason is the stock of natural resources (natural resources are not evely distributed
across space) : for farm activities, for example, good soils and easy access to water may improve
agricultural conditions, productivity and income ;
– job and income providers such as firms and service-oriented businesses tend to concentrate in space
in order to benefit from large markets (economies of scale) and the availability of specialized skilled
labor.
www.johanblomme.com
Spatial Autocorrelation
64
• The value of Moran’s I is equal to 0.72, which suggests a strong postitive spatial autocorrelation of rural
poverty. This number suggests that for the SFRB, there are more locations wich high (low) rural poverty
rates surrounded by locations with high (low) rural poverty rates than would be the case if poverty were
distributed randomly.
• The value of Moran’s I also suggests that poverty in the SRFB is spatially distributed in clusters and also
suggests that poverty in neighboring areas increases the likelihood of poverty in its neighbors. However,
the value of Moran’s I does not tell us where rural poverty clusters might be, but rather suggests that the
spatial pattern of poverty is not random (there is more similarity in poverty (or the absence of its) than
would be expected if the pattern were random).
• Making use of EB-standardization to reduce variance instability, delivers a coefficient of 0.83 compared to
the initial calculation of Moran’s I. This indicates that the correlation between rural poverty rates in
location i and neighboring locations is stronger when rates are standardized. Hence, increasing the
precision with which rurla poverty is measured will likely increase the spatial correlation among rural
poverty rates in the SFRB.
www.johanblomme.com
Spatial Autocorrelation
65
• Although a Moran I of 0.83 strongly shows that the spatial distribution of rural poverty is not random, it
does not locate poverty clusters.
• To locate “hot spots” and “cold spots”, local indicators of spatial autocorrelation must be used (LISA). LISA
provides location-specific information and estimates the extent of spatial autocorrelation between the
value of a given variable (rural poverty) in a particular location and the values of the same variable in
locations around it. This makes it possible to identify spatial clusters of rural poverty.
• 3 clusters of rural poverty in the SFRB are detected by LISA. Clusters 1 and 2 are rural poverty “hot spots”
and correspond to positive and high-high spatial autocorrelation, indicating spatial clusters of locations
with above-average rural poverty rates. Cluster 3 is a “cold spot” and also corresponds to a positive, but
low-low spatial autocorrelation, indicating a cluster of locations with below -average rural poverty rates.
www.johanblomme.com
Spatial Autocorrelation
66
www.johanblomme.com
Spatial Autocorrelation
67
• As mentioned before, the clusters of rural poverty may be attributable to several reasons. But further
analysis is required to determine the causes of spatial patterns of rural poverty in the SFRB. Multivariate
regression analysis that takes into account the variables that may explain poverty is the appropriate
approach to the analysis of the spacial determinants of patterns of rural poverty in the SFRB.
• The results of this study suggest that poverty reduction policies in the SFRB should take into account the
spatial distribution of poverty. The analysis suggests that location as a causal factor per se is important
and locations are indeed more likely to have high (or low) rural poverty rates depending in where they are
located in the basin. This may be due to obvious reasons such as stock of natural resources, soil quality,
access to water, etc.
• More importantly, the analysis shows that poverty in one location is affected by (or affects) poverty in
neighboring locations. That is, there are spillovers, either positive or negative externalities that make
locations more or less likely to get out of poverty. These spillovers may be associated with the
concentration (or lack of concentration) of firms, technology and knowledge. These results set the stage
for identifying factors that influence rural poverty in the SFRB, factors that may themselves be spatially
correlated.
www.johanblomme.com
Spatial Autocorrelation
68
4. Spatial regression
www.johanblomme.com
Spatial Regression
69
• When moving from simple descriptive analyses to data modeling, analysts turn to multivariate regression
modeling to account for variability in attribute values among geographic units by identifying other
covariates of the attribute of interest.
• Attributes of spatially referenced data generally violate at least one of the assumptions underlying the
standard regression model, which necessitates both caution regarding these violations and attention to
methods designed to correct for them.
Spatial Regression
www.johanblomme.com 70
• Spatial variation : spatial heterogeneity versus spatial dependence
• When undertaking initial EDA of spatial data, it is worthwhile to develop a sense of the spatial distribution
of the attribute values. By mapping the distributions of variables across space, a distinction can be made
between two types of spatial dependence.
• Spatial heterogeneity : large-scale regional differentiation (among attribute values) is an important
component of spatial variation. Spacial heterogeneity is the lack of stability across space of one or more
attribute values. Heterogeneity gives recognition to the common observation that values of a variable are
not the same across space.
• Spatial heterogeneity follows from the intrinsic uniqueness of each location. Spacial heterogeneity is
consistent with the description of how places are particular moments of intersecting social relations. The
unique combination of social forces together in one place may produce effects which would not happen
otherwise. These social forces include nonmaterial forces (e.g. cultural and/or historical processes) that
cannot easily or always be quantified, yet these forces shape otherwise measurable social relationships.
The spacial regime approach permits the analyst to move beyond geography per se, by focusing on social,
economic and demographic factors - or, combined , sociological factors – that comprise the context of
place. This approach is intended to enable the analyst to address the “so what” question : what is it about
a place that distinguishes it from other places ?
Spatial Regression
www.johanblomme.com 71
• Spatial dependence refers to small-scale spatial effects that manifest a lack of independence among
observations (spatial clustering). The assumption is that dependence among the observations derives
from spatial interaction among the units of analysis which can be defended theoretically and which can be
statistically captured by a spatially lagged “neighborhood” effect.
• Two forms of spacial models are commonly used to improve regressions on spatially correlated data :
– The spacial lag model : if two locations are adjacent, the value of the dependent variable of the first
locations can be influenced by the value of the dependent variable of the other. This means that
there is a contagion or dispersion effect, represented best by a spatial lag model.
– The spacial error model : if the error residuals of locations are influenced by one another, this means
that the phenomenon under study is not analysed at the correct geographical level, or that there
might be an unobserved variable correlated with the spatial structure of the data. This would imply
a clustering effect and this has to be studied by a spatial error model.
• A spatial lag model is appropriate if neighboring locations influence one another ; the spatial error model
documents that locations geographically cluster but for an unknown reason.
www.johanblomme.com
Spatial Regression
72
Spatial distribution of population change among Great Plains Counties, 1990-2000
Source : P.R. Voss, K.J. Curtis White & R.B. Hammer : Explorations in spatial demography, in W.A. Kandel & D.L. Brown, Population change and rural society, Springer,
2006, pp. 407-429)
spatial hereogeneity across counties and
spacial dependence (clustering)
Moran scatterplot of population change
Spatial Regression
www.johanblomme.com 73
• A model with spatial lags is able to borrow information from neighborhood observations because of the
spatial autocorrelation among the units of analysis. The units of analysis likely fail a formal statistical test
of randomness and thus fail to meet a key assumption of classical statistics : independence among
observations. With respect to statistical techniques that presume such independence (e.g. standard
regression analysis), positive autocorrelation means that the spatially autocorrelated observations bring
less information to the model estimation process than would the same number of independent
observations.
• A carefully selected variable can account for spatial heterogeneity in the data and might boost the
explanatory value of the model and largely remove the large-scale spatial process, but spatial
autocorrelation would persist if a spatial dependence process were also indicated. There would remain in
the data a more complicated, interactive spatial relationship among neighbors that suggests the
requirement of some type of autoregressive term in the regression specification.
Spatial Regression
www.johanblomme.com 74
• The aim of the researcher is to specify and estimate a model that reasonably accounts for or incorporates
that spatial effects present in the data. These effects can be modeled as spatial heteregeneity and spatial
dependence. When first examining a spatial relationship, the reseacher must ask whether the association
appears to be a reaction to some geophysical, cultural, social or economic force that works to create
spatial patterning (spatial heterogeneity), or an interaction, indicative of spatial dependence.
• If the association is merely a reaction to some general force, then a modeling strategy with a standard
regression structure may be appropriate.
• If, on the other hand, the association is an interaction suggesting some type of formal dependency among
units, then a modeling strategy with a spatial dependent covariance structure is the way to proceed. In
this instance, heterogeneity likely will not fully remove the spatial effects within the data. An alternative is
needed – a spatially oriented approach that formally incorporates a spatially lagged dependent variable
or spatially lagged error term.
Spatial Regression
www.johanblomme.com 75
• Spatial dependency modeling : example 1
• The shapefile newyork.shp is the map of Manhattan in New York City with Census 2000 data* . These are socioeconomic attributes for
297 Census tracts. It includes the following variables:
POLYID Polygon ID
STATE State FIPS
COUNTY County FIPS
TRACT Census Tract ID
sctrct00 FIPSID
hvalue Median housing value
t0_pop Total population
pctnhw Percent non-Hispanic white persons
pctnhb Percent non-Hispanic black persons
pcthsp Percent Hispanic persons
pctasn Percent Asian persons
t0p_own Percent homeowners
t0p_coll Percent college educated
t0p_prf Percent of people employed in professional/managerial occupations
t0p_uemp Percent of people unemployed
t0p_for Percent foreign born persons
t0p_rec Percent recent immigrants
t0_minc Median household income
t0p_poor Percent total population below poverty
* Source : http://www.s4.brown.edu/S4/Training/Modul2/GeoDa3FINAL.pdf
Spatial Regression
www.johanblomme.com 76
• Before starting a regression, create a weights file :
Spatial Regression
www.johanblomme.com 77
• In this example, we will predict neighborhood homeownership with several indicators :
Spatial Regression
www.johanblomme.com 78
insignificant effects
Spatial Regression
www.johanblomme.com 79
Test of multicollinearity of the model : one should be alarmed when the condition number is greater than 20.
Jarque-Bara test is used to examine
the normality of the distribution
of the errors. The low probability
of the test score suggests non-
normal distribution of the error
term.
The low probabilities of the
three tests point to the
existence of heteroscedasticity.
Error variance can be affected by
spatial dependence in the data.
Moran’s I suggests
spatial autocorrelation
of the residuals.
Both tests of the lag and error are significant, indicating presence of spatial dependence.
The robust test help us understand what type of spatial dependence may be at work. The robust measure for error is still significant, but the
robust lag test becomes insignificant, which means that when the lagged dependent variable is present the error dependence disappears.
Spatial Regression
www.johanblomme.com 80
• After identifying the presence of spatial dependence, we will use GeoDa to re-estimate the model when
controlling for spatial dependence.
Spatial Regression
www.johanblomme.com 81
Spatial Regression
The spatial lag term of
homeownership (W_TOP_OWN)
appears as an additional
indicator. It has a positive
effect and is highly significant.
As a result, the model fit is
improved (higher R-square).
Coefficient Rho reflects the spatial dependence in the sample data,
measuring the average influence on observations by their
neighboring observations.
Although the introduction of the spacial lag
term improved the model fit , it didn’t make
the spacial effects go away.
cfr. R2= 0.495 with OLS regression
www.johanblomme.com 82
• Now let’s review the results for the spatial error model.
Spatial Regression
www.johanblomme.com 83
Coefficient of spatially
correlated errors is
positive. The model fit is
improved (higher R2).
Heteroscedasticity remais significant.
Also, spatial error stays significant.
Although allowing the error terms to
be spatially correlated improved the
model fit, it didn’t make the spatial
effects go away.
Spatial Regression
www.johanblomme.com 84
• Comparing the spatial lag and spatial error models, we can see that both models yield improvement to the
original OLS model. Therefore, controlling spatial dependence improves model performance.
• Now the question is which of the two models is better ? To some extent, this is an open question. The
general advice is first to look for a theoretical basis to inform your choice. When it is not so clear
theoretically, you can compare the model performance parameters : the R-squared and log likelihood. In
this example, the spatial error model has greater R-squared and log likelihood values. That provides a
statistical basis to adopt this solution.
Spatial Regression
www.johanblomme.com 85
Spatial Regression
www.johanblomme.com 86
• Spatial dependency modeling : example 2
• Analysis of poverty in the U.S. *
Source : http://csde.washington.edu/services/gis/workshops/SPREG.html
Spatial Regression
www.johanblomme.com 87
Spatial Regression
www.johanblomme.com 88
violation of regression
assumptions
Spatial Regression
www.johanblomme.com 89
Spatial Regression
www.johanblomme.com 90
Spatial Regression
www.johanblomme.com 91
spatial error model
Model R2 Log Likelihood
OLS 0,780 2323,69
Spatial Lag 0,822 2457,37
Spatial Error 0,847 2504,64
Spatial Regression
www.johanblomme.com 92
The spatial error form results in a substantial reduction of spatial autocorrelation.
Spatial Regression
www.johanblomme.com 93
Part 2
Analyzing spatial heterogeneity with
geographically weighted regression
www.johanblomme.com 94
• Traditional regression analysis describes a modelled relationship between a dependent variable and a set
of independent variables. When applied to spatial data, the regression analysis often assumes that the
modelled relationship is stationary over space and produces a global model which is supposed to describe
the relationship at every location in the study area. This would be misleading, however, if relationships
being modelled are intrinsically different across space. One of the spatial statistical methods that attempts
to solve this problem and explain local variation in complex relationships is Geographically Weighted
Regression (GWR).
• In a global regression model, the dependent variable is often modelled as a linear combination of
independent variables, where a parameter belonging to each variable is assumed to be stationary over the
whole area (i.e. the model returns one value for each parameter). GWR extends this framework by
dropping the stationarity assumption: the parameters are assumed to be continuous functions of location.
The result of the GWR analysis is a set of continuous localised parameter estimate surfaces, which
describe the geography of the parameter space. These estimates are usually mapped or analysed
statistically to examine the plausibility of the stationarity assumption of the traditional regression and
different possible causes of non-stationarity.
www.johanblomme.com 95
The definitive text on GWR is : Fotheringham, A.S., Brunsdon, C. & Charlton, M.E., Geographically Weighted Regression : The Analysis of
Spatially Varying Relationships, Chichester, Wiley, 2002.
Geographically Weighted Regression
www.johanblomme.com 96
Geographically Weighted Regression
• The use of linear regression is common in many areas of science. Ordinary linear regression implicitly
assumes spatial stationarity of the regression-model that is, the relationships between the variables
remain constant over geographical space. We refer to a model in which the parameter estimates for every
observation in the sample are identical as a global model.
• Spatial non-stationarity occurs when a relationship (or pattern) that applies in one region does not apply
in another. Global models are statements about processes or patterns which are assumed to be stationary
and as such are local independent, i.e. are assumed to apply to all locations. In contrast local models are
spatial disaggregations of global models, the results of which are location-specific. The template of the
model is the same : the model is a linear regression model with certain variables, but the coefficients alter
geographically. If the parameter estimates are allowed to vary across the study area such that every
observation has its own separate set of parameter estimates we have a local model.
• GWR does not assume the relationships between independent and dependant variables are constant
across space. Instead, GWR explores whether the relationships between a set of predictors and an
outcome vary by geographical location. GWR is suggested to be a powerful tool for investigating spatial
non-stationarity in the relationship between predictors and the outcome variable.
www.johanblomme.com 97
Geographically Weighted Regression
• GWR4 is new release of a Microsoft Windows based application for calibrating geographically weighted
regression models, which can be used to explore geographically varying relationships between
dependent/response variables and independent/explanatory variables.
www.johanblomme.com 98
Geographically Weighted Regression
www.johanblomme.com 99
Give the session a name
Specify regression type
and variable settings
Chose a geographic
kernel type
Specify names for files
storing the modelling results
Execute the session
For an extensive review of these 5 steps, see T. Nakaya, GWR4 User Manual, update 7 may 2012.
Geographically Weighted Regression
• Type II diabetes is a growing health problem. Because the burden of diabetes falls disproportionally on
less advantaged individuals, poverty is one of the most important risk factors for diabetes.
• Micro-level (individual-level) research has consistently found positive associations between diabetes and
poverty. Poverty and diabetes may be related because economic disadvantage may limit people to poorer
diets and more sedentary lifestyles.
• Macro-level (context-level) investigations have also found a positive association between diabetes and
poverty. Rates of diabetes are higher in areas with higher economic deprivation.
• What follows, provides a study of the geographical variability in the relationship between poverty and
diabetes. We first show how a classical ordinary least squares regression captures the “global” and
positive relationship between diabetes and poverty (an increase in the concentration in poverty is
accompanied with an increase in the prevalence of diabetes). We then make use of an exploratory
geographically weighted regression to specify a local modal. The findings reveal that the diabetes-poverty
relationship macro-level relationship varies by geographical space
www.johanblomme.com 100
. An introduction to macro-level spatial nonstationarity : A geographically weighted
regression analysis of diabetes and poverty
Geographically Weighted Regression
• Theoretically, spatial non-stationarity is based on the concept of the social construction of space. The
interaction between individuals with each other and their physical environment produces space. Human
beings are just as much spatial as temporaral beings. By temporal, we mean that we are most influenced
by what is immediate in space. What happens near us matters more than non-proximal events. Human’s
spatiality and temporality are essential and equal powerful in explaining human behavior. Consequently,
everything that is social is inherently spatial, just as everything spatial is inherently socialized.
• From this perspective, we analyse how the macro-level relationship between diabetes and poverty unfolds
over geographical space.
www.johanblomme.com 101
Geographically Weighted Regression
• Investigations on spatial non-stationarity focus on the phenomenon that two measurements taken from
geographically close locations are often more similar than measurements from more widely separated
locations (Tobler’s law (1970, p. 236) : “Everything is related to everything else, but near things are more
related than distant things”).
• For this reason, spatial autocorrelation has been developed to deal with the tendency toward
interdependence among spatial data. Investigating diabetes prevalence requires we expand our
understanding of how macro-level relationships vary as a function of geographical distance.
• In a global modal, we can hypothesize that poverty and diabetes are positively related. In a local modal,
we can hypothesize that the diabetes-poverty macro-level relationship will be spatial non-stationary.
Tobler, W.R., A computer movie simulating urban growth in the Detroit region, Economic Geography, 46, 1970, pp . 234-240.
www.johanblomme.com 102
Geographically Weighted Regression
• Traditionally non-spatial research, including the OLS approach, assumes that the nature of statistical
relationships is the same for all points within the entire study area. With GWR, we can explore how the
diabetes-poverty relationship varies over space. The OLS results are thus for the “global model” findings
while the GWR outputs are the “local” analyis results.
• We first execute an OLS multivariate regression to show the linear association between diabetes and
poverty in US counties in the South Atlantic area (N=588)*. The goal of this “global model” is to verify the
positive association found in previous studies. In the OLS model we use the percentage of diabetes in the
county as the dependent variable and the percentage in poverty as the independent variable. We control
the relationship between poverty and diabetes prevalence for median income of households and the
percentage of people who completed high school. We then develop a GWR-model to account for spatial
variations. The GWR model contains the same variables used in the OLS regression.
* Source : http://www.ers.usda.gov/data-products/county-level-data-sets.aspx
We focus on the 588 contiguous counties because GWR analysis requires that all polygons be physically adjacent or in near physical proximity to at least
one other polygon with data on the variables of interest.
www.johanblomme.com 103
Geographically Weighted Regression
www.johanblomme.com 104
US counties South Atlantic : N = 588
Geographically Weighted Regression
Global results
• Poverty is positively associated with diabetes. The results of OLS-model 1 demonstrate that an
increase of one percentage point in the poverty concentration of a county is associated with a 0,15
percent increase in diabetes.
• Model 1 has an R2 of 0,262. While diabetes prevalence and percent in poverty are statistically
significantly related, a substantial proportion of the variation in diabetes prevalence remains
unexplained.
• After adding median income of households and the percentage of people who completed high
school to the regression equation, the effect of poverty is substantially reduced and no longer
significant and even the sign of the coefficient for poverty changes from positive to negative. The
R-square value for model 2 achieves a respectable 0,395*.
• We also note a problem : the regression equation shows strong spatial autocorrelation (Moran’s I =
0,328 ; p < 001)**, a clear indication that the model is in violation with at least one of the
assumptions underlying standard linear regression. The Moran test tells us that the residuals are
not independent. Moreover, the Koenker-Bassett test for heteroscedasticity indicates that the
residuals also are not distributed identically.
* Collinearity diagnostics were estimated using SPSS 20.0, and no problems of multicollinearity were found among the independent variables.
The collinearity diagnostics used were the variance inflation factors (VIF) and tolerances for individual variables.
Multicollinearity is said to exist if the VIF is 5 or higher (or equivalently, tolerances of 0,20 or less). The highest VIF in this analysis was 3,314 and
the lowest tolerance was 0,302 for median income of households.
** Moran’s I is strongly positive, indicating powerful positive autocorrelation (clustering of like values). LISA analysis demonstrates that most counties are
found in the high-high and low-low quadrants.
www.johanblomme.com 105
Geographically Weighted Regression
www.johanblomme.com 106
Geographically Weighted Regression
• Comparing the residual spatial autocorrelation (I = 0,328) with the spatial autocorrelation for the dependent
variable (I = 0,454) tells us that spatial autocorrelation in one or more independent variables “explains” a
portion of the spatial autocorrelation in the dependent variable*.
• It is frequently the case that the independent variables in a regression model can almost completely
account for the spatial autocorrelation in a dependent variable, thus removing a problematic spatially
autocorrelated residual. However, in the present case, the regressors have not satisfactorily accounted for
spatial dependence in the data, and a correction to the model clearly is necessary. But what type of
correction ? Might there be spillover effects among counties that influence the diabetes prevalence of their
neighbors (spatial lag model) ? Or does the residual dependence in the model likely stem from omitted
variables on the right-hand side of the regression equation, thus suggesting a spatial error model ?
* Moran’s I is calculated by specifying a matrix of weights that characterizes the structure of local dependence. In this analysis “neighbors” are defined
under the “first-order queen” convention, meaning that the neighbors for any given county “A” are those other counties that share a common
boundary with “A” (or single point of contact with “A”). Importantly, “A” is not considered a neighbor of itself and is excluded from the average.
www.johanblomme.com 107
Geographically Weighted Regression
• We used a spatial regression model to control for the spatial autocorrelation. We chose which spatial
dependence model to use (spatial lag or spatial error) using Lagrange Multiplier tests. Although both
models exhibit significant spatial dependence, we used the model with the highest test statistic, in this
case, the spatial error model.
• Aside from the remaining heteroscedasticity, the spatial error model appears to be a plausible alternative
to the OLS specification. The AIC score is lower and the explanatory power of the model increases
considerably over the OLS regression, with an R2 of 0,538.
• In contrast with OLS-model 2, the effect of poverty on diabetes is statistically significant, independent
from the median income of households and the percentage of people who completed high school.
• It is still not clear if spatial non-stationarity is a concern in our analysis. It is necessary to investigate the
homoscedastic assumptions underlying the OLS with local modeling.
www.johanblomme.com 108
www.johanblomme.com 109
Geographically Weighted Regression
OLS and spatial regression models predicting the prevalence of diabetes in US South Atlantic counties (N=588)
OLS (1) OLS (2) Spatial Error
independent variables coeff. std.err. coeff. std.err. coeff. std.err.
constant 9,066** 0,185 18,662 ** 20,146** 1,036
% poverty 0,151** 0,010 -0,007 0,017 -0,040* 0,016
median income of households -0,000068** 0,000008 -0,000077** 0,000009
% completed high school -0,051** 0,012 -0,059* 0,012
spatial error (Lambda) 0,530**
heteroscedasticity 30,240 **□ 55,547 **□ 48,399**●
R2 0,262 0,395 0,538
AIC 2233,690 2120,650 2002,780
Lagrange Multiplier (Lag) 72,872 **
Robust LM (Lag) 1,642
Lagrange Multiplier (Error) 141,604 **
Robust LM (Error) 70,375 **
* p<0,05 ** p<0,01
□ Koenker-Bassettt test for heteroscedasticity
● Breusch-Pagan test for heteroscedasticity
OLS models and the spatial error model are estimated by making use of Open GeoDa 1.2.0 (august 2012) ©Luc Anselin, 2011,2012
Local results
• In using spatial regression models we assume that the spatial process accounting for diabetes levels is the
same across the study area. That is, the relationship is spatially stationary. However, few social processes
will be found to be so constant over space. Global models will hide potential heterogeneity, or spatial non-
stationarity, in the determinants of diabetes. GWR provides a method to access the degree to which the
relationship between the potential determinants and the prevalence of diabetes varies across space.
• The spatial non-stationarity of the relationship of each independent variable to the dependent variable
can be assessed to determine whether the GWR method offers any improvement over a global regression
model. The variability in the observed GWR estimates for the spatial units is compared to the variability of
the GWR results from a large number of allocations of the analytical data across the units. Where one
finds a significant difference between the variability of an observed estimate to those computed using the
randomized data, spatial non-stationarity for that independent variable is indicated.
www.johanblomme.com 110
Geographically Weighted Regression
• We first made use of a local Moran’s I cluster analysis of the residuals of the GWR model as a diagnostic
for the collinearity of the GWR residuals. We found no violations of residual independence.
www.johanblomme.com 111
Geographically Weighted Regression
www.johanblomme.com 112
Geographically Weighted Regression
• The GWR results can best be summarized through the maps of the parameter estimates and the Monte
Carlo tests. We provide maps of the local R2 values and for each of the independent variables with a
significant Monte Carlo test.
• The Monte Carlo tests for spatial variability of parameters indicate that the associations between the
independent variables and diabetes are all non-stationary across space. Explicitly, the associations we
found in OLS could not be generalized to anywhere in the South Atlantic region. In contrast to OLS, the
GWR model explains 62,2 % of the total variance.
• As shown on the map of the local R2 values, the total variance explained by the local model ranges from
16,1 % to 81,1%. The model fits the data well in the northern counties. Especially in the southern
situated counties, there are areas that may benefit from a model with additional covariates. Herein lies
the value of the GWR approach : without the ability to map the local R2, we would not know where our
model could be improved with additional covariates.
www.johanblomme.com 113
Geographically Weighted Regression
www.johanblomme.com 114
Geographically Weighted Regression
• The model results of the GWR can be interpreted in two ways. Those interested in a particular area can
use the model results for that place to get a multivariate understanding of key local determinants of the
diabetes prevalence. We will not do this here. An alternative way to examine the results is by considering
for each determinant the varying nature across the counties of the South Atlantic region.
• For example, the GWR coefficient for the percentage of poverty ranges from -0,33 to 0,32 which signals
that the poverty-diabetes macro-level association is spatially non-stationary. The blue marked counties
indicate areas where an increase in poverty predicts lower diabetes prevalence. The shift to light-blue
marked areas captures the spatially non-stationary relationship between poverty and diabetes. The
poverty-diabetes relationship fluctuates from negative to positive as a function of geographical location.
Similar results exist for the relationship between median household income, resp. educational attainment
and diabetes. In short, after accounting for location, we find that macro-level associations between
predictor variables and diabetes fluctuate as a function of geography.
www.johanblomme.com 115
Geographically Weighted Regression
www.johanblomme.com 116
Geographically Weighted Regression
• The previous analysis demonstrates that GWR addresses the need for place-specific or place-sensitive forms
of analysis.
• Effective locational decision making is essential for properly addressing many socio-economic, demographic
and health related concerns. Presently, these decisions are supported by quantitative models, which are
potentially powerful tools, but whose estimates are often affected by uncertainty, which reduces their
reliability.
• Uncertainty in the model parameters stems from two proporties of geographical phenomena :
– spatial dependence : near things are more related than distant things ;
– spatial non-stationarity : variability over space ;
• These two properties are mutually related, and most observed processes exhibit both, simultaneously.
• Advanced spatial analytical methods exist to correct for the effects of each property. However, despite the
recognized simultaneity of their occurrence, each advanced spatial method is designed to address only one
property. Spatial autoregressive methods address spatial dependence but do not account for non-
stationarity ; geographically weighted regression addresses non-stationarity but does not account for spatial
dependence.
www.johanblomme.com 117
Geographically Weighted Regression

More Related Content

What's hot (20)

PPTX
Introduction to GIS
Uday kumar Devalla
 
PDF
georeference
Thana Chirapiwat
 
PDF
Spatial data analysis 1
Johan Blomme
 
PDF
Projections and coordinate system
Mohsin Siddique
 
PPTX
GIS Map Projection
MUHAMMAD BILAL SALEEM
 
PPTX
Spatial Autocorrelation
Ehsan Hamzei
 
PPT
Geodatabases
John Reiser
 
PDF
Introduction to GIS
Joey Li
 
PPTX
Digital image processing
Vandana Verma
 
PPT
Terminology and Basic Questions About GIS
Mrinmoy Majumder
 
PPTX
Functions of GIS
Fayaz Ahamed A P
 
PPTX
Types of GIS Data
Preeti Prajapati
 
PPTX
Vector data model
PRAMODA G
 
PDF
QGIS Module 1
CAPSUCSF
 
PPT
GIS Data Types
John Reiser
 
PPTX
Digital image processing
kaushal gadariya
 
PPTX
Understanding Coordinate Systems and Projections for ArcGIS
John Schaeffer
 
PPT
Introduction to GIS systems
Vivek Srivastava
 
PPTX
Gis powerpoint
kaushdave
 
Introduction to GIS
Uday kumar Devalla
 
georeference
Thana Chirapiwat
 
Spatial data analysis 1
Johan Blomme
 
Projections and coordinate system
Mohsin Siddique
 
GIS Map Projection
MUHAMMAD BILAL SALEEM
 
Spatial Autocorrelation
Ehsan Hamzei
 
Geodatabases
John Reiser
 
Introduction to GIS
Joey Li
 
Digital image processing
Vandana Verma
 
Terminology and Basic Questions About GIS
Mrinmoy Majumder
 
Functions of GIS
Fayaz Ahamed A P
 
Types of GIS Data
Preeti Prajapati
 
Vector data model
PRAMODA G
 
QGIS Module 1
CAPSUCSF
 
GIS Data Types
John Reiser
 
Digital image processing
kaushal gadariya
 
Understanding Coordinate Systems and Projections for ArcGIS
John Schaeffer
 
Introduction to GIS systems
Vivek Srivastava
 
Gis powerpoint
kaushdave
 

Similar to Spatial data analysis (20)

PDF
Spatial data analysis 2
Johan Blomme
 
PDF
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Revolution Analytics
 
PDF
Assessing spatial heterogeneity
Johan Blomme
 
PPTX
Spatial statistics presentation Texas A&M Census RDC
Corey Sparks
 
PDF
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
hpaocec
 
PDF
Spatial Analysis In Epidemiology Illustrated Stevenson Mark Stevens
otwaylivia
 
PDF
Python for Geospatial Data Analysis (First Early Release) Bonny P. Mcclain
nghgrinel
 
DOCX
Research design (1).docx
Manjunath D R
 
PDF
Spatial analysis of house price determinants
Laurent Lacaze Santos
 
PPTX
s3859353_Lab2.pptx
NguyenVanTuan33
 
PDF
Spatial Analysis of House Price Determinants
Laurent Lacaze Santos
 
PPT
A quick overview of geospatial analysis
Md.Farhad Hossen
 
PPT
STATISTICAL ANALYSIS FOR NEW STUDENTS mueller.ppt
isaacmagoya9
 
PPTX
Lecture 7 Area Objects and Spatial Autocorrelation.pptx
ss248sky08
 
PDF
Human Terrain Analysis at George Mason University (DAY 1)
Rich Heimann
 
PDF
Spatial analysis using big data: methods and urban applications Yamagata
kaslibusiwe
 
PDF
2016 urisa track: ring pattern of older adult population in urban areas by y...
GIS in the Rockies
 
PPT
Marek - Spatial analyses of health data: From points to models
swenney
 
PPTX
GIS.INTRODUCTION TO GIS PACKAGES &GEOGRAPHIIC ANALYSIS
TessaRaju
 
PPTX
Sampling and Probability in Geography
Prof Ashis Sarkar
 
Spatial data analysis 2
Johan Blomme
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Revolution Analytics
 
Assessing spatial heterogeneity
Johan Blomme
 
Spatial statistics presentation Texas A&M Census RDC
Corey Sparks
 
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
hpaocec
 
Spatial Analysis In Epidemiology Illustrated Stevenson Mark Stevens
otwaylivia
 
Python for Geospatial Data Analysis (First Early Release) Bonny P. Mcclain
nghgrinel
 
Research design (1).docx
Manjunath D R
 
Spatial analysis of house price determinants
Laurent Lacaze Santos
 
s3859353_Lab2.pptx
NguyenVanTuan33
 
Spatial Analysis of House Price Determinants
Laurent Lacaze Santos
 
A quick overview of geospatial analysis
Md.Farhad Hossen
 
STATISTICAL ANALYSIS FOR NEW STUDENTS mueller.ppt
isaacmagoya9
 
Lecture 7 Area Objects and Spatial Autocorrelation.pptx
ss248sky08
 
Human Terrain Analysis at George Mason University (DAY 1)
Rich Heimann
 
Spatial analysis using big data: methods and urban applications Yamagata
kaslibusiwe
 
2016 urisa track: ring pattern of older adult population in urban areas by y...
GIS in the Rockies
 
Marek - Spatial analyses of health data: From points to models
swenney
 
GIS.INTRODUCTION TO GIS PACKAGES &GEOGRAPHIIC ANALYSIS
TessaRaju
 
Sampling and Probability in Geography
Prof Ashis Sarkar
 
Ad

More from Johan Blomme (13)

PDF
Modeling spatial non-stationarity with multiscale geographically weighted re...
Johan Blomme
 
PDF
Curieuzeneuzen ww belgie
Johan Blomme
 
PDF
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
PDF
Trends voor data analyse 2014
Johan Blomme
 
PDF
Trends in business_intelligence_2013
Johan Blomme
 
PDF
Trends in business intelligence 2012
Johan Blomme
 
PDF
The new normal in business intelligence
Johan Blomme
 
PDF
Business intelligence in the real time economy
Johan Blomme
 
PDF
E Business Integration. Enabling the Real Time Enterprise
Johan Blomme
 
PDF
Correspondentie Analyse
Johan Blomme
 
PDF
Knowledge Discovery In Data. Van ad hoc data mining naar real-time predictie...
Johan Blomme
 
PDF
Operational B I In Supply Chain Planning
Johan Blomme
 
PDF
What is data mining ?
Johan Blomme
 
Modeling spatial non-stationarity with multiscale geographically weighted re...
Johan Blomme
 
Curieuzeneuzen ww belgie
Johan Blomme
 
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
Trends voor data analyse 2014
Johan Blomme
 
Trends in business_intelligence_2013
Johan Blomme
 
Trends in business intelligence 2012
Johan Blomme
 
The new normal in business intelligence
Johan Blomme
 
Business intelligence in the real time economy
Johan Blomme
 
E Business Integration. Enabling the Real Time Enterprise
Johan Blomme
 
Correspondentie Analyse
Johan Blomme
 
Knowledge Discovery In Data. Van ad hoc data mining naar real-time predictie...
Johan Blomme
 
Operational B I In Supply Chain Planning
Johan Blomme
 
What is data mining ?
Johan Blomme
 
Ad

Recently uploaded (20)

DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
AI/ML Applications in Financial domain projects
Rituparna De
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 

Spatial data analysis

  • 1. Spatial Data Analysis An introduction to spatial autocorrelation and spatial regression analysis january 2015 Johan Blomme | Leenstraat 11 | 8340 Damme-Sijsele [email protected] www.johanblomme.com
  • 2. Many research questions require analysis of complex patterns of interrelated social, behavioral, economic and environmental phenomena. In addressing these questions, it is increasingly argued that both spatial thinking and spatial analytical perspectives have an important role to play. Indeed, research on social stratification and inequality, health, mortality and fertility and many other issues depends on the collection and analysis of individual and context-level data. The geospatial and methodological development environment has changed. The volume, sources and forms of available geospatial data are growing rapidly. The flow of information from a host of sensors has grown exponentially in recent years to the point that many observations can be geo-referenced. Data storage and handling (e.g. cloud computing) change what, how and when we collect data on individuals and their environments. In a world where information is increasingly seen through geographic filters, the importance of spatial thinking is addressed. More and more instances show that space and place are important elements and stress the leverage of place-based politics. For example, conventional approaches in health research underestimate the contribution of place to disease risk. Several studies reinforce the view how neighborhood context is an important condition of human well being. Place emerges as an important contextual framework for considering a number of critical societal issues. Place as a social context is deeply connected to larger patterns of social advantage and disadvantage. Since the mid 1990s, there is a renewed interest in the much earlier tradition of spatial demography that focuses on areal aggregates as units of analysis. Trends in technology during the 1980s and 1990s brought sophistication to the world of spacial demography. Factors contributing were : – U.S. Census Bureau’s TIGER files ; – extensive natural resource, crime and epidemiological databases ; – powerful GIS software for integrating and mapping spatial data ; – computing hardware platforms. These factors altered the way in which spatial demography research was carried out. Other trends that emerged were : – the use of exploratory spacial data analysis (ESDA) ; – the role of regression analysis in spatial demography ; – the special nature of spatial data that requires modification to the standard regression model (e.g. the role of geographically weighted regrssion for exploring spatial variation); – the need for attention both to global as well as local diagnostic tools. When analyzing spatial data from a large number of units (e.g. counties), it is the natural inclination of researchers to move from simple descriptive analysis to begin asking questions as : How might these data be modeled ? How well can we account for variability in attribute values among geographic units ?
  • 3. To answer these questions, analysts turned to multivariate regression modeling, the common methodology in the social sciences. However, the application of the standard regression approach to data tied to spatial units brings spacial complications because “spatial is special”. Attention has been drawn to the fact that spatial data require special analytic approaches. Two properties are particularly important in the analysis of spatial data. The first, spatial dependence, refers to the tendency for spatial data to exhibit spatial autocorrelation. For most social phenomena mapped in space, local proximity usually results in value similarity. High values tend to be located near other high values, while low values tend to be located near other low values, thus exhibiting positive spatial autocorrelation. Less often, high values may tend to be co-located with low values (or vice versa), as islands of dissimilarity (negative spatial autocorrelation). In either case, the units of analysis in spacial demography likely fail a key assumption of classical statistics : independence among observations. With respect to statistical analysis that presumes such independence (e.g. standard regression analysis), positive autocorrelation means that the spatially autocorrelated observations bring less information to the model estimation process than would the same number of independent observations. The greater the extent of spatial autocorrelation, the more severe is the information loss. A quick explanation for the presence of spatial autocorrelation can be found in the oft-cited “first law of geography” enunciated by Tobler in 1970 : “Everything is related to everything, but near things are more related than distant things” (Tobler, 1970 : 36). Tobler’s first law is somewhat unsatisfying because it doesn’t tell us why this phenomenon arises in practice. The answer to this question can only be approximated with models of the spatial process and the analysts’s theory about the process. The second concept refers to spacial heterogeneity, the tendency for phenomena distributed in many spaces to be statistically nonstationary (a lack of stability across space of one or more attribute values). Spacial heterogeneity confounds attempts to generalize because results of an analysis of a limited area will change when the boundaries of the area are shifted. One of the more recent and fascinating developments in the design of local statistics is the theoretical background and associated software to explore how regression parameters and regression model performance vary across a study region. Geographically weighted regression (GWR) is similar to a global regression model in that the familiar constant, regression coefficients and error term are all present within the regression specification. There are two ways in which GWR differs from standard (global) regression. First is the fact that a separate regression is carried out at each location (observation) using only the other observations that lie within a user-specified distance from that location. Second, the regression specification includes a statistical device which weights the attributes of nearby geographical units more highly than it does the attributes of distant geographical units. The result is a set of local regression parameters for each geographical unit. The regression is thus localized.
  • 4. A GWR approach to regression analysis is a highly useful exploratory device for understanding parameter heterogeneity in one’s data. The output of GWR enables the researcher to examine and map local parameter estimates and local regression diagnostics, thereby enabling assessment of the utility of the model for various positions of the larger study region. In the first part of this guide, we provide a general introduction to perform spatial regression and spatial autocorrelation analysis. We use GeoDa, software developed by the Arizona State University’s GeoDa Center for geospatial analysis and computation (http://geodacenter.asu.edu). In the second part, we model spatial data with geographically weighted regression to explain local variations in relationships. CONTENTS Part 1 An introduction to spatial autocorrelation and spatial regression with GeoDa 1 1. Manipulating data 4 2. Mapping and exploratory data analysis 8 3. Spatial autocorrelation 25 4. Spatial regression 69 Part 2 Analyzing spatial hereogeneity with geographically weighted regression 94
  • 5. Part 1 An introduction to spatial autocorrelation and spatial regression analysis with GeoDa www.johanblomme.com
  • 6. • The development of specialized software for spatial data analysis has seen rapid growth since the late 1980s. • A substantial collection of spacial data analysis software is available, ranging from niche programs and commercial statistical and GIS packages to open source software environments such as R, Java and Python. • GeoDa, for example, is the result of the effort to facilitate spatial data analysis. The main objective of the software is to provide the user with a path starting with simple mapping and geovisualization moving to spatial autocorrelation analysis and ending up with spatial regression. www.johanblomme.com 2
  • 8. 1. Manipulating Spatial Data www.johanblomme.com Manipulating Spatial Data 4
  • 9. Creating point shape files from .dbf-file www.johanblomme.com Manipulating Spatial Data 5
  • 10. Tools → Shape → Points to polygon Creating Thiessen polygons as shape files Thiessen polygons are created as a polygon shape file derived from a point shape file. Each Thiessen polygon encloses the original points in such a way that all points in a polygon are closer to the enclosed point than any other point. This correspons to the notion of geographic market area. Thiessen polygons allow the computation of contiguity based spatial weights for point data, using the boundaries of the polygons to establish contiguity. Area and perimeter calculations are only supported for projected coordinates (Euclidean distance). For point shape files in unprojected latitude and longitude, the results will not be correct. www.johanblomme.com Manipulating Spatial Data 6
  • 11. Computing spatially lagged variables Spatially lagged variables are weighted averages of the values for neighboring locations, as specified by a spatial weights matrix. The changes and additions made to a table only reside in memory and are not permanent. In order To make them permanent, the table must be saved to a new file : File → Save as → Shapefile name to save as This results in three files to be saved, with file extensions .shp, .shx and .dbf. www.johanblomme.com Manipulating Spatial Data 7
  • 12. 2. Mapping and Exploratory Data Analysis www.johanblomme.com Mapping and EDA 8
  • 15. resource deprivation index (1970) Hinge value of 1.5 = 1.5 times the interquartile range to define outliers Univariate EDA www.johanblomme.com 11
  • 16. sort on variable to find outliers Univariate EDA www.johanblomme.com 12
  • 18. Multivariate EDA Homicide data for counties around St Louis Quintile map homicide rate Quintile map resource deprivation www.johanblomme.com 14
  • 19. Multivariate EDA scatterplot parallel coordinate plot (PCP) www.johanblomme.com 15
  • 20. Multivariate EDA Linking and brushing www.johanblomme.com 16
  • 21. Multivariate EDAAnalyzing changes over time : www.johanblomme.com 17
  • 22. Multivariate EDA Cartogram crime rate Cartogram Gini inequality www.johanblomme.com 18
  • 23. Ohio counties, total lung cancer deaths for White females, 1968 selecting a rate variable from the data set (reveals the problem of variance instability) both the event and the population at risk are specified and the rate is calculated on the fly Rate Smoothing www.johanblomme.com 19
  • 24. A commonly used notion in public health analysis is the concept of a standardized mortality rate (SMR), or, the ratio of the observed moratlity rate to a national (or regional) standard. GeoDa implements this in the form of an excess risk map. The excess rate is the ratio of the observed rate to the average rate computed for all the data. Note that this average is not the average of the county rates (instead, it is calculated as the ratio of the total sum of all vents over the total sum of all populations at risk). risk is higher than state average risk is lower than state average Rate Smoothing www.johanblomme.com 20
  • 25. saved to the table (right click on previous map) no difference between rescaled raw rates and raw rates Rate Smoothing www.johanblomme.com 21
  • 26. a new outlier is added Empirical Bayes consists of computing a weighted average between the raw rate for each county and the state average, with weights proportional to the underlying population at risk. Small conties will tend to have their rates adjusted considerably, whereas for larger counties the rates will barely change. Rate Smoothing www.johanblomme.com 22
  • 27. Spatial rate smoothing consists of computing the rate in a moving window that includes the county as well as its neighbors. In GeoDa neighbors are defined by means of a spatial weights file. We will construct a simple spacial weights file consisting of the 8 nearest neighbors for each county in the Ohio shapefile. Rate Smoothing www.johanblomme.com 23
  • 28. A spatially smooted box map emphasizes broad regional patterns. Note how there are no more outliers. Rate Smoothing www.johanblomme.com 24
  • 30. • Spatial autocorrelation is a measure of spacial dependency that quantifies the degree of spatial clustering or dispersion in the values of a variable measured across a set of locations. • There are two basic types of spatial autocorrelation statistics : global measures identify whether the values of a variable exhibit a significant overall pattern of regional clustering, whereas local measures identify the location of significant high and low value clusters. www.johanblomme.com Spatial Autocorrelation 26
  • 31. • Basics : Steps in determining the extent of spatial autocorrelation : – choose a neighborhood criterion : which areas are linked ? – assign weights to the areas that are linked : create a spatial weights matrix – run statistical tests, using weights matrix, to examine spatial autocorrelation Spatial Autocorrelation www.johanblomme.com 27
  • 32. • Spacial autocorrelation measures the correlation of a variable with itself through space. Spacial autocorrelation can be positive or negative. Positive spatial autocorrelation occurs when similar values occur near one another. Negative spatial autocorrelation occurs when dissimilar values occur near one another. • Spacial weights are essential for the computation of spacial autocorrelation statistics. • Spacial weights can be based on contiguity from polygon boundary files or calculated from the distance between points. www.johanblomme.com Spatial Autocorrelation 28
  • 33. rook contiguity queen contiguity 1st order higher order CONTIGUITY BASED WEIGHTS .GAL-file uses only common boundaries to define neighbors uses all common points (denser connectedness structure) removes redundancies and circularities in the weights construction Contiguity Based Weights polygon shape files www.johanblomme.com 29
  • 34. flag, number of observations, name of polygon shape file, name of the key variable Rooks Contiguity www.johanblomme.com 30
  • 37. Comparison of connectedness structure for rook and queen contiguity Contiguity Based Weights ROOKS QUEEN www.johanblomme.com 33
  • 38. Rooks Contiguity Higher Order Contiguity www.johanblomme.com 34
  • 39. Pure 2nd order Rooks Contiguity Higher Order Contiguity www.johanblomme.com 35
  • 40. Cumulative 2nd order Rooks Contiguity Higher Order Contiguity www.johanblomme.com 36
  • 41. Higher Order Contiguity locations with 5 first Order rook neighbors www.johanblomme.com 37
  • 42. threshold distance K-nearest neighbors 1st order higher order DISTANCE BASED WEIGHTS .GWT-file GeoDa calculates the minimum distance required to assure that each observation has at least one neighbor Spacial weights based on distance threshold can lead to a very unbalenced connectedness structure (esp. In the case when spacial units have very different areas, with small areas having many neighbors while larger ones may have only a few). A commonly used alternative consists of considering the k-nearest neighbors. point or polygon shape files Distance_Based Weights www.johanblomme.com 38
  • 43. In contrast to contiguity weights, distance-based spatial weights can be calculated for both point shape files as well as polygon shape files. For polygon files, if no coordinate variables are specified, the polygon centroids will be used as the basis for distance calculation. When polygon shape files are used, maps must be projected (e.g. UTM) for proper computation of centroids. For unprojected maps, the resulting centroids will only approximate. the minimum distance required to ensure that each location has at least one neighbor if the points are in latitude and longitude, select the <Arc Distance> option Distance_Based Weights www.johanblomme.com 39
  • 44. Connectivity for distance-based weights distance between neighbor pairs The distribution has a much broader range compared to contiguity-based weights. Some points are clustered while other are far apart. The minimum threshold needed to avoid islands may be too large for many or most locations in the data set. In such cases, care is needed in the specification of the distance threshold, and the use of K-nearest weights may be more appropriate. Distance_Based Weights www.johanblomme.com 40
  • 45. Spatially Lagged Variables Spatially lagged variables are an essential part of the computation of spatial autocorrelation tests and the specification of spatial regression models. GeoDa computes these variables on the fly, but in some instances it is useful to calculate spatially lagged variables explicitly. We will calculate a spatially lagged variable for the variable HH_INC (census tract median household income) in the Sacramento file. The first thing we do is open the spatial weights file we created. Then we create a new field that is added to the table. The value of the spatially lagged variable “W_HH_INC” for this location is the mean of its neighbors www.johanblomme.com 41
  • 47. • Global spacial autocorrelation is handled in GeoDa by means of Moran’s I spatial autocorrelation statistic and its visualization in the form of a scatterplot. • Global spacial autocorrelation requires a spatial weights file and a variable must be specified. • Spacial autocorrelation analysis is implemented in its traditional univariate form as well in a bivariate form. Global Spatial Autocorrelation www.johanblomme.com 43
  • 48. Moran’s I for Columbus data (variable = crime ; spacial weights file = rooks-based contiguity file) Global Spatial Autocorrelation www.johanblomme.com 44
  • 49. (1) (2) (3) (4) negative autocorrelation positive autocorrelation Global Spatial Autocorrelation www.johanblomme.com 45
  • 50. Moran’s I reference distribution calculated for spatially random layouts with the same data as observed (none of the simulated values is larger than the observed 0.52) Global Spatial Autocorrelation www.johanblomme.com 46
  • 51. Moran’s I = 0.479487 Global Spatial Autocorrelation www.johanblomme.com 47
  • 52. the slope of the regression line changes as specific locations (in this case 1 location) are excluded from the calculation Global Spatial Autocorrelation www.johanblomme.com 48
  • 53. Inference for Moran’s I is based on a random permutation procedure, which recalculates the statistic many times to generate a reference distribution. The obtained statistic is then compared to this reference distribution and a pseudo significance level is computed. Global Spatial Autocorrelation www.johanblomme.com 49
  • 55. • Global measures : global spatial autocorrelation (Moran’s I) : a single value which applies to the entire data set (the same pattern or process occurs over the entire geographical area ; and average for the entire area). • Local measures : local spatial autocorrelation (Lisa) : a value calculated for each observation unit (different patterns of processes may occur in different parts of the region ; a unique number for each location). Local Spatial Autocorrelation www.johanblomme.com 51
  • 56. • Local spatial autocorrelation is based on local Moran LISA statistics. This yields a measure of spatial autocorrelation for each individual location. • Both univariate and multivariate LISA are included in GeoDa. • The input needed for local spatial autocorrelation is the same as for global spatial autocorrelation. Local Spatial Autocorrelation www.johanblomme.com 52
  • 57. the significance map shows the locations with significant local Moran statistics Local Spatial Autocorrelation the high-high and low-low locations (positive local spatial autocorrelation) are typically referred to as spatial clusters, while the low-high and high-low are termed spatial outliers (while outliers are single locations by definition, this is not the case for clusters) www.johanblomme.com 53
  • 59. The result for univariate LISA is a special chloropleth map showing those locations with a significant local Moran statistic (depending on the significance level). In the map blow, the significance map is shown for the CRIME variable in the Columbus Data set, using rook contiguity. Local Spatial Autocorrelation www.johanblomme.com 55
  • 60. The result of the cluster map is a special choropleth map showing those locations with a significant local Moran statistic Classified by type of spatial correlation : bright red for the high-high association and bright blue for low-low. The high-high and low-low locations suggest clustering of similar values, while the high-low and low-high locations Indicate spatial outliers. Local Spatial Autocorrelation www.johanblomme.com 56
  • 61. It is strongly recommended that sensitivity analysis be carried out before interpreting results of LISA maps as “significant” clusters. The randomization option provides a way to address numerical stability of the results. The significance filter is designed to assess how conclusions depend on the chosen significance level. Local Spatial Autocorrelation www.johanblomme.com 57
  • 62. LISA maps after applying a significance filter. Local Spatial Autocorrelation www.johanblomme.com 58
  • 63. Local Spatial Autocorrelation When Moran’s I statistic is calculated for rates or proportions, the underlying assumption of stationarity may be Violated by the instrinsic instability of rates. The latter follows when the population at risk (the base) varies Considerably across observations. The variance instavility mat lead to spurious inferences for Moran’s I. To correct for this, GeoDa implements the Empirical Bayes (EB) standardization. This is implemented for both the global (Moran scatter plot) and local spatial autocorrelation statistics. To illustrate this, we will use the Scottish lip cancer data set and associated weights file to compare the results of calculating Moran’s I based on the non-standardized rates with the results of the EB standardization. www.johanblomme.com 59
  • 64. The value for Moran’s I of 0.527 differs somewhat from the statistic for the unstandardized rates (0.479). More important is to assess whether or not inference is affected. The resulting permulation distribution still suggests a highly significant statistic. Local Spatial Autocorrelation www.johanblomme.com 60
  • 65. • Practice : Spatial patterns of rural poverty : An exploratory analysis in the São Fransisco River Bassin, Brazil (Nove Economia_Belo_Horizonte_21 (1), 45-66_janeiro-abril de 2011). This study uses recently released municipio-level data on rural poverty in Brazil to identify and analyze spatial patterns of rural poverty in the SFRB. Moran’s I statistics are generated and used to test for spatial autocorrelation, and to prepare cluster maps that locate rural poverty “hot spots” and “cold spots”. The results indicate that poverty reduction in the SFRB should take into account the spatial distribution of poverty. Not only is poverty in the SFRB clustered spatially, but the bulk of the bassin’s poor resides in municipios that comprise the poverty “hot spots” the study identifies. These clusters did not correspond to state-level boundaries, so scope may exist for geographically refocusing poverty reduction efforts to make them more efficient. www.johanblomme.com Spatial Autocorrelation 61
  • 68. • Information on spatial patterns of rural poverty in the SFRB may shed light on the importance of location as a causal factor per se. Municipios may be more likely to have high (or low) rural poverty rates depending on where they are located geographically : – one obvious reason is the stock of natural resources (natural resources are not evely distributed across space) : for farm activities, for example, good soils and easy access to water may improve agricultural conditions, productivity and income ; – job and income providers such as firms and service-oriented businesses tend to concentrate in space in order to benefit from large markets (economies of scale) and the availability of specialized skilled labor. www.johanblomme.com Spatial Autocorrelation 64
  • 69. • The value of Moran’s I is equal to 0.72, which suggests a strong postitive spatial autocorrelation of rural poverty. This number suggests that for the SFRB, there are more locations wich high (low) rural poverty rates surrounded by locations with high (low) rural poverty rates than would be the case if poverty were distributed randomly. • The value of Moran’s I also suggests that poverty in the SRFB is spatially distributed in clusters and also suggests that poverty in neighboring areas increases the likelihood of poverty in its neighbors. However, the value of Moran’s I does not tell us where rural poverty clusters might be, but rather suggests that the spatial pattern of poverty is not random (there is more similarity in poverty (or the absence of its) than would be expected if the pattern were random). • Making use of EB-standardization to reduce variance instability, delivers a coefficient of 0.83 compared to the initial calculation of Moran’s I. This indicates that the correlation between rural poverty rates in location i and neighboring locations is stronger when rates are standardized. Hence, increasing the precision with which rurla poverty is measured will likely increase the spatial correlation among rural poverty rates in the SFRB. www.johanblomme.com Spatial Autocorrelation 65
  • 70. • Although a Moran I of 0.83 strongly shows that the spatial distribution of rural poverty is not random, it does not locate poverty clusters. • To locate “hot spots” and “cold spots”, local indicators of spatial autocorrelation must be used (LISA). LISA provides location-specific information and estimates the extent of spatial autocorrelation between the value of a given variable (rural poverty) in a particular location and the values of the same variable in locations around it. This makes it possible to identify spatial clusters of rural poverty. • 3 clusters of rural poverty in the SFRB are detected by LISA. Clusters 1 and 2 are rural poverty “hot spots” and correspond to positive and high-high spatial autocorrelation, indicating spatial clusters of locations with above-average rural poverty rates. Cluster 3 is a “cold spot” and also corresponds to a positive, but low-low spatial autocorrelation, indicating a cluster of locations with below -average rural poverty rates. www.johanblomme.com Spatial Autocorrelation 66
  • 72. • As mentioned before, the clusters of rural poverty may be attributable to several reasons. But further analysis is required to determine the causes of spatial patterns of rural poverty in the SFRB. Multivariate regression analysis that takes into account the variables that may explain poverty is the appropriate approach to the analysis of the spacial determinants of patterns of rural poverty in the SFRB. • The results of this study suggest that poverty reduction policies in the SFRB should take into account the spatial distribution of poverty. The analysis suggests that location as a causal factor per se is important and locations are indeed more likely to have high (or low) rural poverty rates depending in where they are located in the basin. This may be due to obvious reasons such as stock of natural resources, soil quality, access to water, etc. • More importantly, the analysis shows that poverty in one location is affected by (or affects) poverty in neighboring locations. That is, there are spillovers, either positive or negative externalities that make locations more or less likely to get out of poverty. These spillovers may be associated with the concentration (or lack of concentration) of firms, technology and knowledge. These results set the stage for identifying factors that influence rural poverty in the SFRB, factors that may themselves be spatially correlated. www.johanblomme.com Spatial Autocorrelation 68
  • 74. • When moving from simple descriptive analyses to data modeling, analysts turn to multivariate regression modeling to account for variability in attribute values among geographic units by identifying other covariates of the attribute of interest. • Attributes of spatially referenced data generally violate at least one of the assumptions underlying the standard regression model, which necessitates both caution regarding these violations and attention to methods designed to correct for them. Spatial Regression www.johanblomme.com 70
  • 75. • Spatial variation : spatial heterogeneity versus spatial dependence • When undertaking initial EDA of spatial data, it is worthwhile to develop a sense of the spatial distribution of the attribute values. By mapping the distributions of variables across space, a distinction can be made between two types of spatial dependence. • Spatial heterogeneity : large-scale regional differentiation (among attribute values) is an important component of spatial variation. Spacial heterogeneity is the lack of stability across space of one or more attribute values. Heterogeneity gives recognition to the common observation that values of a variable are not the same across space. • Spatial heterogeneity follows from the intrinsic uniqueness of each location. Spacial heterogeneity is consistent with the description of how places are particular moments of intersecting social relations. The unique combination of social forces together in one place may produce effects which would not happen otherwise. These social forces include nonmaterial forces (e.g. cultural and/or historical processes) that cannot easily or always be quantified, yet these forces shape otherwise measurable social relationships. The spacial regime approach permits the analyst to move beyond geography per se, by focusing on social, economic and demographic factors - or, combined , sociological factors – that comprise the context of place. This approach is intended to enable the analyst to address the “so what” question : what is it about a place that distinguishes it from other places ? Spatial Regression www.johanblomme.com 71
  • 76. • Spatial dependence refers to small-scale spatial effects that manifest a lack of independence among observations (spatial clustering). The assumption is that dependence among the observations derives from spatial interaction among the units of analysis which can be defended theoretically and which can be statistically captured by a spatially lagged “neighborhood” effect. • Two forms of spacial models are commonly used to improve regressions on spatially correlated data : – The spacial lag model : if two locations are adjacent, the value of the dependent variable of the first locations can be influenced by the value of the dependent variable of the other. This means that there is a contagion or dispersion effect, represented best by a spatial lag model. – The spacial error model : if the error residuals of locations are influenced by one another, this means that the phenomenon under study is not analysed at the correct geographical level, or that there might be an unobserved variable correlated with the spatial structure of the data. This would imply a clustering effect and this has to be studied by a spatial error model. • A spatial lag model is appropriate if neighboring locations influence one another ; the spatial error model documents that locations geographically cluster but for an unknown reason. www.johanblomme.com Spatial Regression 72
  • 77. Spatial distribution of population change among Great Plains Counties, 1990-2000 Source : P.R. Voss, K.J. Curtis White & R.B. Hammer : Explorations in spatial demography, in W.A. Kandel & D.L. Brown, Population change and rural society, Springer, 2006, pp. 407-429) spatial hereogeneity across counties and spacial dependence (clustering) Moran scatterplot of population change Spatial Regression www.johanblomme.com 73
  • 78. • A model with spatial lags is able to borrow information from neighborhood observations because of the spatial autocorrelation among the units of analysis. The units of analysis likely fail a formal statistical test of randomness and thus fail to meet a key assumption of classical statistics : independence among observations. With respect to statistical techniques that presume such independence (e.g. standard regression analysis), positive autocorrelation means that the spatially autocorrelated observations bring less information to the model estimation process than would the same number of independent observations. • A carefully selected variable can account for spatial heterogeneity in the data and might boost the explanatory value of the model and largely remove the large-scale spatial process, but spatial autocorrelation would persist if a spatial dependence process were also indicated. There would remain in the data a more complicated, interactive spatial relationship among neighbors that suggests the requirement of some type of autoregressive term in the regression specification. Spatial Regression www.johanblomme.com 74
  • 79. • The aim of the researcher is to specify and estimate a model that reasonably accounts for or incorporates that spatial effects present in the data. These effects can be modeled as spatial heteregeneity and spatial dependence. When first examining a spatial relationship, the reseacher must ask whether the association appears to be a reaction to some geophysical, cultural, social or economic force that works to create spatial patterning (spatial heterogeneity), or an interaction, indicative of spatial dependence. • If the association is merely a reaction to some general force, then a modeling strategy with a standard regression structure may be appropriate. • If, on the other hand, the association is an interaction suggesting some type of formal dependency among units, then a modeling strategy with a spatial dependent covariance structure is the way to proceed. In this instance, heterogeneity likely will not fully remove the spatial effects within the data. An alternative is needed – a spatially oriented approach that formally incorporates a spatially lagged dependent variable or spatially lagged error term. Spatial Regression www.johanblomme.com 75
  • 80. • Spatial dependency modeling : example 1 • The shapefile newyork.shp is the map of Manhattan in New York City with Census 2000 data* . These are socioeconomic attributes for 297 Census tracts. It includes the following variables: POLYID Polygon ID STATE State FIPS COUNTY County FIPS TRACT Census Tract ID sctrct00 FIPSID hvalue Median housing value t0_pop Total population pctnhw Percent non-Hispanic white persons pctnhb Percent non-Hispanic black persons pcthsp Percent Hispanic persons pctasn Percent Asian persons t0p_own Percent homeowners t0p_coll Percent college educated t0p_prf Percent of people employed in professional/managerial occupations t0p_uemp Percent of people unemployed t0p_for Percent foreign born persons t0p_rec Percent recent immigrants t0_minc Median household income t0p_poor Percent total population below poverty * Source : http://www.s4.brown.edu/S4/Training/Modul2/GeoDa3FINAL.pdf Spatial Regression www.johanblomme.com 76
  • 81. • Before starting a regression, create a weights file : Spatial Regression www.johanblomme.com 77
  • 82. • In this example, we will predict neighborhood homeownership with several indicators : Spatial Regression www.johanblomme.com 78
  • 84. Test of multicollinearity of the model : one should be alarmed when the condition number is greater than 20. Jarque-Bara test is used to examine the normality of the distribution of the errors. The low probability of the test score suggests non- normal distribution of the error term. The low probabilities of the three tests point to the existence of heteroscedasticity. Error variance can be affected by spatial dependence in the data. Moran’s I suggests spatial autocorrelation of the residuals. Both tests of the lag and error are significant, indicating presence of spatial dependence. The robust test help us understand what type of spatial dependence may be at work. The robust measure for error is still significant, but the robust lag test becomes insignificant, which means that when the lagged dependent variable is present the error dependence disappears. Spatial Regression www.johanblomme.com 80
  • 85. • After identifying the presence of spatial dependence, we will use GeoDa to re-estimate the model when controlling for spatial dependence. Spatial Regression www.johanblomme.com 81
  • 86. Spatial Regression The spatial lag term of homeownership (W_TOP_OWN) appears as an additional indicator. It has a positive effect and is highly significant. As a result, the model fit is improved (higher R-square). Coefficient Rho reflects the spatial dependence in the sample data, measuring the average influence on observations by their neighboring observations. Although the introduction of the spacial lag term improved the model fit , it didn’t make the spacial effects go away. cfr. R2= 0.495 with OLS regression www.johanblomme.com 82
  • 87. • Now let’s review the results for the spatial error model. Spatial Regression www.johanblomme.com 83
  • 88. Coefficient of spatially correlated errors is positive. The model fit is improved (higher R2). Heteroscedasticity remais significant. Also, spatial error stays significant. Although allowing the error terms to be spatially correlated improved the model fit, it didn’t make the spatial effects go away. Spatial Regression www.johanblomme.com 84
  • 89. • Comparing the spatial lag and spatial error models, we can see that both models yield improvement to the original OLS model. Therefore, controlling spatial dependence improves model performance. • Now the question is which of the two models is better ? To some extent, this is an open question. The general advice is first to look for a theoretical basis to inform your choice. When it is not so clear theoretically, you can compare the model performance parameters : the R-squared and log likelihood. In this example, the spatial error model has greater R-squared and log likelihood values. That provides a statistical basis to adopt this solution. Spatial Regression www.johanblomme.com 85
  • 91. • Spatial dependency modeling : example 2 • Analysis of poverty in the U.S. * Source : http://csde.washington.edu/services/gis/workshops/SPREG.html Spatial Regression www.johanblomme.com 87
  • 93. violation of regression assumptions Spatial Regression www.johanblomme.com 89
  • 96. spatial error model Model R2 Log Likelihood OLS 0,780 2323,69 Spatial Lag 0,822 2457,37 Spatial Error 0,847 2504,64 Spatial Regression www.johanblomme.com 92
  • 97. The spatial error form results in a substantial reduction of spatial autocorrelation. Spatial Regression www.johanblomme.com 93
  • 98. Part 2 Analyzing spatial heterogeneity with geographically weighted regression www.johanblomme.com 94
  • 99. • Traditional regression analysis describes a modelled relationship between a dependent variable and a set of independent variables. When applied to spatial data, the regression analysis often assumes that the modelled relationship is stationary over space and produces a global model which is supposed to describe the relationship at every location in the study area. This would be misleading, however, if relationships being modelled are intrinsically different across space. One of the spatial statistical methods that attempts to solve this problem and explain local variation in complex relationships is Geographically Weighted Regression (GWR). • In a global regression model, the dependent variable is often modelled as a linear combination of independent variables, where a parameter belonging to each variable is assumed to be stationary over the whole area (i.e. the model returns one value for each parameter). GWR extends this framework by dropping the stationarity assumption: the parameters are assumed to be continuous functions of location. The result of the GWR analysis is a set of continuous localised parameter estimate surfaces, which describe the geography of the parameter space. These estimates are usually mapped or analysed statistically to examine the plausibility of the stationarity assumption of the traditional regression and different possible causes of non-stationarity. www.johanblomme.com 95 The definitive text on GWR is : Fotheringham, A.S., Brunsdon, C. & Charlton, M.E., Geographically Weighted Regression : The Analysis of Spatially Varying Relationships, Chichester, Wiley, 2002. Geographically Weighted Regression
  • 101. • The use of linear regression is common in many areas of science. Ordinary linear regression implicitly assumes spatial stationarity of the regression-model that is, the relationships between the variables remain constant over geographical space. We refer to a model in which the parameter estimates for every observation in the sample are identical as a global model. • Spatial non-stationarity occurs when a relationship (or pattern) that applies in one region does not apply in another. Global models are statements about processes or patterns which are assumed to be stationary and as such are local independent, i.e. are assumed to apply to all locations. In contrast local models are spatial disaggregations of global models, the results of which are location-specific. The template of the model is the same : the model is a linear regression model with certain variables, but the coefficients alter geographically. If the parameter estimates are allowed to vary across the study area such that every observation has its own separate set of parameter estimates we have a local model. • GWR does not assume the relationships between independent and dependant variables are constant across space. Instead, GWR explores whether the relationships between a set of predictors and an outcome vary by geographical location. GWR is suggested to be a powerful tool for investigating spatial non-stationarity in the relationship between predictors and the outcome variable. www.johanblomme.com 97 Geographically Weighted Regression
  • 102. • GWR4 is new release of a Microsoft Windows based application for calibrating geographically weighted regression models, which can be used to explore geographically varying relationships between dependent/response variables and independent/explanatory variables. www.johanblomme.com 98 Geographically Weighted Regression
  • 103. www.johanblomme.com 99 Give the session a name Specify regression type and variable settings Chose a geographic kernel type Specify names for files storing the modelling results Execute the session For an extensive review of these 5 steps, see T. Nakaya, GWR4 User Manual, update 7 may 2012. Geographically Weighted Regression
  • 104. • Type II diabetes is a growing health problem. Because the burden of diabetes falls disproportionally on less advantaged individuals, poverty is one of the most important risk factors for diabetes. • Micro-level (individual-level) research has consistently found positive associations between diabetes and poverty. Poverty and diabetes may be related because economic disadvantage may limit people to poorer diets and more sedentary lifestyles. • Macro-level (context-level) investigations have also found a positive association between diabetes and poverty. Rates of diabetes are higher in areas with higher economic deprivation. • What follows, provides a study of the geographical variability in the relationship between poverty and diabetes. We first show how a classical ordinary least squares regression captures the “global” and positive relationship between diabetes and poverty (an increase in the concentration in poverty is accompanied with an increase in the prevalence of diabetes). We then make use of an exploratory geographically weighted regression to specify a local modal. The findings reveal that the diabetes-poverty relationship macro-level relationship varies by geographical space www.johanblomme.com 100 . An introduction to macro-level spatial nonstationarity : A geographically weighted regression analysis of diabetes and poverty Geographically Weighted Regression
  • 105. • Theoretically, spatial non-stationarity is based on the concept of the social construction of space. The interaction between individuals with each other and their physical environment produces space. Human beings are just as much spatial as temporaral beings. By temporal, we mean that we are most influenced by what is immediate in space. What happens near us matters more than non-proximal events. Human’s spatiality and temporality are essential and equal powerful in explaining human behavior. Consequently, everything that is social is inherently spatial, just as everything spatial is inherently socialized. • From this perspective, we analyse how the macro-level relationship between diabetes and poverty unfolds over geographical space. www.johanblomme.com 101 Geographically Weighted Regression
  • 106. • Investigations on spatial non-stationarity focus on the phenomenon that two measurements taken from geographically close locations are often more similar than measurements from more widely separated locations (Tobler’s law (1970, p. 236) : “Everything is related to everything else, but near things are more related than distant things”). • For this reason, spatial autocorrelation has been developed to deal with the tendency toward interdependence among spatial data. Investigating diabetes prevalence requires we expand our understanding of how macro-level relationships vary as a function of geographical distance. • In a global modal, we can hypothesize that poverty and diabetes are positively related. In a local modal, we can hypothesize that the diabetes-poverty macro-level relationship will be spatial non-stationary. Tobler, W.R., A computer movie simulating urban growth in the Detroit region, Economic Geography, 46, 1970, pp . 234-240. www.johanblomme.com 102 Geographically Weighted Regression
  • 107. • Traditionally non-spatial research, including the OLS approach, assumes that the nature of statistical relationships is the same for all points within the entire study area. With GWR, we can explore how the diabetes-poverty relationship varies over space. The OLS results are thus for the “global model” findings while the GWR outputs are the “local” analyis results. • We first execute an OLS multivariate regression to show the linear association between diabetes and poverty in US counties in the South Atlantic area (N=588)*. The goal of this “global model” is to verify the positive association found in previous studies. In the OLS model we use the percentage of diabetes in the county as the dependent variable and the percentage in poverty as the independent variable. We control the relationship between poverty and diabetes prevalence for median income of households and the percentage of people who completed high school. We then develop a GWR-model to account for spatial variations. The GWR model contains the same variables used in the OLS regression. * Source : http://www.ers.usda.gov/data-products/county-level-data-sets.aspx We focus on the 588 contiguous counties because GWR analysis requires that all polygons be physically adjacent or in near physical proximity to at least one other polygon with data on the variables of interest. www.johanblomme.com 103 Geographically Weighted Regression
  • 108. www.johanblomme.com 104 US counties South Atlantic : N = 588 Geographically Weighted Regression
  • 109. Global results • Poverty is positively associated with diabetes. The results of OLS-model 1 demonstrate that an increase of one percentage point in the poverty concentration of a county is associated with a 0,15 percent increase in diabetes. • Model 1 has an R2 of 0,262. While diabetes prevalence and percent in poverty are statistically significantly related, a substantial proportion of the variation in diabetes prevalence remains unexplained. • After adding median income of households and the percentage of people who completed high school to the regression equation, the effect of poverty is substantially reduced and no longer significant and even the sign of the coefficient for poverty changes from positive to negative. The R-square value for model 2 achieves a respectable 0,395*. • We also note a problem : the regression equation shows strong spatial autocorrelation (Moran’s I = 0,328 ; p < 001)**, a clear indication that the model is in violation with at least one of the assumptions underlying standard linear regression. The Moran test tells us that the residuals are not independent. Moreover, the Koenker-Bassett test for heteroscedasticity indicates that the residuals also are not distributed identically. * Collinearity diagnostics were estimated using SPSS 20.0, and no problems of multicollinearity were found among the independent variables. The collinearity diagnostics used were the variance inflation factors (VIF) and tolerances for individual variables. Multicollinearity is said to exist if the VIF is 5 or higher (or equivalently, tolerances of 0,20 or less). The highest VIF in this analysis was 3,314 and the lowest tolerance was 0,302 for median income of households. ** Moran’s I is strongly positive, indicating powerful positive autocorrelation (clustering of like values). LISA analysis demonstrates that most counties are found in the high-high and low-low quadrants. www.johanblomme.com 105 Geographically Weighted Regression
  • 111. • Comparing the residual spatial autocorrelation (I = 0,328) with the spatial autocorrelation for the dependent variable (I = 0,454) tells us that spatial autocorrelation in one or more independent variables “explains” a portion of the spatial autocorrelation in the dependent variable*. • It is frequently the case that the independent variables in a regression model can almost completely account for the spatial autocorrelation in a dependent variable, thus removing a problematic spatially autocorrelated residual. However, in the present case, the regressors have not satisfactorily accounted for spatial dependence in the data, and a correction to the model clearly is necessary. But what type of correction ? Might there be spillover effects among counties that influence the diabetes prevalence of their neighbors (spatial lag model) ? Or does the residual dependence in the model likely stem from omitted variables on the right-hand side of the regression equation, thus suggesting a spatial error model ? * Moran’s I is calculated by specifying a matrix of weights that characterizes the structure of local dependence. In this analysis “neighbors” are defined under the “first-order queen” convention, meaning that the neighbors for any given county “A” are those other counties that share a common boundary with “A” (or single point of contact with “A”). Importantly, “A” is not considered a neighbor of itself and is excluded from the average. www.johanblomme.com 107 Geographically Weighted Regression
  • 112. • We used a spatial regression model to control for the spatial autocorrelation. We chose which spatial dependence model to use (spatial lag or spatial error) using Lagrange Multiplier tests. Although both models exhibit significant spatial dependence, we used the model with the highest test statistic, in this case, the spatial error model. • Aside from the remaining heteroscedasticity, the spatial error model appears to be a plausible alternative to the OLS specification. The AIC score is lower and the explanatory power of the model increases considerably over the OLS regression, with an R2 of 0,538. • In contrast with OLS-model 2, the effect of poverty on diabetes is statistically significant, independent from the median income of households and the percentage of people who completed high school. • It is still not clear if spatial non-stationarity is a concern in our analysis. It is necessary to investigate the homoscedastic assumptions underlying the OLS with local modeling. www.johanblomme.com 108
  • 113. www.johanblomme.com 109 Geographically Weighted Regression OLS and spatial regression models predicting the prevalence of diabetes in US South Atlantic counties (N=588) OLS (1) OLS (2) Spatial Error independent variables coeff. std.err. coeff. std.err. coeff. std.err. constant 9,066** 0,185 18,662 ** 20,146** 1,036 % poverty 0,151** 0,010 -0,007 0,017 -0,040* 0,016 median income of households -0,000068** 0,000008 -0,000077** 0,000009 % completed high school -0,051** 0,012 -0,059* 0,012 spatial error (Lambda) 0,530** heteroscedasticity 30,240 **□ 55,547 **□ 48,399**● R2 0,262 0,395 0,538 AIC 2233,690 2120,650 2002,780 Lagrange Multiplier (Lag) 72,872 ** Robust LM (Lag) 1,642 Lagrange Multiplier (Error) 141,604 ** Robust LM (Error) 70,375 ** * p<0,05 ** p<0,01 □ Koenker-Bassettt test for heteroscedasticity ● Breusch-Pagan test for heteroscedasticity OLS models and the spatial error model are estimated by making use of Open GeoDa 1.2.0 (august 2012) ©Luc Anselin, 2011,2012
  • 114. Local results • In using spatial regression models we assume that the spatial process accounting for diabetes levels is the same across the study area. That is, the relationship is spatially stationary. However, few social processes will be found to be so constant over space. Global models will hide potential heterogeneity, or spatial non- stationarity, in the determinants of diabetes. GWR provides a method to access the degree to which the relationship between the potential determinants and the prevalence of diabetes varies across space. • The spatial non-stationarity of the relationship of each independent variable to the dependent variable can be assessed to determine whether the GWR method offers any improvement over a global regression model. The variability in the observed GWR estimates for the spatial units is compared to the variability of the GWR results from a large number of allocations of the analytical data across the units. Where one finds a significant difference between the variability of an observed estimate to those computed using the randomized data, spatial non-stationarity for that independent variable is indicated. www.johanblomme.com 110 Geographically Weighted Regression
  • 115. • We first made use of a local Moran’s I cluster analysis of the residuals of the GWR model as a diagnostic for the collinearity of the GWR residuals. We found no violations of residual independence. www.johanblomme.com 111 Geographically Weighted Regression
  • 117. • The GWR results can best be summarized through the maps of the parameter estimates and the Monte Carlo tests. We provide maps of the local R2 values and for each of the independent variables with a significant Monte Carlo test. • The Monte Carlo tests for spatial variability of parameters indicate that the associations between the independent variables and diabetes are all non-stationary across space. Explicitly, the associations we found in OLS could not be generalized to anywhere in the South Atlantic region. In contrast to OLS, the GWR model explains 62,2 % of the total variance. • As shown on the map of the local R2 values, the total variance explained by the local model ranges from 16,1 % to 81,1%. The model fits the data well in the northern counties. Especially in the southern situated counties, there are areas that may benefit from a model with additional covariates. Herein lies the value of the GWR approach : without the ability to map the local R2, we would not know where our model could be improved with additional covariates. www.johanblomme.com 113 Geographically Weighted Regression
  • 119. • The model results of the GWR can be interpreted in two ways. Those interested in a particular area can use the model results for that place to get a multivariate understanding of key local determinants of the diabetes prevalence. We will not do this here. An alternative way to examine the results is by considering for each determinant the varying nature across the counties of the South Atlantic region. • For example, the GWR coefficient for the percentage of poverty ranges from -0,33 to 0,32 which signals that the poverty-diabetes macro-level association is spatially non-stationary. The blue marked counties indicate areas where an increase in poverty predicts lower diabetes prevalence. The shift to light-blue marked areas captures the spatially non-stationary relationship between poverty and diabetes. The poverty-diabetes relationship fluctuates from negative to positive as a function of geographical location. Similar results exist for the relationship between median household income, resp. educational attainment and diabetes. In short, after accounting for location, we find that macro-level associations between predictor variables and diabetes fluctuate as a function of geography. www.johanblomme.com 115 Geographically Weighted Regression
  • 121. • The previous analysis demonstrates that GWR addresses the need for place-specific or place-sensitive forms of analysis. • Effective locational decision making is essential for properly addressing many socio-economic, demographic and health related concerns. Presently, these decisions are supported by quantitative models, which are potentially powerful tools, but whose estimates are often affected by uncertainty, which reduces their reliability. • Uncertainty in the model parameters stems from two proporties of geographical phenomena : – spatial dependence : near things are more related than distant things ; – spatial non-stationarity : variability over space ; • These two properties are mutually related, and most observed processes exhibit both, simultaneously. • Advanced spatial analytical methods exist to correct for the effects of each property. However, despite the recognized simultaneity of their occurrence, each advanced spatial method is designed to address only one property. Spatial autoregressive methods address spatial dependence but do not account for non- stationarity ; geographically weighted regression addresses non-stationarity but does not account for spatial dependence. www.johanblomme.com 117 Geographically Weighted Regression