An Introduction to strm

#load libraries
#load strm
library(strm)
library(rmarkdown)
#load package dependencies for the vignette
library(tidyr)
library(dplyr)
library(rgdal)
library(spdep)
library(rgeos)
library(sf)
library(splm)
library(knitr)
library(Ecdat)

strm is an R package that fits spatio-temporal regression model based on Chi & Zhu Spatial Regression Models for the Social Sciences (2019). The approach here fits a spatial error model while incorporating a temporally lagged response variable and temporally lagged explanatory variables.

This package builds on the errorsarlm() function from the spatialreg package.

This package is still under development. Please report bugs or constructive tips to issues here.

Introduction

Fit a spatial error model but include a temporally lagged response variable, temporally lagged explanatory variable, and temporally and spatially lagged explanatory variables.

Spatial error model at time \(t\):

\[Y_t = X_t \beta + u_t, u_t=\rho Wu_t + \varepsilon\]

Adding in a temporally lagged response variable and temporally lagged explanatory variable:

\[Y_t = X_t\beta_t + \beta_2Y_{t-1} + X_{t-1}\beta_3 + u_t, u_t=\rho Wu_t + \varepsilon\]

This becomes:

\[Y_t = Y_{t-1} \beta_2 + \rho W Y_t - \rho\beta_2 WY_{t-1} + X_t \beta_1 + X_{t-1}\beta_3 - W X_t \rho \beta_1 -WX_{t-1} \rho \beta_3 + \varepsilon\]

Example 1: Produce in the United States

(This example has been adapted from the splm package vignette.)

The first data set we will use is the Produc data set from the Ecdat package. This data set is a panel of United States production data from 1970-1986. There are 816 observations by county in the United States. The variables we will use are:

We also load usaww from the splm package, a spatial weights matrix of the 48 contintental United States based on second-order neighbors.

#load data example
data("Produc")
data("usaww")

#explore the data structures
str(Produc)
## 'data.frame':    816 obs. of  10 variables:
##  $ state: Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year : int  1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
##  $ pcap : num  15033 15502 15972 16406 16763 ...
##  $ hwy  : num  7326 7526 7765 7908 8026 ...
##  $ water: num  1656 1721 1765 1742 1735 ...
##  $ util : num  6051 6255 6442 6756 7002 ...
##  $ pc   : num  35794 37300 38670 40084 42057 ...
##  $ gsp  : int  28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
##  $ emp  : num  1010 1022 1072 1136 1170 ...
##  $ unemp: num  4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...
str(usaww)
##  num [1:48, 1:48] 0 0 0 0 0 0 0 0.5 0.2 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:48] "ALABAMA" "ARIZONA" "ARKANSAS" "CALIFORNIA" ...
##   ..$ : chr [1:48] "ALABAMA" "ARIZONA" "ARKANSAS" "CALIFORNIA" ...

Single Year Lag

We next convert the spatial weights matrix to a list of spatial weights using the mat2listw() function from the spdep package and check the class() of the object:

#create list of spatial weights
usalw <- mat2listw(usaww)
class(usalw)
## [1] "listw" "nb"

Next, we perform the spatio-temporal regression model. We first create a formula using as.formula(). For this model, our response is log(gsp) and our explanatory variables are log(pcap), log(pc), log(emp), unemp.

For this analysis, we will limit the data to only 1970 and 1971 - two time periods.

We use the strm() function from the strm package. The first argument is the model formula. Given that the data is in long format, strm will create the lagged values for you as well as the lagged response in the right-hand side of the model. The next argument is id, which we set to "state" as each observation is taken at the state level. We then specify the name of the data set (data=Produc), the list of spatial weights (listw = usalw), time=2, we tell strm that the data is in long format by setting wide=FALSE, and lastly we pass an argument to filter observations where year is equal to either 1970 or 1971 (2 time periods).

formula1 <-as.formula(log(gsp)  ~ log(pcap) + log(pc) + log(emp) + unemp)
out <- strm(formula1, id="state", data=Produc, listw = usalw, time=2,wide=FALSE,
            filter_options="year==1970 | year==1971")
## The spatio-temporal regression model fitted:
## loggsp ~ logpcap + logpc + logemp + unemp + logpcap.Tlag1 + logpc.Tlag1 +     logemp.Tlag1 + unemp.Tlag1 + loggsp.Tlag1
summary(out)
## 
## Call:spatialreg::errorsarlm(formula = modframe, listw = listw)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -0.03855189 -0.00739710 -0.00061512  0.00585547  0.05119821 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                  Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)    1.9946e-01  7.6150e-02  2.6193 0.0088117
## logpcap       -1.8587e-01  2.3251e-01 -0.7994 0.4240508
## logpc          1.4288e-05  2.5371e-01  0.0001 0.9999551
## logemp         8.2984e-01  1.9872e-01  4.1758 2.969e-05
## unemp          3.3262e-03  6.0468e-03  0.5501 0.5822638
## logpcap.Tlag1  1.9192e-01  2.2597e-01  0.8493 0.3956987
## logpc.Tlag1    1.7700e-02  2.5045e-01  0.0707 0.9436582
## logemp.Tlag1  -7.5794e-01  1.9659e-01 -3.8555 0.0001155
## unemp.Tlag1   -4.5510e-03  6.8513e-03 -0.6643 0.5065302
## loggsp.Tlag1   9.1257e-01  2.5405e-02 35.9204 < 2.2e-16
## 
## Lambda: -0.0093887, LR test value: 0.00085418, p-value: 0.97668
## Asymptotic standard error: 0.20491
##     z-value: -0.045819, p-value: 0.96345
## Wald statistic: 0.0020994, p-value: 0.96345
## 
## Log likelihood: 128.249 for error model
## ML residual variance (sigma squared): 0.00027975, (sigma: 0.016726)
## Number of observations: 48 
## Number of parameters estimated: 12 
## AIC: -232.5, (AIC for lm: -234.5)

From the model summary output, we see that strm has included lagged variables (*.Tlag1) for each of the explanatory variables initially specified (log(pcap), log(pc), log(emp), unemp), as well as a lagged variable for the response (log(gsp)).

With 2 Lags

As the number of time periods in the data increase, so do the number of lags. If we now use the same model, but instead include an additional year (1972) and set time=3.

formula2 <-as.formula(log(gsp)  ~ log(pcap) + log(pc) + log(emp) + unemp)
out <- strm(formula2, id="state", data=Produc, listw= usalw, time=3,
            wide=FALSE,filter_options="year==1970 | year==1971 | year ==1972")
## The spatio-temporal regression model fitted:
## loggsp ~ logpcap + logpc + logemp + unemp + logpcap.Tlag1 + logpc.Tlag1 +     logemp.Tlag1 + unemp.Tlag1 + logpcap.Tlag2 + logpc.Tlag2 +     logemp.Tlag2 + unemp.Tlag2 + loggsp.Tlag1 + loggsp.Tlag2

After executing the strm() model, we apply the summary() function to the results object, out.

summary(out)
## 
## Call:spatialreg::errorsarlm(formula = modframe, listw = listw)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -0.02179096 -0.00525540  0.00041786  0.00461342  0.01653898 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                 Estimate Std. Error z value  Pr(>|z|)
## (Intercept)    0.2516030  0.0462216  5.4434 5.227e-08
## logpcap        0.0980899  0.1760534  0.5572   0.57742
## logpc          0.2284076  0.3163419  0.7220   0.47028
## logemp         0.2484005  0.1105107  2.2478   0.02459
## unemp         -0.0056502  0.0035709 -1.5823   0.11359
## logpcap.Tlag1 -0.2784667  0.3731087 -0.7463   0.45546
## logpc.Tlag1   -0.4567652  0.3651608 -1.2509   0.21099
## logemp.Tlag1   0.5172162  0.2297628  2.2511   0.02438
## unemp.Tlag1    0.0099326  0.0042575  2.3330   0.01965
## logpcap.Tlag2  0.1929118  0.2093529  0.9215   0.35681
## logpc.Tlag2    0.2527881  0.1388406  1.8207   0.06865
## logemp.Tlag2  -0.6862555  0.1397530 -4.9105 9.085e-07
## unemp.Tlag2   -0.0037672  0.0033893 -1.1115   0.26636
## loggsp.Tlag1   0.7593464  0.0720738 10.5357 < 2.2e-16
## loggsp.Tlag2   0.1317338  0.0685188  1.9226   0.05453
## 
## Lambda: -0.77659, LR test value: 4.9975, p-value: 0.025385
## Asymptotic standard error: 0.18694
##     z-value: -4.1542, p-value: 3.2636e-05
## Wald statistic: 17.258, p-value: 3.2636e-05
## 
## Log likelihood: 162.5098 for error model
## ML residual variance (sigma squared): 5.8561e-05, (sigma: 0.0076525)
## Number of observations: 48 
## Number of parameters estimated: 17 
## AIC: -291.02, (AIC for lm: -288.02)

From the model summary output, we see that strm has now included two lagged variables (*.Tlag1 and *.Tlag2) for each of the explanatory variables initially specified, as well as two lags for the response variable.

Example 2: Minor Civil Divisions in Wisconsin

We use the example from Spatial Regression Models for the Social Sciences Chi & Zhu (2019). The example uses population growth data from 2000 to 2010. Data are at the minor civil division (MCD) level in Wisconsin. There are two years of data: 2000 and 2010. The variables we will use are:

We load the sptdmg3 .RData file that comes with the strm package using data(sptdmg3) and explore its class() and names():

data(sptdmg3)
class(sptdmg3)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
names(sptdmg3)
## [1] "STFID"   "X_COORD" "Y_COORD" "LNP1000" "LNP0090" "POLD00"  "POLD90"

First, fit a standard linear regression model with population growth from 2000 to 2010 as the response variable and population growth from 1990 to 2000, the old population in 2000, and the old population in 1990 as the explanatory variables.

formula2 <- as.formula(LNP1000 ~ LNP0090 + POLD00 + POLD90)
m2 <- lm(formula2, data = sptdmg3)
summary(m2)
## 
## Call:
## lm(formula = formula2, data = sptdmg3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69307 -0.06942 -0.01170  0.05636  1.08243 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.109237   0.009174  11.907  < 2e-16 ***
## LNP0090      0.061267   0.020858   2.937 0.003352 ** 
## POLD00      -0.335543   0.089100  -3.766 0.000171 ***
## POLD90      -0.214722   0.082246  -2.611 0.009109 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1348 on 1833 degrees of freedom
## Multiple R-squared:  0.05626,    Adjusted R-squared:  0.05471 
## F-statistic: 36.42 on 3 and 1833 DF,  p-value: < 2.2e-16

We first convert the SpatialPolygonsDataFrame to a simple features or sf object. We use the st_as_sf() function from the sf package.

#convert to simple features
sptdmg3_sf <- sf::st_as_sf(sptdmg3)
class(sptdmg3_sf)
## [1] "sf"         "data.frame"

Next, we create a spatial weights matrix based on 4-nearest neighbors. We extract the coordinates from the MCD centroids using st_centroid() to extract the centroid of each MCD and st_coordinates() to extract the coordinates. We then use knearneigh() from the spdep package and specify k=4 to create a matrix of the 4 nearest neighbors; knn2nb() creates a neighborhood structure object and nb2listw() creates a spatial weights list, where we set the style to be row-standardized weights and the zero policy to be TRUE. Finally, we plot the MCD and their neighbors.

#knn4
coords <-st_coordinates(st_centroid(sptdmg3_sf)) #extract centroid of each  minor civil division
col.knn <- knearneigh(coords, k=4)
nbknn <- knn2nb(col.knn, row.names = sptdmg3$STFID)
listwk <- nb2listw(nbknn, style="W")
plot(sptdmg3, main="Wisconsin: 4 Nearest-Neighbors"); plot(nbknn, coords,col="forestgreen", add=TRUE)

We are now ready to perform the spatio-temporal regression model. In this example, the data is already in WIDE format, meaning that every variable-year is a column in the dataset. Because of this, we directly include the lagged variables for all of the covariates as well as the lagged variable for the response, just as we had done in the simple linear model above.

#since this data is in wide format, include temporal lag for both explanatory and response variable manually
res <- strm(formula2, id="STFID", data=sptdmg3_sf, listw = listwk, time=2,wide=TRUE)
## The spatio-temporal regression model fitted:
## LNP1000 ~ LNP0090 + POLD00 + POLD90
summary(res)
## 
## Call:spatialreg::errorsarlm(formula = modframe, listw = listw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.667816 -0.068463 -0.010631  0.055872  1.053517 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.0856035  0.0099058  8.6418  < 2e-16
## LNP0090      0.0312120  0.0212166  1.4711  0.14126
## POLD00      -0.1992946  0.0863089 -2.3091  0.02094
## POLD90      -0.1665145  0.0789090 -2.1102  0.03484
## 
## Lambda: 0.28547, LR test value: 78.083, p-value: < 2.22e-16
## Asymptotic standard error: 0.030315
##     z-value: 9.4169, p-value: < 2.22e-16
## Wald statistic: 88.678, p-value: < 2.22e-16
## 
## Log likelihood: 1115.203 for error model
## ML residual variance (sigma squared): 0.017069, (sigma: 0.13065)
## Number of observations: 1837 
## Number of parameters estimated: 6 
## AIC: -2218.4, (AIC for lm: -2142.3)