Generating syntax for structural equation models

tidySEM offers a user-friendly, tidy workflow for generating syntax for SEM models. The workflow is top-down, meaning that syntax is generated based on conceptual model elements. In many cases, the generated syntax will suffice - but it is always customizable. The workflow also tries to intelligently guess which variables go together, but these defaults can be overridden.

The tidySEM workflow

The workflow underlying syntax generation in tidySEM is as follows:

  1. Give the variables in your data object short, informative names, that are easily machine readable
  2. Convert the data to a tidy_sem object by running model <- tidy_sem(data)
  3. Add elements of syntax
    • E.g., measurement(model)
  4. Optionally, access the dictionary, data, and syntax elements in the tidy_sem object by calling dictionary(model), get_data(model), or syntax(model)
  5. Optionally, modify the dictionary, data, and syntax elements in the tidy_sem object dictionary(model) <- ..., get_data(model) <- ..., and syntax(model) <- ...
  6. Run the analysis, either by:
    • Converting the tidy_sem object to lavaan syntax using as_lavaan(model) and using that as input for the lavaan functions sem, lavaan, or cfa
    • Converting the tidy_sem object to Mplus using as_mplus(model), and using that as input for MplusAutomation::mplusObject()
    • Using the functions estimate_lavaan(model) or estimate_mplus(model)

All elements of the tidy_sem object are “tidy” data, i.e., tabular data.frames, and can be modified using the familiar suite of functions in the ‘tidyverse’. Thus, the data, dictionary, and syntax are all represented as data.frames.

Example: Running a CFA

Step 1: Check the variable names

As an example, let’s make a graph for a classic lavaan tutorial example for CFA. First, we check the data names:

df <- HolzingerSwineford1939
names(df)
#>  [1] "id"     "sex"    "ageyr"  "agemo"  "school" "grade"  "x1"     "x2"    
#>  [9] "x3"     "x4"     "x5"     "x6"     "x7"     "x8"     "x9"

These names are not informative, as the items named x.. are indicators of three different latent variables. We will rename them accordingly:

names(df)[grepl("^x", names(df))] <- c("vis_1", "vis_2", "vis_3", "tex_1", "tex_2", "tex_3", "spe_1", "spe_2", "spe_3")

Guidelines for naming variables

In general, it is good practice to name variables using the following information:

Roughly speaking, elements of the variable name should be ordered from “slow-changing” to “fast-changing”; i.e.; there are only a few scales, with possibly several measurement occasions or respondents, and many items.

Step 2: Generate a dictionary

A dictionary indicates which variables in the data belong to, for example, the same scale. When the data have informative names, it is possible to construct a data dictionary automatically:

model <- tidy_sem(df)
model
#> A tidy_sem object
#> v    $dictionary
#> v    $data
#> o    $syntax

Step 3: Generate syntax

We can automatically add basic syntax to the sem_syntax object, by passing it to a syntax-generating function like measurement(), which adds a measurement model for any scales in the object:

model %>%
  measurement() -> model
model
#> A tidy_sem object
#> v    $dictionary
#> v    $data
#> v    $syntax

Step 4: Run the model

The resulting model can be evaluated as ‘Mplus’ syntax or ‘lavaan’ syntax, using the as_mplus and as_lavaan functions. For example, using lavaan:

res <- lavaan(as_lavaan(model), data = df)
summary(res, estimates = FALSE)
#> lavaan 0.6-7 ended normally after 35 iterations
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of free parameters                         21
#>                                                       
#>   Number of observations                           301
#>                                                       
#> Model Test User Model:
#>                                                       
#>   Test statistic                                85.306
#>   Degrees of freedom                                24
#>   P-value (Chi-square)                           0.000

Or, alternatively, the model can be estimated using the lavaan wrapper estimate_lavaan():

model %>%
  estimate_lavaan()
#> lavaan 0.6-7 ended normally after 35 iterations
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of free parameters                         21
#>                                                       
#>   Number of observations                           301
#>                                                       
#> Model Test User Model:
#>                                                       
#>   Test statistic                                85.306
#>   Degrees of freedom                                24
#>   P-value (Chi-square)                           0.000

The same model can be estimated in ‘Mplus’ through the R-package MplusAutomation. This requires ‘Mplus’ to be installed.

library(MplusAutomation)
model %>%
  estimate_mplus()
#> Estimated using ML 
#> Number of obs: 301, number of (free) parameters: 30 
#> 
#> Model: Chi2(df = 24) = 85.306, p = 0 
#> Baseline model: Chi2(df = 36) = 918.852, p = 0 
#> 
#> Fit Indices: 
#> 
#> CFI = 0.931, TLI = 0.896, SRMR = 0.06 
#> RMSEA = 0.092, 90% CI [0.071, 0.114], p < .05 = 0.001 
#> AIC = 7535.49, BIC = 7646.703

Optional step 5: Access the dictionary, data, and syntax

The dictionary and syntax can be examined using dictionary(model) and syntax(model):

dictionary(model)
#>      name scale      type  label
#> 1      id  <NA>  observed     id
#> 2     sex  <NA>  observed    sex
#> 3   ageyr  <NA>  observed  ageyr
#> 4   agemo  <NA>  observed  agemo
#> 5  school  <NA>  observed school
#> 6   grade  <NA>  observed  grade
#> 7   vis_1   vis indicator  vis_1
#> 8   vis_2   vis indicator  vis_2
#> 9   vis_3   vis indicator  vis_3
#> 10  tex_1   tex indicator  tex_1
#> 11  tex_2   tex indicator  tex_2
#> 12  tex_3   tex indicator  tex_3
#> 13  spe_1   spe indicator  spe_1
#> 14  spe_2   spe indicator  spe_2
#> 15  spe_3   spe indicator  spe_3
#> 16    vis  <NA>    latent    vis
#> 17    tex  <NA>    latent    tex
#> 18    spe  <NA>    latent    spe
syntax(model)
#>      lhs op   rhs block free label ustart plabel
#> 1    vis =~ vis_1     1    0            1   .p1.
#> 2    vis =~ vis_2     1    1           NA   .p2.
#> 3    vis =~ vis_3     1    1           NA   .p3.
#> 4    tex =~ tex_1     1    0            1   .p4.
#> 5    tex =~ tex_2     1    1           NA   .p5.
#> 6    tex =~ tex_3     1    1           NA   .p6.
#> 7    spe =~ spe_1     1    0            1   .p7.
#> 8    spe =~ spe_2     1    1           NA   .p8.
#> 9    spe =~ spe_3     1    1           NA   .p9.
#> 10 vis_1 ~~ vis_1     1    1           NA  .p10.
#> 11 vis_2 ~~ vis_2     1    1           NA  .p11.
#> 12 vis_3 ~~ vis_3     1    1           NA  .p12.
#> 13 tex_1 ~~ tex_1     1    1           NA  .p13.
#> 14 tex_2 ~~ tex_2     1    1           NA  .p14.
#> 15 tex_3 ~~ tex_3     1    1           NA  .p15.
#> 16 spe_1 ~~ spe_1     1    1           NA  .p16.
#> 17 spe_2 ~~ spe_2     1    1           NA  .p17.
#> 18 spe_3 ~~ spe_3     1    1           NA  .p18.
#> 19   vis ~~   vis     1    1           NA  .p19.
#> 20   tex ~~   tex     1    1           NA  .p20.
#> 21   spe ~~   spe     1    1           NA  .p21.
#> 22   vis ~~   tex     1    1           NA  .p22.
#> 23   vis ~~   spe     1    1           NA  .p23.
#> 24   tex ~~   spe     1    1           NA  .p24.

Optional step 6: Modify the dictionary and syntax

At this stage, we may want to modify the basic syntax slightly. The functions dictionary(model) <- ... and syntax(model) <- ... can be used to modify the dictionary and syntax:

dictionary(model) %>%
  mutate(label = ifelse(label == "vis", "Visual", label))
#>      name scale      type  label
#> 1      id  <NA>  observed     id
#> 2     sex  <NA>  observed    sex
#> 3   ageyr  <NA>  observed  ageyr
#> 4   agemo  <NA>  observed  agemo
#> 5  school  <NA>  observed school
#> 6   grade  <NA>  observed  grade
#> 7   vis_1   vis indicator  vis_1
#> 8   vis_2   vis indicator  vis_2
#> 9   vis_3   vis indicator  vis_3
#> 10  tex_1   tex indicator  tex_1
#> 11  tex_2   tex indicator  tex_2
#> 12  tex_3   tex indicator  tex_3
#> 13  spe_1   spe indicator  spe_1
#> 14  spe_2   spe indicator  spe_2
#> 15  spe_3   spe indicator  spe_3
#> 16    vis  <NA>    latent Visual
#> 17    tex  <NA>    latent    tex
#> 18    spe  <NA>    latent    spe

For example, imagine we want to change the model, so that all items of the “spe” subscale load on the “tex” latent variable. We would first replace the latent variable “spe” with “tex”, and secondly remove all mention of the “spe” latent variable:

syntax(model) %>%
  mutate(lhs = ifelse(lhs == "spe" & op == "=~", "tex", lhs)) %>%
  filter(!(lhs == "spe" | rhs == "spe")) -> syntax(model)

The modified model could then be run:

estimate_lavaan(model)
#> lavaan 0.6-7 ended normally after 28 iterations
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of free parameters                         18
#>                                                       
#>   Number of observations                           301
#>                                                       
#> Model Test User Model:
#>                                                       
#>   Test statistic                               330.994
#>   Degrees of freedom                                27
#>   P-value (Chi-square)                           0.000

Optional step 7: Adding paths

In addition to the way of editing the data.frame with model syntax described in Step 6, it is also possible to add (or modify) paths by adding lavaan syntax. For example, imagine that - instead of having “vis” and “tex” correlate, we want to add a regression path between them:

model %>%
  add_paths("vis ~ tex") %>%
  estimate_lavaan() %>%
  summary(estimates = TRUE)
#> lavaan 0.6-7 ended normally after 27 iterations
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of free parameters                         18
#>                                                       
#>   Number of observations                           301
#>                                                       
#> Model Test User Model:
#>                                                       
#>   Test statistic                               330.994
#>   Degrees of freedom                                27
#>   P-value (Chi-square)                           0.000
#> 
#> Parameter Estimates:
#> 
#>   Standard errors                             Standard
#>   Information                                 Expected
#>   Information saturated (h1) model          Structured
#> 
#> Latent Variables:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>   vis =~                                              
#>     vis_1             1.000                           
#>     vis_2             0.558    0.104    5.361    0.000
#>     vis_3             0.711    0.116    6.143    0.000
#>   tex =~                                              
#>     tex_1             1.000                           
#>     tex_2             1.448    0.099   14.657    0.000
#>     tex_3             1.234    0.084   14.696    0.000
#>     spe_1             1.000                           
#>     spe_2             0.319    0.083    3.860    0.000
#>     spe_3             0.437    0.082    5.326    0.000
#> 
#> Regressions:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>   vis ~                                               
#>     tex               0.591    0.092    6.408    0.000
#> 
#> Variances:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>    .vis_1             0.538    0.126    4.280    0.000
#>    .vis_2             1.127    0.102   11.006    0.000
#>    .vis_3             0.860    0.094    9.151    0.000
#>    .tex_1             0.512    0.050   10.281    0.000
#>    .tex_2             0.485    0.065    7.447    0.000
#>    .tex_3             0.343    0.047    7.335    0.000
#>    .spe_1             1.364    0.118   11.587    0.000
#>    .spe_2             0.965    0.079   12.160    0.000
#>    .spe_3             0.908    0.075   12.055    0.000
#>    .vis               0.625    0.134    4.673    0.000
#>     tex               0.560    0.073    7.638    0.000

This function accepts both quoted (character) and unquoted arguments. So, for example, if we want to add a cross-loading from “spe_1” on “vis”, in addition to the regression path before, we could use the following code:

model %>%
  add_paths("vis ~ tex", vis =~ spe_1) %>%
  estimate_lavaan()
#> Warning in lav_object_post_check(object): lavaan WARNING: some estimated ov
#> variances are negative
#> lavaan 0.6-7 ended normally after 64 iterations
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of free parameters                         18
#>                                                       
#>   Number of observations                           301
#>                                                       
#> Model Test User Model:
#>                                                       
#>   Test statistic                               407.730
#>   Degrees of freedom                                27
#>   P-value (Chi-square)                           0.000