`vtreat::prepare(scale=TRUE)`

is a variation of
`vtreat::prepare()`

intended to prepare data frames so all
the derived input or independent (`x`

) variables are fully in
outcome or dependent variable (`y`

) units. This is in the
sense of a linear regression for numeric `y`

’s
(`vtreat::designTreatmentsN`

and
`vtreat::mkCrossFrameNExperiment`

).

For classification problems (or categorical `y`

’s) as of
version `0.5.26`

and newer (available here) scaling is
established through a a logistic regression “in
link units” or as 0/1 indicators depending on the setting of the
`catScaling`

argument in
`vtreat::designTreatmentsC`

or
`vtreat::mkCrossFrameNExperiment`

. Prior to this version
classification the scaling calculation (and only the scaling
calculation) was always handled as a linear regression against a 0/1
`y`

-indicator. `catScaling=FALSE`

can be a bit
faster as the underlying regression can be a bit quicker than a logistic
regression.

This is the appropriate preparation before a geometry/metric sensitive modeling step such as principal components analysis or clustering (such as k-means clustering).

Normally (with `vtreat::prepare(scale=FALSE)`

) vtreat
passes through a number of variables with minimal alteration (cleaned
numeric), builds 0/1 indicator variables for various conditions
(categorical levels, presence of NAs, and so on), and builds some “in
y-units” variables (catN, catB) that are in fact sub-models. With
`vtreat::prepare(scale=TRUE)`

all of these numeric variables
are then re-processed to have mean zero, and slope 1 (when possible)
when appropriately regressed against the y-variable.

This is easiest to illustrate with a concrete example.

```
library('vtreat')
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE,
catScaling=FALSE,
verbose=FALSE)
dTrainCTreatedUnscaled <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=FALSE)
```

```
## Warning in prepare.treatmentplan(treatmentsC, dTrainC, pruneSig = c(), scale =
## FALSE): possibly called prepare() on same data frame as
## designTreatments*()/mkCrossFrame*Experiment(), this can lead to over-fit. To
## avoid this, please use mkCrossFrame*Experiment$crossFrame.
```

```
## Warning in prepare.treatmentplan(treatmentsC, dTrainC, pruneSig = c(), scale =
## TRUE): possibly called prepare() on same data frame as
## designTreatments*()/mkCrossFrame*Experiment(), this can lead to over-fit. To
## avoid this, please use mkCrossFrame*Experiment$crossFrame.
```

Note we have set `catScaling=FALSE`

to ask that we treat
`y`

as a 0/1 indicator and scale using linear regression. The
standard vtreat treated frame converts the original data from this:

```
## x y
## 1 a FALSE
## 2 a FALSE
## 3 a TRUE
## 4 b FALSE
## 5 b TRUE
## 6 <NA> TRUE
```

into this:

```
## x_catP x_catB x_lev_NA x_lev_x_a x_lev_x_b y
## 1 0.5000000 -0.6930972 0 1 0 FALSE
## 2 0.5000000 -0.6930972 0 1 0 FALSE
## 3 0.5000000 -0.6930972 0 1 0 TRUE
## 4 0.3333333 0.0000000 0 0 1 FALSE
## 5 0.3333333 0.0000000 0 0 1 TRUE
## 6 0.1666667 9.2104404 1 0 0 TRUE
```

This is the “standard way” to run vtreat – with the exception that
for this example we set `pruneSig`

to `NULL`

to
suppress variable pruning, instead of setting it to a value in the
interval `(0,1)`

. The principle is: vtreat inflicts the
minimal possible alterations on the data, leaving as much as possible to
the downstream machine learning code. This does turn out to already be a
lot of alteration. Mostly vtreat is taking only steps that are unsafe to
leave for later: re-encoding of large categoricals, re-coding of
aberrant values, and bulk pruning of variables.

However some procedures, in particular principal components analysis
or geometric clustering, assume all of the columns have been fully
transformed. The usual assumption (“more honored in the breach than the
observance”) is that the columns are centered (mean zero) and scaled.
The non y-aware meaning of “scaled” is unit variance. However, vtreat is
designed to emphasize y-aware processing and we feel the y-aware sense
of scaling should be: unit slope when regressed against y. If you want
standard scaling you can use the standard frame produced by vtreat and
scale it yourself. If you want vtreat style y-aware scaling you (which
we strongly think is the right thing to do) you can use
`vtreat::prepare(scale=TRUE)`

which produces a frame that
looks like the following:

```
## x_catP x_catB x_lev_NA x_lev_x_a x_lev_x_b y
## 1 -0.2 -0.11976374 -0.1 -0.1666667 4.807407e-17 FALSE
## 2 -0.2 -0.11976374 -0.1 -0.1666667 4.807407e-17 FALSE
## 3 -0.2 -0.11976374 -0.1 -0.1666667 4.807407e-17 TRUE
## 4 0.1 -0.07564865 -0.1 0.1666667 -9.614813e-17 FALSE
## 5 0.1 -0.07564865 -0.1 0.1666667 -9.614813e-17 TRUE
## 6 0.4 0.51058851 0.5 0.1666667 4.807407e-17 TRUE
```

First we can check the claims. Are the variables mean-zero and slope 1 when regressed against y?

```
slopeFrame <- data.frame(varName = treatmentsC$scoreFrame$varName,
stringsAsFactors = FALSE)
slopeFrame$mean <-
vapply(dTrainCTreatedScaled[, slopeFrame$varName, drop = FALSE], mean,
numeric(1))
slopeFrame$slope <- vapply(slopeFrame$varName,
function(c) {
lm(paste('y', c, sep = '~'),
data = dTrainCTreatedScaled)$coefficients[[2]]
},
numeric(1))
slopeFrame$sig <- vapply(slopeFrame$varName,
function(c) {
treatmentsC$scoreFrame[treatmentsC$scoreFrame$varName == c, 'sig']
},
numeric(1))
slopeFrame$badSlope <-
ifelse(is.na(slopeFrame$slope), TRUE, abs(slopeFrame$slope - 1) > 1.e-8)
print(slopeFrame)
```

```
## varName mean slope sig badSlope
## 1 x_catP 1.850372e-17 1 0.1547700 FALSE
## 2 x_catB -1.156482e-17 1 0.5160763 FALSE
## 3 x_lev_NA -6.938894e-18 1 0.2076623 FALSE
## 4 x_lev_x_a -2.775558e-17 1 0.4097258 FALSE
## 5 x_lev_x_b 4.108149e-33 0 1.0000000 TRUE
```

The above claims are true with the exception of the derived variable
`x_lev_x.b`

. This is because the outcome variable
`y`

has identical distribution when the original variable
`x==‘b’`

and when `x!=‘b’`

(on half the time in
both cases). This means `y`

is perfectly independent of
`x==‘b’`

and the regression slope must be zero (thus, cannot
be 1). vtreat now treats this as needing to scale by a multiplicative
factor of zero. Note also that the significance level associated with
`x_lev_x.b`

is large, making this variable easy to prune. The
`varMoves`

and significance facts in
`treatmentsC$scoreFrame`

are about the un-scaled frame (where
`x_lev_x.b`

does in fact move).

For a good discussion of the application of *y*-aware scaling
to Principal Components Analysis please see here.

Previous versions of vtreat (0.5.22 and earlier) would copy variables that could not be sensibly scaled into the treated frame unaltered. This was considered the “most faithful” thing to do. However we now feel that this practice was not safe for many downstream procedures, such as principal components analysis and geometric clustering.

As of version `0.5.26`

`vtreat`

also supports a
“scaling mode for categorical outcomes.” In this mode scaling is
performed using the coefficient of a logistic regression fit on a
categorical instead of the coefficient of a linear fit (with the outcome
encoded as a zero/one indicator).

The idea is with this mode on we are scaling as a logistic regression would- so we are in logistic regression “link space” (where logistic regression assume effects are additive). The mode may be well suited for principal components analysis or principal components regression where the target variable is a categorical (i.e. classification tasks).

To ensure this effect we set the argument
`catScaling=TRUE`

in `vtreat::designTreatmentsC`

or `vtreat::mkCrossFrameCExperiment`

. WE demonstrate this
below.

```
treatmentsC2 <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE,
catScaling=TRUE,
verbose=FALSE)
dTrainCTreatedScaled2 <- prepare(treatmentsC2,dTrainC,pruneSig=c(),scale=TRUE)
```

```
## Warning in prepare.treatmentplan(treatmentsC2, dTrainC, pruneSig = c(), :
## possibly called prepare() on same data frame as
## designTreatments*()/mkCrossFrame*Experiment(), this can lead to over-fit. To
## avoid this, please use mkCrossFrame*Experiment$crossFrame.
```

```
## x_catP x_catB x_lev_NA x_lev_x_a x_lev_x_b y
## 1 -0.9396225 -1.894112 -3.161922 -0.6931472 0 FALSE
## 2 -0.9396225 -1.894112 -3.161922 -0.6931472 0 FALSE
## 3 -0.9396225 -1.894112 -3.161922 -0.6931472 0 TRUE
## 4 0.4698112 -1.196414 -3.161922 0.6931472 0 FALSE
## 5 0.4698112 -1.196414 -3.161922 0.6931472 0 TRUE
## 6 1.8792449 8.075166 15.809611 0.6931472 0 TRUE
```

Notice the new scaled frame is in a different scale than the original scaled frame. It likely is a function of the problem domain which scaling is more appropriate or useful.

The new scaled columns are again mean-0 (so they are not exactly the logistic link values, which may not have been so shifted). The new scaled columns do not necessarily have linear model slope 1 as the original scaled columns did as we see below:

```
## x_catP x_catB x_lev_NA x_lev_x_a x_lev_x_b
## 3.700743e-16 -3.700743e-17 -2.220446e-16 1.110223e-16 0.000000e+00
## y
## 5.000000e-01
```

```
##
## Call:
## lm(formula = y ~ x_lev_NA, data = dTrainCTreatedScaled)
##
## Coefficients:
## (Intercept) x_lev_NA
## 0.5 1.0
```

```
##
## Call:
## lm(formula = y ~ x_lev_NA, data = dTrainCTreatedScaled2)
##
## Coefficients:
## (Intercept) x_lev_NA
## 0.50000 0.03163
```

The new scaled columns, however are in good logistic link units.

```
vapply(slopeFrame$varName,
function(c) {
glm(paste('y', c, sep = '~'),family=binomial,
data = dTrainCTreatedScaled2)$coefficients[[2]]
},
numeric(1))
```

```
## x_catP x_catB x_lev_NA x_lev_x_a x_lev_x_b
## 1 1 1 1 NA
```

The intended applications of scale mode include preparing data for metric sensitive applications such as KNN classification/regression and Principal Components Analysis/Regression. Please see here for an article series describing such applications.

Overall the advice is to first use the following pattern:

- Significance prune incoming variables.
- Use
*y*-aware scaling. - Significance prune resulting latent variables.

However, practitioners experienced in principal components analysis
may uncomfortable with the range of eigenvalues or singular values
returned by *y*-aware analysis. If a more familiar scale is
desired we suggest performing the *y*-aware scaling against an
additional scaled and centered *y* to try to get ranges closer
the traditional unit ranges. This can be achieved as shown below.

```
set.seed(235235)
dTrainN <- data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
stringsAsFactors=FALSE)
dTrainN$y <- 1000*(dTrainN$x1 + dTrainN$x2)
cEraw <- vtreat::mkCrossFrameNExperiment(dTrainN,
c('x1','x2','x3'),'y',
scale=TRUE)
```

```
## [1] "vtreat 1.6.4 start initial treatment design Sat Aug 19 12:10:23 2023"
## [1] " start cross frame work Sat Aug 19 12:10:23 2023"
## [1] " vtreat::mkCrossFrameNExperiment done Sat Aug 19 12:10:23 2023"
```

`## [1] "x1" "x2" "x3"`

```
dM1 <- as.matrix(cEraw$crossFrame[, newvars])
pCraw <- stats::prcomp(dM1,
scale.=FALSE,center=TRUE)
print(pCraw)
```

```
## Standard deviations (1, .., p=3):
## [1] 1160.3144 1057.6874 101.1756
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## x1 0.9653602255 -0.260919781 -0.0007092447
## x2 0.2609205097 0.965359437 0.0012824611
## x3 -0.0003500566 0.001423093 -0.9999989261
```

```
dTrainN$yScaled <- scale(dTrainN$y,center=TRUE,scale=TRUE)
cEscaled <- vtreat::mkCrossFrameNExperiment(dTrainN,
c('x1','x2','x3'),'yScaled',
scale=TRUE)
```

```
## [1] "vtreat 1.6.4 start initial treatment design Sat Aug 19 12:10:23 2023"
## [1] " start cross frame work Sat Aug 19 12:10:23 2023"
## [1] " vtreat::mkCrossFrameNExperiment done Sat Aug 19 12:10:23 2023"
```

`## [1] "x1" "x2" "x3"`

```
dM2 <- as.matrix(cEscaled$crossFrame[, newvars_s])
pCscaled <- stats::prcomp(dM2,
scale.=FALSE,center=TRUE)
print(pCscaled)
```

```
## Standard deviations (1, .., p=3):
## [1] 0.7866757 0.6880818 0.1097586
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## x1 0.9700658 -0.24208741 0.01913148
## x2 0.2417583 0.97016953 0.01800061
## x3 -0.0229185 -0.01283658 0.99965492
```

Notice the second application of `stats::prcomp`

has more
standard scaling of the reported standard deviations (though we still do
not advise choosing latent variables based on mere comparisons to unit
magnitude).