This package includs all features that has been extracted from Position-Specific Scoring Matrix (PSSM) in POSSUM website: http://possum.erc.monash.edu/ and some of other features from several articles. each function in this package corresponds to one feature that has extracted from PSSM Matrix.

Feature extraction or feature encoding is a fundamental step in the construction of high-quality machine learning-based models. Specifically, this is a key step for determining the effectiveness of the trained models in bioinformatics applications. In the last two decades, a variety of feature encoding schemes have been proposed in order to exploit useful patterns from protein sequences. Such schemes are often based on sequence information or physicochemical properties of amino acids. Although direct features derived from sequences themselves (such as amino acid compositions, dipeptide compositions, and counting of k-mers) are regarded as essential for training models, an increasing number of studies have shown that evolutionary information in the form of PSSM profiles is much more informative than sequence information alone. Accordingly, PSSM-based feature descriptors have been commonly used as indispensable primary features to construct models, filling a major gap in the current bioinformatics research. For example, PSSM-based feature descriptors have successfully improved the prediction performance of structural and functional properties of proteins across a wide spectrum of bioinformatics applications. These predictions can be applied for protein fold recognition and the prediction of protein structural classes, protein-protein interactions, protein subcellular localization, RNA-binding sites, and protein functions. But at the same time, there is no comprehensive, simple tool in the R programming language for extracting all of these features from the PSSM and displaying it in the output. PSSMCOOL package here is developed in R for these purposes. First, in Figure 1 which is a table that lists all of the features implemented in this package with their feature-lengths is brought. Then each one of these features will be explained in full detail.

PSSMCOOL Package is currently available on CRAN website:

https://CRAN.R-project.org/package=PSSMCOOL

for issues about this package :

https://github.com/BioCool-Lab/PSSMCOOL/issues

Figure 1: List of implemented features in PSSMCOOL package

*feature vector length depends on the choice of parameter

**these features produce Matrix of features which its dimension depends on choice of parameter

`library(PSSMCOOL)`

This feature, which stands for auto-covariance transformation, for column j, calculates the average of this column as shown in Figure 2, Then subtracts the resulting number from the elements on the rows i and i + g of this column, and finally multiplies them and calculates the sum by changing the variable i from 1 to L-g. Because the variable j changes between 1 and 20 and the variable g changes between 1 and 10, eventually a feature vector of length 200 will be obtained.

Figure 2: process of extracting PSSM-AC feature vector from PSSM Matrix

```
w<-PSSMAC(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(w, n = 50)
```

```
## [1] 0.0092 0.0070 0.0181 0.0072 -0.0032 0.0054 -0.0068 -0.0065 -0.0015
## [10] -0.0032 0.0057 0.0102 0.0059 0.0087 0.0176 0.0072 0.0141 -0.0080
## [19] -0.0132 0.0109 0.0271 0.0064 0.0080 0.0107 0.0130 0.0024 0.0062
## [28] 0.0126 0.0108 0.0061 0.0159 -0.0024 0.0157 0.0102 -0.0052 0.0141
## [37] -0.0023 -0.0097 0.0244 0.0333 0.0094 0.0132 -0.0003 -0.0006 0.0158
## [46] 0.0068 0.0094 -0.0069 0.0085 -0.0109
```

This feature stands for dipeptide composition, which multiplies the values that are located in two consecutive rows and two different columns. Having calculated these values for different rows and columns, they are summed. Next, for both columns, the sum is divided by L-1. Since the result depends on two different columns, eventually a feature vector of length 400, according to Figure 3 and following equation ,will be obtained.

Figure 3: process of extracting DPC-PSSM feature vector from PSSM Matrix

In the above equation, ’s, are the PSSM elements and ’s are the 20x20 matrix elements, which by placing the rows of this matrix next to each other, DPC-PSSM feature vector of length 400 is obtained.

```
ss<-DPC_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(ss, n = 50)
```

```
## [1] 0.3773 0.2965 0.3371 0.3049 0.1816 0.4356 0.3395 0.2866 0.3109 0.2576
## [11] 0.2622 0.3150 0.2776 0.3187 0.2383 0.3846 0.2755 0.1791 0.2180 0.2808
## [21] 0.1528 0.1218 0.1387 0.1063 0.0794 0.1766 0.1235 0.1070 0.1410 0.1055
## [31] 0.1085 0.1235 0.1078 0.1354 0.0858 0.1631 0.1174 0.0823 0.0939 0.1186
## [41] 0.1241 0.0928 0.1133 0.0890 0.0416 0.1248 0.0982 0.0916 0.0916 0.0746
```

This feature vector is of length 8000, which is extracted from the PSSM. If we multiply elements available in three consecutive rows and three different columns of the PSSM by each other, and apply this to all rows (all three consecutive rows) and then sum these numbers, eventually one of the elements of feature vector with length 8000 corresponding to the three selected columns will be obtained. Because we have 20 different columns, The final feature vector will be of length 8000 = 20 * 20 * 20. Figure 4: shows these steps. For example, in this figure for three marked rows and columns, the numbers obtained from the intersection of these rows and columns marked with a blue dotted circle around them, are multiplied to each other.

Figure 4: process of extracting trigram-PSSM feature vector from PSSM

```
as<-trigrame_pssm(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GSI6.txt.pssm"))
head(as, n = 50)
```

```
## [1] 6.8369 3.7511 4.1531 2.6852 2.6406 4.3763 2.8829 4.2257 3.7228 5.1379
## [11] 5.2524 4.4484 5.1002 3.8583 5.1569 7.5018 5.2072 2.4309 2.8760 5.6669
## [21] 3.5204 3.0895 2.7871 1.9459 0.7295 3.4388 2.1569 2.4319 3.2735 1.6680
## [31] 1.9269 3.1770 2.1466 2.0473 3.1579 4.1613 2.8492 1.5444 1.8998 1.9281
## [41] 3.3102 2.3229 2.7297 1.6388 1.3118 3.0535 1.9647 2.2095 2.7645 2.8273
```

The length of this feature vector is 320. The first 20 numbers of this feature vector are the mean of 20 columns in PSSM, and the next values for each column are the mean squares of the difference between the elements of row i and i + lag in this column. Because the lag value varies between 1 and 15, the final feature vector will have a length of 320. Figure 5: and following equation shows the process of this function and the corresponding mathematical equation, respectively.

Figure 5: process of extracting Pse-PSSM feature vector from PSSM

```
v<-pse_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(v, n = 50)
```

```
## [1] 0.3773 0.2965 0.3371 0.3049 0.1816 0.4356 0.3395 0.2866 0.3109 0.2576
## [11] 0.2622 0.3150 0.2776 0.3187 0.2383 0.3846 0.2755 0.1791 0.2180 0.2808
## [21] 0.1636 0.1969 0.1304 0.2129 0.1649 0.1271 0.1766 0.1524 0.1649 0.1478
## [31] 0.1197 0.1964 0.1259 0.1896 0.1973 0.1181 0.0903 0.1807 0.1477 0.1717
## [41] 0.1693 0.1875 0.1729 0.2508 0.1585 0.1495 0.1832 0.1635 0.1666 0.1338
```

This feature is almost identical to the DPC feature, and in fact, the DPC feature is part of this feature (for k = 1) and for two different columns, it considers rows that have distance k.

Figure 6: process of extracting K-separated-bigam-PSSM feature vector from PSSM

```
w<-k_seperated_bigrame(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),5)
head(w, n = 50)
```

```
## [1] 17.8176 13.2148 15.5094 14.4319 8.3346 22.1273 16.1736 13.4963 13.7240
## [10] 11.5597 11.8734 14.7134 12.4310 17.7145 11.2181 18.3770 12.4118 8.3759
## [19] 10.2801 13.0504 14.3991 12.7802 13.3367 12.7504 6.5913 16.3527 13.2403
## [28] 9.4264 10.8272 8.9290 7.9045 12.3299 7.7908 8.3621 10.7674 15.1463
## [37] 10.1184 6.4232 6.2908 10.3672 14.7014 13.7350 16.3146 14.6568 7.6190
## [46] 20.1827 15.6412 13.8415 13.7924 8.5747
```

In this group of features, in order to use uniform dimensions to show proteins of different lengths, in the first step, the average evolutionary score between adjacent residues is calculated using the following equations:

Where L is equal to the length of the protein, is the mean score of positions i and i-1, and is the mean score of positions i and i+1. The evolutionary difference formula (EDF) is then defined as follows:

represents the mean of evolutionary difference between the residues of a given protein sequence. According to EDF, a given protein can be expressed by a 20 x 20 matrix called ED-PSSM, which is defined by the equations

Using this ED-PSSM, the three features EDP, EEDP, and MEDP are defined by the equations following equation, EDP has a length of 20, EEDP has a length of 400, and the MEDP feature is obtained by merging these two feature vectors.

Figure 7: also shows the process of this work. It is noteworthy that in the following equation the value of is removed during the subtraction of and , and therefore this value is not shown in Figure 7.

Figure 7: process of extracting EDP-EEDP-MEDP feature vectors from PSSM

```
as<-EDP_MEDP(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GS61.txt.pssm"))
head(as, n = 50)
```

```
## [1] 0.0470 0.0270 0.0327 0.0310 0.0275 0.0284 0.0307 0.0380 0.0325 0.0425
## [11] 0.0394 0.0267 0.0329 0.0373 0.0334 0.0567 0.0558 0.0383 0.0350 0.0485
## [21] 0.0266 0.0521 0.0464 0.0549 0.0455 0.0409 0.0429 0.0520 0.0574 0.0428
## [31] 0.0425 0.0419 0.0354 0.0564 0.0492 0.0423 0.0436 0.0762 0.0549 0.0359
## [41] 0.0515 0.0095 0.0211 0.0154 0.0123 0.0171 0.0174 0.0290 0.0149 0.0347
```

This feature consists of two types of feature vectors. At first, each protein sequence is divided into 20 equal parts, each of which is called a block, and in each block, the row vectors of the PSSM related to that block are added together. The resulting final vector is divided by the length of that block, which is equal to 5% of protein length. Finally, by placing these 20 vectors side by side, the first feature vector of length 400 is obtained. The second feature for each amino acid in each column is the average of the positive numbers in that column and for each block, and these 20 values, corresponding to 20 blocks, are placed next to each other, and therefore for each of the 20 types of amino acids, a vector of length 20 is obtained, and by placing these together,the second feature vector of length 400, is obtained. Figure 8 represents this process.

Figure 8: process of extracting AB-PSSM feature vectors from PSSM

```
zz<- AB_PSSM(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"))
head(zz[1], n = 50)
```

```
## [[1]]
## [1] 0.3595 0.1202 0.3442 0.2908 0.0545 0.3715 0.3510 0.0816 0.1202 0.5049
## [11] 0.3361 0.3861 0.4256 0.2303 0.1354 0.5385 0.5939 0.0278 0.1983 0.4214
## [21] 0.6790 0.3226 0.6670 0.4566 0.1837 0.4365 0.5000 0.3562 0.4365 0.3546
## [31] 0.2157 0.4615 0.3340 0.0664 0.4495 0.7409 0.6774 0.0327 0.0963 0.3460
## [41] 0.4246 0.0571 0.3928 0.3598 0.1810 0.1769 0.3844 0.1568 0.1325 0.1810
## [51] 0.2183 0.1967 0.1040 0.2167 0.5664 0.6090 0.4246 0.0095 0.1126 0.2028
## [61] 0.3562 0.3961 0.2456 0.5310 0.0379 0.6522 0.7260 0.0519 0.2773 0.3794
## [71] 0.2448 0.6587 0.1332 0.1951 0.2488 0.5583 0.6236 0.1798 0.1836 0.3659
## [81] 0.4784 0.1351 0.2863 0.3325 0.4861 0.3323 0.3744 0.0590 0.1585 0.4641
## [91] 0.4197 0.3057 0.3361 0.2889 0.0308 0.4194 0.4896 0.0191 0.1836 0.4511
## [101] 0.6639 0.2440 0.5651 0.6509 0.1202 0.5000 0.6774 0.3460 0.2591 0.2591
## [111] 0.2992 0.4750 0.1821 0.0963 0.2076 0.7159 0.6540 0.0425 0.1083 0.2591
## [121] 0.3725 0.1707 0.5000 0.6774 0.2623 0.4864 0.6654 0.3611 0.5583 0.1668
## [131] 0.1332 0.3628 0.1701 0.3242 0.4116 0.5635 0.4230 0.2857 0.5197 0.2087
## [141] 0.4615 0.1772 0.4929 0.5136 0.0914 0.5000 0.4750 0.3057 0.1788 0.1837
## [151] 0.1701 0.4631 0.1452 0.2053 0.3731 0.6139 0.6188 0.2013 0.1322 0.3476
## [161] 0.2461 0.1869 0.4197 0.2167 0.3073 0.3979 0.3941 0.1986 0.3708 0.5298
## [171] 0.3996 0.3390 0.2960 0.1684 0.0547 0.3947 0.5120 0.0659 0.1739 0.4951
## [181] 0.5586 0.3356 0.3976 0.1937 0.0447 0.4760 0.5003 0.1960 0.2209 0.3008
## [191] 0.3614 0.5616 0.1582 0.0547 0.1665 0.4763 0.3812 0.0142 0.0278 0.3627
## [201] 0.3760 0.3426 0.2342 0.0959 0.0346 0.3812 0.2154 0.0451 0.3520 0.0891
## [211] 0.1357 0.6648 0.2956 0.2079 0.3536 0.5045 0.4763 0.0278 0.2264 0.0491
## [221] 0.3073 0.0240 0.2523 0.0343 0.3199 0.1380 0.0282 0.0207 0.2364 0.4197
## [231] 0.2254 0.0233 0.1234 0.2630 0.0231 0.4732 0.5395 0.1674 0.2817 0.4977
## [241] 0.5884 0.3346 0.3812 0.3947 0.1923 0.5404 0.5216 0.4313 0.3193 0.4583
## [251] 0.3861 0.4880 0.3210 0.1273 0.2737 0.5651 0.5385 0.0327 0.0833 0.4534
## [261] 0.6654 0.2575 0.6269 0.5635 0.1572 0.5000 0.6405 0.6774 0.5770 0.2190
## [271] 0.2076 0.5000 0.2326 0.1468 0.5385 0.7560 0.6155 0.0474 0.1322 0.2960
## [281] 0.3210 0.4029 0.5071 0.3682 0.4511 0.2206 0.2542 0.2610 0.4145 0.2639
## [291] 0.2704 0.3611 0.2430 0.1635 0.1684 0.5838 0.3845 0.0327 0.1582 0.2618
## [301] 0.5169 0.1037 0.0756 0.1557 0.1458 0.0888 0.0418 0.3445 0.2564 0.2096
## [311] 0.1406 0.0717 0.1231 0.3712 0.0299 0.6062 0.3536 0.2284 0.3184 0.2779
## [321] 0.1866 0.2329 0.5026 0.5224 0.0884 0.2542 0.4347 0.2264 0.1582 0.0247
## [331] 0.0146 0.4253 0.0205 0.0059 0.1610 0.2759 0.3356 0.0040 0.0156 0.1360
## [341] 0.1701 0.1836 0.1869 0.0341 0.3225 0.2553 0.1133 0.0191 0.5418 0.3611
## [351] 0.4163 0.2128 0.2300 0.5189 0.0222 0.3562 0.3679 0.3881 0.6778 0.4246
## [361] 0.4864 0.2825 0.6524 0.7289 0.0425 0.4615 0.7611 0.7645 0.2342 0.0963
## [371] 0.0714 0.6654 0.1083 0.0628 0.3595 0.5000 0.3980 0.0964 0.1034 0.1023
## [381] 0.2007 0.2046 0.1458 0.1235 0.2044 0.1362 0.1407 0.0557 0.2154 0.6054
## [391] 0.4998 0.1505 0.5000 0.3558 0.0226 0.2677 0.2905 0.1722 0.3651 0.5821
```

In this feature, at first, a TPM matrix is constructed from the PSSM, which has represented by a vector corresponding to the following equation:

Where the components are as follows:

In the above equation, the numerator is the same as the equation related to DPC-PSSM feature without considering its coefficient. By placing these components together, a TPC feature vector of length 400 is obtained, and if we add the AAC feature vector of length 20 which is the average of columns of the PSSM to the beginning of this vector, AATP feature vector of length 420 is obtained.

```
as<-AATP_TPCC(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GQS7.txt.pssm"))
head(as, n = 50)
```

```
## [1] 0.3773 0.2965 0.3371 0.3049 0.1816 0.4356 0.3395 0.2866 0.3109 0.2576
## [11] 0.2622 0.3150 0.2776 0.3187 0.2383 0.3846 0.2755 0.1791 0.2180 0.2808
## [21] 0.0640 0.0510 0.0580 0.0445 0.0332 0.0739 0.0517 0.0448 0.0590 0.0442
## [31] 0.0454 0.0517 0.0451 0.0567 0.0359 0.0683 0.0492 0.0344 0.0393 0.0496
## [41] 0.0706 0.0528 0.0645 0.0507 0.0237 0.0710 0.0559 0.0521 0.0521 0.0425
```

This feature consists of a combination of several types of features, and in general, the obtained feature vector would be of length 700. Here all parts of this feature are described separately.

First, the consensus sequence is obtained from the PSSM according to the following equation, then from this consensus sequence next feature vectors are obtained

where is the index for the largest element in row i of PSSM, and the ith component in the consensus sequence equal to the ’th amino acid in the standard amino acid alphabet which the column names of PSSM are labeled by them. Now, using the following equation, the feature vector of length 20 is obtained:

Here shows the number of j-th amino acid occurrences in the consensus sequence

This feature vector is obtained using the following equation from the consensus sequence:

Here shows the number of i-th amino acid occurrences in the consensus sequence and indicates the j-th position of the i-th amino acid in the consensus sequence.

Here the PSSM is divided into n segments, which corresponds to dividing the initial protein sequence into n segments, if n = 2:

Where L represents the length of the initial protein and indexed L’s indicate the length of the first and second segments, respectively. Now, using the following equations, the feature vector components of length 200 are obtained

And if n = 3 then we have:

Therefore, using the following equations, the components of a feature vector with length 180 are obtained as follows:

In total, using the previous feature vector, a feature vector of length 380 is obtained for this group.

In this group, using the previous equations and the following equations, the feature vector of length 280 could be obtained. when n=2:

when n=3:

If we connect all these feature vectors together, we will get a feature vector with a length of 700, which is reduced by PCA method and is used as input for the support vector machine classifier.

```
A<-CS_PSe_PSSM(system.file("extdata", "C7GSI6.txt.pssm", package="PSSMCOOL"),"total")
head(A, n = 50)
```

```
## [1] 0.0833 0.0278 0.0694 0.0139 0.0556 0.0139 0.0000 0.0278 0.0139 0.1250
## [11] 0.0833 0.0417 0.0556 0.0556 0.1111 0.0833 0.0278 0.0139 0.0417 0.0556
## [21] 0.0350 0.0139 0.0299 0.0082 0.0401 0.0104 0.0000 0.0211 0.0012 0.0841
## [31] 0.0401 0.0335 0.0280 0.0411 0.0149 0.0327 0.0125 0.0121 0.0196 0.0358
## [41] 0.4777 0.2352 0.2671 0.1705 0.1904 0.2805 0.2288 0.2839 0.2522 0.3366
```

If we sum the numbers of each column in the PSSM, we get a feature vector of length 20 as follows:

If we remove the negative elements somehow of the PSSM and call the resulting new matrix FPSSM and then calculate this feature vector for FPSSM, the components of this vector will depend on the length of the original protein, so to eliminate this dependency, we normalize the components of this vector using the following equation:

Where, min and max represent the smallest and largest values of the previous vector components, respectively, and L represents the length of the original protein. The second feature vector with length 400 is obtained as follows:

If we name the columns of the FPSSM from to in the order from left to right, then is equal to the sum of those members in the j-th column in the FPSSM whose corresponding row amino acid is equal to . Figure 9 schematically shows these steps:

```
q<-FPSSM(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"),20)
head(q, n = 50)
```

```
## [[1]]
## [1] 62 65 48 87 105 124 61 73 75 49 63 74 48 116 83 53 21 80 55
## [20] 54
##
## [[2]]
## [1] 14 6 1 2 5 5 0 7 7 1 2 3 1 6 2 2 2 0 0 1 3 11 2 0 2
## [26] 0 0 0 3 0 1 1 1 0 0 1 1 0 1 0 9 1 11 12 0 3 1 14 0 1
## [51] 0 1 0 9 4 5 2 0 0 3 3 4 4 41 0 7 14 1 6 0 3 5 0 0 0
## [76] 3 0 0 0 0 0 0 0 0 87 0 0 0 0 0 0 0 0 5 0 2 0 0 0 0
## [101] 3 10 6 8 0 52 12 6 15 0 1 2 10 7 21 10 1 1 0 0 0 1 0 3 0
## [126] 4 5 4 0 0 0 0 0 0 1 0 0 1 0 0 6 0 7 2 1 4 2 23 2 2
## [151] 1 4 4 5 0 0 0 6 0 1 0 2 4 1 0 16 3 3 17 0 0 8 5 1 0
## [176] 0 0 0 0 0 1 0 0 0 1 0 0 0 0 16 9 6 4 4 0 0 0 1 0 9
## [201] 2 0 0 5 2 0 5 0 0 11 23 0 4 2 0 0 0 0 1 6 2 19 4 5 6
## [226] 7 10 3 7 0 0 40 0 0 0 5 2 6 2 0 0 0 0 0 0 0 0 0 0 0
## [251] 3 0 9 0 0 0 0 0 0 0 5 2 1 0 1 8 0 1 6 4 14 0 6 46 0
## [276] 0 1 12 11 4 5 1 1 7 0 5 0 10 3 0 0 0 0 1 52 1 0 7 1 0
## [301] 5 3 4 0 0 5 4 0 3 0 1 3 1 1 1 20 1 9 1 3 4 0 0 1 0
## [326] 1 1 1 1 0 0 0 0 5 1 3 11 2 2 0 0 0 0 0 0 0 0 0 0 2
## [351] 1 0 0 7 0 0 0 16 2 0 0 0 0 0 0 2 0 0 2 2 0 0 0 9 0
## [376] 0 0 16 30 3 0 5 3 0 0 5 4 0 3 10 4 1 3 8 1 1 0 3 4 24
```

To generate this feature vector, the consensus sequence corresponding to the protein sequence is extracted using the PSSM. Then, by placing these two sequences next to each other, a matrix with dimensions of 2 * L will be created. In the next step, each component in the upper row of this matrix is connected to two components in the lower row of this matrix, and thus a graph similar to a bipartite graph could be created. Now in this graph, each path of length 2 specifies a 3-mer and each path of length 1 denotes a 2-mer corresponding to these two sequences. Now if we consider a table consisting of two rows and 8000 columns so that the first row contains all possible 3-mers of 20 amino acids, then for every 3-mer obtained from this graph, we put number 1 below the corresponding cell With that 3-mer in the aforementioned table and 0 in other cells. so This gives us a vector of length 8000. For the 2-mers obtained from this graph, a vector of length 400 is obtained in a similar way. figures 10, 11 show these processes.

```
zz<- scsh2(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"),2)
head(zz, n = 200)
```

```
## [1] 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0
## [38] 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1
## [75] 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1
## [112] 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1
## [149] 1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 0
## [186] 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1
```

If we represent the PSSM as follows:

The indices will show the standard 20 amino acids. If we assume that our primary protein has length L, each of the above columns is as follows:

Now, using the following equations, we merge the columns of the PSSM and obtain a matrix with dimensions :

Now using this new matrix we get a feature vector of length 10 as follows:

where

now using following equations; will create a feature vector of length 100 and by combining the feature vector of length 10 mentioned previously, the final feature vector of length 110 will be created.

```
w<-rpssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(w, n = 50)
```

```
## [1] 0.0620 0.0528 0.0646 0.0674 0.0806 0.0810 0.0883 0.0824 0.0988 0.0950
## [11] 0.0621 0.0482 0.0640 0.0575 0.0896 0.1044 0.0888 0.0752 0.1037 0.1039
## [21] 0.0787 0.0618 0.0731 0.0671 0.0982 0.1050 0.0912 0.0805 0.1200 0.1085
## [31] 0.0559 0.0589 0.0685 0.0373 0.0523 0.0643 0.0633 0.0727 0.0819 0.0853
## [41] 0.0693 0.0864 0.1022 0.0645 0.0493 0.0570 0.0811 0.1101 0.0714 0.0823
```

This feature, which is similar to the PSSM-AC feature, stands for cross-covariance transformation. for column , Calculates the average of this column as shown in Figure 12, and then subtract the result from the number on the i-th row in this column. Similarly, the feature calculates the average for the column and then subtracts the resulting number from the value on row i + g of this column and finally multiplies them. By changing the variable i from 1 to L-g, it calculates the sum of these, because the variable changes between 1 and 20 and the variable changes in the same interval (1,20) except for the number selected for the variable , eventually feature vector of length 380 will be obtained.

```
aa<-pssm_cc(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"),18)
head(aa, n = 50)
```

```
## [1] 0.0150 0.0009 -0.0080 -0.0197 0.0277 -0.0021 0.0034 0.0222 -0.0033
## [10] -0.0087 0.0161 -0.0024 -0.0144 0.0161 0.0050 -0.0056 -0.0024 -0.0087
## [19] -0.0165 0.0004 -0.0035 0.0101 0.0057 -0.0256 -0.0114 -0.0009 -0.0179
## [28] 0.0027 -0.0024 -0.0127 -0.0113 -0.0155 -0.0056 -0.0020 0.0040 -0.0121
## [37] -0.0061 0.0026 -0.0045 0.0110 0.0048 0.0076 -0.0167 -0.0103 -0.0117
## [46] 0.0049 -0.0052 -0.0062 0.0077 -0.0117
```

Discrete cosine transforms can be described as follows:

where:

![(u)={ \[\begin{array}{ll}\sqrt{\frac{1}{M}} & u=0\\ \sqrt{\frac{2}{M}} & 1\leq{u}\leq{M-1}\end{array}\] .\ (v)={ \[\begin{array}{ll}\sqrt{\frac{1}{N}} & v=0\\ \sqrt{\frac{2}{N}} & 1\leq{v}\leq{N-1}\end{array}\].](https://latex.codecogs.com/svg.latex?%5Clarge%20%5Crho%28u%29%3D%5Cleft%5C%7B%5Cbegin%7Barray%7D%7Bll%7D%5Csqrt%7B%5Cfrac%7B1%7D%7BM%7D%7D%20%26%20u%3D0%5C%5C%20%5Csqrt%7B%5Cfrac%7B2%7D%7BM%7D%7D%20%26%201%5Cleq%7Bu%7D%5Cleq%7BM-1%7D%5Cend%7Barray%7D%5Cright.%5C%5C%20%5Crho%28v%29%3D%5Cleft%5C%7B%5Cbegin%7Barray%7D%7Bll%7D%5Csqrt%7B%5Cfrac%7B1%7D%7BN%7D%7D%20%26%20v%3D0%5C%5C%20%5Csqrt%7B%5Cfrac%7B2%7D%7BN%7D%7D%20%26%201%5Cleq%7Bv%7D%5Cleq%7BN-1%7D%5Cend%7Barray%7D%5Cright.)

In above Equation, the matrix is the input signal and here represents the PSSM with dimensions . According to the aforementioned equation, it is clear that the length of the resulting feature vector depends on the length of the original protein, so in most articles that have used this feature vector, the final feature vector DCT, which encodes a protein sequence by choosing the first 400 coefficients is obtained.

```
as<-Discrete_Cosine_Transform(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(as, n = 50)
```

```
## [1] 3.7586 -0.8904 -1.8115 0.9043 1.3727 -1.3320 -0.1494 0.4772 0.3272
## [10] 0.2038 -0.3245 -0.9033 1.0473 0.2255 -1.2477 0.3651 1.1561 -0.8085
## [19] -0.7821 1.1143 9.4695 -1.3575 -0.4747 0.3296 1.2757 0.6152 -0.9732
## [28] 1.0494 -0.5279 0.2443 -0.7019 -0.8720 -1.4273 0.1932 -0.0210 0.5049
## [37] -0.8403 -1.4087 0.6963 -0.1914 6.2212 -0.6490 -0.3823 0.3074 0.6949
## [46] -0.7493 -0.3240 1.9688 -0.0276 0.2205
```

Wavelet transform (WT) is defined as the signal image on the wavelet function according to the following equation:

Where a is a scale variable and b is a transition variable. is the analyze wavelet function. is the conversion factor found for both specific locations on the signal as well as specific wavelet periods. Discrete wavelet transform can decompose amino acid sequences into coefficients in different states and then remove the noise component from the profile.Assuming that the discrete signal is equal to , where the length of the discrete signal is equal to N, we have the following equations:

In these equations, g is the low-pass filter and h is the high-pass filter. is the approximate coefficient (low-frequency components) of the signal and is the exact coefficient (high-frequency components) of the signal.This decomposition is repeated to further increase the frequency resolution, and the approximate coefficients are decomposed with high and low pass filters and then are sampled lower. By increasing the level of j decomposition, we can see more accurate characteristics of the signal. We use level 4 DWT and calculate the maximum, minimum, mean, and standard deviation of different scales (4 levels of both coefficients High and low frequency). Because high-frequency components have high noise, only low-frequency components are more important.A schematic diagram of a level 4 DWT is shown in Figure 13:

The PSSM has 20 columns. Therefore, the PSSM consists of 20 types of discrete signals (L-length). Therefore, we used the level 4 DWT as mentioned above to analyze these discrete signals from PSSM (each column) and to extract the PSSM-DWT feature vector from the PSSM, which is a feature vector of length 80.

```
as<-dwt_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(as, n = 50)
```

```
## [1] -1.2032 2.1352 0.0547 0.4630 -0.9770 2.0423 0.0309 0.4129 -0.7134
## [10] 2.3055 0.0884 0.4222 -0.6967 1.7063 0.0818 0.4345 -0.7764 1.0817
## [19] 0.0411 0.3327 -1.2524 3.9385 0.0943 0.5317 -0.8283 2.3753 0.0723
## [28] 0.4349 -0.8160 2.0562 0.0561 0.4058 -0.7764 2.6698 0.0822 0.4252
## [37] -0.6910 2.2393 0.0964 0.3850 -0.7962 2.7484 0.0878 0.4019 -0.6288
## [46] 1.9567 0.0677 0.4157 -0.7807 1.9298
```

For the purpose of predicting disulfide bond in protein at first, the total number of cysteine amino acids in the protein sequence is counted and their position in the protein sequence is identified. Then, using a sliding window with a length of 13, moved on the PSSM from top to bottom so that the middle of the window is on the amino acid cysteine, then the rows below the matrix obtained from the PSSM with the dimension of 13 x 20 are placed next to each other to get a feature vector with a length of 260 = 20 * 13 per cysteine. If the position of the first and last cysteine in the protein sequence is such that the middle of sliding window is not on cysteine residue while moving on PSSM, then the required number of zero rows from top and bottom is added to the PSSM matrix to achieve this goal.Thus, for every cysteine amino acid presented in protein sequence, a feature vector with a length of 260 is formed.Then all the pairwise combinations of these cysteines is wrote in the first column of a table. In front of each of these pairwise combinations, the corresponding feature vectors are stuck together to get a feature vector of length 520 for each of these compounds.Finally, the table obtained in this way will have the number of rows equal to the number of all pairwise combinations of these cysteines and the number of columns will be equal to 521 (the first column includes the name of these pair combinations). It is easy to divide this table into training and testing data and predict the desired disulfide bonds between cysteines.Figure 14 shows a schematic of this process:

```
aq<-disulfid(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
head(aq[,1:50])
```

```
## 1 2 3 4 5 6 7 8 9 10 11
## 1 c1c2 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 2 c1c3 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 3 c1c4 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 4 c1c5 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 5 c1c6 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 6 c1c7 0.2689 0.982 0.7311 0.1192 0.018 0.9526 0.7311 0.0474 0.8808 0.0474
## 12 13 14 15 16 17 18 19 20 21 22 23
## 1 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 2 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 3 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 4 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 5 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 6 0.1192 0.7311 0.5 0.1192 0.1192 0.7311 0.5 0.018 0.1192 0.0474 0.5 0.2689
## 24 25 26 27 28 29 30 31 32 33 34 35
## 1 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 2 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 3 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 4 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 5 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 6 0.8808 0.1192 0.0474 0.1192 0.0474 0.0474 0.2689 0.5 0.5 0.1192 0.5 0.8808
## 36 37 38 39 40 41 42 43 44 45 46 47
## 1 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 2 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 3 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 4 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 5 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 6 0.2689 0.5 0.2689 0.9526 0.982 0.8808 0.2689 0.1192 0.2689 0.9526 0.0474 0.5
## 48 49 50
## 1 0.8808 0.0474 0.2689
## 2 0.8808 0.0474 0.2689
## 3 0.8808 0.0474 0.2689
## 4 0.8808 0.0474 0.2689
## 5 0.8808 0.0474 0.2689
## 6 0.8808 0.0474 0.2689
```

The extraction of this feature is obtained using the following equations from the PSSM: