Multivariate Test With 2 Continuous Variable

Test on one or two mean vectors

Multivariate Versus Univariate Tests

Test on \(\mu\) with \(\Sigma\) Known

Multivariate Test for \(H_{0}:\mu=\mu_{0}\) with \(\Sigma\) known

Test on One or Two Mean Vectors

Why we prefer to do multivariate testing?

The use of p univariate tests inflates the Type I error rate, a, whereas the multivariate test preserves the exact a level. For example, if we do p = 10 separate univariate tests at the .05 level, the probability of at least one false rejection is greater than .05. If the variables were independent (they rarely are), we would have (under Ho)
The univariate tests completely ignore the correlations among the variables, whereas the multivariate tests make direct use of the correlations.
The multivariate test is more powerful in many cases. The power of a test is the probability of rejecting HQ when it is false. In some cases, all p of the univariate tests fail to reach significance, but the multivariate test is significant because small effects on some of the variables combine to jointly indicate significance. However, for a given sample size, there is a limit to the number of variables a multivariate test can handle without losing power. This is discussed further in Section 5.3.2.
Many multivariate tests involving means have as a by-product the construction of a linear combination of variables that reveals more about how the variables unite to reject the hypothesis.

\[ H_{0}:\left(\begin{array}{c} \mu_{1}\\ \mu_{2}\\ \vdots\\ \mu_{p} \end{array}\right)=\left(\begin{array}{c} \mu_{01}\\ \mu_{02}\\ \vdots\\ \mu_{0p} \end{array}\right),\qquad H_{0}:\left(\begin{array}{c} \mu_{1}\\ \mu_{2}\\ \vdots\\ \mu_{p} \end{array}\right)\neq\left(\begin{array}{c} \mu_{01}\\ \mu_{02}\\ \vdots\\ \mu_{0p} \end{array}\right) \] To test \(H_{0}\), we use a random sample of n observation vectors \(y_{1}, y_{2}, ...,y_{n}\) from \(Ν_{p}(\mu, \Sigma)\), with \(\Sigma\) known, and calculate \(\bar{y}=\sum_{i=1}^{n}y_{i}/n\). The test statistic is:

\[ Z^{2}=n\left(\bar{y}-\mu_{0}\right)^{'}\Sigma^{-1}\left(\bar{y}-\mu_{0}\right) \]

If \(H_{0}\) is true, \(Z\) is distributed as \(\chi_{p}^{2}\), and we therefore reject \(H_{0}\) if \(Z >\chi_{\alpha}^{2}\).

Example 5.2.2

In Table 3.1, height and weight were given for a sample of 20 colleage males. Let us assume that this sample originated from the bivariate normal \(N_{2}\left(\mu,\Sigma\right)\) where

\[ \Sigma=\left(\begin{array}{cc} 20 & 100\\ 100 & 1000 \end{array}\right) \]

Suppose we wish to test \(H_{0}: \mu = (70, 170)'\).

Recovering dataset from table 3.1.: 20 college-age males, data for weight and height anc computing \(\mu_{0}\) and \(\Sigma\):

                                  height <-                                        c(69,                    74,                    68,                    70,                    72,                    67,                    66,                    70,                    76,                    68,                    72,                    79,                    74,                    67,                    66,                    71,                    74,                    75,                    75,                    76)                                    weight <-                                        c(153,                    175,                    155,                    135,                    172,                    150,                    115,                    137,                    200,                    130,                    140,                    265,                    185,                    112,                    140,                    150,                    165,                    185,                    210,                    220)                                    data522 <-                                        cbind(height, weight)                    %>%                                        as.data.frame

                                  knitr::                    kable(                                      data522,                    caption =                    'Colleage Males, Exercise 5.22',                                      booktabs =                    TRUE                                    )

Table 5.1: Colleage Males, Exercise 5.22
height	weight
69	153
74	175
68	155
70	135
72	172
67	150
66	115
70	137
76	200
68	130
72	140
79	265
74	185
67	112
66	140
71	150
74	165
75	185
75	210
76	220

                                                      #                                                        mu522 <-                                        matrix(c(70,                    170),                    nrow =                    2,                    ncol =                    1)                  mu522

              ##      [,1] ## [1,]   70 ## [2,]  170

                                  sigma522 <-matrix(c(20,                    100,                    100,                    1000),                    nrow =                    2,                    ncol =                    2)                  sigma522

              ##      [,1] [,2] ## [1,]   20  100 ## [2,]  100 1000

Solution:

                                  test_mu_sigKnown <-                                        function(db, mu0, sigma, alpha){                                      "                                                                              This function computes $Z^{2}$ (eq. 5.2).                                                                              Parameters:                                                                              db: database containing columns of variables                                                                              mu0: vector of means in H0                                                                              sigma: known covariance matrix                                                                              alpha: significance level                                                                              "                                                        n <-                                        nrow(db)                                                        p <-                                        ncol(db)                                      y_hat <<-                                        as.matrix(apply(db,                    MARGIN=                    2, mean))                    # sample mean for each column                                                        Z2 <-                    n                    *                                                            t(y_hat                    -                                        mu0)                    %*%                                                            solve(sigma)                    %*%                                        (y_hat                    -                                        mu0)                                      # performing the test                                                        print(paste("Z2 value:",                    round(Z2,                    5),                    sep=                    " "))                                      print(paste("chi2:",                    round(qchisq(1                    -alpha,                    df=p),                    5) ,                    "with p =", p,                    "degrees of freedom",                    sep=                    " "))                                      print(paste("reject H0:", Z2                    >                                                            qchisq(1                    -alpha,                    df=p)))                                                      }

using \(\alpha=0.05\), and knowing \(p=2\) (because we have two variables height and weight), then we have \(\chi_{\alpha, p}^{2}\). Now we can compute the test:

                                                      test_mu_sigKnown(db=data522,                    mu0 =                    mu522,                    sigma =                    sigma522,                    alpha=                    0.05)

              ## [1] "Z2 value: 8.4026" ## [1] "chi2: 5.99146 with p = 2 degrees of freedom" ## [1] "reject H0: TRUE"

Tests on \(\mu\) when \(\Sigma\) is unknown

Example 5.3.2.

In table 3.3 we have \(n=10\) observations on \(p=3\) variables. Desirable leves for \(y_1\) and \(y_2\) are \(15\) and \(6\), respectively, and the expected level of \(y_3\) is \(2.85\). We can, therefore, test the hyphotesis

\[H_0: \mu=\begin{pmatrix} 15\\ 6\\ 2.85 \end{pmatrix}\]

Database from table 3.3

                                  y_1                    <-                                        c(                    35,                    35,                    40,                    10,                    6,                    20,                    35,                    35,                    35,                    30)                  y_2                    <-                                        c(                    4.9,                    3.5,                    30,                    2.8,                    2.7,                    2.8,                    4.6,                    10.9,                    8,                    1.6)                  y_3                    <-                                        c(                    2.7,                    2.8,                    4.38,                    3.21,                    2.73,                    2.81,                    2.88,                    2.9,                    3.28,                    3.2)                                    data532 <-                                        as.matrix(cbind(y_1, y_2, y_3))

                                  knitr::                    kable(                                      data532,                    caption =                    'Data Exercise 5.32',                                      booktabs =                    TRUE                                    )

Table 5.2: Data Exercise 5.32
y_1	y_2	y_3
35	4.9	2.70
35	3.5	2.80
40	30.0	4.38
10	2.8	3.21
6	2.7	2.73
20	2.8	2.81
35	4.6	2.88
35	10.9	2.90
35	8.0	3.28
30	1.6	3.20

                                                      #mu0                                    mu532 <-                                        matrix(c(15.0,                    6.0,                    2.85),                    nrow =                    3,                    ncol =                    1)                  mu532

              ##       [,1] ## [1,] 15.00 ## [2,]  6.00 ## [3,]  2.85

                                  db <-                    data532                  mu0 <-                    mu532                                      n <-                                        nrow(db)                                                        v <-                    n-1                                                        p <-                                        ncol(db)                                      y_hat <-                                        as.matrix(apply(db,                    MARGIN=                    2, mean))                    # sample mean for each column                                                        S <-                                        as.matrix(cov(db))                  S

              ##            y_1       y_2       y_3 ## y_1 140.544444 49.680000 1.9412222 ## y_2  49.680000 72.248444 3.6760889 ## y_3   1.941222  3.676089 0.2501211

              ##       [,1] ## y_1 28.100 ## y_2  7.180 ## y_3  3.089

                                  T2 <-                    n                    *                                                            t(y_hat                    -                                        mu0)                    %*%                                                            solve(S)                    %*%                                        (y_hat                    -                                        mu0)

                                  test_mu_sigUnknown <-                                        function(db, mu0, alpha){                                      library(ICSNP)                                      "                                                                              This function computes $Z^{2}$ (eq. 5.5).                                                                              Parameters:                                                                              * db: database containing columns of variables                                                                              * mu0: vector of means in H0                                                                              * alpha: significance level                                                                              "                                                        n <-                                        nrow(db)                                                        v <-                    n-1                                                        p <-                                        ncol(db)                                      y_hat <-                                        as.matrix(apply(db,                    MARGIN=                    2, mean))                    # sample mean for each column                                                        S <-                                        as.matrix(cov(db))                                      T2 <-                    n                    *                                                            t(y_hat                    -                                        mu0)                    %*%                                                            solve(S)                    %*%                                        (y_hat                    -                                        mu0)                                      T2_adj <-                    T2*(v-p+                    1)/(v*p)                                      # performing the test                                                        print(paste("T2 value:",                    round(T2,                    5),                    sep=                    " "))                                      print(paste("T2 F-adjusted value:",                    round(T2_adj,                    5),                    sep=                    " "))                                      print(paste("F-critial value:",                    round(qf(1                    -alpha,                    df1=p,                    df2=v-p+                    1),                    5) ,                    "with", p,"and", v-p+                    1,                    " degrees of freedom",                    sep=                    " "))                                      print(paste("reject H0:", T2_adj                    >                                                            qf(1                    -alpha,                    df1=p,                    df2=v-p+                    1)))                                      print(paste("p-value:", (pf(T2_adj,df1=p,                    df2=v-p+                    1,                    lower.tail =                    F))))                                      print("***Comparing results using HotellingsT2 from ICSNP library - F Test***")                                      print(HotellingsT2(db,                    Y =                    NULL,                    mu =                    mu0,                    test =                    "f"))                                      print("***Comparing results using HotellingsT2 from ICSNP library - Chi Test***")                                      print(HotellingsT2(db,                    Y =                    NULL,                    mu =                    mu0,                    test =                    "chi"))                                    }

                                                      test_mu_sigUnknown(db=data532,                    mu0=mu532,                    alpha=                    0.05)

              ## [1] "T2 value: 24.55891" ## [1] "T2 F-adjusted value: 6.36712" ## [1] "F-critial value: 4.34683 with 3 and 7  degrees of freedom" ## [1] "reject H0: TRUE" ## [1] "p-value: 0.0206801514119681" ## [1] "***Comparing results using HotellingsT2 from ICSNP library - F Test***" ##  ##  Hotelling's one sample T2-test ##  ## data:  db ## T.2 = 6.3671, df1 = 3, df2 = 7, p-value = 0.02068 ## alternative hypothesis: true location is not equal to c(15,6,2.85) ##  ## [1] "***Comparing results using HotellingsT2 from ICSNP library - Chi Test***" ##  ##  Hotelling's one sample T2-test ##  ## data:  db ## T.2 = 24.559, df = 3, p-value = 1.909e-05 ## alternative hypothesis: true location is not equal to c(15,6,2.85)

Comparing two mean vectors

Example 5.4.2

Four psychological test were given to \(32\) men and \(32\) women. The data are recorded in Table \(5.1\) (Beall 1945). The variables are:

\[y_{1}=\text{pictorial inconsistencies}\]

\[y_{2}=\text{paper form board}\]

\[y_{3}=\text{tool recognition}\]

\[y_{4}= \text{vocabulary}\]

We wish to test:

\[H_{0}: \mu_{1}=\mu{2} \ \text{vs} \ H_{1}: \mu_{1} \neq \mu_{2}\]

We assume that the two samples are independent and that \(\Sigma_{1}=\Sigma_{2}=\Sigma\), say with unknown. These assumptions are neccesary in order for \(T^{2}\) statistic in (5.9) to have a \(T^2\)-distribution.

First we load the data required

                                                      #pyschological data of males                                                      y_1m <-                                        c(17,                    15,                    15,                    13,                    20,                    15,                    15,                    13,                    14,                    17,                    17,                    17,                    15,                    18,                    18,                    15,                    18,                    10,                    18,                    18,                    13,                    16,                    11,                    16,                    16,                    18,                    16,                    15,                    18,                    18,                    17,                    19)                                    y_2m <-c(15,                    17,                    14,                    12,                    17,                    21,                    13,                    5,                    7,                    15,                    17,                    20,                    15,                    19,                    18,                    14,                    17,                    14,                    21,                    21,                    17,                    16,                    15,                    13,                    13,                    18,                    15,                    16,                    19,                    16,                    20,                    19)                                    y_3m <-                                        c(32,                    24,                    29,                    10,                    26,                    26,                    26,                    22,                    30,                    30,                    26,                    28,                    29,                    32,                    31,                    26,                    33,                    19,                    30,                    34,                    30,                    16,                    25,                    26,                    23,                    34,                    28,                    29,                    32,                    33,                    21,                    30)                                    y_4m <-                                        c(26,                    14,                    23,                    16,                    28,                    21,                    22,                    22,                    17,                    27,                    20,                    24,                    24,                    28,                    27,                    21,                    26,                    17,                    29,                    26,                    24,                    16,                    23,                    16,                    21,                    24,                    27,                    24,                    23,                    23,                    21,                    28)

                                  data542_male <-                                        data.frame(                                      y_1m, y_2m, y_3m, y_4m                                      )

                                  knitr::                    kable(                                      data542_male,                    caption =                    'Data Male, Exercise 5.42',                                      booktabs =                    TRUE                                    )

Table 5.3: Data Male, Exercise 5.42
y_1m	y_2m	y_3m	y_4m
17	15	32	26
15	17	24	14
15	14	29	23
13	12	10	16
20	17	26	28
15	21	26	21
15	13	26	22
13	5	22	22
14	7	30	17
17	15	30	27
17	17	26	20
17	20	28	24
15	15	29	24
18	19	32	28
18	18	31	27
15	14	26	21
18	17	33	26
10	14	19	17
18	21	30	29
18	21	34	26
13	17	30	24
16	16	16	16
11	15	25	23
16	13	26	16
16	13	23	21
18	18	34	24
16	15	28	27
15	16	29	24
18	19	32	23
18	16	33	23
17	20	21	21
19	19	30	28

                                                      #pyschological data of females                                                      y_1f <-                                        c(                                      14,                    13,                    12,                    12,                    11,                    12,                    10,                    10,                    12,                    11,                    12,                    14,                    14,                    13,                    14,                    13,                    16,                    14,                    16,                    13,                    2,                    14,                    17,                    16,                    15,                    12,                    14,                    13,                    11,                    7,                    12,                    6                                                        )                                                      y_2f <-c(                                      12,                    14,                    19,                    13,                    20,                    9,                    13,                    8,                    20,                    10,                    18,                    18,                    10,                    16,                    8,                    16,                    21,                    17,                    16,                    16,                    6,                    16,                    17,                    13,                    14,                    10,                    17,                    15,                    16,                    7,                    15,                    5                                                        )                                    y_3f <-                                        c(                                      14,                    12,                    21,                    10,                    16,                    14,                    18,                    13,                    19,                    11,                    25,                    13,                    25,                    8,                    13,                    23,                    26,                    14,                    15,                    23,                    16,                    22,                    22,                    16,                    20,                    12,                    24,                    18,                    18,                    19,                    7,                    6                                                        )                                    y_4f <-                                        c(                                      26,                    21,                    21,                    16,                    16,                    18,                    24,                    23,                    23,                    27,                    25,                    26,                    28,                    14,                    25,                    28,                    26,                    14,                    23,                    24,                    21,                    26,                    28,                    14,                    26,                    9,                    23,                    20,                    28,                    18,                    28,                    13                                                        )

                                  data542_female <-                                        data.frame(                                      y_1f, y_2f, y_3f, y_4f                                      )

                                  knitr::                    kable(                                      data542_female,                    caption =                    'Data Female, Exercise 5.42',                                      booktabs =                    TRUE                                    )

Table 5.4: Data Female, Exercise 5.42
y_1f	y_2f	y_3f	y_4f
14	12	14	26
13	14	12	21
12	19	21	21
12	13	10	16
11	20	16	16
12	9	14	18
10	13	18	24
10	8	13	23
12	20	19	23
11	10	11	27
12	18	25	25
14	18	13	26
14	10	25	28
13	16	8	14
14	8	13	25
13	16	23	28
16	21	26	26
14	17	14	14
16	16	15	23
13	16	23	24
2	6	16	21
14	16	22	26
17	17	22	28
16	13	16	14
15	14	20	26
12	10	12	9
14	17	24	23
13	15	18	20
11	16	18	28
7	7	19	18
12	15	7	28
6	5	6	13

We can also visualize the data

                                                      if(!                    require("plotly")){                                      install.packages("plotly")                                      library(ICSNP)                  }

                                  f1 <-                                        plot_ly(data542_male,                                      type =                    "box")                    %>%                                                                                                                                        add_boxplot(y =                    ~y_1m)                    %>%                                                                                                                                        add_boxplot(y =                    ~y_2m)                    %>%                                                                                                                                        add_boxplot(y =                    ~y_3m)                    %>%                                                                                                                                        add_boxplot(y =                    ~y_4m)                    %>%                                                                                                                                        layout(title =                    "Male Psychological Scores By Each Variable",                                      xaxis =                    list(title =                    "Variables"),                                      yaxis =                    list(title =                    "Scores"))                  f1

                                  f2 <-                                        plot_ly(data542_female,                                      type =                    "box")                    %>%                                                                                                                                        add_boxplot(y =                    ~y_1f)                    %>%                                                                                                                                        add_boxplot(y =                    ~y_2f)                    %>%                                                                                                                                        add_boxplot(y =                    ~y_3f)                    %>%                                                                                                                                        add_boxplot(y =                    ~y_4f)                    %>%                                                                                                                                        layout(title =                    "Female Psychological Scores By Each Variable",                                      xaxis =                    list(title =                    "Variables"),                                      yaxis =                    list(title =                    "Scores"))                  f2

We find the mean vectors and covariances matrices of the two samples.

                                                      #Determining the mean of each variable of the male population                                    y_mean_male<-as.matrix(                                      apply(                                      data542_male,                    2,                    FUN=mean                                      )                                      )                  y_mean_male

              ##          [,1] ## y_1m 15.96875 ## y_2m 15.90625 ## y_3m 27.18750 ## y_4m 22.75000

              ## [1] 4 1

                                                      #Determining the mean of each variable of the female population                                    y_mean_female<-as.matrix(                                      apply(data542_female,                    2,                    FUN=mean                                      )                                      )                  y_mean_female

              ##          [,1] ## y_1f 12.34375 ## y_2f 13.90625 ## y_3f 16.65625 ## y_4f 21.93750

              ## [1] 4 1

                                                      #Determining the covariance matrix of the male population                                    sigma_male <-                                        cov(data542_male)                  sigma_male

              ##          y_1m      y_2m      y_3m      y_4m ## y_1m 5.192540  4.545363  6.522177  5.250000 ## y_2m 4.545363 13.184476  6.760081  6.266129 ## y_3m 6.522177  6.760081 28.673387 14.467742 ## y_4m 5.250000  6.266129 14.467742 16.645161

                                                      #Determining the mean of each variable of the female population                                    sigma_female <-                                        cov(data542_female)                  sigma_female

              ##          y_1f      y_2f      y_3f      y_4f ## y_1f 9.136089  7.549395  4.863911  4.151210 ## y_2f 7.549395 18.603831 10.224798  5.445565 ## y_3f 4.863911 10.224798 30.039315 13.493952 ## y_4f 4.151210  5.445565 13.493952 27.995968

We find the pooled covaricance matrix -an unbiased estimator of the common population covariance matrix, \(\Sigma\).

\[S_{pl}= \frac{1}{n_{1}+n_{2}-2}[(n_{1}-1)\textbf{S}_{1}+(n_{2}-1)\textbf{S}_{2}]\]

                                                      #function for finding the common population covariance matrix                                                      common_population_variance <-                                        function(data1, data2) {                                      "                                                                                                "                                                        n_1=                                        nrow(data1)                                      n_2=                                        nrow(data2)                                      sigma_1                    <-                                        cov(data1)                                      sigma_2                    <-                                        cov(data2)                                                        sigma_population<-(1                    /(n_1                    +n_2-2))*((n_1-1)                    *                                        sigma_1                    +                                        (n_2-1)                    *                                        sigma_2)                                      print("The pooled covariance matrix is:")                                      print(sigma_population)                  }

                                                      common_population_variance(data542_female, data542_male)

              ## [1] "The pooled covariance matrix is:" ##          y_1f      y_2f      y_3f      y_4f ## y_1f 7.164315  6.047379  5.693044  4.700605 ## y_2f 6.047379 15.894153  8.492440  5.855847 ## y_3f 5.693044  8.492440 29.356351 13.980847 ## y_4f 4.700605  5.855847 13.980847 22.320565

Then, to test \[H_{0}: \mu_{1}=\mu{2} \ \text{vs} \ H_{1}: \mu_{1} \neq \mu_{2}\], we find:

\[T^2= \frac{ n_{1}n_{2}}{n_{1}+n_{2}}(\bar{y}_{1}- \bar{y}_2)' \textbf{S}_{pl}^{-1}(\bar{y}_{1}- \bar{y}_2)\]

                                                      #function for finding the T^2 in a multivariate two sample setting                                                      T2_two_sample<-                                        function(data1, data2) {                                      "                                                                                                "                                                        n_1                    <<-                                        nrow(data1)                                      n_2<<-                                        nrow(data2)                                      sigma_1                    <-                                        cov(data1)                                      sigma_2                    <-                                        cov(data2)                                                                          y_mean_1<-as.matrix(                                      apply(data1,                    2,                    FUN=mean                                      )                                      )                                                                          y_mean_2<-as.matrix(                                      apply(data2,                    2,                    FUN=mean                                      )                                      )                                      sigma_population<-(                                      1                    /(n_1                    +n_2-2)                                      )*(                                      (n_1-1)                    *                                        sigma_1                    +                                        (n_2-1)                    *                                        sigma_2                                                        )                                                                          sigma_population_inverse <-                                        solve(sigma_population)                                      T2 <<-                    ((n_1                    *n_2)/(n_1                    +n_2))*                    t(y_mean_1                    -y_mean_2)                    %*%                                        sigma_population_inverse                    %*%                                        (y_mean_1                    -y_mean_2)                                                                          print("The pooled covariance matrix is:")                                      print(sigma_population)                                      print("The value of T2 is:")                                      print                    (T2)                  }

                                                      T2_two_sample(data542_female, data542_male)

              ## [1] "The pooled covariance matrix is:" ##          y_1f      y_2f      y_3f      y_4f ## y_1f 7.164315  6.047379  5.693044  4.700605 ## y_2f 6.047379 15.894153  8.492440  5.855847 ## y_3f 5.693044  8.492440 29.356351 13.980847 ## y_4f 4.700605  5.855847 13.980847 22.320565 ## [1] "The value of T2 is:" ##         [,1] ## [1,] 97.6015

                                                      #transforming T^2- statistic to an F-statistic

The \(T^2\)-statistic can be readily transformed to an F-statistic using:

\[ \frac{n_{1}+ n_{2}-p-1}{(n_1+n_2-2)p}T^2=F_{p, n_1+n_2-p-1} \]

where again the dimension \(p\) of the \(T^2\)-statistic becomes the first degree-of-freedom parameter for the \(F\)-statistic.

Finally, we have the elements to integrate all mentioned above (plus some complements of the T2 statistic test) into a function that will test whether the means of the two samples are equal and compare it to existing packages in R.

                                                      #Function for multivariate two-Sample $T^2$-test                                                                                          hotelling_two_independent_samples <-                                        function(data1, data2, alpha) {                                      "                                                                              Considering that p variables are measured on two samples, this function tests whether                                                                              H_0:u_1=u_2 vs H_1: u_1 distinct from    u_2. We assume that the two samples are independent and that                                                                              sigma_1=sigma_2=sigma                                                                                                                                                          Input:                                                                                                  + data1: dataset from sample 1                                                                              + data2: dataset from sample 2                                                                              + alpha. significance level                                                                                                                                                          Output:                                                                              + sigma_population: pooled covariance matrix                                                                              + T2: hotellings T2                                                                              + F: T2-statistic transformed to an F-statistic                                                                              + F-critical value                                                                              + Reject H0: if we reject H0=TRUE, otherwise FALSE                                                                              + P-value                                                                                                "                                                        #package for hotelling testing                                                        if(!                    require("ICSNP")){                                      install.packages("ICSNP")                                      library(ICSNP)                                      }                                      n_1                    <<-                                        nrow(data1)                                      n_2<<-                                        nrow(data2)                                      sigma_1                    <-                                        cov(data1)                                      sigma_2                    <-                                        cov(data2)                                                                          y_mean_1<-as.matrix(                                      apply(data1,                    2,                    FUN=mean                                      )                                      )                                                                          y_mean_2<-as.matrix(                                      apply(data2,                    2,                    FUN=mean                                      )                                      )                                      sigma_population<-(                                      1                    /(n_1                    +n_2-2)                                      )*(                                      (n_1-1)                    *                                        sigma_1                    +                                        (n_2-1)                    *                                        sigma_2                                                        )                                                                          sigma_population_inverse <-                                        solve(sigma_population)                                      T2 <<-                    ((n_1                    *n_2)/(n_1                    +n_2))*                    t(y_mean_1                    -y_mean_2)                    %*%                                        sigma_population_inverse                    %*%                                        (y_mean_1                    -y_mean_2)                                      F<-round(T2*(n_1                    +n_2                    -p-1)/(p*(n_1                    +n_2-2)),2)                                      p<-ncol(data1)                                      print(                                      "The pooled covariance matrix is:"                                                        )                                      print(                                      sigma_population                                      )                                      print(                                      paste0(                                      "The value of T2 is: ", T2                                                        )                                      )                                      print(                                      paste0(                                      "T2-statistic transformed to an F-statistic is: ", F                                      )                                      )                                      print(                                      paste0(                                      "F-critical value: ",                    round(qf(1                    -alpha,                    df1=p,                    df2=n_1                    +n_2                    -p-1),                    5) ,                    " with ", p," and ", n_1                    +n_2                    -p-1,                    " degrees of freedom",                    sep=                    " "                                                        )                                      )                                      print(                                      paste0(                                      "Reject H0: ", F                    >                                                            qf(1                    -alpha,                    df1=p,                    df2=n_1                    +n_2                    -p-1)                                      )                                      )                                      print(                                      paste0(                                      "P-value: ", (pf(F,df1=p,                    df2=n_1                    +n_2                    -p-1,                    lower.tail =                    F))                                      )                                      )                                      print(                                      "***Comparing results using HotellingsT2 from ICSNP library - F-TEST***"                                                        )                                      print(                                      HotellingsT2(data1, data2,                    mu=                    NULL,                    test =                    "f")                                      )                                      print(                                      "***Comparing results using HotellingsT2 from ICSNP library - CHI-TEST***"                                                        )                                      print(                                      HotellingsT2(data1, data2,                    mu=                    NULL,                    test =                    "chi")                                      )                  }

The solution to problem \(5.42\) is:

                                                      hotelling_two_independent_samples(data542_female, data542_male,                    0.05)

              ## [1] "The pooled covariance matrix is:" ##          y_1f      y_2f      y_3f      y_4f ## y_1f 7.164315  6.047379  5.693044  4.700605 ## y_2f 6.047379 15.894153  8.492440  5.855847 ## y_3f 5.693044  8.492440 29.356351 13.980847 ## y_4f 4.700605  5.855847 13.980847 22.320565 ## [1] "The value of T2 is: 97.6014965277689" ## [1] "T2-statistic transformed to an F-statistic is: 31.48" ## [1] "F-critical value: 2.52791 with 4 and 59 degrees of freedom " ## [1] "Reject H0: TRUE" ## [1] "P-value: 0.999999999999951" ## [1] "***Comparing results using HotellingsT2 from ICSNP library - F-TEST***" ##  ##  Hotelling's two sample T2-test ##  ## data:  data1 and data2 ## T.2 = 23.22, df1 = 4, df2 = 59, p-value = 1.464e-11 ## alternative hypothesis: true location difference is not equal to c(0,0,0,0) ##  ## [1] "***Comparing results using HotellingsT2 from ICSNP library - CHI-TEST***" ##  ##  Hotelling's two sample T2-test ##  ## data:  data1 and data2 ## T.2 = 97.601, df = 4, p-value < 2.2e-16 ## alternative hypothesis: true location difference is not equal to c(0,0,0,0)

Tests on individual variables conditional on rejection of \(H_{0}\) by the \(T^2\)-test.

If the hypothesis \(H_{0}: \mu_{1} = \mu_{2}\) is rejected, the implication is that \(\mu_{ij} \neq \mu_{2j}\) for at least one \(j=1,2,...,p\). But there is no guarantee that \(H_{0}: \mu_{1} = \mu_{2}\) will be rejected for some \(j\) by a univariate test. However, if we consider a linear combination of the variables, \(z = a'y\), then there is at least one coefficient vector for which will reject the corresponding hyphotesis \(H_{0}=a'\mu_{1}=a'\mu_{2}\). Thus if \(H_{0}: \mu_{1} = \mu_{2}\) is rejected by \(T^2\), the \(a'y\) will lead to rejection of\(H_{0}=a'\mu_{1}=a'\mu_{2}\), with \(a=S_{pl}^{-1}(\bar{y}_{1}-\bar{y}_{2})\).

We can then examine each \(a_{j}\) in for an indication of the contribution of the corresponding \(y_{j}\) to rejection of \(H_{0}\).

Example 5.5

Returning to the psychological data Table \(5.1\), we obtained \(\bar{y}_{1}- \bar{y}_{2}\), and \(\textbf{S}_{pl}\). The discriminant function coefficiente vector is obtained as:

\[a=\textbf{S}_{pl}^{-1}(\bar{y}_{1}- \bar{y}_{2})\]

                                  discriminant_function <-                                        function(data1, data2) {                                      "                                                                              This function determines the discriminat function coefficient vector                                                                                                                                                                            Input:                                                                                                  + data1: dataset from sample 1                                                                              + data2: dataset from sample 2                                                                                                Output:                                                                              + a: discriminant function coefficient vector                                                                                                                                                          "                                                                          n_1                    <<-                                        nrow(data1)                                      n_2<<-                                        nrow(data2)                                      sigma_1                    <-                                        cov(data1)                                      sigma_2                    <-                                        cov(data2)                                                                          y_mean_1<-as.matrix(                                      apply(data1,                    2,                    FUN=mean                                      )                                      )                                                                          y_mean_2<-as.matrix(                                      apply(data2,                    2,                    FUN=mean                                      )                                      )                                      sigma_population<-(                                      1                    /(n_1                    +n_2-2)                                      )*(                                      (n_1-1)                    *                                        sigma_1                    +                                        (n_2-1)                    *                                        sigma_2                                                        )                                                                          sigma_population_inverse <-                                        solve(sigma_population)                                      a <-                    sigma_population_inverse%*%(y_mean_1                    -y_mean_2)                                      print(                                      "The discriminant coeficient vector is: "                                                        )                                      print(round(a,                    4))                  }

                                                      discriminant_function(data542_male, data542_female)

              ## [1] "The discriminant coeficient vector is: " ##         [,1] ## y_1m  0.5104 ## y_2m -0.2033 ## y_3m  0.4660 ## y_4m -0.3097

Thus the linear combination that best separates the two groups is:

\[a'y= 0,5104y_{1} - 0.2033y_{2} +0.4660y_{3}-0.3097y_{4}\]

Computation of \(T^2\)

Obtaining \(T^2\) from a MANOVA Program

One-way MANOVA involves a comparison of mean vectors from several samples. Typically, the number of samples is three or more, but the procedure will also accommodate two samples. The two-sample \(T^2\) test is thus a special case of MANOVA.

Obtaining \(T^2\) from Multiple Regression

Example: 5.6.2

We illustrate the regression approach to computation of \(T^2\) using the psychological data in Table \(5.1.\)

Paired Observation Test

Univariate Case

Suppose two samples are not independent because there exists a natural pairing between the \(ith\) observation \(y_{i}\) int the first sample and the \(ith\) observation \(x_{i}\) in the second sample for all \(i\), as for example, when a treatment is applied twice to the same indidividual or when subjects are matched according to some criterion, such as IQ or family background. With such pairing, the samples are often referred to as paired observations or matched pairs.

The two samples thus obtained are correlated, and the two-sample test statistic is not appropriate because the samples must be independent in order to have a t-distribution.

Multivariate Case We assume \(y\) and \(x\) are correlated and have a multivariate normal distribution.

\[H_0=\mu_d=0\], which is equivalent to \(H_0=\mu_y=\mu_x\), since \(\mu_d=E(y-x)=\mu_y-\mu_x\). we find: \[\bar{d}=\frac{1}{n}\sum_1^nd_i\]

\[S_d=\frac{1}{n-1}\sum_1^n(d_i-\bar{d})(d_i-\bar{d})'\]

We then have:

\[T^2=\bar{d}'\begin{pmatrix} \frac{S_d}{n} \end{pmatrix}^{-1}\bar{d}=n\bar{d}'S_d^{-1}\bar{d}'\]

Example 5.7.2

To compare two types of coating for resistance to corroison, 15 pieces of pipe were coated with each type of coating. Two pipes, one with each type of coating, were buried together and left for the same length of time at \(15\) different locations, providing a natural pairing of the observations. Corrosion for the first type of coating was measured by two variables:

\(y_1\) = maximum depth of pit in thousandths of an inch. \(y_2\) = number of pits.

with \(x_1\) and \(x_2\) defined analogously for the second coating. The data and differences are given in Table \(5.3\). Thus, we have, for example, \(y_1'=74\)

                                  coat1_depth <-                                        c(73,                    43,                    47,                    53,                    58,                    47,                    52,                    38,                    61,                    56,                    56,                    34,                    55,                    65,                    75)                  coat1_numb  <-                                        c(31,                    19,                    22,                    26,                    36,                    30,                    29,                    36,                    34,                    33,                    19,                    19,                    26,                    15,                    18)                  coat2_depth <-                                        c(51,                    41,                    43,                    41,                    47,                    32,                    24,                    43,                    53,                    52,                    57,                    44,                    57,                    40,                    68)                  coat2_numb  <-                                        c(35,                    14,                    19,                    29,                    34,                    26,                    19,                    37,                    24,                    27,                    14,                    19,                    30,                    7,                    13)                  diff_depth  <-                    coat1_depth-coat2_depth                  diff_numb   <-                    coat1_numb-coat2_numb                                    data572 <-                                        data.frame(coat1_depth, coat1_numb, coat2_depth, coat2_numb, diff_depth, diff_numb)                  data572a <-                                        tibble(coat1_depth, coat1_numb, coat2_depth, coat2_numb, diff_depth, diff_numb)

                                  knitr::                    kable(                                      data572a,                    caption =                    'Data Exercise 5.72',                                      booktabs =                    TRUE                                    )

Table 5.5: Data Exercise 5.72
coat1_depth	coat1_numb	coat2_depth	coat2_numb	diff_depth	diff_numb
73	31	51	35	22	-4
43	19	41	14	2	5
47	22	43	19	4	3
53	26	41	29	12	-3
58	36	47	34	11	2
47	30	32	26	15	4
52	29	24	19	28	10
38	36	43	37	-5	-1
61	34	53	24	8	10
56	33	52	27	4	6
56	19	57	14	-1	5
34	19	44	19	-10	0
55	26	57	30	-2	-4
65	15	40	7	25	8
75	18	68	13	7	5

Calculating the difference averages

                                                      # difference vector                                    d_bar_depth <-                                        mean(data572$diff_depth)                                      print("depth average, d_bar_depth")

              ## [1] "depth average, d_bar_depth"

              ## [1] 8

                                  d_bar_numb  <-                                        as.matrix(mean(data572$diff_numb))                                      print("depth average, d_bar_numb")

              ## [1] "depth average, d_bar_numb"

              ##          [,1] ## [1,] 3.066667

              ## [1] "d_bar vector"

                                  d_bar_vector <-                                        c(d_bar_depth, d_bar_numb)                  d_bar_vector <-                    (as.matrix(d_bar_vector))                  d_bar_vector

              ##          [,1] ## [1,] 8.000000 ## [2,] 3.066667

              ##          [,1] ## [1,] 8.000000 ## [2,] 3.066667

Function for calculating matrix \(S_{d}\)

                                  calc_Sd <-                                        function(data,                    y1=                    1,                    y2=                    2,                    x1=                    3,                    x2=                    4){                                      "                                                                              (see table 5.3 for clarity)                                                                              data: dataframe                                                                              y1: number of the column of y1, 1 for depth coating 1                                                                                                  y2: number of the column of y2, 1 for number coating 1                                                                              x1: number of the column of x1, 1 for depth coating 2                                                                              x2: number of the column of y2, 1 for number coating 2                                                                              "                                                        d_bar_y1x1 <-                                        mean((data[,y1])-(data[,x1]))                                      d_bar_y2x2 <-                                        mean((data[,y2])-(data[,x2]))                                      d_bar_vector <-                                        as.matrix(c(d_bar_y1x1, d_bar_y2x2))                                                        n  <-                                        nrow(data)                    # number of rows                                                        d <-                                        matrix(data=                    0,                    nrow=n,                    2)                    # in this case 2 because we have depth and number                                                        sum_diffs <-                                        0                    # suma de los vectores (d-dbar)(d-dbar)'                                                                                            for                    (i                    in                    seq(nrow(data))){                                      y <-                                        as.matrix(c(data[i, y1], data[i, y2]))                                      x <-                                        as.matrix(c(data[i, x1], data[i, x2]))                                      diff_xy <-                    (y                    -                                        x)                                                        diff_xy_d <-                    diff_xy                    -                                        d_bar_vector                                      aux <-                    diff_xy_d                    %*%                                                            t(diff_xy_d)                                      sum_diffs <-                    sum_diffs                    +                                        aux                                      }                                                                          Sd <-                                        1                    /(n-1)                    *sum_diffs                                                      }                                                        # testing the function to get Sd in the exercise                                    Sd <-                                        calc_Sd(data572,                    y1=                    1,                    y2=                    2,                    x1=                    3,                    x2=                    4)                                      print("* matrix Sd")

              ## [1] "* matrix Sd"

              ##           [,1]     [,2] ## [1,] 121.57143 17.07143 ## [2,]  17.07143 21.78095

Compute \(T^{2}\) using (5.24)

                                  cal_T2_test <-                                        function(data,                    y1=                    1,                    y2=                    2,                    x1=                    3,                    x2=                    4){                                      "                                                                              Calculate T2 of equation 5.24                                                                              data: dataframe                                                                              y1: number of the column of y1, 1 for depth coating 1                                                                                                  y2: number of the column of y2, 1 for number coating 1                                                                              x1: number of the column of x1, 1 for depth coating 2                                                                              x2: number of the column of y2, 1 for number coating 2                                                                              "                                                        d_bar_y1x1 <-                                        mean((data[,y1])-(data[,x1]))                                      d_bar_y2x2 <-                                        mean((data[,y2])-(data[,x2]))                                      #d_bar_vector: vector of means                                                        d_bar_vector <-                                        as.matrix(c(d_bar_y1x1, d_bar_y2x2))                                      print(paste("d_bar_y1x1: ", d_bar_y1x1 ))                                      print(paste("d_bar_y2x2: ", d_bar_y2x2 ))                                      #n: number of observations                                                        n  <-                                        nrow(data)                    # number of rows                                                        d <-                                        matrix(data=                    0,                    nrow=n,                    2)                    # in this case 2 because we have depth and number                                                        sum_diffs <-                                        0                    # suma de los vectores (d-dbar)(d-dbar)'                                                                                            for                    (i                    in                    seq(nrow(data))){                                      y <-                                        as.matrix(c(data[i, y1], data[i, y2]))                                      x <-                                        as.matrix(c(data[i, x1], data[i, x2]))                                      diff_xy <-                    (y                    -                                        x)                                                        diff_xy_d <-                    diff_xy                    -                                        d_bar_vector                                      aux <-                    diff_xy_d                    %*%                                                            t(diff_xy_d)                                      sum_diffs <-                    sum_diffs                    +                                        aux                                      }                                      #Sd: Cov                                                        Sd <-                                        1                    /(n-1)                    *sum_diffs                                      print("* matrix Sd")                                      print(Sd)                                      T2 <-                                        t(d_bar_vector)                    %*%                                                            solve(Sd/n)                    %*%                                        d_bar_vector                                      print(paste("T2 statistic: ", T2,                    "with p=2 and n-1: ", n-1                    ))                                      print("see: table A7. If T2 > for a given alpha, reject H0 ")                                                      }

                                  T2 <-                                        cal_T2_test(data572,                    y1=                    1,                    y2=                    2,                    x1=                    3,                    x2=                    4)

              ## [1] "d_bar_y1x1:  8" ## [1] "d_bar_y2x2:  3.06666666666667" ## [1] "* matrix Sd" ##           [,1]     [,2] ## [1,] 121.57143 17.07143 ## [2,]  17.07143 21.78095 ## [1] "T2 statistic:  10.8188985401983 with p=2 and n-1:  14" ## [1] "see: table A7. If T2 > for a given alpha, reject H0 "

For this case, if wee look in table A7 th critical value is 8.197. And conclude the two coating differ in their effect on corrosion.

Test for Aditional Information

In this section we again consider two independent samples in section \(5.4\). We start with a basic \(p x 1\) vector y of measurements on each sampling unit and ask whether a \(q x 1\) subvector x measured in addition to \(y\) (on the same unit) will significantly increase the separation of the two samples as shown by \(T^2\)`.

Or, we may be interested in determining whether some of the variables we already have are redundant in the presence of other variables in terms of separating the groups, in terms of separating the groups.

In short: we wish to test the hypothesis that \(x_{1}\) and \(x_{2}\) are redundant for separating the two groups: that is, that the extra \(q\) variables do no contribute anything significant beyond the information already available in \(y_{1}\) and \(y_{2}\) for seperating the groups.

Notice! We area not asking whether \(x\)'s can signigicantly separate the two groups by themselves, but whether they provide additional separation beyond the separation already achieved by the \(y\)'s.

\(H_{0}\): redundancy of x.

We reject hypothesis of redundancy if \(F>= F_{\alpha, q, \nu-p-q+1}\)

If \(x\)'s were independent of \(y\)'s, we would have \(T^2_{p+q}=T^2_p+T^2_q\), but this does not hold because they are correlated. Therefore, we can only inquiry whether the increase from \(T^2_p\) to \(T^2_{p+q)}\) is significant.

\(T^2\)-statistic, based on the full set of \(p+q\) variables is given by:

\[ T^2_{p+q}=\frac{n_1n_2}{n_1+n_2} \Big(\begin{bmatrix} \bar{y}_1\\ \bar{x}_1 \end{bmatrix} - \begin{bmatrix} \bar{y}_2\\ \bar{x}_2 \end{bmatrix}\Big)'\textbf{S}_{pl}^{-1}\Big(\begin{bmatrix} \bar{y}_1\\ \bar{x}_1 \end{bmatrix} - \begin{bmatrix} \bar{y}_2\\ \bar{x}_2 \end{bmatrix}\Big) \]

Whereas \(T^2\), for the reduced set of \(p\) variables is:

\[T^2_p=\frac{n_1n_2}{n_1+n_2}(\bar{y}_1- \bar{y}_2)'\textbf{S}_{yy}^{-1}(\bar{y}_1- \bar{y}_2)\]

Then the test statistic for the significance of the increase from \(T^2_P\) to \(T^2_{p+q}\) is given by:

\[ T^2(x|y)= (\nu-p)\frac{T^2_{p+q}-T^2_p}{\nu+T^2_p} \]

which is distributed as \(T^2_{q, \nu-p}\). We reject the hyphotesis of redundancy of \(x\) if \(T^2(x|y) >= T^2_{\alpha, q,\nu-p}\)

\(T^2(x|y)\) can be also converted to an \(F\)-statistic:

\[ F=\frac{\nu-p-q+1}{q}\frac{T^2_{p+q}-T^2_p}{\nu+ T^2_p} \]

Which is distributed as \(F_{q, \nu-p-q+1}\), and we reject the hypothesis of redundacy if \(F>= F_{\alpha, q, \nu-p-q+1}\)

Example 5.8

We use the psychological data of Table \(5.1\) to illustrate tests on subvectors. We begin by testing the significance of \(y_{3}\) and \(y_{4}\) above and beyond \(y_{1}\) and \(y_{2}\) (In the notation of the present section,\(y_{3}\) and \(y_{4}\) become \(x_{1}\) and \(x_{2}\)).

\(H_{0}\): \(y_3\) and \(y_4\) are redundant on \(y_1\) and \(y_2\).

For these subvectors, \(p = 2\) and \(q = 2\). The value of \(T^2_{p+q}\) for all four variables as given by (5.27) was obtained in Example \(5.4.2\) as \(97.6015\).

Thus, for \(y_{1}\) and \(y_{2}\) we obtain, by (5.28), the \(T^2\)-statistic for the reduced set of \(p\) variables:

\[ T_{p}^{2}= \frac{n_{1}n_{2}}{n_{1}+n_{2}}(\bar{y_1}-\bar{y_2})'S_{yy}^{-1}(\bar{y_{1}}-\bar{y_{2}}) \]

                                  T2_reduced <-                    {nrow(data542_male)*                    nrow(data542_female)}/{nrow(data542_male)+                                                            nrow(data542_female)}

First, we get the reduced data samples.

                                                      ##filter variables of interest from each dataset                                                        if(!                    require("tidyverse")){                                      install.packages("tidyverse")                                      library("tidyverse")                                      }                                      data542_female_reduced <-                    data542_female                    %>%                                                            select(y_1f, y_2f)                    #we select the variables of data1 we are interested in                                                        data542_male_reduced <-                    data542_male                    %>%                                                            select(y_1m, y_2m)                    #we select the variables of data2 we are interested in

Second, with the reduced version of the data samples, we have enough elements to find the \(T^2\)-statistic for the reduced set of \(p\) variables:

                                                      #function for finding the T^2-statistic for the reduced set of p variables:                                                      T2_two_sample_reduced<-                                        function(data1_reduced, data2_reduced) {                                      "                                                                              Function to obtain T2 for the reduced set of p            variables.                                                                              "                                                        n_1                    <<-                                        nrow(data1_reduced)                                      n_2<<-                                        nrow(data2_reduced)                                      sigma_1                    <-                                        cov(data1_reduced)                                                        sigma_2                    <-                                        cov(data2_reduced)                                                                          y_mean_1<-as.matrix(                                      apply(data1_reduced,                    2,                    FUN=mean                                      )                                      )                                                                                            y_mean_2<-as.matrix(                                      apply(data2_reduced,                    2,                    FUN=mean                                      )                                      )                                                        sigma_population_reduced<-(                                      1                    /(n_1                    +n_2-2)                                      )*(                                      (n_1-1)                    *                                        sigma_1                    +                                        (n_2-1)                    *                                        sigma_2                                                        )                                      p <-                                        ncol(data1_reduced)                                      sigma_population_reduced_inverse <-                                        solve(sigma_population_reduced)                                      T2_reduced <<-                    ((n_1                    *n_2)/(n_1                    +n_2))*                    t(y_mean_1                    -y_mean_2)                    %*%                                        sigma_population_reduced_inverse                    %*%                                        (y_mean_1                    -y_mean_2)                                                                          print(paste0("The pooled covariance matrix of the reduced set of ", p,                    " variables is:"))                                      print(sigma_population_reduced)                                      print(paste0("The value of T2 for the reduced set of ", p,                    " variables is:"))                                      print                    (round(T2_reduced,                    4))                                    }

                                                      T2_two_sample_reduced(data542_female_reduced, data542_male_reduced)

              ## [1] "The pooled covariance matrix of the reduced set of 2 variables is:" ##          y_1f      y_2f ## y_1f 7.164315  6.047379 ## y_2f 6.047379 15.894153 ## [1] "The value of T2 for the reduced set of 2 variables is:" ##         [,1] ## [1,] 31.0126

Third, we are ready to test redundancy of variables or significance of the increase from \(T_p^2\) to \(T_{p+q}^2\). The null hyphotesis is: redundancy of x. We reject \(H_0\) if \(T^2_(x|y)>=T^2_(\alpha, q,\nu-p)\)

We can also convert \(T^2(x|y)\) to an F-statistic:

\[F=\frac{\nu-p-q-1}{q}*\frac{T^2_{p+q}-T^2_p}{v+T^2_p}\]

Then, we reject \(H_{0}\) if \(F>=F_{\alpha, q, \nu-p-q+1}\).

                                                      #Function to test redundancy of variables or significance of the increase from Tp^2 to Tp+q^2                                                      test_redundancy_additional_information <-                                        function(data1, data2, data1_reduced, data2_reduced, alpha) {                                      "                                                                              This function tests the statistic significance of the increase from T_p^2 to T_{p+q}^2.                                                                                                  H0: Redundancy of x                                                                                                  We reject H0 if T^2_(x|y)>=T^2_(alpha, q, niu-p)                                                                              Or, if we convert T^2_(x|y) to an F-statistic: we reject H0 if F>=F_{alpha, q, niu-p-q+1}                                                                                                Input:                                                                                                  + data1: dataset from sample 1                                                                              + data2: dataset from sample 2                                                                              + data1_reduced: dataset from sample 1 with selected variables of interest                                                                              + data2_reduced: dataset from sample 2 with selected variables of interest                                                                              + alpha: significance level                                                                                                "                                                        #package for hotelling testing                                                        if(!                    require("ICSNP")){                                      install.packages("ICSNP")                                      library(ICSNP)                                      }                                      n_1                    <<-                                        nrow(data1)                                      n_2                    <<-                                        nrow(data2)                                      sigma_1                    <-                                        cov(data1)                                      sigma_2                    <-                                        cov(data2)                                                                          y_mean_1<-as.matrix(                                      apply(data1,                    2,                    FUN=mean                                      )                                      )                                                                          y_mean_2<-as.matrix(                                      apply(data2,                    2,                    FUN=mean                                      )                                      )                                      sigma_population<-(                                      1                    /(n_1                    +n_2-2)                                      )*(                                      (n_1-1)                    *                                        sigma_1                    +                                        (n_2-1)                    *                                        sigma_2                                                        )                                                                          sigma_population_inverse <-                                        solve(sigma_population)                                      T2 <<-                    ((n_1                    *n_2)/(n_1                    +n_2))*                    t(y_mean_1                    -y_mean_2)                    %*%                                        sigma_population_inverse                    %*%                                        (y_mean_1                    -y_mean_2)                                                                          n_1                    <<-                                        nrow(data1_reduced)                                      n_2<<-                                        nrow(data2_reduced)                                      sigma_1                    <-                                        cov(data1_reduced)                                                        sigma_2                    <-                                        cov(data2_reduced)                                                                          y_mean_1<-as.matrix(                                      apply(data1_reduced,                    2,                    FUN=mean                                      )                                      )                                                                                            y_mean_2<-as.matrix(                                      apply(data2_reduced,                    2,                    FUN=mean                                      )                                      )                                                        sigma_population_reduced<-(                                      1                    /(n_1                    +n_2-2)                                      )*(                                      (n_1-1)                    *                                        sigma_1                    +                                        (n_2-1)                    *                                        sigma_2                                                        )                                      p <-                                        ncol(data1_reduced)                                      q <-                                        ncol(data1)-                    ncol(data1_reduced)                                      sigma_population_reduced_inverse <-                                        solve(sigma_population_reduced)                                      T2_reduced <<-                    ((n_1                    *n_2)/(n_1                    +n_2))*                    t(y_mean_1                    -y_mean_2)                    %*%                                        sigma_population_reduced_inverse                    %*%                                        (y_mean_1                    -y_mean_2)                                      niu=n_1                    +n_2-2                                                        T2_significance_increase <<-                    (niu-p)*(T2-T2_reduced)/(niu+T2_reduced)                                      F =                    ((niu-p-q+                    1)/q)*(T2-T2_reduced)/(niu+T2_reduced)                                      df1=q                                      df2=niu-p-q+                    1                                                        print(                                      "The pooled covariance matrix is:"                                                        )                                      print(                                      round(sigma_population,4                                                        )                                      )                                      print(                                      paste0(                                      "The value of T2 is: ",                    round(T2,4)                                                        )                                      )                                                                                            print(                                      paste0(                                      "The pooled covariance matrix of the reduced set of ", p,                    " variables is:"                                                        )                                      )                                      print(                                      round(                                      sigma_population_reduced,                    4                                                        )                                      )                                      print(                                      paste0(                                      "The value of T2 for the reduced set of ", p,                    " variables is:"                                                        )                                      )                                      print(                                      round(                                      T2_reduced,                    4                                                        )                                      )                                      print(                                      paste0(                                      "The results of the test statistic for the significance of the increase from ",                    round(T2_reduced,4),                    " to ",                    round(T2,                    4),                    " is "                                                        )                                      )                                      print(                                      round(                                      T2_significance_increase,                    4                                                        )                                      )                                                        print(                                      paste0(                                      "T2 test statistic transformed to an F-statistic is: ",                    round(F,4)                                      )                                      )                                      print(                                      paste0(                                      "F-critical value: ",                    round(qf(1                    -alpha, df1, df2),                    5) ,                    " with ", df1," and ", df2,                    " degrees of freedom",                    sep=                    " "                                                        )                                      )                                      print(                                      paste0(                                      "Reject H0: ", F                    >                                                            qf(1                    -alpha, df1, df2)                                      )                                      )                                      print(                                      paste0(                                      "P-value: ", (pf(F,df1, df2,                    lower.tail =                    F))                                      )                                      )                  }

                                                      test_redundancy_additional_information(data542_female, data542_male, data542_female_reduced, data542_male_reduced,                    0.05)

              ## [1] "The pooled covariance matrix is:" ##        y_1f    y_2f    y_3f    y_4f ## y_1f 7.1643  6.0474  5.6930  4.7006 ## y_2f 6.0474 15.8942  8.4924  5.8558 ## y_3f 5.6930  8.4924 29.3564 13.9808 ## y_4f 4.7006  5.8558 13.9808 22.3206 ## [1] "The value of T2 is: 97.6015" ## [1] "The pooled covariance matrix of the reduced set of 2 variables is:" ##        y_1f    y_2f ## y_1f 7.1643  6.0474 ## y_2f 6.0474 15.8942 ## [1] "The value of T2 for the reduced set of 2 variables is:" ##         [,1] ## [1,] 31.0126 ## [1] "The results of the test statistic for the significance of the increase from 31.0126 to 97.6015 is " ##         [,1] ## [1,] 42.9548 ## [1] "T2 test statistic transformed to an F-statistic is: 21.1194" ## [1] "F-critical value: 3.15312 with 2 and 59 degrees of freedom " ## [1] "Reject H0: TRUE" ## [1] "P-value: 0.999999879110234"

Therefore we conclude that \(x=(y_3, y_4)'\) adds a significant amount of separation to \(y=(y_1,y_2)'\).

If we want to test the effect of each variable adjusted to the rest of the variables, we just need to redefine the data samples from which to pooled. Notice that when we are testing the effect of adding a single \(x\), then \(t^2_{\alpha/2, \nu -p}= F_{\alpha, 1, \nu-p}\).

For illustration purposes we will only evaluate the effect of \(y_1\) on the other three variables.

                                                      ##filter variables of interest from each dataset: we keep everyvariable by y_1                                                        if(!                    require("tidyverse")){                                      install.packages("tidyverse")                                      library("tidyverse")                                      }                                      data542_female_reduced_y1<-                    data542_female                    %>%                                                            select(y_2f, y_3f,y_4f)                    #we select the variables of data1 we are interested in                                                        data542_male_reduced_y1<-                    data542_male                    %>%                                                            select(y_2m, y_3m, y_4m)                    #we select the variables of data2 we are interested in

                                                      test_redundancy_additional_information(data542_female, data542_male, data542_female_reduced_y1, data542_male_reduced_y1,                    0.05)

              ## [1] "The pooled covariance matrix is:" ##        y_1f    y_2f    y_3f    y_4f ## y_1f 7.1643  6.0474  5.6930  4.7006 ## y_2f 6.0474 15.8942  8.4924  5.8558 ## y_3f 5.6930  8.4924 29.3564 13.9808 ## y_4f 4.7006  5.8558 13.9808 22.3206 ## [1] "The value of T2 is: 97.6015" ## [1] "The pooled covariance matrix of the reduced set of 3 variables is:" ##         y_2f    y_3f    y_4f ## y_2f 15.8942  8.4924  5.8558 ## y_3f  8.4924 29.3564 13.9808 ## y_4f  5.8558 13.9808 22.3206 ## [1] "The value of T2 for the reduced set of 3 variables is:" ##         [,1] ## [1,] 78.8733 ## [1] "The results of the test statistic for the significance of the increase from 78.8733 to 97.6015 is " ##        [,1] ## [1,] 7.8437 ## [1] "T2 test statistic transformed to an F-statistic is: 7.8437" ## [1] "F-critical value: 4.00398 with 1 and 59 degrees of freedom " ## [1] "Reject H0: TRUE" ## [1] "P-value: 0.993115461108083"

Profile Analysis

When multivariate data are the result of repeated measures, you might want to see how the mean is changing over time. As an example, you might have data on blood pressure taken once per month for the same set of patients.

Profile analysis can also be used when variables are not ordered in time (or any other way), but the values are on a similar scale, such as in psychological testing.

Tecnical Note: A profile plot uses the subscript of the variable on the x-axis and the means μ1,…,μp on the y-axis. The pattern obtained by plotting \(\mu_{1}, \mu_{2}, ..., \mu_{p}\) as ordinates and connecting the points is called profile. Profile Analysis is an analysis of the profile or a comparison of two or more profiles.

In a one sample setting, the null hypothesis that all means are equal \(H_{0} : \mu_1 = ··· = \mu_2\) can be interpreted graphically as the hypothesis that the profile plot is flat. The alternative hypothesis is \(H_{1} : \mu_{i}\neq \mu_j\) for for some \(i,j\).

Note that these hypotheses are the same as for ANOVA, but the analysis will be different because the columns are not assumed to be independent. Here we have a single population but are sampling correlated variables from this population.

Parallelism

Usually the main test of interest of profile analysis. It asks whether each segment is the same across all groups.

For two samples, the null hyphotesis that the two plots have the same slopes (are parallel) can be tested checking whether differences between means are equal for the two groups. This can be written as:

\[H_0=\mu_{i,j}-\mu_{i,j-1}= \mu_{2,j}-\mu_{2,j-1}\] for \(j=1, ..., p\). In matrix notation, this is:

\[H_{01}=\begin{pmatrix} \mu_{12} & -\mu_{11} \\ \mu_{12} & -\mu_{11} \\ \vdots\\ \mu_{1p} & -\mu_{1, p-1} \\ \end{pmatrix}=\begin{pmatrix} \mu_{22} & -\mu_{21} \\ \mu_{22} & -\mu_{21} \\ \vdots\\ \mu_{2p} & -\mu_{2, p-1} \\ \end{pmatrix}\]

which can be written as \(H_{01}:\mathbf{C\mu_1}=\mathbf{C\mu_2}\), using the contrast matrix:

\[\mathbf{C}= \begin{pmatrix} -1 & 1 & 0 & \dots & 0\\ 0 & -1 & 1 & \dots &0 \\ \vdots& \vdots&\vdots & &\vdots\\ 0&0&0 & \dots & 1 \end{pmatrix}\]

The test statistic is:

The biggest discrepancy in the slopes can be found by looking at the discriminant function, for which we use:

\[ \textbf{a}=(\textbf{C}S_{pl}\textbf{C'})^{-1}\textbf{C}(\bar{y}_1- \bar{y}_2) \] If the largest component of \(\textbf{a}\) is \(a_i\), then the \(i\) slope has largest violation of parallel slopes.

Same overall level

A second question is whether two samples have the same overall level, ignoring possible interaction.

This hyphotesis can be expressed as:

\[H_{0}: \mu_{11}+\dots + \mu_{1p}=\mu_{21}+ \cdots +\mu_{2p}\] or:

\[H_{0}=j'(\mu_1-\mu_2)=0\]

In this case, the test statistic is:

\[t=\frac{j'(\bar{y}_{1}-\bar{y}_2)}{\sqrt{j'S_{pl}j(1/n_1+1/n_2)}}\] which has a t-distribution with \(n_1 +n_2 -2\) degrees of freedom.

Flatness

A third hyphotesis is that the two profiles are both flat. Because the lack of parallelism automatically implies that at least one curve is not flat, the hypothesis that botch curves are flat is more interesting if the hyphotesis of parallelism cant be rejected.

Example 5.9.2

We use the psychological data in Table \(5.1\) to illustrate two-sample profile analysis. The values of\(\bar{y}_{1}\), \(\bar{y}_{2}\) and \(\textbf{S}_{pl}\) are given in Example \(5.4.2\). The profiles of the two mean vectors \(\bar{y}_{1}\) and \(\bar{y}_{2}\) are plotted in Figure \(5.8\). There appears to be a lack of parallelism.

To test for parallelism, \(H_{01}: \textbf{C}\mu_{1}= \textbf{C}\mu_{2}\), we use the matrix:

First: data preparation, joining together both samples

                                                      library(dplyr)                  data542_female <-                    data542_female                    %>%                                                            select(y1 =                    y_1f,                    y2=y_2f,                    y3=y_3f,                    y4=y_4f)                    %>%                                                            mutate(gender=                    "female")                                    data542_male <-                    data542_male                    %>%                                                            select(y1 =                    y_1m,                    y2=y_2m,                    y3=y_3m,                    y4=y_4m)                    %>%                                                            mutate(gender=                    "male")                                    data542 <-                                        rbind(data542_female, data542_male)

                                  knitr::                    kable(                                      data542,                    caption =                    'Data Combine Exercise 5.42',                                      booktabs =                    TRUE                                    )

Table 5.6: Data Combine Exercise 5.42
y1	y2	y3	y4	gender
14	12	14	26	female
13	14	12	21	female
12	19	21	21	female
12	13	10	16	female
11	20	16	16	female
12	9	14	18	female
10	13	18	24	female
10	8	13	23	female
12	20	19	23	female
11	10	11	27	female
12	18	25	25	female
14	18	13	26	female
14	10	25	28	female
13	16	8	14	female
14	8	13	25	female
13	16	23	28	female
16	21	26	26	female
14	17	14	14	female
16	16	15	23	female
13	16	23	24	female
2	6	16	21	female
14	16	22	26	female
17	17	22	28	female
16	13	16	14	female
15	14	20	26	female
12	10	12	9	female
14	17	24	23	female
13	15	18	20	female
11	16	18	28	female
7	7	19	18	female
12	15	7	28	female
6	5	6	13	female
17	15	32	26	male
15	17	24	14	male
15	14	29	23	male
13	12	10	16	male
20	17	26	28	male
15	21	26	21	male
15	13	26	22	male
13	5	22	22	male
14	7	30	17	male
17	15	30	27	male
17	17	26	20	male
17	20	28	24	male
15	15	29	24	male
18	19	32	28	male
18	18	31	27	male
15	14	26	21	male
18	17	33	26	male
10	14	19	17	male
18	21	30	29	male
18	21	34	26	male
13	17	30	24	male
16	16	16	16	male
11	15	25	23	male
16	13	26	16	male
16	13	23	21	male
18	18	34	24	male
16	15	28	27	male
15	16	29	24	male
18	19	32	23	male
18	16	33	23	male
17	20	21	21	male
19	19	30	28	male

Second we implement the profiling analysis, employing the package "pbg".

                                                      if(!                    require("profileR")){                                      install.packages("profileR")                                      library("pbg")                                      "The pbg function implements three hypothesis tests. These tests are whether the profiles are parallel, have equal levels, and are flat across groups defined by the grouping variable. If parallelism is rejected, the other two tests are not necessary.                                                                              "                                    }

                                  mod<-pbg(data=data542[,1                    :                    4],                    group=data542[,5],                    original.names =                    TRUE,                    profile.plot =                    TRUE)

              ##  ## Data Summary: ##      female     male ## y1 12.34375 15.96875 ## y2 13.90625 15.90625 ## y3 16.65625 27.18750 ## y4 21.93750 22.75000

              ## Call: ## pbg(data = data542[, 1:4], group = data542[, 5], original.names = TRUE,  ##     profile.plot = TRUE) ##  ## Hypothesis Tests: ## $`Ho: Profiles are parallel` ##   Multivariate.Test Statistic Approx.F num.df den.df      p.value ## 1             Wilks  0.455078 23.94851      3     60 2.587417e-10 ## 2            Pillai  0.544922 23.94851      3     60 2.587417e-10 ## 3  Hotelling-Lawley  1.197425 23.94851      3     60 2.587417e-10 ## 4               Roy  1.197425 23.94851      3     60 2.587417e-10 ##  ## $`Ho: Profiles have equal levels` ##             Df Sum Sq Mean Sq F value   Pr(>F)     ## group        1  287.9  287.94   28.04 1.66e-06 *** ## Residuals   62  636.6   10.27                      ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## $`Ho: Profiles are flat` ##         F df1 df2      p-value ## 1 81.9367   3  60 3.401274e-21

paquingetelon.blogspot.com

Source: https://dapivei.github.io/MCA/chapter5.html