# TU060 – Math 9102 – PSI – Assignment

#### Joseph O’Carroll

#### 06/01/2021

# Introduction:

The purpose of this report is to address all statistical concepts comprehensively and demonstrate coherent understanding of statistical analysis through an independent project, as covered during the TU060 MATH9102 Module. For this part of the assignment we will be using a dataset of academic performance evolution for engineering students. The dataset (E. Delahoz-Dominguez et al., 2020) contains the results in national assessments for secondary and university education in engineering students and contains academic, social, economic information for 12,411 students. The data was collected as part of the Master’s Degree in Engineering project of the Technological University of Bolívar (UTB) titled Academic Efficiency Analysis in Engineering students.

A full descriptor is available here. The dataset is available for download at from mendeley.

### Option A Selected

We have selected option A. The assignment is broken into different sections, each section addressing different requirements. Option A includes the following learning outcomes which are illustrated with references to the academic performance dataset provided:

- State (a) research question(s) which can be investigated using statistical analysis.
- Develop testable hypotheses by building a linear or a logistic regression model.
- Derive and present appropriate statistical evidence.
- Using either linear regression or logistic regression.
+Build a baseline regression model with at least 2 predictors.
- Assess its fit and usefulness.
- Test the assumptions of the approach.
- Illustrate findings using appropriate examples from the data.
- Build at least one additional model which extends this baseline either adding or removing predictors relevant to your hypotheses.
- Note: The model must have at least 2 predictors.
- Assess its fit and usefulness.
- Test the assumptions of the approach.
- Illustrate our findings using appropriate examples from the data.

- Compare the fit and usefulness of our successive models.

- Note: At least one of our models should investigate a differential effect.

# 1 Research Question(s)

This section captured the main research question for this independent project. We’ve included the research goal and broken down the research question into three parts that can be validated through statistical methods.

## 1.1 Research Goal:

To gain a deeper understanding of the nature of the relationship between engineering student academic performance and socio-economic factors.

## 1.2 Main Research Questions:

Can an engineering student’s final university performance in engineering be predicted from a combination of past performance in national standardised tests at the final year of high school, is there a difference between private and public school students and is there a differential effect for male and female students?

To answer this main research question, we also must look at a number of contributing descriptive, comparative and relational questions, which serve to explore specific aspects of the data.

### 1.2.1 Question 1

RQ: What is the relationship between student high school performance and university performance?

*Hypothesis*; Students who perform well overall in highschool will perform well in engineering at university.

### 1.2.2 Question 2

RQ: What is the relationship between attending a private or public school and final educational achievement?

*Hypothesis*; Students who attend private high schools will perform better overall in engineering at university.

### 1.2.3 Question 3

RQ: What is the relationship between being male or female and final educational achievement?

*Hypothesis*; There will be a difference in male and female student performance.

# 2 Data understanding / Prepare

For this part of the assignment we will be using a dataset of academic performance evolution for engineering students. The dataset contains the results in national assessments for secondary and university education in engineering students and contains academic, social, economic information for 12,411 students. The sample contains 44 potential variables of interest.

## 2.1 Describe the Sample:

### 2.1.1 Import dataset

There are three files that should be imported. The RMD file attached to this report automatically downloads these datasets from github.

- data_academic_performance.xlsx.
- statistical_data_types_sheet1.csv – Numeric variables.
- statistical_data_types_sheet2.csv – Categorical variables.

############ # PART: Import data ############ # Dateset 1 : Import sperformance-dataset uri <- 'https://ocarrolj.github.io/TU060_MATH9102_DATASET_HOSTING/data_academic_performance.xlsx' download.file(uri, 'data_academic_performance_downloaded.xlsx', mode = 'wb') tbl_sperf_all <- read_excel('data_academic_performance_downloaded.xlsx') %>% as.data.frame() names(tbl_sperf_all)[42] <- 'SECOND_DECILE' # Fix issue with the name of first field. # Datedescriptors 1 : Import numberic variables uri <- 'https://ocarrolj.github.io/TU060_MATH9102_DATASET_HOSTING/assignment_statistical_data_types_sheet1.csv' tbl_sperf_desciption_numeric <- read.csv(uri) # Datedescriptors 2 : Import categorical variables uri <- 'https://ocarrolj.github.io/TU060_MATH9102_DATASET_HOSTING/assignment_statistical_data_types_sheet2.csv' tbl_sperf_desciption_categorical <- read.csv(uri)

### 2.1.2 Describe the dataset

Attribute | Description | Statistical.Data.Type |
---|---|---|

MAT_S11 | Mathematics | Numeric: Interval ratio |

CR_S11 | Critical Reading | Numeric: Interval ratio |

CC_S11 | Citizen Competencies S11 | Numeric: Interval ratio |

BIO_S11 | Biology | Numeric: Interval ratio |

ENG_S11 | English | Numeric: Interval ratio |

QR_PRO | Quantitative Reasoning | Numeric: Interval ratio |

CR_PRO | Critical Reading | Numeric: Interval ratio |

CC_PRO | Citizen Competencies SPRO | Numeric: Interval ratio |

ENG_PRO | English | Numeric: Interval ratio |

WC_PRO | Written Communication | Numeric: Interval ratio |

FEP_PRO | Formulation of Engineering Projects | Numeric: Interval ratio |

G_SC | Global Score | Numeric: Interval ratio |

PERCENTILE | Percentile | Numeric: Interval ratio |

2ND_DECILE | Second Decile | Categorical: ordinal data values 1 to 5 |

QUARTILE | Quartile | Categorical: ordinal data values 1 to 4 |

SEL | Socioeconomic Level | Categorical: ordinal data values 1 to 4 |

SEL_IHE | Socioeconomic Level of The Institution of Higher Education | Categorical: ordinal data values 1 to 4 |

Attribute | Description | Statistical.Data.Type |
---|---|---|

GENDER | Gender | Binary: Categorical with values Female or Male. |

EDU_FATHER | Father’s education | Ordinal: Categorical with 12 levels. |

EDU_MOTHER | Mother’s education | Ordinal: Categorical with 12 levels. |

OCC_FATHER | Father’s occupation | Nominal: Categorical with 13 values |

OCC_MOTHER | Mother’s occupation | Nominal: Categorical with 13 values |

STRATUM | Stratum | Ordinal: Categorical with 7 levels. |

SISBEN | Sisben | Ordinal: Categorical with 6 levels. |

PEOPLE_HOUSE | People in the house | Ordinal: Categorical with 13 levels. |

INTERNET | Internet | Binary: Categorical with values Yes or No. |

TV | TV | Binary: Categorical with values Yes or No. |

COMPUTER | Computer | Binary: Categorical with values Yes or No. |

WASHING_MCH | Washing machine | Binary: Categorical with values Yes or No. |

MIC_OVEN | Microwave oven | Binary: Categorical with values Yes or No. |

CAR | Car | Binary: Categorical with values Yes or No. |

DVD | DVD | Binary: Categorical with values Yes or No. |

FRESH | Fresh | Binary: Categorical with values Yes or No. |

PHONE | Phone | Binary: Categorical with values Yes or No. |

MOBILE | Mobile | Binary: Categorical with values Yes or No. |

REVENUE | Revenue | Ordinal: Categorical with 3 levels. |

JOB | Job | Ordinal: Categorical with 8 levels. |

SCHOOL_NAME | School name | Nominal: Categorical with 4 values |

SCHOOL_NAT | Nature of School | Nominal: Categorical with 2 values |

SCHOOL_TYPE | Type of School | Nominal: Categorical with 4 values |

COD_SPRO | Code Saber Pro | Nominal: Unique Student ID, not a statitical data type |

UNIVERSITY | University | Nominal: Categorical with 134 values |

ACADEMIC_PROGRAM | Academic Program | Nominal: Categorical with 23 values |

COD_S11 | Code Saber 11 | Nominal: Unique Student ID, not a statitical data type |

## 2.2 Variables of interest to research question

Based on our research question, the concepts we are interested in are: * Student performance in high-school (Represented by 5 continuous numerical variables) * The nature of the high-school attended (Represented by 1 Categorical variable – School_Nat) * The sex of the student (Represented by 1 Binary nominal Categorical variable – Gender) * The final performances of the student (Represented by the overall average score of the professional evaluation G_SC)

### 2.2.1 Quantitative Data:

In the sample dataset there are 5 continuous numerical quantitative variables of interest related to student performance. The tables below contain statistics that describe the center point (Mean and Median), the shape (skew, kurtosis) and the variability (Standard Deviation, Variance). These statistics inform us about the overall shape and distribution of the variables under consideration. The shape of the distribution is important because it will help us during hypothesis testing to determine what sort of tests we can use; for instance, if we can use parametric tests or if we have to revert to the less statistically powerful non-parametric tests.The first step in the process to understand our data is to validate normality of variables. To do this we generate summary statistics, histograms and Q-Q Plots for each of the variables. Should we discover that the variables are not ideally normal, we will need to quantify how far away from normal the data is by calculating the standard skew and kurtosis.

median | mean | SE.mean | CI.mean.0.95 | var | std.dev | coef.var | std_skew | std_kurt | gt_2sd | gt_P99 | gt_3sd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

MAT_S11 | 64 | 64.32076 | 0.1065813 | 0.2089158 | 140.9836 | 11.87365 | 0.1846006 | 18.17215 | 2.954064 | 4.447667 | 2.062686 | 0 |

CR_S11 | 61 | 60.77842 | 0.0899951 | 0.1764044 | 100.5182 | 10.02588 | 0.1649578 | 9.74404 | 10.86751 | 5.575699 | 1.579244 | 0.3786963 |

CC_S11 | 60 | 60.70518 | 0.0908447 | 0.1780697 | 102.4250 | 10.12052 | 0.1667160 | 15.78875 | 17.18771 | 5.487068 | 1.667875 | 0.4431553 |

BIO_S11 | 64 | 63.95053 | 0.1001472 | 0.1963041 | 124.4757 | 11.15687 | 0.1744609 | 13.7976 | 6.800432 | 5.559584 | 1.490613 | 0.0322295 |

ENG_S11 | 59 | 61.80106 | 0.1283409 | 0.2515681 | 204.4264 | 14.29778 | 0.2313516 | 27.60265 | -8.432497 | 5.648215 | 1.128032 | 0 |

G_SC | 163 | 162.71050 | 0.2074642 | 0.4066620 | 534.1867 | 23.11248 | 0.1420466 | -4.339492 | -1.714035 | 4.060914 | 0.9265974 | 0.15309 |

We converted the raw scores for our variables into standardised scores. If 95% of our data falls within +/- 1.96 (+/- 2.58 if our sample size is greater than 80) then we can treat the data as normal (using the empirical rule). In our case the sample size is larger than 80 cases, so a case is an outlier if its standard score is +/- 3.29 or beyond. For CR_S11 .37% of our data is outside our range, for CC_S11 .47% of our data is outside our range and this means we can not consider them normally distributed as our limit was .3% of our data.

### 2.2.2 Categorical/Qualitative data:

In the sample dataset there are 26 categorical variables describing the social, economic and demographics attributes of students. For our analysis we are interested in Gender and School Nature. For categorical data, it doesn’t make sense to describe the data in terms of average value or standard deviation since the numerical values are just an encoding and have no quantitative meaning. As such, we describe categorical data in terms of possible values and frequency of occurrence of those values. Important summary statistics include the count of distinct values, a list of possible values, the relative proportion that each value occurs, and the most frequently occurring value. The following table and figures describe the summary statistics for the categorical variables in the sample dataset.

```
## Frequencies
## tbl_sperf_categorical_measurements$GENDER
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## F 5043 40.63 40.63 40.63 40.63
## M 7368 59.37 100.00 59.37 100.00
## <NA> 0 0.00 100.00
## Total 12411 100.00 100.00 100.00 100.00
##
## tbl_sperf_categorical_measurements$SCHOOL_NAT
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------- ------- --------- -------------- --------- --------------
## PRIVATE 6565 52.90 52.90 52.90 52.90
## PUBLIC 5846 47.10 100.00 47.10 100.00
## <NA> 0 0.00 100.00
## Total 12411 100.00 100.00 100.00 100.00
```

## Potential Issues and Shortcomings

### 2.2.3 Missing Data

Missing data can be considered to be an outlier. This is where we have a variable but we are missing a value for that variable in some records. It is common, particularly when dealing with data related to human beings, that not all variables will have values in all cases. In the engineering performance dataset there are variables with missing values, for example approximately 1% of students did not indicate if they had a job or not. For the variables of interest for our research question, all student records have all variables, and as such missing data is not considered to be an issue.

### 2.2.4 Representativeness

To make generalised statements regarding the results in national assessments for secondary and university education in the engineering student population, two necessary (though not sufficient) requirements are that the sample be big enough and representative. “Big enough” means that whatever we’re interested in investigating as part of our statistical analysis, can be found in our sample if it is present in the population. The sample has records from 12,411 students across 3649 schools, 134 universities and 21 academic programs. The dataset is both broad and deep. “Representative” means that the characteristics of our sample mirrors the representation in the population in the same proportions. For example, If we are interested in privately educated versus publicly educated, then we need to have a similar fraction of students in our sample from private schools and students from public schools, in proportion to that which prevails in the wider engineering student population. Without an understanding of the general population it is difficult to assert that the sample is representative, however we can determine specific dimensions which are unlikely to be, for example the number of responses for each university varies significantly from 1 to 639 responses. As such if we wished to perform statistical analysis between universities or within the population or a given university our analysis may not be representative.For the variables of interest, critical reading at highschool level (CR_11) and citizen competencies (CC_11) appear to have a disproportionate number of students obtaining maximum scores. This might suggest an over-sampling of high achievers within the dataset. Similarly, many of the variables do not have records for low scores. This is less surprising since our population is highschool students who attended and completed engineering courses at a university. Students who failed highschool or failed their university course would naturally not be represented in this dataset.

# 3 Results

## 3.1 Statistical Evidence

In this section we will look to provide statistical evidence for the inclusion of variables in our hypotheses test. The steps for hypothesis testing are: Determine whether variables are related to one another.

The first step is to collect data to see whether our hypotheses are accurate. In section 2 we already selected a number of variables for consideration. The next step is to investigate if our predictor variables are related to our outcome variable through correlation or if there is a differential effect for different groups. This will provide the justification for including these variables in our predictive models.

### 3.1.1 Correlation

One of the tests for correlation is the Pearson Correlation. This requires that the variables be normally distributed and the relationship linear. Non-Parametric Spearman or Kendall tests can be used when the data doesn’t meet the criteria to be considered normal and the assumptions of Pearson are violated. Spearman does require a monotonic relationship however.

#### 3.1.1.1 Step 1 Check for Linearity of Co Variance

Pearson Correlation requires there to be a linear relationship between the two variables, as Pearson uses the equation of a line to model the relationship. We validate this condition through inspection of a scatter plot, which should resemble a straight line rather than a curved line. For linearity of Covariance (Homoscedasticity) we want the values to be evenly spaced around the line in a rectangular space. The shape of the scatterplot for the normally distributed variables is tube-like in shape indicating homoscedasticity.

############ # PART: Visualisation of Linearity of Co Variance ############ plots <- list() #------------------- Generate Plots-------------------# #Create histograms gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = MAT_S11, y = G_SC)) gs <- gs + geom_point() + geom_smooth(method = "lm", colour = "Red", se = F) + labs(x = "Highschool Mathematics score (MAT_S11)", y = "Final University Global Score (G_SC)") plots[["math"]] <- gs gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = BIO_S11, y = G_SC)) gs <- gs + geom_point() + geom_smooth(method = "lm", colour = "Red", se = F) + labs(x = "Highschool Biology score (BIO_S11)", y = "Final University Global Score (G_SC)") plots[["bio"]] <- gs gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = ENG_S11, y = G_SC)) gs <- gs + geom_point() + geom_smooth(method = "lm", colour = "Red", se = F) + labs(x = "Highschool Engineering score (ENG_S11)", y = "Final University Global Score (G_SC)") plots[["eng"]] <- gs plot_grid(plotlist = plots, labels = "", ncol = 3)

#### 3.1.1.2 Step 2 Run the pearson correlation test

############ # PART: Correlation test ############ #Pearson Correlation tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$MAT_S11, tbl_sperf_continuous_scale$G_SC, method = 'pearson') show(tbl_correlation_stats)

```
##
## Pearson's product-moment correlation
##
## data: tbl_sperf_continuous_scale$MAT_S11 and tbl_sperf_continuous_scale$G_SC
## t = 93.733, df = 12409, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6334194 0.6540231
## sample estimates:
## cor
## 0.6438379
```

tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$BIO_S11, tbl_sperf_continuous_scale$G_SC, method = 'pearson') show(tbl_correlation_stats)

```
##
## Pearson's product-moment correlation
##
## data: tbl_sperf_continuous_scale$BIO_S11 and tbl_sperf_continuous_scale$G_SC
## t = 99.627, df = 12409, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6567441 0.6762966
## sample estimates:
## cor
## 0.666635
```

tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$ENG_S11, tbl_sperf_continuous_scale$G_SC, method = 'pearson') show(tbl_correlation_stats)

```
##
## Pearson's product-moment correlation
##
## data: tbl_sperf_continuous_scale$ENG_S11 and tbl_sperf_continuous_scale$G_SC
## t = 98.435, df = 12409, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6521731 0.6719345
## sample estimates:
## cor
## 0.6621689
```

#### 3.1.1.3 Step 3:

Spearman Correlation requires there to be a monotonic relationship between the two variables. The below graphs show that condition is met.

############ # PART: Visualisation of non-normally distributed variables ############ plots <- list() gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = CR_S11, y = G_SC)) gs <- gs + geom_point() + geom_smooth(method = "lm", colour = "Red", se = F) + labs(x = "Highschool Critical Reading score (CR_S11)", y = "Final University Global Score (G_SC)") plots[["CR"]] <- gs gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = CC_S11, y = G_SC)) gs <- gs + geom_point() + geom_smooth(method = "lm", colour = "Red", se = F) + labs(x = "Highschool Citizen Competencies score (CC_S11)", y = "Final University Global Score (G_SC)") plots[["CC"]] <- gs plot_grid(plotlist = plots, labels = "", ncol = 2)

#### 3.1.1.4 Step 4 Run the Spearman correlation test

############ # PART: Correlation test ############ #Pearson Correlation tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$CR_S11, tbl_sperf_continuous_scale$G_SC, method = 'spearman') show(tbl_correlation_stats)

```
##
## Spearman's rank correlation rho
##
## data: tbl_sperf_continuous_scale$CR_S11 and tbl_sperf_continuous_scale$G_SC
## S = 1.0562e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6685184
```

tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$CC_S11, tbl_sperf_continuous_scale$G_SC, method = 'spearman') show(tbl_correlation_stats)

```
##
## Spearman's rank correlation rho
##
## data: tbl_sperf_continuous_scale$CC_S11 and tbl_sperf_continuous_scale$G_SC
## S = 1.1158e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6498049
```

#### 3.1.1.5 Reporting Correlation

The relationship between high school maths score (MAT_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (r = .644, n = 12409, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school maths score and final university performance.

The relationship between high school biology score (BIO_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (r = .667, n = 12409, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school biology score and final university performance.

The relationship between high school engineering score (ENG_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (r = .662, n = 12409, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school engineering score and final university performance.

The relationship between high school critical reading score (CR_S11) and final university performance (G_SC) was investigated using a Spearman Correlation. A strong positive correlation was found (rho = .669, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school critical reading score and final university performance.

The relationship between high school citizen competencies score (CC_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (rho = .65, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school citizen competencies score and final university performance.

### 3.1.2 Difference tests

Before we can build a predictive model, we need to work out which groupings to include in our hypothesis testing. The choice of test we perform to Investigate if differential effects exist depends on the measurement level of the variable and the shape of the data. If our data measurement level is at interval data and normally distributed then we can use the independent samples t-test, provided the variables are independent. If our data is ordinal or is continuous scale data that does not conform to the normal distribution, then we must use a non-parametric test, such as the Mann-Whitney U-Test.We will be using a number of heuristics to justify our assessment of differential effects. A benchmark suggested by Cohen (1988) is to consider an effect size d as small (d = 0.2), medium (d = 0.5), or large (d = 0.8) for the independent samples t-test.

#### 3.1.2.1 Step 1 Check for Homogeneity of Variance for different groups

Homogeneity of variance means that the pattern in variance of the variable around the mean for each group is the same. The t-test is a robust test and it does not require Homogeneity of variance between the two groups, however if we can assume homogeneity of variance we increase the power of the t-test.

############ # PART: Homoscedasticity ############ # Create a subset dataframe with just the variables of interest. tbl_sperf_sex_nat_diff <- tbl_sperf_all %>% dplyr::select(MAT_S11, CR_S11, CC_S11, BIO_S11, ENG_S11, G_SC, GENDER, SCHOOL_NAT) # -------------- Box Plot --------------- # # Just a little eye ball test fo variance and mean to cross check with Leven's test tbl_sperf_sex_nat_diff %>% gather(MAT_S11, CR_S11, CC_S11, BIO_S11, ENG_S11, G_SC, key = "var", value = "value") %>% ggplot(aes(x = var, y = value, fill = GENDER)) + geom_boxplot() + theme_bw() + labs(y = "Grades", x = "Performance Variables", title = "Box Plots to eye ball variance", subtitle = "Difference testing: Male and Female")

# -------------- Box Plot --------------- # # Just a little eye ball test fo variance and mean to cross check with Leven's test tbl_sperf_sex_nat_diff %>% gather(MAT_S11, CR_S11, CC_S11, BIO_S11, ENG_S11, G_SC, key = "var", value = "value") %>% ggplot(aes(x = var, y = value, fill = SCHOOL_NAT)) + geom_boxplot() + theme_bw() + labs(y = "Grades", x = "Performance Variables", title = "Box Plots to eye ball variance", subtitle = "Difference testing: Public and Private Highschools")

# -------------- Create summary statistics --------------- # tbl_sperf_sex_diff_stats <- tbl_sperf_sex_nat_diff %>% dplyr::select(-(SCHOOL_NAT)) %>% psych::describeBy(tbl_sperf_sex_nat_diff$GENDER, mat = TRUE) %>% filter(!is.na(skew)) # removes categorical variables. tbl_sperf_nat_diff_stats <- tbl_sperf_sex_nat_diff %>% dplyr::select(-(GENDER)) %>% psych::describeBy(tbl_sperf_sex_nat_diff$SCHOOL_NAT, mat = TRUE) %>% filter(!is.na(skew)) # Pretty print table tbl_sperf_sex_diff_stats %>% kbl(caption = "Summary statistics for Performance by student Sex") %>% kable_styling(bootstrap_options = c("striped", "hover"))

item | group1 | vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

MAT_S111 | 1 | F | 1 | 5043 | 61.93238 | 11.169655 | 61 | 61.48699 | 11.8608 | 31 | 100 | 69 | 0.4310270 | 0.2484840 | 0.1572879 |

MAT_S112 | 2 | M | 1 | 7368 | 65.95548 | 12.063491 | 65 | 65.51357 | 11.8608 | 26 | 100 | 74 | 0.3522215 | 0.0572731 | 0.1405394 |

CR_S111 | 3 | F | 2 | 5043 | 60.89530 | 9.873563 | 61 | 60.74374 | 8.8956 | 28 | 100 | 72 | 0.2371788 | 0.5148966 | 0.1390367 |

CR_S112 | 4 | M | 2 | 7368 | 60.69843 | 10.128695 | 61 | 60.54664 | 10.3782 | 24 | 100 | 76 | 0.2007250 | 0.4481850 | 0.1179991 |

CC_S111 | 5 | F | 3 | 5043 | 59.95994 | 9.519411 | 60 | 59.73903 | 8.8956 | 29 | 100 | 71 | 0.3059556 | 0.5966242 | 0.1340496 |

CC_S112 | 6 | M | 3 | 7368 | 61.21526 | 10.482293 | 61 | 60.90655 | 8.8956 | 0 | 100 | 100 | 0.3438518 | 0.7672250 | 0.1221185 |

BIO_S111 | 7 | F | 4 | 5043 | 62.26829 | 10.529873 | 62 | 61.99232 | 10.3782 | 11 | 100 | 89 | 0.2912662 | 0.4374554 | 0.1482787 |

BIO_S112 | 8 | M | 4 | 7368 | 65.10193 | 11.425222 | 65 | 64.81411 | 10.3782 | 20 | 100 | 80 | 0.2752888 | 0.1978239 | 0.1331036 |

ENG_S111 | 9 | F | 5 | 5043 | 61.45112 | 13.788398 | 59 | 60.42082 | 13.3434 | 26 | 100 | 74 | 0.6377273 | -0.2513956 | 0.1941643 |

ENG_S112 | 10 | M | 5 | 7368 | 62.04058 | 14.632306 | 59 | 61.00390 | 14.8260 | 30 | 100 | 70 | 0.5828105 | -0.4540746 | 0.1704661 |

G_SC1 | 11 | F | 6 | 5043 | 161.27821 | 22.332939 | 162 | 161.49938 | 23.7216 | 76 | 242 | 166 | -0.0721601 | -0.2471797 | 0.3144861 |

G_SC2 | 12 | M | 6 | 7368 | 163.69083 | 23.582616 | 164 | 163.95607 | 25.2042 | 37 | 247 | 210 | -0.1221049 | 0.0144190 | 0.2747370 |

# Pretty print table tbl_sperf_nat_diff_stats %>% kbl(caption = "Summary statistics for Performance by student highschool nature") %>% kable_styling(bootstrap_options = c("striped", "hover"))

item | group1 | vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

MAT_S111 | 1 | PRIVATE | 1 | 6565 | 66.88591 | 12.144541 | 66 | 66.47973 | 11.8608 | 31 | 100 | 69 | 0.3153056 | -0.0142747 | 0.1498869 |

MAT_S112 | 2 | PUBLIC | 1 | 5846 | 61.44013 | 10.863289 | 60 | 61.01603 | 10.3782 | 26 | 100 | 74 | 0.4177939 | 0.3192681 | 0.1420797 |

CR_S111 | 3 | PRIVATE | 2 | 6565 | 62.57944 | 9.875766 | 62 | 62.45555 | 8.8956 | 24 | 100 | 76 | 0.1946699 | 0.6232599 | 0.1218859 |

CR_S112 | 4 | PUBLIC | 2 | 5846 | 58.75590 | 9.805907 | 58 | 58.55772 | 8.8956 | 26 | 100 | 74 | 0.2556649 | 0.4378005 | 0.1282503 |

CC_S111 | 5 | PRIVATE | 3 | 6565 | 62.31988 | 10.092052 | 62 | 62.01999 | 8.8956 | 29 | 100 | 71 | 0.3640218 | 0.7101923 | 0.1245553 |

CC_S112 | 6 | PUBLIC | 3 | 5846 | 58.89189 | 9.842347 | 58 | 58.62227 | 8.8956 | 0 | 100 | 100 | 0.3332656 | 0.8651224 | 0.1287269 |

BIO_S111 | 7 | PRIVATE | 4 | 6565 | 66.07890 | 11.208219 | 65 | 65.78736 | 10.3782 | 11 | 100 | 89 | 0.2781122 | 0.3189615 | 0.1383309 |

BIO_S112 | 8 | PUBLIC | 4 | 5846 | 61.56038 | 10.602146 | 61 | 61.27747 | 10.3782 | 20 | 100 | 80 | 0.3041901 | 0.2957655 | 0.1386642 |

ENG_S111 | 9 | PRIVATE | 5 | 6565 | 67.08271 | 14.885313 | 67 | 66.56387 | 17.7912 | 31 | 100 | 69 | 0.2363625 | -0.8065631 | 0.1837133 |

ENG_S112 | 10 | PUBLIC | 5 | 5846 | 55.86983 | 10.894422 | 54 | 54.88392 | 8.8956 | 26 | 100 | 74 | 0.8525358 | 0.6290913 | 0.1424869 |

G_SC1 | 11 | PRIVATE | 6 | 6565 | 168.21584 | 22.823660 | 170 | 168.73786 | 23.7216 | 37 | 247 | 210 | -0.2040610 | 0.0242467 | 0.2816877 |

G_SC2 | 12 | PUBLIC | 6 | 5846 | 156.52805 | 21.838175 | 157 | 156.56670 | 22.2390 | 72 | 228 | 156 | -0.0521594 | -0.0191213 | 0.2856189 |

############ # PART: Levene test ############ # -------------- Levene's test --------------- # # Conduct Levene's test for homogeneity of variance in library car - the null hypothesis is that variances in groups are equal so to # assume homogeneity we woudl expect probaility to not be statistically significant. # Part 1: Iterate through variables with median for Gender variable_count <- 6 result <- list() for (n in 1:variable_count) { variable <- colnames(tbl_sperf_sex_nat_diff)[n] result[[variable]] <- car::leveneTest(y = tbl_sperf_sex_nat_diff[, 1], group = as.factor(tbl_sperf_sex_nat_diff$GENDER), center = median) } #Pr(>F) is your probability - in this case it is not statistically significant for any variable so we can assume homogeneity. print.listof(result)

```
## MAT_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 30.904 2.767e-08 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## CR_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 30.904 2.767e-08 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## CC_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 30.904 2.767e-08 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## BIO_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 30.904 2.767e-08 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ENG_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 30.904 2.767e-08 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## G_SC :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 30.904 2.767e-08 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

# Part 1: Iterate through variables with median for School Nature variable_count <- 6 result <- list() for (n in 1:variable_count) { variable <- colnames(tbl_sperf_sex_nat_diff)[n] result[[variable]] <- car::leveneTest(y = tbl_sperf_sex_nat_diff[, n], group = as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT), center = median) } # Pr(>F) is your probability - in this case it is not statistically significant for any variable so we can assume homogeneity. # No difference in outcomes from using mean versus median, which is a good sign. print.listof(result)

```
## MAT_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 74.954 < 2.2e-16 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## CR_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.0262 0.8713
## 12409
##
## CC_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.9286 0.1649
## 12409
##
## BIO_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 12.464 0.0004163 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ENG_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 936.77 < 2.2e-16 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## G_SC :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 12.614 0.0003843 ***
## 12409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

#### 3.1.2.2 Step 2 Select and run the T-Test for Gender

Based on the analysis above, it is appropriate to select a parametric independent samples t-test to evaluate if two different groups exist for gender for Maths, Biology, Engineering and Final Score on the basis of student sex (male or female). For this t-test we will set the variance to be not equal for each difference test as a statistically significant difference in variance was discovered when the Lavene test was run.

############ # PART 1: T-Test for Gender ############ # -------------- Conduct the t-test --------------- # #Conduct the t-test from package stats #In this case we can use the var.equal = FALSE option to specify not equal variances. tbl_test_result <- data.frame() tbl_test_effectsize <- data.frame() # Brittle code here, hardcoded variable of interest index. for (n in c(1, 4, 5, 6)) { ## start variable <- colnames(tbl_sperf_sex_nat_diff)[n] test_result <- stats::t.test(tbl_sperf_sex_nat_diff[, n] ~ as.factor(tbl_sperf_sex_nat_diff$GENDER), var.equal = FALSE) %>% broom::tidy() %>% as.data.frame() # Build output table row.names(test_result) <- variable tbl_test_result <- rbind(tbl_test_result, test_result) #---------- Calculate Cohens D Effect size ---------- # effcd <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter), 2) #---------- Calculate Eta Squared Effect size ---------- # #Eta squared calculation effes <- round((test_result$statistic * test_result$statistic) / ((test_result$statistic * test_result$statistic) + (test_result$parameter)), 3) # Build output table tbl_merged_effectsize <- merge(effcd, effes) row.names(tbl_merged_effectsize) <- variable tbl_test_effectsize <- rbind(tbl_test_effectsize, tbl_merged_effectsize) } ## tidy up column names colnames(tbl_test_result)[1] <- 'mean.diff.est' colnames(tbl_test_result)[2] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$GENDER))[1], sep = ".") colnames(tbl_test_result)[3] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$GENDER))[2], sep = ".") colnames(tbl_test_result)[6] <- 'df' #P-value is your probability - in this case every result was statiscally significant @ P < .05* # -------------- Pretty Print Test statistics --------------- # tbl_test_result %>% kbl(caption = "Summary of T-Test Statistics for the Male and Female student Groups") %>% kable_styling(bootstrap_options = c("striped", "hover"))

mean.diff.est | mean.F | mean.M | statistic | p.value | df | conf.low | conf.high | method | alternative | |
---|---|---|---|---|---|---|---|---|---|---|

MAT_S11 | -4.0231017 | 61.93238 | 65.95548 | -19.073300 | 0.0000000 | 11353.66 | -4.436558 | -3.6096454 | Welch Two Sample t-test | two.sided |

BIO_S11 | -2.8336346 | 62.26829 | 65.10193 | -14.221044 | 0.0000000 | 11382.97 | -3.224212 | -2.4430576 | Welch Two Sample t-test | two.sided |

ENG_S11 | -0.5894605 | 61.45112 | 62.04058 | -2.281401 | 0.0225434 | 11239.88 | -1.095924 | -0.0829972 | Welch Two Sample t-test | two.sided |

G_SC | -2.4126178 | 161.27821 | 163.69083 | -5.777472 | 0.0000000 | 11207.16 | -3.231169 | -1.5940669 | Welch Two Sample t-test | two.sided |

# -------------- Pretty Print Test Effect size --------------- # tbl_test_effectsize %>% kbl(caption = "Summary of T-Test Effectsize for the Male and Female student Groups") %>% kable_styling(bootstrap_options = c("striped", "hover"))

d | CI | CI_low | CI_high | y | |
---|---|---|---|---|---|

MAT_S11 | -0.36 | 0.95 | -0.40 | -0.32 | 0.031 |

BIO_S11 | -0.27 | 0.95 | -0.30 | -0.23 | 0.017 |

ENG_S11 | -0.04 | 0.95 | -0.08 | -0.01 | 0.000 |

G_SC | -0.11 | 0.95 | -0.15 | -0.07 | 0.003 |

#### 3.1.2.3 Step 3 select and run the T-Test for School Nature

Based on the analysis above, it is appropriate to select a parametric independent samples t-test to evaluate if two different groups exist for Maths, Biology, Engineering and Final Score on the basis of School nature (public or private). For this t-test we will set the variance to be not equal.

############ # PART 2: T-Test for School Nature ############ # -------------- Conduct the t-test --------------- # #Conduct the t-test from package stats #In this case we can use the var.equal = FALSE option to specify not equal variances. tbl_test_result <- data.frame() tbl_test_effectsize <- data.frame() # Brittle code here, hardcoded variable of interest index. for (n in c(1, 4, 5, 6)) { ## start variable <- colnames(tbl_sperf_sex_nat_diff)[n] test_result <- stats::t.test(tbl_sperf_sex_nat_diff[, n] ~ as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT), var.equal = FALSE) %>% broom::tidy() %>% as.data.frame() # Build output table row.names(test_result) <- variable tbl_test_result <- rbind(tbl_test_result, test_result) #---------- Calculate Cohens D Effect size ---------- # effcd <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter), 2) #---------- Calculate Eta Squared Effect size ---------- # #Eta squared calculation effes <- round((test_result$statistic * test_result$statistic) / ((test_result$statistic * test_result$statistic) + (test_result$parameter)), 3) # Build output table tbl_merged_effectsize <- merge(effcd, effes) row.names(tbl_merged_effectsize) <- variable tbl_test_effectsize <- rbind(tbl_test_effectsize, tbl_merged_effectsize) } ## tidy up column names colnames(tbl_test_result)[1] <- 'mean.diff.est' colnames(tbl_test_result)[2] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT))[1], sep = ".") colnames(tbl_test_result)[3] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT))[2], sep = ".") colnames(tbl_test_result)[6] <- 'df' #P-value is your probability - in this case every result was statiscally significant @ P < .05* # -------------- Pretty Print Test statistics --------------- # tbl_test_result %>% kbl(caption = "Summary of T-Test Statistics for the Private and Public student Groups") %>% kable_styling(bootstrap_options = c("striped", "hover"))

mean.diff.est | mean.PRIVATE | mean.PUBLIC | statistic | p.value | df | conf.low | conf.high | method | alternative | |
---|---|---|---|---|---|---|---|---|---|---|

MAT_S11 | 5.44578 | 66.88591 | 61.44013 | 26.36858 | 0 | 12408.75 | 5.040958 | 5.850602 | Welch Two Sample t-test | two.sided |

BIO_S11 | 4.51852 | 66.07890 | 61.56038 | 23.06953 | 0 | 12363.82 | 4.134594 | 4.902447 | Welch Two Sample t-test | two.sided |

ENG_S11 | 11.21289 | 67.08271 | 55.86983 | 48.22888 | 0 | 11971.48 | 10.757162 | 11.668610 | Welch Two Sample t-test | two.sided |

G_SC | 11.68779 | 168.21584 | 156.52805 | 29.13527 | 0 | 12345.14 | 10.901460 | 12.474117 | Welch Two Sample t-test | two.sided |

# -------------- Pretty Print Test Effect size --------------- # tbl_test_effectsize %>% kbl(caption = "Summary of T-Test Effectsize for the Private and Public student Groups") %>% kable_styling(bootstrap_options = c("striped", "hover"))

d | CI | CI_low | CI_high | y | |
---|---|---|---|---|---|

MAT_S11 | 0.47 | 0.95 | 0.44 | 0.51 | 0.053 |

BIO_S11 | 0.41 | 0.95 | 0.38 | 0.45 | 0.041 |

ENG_S11 | 0.88 | 0.95 | 0.84 | 0.92 | 0.163 |

G_SC | 0.52 | 0.95 | 0.49 | 0.56 | 0.064 |

#### 3.1.2.4 Reporting gender difference with Cohen’s D

An independent-samples t-test was conducted to compare highschool maths final scores for students who are male or female. A statistically significant result was found (M = 65.96, SD = 12.06 for Male students, M = 61.93, SD = 11.17 for female students), (t(11353) = -19.07, p < .001). Cohen’s d also indicated a small effect size (d = – 0.36).An independent-samples t-test was conducted to compare highschool biology final scores for students who are male or female. A statistically significant result was found (M = 65.96, SD = 11.42 for Male students, M = 62.27, SD = 10.53 for female students), (t(11383) = -14.22, p < .001). Cohen’s d also indicated a small effect size (d = – 0.27).

An independent-samples t-test was conducted to compare highschool