Digital Portfolio – Independent Project

Reading Time: 46 minutes

Introduction:

The purpose of this report is to address all statistical concepts comprehensively and demonstrate coherent understanding of statistical analysis through an independent project, as covered during the TU060 MATH9102 Module. For this part of the assignment we will be using a dataset of academic performance evolution for engineering students. The dataset (E. Delahoz-Dominguez et al., 2020) contains the results in national assessments for secondary and university education in engineering students and contains academic, social, economic information for 12,411 students. The data was collected as part of the Master’s Degree in Engineering project of the Technological University of Bolívar (UTB) titled Academic Efficiency Analysis in Engineering students.

A full descriptor is available here. The dataset is available for download at from mendeley.

Option A Selected

We have selected option A. The assignment is broken into different sections, each section addressing different requirements. Option A includes the following learning outcomes which are illustrated with references to the academic performance dataset provided:

  • State (a) research question(s) which can be investigated using statistical analysis.
  • Develop testable hypotheses by building a linear or a logistic regression model.
  • Derive and present appropriate statistical evidence.
  • Using either linear regression or logistic regression. +Build a baseline regression model with at least 2 predictors.
    • Assess its fit and usefulness.
    • Test the assumptions of the approach.
    • Illustrate findings using appropriate examples from the data.
    • Build at least one additional model which extends this baseline either adding or removing predictors relevant to your hypotheses.
      • Note: The model must have at least 2 predictors.
      • Assess its fit and usefulness.
      • Test the assumptions of the approach.
      • Illustrate our findings using appropriate examples from the data.
    • Compare the fit and usefulness of our successive models.
  • Note: At least one of our models should investigate a differential effect.

1 Research Question(s)

This section captured the main research question for this independent project. We’ve included the research goal and broken down the research question into three parts that can be validated through statistical methods.

1.1 Research Goal:

To gain a deeper understanding of the nature of the relationship between engineering student academic performance and socio-economic factors.

1.2 Main Research Questions:

Can an engineering student’s final university performance in engineering be predicted from a combination of past performance in national standardised tests at the final year of high school, is there a difference between private and public school students and is there a differential effect for male and female students?

To answer this main research question, we also must look at a number of contributing descriptive, comparative and relational questions, which serve to explore specific aspects of the data.

1.2.1 Question 1

RQ: What is the relationship between student high school performance and university performance?

Hypothesis; Students who perform well overall in highschool will perform well in engineering at university.

1.2.2 Question 2

RQ: What is the relationship between attending a private or public school and final educational achievement?

Hypothesis; Students who attend private high schools will perform better overall in engineering at university.

1.2.3 Question 3

RQ: What is the relationship between being male or female and final educational achievement?

Hypothesis; There will be a difference in male and female student performance.

2 Data understanding / Prepare

For this part of the assignment we will be using a dataset of academic performance evolution for engineering students. The dataset contains the results in national assessments for secondary and university education in engineering students and contains academic, social, economic information for 12,411 students. The sample contains 44 potential variables of interest.

2.1 Describe the Sample:

2.1.1 Import dataset

There are three files that should be imported. The RMD file attached to this report automatically downloads these datasets from github.

############
# PART: Import data
############
# Dateset 1 : Import sperformance-dataset
uri <- 'https://ocarrolj.github.io/TU060_MATH9102_DATASET_HOSTING/data_academic_performance.xlsx'
download.file(uri, 'data_academic_performance_downloaded.xlsx', mode = 'wb')
tbl_sperf_all <- read_excel('data_academic_performance_downloaded.xlsx') %>% as.data.frame()
names(tbl_sperf_all)[42] <- 'SECOND_DECILE' # Fix issue with the name of first field.

# Datedescriptors 1 : Import numberic variables
uri <- 'https://ocarrolj.github.io/TU060_MATH9102_DATASET_HOSTING/assignment_statistical_data_types_sheet1.csv'
tbl_sperf_desciption_numeric <- read.csv(uri)

# Datedescriptors 2 : Import categorical variables
uri <- 'https://ocarrolj.github.io/TU060_MATH9102_DATASET_HOSTING/assignment_statistical_data_types_sheet2.csv'
tbl_sperf_desciption_categorical <- read.csv(uri)

2.1.2 Describe the dataset

(#tab:Describe the sample)Statistical Data Types – Numeric
Attribute Description Statistical.Data.Type
MAT_S11 Mathematics Numeric: Interval ratio
CR_S11 Critical Reading Numeric: Interval ratio
CC_S11 Citizen Competencies S11 Numeric: Interval ratio
BIO_S11 Biology Numeric: Interval ratio
ENG_S11 English Numeric: Interval ratio
QR_PRO Quantitative Reasoning Numeric: Interval ratio
CR_PRO Critical Reading Numeric: Interval ratio
CC_PRO Citizen Competencies SPRO Numeric: Interval ratio
ENG_PRO English Numeric: Interval ratio
WC_PRO Written Communication Numeric: Interval ratio
FEP_PRO Formulation of Engineering Projects Numeric: Interval ratio
G_SC Global Score Numeric: Interval ratio
PERCENTILE Percentile Numeric: Interval ratio
2ND_DECILE Second Decile Categorical: ordinal data values 1 to 5
QUARTILE Quartile Categorical: ordinal data values 1 to 4
SEL Socioeconomic Level Categorical: ordinal data values 1 to 4
SEL_IHE Socioeconomic Level of The Institution of Higher Education Categorical: ordinal data values 1 to 4
(#tab:Describe the sample)Statistical Data Types – Categorical
Attribute Description Statistical.Data.Type
GENDER Gender Binary: Categorical with values Female or Male.
EDU_FATHER Father’s education Ordinal: Categorical with 12 levels.
EDU_MOTHER Mother’s education Ordinal: Categorical with 12 levels.
OCC_FATHER Father’s occupation Nominal: Categorical with 13 values
OCC_MOTHER Mother’s occupation Nominal: Categorical with 13 values
STRATUM Stratum Ordinal: Categorical with 7 levels.
SISBEN Sisben Ordinal: Categorical with 6 levels.
PEOPLE_HOUSE People in the house Ordinal: Categorical with 13 levels.
INTERNET Internet Binary: Categorical with values Yes or No. 
TV TV Binary: Categorical with values Yes or No. 
COMPUTER Computer Binary: Categorical with values Yes or No. 
WASHING_MCH Washing machine Binary: Categorical with values Yes or No. 
MIC_OVEN Microwave oven Binary: Categorical with values Yes or No. 
CAR Car Binary: Categorical with values Yes or No. 
DVD DVD Binary: Categorical with values Yes or No. 
FRESH Fresh Binary: Categorical with values Yes or No. 
PHONE Phone Binary: Categorical with values Yes or No. 
MOBILE Mobile Binary: Categorical with values Yes or No. 
REVENUE Revenue Ordinal: Categorical with 3 levels.
JOB Job Ordinal: Categorical with 8 levels.
SCHOOL_NAME School name Nominal: Categorical with 4 values
SCHOOL_NAT Nature of School Nominal: Categorical with 2 values
SCHOOL_TYPE Type of School Nominal: Categorical with 4 values
COD_SPRO Code Saber Pro Nominal: Unique Student ID, not a statitical data type
UNIVERSITY University Nominal: Categorical with 134 values
ACADEMIC_PROGRAM Academic Program Nominal: Categorical with 23 values
COD_S11 Code Saber 11 Nominal: Unique Student ID, not a statitical data type

2.2 Variables of interest to research question

Based on our research question, the concepts we are interested in are: * Student performance in high-school (Represented by 5 continuous numerical variables) * The nature of the high-school attended (Represented by 1 Categorical variable – School_Nat) * The sex of the student (Represented by 1 Binary nominal Categorical variable – Gender) * The final performances of the student (Represented by the overall average score of the professional evaluation G_SC)

2.2.1 Quantitative Data:

In the sample dataset there are 5 continuous numerical quantitative variables of interest related to student performance. The tables below contain statistics that describe the center point (Mean and Median), the shape (skew, kurtosis) and the variability (Standard Deviation, Variance). These statistics inform us about the overall shape and distribution of the variables under consideration. The shape of the distribution is important because it will help us during hypothesis testing to determine what sort of tests we can use; for instance, if we can use parametric tests or if we have to revert to the less statistically powerful non-parametric tests.

The first step in the process to understand our data is to validate normality of variables. To do this we generate summary statistics, histograms and Q-Q Plots for each of the variables. Should we discover that the variables are not ideally normal, we will need to quantify how far away from normal the data is by calculating the standard skew and kurtosis.

(#tab:section 2 – sample – numeric measurements)Summary statistics for Performance
median mean SE.mean CI.mean.0.95 var std.dev coef.var std_skew std_kurt gt_2sd gt_P99 gt_3sd
MAT_S11 64 64.32076 0.1065813 0.2089158 140.9836 11.87365 0.1846006 18.17215 2.954064 4.447667 2.062686 0
CR_S11 61 60.77842 0.0899951 0.1764044 100.5182 10.02588 0.1649578 9.74404 10.86751 5.575699 1.579244 0.3786963
CC_S11 60 60.70518 0.0908447 0.1780697 102.4250 10.12052 0.1667160 15.78875 17.18771 5.487068 1.667875 0.4431553
BIO_S11 64 63.95053 0.1001472 0.1963041 124.4757 11.15687 0.1744609 13.7976 6.800432 5.559584 1.490613 0.0322295
ENG_S11 59 61.80106 0.1283409 0.2515681 204.4264 14.29778 0.2313516 27.60265 -8.432497 5.648215 1.128032 0
G_SC 163 162.71050 0.2074642 0.4066620 534.1867 23.11248 0.1420466 -4.339492 -1.714035 4.060914 0.9265974 0.15309
At first inspection, the histograms appear to show approximately normal distributions for all variables, but the Boxplots, Q-Q Plots and summary statistics show that we may have an issue with skewness, kurtosis and outliers. For skewness and kurtosis standardised score, we use George, D., & Mallery, M. (2003) heuristics. Which states that a standardised score (value/std.error) for skewness (and kurtosis) between +/-2 (1.96 rounded) are considered acceptable in order to assume a normal univariate distribution. None of the variables of interest meet the criteria and all have excessive skew and Kurtosis. The distribution shape is not acceptable so we need to determine the proportion of our distribution which are outliers.

We converted the raw scores for our variables into standardised scores. If 95% of our data falls within +/- 1.96 (+/- 2.58 if our sample size is greater than 80) then we can treat the data as normal (using the empirical rule). In our case the sample size is larger than 80 cases, so a case is an outlier if its standard score is +/- 3.29 or beyond. For CR_S11 .37% of our data is outside our range, for CC_S11 .47% of our data is outside our range and this means we can not consider them normally distributed as our limit was .3% of our data.

2.2.2 Categorical/Qualitative data:

In the sample dataset there are 26 categorical variables describing the social, economic and demographics attributes of students. For our analysis we are interested in Gender and School Nature. For categorical data, it doesn’t make sense to describe the data in terms of average value or standard deviation since the numerical values are just an encoding and have no quantitative meaning. As such, we describe categorical data in terms of possible values and frequency of occurrence of those values. Important summary statistics include the count of distinct values, a list of possible values, the relative proportion that each value occurs, and the most frequently occurring value. The following table and figures describe the summary statistics for the categorical variables in the sample dataset.

## Frequencies  
## tbl_sperf_categorical_measurements$GENDER  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           F    5043     40.63          40.63     40.63          40.63
##           M    7368     59.37         100.00     59.37         100.00
##        <NA>       0                               0.00         100.00
##       Total   12411    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$SCHOOL_NAT  
## Type: Character  
## 
##                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------- --------- -------------- --------- --------------
##       PRIVATE    6565     52.90          52.90     52.90          52.90
##        PUBLIC    5846     47.10         100.00     47.10         100.00
##          <NA>       0                               0.00         100.00
##         Total   12411    100.00         100.00    100.00         100.00

## Potential Issues and Shortcomings

2.2.3 Missing Data

Missing data can be considered to be an outlier. This is where we have a variable but we are missing a value for that variable in some records. It is common, particularly when dealing with data related to human beings, that not all variables will have values in all cases. In the engineering performance dataset there are variables with missing values, for example approximately 1% of students did not indicate if they had a job or not. For the variables of interest for our research question, all student records have all variables, and as such missing data is not considered to be an issue.

2.2.4 Representativeness

To make generalised statements regarding the results in national assessments for secondary and university education in the engineering student population, two necessary (though not sufficient) requirements are that the sample be big enough and representative. “Big enough” means that whatever we’re interested in investigating as part of our statistical analysis, can be found in our sample if it is present in the population. The sample has records from 12,411 students across 3649 schools, 134 universities and 21 academic programs. The dataset is both broad and deep. “Representative” means that the characteristics of our sample mirrors the representation in the population in the same proportions. For example, If we are interested in privately educated versus publicly educated, then we need to have a similar fraction of students in our sample from private schools and students from public schools, in proportion to that which prevails in the wider engineering student population. Without an understanding of the general population it is difficult to assert that the sample is representative, however we can determine specific dimensions which are unlikely to be, for example the number of responses for each university varies significantly from 1 to 639 responses. As such if we wished to perform statistical analysis between universities or within the population or a given university our analysis may not be representative.

For the variables of interest, critical reading at highschool level (CR_11) and citizen competencies (CC_11) appear to have a disproportionate number of students obtaining maximum scores. This might suggest an over-sampling of high achievers within the dataset. Similarly, many of the variables do not have records for low scores. This is less surprising since our population is highschool students who attended and completed engineering courses at a university. Students who failed highschool or failed their university course would naturally not be represented in this dataset.

3 Results

3.1 Statistical Evidence

In this section we will look to provide statistical evidence for the inclusion of variables in our hypotheses test. The steps for hypothesis testing are: Determine whether variables are related to one another.

The first step is to collect data to see whether our hypotheses are accurate. In section 2 we already selected a number of variables for consideration. The next step is to investigate if our predictor variables are related to our outcome variable through correlation or if there is a differential effect for different groups. This will provide the justification for including these variables in our predictive models.

3.1.1 Correlation

One of the tests for correlation is the Pearson Correlation. This requires that the variables be normally distributed and the relationship linear. Non-Parametric Spearman or Kendall tests can be used when the data doesn’t meet the criteria to be considered normal and the assumptions of Pearson are violated. Spearman does require a monotonic relationship however.

3.1.1.1 Step 1 Check for Linearity of Co Variance

Pearson Correlation requires there to be a linear relationship between the two variables, as Pearson uses the equation of a line to model the relationship. We validate this condition through inspection of a scatter plot, which should resemble a straight line rather than a curved line. For linearity of Covariance (Homoscedasticity) we want the values to be evenly spaced around the line in a rectangular space. The shape of the scatterplot for the normally distributed variables is tube-like in shape indicating homoscedasticity.

############
# PART: Visualisation of Linearity of Co Variance
############
plots <- list()

#------------------- Generate Plots-------------------#
#Create histograms
gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = MAT_S11, y = G_SC))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Highschool Mathematics score (MAT_S11)", y = "Final University Global Score (G_SC)")
plots[["math"]] <- gs

gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = BIO_S11, y = G_SC))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Highschool Biology score (BIO_S11)", y = "Final University Global Score (G_SC)")
plots[["bio"]] <- gs

gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = ENG_S11, y = G_SC))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Highschool Engineering score (ENG_S11)", y = "Final University Global Score (G_SC)")
plots[["eng"]] <- gs

plot_grid(plotlist = plots, labels = "", ncol = 3)

3.1.1.2 Step 2 Run the pearson correlation test

############
# PART: Correlation test
############
#Pearson Correlation
tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$MAT_S11, tbl_sperf_continuous_scale$G_SC, method = 'pearson')
show(tbl_correlation_stats)
## 
##  Pearson's product-moment correlation
## 
## data:  tbl_sperf_continuous_scale$MAT_S11 and tbl_sperf_continuous_scale$G_SC
## t = 93.733, df = 12409, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6334194 0.6540231
## sample estimates:
##       cor 
## 0.6438379

tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$BIO_S11, tbl_sperf_continuous_scale$G_SC, method = 'pearson')
show(tbl_correlation_stats)
## 
##  Pearson's product-moment correlation
## 
## data:  tbl_sperf_continuous_scale$BIO_S11 and tbl_sperf_continuous_scale$G_SC
## t = 99.627, df = 12409, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6567441 0.6762966
## sample estimates:
##      cor 
## 0.666635

tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$ENG_S11, tbl_sperf_continuous_scale$G_SC, method = 'pearson')
show(tbl_correlation_stats)
## 
##  Pearson's product-moment correlation
## 
## data:  tbl_sperf_continuous_scale$ENG_S11 and tbl_sperf_continuous_scale$G_SC
## t = 98.435, df = 12409, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6521731 0.6719345
## sample estimates:
##       cor 
## 0.6621689

3.1.1.3 Step 3:

Spearman Correlation requires there to be a monotonic relationship between the two variables. The below graphs show that condition is met.

############
# PART: Visualisation of non-normally distributed variables
############
plots <- list()

gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = CR_S11, y = G_SC))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Highschool Critical Reading score (CR_S11)", y = "Final University Global Score (G_SC)")
plots[["CR"]] <- gs

gs <- tbl_sperf_continuous_scale %>% ggplot2::ggplot(aes(x = CC_S11, y = G_SC))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Highschool Citizen Competencies score (CC_S11)", y = "Final University Global Score (G_SC)")
plots[["CC"]] <- gs


plot_grid(plotlist = plots, labels = "", ncol = 2)

3.1.1.4 Step 4 Run the Spearman correlation test

############
# PART: Correlation test
############
#Pearson Correlation
tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$CR_S11, tbl_sperf_continuous_scale$G_SC, method = 'spearman')
show(tbl_correlation_stats)
## 
##  Spearman's rank correlation rho
## 
## data:  tbl_sperf_continuous_scale$CR_S11 and tbl_sperf_continuous_scale$G_SC
## S = 1.0562e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6685184

tbl_correlation_stats <- stats::cor.test(tbl_sperf_continuous_scale$CC_S11, tbl_sperf_continuous_scale$G_SC, method = 'spearman')
show(tbl_correlation_stats)
## 
##  Spearman's rank correlation rho
## 
## data:  tbl_sperf_continuous_scale$CC_S11 and tbl_sperf_continuous_scale$G_SC
## S = 1.1158e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6498049

3.1.1.5 Reporting Correlation

The relationship between high school maths score (MAT_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (r = .644, n = 12409, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school maths score and final university performance.

The relationship between high school biology score (BIO_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (r = .667, n = 12409, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school biology score and final university performance.

The relationship between high school engineering score (ENG_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (r = .662, n = 12409, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school engineering score and final university performance.

The relationship between high school critical reading score (CR_S11) and final university performance (G_SC) was investigated using a Spearman Correlation. A strong positive correlation was found (rho = .669, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school critical reading score and final university performance.

The relationship between high school citizen competencies score (CC_S11) and final university performance (G_SC) was investigated using a Pearson Correlation. A strong positive correlation was found (rho = .65, p < .001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between high school citizen competencies score and final university performance.

3.1.2 Difference tests

Before we can build a predictive model, we need to work out which groupings to include in our hypothesis testing. The choice of test we perform to Investigate if differential effects exist depends on the measurement level of the variable and the shape of the data. If our data measurement level is at interval data and normally distributed then we can use the independent samples t-test, provided the variables are independent. If our data is ordinal or is continuous scale data that does not conform to the normal distribution, then we must use a non-parametric test, such as the Mann-Whitney U-Test.

We will be using a number of heuristics to justify our assessment of differential effects. A benchmark suggested by Cohen (1988) is to consider an effect size d as small (d = 0.2), medium (d = 0.5), or large (d = 0.8) for the independent samples t-test.

3.1.2.1 Step 1 Check for Homogeneity of Variance for different groups

Homogeneity of variance means that the pattern in variance of the variable around the mean for each group is the same. The t-test is a robust test and it does not require Homogeneity of variance between the two groups, however if we can assume homogeneity of variance we increase the power of the t-test.

############
# PART: Homoscedasticity
############
# Create a subset dataframe with just the variables of interest.
tbl_sperf_sex_nat_diff <- tbl_sperf_all %>% dplyr::select(MAT_S11, CR_S11, CC_S11, BIO_S11, ENG_S11, G_SC, GENDER, SCHOOL_NAT)
# -------------- Box Plot --------------- #
# Just a little eye ball test fo variance and mean to cross check with Leven's test
tbl_sperf_sex_nat_diff %>%
  gather(MAT_S11, CR_S11, CC_S11, BIO_S11, ENG_S11, G_SC, key = "var", value = "value") %>%
  ggplot(aes(x = var, y = value, fill = GENDER)) +
  geom_boxplot() +
  theme_bw() +
  labs(y = "Grades", x = "Performance Variables", title = "Box Plots to eye ball variance", subtitle = "Difference testing: Male and Female")

# -------------- Box Plot --------------- #
# Just a little eye ball test fo variance and mean to cross check with Leven's test
tbl_sperf_sex_nat_diff %>%
  gather(MAT_S11, CR_S11, CC_S11, BIO_S11, ENG_S11, G_SC, key = "var", value = "value") %>%
  ggplot(aes(x = var, y = value, fill = SCHOOL_NAT)) +
  geom_boxplot() +
  theme_bw() +
  labs(y = "Grades", x = "Performance Variables", title = "Box Plots to eye ball variance", subtitle = "Difference testing: Public and Private Highschools")

# -------------- Create summary statistics --------------- #
tbl_sperf_sex_diff_stats <- tbl_sperf_sex_nat_diff %>%
  dplyr::select(-(SCHOOL_NAT)) %>%
  psych::describeBy(tbl_sperf_sex_nat_diff$GENDER, mat = TRUE) %>%
  filter(!is.na(skew)) # removes categorical variables.

tbl_sperf_nat_diff_stats <- tbl_sperf_sex_nat_diff %>%
  dplyr::select(-(GENDER)) %>%
  psych::describeBy(tbl_sperf_sex_nat_diff$SCHOOL_NAT, mat = TRUE) %>%
  filter(!is.na(skew))

# Pretty print table
tbl_sperf_sex_diff_stats %>%
  kbl(caption = "Summary statistics for Performance by student Sex") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
(#tab:Check for Homogeneity of Variance – Summary stats)Summary statistics for Performance by student Sex
item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
MAT_S111 1 F 1 5043 61.93238 11.169655 61 61.48699 11.8608 31 100 69 0.4310270 0.2484840 0.1572879
MAT_S112 2 M 1 7368 65.95548 12.063491 65 65.51357 11.8608 26 100 74 0.3522215 0.0572731 0.1405394
CR_S111 3 F 2 5043 60.89530 9.873563 61 60.74374 8.8956 28 100 72 0.2371788 0.5148966 0.1390367
CR_S112 4 M 2 7368 60.69843 10.128695 61 60.54664 10.3782 24 100 76 0.2007250 0.4481850 0.1179991
CC_S111 5 F 3 5043 59.95994 9.519411 60 59.73903 8.8956 29 100 71 0.3059556 0.5966242 0.1340496
CC_S112 6 M 3 7368 61.21526 10.482293 61 60.90655 8.8956 0 100 100 0.3438518 0.7672250 0.1221185
BIO_S111 7 F 4 5043 62.26829 10.529873 62 61.99232 10.3782 11 100 89 0.2912662 0.4374554 0.1482787
BIO_S112 8 M 4 7368 65.10193 11.425222 65 64.81411 10.3782 20 100 80 0.2752888 0.1978239 0.1331036
ENG_S111 9 F 5 5043 61.45112 13.788398 59 60.42082 13.3434 26 100 74 0.6377273 -0.2513956 0.1941643
ENG_S112 10 M 5 7368 62.04058 14.632306 59 61.00390 14.8260 30 100 70 0.5828105 -0.4540746 0.1704661
G_SC1 11 F 6 5043 161.27821 22.332939 162 161.49938 23.7216 76 242 166 -0.0721601 -0.2471797 0.3144861
G_SC2 12 M 6 7368 163.69083 23.582616 164 163.95607 25.2042 37 247 210 -0.1221049 0.0144190 0.2747370

# Pretty print table
tbl_sperf_nat_diff_stats %>%
  kbl(caption = "Summary statistics for Performance by student highschool nature") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
(#tab:Check for Homogeneity of Variance – Summary stats)Summary statistics for Performance by student highschool nature
item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
MAT_S111 1 PRIVATE 1 6565 66.88591 12.144541 66 66.47973 11.8608 31 100 69 0.3153056 -0.0142747 0.1498869
MAT_S112 2 PUBLIC 1 5846 61.44013 10.863289 60 61.01603 10.3782 26 100 74 0.4177939 0.3192681 0.1420797
CR_S111 3 PRIVATE 2 6565 62.57944 9.875766 62 62.45555 8.8956 24 100 76 0.1946699 0.6232599 0.1218859
CR_S112 4 PUBLIC 2 5846 58.75590 9.805907 58 58.55772 8.8956 26 100 74 0.2556649 0.4378005 0.1282503
CC_S111 5 PRIVATE 3 6565 62.31988 10.092052 62 62.01999 8.8956 29 100 71 0.3640218 0.7101923 0.1245553
CC_S112 6 PUBLIC 3 5846 58.89189 9.842347 58 58.62227 8.8956 0 100 100 0.3332656 0.8651224 0.1287269
BIO_S111 7 PRIVATE 4 6565 66.07890 11.208219 65 65.78736 10.3782 11 100 89 0.2781122 0.3189615 0.1383309
BIO_S112 8 PUBLIC 4 5846 61.56038 10.602146 61 61.27747 10.3782 20 100 80 0.3041901 0.2957655 0.1386642
ENG_S111 9 PRIVATE 5 6565 67.08271 14.885313 67 66.56387 17.7912 31 100 69 0.2363625 -0.8065631 0.1837133
ENG_S112 10 PUBLIC 5 5846 55.86983 10.894422 54 54.88392 8.8956 26 100 74 0.8525358 0.6290913 0.1424869
G_SC1 11 PRIVATE 6 6565 168.21584 22.823660 170 168.73786 23.7216 37 247 210 -0.2040610 0.0242467 0.2816877
G_SC2 12 PUBLIC 6 5846 156.52805 21.838175 157 156.56670 22.2390 72 228 156 -0.0521594 -0.0191213 0.2856189

############
# PART: Levene test
############
# -------------- Levene's test --------------- #
# Conduct Levene's test for homogeneity of variance in library car - the null hypothesis is that variances in groups are equal so to
# assume homogeneity we woudl expect probaility to not be statistically significant.
# Part 1: Iterate through variables with median for Gender
variable_count <- 6
result <- list()
for (n in 1:variable_count) { variable <- colnames(tbl_sperf_sex_nat_diff)[n]
  result[[variable]]                   <- car::leveneTest(y = tbl_sperf_sex_nat_diff[, 1], group = as.factor(tbl_sperf_sex_nat_diff$GENDER), center = median) }

#Pr(>F) is your probability - in this case it is not statistically significant for any variable so we can assume homogeneity.
print.listof(result)
## MAT_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  30.904 2.767e-08 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## CR_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  30.904 2.767e-08 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## CC_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  30.904 2.767e-08 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## BIO_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  30.904 2.767e-08 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ENG_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  30.904 2.767e-08 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## G_SC :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  30.904 2.767e-08 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Part 1: Iterate through variables with median for School Nature
variable_count <- 6
result <- list()
for (n in 1:variable_count) { variable <- colnames(tbl_sperf_sex_nat_diff)[n]
  result[[variable]]                   <- car::leveneTest(y = tbl_sperf_sex_nat_diff[, n], group = as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT), center = median) }
# Pr(>F) is your probability - in this case it is not statistically significant for any variable so we can assume homogeneity.
# No difference in outcomes from using mean versus median, which is a good sign.
print.listof(result)
## MAT_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  74.954 < 2.2e-16 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## CR_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value Pr(>F)
## group     1  0.0262 0.8713
##       12409               
## 
## CC_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value Pr(>F)
## group     1  1.9286 0.1649
##       12409               
## 
## BIO_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  12.464 0.0004163 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ENG_S11 :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  936.77 < 2.2e-16 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## G_SC :
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  12.614 0.0003843 ***
##       12409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.1.2.2 Step 2 Select and run the T-Test for Gender

Based on the analysis above, it is appropriate to select a parametric independent samples t-test to evaluate if two different groups exist for gender for Maths, Biology, Engineering and Final Score on the basis of student sex (male or female). For this t-test we will set the variance to be not equal for each difference test as a statistically significant difference in variance was discovered when the Lavene test was run.

############
# PART 1: T-Test for Gender
############
# -------------- Conduct the t-test --------------- #
#Conduct the t-test from package stats
#In this case we can use the var.equal = FALSE option to specify not equal variances.

tbl_test_result <- data.frame()
tbl_test_effectsize <- data.frame()

# Brittle code here, hardcoded variable of interest index.
for (n in c(1, 4, 5, 6)) {
  ## start
  variable    <- colnames(tbl_sperf_sex_nat_diff)[n]
  test_result <- stats::t.test(tbl_sperf_sex_nat_diff[, n] ~ as.factor(tbl_sperf_sex_nat_diff$GENDER), var.equal = FALSE) %>%
          broom::tidy() %>%
          as.data.frame()

  # Build output table
  row.names(test_result) <- variable
  tbl_test_result        <- rbind(tbl_test_result, test_result)

  #---------- Calculate Cohens D Effect size ---------- #
  effcd <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter), 2)

  #---------- Calculate Eta Squared Effect size ---------- #
  #Eta squared calculation
  effes <- round((test_result$statistic * test_result$statistic) / ((test_result$statistic * test_result$statistic) + (test_result$parameter)), 3)

  # Build output table
  tbl_merged_effectsize            <- merge(effcd, effes)
  row.names(tbl_merged_effectsize) <- variable
  tbl_test_effectsize              <- rbind(tbl_test_effectsize, tbl_merged_effectsize)

}

## tidy up column names
colnames(tbl_test_result)[1] <- 'mean.diff.est'
colnames(tbl_test_result)[2] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$GENDER))[1], sep = ".")
colnames(tbl_test_result)[3] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$GENDER))[2], sep = ".")
colnames(tbl_test_result)[6] <- 'df'
#P-value is your probability - in this case every result was statiscally significant @ P < .05*

# -------------- Pretty Print Test statistics --------------- #
tbl_test_result %>%
        kbl(caption = "Summary of T-Test Statistics for the Male and Female student Groups") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
(#tab:section 3 t test part 1)Summary of T-Test Statistics for the Male and Female student Groups
mean.diff.est mean.F mean.M statistic p.value df conf.low conf.high method alternative
MAT_S11 -4.0231017 61.93238 65.95548 -19.073300 0.0000000 11353.66 -4.436558 -3.6096454 Welch Two Sample t-test two.sided
BIO_S11 -2.8336346 62.26829 65.10193 -14.221044 0.0000000 11382.97 -3.224212 -2.4430576 Welch Two Sample t-test two.sided
ENG_S11 -0.5894605 61.45112 62.04058 -2.281401 0.0225434 11239.88 -1.095924 -0.0829972 Welch Two Sample t-test two.sided
G_SC -2.4126178 161.27821 163.69083 -5.777472 0.0000000 11207.16 -3.231169 -1.5940669 Welch Two Sample t-test two.sided

# -------------- Pretty Print Test Effect size --------------- #
tbl_test_effectsize %>%
        kbl(caption = "Summary of T-Test Effectsize for the Male and Female student Groups") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
(#tab:section 3 t test part 1)Summary of T-Test Effectsize for the Male and Female student Groups
d CI CI_low CI_high y
MAT_S11 -0.36 0.95 -0.40 -0.32 0.031
BIO_S11 -0.27 0.95 -0.30 -0.23 0.017
ENG_S11 -0.04 0.95 -0.08 -0.01 0.000
G_SC -0.11 0.95 -0.15 -0.07 0.003

3.1.2.3 Step 3 select and run the T-Test for School Nature

Based on the analysis above, it is appropriate to select a parametric independent samples t-test to evaluate if two different groups exist for Maths, Biology, Engineering and Final Score on the basis of School nature (public or private). For this t-test we will set the variance to be not equal.

############
# PART 2: T-Test for School Nature
############
# -------------- Conduct the t-test --------------- #
#Conduct the t-test from package stats
#In this case we can use the var.equal = FALSE option to specify not equal variances.

tbl_test_result <- data.frame()
tbl_test_effectsize <- data.frame()

# Brittle code here, hardcoded variable of interest index.
for (n in c(1, 4, 5, 6)) {
  ## start
  variable    <- colnames(tbl_sperf_sex_nat_diff)[n]
  test_result <- stats::t.test(tbl_sperf_sex_nat_diff[, n] ~ as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT), var.equal = FALSE) %>%
          broom::tidy() %>%
          as.data.frame()

  # Build output table
  row.names(test_result) <- variable
  tbl_test_result        <- rbind(tbl_test_result, test_result)

  #---------- Calculate Cohens D Effect size ---------- #
  effcd <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter), 2)

  #---------- Calculate Eta Squared Effect size ---------- #
  #Eta squared calculation
  effes <- round((test_result$statistic * test_result$statistic) / ((test_result$statistic * test_result$statistic) + (test_result$parameter)), 3)

  # Build output table
  tbl_merged_effectsize            <- merge(effcd, effes)
  row.names(tbl_merged_effectsize) <- variable
  tbl_test_effectsize              <- rbind(tbl_test_effectsize, tbl_merged_effectsize)

}

## tidy up column names
colnames(tbl_test_result)[1] <- 'mean.diff.est'
colnames(tbl_test_result)[2] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT))[1], sep = ".")
colnames(tbl_test_result)[3] <- paste('mean', levels(as.factor(tbl_sperf_sex_nat_diff$SCHOOL_NAT))[2], sep = ".")
colnames(tbl_test_result)[6] <- 'df'
#P-value is your probability - in this case every result was statiscally significant @ P < .05*

# -------------- Pretty Print Test statistics --------------- #
tbl_test_result %>%
        kbl(caption = "Summary of T-Test Statistics for the Private and Public student Groups") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
(#tab:section 3 t test part 2)Summary of T-Test Statistics for the Private and Public student Groups
mean.diff.est mean.PRIVATE mean.PUBLIC statistic p.value df conf.low conf.high method alternative
MAT_S11 5.44578 66.88591 61.44013 26.36858 0 12408.75 5.040958 5.850602 Welch Two Sample t-test two.sided
BIO_S11 4.51852 66.07890 61.56038 23.06953 0 12363.82 4.134594 4.902447 Welch Two Sample t-test two.sided
ENG_S11 11.21289 67.08271 55.86983 48.22888 0 11971.48 10.757162 11.668610 Welch Two Sample t-test two.sided
G_SC 11.68779 168.21584 156.52805 29.13527 0 12345.14 10.901460 12.474117 Welch Two Sample t-test two.sided

# -------------- Pretty Print Test Effect size --------------- #
tbl_test_effectsize %>%
        kbl(caption = "Summary of T-Test Effectsize for the Private and Public student Groups") %>%
        kable_styling(bootstrap_options = c("striped", "hover"))
(#tab:section 3 t test part 2)Summary of T-Test Effectsize for the Private and Public student Groups
d CI CI_low CI_high y
MAT_S11 0.47 0.95 0.44 0.51 0.053
BIO_S11 0.41 0.95 0.38 0.45 0.041
ENG_S11 0.88 0.95 0.84 0.92 0.163
G_SC 0.52 0.95 0.49 0.56 0.064

3.1.2.4 Reporting gender difference with Cohen’s D

An independent-samples t-test was conducted to compare highschool maths final scores for students who are male or female. A statistically significant result was found (M = 65.96, SD = 12.06 for Male students, M = 61.93, SD = 11.17 for female students), (t(11353) = -19.07, p < .001). Cohen’s d also indicated a small effect size (d = – 0.36).

An independent-samples t-test was conducted to compare highschool biology final scores for students who are male or female. A statistically significant result was found (M = 65.96, SD = 11.42 for Male students, M = 62.27, SD = 10.53 for female students), (t(11383) = -14.22, p < .001). Cohen’s d also indicated a small effect size (d = – 0.27).

An independent-samples t-test was conducted to compare highschool