Digital Portfolio – Explore & Analyse Part 1

Reading Time: 12 minutes

Introduction:

The purpose of this section is to address comprehensively all required concepts and demonstrate coherent understanding of the statistical analysis exploration phase, as covered during the TU060 MATH9102 Module. In the preparation phase we came up with a question that needed to be answered and generated a number of hypotheses that could be tested through statistical analysis. In this phase we will look to explore data that has already been collected and identify candidate variables for testing our hypotheses. The steps for hypotheses testing are:

  • Determine whether variables are related to one another.
  • Investigate if differential effects exist for different groups.
  • Provide statistical evidence that justifies the inclusion of these variables in a predictive model.
The first step is to collect data to see whether our hypotheses are accurate. To do this, one or more variables need to be identified and appropriate statistical measurements need to be calculated, as set out in the following steps:
  • Selection, with justification, of the appropriate variables to investigate the theory identified in the preparation phase.
  • Description of statistical measures associated with variables under consideration.
  • Selection of appropriate assessments/tests to investigate issues with statistical measures.
  • Interpretation of findings and generation appropriate conclusions.
Note This chapter will only look at determining whether variables are related to one another through correlation as a justification for inclusion in a predictive model. In the next section we will provide exploration and analyse to determine if differential effects exist for different groups and provide statistical evidence that justifies causal relationships.

Import dataset

There are two dataset that must be imported:

############
# PART: Import data
############
# Dateset 1 : Import sperformance-dataset
tbl_sperf_all <- read.csv('sperformance-dataset.csv', header = TRUE)
names(tbl_sperf_all)[1] <- 'School' # Fix issue with the name of first field.

# Dateset 2 : Import sperformance-dataset variable description, created by me
tbl_sperf_description_all <- read.csv('TU060_MATH9102_Student_variables_description.csv', header = TRUE)

Hypothesis testing – Correlation:

A general description of all variables in the dataset was provided as part of the preparation phase. Our goal here is to accurately summarize and describe the data, but before we can build a predictive model we need to work out what predictor variables to include in our hypothesis testing. We cannot include all variables in our tests because it would be difficult to see which variables truly influence our outcomes. Our starting point for the exploration is to answer the question: what type of co-variation occurs between variables? This will provide the justification for variable selection in hypothesis testing.

In the general linear statistical model, the theory is that the concepts of interest, as measured by their variables, are related to each other in a linear fashion. This means that when one variable increases the other increases or decreases proportionally. The null hypothesis is that any patterns in the data are random and the alternative hypothesis is that the variables have a linear relationship. Using the general linear statistical model, we use the equation of a line as our model of the pattern in the relationship between the two variables (Bi-variant correlation). For assessing co-variant correlation, the questions we need to answer for each pair of variables are:

  • Can the relationship between the two variables be modeled as a straight line?
  • What is the direction of the co-variation (positive or negative)?
  • What is the strength of the co-variation (weak, moderate, or strong)?
  • What is the statistical significance (likelihood the relationship we observe is occurring due to chance)?
The relationship between two variables is quantified by a statistical measure called the correlation coefficint (-1 to +1). And from that, we can calculate the co-variance, which is how much of the variation in each variable is common to both. The direction and strength of the correlation is measured by a slope of the line. The model contains an error term, which gives the variation observed in one variable that is not explained by variation in the predictor variable.

A type 1 error would occur if we find that students’ performance in Portuguese and Math are related when they are not and a type 2 error would occur if we find there is no relationship, when there really is one in the population.

A note on Heuristics:

We will be using a number of heuristics to justify our assessment of normality and correlation. For correlation we will use Cohen’s effect size heuristics. According to Cohen (1988) an absolute value of r of 0.1 is classified as small, an absolute value of 0.3 is classified as medium and of 0.5 is classified as large.

For assessing normality of distribution for variables with outliers, Tabachnik and Fidell (2007), suggested the heuristic If missing data represent less than 5% of the total and is missing in a random pattern from a large data set, almost any procedure for handling missing values yields similar results, including simply omitting the outliers.

Correlation testing – Past Performance and Future Performance

From the preparation phase we had the Hypothesis:
Students who perform well as part of initial assessment in subjects will perform better overall.
In terms of testing for correlation, the null hypothesis is that there is no relationship between past performance and future student performance. The alternative hypothesis is that there is a relationship between these two variables.

For each of the two subjects we have we have two potential predictor variables which are Grade 1 and 2, and one outcome variable, Grade 3. As such we will conduct a total of four tests for correlation.

Step 1 Check for Normality of the Variables

One of the tests for correlation is the Pearson Correlation. This requires that the variables be normally distributed and the relationship linear. We validate the normality conditions by generating summary statistics, histograms and Q-Q Plots for each of the variables. The process to validate the normality condition, is to generate summary statistics, histograms and Q-Q Plots for each of the variables. We’d most likely discover the variables are not ideally normal so we will need to quantify how far away from normal the data is by calculating the standard skew and kurtosis. If those are within acceptable bands (+/- 2.58 if our samples size is greater than 80 and we want a 99% cut off) we can assume normality. If not we need to look at the actual values in the variable, convert them to z-scores and calculate the percentage of those scores that can be considered outliers, if this percentage is within acceptable limits (+/- 2.58 if our samples size is greater than 80 and we want a 99% cut off) then we can go a head and treat our data as approximately normal.

To that end, for each variable, we have completed the following steps:

  • Generated plots
    • Histogram with normal curve showing
    • Q-Q Plot
  • Generated Summary statistics
  • Reviewed the statistical measures and plots to see how far away from normal the sample data is.
  • Generate standardised scores for skew and kurtosis and compare to acceptable range.
  • Generate standardised z-scores for variables and compare to acceptable range.
  • Reported the correct statistics for this data based on an assessment of normality
Summary statistics for Performance (not cleaned)
vars n mean sd median trimmed mad min max range skew kurtosis se IQR
mG1 1 382 10.86126 3.349167 10.5 10.74510 3.7065 3 19 16 0.2741912 -0.7061137 0.1713583 5.00
mG2 2 382 10.71204 3.832560 11.0 10.83007 2.9652 0 19 19 -0.3970490 0.4666241 0.1960908 4.75
mG3 3 382 10.38743 4.687242 11.0 10.81373 4.4478 0 20 20 -0.7003219 0.2415275 0.2398202 6.00
pG1 4 382 12.11257 2.556531 12.0 12.09150 2.9652 0 19 19 -0.1523548 0.6986985 0.1308035 4.00
pG2 5 382 12.23822 2.468341 12.0 12.15359 2.9652 5 19 14 0.2368869 -0.2038155 0.1262913 3.00
pG3 6 382 12.51571 2.945438 13.0 12.62092 2.9652 0 19 19 -0.9891237 3.3902597 0.1507017 3.00

Any outliers, skew, or kurtosis needs to be investigated and explained. We can see above that there appears to be an issue with no grades being reported for some students. Not all student records contain grades 1 through 3. Some records have grades 1 and 2, but no final grade, others are missing grades 1 or 2 but have a final grade.

The reference paper highlighted that student performance is being manually recorded using a paper-based filing system and this might be a clerical error. In a later sections we deep dive into the missing data and observe that this doesn’t explain the missing data, and as the missing data is not random. For illustration purposes we will assume a zero grade is invalid and we will then regenerate our summary statistics and graphs on a cleaned dataset.

############
# PART: Normality
############
tbl_sperf_numerical_measurements <- tbl_sperf_all %>%
  select(contains('mG'), contains('pG')) %>%
  filter(mG1 != 0, mG2 != 0, mG3 != 0, pG1 != 0, pG2 != 0, pG3 != 0) # Filtering records with missing data.

tbl_sperf_numerical_stats <- tbl_sperf_numerical_measurements %>% psych::describe(omit = TRUE, IQR = TRUE)

#-------- Iterate through eact variable -------#
#Generate regular summary statistics - not as nice as psych package but gives p value
st <- pastecs::stat.desc(tbl_sperf_numerical_measurements, basic = F)
tbl_sperf_numerical_stats_2 <- st %>% transpose()
colnames(tbl_sperf_numerical_stats_2) <- rownames(st)
rownames(tbl_sperf_numerical_stats_2) <- colnames(st)

# Initialise vectors
std_skew <- list()
std_kurt <- list()
gt_196 <- list()
gt_329 <- list()
variable_count <- nrow(tbl_sperf_numerical_stats_2)

# Iterate through variables
for (n in 1:variable_count) { variable <- row.names.data.frame(tbl_sperf_numerical_stats_2)[n]

  tpskew               <- semTools::skew(tbl_sperf_numerical_measurements[[variable]])
  tpkurt               <- semTools::kurtosis(tbl_sperf_numerical_measurements[[variable]])
  std_skew[[variable]] <- tpskew[1] / tpskew[2]
  std_kurt[[variable]] <- tpkurt[1] / tpkurt[2]
  z_score              <- abs(scale(tbl_sperf_numerical_measurements[[variable]]))
  gt_196[[variable]]   <- FSA::perc(as.numeric(z_score), 1.96, "gt") # 95% within +/- 1.96
  gt_329[[variable]]   <- FSA::perc(as.numeric(z_score), 3.29, "gt") # 99.7% within +- 3.29 for larger distributions

}

tbl_sperf_numerical_stats_2$std_skew <- std_skew
tbl_sperf_numerical_stats_2$std_kurt <- std_kurt
tbl_sperf_numerical_stats_2$gt_2sd <- gt_196
tbl_sperf_numerical_stats_2$gt_3sd <- gt_329 # Pretty print tbl_sperf_numerical_stats_2 %>%
  kbl(caption = "Summary statistics for Performance (zero scores removed)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Summary statistics for Performance (zero scores removed)
median mean SE.mean CI.mean.0.95 var std.dev coef.var std_skew std_kurt gt_2sd gt_3sd
mG1 11 11.28614 0.1758996 0.3459959 10.488890 3.238656 0.2869588 1.646218 -2.505487 3.539823 0
mG2 11 11.43953 0.1730551 0.3404007 10.152397 3.186283 0.2785327 1.551482 -2.093272 7.079646 0
mG3 11 11.61947 0.1769549 0.3480716 10.615123 3.258086 0.2803989 1.646578 -1.701956 3.834808 0
pG1 12 12.36578 0.1311432 0.2579596 5.830305 2.414602 0.1952648 1.024839 -1.390545 3.539823 0
pG2 12 12.47493 0.1302716 0.2562453 5.753068 2.398555 0.1922701 2.509813 -1.315284 3.244838 0
pG3 13 12.82301 0.1402359 0.2758450 6.666806 2.582016 0.2013580 -1.47584 3.064222 5.014749 0.2949853
plot of chunk Visualisation Histogramsplot of chunk Visualisation Histograms

Assessing Portuguese and Maths final grade Distribution – Do they fit the normal distribution

To illustrate the concept of assessing normality we selected the grades for Portuguese and Maths form the sample dataset. The Normal Quantile Plot (Q-Q Plot) shows that most observations are lying on or around the reference line with more observations towards the middle and less towards either end. As such, the variables are not ideally normal as we have some outliers and gaps affecting the shape of our distribution.

Next, we quantify how far away from normal the distribution is. To do this, we calculate statistics for skew and kurtosis and standardise them so we can compare them to heuristics. Standardised scores (value/std.error) for skewness between +/-2 (1.96 rounded) are considered acceptable in order to assume a normal distribution. Skewness for pG2 is not within an acceptable range so we need to look into this further by exploring outliers: how many of them there are or whether we can transform it to become more normal.

In terms of quantifying the proportion of the data that is not normal, we generated standardised z scores for the variable and calculated the percentage of the standardised scores that are outside an acceptable range. No variable exceeded our acceptable range for outside the 99.7% significance level. The variables mG2 and pG3 fell outside the 95% significance level but as our number of examples exceeds 80 we assume we can accept the 99.7 significance level. The pG2 variable is within our acceptable range, so we can assume the excess skewness is not an issue for accepting normality of this variable and pG3 was within our acceptance range for standardised skew.

Based on this assessment, all performance variables can be treated as normally distributed once missing data outliers have been removed.

Step 2 Check for Linearity of Co Variance

Pearson Correlation requires there to be a linear relationship between the two variables, as Pearson uses the equation of a line to model the relationship. We validate this condition through inspection of a scatter plot, which should resemble a straight line rather than a curved line. For linearity we want the values to be evenly spaced around the line in a rectangular space. Below we have graphed the predictor/explanatory variable on the X axis and response/outcome variable on the Y axis.

After cleaning the dataset for missing values, as explained above, the scatter plots were generated for each variable relationship we want to test. All of the scatterplots show a uniform distribution of values above and below the reference line with few outliers, and as such we can assume homoscedasticity. Based on this inspection, all grade relationships of interest can be treated as linear once missing data outliers have been removed.

############
# PART: Visualisation
############

# Initial Math Grade (mG1)
plots <- list()
gs <- tbl_sperf_numerical_measurements %>% ggplot2::ggplot(aes(x = mG1, y = mG3))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Initial Math Grade (mG1)", y = "Final Math Grade (mG3)")
plots[["mG1 <-> mG3"]] <- gs

# Second Math Grade (mG2)
gs <- tbl_sperf_numerical_measurements %>% ggplot2::ggplot(aes(x = mG2, y = mG3))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Second Math Grade (mG2)", y = "Final Math Grade (mG3)")
plots[["mG2 <-> mG3"]] <- gs

gs <- tbl_sperf_numerical_measurements %>% ggplot2::ggplot(aes(x = pG1, y = pG3))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Initial portuguese Grade (pG1)", y = "Final portuguese Grade (pG3)")
plots[["pG1 <-> pG3"]] <- gs

# Second Math Grade (mG2)
gs <- tbl_sperf_numerical_measurements %>% ggplot2::ggplot(aes(x = pG2, y = pG3))
gs <- gs +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red", se = F) +
  labs(x = "Second portuguese Grade (pG2)", y = "Final portuguese Grade (pG3)")

plots[["pG2 <-> pG3 "]] <- gs
plot_grid(plotlist = plots, labels = "auto", ncol = 2)

############
# PART: Linearity of Co-variant relationship
############
#Pearson Correlation
### mG1 correlated to mG3
tbl_correlation_stats <- stats::cor.test(tbl_sperf_numerical_measurements$mG1, tbl_sperf_numerical_measurements$mG3, method = 'pearson')
show(tbl_correlation_stats)

##
##  Pearson's product-moment correlation
##
## data:  tbl_sperf_numerical_measurements$mG1 and tbl_sperf_numerical_measurements$mG3
## t = 36.943, df = 337, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8722090 0.9147845
## sample estimates:
##       cor
## 0.8955274

### mG2 correlated to mG3
tbl_correlation_stats <- stats::cor.test(tbl_sperf_numerical_measurements$mG2, tbl_sperf_numerical_measurements$mG3, method = 'pearson')
show(tbl_correlation_stats)

##
##  Pearson's product-moment correlation
##
## data:  tbl_sperf_numerical_measurements$mG2 and tbl_sperf_numerical_measurements$mG3
## t = 68.944, df = 337, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9584693 0.9727246
## sample estimates:
##       cor
## 0.9663306

### pG1 correlated to pG3
tbl_correlation_stats <- stats::cor.test(tbl_sperf_numerical_measurements$pG1, tbl_sperf_numerical_measurements$pG3, method = 'pearson')
show(tbl_correlation_stats)

##
##  Pearson's product-moment correlation
##
## data:  tbl_sperf_numerical_measurements$pG1 and tbl_sperf_numerical_measurements$pG3
## t = 31.866, df = 337, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8372549 0.8907966
## sample estimates:
##       cor
## 0.8664967

### pG2 correlated to pG3
tbl_correlation_stats <- stats::cor.test(tbl_sperf_numerical_measurements$pG2, tbl_sperf_numerical_measurements$pG3, method = 'pearson')
show(tbl_correlation_stats)

##
##  Pearson's product-moment correlation
##
## data:  tbl_sperf_numerical_measurements$pG2 and tbl_sperf_numerical_measurements$pG3
## t = 43.489, df = 337, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9034209 0.9359535
## sample estimates:
##       cor
## 0.9212835

Just for Fun – Correlation Matrix for all Performance Variables

Just for fun I also generated the correlation coefficient for all grade variables at the same time. This can be seen below.

### Correlation matrix for all
correlation_matrix <- rcorr(as.matrix(tbl_sperf_numerical_measurements))

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
title <- "Correlation matrix for all performance variables"
gs <- corrplot(correlation_matrix$r, method = "color", col = col(200), type = "upper", order = "hclust", addCoef.col = "black",              # Add coefficient of correlation
               tl.col                       = "black", tl.srt = 45,                                                                          #Text label color and rotation
               # Combine with significance
               p.mat                        = correlation_matrix$p, sig.level = 0.01, insig = "blank",
               # hide correlation coefficient on the principal diagonal
               diag                         = FALSE, title = title, mar = c(0, 0, 2, 0) ) # http://stackoverflow.com/a/14754408/54964)
plot of chunk Correlation matrix for all

Reporting Correlation

Hypothesis test: relationship between Initial grade and Final grade

The relationship between initial grade in Maths (mG1 taken from school reports) and final Maths grade (mG3 taken from school reports) was investigated using a Pearson Correlation. A strong positive correlation was found (r =-.896, n=337, p<.001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between initial math grade and final math grade.

The relationship between initial grade in Portuguese (pG1 taken from school reports) and final Portuguese grade (pG3 taken from school reports ) was investigated using a Pearson correlation. A strong positive correlation was found (r =-.866, n=337, p<.001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between initial Portuguese grade and final Portuguese grade.

Hypothesis test: relationship between Intermediate grade and Final grade

The relationship between intermediate grade in Maths (mG2 taken from school reports) and final Maths grade (mG3 taken from school reports ) was investigated using a Pearson Correlation. A strong positive correlation was found (r =-.966, n=337, p<.001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between intermediate math grade and final math grade.

The relationship between intermediate grade in Portuguese (pG2 taken from school reports) and final Portuguese grade (pG3 taken from school reports ) was investigated using a Pearson correlation. A strong positive correlation was found (r =-.92, n=337, p<.001). There is therefore evidence to reject the null hypothesis and accept the alternative hypothesis that there is a relationship between intermediate Portuguese grade and final Portuguese grade.

References: