Digital Portfolio – Model

Reading Time: 34 minutes

Introduction:

Before we can build a model using a predictor we need strong statistical as well as theoretical support. If we assert as part of our hypothesis that there is a relationship between two concepts, then we need to establish a) if the relationship is statistically likely to exist, and b) how strong that relationship is. We do this using correlation testing. If as part of our hypothesis we are saying that a given concept is experienced differently for different groups in our population, then we would have to establish statistical evidence to say that there is a difference in the experience of this concept for the different groups. We do this using different tests. Exploring, gathering and analysing the supporting statistical evidence for the inclusion of predictor variables in predictive models was the purpose of the explore and analyse phase. The purpose of this phase is to build and execute the predictive models, having already established justification for the selection of predictor variables and the hypothesis.

Linear Regression

Linear regression, at its most fundamental, is a hypothetical model between two variables, using the equation of a line as our model equation. We model units of change in our outcome variable, and attempt to determine how much each unit of change in one predictor contributes to that change. Our assumption is that we will have a uniform pattern when we’re looking at the average change.

We can quantify how good our model is by comparing our linear model to the most basic reference model (the mean) to see how much extra explanation of the variance in the outcome variable we get. To do this, we square the difference in each point of our observed data to the mean and sum the result to get the total sum of squares as a measure of variability between the scores and the mean (SSt). We also get the total sum of squares from the observed data to the model of a line as a measure of variability between the scores and the regression mode (SSr). The difference between the two is how much variability is accounted for by the model. Running an analysis of variance (ANOVA) produces an F-statistic, which is a ratio value. This compares the amount of systematic variance in the data to the amount of unsystematic variance. In other words it compares what we see to what we expect for a distribution with the same degrees of freedom. Through a comparison to standard F-distribution we can determine if our model produces a statistically significant result.

Simple Linear Regression of student performance

To illustrate this methodology we will make use of the student performance data set to build a model to predict final student grade on the basis of initial assessment.
HA: Students who perform well as part of initial assessment in subjects will perform better overall.

Step 1: Investigate Normality

Because simple linear regression is based on the equation for a line, our first step would be a test for normality of our continuous scale data. As part of the exploration phase we established that all performance variables in the dataset can be assumed to be approximately normally distributed, and this analysis is not repeated here. The summary stats are included for convenience of reporting.
Summary statistics for Performance (zero scores removed)
mean SE.mean CI.mean.0.95 var std.dev coef.var std_skew std_kurt gt_2sd gt_3sd
mG1 11.28614 0.1758996 0.3459959 10.488890 3.238656 0.2869588 1.646218 -2.505487 3.539823 0
mG2 11.43953 0.1730551 0.3404007 10.152397 3.186283 0.2785327 1.551482 -2.093272 7.079646 0
mG3 11.61947 0.1769549 0.3480716 10.615123 3.258086 0.2803989 1.646578 -1.701956 3.834808 0
pG1 12.36578 0.1311432 0.2579596 5.830305 2.414602 0.1952648 1.024839 -1.390545 3.539823 0
pG2 12.47493 0.1302716 0.2562453 5.753068 2.398555 0.1922701 2.509813 -1.315284 3.244838 0
pG3 12.82301 0.1402359 0.2758450 6.666806 2.582016 0.2013580 -1.47584 3.064222 5.014749 0.2949853

Step 2: Investigate relationship between predictor and response variable.

As part of the exploration phase, we established through Peason Correlation that there was a strong, statistically significant (P < .001), and positive relationship between Initial grade in a subject and final grade in that subject. We also established the same for intermediate grade and final grade. This analysis is not repeated here.

Step 3: Run Simple Linear Regression

Given the assumptions of normality and that a strong positive relationship exists it is safe to use the students performance dataset to illustrate the Simple Linear regression predictive modelling technique.

Note: This is equivalent to doing a Pearson correlation again and is included for illustration

############
# PART: Linear Regression
############
# -------------- Linear Regression Math Grade 1 as a predictor for Math Grade 3 --------------- #
reg_result_model_lm <-lm(tbl_sperf_numerical_measurements$mG3~tbl_sperf_numerical_measurements$mG1)

# -------------- Determine how good our model is versus using the mean --------------- # anova(reg_result_model_lm) ## Get sums of squares of variability

## Analysis of Variance Table
##
## Response: tbl_sperf_numerical_measurements$mG3
##                                       Df  Sum Sq Mean Sq F value    Pr(>F)
## tbl_sperf_numerical_measurements$mG1   1 2877.40 2877.40  1364.8 < 2.2e-16 ***
## Residuals                            337  710.52    2.11
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(reg_result_model_lm)

##
## Call:
## lm(formula = tbl_sperf_numerical_measurements$mG3 ~ tbl_sperf_numerical_measurements$mG1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.6590 -1.0644 -0.1635  1.1338  3.6383
##
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           1.45179    0.28630   5.071 6.55e-07 ***
## tbl_sperf_numerical_measurements$mG1  0.90090    0.02439  36.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 337 degrees of freedom
## Multiple R-squared:  0.802,  Adjusted R-squared:  0.8014
## F-statistic:  1365 on 1 and 337 DF,  p-value: < 2.2e-16

# lm.beta::lm.beta(reg_result_model_lm) Not needed as units are the same for predictor and outcome
stargazer(reg_result_model_lm, type="text")

##
## ===============================================
##                         Dependent variable:
##                     ---------------------------
##                                 mG3
## -----------------------------------------------
## mG1                          0.901***
##                               (0.024)
##
## Constant                     1.452***
##                               (0.286)
##
## -----------------------------------------------
## Observations                    339
## R2                             0.802
## Adjusted R2                    0.801
## Residual Std. Error      1.452 (df = 337)
## F Statistic         1,364.757*** (df = 1; 337)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Report Simple Linear Regression

The relationship between initial grade in Maths (mG1, taken from school reports) and final Maths grade (mG3, taken from school reports) was modelled using a simple linear model. A strong positive correlation was found (R2 =-.802, n=337, p<.01). This matches our early results from Pearson Correlation. Our AVONA test to evaluate goodness of fit found a statistically significant result (p < 0.001). The model has an R-squared value of .802, indicating that initial grade contributes 80% to final grade in the population. For every 1 unit change in the initial grade in maths, it contributes 0.901 units of change to the final student grade in the population. Therefore there is good justification for using regression to look at prediction. This analysis supports the use of simple linear regression as a model to predict student final performance on the basis of initial grade.

Multiple Linear Regression of student performance

Simple linear Regression is a model to predict the value of one variable from another. Multiple Regression is a natural extension of this model and is used to predict an outcome variable using multiple predictor variables, each of which contributes partially to the outcome. We are looking to determine how much collectively does the set of predictors contribute to the outcome variable on average and additionally, how much do each of the individual predictors contribute on average. When we use categorical variables as predictors, effectively what we are looking for is a differential effect in the outcomes variable as a result of being in one category or another.
HA: Students who perform well as part of initial assessment in subjects will perform better overall and there is a different between male and female students
For our predictors we will use initial grade and whether a student is male or female.

Step 1: Investigate differential effects

We have already as part of the explore and analysis phase established there is a differential effect between male and female students for performance. That analysis is not repeated here but we can assume the variables of interest meet the criteria for Multiple linear Regression..

Step 2: Run Simple Linear Regression

Given the assumptions of normality and that a strong positive relationship exists it is safe to use the students performance dataset to illustrate the Simply Linear regression predictive modeling technique.

tbl_sperf_sex_diff_stats <- tbl_sperf_sex_diff %>% psych::describe(omit = TRUE)

############ # PART: Multiple Linear Regression ############

# -------------- Regression Math Grade 1 as a predictor for Math Grade 3 and sex--------------- # reg_result_model_multiple_lm <-lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1+as.integer(tbl_sperf_sex_diff$sex=='M')) #Force female to be baseline.

# -------------- Determine how good our model is versus using the mean --------------- # anova(reg_result_model_multiple_lm) ## Get sums of squares of variability

## Analysis of Variance Table
##
## Response: tbl_sperf_sex_diff$mG3
##                                            Df  Sum Sq Mean Sq   F value Pr(>F)
## tbl_sperf_sex_diff$mG1                      1 2877.40 2877.40 1362.3560 <2e-16 ***
## as.integer(tbl_sperf_sex_diff$sex == "M")   1    0.86    0.86    0.4071 0.5239
## Residuals                                 336  709.66    2.11
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(reg_result_model_multiple_lm)

##
## Call:
## lm(formula = tbl_sperf_sex_diff$mG3 ~ tbl_sperf_sex_diff$mG1 +
##     as.integer(tbl_sperf_sex_diff$sex == "M"))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.6170 -1.1102 -0.1102  1.0790  3.5857
##
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                1.42738    0.28910   4.937 1.25e-06 ***
## tbl_sperf_sex_diff$mG1                     0.89865    0.02466  36.438  < 2e-16 ***
## as.integer(tbl_sperf_sex_diff$sex == "M")  0.10179    0.15954   0.638    0.524
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.453 on 336 degrees of freedom
## Multiple R-squared:  0.8022, Adjusted R-squared:  0.801
## F-statistic: 681.4 on 2 and 336 DF,  p-value: < 2.2e-16

stargazer(reg_result_model_lm, reg_result_model_multiple_lm, type="text")

##
## =======================================================================
##                                     Dependent variable:
##                     ---------------------------------------------------
##                                mG3                       mG3
##                                (1)                       (2)
## -----------------------------------------------------------------------
## mG1                          0.901***
##                              (0.024)
##
## mG1                                                    0.899***
##                                                        (0.025)
##
## sex == "M")                                             0.102
##                                                        (0.160)
##
## Constant                     1.452***                  1.427***
##                              (0.286)                   (0.289)
##
## -----------------------------------------------------------------------
## Observations                   339                       339
## R2                            0.802                     0.802
## Adjusted R2                   0.801                     0.801
## Residual Std. Error      1.452 (df = 337)          1.453 (df = 336)
## F Statistic         1,364.757*** (df = 1; 337) 681.382*** (df = 2; 336)
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01

#Note: R automatically recodes categorical to be dummy variable 0 = reference (boy), 1 category of interest (girl)

Report Multiple Linear Regression

The relationship between initial grade in Maths (mG1 taken from school reports) and final Maths grade (mG3 taken from school reports) for male and female students was modelled using a multiple linear model. There was a statistically significant difference at the p < .001 level in Maths final grade scores for male and female students: (F(2, 337)= 1,364.78, p<0.001). When we take into account the overall contribution from the initial Maths grade a statistically significant result was not found for sex as a predictor. There is a no change in the contribution of maths initial grade to maths final grade outcome compared to the simple linear model which did not account for sex. Our model is still explaining 80% of the variation on average in the outcome variable, the same as the simple linear model.

This result of the multiple linear regression raises the question of interaction effects. If a student is getting a differential effect for final maths grade based on sex, was that student also experiencing a differential effect for the initial grade? The concept here is that there is some relationship between predictors which needs to be accounted for in our model, and we use interaction terms to do so.

Dummy variables and Interaction Effects – Categorical Predictive Variables

Many variables we are interested in using as predictors are categorical. In the above example we used a categorical variable (sex) as a predictor but did not explain it’s treatment. Because categorical variables have no scale, it makes no sense to think of the effect of a unit of increase in these as it would for a continuous variable. What we want to see is if there is a difference in outcome for a record belonging to one category versus another and how much of the variation in the outcome can be explained by belonging to a category. To do this, we transform the categorical variable into a series of dummy variables which indicate whether a particular record has a specific characteristic represented by the dummy variable. Dummy variables are also referred to as indicator variables or switches.

When we include a dummy variable in our model we get a new interception point with the Y axis and a new line of best fit for the model.. In this way we can have different predictive models for different groups. For dummy variables, we recode each category variable to 0 (reference/baseline category) and 1 (category of interest). We must have previously shown a dummy variable has a differential effect on the outcome variable through difference testing in order to include it in our predictive model.

Models built with dummy variables assume that the gap between groups (lines of best fit) is constant, but we may find there is an intersection point for the lines representing each group in our model. For example, in our data set we might find that on average male students are better during initial assessment, but the trend narrows or even reverses for the final grade. To capture this effect in our predictive model we use interaction terms which change the slope of our model. We generate interaction terms by multiplying our dummy category variable by our continuous scale predictor variable and adding this term to our model. Our predictive model should include both the dummy variable and interaction term.

Illustrative example of Interaction affects

Using our student performance dataset we will illustrate interaction effects by building progressively more complex examples.

############
# PART: Create switches and interaction terms
############

tbl_sperf_sex_diff$int_sex_male <- tbl_sperf_sex_diff$mG1*as.integer(tbl_sperf_sex_diff$sex=='M') tbl_sperf_sex_diff$int_sex_female <- tbl_sperf_sex_diff$mG1*as.integer(tbl_sperf_sex_diff$sex=='F')

tbl_sperf_sex_diff$medu_0 <- as.integer(tbl_sperf_sex_diff$Medu == 0) tbl_sperf_sex_diff$medu_1 <- as.integer(tbl_sperf_sex_diff$Medu == 1) tbl_sperf_sex_diff$medu_2 <- as.integer(tbl_sperf_sex_diff$Medu == 2) tbl_sperf_sex_diff$medu_3 <- as.integer(tbl_sperf_sex_diff$Medu == 3) tbl_sperf_sex_diff$medu_4 <- as.integer(tbl_sperf_sex_diff$Medu == 4) ## Stats tbl_sperf_sex_diff_stats <- tbl_sperf_sex_diff %>% psych::describe(omit = TRUE)

First Model mG3 predicted by mG1

Our first model is a simple linear regression of final Math grade predicted by initial Math grade.

############
# PART: Demonstrate application of interaction terms
############
#------------ First Model: mG3 predicted by mG1 -----------#
model1<-lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1)

## Pretty Print stargazer::stargazer(model1, type="text") #Tidy output of all the required stats

##
## ===============================================
##                         Dependent variable:
##                     ---------------------------
##                                 mG3
## -----------------------------------------------
## mG1                          0.901***
##                               (0.024)
##
## Constant                     1.452***
##                               (0.286)
##
## -----------------------------------------------
## Observations                    339
## R2                             0.802
## Adjusted R2                    0.801
## Residual Std. Error      1.452 (df = 337)
## F Statistic         1,364.757*** (df = 1; 337)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

#------------ First Model: mG3 predicted by mG1 -----------#
equation1 <- function(x){coef(model1)[2]*x+coef(model1)[1]}
ggplot(tbl_sperf_sex_diff,aes(y=mG3,x=mG1,color=sex))+geom_point()+
        stat_function(fun=equation1,geom="line",color=scales::hue_pal()(2)[1])
plot of chunk model1 visualisation

Second Model mG3 predicted by mG1 using sex as a switch

Our second model is a multi variable linear regression of with final Math grade predicted by initial Math grade and student sex. We transform sex into a dummy/switch variable where male will be treated as the reference category/baseline and female will be treated as the category of interest. This means in our model male will be given the value of 0, females a value of 1. Furthermore, we no longer assume constant performance differences between male and female students. We use interaction terms which change the gradient of our model for female students.

#------------ Second Model: mG3 predicted by mG1 using sex as a dummy variable -----------#
model2 <- lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1+as.integer(tbl_sperf_sex_diff$sex=='F')) # Male baseline

## Pretty Print stargazer::stargazer(model2, type="text") #Tidy output of all the required stats

##
## ===============================================
##                         Dependent variable:
##                     ---------------------------
##                                 mG3
## -----------------------------------------------
## mG1                          0.899***
##                               (0.025)
##
## sex == "F")                   -0.102
##                               (0.160)
##
## Constant                     1.529***
##                               (0.311)
##
## -----------------------------------------------
## Observations                    339
## R2                             0.802
## Adjusted R2                    0.801
## Residual Std. Error      1.453 (df = 336)
## F Statistic          681.382*** (df = 2; 336)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

stargazer::stargazer(model1, model2, type="text") #Quick model comparison

##
## =======================================================================
##                                     Dependent variable:
##                     ---------------------------------------------------
##                                             mG3
##                                (1)                       (2)
## -----------------------------------------------------------------------
## mG1                          0.901***                  0.899***
##                              (0.024)                   (0.025)
##
## sex == "F")                                             -0.102
##                                                        (0.160)
##
## Constant                     1.452***                  1.529***
##                              (0.286)                   (0.311)
##
## -----------------------------------------------------------------------
## Observations                   339                       339
## R2                            0.802                     0.802
## Adjusted R2                   0.801                     0.801
## Residual Std. Error      1.452 (df = 337)          1.453 (df = 336)
## F Statistic         1,364.757*** (df = 1; 337) 681.382*** (df = 2; 336)
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01

#------------ Second Model: mG3 predicted by mG1 using sex as a dummy variable -----------#
equation1_male <- function(x){coef(model1)[2]*x+coef(model1)[1]}
equation1_female <- function(x){coef(model2)[2]*x+coef(model2)[1] + coef(model2)[3]}
ggplot(tbl_sperf_sex_diff,aes(y=mG3,x=mG1,color=sex))+geom_point()+
        stat_function(fun=equation1_male,geom="line",color=scales::hue_pal()(2)[1]) +
        stat_function(fun=equation1_female,geom="line",color=scales::hue_pal()(2)[2])
plot of chunk model2 versus model1 visualisation

Third Model mG3 predicted by mG1 using sex as an interaction term

Our third model is a multi variable linear regression of final Math grade predicted by initial Math grade and student sex. We transform sex into a interactive variable, where male will be treated as the reference category/baseline and female will be treated as the category of interest. This means in our model male will be given the value of 0, girl a value of 1. Both the switch and the interaction variable are included because the switch variable determines the inception point of the model with the Y axis (the starting point).

#------------ Third Model: mG3 predicted by mG1 using sex as an interaction variable -----------#
model3 <- lm(tbl_sperf_sex_diff$mG3 ~ tbl_sperf_sex_diff$mG1 +
  as.integer(tbl_sperf_sex_diff$sex == 'F') +
  tbl_sperf_sex_diff$int_sex_female
) # Male baseline

## Pretty Print stargazer::stargazer(model3, type = "text") #Tidy output of all the required stats

##
## ===============================================
##                         Dependent variable:
##                     ---------------------------
##                                 mG3
## -----------------------------------------------
## mG1                          0.884***
##                               (0.035)
##
## sex == "F")                   -0.437
##                               (0.580)
##
## int_sex_female                 0.030
##                               (0.049)
##
## Constant                     1.706***
##                               (0.428)
##
## -----------------------------------------------
## Observations                    339
## R2                             0.802
## Adjusted R2                    0.801
## Residual Std. Error      1.455 (df = 335)
## F Statistic          453.511*** (df = 3; 335)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

stargazer::stargazer(model3, model2, type = "text") #Quick model comparison

##
## =====================================================================
##                                    Dependent variable:
##                     -------------------------------------------------
##                                            mG3
##                               (1)                      (2)
## ---------------------------------------------------------------------
## mG1                         0.884***                 0.899***
##                             (0.035)                  (0.025)
##
## sex == "F")                  -0.437                   -0.102
##                             (0.580)                  (0.160)
##
## int_sex_female               0.030
##                             (0.049)
##
## Constant                    1.706***                 1.529***
##                             (0.428)                  (0.311)
##
## ---------------------------------------------------------------------
## Observations                  339                      339
## R2                           0.802                    0.802
## Adjusted R2                  0.801                    0.801
## Residual Std. Error     1.455 (df = 335)         1.453 (df = 336)
## F Statistic         453.511*** (df = 3; 335) 681.382*** (df = 2; 336)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01
plot of chunk model3 versus model2 visualisation

Double checking association

To confirm we hadn’t made an error, we reconfirmed our understanding of the associations for the variables. This is not strictly required but included for convenience.

Look at zero order correlations

#Get zero order correlations as well to fully explore the effect
cor.test(tbl_sperf_sex_diff$mG1, tbl_sperf_sex_diff$mG3)

##
##  Pearson's product-moment correlation
##
## data:  tbl_sperf_sex_diff$mG1 and tbl_sperf_sex_diff$mG3
## t = 36.943, df = 337, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8722090 0.9147845
## sample estimates:
##       cor
## 0.8955274

Look at differences in experience of concepts for respondents for different gender

#Look at differences in scores for gender
#Conduct Levene's test for homogeneity of variance in library car
#---------- mG1 ~ Sex Differential Effect  ---------- #
leveneTest(mG1 ~ sex, data=tbl_sperf_sex_diff)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.0552 0.8144
##       337

test_result <- stats::t.test(mG1~sex,var.equal=TRUE,data=tbl_sperf_sex_diff)#Variances are not equal
test_result[["effcd"]] <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter),2)
print(test_result)

##
##  Two Sample t-test
##
## data:  mG1 by sex
## t = -2.657, df = 337, p-value = 0.008259
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6126906 -0.2406418
## sample estimates:
## mean in group F mean in group M
##        10.83237        11.75904

print(test_result[["effcd"]])

##     d |         95% CI
## ----------------------
## -0.29 | [-0.50, -0.07]

#---------- mG3 ~ Sex Differential Effect  ---------- #
leveneTest(mG3 ~ sex, data=tbl_sperf_sex_diff)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.1175  0.732
##       337

test_result <- stats::t.test(mG3 ~ sex,var.equal=TRUE,data=tbl_sperf_sex_diff)#Variances are not equal
test_result[["effcd"]] <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter),2)

print(test_result)

##
##  Two Sample t-test
##
## data:  mG3 by sex
## t = -2.6637, df = 337, p-value = 0.008099
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6246401 -0.2444316
## sample estimates:
## mean in group F mean in group M
##        11.16185        12.09639

print(test_result[["effcd"]])

##     d |         95% CI
## ----------------------
## -0.29 | [-0.50, -0.08]

Reporting Switchs and Interactive effects for variable; sex

The interaction affects between initial grade in Maths (mG1 taken from school reports) and sex was modeled using a multiple linear model. Being female was not found to be a statistically significant ( P = 0.580 ) predictor of final performance. The interaction effect of being female and initial grade was also not found to be significant (P = 0.05). We saw how the inclusion of the switches or dummy variables had the consequence of giving us two lines of best fit (one for male and one for female students) that have differing intercept points. We also saw how the inclusion of interactive variables gave us a slightly different gradient.

Linear Regression with Interaction effects and more than two categories

From our explore and analysis phase we established there is a relationship between mother’s education (Medu) and student performance, with the strongest relationship found for students with mothers who obtained the highest level of education. Medu differs from sex, in that it is a variable that can take more than two values.

Fourth Model mG3 predicted by mG1 using sex as an interaction term

Our forth model is a multi variable linear regression with final Math grade predicted by initial Math grade, student sex and mother’s educational achievement. We transform sex into a interactive variable, where male was treated as the reference category/baseline and female will be treated as the category of interest. This means in our model male will be given the value of 0, females a value of 1. Mothers’ education has five possible values and this is transformed into 4 switching variables. The lowest educational level (Medu = 0) is taken as the reference category.

#------------ Fourth Model: mG3 predicted by mG1 using sex as an interaction variable and mothers education as a switch -----------#
model4 <- lm(tbl_sperf_sex_diff$mG3 ~ tbl_sperf_sex_diff$mG1 +
  as.integer(tbl_sperf_sex_diff$sex == 'F') +
  tbl_sperf_sex_diff$medu_4 +
  tbl_sperf_sex_diff$medu_1 +
  tbl_sperf_sex_diff$medu_2 +
  tbl_sperf_sex_diff$medu_3 ) # lowest eduaction baseline, male baseline

## Pretty Print stargazer::stargazer(model4, type = "text") #Tidy output of all the required stats

##
## ===============================================
##                         Dependent variable:
##                     ---------------------------
##                                 mG3
## -----------------------------------------------
## mG1                          0.894***
##                               (0.025)
##
## sex == "F")                   -0.103
##                               (0.160)
##
## medu_4                        -0.730
##                               (0.852)
##
## medu_1                        -0.977
##                               (0.872)
##
## medu_2                        -0.809
##                               (0.857)
##
## medu_3                        -0.667
##                               (0.857)
##
## Constant                      2.342**
##                               (0.906)
##
## -----------------------------------------------
## Observations                    339
## R2                             0.804
## Adjusted R2                    0.800
## Residual Std. Error      1.457 (df = 332)
## F Statistic          226.304*** (df = 6; 332)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

stargazer::stargazer(model4, model3, type = "text") #Quick model comparison

##
## =====================================================================
##                                    Dependent variable:
##                     -------------------------------------------------
##                                            mG3
##                               (1)                      (2)
## ---------------------------------------------------------------------
## mG1                         0.894***                 0.884***
##                             (0.025)                  (0.035)
##
## sex == "F")                  -0.103                   -0.437
##                             (0.160)                  (0.580)
##
## medu_4                       -0.730
##                             (0.852)
##
## medu_1                       -0.977
##                             (0.872)
##
## medu_2                       -0.809
##                             (0.857)
##
## medu_3                       -0.667
##                             (0.857)
##
## int_sex_female                                        0.030
##                                                      (0.049)
##
## Constant                    2.342**                  1.706***
##                             (0.906)                  (0.428)
##
## ---------------------------------------------------------------------
## Observations                  339                      339
## R2                           0.804                    0.802
## Adjusted R2                  0.800                    0.801
## Residual Std. Error     1.457 (df = 332)         1.455 (df = 335)
## F Statistic         226.304*** (df = 6; 332) 453.511*** (df = 3; 335)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01
plot of chunk model4 versus model3 visualisation

Report Interactive effects for Medu and Sex

A multiple linear predictive model of final maths grade was built using initial grade in Maths (mG1 taken from school reports), sex and mothers education. When we take into account the overall contribution from the initial Maths grade, a statistically significant result was not found for sex nor mothers education as predictors. We saw how the inclusion of the mother’s education switch term had the consequence of giving us two lines of best fit (one for female students with mothers who have higher education and one for female students who didn’t) that have different intercepts as well as different slopes. The above analysis supports not including Mothers education in a predictive set with initial grade.

Generalisation of Linear Regression Model

To generalise our model to the population there are a number of assumptions unpinning linear regression that must be validated. For every one unit of change in the independent variable there will be a consistent and uniform change in the dependent variable. This allows us to use the equation of a line to model the association between predictor variables and outcome variables. Before we build our model we need to validate our assumptions regarding associations. If we are asserting a relationship we need to investigate if there is any evidence of a relationship using correlation and make a decision based on the results (strength, direction etc). If we are asserting a differential effect for different groups. We need to investigate if there is any difference using either the appropriate test and make a decision based on the result.

Using linear regression as a predictive model requires that a straight line be a good representative model of the underlying relationship. How well a straight line serves as a model is determined by the residuals. We want the residuals to follow a normal distribution that minimizes the opportunity for error to emerge just because the data is skewed. We want constant error variance because non constant variance would pull our outcome in particular directions and increase our chances of a type 1 error. An outlier is considered any observation that’s particularly large or small. A leverage point is something that’s far away from the average and changes the gradient of the line of best fit. And anything that’s considered an influential observation is something that changes the slope of the line. If we have a particularly large value, this will be in our data as an outlier, it is likely that we’ll have a particularly large residual in our regression model. And then it will pull the data in a particular direction, and therefore it will pull the prediction in a particular direction.

Step 1: Determine our outliers in terms of residuals

Step 1: Determining our outliers in terms of residuals The first thing we need to do is identify our outliers in terms of our residuals. If we find outliers that are particularly significant, it may be acceptable to delete them and repeat the regression without those cases included. The danger if we simply remove cases is that remaining cases that where not outliers may become outliers. It may be that we have to retain it because it’s a particularly important case in our data set. It is a good idea to perform a robustness test and run the data analysis twice to determine how susceptible the model is to assumptions regarding outliers. To do this we run the regression data analysis twice, once with and once without the outliers.

Next, we quantify how far away from normal the distribution is. To do this, we calculate statistics for skew and kurtosis and standardise them so we can compare them to heuristics. Standardised scores (value/std.error) for skewness between +/-2 (1.96 rounded) are considered acceptable in order to assume a normal distribution. Residuals for model 1: mG1 as a predictor of mG3, are within an acceptable range. We further investigated by exploring outliers: how many of them there are or whether we can transform the problematic variable to become more normal.

In terms of quantifying the proportion of the residual that is not normal, we generated standardised z scores for the variable and calculated the percentage of the standardised scores that are outside an acceptable range. No variable exceeded our acceptable range for the 95% significance level. Based on this assessment, residual skewness due to outliers is not an issue and residuals can be treated as normally distributed.

An analysis of standard residuals was carried out, showing that the data contained no additional outliers once zero scores had been removed. (Std. Residual Min = -3.61, Std. Residual Max = 3.59).

###########
# Part: Model
###########
#------------ Second Model: mG3 predicted by mG1 using sex as a dummy variable -----------#
model2 <- lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1+as.integer(tbl_sperf_sex_diff$sex=='F')) # Female baseline
############
# PART: check
############
std_skew <- list()
std_kurt <- list()
gt_196 <- list()
gt_329 <- list()
#--------------- Check for percentage outliers ---------------#
variable <- "model2_stats"
st <- as.data.frame(pastecs::stat.desc(model2$residuals, basic = F))
model2_stats <- st %>% transpose()
colnames(model2_stats) <- rownames(st)
rownames(model2_stats) <-('Model 2: Residuals')

## Build standard statistics tpskew <- semTools::skew(model2$residuals) tpkurt <- semTools::kurtosis(model2$residuals) std_skew[[variable]] <- tpskew[1] / tpskew[2] std_kurt[[variable]] <- tpkurt[1] / tpkurt[2] z_score <- abs(scale(model1$residuals)) gt_196[[variable]] <- FSA::perc(as.numeric(z_score), 1.96, "gt") # 95% within +/- 1.96 gt_329[[variable]] <- FSA::perc(as.numeric(z_score), 3.29, "gt") # 99.7% within +- 3.29 for larger distributions

## Build output model2_stats$std_skew <- std_skew model2_stats$std_kurt <- std_kurt model2_stats$gt_2sd <- gt_196 model2_stats$gt_3sd <- gt_329

#-------- Pretty print my table -------# model2_stats %>% kbl(caption = "Summary statistics for Residuals - mG3 predicted by mG1 - sex as a dummy variable (zero scores removed)") %>% kable_styling(bootstrap_options = c("striped", "hover"))
Summary statistics for Residuals – mG3 predicted by mG1 – sex as a dummy variable (zero scores removed)
median mean SE.mean CI.mean.0.95 var std.dev coef.var std_skew std_kurt gt_2sd gt_3sd
Model 2: Residuals -0.1102192 0 0.0786984 0.1548003 2.099575 1.448991 1.537045e+17 1.161807 -1.42829 5.014749 0

############
# PART: Visualisation Histograms
############
# Plot single variable

binwidth <- .5 gs1 <- model2 %>% ggplot(aes(x = model2$residuals)) gs1 <- gs1 + geom_histogram(binwidth = binwidth, colour = "black", aes(y = ..density.., fill = ..count..)) gs1 <- gs1 + stat_function(fun = dnorm, color = "red", args = list(mean = model2_stats$mean, sd = model2_stats$std.dev), na.rm = TRUE) gs1 <- gs1 + labs(x = "Model 1: mG3 predicted by mG1 Residuals") gs1 <- gs1 + scale_fill_gradient("Count", low = "#DCDCDC", high = "#7C7C7C")

############ # PART: Visualisation Histograms Q-Q Plot ############

gs2 <- model2 %>% ggplot(aes(sample = model2$residuals)) + stat_qq() + stat_qq_line(linetype = "dotted", color = "red", size = 1) + theme_bw() + labs(title = "Q-Q Plot of Model 2 Residuals", subtitle = "mG1 as predictor of mG3 dummy sex")

plot_grid(gs1, gs2, labels = "auto", ncol = 2)
plot of chunk Visualisation Histograms for residuals

Influential Outliers

#--------------- Influential Outliers - Cook's distance ---------------#
cooksd<-sort(cooks.distance(model2))
# plot Cook's distance
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")
abline(h = 4*mean(cooksd, na.rm=T), col="red")  # add cutoff line
text(x = seq_along(cooksd) +1, y =cooksd, labels =ifelse(cooksd>4*mean(cooksd, na.rm =T), names(cooksd), ""), col ="red")  # add labels
plot of chunk Influential outliers

#--------------- find rows related to influential observations ---------------#
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  # influential row numbers
stem(influential)

##
##   The decimal point is 2 digit(s) to the right of the |
##
##   0 | 0368
##   1 | 268999
##   2 | 034556
##   3 | 0

head(tbl_sperf_sex_diff[influential, ])  # influential observations.

##     mG1 mG2 mG3 pG1 pG2 pG3 sex Medu int_sex_male int_sex_female medu_0 medu_1 medu_2 medu_3 medu_4
## 157  11   8   8   7   8   8   M    2           11              0      0      0      1      0      0
## 262  14  12  11  14  15  15   M    3           14              0      0      0      0      1      0
## 202  11  15  15   9  11  11   M    4           11              0      0      0      0      0      1
## 194  15  16  18  14  14  14   M    4           15              0      0      0      0      0      1
## 304  15  12  12  10  10  11   M    3           15              0      0      0      0      1      0
## 181  10  13  14  12  13  12   M    4           10              0      0      0      0      0      1

head(tbl_sperf_sex_diff[influential, ]$mG3)  # influential observations - look at the values of mG3

## [1]  8 11 15 18 12 14

head(tbl_sperf_sex_diff[influential, ]$mG1)  # influential observations - look at the values of mG1

## [1] 11 14 11 15 15 10

car::outlierTest(model2) # Bonferonni p-value for most extreme obs - Are there any cases where the outcome variable has an unusual variable for its predictor values?

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 236 -2.518902           0.012237           NA

car::leveragePlots(model2) # leverage plots
plot of chunk Influential outliers

Step 2: Check residuals for Constant Variation of error (Homoscedasticity)

Sometimes our residuals can display heteroscedasticity in terms of error variance. This is when we see a different variation of error from some variables values at a higher level when compared to the lower level values. In a well-fitting regression model the error variation around the reference line is constant (Homoscedasticity). If we detect heteroscedasticity, multiple linear regression is not an acceptable predictive model, even if there are large and statistically significant coefficients, because of the possibility of a type one error.

The data also met the assumption of non-zero variances (Initial Maths grade Variance = 10.49,, sex=0.25)

plot(model2, 1)
plot of chunk unnamed-chunk-5

plot(model2, 3)
plot of chunk unnamed-chunk-5

#The first plot is the chart of residuals vs fitted values, in the second plot the standardised residuals are on the Y axis. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line. We reall want to see that there is no pattern in the residuals and that they are equally spread around the y = 0 line - the dashed line.
#n our case, as you can notice the red line is slightly distorted in the middle on plot 1 but is not a big problem. Looking at the second plot we can see that while it is a problem it is not huge. Only a concern if there are definite patterns.

#Create a QQ plotqqPlot(model, main="QQ Plot") #qq plot for studentized resid car::qqPlot(model2, main = "QQ Plot") #qq plot for studentized resid
plot of chunk unnamed-chunk-5

## [1]   2 236

var_mg1 <- var(as.numeric(tbl_sperf_sex_diff$mG1, rm.na = TRUE)) %>% round(2)
var_sex <- var(as.numeric(tbl_sperf_sex_diff$sex=='M', rm.na = TRUE)) %>% round(2)

Step 3: Check for Collinearity

Collinearity occurs when two or more independent variables contain redundant information. If variables are collinear then it means there is not enough distinct information in these variables for MLR to operate and they are essentially measuring the same thing. The problem is that this concept could end up over represented in our model. A correlation coefficient above 0.8 suggests collinearity may be present. If it is, we need to analyse the variables in separate regression equations and then examine how they operate together.

Tests to see if the model 2 met the assumption of collinearity indicated that multicollinearity was not a concern (Sex, Tolerance = .98, VIF = 1.02; Initial Maths grade, Tolerance = 0.98, VIF = 102.)

#Calculate Collinearity
vifmodel<-car::vif(model2)
vifmodel

##                    tbl_sperf_sex_diff$mG1 as.integer(tbl_sperf_sex_diff$sex == "F")
##                                  1.020949                                  1.020949

#Calculate tolerance
1/vifmodel

##                    tbl_sperf_sex_diff$mG1 as.integer(tbl_sperf_sex_diff$sex == "F")
##                                 0.9794811                                 0.9794811

Step 4: Check for Causality

To assert causality we need to assume association, time order and non spuriousness. Statistical methods can justify association assumptions but not time order; that depends on our conceptual framework. We control for spuriousness by introducing controlling measures into our predictive model.

Reporting MLR – Model 2

For expedience, I have only reported on model 2. A multiple regression analysis was conducted to determine if a student’s score for initial maths grade and sex could predict a student’s final academic achievement.

In order to include the sex in the regression model it was recorded into a single dummy variable (0 for male, 1 for female).

Examination of the histogram, normal P-P plot of standardised residuals and the scatterplot of the dependent variable, initial maths grade, and standardised residuals showed that the some outliers existed even after removing the zero score records we deemed invalid earlier. However, examination of the standardised residuals showed that none could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013).

Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The scatterplot of standardised residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.

Logistic Regression

In linear regression, our outcome variable is continuous and is predicted from one or more continuous or categorical predictor variables. If the outcomes we are interested in are categorical in nature, we cannot use linear regression as there is no linear relationship between predictor and outcomes variables. To predict categorical outcomes, we use logistic regression. If the outcome variable has exactly two values, for example predicting pass or fail for a student, we use a binary logistic regression. When the outcome has more than two categories, we use a multi-nominal logistic regression. The output from a logistic regression is not an estimated value but rather a probability that the given input belongs to a certain category or not. Because we’re looking at probability, we choose one category as the category of interest. The event we’re interested in is the event where that category occurs. The reference event is when the event of interest doesn’t occur and we are interested in the probability of the event of interest occurring versus the probability of it not occurring.

Binary Logistic Regression

To illustrate this methodology we will make use of the student performance data set to build a model to predict if students sat the final maths examination, using gender as the predictor.
HA: Gender is a factor in determining if a student will sit the final maths examination
Our outcome variable is the probability that they did not sit the exam. As such we give zero to the “no” category, which is our reference category and 1 to our category of interest which is ‘Yes’ they did sit the exam. For our predictor we used Sex, which takes the value Male or Female.

Note: We’re building a model with one predictor to illustrate our understanding of how logistic regression works. The starting point is a simple logistic regression, which we will build upon. We gain no new insights from this over and above a difference test.

Step 1: Run Simple Logistic Regression

##
## Call:
## glm(formula = tbl_sperf_sex_medu_diff$sat_maths ~ tbl_sperf_sex_medu_diff$sex,
##     family = binomial(link = logit), data = tbl_sperf_sex_medu_diff,
##     na.action = na.exclude)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.2101   0.4265   0.4265   0.4970   0.4970
##
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)
## (Intercept)                    2.0293     0.2218   9.149   <2e-16 ***
## tbl_sperf_sex_medu_diff$sexM   0.3221     0.3430   0.939    0.348
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 251.86  on 381  degrees of freedom
## Residual deviance: 250.97  on 380  degrees of freedom
## AIC: 254.97
##
## Number of Fisher Scoring iterations: 5

## Likelihood ratio test
##
## Model 1: tbl_sperf_sex_medu_diff$sat_maths ~ tbl_sperf_sex_medu_diff$sex
## Model 2: tbl_sperf_sex_medu_diff$sat_maths ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   2 -125.48
## 2   1 -125.93 -1 0.8929     0.3447
The Pr(>|z|) column shows the two-tailed p-values testing the null hypothesis that the coefficient is equal to zero (i.e. no significant effect). The usual cut-off value is 0.05. From this model it appears that sex does not have a significant effect on the log-odds ratio of the outcome variable, weather or not a student sat the final maths exam. The estimate column shows how much of a contribution or predictor makes to the outcomes. The results indicated that when sex is male the expected change in the log odds is .3221 an increase in comparison of being female, however the result is not statistically significant (p = 0.328).

Step 2: Calculate odds ratio and probability

To calculate the odds ratio, which is an indication of the change in odds result from a unit change in the predictor, we take the exponential of the co-efficients. We see the odds of sitting the maths exam is 7.6 to 1 and that ratio increases by 1.38 to 8.99 to 1 if a student is Male. From this we can see the probability of sitting the final maths exam when female is .88 and when male is 0.91.
Summary of co-efficients
Estimate Odds_ratio Probability
(Intercept) 2.0293 7.6087 0.8838384
tbl_sperf_sex_medu_diff$sexM 0.3221 1.3800 0.9130435

Step 3: Check goodness of fit of the model over the baseline model

In linear regression we check goodness of fit by comparing our results to the results we would have gotten using the mean as the baseline model. For logistic regression an omnibus test is used to check that the new model (with explanatory variables included) is an improvement over the baseline model which is a sum of frequencies of occurrence of each category. We use chi-square tests to see if there is a significant difference between the baseline model (null) and the model we have created.

Below we can see our model is not statistically significant (Pr(>Chisq) = 0.3447), which means it cannot be assumed to be a better predictor than the baseline model. This was not surprising since our predictor variable (sex) was shown to not be statistically significant.
Summary statistics for regression compared to baselien model
#Df LogLik Df Chisq Pr(>Chisq)
2 -125.4838 NA NA NA
1 -125.9302 -1 0.8929079 0.3446905

Step 4: Determine usefulness

We can get a pseudo R squared value to determine how useful the model is. From linear regression we understood the R2 was a measure of how much of the outcome variable was explained by the predictor variable. Cox and Snell R2 and Nagelkerke R2 are pseudo R2 statistics we can generate. From this we see that student sex explains between 0.2% and 0.48% of whether a student will sit the final maths exam or not, but we need to keep in mind the result was not statistically significant.
Summary of Pseudo R-Squared values
CoxSnell Nagelkerke
R-Squared 0.0023 0.0048

Step 5: Sensitivity Analysis and Summary Stats.

We need to understand how well the model is working. The two main measures are sensitivity and specificity. Sensitivity refers to the proportion of true positives that were correctly identified by the model (correctly identified that a student sat the exam) and specificity is the proportion of true negatives (predicted to be negative and where negative). We achieve this using a ROC chart. Below we can see that our true positive rate is 49.0% and our true negative rate is 59.0%. The positive predictive values is the percentage of cases classified by our model correctly (students who were predicted to sit the exam and did plus students who were predicted not to sit the exam and didn’t). The negative predictive value is 1 – the positive. To use the ROC curve to quantify the performance of a classifier use the Area Under the Curve (AUC). We have an AUC of .54 which is considered weak.

plot of chunk roc charts

Step 6: Check the assumption of linearity of independent variables

We only have one predictor so we don’t need to check for collinearity. We check the assumption of linearity of independent variables and log odds using a Hosmer-Lemeshow test.

##
##  Hosmer and Lemeshow test (binary model)
##
## data:  tbl_sperf_sex_medu_diff$sat_maths, fitted(logmodel1)
## X-squared = 5.6128e-26, df = -1, p-value = NA

Reporting Binary Logistic Regression

A Binary logistic regression model was built of student propensity to sit the final Maths examination as predicted by student sex. Our outcome variable is the probability of a student sitting the final Maths exam given their gender. As such, we give zero to the “no” category, which is our reference category, and 1 to our category of interest which is “yes” they did sit the exam. For our predictor we used sex, with values Male and Female. Being Male was not found to be a statistically significant ( P = .348 ) predictor of whether a student will or will not not sit the examination. Female Students have an odd ratio of 7.60 to 1 to sit the exam while Male student odds where 1.38 higher which corresponds to a probability of .88 and .91 respectively. R2 = .0023 (Cox-Snell), .0048 (Nagelkerke), Pr (>Chisq) = .34.

##
## =============================================
##                       Dependent variable:
##                   ---------------------------
##                            sat_maths
## ---------------------------------------------
## sexM                         0.322
##                             (0.343)
##
## Constant                   2.029***
##                             (0.222)
##
## ---------------------------------------------
## Observations                  382
## Log Likelihood             -125.484
## Akaike Inf. Crit.           254.968
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

Extra: Extending the model to include Parent’s Education.

Similar to Linear regression we can extend logistic regression to include multiple predictors. Our goal is to explore if we can predict if a student will sit the final Maths exam on the basis of their parents’ education. We introduce a new categorical variable representing the highest educational achievement of either parent. We have 3 levels: at least one parent with a higher degree, at least one parent with a level 5 qualification (secondary school) and neither parent with a level 5 qualification. At least one parent with a higher degree was taken as the baseline reference category.

Step 1: Run General Logistic Regression

We find that the overall model is not statistically significant (Pr(>Chisq) = .077) but the category Neither parent with a level 5 qualification is statistically significant at the P <0.05 significances level.

##
## Call:
## glm(formula = sat_maths ~ sex + pedu_paired, family = binomial(link = logit),
##     data = tbl_sperf_sex_medu_diff, na.action = na.exclude)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.3552   0.3708   0.4099   0.5458   0.6193
##
## Coefficients:
##                                                              Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                                   2.36843    0.34178   6.930 4.22e-12 ***
## sexM                                                          0.27464    0.34668   0.792   0.4283
## pedu_paired At least one parent with a level 5 qualification  0.06604    0.49488   0.133   0.8938
## pedu_paired Neither parent with a level 5 qualification      -0.81430    0.38806  -2.098   0.0359 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 251.86  on 381  degrees of freedom
## Residual deviance: 245.00  on 378  degrees of freedom
## AIC: 253
##
## Number of Fisher Scoring iterations: 5

## Likelihood ratio test
##
## Model 1: sat_maths ~ sex + pedu_paired
## Model 2: sat_maths ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   4 -122.50
## 2   1 -125.93 -3 6.8576    0.07658 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 2: Calculate odds ratio and probability

Student sex remained a statistically insignificant variable. The only level of parental educational achievement that was significant was if neither parent had a level 5 education.
Summary of co-efficients
Estimate Odds_ratio
(Intercept) 2.3684 10.6806
sexM 0.2746 1.3161
pedu_paired At least one parent with a level 5 qualification 0.0660 1.0683
pedu_paired Neither parent with a level 5 qualification -0.8143 0.4430
Summary of co-efficients probabilities
X.Intercept.
Female 0.91
Male 0.93
Female_level5 0.92
Male_level5 0.94
Female_nolevel5 0.83
Male_nolevel5 0.86

## Predicted levels same as naive model (majority level)
Confusion Matrix
Predicted 1
Actual 0 39
Actual 1 343

Step 3: Check goodness of fit of the model over the baseline model

Summary statistics for regression compared to baselien model
#Df LogLik Df Chisq Pr(>Chisq)
4 -122.5014 NA NA NA
1 -125.9302 -3 6.85764 0.0765764

Step 4: Determine usefulness

The Pseudo R squared statistics show that our extended model explains between 1.78% and 3.69% of the variation in the outcome variable. This is consistent with our findings above.
Summary of Pseudo R-Squared values
CoxSnell Nagelkerke
R-Squared 0.0178 0.0369

Step 5: Sensitivity Analysis and Summary Stats.

When we compared the performance of the model using sensitivity analysis, the AUC has increased to 0.61 compared to the first model which had an AUC of .51, but both models are considered weak.

plot of chunk roc charts model 2

Step 6: Check the assumption of linearity of independent variables

We have two predictor variables so we need to check for collinearity. Conceptually, student sex and parental educational achievement should not be collinear, but the step is included for illustration. We check the assumption of linearity of independent variables and log odds using a Hosmer-Lemeshow test.

##
##  Hosmer and Lemeshow test (binary model)
##
## data:  tbl_sperf_sex_medu_diff$sat_maths, fitted(logmodel2)
## X-squared = 6.3334, df = 3, p-value = 0.09647

##                 GVIF Df GVIF^(1/(2*Df))
## sex         1.005222  1        1.002608
## pedu_paired 1.005222  2        1.001303

##                  GVIF  Df GVIF^(1/(2*Df))
## sex         0.9948052 1.0       0.9973992
## pedu_paired 0.9948052 0.5       0.9986988

Reporting Binary Logistic Regression with two Predictors

A Binary logistic regression model was built of student propensity to sit the final Maths examination as predicted by student sex and parental educational achievement. Our outcome variable is the probability of a student sitting the final maths exam given their gender and their parents highest educational achievement. We give zero to the “no” category, which is our reference category and 1 to our category of interest which is ‘Yes’ they did sit the exam. For our predictor we used sex, with values Male and Female. For Parental education we observe 3 different levels: at least one parent with a level 5 qualification, at least one parent with a higher degree and neither parent with a level 5 qualification. At least one parent with a higher degree was our reference category. Neither parent having a level 5 education was shown to be statistically significant (P < .05). Being Male was still not found to be a statistically significant ( P = .43 ) predictor of whether a student will or will not not sit the examination. Female Students have an odd ratio of 10.68 to 1 to sit the exam while Male student odds where 1.31 higher which corresponds to a probability of .91 and .93 respectively. The lowest probability of sitting the Maths exam was Female students for which neither parent has a level 5 education. R2 = .0178 (Cox-Snell), .00369 (Nagelkerke), Pr (>Chisq) = .077.

Multinomial logistic regression

We can also use logistic regression to predict if a case will be a member of more than two categories. To illustrate this methodology, we will make use of the student performance data set to build a model to predict student alcohol consumption on the basis of performance.
HA: Subject performance, failures, and absences are good predictors of student tendency to consume alcohol

Step 1: Quick summary of the data

A general guideline is that we need a minimum of 10 cases with the least frequent value for each categorical variable in our model. A quick look at the data shows we meet this requirement.
Summary of Weekend Alchol Consumption (Walc.p)
Freq % Valid % Valid Cum. % Total % Total Cum.
1 144 37.696335 37.69634 37.696335 37.69634
2 86 22.513089 60.20942 22.513089 60.20942
3 76 19.895288 80.10471 19.895288 80.10471
4 49 12.827225 92.93194 12.827225 92.93194
5 27 7.068063 100.00000 7.068063 100.00000
<NA> 0 NA NA 0.000000 100.00000
Total 382 100.000000 100.00000 100.000000 100.00000
Summary of Student Failures (failures.p)
n median mad min max range skew kurtosis se
X1 382 0 0 0 3 3 4.150294 17.77067 0.0262603
Summary of Student Absences (absences.p)
n median mad min max range skew kurtosis se
X1 382 2 2.9652 0 32 32 2.165519 6.17826 0.251011

Step 2: Run Multi-nominal Logistic Regression

##         Walc.p
## famrel.p  1  2  3  4  5
##        1  2  3  1  0  2
##        2  6  3  3  4  2
##        3 19 16 14 13  5
##        4 71 39 41 21 12
##        5 46 25 17 11  6

##          M       SD
## 1 13.13194 2.430133
## 2 12.68605 2.963550
## 3 12.36842 3.281634
## 4 11.48980 2.731929
## 5 10.96296 3.787444

##            M        SD
## 0  12.720000 3.4740070
## 1  14.333333 1.5275252
## 2  12.984615 2.3617546
## 3   8.500000 2.1213203
## 4  12.393443 2.4985242
## 6  12.657895 1.9765697
## 7  15.500000 0.7071068
## 8  12.263158 2.4909192
## 9  10.000000        NA
## 10 12.555556 3.0867099
## 12 10.428571 3.2586880
## 14 11.333333 3.0110906
## 15 11.000000 1.4142136
## 16 10.000000 1.7320508
## 18 13.500000 0.7071068
## 21 13.000000        NA
## 22  7.333333 2.3094011
## 30 16.000000        NA
## 32 14.000000        NA

## # weights:  20 (12 variable)
## initial  value 614.805283
## iter  10 value 556.514900
## final  value 549.886689
## converged

## Call:
## multinom(formula = Walc.p ~ absences.p + pG3, data = tbl_sperf_family_alc)
##
## Coefficients:
##   (Intercept) absences.p         pG3
## 2  0.01087414 0.04400762 -0.05210308
## 3  0.49526196 0.02913020 -0.09622110
## 4  0.96253579 0.06192398 -0.18320827
## 5  0.66035020 0.09538812 -0.22493513
##
## Std. Errors:
##   (Intercept) absences.p        pG3
## 2   0.7013638 0.03074781 0.05172324
## 3   0.7022372 0.03279891 0.05249118
## 4   0.7414868 0.03397290 0.05753070
## 5   0.8471884 0.03716125 0.06793020
##
## Residual Deviance: 1099.773
## AIC: 1123.773
Summary of Predictor Probability Pr(>|z|)
(Intercept) absences.p pG3
2 0.9876299 0.1523603 0.3137696
3 0.4806466 0.3744626 0.0667891
4 0.1942475 0.0683418 0.0014499
5 0.4357083 0.0102621 0.0009287

## # weights:  10 (4 variable)
## initial  value 614.805283
## iter  10 value 563.601169
## iter  10 value 563.601167
## final  value 563.601167
## converged

## Likelihood ratio test
##
## Model 1: Walc.p ~ absences.p + pG3
## Model 2: Walc.p ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1  12 -549.89
## 2   4 -563.60 -8 27.429  0.0005959 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 2: Calculate odds ratio and probability

The relative odds ratio for a one-unit increase in Portuguese final grade is .9492 for being a level 2 alcohol consumer (Likelihood is decreasing) versus level 1. The relative odds ratio for a one-unit increase in Portuguese absences is 1.045 for being a level 2 alcohol consumer (Likelihood is decreasing) versus level 1.
Summary of co-efficients (Odds Ratios)
X.Intercept. absences.p pG3
2 1.0109 1.0450 0.9492
3 1.6409 1.0296 0.9083
4 2.6183 1.0639 0.8326
5 1.9355 1.1001 0.7986

##   absences.p pG3 variable probability
## 1          0   0        1   0.1218671
## 2          1   1        1   0.1323312
## 3          2   2        1   0.1433149
## 4          3   3        1   0.1547898
## 5          4   4        1   0.1667205
## 6          5   5        1   0.1790644
plot of chunk predictorsmodel3

Step 4: Check goodness of fit of the model over the baseline model

Below we can see our model is statistically significant (Pr(>Chisq) < .001), which means it can be assumed to be better than the baseline.

## # weights:  10 (4 variable)
## initial  value 614.805283
## iter  10 value 563.601169
## iter  10 value 563.601167
## final  value 563.601167
## converged
Summary statistics for regression compared to baselien model
#Df LogLik Df Chisq Pr(>Chisq)
12 -549.8867 NA NA NA
4 -563.6012 -8 27.42896 0.0005959

Step 5: Determine usefulness

The Pseudo R squared statistics show that our extended model explains between 6.93% and 7.31% of the variation in the outcome variable. This is consistent with our findings above.
Summary of Pseudo R-Squared values
CoxSnell Nagelkerke
R-Squared 0.0693 0.0731

Step 6: Check the assumption of linearity of independent variables

We checked the assumption of linearity of independent variables and log odds using a Hosmer-Lemeshow test and found it was not statistically significant. Multicollinearity analysis showed that the tolerance and variance influence factor measures were not within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). As such we may have problems with collinearity.

##
##  Hosmer and Lemeshow test (multinomial model)
##
## data:  tbl_sperf_family_alc$Walc.p, fitted(logmodel3)
## X-squared = 34.594, df = 32, p-value = 0.345
Multi colinearity Tolerance
VIF.Tolerance vifmodel
absences.p 0.4075424 2.453733
pG3 0.0335390 29.816033

Reporting Multinominal Logistic Regression

A Multinomial logistic regression model of student Alcohol consumption as predicted by student performance and absenteeism was built. Our outcome variable is the probability of a student having each of the alcohol consumption levels given their performance and absenteeism rates. Our reference outcome category was level 1, very low consumption. For our predictor, we used student performance and absence from Portuguese class. The model was statistically significant (Pr(>Chisq) = .001 ) and explains between 6.93% and 7.31% of the variation in the output variable. The level of collinearity between predictors was outside the acceptable range, however.

##
## =========================================================
##                             Dependent variable:
##                   ---------------------------------------
##                       2         3         4         5
##                      (1)       (2)       (3)       (4)
## ---------------------------------------------------------
## absences.p          0.044     0.029    0.062*    0.095**
##                    (0.031)   (0.033)   (0.034)   (0.037)
##
## pG3                -0.052    -0.096*  -0.183*** -0.225***
##                    (0.052)   (0.052)   (0.058)   (0.068)
##
## Constant            0.011     0.495     0.963     0.660
##                    (0.701)   (0.702)   (0.741)   (0.847)
##
## ---------------------------------------------------------
## Akaike Inf. Crit. 1,123.773 1,123.773 1,123.773 1,123.773
## =========================================================
## Note:                         *p<0.1; **p<0.05; ***p<0.01