# Introduction:

Before we can build a model using a predictor we need strong statistical as well as theoretical support. If we assert as part of our hypothesis that there is a relationship between two concepts, then we need to establish a) if the relationship is statistically likely to exist, and b) how strong that relationship is. We do this using correlation testing. If as part of our hypothesis we are saying that a given concept is experienced differently for different groups in our population, then we would have to establish statistical evidence to say that there is a difference in the experience of this concept for the different groups. We do this using different tests. Exploring, gathering and analysing the supporting statistical evidence for the inclusion of predictor variables in predictive models was the purpose of the explore and analyse phase. The purpose of this phase is to build and execute the predictive models, having already established justification for the selection of predictor variables and the hypothesis.# Linear Regression

Linear regression, at its most fundamental, is a hypothetical model between two variables, using the equation of a line as our model equation. We model units of change in our outcome variable, and attempt to determine how much each unit of change in one predictor contributes to that change. Our assumption is that we will have a uniform pattern when we’re looking at the average change. We can quantify how good our model is by comparing our linear model to the most basic reference model (the mean) to see how much extra explanation of the variance in the outcome variable we get. To do this, we square the difference in each point of our observed data to the mean and sum the result to get the total sum of squares as a measure of variability between the scores and the mean (SSt). We also get the total sum of squares from the observed data to the model of a line as a measure of variability between the scores and the regression mode (SSr). The difference between the two is how much variability is accounted for by the model. Running an analysis of variance (ANOVA) produces an F-statistic, which is a ratio value. This compares the amount of systematic variance in the data to the amount of unsystematic variance. In other words it compares what we see to what we expect for a distribution with the same degrees of freedom. Through a comparison to standard F-distribution we can determine if our model produces a statistically significant result.## Simple Linear Regression of student performance

To illustrate this methodology we will make use of the student performance data set to build a model to predict final student grade on the basis of initial assessment.HA: Students who perform well as part of initial assessment in subjects will perform better overall.

### Step 1: Investigate Normality

Because simple linear regression is based on the equation for a line, our first step would be a test for normality of our continuous scale data. As part of the exploration phase we established that all performance variables in the dataset can be assumed to be approximately normally distributed, and this analysis is not repeated here. The summary stats are included for convenience of reporting.mean | SE.mean | CI.mean.0.95 | var | std.dev | coef.var | std_skew | std_kurt | gt_2sd | gt_3sd | |
---|---|---|---|---|---|---|---|---|---|---|

mG1 | 11.28614 | 0.1758996 | 0.3459959 | 10.488890 | 3.238656 | 0.2869588 | 1.646218 | -2.505487 | 3.539823 | 0 |

mG2 | 11.43953 | 0.1730551 | 0.3404007 | 10.152397 | 3.186283 | 0.2785327 | 1.551482 | -2.093272 | 7.079646 | 0 |

mG3 | 11.61947 | 0.1769549 | 0.3480716 | 10.615123 | 3.258086 | 0.2803989 | 1.646578 | -1.701956 | 3.834808 | 0 |

pG1 | 12.36578 | 0.1311432 | 0.2579596 | 5.830305 | 2.414602 | 0.1952648 | 1.024839 | -1.390545 | 3.539823 | 0 |

pG2 | 12.47493 | 0.1302716 | 0.2562453 | 5.753068 | 2.398555 | 0.1922701 | 2.509813 | -1.315284 | 3.244838 | 0 |

pG3 | 12.82301 | 0.1402359 | 0.2758450 | 6.666806 | 2.582016 | 0.2013580 | -1.47584 | 3.064222 | 5.014749 | 0.2949853 |

### Step 2: Investigate relationship between predictor and response variable.

As part of the exploration phase, we established through Peason Correlation that there was a strong, statistically significant (P < .001), and positive relationship between Initial grade in a subject and final grade in that subject. We also established the same for intermediate grade and final grade. This analysis is not repeated here.### Step 3: Run Simple Linear Regression

Given the assumptions of normality and that a strong positive relationship exists it is safe to use the students performance dataset to illustrate the Simple Linear regression predictive modelling technique.*Note:*This is equivalent to doing a Pearson correlation again and is included for illustration

############ # PART: Linear Regression ############ # -------------- Linear Regression Math Grade 1 as a predictor for Math Grade 3 --------------- # reg_result_model_lm <-lm(tbl_sperf_numerical_measurements$mG3~tbl_sperf_numerical_measurements$mG1) # -------------- Determine how good our model is versus using the mean --------------- # anova(reg_result_model_lm) ## Get sums of squares of variability

## Analysis of Variance Table ## ## Response: tbl_sperf_numerical_measurements$mG3 ## Df Sum Sq Mean Sq F value Pr(>F) ## tbl_sperf_numerical_measurements$mG1 1 2877.40 2877.40 1364.8 < 2.2e-16 *** ## Residuals 337 710.52 2.11 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(reg_result_model_lm)

## ## Call: ## lm(formula = tbl_sperf_numerical_measurements$mG3 ~ tbl_sperf_numerical_measurements$mG1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.6590 -1.0644 -0.1635 1.1338 3.6383 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.45179 0.28630 5.071 6.55e-07 *** ## tbl_sperf_numerical_measurements$mG1 0.90090 0.02439 36.943 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.452 on 337 degrees of freedom ## Multiple R-squared: 0.802, Adjusted R-squared: 0.8014 ## F-statistic: 1365 on 1 and 337 DF, p-value: < 2.2e-16

# lm.beta::lm.beta(reg_result_model_lm) Not needed as units are the same for predictor and outcome stargazer(reg_result_model_lm, type="text")

## ## =============================================== ## Dependent variable: ## --------------------------- ## mG3 ## ----------------------------------------------- ## mG1 0.901*** ## (0.024) ## ## Constant 1.452*** ## (0.286) ## ## ----------------------------------------------- ## Observations 339 ## R2 0.802 ## Adjusted R2 0.801 ## Residual Std. Error 1.452 (df = 337) ## F Statistic 1,364.757*** (df = 1; 337) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01

### Report Simple Linear Regression

The relationship between initial grade in Maths (mG1, taken from school reports) and final Maths grade (mG3, taken from school reports) was modelled using a simple linear model. A strong positive correlation was found (R2 =-.802, n=337, p<.01). This matches our early results from Pearson Correlation. Our AVONA test to evaluate goodness of fit found a statistically significant result (p < 0.001). The model has an R-squared value of .802, indicating that initial grade contributes 80% to final grade in the population. For every 1 unit change in the initial grade in maths, it contributes 0.901 units of change to the final student grade in the population. Therefore there is good justification for using regression to look at prediction. This analysis supports the use of simple linear regression as a model to predict student final performance on the basis of initial grade.## Multiple Linear Regression of student performance

Simple linear Regression is a model to predict the value of one variable from another. Multiple Regression is a natural extension of this model and is used to predict an outcome variable using multiple predictor variables, each of which contributes partially to the outcome. We are looking to determine how much collectively does the set of predictors contribute to the outcome variable on average and additionally, how much do each of the individual predictors contribute on average. When we use categorical variables as predictors, effectively what we are looking for is a differential effect in the outcomes variable as a result of being in one category or another.HA: Students who perform well as part of initial assessment in subjects will perform better overall and there is a different between male and female studentsFor our predictors we will use initial grade and whether a student is male or female.

### Step 1: Investigate differential effects

We have already as part of the explore and analysis phase established there is a differential effect between male and female students for performance. That analysis is not repeated here but we can assume the variables of interest meet the criteria for Multiple linear Regression..### Step 2: Run Simple Linear Regression

Given the assumptions of normality and that a strong positive relationship exists it is safe to use the students performance dataset to illustrate the Simply Linear regression predictive modeling technique.tbl_sperf_sex_diff_stats <- tbl_sperf_sex_diff %>% psych::describe(omit = TRUE) ############ # PART: Multiple Linear Regression ############ # -------------- Regression Math Grade 1 as a predictor for Math Grade 3 and sex--------------- # reg_result_model_multiple_lm <-lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1+as.integer(tbl_sperf_sex_diff$sex=='M')) #Force female to be baseline. # -------------- Determine how good our model is versus using the mean --------------- # anova(reg_result_model_multiple_lm) ## Get sums of squares of variability

## Analysis of Variance Table ## ## Response: tbl_sperf_sex_diff$mG3 ## Df Sum Sq Mean Sq F value Pr(>F) ## tbl_sperf_sex_diff$mG1 1 2877.40 2877.40 1362.3560 <2e-16 *** ## as.integer(tbl_sperf_sex_diff$sex == "M") 1 0.86 0.86 0.4071 0.5239 ## Residuals 336 709.66 2.11 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(reg_result_model_multiple_lm)

## ## Call: ## lm(formula = tbl_sperf_sex_diff$mG3 ~ tbl_sperf_sex_diff$mG1 + ## as.integer(tbl_sperf_sex_diff$sex == "M")) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.6170 -1.1102 -0.1102 1.0790 3.5857 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.42738 0.28910 4.937 1.25e-06 *** ## tbl_sperf_sex_diff$mG1 0.89865 0.02466 36.438 < 2e-16 *** ## as.integer(tbl_sperf_sex_diff$sex == "M") 0.10179 0.15954 0.638 0.524 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.453 on 336 degrees of freedom ## Multiple R-squared: 0.8022, Adjusted R-squared: 0.801 ## F-statistic: 681.4 on 2 and 336 DF, p-value: < 2.2e-16

stargazer(reg_result_model_lm, reg_result_model_multiple_lm, type="text")

## ## ======================================================================= ## Dependent variable: ## --------------------------------------------------- ## mG3 mG3 ## (1) (2) ## ----------------------------------------------------------------------- ## mG1 0.901*** ## (0.024) ## ## mG1 0.899*** ## (0.025) ## ## sex == "M") 0.102 ## (0.160) ## ## Constant 1.452*** 1.427*** ## (0.286) (0.289) ## ## ----------------------------------------------------------------------- ## Observations 339 339 ## R2 0.802 0.802 ## Adjusted R2 0.801 0.801 ## Residual Std. Error 1.452 (df = 337) 1.453 (df = 336) ## F Statistic 1,364.757*** (df = 1; 337) 681.382*** (df = 2; 336) ## ======================================================================= ## Note: *p<0.1; **p<0.05; ***p<0.01

#Note: R automatically recodes categorical to be dummy variable 0 = reference (boy), 1 category of interest (girl)

### Report Multiple Linear Regression

The relationship between initial grade in Maths (mG1 taken from school reports) and final Maths grade (mG3 taken from school reports) for male and female students was modelled using a multiple linear model. There was a statistically significant difference at the p < .001 level in Maths final grade scores for male and female students: (F(2, 337)= 1,364.78, p<0.001). When we take into account the overall contribution from the initial Maths grade a statistically significant result was not found for sex as a predictor. There is a no change in the contribution of maths initial grade to maths final grade outcome compared to the simple linear model which did not account for sex. Our model is still explaining 80% of the variation on average in the outcome variable, the same as the simple linear model. This result of the multiple linear regression raises the question of interaction effects. If a student is getting a differential effect for final maths grade based on sex, was that student also experiencing a differential effect for the initial grade? The concept here is that there is some relationship between predictors which needs to be accounted for in our model, and we use interaction terms to do so.# Dummy variables and Interaction Effects – Categorical Predictive Variables

Many variables we are interested in using as predictors are categorical. In the above example we used a categorical variable (sex) as a predictor but did not explain it’s treatment. Because categorical variables have no scale, it makes no sense to think of the effect of a unit of increase in these as it would for a continuous variable. What we want to see is if there is a difference in outcome for a record belonging to one category versus another and how much of the variation in the outcome can be explained by belonging to a category. To do this, we transform the categorical variable into a series of dummy variables which indicate whether a particular record has a specific characteristic represented by the dummy variable. Dummy variables are also referred to as indicator variables or switches. When we include a dummy variable in our model we get a new interception point with the Y axis and a new line of best fit for the model.. In this way we can have different predictive models for different groups. For dummy variables, we recode each category variable to 0 (reference/baseline category) and 1 (category of interest). We must have previously shown a dummy variable has a differential effect on the outcome variable through difference testing in order to include it in our predictive model. Models built with dummy variables assume that the gap between groups (lines of best fit) is constant, but we may find there is an intersection point for the lines representing each group in our model. For example, in our data set we might find that on average male students are better during initial assessment, but the trend narrows or even reverses for the final grade. To capture this effect in our predictive model we use interaction terms which change the slope of our model. We generate interaction terms by multiplying our dummy category variable by our continuous scale predictor variable and adding this term to our model. Our predictive model should include both the dummy variable and interaction term.## Illustrative example of Interaction affects

Using our student performance dataset we will illustrate interaction effects by building progressively more complex examples.############ # PART: Create switches and interaction terms ############ tbl_sperf_sex_diff$int_sex_male <- tbl_sperf_sex_diff$mG1*as.integer(tbl_sperf_sex_diff$sex=='M') tbl_sperf_sex_diff$int_sex_female <- tbl_sperf_sex_diff$mG1*as.integer(tbl_sperf_sex_diff$sex=='F') tbl_sperf_sex_diff$medu_0 <- as.integer(tbl_sperf_sex_diff$Medu == 0) tbl_sperf_sex_diff$medu_1 <- as.integer(tbl_sperf_sex_diff$Medu == 1) tbl_sperf_sex_diff$medu_2 <- as.integer(tbl_sperf_sex_diff$Medu == 2) tbl_sperf_sex_diff$medu_3 <- as.integer(tbl_sperf_sex_diff$Medu == 3) tbl_sperf_sex_diff$medu_4 <- as.integer(tbl_sperf_sex_diff$Medu == 4) ## Stats tbl_sperf_sex_diff_stats <- tbl_sperf_sex_diff %>% psych::describe(omit = TRUE)

### First Model mG3 predicted by mG1

Our first model is a simple linear regression of final Math grade predicted by initial Math grade.############ # PART: Demonstrate application of interaction terms ############ #------------ First Model: mG3 predicted by mG1 -----------# model1<-lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1) ## Pretty Print stargazer::stargazer(model1, type="text") #Tidy output of all the required stats

## ## =============================================== ## Dependent variable: ## --------------------------- ## mG3 ## ----------------------------------------------- ## mG1 0.901*** ## (0.024) ## ## Constant 1.452*** ## (0.286) ## ## ----------------------------------------------- ## Observations 339 ## R2 0.802 ## Adjusted R2 0.801 ## Residual Std. Error 1.452 (df = 337) ## F Statistic 1,364.757*** (df = 1; 337) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01

#------------ First Model: mG3 predicted by mG1 -----------# equation1 <- function(x){coef(model1)[2]*x+coef(model1)[1]} ggplot(tbl_sperf_sex_diff,aes(y=mG3,x=mG1,color=sex))+geom_point()+ stat_function(fun=equation1,geom="line",color=scales::hue_pal()(2)[1])

### Second Model mG3 predicted by mG1 using sex as a switch

Our second model is a multi variable linear regression of with final Math grade predicted by initial Math grade and student sex. We transform sex into a dummy/switch variable where male will be treated as the reference category/baseline and female will be treated as the category of interest. This means in our model male will be given the value of 0, females a value of 1. Furthermore, we no longer assume constant performance differences between male and female students. We use interaction terms which change the gradient of our model for female students.#------------ Second Model: mG3 predicted by mG1 using sex as a dummy variable -----------# model2 <- lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1+as.integer(tbl_sperf_sex_diff$sex=='F')) # Male baseline ## Pretty Print stargazer::stargazer(model2, type="text") #Tidy output of all the required stats

## ## =============================================== ## Dependent variable: ## --------------------------- ## mG3 ## ----------------------------------------------- ## mG1 0.899*** ## (0.025) ## ## sex == "F") -0.102 ## (0.160) ## ## Constant 1.529*** ## (0.311) ## ## ----------------------------------------------- ## Observations 339 ## R2 0.802 ## Adjusted R2 0.801 ## Residual Std. Error 1.453 (df = 336) ## F Statistic 681.382*** (df = 2; 336) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01

stargazer::stargazer(model1, model2, type="text") #Quick model comparison

## ## ======================================================================= ## Dependent variable: ## --------------------------------------------------- ## mG3 ## (1) (2) ## ----------------------------------------------------------------------- ## mG1 0.901*** 0.899*** ## (0.024) (0.025) ## ## sex == "F") -0.102 ## (0.160) ## ## Constant 1.452*** 1.529*** ## (0.286) (0.311) ## ## ----------------------------------------------------------------------- ## Observations 339 339 ## R2 0.802 0.802 ## Adjusted R2 0.801 0.801 ## Residual Std. Error 1.452 (df = 337) 1.453 (df = 336) ## F Statistic 1,364.757*** (df = 1; 337) 681.382*** (df = 2; 336) ## ======================================================================= ## Note: *p<0.1; **p<0.05; ***p<0.01

#------------ Second Model: mG3 predicted by mG1 using sex as a dummy variable -----------# equation1_male <- function(x){coef(model1)[2]*x+coef(model1)[1]} equation1_female <- function(x){coef(model2)[2]*x+coef(model2)[1] + coef(model2)[3]} ggplot(tbl_sperf_sex_diff,aes(y=mG3,x=mG1,color=sex))+geom_point()+ stat_function(fun=equation1_male,geom="line",color=scales::hue_pal()(2)[1]) + stat_function(fun=equation1_female,geom="line",color=scales::hue_pal()(2)[2])

### Third Model mG3 predicted by mG1 using sex as an interaction term

Our third model is a multi variable linear regression of final Math grade predicted by initial Math grade and student sex. We transform sex into a interactive variable, where male will be treated as the reference category/baseline and female will be treated as the category of interest. This means in our model male will be given the value of 0, girl a value of 1. Both the switch and the interaction variable are included because the switch variable determines the inception point of the model with the Y axis (the starting point).#------------ Third Model: mG3 predicted by mG1 using sex as an interaction variable -----------# model3 <- lm(tbl_sperf_sex_diff$mG3 ~ tbl_sperf_sex_diff$mG1 + as.integer(tbl_sperf_sex_diff$sex == 'F') + tbl_sperf_sex_diff$int_sex_female ) # Male baseline ## Pretty Print stargazer::stargazer(model3, type = "text") #Tidy output of all the required stats

## ## =============================================== ## Dependent variable: ## --------------------------- ## mG3 ## ----------------------------------------------- ## mG1 0.884*** ## (0.035) ## ## sex == "F") -0.437 ## (0.580) ## ## int_sex_female 0.030 ## (0.049) ## ## Constant 1.706*** ## (0.428) ## ## ----------------------------------------------- ## Observations 339 ## R2 0.802 ## Adjusted R2 0.801 ## Residual Std. Error 1.455 (df = 335) ## F Statistic 453.511*** (df = 3; 335) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01

stargazer::stargazer(model3, model2, type = "text") #Quick model comparison

## ## ===================================================================== ## Dependent variable: ## ------------------------------------------------- ## mG3 ## (1) (2) ## --------------------------------------------------------------------- ## mG1 0.884*** 0.899*** ## (0.035) (0.025) ## ## sex == "F") -0.437 -0.102 ## (0.580) (0.160) ## ## int_sex_female 0.030 ## (0.049) ## ## Constant 1.706*** 1.529*** ## (0.428) (0.311) ## ## --------------------------------------------------------------------- ## Observations 339 339 ## R2 0.802 0.802 ## Adjusted R2 0.801 0.801 ## Residual Std. Error 1.455 (df = 335) 1.453 (df = 336) ## F Statistic 453.511*** (df = 3; 335) 681.382*** (df = 2; 336) ## ===================================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01

### Double checking association

To confirm we hadn’t made an error, we reconfirmed our understanding of the associations for the variables. This is not strictly required but included for convenience.#### Look at zero order correlations

#Get zero order correlations as well to fully explore the effect cor.test(tbl_sperf_sex_diff$mG1, tbl_sperf_sex_diff$mG3)

## ## Pearson's product-moment correlation ## ## data: tbl_sperf_sex_diff$mG1 and tbl_sperf_sex_diff$mG3 ## t = 36.943, df = 337, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.8722090 0.9147845 ## sample estimates: ## cor ## 0.8955274

#### Look at differences in experience of concepts for respondents for different gender

#Look at differences in scores for gender #Conduct Levene's test for homogeneity of variance in library car #---------- mG1 ~ Sex Differential Effect ---------- # leveneTest(mG1 ~ sex, data=tbl_sperf_sex_diff)

## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 1 0.0552 0.8144 ## 337

test_result <- stats::t.test(mG1~sex,var.equal=TRUE,data=tbl_sperf_sex_diff)#Variances are not equal test_result[["effcd"]] <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter),2) print(test_result)

## ## Two Sample t-test ## ## data: mG1 by sex ## t = -2.657, df = 337, p-value = 0.008259 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.6126906 -0.2406418 ## sample estimates: ## mean in group F mean in group M ## 10.83237 11.75904

print(test_result[["effcd"]])

## d | 95% CI ## ---------------------- ## -0.29 | [-0.50, -0.07]

#---------- mG3 ~ Sex Differential Effect ---------- # leveneTest(mG3 ~ sex, data=tbl_sperf_sex_diff)

## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 1 0.1175 0.732 ## 337

test_result <- stats::t.test(mG3 ~ sex,var.equal=TRUE,data=tbl_sperf_sex_diff)#Variances are not equal test_result[["effcd"]] <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter),2) print(test_result)

## ## Two Sample t-test ## ## data: mG3 by sex ## t = -2.6637, df = 337, p-value = 0.008099 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.6246401 -0.2444316 ## sample estimates: ## mean in group F mean in group M ## 11.16185 12.09639

print(test_result[["effcd"]])

## d | 95% CI ## ---------------------- ## -0.29 | [-0.50, -0.08]

### Reporting Switchs and Interactive effects for variable; sex

The interaction affects between initial grade in Maths (mG1 taken from school reports) and sex was modeled using a multiple linear model. Being female was not found to be a statistically significant ( P = 0.580 ) predictor of final performance. The interaction effect of being female and initial grade was also not found to be significant (P = 0.05). We saw how the inclusion of the switches or dummy variables had the consequence of giving us two lines of best fit (one for male and one for female students) that have differing intercept points. We also saw how the inclusion of interactive variables gave us a slightly different gradient.## Linear Regression with Interaction effects and more than two categories

From our explore and analysis phase we established there is a relationship between mother’s education (Medu) and student performance, with the strongest relationship found for students with mothers who obtained the highest level of education. Medu differs from sex, in that it is a variable that can take more than two values.### Fourth Model mG3 predicted by mG1 using sex as an interaction term

Our forth model is a multi variable linear regression with final Math grade predicted by initial Math grade, student sex and mother’s educational achievement. We transform sex into a interactive variable, where male was treated as the reference category/baseline and female will be treated as the category of interest. This means in our model male will be given the value of 0, females a value of 1. Mothers’ education has five possible values and this is transformed into 4 switching variables. The lowest educational level (Medu = 0) is taken as the reference category.#------------ Fourth Model: mG3 predicted by mG1 using sex as an interaction variable and mothers education as a switch -----------# model4 <- lm(tbl_sperf_sex_diff$mG3 ~ tbl_sperf_sex_diff$mG1 + as.integer(tbl_sperf_sex_diff$sex == 'F') + tbl_sperf_sex_diff$medu_4 + tbl_sperf_sex_diff$medu_1 + tbl_sperf_sex_diff$medu_2 + tbl_sperf_sex_diff$medu_3 ) # lowest eduaction baseline, male baseline ## Pretty Print stargazer::stargazer(model4, type = "text") #Tidy output of all the required stats

## ## =============================================== ## Dependent variable: ## --------------------------- ## mG3 ## ----------------------------------------------- ## mG1 0.894*** ## (0.025) ## ## sex == "F") -0.103 ## (0.160) ## ## medu_4 -0.730 ## (0.852) ## ## medu_1 -0.977 ## (0.872) ## ## medu_2 -0.809 ## (0.857) ## ## medu_3 -0.667 ## (0.857) ## ## Constant 2.342** ## (0.906) ## ## ----------------------------------------------- ## Observations 339 ## R2 0.804 ## Adjusted R2 0.800 ## Residual Std. Error 1.457 (df = 332) ## F Statistic 226.304*** (df = 6; 332) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01

stargazer::stargazer(model4, model3, type = "text") #Quick model comparison

## ## ===================================================================== ## Dependent variable: ## ------------------------------------------------- ## mG3 ## (1) (2) ## --------------------------------------------------------------------- ## mG1 0.894*** 0.884*** ## (0.025) (0.035) ## ## sex == "F") -0.103 -0.437 ## (0.160) (0.580) ## ## medu_4 -0.730 ## (0.852) ## ## medu_1 -0.977 ## (0.872) ## ## medu_2 -0.809 ## (0.857) ## ## medu_3 -0.667 ## (0.857) ## ## int_sex_female 0.030 ## (0.049) ## ## Constant 2.342** 1.706*** ## (0.906) (0.428) ## ## --------------------------------------------------------------------- ## Observations 339 339 ## R2 0.804 0.802 ## Adjusted R2 0.800 0.801 ## Residual Std. Error 1.457 (df = 332) 1.455 (df = 335) ## F Statistic 226.304*** (df = 6; 332) 453.511*** (df = 3; 335) ## ===================================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01

### Report Interactive effects for Medu and Sex

A multiple linear predictive model of final maths grade was built using initial grade in Maths (mG1 taken from school reports), sex and mothers education. When we take into account the overall contribution from the initial Maths grade, a statistically significant result was not found for sex nor mothers education as predictors. We saw how the inclusion of the mother’s education switch term had the consequence of giving us two lines of best fit (one for female students with mothers who have higher education and one for female students who didn’t) that have different intercepts as well as different slopes. The above analysis supports not including Mothers education in a predictive set with initial grade.## Generalisation of Linear Regression Model

To generalise our model to the population there are a number of assumptions unpinning linear regression that must be validated. For every one unit of change in the independent variable there will be a consistent and uniform change in the dependent variable. This allows us to use the equation of a line to model the association between predictor variables and outcome variables. Before we build our model we need to validate our assumptions regarding associations. If we are asserting a relationship we need to investigate if there is any evidence of a relationship using correlation and make a decision based on the results (strength, direction etc). If we are asserting a differential effect for different groups. We need to investigate if there is any difference using either the appropriate test and make a decision based on the result. Using linear regression as a predictive model requires that a straight line be a good representative model of the underlying relationship. How well a straight line serves as a model is determined by the residuals. We want the residuals to follow a normal distribution that minimizes the opportunity for error to emerge just because the data is skewed. We want constant error variance because non constant variance would pull our outcome in particular directions and increase our chances of a type 1 error. An outlier is considered any observation that’s particularly large or small. A leverage point is something that’s far away from the average and changes the gradient of the line of best fit. And anything that’s considered an influential observation is something that changes the slope of the line. If we have a particularly large value, this will be in our data as an outlier, it is likely that we’ll have a particularly large residual in our regression model. And then it will pull the data in a particular direction, and therefore it will pull the prediction in a particular direction.### Step 1: Determine our outliers in terms of residuals

Step 1: Determining our outliers in terms of residuals The first thing we need to do is identify our outliers in terms of our residuals. If we find outliers that are particularly significant, it may be acceptable to delete them and repeat the regression without those cases included. The danger if we simply remove cases is that remaining cases that where not outliers may become outliers. It may be that we have to retain it because it’s a particularly important case in our data set. It is a good idea to perform a robustness test and run the data analysis twice to determine how susceptible the model is to assumptions regarding outliers. To do this we run the regression data analysis twice, once with and once without the outliers. Next, we quantify how far away from normal the distribution is. To do this, we calculate statistics for skew and kurtosis and standardise them so we can compare them to heuristics. Standardised scores (value/std.error) for skewness between +/-2 (1.96 rounded) are considered acceptable in order to assume a normal distribution. Residuals for model 1: mG1 as a predictor of mG3, are within an acceptable range. We further investigated by exploring outliers: how many of them there are or whether we can transform the problematic variable to become more normal. In terms of quantifying the proportion of the residual that is not normal, we generated standardised z scores for the variable and calculated the percentage of the standardised scores that are outside an acceptable range. No variable exceeded our acceptable range for the 95% significance level. Based on this assessment, residual skewness due to outliers is not an issue and residuals can be treated as normally distributed. An analysis of standard residuals was carried out, showing that the data contained no additional outliers once zero scores had been removed. (Std. Residual Min = -3.61, Std. Residual Max = 3.59).########### # Part: Model ########### #------------ Second Model: mG3 predicted by mG1 using sex as a dummy variable -----------# model2 <- lm(tbl_sperf_sex_diff$mG3~tbl_sperf_sex_diff$mG1+as.integer(tbl_sperf_sex_diff$sex=='F')) # Female baseline ############ # PART: check ############ std_skew <- list() std_kurt <- list() gt_196 <- list() gt_329 <- list() #--------------- Check for percentage outliers ---------------# variable <- "model2_stats" st <- as.data.frame(pastecs::stat.desc(model2$residuals, basic = F)) model2_stats <- st %>% transpose() colnames(model2_stats) <- rownames(st) rownames(model2_stats) <-('Model 2: Residuals') ## Build standard statistics tpskew <- semTools::skew(model2$residuals) tpkurt <- semTools::kurtosis(model2$residuals) std_skew[[variable]] <- tpskew[1] / tpskew[2] std_kurt[[variable]] <- tpkurt[1] / tpkurt[2] z_score <- abs(scale(model1$residuals)) gt_196[[variable]] <- FSA::perc(as.numeric(z_score), 1.96, "gt") # 95% within +/- 1.96 gt_329[[variable]] <- FSA::perc(as.numeric(z_score), 3.29, "gt") # 99.7% within +- 3.29 for larger distributions ## Build output model2_stats$std_skew <- std_skew model2_stats$std_kurt <- std_kurt model2_stats$gt_2sd <- gt_196 model2_stats$gt_3sd <- gt_329 #-------- Pretty print my table -------# model2_stats %>% kbl(caption = "Summary statistics for Residuals - mG3 predicted by mG1 - sex as a dummy variable (zero scores removed)") %>% kable_styling(bootstrap_options = c("striped", "hover"))

median | mean | SE.mean | CI.mean.0.95 | var | std.dev | coef.var | std_skew | std_kurt | gt_2sd | gt_3sd | |
---|---|---|---|---|---|---|---|---|---|---|---|

Model 2: Residuals | -0.1102192 | 0 | 0.0786984 | 0.1548003 | 2.099575 | 1.448991 | 1.537045e+17 | 1.161807 | -1.42829 | 5.014749 | 0 |

############ # PART: Visualisation Histograms ############ # Plot single variable binwidth <- .5 gs1 <- model2 %>% ggplot(aes(x = model2$residuals)) gs1 <- gs1 + geom_histogram(binwidth = binwidth, colour = "black", aes(y = ..density.., fill = ..count..)) gs1 <- gs1 + stat_function(fun = dnorm, color = "red", args = list(mean = model2_stats$mean, sd = model2_stats$std.dev), na.rm = TRUE) gs1 <- gs1 + labs(x = "Model 1: mG3 predicted by mG1 Residuals") gs1 <- gs1 + scale_fill_gradient("Count", low = "#DCDCDC", high = "#7C7C7C") ############ # PART: Visualisation Histograms Q-Q Plot ############ gs2 <- model2 %>% ggplot(aes(sample = model2$residuals)) + stat_qq() + stat_qq_line(linetype = "dotted", color = "red", size = 1) + theme_bw() + labs(title = "Q-Q Plot of Model 2 Residuals", subtitle = "mG1 as predictor of mG3 dummy sex") plot_grid(gs1, gs2, labels = "auto", ncol = 2)

#### Influential Outliers

#--------------- Influential Outliers - Cook's distance ---------------# cooksd<-sort(cooks.distance(model2)) # plot Cook's distance plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line text(x = seq_along(cooksd) +1, y =cooksd, labels =ifelse(cooksd>4*mean(cooksd, na.rm =T), names(cooksd), ""), col ="red") # add labels

#--------------- find rows related to influential observations ---------------# influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers stem(influential)

## ## The decimal point is 2 digit(s) to the right of the | ## ## 0 | 0368 ## 1 | 268999 ## 2 | 034556 ## 3 | 0

head(tbl_sperf_sex_diff[influential, ]) # influential observations.

## mG1 mG2 mG3 pG1 pG2 pG3 sex Medu int_sex_male int_sex_female medu_0 medu_1 medu_2 medu_3 medu_4 ## 157 11 8 8 7 8 8 M 2 11 0 0 0 1 0 0 ## 262 14 12 11 14 15 15 M 3 14 0 0 0 0 1 0 ## 202 11 15 15 9 11 11 M 4 11 0 0 0 0 0 1 ## 194 15 16 18 14 14 14 M 4 15 0 0 0 0 0 1 ## 304 15 12 12 10 10 11 M 3 15 0 0 0 0 1 0 ## 181 10 13 14 12 13 12 M 4 10 0 0 0 0 0 1

head(tbl_sperf_sex_diff[influential, ]$mG3) # influential observations - look at the values of mG3

## [1] 8 11 15 18 12 14

head(tbl_sperf_sex_diff[influential, ]$mG1) # influential observations - look at the values of mG1

## [1] 11 14 11 15 15 10

car::outlierTest(model2) # Bonferonni p-value for most extreme obs - Are there any cases where the outcome variable has an unusual variable for its predictor values?

## No Studentized residuals with Bonferroni p < 0.05 ## Largest |rstudent|: ## rstudent unadjusted p-value Bonferroni p ## 236 -2.518902 0.012237 NA

car::leveragePlots(model2) # leverage plots

### Step 2: Check residuals for Constant Variation of error (Homoscedasticity)

Sometimes our residuals can display heteroscedasticity in terms of error variance. This is when we see a different variation of error from some variables values at a higher level when compared to the lower level values. In a well-fitting regression model the error variation around the reference line is constant (Homoscedasticity). If we detect heteroscedasticity, multiple linear regression is not an acceptable predictive model, even if there are large and statistically significant coefficients, because of the possibility of a type one error. The data also met the assumption of non-zero variances (Initial Maths grade Variance = 10.49,, sex=0.25)plot(model2, 1)

plot(model2, 3)

#The first plot is the chart of residuals vs fitted values, in the second plot the standardised residuals are on the Y axis. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line. We reall want to see that there is no pattern in the residuals and that they are equally spread around the y = 0 line - the dashed line. #n our case, as you can notice the red line is slightly distorted in the middle on plot 1 but is not a big problem. Looking at the second plot we can see that while it is a problem it is not huge. Only a concern if there are definite patterns. #Create a QQ plotqqPlot(model, main="QQ Plot") #qq plot for studentized resid car::qqPlot(model2, main = "QQ Plot") #qq plot for studentized resid

## [1] 2 236

var_mg1 <- var(as.numeric(tbl_sperf_sex_diff$mG1, rm.na = TRUE)) %>% round(2) var_sex <- var(as.numeric(tbl_sperf_sex_diff$sex=='M', rm.na = TRUE)) %>% round(2)

### Step 3: Check for Collinearity

Collinearity occurs when two or more independent variables contain redundant information. If variables are collinear then it means there is not enough distinct information in these variables for MLR to operate and they are essentially measuring the same thing. The problem is that this concept could end up over represented in our model. A correlation coefficient above 0.8 suggests collinearity may be present. If it is, we need to analyse the variables in separate regression equations and then examine how they operate together. Tests to see if the model 2 met the assumption of collinearity indicated that multicollinearity was not a concern (Sex, Tolerance = .98, VIF = 1.02; Initial Maths grade, Tolerance = 0.98, VIF = 102.)#Calculate Collinearity vifmodel<-car::vif(model2) vifmodel

## tbl_sperf_sex_diff$mG1 as.integer(tbl_sperf_sex_diff$sex == "F") ## 1.020949 1.020949

#Calculate tolerance 1/vifmodel

## tbl_sperf_sex_diff$mG1 as.integer(tbl_sperf_sex_diff$sex == "F") ## 0.9794811 0.9794811

### Step 4: Check for Causality

To assert causality we need to assume association, time order and non spuriousness. Statistical methods can justify association assumptions but not time order; that depends on our conceptual framework. We control for spuriousness by introducing controlling measures into our predictive model.## Reporting MLR – Model 2

For expedience, I have only reported on model 2. A multiple regression analysis was conducted to determine if a student’s score for initial maths grade and sex could predict a student’s final academic achievement. In order to include the sex in the regression model it was recorded into a single dummy variable (0 for male, 1 for female). Examination of the histogram, normal P-P plot of standardised residuals and the scatterplot of the dependent variable, initial maths grade, and standardised residuals showed that the some outliers existed even after removing the zero score records we deemed invalid earlier. However, examination of the standardised residuals showed that none could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013). Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The scatterplot of standardised residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.# Logistic Regression

In linear regression, our outcome variable is continuous and is predicted from one or more continuous or categorical predictor variables. If the outcomes we are interested in are categorical in nature, we cannot use linear regression as there is no linear relationship between predictor and outcomes variables. To predict categorical outcomes, we use logistic regression. If the outcome variable has exactly two values, for example predicting pass or fail for a student, we use a binary logistic regression. When the outcome has more than two categories, we use a multi-nominal logistic regression. The output from a logistic regression is not an estimated value but rather a probability that the given input belongs to a certain category or not. Because we’re looking at probability, we choose one category as the category of interest. The event we’re interested in is the event where that category occurs. The reference event is when the event of interest doesn’t occur and we are interested in the probability of the event of interest occurring versus the probability of it not occurring.## Binary Logistic Regression

To illustrate this methodology we will make use of the student performance data set to build a model to predict if students sat the final maths examination, using gender as the predictor.HA: Gender is a factor in determining if a student will sit the final maths examinationOur outcome variable is the probability that they did not sit the exam. As such we give zero to the “no” category, which is our reference category and 1 to our category of interest which is ‘Yes’ they did sit the exam. For our predictor we used Sex, which takes the value Male or Female. Note: We’re building a model with one predictor to illustrate our understanding of how logistic regression works. The starting point is a simple logistic regression, which we will build upon. We gain no new insights from this over and above a difference test.

### Step 1: Run Simple Logistic Regression

## ## Call: ## glm(formula = tbl_sperf_sex_medu_diff$sat_maths ~ tbl_sperf_sex_medu_diff$sex, ## family = binomial(link = logit), data = tbl_sperf_sex_medu_diff, ## na.action = na.exclude) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2101 0.4265 0.4265 0.4970 0.4970 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.0293 0.2218 9.149 <2e-16 *** ## tbl_sperf_sex_medu_diff$sexM 0.3221 0.3430 0.939 0.348 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 251.86 on 381 degrees of freedom ## Residual deviance: 250.97 on 380 degrees of freedom ## AIC: 254.97 ## ## Number of Fisher Scoring iterations: 5

## Likelihood ratio test ## ## Model 1: tbl_sperf_sex_medu_diff$sat_maths ~ tbl_sperf_sex_medu_diff$sex ## Model 2: tbl_sperf_sex_medu_diff$sat_maths ~ 1 ## #Df LogLik Df Chisq Pr(>Chisq) ## 1 2 -125.48 ## 2 1 -125.93 -1 0.8929 0.3447The Pr(>|z|) column shows the two-tailed p-values testing the null hypothesis that the coefficient is equal to zero (i.e. no significant effect). The usual cut-off value is 0.05. From this model it appears that sex does not have a significant effect on the log-odds ratio of the outcome variable, weather or not a student sat the final maths exam. The estimate column shows how much of a contribution or predictor makes to the outcomes. The results indicated that when sex is male the expected change in the log odds is .3221 an increase in comparison of being female, however the result is not statistically significant (p = 0.328).

### Step 2: Calculate odds ratio and probability

To calculate the odds ratio, which is an indication of the change in odds result from a unit change in the predictor, we take the exponential of the co-efficients. We see the odds of sitting the maths exam is 7.6 to 1 and that ratio increases by 1.38 to 8.99 to 1 if a student is Male. From this we can see the probability of sitting the final maths exam when female is .88 and when male is 0.91.Estimate | Odds_ratio | Probability | |
---|---|---|---|

(Intercept) | 2.0293 | 7.6087 | 0.8838384 |

tbl_sperf_sex_medu_diff$sexM | 0.3221 | 1.3800 | 0.9130435 |

### Step 3: Check goodness of fit of the model over the baseline model

In linear regression we check goodness of fit by comparing our results to the results we would have gotten using the mean as the baseline model. For logistic regression an omnibus test is used to check that the new model (with explanatory variables included) is an improvement over the baseline model which is a sum of frequencies of occurrence of each category. We use chi-square tests to see if there is a significant difference between the baseline model (null) and the model we have created. Below we can see our model is not statistically significant (Pr(>Chisq) = 0.3447), which means it cannot be assumed to be a better predictor than the baseline model. This was not surprising since our predictor variable (sex) was shown to not be statistically significant.#Df | LogLik | Df | Chisq | Pr(>Chisq) |
---|---|---|---|---|

2 | -125.4838 | NA | NA | NA |

1 | -125.9302 | -1 | 0.8929079 | 0.3446905 |

### Step 4: Determine usefulness

We can get a pseudo R squared value to determine how useful the model is. From linear regression we understood the R2 was a measure of how much of the outcome variable was explained by the predictor variable. Cox and Snell R2 and Nagelkerke R2 are pseudo R2 statistics we can generate. From this we see that student sex explains between 0.2% and 0.48% of whether a student will sit the final maths exam or not, but we need to keep in mind the result was not statistically significant.CoxSnell | Nagelkerke | |
---|---|---|

R-Squared | 0.0023 | 0.0048 |

### Step 5: Sensitivity Analysis and Summary Stats.

We need to understand how well the model is working. The two main measures are sensitivity and specificity. Sensitivity refers to the proportion of true positives that were correctly identified by the model (correctly identified that a student sat the exam) and specificity is the proportion of true negatives (predicted to be negative and where negative). We achieve this using a ROC chart. Below we can see that our true positive rate is 49.0% and our true negative rate is 59.0%. The positive predictive values is the percentage of cases classified by our model correctly (students who were predicted to sit the exam and did plus students who were predicted not to sit the exam and didn’t). The negative predictive value is 1 – the positive. To use the ROC curve to quantify the performance of a classifier use the Area Under the Curve (AUC). We have an AUC of .54 which is considered weak.### Step 6: Check the assumption of linearity of independent variables

We only have one predictor so we don’t need to check for collinearity. We check the assumption of linearity of independent variables and log odds using a Hosmer-Lemeshow test.## ## Hosmer and Lemeshow test (binary model) ## ## data: tbl_sperf_sex_medu_diff$sat_maths, fitted(logmodel1) ## X-squared = 5.6128e-26, df = -1, p-value = NA

### Reporting Binary Logistic Regression

A Binary logistic regression model was built of student propensity to sit the final Maths examination as predicted by student sex. Our outcome variable is the probability of a student sitting the final Maths exam given their gender. As such, we give zero to the “no” category, which is our reference category, and 1 to our category of interest which is “yes” they did sit the exam. For our predictor we used sex, with values Male and Female. Being Male was not found to be a statistically significant ( P = .348 ) predictor of whether a student will or will not not sit the examination. Female Students have an odd ratio of 7.60 to 1 to sit the exam while Male student odds where 1.38 higher which corresponds to a probability of .88 and .91 respectively. R2 = .0023 (Cox-Snell), .0048 (Nagelkerke), Pr (>Chisq) = .34.## ## ============================================= ## Dependent variable: ## --------------------------- ## sat_maths ## --------------------------------------------- ## sexM 0.322 ## (0.343) ## ## Constant 2.029*** ## (0.222) ## ## --------------------------------------------- ## Observations 382 ## Log Likelihood -125.484 ## Akaike Inf. Crit. 254.968 ## ============================================= ## Note: *p<0.1; **p<0.05; ***p<0.01

## Extra: Extending the model to include Parent’s Education.

Similar to Linear regression we can extend logistic regression to include multiple predictors. Our goal is to explore if we can predict if a student will sit the final Maths exam on the basis of their parents’ education. We introduce a new categorical variable representing the highest educational achievement of either parent. We have 3 levels: at least one parent with a higher degree, at least one parent with a level 5 qualification (secondary school) and neither parent with a level 5 qualification. At least one parent with a higher degree was taken as the baseline reference category.### Step 1: Run General Logistic Regression

We find that the overall model is not statistically significant (Pr(>Chisq) = .077) but the category*Neither parent with a level 5 qualification*is statistically significant at the P <0.05 significances level.

## ## Call: ## glm(formula = sat_maths ~ sex + pedu_paired, family = binomial(link = logit), ## data = tbl_sperf_sex_medu_diff, na.action = na.exclude) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.3552 0.3708 0.4099 0.5458 0.6193 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.36843 0.34178 6.930 4.22e-12 *** ## sexM 0.27464 0.34668 0.792 0.4283 ## pedu_paired At least one parent with a level 5 qualification 0.06604 0.49488 0.133 0.8938 ## pedu_paired Neither parent with a level 5 qualification -0.81430 0.38806 -2.098 0.0359 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 251.86 on 381 degrees of freedom ## Residual deviance: 245.00 on 378 degrees of freedom ## AIC: 253 ## ## Number of Fisher Scoring iterations: 5

## Likelihood ratio test ## ## Model 1: sat_maths ~ sex + pedu_paired ## Model 2: sat_maths ~ 1 ## #Df LogLik Df Chisq Pr(>Chisq) ## 1 4 -122.50 ## 2 1 -125.93 -3 6.8576 0.07658 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

### Step 2: Calculate odds ratio and probability

Student sex remained a statistically insignificant variable. The only level of parental educational achievement that was significant was if neither parent had a level 5 education.Estimate | Odds_ratio | |
---|---|---|

(Intercept) | 2.3684 | 10.6806 |

sexM | 0.2746 | 1.3161 |

pedu_paired At least one parent with a level 5 qualification | 0.0660 | 1.0683 |

pedu_paired Neither parent with a level 5 qualification | -0.8143 | 0.4430 |

X.Intercept. | |
---|---|

Female | 0.91 |

Male | 0.93 |

Female_level5 | 0.92 |

Male_level5 | 0.94 |

Female_nolevel5 | 0.83 |

Male_nolevel5 | 0.86 |

## Predicted levels same as naive model (majority level)

Predicted 1 | |
---|---|

Actual 0 | 39 |

Actual 1 | 343 |

### Step 3: Check goodness of fit of the model over the baseline model

#Df | LogLik | Df | Chisq | Pr(>Chisq) |
---|---|---|---|---|

4 | -122.5014 | NA | NA | NA |

1 | -125.9302 | -3 | 6.85764 | 0.0765764 |

### Step 4: Determine usefulness

The Pseudo R squared statistics show that our extended model explains between 1.78% and 3.69% of the variation in the outcome variable. This is consistent with our findings above.CoxSnell | Nagelkerke | |
---|---|---|

R-Squared | 0.0178 | 0.0369 |

### Step 5: Sensitivity Analysis and Summary Stats.

When we compared the performance of the model using sensitivity analysis, the AUC has increased to 0.61 compared to the first model which had an AUC of .51, but both models are considered weak.### Step 6: Check the assumption of linearity of independent variables

We have two predictor variables so we need to check for collinearity. Conceptually, student sex and parental educational achievement should not be collinear, but the step is included for illustration. We check the assumption of linearity of independent variables and log odds using a Hosmer-Lemeshow test.## ## Hosmer and Lemeshow test (binary model) ## ## data: tbl_sperf_sex_medu_diff$sat_maths, fitted(logmodel2) ## X-squared = 6.3334, df = 3, p-value = 0.09647

## GVIF Df GVIF^(1/(2*Df)) ## sex 1.005222 1 1.002608 ## pedu_paired 1.005222 2 1.001303

## GVIF Df GVIF^(1/(2*Df)) ## sex 0.9948052 1.0 0.9973992 ## pedu_paired 0.9948052 0.5 0.9986988

### Reporting Binary Logistic Regression with two Predictors

A Binary logistic regression model was built of student propensity to sit the final Maths examination as predicted by student sex and parental educational achievement. Our outcome variable is the probability of a student sitting the final maths exam given their gender and their parents highest educational achievement. We give zero to the “no” category, which is our reference category and 1 to our category of interest which is ‘Yes’ they did sit the exam. For our predictor we used sex, with values Male and Female. For Parental education we observe 3 different levels: at least one parent with a level 5 qualification, at least one parent with a higher degree and neither parent with a level 5 qualification. At least one parent with a higher degree was our reference category. Neither parent having a level 5 education was shown to be statistically significant (P < .05). Being Male was still not found to be a statistically significant ( P = .43 ) predictor of whether a student will or will not not sit the examination. Female Students have an odd ratio of 10.68 to 1 to sit the exam while Male student odds where 1.31 higher which corresponds to a probability of .91 and .93 respectively. The lowest probability of sitting the Maths exam was Female students for which neither parent has a level 5 education. R2 = .0178 (Cox-Snell), .00369 (Nagelkerke), Pr (>Chisq) = .077.## Multinomial logistic regression

We can also use logistic regression to predict if a case will be a member of more than two categories. To illustrate this methodology, we will make use of the student performance data set to build a model to predict student alcohol consumption on the basis of performance.HA: Subject performance, failures, and absences are good predictors of student tendency to consume alcohol

### Step 1: Quick summary of the data

A general guideline is that we need a minimum of 10 cases with the least frequent value for each categorical variable in our model. A quick look at the data shows we meet this requirement.Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|

1 | 144 | 37.696335 | 37.69634 | 37.696335 | 37.69634 |

2 | 86 | 22.513089 | 60.20942 | 22.513089 | 60.20942 |

3 | 76 | 19.895288 | 80.10471 | 19.895288 | 80.10471 |

4 | 49 | 12.827225 | 92.93194 | 12.827225 | 92.93194 |

5 | 27 | 7.068063 | 100.00000 | 7.068063 | 100.00000 |

<NA> | 0 | NA | NA | 0.000000 | 100.00000 |

Total | 382 | 100.000000 | 100.00000 | 100.000000 | 100.00000 |

n | median | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|

X1 | 382 | 0 | 0 | 0 | 3 | 3 | 4.150294 | 17.77067 | 0.0262603 |

n | median | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|

X1 | 382 | 2 | 2.9652 | 0 | 32 | 32 | 2.165519 | 6.17826 | 0.251011 |

### Step 2: Run Multi-nominal Logistic Regression

## Walc.p ## famrel.p 1 2 3 4 5 ## 1 2 3 1 0 2 ## 2 6 3 3 4 2 ## 3 19 16 14 13 5 ## 4 71 39 41 21 12 ## 5 46 25 17 11 6

## M SD ## 1 13.13194 2.430133 ## 2 12.68605 2.963550 ## 3 12.36842 3.281634 ## 4 11.48980 2.731929 ## 5 10.96296 3.787444

## M SD ## 0 12.720000 3.4740070 ## 1 14.333333 1.5275252 ## 2 12.984615 2.3617546 ## 3 8.500000 2.1213203 ## 4 12.393443 2.4985242 ## 6 12.657895 1.9765697 ## 7 15.500000 0.7071068 ## 8 12.263158 2.4909192 ## 9 10.000000 NA ## 10 12.555556 3.0867099 ## 12 10.428571 3.2586880 ## 14 11.333333 3.0110906 ## 15 11.000000 1.4142136 ## 16 10.000000 1.7320508 ## 18 13.500000 0.7071068 ## 21 13.000000 NA ## 22 7.333333 2.3094011 ## 30 16.000000 NA ## 32 14.000000 NA

## # weights: 20 (12 variable) ## initial value 614.805283 ## iter 10 value 556.514900 ## final value 549.886689 ## converged

## Call: ## multinom(formula = Walc.p ~ absences.p + pG3, data = tbl_sperf_family_alc) ## ## Coefficients: ## (Intercept) absences.p pG3 ## 2 0.01087414 0.04400762 -0.05210308 ## 3 0.49526196 0.02913020 -0.09622110 ## 4 0.96253579 0.06192398 -0.18320827 ## 5 0.66035020 0.09538812 -0.22493513 ## ## Std. Errors: ## (Intercept) absences.p pG3 ## 2 0.7013638 0.03074781 0.05172324 ## 3 0.7022372 0.03279891 0.05249118 ## 4 0.7414868 0.03397290 0.05753070 ## 5 0.8471884 0.03716125 0.06793020 ## ## Residual Deviance: 1099.773 ## AIC: 1123.773

(Intercept) | absences.p | pG3 | |
---|---|---|---|

2 | 0.9876299 | 0.1523603 | 0.3137696 |

3 | 0.4806466 | 0.3744626 | 0.0667891 |

4 | 0.1942475 | 0.0683418 | 0.0014499 |

5 | 0.4357083 | 0.0102621 | 0.0009287 |

## # weights: 10 (4 variable) ## initial value 614.805283 ## iter 10 value 563.601169 ## iter 10 value 563.601167 ## final value 563.601167 ## converged

## Likelihood ratio test ## ## Model 1: Walc.p ~ absences.p + pG3 ## Model 2: Walc.p ~ 1 ## #Df LogLik Df Chisq Pr(>Chisq) ## 1 12 -549.89 ## 2 4 -563.60 -8 27.429 0.0005959 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

### Step 2: Calculate odds ratio and probability

The relative odds ratio for a one-unit increase in Portuguese final grade is .9492 for being a level 2 alcohol consumer (Likelihood is decreasing) versus level 1. The relative odds ratio for a one-unit increase in Portuguese absences is 1.045 for being a level 2 alcohol consumer (Likelihood is decreasing) versus level 1.X.Intercept. | absences.p | pG3 | |
---|---|---|---|

2 | 1.0109 | 1.0450 | 0.9492 |

3 | 1.6409 | 1.0296 | 0.9083 |

4 | 2.6183 | 1.0639 | 0.8326 |

5 | 1.9355 | 1.1001 | 0.7986 |

## absences.p pG3 variable probability ## 1 0 0 1 0.1218671 ## 2 1 1 1 0.1323312 ## 3 2 2 1 0.1433149 ## 4 3 3 1 0.1547898 ## 5 4 4 1 0.1667205 ## 6 5 5 1 0.1790644

### Step 4: Check goodness of fit of the model over the baseline model

Below we can see our model is statistically significant (Pr(>Chisq) < .001), which means it can be assumed to be better than the baseline.## # weights: 10 (4 variable) ## initial value 614.805283 ## iter 10 value 563.601169 ## iter 10 value 563.601167 ## final value 563.601167 ## converged

#Df | LogLik | Df | Chisq | Pr(>Chisq) |
---|---|---|---|---|

12 | -549.8867 | NA | NA | NA |

4 | -563.6012 | -8 | 27.42896 | 0.0005959 |

### Step 5: Determine usefulness

The Pseudo R squared statistics show that our extended model explains between 6.93% and 7.31% of the variation in the outcome variable. This is consistent with our findings above.CoxSnell | Nagelkerke | |
---|---|---|

R-Squared | 0.0693 | 0.0731 |

### Step 6: Check the assumption of linearity of independent variables

We checked the assumption of linearity of independent variables and log odds using a Hosmer-Lemeshow test and found it was not statistically significant. Multicollinearity analysis showed that the tolerance and variance influence factor measures were not within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). As such we may have problems with collinearity.## ## Hosmer and Lemeshow test (multinomial model) ## ## data: tbl_sperf_family_alc$Walc.p, fitted(logmodel3) ## X-squared = 34.594, df = 32, p-value = 0.345

VIF.Tolerance | vifmodel | |
---|---|---|

absences.p | 0.4075424 | 2.453733 |

pG3 | 0.0335390 | 29.816033 |

### Reporting Multinominal Logistic Regression

A Multinomial logistic regression model of student Alcohol consumption as predicted by student performance and absenteeism was built. Our outcome variable is the probability of a student having each of the alcohol consumption levels given their performance and absenteeism rates. Our reference outcome category was level 1, very low consumption. For our predictor, we used student performance and absence from Portuguese class. The model was statistically significant (Pr(>Chisq) = .001 ) and explains between 6.93% and 7.31% of the variation in the output variable. The level of collinearity between predictors was outside the acceptable range, however.## ## ========================================================= ## Dependent variable: ## --------------------------------------- ## 2 3 4 5 ## (1) (2) (3) (4) ## --------------------------------------------------------- ## absences.p 0.044 0.029 0.062* 0.095** ## (0.031) (0.033) (0.034) (0.037) ## ## pG3 -0.052 -0.096* -0.183*** -0.225*** ## (0.052) (0.052) (0.058) (0.068) ## ## Constant 0.011 0.495 0.963 0.660 ## (0.701) (0.702) (0.741) (0.847) ## ## --------------------------------------------------------- ## Akaike Inf. Crit. 1,123.773 1,123.773 1,123.773 1,123.773 ## ========================================================= ## Note: *p<0.1; **p<0.05; ***p<0.01