Digital Portfolio – Prepare

Reading Time: 19 minutes

Introduction:

The purpose of this section is to address all statistical concepts comprehensively and demonstrate coherent understanding of the statistical analysis preparation phase, as covered during the TU060 MATH9102 Module. The following concepts can be considered to be included in the preparation phase:
  • Formulation of a Research Question
  • Formulation of a Hypothesis and the purpose of Hypothesis testing
  • Populations and Samples
  • Describing a sample
  • Statistical measures
  • Identification of analyse challenges/limitations/constraints.
Understanding is illustrated through application to the real-world data Portuguese Secondary School Student Performance dataset provided. The following sections describe the sample data in detail.

Citation:

Populations and samples:

The purpose of statistical analysis is to make generalized statements about complete collections of things. This full set of things is referred to as the population. However, we rarely have access to an entire population. Therefore, we conduct our statistical analysis using a small subset of the population known as a sample, and from that sample we infer things about the general population. In this instance the population sample was collected during the 2005-2006 school year from two public secondary schools in the Alentejo region of Portugal.

The sample contains 382 records and 33 variables of interest. It was constructed from a combination of school records and a closed questionnaire answered by students to collect demographic data. The sample only contains measures for performance in the core Mathematics and Portuguese language subjects. Initially 788 students took part in the sample. During preprocessing by the dataset publisher, the record count was reduced to 382 records by removing records missing student identification details and records which only had performance results for one subject

Description of our statistical data types:

The variables of interest in the sample are described in the table below. The first part of the table describes variables related to demographic concepts and the last four rows denote the variables taken from the school reports related to performance and attendance. The following sections provide measures of the different statistical data types in sufficient detail for the reader to be able to understand what we have done and why we have done it without the need to access the student performance sample dataset themselves. We describe the sample by providing summary statistics about the variables of interest.

Later on we will illustrate inferential statistical methods by inferring population parameters from sample statistics. Note this exercise will only be done to illustrate the author’s understanding of the statistical methods since the portfolio brief clearly indicated the sample data should only be treated as training data to inform model selection, not to actually make statements of significance about the generalizability of findings.

Import dataset

There are two dataset that must be imported:

############
# PART: Import data
############
# Dateset 1 : Import sperformance-dataset
tbl_sperf_all <- read.csv('sperformance-dataset.csv', header = TRUE)
names(tbl_sperf_all)[1] <- 'School' # Fix issue with the name of first field.

# Dateset 2 : Import sperformance-dataset variable description, created by me
tbl_sperf_description_all <- read.csv('TU060_MATH9102_Student_variables_description.csv',
                                      header = TRUE)
Statistical Data Types
Attribute Description Statistical.Data.Type
sex student’s sex Binary: Categorical with values Female or Male
age student’s age Numeric: Interval ratio with values 15 to 22
school student’s school Binary: Categorical with values Gabriel Pereira or Mousinho da Silveira
address student’s home address type Binary: Categorical with values urban or rural
Pstatus parent’s cohabitation status Binary: Categorical with values together or apart
Medu mother’s education Numeric: Categorical with values 0 to 4
Mjob mother’s job Nominal: Categorical (example values: at_home, health etc)
Fedu father’s education Numeric: Categorical with values 0 to 4
Fjob father’s job Nominal: Categorical (example values: at_home, health etc)
guardian student’s guardian Nominal: Categorical with values mother, farther or other
famsize family size Binary: Categorical with values for < 3 or > 3
famrel quality of family relationships Numeric ordinal: Categorical with values 1 – very bad to 5 – excellent
reason reason to choose this school Nominal: Categorical (example vlaues: reputation, course etc)
traveltime home to school travel time Numeric ordinal:Categorical data with values 1 to 4
studytime weekly study time Numeric ordinal:Categorical data with values 1 to 4
failures number of past class failures Numeric: with values 1 to 4
schoolsup extra educational school support Binary: Categorical with values yes or no
famsup family educational support Binary: Categorical with values yes or no
activities extra-curricular activities Binary: Categorical with values yes or no
paidclass extra paid classes Binary: Categorical with values yes or no
internet Internet access at home Binary: Categorical with values yes or no
nursery attended nursery school Binary: Categorical with values yes or no
higher wants to take higher education Binary: Categorical with values yes or no
romantic with a romantic relationship Binary: Categorical with values yes or no
freetime free time after school Numeric ordinal: Categorical with values 1 – very low to 5 – very high
goout going out with friends Numeric ordinal: Categorical with values 1 – very low to 5 – very high
Walc weekend alcohol consumption Numeric ordinal: Categorical with values 1 – very low to 5 – very high
Dalc workday alcohol consumption Numeric ordinal: Categorical with values 1 – very low to 5 – very high
health current health status Numeric ordinal: Categorical with values 1 – very low to 5 – very high
absences number of school absences Numeric: Interval ratio with values 0 to 93
G1 first period grade Numeric: Interval ratio with values 0 to 20
G2 second period grade Numeric: Interval ratio with values 0 to 20
G3 final grade Numeric: Interval ratio with values 0 to 20

Quantitative Data:

For numerical data the measurements we need to describe are the center point of the data and the spread of the data. This tells us the overall shape/distribution of the data for the variable of interest. In the sample dataset there are 3 numerical quantitative variables of interest related to student performance. These are: absences, grades in Mathematics and grades in Portuguese. The tables below contains statistics that describe the center point (Mean and Median), the spread (Range, IQR, min and max) for each variable measure and the variability (Standard Deviation, Median absolute Deviation). These statistics inform us about the overall shape and distribution of the variables under consideration. The shape of the distribution is important because it will help us during hypothesis testing to determine what sort of tests we can use; for instance, if we can use parametric tests or if we have to revert to the less statistically powerful non-parametric tests.

In the explore and analyse phase we will look more closely at the distribution of some of these variables but we can already see the grade variables approach a normal distribution with some outliers, gaps and clusters and may be candidates for parametric hypothesis tests. The absence variables are definitely not normally distributed on account of positive skew, and we will likely need to use the non-parametric tests for evaluating any hypotheses based on this variable.

The table below also includes measures of skew and kurtosis. These statistics help us to understand if we have unusual aspects to our data such as outliers, gaps or clusters which may shift the distribution, make it more flat or make it more pointy. For continuous variables that appear to be normally distributed the mean and sd are reported. For continuous variables that are not normally distributed the median and IQR (and other measures of range) are reported. Summary statistics for non continuous categorical variables are reported in a later section.

Summary statistics for Numerical Data Types with Normal distribution
vars n mean sd skew kurtosis se trimmed
mG1 1 382 10.86126 3.349167 0.2741912 -0.7061137 0.1713583 10.74510
mG2 2 382 10.71204 3.832560 -0.3970490 0.4666241 0.1960908 10.83007
mG3 3 382 10.38743 4.687242 -0.7003219 0.2415275 0.2398202 10.81373
pG1 4 382 12.11257 2.556531 -0.1523548 0.6986985 0.1308035 12.09150
pG2 5 382 12.23822 2.468341 0.2368869 -0.2038155 0.1262913 12.15359
pG3 6 382 12.51571 2.945438 -0.9891237 3.3902597 0.1507017 12.62092
Summary statistics for Numerical Data Types with non Normal distribution
vars n median mad min max range skew kurtosis IQR
age 1 382 17 1.4826 15 22 7 0.3953408 0.0648447 1
absences.m 2 382 3 4.4478 0 75 75 4.0116261 26.2928570 8
absences.p 3 382 2 2.9652 0 32 32 2.1655187 6.1782596 6

############
# PART: Visualize the numerical variables
############

#Create histograms
num_diagram_count <- ncol(tbl_sperf_numerical_measurements)
plots <- list()
for (n in 1:num_diagram_count) {
  variable <- colnames(tbl_sperf_numerical_measurements)[n]
  binwidth <- 1

  if (variable %in% c('absences.m', 'absences.p')) {
    binwidth <- 2
  }

  gs <- ggplot(tbl_sperf_numerical_measurements,
               aes_string(colnames(tbl_sperf_numerical_measurements)[n])
  )
  gs <- gs + geom_histogram(binwidth = binwidth, colour = "black", aes(y = ..density.., fill = ..count..))
  gs <- gs + stat_function(fun   = dnorm,
                           color = "red",
                           args  = list(mean = mean(tbl_sperf_numerical_measurements[,n]),
                                        sd   = sd(tbl_sperf_numerical_measurements[,n])),
                           na.rm = TRUE)
  gs <- gs + labs(x = variable)
  gs <- gs + scale_fill_gradient("Count", low = "#DCDCDC", high = "#7C7C7C")

  # Gather All the plots
  plots[[names(tbl_sperf_numerical_measurements)[n]]] <- gs
}

plot_grid(plotlist = plots,
          labels   = "auto", ncol = 3
)
plot of chunk Visualise the numerical variables

Categorical/Qualitative data:

Categorical variables are qualitative and describe our dataset. They allow us to segment our sample on the basic characteristics. In there dataset there are 28 categorical variables describing the demographics of students. Most of the categorical variables in the sample are nominal or ordinal and numerically encoded. For numeric categorical data, it doesn’t make sense to describe the data in terms of average value or standard deviation since the numerical values are just an encoding and have no quantitative meaning. As such, we describe categorical data in terms of possible values and frequency of occurrence of those values. Important summary statistics include the count of distinct values, a list of possible values, the relative proportion that each value occurs, and the most frequently occurring value. The following table and figures describe the summary statistics for the categorical variables in the sample dataset.

Note The dataset contained repetition of demographic variables as students completed the survey in both Maths and Portuguese classes. This repetition is included in the table but excluded in the dataset.

## Frequencies  
## tbl_sperf_categorical_measurements$School  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          GP    342     89.53          89.53     89.53          89.53
##          MS     40     10.47         100.00     10.47         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$sex  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           F    198     51.83          51.83     51.83          51.83
##           M    184     48.17         100.00     48.17         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$address  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           R     81     21.20          21.20     21.20          21.20
##           U    301     78.80         100.00     78.80         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$famsize  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##         GT3    278     72.77          72.77     72.77          72.77
##         LE3    104     27.23         100.00     27.23         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Pstatus  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           A     38      9.95           9.95      9.95           9.95
##           T    344     90.05         100.00     90.05         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Medu  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0      3      0.79           0.79      0.79           0.79
##           1     51     13.35          14.14     13.35          14.14
##           2     98     25.65          39.79     25.65          39.79
##           3     95     24.87          64.66     24.87          64.66
##           4    135     35.34         100.00     35.34         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Fedu  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0      2      0.52           0.52      0.52           0.52
##           1     77     20.16          20.68     20.16          20.68
##           2    105     27.49          48.17     27.49          48.17
##           3     99     25.92          74.08     25.92          74.08
##           4     99     25.92         100.00     25.92         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Mjob  
## Type: Character  
## 
##                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## -------------- ------ --------- -------------- --------- --------------
##        at_home     53     13.87          13.87     13.87          13.87
##         health     33      8.64          22.51      8.64          22.51
##          other    138     36.13          58.64     36.13          58.64
##       services     96     25.13          83.77     25.13          83.77
##        teacher     62     16.23         100.00     16.23         100.00
##           <NA>      0                               0.00         100.00
##          Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Fjob  
## Type: Character  
## 
##                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## -------------- ------ --------- -------------- --------- --------------
##        at_home     16      4.19           4.19      4.19           4.19
##         health     17      4.45           8.64      4.45           8.64
##          other    211     55.24          63.87     55.24          63.87
##       services    107     28.01          91.88     28.01          91.88
##        teacher     31      8.12         100.00      8.12         100.00
##           <NA>      0                               0.00         100.00
##          Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$reason  
## Type: Character  
## 
##                    Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------- ------ --------- -------------- --------- --------------
##           course    140     36.65          36.65     36.65          36.65
##             home    110     28.80          65.45     28.80          65.45
##            other     34      8.90          74.35      8.90          74.35
##       reputation     98     25.65         100.00     25.65         100.00
##             <NA>      0                               0.00         100.00
##            Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$nursery  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no     72     18.85          18.85     18.85          18.85
##         yes    310     81.15         100.00     81.15         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$internet  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no     58     15.18          15.18     15.18          15.18
##         yes    324     84.82         100.00     84.82         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$guardian.m  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##       father     91     23.82          23.82     23.82          23.82
##       mother    275     71.99          95.81     71.99          95.81
##        other     16      4.19         100.00      4.19         100.00
##         <NA>      0                               0.00         100.00
##        Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$traveltime.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    250     65.45          65.45     65.45          65.45
##           2    103     26.96          92.41     26.96          92.41
##           3     21      5.50          97.91      5.50          97.91
##           4      8      2.09         100.00      2.09         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$studytime.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    103     26.96          26.96     26.96          26.96
##           2    190     49.74          76.70     49.74          76.70
##           3     62     16.23          92.93     16.23          92.93
##           4     27      7.07         100.00      7.07         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$failures.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0    316     82.72          82.72     82.72          82.72
##           1     38      9.95          92.67      9.95          92.67
##           2     11      2.88          95.55      2.88          95.55
##           3     17      4.45         100.00      4.45         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$schoolsup.m  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    331     86.65          86.65     86.65          86.65
##         yes     51     13.35         100.00     13.35         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$famsup.m  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    144     37.70          37.70     37.70          37.70
##         yes    238     62.30         100.00     62.30         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$paid.m  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    205     53.66          53.66     53.66          53.66
##         yes    177     46.34         100.00     46.34         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$activities.m  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    181     47.38          47.38     47.38          47.38
##         yes    201     52.62         100.00     52.62         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$higher.m  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no     18      4.71           4.71      4.71           4.71
##         yes    364     95.29         100.00     95.29         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$romantic.m  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    261     68.32          68.32     68.32          68.32
##         yes    121     31.68         100.00     31.68         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$famrel.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1      9      2.36           2.36      2.36           2.36
##           2     18      4.71           7.07      4.71           7.07
##           3     66     17.28          24.35     17.28          24.35
##           4    183     47.91          72.25     47.91          72.25
##           5    106     27.75         100.00     27.75         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$freetime.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     18      4.71           4.71      4.71           4.71
##           2     62     16.23          20.94     16.23          20.94
##           3    156     40.84          61.78     40.84          61.78
##           4    109     28.53          90.31     28.53          90.31
##           5     37      9.69         100.00      9.69         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$goout.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     24      6.28           6.28      6.28           6.28
##           2     99     25.92          32.20     25.92          32.20
##           3    123     32.20          64.40     32.20          64.40
##           4     82     21.47          85.86     21.47          85.86
##           5     54     14.14         100.00     14.14         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Dalc.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    268     70.16          70.16     70.16          70.16
##           2     73     19.11          89.27     19.11          89.27
##           3     24      6.28          95.55      6.28          95.55
##           4      8      2.09          97.64      2.09          97.64
##           5      9      2.36         100.00      2.36         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Walc.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    146     38.22          38.22     38.22          38.22
##           2     85     22.25          60.47     22.25          60.47
##           3     76     19.90          80.37     19.90          80.37
##           4     48     12.57          92.93     12.57          92.93
##           5     27      7.07         100.00      7.07         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$health.m  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     46     12.04          12.04     12.04          12.04
##           2     43     11.26          23.30     11.26          23.30
##           3     83     21.73          45.03     21.73          45.03
##           4     64     16.75          61.78     16.75          61.78
##           5    146     38.22         100.00     38.22         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$guardian.p  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##       father     91     23.82          23.82     23.82          23.82
##       mother    275     71.99          95.81     71.99          95.81
##        other     16      4.19         100.00      4.19         100.00
##         <NA>      0                               0.00         100.00
##        Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$traveltime.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    250     65.45          65.45     65.45          65.45
##           2    102     26.70          92.15     26.70          92.15
##           3     22      5.76          97.91      5.76          97.91
##           4      8      2.09         100.00      2.09         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$studytime.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    102     26.70          26.70     26.70          26.70
##           2    190     49.74          76.44     49.74          76.44
##           3     63     16.49          92.93     16.49          92.93
##           4     27      7.07         100.00      7.07         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$failures.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0    348     91.10          91.10     91.10          91.10
##           1     21      5.50          96.60      5.50          96.60
##           2      6      1.57          98.17      1.57          98.17
##           3      7      1.83         100.00      1.83         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$schoolsup.p  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    332     86.91          86.91     86.91          86.91
##         yes     50     13.09         100.00     13.09         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$famsup.p  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    143     37.43          37.43     37.43          37.43
##         yes    239     62.57         100.00     62.57         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$paid.p  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    356     93.19          93.19     93.19          93.19
##         yes     26      6.81         100.00      6.81         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$activities.p  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    182     47.64          47.64     47.64          47.64
##         yes    200     52.36         100.00     52.36         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$higher.p  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no     18      4.71           4.71      4.71           4.71
##         yes    364     95.29         100.00     95.29         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$romantic.p  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          no    259     67.80          67.80     67.80          67.80
##         yes    123     32.20         100.00     32.20         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$famrel.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1      8      2.09           2.09      2.09           2.09
##           2     18      4.71           6.81      4.71           6.81
##           3     67     17.54          24.35     17.54          24.35
##           4    184     48.17          72.51     48.17          72.51
##           5    105     27.49         100.00     27.49         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$freetime.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     17      4.45           4.45      4.45           4.45
##           2     62     16.23          20.68     16.23          20.68
##           3    157     41.10          61.78     41.10          61.78
##           4    108     28.27          90.05     28.27          90.05
##           5     38      9.95         100.00      9.95         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$goout.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     24      6.28           6.28      6.28           6.28
##           2     98     25.65          31.94     25.65          31.94
##           3    124     32.46          64.40     32.46          64.40
##           4     81     21.20          85.60     21.20          85.60
##           5     55     14.40         100.00     14.40         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Dalc.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    267     69.90          69.90     69.90          69.90
##           2     74     19.37          89.27     19.37          89.27
##           3     24      6.28          95.55      6.28          95.55
##           4      8      2.09          97.64      2.09          97.64
##           5      9      2.36         100.00      2.36         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$Walc.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    144     37.70          37.70     37.70          37.70
##           2     86     22.51          60.21     22.51          60.21
##           3     76     19.90          80.10     19.90          80.10
##           4     49     12.83          92.93     12.83          92.93
##           5     27      7.07         100.00      7.07         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
## 
## tbl_sperf_categorical_measurements$health.p  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     46     12.04          12.04     12.04          12.04
##           2     44     11.52          23.56     11.52          23.56
##           3     83     21.73          45.29     21.73          45.29
##           4     62     16.23          61.52     16.23          61.52
##           5    147     38.48         100.00     38.48         100.00
##        <NA>      0                               0.00         100.00
##       Total    382    100.00         100.00    100.00         100.00
plot of chunk Visualisation of categorical measurement statistics 1 plot of chunk Visualisation of categorical measurement statistics 2 plot of chunk Visualisation of categorical measurement statistics 3

Research Goal, Questions and Hypothesis

The following section illustrates the author’s ability to formulate and communicate suitable research question(s) for which employing a statistical analysis to investigate is appropriate.

Research Goal:

To gain a deeper understanding of the nature of the relationship between student performance in core secondary school subjects and student demographics.

Rationale:

There has been a lot of discourse related to student performance and the potential applicability of probability and statistical methods to predict outcomes. By gaining a deeper understanding of the relationship between student performance and demographics, we can develop better tools for predicting performance which may allow for more meaningful and timely interventions to improve overall student outcomes.

Main Research Questions:

Can a student’s future performance in mathematics and Portuguese be predicted from a combination of past performance and parental education, and is there a differential effect for male and female students?
To answer this main research question, we also must look at a number of contributing descriptive, comparative and relational questions, which serve to explore specific aspects of the data.

Descriptive questions:

These questions ask, what type of variation occurs within the variables of interest?
  • What is the average grade of students?
  • How many students have failed an exam?
  • What is the frequency distribution of Mothers education?
  • What is the frequency distribution of Father’s education?

Comparative questions:

These questions ask, what type of variation occurs within the variables of interest for different populations or segments within the same population?
  • What is the difference in outcomes for male and female students?
  • What are the most important factor in determining performance for different grouping?

Relational questions:

Relationship questions ask what type of co-variation occurs between variables of interest, is their a causal relationship, and how strong is that relationship if one exists.

Question 1

RQ: Is past performance a good indicator of future performance? Hypothesis; Students who perform well as part of initial assessment in subjects will perform better overall.

Question 2

RQ: What is the relationship between alcohol consumption and final grade for math students Hypothesis; students with lower alcohol consumption will perform better at math and Portuguese

Question 3

RQ: What is the relationship between Portuguese and Math grades? Hypothesis; students who perform well in Portuguese will perform well in Math Hypothesis; students who perform well in Math will perform well in Portuguese

Question 4

RQ: What is the relationship between having at least one parent be a stay at home parent and student performance? Hypothesis; students whose father stays at home will perform better than those whose father doesn’t Hypothesis; students whose mother stays at home will perform better than those whose mother doesn’t

Question 5

RQ: What is the relationship between past performance and future performance for students and is there a differential relationship between male and female students. Hypothesis; students who perform well during interim assessment in subjects will perform better overall. Hypothesis; Male and Female students will perform differently overall.

Question 6:

RQ: What is the relationship between extra-curricular activity and student performance for male and female students? Hypothesis; There are differences between extra-curricular activities engagement for respondents between measurements.

Potential Issues and Shortcomings

Representativeness:

The sample is not sufficient to be considered representative of the general population of students attending Portuguese Secondary Schools. To make generalised statements regarding Portuguese Secondary School Student population two necessary (though not sufficient) requirements are that the sample be big enough and representative. “Big enough” means that whatever we’re interested in investigating as part of our statistical analysis, can be found in our sample if it is present in the population.

The sample only contains data from two schools in the same region, and only a subset of students within those schools. “Representative” means that the characteristics of our sample mirrors the representation in the population in the same proportions. For example, If we are interested in rural versus urban characteristic behaviours or outcomes for people in Ireland, then we need to have a similar fraction of people in our sample from urban areas and people from rural areas that is in proportion to that which prevails in the wider population.

Validity:

Are the variables actually measuring what we think they are? The purpose of the grades in the sample dataset is to assess student capability and understanding of the core subject, but we cannot measure this directly so we must use an examination score as a proxy. Some students may be very knowledgeable but struggle with examinations or may have had issues with health on the day of the exam which otherwise impacted grades. Other students may have very little knowledge but excel at taking and passing examinations through rote memorization.

Another form of measurement error to consider is content validity with regards the survey completed by students. For example students are asked about alcohol consumption on the weekends, the students’ concept of what high alcohol consumption levels means may differ from what the survey was designed to evaluate.

Confounding Variables:

The demographics survey asked 37 questions (some questions where later discarded) and was reviewed and tested before being rolled out fully. Nevertheless there is the potential for variables which have not been accounted for, having an impact on the student performance. By failing to account for these confounding variables we may draw incorrect conclusions from our analysis.

Accuracy of measurements:

The student grades were recorded on paper files as opposed to in an IT system. This form of manual paper-based record can be prone to errors in terms of miss-filling or miss-reporting information.