Superstats

1
Univariate Statistics

An Introduction

Statistics can be defined as the science of collecting and analysing data. It can be split into towo main categories:

Descriptive Statistics summarises and describe the sample. It helps to understand the results and as well as present them to others.
Inferential Statistics is the next stage and attempts to infer, deduce or reach conclusions about the entire population just from the sample..

Types of Statistical Data

Statistical data can be split up into a number of different types:

Type	Description	Example
Categorical	value is non-numerical	nationality
Ranked	variable is categorical, where category is an ordered position	100m finishing position
Discrete	variable can only take specific values in given range	number of medals won
Continuous	variable can take any value in given range	100m finishing time

It is very important to have different data types as:

Different data types contain different amounts of information
Different types of statistical analysis can be carried out on different data types
If you use the wrong analysis you could end up with incorrect results, or you could be using your data in a sub-optimal way

The Number of Variables Analysed

In statistics, the number of variables analysed per individual in your sample can also vary:

Univariate statistics: A single variable analysed
Bivariate statistics: Two variables per ‘individual’ analysed
Multivariate statistics: Three or more variables per ‘individual’ analysed

Univariate data visualisation

We’re going to look at three common types of visualisation

Dotplot
Histogram
Bar Chart

Using two data sets:

’fortyStudentHeightData.mat’: heights of forty students
’fortyStudentGenderData.mat’: gender of forty students

Note that the data is corresponding i.e. the first element in the height and gender data refers to the same individual

To load the student height data into matlab you can either double click on ’fortyStudentHeightData.mat’ or:

» load(’fortyStudentHeightData.mat’);

You can look at the data by just typing the variable name

» heights

1. Dotplot

There is not a specific Matlab command to produce a dotplot, but one can be created using the plot, for and if commands as shown in the function below:

function [] = dotplot( x )
% usage:
% [] - dotplot( x )
% x - input list of numbers as column matrix
% note no returned values, output is dotplot figure

ypos = zeros( length(x), 1);
for i = 1:length(x)
  for j = i:length(x)
    if x(j) == x(i)
      ypos(j) = ypos(j) + 1;
    end
  end
end
plot( x, ypos, 'o' );
title('Dotplot');
ylim([0 max(ypos)+1]);

Then this function can be used to create a dotplot, » dotplot( heights );:

Things to look for in the distribution

does the distribution look symmetric?
are there any outliers?
are the dots evenly distributed, or are there peaks?

2. Histogram

With a bit more processing of the data, a histogream can be formed from the height data (as shown below). Histograms are typicallu used for continuous and discrete data types, and they show the underlying distribution more clearly than the dotpot:

The key difference between a histogram and a dotplot is that a “binning” process has occurred. There are no specific rules as to how many bins to use. But as a rough guide the number of bins = √(number of data values), but not less than six.

3. Bar Chart

Most often used for categorical data. The command to draw a bar chart is bar. However there is a small problem: data (array gender) comes in the following format: M F F F F F M M F F F F F F F F M F F M M F F F F F F F M F F F F F M M M M M M

But we need know the number of Ms and Fs in the array to do this we can use the find and length functions:

» malePos = find(gender==’M’)
» malePos = 1 7 8 17 20 21 29 35 36 37 38 39 40

find returns an array where each element equals the position within gender where there is an ’M’. But we still don’t know how many people in the class are male. An easy way to do this is to calculate the length of array malePos » length(malePos) » ans = 13

The entire bar chart script

% split gender into two separate arrays,
% one for each gender
malePos = find(gender==’M’)
femalePos = find(gender==’F’)

% use the length command to count the number of male and female students and then use this as input to the bar plotting
function bar([length(malePos), length(femalePos)])

% lastly label the x axis
set( gca, ’XTickLabel’,{’Male’,’Female’})

The set command allows changes to be made to specified objects in the figure, in this case gca is used to accesss the axes of the current figure, and then specify exactly how each axis is altered. doe can be used to find out more about these commands.

Numerical measures of central tendency

If you want to summarise your data with a single number, a measure of central tendency is usually the most appropriate. ’Average’ is a collective term for these measures

The measures are:

mean : $ \bar{a} = \frac{1}{n} \sum^{n}{i=1} x{i} $
median : central value. Line all values in data up from smallest to largest, then it’s the (n + 1)/2th value.
mode : most common value

Matlab has built in functions for each of these measures:

load(’fortyStudentHeightData.mat’);
mean(heights)
% ans = 168.2500

median(heights)
% ans = 167.5000

mode(heights)
% ans = 157

Which measure to use?

Sometimes it is not possible to use all measure of central tendency with all data types:

categorical data: often mode is the only possibility
ranked data: median or mode possible
continuous or discrete: all can be used, mean and median are most common

Which Measure to Use for Continuous or Discrete Data?

Whether to use mean or median depends on the distribution, is it symmetric or skewed. Also depends on whether the data has outliers (data points very different from the others being analysed):

use the mean if your data distribution is symmetric with no significant outliers
use the median if your data distribution is skewed or has significant outliers

A Measure of Skewness

Now that we have calcualted the standard deviation s we can also define a numerical value for the skewness of the distribution as follows:

$skewness = \frac{3(\bar{x} - median)}{s}$

skewness > 1: ‘reasonable’ positive skew
skewness < –1: ‘reasonable’ negative skew
skewness< 1 && skewness > –1: roughly symmetric

In Matlab:

3*(mean(heights) - median(heights))/std(heights)
% ans = 0.2470

Visualising Mean and Standard Deviation

Common to use errors bars ± one standard deviation. Important to state clearly what the error bars represent, as there are other common uses of error bars such as standard error of the mean which we will cover in chapter 4.

Matlab Commands to Produce Error Bars

load(’fortyStudentHeightData’);
load(’fortyStudentGenderData’);

% form separate arrays containing male and female height data
maleheight = heights(find(gender==’M’));
femaleheight = heights(find(gender==’F’));

% calculate statistics, and place in arrays
means = [mean(maleheight), mean(femaleheight)]
stdevs = [std(maleheight), std(femaleheight)]

% plot data, and label axes
errorbar( means, stdevs, ’x’)
set(gca,’XTick’,1:2)
set(gca,’XTickLabel’,{’Male’,’Female’}, ’Fontsize’, 18)
ylabel(’Height in cm’, ’Fontsize’, 18)

Matlab commands to derive box plot

load(’fortyStudentHeightData’);
load(’fortyStudentGenderData’);
boxplot( heights, gender )

% now label axis
ylabel(’Height in cm’, ’Fontsize’, 18)

SUMMARY

Things you should know:

Covered univariate descriptive statistical methods, how to calculate statistics and display data using Matlab.
Data comes in a number of different types. Data type affects which statistical analysis is appropriate.
Measures of central tendency (mean, median and mode): the most “typical” value.
Measures of variation (standard deviation and IQR): data spread about “typical” value.
Symmetric distribution with no significant outliers then use mean and standard deviation.
If the distribution is skewed with significant outliers then use median and IQR.
Always visualise your data, don’t go straight for numerical methods.

Written by Tobias Whetton

2
Bivariate Statistics

OPEN STANDALONE

Visualising Bivariate Statistics

Visualisation you can use depends on types of data involved. Here are three common combinations:

Two categorical variables
Categorical and continuous
Two continuous

Two categorical variables

Common method is joint contingency table. Datasets:

Gender of 40 students in class.
Type of degree (BA or BSc) of 40 students in class

Bsc	BA
Male	11	2
Female	20	7

No command available, so a function an be written for example, ’frequencyTable.m’. Content of function is not important at the moment, but how to use it is:

frequencyTable.m

>> help frequencyTable
usage: [] = frequencyTable( x, y )
x,y input variables: must have the same length
output is frequency table of unique values within x and y

>> load('fortyStudentDegreeData')
>> load('fortyStudentGenderData')
>> frequencyTable( degree, gender )

Often also called a joint frequency table, and values often presented as % of data set size.

Categorical and Continuous Variables

You can also combine categorical and continuous variables using a box plot, as seen in chapter 1.

Two Continuous Variables

This covers three commonly used methods:

scatter plot
2D histogram
line graph

Scatter Plot

Datasets: 40 students in class:

Distance travelled to university
Height

One variable on x-axis, other variable on y-axis, dot at position for each individual in class.

>> load('fortyStudentDistanceData.mat')
>> load('fortyStudentHeightData.mat')
>> scatter( heights, distance )
>> xlabel('Student height (cm)', 'fontsize', 18);
>> ylabel('Distance travelled to university (cm)', 'fontsize', 18);

2D Histograms

After binning data can be shown as a 2D histogram. Counts per bin displayed as intensities, or heights in a 3D plot.

load('fortyStudentDistanceData.mat') load('fortyStudentHeightData.mat') heightNdistance=[ heights, distance ]; % for intensity plot
histArray = hist3( heightNdistance ); colormap( gray )
imagesc( histArray );
% for 3D plot
hist3( heightNdistance )

Line Graphs

Useful when one variable can only have a single corresponding second variable. Commonly used when one of the variables represents time. Simply use the plot command which was used in chapter 1:

load('interestRate.mat');
plot( Irate(:,1), Irate(:,2))
xlabel('Year', 'fontsize', 18 )
ylabel('Interest rate', 'fontsize', 18 )

Which Variable Should go on Which Axis?

Two main types of research design:

Experimental : researcher alters the value of one variable (independent variable) and then measures the value of a second variable (dependent variable). In an experimental study, the independent variable should go on the horizontal x-axis.
Observational : researcher has no control on either variable. In an observational study, there is no specifc criteria.

Pearson’s Correlation Coefficient

Numerical value for how linearly related two variables are. The correlation coefficient, r, for paired variables xi, and yi, i = 1,…,n (where n is the sample size) is given by:

Equation to be added soon

where paired variables xi, and yi , i = 1,…,n (n is sample size):

r = +1 : perfect positive correlation
r = 0 : no linear correlation
r = –1 : perfect negative correlation

A strong correlation does not signify cause and effect. E.g. there is a strong correlation between ice cream sales and incidences of drowning. Does ice cream consumption cause drowning? No, both are related by a much stronger factor, daily temperature.

The correlation coefficient can be greatly affected by a few outliers.