Statistics can be found everywhere. Grasp the important concepts, widening your horizon with probability and calculations. The current chapters are:
Statistics can be defined as the science of collecting and analysing data. It can be split into towo main categories:
Statistical data can be split up into a number of different types:
Type | Description | Example |
---|---|---|
Categorical | value is non-numerical | nationality |
Ranked | variable is categorical, where category is an ordered position | 100m finishing position |
Discrete | variable can only take specific values in given range | number of medals won |
Continuous | variable can take any value in given range | 100m finishing time |
It is very important to have different data types as:
In statistics, the number of variables analysed per individual in your sample can also vary:
We’re going to look at three common types of visualisation
Using two data sets:
Note that the data is corresponding i.e. the first element in the height and gender data refers to the same individual
To load the student height data into matlab you can either double click on ’fortyStudentHeightData.mat’ or:
» load(’fortyStudentHeightData.mat’);
You can look at the data by just typing the variable name
» heights
There is not a specific Matlab command to produce a dotplot, but one can be created using the plot
, for
and if
commands as shown in the function below:
function [] = dotplot( x )
% usage:
% [] - dotplot( x )
% x - input list of numbers as column matrix
% note no returned values, output is dotplot figure
ypos = zeros( length(x), 1);
for i = 1:length(x)
for j = i:length(x)
if x(j) == x(i)
ypos(j) = ypos(j) + 1;
end
end
end
plot( x, ypos, 'o' );
title('Dotplot');
ylim([0 max(ypos)+1]);
Then this function can be used to create a dotplot, » dotplot( heights );
:
Things to look for in the distribution
With a bit more processing of the data, a histogream can be formed from the height data (as shown below). Histograms are typicallu used for continuous and discrete data types, and they show the underlying distribution more clearly than the dotpot:
The key difference between a histogram and a dotplot is that a “binning” process has occurred. There are no specific rules as to how many bins to use. But as a rough guide the number of bins = √(number of data values), but not less than six.
Most often used for categorical data. The command to draw a bar chart is bar
. However there is a small problem: data (array gender) comes in the following format: M F F F F F M M F F F F F F F F M F F M M F F F F F F F M F F F F F M M M M M M
But we need know the number of Ms and Fs in the array to do this we can use the find
and length
functions:
» malePos = find(gender==’M’)
» malePos = 1 7 8 17 20 21 29 35 36 37 38 39 40
find
returns an array where each element equals the position within gender where there is an ’M’. But we still don’t know how many people in the class are male. An easy way to do this is to calculate the length of array malePos » length(malePos) » ans = 13
% split gender into two separate arrays,
% one for each gender
malePos = find(gender==’M’)
femalePos = find(gender==’F’)
% use the length command to count the number of male and female students and then use this as input to the bar plotting
function bar([length(malePos), length(femalePos)])
% lastly label the x axis
set( gca, ’XTickLabel’,{’Male’,’Female’})
The set
command allows changes to be made to specified objects in the figure, in this case gca
is used to accesss the axes of the current figure, and then specify exactly how each axis is altered. doe
can be used to find out more about these commands.
If you want to summarise your data with a single number, a measure of central tendency is usually the most appropriate. ’Average’ is a collective term for these measures
The measures are:
Matlab has built in functions for each of these measures:
load(’fortyStudentHeightData.mat’);
mean(heights)
% ans = 168.2500
median(heights)
% ans = 167.5000
mode(heights)
% ans = 157
Sometimes it is not possible to use all measure of central tendency with all data types:
Whether to use mean or median depends on the distribution, is it symmetric or skewed. Also depends on whether the data has outliers (data points very different from the others being analysed):
Now that we have calcualted the standard deviation s we can also define a numerical value for the skewness of the distribution as follows:
$skewness = \frac{3(\bar{x} - median)}{s}$
3*(mean(heights) - median(heights))/std(heights)
% ans = 0.2470
Common to use errors bars ± one standard deviation. Important to state clearly what the error bars represent, as there are other common uses of error bars such as standard error of the mean which we will cover in chapter 4.
load(’fortyStudentHeightData’);
load(’fortyStudentGenderData’);
% form separate arrays containing male and female height data
maleheight = heights(find(gender==’M’));
femaleheight = heights(find(gender==’F’));
% calculate statistics, and place in arrays
means = [mean(maleheight), mean(femaleheight)]
stdevs = [std(maleheight), std(femaleheight)]
% plot data, and label axes
errorbar( means, stdevs, ’x’)
set(gca,’XTick’,1:2)
set(gca,’XTickLabel’,{’Male’,’Female’}, ’Fontsize’, 18)
ylabel(’Height in cm’, ’Fontsize’, 18)
load(’fortyStudentHeightData’);
load(’fortyStudentGenderData’);
boxplot( heights, gender )
% now label axis
ylabel(’Height in cm’, ’Fontsize’, 18)
Things you should know:
Visualisation you can use depends on types of data involved. Here are three common combinations:
Common method is joint contingency table. Datasets:
Bsc | BA | |
Male | 11 | 2 |
Female | 20 | 7 |
No command available, so a function an be written for example, ’frequencyTable.m’. Content of function is not important at the moment, but how to use it is:
>> help frequencyTable
usage: [] = frequencyTable( x, y )
x,y input variables: must have the same length
output is frequency table of unique values within x and y
>> load('fortyStudentDegreeData')
>> load('fortyStudentGenderData')
>> frequencyTable( degree, gender )
Often also called a joint frequency table, and values often presented as % of data set size.
You can also combine categorical and continuous variables using a box plot, as seen in chapter 1.
This covers three commonly used methods:
Datasets: 40 students in class:
One variable on x-axis, other variable on y-axis, dot at position for each individual in class.
>> load('fortyStudentDistanceData.mat')
>> load('fortyStudentHeightData.mat')
>> scatter( heights, distance )
>> xlabel('Student height (cm)', 'fontsize', 18);
>> ylabel('Distance travelled to university (cm)', 'fontsize', 18);
After binning data can be shown as a 2D histogram. Counts per bin displayed as intensities, or heights in a 3D plot.
load('fortyStudentDistanceData.mat') load('fortyStudentHeightData.mat') heightNdistance=[ heights, distance ]; % for intensity plot
histArray = hist3( heightNdistance ); colormap( gray )
imagesc( histArray );
% for 3D plot
hist3( heightNdistance )
Useful when one variable can only have a single corresponding second variable. Commonly used when one of the variables represents time. Simply use the plot command which was used in chapter 1:
load('interestRate.mat');
plot( Irate(:,1), Irate(:,2))
xlabel('Year', 'fontsize', 18 )
ylabel('Interest rate', 'fontsize', 18 )
Two main types of research design:
Numerical value for how linearly related two variables are. The correlation coefficient, r, for paired variables xi, and yi, i = 1,…,n (where n is the sample size) is given by:
Equation to be added soon
where paired variables xi, and yi , i = 1,…,n (n is sample size):
A strong correlation does not signify cause and effect. E.g. there is a strong correlation between ice cream sales and incidences of drowning. Does ice cream consumption cause drowning? No, both are related by a much stronger factor, daily temperature.
The correlation coefficient can be greatly affected by a few outliers.