Univariate Statistics


Univariate Statistics

stats



An Introduction

Statistics can be defined as the science of collecting and analysing data. It can be split into towo main categories:

  • Descriptive Statistics summarises and describe the sample. It helps to understand the results and as well as present them to others.
  • Inferential Statistics is the next stage and attempts to infer, deduce or reach conclusions about the entire population just from the sample..

Types of Statistical Data

Statistical data can be split up into a number of different types:

Type Description Example
Categorical value is non-numerical nationality
Ranked variable is categorical, where category is an ordered position 100m finishing position
Discrete variable can only take specific values in given range number of medals won
Continuous variable can take any value in given range 100m finishing time

It is very important to have different data types as:

  • Different data types contain different amounts of information
  • Different types of statistical analysis can be carried out on different data types
  • If you use the wrong analysis you could end up with incorrect results, or you could be using your data in a sub-optimal way
The Number of Variables Analysed

In statistics, the number of variables analysed per individual in your sample can also vary:

  • Univariate statistics: A single variable analysed
  • Bivariate statistics: Two variables per ‘individual’ analysed
  • Multivariate statistics: Three or more variables per ‘individual’ analysed

Univariate data visualisation

We’re going to look at three common types of visualisation

  1. Dotplot
  2. Histogram
  3. Bar Chart

Using two data sets:

  • ’fortyStudentHeightData.mat’: heights of forty students
  • ’fortyStudentGenderData.mat’: gender of forty students

Note that the data is corresponding i.e. the first element in the height and gender data refers to the same individual

To load the student height data into matlab you can either double click on ’fortyStudentHeightData.mat’ or:

» load(’fortyStudentHeightData.mat’);

You can look at the data by just typing the variable name

» heights

1. Dotplot

There is not a specific Matlab command to produce a dotplot, but one can be created using the plot, for and if commands as shown in the function below:

function [] = dotplot( x )
% usage:
% [] - dotplot( x )
% x - input list of numbers as column matrix
% note no returned values, output is dotplot figure

ypos = zeros( length(x), 1);
for i = 1:length(x)
  for j = i:length(x)
    if x(j) == x(i)
      ypos(j) = ypos(j) + 1;
    end
  end
end
plot( x, ypos, 'o' );
title('Dotplot');
ylim([0 max(ypos)+1]);

Then this function can be used to create a dotplot, » dotplot( heights );:

Things to look for in the distribution

  • does the distribution look symmetric?
  • are there any outliers?
  • are the dots evenly distributed, or are there peaks?

2. Histogram

With a bit more processing of the data, a histogream can be formed from the height data (as shown below). Histograms are typicallu used for continuous and discrete data types, and they show the underlying distribution more clearly than the dotpot:

The key difference between a histogram and a dotplot is that a “binning” process has occurred. There are no specific rules as to how many bins to use. But as a rough guide the number of bins = √(number of data values), but not less than six.

3. Bar Chart

Most often used for categorical data. The command to draw a bar chart is bar. However there is a small problem: data (array gender) comes in the following format: M F F F F F M M F F F F F F F F M F F M M F F F F F F F M F F F F F M M M M M M

But we need know the number of Ms and Fs in the array to do this we can use the find and length functions:

» malePos = find(gender==’M’)
» malePos = 1 7 8 17 20 21 29 35 36 37 38 39 40

find returns an array where each element equals the position within gender where there is an ’M’. But we still don’t know how many people in the class are male. An easy way to do this is to calculate the length of array malePos » length(malePos) » ans = 13

The entire bar chart script
% split gender into two separate arrays,
% one for each gender
malePos = find(gender==’M’)
femalePos = find(gender==’F’)

% use the length command to count the number of male and female students and then use this as input to the bar plotting
function bar([length(malePos), length(femalePos)])

% lastly label the x axis
set( gca, ’XTickLabel’,{’Male’,’Female’})

The set command allows changes to be made to specified objects in the figure, in this case gca is used to accesss the axes of the current figure, and then specify exactly how each axis is altered. doe can be used to find out more about these commands.

Numerical measures of central tendency

If you want to summarise your data with a single number, a measure of central tendency is usually the most appropriate. ’Average’ is a collective term for these measures

The measures are:

  • mean : $ \bar{a} = \frac{1}{n} \sum^{n}{i=1} x{i} $
  • median : central value. Line all values in data up from smallest to largest, then it’s the (n + 1)/2th value.
  • mode : most common value

Matlab has built in functions for each of these measures:

load(’fortyStudentHeightData.mat’);
mean(heights)
% ans = 168.2500

median(heights)
% ans = 167.5000

mode(heights)
% ans = 157

Which measure to use?

Sometimes it is not possible to use all measure of central tendency with all data types:

  • categorical data: often mode is the only possibility
  • ranked data: median or mode possible
  • continuous or discrete: all can be used, mean and median are most common

Which Measure to Use for Continuous or Discrete Data?

Whether to use mean or median depends on the distribution, is it symmetric or skewed. Also depends on whether the data has outliers (data points very different from the others being analysed):

  • use the mean if your data distribution is symmetric with no significant outliers
  • use the median if your data distribution is skewed or has significant outliers

A Measure of Skewness

Now that we have calcualted the standard deviation s we can also define a numerical value for the skewness of the distribution as follows:

$skewness = \frac{3(\bar{x} - median)}{s}$

  • skewness > 1: ‘reasonable’ positive skew
  • skewness < –1: ‘reasonable’ negative skew
  • skewness< 1 && skewness > –1: roughly symmetric
In Matlab:
3*(mean(heights) - median(heights))/std(heights)
% ans = 0.2470

Visualising Mean and Standard Deviation

Common to use errors bars ± one standard deviation. Important to state clearly what the error bars represent, as there are other common uses of error bars such as standard error of the mean which we will cover in chapter 4.

Matlab Commands to Produce Error Bars
load(’fortyStudentHeightData’);
load(’fortyStudentGenderData’);

% form separate arrays containing male and female height data
maleheight = heights(find(gender==’M’));
femaleheight = heights(find(gender==’F’));

% calculate statistics, and place in arrays
means = [mean(maleheight), mean(femaleheight)]
stdevs = [std(maleheight), std(femaleheight)]

% plot data, and label axes
errorbar( means, stdevs, ’x’)
set(gca,’XTick’,1:2)
set(gca,’XTickLabel’,{’Male’,’Female’}, ’Fontsize’, 18)
ylabel(’Height in cm’, ’Fontsize’, 18)
Matlab commands to derive box plot
load(’fortyStudentHeightData’);
load(’fortyStudentGenderData’);
boxplot( heights, gender )

% now label axis
ylabel(’Height in cm’, ’Fontsize’, 18)

SUMMARY

Things you should know:

  • Covered univariate descriptive statistical methods, how to calculate statistics and display data using Matlab.
  • Data comes in a number of different types. Data type affects which statistical analysis is appropriate.
  • Measures of central tendency (mean, median and mode): the most “typical” value.
  • Measures of variation (standard deviation and IQR): data spread about “typical” value.
  • Symmetric distribution with no significant outliers then use mean and standard deviation.
  • If the distribution is skewed with significant outliers then use median and IQR.
  • Always visualise your data, don’t go straight for numerical methods.


return  link
Written by Tobias Whetton