Bivariate Statistics

stats

Visualising Bivariate Statistics

Visualisation you can use depends on types of data involved. Here are three common combinations:

Two categorical variables
Categorical and continuous
Two continuous

Two categorical variables

Common method is joint contingency table. Datasets:

Gender of 40 students in class.
Type of degree (BA or BSc) of 40 students in class

Bsc	BA
Male	11	2
Female	20	7

No command available, so a function an be written for example, ’frequencyTable.m’. Content of function is not important at the moment, but how to use it is:

frequencyTable.m

>> help frequencyTable
usage: [] = frequencyTable( x, y )
x,y input variables: must have the same length
output is frequency table of unique values within x and y

>> load('fortyStudentDegreeData')
>> load('fortyStudentGenderData')
>> frequencyTable( degree, gender )

Often also called a joint frequency table, and values often presented as % of data set size.

Categorical and Continuous Variables

You can also combine categorical and continuous variables using a box plot, as seen in chapter 1.

Two Continuous Variables

This covers three commonly used methods:

scatter plot
2D histogram
line graph

Scatter Plot

Datasets: 40 students in class:

Distance travelled to university
Height

One variable on x-axis, other variable on y-axis, dot at position for each individual in class.

>> load('fortyStudentDistanceData.mat')
>> load('fortyStudentHeightData.mat')
>> scatter( heights, distance )
>> xlabel('Student height (cm)', 'fontsize', 18);
>> ylabel('Distance travelled to university (cm)', 'fontsize', 18);

2D Histograms

After binning data can be shown as a 2D histogram. Counts per bin displayed as intensities, or heights in a 3D plot.

load('fortyStudentDistanceData.mat') load('fortyStudentHeightData.mat') heightNdistance=[ heights, distance ]; % for intensity plot
histArray = hist3( heightNdistance ); colormap( gray )
imagesc( histArray );
% for 3D plot
hist3( heightNdistance )

Line Graphs

Useful when one variable can only have a single corresponding second variable. Commonly used when one of the variables represents time. Simply use the plot command which was used in chapter 1:

load('interestRate.mat');
plot( Irate(:,1), Irate(:,2))
xlabel('Year', 'fontsize', 18 )
ylabel('Interest rate', 'fontsize', 18 )

Which Variable Should go on Which Axis?

Two main types of research design:

Experimental : researcher alters the value of one variable (independent variable) and then measures the value of a second variable (dependent variable). In an experimental study, the independent variable should go on the horizontal x-axis.
Observational : researcher has no control on either variable. In an observational study, there is no specifc criteria.

Pearson’s Correlation Coefficient

Numerical value for how linearly related two variables are. The correlation coefficient, r, for paired variables xi, and yi, i = 1,…,n (where n is the sample size) is given by:

Equation to be added soon

where paired variables xi, and yi , i = 1,…,n (n is sample size):

r = +1 : perfect positive correlation
r = 0 : no linear correlation
r = –1 : perfect negative correlation

A strong correlation does not signify cause and effect. E.g. there is a strong correlation between ice cream sales and incidences of drowning. Does ice cream consumption cause drowning? No, both are related by a much stronger factor, daily temperature.

The correlation coefficient can be greatly affected by a few outliers.

return link

Written by Tobias Whetton