**Statistical
Methods**

- Probability theory
- Sampling
- Differences between groups:
- Assessing size of differences
- statistical significance of differences

- Associations
- Regression
- see also:
- on the net:

**Probability theory and
Population Distribution Curves:**

**Normal distribution**:

- 1. continuous symmetrical distribution;
- 2. mean lies at highest point of curve (as do the mode & median if it is not skewed);
- 3. shape of curve approx. bell-shaped;
- 4. y = [1/(2pi)]exp(-x
^{2}/2) - 5. it cannot be integrated (as is the case for most distribution curves in statistics) thus there is no simple formula for the probability of a random variable lying between given limits. These probabilities (areas under the curve) are obtained from tables.

- the
**mean**(average):- sum of all observations divided by the no. of observations.

- the
**variance**(variability):- = sum of squares about the mean / degrees of freedom
- = sum(x
_{i}- mean)^{2}/ (n-1) - = (sum(x
_{i}^{2}) - (sum(x_{i})^{2}/ n)) / (n-1) - as this is not in the units of the variable being observed, nor the mean, the std.dev. is used.

- the
**standard deviation**- describes the dispersion or spread of data in same units as the variable and the mean.
- = sqr.root(variance)
- If the data follows a "normal distribution", then 95% of the group's measurements are found within 2 st.dev. above and below the mean.
- However, even if not normal distrib., at least 75% of all values will lie within 2 s.d. of mean, & at least 88% will lie within 3 s.d. of the mean.

Percentage points of the normal distribution:

One- |
Sided |
Two- |
Sided |

P |
x |
P |
x |

50 | 0.00 | ||

25 | 0.67 | 50 | 0.67 |

10 | 1.28 | ||

5 | 1.64 | 10 | 1.64 |

2.5 | 1.96 | 5 | 1.96 |

1 | 2.33 | ||

0.5 | 2.58 | 1 | 2.58 |

0.1 | 3.09 | ||

0.05 | 3.29 | 0.1 | 3.29 |

**Binomial Distribution:**

- is the distribution followed by the number of successes in n independent trials when the probability of one trial being a success is p
- it is a discrete distribution
- Prob(r successes) = n!((p
^{r})(1-p)^{n-r}/ (r!(n-r)!) - eg. if probability of surviving a disease is 0.90, & we have a sample of 20 patients, the number who survive will be a binomial distribution with p=0.9, n = 20. The probability that all survive (ie. r=20), is 0.12. (NB. 0! = 1)
- mean = np
- variance = np(1-p)
- for practical purposes, generally this (and the Poisson distribution) approximate Normal distribution if BOTH np & n(1-p) are greater than 5.

**Poisson Distribution:**

- like the binomial, it is a discrete distribution
- for events that happen randomly & independently in time with a constant rate, then the number of events which happens in a fixed time interval follows the Poisson distribution
- mean = rate events happen
- variance = rate events happen
- Prob(r events occurring in
unit time with rate m) = e
^{-m}m^{r}/ r!, where e = 2.718;

**Chi-squared
Distribution:**

- chi-squared = sum(U
_{i}^{2}), where- U is a Standard Normal variable with mean of 0 and variance of 1.

- sqr.root(chi-squared):
- mean = sqr.root(n-1/2), where n is degrees of freedom
- variance = ~1/2

**Student's t
Distribution:**

When one samples a population according to a parameter (eg. height), one needs to determine:

- the
**mode**:- the most frequently occurring value

- the
**median**:- the central value of the distribution
- ie. value at which 50% of population lie below = 50th quantile

- the
**mean**(average) of the sample - the
**variance**(variability) of the sample - the
**standard deviation**of the sample - the
**error range of the the sample's mean in estimating the population's mean:**- ie. if the sample's is 5.8, how does one estimate the range of values within which the population's mean will fall.
- this is the
**standard error**of the estimate of the population mean = standard dev. of the sample mean- the population mean will have:
- 95% probability of lying in range
of sample mean +/- 1.96 x
std.error
- this range is called the
**95% confidence interval**

- this range is called the
- 99% probability of lying in range
of sample mean +/- 2.58 x
std.error
- this range is called the
**99% confidence interval**

- this range is called the

- 95% probability of lying in range
of sample mean +/- 1.96 x
std.error

- the population mean will have:
- the standard error can be determined by:
**to determine for a normally distributed population:**- eg. what is the mean value of a parameter for the whole population
- std.error = population variance /
sq.root(sample size)
- mostly we do not know the pop. variance but its estimate can be used.

**to determine for a proportion of a population:**- eg. what is the proportion of the population with a condition
- this is a binomial distribution
- std. error = sqr.root {p(1-p)/n},
- where p = the proportion of individuals in population & n = sample size
- we can estimate this by replacing p with r/n where r = proportion of individuals in sample, but this normal distributed approximation is ONLY if both np & n(1-p) are greater than 5.

**Assessing size of differences:**

**Mean**(average)

**Standard deviation**

**What size sample populations do you need to
get a significant difference?**

**Power of a test:**- the probability that a test will produce a significant difference at a given significance level
- a good test has a power approaching 1.
- = 1 - P(x), where:
- P(x) is the normal distribution probability of x, and,
- x = 1.96 - (µ
_{1}- µ_{2})/se, where- P(1.96) = 0.95
- ie. the value where the area under the normal curve to the left of it = 0.95 which will give us a significance level of p = 0.05 (for p=0.01, then the figure to use is 2.58)

- µ
_{1}& µ_{2}are the population means &, - se is the standard error of the
difference of means
- = sqr. root {(s
_{1}^{2}/n_{1})+(s_{2}^{2}/n_{2})}, where- s = std.dev. of sample of size n

- = sqr. root {(s

- P(1.96) = 0.95

**Sample size for a comparison:**- just use the above equation for the power to solve for n
- eg. if we wish to have a probability of 90% of
finding a difference (ie. power = 0.90) with a
significance level of p=0.05, and a pilot study
indicated that the std.dev. of the population is
40units and our hypothesis is that the difference
will be at least half a std.dev. (ie. 20units)
between the 2 groups, then we have:
- P(x) = 0.10 & from normal
distribution table, x = -1.28, thus,
- -1.28 = 1.96 - 20 / (sq.root{(40
^{2}/n)+(40^{2}/n)})- => round(n)= 84 patients needed in each group

- -1.28 = 1.96 - 20 / (sq.root{(40

- P(x) = 0.10 & from normal
distribution table, x = -1.28, thus,

**NB. using a significance level of p=0.05 means that 1 in 20 independent studies of the same study will show an incorrect result!!**

**Statistical significance of differences:**

**Considerations:****Comparisons**- the no. of comparisons possible increase
as the no. of groups increase.
- ie. 2 groups = 1 comparison (pair-wise);
- 3 groups = 3 comparisons;
- 4 groups = 6 comparisons;

- Thus, the more groups, the greater the chance of finding a statistically significant difference.

- the no. of comparisons possible increase
as the no. of groups increase.
**Matching (Pairing)**- are the individuals in the control and study groups paired to make the groups more uniform.

**One-tailed or two-tailed:**- a one-tailed test is used if one is only interested in results on one side of the mean, whereas, two-tailed test is used if both sides of the mean are relevant.

**Data type:****interval scales (continuous)**- the interval or distance between points on the scale has precise meaning , & a change in one unit at one scale point is the same as a change in one unit at another
- such scales are also ordinal
- eg. temperature, time

**ordinal scales**- limited no. of categories with an
inherent
**ordering**of categories from lowest to highest. - eg. much improved, mild improved, same, mild worse, much worse

- limited no. of categories with an
inherent
**ordered nominal scales**- grouping subjects into several ordered categories
- eg. group subjects by an ordinal scale value (eg. "improved" vs "worse", etc)

**nominal scales**- limited no. of categories but no inherent ordering of the categories.
- eg. eye colour

**dichotomous scales**- subjects are grouped into 2 categories (a special case of nominal scales)
- eg. died vs survived

Specific tests for statistical significance of differences:

Interval Data:

> 50 in each sample:

Normal Distribution for means:

Std error of the difference between 2 means:

- eg. mean PEFR for children without nocturnal cough vs with nocturnal cough
- = sqr. root {(s
_{1}^{2}/n_{1})+(s_{2}^{2}/n_{2})}- the difference of the means of each group then has a 95% confidence interval of being within (mean
_{1}- mean_{2}) +/- 1.96 x std.error- if this confidence interval DOES NOT include zero then there IS a difference between the 2 groups.
Std error of the difference between 2 proportions:

- eg. comparing the proportions of 2 pop.ns with a condition

- eg. people with PH bronchiolitis vs no PH bronchiolitis & determine the proportion in each group who have asthma
- = sqr. root {(p
_{1}(1-p_{1})/n_{1})+(p_{2}(1-p_{2})/n_{2})}

- where p = the proportion of individuals in population & n = sample size
- we can estimate this by replacing p with r/n where r = proportion of individuals in sample, but this normal distributed approximation is ONLY if both np & n(1-p) are greater than 5.
- the difference of the means of each group then has a 95% confidence interval of being within (mean
_{1}- mean_{2}) +/- 1.96 x std.error- if this confidence interval DOES NOT include zero then there IS a difference between the 2 groups.
< 50 in each sample, with normal distribution:

t Distribution for means:<50 in each sample, non-Normal distribution:

Mann-Whitney U test

Nominal Data:

Chi-square

- the most commonly used test for nominal data;
- Can be used for one comparison or modified for more than one comparison;
- Uses a crosstabulation table and calculates the differences between the observed and expected values in each cell;
- Should NOT be used if expected numbers are small, thus:

- no. cells < 5 should be < 20% of cells
- min. expected number = 1.
Fischer's exact test

- for use with small numbers, unmatched and only 1 comparison (2 groups).
McNemar's test

- a modification of the Chi-square for use with matched samples and large numbers.
Sign test

- used when numbers are too small for McNemar's.
Ordinal Data:

Nonparametric tests (1 & 2 way modifications all tests) -

Mann-Whitney U or Median test:

- two groups, unmatched.
Wilcoxon matched pairs signed ranks test:

- two groups with matched samples;
Kruskal-Wallis 1-way analysis of variance:

- more than 2 groups, unmatched.
Friedman 2-way analysis of variance:

- more than 2 groups with matched samples.
Continuous Data:

T-test

- 2 groups, unmatched.
Matched t-test

- 2 groups with matched samples.
F test for analysis of variance:

- more than 2 groups, unmatched.
F test for analysis of var. with blocking or analysis of covariance:

- more than 2 groups with matched samples.
Approx. t-test:

- for sample >30, mean & s.d. available, then can rapidly assess the significance by seeing if the mean of one group is outside the 95% confidence limits of the other group
- 95% conf.limits = sample mean +/- 2x st.error mean
- st.error mean = st. dev. / sq.root (sample size)

**Aspects:**- 1. What is the degree or strength of the association?
- 2. Is the association found statistically significant?
- 3. How much of the variation in the outcome in the study and control groups is explained by the association?

**Nominal data:****A. Prospective or experimental data:****1. Relative risk:**- measures the strength of assoc.;
- Rel.Risk = (risk if factor present) / (risk if absent)

**2. Stat. significance of rel. risk:****3. Attributable risk:**- measures how much risk is attributable to that factor -> a measure of benefit in removing that factor.
- Attrib.risk = [(risk with factor)-(risk without factor)] / (risk without factor)

**B. Retrospective or cross-sectional data:**- Because the researcher chooses a certain number of subjects with and without a disease, the numbers do not reflect natural incidence, and thus absolute and relative risks cannot be calculated although approximations can be calculated.
**1. Odds ratio:**- an approx. of rel.risk.
- Odds risk factor & dis. = (dis & factor)/(dis no fact.) Y
- Odds risk factor no dis.= (no dis
& fact)/(no dis,no fact) Z
- Odds ratio = Y/Z

**2. Approx. attrib. risk:**- also need to measure the prevalence of risk factor in the population;

**C. Chi-square based measures of association:**- Each of these measures attempts to modify
Chi-sq. to minimise the influence of
sample size (
**N**), and degrees of freedom, as well as restrict the range of values of measure to between 0 and 1. - Without such adjustments, comparisons of O
^{2}values from tables of varying dimensions and sample size are meaningless. - However, these measures are hard to interpret, although, when properly standardized, they can be used to compare the strength of association in several tables.
**Phi coefficient:**- For 2x2 tables; Phi need not lie
between 0 and 1 as O
^{2}may be greater than N. To obtain a measure between 0 and 1, Pearson suggested the use of**C**. - Phi (N)
= %[
(O
^{2})/N ]

- For 2x2 tables; Phi need not lie
between 0 and 1 as O
**Contingency coefficient (C):**- Value is always between 0 and 1,
but cannot generally attain value
of 1. Max. value possible depends
on no. of rows and columns.
- eg. 4x4 table, max. value C = 0.87

- C = %[
(O
^{2})/(O^{2}+ N) ]

- Value is always between 0 and 1,
but cannot generally attain value
of 1. Max. value possible depends
on no. of rows and columns.
**Cramer V:**- This can attain a maximum value of 1.
- V = %[
O
^{2}/(N(k-1) ],- where k is the smaller no. of rows & columns.

- Each of these measures attempts to modify
Chi-sq. to minimise the influence of
sample size (
**D. Measures based on proportional reduction of error (PRE):**- The meaning of the association is clearer than Chi-sq. based measures.
- These measures are all essentially ratios of a measure of error in predicting the values of one variable based on knowledge of that variable alone, and the same measure of error applied to predictions based on knowledge of an additional variable.
**Goodman & Kruskal Lambda:**- Lambda x 100 = % reduction in error when that variable is used to predict the outcome of the dependent variable.
- Lambda always between 0 and 1.

**Ordinal & continuous data:**- The techniques are the same for all 4 types of study designs: prospective, retrospective, cross-sectional & experimental.
- The fundamental techniques with these types of
data known as
**correlation.** **Pearson's Correlation Techniques:**- If both measurements being correlated consist of continuous data, & a linear relationship exists between the variables being correlated.
**1. Pearson's Correl. Coeff. (r) :**- the degree of association;
- range of r is -1 to +1, 0 = no predictable change.

**2. Stat. signif. of r:****3. Pearson's coeff. of determination (r**^{2}**):**- measures the extent of association;

**Nonparametric correlation:**- Used if the data from any of the
variables is ordinal or when a linear
relationship is not suspected.
- eg. consider 2 ordered variables
a, b, and 2 cases 1,2, taken from
the sample:
- values a1, a2, b1, b2
- if a1 & b1 are
**both**greater (or smaller) than a2 & b2 respectively, then the pair of cases is called**concordant.** - if a1 > a2 and b1 <
b2, then the pair of
cases is called
**discordant.** - if a1 = a2, and b1 not =
b2, then it is
**tied**on**a**but**not tied**on**b.** - if a1 = a2, and b1 = b2,
then they are
**tied**on**both****variables.**

- eg. consider 2 ordered variables
a, b, and 2 cases 1,2, taken from
the sample:
- if there is a preponderance of
**concordant**pairs, then the association is said to be**positive.** - if there is a preponderance of
**discordant**pairs, then the association is**negative.** - if no. concordant pairs = no. discordant pairs, then it is said there is no association.
**A. Spearman's rho:****B. Kendall's tau a,b & c:**range -1 to +1, for tau-c;**C. G & K's gamma:**Pr(conc.)-Pr(disc.),assume no

- Used if the data from any of the
variables is ordinal or when a linear
relationship is not suspected.

- A series of techniques that are useful for studying the association between one dependant variable and many independent variables.
- These techniques are capable of measuring:
- the strength of an association,
- the statistical significance of the association, and,
- the extent of the variation in the dependent variable that can be explained by the independent variable.

**Assumptions:**- 1. For any fixed value of an independent variable
X, the distribution of the dependant variable Y
is
**Normal**, with mean**u**_{y/x}^{2}. They may have different means, but same variance. - 2. The dependent variable values are
statistically
**independent**of each other. - 3. The mean values
**u**_{y/x}all lie on a straight line, which is the**population regression line.**

- 1. For any fixed value of an independent variable
X, the distribution of the dependant variable Y
is
**Single Independent Variable:**- Y
_{i}= b_{o}+ b_{1}X_{i}+ e_{i}, where b_{o}= intercept, b_{1}= slope, - e
_{i}= error or disturbance- = diff. b/n observed Y
_{i}and the subpop. mean at point X_{i}

- = diff. b/n observed Y
- b
_{o}, and b_{1}, are unknown population parameters, and must be estimated from the sample B_{o}& B_{1}using least-square methods.

- Y
**Testing Hypotheses:****That there is no linear relationship b/n X & Y:**- ie. that the slope of the pop. regression line = 0.
- t = B
_{1}/ (st.dev. B^{1}), t should fit Student's t distrib. with N-2 d.f.

**That the intercept = 0:**- t = B
_{o}/ (st.dev. B_{o}), t should fit Student's t distrib. with N-2 d.f.

- t = B
**95% Confidence interval of B**_{1}**:**- 95% confidence means that, if repeated
samples are drawn from a population under
the same conditions, & 95% confidence
intervals are calculated, 95% of the
intervals will contain the unknown
parameter B
_{1}. Since the parameter value is unknown, it is not possible to determine whether or not a particular interval contains it.

- 95% confidence means that, if repeated
samples are drawn from a population under
the same conditions, & 95% confidence
intervals are calculated, 95% of the
intervals will contain the unknown
parameter B
**Goodness of Fit:**- How well the model actually fits the data.
**The R coefficient:****R**^{2}is sometimes called the**coefficient of determination.****R**= Pearson correl. coeff. b/n predicted Y & observed Y**R**^{2}= R x R,**Multiple R**= Sq.Root R^{2}.**Adjusted R**^{2}is a correction to more closely reflect goodness of fit.- If all obs. fall on regression
line, R
^{2}= 1; - If there is
**no****linear**relationship b/n the dependent & independent variables, then R^{2}= 0. - R
^{2}= 0 does not necessarily mean that there is no association, but that there is no linear relationship.

**Analysis of Variance:**- To test the hypothesis of no linear relationship b/n X & Y, several equivalent statistics can be computed.
- If single independent variable:
- Hypothesis R
^{2}(pop.) = 0, is identical to hypothesis population slope = 0; - If the probability (
**signif. F**) associated with the**F**statistic is small, the hypothesis that R^{2}= 0 is rejected.

- Hypothesis R

**Searching****for violations of assumptions:**- 1.
**Residuals:**- a residual is what is left after the model is fit. The diff. b/n an observed value & the value predicted by the model.
- If the model is appropriate, the
observed residuals
**E**, which are estimates of the true errors**e**_{i}, should have similar characteristics, ie. normal dist. with mean of 0 and a constant variance.

- 2.
**Linearity:** - 3.
**Equality of variance:** - 4.
**Independence of error:** - 5.
**Normality of residuals:**

- 1.