maths:stats

# a brief summary of statistics principles and tools

## Probability theory and Population Distribution Curves:

### Normal distribution

• 1. continuous symmetrical distribution;
• 2. mean lies at highest point of curve (as do the mode & median if it is not skewed);
• 3. shape of curve approx. bell-shaped;
• 4. y = [1/(2pi)]exp(-x2/2)
• 5. it cannot be integrated (as is the case for most distribution curves in statistics) thus there is no simple formula for the probability of a random variable lying between given limits. These probabilities (areas under the curve) are obtained from tables.
• the mean (average):
• sum of all observations divided by the no. of observations.
• the variance (variability):
• = sum of squares about the mean / degrees of freedom
• = sum(xi - mean)2 / (n-1)
• = (sum(xi2) - (sum(xi)2 / n)) / (n-1)
• as this is not in the units of the variable being observed, nor the mean, the std.dev. is used.
• the standard deviation
• describes the dispersion or spread of data in same units as the variable and the mean.
• = sqr.root(variance)
• If the data follows a “normal distribution”, then 95% of the group's measurements are found within 2 st.dev. above and below the mean.
• However, even if not normal distrib., at least 75% of all values will lie within 2 s.d. of mean, & at least 88% will lie within 3 s.d. of the mean.

Percentage points of the normal distribution:

One-sided Two-sided
PxPx
500.00
250.67500.67
101.28
51.64101.64
2.51.9651.96
12.33
0.52.5812.58
0.13.09
0.053.290.13.29

### Binomial Distribution

• is the distribution followed by the number of successes in n independent trials when the probability of one trial being a success is p
• it is a discrete distribution
• Prob(r successes) = n!((pr)(1-p)n-r / (r!(n-r)!)
• eg. if probability of surviving a disease is 0.90, & we have a sample of 20 patients, the number who survive will be a binomial distribution with p=0.9, n = 20. The probability that all survive (ie. r=20), is 0.12. (NB. 0! = 1)
• mean = np
• variance = np(1-p)
• for practical purposes, generally this (and the Poisson distribution) approximate Normal distribution if BOTH np & n(1-p) are greater than 5.

### Poisson Distribution:

• like the binomial, it is a discrete distribution
• for events that happen randomly & independently in time with a constant rate, then the number of events which happens in a fixed time interval follows the Poisson distribution
• mean = rate events happen
• variance = rate events happen
• Prob(r events occurring in unit time with rate m) = e-mmr / r!, where e = 2.718;

### Chi-squared Distribution:

• chi-squared = sum(Ui2), where
• U is a Standard Normal variable with mean of 0 and variance of 1.
• sqr.root(chi-squared):
• mean = sqr.root(n-1/2), where n is degrees of freedom
• variance = ~1/2

## Sampling

When one samples a population according to a parameter (eg. height), one needs to determine:

• the mode:
• the most frequently occurring value
• the median:
• the central value of the distribution
• ie. value at which 50% of population lie below = 50th quantile
• the mean (average) of the sample
• the variance (variability) of the sample
• the standard deviation of the sample
• the error range of the the sample's mean in estimating the population's mean:
• ie. if the sample's is 5.8, how does one estimate the range of values within which the population's mean will fall.
• this is the standard error of the estimate of the population mean = standard dev. of the sample mean
• the population mean will have:
• 95% probability of lying in range of sample mean +/- 1.96 x std.error
• this range is called the 95% confidence interval
• 99% probability of lying in range of sample mean +/- 2.58 x std.error
• this range is called the 99% confidence interval
• the standard error can be determined by:
• to determine for a normally distributed population:
• eg. what is the mean value of a parameter for the whole population
• std.error = population variance / sq.root(sample size)
• mostly we do not know the pop. variance but its estimate can be used.
• to determine for a proportion of a population:
• eg. what is the proportion of the population with a condition
• this is a binomial distribution
• std. error = sqr.root {p(1-p)/n},
• where p = the proportion of individuals in population & n = sample size
• we can estimate this by replacing p with r/n where r = proportion of individuals in sample, but this normal distributed approximation is ONLY if both np & n(1-p) are greater than 5.

## Differences between groups:

Assessing size of differences:

• Mean (average)
• Standard deviation

What size sample populations do you need to get a significant difference?

• Power of a test:
• the probability that a test will produce a significant difference at a given significance level
• a good test has a power approaching 1.
• = 1 - P(x), where:
• P(x) is the normal distribution probability of x, and,
• x = 1.96 - (µ1 - µ2)/se, where
• P(1.96) = 0.95
• ie. the value where the area under the normal curve to the left of it = 0.95 which will give us a significance level of p = 0.05 (for p=0.01, then the figure to use is 2.58)
• µ1 & µ2 are the population means &,
• se is the standard error of the difference of means
• = sqr. root {(s12/n1)+(s22/n2)}, where
• s = std.dev. of sample of size n
• Sample size for a comparison:
• just use the above equation for the power to solve for n
• eg. if we wish to have a probability of 90% of finding a difference (ie. power = 0.90) with a significance level of p=0.05, and a pilot study indicated that the std.dev. of the population is 40units and our hypothesis is that the difference will be at least half a std.dev. (ie. 20units) between the 2 groups, then we have:
• P(x) = 0.10 & from normal distribution table, x = -1.28, thus,
• -1.28 = 1.96 - 20 / (sq.root{(402/n)+(402/n)})
• ⇒ round(n)= 84 patients needed in each group
• NB. using a significance level of p=0.05 means that 1 in 20 independent studies of the same study will show an incorrect result!!

## Statistical significance of differences:

• Considerations:
• Comparisons
• the no. of comparisons possible increase as the no. of groups increase.
• ie. 2 groups = 1 comparison (pair-wise);
• 3 groups = 3 comparisons;
• 4 groups = 6 comparisons;
• Thus, the more groups, the greater the chance of finding a statistically significant difference.
• Matching (Pairing)
• are the individuals in the control and study groups paired to make the groups more uniform.
• One-tailed or two-tailed:
• a one-tailed test is used if one is only interested in results on one side of the mean, whereas, two-tailed test is used if both sides of the mean are relevant.
• Data type:
• interval scales (continuous)
• the interval or distance between points on the scale has precise meaning , & a change in one unit at one scale point is the same as a change in one unit at another
• such scales are also ordinal
• eg. temperature, time
• ordinal scales
• limited no. of categories with an inherent ordering of categories from lowest to highest.
• eg. much improved, mild improved, same, mild worse, much worse
• ordered nominal scales
• grouping subjects into several ordered categories
• eg. group subjects by an ordinal scale value (eg. “improved” vs “worse”, etc)
• nominal scales
• limited no. of categories but no inherent ordering of the categories.
• eg. eye colour
• dichotomous scales
• subjects are grouped into 2 categories (a special case of nominal scales)
• eg. died vs survived

### Specific tests for statistical significance of differences

• Interval Data:
• > 50 in each sample:
• Normal Distribution for means:
• Std error of the difference between 2 means:
• eg. mean PEFR for children without nocturnal cough vs with nocturnal cough
• = sqr. root {(s12/n1)+(s22/n2)}
• the difference of the means of each group then has a 95% confidence interval of being within (mean1 - mean2) +/- 1.96 x std.error
• if this confidence interval DOES NOT include zero then there IS a difference between the 2 groups.
• Std error of the difference between 2 proportions:
• eg. comparing the proportions of 2 pop.ns with a condition
• eg. people with PH bronchiolitis vs no PH bronchiolitis & determine the proportion in each group who have asthma
• = sqr. root {(p1(1-p1)/n1)+(p2(1-p2)/n2)}
• where p = the proportion of individuals in population & n = sample size we can estimate this by replacing p with r/n where r = proportion of individuals in sample, but this normal distributed approximation is ONLY if both np & n(1-p) are greater than 5.
• the difference of the means of each group then has a 95% confidence interval of being within (mean1 - mean2) +/- 1.96 x std.error
• if this confidence interval DOES NOT include zero then there IS a difference between the 2 groups.
• < 50 in each sample, with normal distribution:
• t Distribution for means:
• <50 in each sample, non-Normal distribution:
• Mann-Whitney U test
• Nominal Data:
• Chi-square
• the most commonly used test for nominal data;
• Can be used for one comparison or modified for more than one comparison;
• Uses a crosstabulation table and calculates the differences between the observed and expected values in each cell;
• Should NOT be used if expected numbers are small, thus:
• no. cells < 5 should be < 20% of cells
• min. expected number = 1.
• Fischer's exact test
• for use with small numbers, unmatched and only 1 comparison (2 groups).
• McNemar's test
• a modification of the Chi-square for use with matched samples and large numbers.
• Sign test
• used when numbers are too small for McNemar's.
• Ordinal Data:
• Nonparametric tests (1 & 2 way modifications all tests) -
• Mann-Whitney U or Median test:
• two groups, unmatched.
• Wilcoxon matched pairs signed ranks test:
• two groups with matched samples;
• Kruskal-Wallis 1-way analysis of variance:
• more than 2 groups, unmatched.
• Friedman 2-way analysis of variance:
• more than 2 groups with matched samples.
• Continuous Data:
• T-test
• 2 groups, unmatched.
• Matched t-test
• 2 groups with matched samples.
• F test for analysis of variance:
• more than 2 groups, unmatched.
• F test for analysis of var. with blocking or analysis of covariance:
• more than 2 groups with matched samples.
• Approx. t-test:
• for sample >30, mean & s.d. available, then can rapidly assess the significance by seeing if the mean of one group is outside the 95% confidence limits of the other group
• 95% conf.limits = sample mean +/- 2x st.error mean
• st.error mean = st. dev. / sq.root (sample size)

## Associations:

• Aspects:
• 1. What is the degree or strength of the association?
• 2. Is the association found statistically significant?
• 3. How much of the variation in the outcome in the study and control groups is explained by the association?
• Nominal data:
• A. Prospective or experimental data:
• 1. Relative risk:
• measures the strength of assoc.;
• Rel.Risk = (risk if factor present) / (risk if absent)
• 2. Stat. significance of rel. risk:
• 3. Attributable risk:
• measures how much risk is attributable to that factor → a measure of benefit in removing that factor.
• Attrib.risk = [(risk with factor)-(risk without factor)] / (risk without factor)
• B. Retrospective or cross-sectional data:
• Because the researcher chooses a certain number of subjects with and without a disease, the numbers do not reflect natural incidence, and thus absolute and relative risks cannot be calculated although approximations can be calculated.
• 1. Odds ratio:
• an approx. of rel.risk.
• Odds risk factor & dis. = (dis & factor)/(dis no fact.) Y
• Odds risk factor no dis.= (no dis & fact)/(no dis,no fact) Z
• Odds ratio = Y/Z
• 2. Approx. attrib. risk:
• also need to measure the prevalence of risk factor in the population;
• C. Chi-square based measures of association:
• Each of these measures attempts to modify Chi-sq. to minimise the influence of sample size (N), and degrees of freedom, as well as restrict the range of values of measure to between 0 and 1.
• Without such adjustments, comparisons of O2 values from tables of varying dimensions and sample size are meaningless.
• However, these measures are hard to interpret, although, when properly standardized, they can be used to compare the strength of association in several tables.
• Phi coefficient:
• For 2×2 tables; Phi need not lie between 0 and 1 as O2 may be greater than N. To obtain a measure between 0 and 1, Pearson suggested the use of C.
• Phi (N) = %[ (O2)/N ]
• Contingency coefficient (C):
• Value is always between 0 and 1, but cannot generally attain value of 1. Max. value possible depends on no. of rows and columns.
• eg. 4×4 table, max. value C = 0.87
• C = %[ (O2)/(O2 + N) ]
• Cramer V:
• This can attain a maximum value of 1.
• V = %[ O2/(N(k-1) ],
• where k is the smaller no. of rows & columns.
• D. Measures based on proportional reduction of error (PRE):
• The meaning of the association is clearer than Chi-sq. based measures.
• These measures are all essentially ratios of a measure of error in predicting the values of one variable based on knowledge of that variable alone, and the same measure of error applied to predictions based on knowledge of an additional variable.
• Goodman & Kruskal Lambda:
• Lambda x 100 = % reduction in error when that variable is used to predict the outcome of the dependent variable.
• Lambda always between 0 and 1.
• Ordinal & continuous data:
• The techniques are the same for all 4 types of study designs: prospective, retrospective, cross-sectional & experimental.
• The fundamental techniques with these types of data known as correlation.
• Pearson's Correlation Techniques:
• If both measurements being correlated consist of continuous data, & a linear relationship exists between the variables being correlated.
• 1. Pearson's Correl. Coeff. ® :
• the degree of association;
• range of r is -1 to +1, 0 = no predictable change.
• 2. Stat. signif. of r:
• 3. Pearson's coeff. of determination (r2):
• measures the extent of association;
• Nonparametric correlation:
• Used if the data from any of the variables is ordinal or when a linear relationship is not suspected.
• eg. consider 2 ordered variables a, b, and 2 cases 1,2, taken from the sample:
• values a1, a2, b1, b2
• if a1 & b1 are both greater (or smaller) than a2 & b2 respectively, then the pair of cases is called concordant.
• if a1 > a2 and b1 < b2, then the pair of cases is called discordant.
• if a1 = a2, and b1 not = b2, then it is tied on a but not tied on b.
• if a1 = a2, and b1 = b2, then they are tied on both variables.
• if there is a preponderance of concordant pairs, then the association is said to be positive.
• if there is a preponderance of discordant pairs, then the association is negative.
• if no. concordant pairs = no. discordant pairs, then it is said there is no association.
• A. Spearman's rho:
• B. Kendall's tau a,b & c: range -1 to +1, for tau-c;
• C. G & K's gamma: Pr(conc.)-Pr(disc.),assume no ties;

## Regression:

• A series of techniques that are useful for studying the association between one dependant variable and many independent variables.
• These techniques are capable of measuring:
• the strength of an association,
• the statistical significance of the association, and,
• the extent of the variation in the dependent variable that can be explained by the independent variable.
• Assumptions:
• 1. For any fixed value of an independent variable X, the distribution of the dependant variable Y is Normal, with mean uy/x (mean of Y for a given X) and a constant variance of o2. They may have different means, but same variance.
• 2. The dependent variable values are statistically independent of each other.
• 3. The mean values uy/x all lie on a straight line, which is the population regression line.
• Single Independent Variable:
• Yi = bo + b1Xi + ei, where bo = intercept, b1 = slope,
• ei = error or disturbance
• = diff. b/n observed Yi and the subpop. mean at point Xi
• bo, and b1, are unknown population parameters, and must be estimated from the sample Bo & B1 using least-square methods.
• Testing Hypotheses:
• That there is no linear relationship b/n X & Y:
• ie. that the slope of the pop. regression line = 0.
• t = B1 / (st.dev. B1), t should fit Student's t distrib. with N-2 d.f.
• That the intercept = 0:
• t = Bo / (st.dev. Bo), t should fit Student's t distrib. with N-2 d.f.
• 95% Confidence interval of B1:
• 95% confidence means that, if repeated samples are drawn from a population under the same conditions, & 95% confidence intervals are calculated, 95% of the intervals will contain the unknown parameter B1. Since the parameter value is unknown, it is not possible to determine whether or not a particular interval contains it.
• Goodness of Fit:
• How well the model actually fits the data.
• The R coefficient:
• R2 is sometimes called the coefficient of determination.
• R = Pearson correl. coeff. b/n predicted Y & observed Y
• R2 = R x R, Multiple R = Sq.Root R2.
• Adjusted R2 is a correction to more closely reflect goodness of fit.
• If all obs. fall on regression line, R2 = 1;
• If there is no linear relationship b/n the dependent & independent variables, then R2 = 0.
• R2 = 0 does not necessarily mean that there is no association, but that there is no linear relationship.
• Analysis of Variance:
• To test the hypothesis of no linear relationship b/n X & Y, several equivalent statistics can be computed.
• If single independent variable:
• Hypothesis R2(pop.) = 0, is identical to hypothesis population slope = 0;
• If the probability (signif. F) associated with the F statistic is small, the hypothesis that R2 = 0 is rejected.
• Searching for violations of assumptions:
• 1. Residuals:
• a residual is what is left after the model is fit. The diff. b/n an observed value & the value predicted by the model.
• If the model is appropriate, the observed residuals E, which are estimates of the true errors ei, should have similar characteristics, ie. normal dist. with mean of 0 and a constant variance.
• 2. Linearity:
• 3. Equality of variance:
• 4. Independence of error:
• 5. Normality of residuals: