Stats key points

SS02

Time series analysis

Seasonal variation

When asked for the type of variation

Seasonal variation If the variations from the fit line appear to follow a pattern

Random variation If they don’t

Moving averages and seasonal effects

Moving averages For {x1,x2,x3,...,xn}, ai=xi1+xi+xi+13

Centered moving average The average of the current moving average and the surrounding moving averages

The seasonal effect is the average of the values of the difference between the actual values and the centered moving average.

Take the centered moving average and subtract it from the actual value for each item in a group. Then calculate the average of these values.

Estimation from seasonal effects

Once the seasonal effects has been calculated the fit line can be used to find a value for a given time, and the seasonal effect can then be applied to this value to produce an estimate

Sampling

Simple random samples

  1. Assign a range of values to the data
  2. Choose random values from the number table, starting from a random position

Each item has the same probability of being chosen.

If the data is in sections, each section may not be represented.
May not give correct representations of strata.

Stratified random sampling

There may often be factors which divide up the population into groups (strata), and we may expect the measurement of interest to vary among the different groups.
This can be accounted for when we select a sample from the population in order that we obtain a sample that is representative of the population.

We generally require that the proportion of each stratum in the sample should be the same as in the population.

Stratified sampling techniques are generally used when the population is heterogeneous, or dissimilar, where certain homogeneous sub populations can be isolated.

Some reasons for using stratified sampling over simple random sampling are:

Cluster, quota, and systematic sampling

Cluster sampling

The data is is divided into separate groups. Then a simple random sample of clusters is selected from the population.

The population is divided into N groups, called clusters

The researcher randomly selects n clusters to include in the sample

The number of observations within each cluster Mj is known, and M=Mj.

One-stage sampling All of the elements within selected clusters are included in the sample

Two-stage sampling A subset of elements within selected clusters are randomly selected for inclusion in the sample

Sometimes, the cost per sample point is less for cluster sampling than for other sampling methods. Given a fixed budget, the researcher may be able to use a bigger sample with cluster sampling than with the other methods. When the increased sample size is sufficient to offset the loss in precision, cluster sampling may be the best choice.

Application:

  1. Select a cluster grouping as a sampling frame
  2. Mark each cluster with a unique number
  3. Choose a sample of clusters applying probability sampling

Quota sampling

Quota sampling requires, that representative individuals are chosen from a specific subgroup

Advantages:

Disadvantages:

Systematic sampling

A sampling method in which the first position in the data set is randomly chosen, and every nth position after this is also chosen

Advantages:

Disadvantages:

Discrete probability distributions

Expectations and variance

Poisson

(Same as MS02)

Interpretation of data

Pie charts

Line diagrams

Box and whisker plots

Frequency diagrams

Scatter diagrams

Histograms not required

Hypothesis testing

Tests for means

Errors

Type I error: Rejecting a true null hypothesis

Type II error: Accepting a false null hypothesis

If it says test. State the fucking hypotheses!

Tests

Test Details Case to reject H0
SS02
Z/T test Determines whether two population means are different TV > CV
SS03
Contingency tables A test for independence CV > TV
Sign test Test for difference in medians TV < CV
Wilcoxon Test for difference in mean or median TV < CV
Mann-Whitney Test for equality of population TV < CV
Kruskal-Wallis Test for equality of population of two or more samples TV > CV
Correlation coefficient Test for existence of correlation between two random variables TV > SL
SS04
Poisson Test for change in a Poisson variable TV < SL
Proportion Test whether sample proportion represents the population TV < SL
SS05
Variance Tests a sample for a given population variance TV > CV(Upper) or TV < CV(Lower)
Variance equality of samples (F) Test for equality of the variances of the populations of two samples of two normally distributed random variables TV > CV
Difference in mean (Two sample Z) Test for the difference in the means of two independent populations |TV| > |CV|
Difference in mean (Two sample T) Test for the difference in the means of two independent populations with unknown variances |TV| > |CV|
Goodness of fit Test for the fit of a sample to a particular distributio n TV > CV
SS06
Paired comparisons Analysis of the difference between pairs of values sampled from two normal populations TV > CV
Analysis of variance An extension of F tests with more than 2 populations TV > CV
Two way analysis of variance An analysis of variance which accounts for a second factor TV > CV
Latin squares

SS03

Contingency tables

  1. State hypotheses (In context)
    i. H0: No assocation
    ii H1: Assocation
  2. Calculate the expected values
    If any value is less than 5, merge the rows or columns
  3. Calculate the text χ2 statistic
  4. Find the critical value for ν=(n1)(m1) degrees of freedom
    This can either be done from the table or with CHISQUARE_ICDF(ν,1α)
  5. If text χt2>χc2, reject H0 in context

Yates’ correction

For a 2 by 2 table, Yates’ correction is used.

Rather than

χ2=(OiEi)2Ei
the corrected formula is
χ2=(|OiEi|0.5)2Ei

Distribution free methods

Test For Use case
Sign test Median When Wilcoxon cannot be used as data is not symmetrical or is non-numeric
Wilcoxon signed-rank test Median or Mean When a z or t test cannot be used
Mann-Whitney U test Equality of populations of two samples There are two samples
Kruskal-Wallis test Equality of populations of two or more samples There are more than two samples

Sign test

The sign test checks for a difference in the median value by comparing each pair. It does not require a symmetric distribution and could be used on non-numeric data so long as the data can be assigned to two groups (e.g. boolean values of opinions)

  1. State hypotheses
    i. H0 that population medians are equal η1=η2
    ii. H1: that population medians are not equal η1η2 or η1(>or<)η2
  2. Find the differences between each pair, ignoring any equal values
  3. Count T+ the number of positive differences and T the number of negative differences
  4. Find the value of P(X<min(T+,T) | XB(n,12))
  5. Compare the value above with the significance level. If it is less, reject H0

Wilcoxon Signed-rank test

The Wilcoxon test is similar to the sign test except that it ranks the differences ignoring their signs.

  1. State the hypotheses
    i. H0 Population average difference (in mean or median) of 0
    ii. H1 Population average difference not 0
  2. Rank the absolutes of the differences, giving each rank the sign of its respective difference
  3. Calculate T+ and T, the sums of the positive and negative ranks respectively. Let Tt=min(T+,T)
  4. Find the critical value, Tc from the table
  5. Compare Tt to Tc, rejecting H0 if Tt is smaller

Mann-Whitney U-test

The Mann-Whitney U test tests whether two samples were taken from the same population. It is used when a t test cannot be applied as the data is not normal

  1. State the hypotheses
    i. H0 The samples are from the same population, η1=η2
    ii. H1 The samples are from different populations, η1=η2
  2. Rank the entire dataset and calculate the sum of the ranks for each set
  3. Calculate the test statistic for each set U=Tn(n+1)2 where T is the sum of the ranks of the set and n is the size of the set.
    Let U=min(U1,U2)
  4. Find the critical value for the significance level and the size of each data set
  5. If U is less than the critical value, reject H0

Kruskal-Wallis test

The Kruskal-Wallis test is a non-parametric version of the ANOVA test.
It determines a difference between samples

  1. State the hypotheses
    i. H0 All samples are from the same population
    ii. H1 Samples are from different populations
  2. Rank the entire dataset and calculate the sum of the ranks of each set
  3. Calculate the test statistic H=(12N(N+1)Ti2ni)3(N+1) where N is the sum of all samples sizes, and Ti and ni are the rank sums and sizes of each sample
  4. Find the degrees of freedom, the number of samples minus one
  5. Calculate the critical value for the given significance level either from the tables or with CHISQUARE_ICDF(ν,1α)
  6. If the test statistic is larger than the critical value, reject H0

Correlation

Spearman’s Rank Correlation Coefficient

Spearman’s rank correlation coefficient is used when the data is ranked.

The value is given by rs=16d2n(n21) where di is the difference between the rank of a pair of values, and n is the (equal) size of each dataset

Testing with the correlation coefficient

A test can be carried out in order to determine whether there is or is not a correlation between two random variables.
Assuming that the correlation coefficient, rs, is already known.

  1. State the hypotheses
    i. H0:ρ=0 (independent)
    ii. H1:ρ0 (not-independent)
  2. Find the critical value from the table
  3. If the correlation coefficient is greater than the critical value, reject H0

SS04

Linear combinations of independent normal variables

If the letters X and Y are variables and the letters a and b are constants then

  1. E(aX+b)=aE(X)+b
  2. E(X+Y)=E(X)+E(Y)
  3. E(a0+a1X1+a2X2+...+anXn)=a0+a1E(X1)+a2E(X2)+...+anE(Xn)
  4. Var(a+bX)=b2Var(X)
  5. Var(X±Y)=Var(X)+Var(Y)
  6. Var(a0+a1X1+a2X2+...+anXn)=a12Var(X1)+a22Var(X2)+...+an2Var(Xn)
  7. A linear combination of independent, normal variables will itself be normally distributed
  8. (i=1nci×N(μi,σi2))N(i=1nciμi,i=1nci2σi2)

Given a normal distribution XN(x¯,σ2), if an event is given which is k times X then this new event has a distribution N(k×x¯,k×σ2). The standard deviation of the new event is then k×σ2.

Given two normal distributions AN(x¯A,σA2), and BN(x¯B,σB2) the normally distributed random variable A±B has the distribution N(x¯A±x¯B,σA2+σB2).

Approximating distributions

  1. The purpose of making an approximation is:
    • To reduce the amount of calculation
    • To allow tables to be used where they otherwise could not
    • To calculate confidence intervals
  2. The binomial distribution may be approximated by the Poisson distribution if n50 and p0.1
  3. The conditions for the approximations are rules of theu,b. They are not shar dividing lines between good approximations and bad approximations
  4. The binomial distribution may be approximated by the normal distribution if n50 and np10
  5. The Poisson distribution may be approximated by the normal distribution if λ10

Confidence intervals

  1. An estimate of a population standard deviation calculated from a random sample of size n has n1 degrees of freedom
  2. If X¯ is the mean of a random sample of size n from a normal distribution with mean μ a 100(1α)% confidence interval for μ is given by x¯±tα2,n1sn

Further confidence intervals

  1. If x is an observation from a Poisson distribution with mean λ then an approximate 100(1α)% confidence interval for λ is given by x±zα2x, provided that x is reasonably large, say >20
  2. If r is an observation from a binomial distribution with parameters n,p then an approximate 100(1α)% confidence interval for p is given by p^±zα2p^(1p^)n, provided r is reasonably large, say >20

Further hypothesis testing for means

To carry out a hypothesis test for a mean based on a sample from a normal distribution with an unknown standard deviation:

The test statistic is x¯μsn where s=Σ(xx¯)2n1

Hypothesis tests for proportions and for the mean of a Poisson distribution

  1. To test hypotheses about a binomial population proportion, p, either:
    a. Determine the cumulative binomial probability of B(n,p)
    State H0 that the mean remains the same, and H1 that it exhibits the expected change. Then find the probability that this value occurs, and reject H0 if the probability is higher than the level of the test.
    b. use p^pp(1p)nN(0,1)
    The hypotheses are then dealt with like a regular normal hypothesis test.
  2. To test hypotheses about a Poisson population mean λ, either
    a. Determine the cumulative Poisson probability of P0(λ), or
    b. use λ^λλN(0,1), λ>10

SS05

Continuous probability distributions

  1. The random variables X having probability density function f(x)={1ba,a<x<b0,otherwise where a and b are constants, it is said to follow a rectangular distribution
  2. The mean of X is 12(a+b) and the variance of X is 112(ba)2
  3. The exponential distribution has probability density function f(x)={λeλxx>00otherwise
  4. The exponential distribution with parameer λ has mean 1λ and standard deviation 1λ
  5. P(X<x) is known as the cumulative distribution function and is usually denoted F(x)
  6. For the exponential distribution with parameter λ, F(x)=1eλx,x>0
  7. If a and b are two constants and a<b, the probability that X takes a value between a and b is F(b)F(a)
  8. The intervals between successive events from a Poisson distribution with mean λ are distributed according to the exponential distribution with parameter λ

Estimation

  1. If S2 denotes the variance estimate from a random sample of size n from a normal population with variance of σ2, then (n1)S2σ2χn12
  2. The χ2 distribution is not symmetric so both lower and upper percentage points need to be read from tables
  3. A 100(1α)% confidence interval for a normal population variance, σ2, is given by (n1)S2χ1α22 and (n1)S2χα22
  4. Confidence limits for a normal population standard deviation, σ are found by taking the square root of those calculated for the population variance

Hypothesis testing: one sample tests

  1. To test hypotheses about a normal population variance, σ2 or standard deviation σ, use (n1)S2σ2χn12
  2. To test hypotheses about a normal population with mean, μ, use X¯μSntn1

Hypothesis testing: two-sample tests

  1. To test hypotheses about the equality of two normal population variances, or standard deviations, use sx2sy2Fnx1, ny1
  2. To test hypotheses about the equality of (or given differece in) two normal population means, based upon independent random samples and known population variances use (X¯Y¯)(μxμy)σx2nx+σy2nyN(0,1)
    Note that for nx>30 and ny>30 the requirement for normal populations canbe relaxed and/or sample variances can be used as estimates of the population variances
  3. To test hypotheses about the equality of (or given difference in) two normal population means, based upon independent random samples and unknown but equal population variances use (X¯Y¯)(μxμy)sp2(1nx+1ny)tnx+ny, 2
    where sp2=(nx1)sx2+(ny1)sy2nx+ny2

Testing for goodness of fit

  1. (OE)2E may be approximated by a χ2 distribution provided that
    • The Os are frequenceies,
    • The Es are at least five,
    • The classes form a sample space that is, every possible observation fits into one and only one class
  2. The number of degrees of freedom is the number of classes, minus the number of independent pieces of information derived from the Os in order to calculate the Es
  3. If there are k classes and any necessary parameters are estimated from the data the number of degrees of is k2 for a Poisson, binomial, or exponential distribution, and k3 for a normal distribution

SS06

Experimental design

  1. Experimental error is the effect of factors other than those controlled by the experimenter
  2. In a paired comparison, experimental error is reduced by applying both treatments to the same subjects or in the same conditions
  3. The purpose of randomisation is to eliminate bias
  4. Blocking is used to reduce experimental error by applying treatments (usually more than two) to the same subjects or in the same conditions
  5. If a new treatement is applied to an experimental group, a control group, which receives no treatment or the standard treatment, is needed to act as a measure of the effect of not applying the new treatment
  6. A placebo is a pill or treatment which contains no active ingredient
  7. In a blind trial subjects do not know whether they are receiving the treatment or a placebo
  8. In a double blind trial neither the subject nor the person administering the treatment knows whether a placebo or an active drug is being given

Analysis of paired comparisons

If D¯ and Sd denote the mean and standard deviation, respectively, of a random sample of n differences that can be assumed to be bormally distributed with mean μd then D¯μdSdntn1

Analysis of variance (ANOVA)

  1. The assumptions for the three models considered, one and two factor ANOVAs, and Latin square designs, are that:
    a. The observations are obtained independently and randomly from populations at each factor level (combination)
    b. These populations are (approximately) normally distributed with common variance σ2
    c. When two or more factors are involved, there is no interaction between them

  2. One way ANOVA table

    Source of variation Sum of squares Degrees of freedom Mean square F ratio
    Between samples SSB k1 MSw=SSBk1 MSBMSW
    Within samples SSW=SSTSSB nk MSW=SSWn1
    Total SST n1
  3. Two way ANOVA table

    Source of variation Sum of squares Degrees of freedom Mean square F ratio
    Between rows SSB m1 MSR=SSRm1 MSRMSE
    Between columns SSC n1 MSC=SSCn1 $$
    Error SSE=SSTSSRSSC (m1)(n1) MSE=SSE(m1)(n1)
    Total SST mn1

Provided in the formulae booklet

Statistical process control

  1. Statistical process control may be used when a large number of similar items are being produced. Its purpose is to give a signal when the process mean has moved away from the target value or when item-to-item variability has increased
  2. For control charts for means:
    • Sample mean between warning limits- No action
    • Sample mean between arning and action limits- Take another sample immediately. If new sample mean outside warning limits take action
    • Sample mean outside action limits- Take action
  3. The warning limits are set at μ±1.96σn, and the action limits at μ±3.09σn, where μ is the target value, σ is the short-term standard deviation, and n is the sample size
  4. Variability may be controlled by plotting the sample ranges or standard deviations on control charts. The limits for these charts are found by multiplying the process short-term standard deviation found by factors in the control charts for variability (Table 12)
  5. When the standard deviation must be estimated from a number of small samples the average sample range can be calculated and a factor from table 12 applied.
    Alternatively si can be calculated for each sample and the formula s=si2n evaluated
  6. If the tolerance width exceeds six standard deviations the process should be able to meet the tolerances consistently, provided the mean is kept on target
  7. For charts for proportion non-conforming providing n is reasonably large:
    • The warning limits are p±1.96p(1p)n
    • The action limits are p±3.09p(1p)n

Acceptance sampling

  1. Acceptance sampling may be applied to large batches of similar items. It is the process of deciding whether or not the batch is acceptable by testing a small sample of the items
  2. The operating characteristic for an acceptance sampling by attributes plan is a graph of probability of acceptance against proportion non-conforming in the batch
  3. The probabilities may be found from the binomial distribution provided the sample is random and the sample size is small compared to the batch
  4. In double sampling, the number of non-conforming items in the first sample will determine whether a decision is made immediately or whether it is delayed until a second sample has been inspected
  5. For acceptance sampling by variables the operating characteristic is a graph of probability of acceptance against batch mean

MS03

Bayes’ theorem

P(Aj|B)=P(Aj)P(B|Aj)i=1nP(Ai)P(B|Ai)

Example

Event P(A) P(B)
C 0.3 0.8
D 0.5 0.1
E 0.2 0.4

After event A, event B occured. Find the probability that event E occured.

Let F be the probability that event B occured.

P(E|F)==P(E)P(F|E)P(C)P(C|F)+P(D)P(D|F)+P(E)P(E|F)=0.2×0.40.3×0.8+0.5×0.1+0.2×0.4=837

Linaer combinations of random variables

Covariance is a measure of the joint variability of two random variables.

Cov(X,Y)=E((Xμx)(Yμy))=E(XY)μxμy

The covariance can be used to find the product moment correlation coefficient of two random variables:

ρ=Cov(X,Y)σxσy

The variance can also be computed as

Var(aX±bY)=a2Var(X)+b2Var(Y)±2abCov(X,Y)

Distributional approximations

Mean and variance of binomial and Poisson distributions

Proof of μ=np for binomial

μ=E(X)=xipi=x=1nx×n!x!(nx)!×px(1p)nx=npx=1n(n1)!(x1)!(nx)!×px1(1p)nx=npy0n1(n1)!y!(ny1)!×py(1p)ny1=npx=1n1B(n1,p)=np

Proof of σ2=n(n1)p for binomial