# Below a low level inversion visibility is often

Classified in Mathematics

Written at on English with a size of 25.41 KB.

Tweet |

Chapter 17: Inferences when SD is unknown

BIG IDEA:

T

**test**is used when SD is unknownThere are less conditions for inferences about a mean

Data is SRS from a larger pop

Observations follow a Normal distribution

We will estimate standard error using s/√n

S is the

**sample**SDS is the estimate of the variation btw indiv

s/√n is how much the sample means vary

T Test

(X bar - mu)/(s/√n)

T test is more variable than a z test because we have to estimate sigma with s

This means the t-test is not normal distr, it is more variable(wider)

df=n-1

Higher df=more close to normal

steps:

Conditions are met

Calculate the t-test

**stat**using x bar and s, n and mu nullCompute prob of observing the test stat t or more extreme under the null hypothesis(p-value)

Interpret the p-value

T-Star:

For a 95CI with 25 objects

t_star <- qt(p = 0.975, df = 24)

Formuler

X Bar + or - t_star * s/√n

Robustness:

Robust if the CI or p-value do not change much when procedure is violated

T procedure is quite robust against non-Normality, except when outliers or strong skew is present

T-procedure with outliers is ok if the sample is big enough

Assumptions:

Plot

**data**to see if there are outiers and if there is skewSRS is more important than Normality

If n<15 use t procedures if the data appears close to normal

If >15 then use t unless there are outliers or strong skew

If >40 then use t

Chapter 17 Pt2 Paired T test

Used to match by design

This is a test of the mean differences within a subject

T = (Mu d - 0)/(sigma d / √n)

Observed value of the test stat is t = (x bar d - 0)/(s d/√n)

Pull() vs Select()

Select keeps the data that you want but it remains inside a dataframe

Pull will pull the raw data without the dataframe and display it

Find the quantile using q <- qt(p = 0.975, lower.Tail = T, df = 10)

paired_t <- t.Test(chol_dat %>% pull(B), chol_dat %>% pull(A),

alternative = "two.Sided", mu = 0, paired = T)

Use this code to find it, make sure it says paired

Paired T tests are good to remove confounding

Must make sure to give treatment after wash-out period so effects don’t transer over

Chapter 18: Comparing 2 pop means

We have used One sample tests

One sample and one variable

Now we will use two pops

H null: mu1-mu2=0

H alt: mu1-mu2 dne 0

Compare graphically

Make a histogram, one for each sample

Compare their shapes, centers and spreads

Or make two boxplots and compare their medians and IQRs

Conditions

We have two SRSs from two pops

The samples are independent

Same quantitative val for both samp

Both are normally distributed and no outlers

Standard Deviation is √ (sigma1^2/n1)+(sigma2^2/n2)

Our estimate is SE=√ (s1^2/n1)+(s2^2/n2)

Two sample t-test is:

t= (xbar1-xbar2) - (mu1-mu2) / (SE)

t= (xbar1-xbar2) / (SE)

Degrees of freedom is fuckin long

df=(s1^2/n1)+(s2^2/n2)^2 / [ (1/n1-1)*(s1^2/n1)^2 + (1/n2-1)(s2^2/n2)^2 ]

Confidence Interval

(xbar1-xbar2) + or - t_star * √(s1^2/n1 + s2^2/n2)

Example:

Infection of chickens with the avian flu is a threat to both poultry production and human health. A research team created transgenic chickens resistant to avian flu infection. Could the modification affect the chicken in other ways? The researchers compared the hatching weights (in grams) of 45 transgenic chickens and 54 independently selected commercial chickens of the same breed.

Use this to simplify all that bs

t.Test(commercial_weight, transgenic_weight, alternative = "two.Sided")

Robustness

More robust than one sample tests, esp if the data is skewed

When the samples are the same size, they can work for samples as small as 5

When two pops have different shapes, ya need larger samples

Chapter 19: Inference about a pop proportion

This is about binary data, as opposed to continuous from previous

Large sample CI

P_hat + or - z_star √p_hat(1-p_hat)/n

Not as effective as Plus 4

Plus 4 Method

Used bc normal CI method will not be as good on binary data

Add 2 fake successes and 2 fake failures

P_tilde = number of success + 2 / n+4

SE = √ p_tilde(1-p_tilde) / (n + 4)

CI = z_star * SE

This should be used when n = 10 or more and CI is 90 or more

Use this when doing by hand

Wilson Score [prop.Test]

Same as Plus 4 but with correction

Basically the R version of Plus4

Clopper Pearson or Exact [binom.Test]

Statistically conservative

Gives better coverage than it suggests

Example: Suppose that 500 elderly individuals suffered hip fractures, of which 100 died within a year of their fracture. Compute the 95% CI for the proportion who died using:

Large sample16.5% to 23.5%by hand

Clopper Pearson*16.6% to 23.8%binom.Test

Wilson Score**16.6% to 23.8%prop.Test

Plus four16.7% to 23.7%by hand

Note that only large enough is symmetric around .20

We do not need symmetric CI with binary data

Finding a specific sample size

Let m = Margin of Error desired

M = z_star * √p_hat(1-p_hat) / n

We will have to guess for p using p_star

If you have no clue use p_star=0.5

But if true p is less than 0.3 or more than 0.7 than this will be bigger than needed

N = sample size wanted

n = (z_star/m)^2 * p_star * (1-p_star)

Ex. Suppose after the midterm vote, you were interested in estimating the number of STEM undergraduate students who voted. First you need to decide what margin of error you desire. Suppose it is 4 percentage points or m=0.04 for a 95% CI.

If we knew that the estimate was 25% then the formuler is

(1.96/(0.04)^2 * 0.25 * (1-0.25) =450.19 = 451

ROUND UP

P-Value

Plug in the z value you get into

pnorm(q = 3.06413, lower.Tail = F)

Chapter 20: Inference for comparing 2 proportions

LArgE Sample CI for diff of 2 prop

Use when the number of success and failures are >10 for both samples

(p_hat1-p_hat2) + or - z_star * √p_hat1* (1-p_hat1) / n1 + p_hat2(1-p_hat2) / n2

This has low coverage

Example: Patients in a randomized controlled trial who were severely immobilized were randomly assigned to receive either Fragamin (to prevent blood clots) or a placebo. The number of patients experiencing deep vein thrombosis (DVT) was recorded:

DVTno DVTTotalp^

Fragamin421476151842/1518=2.77%

Placebo731400147373/1473=4.96%

Conditions:

Samples both have more than 10

The estimate of the fidd is 2.19%

Plus 4

When Large enough is not satisfied

Add 4 objects, 1 success and 1 failure to both samples

P_tilde1 = successes in pop1+1 / n1+2

P_tilde2 = successes in pop2+1 / n2+2

(p_hat1-p_hat2) + or - z_star * √p_hat1* (1-p_hat1) / n1 + p_hat2(1-p_hat2) / n2

Use when sample is at least 5, can be used even when success or failure = 0

Much more accurate when sample sizes are small

May be conservative (higher coverage than advertised)

Z-Test

(p_hat1-p_hat2) / √(p_hat)(1-p_hat)(1/n1 + 1/n2)

Use this only when counts of success and failure is more than 5 for both samples

Use pnorm to find p-value

pnorm(q = 3.112881, lower.Tail = F)*2

Chapter 21:The

**chi****squared**goodness of fit testOne categorical variable with more than 2 categories

GOODNESS OF FIT

Estimate how many observations we will expect in each cat

Compare the number of observations in each category to the exp value

Example

Suppose that the following number of people were selected for jury duty in theprevious year, in a county where jury selection was supposed to be random.

EthnicityWhite Black Latinx Asian Other Total

1920 347 1984130 2500

You want to take the percent of each race and make sure that each

**group**is being represented proportionatelyYou can use line graph to show the difference between expected and observed

Chi Squared Stat

Equal to the sum of

(observed1-expected1)^2 / expected 1

For each value

Chi squared distribution

Like T distribution, the only parameter is degrees of freedom

df=number of groups minus 1

For this ex it would be 5-1=4

As df increases, the distributions central tendency will move to the right

Chi-square is positive, always take upper tail

P-Value

Once you find the chi squared value plug it into

pchisq(q = 1606.454, df = 4,lower.Tail = F)

Chi Square function

chisq.Test(x = c(1920, 347, 19, 84, 130),

p = c(.422, .103, .251, .171, .053))

Conditions for Chi Squared

Fixed number of n observations

All observations are independent of each other

Each observation falls into one of the k mutually exclusively categories

At least 80% of the cells have 5 or more expected observations

All k cells have expected counts more than 1

Chapter 22: Inference two way tables

Last

**chapter**was about one categorical variable, this is about 2Ex

For example, what is the conditional probability of vaping among teens exposed to a JUUL advertisement vs. Teens unexposed?

GroupLung CancerNo Lung CancerRow total

Smoker12238250

Non-smoker7743750

Column total199811000

Then you will want to find the expected values given the null hypothesis that the condition you are observing has no effect on the other

Shortcut for expected counts

Expected = row total * col total / overall total

Calculate the chi squared the same way

(expected1-observed1)^2 / expected1 for all values

Calculate degrees of freedom

For this it is the column number -1 times rows -1

(for this example it is (2-1)(2-1)=1

Same r function to find the p value

pchisq(q = 15.04015, df = 1, lower.Tail = F) #df = (2-1)(2-1) = 1

Chi Squared test of independence

chisq.Test(two_way, correct = F)

chisq.Test(two_way, correct = T)

Use the correction whenever n < 100 or any observed value is less than 10

Conditions for the chi square test of independence

Expected is at least 5 for at least 80% of the cells

All expected values are greater than 1

If table is 2X2 then all four cells need expected at least 5

Assumptions for the chi squared test of independence

Must have data from independent SRSs from at least 2 populations, with mutually exclusive categories

Or a single SRS with each individual classified according to each of two categorical variables

For this test z^2 is the same as Chi squared

The p value for the two sided z test and the chi squared test are the same

When the data looks like this you may want to use a z test to find the one sided because you cannot do a one sided with chi squared

Use dodged histograms to compare the conditional distributions with one variable across levels of another variable

Chapter 23: Inference for Regression

Recap of regression from pt1

Graph the data. Does the data look linear? What is the correlation coefficient

Calculate the line of bets fit w lm()

Using glance() and tidy() from library(broom) to summarize model findings

Interpret the slope (b_hat) and intercept (a_hat) parameters

Interpret the r_hat squared value

Assumptions to check for regression inference

The relationship between x and y is linear in the pop

y varies normally around the line of best fit. That is, the residuals vary normally around the line of best fit

Residuals refer to the vertical distance between the line of best fit and the observed y value

Observations are independent

This cannot be checked on the plot, we need to know the study design

The sd of the responses is the same for all values of x

Vocab:

Observed value: y

Fitted value: y_hat = a_hat + b_hat*x

Estimated residual: r_hat = observed value - fitted value = y - (a_hat + b_hat*x)

Graphs used to check

Scatter Plot

Shows fitted regression line and the data. The estimated residuals are shown by the dash lines. We want to see that residuals are positive and negative with no trend

QQPlot

Check if residuals normally distributed

Fitted v. Residuals

Check to see random scatter

Amount Explained

Boxplot of the distribution of y v the distribution of the residuals. If x does a good job of describing y, then the box plot for the residuals will be much shorter

Note

Regression procedures are not too sensitive to lack of normality

Outliers are important since they can have a large effect

Chapter 23 Pt.2 Inference for Regression

tidy(your_lm) presents the output of the linear model

glance(lm) takes a quick one line look at fit stats

augmentlm) creates and augmented data frame that contains a column for the fitted y-values (y_hat) and the residuals (e_hat = y - y_hat) among other columns

New terminology: SSE

Sum of squared estimates of error

The SSE is the summation of the squared distance between each indiv’s y value and the fitted value based on the line of best fit

The higher the SSE, the worse the model

Regression standard error

Used to measure if a model is good fitting

S = √(1/n-2) * SSE

A good fitting model should have a low regression standard error

Look at s after running a linear model to assess the model’s fit to the data

s is on the same scale as y, same units

glance(lm) will print s, denoted as sigma

Hypothesis testing for regression

We would like to know if the slope is different from 0

H_null: b = 0

There is no association between x and y

H_alt: b is not equal to 0 for a two sided test

There is an association

Know how to use R to find these data

tidy(lm)

Estimate is the estimated slope coefficient b_hat

Std.Error is the standard error, SE b

Statistic is the t test stat b_hat / SE b

Test will always have n-2 df

Use pt to find p value

pt(q = 6.7211302, df = 18, lower.Tail = F)*2

We can also use the tidy(lm) output to find the regression coefficient

B_hat + or - t_star * SE b

T_star = t_star <- qt(p = 0.975, df = 18)

Test for the lack of correlation

Lack of correlation If and only if there is no association between the explanatory and response variables

Thus if your hypothesis test does not reject the null (b = 0) then this also implies that you would not reject the hypothesis of no correlation between x and y

Chapter 24: ANOVA

Analysis of variance

When the ratio of between vs within variation is large enough, then we detect a difference between the groups

When the ratio is not large enough we do not detect the difference

The ratio is our test stat, denoted by F

Graphing

Use a box plot for each level of the grouping variable

Make a density plot for each level of the grouping variable

Histogram for each level of the grouping variable

Hypotheses

Null: mu1=mu2=muk

Alt: not all mu are equal

At least one mean differs from the rest

Ex

High-grade glioma is an aggressive type of brain cancer with a low long-term survival rate. Cannabinoids, a chemical compounds found in cannabis, are thought to inhibit glioma cell growth. Researchers transplanted glioma cells into otherwise-healthy mice, and then randomly assigned these mice to 4 cancer treatments: irradiation alone, cannabinoids alone, irradiation combined with cannabinoids, or no treatment. The treatments were administered for 21 days, after which the glioma tumor volume (in cubic millimeters) was assessed in each mouse using brain imaging.

\

The Test Stat(ANOVA F)

F = variation among group means / variation among individuals in the same group

F = mean squares for groups / mean squares for error

Numerator: MSG

Let x_bar represent the overall sample mean

MSG = n1(x_bar1-x_bar)^2 + … / k-1

Denominator: MSE

MSE = (n1-1)s1^2 + (nk-1)sk^2 / (Ntotal - k)

F = MSG/MSE

If the stat is high then there is relatively more variation among groups then there is among groups

If the stat is less than 1, then there is more variation across individuals in the same group than there is among groups

Anova in R

aov()

cancer_anova <- aov(formula = tumor_volume ~ treatment, data = cancer_data)

Use tidy() to display yo data

Df displays the numerator and denom degrees of freedom

Sumsq displays the sum of squares for groups and sum of squares for error, meansq displays the MSG and MSE respectively

Statistic is the F test stat

P.Value is the p value duh

Finding which group is different from the rest

TUKEY’s honestly significant differences (HSD)

This test maintains a 5% experimentwise error rate

The error rate is 5% overall no matter hwo many test we do

diffs <- TukeyHSD(cancer_anova, conf.Level = 0.95) %>% tidy()

Each row in the table corresponds to a pairwise test

Note that when you have an adjusted test, you cannot use the CI to infer the value of the p-value

Conditions for ANOVA

1: independent SRSs, one from each of k populations

The most important assumption, bc this method unlike others from Pt.3 depend on having a random sample

2: Each of the populations has a Normal distribution w an unknown mean

This is less neccessary

The ANOVA test is robust to non-Normality

Normality of the sample means is more important

If the sample size is small, 4-5 indiv per group, then need data that is roughly symmetric with no outliers

3: All of the populations have the same sd whose value is unknown

Hardest to satisfy and check

If this is not satisfied it is usually a ok

Use group_by() and summarize() to calculate the sample SDs to see if they are similar and indicative that the population paramters are too

Rule of Dumb

Want the largest sample SD to be less than 2x the smallest one

No_Chapter Bootstrap Confidence Intervals

We will use this to find CI when data is not Normally distributed

Also can find the CI for the median for a quartile or some other parameter

This method takes repeated samples with replacement from our sample

Steps

1: find the median of the original sample. Denote this as m

2: resample with replacement from the original sample a new sample, also of size 54

3. Calculate the median based on resample #1. Call this median m1*

4. Resample again, calculate the median based on resample #2. Call this median m2* repeat this thousands of times lmao

5. Make a histogram of all the m*. This histogram will approximate the sampling distribution for the median

6. Calculate the bounds such that the middle 95% of the observations are between the lower and upper bounds. In R

quantile(sample_median, 0.025) and

quantile(sample_median, 0.975)

When to use Bootstrao

When we do not have a nice formula to calculate the CI or do not know what the formula is

The underlying assumptions of using a large sample formulas are not satisfied

We can make bootstrap CIs around any statistic we’ve learnt about

No_chapter Permutation Tests

Permutation are used when we do not have a large enough sample and/or our data is not from an SRS

Like bootstrapping but for hypothesis testing

Ex

Background: Malaria and alcohol consumption both represent major public health problems. Alcohol consumption is rising in developing countries and, as efforts to manage malaria are expanded, understanding the links between malaria and alcohol consumption becomes crucial. Our aim was to ascertain the effect of beer consumption on human attractiveness to malaria mosquitoes in semi field conditions in Burkina Faso

Volunteers are randomly assigned to beer or water

We COULD use a t test to determine if there is a difference in Mosquito attraction between the drinkers, OR we could mix up the labels and recompute the difference between drinkers

Permutation requires you to load library(infer)

We’ll use specify hypothesize generate and calculate from infer

library(infer)

null_distn <- mosq_data %>%

specify(response = num_mosquitos, explanatory = treatment) %>%

hypothesize(null = "independence") %>%

generate(reps = 1000, type = "permute") %>%

calculate(stat = "diff in means", order = c("beer", "water"))

head(null_distn)

Use get_pvalue() to get p value duh

null_distn %>% get_pvalue(obs_stat = 23.6-19.22, direction = "two_sided")

If the null is true then the distribution of the response vraible is the same for each level of the explanatory, should look the same after shuffling

Bonus_chapter Regression model with cat exposure

CODE TO KNOW

Qt, pt, qnorm, pnorm, pchisq

Testing functions: t.Test, binom.Test, prop.Test, chisq.Test,

Broom: tidy, glance, augment, lm, predict, confint, aov, tukeyshsd

Interpret

Ggplot2, dplyr, what does R code do <-

Check Inference Formulas PDf

Example:

A study is designed to test whether there is a difference in mean daily calcium intake in adults with normal bone density, adults with osteopenia (a low bone density which may lead to osteoporosis) and adults with osteoporosis. Adults 60 years of age with normal bone density, osteopenia and osteoporosis are selected at random from hospital records and invited to participate in the study. Each participant’s daily calcium intake is measured based on reported food intake and supplements. The data are shown below.