Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.
In this message you will learn what a T -Test is and how you can perform it in R. First you see a simple function with which you can perform the test with just one rule code. We will then explore the intuition behind the test and build step by step with data about the Titanic passengers. Enjoy the lecture!
1. What is a T-Test?
A T-Test is a statistical procedure that is used to check whether the difference between two groups is significant or only because of the chance. In this message we will look at data from Titanic passengers, which they divide into men and women. Suppose we want to test the hypothesis that men and women had the same average age. If our data show that women were on average 2 years younger than men, we have to ask ourselves: is this a real difference, or could it have happened randomly? The T-Test helps us answer this question.
2. Why is a T-test important?
A T-Test is important when we want to draw conclusions about a population based on a sample. For example, imagine that we study the demography of ship passengers at the beginning of the twentieth century and want to use the Titanic steek test to generalize findings to a wider population of passengers.
Of course, such conclusions can be biased because Titanic passengers may not perfectly represent all ship passengers of that time. Nevertheless, the sample can still offer valuable insights, as long as the context of both the sample and the population is carefully considered and clearly explained.
3. The Titanic Passage
We’re going to titanic R Library to access data about titanic passengers. In particular, we will work with a subset of passengers in the titanic_train Dataset. Below you will find the code to charge the data, the average and standard deviation of the age for men and women and to show how many passengers men and women are.
content_copy
Copy
library(titanic)
data('titanic_train')
df <- titanic_train %>%
select(Sex, Age) %>%
na.omit()
df %>% group_by(Sex) %>%
summarize(mean(Age), sd(Age), n())
| Sex | average (age) | SD (Age) | N |
|---|---|---|---|
| female | 27.9 | 14.1 | 261 |
| male | 30.7 | 14.7 | 453 |
We can see that there is a difference of 2.8 years between the average age of men and women on the Titanic. Below you can also check the distribution of ages.
content_copy
Copy
ggplot()+
geom_density(aes(x=df$Age, color = df$Sex), size = 0.7)+
scale_color_discrete("")+
xlab("Age")+
ylab("Density")
It indeed seems that the distributions are very similar. In this case, our best option is to perform a T -Test to see if they are really that similar.
4. T test in r
With test, it can be performed in a very simple way on R. A function has been mentioned t.testWhose first argument is a formula, in our case we would like to know how age varies between different genera. Thomas Leeper wrote a very clear explanation about formulas In this page. It is important for us that the formula is composed by a dependent variable on the left (age), followed by “~” and one or more independent variables on the right (sex). The second argument is simply the data frame with the data that we want to test. This test assumes that the two samples are independent and that age is about normally distributed, which we have confirmed by the density plot above.
content_copy
Copy
t.test(Age ~ Sex, data = df)
How to interpret these results?
- The p-value of 0.0118 means that if there was really no difference in the average age between male and female passengers (that is, if the zero hypothesis was true), there would only be a chance of 1.18% to observe a difference as great as we found or larger. Because this P value is less than 0.05, we reject the zero hypothesis at the reliability level of 95%, suggesting that there is a real difference. However, if we had chosen a reliability level of 99%, we would not reject the zero hypothesis, because the P value is greater than 0.01.
- Our confidence interval tells us that if we took many samples, as we have, in 95% percent of the times, we would get a difference between averages between -0.62 and -5. This confidence interval does not include 0 and that is why we reject the zero hypothesis and we accept the hypothesis that there is a difference between the average age of men and women.
5. T -Test with Bootstrap
During test with Bootstrap is a good way to understand the concepts that are needed to interpret the results of the Test test above. Everything is based on the central limit position according to which if I sign many samples of a population and the average of each sample, then the distribution of all these means will:
(i) Follow a normal distribution;
(ii) the average of the sample average will approach the population average;
(iii) The standard deviation of this distribution is called standard error.
In our example we have one sample of passengers. Imagine that we could collect many of those samples. If we could do that, the resources of all samples would approach the population parameter. Bootstrap is a technique to make almost as many monsters as we want from our unique monster. In our example we have 712 ages after eliminating NAS. We can restore 712 observations from these values, allowing them to repeat. That is the basic idea behind bootstrapping.
To carry out that procedure, we will make a function that is re -sampling our data frame. The first line code used slice_sample random N Rows of our data frame so that the same row can be chosen more than once. Brand on that N Is the number of rows of the data frame. Then we use dplyr To calculate the average per gender. Note that we are actually interested in the difference between the male average and the female average. That is what the last two rules do code.
content_copy
Copy
diff_means <- function(data) {
sample_df <- data %>% slice_sample(n = nrow(data), replace = TRUE)
means <- sample_df %>%
group_by(Sex) %>%
summarize(mean_age = mean(Age, na.rm = TRUE))
male_mean <- means %>% filter(Sex == "male") %>% pull(mean_age)
female_mean <- means %>% filter(Sex == "female") %>% pull(mean_age)
return(male_mean - female_mean)
}
Now we can replicate function to perform our position for N time. For our goal, 1000 times is sufficient. Brand on that replicate Works like a loop. Before we do that, let’s make a small adjustment so that we can also calculate our P value. The P value assumes that the zero hypothesis is true. Let us, before we sampled our data, the difference between resources are 0. Before that, let the observed difference, 2.81, deduct from the ages of all men.
content_copy
Copy
df_null <- df %>%
mutate(Age = ifelse(Sex=="male", Age-2.81, Age))
set.seed(1308)
diffs <- replicate(1000, diff_means(df_null))
sd(diffs)
mean(diffs)
ggplot()+
geom_histogram(aes(x = diffs), color = "white", fill = "#2E3031")+
geom_vline(xintercept = -2.8, color = "#A33F3F")+
geom_vline(xintercept = 2.8, color = "#A33F3F")+
scale_color_discrete("")+
xlab("Age Differences (Null Hypothesis)")+
ylab("Number of Individuals")+
theme_bw()
If we carry out the above assignments, we will get the average of the sampling distribution – as the distribution of the sample is called – is approximately 0, as expected, and the standard deviation is 1.1.
The above histogram shows us what the sample differences would look like if the zero hypothesis were true. The red lines show the difference that we have actually observed. Do you think it will probably observe what we have observed under the nul hypothesis? It is actually not and you can calculate it with the code below:
content_copy
Copy
sum(diffs>=2.81)/1000
sum(diffs<=-2.81)/1000
The code calculates the number of samples whose resources were more extreme than 2.8 (male age – female age) or -2.8 (female age – male age). This results in 9 samples from 1,000 or 0.9%. This estimate is very close to the P value that is found using the R-function t.test. Again, we can reject the zero hypothesis and conclude that there is a difference between the average age of men and women.
In addition to helping the test to better understand the test, the Bootstrap method has the advantage that it does not assume that the age distribution will follow a normal distribution. This is another advantage of using this approach.
Use the comments below if you have not understood a specific point of the test or if you have a suggestion to improve the test.
Related
#test #RBloggers


