Using the Aggregate Command in R
If you use SQL, you are used to using a “select command” with “group by” in order to get calculations per group. For example, if you wanted to see the mean of some continuous variable in men vs. women, in SQL, you’d do a “select average” and you would group by gender.
In R, this type of analysis is handled by the aggregate command. This command has a “fun” characteristic – not fun, like “enjoyment” – fun like “function”. The function option in aggregate changes what you are calculating for each group.
Example of aggregate: FUN = Mean
Here is some example code. We are using an analytic file called “analytic” which consists of all the surveys from veterans in the 2014 Behavioral Risk Factor Surveillance System (BRFSS) data.
We are also using these two variables:
- SLEPTIM2 – this is a measure of sleep duration. It’s the answer to the question, “On average, how many hours of sleep do you get in a 24-hour period?”
- SEX – this is a measure of gender. It is measured as 1=men and 2=women.
The following code produces the mean SLEPTIM2 for each group in SEX:
aggregate(analytic$SLEPTIM2, by=list(analytic$SEX), FUN=mean, na.rm=TRUE)
Note the aggregate command construction:
- First, we state the continuous variable for which we are generating calculations, which is analytic$SLEPTIM2.
- In the next line, we say by=list, and put all the categories we want to aggregate by in the list. This is a simple example – we are just putting one variable, analytic$SEX, in there – but you could put more variables in there and get output about mean SLEPTIM2 by many other categorical variables.
- The FUN=mean line is the most important. In this case, we chose mean. We could choose other functions – median, minimum, maximum are a few.
- Finally, the na.rm=TRUE option tells R to ignore the missings (“remove NA’s”) from the calculation. SAS does this automatically with means, but you have to tell R to do it explicitly or you’ll get an error.
Here is the output produced by this code:
Group.1 x 1 1 7.140360 2 2 6.863178
For Group.1, R means the group SEX, with 1 being men and 2 being women. For x, in this case, R means the mean. So the mean sleep duration for men is 7.14, and for women, is 6.86.
Example of aggregate: FUN = Standard Deviation
Another metric we need often in health data analytics is standard deviation, or SD. We can get that out of the aggregate command by changing the function option to FUN=sd.
aggregate(analytic$SLEPTIM2, by=list(analytic$SEX), FUN=sd, na.rm=TRUE)
Group.1 x 1 1 1.462665 2 2 1.505267
This time, the x refers to the standard deviations in each group.
Yes, I agree it is kind of ugly. But like with SQL, you can post-process this output to make it look prettier. You can, for example, turn these results into a table using the arrow command. Then, you can do modifications and operations on that table to make it more presentable.
But if you are in a hurry, and you need a quick and dirty mean or standard deviation by group, aggregate’s your command!
Photograph of mice by Kapa65, available here.