Grouped summaries are powerful for exploring and wrangling data. The groups are naturally defined by categorical data with summaries like mean(), max(), and min() used on numeric data. But what if you want to summarize categorical data? This is an important question because text data is categorical.
Let’s look again at review_data. Another benefit of a tibble, the type of data frame shared across the tidyverse and related packages, is that each column has both a name and a type clearly listed. Here you can see that the stars column is of type dbl or double, meaning the column is filled with numeric data. Another column type you’ve probably seen is int or integer, which is just whole numbers. You can also see that both the product and review columns are of type chr or character, which is one way that categorical data is stored.
The basic summary of categorical data is a count. We can get a count of categorical data by summarizing with a function simply called n(). n() computes the number of rows in the current group. Here we are summarizing the number of rows for a single group, the entire dataset, so we get 1,833 rows. Note that we didn’t need any argument for n(), it’s simply counting the number of rows for the data we’ve piped into summarize().
Now if we combine n() with group_by(), we can easily count the number of rows for each product. Here we see there are about twice as many rows or reviews for the 880 Roomba model than the 650 model.
If that seems like a lot of work to just get a count, you’re right. Instead of a grouped summary using n(), we can use count(). Here we count the number of rows for each product directly. You can see that this is identical to the grouped summary using n(). The one difference is that the column with the actual counts is named n by default, a reference to the fact that the n() function is being used by count() in the background.
While this two-row data frame is easy enough to read, we often want to arrange() the output by a certain value. By default, arrange() is in ascending order. To get descending order, we wrap the desc() helper function in a call to arrange(). Here we are arranging the data frame in descending order by n, the count of rows for each product as output by the count() function.
R packages are full of functions like the count() verb that automate more complicated code. These types of functions are called wrappers, because they are wrapped around other functions. Let’s practice counting categorical data.
0 Comments