Chapter 3 Transforming Data

{dplyr} basics

The book R for Data Science is a very helpful reference guide. Chapter 5 covers many of the topics covered in this section, and may be useful as a resource later or to dive deeper into a topic.

Topics:

  • Filter rows
    • Comparisons
    • Logical operators
    • Missing values
  • Arrange rows
  • Select columns
  • Add new variables
  • Grouped summaries
    • Combining multiple operations with the pipe
    • Missing values
    • Counts
    • Useful summary functions
    • Grouping by multiple variables
    • Ungrouping
  • Grouped mutates (and filters)
  • Extra: Working with factor variables

3.1 An example

The following example ties several of the above ideas together. Imagine that we have a set of grades from a course that we would like to convert to letter grades using a particular weighting and letter-grade cut-offs.

url <- "https://raw.githubusercontent.com/haleyjeppson/R4QM/main/data/CourseData.txt"
students <- readr::read_table(url)[, -1]
students
## # A tibble: 70 × 6
##    College  Year  Hmwk    E1    E2    E3
##    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 NU          1  83.8    79    89    59
##  2 NU          1  77.1    60    97    63
##  3 NU          1  94.5    83    88    69
##  4 NU          1  84.3    91    77    72
##  5 NU          1  73.4    68    79    76
##  6 NU          1  94.7    89    96    79
##  7 NU          1  77.7    76    43    84
##  8 NU          1  92.9    92    81    84
##  9 NU          1  95.9    84    85    84
## 10 NU          1  98.2    99    84    85
## # … with 60 more rows
head(students)
## # A tibble: 6 × 6
##   College  Year  Hmwk    E1    E2    E3
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NU          1  83.8    79    89    59
## 2 NU          1  77.1    60    97    63
## 3 NU          1  94.5    83    88    69
## 4 NU          1  84.3    91    77    72
## 5 NU          1  73.4    68    79    76
## 6 NU          1  94.7    89    96    79
tail(students)
## # A tibble: 6 × 6
##   College  Year  Hmwk    E1    E2    E3
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MC          2  83.2    69    77    93
## 2 MC          2  95.7    88    89    92
## 3 MC          2  79.3    80    61    97
## 4 MC          2  78.1    69    89    97
## 5 MC          2  99.1    84    89    96
## 6 MC          2  95.8    96   100   100

In this particular case we want the weighting to be:

  • 20% to Hmwk (col 3)
  • 25% to E1 (col 4)
  • 25% to E2 (col 5)
  • 30% to E3 (col 6)

But those weights could change later. A function can be written to take the data, and the weights and calculate the weighted average. We can then add that score to the students data frame, and add a final column with the actual letter grades on a 90-80-70-60 scale.

students <- students %>% 
  mutate(Final = .2*Hmwk + .25*E1 + .25*E2 + .3*E3,
         Grade = case_when(
           Final < 60 ~ "F",
           Final < 69 ~ "D",
           Final < 79 ~ "C",
           Final < 89 ~ "B",
           Final >= 90 ~ "A"))

head(students)
## # A tibble: 6 × 8
##   College  Year  Hmwk    E1    E2    E3 Final Grade
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 NU          1  83.8    79    89    59  76.5 C    
## 2 NU          1  77.1    60    97    63  73.6 C    
## 3 NU          1  94.5    83    88    69  82.4 B    
## 4 NU          1  84.3    91    77    72  80.5 B    
## 5 NU          1  73.4    68    79    76  74.2 C    
## 6 NU          1  94.7    89    96    79  88.9 B