Parts 1 & 2
OHSU OCTRI-BERD
2024-09-19
Open Google Doc https://bit.ly/berd_r_intro_2024_doc
Download folder of files using link https://bit.ly/berd_intro_2024_files
Do you have a PC? Need to right click .zip file and select EXTRACT ALL
Do you have a Mac? Just double click .zip file and it will open
These files are hosted on github https://github.com/OHSU-OCTRI-BERD/R_Intro_2024_09
Open the slides file (html) in a web browser: slides_R_berd_2024.html
or use the link https://bit.ly/berd_intro_2024_slides
Open RStudio by double-clicking on the file berd_r_intro_2024_09.Rproj
For the history and details: Wikipedia
R is a programming language
RStudio is an integrated development environment (IDE) = an interface to use R (with perks!)
Read more about RStudio’s layout in Section 3.4 of “Getting Used to R, RStudio, and R Markdown” (Ismay and Kennedy 2021)
.qmd
file = Code + text \(\to\) html.qmd
files contain code + markdown syntax which can be “rendered” to other formats (html, pdf, Word, etc)
.qmd file
html output
.qmd
)Two options:
Pop-up window selections:
HTML
output format (default)Knitr
Use visual markdown editor
Create
.qmd
)Create
, you should then see the following in your editor window:.qmd
)File -> Save
,We create the html file by rendering the .qmd file.
Two options:
Note
.qmd file
html output
.qmd file
html output
An empty code chunk looks like this:
Visual editor
Source editor
Important
Note that a code chunks start with ```{r}
and ends with ```
. Make sure there is no space before ```
.
Run Selected Line(s)
,Mac | ctrl + return |
PC | command + return |
Note
3 options to create a code chunk
Click on at top right of the editor window, or
Keyboard shortcut
Mac | Command + Option + I |
PC | Ctrl + Alt + I |
Visual editor
: Select Insert
-> Executable Cell
-> R
A good analogy for R packages is that they
are like apps you can download onto a mobile phone:
Two options to install packages:
install.packages()
ortidyverse: collection of many commonly used packages, including: readr, forcats, ggplot2 (see list here)
For loading datasets:
For summarizing data
for wrangling data
library()
commandlibrary()
command to load each required package.Note
library()
function, but they are required when using the install.packages()
function.
Use Excel to open the dataset toy_data.xlsx
that is located in the data folder.
Terminology
How many observations are in this dataset?
What are the variable types in this dataset?
We can import data from many file types, including .csv, .txt., .xlsx, plus SAS or Stata files
Once imported, R typically stores data as data frames, or tibbles if using the tidyverse
package.
SAVE THE CODE after using the point & click option
data
folder and
toy_data.xlsx
.Import Dataset...
option,
NA
in the “NA:” box, andImport
button on the bottom right of the pop-up window.
read_excel
command from the readxl
package to load the xlsx file.data
.read_excel()
function (data) into an object named toy_data
<-
is the assignment operatortoy_data
will be in your Environment tabtoy_data
dataset
R variable type | Description |
---|---|
dbl : double |
numbers; also num or int for numbers or integers |
chr : character |
text, “strings” |
fct : factor |
categorical variables stored with levels (groups) |
lgl : logical |
boolean (TRUE, FALSE) |
Rows: 20
Columns: 11
$ id <dbl> 335340, 638618, 922382, 923122, 923963, 925603,…
$ age <chr> "17 years old", "16 years old", "14 years old",…
$ sex <chr> "Female", "Female", "Male", "Male", "Male", "Ma…
$ grade <chr> "10th", "9th", "9th", "9th", "10th", "10th", "1…
$ race4 <chr> "White", NA, "White", "White", "Black or Africa…
$ bmi <dbl> 27.5671, 29.3495, 18.1827, 21.3754, 19.5988, 22…
$ weight_kg <dbl> 66.23, 84.82, 57.61, 60.33, 63.50, 70.31, 45.36…
$ text_while_driving_30d <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "0 days…
$ smoked_ever <chr> NA, "Yes", "Yes", "Yes", "No", "No", "Yes", "No…
$ bullied_past_12mo <lgl> NA, NA, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, …
$ height_m <dbl> 1.550000, 1.699999, 1.779999, 1.680001, 1.79999…
[1] 20 11
[1] 20
[1] 11
Data from the CDC’s Youth Risk Behavior Surveillance System (YRBSS)
yrbss_demo.xlsx
are a subset of data in the R package yrbss
, which includes YRBSS from 1991-2013Use R code to answer the following questions.
yrbss_demo.xlsx
in the data folder to familiarize yourself with it.yrbss_demo.xlsx
. Make sure the name of the loaded data is yrbss_data
.yrbss_data
?yrbss_demo.xlsx
in the data folder to familiarize yourself with it.First look at the data:
Rows: 20,000
Columns: 8
$ record <dbl> 931897, 333862, 36253, 1095530, 1303997, 261619, 926649, 1309…
$ age <chr> "15 years old", "17 years old", "18 years old or older", "15 …
$ sex <chr> "Female", "Female", "Male", "Male", "Male", "Male", "Male", "…
$ grade <chr> "10th", "12th", "11th", "10th", "9th", "9th", "11th", "12th",…
$ race4 <chr> "White", "White", "Hispanic/Latino", "Black or African Americ…
$ race7 <chr> "White", "White", "Hispanic/Latino", "Black or African Americ…
$ bmi <dbl> 17.1790, 20.2487, NA, 27.9935, 24.4922, NA, 20.5435, 19.2555,…
$ stweight <dbl> 54.43, 57.15, NA, 85.73, 66.68, NA, 70.31, 58.97, 123.38, NA,…
yrbss_data
in the Data list of the Environment window.YRBSS
datasetView
command below within RStudio.
yrbss_data
dataset.#| eval: FALSE
so that this code is not run while rendering the .qmd file.First 20 rows shown below:
%>%
%>%
, which is used to string together commandsThe pipe is read “and then” and tells R the next function will be applied to the data on the left side of the pipe:
# A tibble: 6 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4
2 333862 17 years old Female 12th White White 20.2 57.2
3 36253 18 years old or older Male 11th Hispanic/Lati… Hisp… NA NA
4 1095530 15 years old Male 10th Black or Afri… Blac… 28.0 85.7
5 1303997 14 years old Male 9th All other rac… Mult… 24.5 66.7
6 261619 17 years old Male 9th All other rac… <NA> NA NA
rstatix
package has a useful function called get_summary_stats()
for quick statistics of quantitative data# A tibble: 2 × 10
variable n min max median iqr mean sd se ci
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bmi 13542 13.2 53.9 22.3 5.52 23.5 4.99 0.043 0.084
2 stweight 13443 27.7 181. 64.0 19.0 67.5 16.9 0.145 0.285
type
options are
type = c("full", "common", "robust", "five_number", "mean_sd", "mean_se", "mean_ci", "median_iqr", "median_mad", "quantile", "mean", "median", "min", "max")
%>%
)group_by()
function%>%
:
data %>% group_by(...) %>% get_summary_stats(...)
%>%
we can string together many commands in the order we want them applied. Example:
first tell R what dataset to use,
then have R group (stratify) the data by sex
,
and then calculate summary statistics
# A tibble: 4 × 11
sex variable n min max median iqr mean sd se ci
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female bmi 6622 13.2 53.3 22.1 5.33 23.3 4.96 0.061 0.119
2 Female stweight 6542 27.7 181. 59.0 15.9 61.7 14.2 0.175 0.343
3 Male bmi 6920 13.2 53.9 22.5 5.51 23.7 5.01 0.06 0.118
4 Male stweight 6901 35.4 181. 69.4 20.0 73.1 17.4 0.209 0.41
group_by()
to stratify the summaries by more variables# A tibble: 20 × 12
sex grade variable n min max median iqr mean sd se ci
<chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female 10th bmi 1604 13.6 52.3 22.0 5.29 23.0 4.75 0.119 0.233
2 Female 10th stweight 1582 37.2 132. 58.1 15.0 61.1 13.2 0.332 0.652
3 Female 11th bmi 1640 13.6 52.4 22.3 5.41 23.4 4.78 0.118 0.232
4 Female 11th stweight 1624 36.3 161. 59.0 15.0 62.5 14.1 0.35 0.687
5 Female 12th bmi 1675 15.4 53.3 22.7 5.47 23.9 5.23 0.128 0.251
6 Female 12th stweight 1664 36.3 154. 59.9 15.4 63.4 14.8 0.363 0.713
7 Female 9th bmi 1655 13.2 51.2 21.7 5.26 22.8 4.95 0.122 0.239
8 Female 9th stweight 1625 27.7 181. 56.7 15.4 59.8 14.1 0.35 0.686
9 Female <NA> bmi 48 17.0 47.6 21.2 4.35 23.1 5.69 0.821 1.65
10 Female <NA> stweight 47 35.8 113. 56.2 13.2 60.3 14.4 2.10 4.23
11 Male 10th bmi 1739 13.7 51.5 22.4 5.38 23.5 4.77 0.114 0.224
12 Male 10th stweight 1734 40.8 163. 69.0 18.1 72.3 16.5 0.397 0.778
13 Male 11th bmi 1704 14.0 53.9 23.0 5.58 24.1 4.97 0.12 0.236
14 Male 11th stweight 1704 40.8 172. 70.8 20.4 75.2 17.2 0.416 0.815
15 Male 12th bmi 1611 14.4 52.3 23.1 5.79 24.5 5.15 0.128 0.252
16 Male 12th stweight 1611 43.1 181. 72.6 20.4 76.9 17.9 0.446 0.876
17 Male 9th bmi 1773 13.2 52.9 21.7 5.58 22.8 4.88 0.116 0.227
18 Male 9th stweight 1761 35.8 181. 65.8 19.0 68.2 16.4 0.391 0.768
19 Male <NA> bmi 93 13.9 53.4 21.7 5.86 23.7 6.83 0.708 1.41
20 Male <NA> stweight 91 35.4 167. 69.0 20.0 73.1 21.6 2.27 4.50
gt()
at the end to make the table prettier.
gt()
command is from the gt
packageyrbss_data %>%
group_by(sex, grade) %>%
get_summary_stats(
bmi, stweight,
type = "common"
) %>%
gt()
sex | grade | variable | n | min | max | median | iqr | mean | sd | se | ci |
---|---|---|---|---|---|---|---|---|---|---|---|
Female | 10th | bmi | 1604 | 13.578 | 52.269 | 21.972 | 5.288 | 22.985 | 4.753 | 0.119 | 0.233 |
Female | 10th | stweight | 1582 | 37.200 | 131.540 | 58.060 | 14.970 | 61.071 | 13.216 | 0.332 | 0.652 |
Female | 11th | bmi | 1640 | 13.578 | 52.431 | 22.269 | 5.406 | 23.448 | 4.780 | 0.118 | 0.232 |
Female | 11th | stweight | 1624 | 36.290 | 160.570 | 58.970 | 14.970 | 62.536 | 14.116 | 0.350 | 0.687 |
Female | 12th | bmi | 1675 | 15.375 | 53.265 | 22.656 | 5.470 | 23.884 | 5.231 | 0.128 | 0.251 |
Female | 12th | stweight | 1664 | 36.290 | 154.220 | 59.880 | 15.420 | 63.434 | 14.822 | 0.363 | 0.713 |
Female | 9th | bmi | 1655 | 13.161 | 51.208 | 21.660 | 5.261 | 22.819 | 4.954 | 0.122 | 0.239 |
Female | 9th | stweight | 1625 | 27.670 | 180.990 | 56.700 | 15.420 | 59.802 | 14.103 | 0.350 | 0.686 |
Female | NA | bmi | 48 | 16.991 | 47.610 | 21.216 | 4.349 | 23.092 | 5.687 | 0.821 | 1.651 |
Female | NA | stweight | 47 | 35.830 | 113.400 | 56.250 | 13.155 | 60.261 | 14.413 | 2.102 | 4.232 |
Male | 10th | bmi | 1739 | 13.664 | 51.540 | 22.430 | 5.379 | 23.453 | 4.766 | 0.114 | 0.224 |
Male | 10th | stweight | 1734 | 40.820 | 163.300 | 68.950 | 18.140 | 72.318 | 16.513 | 0.397 | 0.778 |
Male | 11th | bmi | 1704 | 14.003 | 53.947 | 22.983 | 5.585 | 24.129 | 4.973 | 0.120 | 0.236 |
Male | 11th | stweight | 1704 | 40.820 | 172.370 | 70.760 | 20.420 | 75.238 | 17.158 | 0.416 | 0.815 |
Male | 12th | bmi | 1611 | 14.397 | 52.284 | 23.102 | 5.789 | 24.470 | 5.150 | 0.128 | 0.252 |
Male | 12th | stweight | 1611 | 43.090 | 180.990 | 72.580 | 20.410 | 76.900 | 17.918 | 0.446 | 0.876 |
Male | 9th | bmi | 1773 | 13.194 | 52.882 | 21.671 | 5.579 | 22.809 | 4.883 | 0.116 | 0.227 |
Male | 9th | stweight | 1761 | 35.830 | 180.990 | 65.770 | 19.050 | 68.180 | 16.426 | 0.391 | 0.768 |
Male | NA | bmi | 93 | 13.896 | 53.432 | 21.698 | 5.862 | 23.744 | 6.827 | 0.708 | 1.406 |
Male | NA | stweight | 91 | 35.380 | 167.380 | 68.950 | 19.955 | 73.075 | 21.634 | 2.268 | 4.505 |
summarize
function
Below are links to data summary examples in our previous OCTRI-BERD workshop on Introduction to R and RStudio for Exploratory Data Analysis: Part 2 using the function summarize()
:
tabyl
function from the janitor
package (loaded above).Default table:
age n percent valid_percent
12 years old or younger 137 0.00685 0.00692094
13 years old 96 0.00480 0.00484971
14 years old 2026 0.10130 0.10234908
15 years old 4290 0.21450 0.21672139
16 years old 4924 0.24620 0.24874968
17 years old 4988 0.24940 0.25198282
18 years old or older 3334 0.16670 0.16842637
<NA> 205 0.01025 NA
Make the output prettier using adorn_
options and adding on gt()
:
With adornments:
age n percent valid_percent
12 years old or younger 137 0.69% 0.69%
13 years old 96 0.48% 0.48%
14 years old 2026 10.13% 10.23%
15 years old 4290 21.45% 21.67%
16 years old 4924 24.62% 24.87%
17 years old 4988 24.94% 25.20%
18 years old or older 3334 16.67% 16.84%
<NA> 205 1.03% -
Total 20000 100.00% 100.00%
With adornments & gt()
:
yrbss_data %>%
tabyl(age) %>%
adorn_totals("row") %>%
adorn_pct_formatting(digits=2) %>%
gt() # from the gt package
age | n | percent | valid_percent |
---|---|---|---|
12 years old or younger | 137 | 0.69% | 0.69% |
13 years old | 96 | 0.48% | 0.48% |
14 years old | 2026 | 10.13% | 10.23% |
15 years old | 4290 | 21.45% | 21.67% |
16 years old | 4924 | 24.62% | 24.87% |
17 years old | 4988 | 24.94% | 25.20% |
18 years old or older | 3334 | 16.67% | 16.84% |
NA | 205 | 1.03% | - |
Total | 20000 | 100.00% | 100.00% |
adorn_totals
option
age/grade | 10th | 11th | 12th | 9th | NA_ | Total |
---|---|---|---|---|---|---|
12 years old or younger | 9 | 21 | 35 | 31 | 41 | 137 |
13 years old | 6 | 2 | 1 | 79 | 8 | 96 |
14 years old | 65 | 8 | 6 | 1922 | 25 | 2026 |
15 years old | 1909 | 104 | 22 | 2225 | 30 | 4290 |
16 years old | 2102 | 1899 | 157 | 717 | 49 | 4924 |
17 years old | 637 | 2178 | 1956 | 164 | 53 | 4988 |
18 years old or older | 163 | 670 | 2373 | 60 | 68 | 3334 |
NA | 16 | 9 | 27 | 21 | 132 | 205 |
Total | 4907 | 4891 | 4577 | 5219 | 406 | 20000 |
yrbss_data %>%
tabyl(age, grade) %>%
adorn_totals(c("row")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits=0) %>%
adorn_ns() %>%
gt()
age | 10th | 11th | 12th | 9th | NA_ |
---|---|---|---|---|---|
12 years old or younger | 7% (9) | 15% (21) | 26% (35) | 23% (31) | 30% (41) |
13 years old | 6% (6) | 2% (2) | 1% (1) | 82% (79) | 8% (8) |
14 years old | 3% (65) | 0% (8) | 0% (6) | 95% (1,922) | 1% (25) |
15 years old | 44% (1,909) | 2% (104) | 1% (22) | 52% (2,225) | 1% (30) |
16 years old | 43% (2,102) | 39% (1,899) | 3% (157) | 15% (717) | 1% (49) |
17 years old | 13% (637) | 44% (2,178) | 39% (1,956) | 3% (164) | 1% (53) |
18 years old or older | 5% (163) | 20% (670) | 71% (2,373) | 2% (60) | 2% (68) |
NA | 8% (16) | 4% (9) | 13% (27) | 10% (21) | 64% (132) |
Total | 25% (4,907) | 24% (4,891) | 23% (4,577) | 26% (5,219) | 2% (406) |
We can also add a third variable, which creates 2x2 tables stratified by the third variable:
$Female
age 10th 11th 12th 9th NA_
12 years old or younger 3 5 14 15 21
13 years old 0 2 0 40 1
14 years old 30 4 2 963 8
15 years old 961 53 11 1070 9
16 years old 1016 957 81 311 13
17 years old 261 1055 1044 66 21
18 years old or older 57 285 1114 19 29
<NA> 4 4 11 8 24
$Male
age 10th 11th 12th 9th NA_
12 years old or younger 6 16 20 16 18
13 years old 4 0 1 38 6
14 years old 34 4 4 950 14
15 years old 942 51 10 1136 20
16 years old 1067 936 76 395 32
17 years old 369 1106 896 98 30
18 years old or older 105 379 1242 40 35
<NA> 12 4 14 11 40
$NA_
age 10th 11th 12th 9th NA_
12 years old or younger 0 0 1 0 2
13 years old 2 0 0 1 1
14 years old 1 0 0 9 3
15 years old 6 0 1 19 1
16 years old 19 6 0 11 4
17 years old 7 17 16 0 2
18 years old or older 1 6 17 1 4
<NA> 0 1 2 2 68
Make sure to scroll down in the output box above to see all of the output
tabyl
and adorn_
options
See the tabyl
s vignette for more tabyl
and adorn_
options:
https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html
race4
and race7
, with both percentages and counts. What is the difference between these two variables?race4
and race7
, with both percentages and counts, stratified by grade
.ggplot2
package.
ggplot2
package gets loaded with the tidyverse
and thus we do not need to load it separately.BMI:
horizontal boxplot (specify x = ...
)
fill
NA
category with drop_na()
(we will talk about this in later slides)fill=
here?
color
(or colour
) is used for lines and outlinesfill
is used for interiors
alpha
specifies the opacity / transparency
By default, barplots display counts (frequencies) on the vertical axis:
* To show proportions, the code is more complicated.
ggplot(data = yrbss_data,
aes(x = age)) +
# specify aesthetics within the barplot to show proportions
geom_bar(aes(y = after_stat(prop), group = 1)) +
# Next line converts y-axis labels to percentages instead of proportions
scale_y_continuous(labels = scales::percent_format()) +
# "dodge" the x-axis labels
scale_x_discrete(guide = guide_axis(n.dodge = 2))
Stacked bars showing counts:
First, create a plot and save it as an R object:
Save the plot as a pdf (or “jpeg”, “tiff”, “png”, etc.)
Can specify the dpi when saving, many journals have dpi requirements.
See ggsave
webpage for more details:
naniar
https://allisonhorst.shinyapps.io/missingexplorer/Load the naniar
package if haven’t already:
Total number of missing values in dataset:
Proportion of missing values in dataset:
Missingness by each variable:
fct =
optionExample: Percent missingness stratified by age group
stweight
variable. Describe the distribution shape.x =
to y =
in the histogram code?stweight
stratified by sex.toy_data
using vis_miss()
. What missingness patterns do you see?arrange()
arrange()
is a function that lets us sort a data.frame
by a specified variable.
# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 30592 <NA> <NA> <NA> <NA> <NA> NA NA
2 30593 <NA> <NA> 9th Hispanic/Lat… Hisp… NA NA
3 30595 16 years old <NA> 10th White White NA NA
4 30599 12 years old or younger Male <NA> Hispanic/Lat… Hisp… NA NA
5 30601 12 years old or younger Male 11th White White NA NA
6 30604 12 years old or younger Male 9th Hispanic/Lat… Hisp… NA NA
7 30605 12 years old or younger Male 9th All other ra… Asian NA NA
8 30607 12 years old or younger Male 10th All other ra… Am I… NA NA
9 30608 12 years old or younger Male 9th All other ra… <NA> NA NA
10 30612 14 years old Male 9th All other ra… <NA> NA NA
# ℹ 19,990 more rows
desc()
function:# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 19,990 more rows
age
first, and then within age
, we sort by record
.# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 30599 12 years old or younger Male <NA> Hispanic/La… Hisp… NA NA
2 30601 12 years old or younger Male 11th White White NA NA
3 30604 12 years old or younger Male 9th Hispanic/La… Hisp… NA NA
4 30605 12 years old or younger Male 9th All other r… Asian NA NA
5 30607 12 years old or younger Male 10th All other r… Am I… NA NA
6 30608 12 years old or younger Male 9th All other r… <NA> NA NA
7 36582 12 years old or younger Female 12th All other r… <NA> NA NA
8 36584 12 years old or younger Female <NA> Black or Af… Blac… NA NA
9 36585 12 years old or younger Female 12th All other r… Am I… NA NA
10 36586 12 years old or younger Female 9th Black or Af… Blac… NA NA
# ℹ 19,990 more rows
arrange()
matters!# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 30592 <NA> <NA> <NA> <NA> <NA> NA NA
2 30593 <NA> <NA> 9th Hispanic/Lat… Hisp… NA NA
3 30595 16 years old <NA> 10th White White NA NA
4 30599 12 years old or younger Male <NA> Hispanic/Lat… Hisp… NA NA
5 30601 12 years old or younger Male 11th White White NA NA
6 30604 12 years old or younger Male 9th Hispanic/Lat… Hisp… NA NA
7 30605 12 years old or younger Male 9th All other ra… Asian NA NA
8 30607 12 years old or younger Male 10th All other ra… Am I… NA NA
9 30608 12 years old or younger Male 9th All other ra… <NA> NA NA
10 30612 14 years old Male 9th All other ra… <NA> NA NA
# ℹ 19,990 more rows
arrange()
after count()
-ing categorical dataarrange()
is usefultabyl()
to see what values exist in a categorical variable.count()
to count all of the unique values for a categorical variable.NA
category, which is the special missing variable type:yrbss_data
dataframe after arranging:# A tibble: 6 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 931897 15 years old Female 10th White White 17.2 54.4
2 333862 17 years old Female 12th White White 20.2 57.2
3 36253 18 years old or older Male 11th Hispanic/Lati… Hisp… NA NA
4 1095530 15 years old Male 10th Black or Afri… Blac… 28.0 85.7
5 1303997 14 years old Male 9th All other rac… Mult… 24.5 66.7
6 261619 17 years old Male 9th All other rac… <NA> NA NA
yrbss_data
with the <-
operator:# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 19,990 more rows
filter()
ing datafilter()
ing quantitative datafilter()
is an extremely powerful function. It lets us subset our data according to specific criteria.
numeric
variable, bmi
:Note the double equal signs! ==
is a test of whether two things are equal, it is not an assignment (i.e. x = 5
assigns the value 5 to the object x
).
# A tibble: 160 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1316985 18 years old or older Female 12th White White 42.4 127.
2 1316823 18 years old or older Female 12th Black or Afr… Blac… 43.3 118.
3 1316755 18 years old or older Female 12th White White 41.0 109.
4 1316476 18 years old or older Female 12th Hispanic/Lat… Hisp… 49.6 104.
5 1316448 18 years old or older Female 12th All other ra… Mult… 41.9 128.
6 1316336 18 years old or older Female 12th Black or Afr… Blac… 40.2 113.
7 1316330 18 years old or older Female 12th Hispanic/Lat… Hisp… 40.2 113.
8 1316235 18 years old or older Female 12th <NA> <NA> 45.1 120.
9 1316186 18 years old or older Female <NA> Hispanic/Lat… Hisp… 47.6 89.4
10 1315913 17 years old Female 12th Black or Afr… Blac… 53.3 142.
# ℹ 150 more rows
# A tibble: 0 × 8
# ℹ 8 variables: record <dbl>, age <chr>, sex <chr>, grade <chr>, race4 <chr>,
# race7 <chr>, bmi <dbl>, stweight <dbl>
# A tibble: 1 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1096565 15 years old Male 10th Black or African Americ… Blac… 13.9 56.2
filter()
ing cateogrical dataNote the double equal signs! ==
is a test of whether two things are equal, it is not an assignment (i.e. x = 5
assigns the value 5 to the object x
).
yrbss_data %>%
filter(grade == "10th") #We must include " " around the value for categorical variables
# A tibble: 4,907 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317010 18 years old or older Female 10th Black or Afr… Blac… 28.3 61.2
2 1315996 17 years old Female 10th White White 16.3 44.4
3 1315971 17 years old Female 10th Hispanic/Lat… Hisp… 19.6 56.7
4 1315934 17 years old Female 10th Black or Afr… Blac… 23.5 70.3
5 1315928 17 years old Female 10th Black or Afr… Blac… 21.0 45.4
6 1315926 17 years old Female 10th Hispanic/Lat… Hisp… 25.9 66.2
7 1315884 17 years old Female 10th White White 51.4 132.
8 1315816 17 years old Female 10th All other ra… Mult… 22.2 59.0
9 1315792 17 years old Female 10th Hispanic/Lat… Hisp… 21.7 61.2
10 1315776 17 years old Female 10th Black or Afr… Blac… 20.6 50.8
# ℹ 4,897 more rows
!=
# A tibble: 14,687 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 14,677 more rows
&
(AND) or |
(OR) operators.
Female
and are in 10th grade,&
to chain these criteria together:sex == "Female" & grade == "10th"
&
(AND) criteria# A tibble: 2,332 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317010 18 years old or older Female 10th Black or Afr… Blac… 28.3 61.2
2 1315996 17 years old Female 10th White White 16.3 44.4
3 1315971 17 years old Female 10th Hispanic/Lat… Hisp… 19.6 56.7
4 1315934 17 years old Female 10th Black or Afr… Blac… 23.5 70.3
5 1315928 17 years old Female 10th Black or Afr… Blac… 21.0 45.4
6 1315926 17 years old Female 10th Hispanic/Lat… Hisp… 25.9 66.2
7 1315884 17 years old Female 10th White White 51.4 132.
8 1315816 17 years old Female 10th All other ra… Mult… 22.2 59.0
9 1315792 17 years old Female 10th Hispanic/Lat… Hisp… 21.7 61.2
10 1315776 17 years old Female 10th Black or Afr… Blac… 20.6 50.8
# ℹ 2,322 more rows
# A tibble: 2,332 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317010 18 years old or older Female 10th Black or Afr… Blac… 28.3 61.2
2 1315996 17 years old Female 10th White White 16.3 44.4
3 1315971 17 years old Female 10th Hispanic/Lat… Hisp… 19.6 56.7
4 1315934 17 years old Female 10th Black or Afr… Blac… 23.5 70.3
5 1315928 17 years old Female 10th Black or Afr… Blac… 21.0 45.4
6 1315926 17 years old Female 10th Hispanic/Lat… Hisp… 25.9 66.2
7 1315884 17 years old Female 10th White White 51.4 132.
8 1315816 17 years old Female 10th All other ra… Mult… 22.2 59.0
9 1315792 17 years old Female 10th Hispanic/Lat… Hisp… 21.7 61.2
10 1315776 17 years old Female 10th Black or Afr… Blac… 20.6 50.8
# ℹ 2,322 more rows
|
(OR) criteriaIf we wanted patients who were
Female
or in 10th grade
we would use a |
to chain these criteria together.
# A tibble: 12,167 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 12,157 more rows
Think about it:
Which of the AND vs. OR code blocks will return a larger number of participants?
See this slide from a previous BERD workshop.
Also read the “Data Transformation” chapter in R for Data Science.
This is a useful reference for all the different operators (both logical and comparison) that you can use: https://www.datamentor.io/r-programming/operator/
drop_na()
(1/2)We may want to remove rows that have missing data, which are coded as NA
.
See the drop_na()
reference for examples (?drop_na
).
naniar
package:drop_na()
(2/2)Pay attention to the number of rows for each of these results
# A tibble: 18,813 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 18,803 more rows
# A tibble: 13,126 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 13,116 more rows
# A tibble: 12,897 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 12,887 more rows
filter()
: check for NA10th
grade category.tabyl()
.!="10th"
we lose our NAs
!
|
statement: grade n percent valid_percent
10th 4907 0.24535 0.2504338
11th 4891 0.24455 0.2496172
12th 4577 0.22885 0.2335919
9th 5219 0.26095 0.2663570
<NA> 406 0.02030 NA
Lost the NA
’s in the filtering below!
Updated code to keep the NA
’s when filtering:
select()
ing dataselect()
select()
function allows us to select variables or columns from our dataset:# A tibble: 20,000 × 3
record age race4
<dbl> <chr> <chr>
1 1317069 18 years old or older Black or African American
2 1317067 18 years old or older White
3 1317066 18 years old or older White
4 1317065 18 years old or older White
5 1317064 18 years old or older White
6 1317063 18 years old or older White
7 1317062 18 years old or older White
8 1317058 18 years old or older Black or African American
9 1317057 18 years old or older All other races
10 1317056 18 years old or older White
# ℹ 19,990 more rows
-
in front of that variable.# A tibble: 20,000 × 7
record age sex grade race7 bmi stweight
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or African A… 24.4 69.0
2 1317067 18 years old or older Female 12th White 18.8 49.9
3 1317066 18 years old or older Female 12th White 32.1 90.7
4 1317065 18 years old or older Female 12th White 21.2 56.2
5 1317064 18 years old or older Female 12th White 23.5 68.0
6 1317063 18 years old or older Female 12th White 21.7 52.2
7 1317062 18 years old or older Female 12th White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or African A… 17.7 49.9
9 1317057 18 years old or older Female 12th Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White 21.7 61.2
# ℹ 19,990 more rows
tidyselect
There are ways to search column names and to select them.
These are called the tidyselect
helpers.
You can see some examples here: https://tidyselect.r-lib.org/reference/language.html
For instance, you can see how this might be useful, where we select all columns where the column name includes the word “race”:
# A tibble: 20,000 × 2
race4 race7
<chr> <chr>
1 Black or African American Black or African American
2 White White
3 White White
4 White White
5 White White
6 White White
7 White White
8 Black or African American Black or African American
9 All other races Asian
10 White White
# ℹ 19,990 more rows
# A tibble: 20,000 × 3
record race4 race7
<dbl> <chr> <chr>
1 1317069 Black or African American Black or African American
2 1317067 White White
3 1317066 White White
4 1317065 White White
5 1317064 White White
6 1317063 White White
7 1317062 White White
8 1317058 Black or African American Black or African American
9 1317057 All other races Asian
10 1317056 White White
# ℹ 19,990 more rows
# A tibble: 20,000 × 3
record age sex
<dbl> <chr> <chr>
1 1317069 18 years old or older Female
2 1317067 18 years old or older Female
3 1317066 18 years old or older Female
4 1317065 18 years old or older Female
5 1317064 18 years old or older Female
6 1317063 18 years old or older Female
7 1317062 18 years old or older Female
8 1317058 18 years old or older Female
9 1317057 18 years old or older Female
10 1317056 18 years old or older Female
# ℹ 19,990 more rows
select()
and everything()
are a useful combination for some quick rearranging of columns:# A tibble: 20,000 × 8
record bmi age sex grade race4 race7 stweight
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1317069 24.4 18 years old or older Female 12th Black or Afr… Blac… 69.0
2 1317067 18.8 18 years old or older Female 12th White White 49.9
3 1317066 32.1 18 years old or older Female 12th White White 90.7
4 1317065 21.2 18 years old or older Female 12th White White 56.2
5 1317064 23.5 18 years old or older Female 12th White White 68.0
6 1317063 21.7 18 years old or older Female 12th White White 52.2
7 1317062 24.8 18 years old or older Female 12th White White 61.2
8 1317058 17.7 18 years old or older Female 11th Black or Afr… Blac… 49.9
9 1317057 19.8 18 years old or older Female 12th All other ra… Asian 44.4
10 1317056 21.7 18 years old or older Female 12th White White 61.2
# ℹ 19,990 more rows
relocate()
to move around your column order.tidyselect
mini-practiceselect(where(is.numeric))
selects all numeric columns
where()
is a helper function that returns columns where the inside function is TRUE,is.numeric
returns the TRUE/FALSE)# A tibble: 20,000 × 3
record bmi stweight
<dbl> <dbl> <dbl>
1 1317069 24.4 69.0
2 1317067 18.8 49.9
3 1317066 32.1 90.7
4 1317065 21.2 56.2
5 1317064 23.5 68.0
6 1317063 21.7 52.2
7 1317062 24.8 61.2
8 1317058 17.7 49.9
9 1317057 19.8 44.4
10 1317056 21.7 61.2
# ℹ 19,990 more rows
is.character
to select all character columns, as well as the column bmi
.tidyselect
(BONUS - on your own)See some more examples in this slide from a previous BERD workshop
For more info and learning about tidyselect
, please run this code in your console:
filter()
and select()
filter()
works on rows (think FILTER in Excel!), andselect()
works on columns (select your relevant variables)
Keep that in mind!
rename()
You can rename columns with select()
but it’s a bit easier to do this with the rename()
function.
You just need to remember new_name = old_name
ordering.
# A tibble: 20,000 × 8
id age sex grade race_cat race7 bmi weight
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afric… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afric… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other races Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 19,990 more rows
Note that we did not save the changes above, and thus we haven’t actually renamed the variables in the dataset yrbss_data
!
output/
directory.xlsx
file so that we can open it in Excel.# chain together several data transformation steps with %>%
# use <- to assign the new data to a different name to avoid overwriting the original dataset
processed_data <- yrbss_data %>%
select(-race7) %>%
rename(id = record,
race_cat = race4,
weight = stweight) %>%
filter(sex == "Female")
glimpse(processed_data)
Rows: 9,592
Columns: 7
$ id <dbl> 1317069, 1317067, 1317066, 1317065, 1317064, 1317063, 1317062…
$ age <chr> "18 years old or older", "18 years old or older", "18 years o…
$ sex <chr> "Female", "Female", "Female", "Female", "Female", "Female", "…
$ grade <chr> "12th", "12th", "12th", "12th", "12th", "12th", "12th", "11th…
$ race_cat <chr> "Black or African American", "White", "White", "White", "Whit…
$ bmi <dbl> 24.4296, 18.7813, 32.1429, 21.1713, 23.5433, 21.7107, 24.8448…
$ weight <dbl> 68.95, 49.90, 90.72, 56.25, 68.04, 52.16, 61.24, 49.90, 44.45…
Save the processed_data
:
Perform some subsetting functions on the yrbss_data
data:
Use filter()
to select people who are in 9th or 10th grade and who have BMI less than 25. Save the resulting data as yrbss_filtered
tibble.
Use rename()
to rename 2 of the column names of yrbss_filtered
to something more meaningful.
Use select()
to keep only the columns that denote record, sex, grade, bmi, and the 2 renamed columns.
Arrange the data by bmi, starting with the highest value and ending with the least value.
Save the resulting data set in the file data/practice4_data.xslx
.
mutate()
select()
to change columns in a data.frame
,
# A tibble: 20,000 × 6
age grade race4 race7 bmi stweight
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 18 years old or older 12th Black or African American Black o… 24.4 69.0
2 18 years old or older 12th White White 18.8 49.9
3 18 years old or older 12th White White 32.1 90.7
4 18 years old or older 12th White White 21.2 56.2
5 18 years old or older 12th White White 23.5 68.0
6 18 years old or older 12th White White 21.7 52.2
7 18 years old or older 12th White White 24.8 61.2
8 18 years old or older 11th Black or African American Black o… 17.7 49.9
9 18 years old or older 12th All other races Asian 19.8 44.4
10 18 years old or older 12th White White 21.7 61.2
# ℹ 19,990 more rows
mutate()
- A confusing name, a powerful dplyr
verbmutate()
is one of the most useful dplyr
verbs.mutate()
to
data.frame
) and/orThink of this like adding a formula in Excel to calculate the value of a new column based on previous columns. You can do lots of things such as:
mutate
to calculate a new variable based on other variables (1/3)mutate
is to do Excel type calculations using other columns in the data.
stweight
is in lbs=
inside mutate, not ==
or <-
.
Rows: 20,000
Columns: 9
$ record <dbl> 1317069, 1317067, 1317066, 1317065, 1317064, 1317063, 131706…
$ age <chr> "18 years old or older", "18 years old or older", "18 years …
$ sex <chr> "Female", "Female", "Female", "Female", "Female", "Female", …
$ grade <chr> "12th", "12th", "12th", "12th", "12th", "12th", "12th", "11t…
$ race4 <chr> "Black or African American", "White", "White", "White", "Whi…
$ race7 <chr> "Black or African American", "White", "White", "White", "Whi…
$ bmi <dbl> 24.4296, 18.7813, 32.1429, 21.1713, 23.5433, 21.7107, 24.844…
$ stweight <dbl> 68.95, 49.90, 90.72, 56.25, 68.04, 52.16, 61.24, 49.90, 44.4…
$ weight_kg <dbl> 31.26984, 22.63039, 41.14286, 25.51020, 30.85714, 23.65533, …
mutate
to calculate a new variable based on other variables (2/3)yrbss_new
bmi
and weight_kg
yrbss_new <- yrbss_data %>%
mutate(weight_kg = stweight/2.205,
weight_g = weight_kg*1000)
glimpse(yrbss_new)
Rows: 20,000
Columns: 10
$ record <dbl> 1317069, 1317067, 1317066, 1317065, 1317064, 1317063, 131706…
$ age <chr> "18 years old or older", "18 years old or older", "18 years …
$ sex <chr> "Female", "Female", "Female", "Female", "Female", "Female", …
$ grade <chr> "12th", "12th", "12th", "12th", "12th", "12th", "12th", "11t…
$ race4 <chr> "Black or African American", "White", "White", "White", "Whi…
$ race7 <chr> "Black or African American", "White", "White", "White", "Whi…
$ bmi <dbl> 24.4296, 18.7813, 32.1429, 21.1713, 23.5433, 21.7107, 24.844…
$ stweight <dbl> 68.95, 49.90, 90.72, 56.25, 68.04, 52.16, 61.24, 49.90, 44.4…
$ weight_kg <dbl> 31.26984, 22.63039, 41.14286, 25.51020, 30.85714, 23.65533, …
$ weight_g <dbl> 31269.84, 22630.39, 41142.86, 25510.20, 30857.14, 23655.33, …
mutate
to calculate a new variable based on other variables (3/3)Figure examining the newly created variables using mutate
:
factor
s)One data type that we haven’t yet looked at are factor
s
For the most part, you can use character
and factors
interchangeably for categorical data.
However, there is one main difference.
factor
s define the permissible values in a vector with an argument called levels
.character_vector <- c("Dog", "Dog", "Cat", "Mouse") # c() is the concatenate function
class(character_vector)
[1] "character"
character_vector n percent
Cat 1 0.25
Dog 2 0.50
Mouse 1 0.25
factor()
function, and supplying an argument called levels
.factor
variableThe levels
of a factor are the permissible values in a factor
.
The order of the values in a factor variable determine the order in which the values appear in tables and on the axes in a plot.
You can control the order of the categories in a factor by specifying the order of the categories in the levels
argument.
You can find out the levels of the factor with the function levels()
or tabyl()
.
Being able to specify the ordering is the main reason to use factors
, at least in plotting and calculating counts.
factor_vector
to be “Mouse”, “Cat”, “Dog”.
tabyl()
or levels()
mutate
to transform character
variables into factors
We can also use use mutate()
to make a character
variable a factor
.
Let’s convert age
from character
into factor
:
# A tibble: 20,000 × 9
record age sex grade race4 race7 bmi stweight age_fac
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <fct>
1 1317069 18 years old or older Female 12th Blac… Blac… 24.4 69.0 18 yea…
2 1317067 18 years old or older Female 12th White White 18.8 49.9 18 yea…
3 1317066 18 years old or older Female 12th White White 32.1 90.7 18 yea…
4 1317065 18 years old or older Female 12th White White 21.2 56.2 18 yea…
5 1317064 18 years old or older Female 12th White White 23.5 68.0 18 yea…
6 1317063 18 years old or older Female 12th White White 21.7 52.2 18 yea…
7 1317062 18 years old or older Female 12th White White 24.8 61.2 18 yea…
8 1317058 18 years old or older Female 11th Blac… Blac… 17.7 49.9 18 yea…
9 1317057 18 years old or older Female 12th All … Asian 19.8 44.4 18 yea…
10 1317056 18 years old or older Female 12th White White 21.7 61.2 18 yea…
# ℹ 19,990 more rows
# A tibble: 20,000 × 8
record age sex grade race4 race7 bmi stweight
<dbl> <chr> <chr> <fct> <chr> <chr> <dbl> <dbl>
1 1317069 18 years old or older Female 12th Black or Afr… Blac… 24.4 69.0
2 1317067 18 years old or older Female 12th White White 18.8 49.9
3 1317066 18 years old or older Female 12th White White 32.1 90.7
4 1317065 18 years old or older Female 12th White White 21.2 56.2
5 1317064 18 years old or older Female 12th White White 23.5 68.0
6 1317063 18 years old or older Female 12th White White 21.7 52.2
7 1317062 18 years old or older Female 12th White White 24.8 61.2
8 1317058 18 years old or older Female 11th Black or Afr… Blac… 17.7 49.9
9 1317057 18 years old or older Female 12th All other ra… Asian 19.8 44.4
10 1317056 18 years old or older Female 12th White White 21.7 61.2
# ℹ 19,990 more rows
grade
),
factor
), and then reassigning the variable grade
to our new set of values.levels
of a factor
variable revisitedlevels
argument in factor()
.tabyl()
grade n percent valid_percent
10th 4907 0.24535 0.2504338
11th 4891 0.24455 0.2496172
12th 4577 0.22885 0.2335919
9th 5219 0.26095 0.2663570
<NA> 406 0.02030 NA
yrbss_data %>%
mutate(grade_fac = factor(grade,
levels = c("9th","10th","11th","12th"))) %>%
tabyl(grade_fac)
grade_fac n percent valid_percent
9th 5219 0.26095 0.2663570
10th 4907 0.24535 0.2504338
11th 4891 0.24455 0.2496172
12th 4577 0.22885 0.2335919
<NA> 406 0.02030 NA
Warning
Remember spelling matters!
NA
missing.height
that calculates height based on bmi
and weight_kg
. Note the formula for BMI is weight/height^2 (weight is in kilograms and height is in meters). Create a scatterplot of bmi
vs height
.sex_fac
that orders the values of sex
as Male, Female.case_when()
case_when()
First let’s start with a simple binary 2 category variable.
case_when()
:
~
is where we can specify how we define the category based on our column variable names.~
is where we can specify the category name (as a character).case_when()
example 1sex
to a character variable called female
:
sex == "Female"
tests whether the column sex
is Female, which is how we define the category 1
~
sex == "Male"
tests whether the column sex
is Male, which is how we define the category 0
~
NA
is handled here:case_when()
example 2left side defines the condition ~ right side names the category
case_when()
example 3Next we recode bmi to have 3 levels:
<20
(not including exactly 20)20-30
(not including exactly 30)30+
Notice the middle category we use an & statement similar to how we used logic in filter
yrbss_data %>%
mutate(
bmi3cat = case_when(
bmi < 20 ~ "<20",
(bmi >=20) & (bmi < 30) ~ "20-30",
bmi >= 30 ~ "30+"
)
) %>%
mutate(bmi3cat = factor(bmi3cat)) %>% # make it factor after creating categories
tabyl(bmi3cat)
bmi3cat n percent valid_percent
<20 3167 0.15835 0.2338650
20-30 9024 0.45120 0.6663713
30+ 1351 0.06755 0.0997637
<NA> 6458 0.32290 NA
grade_num
in yrbss_data
that converts grade
to numeric.race_cat
that relabels race4
to have shorter race category names: W (White), H/L (Hispanic/Latino), B/AA (Black or African American), Other (All other races). Make race_cat
a factor variable with levels ordered by the size of the group in descending order.tabyl
of the new variable race_cat
and race4
to check your work, including the ordering of the factors.ggplot2
geom_boxplot()
x
axis)y
axis.The main differences from the scatterplots we created earlier are the geom
type and the variables plotted.
We can change the color similarly to scatterplots.
fill
and not color
if we want to fill in the box with color:grade
levels in our plot.facet_wrap()
facet_wrap()
command to our plot.
facet_wrap
- race4
by using the vars()
function to specify it as a variable.Don’t try to facet on a continous numeric
variable - it won’t work.
Don’t forget to look at the help documentation (e.g., ?facet_wrap
) to learn more about additional ways to customize your plots!
facet_grid()
Facet the boxplot below by age
and grade
, don’t forget the vars()
:
mutate
: using mutate
to replace missing values with replace_na
replace_na()
function inside of mutate()
to specify this.yrbss_new <- yrbss_data %>%
mutate(bmi_filled =
replace_na(bmi,
mean(bmi, na.rm = TRUE)))
yrbss_new %>%
select(contains("bmi")) %>%
tail()
# A tibble: 6 × 2
bmi bmi_filled
<dbl> <dbl>
1 NA 23.5
2 NA 23.5
3 NA 23.5
4 NA 23.5
5 NA 23.5
6 NA 23.5
grade
column with unknown
:race4
, faceted by sex
.Full list of keyboard shortcuts
action | mac | windows/linux |
---|---|---|
Run code in qmd (or script) | cmd + enter | ctrl + enter |
<- |
option + - | alt + - |
interrupt currently running command | esc | esc |
keyboard shortcut help | option + shift + k | alt + shift + k |
Practice: Try typing code below in your qmd (with shortcut) and evaluating it (with shortcut):
From Garrett Grolemund’s Prologue of his book Hands-On Programming with R1:
As you learn to program, you are going to get frustrated. You are learning a new language, and it will take time to become fluent. But frustration is not just natural, it’s actually a positive sign that you should watch for. Frustration is your brain’s way of being lazy; it’s trying to get you to quit and go do something easy or fun. If you want to get physically fitter, you need to push your body even though it complains. If you want to get better at programming, you’ll need to push your brain. Recognize when you get frustrated and see it as a good thing: you’re now stretching yourself. Push yourself a little further every day, and you’ll soon be a confident programmer.