library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
You may (in fact, are encouraged to) use the Internet to search up any information or example code to help you with this assignment, though you must cite any external (i.e. non-course related) websites that you use. Similarly, after attempting this assignment by yourself, you may collaborate with other students in the course, but you must each write your own code and acknowledge all students with whom you collaborated for each problem (you don’t need to cite by subpart). However, you may not post on Internet forums (e.g. Stack Exchange) for help with this assignment; doing so is considered an Honor Code violation.
Please provide your responses to each problem in this .Rmd
file directly below each subpart, inserting additional R code chunks if needed. For text responses, place them in between the provided <p>
tags (this puts them in a grey-background text box, to make them easier to grade). On Gradescope, you need only submit the .html
file created by knitting the document with your responses.
On Canvas, droughts.csv
contains the percentage area of each county in California that was in each of 5 possible categories of drought — D0, D1, D2, D3, and D4 — at the end of each week between 2000 and 2020. Assume each observation is taken on the date represented by MapDate
.
Download and read in the droughts.csv
file on Canvas, as a tibble. Use dplyr::select()
to drop the columns ValidStart
and ValidEnd
, and store the result in a variable called droughts
.
For each variable in droughts
, justify whether it is categorical or quantitative.
What was the average percent area of Santa Clara County that was in severe or worse drought (D2, D3, D4) in 2020? Hint: Figure out how to extract the year from MapDate
. The question is asking you to average over all observations in 2020.
Generate a tibble that contains the percentage land area of California in severe or worse drought for each MapDate
. Order the tibble by MapDate
, from least recent to most recent, and make sure the tibble does not contain any extraneous information. You will need to use the area of each county in USA_counties.csv
.
Repeat part (d), but for the percentage land area of the 9 Bay Area counties (Marin, Napa, Solano, Alameda, Contra Costa, Santa Clara, San Mateo, and San Francisco).
In what proportion of weeks from 2000-2020 did the Bay Area have a higher percentage of its land area under severe or worse drought than California as a whole?
Find the median percent of California’s land area in severe or worse drought in each of the 12 months of the year. You should have 12 numbers, one for each month. Do you see any seasonal trends?
medians.csv
, in a folder where you won’t lose it.We are going to explore the use of simulation to (approximately) answer difficult probability questions.
The below line seeds R’s random number generator so that we all get the exact same numerical results; this makes grading easier.
set.seed(2022)
Read the help menu for the sample()
function. Then, write a function called roll_dice()
that takes in a single argument n
and simulates the rolling of n
ordinary six-sided dice (with faces numbered 1 through 6). It should return a named list of length 2 with names minimum
and maximum
, containing the minimum and maximum roll among the n
dice.
Using your function roll_dice()
, create a data frame called rolls
with 10,000 rows and 2 columns, named min_roll
and max_roll
, so that each row contains the minimum and maximum roll of a different set of 5 dice rolls. Hint: use purrr::map_dfr()
.
Suppose I roll 5 dice. On average, what should I expect the ratio of the maximum roll to the minimum roll to be? Give a precise estimate based on part (b). Hint: You will want to average 10,000 different numbers.