In [None]:
# If you encounter any issues with the installation (the red block below will show an angry message) then let your TA know!

# Uncomment these lines if necessary!
# install.packages("testthat")
# install.packages("IRdisplay")
# install.packages("tidyverse")
# install.packages("tidymodels")
# install.packages("naniar")
# install.packages("plotly")

In [None]:
library(testthat)
library(IRdisplay)
library(tidyverse)
library(tidymodels)
library(naniar)
library(plotly)

<h1 style="text-align: center;">Data Visualization for People in a Hurry</h1>
<p style="text-align: center;"><i>an nwPlus workshop ‚ú®</i></p>

Workshop slides available [here](https://docs.google.com/presentation/d/e/2PACX-1vSPf9e7YfHleOqbkfxuiwXBQNh59jhZoULyXrwL1X1TO8I9IGdlG5lFN4zAlvFEtH0CNnOM_WhpyasR/pub?start=true&loop=true&delayms=30000).

### Goals üéØ
- Produce your first visualization of data using the R programming language
- Learn about how data science can enhance your workflow in non-programming courses
- Explore data science at UBC, in the context of hackathons, and in the working world


### Links üîó

- **GitHub repository:** Source code [here](https://github.com/michaelfromyeg/data-viz-for-people-in-a-hurry) (currently private).
- **Data set:** Available [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data), from Kaggle, or [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), directly from UCI. It's also included in the repo!

**Credits**

I was inspired by a few other sources for this workshop that I recommend you check out!

- [R Workshop Notes](http://tutorials.iq.harvard.edu/R/Rintro/Rintro.html), *Harvard University* ‚Äî good background to R
- [Learn X in Y Minutes: R](https://learnxinyminutes.com/docs/r/), *Learn X in Y Mintues* ‚Äî good refresher for syntax
- [Introduction to Data Science](https://ubc-dsci.github.io/introduction-to-datascience/index.html), *Tiffany-Anne Timbers, Trevor Campbell, Melissa Lee* ‚Äî UBC's DSCI 100 course textbook

##### PART I ‚Äî 1 hour

### 1‚ÄîWhat is data science? ü§î
‚è± **10 mins**

See the [workshop slides](https://docs.google.com/presentation/d/e/2PACX-1vSPf9e7YfHleOqbkfxuiwXBQNh59jhZoULyXrwL1X1TO8I9IGdlG5lFN4zAlvFEtH0CNnOM_WhpyasR/pub?start=true&loop=true&delayms=30000) for more information.

#### 1.1‚ÄîAn introduction to data science

#### 1.2‚ÄîA four part process

#### 1.3‚ÄîThe big picture


### 2‚ÄîWranglin' data ü§†
‚è± **20 mins**

#### 2.0‚ÄîR

R can be sometimes a bit tricky to read. Here's the basic syntax you need to know for today.

- Assigning a variable
- Method calls
- Parameters
- Printing to your notebook cell's output

We'll learn each of these as we go, but up front, let's get comfortable with variables.

#### Variables

In [None]:
# Variables

## Hold on to data (allow you to save a result or calculation)
## Can "change" (i.e., vary, hence variable)
## Once created, available anywhere in your program 
## (including later blocks of code)

## To assign a variable in R, we use a fancy arrow, "<-"
## The arrow means, take the thing on the right hand side and save it
## to the variable on the left

5 + 4        # Here, we're not saving the result of "5 + 4" anywhere
x <- 5 + 4   # Here, we save it to x

In [None]:
# Printing data

## To print any data in R, we can simply "put" the variable on its own line
## If you want to be more explicit about it you can also write `print(my_variable)`
## Note the syntax here and lack of spaces. We write print(...), where ... is the variable we want to print. 
## It's often said that the ... is "wrapped" by parentheses.

# All three of these lines will print the value of x

x # Remember: we can access x down here!
print(x)
print(x + 2)

In [None]:
# Your turn: in this cell, create a variable y that is equal to 7. 
# Then, created a variable z that is equal to the sum of x and y.
# Finally, print z

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
# There'll be tests throughout this notebook to help make sure you're on the right track! If the tests work
# (i.e., you solved the problem right), you should just see: "Test passed"

test_that("y is correct", {
  expect_equal(y, 7)
})
test_that("z is correct", {
  expect_equal(z, 16)
})

In [None]:
# Your turn: change the value of x. Does the value of z change? Why or why not?

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("x has been changed", {
  expect_true(x != 9)
})

In [None]:
# Your turn: try getting R to compute the square root of 4. Save it to a variable called root.
# Hint: square root shortens to sqrt

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("the sqrt of 4 has been saved to root", {
  expect_true(root == 2)
})

In [None]:
# We can work with much more than just numbers. We also have "strings", which are basically text. 
# Note: We wrap strings in quotes "" to signify its a value, not a variable name.

# Your turn: create a variable x, and assign it the value "hello". Create a value y and assign it world. 

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
paste(x, y) # Should be 'hello world'

In [None]:
test_that("hello world prints", {
  expect_equal(paste(x, y), "hello world")
})

#### 2.1‚ÄîImporting data

The first step is to import our data set, but we have to learn a bit more about R to do that.

In [None]:
# Functions

## This concept is very similar to math. Your classic y = f(x) is verymuchso the same concept in programming, 
## under the same name. We use functions to map 0-to-many inputs (called parameters, or arguments) to an 
## output, often called the result or return value. 

## The general form of a function *call* (this means we're using the function, that is, producing our y value) is,
## result <- my_function(argument1, argument2, ..., argumentN) [see how this mirrors y = f(x)?]

## When working with R, we very rarely create our own function (such as, saying h(x) = (x + 3) / 2), but we do use other
## people's functions. Let's practice that!

# This function converts a given argument temp_F, a temperature in Farenheit, to degrees Celcius
# If this looks scary, don't worry! We'll almost never have to write our own functions in R
fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

how_hot <- fahrenheit_to_celsius(32)
how_hot

In [None]:
## Your turn: call the print function twice. First, with the number 5, and second, with the string nwPlus.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: call the help function with a single argument, also just the text help (not the string "help"!)

### BEGIN SOLUTION ###
### END SOLUTION ###

## Psst... this is how you can access help in R. Try typing in help(print) or help(paste)

In [None]:
## Your turn: call the function select with two parameters: mtcars and mpg. Save it to a variable called mtmpg.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("Select is called correctly", {
  expect_equal(as.numeric(unlist(mtcars$mpg)), as.numeric(unlist(mtmpg)))
})

In [None]:
# Libraries

## To get access to more functions that are already "built in" to R, we need to access things called libraries.
## You can think of libraries as collections of functions that make R much more powerful. 
## For today's workshop, we just need one library, called tidyverse. To import a library, we use the library(...) function;
## it accepts a library's name as the parameter (not as a string, just the actual text name).

## P.S. Sometimes, we need to install a library first before we use it. To do this, just run install.packages("...") where
## ... is the name of the desired library.

# Here's an example

install.packages("dplyr")
library(dplyr)

In [None]:
## Your turn: install the caret package and import it 

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
dp <- createDataPartition(iris$Species)
test_that("caret is imported", {
  expect_equal(exists("dp"), TRUE)
})

In [None]:
## Now, finally, let's install and import the needed 
## libraries for this workshop.

install.packages("tidyverse")
library(tidyverse)

In [None]:
## With the tidyverse package, we get access to a function called `read_csv` that allows us to import data from a URL
## (https://www...) or a local file. The "csv" part means we're reading a csv file; you can think of that like an Excel
## spreadsheet. It's just comma-separated values organized in rows and columns.

# Here, we read in our desired data set from a URL

cancer_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data")

# We add column names and then "select" the columns we care about--more on this later
colnames(cancer_data) <- c("id", "diagnosis", "radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension")
cancer_data <- select(cancer_data, "id", "diagnosis", "radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension" )


In [None]:
# Your turn: print out the data we just imported

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("cancer data is read in correctly", {
  expect_equal(nrow(cancer_data), 568)
})

In [None]:
# Your turn: read in the data from the URL stored in "url" and assign it to variable called old_faithful. Print it out.
# Notice that we don't need to assign column names!

url <- "https://raw.githubusercontent.com/barneygovan/from-data-with-love/master/data/faithful.csv"

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("old faithful data is read in correctly", {
  expect_equal(nrow(old_faithful), 272)
})

#### 2.2‚ÄîExploring the data

Woah, that's a lot of data!

Before we know how to clean up the data or visualize it, we need to understand what form the data is in. But, it's really hard to do that when it R spews out that much.

Thankfully, there's a few functions we can use to help manage this.

In [None]:
## Here's an example preview with glimpse‚Äîdon't worry, 
## the other functions yield much prettier results!

glimpse(cancer_data)

In [None]:
## R provides two functions to visualize a nice "slice" of your data. They're called head and tail.
## Let's experiment with them.

## Your turn: call the function head with your data as the only argument.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: do the same with tail. 

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: call the head function with two arguments. First, the data like before. 
## Second, try adding a number (1, 2, 5, 10). What effect does this have?

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 2.3‚ÄîFiltering data

Great, now we understand what our data looks like! To produce the result we want, we might want to reduce the scope of our analysis, or filter out bad rows. What if we want to count and see the total number of benign tumors? Or get only tumors where the `smoothness_mean` is over a certain value?

To solve problems like this, we need to use filter.

In [None]:
## Filter accepts two arguments. The first is your data, and the second is the condition. Conditions can be:
## exactly equal ==
## greater than > (or, greater than or equal to >=)
## less than < (or, less than or equal to <=)

## Conditions must produce a true or false value. Let's try working with true and false before we use filter.

## Here are some example boolean values in R

FALSE
TRUE
3 + 3 == 6
3 + 3 != 6 # This is "does not equal"
100 > 0
0 < 100

In [None]:
## Your turn: print the result of 5 == 4. Print the result of 5 == 5. Notice anything interesting about true and false
## (think: in terms of spelling, capitalization, or punctuation)

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: print the result of whether or not negative one hundred is greater than zero. Save that to a variable called
## is_colder

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("is_colder is correct", {
  expect_false(is_colder)
})

In [None]:
## Your turn: the variable weekday is true if it is a weekday, and the variable vacation is true if we are on vacation. 
## We sleep in if it is not a weekday or we're on vacation. Set sleep_in to its correct value.

## Hint: For this problem, you could use R's logical operators to combine boolean values. 
## They work how you want them to work, generally speaking. (You can also solve this problem by directly assigning
## the correct value.)

## "and" == x && y
## Returns TRUE only if x and y are both true; else it's false

TRUE && FALSE
TRUE && TRUE

## "or" == x || y
## Returns TRUE if either of x or y are true (or both are true); else it's false

TRUE || FALSE
FALSE || FALSE

## "not" == !x
## "Flips" the value of x

!FALSE # TRUE
!TRUE # FALSE

# Now... back to the problem at hand

weekday <- TRUE
vacation <- FALSE

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("sleep_in is correct", {
  expect_equal(sleep_in, "bbb" < "AAA")
})

In [None]:
## Here's an example filter call, looking for all the rows where
## the radius is > 20

## Note that filter takes two arguments: the data frame, and then an expression that returns either true or false based
## on a column name. Be careful with how you type the column name, casing matters!

cancer_data_filtered <- filter(cancer_data, radius > 20)
head(cancer_data_filtered)

In [None]:
## Now, it's your turn to try filtering!

## Your turn: filter all the rows where diagnosis == "B" (B stands for benign). Save this result to a value called 
## benign_rows.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("cancer data is filtered correctly", {
  expect_equal(nrow(benign_rows), 357)
})

In [None]:
## Your turn: do the same for malignant rows. (What's the filtering condition?)
## Save it to a variable called malignant_rows. Print the tail of that variable.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("cancer data is filtered correctly (2)", {
  expect_equal(nrow(malignant_rows), 211)
})

In [None]:
## R gives us a very fast way to check the "size" of a table (that is, the number of rows).
## Fittingly, the function is called nrow.

## Your turn: get the number of malignant tumors in the data set and save it to a variable called num_malignant. Do 
## the same for benign.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: compute the total number of rows in the data set. Save it to a variable called total.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("cancer data is filtered correctly (2)", {
  expect_equal(total, 568)
})

In [None]:
## Your turn: print the percentage benign, and percentage malignant in the data set.
## Hint: this is just a math problem.

# Challenge: Try printing it "well formatted". That is, only to two decimal places and with a '%' symbol at the end.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## No test here, but you should see roughly 40% of your data as malignant and 60% benign. If we wanted to use
## this data to make a prediction on new tumor data, what implications would this imbalance have?

## (Psst... this is a tough question! I'm foreshadowing quite a bit. If you can't think of anything, don't sweat it.)

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 2.4‚ÄîSelecting data

There are a ton of columns associated with the cancer data. These are sometimes called factors, or aspects. We often want to do our analysis on just a few of those columns. R has an easy way of doing this!

In [None]:
## The function we use is called "select". Select's first parameter must be the data set.
## Then, every parameter after must be a column we want to keep.

## for example, selected_columns <- select(original_data, column1, column2, ..., columnN)
## You can select as many, or as few, columns as you want

cancer_only_ids <- select(cancer_data, id)
head(cancer_only_ids)

In [None]:
## Your turn: select only the diagnosis, radius_mean, and smoothness_mean columns from the (original) cancer data. Save
## it to a variable called cancer_select.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("cancer is selected correctly", {
  expect_equal(ncol(cancer_select), 3)
})

In [None]:
## We can also choose to select all columns *except* one (or two, or three, etc.) We do this through the minus symbol.

## for example, selected_columns <- select(original_data, -column1, -column2, ..., -columnN)
## This removes column1, column2, ..., columnN

## Your turn: select every column except the diagnosis column and id column. Save it to a variable called anonymous_data.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("anonymous data is selected correctly", {
  expect_equal(ncol(anonymous_data), 10)
})

#### tl;dr

This entire process of selecting, filtering, and modifying data is generally referred to as "wrangling." This process is extremely important, because working with good, clean data is vital to producing a good visualization.

Want to learn more about what "clean" data means? Come to part 2 of this workshop!

#### 2.X‚ÄîChallenge

Sometimes data comes with imperfections. Here we learned the tools to "fix" those imperfections, but the data we're using doesn't have very many.

Firstly, let me add some imperfections into our data frame.

In [None]:
cancer_select <- replace_with_na(cancer_select, replace = list(radius = c(20.57, 19.69, 11.42, 20.29, 12.45, 18.25)))
head(cancer_select, n=10)

Oh no! Now our radius has some NA ("not available") values. This is trouble, because now some of our calculations won't work. For example, imagine we wanted to compute the mean radius in our data frame. We would do:

In [None]:
cancer_radius_mean <- mean(cancer_select$radius)
cancer_radius_mean

But instead, we get `<NA>`. This is obviously not correct; we want to find some way of removing all of the rows with NA values. How should we do that? (Google is your friend here!)

In [None]:
## Your turn: produce a table that has all of the NAs removed.

## Hint: the function is.na returns whether or not a value is equal to NA. How could we use this is within a function we've
## already used today?

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
test_that("the nas are removed", {
  expect_equal(nrow(cancer_select), 561)
})

Sometimes we want to add a new column by transforming the data in some way. For example, what if wanted to convert the units of the radius column to metres? Or maybe even kilometers?

Thankfully, there's a way to do this in R. We can use the mutate function. Mutate takes two arguments. First, the data frame to be transformed. Second, the actual transformation. The transformation goes like this:

$$ new\_column\_name = A \times old\_column\_name + B $$

Note that "A * old + B" was arbitrarily chosen. You could do any math you like. Here's an example:

In [None]:
cancer_mutate <- mutate(cancer_data, new_column = radius * smoothness + concavity) # The new_column gets placed on the end!
head(cancer_mutate)

In [None]:
## Your turn: mutate the cancer_data data frame so that it also includes a colummn 
## that has the radius squared and multiplied by pi. What does this value represent?

### BEGIN SOLUTION ###
### END SOLUTION ###

### 3‚ÄîYour first visualization üìä
‚è± **30 mins**

#### 3.0‚ÄîHow the turns have tabled

Before we created a visualization, it's often helpful to create a "summary" table of our data, including each columns mean, median, mode, and potentially even sum.

We'll start by looking at some of the more "basic" numerical summaries you might want to produce. R is a programming language primarily used by statasticians and data scientists. As such, it has a lot of really powerful functions built-in that allow you to do statistics!

In [None]:
## One common metric in statistics is the mean, or the average. To find the mean in our data set, we first select the column
## we'd like to compute the mean for. Then, we run the mean function on that column.

## To select your column (since we're only select one column), you can either use pull(dataframe, columnname) 
## or you can use the "$" operator. 

## Pull takes out a column and gives use just the values, getting rid of the extra information which would confused
## the mean calculation.

## To select a column with "$", we can write dataframe$column_name.

## Here's an example mean computation two different ways.

cancer_radius_mean <- mean(cancer_data$radius)

cancer_radius_pull <- pull(cancer_data, radius)
cancer_radius_mean2 <- mean(cancer_radius_pull)

cancer_radius_mean == cancer_radius_mean2
cancer_radius_mean
cancer_radius_mean2

In [None]:
## Your turn: try finding the median concavity in cancer_data. Save it to a variable called median_concavity.

### BEGIN SOLUTION ### 
### END SOLUTION ###

In [None]:
test_that("median concavity is correct", {
  expect_equal(median_concavity, 0.0884273075704225)
})

In [None]:
## There's one more cool function that's super useful for any stats work you're doing. It's called summary. It gives
## a breakdown of some keys stats of all of your columns. The summary function takes one argument, your data frame.

## Your turn: print a summary of cancer_data.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Okay--one more function. nrow gives you the number of rows in a dataframe. You've seen this one before,
## I just want to make double-sure you remember how to use it!

## Your turn: find the number of rows in cancer_data. Save it to a variable called total_cancer_rows.

### BEGIN SOLUTION ###
### END SOLUTION ###

Sometimes, especially when working with data that has two or more distinct categories (in our case, malignant tumors and benign ones), we might want to calculate the mean for each group separately. Thankfully, R has a really handy way of doing this.

First, we want to make our groups. We do this with the group_by function. It takes two arguments: a dataframe, and which column we'd like to use to make our groupings. In our case, that'll be `diagnosis`.

Second, we need to produce a summary of our data and compute some statistic. We can do this with the summarize function. It takes two arguments: first, a grouped dataframe, and second, an expression which computes a new column based on an old column. Something like:

$$ new\_column = mean(old\_column) $$

This would create a new column for each group that is equal to the mean of the old column for each group, and then smush it all back together in one big dataframe.

In [None]:
## There's a lot going on with group_by and summarize. Here's an example of its usage.

cancer_data_groups <- group_by(cancer_data, diagnosis)
cancer_data_summarize <- summarize(cancer_data_groups, mean_radius = mean(radius))
cancer_data_summarize

## Is the final data frame what you expected? This pattern is super common in R, and can be great for producing statistical
## summaries that are grouped by some shared feature.

#### 3.1‚Äîggplot

ggplot is a really fancy R function used to create lots of different kinds of visualizations, or "plots."

It is super powerful, and you can create *tons* of different charts with it. The only way to master ggplot is to experiment and try many different things. Better than try to "explain" ggplot, I'm going to show you some example code, and we'll pick it apart as we go along.

Here, I'll try to give you a taste of what's possible!

In [None]:
## Here's a basic scatterplot of the data we've gathered

basic_plot <- ggplot(cancer_select, aes(x = radius, y = smoothness)) +
    geom_point()
basic_plot

ggplot looks scary, so let's try to break it down. Firstly, you should recognize ggplot is a function, and takes a series of arguments. The first argument is your data set. The second is the "aes", short of the "aesthetic specifications". For our purposes, this is just a function that accepts the x and y columns we'd like to use.

What is the "+" symbol at the end doing? We'll get to that in the next section.

In [None]:
## Your turn: create a scatterplot with perimeter_mean along the x axis and area_mean along the y axis.

## Hint: copy and paste is your friend.

### BEGIN SOLUTION ### 
### END SOLUTION ###

In [None]:
## Your turn: using your solution from the last cell, change "geom_point()" to "geom_violin()". What happens?

### BEGIN SOLUTION ###
### END SOLUTION ###

## Psst... see a complete list of different geom_...()s you can put here: https://ggplot2.tidyverse.org/reference/

Woo hoo! You've made your first few plots ever. With practice, you'll find using ggplot to create effective visualizations is tremendously easier than trying to do the same thing in Excel. 

#### 3.2‚ÄîAdding layers

We can add layers to our plot to supply additional graphics; this is also useful for adding additional data, such as a single point, to our graph. Again, let's look at some sample code and I'll help break it down for you.

Here's a rather involved example.

In [None]:
ggplot(cancer_select, aes(x = radius, 
                          y = smoothness)) +
    geom_point() + 
    xlab("Radius") +
    ylab("Smoothness") + 
    geom_smooth(method=lm,   # Add linear regression line
                se=FALSE) +  # Don't add shaded confidence region
    theme(text = element_text(size = 30))

Notice the "+" symbol at the end of every line? This is something special we do with ggplot, usually referred to as "adding layers". We can make our visualization more complex by adding one "layer" at a time. A layer typically refers to something like geom_point()‚Äîwhich adds a point layer to our graph‚Äîor geom_smooth() or the countless other layers we can add, but you can also use "+" to add things like a theme to your plot, a legend, labels for your x- and y- axes, and more. The possibilities are endless!

In [None]:
## Your turn: make the font size of basic_plot 60 by adding a layer. Save it to a variable called big_text_plot.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: add a new point to the plot you just created. It can be anywhere on the graph; note that you'll need
## need to specify it's x and y values.

## Hint: you'll need to add another geom_point() layer. Here's the documentation for geom_point: 
## https://ggplot2.tidyverse.org/reference/geom_point.html

## Set the new point's color to something distinct, like red. Also, make's it's size huge by setting its size and
## stroke value to 5.

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 3.3‚ÄîGetting fancy

Scatterplots are cool and all, but what else can we do? What if we wanted to group the points? How can we create a more effective visualization?

One thing that might be natural with data like this, where we have a binary grouping (benign or malignant tumors) is to color the data. How do we do that?

With aes! Here's an example.

In [None]:
# This code just changes the plot size to make it easier to read
options(repr.plot.width = 10, repr.plot.height = 10, repr.plot.res = 100)

ggplot(cancer_select, aes(x = radius, 
                          y = smoothness, 
                          color = diagnosis)) +
    geom_point(size=2) + 
    xlab("Radius") +
    ylab("Smoothness") + 
    theme(text = element_text(size = 30))

In [None]:
## Your turn: create a scatterplot with perimeter along the x axis and area along the y axis. The plot should
## have benign and malignant tumors as different colors. The x- and y- axes should be labelled.

## Bonus: add a title to the plot (don't be afraid to use Google!)

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: produce a bar chart that represents the number of rows of malignant data and the number of rows of benign
## data. You'll have to do some data wrangling to get the data that we want.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: create some kind of other plot that is not a violin, and not a scatterplot, 
## using the skills you learned today. Show it off to the group!

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: create a visualization of the old_faithful data from earlier, but only include observations where the
## geyser has been waiting for 80 or more years!

## Here's the data frame again, so you can see it printed out.

head(old_faithful)

### BEGIN SOLUTION ###
### END SOLUTION ###

Now we now how to wrangle data from the internet and visualize it using the ggplot library in R. What else is left to do? Well, inference! 

Our next goal is to predict whether or not a new tumor is benign or malignant. Given the plots you made above, how do you think we could make this kind of prediction?

In [None]:
## Your turn: write down an idea you have for how we could predict whether or not a new tumor is malignant or benign

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 3.X‚ÄîChallenge

In [None]:
## Your turn: Create a 3D scatterplot. That is, use three predictors 
## (the third one is your choice, two should be radius and smoothness). The plot should still be colored.

## Hint: this is a really tough question. You'll have to use another library, plotly, to achieve this. 
## Learning how will require all the skills you learned today! 
## Check out a good starter article here: 
## https://www.datanovia.com/en/blog/how-to-create-a-ggplot-like-3d-scatter-plot-using-plotly/

## Remember: copy and paste if your friend! Don't reinvent the wheel. Also, don't be afraid to ask your TA for help.

### BEGIN SOLUTION ###
### END SOLUTION ###

##### PART II ‚Äî 1 hour

Attend part 2 of this workshop series to see how you can extend what we've just done to predict whether or not a tumor is benign or malignant!

Thank you so much for attending part 1 ‚ú® I hope you enjoyed!