-
Notifications
You must be signed in to change notification settings - Fork 41
/
20-spreadsheets.Rmd
379 lines (242 loc) · 10.6 KB
/
20-spreadsheets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
# (PART\*) Import {-}
# Spreadsheets
**Learning objectives:**
- We are going to learn about the various ways in which we can get data out of a spread sheet or google sheet.
- We are going to learn how to write data from spread sheet and google sheet.
## Data Organization in Spreadsheets {-}
[Data Organization in Spreadsheets](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989) is a paper with some excellent tips:
- Be Consistent
- Choose Good Names for Things
- Write Dates as YYYY-MM-DD
- No Empty Cells
- Put Just One Thing in a Cell
- Make it a Rectangle
- Create a Data Dictionary
- No Calculations in the Raw Data Files
- Do Not Use Font Color or Highlighting as Data
- Make Backups
- Use Data Validation to Avoid Errors
- Save the Data in Plain Text Files
## Excel {-}
Microsoft Excel is a widely used spreadsheet software program where data are organized in worksheets inside of spreadsheet files.
## Prerequisites {-}
```{r,warning=FALSE,message=FALSE}
library(readxl)
library(tidyverse)
library(writexl)
```
## Getting started {-}
Most of readxl’s functions allow you to load Excel spreadsheets into R:
- **read_xls()** reads Excel files with **xls** format.
- **read_xlsx()** read Excel files with **xlsx** format.
- **read_excel()** can read files with both **xls** and **xlsx** format. It guesses the file type based on the input.
## Reading Excel spreadsheets {-}
![](https://r4ds.hadley.nz/screenshots/import-spreadsheets-students.png)
- read_excel() will read the file in as a tibble.
```{r}
students <- read_excel("data/students.xlsx")
students
```
## Column names {-}
The column names are all over the place. We can provide column names that follow a consistent format; we recommend snake_case using the col_names argument.
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
)
```
## Skiping the first row {-}
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1
)
```
## Dealing with the missing data {-}
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A")
)
```
## Specify the column types {-}
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("numeric", "text", "text", "text", "numeric")
)
```
## Read file and then fix data {-}
```{r}
students <- read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("numeric", "text", "text", "text", "text")
)
students
students <- students |>
mutate(
age = if_else(age == "five", "5", age),
age = parse_number(age)
)
students
```
## Reading worksheets {-}
An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets.
![](https://r4ds.hadley.nz/screenshots/import-spreadsheets-penguins-islands.png)
## Read single worksheet {-}
You can read a single worksheet from a spreadsheet with the sheet argument in read_excel(). The default, which we’ve been relying on up until now, is the first sheet.
```{r}
read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
```
## read_excel with NAs {-}
Some variables that appear to contain numerical data are read in as characters due to the character string "NA" not being recognized as a true NA.
```{r}
penguins_torgersen <- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
penguins_torgersen
```
## excel_sheets for information {-}
Alternatively, we can use **excel_sheets()** to get information on all worksheets in an Excel spreadsheet, and then read the one(s) you’re interested in.
```{r}
excel_sheets("data/penguins.xlsx")
```
Once you know the names of the worksheets, you can read them in individually with read_excel().
```{r}
penguins_biscoe <- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
penguins_dream <- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")
```
```{r}
dim(penguins_torgersen)
dim(penguins_biscoe)
dim(penguins_dream)
```
## Bind data for single dataframe {-}
We can put them together with bind_rows().
```{r}
penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
```
## Reading part of a sheet {-}
![](https://r4ds.hadley.nz/screenshots/import-spreadsheets-deaths.png)
## Example file deaths.xlsx {-}
This spreadsheet is one of the example spreadsheets provided in the readxl package. We can use the readxl_example() function to locate the spreadsheet on your system in the directory where the package is installed.
```{r}
deaths_path <- readxl_example("deaths.xlsx")
deaths <- read_excel(deaths_path)
deaths
```
## read_excel with range {-}
Here the data we want to read in starts in cell A5 and ends in cell F15. In spreadsheet notation, this is A5:F15, which we supply to the range argument:
```{r}
read_excel(deaths_path, range = "A5:F15")
```
## Data types {-}
In CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.
The underlying data in Excel spreadsheets is more complex. A cell can be one of four things:
- A boolean, like TRUE, FALSE, or NA.
- A number, like “10” or “10.5”.
- A datetime, which can also include time like “11/1/21” or “11/1/21 3:00 PM”.
- A text string, like “ten”.
## Writing to Excel {-}
```{r}
bake_sale <- tibble(
item = factor(c("brownie", "cupcake", "cookie")),
quantity = c(10, 5, 8)
)
bake_sale
```
```{r, eval=FALSE}
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")
```
- This is how the data look like in excel
![](https://r4ds.hadley.nz/screenshots/import-spreadsheets-bake-sale.png)
## Lost datatypes when reading {-}
Just like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see [Writing to a file](https://r4ds.hadley.nz/data-import.html#sec-writing-to-a-file)
```{r}
read_excel("data/bake-sale.xlsx")
```
## Formatted output {-}
The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the [openxlsx package.](https://ycphs.github.io/openxlsx) We won’t go into the details of using this package here, but we recommend reading[https://ycphs.github.io/openxlsx/articles/Formatting.html](https://ycphs.github.io/openxlsx/articles/Formatting.html)
## Google Sheets {-}
Google Sheets is another widely used spreadsheet program included. It’s free and web-based. Just like with Excel, in Google Sheets data are organized in worksheets (also called sheets) inside of spreadsheet files.
## Prerequisites {-}
```{r}
library(googlesheets4)
library(tidyverse)
```
A quick note about the name of the package: googlesheets4 uses v4 of the Sheets API v4 to provide an R interface to Google Sheets, hence the name.
## Getting started with Google Sheets {-}
- The main function of the **googlesheets4 package** is **read_sheet()**, which reads a Google Sheet from a URL or a file id. This function also goes by the name **range_read()**.
- You can create a brand new sheet with **gs4_create()** or write to an existing sheet with **sheet_write()** and friends.
- You can read from a public Google Sheet without authenticating with your Google account and with **gs4_deauth()**.
## Reading Google Sheets {-}
![](https://r4ds.hadley.nz/screenshots/import-googlesheets-students.png)
- The first argument to read_sheet() is the URL of the file to read, and it returns a tibble:
```{r gsheets01,message=FALSE,warning=FALSE}
gs4_deauth()
students_url <- "https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
read_sheet(students_url)
```
## Define datatypes {-}
```{r gsheets02,message=FALSE,warning=FALSE}
read_sheet(
students_url,
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = "dcccc"
)
```
- Note that we defined column types a bit differently here, using short codes. For example, “dcccc” stands for “double, character, character, character, character”.
## Read specific sheet {-}
```{r gsheets03,message=FALSE,warning=FALSE}
penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"
read_sheet(penguins_url, sheet = "Torgersen Island")
```
## List of sheets {-}
We can obtain a list of all sheets within a Google Sheet with sheet_names():
```{r gsheets04,warning=FALSE,message=FALSE}
sheet_names(penguins_url)
```
## Read portion of the sheet {-}
```{r gsheets05,message=FALSE,warning=FALSE}
deaths_url <- gs4_example("deaths")
read_sheet(deaths_url, range = "A5:F15")
```
## Writing to Google Sheets {-}
We can write from R to Google Sheets with write_sheet(). The first argument is the data frame to write, and the second argument is the name (or other identifier) of the Google Sheet to write to:
```{r gsheets06,message=FALSE,warning=FALSE,eval=FALSE}
write_sheet(deaths)
```
## Authentication {-}
While we can read from a public Google Sheet without authenticating with your Google account, reading a private sheet or writing to a sheet requires authentication so that **googlesheets4** can view and manage your Google Sheets. We can check the [googlesheet4 vignette](https://googlesheets4.tidyverse.org/articles/auth.html) on how to setup the your account in R.
## Summary {-}
![](https://media.tenor.com/AAhRO0tvnT0AAAAM/excited-im.gif)
Microsoft Excel and Google Sheets are two of the most popular spreadsheet systems. Being able to interact with data stored in Excel and Google Sheets files directly from R is a superpower!
- In this chapter we learned how to read data into R from spreadsheets and from Excel with read_excel() from the readxl package and from Google Sheets with read_sheet() from the googlesheets4 package.
## Meeting Videos {-}
### Cohort 7 {-}
`r knitr::include_url("https://www.youtube.com/embed/1OeA2Ogi5is")`
### Cohort 8 {-}
`r knitr::include_url("https://www.youtube.com/embed/UPwib5lZ78o")`
<details>
<summary> Meeting chat log </summary>
```
00:08:22 Abdou: Hello
00:12:42 Abdou: Start
00:57:03 shamsuddeen: Thank you Ahmad
00:57:17 shamsuddeen: I will be using the googlesheet
00:57:24 shamsuddeen: From now
00:58:34 Abdou: Thanks
00:58:38 Abdou: end
00:58:45 shamsuddeen: See you next week
```
</details>