Here are detailed comments on the book. Elsewhere there is a review of the book.
How to read R For Dummies
In order to learn R you need to do something with it. After you have read a little of the book, find something to do. Mix reading and doing your project.
You cannot win if you do not play.
Two complementary documents
They are also complimentary.
Some hints for the R beginner
“Some hints for the R beginner” is a set of pages that give you the basics of the R language. It is a completely different approach to the one R For Dummies takes — you may want to investigate it.
The R Inferno
If you are just at the beginning of learning R, you should ignore The R Inferno (except perhaps Circle 1).
When you start using R for real and run into problems, that is the time to pick it up and see if it helps.
Missing piece
There is one thing that I think is missing in R For Dummies. Actually it isn’t missing, it comes at the very end while I think it should be at the start.
That piece is the search
function. More specifically the way that R operates that is highlighted by the results of the search
function.
The start of “Some hints for the R beginner” talks about search
and how R finds objects.
How to use these annotations
first learning
If you are new to R and first reading the book, then you should probably mostly ignore my comments. However, when you are confused by something in the book, you can look to see if there is a comment on that page that pertains to what you are confused about.
revising
On further reading, these comments are more likely to be of use. Some are clarifications, some are extensions.
Page by page comments
These comments are based on the first printing.
Page 10
There is more history in the Inferno-ish R presentation.
Page 11
distribution
I’m not a lawyer, but I think the phrasing about redistribution is not right. I think it should say “change and redistribute” rather than “change or redistribute”.
If what you do never leaves your entity, then you can do absolutely whatever you want. That is the free as in speech part. Legalities only come into play if what you do is made available to others. It is a common misunderstanding that you are restricted in what you do within your own world.
runs anywhere
The book highlights that R runs on many operating systems. It fails to make clear that the objects that it creates on the operating systems are all the same. You can start a project on a Linux machine at work, continue it while you commute with your Mac laptop, and then finish it on your Windows machine at home. No problem.
Page 12
The book should tell you not to be afraid of new words. New words like “vector”. You don’t need to make friends with them right away, but don’t be scared off.
(technical) Unhappily the word “vector” in R has several meanings — so it is unfortunate that it is the first new word. The meaning used throughout the book is the most common meaning. See The R Inferno (Circle 5.1) for the gory details.
Page 13
statistics
Pretty much everywhere in the book where it says “statistics” I would prefer “data analysis” instead. Statistics in many people’s mind is formal and academic, not like what they do. More people can feel comfortable doing data analysis than statistics.
In addition to the fear factor, there really is a (slight) difference between data analysis and statistics. I think data analysis is more important even though I’m trained as a statistician.
fields of study
There are additional fields of study where R is used that are not considered to be data hotbeds, such as music and literature. The flexibility of R becomes very important for data in non-traditional forms.
Page 23
vectors
If you are new to R, you shouldn’t expect yourself to understand this discussion. Just let it sink in over time.
Page 24
assignment operator
Always put spaces around the assignment operator. That makes the code much more readable.
The book tells you on page 63 that you can use =
as well. You will see both used. They are mostly the same (differences are explained in The R Inferno, Circle 8.2.26). I agree with the book’s approach to use <-
but really you can use either.
Page 28
RStudio
A nice feature of the RStudio workspace view is that it categorizes the objects.
Page 29
Windows pathnames (technical)
The book implies that you can not write Windows pathnames with backslashes. Actually you can, you just need to put a double backslash where you want a backslash. Hence it is easier and (often) less confusing to use slashes rather than backslashes.
Page 30
loading objects (technical)
It is possible to use attach
instead of load
. If you load an object, then it is put into your global environment. If you attach an object, it is put separately on the search list. If you modify an object that has been attached, then the modified version goes into your global environment.
Page 32
vectorization
There are different forms of vectorization, and the book doesn’t make that explicit. Vectorization can be put into three categories:
- vectorization along vectors
- summary
- vectorization across arguments
Functions like sum
and mean
are vectorized in the sense that they take a vector and summarize it. This is done in pretty much all languages, it is not special.
Vectorization as it is commonly spoken of in R is vectorization along vectors. For example the addition operator as seen on page 24. This is the form of vectorization that is so useful and powerful in R.
You should not expect the third form of vectorization in R. However, it does exist in a few functions. The sum
and mean
functions do summary-type vectorization:
> sum(1:3) [1] 6 > mean(1:3) [1] 2
The sum
function also does vectorization along arguments:
> sum(1, 2, 3) [1] 6
That is basically anomalous. The mean
function is more typical by not doing this form of vectorization:
> mean(1, 2, 3) # WRONG [1] 1
Unfortunately you don’t get an error or a warning in this case. Do not expect this form of vectorization.
Page 33
error message
Getting error messages can be frightening for a while. But it’s not the end of the world. Relax.
Page 36
names (technical)
In fact it is possible to get any name that you want, but you probably don’t want to.
return (technical)
Actually return
is not a reserved word, but you should treat it as if it were.
> break <- 1 Error in break <- 1 : invalid (NULL) left side of assignment > while <- 1 Error: unexpected assignment in "while <-"
> return <- 1 #do NOT do this >
Page 37
F and T
I wish to emphasize the advice in the book:
- never abbreviate
TRUE
andFALSE
toT
andF
- avoid using
T
andF
as object names
Page 42
library
The book suggests (with a slight revision on page 361) to load packages with the library
function. Some of us prefer require
instead of library
for this use. The best use of library
is without arguments — this gives you a list of available packages.
> library(fortunes) # load package > require(fortunes) # same thing > library() # get list of packages > require() # don't do this Loading required package: Failed with error: ‘invalid package name’
contributed packages
I think the authors might be being a little too polite in their description of the quality of contributed packages.
I find base R to be phenomenally clean code — it is hard to find commercial code that is less buggy. The quality of contributed packages varies widely. A few are up to the standards of base R, some are quite good, I’m sure there are a few dreadful ones.
With contributed packages you need to be more cautious than when only using base R functionality. Or perhaps I should say that you always need to be vigilent, but if you are using contributed packages, there is a larger chance that a problem is due to a package rather than your own fault.
Without inspecting the code, I know of two clues to suggest a package is of good quality:
- widely used
- good documentation
A widely used package — such as those highlighted in the book — is an indication that a lot of problems with the code have been fixed or didn’t exist in the first place.
Many people use the test of the cleanliness of restaurant restrooms to infer the cleanliness of the kitchen. Likewise, carefully written documentation is likely to be a sign of clean code.
Page 46
exponentiation (technical)
It is not a good idea to use **
to mean exponentiation — it is not out of the question for that to go away. Stick to using the ^
operator.
Page 49
log and exp
The sentence a little below mid-page about creating the vector inside exp
should say inside the log
function.
Page 52
infinity
The last sentence on the page should say 10^309
and 10^310
rather than 10^308
and 10^309
.
Page 54
table 4-3
You are unlikely to use any of these except for is.na
, which you may use quite a lot.
Page 55
types of vectors
All of the types of vectors listed may have missing values (NA
).
Page 56
integer versus double
One of the nice things about R is that you hardly ever need to worry about whether something is stored as an integer or a double.
largest integer (technical)
We can see how big the biggest integer is in a couple different ways:
> format(2^31 - 1, big.mark=",") [1] "2,147,483,647" > .Machine$integer.max [1] 2147483647
Page 59
indexing
What is called “indexing” in the book is more commonly called “subscripting”.
Page 64
missing value testing
It is a common mistake to try testing missing values with a command like:
> x == NA
That doesn’t work — you need to use is.na
.
Page 65
any and all
The last sentence on the page is a false statement. The any
and all
functions are smart enough to know when they can know the answer and when they can’t:
> all(c(NA, FALSE)) [1] FALSE > all(c(NA, TRUE)) [1] NA > any(c(NA, FALSE)) [1] NA > any(c(NA, TRUE)) [1] TRUE
Page 72
assigning to character (technical)
It is more correct to think of the mode being character than the class being character.
Page 82
grep
Alternatively, you can use the value
argument of grep
:
> grep("New", state.name, value=TRUE) [1] "New Hampshire" "New Jersey" "New Mexico" [4] "New York"
Page 83
sub versus gsub
Here is an example that should make clear the difference between sub
and gsub
:
> gsub("e", "a", c("sheep", "cheap", "cheep")) [1] "shaap" "chaap" "chaap" > sub("e", "a", c("sheep", "cheap", "cheep")) [1] "shaep" "chaap" "chaep"
Page 86
factor attributes (technical)
The book says:
[factors are] neither character vectors nor numeric vectors, although they have some attributes of both.
This sentence is using “attribute” in the non-technical sense. But attributes in the technical sense do come into play: factors have “class” and “levels” attributes.
Page 87
factor versus character
Notice how the factor is printed differently than the character vector.
Page 91
American regions (off topic)
There is a brilliant analysis of North American regions called The Nine Nations of North America.
Page 94
date sequences
You might wonder what happens if you start on the thirty-first of the month rather than the first. If you wonder something, try it out to see what happens:
> myStart <- as.Date("2012-12-31") > seq(myStart, by="1 month", length=6) [1] "2012-12-31" "2013-01-31" "2013-03-03" "2013-03-31" [5] "2013-05-01" "2013-05-31"
The result is a bit Aspergery, and not to everyone’s taste. But perhaps we can do better:
> seq(myStart + 1, by="1 month", length=6) - 1 [1] "2012-12-31" "2013-01-31" "2013-02-28" "2013-03-31" [5] "2013-04-30" "2013-05-31"
Wondering is great, experimenting is even greater.
Page 104
one-dimensional arrays (technical)
Regular vectors are not dimensional at all in the technical sense, but we think of them as being one-dimensional. But there really are one-dimensional arrays. They are almost like plain vectors but not quite.
Page 106
playing with attributes
For large objects you often won’t like the response you get when you do:
> attributes(x)
Often better is to just look at what attributes the object has:
> names(attributes(x))
Page 109
extracting values from matrices
The flexibility of subscripting matrices (and data frames) as vectors is a curse as well as a blessing.
If you want to do:
> x[-2,]
and you do:
> x[-2]
then you will get an entirely different result. This can be a hard mistake to find — a few pixels difference on your screen can have a big impact.
Page 113
first.matrix
The example on this page assumes that first.matrix
is as it was first created, not as it has been modified in the intervening exercises.
Page 114
matrix operations
So adding numbers by row is easy. How to add them by column? One way is:
> fmat <- matrix(1:12, ncol=4) > fmat + rep((1:4)*10, each=nrow(fmat)) [,1] [,2] [,3] [,4] [1,] 11 24 37 50 [2,] 12 25 38 51 [3,] 13 26 39 52
This uses the rep
function to create a vector with as many elements as the matrix has (assuming the vector being replicated has length equal to the number of columns), and the replicated values are in the desired positions.
Page 116
inverting a matrix
The reason that the command to invert a matrix is not intuitive is because it is seldom the case that (explicitly) inverting a matrix is a good idea.
Page 117
vectors as arrays (technical)
Actually vectors, in general, are not arrays at all. The difference is of little consequence, however.
third array dimension (technical)
I call the items in the third dimension of an array “slices” rather than “tables”. I’m not aware of any standardized nomenclature. I don’t think “tables” is such a good choice because there are other meanings of “table” in R.
array filling (technical)
I’m not able to follow the sentence in the book describing how arrays are filled. How I think of it is that the first subscripts vary fastest (no matter how many dimensions are in the array).
Page 119
rows and columns (technical)
Maybe my brain went on strike, but I think that “rows” and “columns” are reversed in the first paragraph on the page.
Page 120
data frame structure
Note that all the vectors that make up the columns need to be the same length.
data frame structure (technical)
It is possible for a “column” of a data frame to be a matrix, in which case the number of rows needs to match.
data frame length
Note that the length of a data frame is different from the length of the equivalent matrix. The length of the data frame is the number of columns, while the length of the matrix is the number of columns times the number of rows.
Page 122
character versus factor
The book suggests always making sure that data frames hold character vectors instead of factors in order to reduce problems. The other main route to avoid frustration is to always assume that there are factors.
The thing you don’t want to do is assume that what is really a factor is a character vector.
naming variables
If in the middle of the page where it says “In the previous section” you don’t know what they are talking about, not to worry — you’re not alone.
as with matrices
I’m not clear on the reference to matrices at the very bottom of the page.
Page 124
data frame subscripting
You can get a column of a data frame using either the $ or [ form of subscripting. But there is a difference:
> baskets.df$Granny [1] 12 4 5 6 9 3 > baskets.df[,Granny] Error in `[.data.frame`(baskets.df, , Granny) : object 'Granny' not found > baskets.df[,"Granny"] [1] 12 4 5 6 9 3
Note the quotes or lack thereof.
Page 130
pieces of a list
I prefer calling the pieces of a list “components” rather than “elements”. One reason is that a component of a list can be another list, and hence not very elementary.
Page 139
The functions that you write are essentially the same as the inbuilt functions. They are first-class citizens.
Page 152
functional programming
You can very effectively use R without having a clue what “functional programming” means. The important idea behind functional programming is safety — the data that you want to use is almost surely the data that really is being used.
Page 153
calculation example
The object names were obviously changed midstream. fifty
should be half
and hundred
should be full
.
Page 157
generic functions (technical)
A detail that only occasionally really matters is that the argument names in methods should match the argument name in the generic. You don’t want to have the argument called x
in the generic but object
in a method.
Page 171
looping without loops
Using apply functions is really hiding loops rather than eliminating them.
Page 172
number of apply functions
Not that it matters, but I count 8 apply functions in the base package in version 2.15.0. There is also a reasonably large number of apply functions in contributed packages.
Page 188
error checking (technical)
Another way to write the check for out of bounds values is:
stopifnot(all(x >= 0 & x <= 1))
This will create an appropriate error message if there is a violation.
This will take multiple conditions separated by commas. So you can have checks like:
stopifnot(is.matrix(x), is.data.frame(y))
to make sure that x
is a matrix and y
is a data frame.
Page 190
technical tip (technical)
The first sentence starts:
In fact, functions are generic …
It should read:
In fact, some functions are generic …
Page 192
factor to numeric
The book gives the efficient method of converting a factor to numeric:
as.numeric(levels(x))[x]
The slightly less efficient but easier to remember method is:
as.numeric(as.character(x))
Don’t forget the as.character
— it matters.
problems with factors (technical)
Circle 8.2 of The R Inferno starts with a number of items about factors.
Page 193
documentation quality
Unfortunately, I think the authors are painting too rosy of a picture of the quality of R documentation. There probably is some great documentation for any task or issue that you have, but you may have a significant search on your hands to find that great document.
Page 194
help files
It takes practice to learn how to use help files well. It doesn’t help that sections of the help files are in the wrong order (in my opinion). The “See also” and “Examples” should be near the top, “Details” should be at the bottom.
The examples often are the most important part. The book implies that all examples are reproducible. Not all are, but many are.
You don’t need to understand the whole of a help file the first time around. The goal should be to improve your understanding of the function.
Page 199
Stack Overflow
It is possible to subscribe via RSS to R tags.
Page 200
cards
With the cards I’m used to, the command to create cards should include 2:10
rather than 1:9
.
Page 202
session info
The book says that it is sometimes helpful to include the results of sessionInfo()
in questions. I would change that from “sometimes” to “often”.
Page 210
reading in data
The start of Circle 8.3 in The R Inferno has a number of items about problems reading data in.
Page 216
changing directories
If you are using the RGui, there is a “change dir” item in the File menu.
Page 221
three subset operators
The [[ operator always gets one component. The result is often not a list.
In contrast the [ operator can get any number of items and (except for dropping) gives you back the same type of object.
Page 226
removing duplicates
The book shows the removal of duplicates using both logical subscripts and negative numeric subscripts. Be careful with the latter of these:
> vec <- 1:5 > dups <- duplicated(vec) > vec[!dups] [1] 1 2 3 4 5 > vec[-which(dups)] integer(0)
If you create a vector of negative subscripts, you need to make sure it has at least one element. Otherwise you get nothing when you want everything.
Page 240
apply output
The book is in error when it says that the result of apply
is always a vector. Other possible results include a matrix and a list.
Page 243
sapply example (technical)
The example at the very top of the page that uses ifelse
would be more in the spirit of R if it instead used:
if(is.numeric(x)) mean(x) else NA
Page 245
aggregate (technical)
Alternatives to aggregate
include the by
function (if you have a data frame) and the data.table
package.
Page 253
third paragraph
Something seems to have gone wrong. That the phrase “doesn’t make sense at all” appears in the paragraph seems apropos.
Page 254
checking data
Often checking data with graphics is best. Do plots look as expected?
Page 260
mode
There is a mode
function in R, but it is not the same meaning as in the discussion of location.
Page 270
missing values (technical)
You might think that "pairwise"
should be the default choice since it uses the most data. The problem with it is that the resulting correlation matrix is not guaranteed to be positive definite.
Page 274
prop.table (technical)
I wondered if prop.table
recognized a table that had added margins. The answer is no, it thinks the margins are part of the data.
Page 312
multiple plots (technical)
If you want to put the graphics device back into a single plot state without using the old.par
trick, then say:
par(mfcol=c(1,1))
or
par(mfrow=c(1,1))
It doesn’t matter which you say.
Page 314
hardcopy graphics
If you are putting your graphics into a word processor, then often pdf
is a good choice.
If you are putting your graphics onto a webpage or into a presentation, then png
can be a good choice.
Page 326
boxplots (technical)
To be clear: whiskers are at most 1.5 times the width of the box.
Page 332
changing directory (technical)
To change the working directory and then change it back to the original, you would do something like:
> origwd <- getwd() > setwd("blah/blah") > # do stuff > setwd(origwd)
Page 359
CRAN mirrors (technical)
While all mirrors are conceptually the same as the primary CRAN site, it takes time for changes to propagate. This is unlikely to be an issue unless you are trying to get a brand new release.
Page 360
CRAN packages
As of 2012 October 14, CRAN has 4087 contributed packages.
Page 362
unloading packages
I’ve used R pretty much every day for over a decade and never unloaded a package. I doubt this will be a big issue for you.
Page 363
R-Forge
R-Forge also provides mailing lists. The immediate significance of this for you is that some of your favorite contributed packages might have a dedicated mailing list.
Page 364
own repository (technical)
You can even set up your own repository and fill it with packages that you write.
Page 1
Do you appreciate the meaning of:
knowledge <- apply(theory, 1, sum)
as promised?
Epilogue
I saw a little teddy bear.
Well, I said to myself,
“I know what I want. I gotta get a bear some way.”
from “You cannot win if you do not play” by Steve Forbert
Pingback: Review of “R For Dummies” | Portfolio Probe | Generate random portfolios. Fund management software by Burns Statistics