All the tutorials involve using base R packages. While OK for a tutorial, this is not reflective for real-world data analysis both in performance and usability, a lesson I have learned the hard way. (After using it for three semesters in college, I almost quit R completely in frustration)
I recommend going straight to the Hadleyverse packages for the common use cases, and read the vignettes for the common use cases: readr for reading files, dplyr for manipulating data, ggplot2 for plotting (the most important package!), and tidyr for reshaping data into a ggplot2-friendly long form.
Wickham himself has a very beginner-friendly, free book for learn R using his packages: http://r4ds.had.co.nz/index.html (far better than any tutorial I could write)
Without experiencing base R, you won't appreciate the tidyverse packages, which tend to have more of a learning curve.
For example, you can just run boxplot(x) in base R and it will make you a plot. Only after trying to make any modifications to it that you will see the benefit ggplot2.
As you mention yourself, "I almost quit R completely in frustration". I believe that is exactly why you appreciate the other packages you mention.
Funny you mention this as an example. I find ggplot2 good for exploratory analysis on a data frame with many categorical variables that can be used for faceting/conditioning or grouping. When trying to make any modifications to its appearance, I usually return to base graphics.
Same here. ggplot is nice, but whenever I need to do something more complex i suddenly find myself on stackoverflow. Base graph is super easy and usually get things done, even lattice is somehow more intuitive. ggplot when comes to details, seems awkward to me.
Agreed, I almost use base r exclusively during exploratory work and move to ggplot only when I want to present a finding... I guess you could say its in-efficient but it seems to work.
While I appreciate the hard work Hadley has put into this ecosystem, and I detest the language wars, I can't help but feel that the world would be a better place if Hadley had put his effort into python DS modules instead.
Hadley's work puts R almost at parity with python (at best!) for munging, and for academics, this long-term trend of domain languages like R or SAS becomes counterproductive. It's as if he's leading us to a local optimum.
S (R's ancestor) was developed in the mid-70s and 80s, at least a decade before Python was written. Then couldn't your argument be applied to those working on Python data science tools? Variety is the spice of life. :)
No, because the issue isn't that new programming languages and their libraries shouldn't be developed (and I'm a big fan of new languages with new paradigms) but whether the new languages should be general purpose or not. I'd argue that the development of S was misstep, and one which would have been completely forgotten by now if not for R and the libraries written for it.
Before S there were onerous Fortran routines. Rich Becker, one of the authors of S gave a really interesting talk about this at the useR conference this year, you should check it out if you haven't already: http://blog.revolutionanalytics.com/2016/07/rick-becker-s-ta...
You say that developing S was a misstep, what decision would you have made instead? Build on Fortran?
option 1) S was created at AT&T, the home of UNIX and C. Imagine if statistical functionality had been made a standard part of C rather than developing a new language. While C is maybe not that popular anymore, current languages like Java and C# inherit from it, and we would have a world in which statistical functions were taken as given as part of the default libraries of any language.
option 2) Use another standard language like LISP. This actually was done in the excellent LISP-STAT system. Unfortunately it came out at kind of the wrong time for LISP. These days LISP is kind of undergoing a renaissance with Clojure, Racket, LFE, etc., but in the 1990s it was more associated with the failed AI efforts of the symbolic school and LISP was just seen as this weird thing with lots of parentheses by many people.
The thing is I use R every day. But like everybody else, I bury it under lots of add ons like Hadley's and others to make it palatable by hiding as much of the absurd R syntax as possible. That's like dumping ketchup on foul tasting food. Is this really what we want to do?
That'd add a lot of baggage to C that's really beyond its core mission. It'd also slow down development of statistical interfaces (that lag far behind the stability of systems interfaces). Thank you for responding, but I'm not convinced.
It is interesting that Wes McKinney said that he had started Pandas so that R wouldn't dominate data science. (I am paraphrasing, but I think it was at the beginning of his pandas book).
Now I happen to be a pandas person myself, and I am glad he did it. If someone had advised Wes not to spend his time on python and just focused on making R better, I would hope he would not have listened to that person.
Hadley's packages seem geared more towards data munging and transformations (I regularly use plyr and ggplot from the 'verse) and not machine learning. In my experience R has better implementations of machine learning algorithms than sklearn and rstudio is a better ide than any I found in python.
Not arguing that your statement is right or wrong, but I suggest checking out Spyder for python. I go back and forth between the two, and though I hate the syntactic gear shift, they are quite similar.
I could not disagree more. Base graphics are absolutely intuitive for users coming from an imperative style, and base graphics are extremely fast, which is seriously useful a lot of the time. This is not to say I dislike ggplot, I love it, only that it is not the starting point for learning R because its syntax really is not idiomatic R. You'd be imposing two orthogonal learning curves on the new user.
Second, while dplyr et al are fantastic, it is really important to understand the "functional" aspects of R which are much better learned via the simple apply families.
I disagree. The inconsistencies across the apply family makes them hard to learn, and the absence of an apply function for data frames is particularly frustrating.
I obviously also disagree on what is idiomatic R. If you know ggplot2, there are a relatively few advantages to learning base graphics, if you're mostly interested in graphics for data analysis.
"If you know ggplot2"... but you need to make a lot of plots to get the hang of ggplot2. The "+" syntax (not sure what the proper name for that is) alone is completely foreign and intimidating.
If you want to make great graphs in R, you will need to learn ggplot2. If you just want to learn R, why not keep it simple at first?
Because the chances are you learning R to do data science/analysis. And you're best off spending a little extra work to learn the tidyverse - that investment pays off with an ecosystem of tools that all fit together to help you solve the problems you are mostly likely to want to solve.
I have to take this comment from whom it comes ie: the creator of the library obviously finds it intuitive. But there's definitely a big "brain paradigm shift" with ggplot2 which IMO would be a challenge to impose on the new user. I would argue that even you acknowedge this, since you start your Springer book with your own imperative qplot, and only get into the declarative grammar full-on in Part 2.
That's changed in the second edition of the book, based on the feedback I had from many people who were teaching ggplot2 to first time R users. If you've never used R before, neither base graphics nor ggplot2 is intuitive, so you're better off learning one paradigm and sticking to it.
Interesting, thanks Hadley. I have to say I have moved most of my advanced graphics to ggplot2 and my users absolutely love it. Yes I bought your book several years ago. Here is an example of a complex plot of mine that successfully uses a 2d-plane but multiiple dimensions of data, using your excellent library. We are able to put a large amount of data, with multiple obliquely related distributions, all on the same plot. The thick white lines represent a 2-z score fwiw. As you will gather, we are thereby able to superimpose to related but not linearly correlated distributions both on the sample, plot, using colour to represent cheapness or dearness, and having both basis point and z-score based visualization. One stop relative value shop, thanks to ggplot2 ;-)
The Hadleyverse is the best thing that ever happened to R: sane, intuitive tools that do what you expect with common-sense usage and examples. And don't forget stringr!
While I would say it is really important to know some basics, such as data.frame, *apply methods, etc. I also admit that as soon as possible I jump to data.table & dplyr when nobody's watching.
A paid alternative is also datacamp.com
Has videos and interactive exercises you can do directly in the site. And btw, I am not affiliated with them, just a user.
These are interesting exercises. I would also add that what it means to learn, or begin to learn, a language in general but R in particular could vary widely.
Some people actually need to learn statistics while they are learning to express those concepts.
Some people need to learn idiomatic R. I really didn't think of R as very functional till I took Peng's (excellent) Coursera course.
Some people need to get a sense of the libraries available.
This is an intro to get some practice in the the classical, original purpose for S, Splus and R: linear algebra operations in R.
I recommend going straight to the Hadleyverse packages for the common use cases, and read the vignettes for the common use cases: readr for reading files, dplyr for manipulating data, ggplot2 for plotting (the most important package!), and tidyr for reshaping data into a ggplot2-friendly long form.
Wickham himself has a very beginner-friendly, free book for learn R using his packages: http://r4ds.had.co.nz/index.html (far better than any tutorial I could write)