Start here to learn R

minimaxir · on July 24, 2016

All the tutorials involve using base R packages. While OK for a tutorial, this is not reflective for real-world data analysis both in performance and usability, a lesson I have learned the hard way. (After using it for three semesters in college, I almost quit R completely in frustration)

I recommend going straight to the Hadleyverse packages for the common use cases, and read the vignettes for the common use cases: readr for reading files, dplyr for manipulating data, ggplot2 for plotting (the most important package!), and tidyr for reshaping data into a ggplot2-friendly long form.

Wickham himself has a very beginner-friendly, free book for learn R using his packages: http://r4ds.had.co.nz/index.html (far better than any tutorial I could write)

vobios · on July 24, 2016

Without experiencing base R, you won't appreciate the tidyverse packages, which tend to have more of a learning curve.

For example, you can just run boxplot(x) in base R and it will make you a plot. Only after trying to make any modifications to it that you will see the benefit ggplot2.

As you mention yourself, "I almost quit R completely in frustration". I believe that is exactly why you appreciate the other packages you mention.

hatmatrix · on July 24, 2016

Funny you mention this as an example. I find ggplot2 good for exploratory analysis on a data frame with many categorical variables that can be used for faceting/conditioning or grouping. When trying to make any modifications to its appearance, I usually return to base graphics.

haddr · on July 24, 2016

Same here. ggplot is nice, but whenever I need to do something more complex i suddenly find myself on stackoverflow. Base graph is super easy and usually get things done, even lattice is somehow more intuitive. ggplot when comes to details, seems awkward to me.

mastratton3 · on July 25, 2016

Agreed, I almost use base r exclusively during exploratory work and move to ggplot only when I want to present a finding... I guess you could say its in-efficient but it seems to work.

R_haterade · on July 24, 2016

While I appreciate the hard work Hadley has put into this ecosystem, and I detest the language wars, I can't help but feel that the world would be a better place if Hadley had put his effort into python DS modules instead.

Hadley's work puts R almost at parity with python (at best!) for munging, and for academics, this long-term trend of domain languages like R or SAS becomes counterproductive. It's as if he's leading us to a local optimum.

RA_Fisher · on July 25, 2016

S (R's ancestor) was developed in the mid-70s and 80s, at least a decade before Python was written. Then couldn't your argument be applied to those working on Python data science tools? Variety is the spice of life. :)

jhbadger · on July 25, 2016

No, because the issue isn't that new programming languages and their libraries shouldn't be developed (and I'm a big fan of new languages with new paradigms) but whether the new languages should be general purpose or not. I'd argue that the development of S was misstep, and one which would have been completely forgotten by now if not for R and the libraries written for it.

RA_Fisher · on July 27, 2016

Before S there were onerous Fortran routines. Rich Becker, one of the authors of S gave a really interesting talk about this at the useR conference this year, you should check it out if you haven't already: http://blog.revolutionanalytics.com/2016/07/rick-becker-s-ta...

You say that developing S was a misstep, what decision would you have made instead? Build on Fortran?

jhbadger · on July 28, 2016

option 1) S was created at AT&T, the home of UNIX and C. Imagine if statistical functionality had been made a standard part of C rather than developing a new language. While C is maybe not that popular anymore, current languages like Java and C# inherit from it, and we would have a world in which statistical functions were taken as given as part of the default libraries of any language.

option 2) Use another standard language like LISP. This actually was done in the excellent LISP-STAT system. Unfortunately it came out at kind of the wrong time for LISP. These days LISP is kind of undergoing a renaissance with Clojure, Racket, LFE, etc., but in the 1990s it was more associated with the failed AI efforts of the symbolic school and LISP was just seen as this weird thing with lots of parentheses by many people.

The thing is I use R every day. But like everybody else, I bury it under lots of add ons like Hadley's and others to make it palatable by hiding as much of the absurd R syntax as possible. That's like dumping ketchup on foul tasting food. Is this really what we want to do?

RA_Fisher · on Aug 8, 2016

That'd add a lot of baggage to C that's really beyond its core mission. It'd also slow down development of statistical interfaces (that lag far behind the stability of systems interfaces). Thank you for responding, but I'm not convinced.

lispm · on July 28, 2016

https://github.com/blindglobe/common-lisp-stat

hadley · on July 25, 2016

No, one language (Python) isn't the right tool for every job.

projectramo · on July 25, 2016

It is interesting that Wes McKinney said that he had started Pandas so that R wouldn't dominate data science. (I am paraphrasing, but I think it was at the beginning of his pandas book).

Now I happen to be a pandas person myself, and I am glad he did it. If someone had advised Wes not to spend his time on python and just focused on making R better, I would hope he would not have listened to that person.

eanzenberg · on July 24, 2016

Hadley's packages seem geared more towards data munging and transformations (I regularly use plyr and ggplot from the 'verse) and not machine learning. In my experience R has better implementations of machine learning algorithms than sklearn and rstudio is a better ide than any I found in python.

projectramo · on July 25, 2016

Not arguing that your statement is right or wrong, but I suggest checking out Spyder for python. I go back and forth between the two, and though I hate the syntactic gear shift, they are quite similar.

eanzenberg · on July 25, 2016

I have tried Spyder before. It was too buggy and crashed a few times on some light scripts. This was around June 2015.

blahi · on July 25, 2016

And I wish exactly the opposite - not to have every thread deailed with someone claiming "Python is better". No, it's not. (see what I did there?)

But no, seriously, it's not.

vegabook · on July 24, 2016

I could not disagree more. Base graphics are absolutely intuitive for users coming from an imperative style, and base graphics are extremely fast, which is seriously useful a lot of the time. This is not to say I dislike ggplot, I love it, only that it is not the starting point for learning R because its syntax really is not idiomatic R. You'd be imposing two orthogonal learning curves on the new user.

Second, while dplyr et al are fantastic, it is really important to understand the "functional" aspects of R which are much better learned via the simple apply families.

hadley · on July 24, 2016

I disagree. The inconsistencies across the apply family makes them hard to learn, and the absence of an apply function for data frames is particularly frustrating.

I obviously also disagree on what is idiomatic R. If you know ggplot2, there are a relatively few advantages to learning base graphics, if you're mostly interested in graphics for data analysis.

vobios · on July 25, 2016

"If you know ggplot2"... but you need to make a lot of plots to get the hang of ggplot2. The "+" syntax (not sure what the proper name for that is) alone is completely foreign and intimidating.

If you want to make great graphs in R, you will need to learn ggplot2. If you just want to learn R, why not keep it simple at first?

hadley · on July 25, 2016

Because the chances are you learning R to do data science/analysis. And you're best off spending a little extra work to learn the tidyverse - that investment pays off with an ecosystem of tools that all fit together to help you solve the problems you are mostly likely to want to solve.

vegabook · on July 25, 2016

I have to take this comment from whom it comes ie: the creator of the library obviously finds it intuitive. But there's definitely a big "brain paradigm shift" with ggplot2 which IMO would be a challenge to impose on the new user. I would argue that even you acknowedge this, since you start your Springer book with your own imperative qplot, and only get into the declarative grammar full-on in Part 2.

hadley · on July 25, 2016

That's changed in the second edition of the book, based on the feedback I had from many people who were teaching ggplot2 to first time R users. If you've never used R before, neither base graphics nor ggplot2 is intuitive, so you're better off learning one paradigm and sticking to it.

vegabook · on July 25, 2016

Interesting, thanks Hadley. I have to say I have moved most of my advanced graphics to ggplot2 and my users absolutely love it. Yes I bought your book several years ago. Here is an example of a complex plot of mine that successfully uses a 2d-plane but multiiple dimensions of data, using your excellent library. We are able to put a large amount of data, with multiple obliquely related distributions, all on the same plot. The thick white lines represent a 2-z score fwiw. As you will gather, we are thereby able to superimpose to related but not linearly correlated distributions both on the sample, plot, using colour to represent cheapness or dearness, and having both basis point and z-score based visualization. One stop relative value shop, thanks to ggplot2 ;-)

http://stackoverflow.com/questions/24828341/how-do-i-remove-...

hadley · on July 24, 2016

Btw it's called the tidyverse now ;)

crispyambulance · on July 24, 2016

The Hadleyverse is the best thing that ever happened to R: sane, intuitive tools that do what you expect with common-sense usage and examples. And don't forget stringr!

IndianAstronaut · on July 25, 2016

Stringr is probably the one tool from the Hadleyverse that removes the most frustration. String manipulation and parsing in base R is nightmarish.

baldfat · on July 25, 2016

I find this to be a great list of resources.

https://www.rstudio.com/online-learning/#R

For the Hadleyverse or Tiddyverse this is a starter book based on dplyr and other tools.

Hands-On Programming with R http://shop.oreilly.com/product/0636920028574.do#

haddr · on July 24, 2016

While I would say it is really important to know some basics, such as data.frame, *apply methods, etc. I also admit that as soon as possible I jump to data.table & dplyr when nobody's watching.

dandermotj · on July 24, 2016

Go one step further with Wickham's purrr too. Offers well developed functional programming tools!

projectramo · on July 24, 2016

Thanks for mentioning this. I have used dplyr, ggplot2, and a little of some others but I had no idea about Hadley and that they all worked together.

hadley · on July 24, 2016

No idea about Hadley?! Oh the horror :p

projectramo · on July 25, 2016

But I'm probably two degrees removed from you. Currently in Houston, and studied with your fellow kiwi Craig (Rutgers) on the east coast.

partycoder · on July 24, 2016

A paid alternative is also datacamp.com Has videos and interactive exercises you can do directly in the site. And btw, I am not affiliated with them, just a user.

sndean · on July 25, 2016

I've found DataCamp to be pretty good too. Have you tried DataQuest [1]?

It's mostly Python (a little R), but just wondering what others think.

[1] https://www.dataquest.io/

harmegido · on July 24, 2016

I thought they were very smart to put a link to datacamp.com in the load message of data.table.

huac · on July 24, 2016

Friends and coworkers have also recommended Datacamp highly, though I've never used it.

projectramo · on July 24, 2016

These are interesting exercises. I would also add that what it means to learn, or begin to learn, a language in general but R in particular could vary widely.

Some people actually need to learn statistics while they are learning to express those concepts.

Some people need to learn idiomatic R. I really didn't think of R as very functional till I took Peng's (excellent) Coursera course.

Some people need to get a sense of the libraries available.

This is an intro to get some practice in the the classical, original purpose for S, Splus and R: linear algebra operations in R.

IanCal · on July 24, 2016

> If you’re completely new to R, we suggest you simply start with the first topic, “Simple manipulations; numbers and vectors”.

Where is this? I couldn't see anything with the same name.

The first thing in the list: http://r-exercises.com/2015/10/09/vector-exercises/ starts by telling me I need to have read something else, which is a chapter in a long PDF.

I applaud the effort, but this feels confusing and I'm worried about people being discouraged right as they start.

smortaz · on July 24, 2016

if you are on Windows and want to try R and don't mind preview software, check out:

http://microsoft.github.io/RTVS-docs/

Jon_Maxwell · on July 25, 2016

I personally prefer video courses, like those on Udemy:

https://www.udemy.com/data-science-and-machine-learning-boot...

RockyMcNuts · on July 25, 2016

Also swirl (similar exercises to teach yourself R)

http://swirlstats.com/students.html