I love R, and I think that the insight people often overlook for R's success is pretty simple: the easy things are easy. Doing hard things in R can be very hard, but the easy things are easy. Eg loading a csv full of data and running a regression on it are two lines of code that are pretty easy to explain to people:
$ R
data <- read.csv(file='some_file', header=T, sep=',')
model <- lm(Y ~ COL1 + COL2 + COL3, data=data)
and if you want to use glm -- logistic regression, etc -- it's a trivial change:
It really allows people to do quite powerful statistical analyses very simply. Note that they built a dsl for specifying regression equations -- and you don't have to bother with bullshit like quoting column column names; quoting requirements are often hard to explain to new computer users.
R's other key feature is it includes a full sql-like data manipulation language to manipulate tabular data; it's so good that every other language that does stats copied it. If df is a dataframe, I can issue predicates on the rows before the comma and columns after the comma, eg
df[ df$col1 < 3 & df$col2 > 7, 'col4']
that takes my dataframe and subsets it so -- row predicates before the comma -- col1 is less than 3 and col2 is greater than 7 -- and column predicates after the comma -- just returns a new dataframe from the subset with col4 in it. It's incredibly powerful and fast.
I wouldn't necessarily consider the 'bracket notation' a query, and it's very un-SQL like. It's simply three vector operations which returns a boolean vector e.g. (TRUE, FALSE, TRUE, ...), which is then used to select particular indices. I think this way of operating is more familiar to statisticians.
Yes it's completely unlike SQL. Rather than declaring what is wanted and being agnostic about implementation, it actually specifies the implementation of the query, and does so using data structures of length equal to the dimension of the full table, so it's completely unscalable.
I think of this as the statistics version of PHP's
<?
$paragraph = 'hello world';
echo '<p>'.$paragraph.'</p>';
?>
It makes a basic tasks super easy, which allows people who don't know what they're doing to make mistakes. It's just a different philosophy that has some negatives.
There are people who think that there's elegance in R's design. Remember that is as old as C (if not more!), but it still feels like a modern language (warts and all). You don't compare C to say clojure for language features, they show the advances in language design over the years.
R says: 'everything is a vector, and vectors can have missing values'. This is profound. It was only recently that other matrix-oriented language extensions (say panda) got missing values, even though they are meat-and-potatoes for data analysis.
Well, it's not older than C, but around the same time. Wikipedia shows S (R's commercial inspiration, for want of a better term) was first written in 1976, but C was around back in 1972. And C was adopted by lots of people a lot faster than S, though it kept chugging along.
Agree with the missing values in native datatypes point, although many libraries for other languages had workarounds for missing values that allowed you to get similar results even back in the day, way before python and panda.
I think the general consensus is that for every thing S and R got right, they got a bunch of other stuff not-so-right, but it was worth the pain to work around them because the good stuff was so very good.
Regarding the adoption rate of S, I think you hit upon a key source of the complaints about R.
R existed in one form or another for a (relatively) small, specialized audience for a long, long time. This included the period where transitioning from S to R included placing a very high priority on maintaining code from S, even being able to replicate the bugs from S. This made a lot of sense from the perspective of a (relatively) small group of statisticians making a data analysis language for their own uses.
Over the last 10 years, R has become massively more popular, and many, many more people are using it outside of its "base" audience. Not surprisingly, many of these new users are discovering that R was not designed with their personal use cases in mind.
Given these constraints, R has done a remarkably good job of adjusting (where it can) to accommodate the diversification in its user base. But you can only go so far without just re-writing it from scratch, and the interests of R Core will always bend towards themselves and their, well, "core" audience.
SAS has missing values as well and goes further -- there are multiple types of missing values (ie, "not available", "doesn't pass quality control review", etc.) We added missing values to our product to interop with both R and SAS:
In SAS, beware that missing values are treated as the smallest possible value (e.g. -Inf). This means that statements like x < 10 return true if x is missing.
Most folks use IEEE floats as their value type, so in a way a whole lot of libraries have supported missing values for a long time. Many languages can use 32- and 64-bit floats; R can only use one or the other (you have to rebuild it to change this). This can make your big data twice as big as it should be. In that, R is in the company of such serious number-crunching languages as Lua, JavaScript, and Splunk.
My startups [1] does flavor profiling and statistical quality control for beer and bourbon producers - it's a fun job!
Our entire back-end is built in R, mostly within the Hadly-verse, and we use Shiny [2] as our web framework.
Our team works a bit differently than most, I suspect; our data-scientists build features and analysis directly in R, and then add the functionality to our Shiny Platform. Our "real devs" are all server + DB, or Android guys. This has created a great development system where all of the "cool findings" and 'awesome visualizations' are immediately implemented in our system, and made available for our clients!
Edited to add; R is a great language and is 100% suitable for production systems. Its older than Python(!), and, with some experience, can be made in to high performance code.
Agreed, I've seen production R code, and it just works. Rstudio does some seriously good debugging tooling, and the testing frameworks are there. I do hear people calling R a toy language, but they are the ones that don't know much about it.
All of our API's are built in R using rserve,
and we share code and functions through an internal package hosted on bitbucket.
You can call R functions on a server from Android and return the results in a list or array - makes for some really cool internal API magic between our devs and data-scientists :)
While I'm not as fond of ggplot2 as the author is, and actually prefer base graphics when making things for publication, I think he hits on a lot of strong points.
I'm rather fond of R as a language, and hop between it and Python as my preferred tools of choice for a given task. I think the package ecosystem is it's biggest plus - for statistical work, Python might have a package to do something, R almost certainly will.
I think the opinion of ggplot2 is unanimous: you can do so much more with it — but Lord Almighty, what an expense of time it is to do anything specific! People like it or not depending on how they value their free time.
I always check in details with newbies what they are trying to do before I mention that name (it used to be surprisingly hard to come across it randomly) because once I have, it’s a rabbit hole -- and they generally have weird ideas, that need more several single-dimension graphs, easily done with hist() and plot(); however, anything a little subtle benefits so much from that flexibility.
I still don’t understand why bucketed log-scale for histogram and properly typed percentage scales (i.e. “10%” and not “0.1”) are so hard to do, but I love impressing the one guy who tried by showing those casually.
I think it's a bit of a stretch to say that your opinion of ggplot2 is unanimous. Obviously I'm rather biased, but I think there is some evidence to suggest that once you understand the grammar of graphics based approach, you can create many EDA graphics faster with ggplot2 than base or lattice.
Tweaking plots for presentation is obviously more challenging, but I think it's a challenge with every graphics system.
For most of what I want to plot, I find ggplot code faster and easier to write than base R graphics (and prettier, but that's subjective). This includes using ggplot::qplot for quick one-off plots.
That’s one case, that doesn’t work well with user-defined buckets, has an unexplainable tendency to shift to ‘10.00%‘ when there is no room to do so and works with only certain of ggplot many wonderful graphs… But yes, that one, when it works is generally great.
I hadn't used data.table or plyr because the native R functions were giving me good performance even at tens of thousands of rows.
But now that I'm doing analysis on hundreds of thousands of rows, doing aggregation takes awhile. This article convinced me to give those packages a try. If data.table and plyr aggregate functions are indeed paralellizeable, that's a big deal, especially when implementing bootstrap resampling.
You should know that dplyr, plyr's replacement is already fairly stable and worth using. For basic tasks, it's as fast or faster than data.table in my experience, with the caveat that it is more likely to copy for some methods than data.table which is very strict about this.
And being able to combine it with an RDBMS means that you can potentially do plyr-ish things on datasets that can't fit in memory. I think the SQL generated by dplyr tries to be smart / efficient too.
I use data.table and like it. It's fast (and I think memory efficient). plyr is IMO so slow to be unusable on just hundreds of thousands rows. dplyr, as mentioned, is the replacement that fixes it, but it was literally first released a couple months ago this year. I have no doubt Hadley will produce something brilliant, but I'm guessing it will undergo a lot of change to get there.
I've done some social science statistical analysis in R. I tried to reproduce the analysis using Python over the weekend, but the tools just aren't there. For example, doing a within-subjects ANOVA in R is maybe 3 lines of code. No native functions exist to do it in Python. Structural equation modelling? It doesn't exist in Python. I'm fairly sure that going forward, people will implement common social science statistical procedures in Python, but I need them today.
I think Python may be closer to ready if you're doing Bayesian stats. Going through some quantitative coursework for a phd in a social science, my experience has been that for frequentist stats, R has everything and Python has a tiny fraction.
On the other hand, I'm taking a Bayesian course now, and am thinking that I could probably do the whole works in Python with little effort. That said, I'm not sure doing things in Python would actually buy me anything over R. If I were to do something other than R it would probably be something like Julia or Clojure, but that would also be more for my amusement rather than for any practical reason.
I just don't know if Python will actually be better after the tools are there. What R does is good what python will eventually be is good but not necessarily better.
Random R gripe: it's hard to reuse code cleanly, because it lacks a nice import system like python, haskell, etc. Related: making a package is complicated (or was last time I looked).
Does anyone know of a good explanation of how plyr, dplyr, data.table and *apply functions differ? I'd love to read an in-depth analysis of each and make an informed decision on which one to use going forward.
My current m.o. is to use data.frames as needed and plyr if I need to do any serious manipulation (which means that every time I use plyr, I need to read the docs). There's a lot of benefit to picking one direction and sticking with it...
In R, the sql group by operation -- ie take a group of things, split them on an identifier, run a function on each group with a distinct identifier, and collect the results -- is called tapply. tapply has the signature tapply(X, INDEX, FUN, ...). In this case X is the data to be split, INDEX is the identifier, and FUN is the function. There are some limitations, including X must be a vector.
a code example:
> df <- data.frame(id=c('a','a','a', 'b', 'b'), vals=rnorm(5,10))
> df
id vals
1 a 10.86507
2 a 10.71303
3 a 11.15321
4 b 10.78187
5 b 10.80042
> # calculate a mean on vals grouped by id
> tapply(df$vals, df$id, mean)
a b
10.91044 10.79114
> # similarly a median -- both mean and median are built in functions
> tapply(df$vals, df$id, median)
a b
10.86507 10.79114
>
> # now let's build our own function; I'm going to build a function that drops outliers
> f <- function(xs){ qs <- quantile(xs, probs=c(0.025, 0.975)); mean( xs[xs >= qs[1] & xs <= qs[2]])}
> tapply(df$vals, df$id, f)
a b
10.86507 NaN
>
> # well, this is just a demo and we ran out of bs, so it got NaN, but you see the idea
>
> # and plyr
> library(plyr)
> f2 <- function(dfs){ qs <- quantile(dfs$vals, probs=c(0.025, 0.975)); mean(dfs[ dfs$val >= qs[1] & dfs$val <= qs[2], 'vals'])}
> ddply(df, .(id), f2)
id V1
1 a 10.86507
2 b NaN
plyr relaxes that limitation, that is, X can be a data frame itself -- which brings the huge benefit that your group by logic can operate over more than one column. The function signature changes, but that's the basic innovation. The first letter indicates what X is, and the second letter indicates the output. Thus ddply runs an enhanced tapply over a data frame input (first d) and collects the output into a data frame (the second d). It also offers a bunch of nice enhancements; it's really a solid bit of work.
data.table, otoh, removes some of the speed problems with the built in dataframes. It offers keys/indices for quick lookup.
I haven't spent much time with dplyr, but I think it does a couple things: (1) move the plyr code from R to c for performance reasons; it allows you to run plyr operations (with all code written in R) then translates most of that to something that can run in sql against remote dbs (for the obvious reason that plyr/group by produce summary stats, and those can be orders of magnitude smaller than the source data, so pulling that all into R to immediately discard most of it sucks).
I had never put it together that the a main functional difference between tapply and plyr is that you can group by multiple columns.
One thing that I've had to use plyr for:
dfrm has three columns: user_id, vals, date
I want to know the average val for the first day of each user_id, second day, and so on. (In other words, average of vals across users relativized to the rank of the date for that user).
This is when plyr is awesome:
dfrm <- ddply(dfrm,"user_id",transform,dateRank = order(date))
That one line split my dataframe by user_id, performed a function over every sub-dfrm (ranking by date), then combined them back together into one dfrm, with a new column names dateRank that is appopriate for that {user,date} pair.
These are very simple functions and I generally try to use them before moving on to plyr. The are easy to set up, understand, and faster. Obviously they cannot do everything, though, which is why there is plyr.
Data table is the fastest and most performant for many core data frame operations.
plyr is wonderful for complex manipulations, but really slow as you start aggregating by a column with many (1000+) distinct values.
The base functions are super-fast (they're written in C), but give you output that you need to manipulate into more useful formats. Additionally, there's some frustrating subtle differences in syntax between them that makes for some annoyances.
I agree with most of what they say here - except using data.table. I much prefer using data.table structures with dplyr a far simpler more familiar syntax.
Indeed data.table - the object implementation - is brilliant and should just replace data.frame ...if that were possible.
The real problem with data.table is that it has too much magic. It's great for interactive use, but it's hard to program to.
If, for example, you want to aggregate a table using a list of supplied variable names, like if you wanted to abstract out some aggregation code, you need to descend into horrible quote/substitute/etc. hackery to make it work.
R's other key feature is it includes a full sql-like data manipulation language to manipulate tabular data; it's so good that every other language that does stats copied it. If df is a dataframe, I can issue predicates on the rows before the comma and columns after the comma, eg
that takes my dataframe and subsets it so -- row predicates before the comma -- col1 is less than 3 and col2 is greater than 7 -- and column predicates after the comma -- just returns a new dataframe from the subset with col4 in it. It's incredibly powerful and fast.