My least favourite things about R is its desire to keep on running when it should have errored on something about 50 lines before and happily spitting out some nonsense result - maybe with a warning, often not.
One of my previous jobs basically turned into an in-house R consultant for a department in a pharmaceutical company, and I caught so many bugs when investigating some other issue which meant the results people were reporting were completely wrong. A really common one is multiplying 2 vectors of unequal length where broadcasting shouldn't be possible and it just recycles the shorter vector - but hey, it ran without error and there's an output so many researchers don't notice.
Not to mention trying to handle errors is pretty miserable, if you want to catch a specific error you have to match the error string, unfortunately the error message changes depending on the locale the R session is running in.
I can't recommend the "R for Data Science" (https://r4ds.had.co.nz) book enough, which is written by one of the creators of the tidyverse, Hadley Wickham. This opinion might get challenged here, but if you're going to use R primarily for data science/analysis and not for programming I think it's a better idea to start learning it with the tidyverse than with base R (beyond the basics, of course, which are also covered in the book).
I use R professionally for biostatistics and I can't remember the last time I had to use the base syntax because something couldn't be done with the tidyverse approach.
Would be interesting if you could expand.
I've used r (data.table) extensively in the last years for biostatistics in a research organization. I was able to get away with not learning tidyverse and stick to data.table.
Main reason for choosing data.table was speed - I'm working with tens of hundred of GB of data at once.
What's worked for me is reading Hadley Wickham's "Tidy Data" paper[0] and then applying the concepts with data.table. The speed is nice, but I really love what's possible with data.table syntax and how any packages work with it. That's opposed to what many people have decided "tidy" means, with non-standard evaluation and functions that take whole tables and symbols of column names instead of vectors.
Compared to data.table, tidyverse offers significantly better readability and ergonomics in exchange for worse computational and memory efficiency, with the magnitude of the performance ranging from negligible to catastrophic depending on the operation and your data volume. At that data volume, you're probably doing some things that would OOM or hang for days if you translated your data.table code to the corresponding tidyverse code.
Agreed. IMO Tidyverse is a fantastic suite of R packages and worth learning after understanding how to use base R/with minimal dependencies. I personally started with base R and evolved to use tidyverse. Now I use base R when writing R packages and use tidyverse for data analysis/modeling workflows.
I’ll second this, though with some hesitation. If you just want to get stuff done, start with tidyverse. But if and when it’s time to start writing classes and packages, you may have to go back and gather some of the fundamentals.
I'm a base R purist personally, but that's mostly because of how long ago I picked it up and don't get any improvements in development speed from dplyr verbs with a few exceptions. But I disagree with this take for beginners especially non-programmers, with the advent of tidyverse it is incredible how fast newcomers pick up enough fluency to handle basic data massaging, analysis and visualisation.
I think exceptions where base-R is necessary can be taught as they arise.
There are several comments below that suggest not using tidyverse because "base R" is the foundation for everything.
I think it is important to use tidyverse because of the many quirks, surprises, and inconsistencies in base R. It would be helpful if others share their reasoning, or at least point to their favorite blog explanation, so that beginners can understand the problems they will face.
Unfortunately 5 minutes of Googleing failed a to produce a reference for me --- the start of some advanced R book that begins by asking "do you need to read this?" and showing examples whose results are predicted incorrectly by most people. Perhaps another user can provide the info.
This depends on what you are using R for. Tidyverse is focused on handling data.frame objects and everything that comes with them. Even ggplot2 uses a data.frame as a default input. And tidyverse has a competitor - data.table, which can be substituted instead (given that you are familiar with base R).
However, some data are better suited to be represented in the form of matrices. Putting matrix-like data in a data.frame is silly, since performance will suffer and you would have to convert it back and forth for many matrix-friendly operations like PCA, tSNE, etc. The creator of data.table shares this opinion [1]. And similar opinions are generally given by people who are familiar with problems that fall outside the data.frame model [2].
Is this really a unique to R or do all programming languages have some foibles? For example I spent an hour recently debugging C++ because I forgot that it loves to do integer division despite the fact it's going into an explicitly typed double. No error, no warning. You just have to know and I highly doubt it's desired behavior for most cases.
Most researchers are not programmers and don't care about programming. It's a tool to get the job done and I think you'd run into similar problems with other languages.
If you divide two integers, you get an integer. You can then cast it to whatever you want. Or, if you want some other type, you need to cast it before the operation is done.
Okay. But I'm storing it in a variable explicitly declared to be a double. That should be enough. If I divide two integers in python or R or Julia or a dollar store calculator I don't get an integer and I don't even have to explicitly type the variable. You have to know that C++ will do that. It's not common sense just like R recycling shorter vectors.
I agree with your point that all languages have their quirks. This is a very poor example however. If it automatically converted to float what would you do if you wanted integer division? I think automatic casting tends to get messy/be pretty evil in general but of course there are exceptions.
You could always do something like:
int divRes = IntA/intB;
double something = divRes*5.342;
At the very least it could warn me. I just tried it in rust and that will error out if you try to divide two ints and store the result in a float which is fine by me.
Hi, would it be possible to contact you to ask some career questions related to the pharmaceutical industry and data science? I'm a biostatistician who uses R for everything and lately I've been thinking about doing a career change, but I'm a bit lost with all the available options.
My least favorite thing so far was indices starting at 1. It seems blasphemous, in a way.
On a more serious note, I agree that R being too charitable in interpreting things (seemingly without warning) seems to be a problem. You'll have to do some debugging to make sure it actually does what you intended it to do. I've only dabbled in it a bit though.
> My least favorite thing so far was indices starting at 1. It seems blasphemous, in a way.
In the real world we start counting from 1. CS people cannot stop complaining about it but it makes sense in languages used for mathematics and statistics. Zero-indexing is not very relevant if you don’t care about memory layout.
> It's a bit of a joke, like arguing over tabs vs. spaces though.
It is taken very seriously, though. This “issue” comes up very often when some people come and lecture others about how stupid the language they use is.
> May I recommend you this fabulous short essay by Dijkstra
That essay is not fabulous, it is obnoxious. I know you either love or hate Dijkstra and he enjoyed being a contrarian, but he’s unconvincing. The only point that surfaces during arguments on 0-indexing is iterating over 1..N-1 instead of 0..N. That’s basically what he wrote himself. This could have been solved with just a bit of syntax if it were really a problem, and it remains largely because C did it that way to simplify pointer arithmetics. It does not change the fact that for the vast majority of people, the first element in a list is, well, first.
The proper way of handling this is to allow for arbitrary indices, because you will always find contexts where a different scheme makes sense (e.g. iterating from -10 to 10 is sometimes natural, and would otherwise require some index gymnastics). Insisting that one narrow view is the correct one is just annoying.
I dunno, it seems you misunderstood me, or idk. I clearly said that it is completely arbitrary to choose one over another, and expressing a preference over either one is just a way of poking fun at people who are anal about choosing a specific one. So there isn't really any disagreement, though I'm always amazed at what lengths people go to, to express what they think, while they're really just arguing about the definition of some thing.
> It is taken very seriously, though.
And those who do take it terribly seriously deserve being poked at ;)
Honestly indices starting from 1 fits really nicely in most situations. 1-based indexing together with ranges and inclusive range-based indexing makes loops and subsetting code really readable IMO
R is frequently compared with python and julia which are general purpose programming languages but it is not really a proper comparison. Once you approach R as a domain specific language / system then its various quirks and pecularities are more palatable and explainable: they are in a sense the price to pay for tapping a large domain of statistical analysis expertise that is not available elsewhere.
This is mental gymnastics. People have some job to do and are looking for an appropriate tool for it; sometimes that’s R and other times it isn’t. Who cares if you call it a DSL or a general purpose language. If I want to do something and the language makes it difficult, telling myself “oh but it’s a DSL” doesn’t get me any closer to solving my problem.
> If I want to do something and the language makes it difficult, telling myself “oh but it’s a DSL” doesn’t get me any closer to solving my problem.
Unless the thing that makes the language difficult is your expecations. In that case, offering you an alternative mental model that helps you make better decisions when using the language does get you closer to solving your problem.
Yes, sure, as long as you recognize that as a very subjective determination.
From the statistician's non-programmer POV the syntax of R or some other language are similarly opaque. Learning one vs. another will present similar investments in time. From their perspective, R does not make things more difficult, and the fact that it's more of the lingua franca within the field has it's own benefits.
The people I see complain about R are usually people that learned a different general purpose language first and find that when work requires data analysis they much prefer the GPL for working through the non-analytical portions if their work. (Especially with python where pandas and numpy have made less specialized tasks much easier)
From a statisticians POV the R syntax is great. Here is the t test:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, …)
A statistician opens the vignette and already knows what all of these variables represent mathematically, and can begin producing analysis immediately.
Yes, precisely. Very much not the pythonic way but that only matters if your prior background before R was python. If your background was SPSS then many of these would be drop downs or check boxes, and (IMO) it's superior to the SPSS scripting language as well.
Heck, my background before using R was python and SPSS and I still prefer R for precisely the example you gave: fine-grained control built in as above, specifying how to handle missing values etc.
It's important to keep this in mind though because R (or rather S) is primarily supposed to be used interactively. A prof of mine used to call the R REPL and then go on from there. He called an editor from the REPL, wrote source files from the REPL etc. Once you see someone working with R like that, you start seeing R as what it is.
The beautiful it is to be used interactively, it really takes a lot of practice to write reliable code that doesn't abort with some error now and then.
I think the point about interactivity is pretty well understood. Another comment in the thread pointed out how the majority of people who write R do it in RStudio and RStudio's defaults push an interactive workflow on the users (the nature of the work you do has a similar effect). So even for someone very new to the language it's pretty obvious.
Saying that R is a domain-specific language for statisticians, and thus its quirks are ignorable, is an incomplete answer. An R program is never just a series of calls to specialized library functions. Programs still need to ingest and emit data, manipulate data ad hoc, take conditional branches based on some runtime condition, and so on. And that glue code must still be written in R. I've had to write a lot of that glue code in R.
As someone who mostly writes not-R, my own R irritation comes from a handful of things:
- The dot character "." has no semantic meaning in identifiers. It's just a valid character for names. Looking at function names like "is.numeric" really messes with my reading comprehension.
- Ambiguously, "." also separates identifiers of objects in one of R's type systems from method calls. In some cases, `foo(bar)` and `bar.foo()` are equivalent. But only in some cases.
- Even better, a popular R library defines a function `.()` (i.e., its name is just a single period character), whose job is to expose a surprising quote/unquote expression evaluation semantics.
- This is not to mention the special meaning of "." in formula literals, which are fairly ubiquitous in R.
- Different authors use different naming conventions. Base prefers "as.numeric," Tidyverse might have "to_factor," another library might prefer camel case.
- Finally, R has a surprisingly extensive syntax, exercised by different libraries to different extents, and a correspondingly rich semantics, with "types," "modes," multiple class systems, "expression" objects, immediate and lazy evaluation, expression quoting and unquoting, metaprogramming, and homoiconicity. It is a zoo of a language.
Once you include the statistical packages, ggplot2, and dplyr, there is nothing that beats R in ease of prototyping for data exploration, model fits and sanity checks, and data visualisation of high dimensional data.
I don't know if you've heard about it, because it is a relatively recent development, but the tidymodels ecosystem of packages (https://www.tidymodels.org) is also breaching the gap from data exploration/visualization to advanced modeling and machine learning in a way that feels really natural if you're used to the tidyverse way of doing things. It's developed by RStudio as the improved version of caret. I've been using it for differential gene expression analysis and it's a game changer in how much time it saves me.
What about python and its countless packages? (Honest question, I 'grew up' using python in an academic setting, but haven't caught up with the latest developments)
Python isn't really an advancement. But it's a more obvious choice for people with a background in software engineering. I have some hopes for Julia though.
As someone who used ruby (yes real ruby, not rails) before python or R, I definately think R is better for data science and ruby better for everything else. Sadly, I predict a future where python rules over everything.
I've been using https://exploratory.io/ a lot, which is r in a really nice wrapper where you can do everything point and click, by writing code by hand or a mix.
I love R. Once you get it, there is something beautiful about its functional approach. I like using either tidyverse or data.table with pipes, split, map, reduce. The code looks like layers of a filter that data flows through.
On the other hand, if it were lispiness that was the issue, surely xlispstat would be the winner. I love xlispstat. I used it in grad school in the 1990s and even maintain the github repository https://github.com/jhbadger/xlispstat . But the fact is xlispstat never appealed to the general statistical community and R did.
I thought xlispstat was a big deal in statistics at its peak? I suppose both R and xlispstat are (to varying degrees) lisp-based, so another way of looking at it is that statisticians like lisp?
I don't think it got as much popularity in its day as R does now, but it was popular to a degree. But that was also because at the time it was pretty much the only free statistics programming environment -- at the time the choice was either xlispstat or pay for a licence for S-PLUS, SAS, or the like.
Seconded. I was taught ggplot by a great stats professor and the framing of visualizations as a language (gg actually stands for the grammar of graphics!) describing the relation between data and visual elements (layers in the graph) really made something click.
The amount of consideration and careful design behind tidyverse APIs (tidyr, ggplot, dplyr) really astounds me. I've never felt the need to actually memorize any of them but they come to me so naturally whenever I type "library(tidyverse)". Very few DSLs, libraries or APIs have ever made me feel this way, and certainly NOT Python and the mess that pandas/matplotlib/scikit is. Even more impressive that he managed to build such a consistent layer atop the hack that is base R.
Note that I've nothing against base R. It really appeals to the hacker in me and it certainly has a ton of cool features (a condition system, multiple function evaluation forms - in what other language are `if`, `while`, `repeat` and even parentheses `(` and the BLOCK STATEMENT `{` all implemented as functions?) but damn if it isn't a mess of corner cases and gotchas.
I don't have a source for this, but I think R as a language has one of the highest concentrations of users in a single IDE - and for good reason - something like 80% use the free (and amazing) RStudio IDE.
I use R with Vim. Usually the R script file is open on top and there is a :terminal buffer with R running below. And I use a small vim-plugin [1] for sending commands from the editor to the REPL.
This has a few advantages, major being that you can run any language with a dynamic REPL this way, without changing your setup. Or, you can even have two files, written in two different languages, open side by side with a corresponding REPLs running beneath each of them. The downside of course is that you miss on auto-completion and other integrations like that. These are not impossible, but you would have to torture your Vim setup quite a bit in order to implement them.
However, you do indeed get autocompletion and many IDE amenities with the language server protocol. Naturally it’s not at the same level as RStudio. But one tool to play with any language is a very nice thing.
Guessing here: probably Jupyter notebooks, Emacs and vscode, and perhaps the (very minimal) R IDE (if we can call it that) that comes with the installation of base R.
I use R directly from the terminal quite a bit for any small jobs, like calculations, purely due to the <1000ms boot time.
In my domain q/kdb is used extensively. I don't have a decade to master obscure syntax/grammar just for one simple purpose of extracting some data set from a larger population and maybe do some basic statistics on it.
If you're like me R is a godsend. You'll also love the tonnes of free packages. You can't get wrong with R if you appreciate simplicity and intuitiveness.
In practice the difference is almost non-existent, unless you start doing assignments within function calls, which is a popular style among some R stars, like Martin Machler [1]. But on the other hand some of them resolve to just always use "=" everywhere, including one of R's creators - Ross Ihaka [2].
Anyhow, explaining the difference at that part of the tutorial is not easy, so I chose to omit it for now. But might introduce it later, along with "<<-" and "->>", probably after describing closures.
Maybe someone can help me with this, how do you integrate r as a cli tool? I'm in a mostly R shop but its integration is so confusing and/or bad with other tools that we usually just rewrite everything in python for integration (which obviously is a huge waste of time). R packages etc have me as an outsider confused,though seem like the obvious choice?
I love R, nothing better for data analysis, stats and plotting. However, if I was making software for other people to use, repeatedly, I would probably pick another language. The R language does have breaking changes, especially in commonly used packages.
In this case you should probably use the "here" or the "rprojroot" packages (libraries in conventional R parlance). They both simplify the usage of relative paths inside a project/repository.
If you have a project root with the folders code, data, etc and are running a project on /path/root/code, you can then just call data_dir <- here::here("data") for the data folder, as the here package uses several always to find the root of a project (e.g., looking for a .git folder).
Personally I use my programming language of choice to generate a ".r" script and then use the os exec system call of said language to call Rscript scriptname.r... If I'm understanding your question correctly.
Great overview. Slight shame it leaves out 'lapply' etc though (and says as much at the top). I just remember realising that you can have lists and run functions on them when I was learning R, and it seemed like a superpower.
This is probably written by a programmer for that reason (and reading the ‘why R is bad’ comments) shows how misunderstood R is by most programmers. Its like giving someone an introduction to the english language by showing them the alphabet and listing punctuation. Yes technically all true, but none of it will stick
Yes, there is a lot of R-bashing by people used to imperative languages designed for efficiency in repetitive tasks, not a functional language designed for numerical analysis. The complaints fall into these categories:
1. It's not zero-indexed (even though most numerical languages aren't)
2. Loops are slow (though if you're looping in R you're probably doing it wrong)
3. It's inconsistent
4. The syntax is weird.
But people don't talk about the somewhat beautiful functional ability of the language to wrangle data almost magically. Its basis in lisp allows for the tidyverse and data.table to exist[1], and ggplot is a formidable analysis/plotting platform that Python doesn't come close to.
I attended an intro to R workshop and found it very confusing. Being "functional" had nothing to do with it. Inconsistent, yes very much so in my opinion. It felt like a lot of little separately developed tools thrown together into a bundle. But I think mostly my difficulty with R is that I'm not a researcher or statistician. My exposure to and experience with those domains was an undergrad class or two many decades ago. If you don't deeply understand the problem space for which R is intended, you will be lost and confused trying to learn it.
It's a very different language to imperative languages out there, so it's not surprising that an introductory course would be confusing. There are several ways to do things in R (for example subsetting data, or pulling out elements of structured data), but that doesn't mean it's inconsistent - they are convenience functions. As you say, you have to do some statistics 'in anger' to really get why R is so good. When I've taught introductory sessions on R I focus more on a very short analysis to demonstrate what it is good at.
It works and is IMO quite okay because NA is not the same as NaN (not a number). NA _does_ actually stand for a number, it's just that we don't know it.
Which is an interesting detail in R that should be mentioned anyway, the difference between NA and NaN. Anyone used to languages which just NaN may confuse NA for that non-value.
Except - 1^NaN also is 1... now that IMO is wrong. But you can try the same in your browser's JS console and you will get 1 as a result too, so R is not the only one.
There are several NA values in R - NA_integer_, NA_real_, NA_complex_ and NA_character_, and the results will be different if you use some of them. NA_character_ and NA_complex_ will produce errors (different ones).
Interesting. I must admit I've never used substitute.
I tried dims but: Error in dims(iris) : could not find function "dims"
I do find the occasional oddity. I've noticed more very useful messages/warnings (particularly in common tidyverse functions) recently, so I think they help.
To be fair, these quirks are generally very uncommon in day to day use.
One of my previous jobs basically turned into an in-house R consultant for a department in a pharmaceutical company, and I caught so many bugs when investigating some other issue which meant the results people were reporting were completely wrong. A really common one is multiplying 2 vectors of unequal length where broadcasting shouldn't be possible and it just recycles the shorter vector - but hey, it ran without error and there's an output so many researchers don't notice.
Not to mention trying to handle errors is pretty miserable, if you want to catch a specific error you have to match the error string, unfortunately the error message changes depending on the locale the R session is running in.