Hacker News new | past | comments | ask | show | jobs | submit login
R for Excel Users (yhat.com)
158 points by elisebreda on Feb 16, 2017 | hide | past | favorite | 58 comments



The thing is though - R is just not Excel. For many tasks (blasphemy!), Excel is simply better.

Take quick scenario evaluation. In Excel, I change one or two cells and see everything update immediately; in R, you need to re-run your analysis and find the outputs you're after again. There is built-in support for scenarios in Excel. In R you have to code your analysis around it, and do the formatting for comparing, too.

Or take the libraries. Anything that is not statistics is just a pain in R. Everything is possible, sure - but high friction. PMT()/IPMT() functions in R? Good luck. Sure it's easy to code yourself (which is the advice you get when asking R users %| ) but I'm using something high level to not have to bother with that sort of thing!

Graphs? Yeah, there's plot() which is straight out of 1960, or ggplot - which is easy for the simple things and then devolves into afternoons chasing obscure manual pages for this or that setting. Here's one: plot 360 degrees of a sine wave, and it's first and second derivatives. Then explain an Excel user how that works. (this is both because anything non-stats is bolted onto R, and because ggplot is designed around stats graphing.)

Keeping a matrix of data, like a simple database? Sure R can read dozens of file formats from CSV to HDF, but actually editing/maintaining that data is a pain in the ass. Excel, just add a sheet and use vlookup - it will let you sort and filter and copy and validate, all without leaving one software package.

Yes, there are many things wrong with Excel. FFS, there is a conference on how to not screw things up in real life with Excel. But saying 'just use R' is silly. For most applications where starry-eyed grad students advocate R over Excel (because hey, nail/hammer, right?), Excel is just the better choice, even if that means workflows that make programming-literate folk die a little inside every time we have to work with them.


Unconvinced.

Excel is fine for opening up a table and doing some quick numberwang.

But, as soon as you have to take your piece of work and start making little variations and tacking bits on, or running it on different bits of data, or God forbid you want to actually test your code (and make no mistake, code is what you are making), well, it all involves rather a lot of clicking and opportunity for fuck-ups.

Of course, R itself hits limitations pretty quickly. While technically it's a general purpose programming language, trying to use it on non-tabular data or to assemble even a medium-sized program leads to pain.

My rough rule of thumb is:

* Checking if it's got vaguely sensible data in? Column https://linux.die.net/man/1/column.

* Just looking, maybe aggregating a column or doing a simple summarise? Excel (or Libreoffice, why not?).

* Giving outputs to someone else or might need to do it more than once? R.

* Need sensible data structures and useful abstractions? Python.

* Need more speed? C or Java or whatever.


Clearly Excel isn't always the best tool if you are free to choose between R or Python or C or Java. But if you're not a developer then Excel may be the only tool that you and all your colleagues are guaranteed to have. Even if I persuaded ICT to install R on my work PC, my colleagues wouldn't have it so work wouldn't be reproducible.

I find that approaching Excel as if it were a database or a programming language tends to reduce the replication problem. For example to analyse a daily file:

* set up a data connection to an exemplar file

* refresh the connection with the new file daily

* set connection option to copy formula down so each row is identical

* point to a separate spreadsheet for lookups

* use a recorded macro to paste the calculated values into a cumulative spreadsheet

* do pivots and change over time graphs in the cumulative spreadsheet

I don't think Excel is the best tool for this workflow, but a logical approach makes it "good enough".


>But, as soon as you have to take your piece of work and start making little variations and tacking bits on, or running it on different bits of data, or God forbid you want to actually test your code (and make no mistake, code is what you are making), well, it all involves rather a lot of clicking and opportunity for fuck-ups.

This isn't an accurate reflection of where Excel is currently. Excel can run OLAP cubes at this point, which is an exponential multiplier for it's utility as a quick BI solution. Comparisons to programming languages really miss the entire point of why excel is so useful.


Yes, also Excel has a subset of Power BI's functionality for data cleaning. One of the great things about Power BI's data cleaning is that it presents a summary of all of the operations taken on the data in a linear format, so that you can easily see everything that was done to the file.


>Here's one: plot 360 degrees of a sine wave, and it's first and second derivatives. Then explain an Excel user how that works.

Seems pretty straightforward to me

    x  <- seq(from = 0, to = 2*pi, by = .05)
    y  <- sin(x)
    y2 <- cos(x)
    y3 <- -sin(x)

    plot(x, y, type = 'l')
    lines(x, y2, col = 'red')
    lines(x, y3, col = 'blue')


Sure, I didn't think about my example well enough - what I meant was a generic way of taking a derivative, from series of data to a function that returns a y value for a given x value. E.g. plot derivatives of a normal distribution curve.


From the tone of your post I'm thinking that you've made up your mind on this point, but there really is a lot of richness and extensibility in R which isn't available in Excel. For instance my blog is written in an R library called Blogdown, you can process 5 GB datasets on your laptop, send a reproducible analysis to a colleague, stuff like that.

I guess the thing I would ask you to consider when thinking about this stuff is the amount of time you've spent learning Excel. At the time I was learning R I probably had several thousand hours of focused Excel practice under my belt, and I could do a lot with the tool. So Excel was a way better tool for me than R because I was an Excel expert and an R novice. After now putting in about that same amount of time working with R I can say it's a much more powerful and extensible tool. But if you mostly work in areas where Excel...uh...excels, then there's no real reason to make the switch.


The things you've listed as R being better at are things that no typical Excel user contemplates doing (i.e. blogging, processing 5GB data sets). You're also ignoring that out of the box Excel comes with a wide range of features that just don't exist in R. Most important for the typical user is being able to format and edit their "data" and calculations as they go with immediate updates on dependent cells.

Take your typical Excel user and explain to them that R is significantly more powerful, and they will stop listening the first time they get the arrow wrong on a variable assignment. You probably won't even get that far because the idea of a typing a variable name is foreign even though they've been using "variables" hidden behind cell references. "Why am I assigning a variable... in Excel I just type my data where I want it and click when I want to use it".


You don't seem to have really answered any of the points, like good in R to interactively edit and view data, and update the graphs automatically. Also, in R I would love an excel style graph wizard -- it could even output the R required.

I need to use R about once a year -- each time I've forgotten everything from last time, and use a combination of stackoverflow and swearing to do whatever I need to do.


Hmmm, I'm not sure there is an Excel style graph wizard, although there's this package which is close https://www.r-statistics.com/2016/11/ggedit-interactive-ggpl...

For PML() kinds of functions, I often just google something like "Excel PML() function in R" and something usually turns up: - https://cran.r-project.org/web/packages/optiRum/optiRum.pdf - https://gist.github.com/econ-r/dcd503815bbb271484ff

Another good tactic is to follow some of the R quants on twitter. A really popular package for this is http://www.quantmod.com/


I've made up my mind about the things that I listed yes - and a bunch of other use cases where Excel is better; but I'm not saying 'Excel is better' or 'R is better'. What I'm saying is that proper applications of Excel and R are (almost) orthogonal; the number of Excel users who should 'upgrade' to R is very small, compared to the total population. R is a statistics tool, it competes with SPSS and maybe some uses of SAS. For those uses, sure use R; use R when teaching statistics. But saying R is some natural, 'better' progression of Excel is myopic and not understanding the use case of Excel.

Not a month goes by that I don't try to switch some Excel sheet I have into R; I run into the problems with Excel every day. It's not like I don't know Excel isn't great. My point is that the solution to those problems isn't R. I don't know what is, but I know it's not R.

Like, the other day I tried to do a real estate investment analysis in R. I always do that in Excel; I build the model step by step, starting with some basics, then filling in the details of the case at hand as I go. All the time, I can focus on the numbers; adding an indicator or refinement is part of the natural workflow. In R, you always have to switch between 'code mode' and 'data mode'. In Excel, you don't have this difference. Which is at the same time also its weakness, of course.

Sharing R models is a pain in the ass. Others have to get the exact packages, you have to tell them what to look at out of all the variables, ... Shiny sucks for that, it's read-only. In Excel someone makes some changes and sends you back the sheet, boom done. In R you would set up a versioning repo, data is split over multiple files, packages need to be installed, scripts for reporting, ... Fine for big projects, not for the small analysis exercises that make up the bulk of the uses of Excel.

Sure you 'can' blog in R. Last week I didn't want to walk to the shed to get my hammer, so I used a brick to pound in a nail. Doesn't make it right. String manipulation in R is a joke. A function to concatenate strings? Please. Again, R is fine for statistics, but it's not a general purpose replacement of Excel.


> FFS, there is a conference on how to not screw things up in real life with Excel.

> In Excel, I change one or two cells and see everything update immediately

This is the problem - it's all too easy to change cells, misclick, and generally make mistakes. When you do things from a programming language, you need to be explicit about operations, and explicit about what changes you're making.

Spreadsheets should be for entering data only, R is for processing data.

All the examples of enter data, create a chart, oh let's change/update some data, oh let's add something to the file, etc..., are why you end up with incredibly convulated spreadsheets that inevitably are filled with errors.

Enter your data with whatever tool you want (spreadsheet works for this), then process it (R works for this), then shove it in a database. Nice and easy, not error prone, and every tool can have easy access since databases are ubiquitous.


The user actions that lead to mistakes in Excel are the same with any other environment: typing formulas. If I'm doing an engineering or financial calculation and misplace a decimal point, flip a sign, or forget an ABS(), I'm going to get the same wrong result in R or Excel. The upside of Excel is that the feedback is immediate so you have a chance of spotting the error if you're expecting a certain output. There is no clean solution to that regardless of environment. It's just as easy to make mistakes when you're changing code or changing a spreadsheet.

R has the advantage of being easier to do a proper diff between files if a change was made in error, but most people don't think they made a change in error. The change was intentional; they just got it wrong.


> The upside of Excel is that the feedback is immediate so you have a chance of spotting the error if you're expecting a certain output.

Only some errors are obvious just from a glance at the result set. For other errors, Excel makes the developer's job much harder. Subtle bugs can hide the code of an individual cell among thousands, and the user has no means of clearly abstracting that code out.


I completely agree. There are some things you can do in Excel that are nearly impossible to track down. There are auditing tools that help, but you can truly ruin a spreadsheet in undetectable ways if you want (or more likely, don't want).


I think the answer lies in the small plot in the article: Difficulty vs Complexity.

As the intrinsic complexity of the problem at hand increases, the (intrinsic) difficulty of using Excel just rockets after a certain point.

The more experienced R-user have a lower threshold of preferring R over Excel, since the (accidental) difficulty of using R is low enough for them to do even trivial stuff. It might be much smarter and faster to do the same thing in Excel for just about everyone else.

On the other hand, if people with no previous programming experience whatsoever keep building excel-workflows for larger and larger problems, it WILL turn into a catastrophically incomprehensible mess eventually, since they know no other alternatives than hitting the complexity wall.

(For myself - a fairly decent professional programmer, it requires a bit of swearing and stackoverflow to use R since although a solid core, a lot of syntax and conventions are decidedly non-cs... But so does Excel.)


Most people can make small edits to an Excel spreadsheet. Most people can also glean, from some analysis, how the spreadsheet works. Communicability is Excel's advantage over R.


R and Excel have benefits that in many ways do not overlap, I use both to good effect. But Excel has and will continue to be responsible for major errors in data analysis [1,2]. Its default autoformat/correct behaviour will without fail screw up your data at some point. Therefore, although you make a valid point, I wouldn't go so far as to say it is the 'better' choice.

[1] http://retractionwatch.com/2013/04/18/influential-reinhart-r...

[2] https://genomebiology.biomedcentral.com/articles/10.1186/s13...


My breakthrough with R didn't really happen until i started programming functionally. If every step to acquire, process, aggregate, graph data are all outputs chained together, re-running everything is a pretty simple process. It's when you try to think too much about objects and states that R is just really clunky.


This is a fantastic article for intermediate beginners. On HN, everyone is a senior data scientist working with Spark and Keras and Tensorflow and deep learning.

In the real world, there is a huge chasm of difference between people just learning Excel and developers, not many people even understand why you would switch away from the former when it's so convenient, which is why the difficulty v.s. complexity chart is so great, and may actually speak to people in an approachable way.

There are a lot of tutorials for how to do hard things and how to do easy things, but not a lot for how to think of the hard things in terms of the easy things, and this falls in that category. Another good book on this topic is Data Smart by John Foreman, where he goes over basic data science skills in Excel.


> This is a fantastic article for intermediate beginners. On HN, everyone is a senior data scientist working with Spark and Keras and Tensorflow and deep learning.

Although there is a ML/AI selection bias in HN, there are certainly a lot of people on HN who fall into the intermediate/beginner category (there is a lot of demand for R tutorials which I have been working on), although I would argue that dplyr can legitimately be used at the advanced level. And certainly Keras/Tensorflow is overkill for common business problems.


There are definitely a lot of people who fall into those categories, but the articles/links give the impression that everyone is senior. Which is why it's great when articles like this come around.


So glad to see this. As someone who learned both Excel and SQL on the first job (it's called being an analyst at a trading firm), I always felt that SQL was a "more powerful but clunkier cousin" of Excel and wished there was a tutorial for SQL aimed at Excel ninjas.

Well, fast forward half a decade and nobody wrote it, so I did so awhile back. Of all the things I wrote for work, it continues to be the most popular:

https://blog.treasuredata.com/blog/2014/12/05/learn-sql-by-c...


Check out my EasyMorph (http://easymorph.com) -- you can get the best of both worlds with it -- SQL-like logic applied to tabular datasets in an Excel-like fashion: visual, reactive, and with built-in visualizations.


Also R is 1 indexed instead of 0 indexed. (Arrays are numbered [1,2,3] instead of [0,1,2] for the beginner).

I forget that probably every 5 minutes when working in R.


So is Fortran, Julia, MATLAB, etc...


It's only the Jonny-come-lately languages like Algol and its descendants that are zero-based :-)


Algol, like Ada, actually requires you to specify your bounds and is perfectly happy to have 0,1,87 or -62 as your first index.


So does excel VBA.

But man, are arrays a pain in VBA.


Why do you say so? I find them pretty easy to handle. Variant arrays can be pulled and pushed straight from/to Range objects, you have LBound and Ubound to iterate over arrays if you don't feel like you can trust the base, or otherwise you have For Each. And you have Collection and Dictionary if you have different needs - you can even define a class, instance it and push to a Dictionary, now you have a key-value store full of your Object.


That's a nice tutorial for beginners, though sticking dplyr code in functions is skirting dangerously close to the world of hurt that is non-standard evaluation.

The example:

    join_and_summarize <- function(df, colour_df){
        left_join(df, colour_df, by = "cyl") %>%
        group_by(colour) %>%
        summarize(mean_displacement = mean(disp))
    }
Will go really badly wrong if someone following the tutorial simply replaces the `disp` with a function argument.


That's a good point, I'm not exactly sure what the best practice is for using NSE vs. SE functions within functions. My practice is to just use the NSE version unless I have a good reason not to. Generally my goal with teaching beginners to use R is to just not be afraid of functions as the barrier tends to be feeling like they can use them at all rather than using them in the safest way.


Yeah, NSE sucks in dplyr, but it'll be fixed soon :)


The standard evaluation versions of all dplyr functions are available, just add an underscore to the end: filter_, select_, ...


I'm of the opinion that non-standard evaluation probably isn't the best introduction for converting people from excel to R. As nice as dplyr's syntax is for beginners you can quickly get into the whole ugly side of R.

Also it's often not as simple as just prefixing an underscore, you end up reading the lazyeval vignette and messing about with quote() and formula syntax.


A cool feature of select_ I discovered accidently is that you can pass in a vector of column names to get the corresponding columns, which is much easier than abusing which() to lookup indices.


Not sure if I'm misunderstanding, but you've always been able to slice on a vector of column names.

    cols <- c("colA", "colB")
    dataframe[, cols]


True, although not as a part of a dplyr chain, which is cleaner.


extract and extract2 from magrittr also work. I do normally use dplyr though.


This dude should learn to use PowerPivot, it will blow his mind and get him back to the old beloved Excel hands. Most of his examples of "why move to R" can be easily solved with PowerPivot and PowerQuery, with a nice GUI and not a clunky command line.


Brings back great memories. Excel/VBA was my first programming "language"...a basic requirement of the supply chain management job I was doing. The limits of Excel for data analysis pushed me to R (+SQL!) which was my first real programming language. Since then I've done professional work in Clojure, Scala, OCaml, F#, Typescript, Prolog, and a handful of constraint/optimization languages like AMPL, Mosel, and Minizinc.


There is a lot to like in this article, but the artificial division between Excel and VBA for Excel is well, artificial. I've since moved to other coding environments like R and python, but in the past I frequently used VBA to parse and tabulate excel files; Excel VBA is quite capable of a lot of complex manipulations...often done slow as hell...but capable.


Great article. But I'm an old spreadsheet jockey in non tech jobs now. I use Sheets for quick and dirty stuff.

Where does one "get" R?


For Linux users, it is probably installable through the distribution's package manager.

For others, the R language website: https://www.r-project.org

Also try the IDE from RStudio to make life more pleasant: https://www.rstudio.com

Good luck! It is a fun journey and a big, deep, wonderful, rewarding rabbit hole.


The R for Data Science book really is the best, but I would also recommend https://www.datacamp.com/ for some great video tutorials.



Also check out: R for Data Science: http://r4ds.had.co.nz/

It's got a lot of good intro materials for R. Though having some understanding of another programming language would be pretty helpful.


Instead of choosing between the two extreme ends of the spectrum, Clicking Buttons vs. Writing Code, you can use a tool like Exploratory (https://exploratory.io) (Disclaimer, I’m a co-founder of Exploratory), which provides a modern and interactive UI to access to all the amazing things of R without needing to write R code. You can get immediate value out of R in the first few minutes with its grammar based (dplyr verbs) data wrangling and beautiful and interactive visualization. Everything you do in Exploratory is recorded as R function that you see in UI so you can learn R while you work on data, and you can export your analysis works as reproducible R scripts. Many of our users use Exploratory to not just learn R, but also learn data analysis / science methods, which to me is more important than learning the tools themselves.


One of the best ways to get a good handle on basic R is to switch all the Excel stuff to R. Force yourself to use only R and you will find that you have a really good grasp of the environment in a couple months.


It would be great if Microsoft integrated native R/Python into Excel.


This is not "R for spreadsheet users", it's "why bother learning R in the first place". Was hoping for the former.


Has anyone used the Rodeo IDE they're selling?


Rodeo is oriented more toward Python users...as RStudio is an unbeatable IDE for R.


This article is pretty good instruction, but very good SEO. I know quite a few people asking this question.


Kudos for using Kill Bill analogy!


I'm thinking of writing a package which displays R errors as Pai Mei gifs.


I would kill for an R package that displays R error messages as something useful. Especially if it's happening 200 lines deep in a library somewhere, it would be nice for that library to catch the error and display something that pertains to what I wrote instead of just throwing whatever error occurs on that line. Or at least tell me what library the error occurred in so I don't have to guess.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: