Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is why it is a good idea to do “set options(warn=2)” to turn warnings into errors & easily spot problems.


Any other tricks that are helpful for a beginner to know?

    Use Rstudio 
    Include tidyverse
    Turn warnings into errors


I can't recommend the "R for Data Science" (https://r4ds.had.co.nz) book enough, which is written by one of the creators of the tidyverse, Hadley Wickham. This opinion might get challenged here, but if you're going to use R primarily for data science/analysis and not for programming I think it's a better idea to start learning it with the tidyverse than with base R (beyond the basics, of course, which are also covered in the book).

I use R professionally for biostatistics and I can't remember the last time I had to use the base syntax because something couldn't be done with the tidyverse approach.


Would be interesting if you could expand. I've used r (data.table) extensively in the last years for biostatistics in a research organization. I was able to get away with not learning tidyverse and stick to data.table. Main reason for choosing data.table was speed - I'm working with tens of hundred of GB of data at once.


What's worked for me is reading Hadley Wickham's "Tidy Data" paper[0] and then applying the concepts with data.table. The speed is nice, but I really love what's possible with data.table syntax and how any packages work with it. That's opposed to what many people have decided "tidy" means, with non-standard evaluation and functions that take whole tables and symbols of column names instead of vectors.

[0]: https://vita.had.co.nz/papers/tidy-data.html


Compared to data.table, tidyverse offers significantly better readability and ergonomics in exchange for worse computational and memory efficiency, with the magnitude of the performance ranging from negligible to catastrophic depending on the operation and your data volume. At that data volume, you're probably doing some things that would OOM or hang for days if you translated your data.table code to the corresponding tidyverse code.


dtplyr is an option as well which lets you use tidyverse syntax with data.table backend. Speed and syntax.

If you learned data.table however, it's better to just stay in data.table. Nothing in the tidyverse can touch the efficiency of data table.


Don't use tidyverse if you don't need to. Certainly don't start with it if you're a complete beginner. Base R goes a long way on its own.


Agreed. IMO Tidyverse is a fantastic suite of R packages and worth learning after understanding how to use base R/with minimal dependencies. I personally started with base R and evolved to use tidyverse. Now I use base R when writing R packages and use tidyverse for data analysis/modeling workflows.


I would say this is bad advice. Don’t learn base R, focus on tidyverse. Tidyverse is what people write and use.


I’ll second this, though with some hesitation. If you just want to get stuff done, start with tidyverse. But if and when it’s time to start writing classes and packages, you may have to go back and gather some of the fundamentals.


I agree with both you and GP. Doing heavy stats work in base is pointlessly painful.

Hadley's Advanced R is a great reference for getting down to those fundamentals.

https://adv-r.hadley.nz/


I'm a base R purist personally, but that's mostly because of how long ago I picked it up and don't get any improvements in development speed from dplyr verbs with a few exceptions. But I disagree with this take for beginners especially non-programmers, with the advent of tidyverse it is incredible how fast newcomers pick up enough fluency to handle basic data massaging, analysis and visualisation.

I think exceptions where base-R is necessary can be taught as they arise.


There are several comments below that suggest not using tidyverse because "base R" is the foundation for everything.

I think it is important to use tidyverse because of the many quirks, surprises, and inconsistencies in base R. It would be helpful if others share their reasoning, or at least point to their favorite blog explanation, so that beginners can understand the problems they will face.

Unfortunately 5 minutes of Googleing failed a to produce a reference for me --- the start of some advanced R book that begins by asking "do you need to read this?" and showing examples whose results are predicted incorrectly by most people. Perhaps another user can provide the info.

* Reference to good HN thread: https://news.ycombinator.com/item?id=20362626

* Particularly pointed notes on base R problems: https://news.ycombinator.com/item?id=20363806


This depends on what you are using R for. Tidyverse is focused on handling data.frame objects and everything that comes with them. Even ggplot2 uses a data.frame as a default input. And tidyverse has a competitor - data.table, which can be substituted instead (given that you are familiar with base R).

However, some data are better suited to be represented in the form of matrices. Putting matrix-like data in a data.frame is silly, since performance will suffer and you would have to convert it back and forth for many matrix-friendly operations like PCA, tSNE, etc. The creator of data.table shares this opinion [1]. And similar opinions are generally given by people who are familiar with problems that fall outside the data.frame model [2].

[1]: https://twitter.com/mattdowle/status/1037949621773844480?lan...

[2]: https://www.youtube.com/watch?v=9Objw9Tvhb4&t=225s


I recommend data.table instead of tidyverse. The syntax is harder to learn initially, but it's much faster.


The equivalent is dplyr not tidyverse, which is huge suite of tools.


Don't learn R from books that are more than 5 years old




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: