I’m a big fan of Python for data analysis, but even I get curious about what else is available. R has long been the go-to language for statistics, but the “Tidyverse” has given the language a serious makeover. Here’s why I’ve decided to learn both technologies.
R Is Popular in Academia and Industry
If you take statistics courses in college, you’ll probably learn the R language. The language, due to being created by statisticians, is widely used in statistics academia, as well as by academic researchers in other fields who do statistical analysis, such as the social sciences. If you pick up advanced statistics textbooks, mentioned later, you’ll find that most of the code examples will be in R. R is also used for analysis in the business world.
If you read the journals, while th…
I’m a big fan of Python for data analysis, but even I get curious about what else is available. R has long been the go-to language for statistics, but the “Tidyverse” has given the language a serious makeover. Here’s why I’ve decided to learn both technologies.
R Is Popular in Academia and Industry
If you take statistics courses in college, you’ll probably learn the R language. The language, due to being created by statisticians, is widely used in statistics academia, as well as by academic researchers in other fields who do statistical analysis, such as the social sciences. If you pick up advanced statistics textbooks, mentioned later, you’ll find that most of the code examples will be in R. R is also used for analysis in the business world.
If you read the journals, while there is increasing mention of other languages like Python, most of the software discussed will run under R. R’s continuing dominance is demostrated in The Journal of Statistical Software, an open-access academic journal covering, well, statistical software.
R is heavily influenced by an earlier language, S. S was created at Bell Labs to implement legendary statistician John Tukey’s idea of “exploratory data analysis.” As with Bell Labs’ other creation, Unix, S was licensed for practically nothing. R built on this with its open source license when it emerged in the ’90s. The relationship of R to S is thus similar to that of Linux to the original Unix.
The Tidyverse has built on this legacy of statistical computation to offer some advanced data plotting and manipulation tools. This suite of R libraries lets you plot and manipulate data. The Tidyverse includes ggplot2 for plotting, dplyr for data manipulation, tidyr for data cleaning, readr for reading rectangular data from spreadsheets or databases, purrr for functional programming, tibble for managing data frames, stringr for working with strings, forcats for working with categorical variables, and lubridate for working with time-date data.
Great-Looking Graphics With ggplot2
One of R’s claims to fame is its ability to create professional-quality statistical plots with a minimum of code. ggplot2, as part of the Tidyverse, could be its “killer app,” the thing you would want to use R and the Tidyverse for.
ggplot can let you create some good-looking plots that you can actually publish. And professional organizations do: the BBC uses it for their infographics.
ggplot2 is based on the idea of a “grammar of graphics.” Instead of a dedicated command for something like making a scatterplot with a regression line, you build up a plot piece by piece out of a selection of elements. You define an “aesthetic” that contains the axis of the data frame, such as the x and y axes, and then you add elements like the scatter plot and the regression. While this approach feels more complicated than plotting functions you might find in a spreadsheet program, it’s a lot more flexible.
Here’s an example using a database of tips that a waiter recorded while working in a restaurant:
ggplot(tips, aes(x = total_bill, y = tip)) + geom_point() + geom_smooth(method = "lm")
This code tells ggplot2 that I want the x-axis or independent variable to be the total bill, and the y-axis to be the corresponding tip. With the aesthetic defined, I then tell it to overlay a scatterplot and then draw a linear regression line over that.
Lots of Available Texts and Tutorials on R
Another reason that I’ve decided to add R programming to my repertoire is that, due to R’s existing popularity in academia, there’s a lot of material for learning more advanced concepts.
If you pick up more advanced statistics textbooks, you’ll often find that they have code examples that are written in R. While it would be fairly easy to translate these examples into Python with the right libraries, I would rather focus on learning the concepts and then try to apply them to Python later if I wanted to.
There are plenty of books and tutorials at different skill levels. For undergraduates taking an introductory statistics course, OpenIntro’s Introduction to Modern Statistics introduces them to statistics using R for labs, without students having to look up values in tables or memorize formulas.
The contributed documentation section on CRAN’s website, R’s answer to the Python Package Index or CPAN, hosts a lot of available texts. One that I stumbled upon was Practical Regression and Anova using R by Julian Faraway. Since linear regression is one of my go-to methods in Python, this could be a big help in learning more advanced techniques with R.
Lots of Packages in CRAN
Apart from Tidyverse, the wide availability of packages for R is another selling point for the language. There are almost 23,000 packages listed on the Comprehensive R Archive Network, or CRAN. This shows how deeply loyal statisticians are to R. The Tidyverse packages are among them, but there are also “task views” for everything from econometrics to sports analytics.
There is a lot in CRAN to keep stats nerds busy for decades.
The Tidyverse Makes Data Cleaning Easy
If you’ve ever downloaded datasets form the internet, you know that they can be less than ideal. The Tidyverse takes its name from the idea that every column in a data frame should be a variable, every row should represent an observation, and a single cell represents a value.
The problem is that when people create their own datasets in spreadsheets, they might not be thinking of this criterion. While it’s best to lay out data this way from the outset when possible, a lot of spreadsheets are used by people who aren’t trained in statistics or thinking ahead to how their data might be used in the future.
The Tidyverse has developed libraries that can reshape datasets to fit the tidy data model. You can expand out data to a “wide” format with multiple columns, and you can squish it into a longer format that’s more suitable for plotting.
RStudio is Great
I’ve been generally skeptical of IDEs in the past, preferring to work with separate components like an editor and a terminal, and an interpreter or compiler (but mostly an interpreter).
The main development tool of the Tidyverse is RStudio, a multi-platform IDE especially designed for statistical work. The name says “R” but it also officially supports other languages like Python. The outreach toward other lanuages is also why its developer changed its name from R Studio to Posit.
I’ve enjoyed working with R Studio so far. It doesn’t seem to get in the way. While I’m normally skeptical of monolithic developer tools, the job of data analysis might just be different from other programming tasks. It’s a lot more interactive. With R, you spend more time exploring data and trying things rather than working on an edit or debug cycle.
RStudio also has an attractive way of displaying plots from ggplot2 in a pane on the lower right-hand corner of the program. It’s also easy to save plots from this window.
Sometimes, Peer Pressure Is Good
New programmers are often advised to learn Lisp, even if they’ll use other languages in their daily work, because knowing it will affect how they approach problems when they code in other languages for the better. I think R might occupy a similar role in data analysis. Python is better when you need to adapt your models to interface with other programs or work with the real world, but R, being designed by statisticians for statisticians, exerts a heavy influence over other data analysis tools.
pandas DataFrames in Python were clearly influenced by R’s data frames.
It’s ultimately a bad idea to be wedded to one language. As popular as Python is, I think knowing more than one data analysis language will help me in the long run.