Data Analysis with R Programming

Maung Agus Sutikno
6 min readJul 10, 2022

--

R programming language is one of many important tools in data analytics, especially for statistics purposes. This powerful statistical programming tool, that is written by Statisticians for Statisticians, is called as a king of statistical computing languages for analyzing and visualizing big data. For that reason, it would be nice to know about this programming language that rules Data Science. Let’s start with the basic!

Getting started

Compared to Python, another popular programming language in the data analytics world, R actually has slightly different characteristics. R is used by professionals who have a statistical-oriented approach to solving problems; for example scientists, statisticians, and engineers. Python is used by professionals looking for solutions in the data analytics or those who have to heavily mine data for answers; for example data scientists, machine learning specialists, and software developers. The comparison between R and Python is shown below.

If we are more familiar with Microsoft Excel or SQL, R also has a commonality with them such as they all use functions. In spreadsheets, we use functions in the formulas, and in SQL, we include them in queries. In R, we will use functions in the code. In terms of the differences, we can see the table below.

Now, hopefully we have started to understand R by connecting it to our current data analytics tool and will go deeper into R as a programming language. R is based on a programming language called S. In the 1970s, John Chambers created S for internal use at Bell Labs, a famous scientific research facility. In the 1990s, Ross Oaxaca and Robert Gentleman developed R at the University of Auckland, New Zealand. The title R refers to the first names of its two authors and plays on the single-letter title of its predecessor S.

RStudio, where we do the R programming including visualization, is an IDE or integrated development environment. Since R is open source, we can either install RStudio freely on our desktop or access the RStudio Cloud. RStudio and R are designed to manage large data sets, which spreadsheets might not be able to manage. When the data consists of many categories or groups, it is when RStudio truly shines. It could help in making it easy to take a specific analysis step and perform it for each group, presenting flexible data visualization for each group by using plotting features, and creating summary statistics for each group.

Having someone or place we are able to get help whenever we get stuck is always encouraging the learning process. We can get support from the RStudio community or by following the twitter feeds.

Technical Part

A data structure is like a house that contains data

The data structures commonly used in the R programming language include vectors, data frames, matrices, and arrays. Data structure is a format for organizing and storing data. Metaphorically, a data structure like a house that is filled with our data. A vector is a group of data elements of the same type, stored in a sequence in R. We cannot have a vector that consists of more than one type. In general, there are four primary types of vectors. They are logical, integer, double, character (contains strings). We can refer to this site for more information about vector. Additionally, dates and times in R are able to be converted to different types of data in R into date and date-time formats using lubridate package. This site provides it in detail.

The next part is talking about operator. It is a symbol that identifies the type of operation or calculation to be performed in a formula. There are three primary types of logical operators that return a logical data type such as TRUE or FALSE. They are AND, OR, and NOT. For more about logical operators, you can find it here.

In order to keep our code readable, it is important to use a clear and consistent style that is free from errors. By using a consistent coding style, it helps if we are working with collaborators or teammates by providing a code that everyone can easily read, edit, and work on each other’s code. If we work alone, it also makes easier and faster for us to review the code later on. Let’s go over a few of the most accepted stylistic conventions for coding in R.

To do data analysis in R, we need to install R packages. Packages are units of reproducible R code that we can use to add more functionality to R. It is created and shared by R community, so that other users can access them. We can see the details of the primary packages, here.

It is often that the data frame we are working on has many columns and rows. To pull up only the first 10 rows of a dataset, or only as many as can fit on the screen, we can use Tibbles. You can find it more here and about importing the data itself here. In order to name the data file, next are some helpful “do’s” and “don’ts” to do naming our files.

Visualization

Ggplot2 is a common package to do visualization in R that you can find it in details here. It makes possible to create different types of data visualizations right in our R workspace. This visual property of an object in our plot is defined as aesthetic. One important note, case sensitivity is the most common coding errors we might encounter in ggplot2 other than about balancing parentheses and quotation marks. Here’s an example of aesthetic attributes are displayed in R:

ggplot(data, aes(x=distance, y= dep_delay, color=carrier, size=air_time, shape = carrier)) + 
geom_point()

Scatter plot is most common visualization to understand our data at the beginning. However, it is hard to understand trends in our data from scatter plot alone. Therefore, smoothing is needed. It enables the detection of a data trend by adding a smoothing line as another layer to a plot. Below is the example code.

ggplot(data, aes(x=distance, y= dep_delay)) + geom_point() + geom_smooth()

And the output visualization will be…

Lastly, all the coding and visualization can be saved and shared for stakeholders by using R Markdown. It is really helpful to bookmark some resources to refer to later. For more about R Markdown, you can access the reference guide here and its cheat sheet here.

As an additional important sources to start R programming language here and happy learning!

--

--