Introduction to Data Analytics Using R

Admin

Welcome to another interesting episode of our data digest. In this episode, we will be discussing an introduction to data analytics using R.

Instructor: Paulina Boadiwaah Mensah | LinkedIn

A brief overview of R

The name R came from the names of its core developers, Robert Gentleman and Ross Ihaka. It’s also a play on the name of its parent language; S. R’s semantics, however, is closer to that of Scheme, a functional programming language.

R is a functional programming language. Functions are first-class objects. This means that you can do anything with functions as with any other R object. As a result, you can assign R functions to variables, store them in lists, pass them as arguments, and return them as a result of a function. Everything that happens in R happens via a function call. Even assignment is really a function.

However, R is not a “pure” functional language. R has the stuff of imperative programming languages such as loops and assignments. So it isn’t just a functional programming language.

Why learn R?

These are some of the many benefits that R can serve you:

R helps you to create books and web apps (using Quarto and Shiny)
It thrives through a supportive community
R can work with numerous programming paradigms.
R is a very good tool for statistical analysis and data science.

Let me inform you about the various opportunities that are in there for you as an R expert:

R programmer
Geo statistician
Researcher
Business intelligence analyst
Data analyst
etc.

Below are some of the companies that use R in their tech stack along with some other software.

Bank of America
Amazon
Facebook
JP Morgan
Google

Variables and data types

Variables are simply names or tags that are given to the values or contents stored in the memory of the computer system that is executing codes.

Rules in naming variables

There are some rules that bind the naming of variables in R. As such you should ensure that your variable:

does not begin with a number e.g. 29var,
only contains letters, numbers, dots, and underscores e.g. var_2
does not start with an underscore, e.g. _var
is not followed by a number in case it starts with a dot, e.g. .2var

The R programming language has majorly 6 six data types. The following are the data types with examples.

Integer: e.g. 186
Numeric: e.g. 20.2
Character: e.g. “ r lessons @ Resagratia “
Logical: TRUE
Complex: 34 + 2i
Raw: Values are displayed as raw bytes

Operators

R has multiple operators that allow you to perform a variety of tasks.

Examples of each of the operators.

Arithmetic Operators: Arithmetic operators let you perform arithmetic operations such as addition, multiplication, etc.
- a + b #Sums two variables
- a - b #Subtracts two variables
- a * b #Multiplies two variables
- a • b #Exponentiation of a variable
- a I b #Divides two variables
- a%/%b #Integer division of variables
- a%%b #Remainder of a variable
Logical Operators: Logical operators are used for Boolean operators.
1. **! #Logical NOT
2. & #Element-wise logical AND
3. && #Logical AND
4. **I #Element-wise logical OR
5. | | #Logicol OR
Relational Operators: Relational operators are used to comparing between values.
1. a == b #Tests for equality
2. a != b #Tests for inequality
3. a > b #Tests for greater than
4. a < b #Tests for lower than
5. a >= b #Tests for greater than or equal to
a <= b #Tests for less than or equal to
Assignment operators:
1. x <- 1 # Assigns a variable to x
2. X = 1 #Assigns a variable to x
Other operators
1. %in% #Identifies whether an element belongs to a vector
2. $ #Allows you to access objects stored within an object
3. %>% #Part of magrittr package, it's used to pass objects to functions

Fundamental data structures in R

A data structure is a particular way of organizing data in a computer so that it can be used effectively. the idea is to reduce the space and time complexities of different tasks. The R programming language has 6 six data structures:

Vectors: Vectors are one-dimensional data structures. c() is a built-in function that converts an item into a vector. A vector in R can only contain objects of only one data type.
Matrices: Matrices are 2-dimensional structures that resemble a table with rows and columns. Just like vectors, all cells in a matrix have to be of the same data type. To create a matrix in R, you see the function matrix().
Lists: Lists in R can contain objects of various data types. A typical list can consist of lists, vectors, matrices, and even functions. list() is the function that converts an item into a list.
Factors: They are used to categorize the data and store it as levels. They are useful for storing categorical data. To create a factor, you use the function factor(). The argument for this is the vector
Data frames: These are made of rows and columns and can contain objects of various data types. A data frame must have column names each column having its unique name. Each column must have an identical number of items. Each item in a single column must be of the same data type. To create a data frame you use the function data.frame().
Arrays: Arrays are data structures that can have more than two dimensions, maybe one, two, or more. Unlike matrices that have only two dimensions, an array is allowed to have more than 2 dimensions. An array is created using the array() function.

References and resources

These are some of the platforms through which you can learn more and practice better the use of R for data analytics.

Datacamp
Hackerrank
R-bloggers.com
R-tutor.com

Main studio session

Let’s get our hands dirty and begin analyzing data in R!

In this hands-on session, it is recommended that you

Install the RStudio on your computer
Install the carData package. This package consists of various datasets from which the Salaries dataset that we will be using for the analytics will be imported.

Let’s follow the steps

Step I: Launch the RStudio from the RStudio shortcut on your desktop.

You can press Ctrl+ L to clear the console

The console is where you will be writing the codes and also have the output displayed to you.

Step II: Set your working directory

The working directory is the folder that you will be loading your data from and also where your files will be saved. You should set your working directory to your own desired folder and not copy and paste the code below. The code will only show you how you can go about it.

#setwd('filepath')
setwd("C:/Users/pauli/OneDrive/Desktop")

Step III: Load the CarData library

You can load the CarData library by writing this code on your console. When you load a library, all its contents, be it functions or datasets will be made available to you.

library(CarData)

Step IV: Read the data.

As defined earlier, the data Salaries from the CarData package will be used for this analytics.

dataset <- Salaries
View(dataset)

The View() prints the tabular form of the dataset on a new page.

You can as well use the head() or tail() to print the first 6 rows or last six rows of your dataset

head(dataset)
tail(dataset)

Step V: Obtaining the descriptive statistics of the numeric variables in the Salaries dataset.

summary(Salaries)

Step VI: Let’s check for the data structure type of the variables (columns) in this dataset.

class(dataset$discipline) 
class(dataset$rank)
class(dataset$yrs.since.phd)
class(dataset$yrs.service)
class(dataset$sex) 
class(dataset$salary)

Step VII: Load the tidyverse package.

library(tidyverse)

The tidyverse package is used for importing, tidying, manipulating, and visualizing data.

Step VIII: Replace the elements in column discipline, changing A to Theoretical Department and B to Applied Department

dataset <- dataset %>% mutate(discipline=fct_recode(discipline, "Theoretical Department" ="A", "Applied Department" ="B"))

Step IX: Now, let’s rename some column names.

We will be capitalizing the column names and also changing the third column name to "YEARS since PHD"

dataset <- dataset %>% rename_with(toupper) 
dataset <- dataset %>% rename("YEARS since PHD" = YRS.SINCE.PHD)

Step X: Let’s get a table summary of the dataset and do a little bit of adjustment to the table data

To do this, it is necessary to load the gtsummary and scales package.

library(gtsummary) 
library(scales) 
dollar_fxn <- label_number_si(accuracy = 0.1, prefix ="$") 
dataset %>% tbl_summary(digits=starts_with("SALARY") ~ dollar_fxn)
        %>% modify_caption("General Overview of the Salaries Dataset")
        %>% bold_labels()

Step XI: Create a scatterplot

This plot displays the years in service versus the salary.

dataset %>% ggplot(aes(x=`YEARS since PHD`, y=SALARY, color=SEX)) 
   + geom_point() + theme_wsj() + scale_color_manual(values=c("red", "black")) + geom_smooth(method=lm, se=F) +labs(title="Resagratia R Session",
   subtitle="Dataset from CarData package")

Step XII: Now let’s create a report of what we have done so far using R Markdown

Click on the File tab to select R Markdown from the New File list, i.e. File > New FIle > R Markdown…
A box will appear showing some spaces to fill.
Name the Author Resagratia R Warriors and the title to General Overview of salaries Dataset
A new tab opens with some pre-written codes and texts. You can erase them leaving the title, author, date, and output.
Now, embed the codes written in the R script earlier here in the R Markdown
Press Ctrl+ Alt I to create a new R code chunk
Then you copy and paste the necessary codes here
After you are done, click on the Knit button to knit the document to Html, Pdf, or Word, the way you like it. For this, the document is knitted to HTML. The result is displayed below.

We do hope that you found this episode of the Data digest series exciting and insightful, for more access to such quality content, kindly sign up to the Resa platform by clicking here.

Thank you for learning with Us!

Empowering individuals and businesses with the tools to harness data, drive innovation, and achieve excellence in a digital world.

Solutions

Bootcamp

Products

Resources