R is a language and environment for statistical computing and graphics,
developed as a GNU project. Originating from Bell Laboratories, it shares
similarities with the S language. R serves as an alternative implementation
of S, maintaining compatibility with much S code. Offering diverse statistical
and graphical techniques, R supports linear and nonlinear modeling, classical
tests, time-series analysis, clustering, and more. Known for its extensibility,
R is widely used for research in statistical methodology. Notably, it facilitates
the creation of publication-quality plots with customizable options. Licensed under
GNU GPL, R is accessible on UNIX, Linux, FreeBSD, Windows, and MacOS platforms.
Data Structures in R Programming
R’s base data structures are often organized by their dimensionality (1D, 2D, or nD) and whether they’re homogeneous (all elements must be of the identical type) or heterogeneous (the elements are often of various types). This gives rise to the six data types which are most frequently utilized in data analysis.
The most essential data structures used in R include:
Vectors
A vector is an ordered collection of basic data types of a given length. The only key thing here is all the elements of a vector must be of the identical data type e.g homogeneous data structures. Vectors are one-dimensional data structures.
X = c(1, 3, 5, 7, 8)
print(X)
Output:
[1] 1 3 5 7 8
Lists
A list is a generic object consisting of an ordered collection of objects. Lists are heterogeneous data structures. These are also one-dimensional data structures. A list can be a list of vectors, list of matrices, a list of characters and a list of functions and so on.
empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")
numberOfEmp = 4
empList = list(empId, empName, numberOfEmp)
print(empList)
Output:
[[1]]
[1] 1 2 3 4
[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"
[[3]]
[1] 4
DataFrames
Dataframes are generic data objects of R which are used to store the tabular data. Dataframes are the foremost popular data objects in R programming because we are comfortable in seeing the data within the tabular form. They are two-dimensional, heterogeneous data structures. These are lists of vectors of equal lengths.
Data frames have the following constraints placed upon them:
- A data-frame must have column names and every row should have a unique name.
- Each column must have the identical number of items.
- Each item in a single column must be of the same data type.
- Different columns may have different data types.
- To create a data frame we use the data.frame() function.
Name = c("Amiya", "Raj", "Asish")
Language = c("R", "Python", "Java")
Age = c(22, 25, 45)
df = data.frame(Name, Language, Age)
print(df)
Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Matrices
A matrix is a rectangular arrangement of numbers in rows and columns. In a matrix, as we know rows are the ones that run horizontally and columns are the ones that run vertically. Matrices are two-dimensional, homogeneous data structures.
Now, let’s see how to create a matrix in R. To create a matrix in R you need to use the function called matrix. The arguments to this matrix() are the set of elements in the vector. You have to pass how many numbers of rows and how many numbers of columns you want to have in your matrix and this is the important point you have to remember that by default, matrices are in column-wise order.
Example:
A = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3, ncol = 3,
byrow = TRUE
)
Output:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
Arrays
Arrays are the R data objects that store data in more than two dimensions. Arrays are n-dimensional data structures. For example, if we create an array of dimensions (2, 3, 3) then it creates 3 rectangular matrices, each with 2 rows and 3 columns. They are homogeneous data structures.
Example:
A = array(
c(1, 2, 3, 4, 5, 6, 7, 8),
dim = c(2, 2, 2)
)
Output:
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
Factors
Factors are the data objects used to categorize data and store it as levels. They are useful for storing categorical data, capable of handling both strings and integers. Factors are valuable for categorizing unique values in columns such as "TRUE" or "FALSE," or "MALE" or "FEMALE," etc. They play a crucial role in data analysis for statistical modeling.
Example:
fac = factor(c("Male", "Female", "Male",
"Male", "Female", "Male", "Female"))
print(fac)
Output:
[1] Male Female Male Male Female Male Female
Levels: Female Male
-->
Data Manipulation in R
Data manipulation in R involves organizing and transforming data for analysis. The dplyr package is commonly used, providing functions for filtering, arranging, grouping, summarizing, and modifying variables. Common tasks include filtering rows, arranging data, grouping and summarizing, adding or modifying variables, and joining datasets. The pipe operator (%>%) facilitates a streamlined workflow for chaining operations.
Tidy Data
The foundation of the dplyr package is built on the principles of tidy data, where each variable is represented in a column, and each observation is in a row. Tidy data is essential for efficient data analysis and visualization.
Summarize Cases:
summarize(.data, ...): Creates a summary table, often used to compute aggregate statistics.
Example: mtcars |> summarize(avg = mean(mpg))
Count Cases:
count(.data, ..., wt = NULL): Counts rows in groups.
Example: mtcars |> count(cyl)
Group Cases:
group_by(.data, ...): Groups data based on specified variables.
Example: mtcars |> group_by(cyl) |> summarize(avg = mean(mpg))
Manipulate Cases:
filter(.data, ...): Extracts rows based on specified conditions.
distinct(.data, ...): Removes duplicate rows.
slice(.data, ...): Selects rows by position.
Logical Operations for Filtering:
Standard logical operations (==, <, >, !=, &, |, !, is.na(), %in%, xor()) are used for filtering.
Arrange Cases:
arrange(.data, ...): Orders rows based on specified variables.
Example: mtcars |> arrange(mpg)
Add & Manipulate Variables:
Extract Variables:
pull(.data, var): Extracts column values as a vector.
select(.data, ...): Extracts specific columns.
relocate(.data, ...): Moves columns.
Manipulate Multiple Variables:
across(.cols, .fun): Modifies multiple columns simultaneously.
c_across(.cols): Applies functions to rows.
Make New Variables:
mutate(.data, ...): Computes new columns.
rename(.data, ...): Renames columns.
Summary Functions:
summarize(): Computes summary statistics.
Common functions like mean(), median(), n(), sd(), min(), max(), n_distinct() are used.
Joins & Set Operations:
Relational Data:
left_join(), right_join(), inner_join(), full_join()
semi_join(), anti_join(), nest_join()
Set Operations:
intersect(), setdiff(), union()
dplyr
The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.
Installation & Loading:
Install: To use dplyr, first install the tidyverse package using install.packages("tidyverse").
Load: After installation, load the dplyr package with library(dplyr).
Transforming Data:
Column Operations:
select(): Choose specific columns in a data frame.
rename(): Rename existing columns.
filter(): Extract rows based on specified conditions.
Creating New Columns:
mutate(): Introduce new columns based on existing data.
add_count(): Add a column with counts for each unique value in another column.
Row Operations:
slice(): Select rows by their index.
arrange(): Reorder rows based on column values.
distinct(): Remove duplicate rows.
Aggregating Data:
summarise(): Compute summary statistics for data.
group_by(): Group data based on specific variables for subsequent summary operations.
Joining Tables:
inner_join(): Combine records from two tables, retaining only matching records.
left_join(): Combine records from the left table with matching records from the right table.
data Visualisation
Data visualization is the technique used to deliver insights in data using visual cues such as graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy understanding of the large quantities of data and thereby make better decisions regarding it.
R – Charts and Graphs
R – Line Graphs
ggplot(data, aes(x = cyl, y = mpg)) +
geom_line() +
ggtitle("Line Plot")
R – Bar Charts
ggplot(data, aes(x = factor(cyl), fill = factor(cyl))) +
geom_bar() +
ggtitle("Bar Plot") +
theme_minimal()
R – Scatter plots
ggplot(data, aes(x = cyl, y = mpg)) +
geom_point() +
ggtitle("Scatter Plot")
R – Histograms
ggplot(data, aes(x = mpg)) +
geom_histogram(binwidth = 0.5, fill = "blue", color = "black", alpha = 0.7) +
ggtitle("Histogram") +
xlab("X-axis Label") +
ylab("Frequency") +
theme_minimal()
R – Boxplots
ggplot(data, aes(x = factor(cyl), y =mpg )) +
geom_boxplot(fill = "skyblue", color = "darkblue") +
ggtitle("Boxplot") +
xlab("X-axis Label") +
ylab("Y-axis Label") +
theme_minimal()
R – Density Plots
den <- density(mtcars$mpg)
plot(den, frame = FALSE, col = "blue" , main = "Density plot")
R – Violin Plots
ggplot(data, aes(x = factor(cyl), y = mpg)) +
geom_violin(fill = "orchid", color = "purple") +
ggtitle("Violin Plot") +
xlab("X-axis Label") +
ylab("Y-axis Label") +
theme_minimal()
R – Scatterplot with a Regression Line
ggplot(data, aes(x = mpg, y = cyl)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Scatter Plot with Regression Line") +
theme_minimal()