R Data Reshaping
### Merging Data Frames
R uses the **merge()** function to combine data frames.
The syntax for the merge() function is as follows:
# S3 method merge(x, y, β¦) # S3 method for data.frame merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE, incomparables = NULL, β¦)
Common parameters explained:
* x, y: Data frames.
* by, by.x, by.y: Specify the column names to match in the two data frames. By default, it uses column names that are identical in both data frames.
* all: Logical value; `all = L` is shorthand for `all.x = L` and `all.y = L`, where `L` can be TRUE or FALSE.
* all.x: Logical value, default is FALSE. If TRUE, returns all rows from `x`, even if there is no matching row in `y`. Rows in `x` with no match in `y` will have NA in the columns from `y`.
* all.y: Logical value, default is FALSE. If TRUE, returns all rows from `y`, even if there is no matching row in `x`. Rows in `y` with no match in `x` will have NA in the columns from `x`.
* sort: Logical value, whether to sort the result by the columns.
The merge() function is very similar to SQL's JOIN functionality:
!(#)
* **Natural join or INNER JOIN**: Returns rows if there is at least one match in both tables.
* **Left outer join or LEFT JOIN**: Returns all rows from the left table, and the matched rows from the right table. The result is NA from the right side if there is no match.
* **Right outer join or RIGHT JOIN**: Returns all rows from the right table, and the matched rows from the left table. The result is NA from the left side if there is no match.
* **Full outer join or FULL JOIN**: Returns rows when there is a match in one of the tables. The result is NA for the missing side if there is no match.
## Example
# data frame 1
df1 =data.frame(SiteId =c(1:6), Site =c("Google","","Taobao","Facebook","Zhihu","Weibo"))
# data frame 2
df2 =data.frame(SiteId =c(2, 4, 6, 7, 8), Country =c("CN","USA","CN","USA","IN"))
# INNER JOIN
df1 =merge(x=df1,y=df2,by="SiteId")
print("----- INNER JOIN -----")
print(df1)
# FULL JOIN
df2 =merge(x=df1,y=df2,by="SiteId",all=TRUE)
print("----- FULL JOIN -----")
print(df2)
# LEFT JOIN
df3 =merge(x=df1,y=df2,by="SiteId",all.x=TRUE)
print("----- LEFT JOIN -----")
print(df3)
# RIGHT JOIN
df4 =merge(x=df1,y=df2,by="SiteId",all.y=TRUE)
print("----- RIGHT JOIN -----")
print(df4)
Executing the above code produces the following output:
"----- INNER JOIN -----" SiteId Site Country1 2 CN 2 4 Facebook USA 3 6 Weibo CN "----- FULL JOIN -----" SiteId Site Country.x Country.y 1 2 CN CN 2 4 Facebook USA USA 3 6 Weibo CN CN 4 7 USA 5 8 IN "----- LEFT JOIN -----" SiteId Site.x Country Site.y Country.x Country.y 1 2 CN CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN "----- RIGHT JOIN -----" SiteId Site.x Country Site.y Country.x Country.y 1 2 CN CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN 4 7 USA 5 8 IN
### Data Reshaping (Melt and Cast)
R uses the **melt()** and **cast()** functions to reshape data.
* melt(): Converts data from wide format to long format.
* cast(): Converts data from long format to wide format.
The following diagram illustrates the functionality of the melt() and cast() functions well (detailed examples will be provided later):
!(#)
melt() stacks each column of the dataset into a single column. The function syntax is:
melt(data, ..., na.rm = FALSE, value.name = "value")
Parameter explanation:
* data: The dataset.
* ...: Other arguments passed to or from other methods.
* na.rm: Whether to remove NA values from the dataset.
* value.name: The name of the variable used to store the values.
Before proceeding with the following operations, let's install the required packages:
# Install packages. MASS contains many statistical functions, tools, and datasets. install.packages("MASS", repos = "https://mirrors.ustc.edu.cn/CRAN/") # The melt() and cast() functions require the reshape2 and reshape packages. install.packages("reshape2", repos = "https://mirrors.ustc.edu.cn/CRAN/") install.packages("reshape", repos = "https://mirrors.ustc.edu.cn/CRAN/")
Test example:
## Example
# Load libraries
library(MASS)
library(reshape2)
library(reshape)
# Create a data frame
id<- c(1, 1, 2, 2)
time<- c(1, 2, 1, 2)
x1 <- c(5, 3, 6, 2)
x2 <- c(6, 5, 1, 4)
mydata <- data.frame(id, time, x1, x2)
# Original data frame
cat("Original data frame:n")
print(mydata)
# Melt the data
md <- melt(mydata, id = c("id","time"))
cat("n After melting:n")
print(md)
Executing the above code produces the following output:
Original data frame: id time x1 x2 1 1 1 5 62 1 2 3 53 2 1 6 14 2 2 2 4After melting: id time variable value 1 1 1 x1 52 1 2 x1 33 2 1 x1 64 2 2 x1 25 1 1 x2 66 1 2 x2 57 2 1 x2 18 2 2 x2 4
The cast function is used to reshape the melted data frame back. dcast() returns a data frame, while acast() returns a vector/matrix/array.
The cast() function syntax is:
dcast( data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess_value(data)) acast( data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess_value(data))
Parameter explanation:
* data: The melted data frame.
* formula: The formula for reshaping the data, similar to `x ~ y` format, where `x` is the row label and `y` is the column label.
* fun.aggregate: An aggregation function used to process the value column.
* margins: A vector of variable names (can include "grand_col" and "grand_row") for calculating margins. Set to TRUE to calculate all margins.
* subset: Conditional filtering on the result, format similar to **subset = .(variable=="length")**.
* drop: Whether to keep default values (drop unused factor levels).
* value.var: The name of the column containing the values to be processed.
## Example
# Load libraries
library(MASS)
library(reshape2)
library(reshape)
# Create a data frame
id<-c(1, 1, 2, 2)
time<-c(1, 2, 1, 2)
x1 <-c(5, 3, 6, 2)
x2 <-c(6, 5, 1, 4)
mydata <-data.frame(id, time, x1, x2)
# Melt the data
md <- melt(mydata, id =c("id","time"))
# Print recasted dataset using cast() function
cast.data<- cast(md, id~variable, mean)
print(cast.data)
cat("n")
time.cast<- cast(md, time~variable, mean)
print(time.cast)
cat("n")
id.time<- cast(md, id~time, mean)
print(id.time)
cat("n")
id.time.cast<- cast(md, id+time~variable)
print(id.time.cast)
cat("n")
id.variable.time<- cast(md, id+variable~time)
print(id.variable.time)
cat("n")
id.variable.time2<- cast(md, id~variable+time)
print(id.variable.time2)
Executing the above code produces the following output:
id x1 x2 1 1 4 5.52 2 4 2.5 time x1 x2 1 1 5.5 3.52 2 2.5 4.5 id 1 21 1 5.5 42 2 3.5 3 id time x1 x2 1 1 1 5 62 1 2 3 53 2 1 6 14 2 2 2 4 id variable 1 21 1 x1 5 32 1 x2 6 53 2 x1 6 24 2 x2 1 4 id x1_1 x1_2 x2_1 x2_2 1 1 5 3 6 52 2 6 2 1 4
YouTip