R Data Reshaping

### Merging Data Frames R uses the **merge()** function to combine data frames. The syntax for the merge() function is as follows: # S3 method merge(x, y, …) # S3 method for data.frame merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE, incomparables = NULL, …) Common parameters explained: * x, y: Data frames. * by, by.x, by.y: Specify the column names to match in the two data frames. By default, it uses column names that are identical in both data frames. * all: Logical value; `all = L` is shorthand for `all.x = L` and `all.y = L`, where `L` can be TRUE or FALSE. * all.x: Logical value, default is FALSE. If TRUE, returns all rows from `x`, even if there is no matching row in `y`. Rows in `x` with no match in `y` will have NA in the columns from `y`. * all.y: Logical value, default is FALSE. If TRUE, returns all rows from `y`, even if there is no matching row in `x`. Rows in `y` with no match in `x` will have NA in the columns from `x`. * sort: Logical value, whether to sort the result by the columns. The merge() function is very similar to SQL's JOIN functionality: !(#) * **Natural join or INNER JOIN**: Returns rows if there is at least one match in both tables. * **Left outer join or LEFT JOIN**: Returns all rows from the left table, and the matched rows from the right table. The result is NA from the right side if there is no match. * **Right outer join or RIGHT JOIN**: Returns all rows from the right table, and the matched rows from the left table. The result is NA from the left side if there is no match. * **Full outer join or FULL JOIN**: Returns rows when there is a match in one of the tables. The result is NA for the missing side if there is no match. ## Example # data frame 1 df1 =data.frame(SiteId =c(1:6), Site =c("Google","","Taobao","Facebook","Zhihu","Weibo")) # data frame 2 df2 =data.frame(SiteId =c(2, 4, 6, 7, 8), Country =c("CN","USA","CN","USA","IN")) # INNER JOIN df1 =merge(x=df1,y=df2,by="SiteId") print("----- INNER JOIN -----") print(df1) # FULL JOIN df2 =merge(x=df1,y=df2,by="SiteId",all=TRUE) print("----- FULL JOIN -----") print(df2) # LEFT JOIN df3 =merge(x=df1,y=df2,by="SiteId",all.x=TRUE) print("----- LEFT JOIN -----") print(df3) # RIGHT JOIN df4 =merge(x=df1,y=df2,by="SiteId",all.y=TRUE) print("----- RIGHT JOIN -----") print(df4) Executing the above code produces the following output: "----- INNER JOIN -----" SiteId Site Country1 2 CN 2 4 Facebook USA 3 6 Weibo CN "----- FULL JOIN -----" SiteId Site Country.x Country.y 1 2 CN CN 2 4 Facebook USA USA 3 6 Weibo CN CN 4 7 USA 5 8 IN "----- LEFT JOIN -----" SiteId Site.x Country Site.y Country.x Country.y 1 2 CN CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN "----- RIGHT JOIN -----" SiteId Site.x Country Site.y Country.x Country.y 1 2 CN CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN 4 7 USA 5 8 IN ### Data Reshaping (Melt and Cast) R uses the **melt()** and **cast()** functions to reshape data. * melt(): Converts data from wide format to long format. * cast(): Converts data from long format to wide format. The following diagram illustrates the functionality of the melt() and cast() functions well (detailed examples will be provided later): !(#) melt() stacks each column of the dataset into a single column. The function syntax is: melt(data, ..., na.rm = FALSE, value.name = "value") Parameter explanation: * data: The dataset. * ...: Other arguments passed to or from other methods. * na.rm: Whether to remove NA values from the dataset. * value.name: The name of the variable used to store the values. Before proceeding with the following operations, let's install the required packages: # Install packages. MASS contains many statistical functions, tools, and datasets. install.packages("MASS", repos = "https://mirrors.ustc.edu.cn/CRAN/") # The melt() and cast() functions require the reshape2 and reshape packages. install.packages("reshape2", repos = "https://mirrors.ustc.edu.cn/CRAN/") install.packages("reshape", repos = "https://mirrors.ustc.edu.cn/CRAN/") Test example: ## Example # Load libraries library(MASS) library(reshape2) library(reshape) # Create a data frame id<- c(1, 1, 2, 2) time<- c(1, 2, 1, 2) x1 <- c(5, 3, 6, 2) x2 <- c(6, 5, 1, 4) mydata <- data.frame(id, time, x1, x2) # Original data frame cat("Original data frame:n") print(mydata) # Melt the data md <- melt(mydata, id = c("id","time")) cat("n After melting:n") print(md) Executing the above code produces the following output: Original data frame: id time x1 x2 1 1 1 5 62 1 2 3 53 2 1 6 14 2 2 2 4After melting: id time variable value 1 1 1 x1 52 1 2 x1 33 2 1 x1 64 2 2 x1 25 1 1 x2 66 1 2 x2 57 2 1 x2 18 2 2 x2 4 The cast function is used to reshape the melted data frame back. dcast() returns a data frame, while acast() returns a vector/matrix/array. The cast() function syntax is: dcast( data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess_value(data)) acast( data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess_value(data)) Parameter explanation: * data: The melted data frame. * formula: The formula for reshaping the data, similar to `x ~ y` format, where `x` is the row label and `y` is the column label. * fun.aggregate: An aggregation function used to process the value column. * margins: A vector of variable names (can include "grand_col" and "grand_row") for calculating margins. Set to TRUE to calculate all margins. * subset: Conditional filtering on the result, format similar to **subset = .(variable=="length")**. * drop: Whether to keep default values (drop unused factor levels). * value.var: The name of the column containing the values to be processed. ## Example # Load libraries library(MASS) library(reshape2) library(reshape) # Create a data frame id<-c(1, 1, 2, 2) time<-c(1, 2, 1, 2) x1 <-c(5, 3, 6, 2) x2 <-c(6, 5, 1, 4) mydata <-data.frame(id, time, x1, x2) # Melt the data md <- melt(mydata, id =c("id","time")) # Print recasted dataset using cast() function cast.data<- cast(md, id~variable, mean) print(cast.data) cat("n") time.cast<- cast(md, time~variable, mean) print(time.cast) cat("n") id.time<- cast(md, id~time, mean) print(id.time) cat("n") id.time.cast<- cast(md, id+time~variable) print(id.time.cast) cat("n") id.variable.time<- cast(md, id+variable~time) print(id.variable.time) cat("n") id.variable.time2<- cast(md, id~variable+time) print(id.variable.time2) Executing the above code produces the following output: id x1 x2 1 1 4 5.52 2 4 2.5 time x1 x2 1 1 5.5 3.52 2 2.5 4.5 id 1 21 1 5.5 42 2 3.5 3 id time x1 x2 1 1 1 5 62 1 2 3 53 2 1 6 14 2 2 2 4 id variable 1 21 1 x1 5 32 1 x2 6 53 2 x1 6 24 2 x2 1 4 id x1_1 x1_2 x2_1 x2_2 1 1 5 3 6 52 2 6 2 1 4

YouTip

R Data Reshaping

📂 Categories