Joining Data
This chapter will teach you how to integrate data from two data tables into a single data table.
There are numerous ways to perform this operation and a multitude of packages to simplify the process.
We will demonstrate data.table
and dplyr
.
base R: The merge()
Function
In this chapter, you will learn how to use the merge()
function to perform inner, full, left, and right joins.
This function is a built-in R function, but is upgraded in the data.table
package.
We will go through the different types of joins step by step using the following two data tables:
name sex age
1: Joe J. m 34
2: Laura M. f 56
3: Isobel P. f 23
4: Nina W. f 18
name id
1: Laura M. 1
2: Nina W. 2
3: Joe J. 3
4: Max C. 4
Inner Join
- An inner join only keeps observations with information in both data tables.
- By default, the
merge()
function does an inner join. - It requires two data tables as inputs: one for the
x
argument and one for they
arguments, together with the names of the key columns in each data table (by.x
andby.y
arguments).
merge(x = patients, y = patients_id, by.x = "name", by.y = "name")
output:
name sex age id
1: Joe J. m 34 3
2: Laura M. f 56 1
3: Nina W. f 18 2
To avoid entering the same column name repeatedly, use by
if the key columns in both data tables share the same name.
Full Join
- A full join contains all observations from both data tables.
- Observations found only in one data table will have missing values (=
NA
) in the columns of the other data table. - Set the argument
all
toTRUE
to perform a full join.
merge(x = patients, y = patients_id, by = "name", all = TRUE)
output:
name sex age id
1: Isobel P. f 23 NA
2: Joe J. m 34 3
3: Laura M. f 56 1
4: Max C. <NA> NA 4
5: Nina W. f 18 2
Left Join
- A left join keeps just data from the left data table (= set to the
x
argument) and adds data from the right data table (= set to they
argument) to the left one. - Set the argument
all.y
toTRUE
to perform a left join.
merge(x = patients, y = patients_id, by = "name", all.x = TRUE)
output:
name sex age id
1: Isobel P. f 23 NA
2: Joe J. m 34 3
3: Laura M. f 56 1
4: Nina W. f 18 2
Right Join
- A right join keeps just data from the right data table (= set to the
y
argument) and adds data from the left data table (= set to thex
argument) to the right one. - Set the argument
all.y
toTRUE
to perform a right join.
merge(x = patients, y = patients_id, by = "name", all.y = TRUE)
output:
name sex age id
1: Joe J. m 34 3
2: Laura M. f 56 1
3: Max C. <NA> NA 4
4: Nina W. f 18 2
Joining Data with data.table
Here we will demonstrate the data.table
package.
In order to use the data.table
syntax for joining data, you must first create table keys with the setkey()
function.
# Set the key to "name"
setkey(patients, "name")
setkey(patients_id, "name")
There are different ways to inspect existing table keys in your R session:
tables()
: This function lists all the data tables loaded in your R session, together with the number of their rows, the number and names of their columns, and how much memory they need. It also lists all the columns you have chosen as their keys.str(x)
: This function displays the type of each column in a single data table, as well as the types and first elements for each column. It will also show you the keys.key(x)
: This function shows you only the keys set in a data table.haskey(x)
: This function allows you to check whether a data table has a key or not. It returnsTRUE
orFALSE
.
x
= a data.table
name
To remove keys from a data table, use the argument NULL
within the setkey()
function as seen below:
# Remove keys from "patients"
setkey(patients, NULL)
The syntax for a join is X[Y]
.
The creator of data.table
, Matt Dowle, described the format as follows: "X[Y]
looks up X
rows using Y
as an index".
This means that matching rows from X
will be added to Y
.
Left Join
- For a left join, you have to put the left data table (=
patients
) inside the square brackets and the right data table (=patients_id
) outside. - This will add matching rows from
patients_id
topatients
.
patients_id[patients]
Right Join
- For a right join, you have to put the right data table (=
patients_id
) inside the square brackets and the left data table (=patients
) outside. - This will add matching rows from
patients
topatients_id
. - You could also see this as a left join, where
patients_id
is the left data table andpatients
is the right data table.
patients[patients_id]
Inner Join
- In order to perform an inner join, insert the argument
nomatch = 0
into the square brackets' syntax. - This filters out rows without matches.
- This time it does not matter which data table is inside and which is outside the brackets.
patients[patients_id, nomatch = 0]
# Same as
patients_id[patients, nomatch = 0]
Anti-Join
- An anti-join retains rows from the left data table and skips data that does not match the right data table.
In our example, this means that rows from
patients
, that do not matchpatients_id
, are displayed. - To perform an anti-join, enter the left data table outside the brackets and put an exclamation mark inside the square brackets, followed by the right data table.
patients[!patients_id]
output:
name id
1: Max C. 4
- You can also do this the other way round to display rows from
patients_id
that do not matchpatients
.
If you do not want to specify the keys in advance, you can simply use the following syntax:
X[Y, on = "key"]
In this way, the data tables are still connected via the column without having to set the key. Note, however, that this only works if both data tables have the same column name!
If they do not have the same column name, you can use c()
or .()
inside the on
argument:
X[Y, on = c("key1", "key2")]
# Same as
X[Y, on = .(key1, key2)]
Joining Data with dplyr
In this chapter, you will learn how to merge data with the dplyr
package, which is part of the Tidyverse.
Make sure you have installed and loaded the dplyr
package.
Join Functions
In the following, we will learn about the join functions of dplyr
and how they can be used.
We will use the data we used in the previous chapter, with the difference that we are now dealing with data frames and no longer with data tables.
- Mutating joins merge information from two different data sources:
- Inner joins
- Left joins
- Right joins
- Full joins
- Filtering joins keep information from the left data source and use the right data as filter:
- Semi-joins
- Anti-joins
inner_join()
- To perform an inner join with this function, we only need to specify the names of our two data frames and the column we want to combine over (with the
by
argument). - Another way to join data is to use the pipe operator (
%>%
). Enter the name of the first data frame followed by%>%
and theinner_join()
function including the second data frame and the key variable. - With an inner join, non-matching rows are excluded from the result.
inner_join(patients2, patients2_id, by = "name")
# Same as
patients2 %>% inner_join(patients2_id, by = "name")
left_join()
- Left joins can be done with the
left_join()
function. - All rows from the data frame that is first inserted into the function are retained. Matching rows from the second data frame are added to it.
left_join(patients2, patients2_id, by = "name")
# Same as
patients2 %>% left_join(patients2_id, by = "name")
right_join()
- Right joins can be done with the
right_join()
function. - All rows from the second/right data frame are retained while matching rows from the first/left data frame are added.
right_join(patients2, patients2_id, by = "name")
# Same as
patients2 %>% right_join(patients2_id, by = "name")
full_join()
- This function returns all rows and columns of the two input data frames and inserts
NA
for missing data.
full_join(patients2, patients2_id, by = "name")
# Same as
patients2 %>% full_join(patients2_id, by = "name")
semi_join()
- This function returns all rows from the first/left dataset with a match in the second/right dataset.
- Only columns of the left dataset are retained!
semi_join(patients2, patients2_id, by = "name")
# Same as
patients2 %>% semi_join(patients2_id, by = "name")
output:
name sex age
1 Joe J. m 34
2 Laura M. f 56
3 Nina W. f 18
anti_join()
- This function is the opposite to a semi-join.
- It returns all rows from the first/left dataset without a match in the second/right dataset.
- This means that only rows that are not present in the right dataset are returned and that only columns of the left data are retained!
anti_join(patients2, patients2_id, by = "name")
# Same as
patients2 %>% anti_join(patients2_id, by = "name")
output:
name sex age
1 Isobel P. f 23
Joining Multiple datasets
To demonstrate joins of multiple datasets, we add a third data frame:
name id occasion
1 Isobel P. 5 anaemia
2 Joe J. 3 cold
3 Max C. 4 abdominal pain
4 Laura M. 1 depression
In the following example, we perform a full join, but you can also use other functions for multiple joining.
First, we have to merge the first two data frames and enter the pipe operator (%>%
) afterwards.
Then we merge this newly created data frame, represented by the dot (.
), with the third.
full_join(patients2, patients2_id, by = "name") %>% full_join(., patients2_occ, by = "name")
output:
name sex age id.x id.y occasion
1 Isobel P. f 23 NA 5 anaemia
2 Joe J. m 34 3 3 cold
3 Laura M. f 56 1 1 depression
4 Nina W. f 18 2 NA <NA>
5 Max C. <NA> NA 4 4 abdominal pain
As you can see, the column id
has been duplicated because it exists in both data frames at the same time.
We will learn how to solve this problem in the next chapter.
Joining by Multiple Columns
If we want to combine two data frames based on several columns, we can select several joining variables for the by
argument at the same time:
full_join(patients2, patients2_id, by = "name") %>% full_join(., patients2_occ, by = c("name", "id"))
output:
name sex age id occasion
1 Isobel P. f 23 NA <NA>
2 Joe J. m 34 3 cold
3 Laura M. f 56 1 depression
4 Nina W. f 18 2 <NA>
5 Max C. <NA> NA 4 abdominal pain
6 Isobel P. <NA> NA 5 anaemia
As you can see, there is now only one id
column.
Note: The row Isobel P.
has been duplicated because it contained different values in the data frames.
Joining by Different Column Names
If you want to link a variable in the first dataset with another variable in the second data, you can equate them in the by
argument as follows:
# Rename "id" to "id_new"
patients2_occ <- patients2_occ %>% rename(id_new = id)
# Full join
patients2_occ %>% full_join(patients2_id, by = c("id_new" = "id"))
# Avoid duplicating the column "name"
patients2_occ %>% full_join(patients2_id, by = c("id_new" = "id", "name"))
output:
name id_new occasion
1 Isobel P. 5 anaemia
2 Joe J. 3 cold
3 Max C. 4 abdominal pain
4 Laura M. 1 depression
5 Nina W. 2 <NA>
Note that when joining via different column names, the columns must contain the same data type!
Example:
full_join(patients2, patients2_id, by = c("name" = "id"))
This would cause the following error:
Error in `full_join()`:
! Can't join `x$name` with `y$id` due to incompatible
types.
ℹ `x$name` is a <character>.
ℹ `y$id` is a <integer>.
Customizing Your Join
Suppose you want to create the following data frame:
patients2_occ %>% full_join(patients2_id, by = c("id_new" = "id"))
output:
name.x id_new occasion name.y
1 Isobel P. 5 anaemia <NA>
2 Joe J. 3 cold Joe J.
3 Max C. 4 abdominal pain Max C.
4 Laura M. 1 depression Laura M.
5 <NA> 2 <NA> Nina W.
As you can see, the output gives two name columns, name.x
and name.y
.
In some cases you might not want to merge these columns as seen in the previous chapter, e.g. if they contain different patient names.
In this case, you could easily customize your join for a more readable name by renaming the column names with a suffix
:
patients2_occ %>% full_join(patients2_id, by = c("id_new" = "id"), suffix = c("_new", "_old"))
output:
name_new id_new occasion name_old
1 Isobel P. 5 anaemia <NA>
2 Joe J. 3 cold Joe J.
3 Max C. 4 abdominal pain Max C.
4 Laura M. 1 depression Laura M.
5 <NA> 2 <NA> Nina W.
Joining Data Summary
base R
Join Type | merge() Function |
---|---|
Inner Join | merge(x = x, y = y, by = "xy") |
Full Join | merge(x = x, y = y, by = "xy", all = TRUE) |
Left Join | merge(x = x, y = y, by = "xy", all.x = TRUE) |
Right Join | merge(x = x, y = y, by = "xy", all.y = TRUE) |
Anti-Join | / |
Semi-Join | / |
data.table
Join Type | data.table Syntax |
---|---|
Inner Join | x[y, nomatch = 0] |
Full Join | / |
Left Join | y[x] |
Right Join | x[y] |
Anti-Join | x[!y] |
Semi-Join | / |
dplyr
Join Type | Join Functions | Pipe Operator Way |
---|---|---|
Inner Join | inner_join(x, y, by = "xy") | x %>% inner_join(y, by = "xy") |
Full Join | full_join(x, y, by = "xy") | x %>% full_join(y, by = "xy") |
Left Join | left_join(x, y, by = "xy") | x %>% left_join(y, by = "xy") |
Right Join | right_join(x, y, by = "xy") | x %>% right_join(y, by = "xy") |
Anti-Join | anti_join(x, y, by = "xy") | x %>% anti_join(y, by = "xy") |
Semi-Join | semi_join(x, y, by = "xy") | x %>% semi_join(y, by = "xy") |