R string split by pattern
November 24, 2021
When strings are not separated properly …
We may all have seen excel or csv files that stored all fields in one column. If the delimiter is consistent, in Excel, you can go to Data > Text to columns and use Convert Text to Column Wizard, specify the delimiter, and Excel will do the split for you. You can also use formula to do the split, as described in this article.
In this post, I’ll demonstrate how to do string split in a more customized way, including:
- split one string on pattern
- split one column into multiple columns, and add them in a data.frame
- split multiple columns by the same pattern, and add the new columns into a data.frame
First let’s make up some data:
t1 <- data.table::data.table(email_add = c('joe@abc.com', 'jane@email.com', NA))
print(t1)
## email_add
## 1: joe@abc.com
## 2: jane@email.com
## 3: <NA>
Assume that we want to split on @
sign and get the first part.
Split string
The most commonly used function is strsplit()
from baseR
. Another option would be stringr::str_split()
, which has very similar parameters as strsplit()
.
Here are some options to vertically extract the n-th element after string split:
use purrr::map()
splited_str <- strsplit(t1$email_add, "@")
res1 <- purrr::map(splited_str, 1L)
unlist(res1)
## [1] "joe" "jane" NA
The above formular splits column email_add
and returns a list, then uses purrr::map()
to get only the 1st part, which returns the 1st element of each nested list. However, note that this approach may not keep the rows that contain NA. In this case, when extracting the 1st part, NA is returned. But when extracting 2nd part, the last row will be omitted.
use vapply()
res2 <- vapply(splited_str, `[`, 1L, FUN.VALUE=character(1))
unlist(res2)
## [1] "joe" "jane" NA
The letter “v” indicates the apply function works vertically, ie: it applies to each row of the data table. vapply()
preserves all rows, so the NA will be reflected.
Split string and turn into multiple columns
To split string and turn one column into multiple columns, there are also different options.
method 1: use matrix()
matrix(unlist(splited_str), ncol = 2, byrow = TRUE)
## [,1] [,2]
## [1,] "joe" "abc.com"
## [2,] "jane" "email.com"
## [3,] NA "joe"
In matrix()
, the parameter ncol
allows you to specify how many columns you want to split into, and byrow
means to stack by rows. The problem with using matrix
is that it recycles row with NA, which will lead to wrong results.
method 2: use stringr::str_split_fixed()
stringr::str_split_fixed(t1$email_add, "@", 2L)
## [,1] [,2]
## [1,] "joe" "abc.com"
## [2,] "jane" "email.com"
## [3,] "" ""
Here the function fixed
represents that the number of column (which is the last parameter here) is pre-determined. This works best if you are sure how many columns will split into. Unlike matrix()
, function str_split_fixed()
correctly identified NA rows and returned blanks
method 3: use stringr::str_split()
stringr::str_split(t1$email_add, "@", n=2L, simplify = T)
## [,1] [,2]
## [1,] "joe" "abc.com"
## [2,] "jane" "email.com"
## [3,] NA ""
This gives very similar results as using str_split_fixed()
function. By specifying parameter simplify = T
, the output becomes a matrix.
method 4: use do.call(rbind)
do.call(rbind, splited_str)
## [,1] [,2]
## [1,] "joe" "abc.com"
## [2,] "jane" "email.com"
## [3,] NA NA
do.call
allows the function rbind
to run iteratively, which formed the matrix. In this case, the NA remains NA cells afterwards.
All the methods mentioned above turn the single column into a matrix. To convert a matrix to a dataframe, you can simply apply data.frame()
or data.table()
functions.
method 5: use data.table::tstrsplit()
t1[, c('user', 'domain') := data.table::tstrsplit(email_add, "@")]
print(t1)
## email_add user domain
## 1: joe@abc.com joe abc.com
## 2: jane@email.com jane email.com
## 3: <NA> <NA> <NA>
If you use the package data.table
, there is a handy function that allows you to do the column split and name the new columns all in place. NA remains NA cells after the split.
method 6: use reshape2::colsplit()
reshape2::colsplit(t1$email_add, "@", c("user", "domain"))
## user domain
## 1 joe abc.com
## 2 jane email.com
## 3
reshape2
package has a very specific function called colsplit()
that does the column split task, just like the name suggested. This approach turns the single column into new dataframe, instead of modifying the original dataframe. NA remains NA.
method 7: use vapply()
on a data.table
object
t1[, user:= vapply(strsplit(t1$email_add, "@"), `[`, 1, FUN.VALUE = character(1))]
print(t1)
## email_add user
## 1: joe@abc.com joe
## 2: jane@email.com jane
## 3: <NA> <NA>
If you rather just add one column on the original dataframe, you can use vapply()
, but make sure the dataframe has been converted to data.table in order to allow in-place change.
Split multiple columns
When we need to apply the same function on multiple columns, using lapply()
in conjunction with data.table
object allows in-place modification.
Let’s make up some data for this scenario:
t2 <- data.table::data.table(state_info = c('AL_Alabama', 'AK_Alaska', NA),
state_cap = c('Juneau_1990','Montgomery_1819', NA),
country = c('USA', 'USA', NA))
print(t2)
## state_info state_cap country
## 1: AL_Alabama Juneau_1990 USA
## 2: AK_Alaska Montgomery_1819 USA
## 3: <NA> <NA> <NA>
Suppose we want to split column “state_info” and “state_cap”, and keep the 1st part from both columns:
sp_cols <- c('state_info', 'state_cap')
new_cols <- c('state_abbr', 'state_cap_city')
t2[, (new_cols) := lapply(.SD, function(x) stringr::str_split(x, "_", n = 2L, simplify = TRUE)[, 1L]), .SDcols = sp_cols]
print(t2)
## state_info state_cap country state_abbr state_cap_city
## 1: AL_Alabama Juneau_1990 USA AL Juneau
## 2: AK_Alaska Montgomery_1819 USA AK Montgomery
## 3: <NA> <NA> <NA> <NA> <NA>
In this formula, [, 1L]
means to get the 1st part after split. If you want to replace the original columns with new values, instead of adding new columns, simply replace (new_cols)
with (sp_cols)
.
Summary
For splitting one column into multiple columns:
- If the number of columns is known, using method 1, 2, or 3 all work. However,
strsplit()
does not retain NAs when picking parts other than the 1st part. The two functions fromstringr
package avoid this issue - If the number of splitted columns is unknown, method 4, 5, 6, 7 do the job. But applying functions in
data.table
allows in-place modification.