R string split by pattern

November 24, 2021


When strings are not separated properly …

We may all have seen excel or csv files that stored all fields in one column. If the delimiter is consistent, in Excel, you can go to Data > Text to columns and use Convert Text to Column Wizard, specify the delimiter, and Excel will do the split for you. You can also use formula to do the split, as described in this article.

In this post, I’ll demonstrate how to do string split in a more customized way, including:

  • split one string on pattern
  • split one column into multiple columns, and add them in a data.frame
  • split multiple columns by the same pattern, and add the new columns into a data.frame

First let’s make up some data:

t1 <- data.table::data.table(email_add = c('joe@abc.com', 'jane@email.com', NA))
print(t1)
##         email_add
## 1:    joe@abc.com
## 2: jane@email.com
## 3:           <NA>

Assume that we want to split on @ sign and get the first part.

Split string

The most commonly used function is strsplit() from baseR. Another option would be stringr::str_split(), which has very similar parameters as strsplit().

Here are some options to vertically extract the n-th element after string split:

use purrr::map()

splited_str <- strsplit(t1$email_add, "@")
res1 <- purrr::map(splited_str, 1L)
unlist(res1)
## [1] "joe"  "jane" NA

The above formular splits column email_add and returns a list, then uses purrr::map() to get only the 1st part, which returns the 1st element of each nested list. However, note that this approach may not keep the rows that contain NA. In this case, when extracting the 1st part, NA is returned. But when extracting 2nd part, the last row will be omitted.

use vapply()

res2 <- vapply(splited_str, `[`, 1L, FUN.VALUE=character(1))
unlist(res2)
## [1] "joe"  "jane" NA

The letter “v” indicates the apply function works vertically, ie: it applies to each row of the data table. vapply() preserves all rows, so the NA will be reflected.

Split string and turn into multiple columns

To split string and turn one column into multiple columns, there are also different options.

method 1: use matrix()

matrix(unlist(splited_str), ncol = 2, byrow = TRUE)
##      [,1]   [,2]
## [1,] "joe"  "abc.com"
## [2,] "jane" "email.com"
## [3,] NA     "joe"

In matrix(), the parameter ncol allows you to specify how many columns you want to split into, and byrow means to stack by rows. The problem with using matrix is that it recycles row with NA, which will lead to wrong results.

method 2: use stringr::str_split_fixed()

stringr::str_split_fixed(t1$email_add, "@", 2L)
##      [,1]   [,2]
## [1,] "joe"  "abc.com"
## [2,] "jane" "email.com"
## [3,] ""     ""

Here the function fixed represents that the number of column (which is the last parameter here) is pre-determined. This works best if you are sure how many columns will split into. Unlike matrix(), function str_split_fixed() correctly identified NA rows and returned blanks

method 3: use stringr::str_split()

stringr::str_split(t1$email_add, "@", n=2L, simplify = T)
##      [,1]   [,2]
## [1,] "joe"  "abc.com"
## [2,] "jane" "email.com"
## [3,] NA     ""

This gives very similar results as using str_split_fixed() function. By specifying parameter simplify = T, the output becomes a matrix.

method 4: use do.call(rbind)

do.call(rbind, splited_str)
##      [,1]   [,2]
## [1,] "joe"  "abc.com"
## [2,] "jane" "email.com"
## [3,] NA     NA

do.call allows the function rbind to run iteratively, which formed the matrix. In this case, the NA remains NA cells afterwards.

All the methods mentioned above turn the single column into a matrix. To convert a matrix to a dataframe, you can simply apply data.frame() or data.table() functions.

method 5: use data.table::tstrsplit()

t1[, c('user', 'domain') := data.table::tstrsplit(email_add, "@")]
print(t1)
##         email_add user    domain
## 1:    joe@abc.com  joe   abc.com
## 2: jane@email.com jane email.com
## 3:           <NA> <NA>      <NA>

If you use the package data.table, there is a handy function that allows you to do the column split and name the new columns all in place. NA remains NA cells after the split.

method 6: use reshape2::colsplit()

reshape2::colsplit(t1$email_add, "@", c("user", "domain"))
##   user    domain
## 1  joe   abc.com
## 2 jane email.com
## 3

reshape2 package has a very specific function called colsplit() that does the column split task, just like the name suggested. This approach turns the single column into new dataframe, instead of modifying the original dataframe. NA remains NA.

method 7: use vapply() on a data.table object

t1[, user:= vapply(strsplit(t1$email_add, "@"), `[`, 1,  FUN.VALUE = character(1))]
print(t1)
##         email_add user
## 1:    joe@abc.com  joe
## 2: jane@email.com jane
## 3:           <NA> <NA>

If you rather just add one column on the original dataframe, you can use vapply(), but make sure the dataframe has been converted to data.table in order to allow in-place change.

Split multiple columns

When we need to apply the same function on multiple columns, using lapply() in conjunction with data.table object allows in-place modification.

Let’s make up some data for this scenario:

t2 <- data.table::data.table(state_info = c('AL_Alabama', 'AK_Alaska', NA),
                             state_cap = c('Juneau_1990','Montgomery_1819', NA),
                             country = c('USA', 'USA', NA))
print(t2)
##    state_info       state_cap country
## 1: AL_Alabama     Juneau_1990     USA
## 2:  AK_Alaska Montgomery_1819     USA
## 3:       <NA>            <NA>    <NA>

Suppose we want to split column “state_info” and “state_cap”, and keep the 1st part from both columns:

sp_cols <- c('state_info', 'state_cap')
new_cols <- c('state_abbr', 'state_cap_city')
t2[, (new_cols) := lapply(.SD, function(x) stringr::str_split(x, "_", n = 2L, simplify = TRUE)[, 1L]), .SDcols = sp_cols]
print(t2)
##    state_info       state_cap country state_abbr state_cap_city
## 1: AL_Alabama     Juneau_1990     USA         AL         Juneau
## 2:  AK_Alaska Montgomery_1819     USA         AK     Montgomery
## 3:       <NA>            <NA>    <NA>       <NA>           <NA>

In this formula, [, 1L] means to get the 1st part after split. If you want to replace the original columns with new values, instead of adding new columns, simply replace (new_cols) with (sp_cols).

Summary

For splitting one column into multiple columns:

  • If the number of columns is known, using method 1, 2, or 3 all work. However, strsplit() does not retain NAs when picking parts other than the 1st part. The two functions from stringr package avoid this issue
  • If the number of splitted columns is unknown, method 4, 5, 6, 7 do the job. But applying functions in data.table allows in-place modification.