Efficiently input files with R

Categories: R Data Analysis

November 27, 2021

Content

  • File path and directory
  • Read in file
    • copy-paste
    • read in excel
    • read in multiple files
  • Read other types of files
  • Ask for user input

File path and directory

Check files in current directory

Checking what files in current directory is something we often time need, as it allows us to pattern match and get the file we want to read in. Suppose we use RProject to organize all the files for a project. RProject itself is in current_path/ directory (which will be the working directory of your RProject), while all the data is stored under current_path/data/. Let’s take a look at what’s in the data/ folder:

file_list <- list.files(path = paste0(current_path, "/data"), pattern = ".*.xlsx")
print(file_list)
## [1] "file1.xlsx" "file2.xlsx"

As the results shows, using list.files() gets you all the file names stored under the path you specified (similar to ls in command line). It does not go to any sub-directories by default, but you can enable that by specifying args all.files=TRUE.

If you want to make sure the script you are working on can always be referred to with the current path, especially when sharing your script with others, you can use rstudioapi::getSourceEditorContent()$path to get the path of the current working file. This will include the file itself.

To include only the directory path, add dirname() on the file path.

Copy-paste data

Suppose I ran into a post that contains the following data, I can simply copy it into clipboard, and transfer the data into R:

Sepal.Width Petal.Length Petal.Width Species
3.5 1.4 0.2 setosa
3.0 1.4 0.2 setosa
3.2 1.3 0.2 setosa
3.1 1.5 0.2 setosa
3.6 1.4 0.2 setosa
3.9 1.7 0.4 setosa

use clipr::read_clip_tbl()

This is a nice function that allows you to turn what’s in your clipboard into a data.frame. No need to put anything in the brackets, just make sure your data is in clipboard. Another similar function is clipr::read_clip() which by default uses \t as delimiter and reads content by row.

read_clip_tbl() function

Figure 1: read_clip_tbl() function

use data.table::fread()

This method requires you to actually paste the content into input= argument, and it will turn the data into a data.table. This saves you one step if you like to use data.table.

Read in excel file

use read_excel()

This function is from package readxl, which is one of the most commonly used package for reading tabular data. You can specify sheet by either the name or the order number:

# specify by name, and replace NAs with "--
readxl::read_excel("file_name.xlsx", sheet = "tab1", na = "--")

# specify by tab order
readxl::read_excel("file_name.xlsx", sheet = 1L)

use read.xlsx()

Function read.xlsx() and read.xlsx2() (faster on big files) both comes from the library xlsx. This package requires java, so it may not work under MacOS system, or in computers where java is out-of-date.

read::read.xlsx("file_name", 1L, header = TRUE)  # read in the first tab

Read in multiple files

In first section, we already discussed how to output all matching file names. We will be using these files as an example. When the data and script are stored in different folders, include the relative path to the file names to make sure data can be referred to correctly.

To read in all the files, use lapply() to iterate through:

file_path <- paste("data/", file_list, sep = "")
myfile <- lapply(file_path, readxl::read_excel)
print(myfile)
## [[1]]
## # A tibble: 3 × 1
##   `Sepal.Length Sepal.Width Petal.Length Petal.Width Species`
##   <chr>                                                      
## 1 1          5.1         3.5          1.4         0.2  setosa
## 2 2          4.9         3.0          1.4         0.2  setosa
## 3 3          4.7         3.2          1.3         0.2  setosa
## 
## [[2]]
## # A tibble: 3 × 1
##   `Sepal.Length Sepal.Width Petal.Length Petal.Width Species`
##   <chr>                                                      
## 1 4          4.6         3.1          1.5         0.2  setosa
## 2 5          5.0         3.6          1.4         0.2  setosa
## 3 6          5.4         3.9          1.7         0.4  setosa

This will read files as a list, each file is one element in this nested list. To combine them into one single data frame, you can use different approaches including:

  • do.call(rbind, myfile)
  • dplyr::bind_rows()
  • data.table::rbindlist() –> which is fast for reading large files, and can be used with fread: rbindlist(lapply(list.files("*.csv"),fread))

As an example, using do.call(rbind, ) is a very quick and handy way to combine elements from nested list:

df <- do.call(rbind, myfile)
print(df)
## # A tibble: 6 × 1
##   `Sepal.Length Sepal.Width Petal.Length Petal.Width Species`
##   <chr>                                                      
## 1 1          5.1         3.5          1.4         0.2  setosa
## 2 2          4.9         3.0          1.4         0.2  setosa
## 3 3          4.7         3.2          1.3         0.2  setosa
## 4 4          4.6         3.1          1.5         0.2  setosa
## 5 5          5.0         3.6          1.4         0.2  setosa
## 6 6          5.4         3.9          1.7         0.4  setosa

Read other files

  • for .txt file, use:

    • data.table()
      • specify args: sep, header, na.strings, stringAsFactor
    • read.delim()
      • specify args: header, sep
  • for .csv file, use:

    • read.csv, which automatically interprets non-numeric data as factors
      • specify args: header, as.is = TRUE (non-numeric will be strings, not factors) , sep, stringsAsFactor, na.strings
    • read.delim()
    • readr::read_csv()

Ask for user input

use readline()

The following example asks user for an input and converts it into an integer before outputting the results. Make sure to put the prompted statements into prompt= argument. When asking for multiple inputs, place readline() in curly brackets, and separate each input with “;":

{
    var1 <- readline(prompt = "Enter 1st number: ");
    var2 <- readline(prompt = "Enter 2nd number: ");
}
res <- as.integer(var1) + as.integer(var2)
ask for user input

Figure 2: ask for user input

use scan()

If user is suppose to input multiple values, you can also use scan() to take a sequence of inputs in. This method will keep taking input from console, until user presses ‘Enter’ key to stop the process.

scan method

Figure 3: scan method

scan() can also be used to read in files, just put in the file name as an argument. It also has argument what=, which can be specified with double(), "", character(), to specify the input to be double, string, and character.