nutriverse

I’m excited to announce that vroom 1.1.0 is now available on CRAN!

vroom reads rectangular data, such as comma separated (csv), tab separated (tsv) or fixed width files (fwf) into R. It performs similar roles to functions like readr::read_csv(), data.table::fread() or read.csv(). But for many datasets vroom::vroom() can read them much, much faster (hence the name). Get the latest version with:

install.packages("vroom")

And attach the package by running

library(vroom)

Improvements in this release include: a hex logo, support for big integer data, improved delimiter guessing, including delimiters in specifications, and streamlined reading from standard input.

See the change log for a full list of changes and bug fixes in this version.

Hex logo

Thanks to Allison Horst we now have an awesome hex logo for vroom!

Big integer support

R’s standard integers are stored in 32 bits of binary data, which means that the largest value they can store is 2,147,483,647 (2^31 - 1). R implicitly converts integers for most operations with doubles to 64-bit floating point values, which is why you may not have noticed this limitation before.

options(digits = 22)
x <- 2147483647L
str(x)
#>  int 2147483647
str(x + 1L)
#> Warning in x + 1L: NAs produced by integer overflow
#>  int NA
str(x + 1)
#>  num 2.15e+09
x + 1
#> [1] 2147483648

However, even 64-bit floating point values can only store consecutive integers up to 9,007,199,254,740,992 (2^53) without losing precision. You can observe this because if you try adding 1 to this number you will get the same number.

y <- 9007199254740992
z <- y + 1
z
#> [1] 9007199254740992
y == z
#> [1] TRUE

To store consecutive integers bigger than this you need to use a 64-bit integer type. R does not have native support for 64-bit integers, however the bit64 package provides support for them. Because these integers are so large, they rarely occur in real world data, however they can often be obtained from generated data, such as database identifiers.

vroom 1.1.0 now supports reading these big integers into the integer64 type provided by bit64 with a new col_big_integer() column type (shortcut ‘I’).

x <- vroom("id\n9007199254740993\n", col_type = "I", delim = ",")
x
#> # A tibble: 1 x 1
#>   id              
#>   <int64>         
#> 1 9007199254740993

x$id + 1
#> integer64
#> [1] 9007199254740994

Improved delimiter guessing

The code to guess delimiters has been rewritten, which should make it more robust to most inputs. Previous versions of vroom would fall back to using a newline delimiter if a delimiter could not be guessed. vroom 1.1.0 instead throws an error.

vroom("x\n1\n")
#> Error: Could not guess the delimiter.
#> 
#> Use `vroom(delim =)` to specify one explicitly.
vroom("x\n1\n", delim = ",")
#> Rows: 1
#> Columns: 1
#> Delimiter: ","
#> dbl [1]: x
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1     1

Delimiters in the specification

vroom now includes the delimiter in the specification object, which means you no longer have to separately provide the delimiter if you are using an existing specification.

# read a csv file, the delimiter is provided as ','
x <- vroom(vroom_example("mtcars.csv"), delim = ',')
#> Rows: 32
#> Columns: 12
#> Delimiter: ","
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
mtcars_spec <- spec(x)

# If the file is read again with the spec, no need to provide the delimiter
y <- vroom(vroom_example("mtcars.csv"), col_types = mtcars_spec)

Reading from standard input

vroom makes it straightforward to read from the C standard input, like you would do when calling R from the terminal command line. Simply use stdin() as your input. Let’s say you want to take the first few lines the mtcars file and find the average horsepower.

head mtcars.tsv | Rscript -e 'hp <- vroom::vroom(stdin(), col_types = list())$hp; mean(hp)'
#> [1] 122.7778

Acknowledgements

This release also contains a number of bug fixes and improvements which should make it more robust than previous releases. See the change log for full details.

A big thanks to all contributors of code, issues and documentation to this release, including many who helped out at the tidyverse developer day in Toulouse, France!

@2005m, @atomman, @batpigandme, @blairj09, @Chris-M-P, @chsafouane, @CriscelyLP, @DyfanJones, @ecoquant, @edzer, @ericbrownaustin, @estroger34, @frm1789, @georgevbsantiago, @guiastrennec, @hadley, @HenrikBengtsson, @henry090, @jaapwalhout, @jimhester, @jonaszierer, @kiernann, @martindut, @meta00, @mgirlich, @mllg, @osiris08, @Plebejer, @R3myG, @randomgambit, @sanromd, @Shians, @stephen-hayne, @vjcitn, @wlattner, and @xiaodaigh.