John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks. Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA). One may ask: “why share new packages with an “R Language Beginners Group?” (Though, I prefer to use the term “part time R user”.) Because: such packages can give such users easy, safe, consistent, and performant access to powerful data systems. rquery establishes a notation and links to external data providers such as PostgreSQL, Amazon Redshift, and Apache Spark. rqdatatable supplies an additional in-memory implementation…
Original Post: John Mount speaking on rquery and rqdatatable

Speed up your R Work

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages. For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines. The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion. Keep in mind: unless the pipeline steps have…
Original Post: Speed up your R Work

seplyr 0.5.8 Now Available on CRAN

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN. seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself. Tools such as seplyr, wrapr or rlang are needed when you (the data scientist temporarily working on a programming sub-task) do not know the names of the columns you want your code to be working with. These are situations where you expect the column names to be made available later, in additional variables…
Original Post: seplyr 0.5.8 Now Available on CRAN

Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing. vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark. vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you. Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even…
Original Post: Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

R Tip: Be Wary of “…”

R Tip: be wary of “…“. The following code example contains an easy error in using the R function unique(). vec1 <- c(“a”, “b”, “c”) vec2 <- c(“c”, “d”) unique(vec1, vec2) # [1] “a” “b” “c” Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “…” function signature feature. In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign…
Original Post: R Tip: Be Wary of “…”

wrapr 1.5.0 available on CRAN

The R package wrapr 1.5.0 is now available on CRAN. wrapr includes a lot of tools for writing better R code: I’ll be writing articles on a number of the new capabilities. For now I just leave you with the nifty operator coalesce notation. Coalesce takes values from its arguments in left to right order taking the first non-NA value (if any available): NA %?% 5 # [1] 5 1 %?% 5 # [1] 1 5 %?% NA # [1] 5 For vectors each position is calculated independently, and scalars are re-cycled to vector sizes. This allows fairly complicated coalesce strategies (such as take from first two vectors if possible, else write in zero) to be expressed very succinctly: vec1 <- c(1, 2, NA, NA) vec2 <- c(10, NA, 20, NA) vec1 %?% vec2 %?% 0 # [1] 1 2…
Original Post: wrapr 1.5.0 available on CRAN

R Tip: use isTRUE()

R Tip: use isTRUE(). A lot of R functions are type unstable, which means they return different types or classes depending on details of their values. For example consider all.equal(), it returns the logical value TRUE when the items being compared are equal: all.equal(1:3, c(1, 2, 3)) # [1] TRUE However, when the items being compared are not equal all.equal() instead returns a message: all.equal(1:3, c(1, 2.5, 3)) # [1] “Mean relative difference: 0.25” This can be inconvenient in using functions similar to all.equal() as tests in if()-statements and other program control structures. The saving functions is isTRUE(). isTRUE() returns TRUE if its argument value is equivalent to TRUE, and returns FALSE otherwise. isTRUE() makes R programming much easier. Some examples of isTRUE() are given below: isTRUE(TRUE) # [1] TRUE isTRUE(FALSE) [1] FALSE isTRUE(NULL) # [1] FALSE isTRUE(NA) # [1]…
Original Post: R Tip: use isTRUE()

rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package. rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable). Teaching rquery and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery and rqdatatable, but frankly the biggest result of…
Original Post: rqdatatable: rquery Powered by data.table

WVPlots now at version 1.0.0 on CRAN!

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. We are excited to announce the WVPlots is now at version 1.0.0 on CRAN! The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions. This is a package to produce plots while you are in the middle of another task. For example the plot below showing both an observed discrete empirical distribution (as stems) and a matching theoretical distribution (as bars) is a built in “one liner.” set.seed(52523) d <- data.frame(wt=100*rnorm(100)) WVPlots::PlotDistCountNormal(d,’wt’,’example’) The graph above is actually the product of a number of presentation decisions: Using a discrete histogram approach to summarize data…
Original Post: WVPlots now at version 1.0.0 on CRAN!

wrapr 1.4.1 now up on CRAN

wrapr 1.4.1 is now available on CRAN. wrapr is a really neat R package both organizing, meta-programming, and debugging R code. This update generalizes the dot-pipe feature’s dot S3 features. Please give it a try! wrapr, is an R package that supplies powerful tools for writing and debugging R code. Introduction Primary wrapr services include: let() (let block) %.>% (dot arrow pipe) build_frame()/draw_frame() := (named map builder) DebugFnW() (function debug wrappers) λ() (anonymous function builder) let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()). The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written “a + a”. library(“wrapr”) a…
Original Post: wrapr 1.4.1 now up on CRAN