wrapr Implementation Update

Introduction The development version of our R helper function wrapr::let() has switched from string-based substitution to abstract syntax tree based substitution (AST based subsitution, or language based substitution). I am looking for some feedback from wrapr::let() users already doing substantial work with wrapr::let(). If you are already using wrapr::let() please test if the current development version of wrapr works with your code. If you run into problems: I apologize, and please file a GitHub issue. The substitution modes The development version of wrapr::let() now has three substitution implementations: Language substitution (subsMethod=’langsubs’ the new default). In this mode user code is captured as an abstract syntax tree (or parse tree) and substitution is performed only on nodes known to be symbols. String substitution (subsMethod=’stringsubs’, the CRAN current default). In this mode user code is captured as text and then string…
Original Post: wrapr Implementation Update

Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE) and composing non-standard-evaluation interfaces (NSE) in R. In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces. To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette(‘programming’, package = ‘dplyr’)). “:=” is needed to make left-hand-side re-mapping possible (adding yet another “more than one assignment type operator running around” notation issue). “!!” substitution requires parenthesis to safely bind (so the notation is actually “(!! )”, not “!!”). Left-hand-sides of expressions are names or strings, while right-hand-sides are quosures/expressions. Example Let’s apply tidyeval/rlang notation to the task of building re-usable generic in R. # setup suppressPackageStartupMessages(library(“dplyr”)) packageVersion(“dplyr”) ## [1] ‘0.7.0’ vignette(‘programming’, package =…
Original Post: Non-Standard Evaluation and Function Composition in R

An easy way to accidentally inflate reported R-squared in linear regression models

Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R. We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here. First let’s set up our problem with a data set where the quantity to be predicted (y) has no real relation to the independent variable (x). We will first build our example data: library(“sigr”) library(“broom”) set.seed(23255) d data.frame(y= runif(100), x= ifelse(runif(100)>=0.5, ‘a’, ‘b’), stringsAsFactors = FALSE) Now let’s build a model and look at the summary statistics returned as part of the model fitting process: m1 lm(y~x, data=d) t(broom::glance(m1)) ## [,1] ## r.squared 0.002177326 ## adj.r.squared -0.008004538 ## sigma 0.302851476 ## statistic 0.213843593 ## p.value 0.644796456 ## df…
Original Post: An easy way to accidentally inflate reported R-squared in linear regression models

Use a Join Controller to Document Your Work

This note describes a useful replyr tool we call a “join controller” (and is part of our “R and Big Data” series, please see here for the introduction, and here for one our big data courses). When working on real world predictive modeling tasks in production, the ability to join data and document how you join data is paramount. There are very strong reasons to organize production data in something resembling one of the Codd normal forms. However, for machine learning we need a fully denormalized form (all columns populated into a single to ready to go row, no matter what their provenance, keying, or stride). This is not an essential difficulty as in relational data systems moving between these forms can be done by joining, and data stores such as PostgreSQL or Apache Spark are designed to provide…
Original Post: Use a Join Controller to Document Your Work

Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr. Handle management Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represent a reference leak and eat up resources. To help control handle lifetime the replyr supplies record-retaining temporary name generators (and uses the same internally). The actual function is pretty simple: print(replyr::makeTempNameGenerator) ## function(prefix, ## suffix= NULL) { ## force(prefix) ## if((length(prefix)!=1)||(!is.character(prefix))) { ## stop(“repyr::makeTempNameGenerator prefix must be a string”) ## } ## if(is.null(suffix)) { ## alphabet ## For instance to join a few tables it is a can be a good idea to call compute after each join (else the generated SQL can become large and unmanageable). This sort…
Original Post: Managing intermediate results when using R/sparklyr

Campaign Response Testing no longer published on Udemy

Our free video course Campaign Response Testing is no longer published on Udemy. It remains available for free on YouTube with all source code available from GitHub. I’ll try to correct bad links as I find them. Please read on for the reasons. Udemy recently unilaterally instituted a new policy on free courses: “When a free course has a Recent Review Rating less than 4.1 and is flagged with a ‘high degree of confidence’ the course will be hidden from Udemy’s search.” Campaign Response Testing is a free course with an all-time average rating of 4.14 and a recent rating of 3.85. We have kept the code up to date and answered student questions (there was no backlog of student questions). Obviously others should have opinion of our public work (which may or may not be good). And we…
Original Post: Campaign Response Testing no longer published on Udemy

More on safe substitution in R

Let’s worry a bit about substitution in R. Substitution is very powerful, which means it can be both used and mis-used. However, that does not mean every use is unsafe or a mistake. From Advanced R : substitute: We can confirm the above code performs no substitution: a 1 b 2 substitute(a + b + z) ## a + b + z And it appears the effect is that substitute is designed to not take values from the global environment. So, as we see below, it isn’t so much what environment we are running in that changes substitute’s behavior, it is what environment the values are bound to that changes things. (function() { a 1 substitute(a + b + z, environment()) })() ## 1 + b + z We can in fact find many simple variations of substitute that…
Original Post: More on safe substitution in R

There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”): There should be one– and preferably only one –obvious way to do it. Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: str(), head(), and the tibble package‘s glimpse(). tibble::glimpse() Consider the important task inspecting a data.frame to see column types and a few example values. The dplyr/tibble/tidyverse way of doing this is as follows: library(“tibble”) glimpse(mtcars) Observations: 32 Variables: 11 $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10…. $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4,…
Original Post: There is usually more than one way in R

In defense of wrapr::let()

Saw this the other day: In defense of wrapr::let() (originally part of replyr, and still re-exported by that package) I would say: let() was deliberately designed for a single real-world use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can re-phrase that as: there is deliberately less to learn as let() is adapted to a need (instead of one having to adapt to let()). The R community already has months of experience confirming let() working reliably in production while interacting with a number of different packages. let() will continue to be a very specific, consistent, reliable, and relevant tool even after dpyr 0.6.* is released, and the community gains experience with rlang/tidyeval in production. If rlang/tidyeval is your thing, by…
Original Post: In defense of wrapr::let()

Summarizing big data in R

Our next “R and big data tip” is: summarizing big data.We always say “if you are not looking at the data, you are not doing science”- and for big data you are very dependent on summaries (as you can’t actually look at everything). Simple question: is there an easy way to summarize big data in R? The answer is: yes, but we suggest you use the replyr package to do so. Let’s set up a trivial example. suppressPackageStartupMessages(library(“dplyr”)) packageVersion(“dplyr”) ## [1] ‘0.5.0’ library(“sparklyr”) packageVersion(“sparklyr”) ## [1] ‘0.5.5’ library(“replyr”) packageVersion(“replyr”) ## [1] ‘0.3.902’ sc sparklyr::spark_connect(version=’2.0.2′, master = “local”) diris copy_to(sc, iris, ‘diris’) The usual S3–summary() summarizes the handle, not the data. summary(diris) ## Length Class Mode ## src 1 src_spark list ## ops 3 op_base_remote list tibble::glimpse() throws. packageVersion(“tibble”) ## [1] ‘1.3.3’ # errors-out glimpse(diris) ## Observations: 150 ## Variables: 5…
Original Post: Summarizing big data in R