Upcoming data preparation and modeling article series

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN. vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects. vtreat handles, in a statistically sound fashion: In our (biased) opinion opinion vtreat has the best methodology and documentation for these important data cleaning and preparation steps. vtreat‘s current public open-source implementation is for in-memory R analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table, Spark, Python/Pandas, and SQL). vtreat brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble. A new feature of vtreat version 0.6.0 is called “custom coders.” Win-Vector LLC‘s Dr. Nina…
Original Post: Upcoming data preparation and modeling article series

My advice on dplyr::mutate()

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time. Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark. “Character is what you are in the dark.”John Whorfin quoting Dwight L. Moody. I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations…
Original Post: My advice on dplyr::mutate()

It is Needlessly Difficult to Count Rows Using dplyr

Question: how hard is it to count rows using the R package dplyr? Answer: surprisingly difficult. When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this “dplyr inferno”). Let’s take an example from sparklyr issue 973: suppressPackageStartupMessages(library(“dplyr”)) packageVersion(“dplyr”) ## [1] ‘0.7.2.9000’ library(“sparklyr”) packageVersion(“sparklyr”) ## [1] ‘0.6.2’ sc <- spark_connect(master = “local”) ## * Using Spark: 2.1.0 db_drop_table(sc, ‘extab’, force = TRUE) ## [1] 0 DBI::dbGetQuery(sc, “DROP TABLE IF EXISTS extab”) DBI::dbGetQuery(sc, “CREATE TABLE extab (n TINYINT)”) DBI::dbGetQuery(sc, “INSERT INTO extab VALUES (1), (2), (3)”) dRemote <- tbl(sc, “extab”) print(dRemote) ## # Source: table [?? x 1] ## # Database: spark_connection ## n ## ## 1…
Original Post: It is Needlessly Difficult to Count Rows Using dplyr

Permutation Theory In Action

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another. Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices. Let’s take a look at an example. Suppose our two orderings are o1 (the ordering Spark ML chooses) and o2 (the order the R user chooses). set.seed(326346) symbols <- letters[1:7] o1 <- sample(symbols, length(symbols), replace = FALSE) o1 ## [1] “e” “a” “b” “f” “d” “c” “g” o2 <- sample(symbols, length(symbols), replace =…
Original Post: Permutation Theory In Action

Why to use the replyr R package

Recently I noticed that the R package sparklyr had the following odd behavior: suppressPackageStartupMessages(library(“dplyr”)) library(“sparklyr”) packageVersion(“dplyr”) #> [1] ‘0.7.2.9000’ packageVersion(“sparklyr”) #> [1] ‘0.6.2’ packageVersion(“dbplyr”) #> [1] ‘1.1.0.9000’ sc <- spark_connect(master = ‘local’) #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #> [1] NA ncol(d) #> [1] NA nrow(d) #> [1] NA This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible. In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above). Tron: fights for the users. In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr…
Original Post: Why to use the replyr R package

Neat New seplyr Feature: String Interpolation

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string. This provides a powerful way to easily work complicated expressions into the seplyr data manipulation methods. The method is easiest to see with an example: library(“seplyr”) ## Loading required package: wrapr ratio <- 2 compCol1 <- “Sepal.Width” expr <- expand_expr(“Sepal.Length” >= ratio * compCol1) print(expr) ## [1] “Sepal.Length >= ratio * Sepal.Width” expand_expr works by capturing the user supplied expression unevaluated, performing some transformations, and returning the…
Original Post: Neat New seplyr Feature: String Interpolation

wrapr: R Code Sweeteners

wrapr is an R package that supplies powerful tools for writing and debugging R code. Primary wrapr services include: let() %.>% (dot arrow pipe) := (named map builder) λ() (anonymous function builder) DebugFnW() let() let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()). The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written “a + a”. library(“wrapr”) a <- 7 let( c(VAR = ‘a’), VAR + VAR ) # [1] 14 This is useful in re-adapting non-standard evaluation interfaces (NSE interfaces) so one can script or program over them. We are trying to make let() self teaching and self…
Original Post: wrapr: R Code Sweeteners

Some Neat New R Notations

The R package seplyr supplies a few neat new coding notations. An Abacus, which gives us the term “calculus.” The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the following: library(“seplyr”) names <- c(‘a’, ‘b’) names := c(‘x’, ‘y’) #> a b #> “x” “y” This can be very useful when programming in R, as it allows indirection or abstraction on the left-hand side of inline name assignments (unlike c(a = ‘x’, b = ‘y’), where all left-hand-sides are concrete values even if not quoted). A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with R‘s “c()” combine/concatenate function. That is: c(‘a’ := ‘x’, ‘b’ := ‘y’) is the same…
Original Post: Some Neat New R Notations

Is dplyr Easily Comprehensible?

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible?dplyr makes sense to those of us who use it a lot. And we can teach part time R users a lot of the common good use patterns. But, is it an easy task to study and characterize dplyr itself? Please take our advanced dplyr quiz to test your dplyr mettle. “Pop dplyr quiz, hot-shot! There is data in a pipe. What does each verb do?” Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop,…
Original Post: Is dplyr Easily Comprehensible?

Thank You For The Very Nice Comment

Somebody nice reached out and gave us this wonderful feedback on our new Supervised Learning in R: Regression (paid) video course. Thanks for a wonderful course on DataCamp on XGBoost and Random forest. I was struggling with Xgboost earlier and Vtreat has made my life easy now :). Supervised Learning in R: Regression covers a lot as it treats predicting probabilities as a type of regression. Nina and I are very proud of this course and think it is very much worth your time (for the beginning through advanced R user). vtreat is a statistically sound data cleaning and preparation tool introduced towards the end of the course. R users who try vtreat find it makes training and applying models much easier. vtreat is distributed as a free open-source package available on CRAN. If you are doing predictive modeling in…
Original Post: Thank You For The Very Nice Comment