Supercharge your R code with wrapr

I would like to demonstrate some helpful wrapr R notation tools that really neaten up your R code. Img: Christopher Ziemnowicz. Named Map Builder First I will demonstrate wrapr‘s “named map builder”: :=. The named map builder adds names to vectors and lists by nice “names on the left and values on the right” notation. For example to build a named vector mapping names c(“a”, “b”) to values c(1, 2) we could write the following R code. c(a = 1, b = 2) ## a b ## 1 2 Or we can write: c(“a” = 1, “b” = 2) ## a b ## 1 2 Using wrapr we can write the same thing for quoted names using :=. library(“data.table”) # data.table before wrapr to avoid := contention suppressPackageStartupMessages(library(“dplyr”)) library(“wrapr”) ## ## Attaching package: ‘wrapr’ ## The following object is masked…
Original Post: Supercharge your R code with wrapr

Latest vtreat up on CRAN

There is a new version of the R package vtreat now up on CRAN. vtreat is an essential data preparation system for predictive modeling that helps defend your predictive modeling work against real world data issues including: High cardinality categorical variables Rare levels (including new or novel levels during application) in categorical variables Missing data (random or systematic) Irrelevant variables/columns Nested model bias, and other over-fit issues. vtreat also includes excellent, citable, documentation: vtreat: a data.frame Processor for Predictive Modeling. For this release I want to thank everybody who generously donated their time to submit an issue or build a git pull-request. In particular: Vadim Khotilovich, who found and fixed a major performance problem in the y-stratified sampling. Lawrence Wu, who has been donating documentation fixes. Peter Hurford, who has been donating documentation fixes. Related To leave a comment for…
Original Post: Latest vtreat up on CRAN

Advisory on Multiple Assignment dplyr::mutate() on Databases

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases. (image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License) In this note I exhibit a troublesome example, and a systematic solution. First let’s set up dplyr, our database, and some example data. library(“dplyr”) ## ## Attaching package: ‘dplyr’ ## The following objects are masked from ‘package:stats’: ## ## filter, lag ## The following objects are masked from ‘package:base’: ## ## intersect, setdiff, setequal, union packageVersion(“dplyr”) ## [1] ‘0.7.4’ packageVersion(“dbplyr”) ## [1] ‘1.2.0’ db <- DBI::dbConnect(RSQLite::SQLite(), “:memory:”) d <- dplyr::copy_to( db, data.frame(xorig = 1:5, yorig = sin(1:5)), “d”) Now suppose somewhere in one of your projects somebody (maybe not even you) has written code that looks somewhat like the following. d %>% mutate( delta = 0, x0 = xorig + delta, y0 = yorig…
Original Post: Advisory on Multiple Assignment dplyr::mutate() on Databases

Data Reshaping with cdata

I’ve just shared a short webcast on data reshaping in R using the cdata package. (link) We also have two really nifty articles on the theory and methods: Please give it a try! This is the material I recently presented at the January 2017 BARUG Meetup. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…
Original Post: Data Reshaping with cdata

Base R can be Fast

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are faster than R code.” The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions. Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types. The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem: Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in…
Original Post: Base R can be Fast

Setting up RStudio Server quickly on Amazon EC2

I have recently been working on projects using Amazon EC2 (elastic compute cloud), and RStudio Server. I thought I would share some of my working notes. Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. The idea is: for a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs. If you are already an Amazon EC2 user with some Unix experience it is very easy to quickly stand up a powerful R environment, which is what I will demonstrate in this note. To follow these notes you must already have an Amazon EC2 account, some experience using the…
Original Post: Setting up RStudio Server quickly on Amazon EC2

rquery: Fast Data Manipulation in R

Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings. Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique. We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic…
Original Post: rquery: Fast Data Manipulation in R

New wrapr R pipeline feature: wrapr_applicable

The R package wrapr now has a neat new feature: “wrapr_applicable”. This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery. A small effort in making a package “wrapr aware” appears to have a fairly large payoff. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from…
Original Post: New wrapr R pipeline feature: wrapr_applicable

Big cdata News

I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata. cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago. However, cdata is much more than that. cdata provides a family of general transforms that include pivot/unpivot (or tidyr::spread/tidyr::gather) as easy special cases. Nina refused to write the article on it until we re-factored the api to be even more teachable (and therefore more learnable, and more useful). After her re-design (adding the concepts of both concrete records and abstract records to the coordinatized data theory) the system teaches itself. It is actually hard to remember you are graphically specifying potentially involved, difficult, and…
Original Post: Big cdata News

Announcing rquery

We are excited to announce the rquery R package. rquery is Win-Vector LLC‘s currently in development big data query tool for R. rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production). As an example: rquery operators allow us to write our earlier “treatment and control” example as follows. dQ <- d %.>% extend_se(., if_else_block( testexpr = “rand()>=0.5”, thenexprs = qae( a_1 := ‘treatment’, a_2 := ‘control’), elseexprs = qae( a_1 := ‘control’, a_2 := ‘treatment’))) %.>% select_columns(., c(“rowNum”, “a_1”, “a_2″)) rquery pipelines are first-class objects; so we can extend them, save them, and even print them. cat(format(dQ)) table(‘d’) %.>% extend(., ifebtest_1 := rand() >= 0.5) %.>% extend(., a_1 := ifelse(ifebtest_1,”treatment”,a_1), a_2 := ifelse(ifebtest_1,”control”,a_2)) %.>% extend(., a_1 := ifelse(!( ifebtest_1…
Original Post: Announcing rquery