Advisory on Multiple Assignment dplyr::mutate() on Databases

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases. (image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License) In this note I exhibit a troublesome example, and a systematic solution. First let’s set up dplyr, our database, and some example data. library(“dplyr”) ## ## Attaching package: ‘dplyr’ ## The following objects are masked from ‘package:stats’: ## ## filter, lag ## The following objects are masked from ‘package:base’: ## ## intersect, setdiff, setequal, union packageVersion(“dplyr”) ## [1] ‘0.7.4’ packageVersion(“dbplyr”) ## [1] ‘1.2.0’ db <- DBI::dbConnect(RSQLite::SQLite(), “:memory:”) d <- dplyr::copy_to( db, data.frame(xorig = 1:5, yorig = sin(1:5)), “d”) Now suppose somewhere in one of your projects somebody (maybe not even you) has written code that looks somewhat like the following. d %>% mutate( delta = 0, x0 = xorig + delta, y0 = yorig…
Original Post: Advisory on Multiple Assignment dplyr::mutate() on Databases

Data Reshaping with cdata

I’ve just shared a short webcast on data reshaping in R using the cdata package. (link) We also have two really nifty articles on the theory and methods: Please give it a try! This is the material I recently presented at the January 2017 BARUG Meetup. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…
Original Post: Data Reshaping with cdata

Base R can be Fast

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are faster than R code.” The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions. Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types. The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem: Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in…
Original Post: Base R can be Fast

Setting up RStudio Server quickly on Amazon EC2

I have recently been working on projects using Amazon EC2 (elastic compute cloud), and RStudio Server. I thought I would share some of my working notes. Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. The idea is: for a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs. If you are already an Amazon EC2 user with some Unix experience it is very easy to quickly stand up a powerful R environment, which is what I will demonstrate in this note. To follow these notes you must already have an Amazon EC2 account, some experience using the…
Original Post: Setting up RStudio Server quickly on Amazon EC2

rquery: Fast Data Manipulation in R

Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings. Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique. We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic…
Original Post: rquery: Fast Data Manipulation in R

New wrapr R pipeline feature: wrapr_applicable

The R package wrapr now has a neat new feature: “wrapr_applicable”. This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery. A small effort in making a package “wrapr aware” appears to have a fairly large payoff. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from…
Original Post: New wrapr R pipeline feature: wrapr_applicable

Big cdata News

I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata. cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago. However, cdata is much more than that. cdata provides a family of general transforms that include pivot/unpivot (or tidyr::spread/tidyr::gather) as easy special cases. Nina refused to write the article on it until we re-factored the api to be even more teachable (and therefore more learnable, and more useful). After her re-design (adding the concepts of both concrete records and abstract records to the coordinatized data theory) the system teaches itself. It is actually hard to remember you are graphically specifying potentially involved, difficult, and…
Original Post: Big cdata News

Announcing rquery

We are excited to announce the rquery R package. rquery is Win-Vector LLC‘s currently in development big data query tool for R. rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production). As an example: rquery operators allow us to write our earlier “treatment and control” example as follows. dQ <- d %.>% extend_se(., if_else_block( testexpr = “rand()>=0.5”, thenexprs = qae( a_1 := ‘treatment’, a_2 := ‘control’), elseexprs = qae( a_1 := ‘control’, a_2 := ‘treatment’))) %.>% select_columns(., c(“rowNum”, “a_1”, “a_2″)) rquery pipelines are first-class objects; so we can extend them, save them, and even print them. cat(format(dQ)) table(‘d’) %.>% extend(., ifebtest_1 := rand() >= 0.5) %.>% extend(., a_1 := ifelse(ifebtest_1,”treatment”,a_1), a_2 := ifelse(ifebtest_1,”control”,a_2)) %.>% extend(., a_1 := ifelse(!( ifebtest_1…
Original Post: Announcing rquery

Plotting Deep Learning Model Performance Trajectories

I am excited to share a new deep learning model performance trajectory graph. Here is an example produced based on Keras in R using ggplot2: The ideas include: We plot model performance as a function of training epoch, data set (training and validation), and metric. For legibility we facet on metric, and facets are adjusted so all facets have the same visual interpretation (“up is better”). The only solid horizontal curve is validation performance, and training performance is only indicated as the top-region of a shared region that depicts degree of over-fit. Obviously is going to take some training and practice to read these graphs quickly: but that is petty much true for all visualizations. The methods work with just about any staged machine learning algorithm (neural nets, deep learning, boosting, random forests, and more) and can also be adapted…
Original Post: Plotting Deep Learning Model Performance Trajectories

How to Greatly Speed Up Your Spark Queries

For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.” The idea behind the advice is: working with fewer columns makes for quicker queries. photo: Jacques Henri Lartigue 1912 The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are “denormalized marts” that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation. Some wonder is this really an issue or is it something one can ignore in the hope the downstream query optimizer fixes the problem. In this note we will show the effect is real. Let’s set up our experiment. The data…
Original Post: How to Greatly Speed Up Your Spark Queries