Thank You For The Very Nice Comment

Somebody nice reached out and gave us this wonderful feedback on our new Supervised Learning in R: Regression (paid) video course. Thanks for a wonderful course on DataCamp on XGBoost and Random forest. I was struggling with Xgboost earlier and Vtreat has made my life easy now :). Supervised Learning in R: Regression covers a lot as it treats predicting probabilities as a type of regression. Nina and I are very proud of this course and think it is very much worth your time (for the beginning through advanced R user). vtreat is a statistically sound data cleaning and preparation tool introduced towards the end of the course. R users who try vtreat find it makes training and applying models much easier. vtreat is distributed as a free open-source package available on CRAN. If you are doing predictive modeling in…
Original Post: Thank You For The Very Nice Comment

Supervised Learning in R: Regression

We are very excited to announce a new (paid) Win-Vector LLC video training course: Supervised Learning in R: Regression now available on DataCamp The course is primarily authored by Dr. Nina Zumel (our chief of course design) with contributions from Dr. John Mount. This course will get you quickly up to speed covering: What is regression? (Hint: it is the art of making good numeric predictions, one of the most important tasks in data science, machine learning, or statistics.) When does it work, and when does it not work? How to move fluidly from basic ordinary least squares to Kaggle-winning methods such as gradient boosted trees. All of this is demonstrated using R, with many worked examples and exercises. We worked very hard to make this course very much worth your time. Related To leave a comment for the author,…
Original Post: Supervised Learning in R: Regression

More on “The Part-Time R-User”

I have some more thoughts on the topic: “the part-time R-user.” I am thinking a bit more about the diversity R users. It occurs to me simply dividing R users into two groups, beginning and advanced, neglects a very important group: the part-time R user. This leaves us teachers and package developers with an unfortunate bias. The concept of “beginning R user” implies a user who has near infinite time to adapt to our advanced R user work style and other nonsense. “Beginning” is a transient state, one feels we can temporarily accommodate the beginners on our path to assuming them away. However for a language such as R which deliberately targets non-programmer populations (such as statisticians, scientists, medical professionals, and more) we must assume there is a permanent population of users that have other things going on in their…
Original Post: More on “The Part-Time R-User”

Let’s Have Some Sympathy For The Part-time R User

When I started writing about methods for better “parametric programming” interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience: dplyr users who had such a need, and wanted such extensions. dplyr users who did not have such a need (“we always know the column names”). dplyr users who found the then-current fairly complex “underscore” and lazyeval system sufficient for the task. Needing name substitution is a problem an advanced full-time R user can solve on their own. However a part-time R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution. Background Roughly I suggested two possible methods for making the task easier: Renaming views for data.frames. I have now implemented the idea as a call-scoped concept in replyr::replyr_apply_f_mapped() (“call-scoped”, meaning the re-mapping lasts for the duration…
Original Post: Let’s Have Some Sympathy For The Part-time R User

More documentation for Win-Vector R packages

The Win-Vector public R packages now all have new pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the pkgdown tool.) Please check them out (hint: vtreat is our favorite). The package sites: For more on all of these packages, please see the Win-Vector blog. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…
Original Post: More documentation for Win-Vector R packages

Tutorial: Using seplyr to Program Over dplyr

seplyr is an R package that makes it easy to program over dplyr 0.7.*. To illustrate this we will work an example. Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement. suppressPackageStartupMessages(library(“dplyr”)) packageVersion(“dplyr”) ## [1] ‘0.7.2’ cat(colnames(starwars), sep=’n’) ## name ## height ## mass ## hair_color ## skin_color ## eye_color ## birth_year ## gender ## homeworld ## species ## films ## vehicles ## starships starwars %>% group_by(homeworld) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0…
Original Post: Tutorial: Using seplyr to Program Over dplyr

seplyr update

The development version of my new R package seplyr is performing in practical applications with dplyr 0.7.* much better than even I (the seplyr package author) expected. I think I have hit a very good set of trade-offs, and I have now spent significant time creating documentation and examples. I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a dplyr user I strongly suggest trying seplyr in your own analysis projects. Please see here for details. Related To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs,…
Original Post: seplyr update

dplyr 0.7 Made Simpler

I have been writing a lot (too much) on the R topics dplyr/rlang/tidyeval lately. The reason is: major changes were recently announced. If you are going to use dplyr well and correctly going forward you may need to understand some of the new issues (if you don’t use dplyr you can safely skip all of this). I am trying to work out (publicly) how to best incorporate the new methods into: real world analyses, reusable packages, and teaching materials. I think some of the apparent discomfort on my part comes from my feeling that dplyr never really gave standard evaluation (SE) a fair chance. In my opinion: dplyr is based strongly on non-standard evaluation (NSE, originally through lazyeval and now through rlang/tidyeval) more by the taste and choice than by actual analyst benefit or need. dplyr isn’t my package,…
Original Post: dplyr 0.7 Made Simpler

Better Grouped Summaries in dplyr

For R dplyr users one of the promises of the new rlang/tidyeval system is an improved ability to program over dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps. Let’s take a look at this capability. First let’s start dplyr. suppressPackageStartupMessages(library(“dplyr”)) packageVersion(“dplyr”) ## [1] ‘0.7.1.9000’ A dplyr pattern that I have seen used often is the “group_by() %>% mutate()” pattern. This historically has been shorthand for a “group_by() %>% summarize()” followed by a join(). It is easiest to show by example. The following code: mtcars %>% group_by(cyl, gear) %>% mutate(group_mean_mpg = mean(mpg), group_mean_disp = mean(disp)) %>% select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>% head() ## # A tibble: 6 x 6 ## # Groups: cyl, gear [4] ## cyl gear mpg disp group_mean_mpg group_mean_disp ## ##…
Original Post: Better Grouped Summaries in dplyr

What is magrittr’s future in the tidyverse?

For many R users the magrittr pipe is a popular way to arrange computation and famously part of the tidyverse. The tidyverse itself is a rapidly evolving centrally controlled package collection. The tidyverse authors publicly appear to be interested in re-basing the tidyverse in terms of their new rlang/tidyeval package. So it is natural to wonder: what is the future of magrittr (a pre-rlang/tidyeval package) in the tidyverse? For instance: here is a draft of rlang/tidyeval based pipe (from one of the primary rlang/tidyeval authors). This pipe even fixes dplyr issue 2726: # do NOT perform the source step directly! # Instead, save and inspect the source code from: # https://gist.github.com/lionel-/10cd649b31f11512e4aea3b7a98fe381 # Output printed is the result of an example in the gist. source(“https://gist.githubusercontent.com/lionel-/10cd649b31f11512e4aea3b7a98fe381/raw/b8804e41424a4f721ce21292a7ec9c35b5f3689d/pipe.R”) #> List of 1 #> $ :List of 1 #> ..$ :List of 2 #>…
Original Post: What is magrittr’s future in the tidyverse?