Obstacles to performance in parallel programming

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include: Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks; Load balance, where the computing resources aren’t contributing equally to the problem; Impacts from use of RAM and virtual memory, such as cache misses and page faults; Network effects, such as latency and bandwidth, that impact performance and communication overhead; Interprocess conflicts and thread scheduling;  Data access and…
Original Post: Obstacles to performance in parallel programming

20 years of the R Core Group

The first “official” version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard: It was twenty years ago today, Ross Ihaka got the band to play…. #rstats pic.twitter.com/msSpPz2kyA — Peter Dalgaard (@pdalgd) August 16, 2017 Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects’ founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members.…
Original Post: 20 years of the R Core Group

How to build an image recognizer in R using just a few images

Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server.  The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify…
Original Post: How to build an image recognizer in R using just a few images

Buzzfeed trains an AI to find spy planes

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.   So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier “Spies in the Skies” story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were…
Original Post: Buzzfeed trains an AI to find spy planes

New Poll: Python vs R vs rest: What did you use in 2016-17 for Analytics, Data Science, Machine Learning tasks?

[unable to retrieve full-text content]Python vs R vs Other – What did you use for Analytics, Data Science, Machine Learning work in 2016-17? Vote and we will analyze and report results and trends.
Original Post: New Poll: Python vs R vs rest: What did you use in 2016-17 for Analytics, Data Science, Machine Learning tasks?

Reproducibility: A cautionary tale from data journalism

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different? Image credit: Timo Grossenbacher The version of R Timo was using had been updated, but that wasn’t the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown…
Original Post: Reproducibility: A cautionary tale from data journalism

In case you missed it: July 2017 roundup

In case you missed them, here are some articles from July of particular interest to R users. A tutorial on using the rsparkling package to apply H20’s algorithms to data in HDInsight. Several exercises to learn parallel programming with the foreach package. A presentation on the R6 class system, by Winston Chang. Introducing “joyplots”, a ggplot2 extension for visualizing multiple time series or distributions (with a nod to Joy Division). SQL Server 2017, with many new R-related capabilities, is nearing release. Ali Zaidi on using neural embeddings with R and Spark to analyze Github comments. R ranks #6 in the 2017 IEEE Spectrum Top Programming Languages. Course materials on “Data Analysis for the Life Sciences”, from Rafael Irizarry. How to securely store API keys in R scripts with the “secret” package. An in-depth tutorial on implementing neural network algorithms in…
Original Post: In case you missed it: July 2017 roundup

dplyrXdf 0.10.0 beta prerelease

I’m happy to announce that version 0.10.0 beta of the dplyrXdf package is now available. You can get it from Github: install_github(“RevolutionAnalytics/dplyrXdf”, build_vignettes=FALSE) This is a major update to dplyrXdf that adds the following features: Support for the tidyeval framework that powers the latest version of dplyr Works with Spark and Hadoop clusters and files in HDFS Several utility functions to ease working with files and datasets Many bugfixes and workarounds for issues with the underlying RevoScaleR functions This (pre-)release of dplyrXdf requires Microsoft R Server or Client version 8.0 or higher, and dplyr 0.7 or higher. If you’re using R Server, dplyr 0.7 won’t be in the MRAN snapshot that is your default repo, but you can get it from CRAN: install.packages(“dplyr”, repos=”https://cloud.r-project.org”) The tidyeval framework This completely changes the way in which dplyr handles standard evaluation. Previously, if…
Original Post: dplyrXdf 0.10.0 beta prerelease

Tutorial: Deep Learning with R on Azure with Keras and CNTK

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) Microsoft’s Cognitive Toolkit (better known as CNTK) is a commercial-grade and open-source framework for deep learning tasks. At present CNTK does not have a native R interface but can be accessed through Keras, a high-level API which wraps various deep learning backends including CNTK, TensorFlow, and Theano, for the convenience of modularizing deep neural network construction. The latest version of CNTK (2.1) supports Keras. The RStudio team has developed an R interface for Keras making it possible to run different deep learning backends, including CNTK, from within an R session. This tutorial illustrates how to simply and quickly spin up a Ubuntu-based Azure Data Science Virtual Machine (DSVM) and to configure a Keras and CNTK environment. An Azure DSVM is a curated virtual machine image coming with an…
Original Post: Tutorial: Deep Learning with R on Azure with Keras and CNTK

Tutorial: Publish an R function as a SQL Server stored procedure with the sqlrutils package

In SQL Server 2016 and later, you can publish an R function to the database as a stored procedure. This makes it possible to run your R function on the SQL Server itself, which makes the power of that server available for R computations, and also eliminates the time required to move data to and from the server. It also makes your R function available as a resource to DBAs for use in SQL queries, even if they don’t know the R language. Neils Berglund recently posted a detailed tutorial on using the sqlrutils package to publish an R function as a stored procedure. There are several steps to the process, but ultimately it boils down to calling registerStoredProcedure on your R function (and providing the necessary credentials). If you don’t have a connection (or the credentials) to publish to…
Original Post: Tutorial: Publish an R function as a SQL Server stored procedure with the sqlrutils package