Obstacles to performance in parallel programming

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include: Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks; Load balance, where the computing resources aren’t contributing equally to the problem; Impacts from use of RAM and virtual memory, such as cache misses and page faults; Network effects, such as latency and bandwidth, that impact performance and communication overhead; Interprocess conflicts and thread scheduling;  Data access and…
Original Post: Obstacles to performance in parallel programming

Obstacles to performance in parallel programming

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include: Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks; Load balance, where the computing resources aren’t contributing equally to the problem; Impacts from use of RAM and virtual memory, such as cache misses and page faults; Network effects, such as latency and bandwidth, that impact performance and communication overhead; Interprocess conflicts and thread scheduling;  Data access and…
Original Post: Obstacles to performance in parallel programming

20 years of the R Core Group

The first “official” version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard: It was twenty years ago today, Ross Ihaka got the band to play…. #rstats pic.twitter.com/msSpPz2kyA — Peter Dalgaard (@pdalgd) August 16, 2017 Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects’ founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members.…
Original Post: 20 years of the R Core Group

20 years of the R Core Group

The first “official” version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard: It was twenty years ago today, Ross Ihaka got the band to play…. #rstats pic.twitter.com/msSpPz2kyA — Peter Dalgaard (@pdalgd) August 16, 2017 Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects’ founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members.…
Original Post: 20 years of the R Core Group

How to build an image recognizer in R using just a few images

Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server.  The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify…
Original Post: How to build an image recognizer in R using just a few images

How to build an image recognizer in R using just a few images

Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server.  The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify…
Original Post: How to build an image recognizer in R using just a few images

Buzzfeed trains an AI to find spy planes

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.  So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier “Spies in the Skies” story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were false…
Original Post: Buzzfeed trains an AI to find spy planes

Buzzfeed trains an AI to find spy planes

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.   So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier “Spies in the Skies” story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were…
Original Post: Buzzfeed trains an AI to find spy planes

Reproducibility: A cautionary tale from data journalism

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different? Image credit: Timo Grossenbacher The version of R Timo was using had been updated, but that wasn’t the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown…
Original Post: Reproducibility: A cautionary tale from data journalism