Gender roles in film direction, analyzed with R

What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle and squeal, while men (but not women) shoot, gallop and strap things to other things.   This is all based on an analysis of almost 2,000 film scripts mostly from 1990 and after. The words come from pairs of words beginning with “he” and “she” in the stage directions (but not the dialogue) in the screenplays — directions like “she snuggles up to him, strokes his back” and “he straps on a holster under his sealskin cloak”. The essay also includes an analysis of words by the writer and character’s gender, and includes lots of lovely interactive…
Original Post: Gender roles in film direction, analyzed with R

Highlights of the Data Science Track at Microsoft Ignite

The letters and numbers you entered did not match the image. Please try again. As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments. Having trouble reading this image? View an alternate.
Original Post: Highlights of the Data Science Track at Microsoft Ignite

Obstacles to performance in parallel programming

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include: Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks; Load balance, where the computing resources aren’t contributing equally to the problem; Impacts from use of RAM and virtual memory, such as cache misses and page faults; Network effects, such as latency and bandwidth, that impact performance and communication overhead; Interprocess conflicts and thread scheduling;  Data access and…
Original Post: Obstacles to performance in parallel programming

20 years of the R Core Group

The first “official” version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard: It was twenty years ago today, Ross Ihaka got the band to play…. #rstats pic.twitter.com/msSpPz2kyA — Peter Dalgaard (@pdalgd) August 16, 2017 Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects’ founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members.…
Original Post: 20 years of the R Core Group

How to build an image recognizer in R using just a few images

Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server.  The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify…
Original Post: How to build an image recognizer in R using just a few images

Buzzfeed trains an AI to find spy planes

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.   So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier “Spies in the Skies” story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were…
Original Post: Buzzfeed trains an AI to find spy planes

Reproducibility: A cautionary tale from data journalism

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different? Image credit: Timo Grossenbacher The version of R Timo was using had been updated, but that wasn’t the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown…
Original Post: Reproducibility: A cautionary tale from data journalism

Because it's Friday: The Shepard Tone

I haven’t seen the Dunkirk movie yet, but the video below makes me want to see it soon. It turns out it contains an auditory illusion: the “Shepard Tone”, which sounds like it’s continually rising but really isn’t. (It turns out that many of Christopher Nolan’s past movies have included it as well.) There’s more explanation in the accompanying article at Vox. That’s it from the blog for this week. Have a great weekend, and see you back here on Monday.
Original Post: Because it's Friday: The Shepard Tone

In case you missed it: July 2017 roundup

In case you missed them, here are some articles from July of particular interest to R users. A tutorial on using the rsparkling package to apply H20’s algorithms to data in HDInsight. Several exercises to learn parallel programming with the foreach package. A presentation on the R6 class system, by Winston Chang. Introducing “joyplots”, a ggplot2 extension for visualizing multiple time series or distributions (with a nod to Joy Division). SQL Server 2017, with many new R-related capabilities, is nearing release. Ali Zaidi on using neural embeddings with R and Spark to analyze Github comments. R ranks #6 in the 2017 IEEE Spectrum Top Programming Languages. Course materials on “Data Analysis for the Life Sciences”, from Rafael Irizarry. How to securely store API keys in R scripts with the “secret” package. An in-depth tutorial on implementing neural network algorithms in…
Original Post: In case you missed it: July 2017 roundup