Is it faster to take a bike or taxi in NYC?

Taxis are plentiful and convenient in New York City, but the city is also served by a wide network of commuter bicycles (Citi Bikes). If you need to get from, say, the West Village to the Garment District, are you better off time-wise hailing a cab, or heading over to the nearest Citi Bike station? Data scientist Todd W. Schnieder crunched the number on travel times for both taxis and Citi Bikes to figure out which was better. Neither is universally the best, but for some trips taxis are most often the fastest, and for others bikes are faster. An interactive map (created with R) allows you to select the time of day and an origin neighborhood, and the map will then tell you the fraction of the time (according to the historical data) that a Citi Bike will outpace…
Original Post: Is it faster to take a bike or taxi in NYC?

Saving Snow Leopards with Artificial Intelligence

The snow leopard, the large cat native to the mountain ranges of Central and South Asia, is a highly endangered species. With an estimated estimated 3900-6500 individuals left in the wild, conservation efforts led by the Snow Leopard Trust are focused on preserving this iconic animal. But the snow leopard is an elusive creature: given their range and emote habitat (including the highlands of the Himalayas), they are difficult to study. In order to gather data about the creatures, researchers have used camera traps to capture more than 1 million images.  But not all of those images are of snow leopards. It’s a time-consuming process to classify those images as being of snow leopards, their prey, some other animal or nothing at all. To make things even more difficult, snow leopards have excellent camouflage, and can be difficult to spot even by…
Original Post: Saving Snow Leopards with Artificial Intelligence

My interview with ROpenSci

The ROpenSci team has started publishing a new series of interviews with the goal of “demystifying the creative and development processes of R community members”. I had the great pleasure of being interviewed by Kelly O’Briant earlier this year, and the interview was published on Friday. Thanks for being a great interviewer, Kelly! I’m looking forward to hearing from other R community members as the the rest of the series is published. ROpenSci blog: .rprofile: David Smith
Original Post: My interview with ROpenSci

Because it's Friday: Line Rider

Line Rider is a simple web-based game: draw a line (or a series of lines), and watch an animated sledder ride along it like it was a snow slope. It’s remained much the same since it was created in 2006 by Boštjan Čadež as a student (although I note it does now support touch/stylus controls to draw lines). It might seem like a simple toy, but people have used it to create some amazing works, like this Line Rider course set to In The Hall of the Mountain King.  If you enjoyed that, you might also enjoy this entire movie created in Line Rider. The soundtrack is amazing.  That’s all from us for this week. We’ll be back with more on Monday. See you then!
Original Post: Because it's Friday: Line Rider

An AI pitches startup ideas

Take a look at this list of 13 hot startups, from a list compiled by Alex Bresler. Perhaps one of them is the next Juicero? FAR ATHERA: A CLINICAL AI PLATFORM THAT CAN BE ACCESSED ON DEMAND. ZAPSY: TRY-AT-HOME SERVICE FOR CONSUMER ELECTRONICS. MADESS: ON-DEMAND ACCESS TO CLEAN WATER. DEERG: AI RADIOLOGIST IN A HOME SPER: THE FASTEST, EASIEST WAY TO BUY A HOME WITHOUT THE USER HAVING TO WEAR ANYTHING. PLILUO: VENMO FOR B2B SAAS. LANTR: WE HELP DOCTORS COLLECT 2X MORE ON CANDLES AND KEROSENE. ABS: WE PROVIDE FULL-SERVICE SUPPORT FOR SUBLIME, VIM, AND EMACS. INSTABLE DUGIT: GITHUB FOR YOUR LOVED ONES. CREDITAY: BY REPLACING MECHANICAL PARTS WITH AIR, WE ELIMINATE INSTALLATION COMPLEXITY AND MAINTENANCE HEADACHES, LEADING TO SIGNIFICANT EFFICIENCY GAINS IN PRODUCTION COSTS AND HARVESTING TIME. CREDITANO: WE BUILD SOFTWARE TO ENABLE HIGH FUNCTIONALITY BIONICS FOR TREATING HEALTH…
Original Post: An AI pitches startup ideas

A cRyptic crossword with an R twist

Last week’s R-themed crossword from R-Ladies DC was popular, so here’s another R-related crossword, this time by Barry Rowlingson and published on page 39 of the June 2003 issue of R-news (now known as the R Journal). Unlike the last crossword, this one follows the conventions of a British cryptic crossword: the grid is symmetrical, and eschews 4×4 blocks of white or black squares. Most importantly, the clues are in the cryptic style: rather than being a direct definition, cryptic clues pair wordplay (homonyms, anagrams, etc) with a hidden definition. (Wikipedia has a good introduction to the types of clues you’re likely to find.) Cryptic crosswords can be frustrating for the uninitiated, but are fun and rewarding once you get to into it.  In fact, if you’re unfamiliar with cryptic crosswords, this one is a great place to start. Not only…
Original Post: A cRyptic crossword with an R twist

Tutorial: Azure Data Lake analytics with R

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.  To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this…
Original Post: Tutorial: Azure Data Lake analytics with R

R's remarkable growth

Python has been getting some attention recently for its impressive growth in usage. Since both R and Python are used for data science, I sometimes get asked if R is falling by the wayside, or if R developers should switch course and learn Python. My answer to both questions is no. First, while Python is an excellent general-purpose data science tool, for applications where comparative inference and robust predictions are the main goal, R will continue to be the prime repository of validated statistical functions and cutting-edge research for a long time to come. Secondly, R and Python are both top-10 programming languages, and while Python has a larger userbase, R and Python are both growing rapidly — and at similar rates. The Stack Overflow blog runs the numbers in a post today, The Impressive Growth of R. Analysis of…
Original Post: R's remarkable growth

Announcing dplyrXdf 1.0

I’m delighted to announce the release of version 1.0.0 of the dplyrXdf package. dplyrXdf began as a simple (relatively speaking) backend to dplyr for Microsoft Machine Learning Server/Microsoft R Server’s Xdf file format, but has now become a broader suite of tools to ease working with Xdf files. This update to dplyrXdf brings the following new features: Support for the new tidyeval framework that powers the current release of dplyr Support for Spark and Hadoop clusters, including integration with the sparklyr package to process Hive tables in Spark Integration with dplyr to process SQL Server tables in-database Simplified handling of parallel processing for grouped data Several utility functions for Xdf and file management Workarounds for various glitches and unexpected behaviour in MRS and dplyr Spark, Hadoop and HDFS New in version 1.0.0 of dplyrXdf is support for Xdf files and datasets stored…
Original Post: Announcing dplyrXdf 1.0

Because it's Friday: Death Risk

Humans are notoriously bad at understanding risk, and perceptions of danger can vary wildly depending on how the possibility is presented. (David Spiegelhalter recently published an excellent review paper on this topic, and it’s a great read even for non-statisticians.) The media has an influence here: wall-to-wall coverage of rare, geographically-contained events like terrorist attacks inflate our sense of risk, especially compared to common, widespread factors like disease or accidents. YourCauseOfDeath.com presents risk in a very simple way. It chooses a random cause of death and presents it to you. It doesn’t correct for factors like sex, age, or medical history: it just rolls a random number and presents causes at a rate according to their actual prevalence. (I assume the data come from the CDC, so it probably does assume you live in the USA.) Click through a couple…
Original Post: Because it's Friday: Death Risk