In case you missed it: November 2017 roundup

In case you missed them, here are some articles from November of particular interest to R users. R 3.4.3 “Kite Eating Tree” has been released. Several approaches for generating a “Secret Santa” list with R. The “RevoScaleR” package from Microsoft R Server has now been ported to Python. The call for papers for the R/Finance 2018 conference in Chicago is now open. Give thanks to the volunteers behind R. Advice for R user groups from the organizer of R-Ladies Chicago. Use containers to build R clusters for parallel workloads in Azure with the doAzureParallel package. A collection of R scripts for interesting visualizations that fit into a 280-character Tweet. R is featured in a StackOverflow case study at the Microsoft Connect conference. The City of Chicago uses R to forecast water quality and issue beach safety alerts. A collection of…
Original Post: In case you missed it: November 2017 roundup

Unleash a faster Python on your data

[unable to retrieve full-text content]Get real performance results and download the free Intel® Distribution for Python that includes everything you need for blazing-fast computing, analytics, machine learning, and more. Use Intel Python with existing code, and you’re all set for a significant performance boost.
Original Post: Unleash a faster Python on your data

Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction

[unable to retrieve full-text content]Scikit-learn’s Pipeline class is designed as a manageable way to apply a series of data transformations followed by a the application of an estimator.
Original Post: Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction

The British Ecological Society's Guide to Reproducible Science

The British Ecological Society has published a new volume in their Guides to Better Science series: A Guide to Reproducible Code in Ecology and Evolution (pdf). The introduction, by , describes its scope: A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists. The guide…
Original Post: The British Ecological Society's Guide to Reproducible Science

On the biases in data

Whether we’re developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it’s always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft’s AI Blog discusses the inherent biases found in many data sets: “The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will…
Original Post: On the biases in data

Machine Learning with Optimus on Apache Spark

[unable to retrieve full-text content]The way most Machine Learning models work on Spark are not straightforward, and they need lots of feature engineering to work. That’s why we created the feature engineering section inside the Optimus Data Frame Transformer.
Original Post: Machine Learning with Optimus on Apache Spark