Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it’s difficult to answer the question, “why that label, and not one of the others?”. One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it’s impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R…
Original Post: Interpreting machine learning models with the lime package for R
There’s growing awareness that the data we collect, and in particular the variables we include as factors in our predictive models, can lead to unwanted bias in outcomes: from loan applications, to law enforcement, and in many other areas. In some instances, such bias is even directly regulated by laws like the Fair Housing Act in the US. But even if we explicitly remove “obvious” variables like sex, age or ethnicity from predictive models, unconscious bias might still be a factor in our predictions as a result of highly-correlated proxy variables that are included in our model. As a result, we need to be aware of the biases in our model and take steps to address them. For an excellent general overview of the topic, I highly recommend watching the recent presentation by Rachel Thomas, “Analyzing and Preventing Bias in ML”.…
Original Post: Detecting unconscious bias in models, with R
On Monday, we learned about a serious issue with the installer for Microsoft R Open on Linux-based systems. (Thanks to Norbert Preining for reporting the problem.) The issue was that the installation and de-installation scripts would modify the system shell, and did not use the standard practices to create and restore symlinks for system applications. The Microsoft R team developed a solution the problem with the help of some Debian experts at Microsoft, and last night issued a hotfix for Microsoft R Open 3.5.0 which is now available for download. With this fix, the MRO installer no longer relinks /bin/sh to /bin/bash, and instead uses dpkg-divert for Debian-based platforms and update-alternatives for RPM-based platforms. We will also request a discussion with the Debian maintainers of R to further review our installation process. Finally, with the next release — MRO 3.5.1, scheduled for…
Original Post: Hotfix for Microsoft R Open 3.5.0 on Linux
Microsoft R Open 3.5.0 is now available for download for Windows, Mac and Linux. This update includes the open-source R 3.5.0 engine, which is a major update with many new capabilities and improvements to R. In particular, it includes a major new framework for handling data in R, with some major behind-the-scenes performance and memory-use benefits (and with further improvements expected in the future). Microsoft R Open 3.5.0 points to a fixed CRAN snapshot taken on June 1 2018. This provides a reproducible experience when installing CRAN packages by default, but you always change the default CRAN repository or the built-in checkpoint package to access snapshots of packages from an earlier or later date. Relatedly, many new packages have been released since the last release of Microsoft R Open, and you can browse a curated list of some interesting ones on the Microsoft…
Original Post: Microsoft R Open 3.5.0 now available
In case you missed them, here are some articles from April of particular interest to R users. The R Consortium has announced a new round of grants for projects proposed by the R community. A look back at the ROpenSci unconference held in Seattle. Video of my European R Users Meeting talk, “Speeding up R with Parallel Programming in the Cloud”. Slides from my talk at the Microsoft Build conference, “Open-Source Machine Learning in Azure”. Discussions on Twitter: R packages by stage of data analysis; thinking differently about AI development; and, why is package management harder in Python than R? Our May 2018 roundup of AI and data science news. Panelist Francesca Lazzeri reviews the Mind Bytes AI conference in Chicago. And some general interest stories (not necessarily related to R): A really bad road in Nepal The definitive answer…
Original Post: In case you missed it: May 2018 roundup
[unable to retrieve full-text content]We find 6 tools form the modern open source Data Science / Machine Learning ecosystem; examine whether Python declared victory over R; and review which tools are most associated with Deep Learning and Big Data.
Original Post: The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?
[unable to retrieve full-text content]Summer. Time to sit back and unwind. Or get your hands on some free machine learning and data science books and learn! Here is a great selection to get started.
Original Post: KDnuggets™ News 18:n22, June 6: 10 More Free Must-Read Books for Machine Learning and Data Science; Beginner Guide to Data Science Pipeline
If you don’t get enough joy from publishing scientific papers in your day job, or simply want to experience what it’s like to be in a publish-or-perish environment where the P-value is the only important part of a paper, you might want to try StatCheck: the board game where the object is to publish two papers before any of your opponents. As the game progresses, players combine “Test”, “Statistic” and “P-value” cards to form the statistical test featured in the paper (and of course, significant tests are worth more than non-significant ones). Opponents may then have the opportunity to play a “StatCheck” card to challenge the validity of the test, which can then be verified using a companion R package or online Shiny application. Other modifier cards include “Bayes Factor” (which can be used to boost the value of your…
Original Post: StatCheck the Game
[unable to retrieve full-text content]In this post, we’ll use linear regression to build a model that predicts cherry tree volume from metrics that are much easier for folks who study trees to measure.
Original Post: Using Linear Regression for Predictive Modeling in R
The letters and numbers you entered did not match the image. Please try again. As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments. Having trouble reading this image? View an alternate.
Original Post: New round of R Consortium grants announced