A guide to working with character data in R

R is primarily a language for working with numbers, but we often need to work with text as well. Whether it’s formatting text for reports, or analyzing natural language data, R provides a number of facilities for working with character data. Handling Strings with R, a free (CC-BY-NC-SA) e-book by UC Berkeley’s Gaston Sanchez, provides an overview of the ways you can manipulate characters and strings with R.  There are many useful sections in the book, but a few selections include: Note that the book does not cover analysis of natural language data, for which you might want to check out the CRAN Task View on Natural Language Processing or the book Text Mining with R: A Tidy Approach. It’s also sadly silent on the topic of character encoding in R, a topic that often causes problems when dealing with text data,…
Original Post: A guide to working with character data in R

PYPL Language Rankings: Python ranks #1, R at #7 in popularity

The new PYPL Popularity of Programming Languages (June 2018) index ranks Python at #1 and R at #7. Like the similar TIOBE language index, the PYPL index uses Google search activity to rank language popularity. PYPL, however, fcouses on people searching for tutorials in the respective languages as a proxy for popularity. By that measure, Python has always been more popular than R (as you’d expect from a more general-purpose language), but both have been growing at similar rates. The chart below includes the three data-oriented languages tracked by the index (and note the vertical scale is logarithmic). Another language ranking was also released recently: the annual KDnuggets Analytics, Data Science and Machine Learning Poll. These rankings, however, are derived not from search trends but by self-selected poll respondents, which perhaps explains the presence of Rapidminer at the #2 spot.
Original Post: PYPL Language Rankings: Python ranks #1, R at #7 in popularity

Interpreting machine learning models with the lime package for R

Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it’s difficult to answer the question, “why that label, and not one of the others?”. One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it’s impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R…
Original Post: Interpreting machine learning models with the lime package for R

Detecting unconscious bias in models, with R

There’s growing awareness that the data we collect, and in particular the variables we include as factors in our predictive models, can lead to unwanted bias in outcomes: from loan applications, to law enforcement, and in many other areas. In some instances, such bias is even directly regulated by laws like the Fair Housing Act in the US. But even if we explicitly remove “obvious” variables like sex, age or ethnicity from predictive models, unconscious bias might still be a factor in our predictions as a result of highly-correlated proxy variables that are included in our model. As a result, we need to be aware of the biases in our model and take steps to address them. For an excellent general overview of the topic, I highly recommend watching the recent presentation by Rachel Thomas, “Analyzing and Preventing Bias in ML”.…
Original Post: Detecting unconscious bias in models, with R

Hotfix for Microsoft R Open 3.5.0 on Linux

On Monday, we learned about a serious issue with the installer for Microsoft R Open on Linux-based systems. (Thanks to Norbert Preining for reporting the problem.) The issue was that the installation and de-installation scripts would modify the system shell, and did not use the standard practices to create and restore symlinks for system applications. The Microsoft R team developed a solution the problem with the help of some Debian experts at Microsoft, and last night issued a hotfix for Microsoft R Open 3.5.0 which is now available for download. With this fix, the MRO installer no longer relinks /bin/sh to /bin/bash, and instead uses dpkg-divert for Debian-based platforms and update-alternatives for RPM-based platforms. We will also request a discussion with the Debian maintainers of R to further review our installation process. Finally, with the next release — MRO 3.5.1, scheduled for…
Original Post: Hotfix for Microsoft R Open 3.5.0 on Linux

Microsoft R Open 3.5.0 now available

Microsoft R Open 3.5.0 is now available for download for Windows, Mac and Linux. This update includes the open-source R 3.5.0 engine, which is a major update with many new capabilities and improvements to R. In particular, it includes a major new framework for handling data in R, with some major behind-the-scenes performance and memory-use benefits (and with further improvements expected in the future). Microsoft R Open 3.5.0 points to a fixed CRAN snapshot taken on June 1 2018. This provides a reproducible experience when installing CRAN packages by default, but you always change the default CRAN repository or the built-in checkpoint package to access snapshots of packages from an earlier or later date. Relatedly, many new packages have been released since the last release of Microsoft R Open, and you can browse a curated list of some interesting ones on the Microsoft…
Original Post: Microsoft R Open 3.5.0 now available

In case you missed it: May 2018 roundup

In case you missed them, here are some articles from April of particular interest to R users. The R Consortium has announced a new round of grants for projects proposed by the R community. A look back at the ROpenSci unconference held in Seattle.  Video of my European R Users Meeting talk, “Speeding up R with Parallel Programming in the Cloud”. Slides from my talk at the Microsoft Build conference, “Open-Source Machine Learning in Azure”. Discussions on Twitter: R packages by stage of data analysis; thinking differently about AI development; and, why is package management harder in Python than R?  Our May 2018 roundup of AI and data science news. Panelist Francesca Lazzeri reviews the Mind Bytes AI conference in Chicago. And some general interest stories (not necessarily related to R): A really bad road in Nepal The definitive answer…
Original Post: In case you missed it: May 2018 roundup

The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

[unable to retrieve full-text content]We find 6 tools form the modern open source Data Science / Machine Learning ecosystem; examine whether Python declared victory over R; and review which tools are most associated with Deep Learning and Big Data.
Original Post: The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

KDnuggets™ News 18:n22, June 6: 10 More Free Must-Read Books for Machine Learning and Data Science; Beginner Guide to Data Science Pipeline

[unable to retrieve full-text content]Summer. Time to sit back and unwind. Or get your hands on some free machine learning and data science books and learn! Here is a great selection to get started.
Original Post: KDnuggets™ News 18:n22, June 6: 10 More Free Must-Read Books for Machine Learning and Data Science; Beginner Guide to Data Science Pipeline

StatCheck the Game

If you don’t get enough joy from publishing scientific papers in your day job, or simply want to experience what it’s like to be in a publish-or-perish environment where the P-value is the only important part of a paper, you might want to try StatCheck: the board game where the object is to publish two papers before any of your opponents. As the game progresses, players combine “Test”, “Statistic” and “P-value” cards to form the statistical test featured in the paper (and of course, significant tests are worth more than non-significant ones). Opponents may then have the opportunity to play a “StatCheck” card to challenge the validity of the test, which can then be verified using a companion R package or online Shiny application. Other modifier cards include “Bayes Factor” (which can be used to boost the value of your…
Original Post: StatCheck the Game