ML models: What they can’t learn?

What I love in conferences are the people, that come after your talk and say: It would be cool to add XYZ to your package/method/theorem. After the eRum (great conference by the way) I was lucky to hear from Tal Galili: It would be cool to use DALEX for teaching, to show how different ML models are learning relations. Cool idea. So let’s see what can and what cannot be learned by the most popular ML models. Here we will compare random forest against linear models against SVMs.Find the full example here. We simulate variables from uniform U[0,1] distribution and calculate y from following equation In all figures below we compare PDP model responses against the true relation between variable x and the target variable y (pink color). All these plots are created with DALEX package. For x1 we can…
Original Post: ML models: What they can’t learn?

Rcpp 0.12.17: More small updates

Another bi-monthly update and the seventeenth release in the 0.12.* series of Rcpp landed on CRAN late on Friday following nine (!!) days in gestation in the incoming/ directory of CRAN. And no complaints: we just wish CRAN were a little more forthcoming with what is happenening when, and/or would let us help supplying additional test information. I do run a fairly insane amount of backtests prior to releases, only to then have to wait another week or more is … not ideal. But again, we all owe CRAN and immense amount of gratitude for all they do, and do so well. So once more, this release follows the 0.12.0 release from July 2016, the 0.12.1 release in September 2016, the 0.12.2 release in November 2016, the 0.12.3 release in January 2017, the 0.12.4 release in March 2016, the 0.12.5…
Original Post: Rcpp 0.12.17: More small updates

Statistics Sunday: Welcome to Sentiment Analysis with “Hotel California”

As promised in last week’s post, this week: sentiment analysis, also with song lyrics. Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods: “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence “bing” for Bing Liu and colleagues – which classifies words as either positive or negative “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining “nrc” for the NRC lexicon – which…
Original Post: Statistics Sunday: Welcome to Sentiment Analysis with “Hotel California”

R/exams @ eRum 2018

Keynote lecture about R/exams at eRum 2018 (European R Users Meeting) in Budapest: Slides, video, e-learning, replication materials. Keynote lecture at eRum 2018 R/exams was presented in a keynote lecture by Achim Zeileis at eRum 2018, the European R Users Meeting, this time organized by a team around Gergely Daróczi in Budapest. It was a great event with many exciting presentations, reflecting the vibrant R community in Europe (and beyond). This blog post provides various resources accompanying the presentation which may be of interest to those who did not attend the meeting as well as those who did and who want to explore the materials in more detail. Most importantly the presentation slides are available in PDF format (under CC-BY): Video The eRum organizers did a great job in making the meeting accessible to those useRs who could not make…
Original Post: R/exams @ eRum 2018

RcppGSL 0.3.5

A maintenance update of RcppGSL just brought version 0.3.5 to CRAN, a mere twelve days after the RcppGSL 0.3.4. release. Just like yesterday’s upload of inline 0.3.15 it was prompted by a CRAN request to update the per-package manual page; see the inline post for details. The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package. No user-facing new code or features were added. The NEWS file entries follow below: Changes in version 0.3.5 (2018-05-19) Update package manual page using references to DESCRIPTION file [CRAN request]. Courtesy of CRANberries, a summary of changes to the most recent release is available. More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please…
Original Post: RcppGSL 0.3.5

openrouteservice – geodata!

The openrouteservice provides a new method to get geodata into R. It has an API (or a set of them) and an R package has been written to communicate with said API(s) and is available from GitHub. I’ve just been playing around with the examples on this page, in the thought of using it for a project (more on that later if I get anywhere with it). Anyways…onto the code…which is primarily a modification from the examples page I mentioned earlier (see that page for more examples). devtools::install_github(“GIScience/openrouteservice-r”) Load some libraries library(openrouteservice) library(leaflet) Set the API key ors_api_key(“your-key-here”) Locations of interest and send the request to the API asking for the region that is accessible within a 15 minute drive of the coordinates. coordinates <- list(c(8.55, 47.23424), c(8.34234, 47.23424), c(8.44, 47.4)) x <- ors_isochrones(coordinates, range = 60*15, # maximum time…
Original Post: openrouteservice – geodata!

Create Code Metrics with cloc

The cloc Perl script (yes, Perl!) by Al Danial (https://github.com/AlDanial/cloc) has been one of the go-to tools for generating code metrics. Given a single file, directory tree, archive, or git repo, cloc can speedily give you metrics on the count of blank lines, comment lines, and physical lines of source code in a vast array of programming languages. I don’t remember the full context but someone in the R community asked about about this type of functionality and I had tossed together a small script-turned-package to thinly wrap the Perl cloc utility. Said package was and is unimaginatively named cloc . Thanks to some collaborative input from @ma_salmon, the package gained more features. Recently I added the ability to process R markdown (Rmd) files (i.e. only count lines in code chunks) to the main cloc Perl script and was performing…
Original Post: Create Code Metrics with cloc

An East-West less divided?

With tensions heightened recently at the United Nations, one might wonder whether we’ve drawn closer, or farther apart, over the decades since the UN was established in 1945. We’ll see if we can garner a clue by performing cluster analysis on the General Assembly voting of five of the founding members. We’ll focus on the five permanent members of the Security Council. Then later on we can look at whether Security Council vetoes corroborate our findings. A prior article, entitled the “cluster of six“, employed unsupervised machine learning to discover the underlying structure of voting data. We’ll use related techniques here to explore the voting history of the General Assembly, the only organ of the United Nations in which all 193 member states have equal representation. By dividing the voting history into two equal parts, which we’ll label as the “early years”…
Original Post: An East-West less divided?

Do Clustering by “Dimensional Collapse”

Problem Image that someone in a bank want to find out whether some of bank’s credit card holders are acctually the same person, so according to his experience, he set a rule: the people share either the same address or the same phone number can be reasonably regarded as the same person. Just as the example: library(tidyverse) a <- data_frame(id = 1:16, addr = c(“a”, “a”, “a”, “b”, “b”, “c”, “d”, “d”, “d”, “e”, “e”, “f”, “f”, “g”, “g”, “h”), phone = c(130L, 131L, 132L, 133L, 134L, 132L, 135L, 136L, 137L, 136L, 138L, 138L, 139L, 140L, 141L, 139L), flag = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L)) head(a) ## id addr phone flag ## 1 a 130 1 ## 2 a 131 1 ## 3 a 132 1 ## 4 b…
Original Post: Do Clustering by “Dimensional Collapse”

Decision Modelling in R Workshop in The Netherlands!

The Decision Analysis in R for Technologies in Health (DARTH) workgroup is hosting a two-day workshop on decision analysis in R in Leiden, The Netherlands from June 7-8, 2018. A one-day introduction to R course will also be offered the day before the workshop, on June 6th. Decision models are mathematical simulation models that are increasingly being used in health sciences to simulate the impact of policy decisions on population health. New methodological techniques around decision modeling are being developed that rely heavily on statistical and mathematical techniques. R is becoming increasingly popular in decision analysis as it provides a flexible environment where advanced statistical methods can be combined with decision models of varying complexity. Also, the fact that R is freely available improves model transparency and reproducibility. The workshop will guide participants on building probabilistic decision trees, Markov models…
Original Post: Decision Modelling in R Workshop in The Netherlands!