A primer in using Java from R – part 1

The rJava package provides a low-level interface to Java virtual machine. It allows creation of objects, calling methods and accessing fields of the objects. It also provides functionality to include our java resources into R packages easily.Note the system requirement Java JDK 1.2 or higher and for JRI/REngine JDK 1.4 or higher. After attaching the package, we also need to initialize a Java Virtual Machine (JVM): Creating Java objects with rJava We will now very quickly go through the basic uses of the package. The .jnew function is used to create a new Java object. Note that the class argument requires a fully qualified class name in Java Native Interface notation. # Creating a new object of java.lang class String sHello <- .jnew(class = “java/lang/String”, “Hello World!”) # Creating a new object of java.lang class Integer iOne <- .jnew(class =…
Original Post: A primer in using Java from R – part 1

Thanks for Reading!

As I’ve been blogging more about statistics, R, and research in general, I’ve been trying to increase my online presence, sharing my blog posts in groups of like-minded people. Those efforts seem to have paid off, based on my view counts over the past year: And based on read counts, here are my top 10 blog posts, most of which are stats-related: Beautiful Asymmetry – none of us is symmetrical, and that’s okay  Statistical Sins: Stepwise Regression – just step away from stepwise regression Statistics Sunday: What Are Degrees of Freedom? (Part 1) – and read Part 2 here Working with Your Facebook Data in R Statistics Sunday: Free Data Science and Statistics Resources Statistics Sunday: What is Bootstrapping? Statistical Sins: Know Your Variables (A Confession) – we all make mistakes, but we should learn from them Statistical Sins: Not Making it…
Original Post: Thanks for Reading!

A guide to working with character data in R

R is primarily a language for working with numbers, but we often need to work with text as well. Whether it’s formatting text for reports, or analyzing natural language data, R provides a number of facilities for working with character data. Handling Strings with R, a free (CC-BY-NC-SA) e-book by UC Berkeley’s Gaston Sanchez, provides an overview of the ways you can manipulate characters and strings with R. There are many useful sections in the book, but a few selections include:Note that the book does not cover analysis of natural language data, for which you might want to check out the CRAN Task View on Natural Language Processing or the book Text Mining with R: A Tidy Approach. It’s also sadly silent on the topic of character encoding in R, a topic that often causes problems when dealing with text data, especially from…
Original Post: A guide to working with character data in R

Using DataCamp’s Autograder to Teach R

Immediate and personalized feedback has been central to the learning experience on DataCamp since we launched the first courses. If students submit code that contains a mistake, they are told where they made a mistake, and how they can fix this. You can play around with it in our free Introduction to R course. The screenshot below is from our Intermediate R course. To check submissions and generate feedback, every exercise on DataCamp features a so-called Submission Correctness Test, or SCT. The SCT is a script of custom tests that assesses the code students submitted and the output and variables they created with their code. For every language that we teach on DataCamp, we have built a corresponding open-source package to easily verify all these elements of a student submission. For R exercises, this package is called testwhat. Over the…
Original Post: Using DataCamp’s Autograder to Teach R

Melt and cast the shape of your data.frame – Exercises

Datasets often arrive to us in a form that is different from what we need for our modelling or visualisations functions who in turn don’t necessary require the same format. Reshaping data.frames is a step that all analysts need but many struggle with. Practicing this meta-skill will in the long-run result in more time to focus on the actual analysis. The solutions to this set will rely on data.table, mostly melt() and dcast() which are originally from the reshape2 package. However, you can also get practice out if it using your favourite base-R, tidyverse or any other method and then compare the results. Solutions are available here. Exercise 1 Take the following data.frame from this form df <- data.frame(id = 1:2, q1 = c(“A”, “B”), q2 = c(“C”, “A”), stringsAsFactors = FALSE) df id q1 q2 1 1 A C…
Original Post: Melt and cast the shape of your data.frame – Exercises

Creating Slopegraphs with R

Presenting data results in the most informative and compelling manner is part of the role of the data scientist. It’s all well and good to master the arcana of some algorithm, to manipulate and master the numbers and bend them to your will to produce a “solution” that is both accurate and useful. But, those activities are typically in pursuit of informing some decision or at least providing information that serves a purpose. So taking those results and making them compelling and understandable by your audience is part of your job! This article will focus on my efforts to develop an R function that is designed to automate the process of producing a Tufte style slopegraph using ggplot2 and dplyr. Tufte is often considered one of the pioneers of data visualization and you are likely to have been influenced techniques…
Original Post: Creating Slopegraphs with R

Parallelizing Linear Regression or Using Multiple Sources

My previous post was explaining how mathematically it was possible to parallelize computation to estimate the parameters of a linear regression. More speficially, we have a matrix which is matrix and a -dimensional vector, and we want to compute by spliting the job. Instead of using the observations, we’ve seen that it was to possible to compute “something” using the first rows, then the next rows, etc. Then, finally, we “aggregate” the objects created to get our overall estimate. Parallelizing on multiple cores Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we’ve seen in the previous post. Here, the data we use are 1 2 3 y = cars$dist X = data.frame(1,cars$speed) k = ncol(X) On my…
Original Post: Parallelizing Linear Regression or Using Multiple Sources

Announcing new software review editors: Anna Krystalli and Lincoln Mullen

Part of rOpenSci’s mission is to create technical infrastructure in the form of carefully vetted R software tools that lower barriers to working with data sources on the web. Our open peer software review system for community-contributed tools is a key component of this. As the rOpenSci community grows and more package authors submit their work for peer review, we need to expand our editorial board to maintain a speedy process. As our recent post shows, package submissions have grown every year since we started this experiment, and we see no reason they will slow down! Editors manage the review process, performing initial package checks, identifying reviewers, and moderating the process until the package is accepted by reviewers and transferred to rOpenSci. Anna Krystalli and Lincoln Mullen have both served as guest editors for rOpenSci and now they join as…
Original Post: Announcing new software review editors: Anna Krystalli and Lincoln Mullen

Idle thoughts lead to R internals: how to count function arguments

“Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?” It’s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there. There are lots of R packages. We’ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they? What packages load on starting R?Start a new R session and type search(). Here’s the result on my machine: search()[1] “.GlobalEnv” “tools:rstudio” “package:stats” “package:graphics” “package:grDevices””package:utils” “package:datasets” “package:methods” “Autoloads” “package:base” We’re interested in the packages with priority = base. Next question: How can I see and filter for package priority?You don’t need dplyr for this, but it helps. library(tidyverse) installed.packages()…
Original Post: Idle thoughts lead to R internals: how to count function arguments

A Comparative Review of the BlueSky Statistics GUI for R

Introduction BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses.  A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server. This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers. Terminology There are various definitions of user interface types, so here’s how I’ll be using these terms: GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any…
Original Post: A Comparative Review of the BlueSky Statistics GUI for R