Any programming environment should be optimized for its task, and not all tasks are alike. For example, if you are exploring uncharted mountain ranges, the portability of a tent is essential. However, when building a house to weather hurricanes, investing in a strong foundation is important. Similarly, when beginning a new data science programming project, it is prudent to assess how much effort should be put into ensuring the code is reproducible. Note that it is certainly possible to go back later and “shore up” the reproducibility of a project where it is weak. This is often the case when an “ad-hoc” project becomes an important production analysis. However, the first step in starting a project is to make a decision regarding the trade-off between the amount of time to set up the project and the probability that the project…

Original Post: Package Management for Reproducible R Code

# Posts by R Views

## Fitting a TensorFlow Linear Classifier with tfestimators

In a recent post, I mentioned three avenues for working with TensorFlow from R:* The keras package, which uses the Keras API for building scaleable, deep learning models * The tfestimators package, which wraps Google’s Estimators API for fitting models with pre-built estimators* The tensorflow package, which provides an interface to Google’s low-level TensorFlow API In this post, Edgar and I use the linear_classifier() function, one of six pre-built models currently in the tfestimators package, to train a linear classifier using data from the titanic package. library(tfestimators) library(tensorflow) library(tidyverse) library(titanic) The titanic_train data set contains 12 fields of information on 891 passengers from the Titanic. First, we load the data, split it into training and test sets, and have a look at it. titanic_set <- titanic_train %>% filter(!is.na(Age)) # Split the data into training and test data sets indices <-…

Original Post: Fitting a TensorFlow Linear Classifier with tfestimators

## Introduction to Kurtosis

Happy 2018 and welcome to our first reproducible finance post of the year! What better way to ring in a new beginning than pondering/calculating/visualizing returns distributions. We ended 2017 by tackling skewness, and we will begin 2018 by tackling kurtosis. Kurtosis is a measure of the degree to which portfolio returns appear in the tails of our distribution. A normal distribution has a kurtosis of 3, which follows from the fact that a normal distribution does have some of its mass in its tails. A distribution with a kurtosis greater than 3 has more returns out in its tails than the normal, and one with kurtosis less than 3 has fewer returns in its tails than the normal. That matters to investors because more bad returns out in tails means that our portfolio might be at risk of a rare…

Original Post: Introduction to Kurtosis

## Downtime Reading

Not everyone has the luxury of taking some downtime at the end the year, but if you do have some free time, you may enjoy something on my short list of downtime reading. The books and articles here are not exactly “light reading”, nor are they literature for cuddling by the fire. Nevertheless, you may find something that catches your eye. The Syncfusion series of free eBooks contains more than a few gems on a variety of programming subjects, including James McCaffrey’s R Programming Succinctly and Barton Poulson’s R Succinctly. For a more ambitious read, mine the rich vein of SUNY Open Textbooks. My pick is Hiroki Sayama’s Introduction to the Modeling and Analysis of Complex Systems. If you just can’t get enough of data science, then a few articles that caught my attention are: Finally, if you really have…

Original Post: Downtime Reading

## Nov 2017: New Package Picks

Two hundred thirty-seven new packages made it to CRAN in November. Here are my picks for the “Top 40” organized into the categories: Computational Methods, Data, Data Science, Science, Social Science, Utilities and Visualizations. Data Science imbalance v0.1.1: Provides algorithms to treat unbalanced datasets. See the vignette for details. intrinsicdimension v1.1.0: Implements a variety of methods for estimating intrinsic dimension of data sets (i.e the manifold or Hausdorff dimension of the support of the distribution that generated the data) as reviewed in Johnsson et al.(2015). The vignette provides an Overview. ppclust v0.1.0: Implements probabilistic clustering algorithms for partitioning datasets including Fuzzy C-Means (Bezdek, 1974), Possibilistic C-Means (Krishnapuram & Keller, 1993), Possibilistic Fuzzy C-Means (Pal et al, 2005), Possibilistic Clustering Algorithm (Yang et al, 2006), Possibilistic C-Means with Repulsion (Wachs et al, 2006) and the other variants. There are vignettes on…

Original Post: Nov 2017: New Package Picks

## A Data Science Lab for R

In a previous post I described the role of analytic administrator as a data scientist who: onboards new tools, deploys solutions, supports existing standards, and trains other data scientists. In this post I will describe how someone in that role might set up a data science lab for R. Architecture A data science lab is an environment for developing code and creating content. It should enhance the productivity of your data scientists and integrate with your existing systems. Your data science lab might live on your premises or in the cloud. It might be built with hardware, virtual machines, or containers. You may use it to support a single data scientist or hundreds of R developers. Here is one reference architecture of a data science lab based on server instances. Key components of this setup include: authentication; load balancing; a…

Original Post: A Data Science Lab for R

## Introduction to Skewness

In previous posts here, here, and here, we spent quite a bit of time on portfolio volatility, using the standard deviation of returns as a proxy for volatility. Today we will begin to a two-part series on additional statistics that aid our understanding of return dispersion: skewness and kurtosis. Beyond being fancy words and required vocabulary for CFA level 1, these two concepts are both important and fascinating for lovers of returns distributions. For today, we will focus on skewness. Skewness is the degree to which returns are asymmetric around the mean. Since a normal distribution is symmetric around the mean, skewness can be taken as one measure of how returns are not distributed normally. Why does skewness matter? If portfolio returns are right, or positively, skewed, it implies numerous small negative returns and a few large positive returns. If…

Original Post: Introduction to Skewness

## Connecting R to Keras and TensorFlow

It has always been the mission of R developers to connect R to the “good stuff”. As John Chambers puts it in his book Extending R: One of the attractions of R has always been the ability to compute an interesting result quickly. A key motivation for the original S remains as important now: to give easy access to the best computations for understanding data. From the day it was announced a little over two years ago, it was clear that Google’s TensorFlow platform for Deep Learning is good stuff. This September (see announcment), J.J. Allaire, François Chollet, and the other authors of the keras package delivered on R’s “easy access to the best” mission in a big way. Data scientists can now build very sophisticated Deep Learning models from an R session while maintaining the flow that R users…

Original Post: Connecting R to Keras and TensorFlow

## How to Show R Inline Code Blocks in R Markdown

Inline code with R Markdown R Markdown is a well-known tool for reproducible science in R. In this article, I will focus on a few tricks with R inline code. Some time ago, I was writing a vignette for my package WordR. I was using R Markdown. At one point I wanted to show `r expression`

in the output, exactly as it is shown here, as an inline code block. In both R Markdown and Markdown, we can write `abc`

to show abc (I will use the blue colour for code blocks showing code as it appears in Rmd file, whereas the default colour will be showing the rendered output). What is not obvious is that you can use double backticks to escape single backticks in the code block. So code like this: ``abc``

(mind the spaces!) produces…

Original Post: How to Show R Inline Code Blocks in R Markdown

## October 2017 New Packages

Of the 182 new packages that made it to CRAN in October, here are my picks for the “Top 40”. They are organized into eight categories: Engineering, Machine Learning, Numerical Methods, Science, Statistics, Time Series, Utilities and Visualizations. Engineering is a new category, and its appearance may be an early signal for the expansion of R into a new domain. The Science category is well-represented this month. I think this is the result of the continuing trend for working scientists to wrap their specialized analyses into R packages. Engineering FlowRegEnvCost 0.1.1: Calculates the daily environmental costs of river-flow regulation by dams based on García de Jalon et al. (2017). rroad v0.0.4: Computes and visualizes the International Roughness Index (IRI) given a longitudinal road profile for a single road segment, or for a sequence of segments with a fixed length. For…

Original Post: October 2017 New Packages