How we built a Shiny App for 700 users?

One of our senior data scientists, Olga Mierzwa-Sulima spoke at the userR! conference in Brussels to a packed house. The seats were full and there were audience members spilling out the doors. Source: https://twitter.com/matlabulous/status/882530484374392834 Olga’s talk was entitled ‘How we built a Shiny App for 700 users?’ She went over the main challenges associated with scaling a Shiny application, and the methods we used to resolve them. The talk was partly in the form of a case study based on Appsilon’s experience. In this talk, Olga shared her experience from a real-life case study of building an app used daily by 700 users where our data science team tackled all these problems. This, to our knowledge, was one of the biggest production deployments of a Shiny App. Shiny has proved itself a great tool for communicating data science teams’ results.…
Original Post: How we built a Shiny App for 700 users?

Creating interactive SVG tables in R

In this post we will explore how to make SVG tables in R using plotly. The tables are visually appealing and can be modified on the fly using simple drag and drop. Make sure you install the latest version of Plotly i.e. v 4.7.1.9 from Github using devtools::install_github(“ropensci/plotly) The easiest way to create a SVG table in R using plotly is by manually specifying headers and individual cell values. The following code snippet highlights this: #library(devtools) #install_github(“ropensci/plotly”) library(plotly) p <- plot_ly( type = ‘table’, # Specify type of plot as table # header is a list and every parameter shown below needs # to be specified. Note that html tags can be used as well header = list( # First specify table headers # Note the enclosure within ‘list’ values = list(list(‘EXPENSES’), list(‘Q1’), list(‘Q2’), list(‘Q3’), list(‘Q4’)), # Formatting line =…
Original Post: Creating interactive SVG tables in R

Unintentional data

A large part of the data we data scientists are asked to analyze was not collected with the specific analysis in mind, or perhaps any particular analysis. In this space, many assumptions of classical statistics no longer hold. The data scientist working today lives in what Brad Efron has termed the “era of scientific mass production,” of which he remarks, “But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. [1]”Statistics, as a discipline, was largely developed in a small data world. Data was expensive to gather, and therefore decisions to collect data were generally well-considered. Implicitly, there was a prior belief about some interesting causal mechanism or an underlying hypothesis motivating…
Original Post: Unintentional data

ARIMA models and Intervention Analysis

In my previous tutorial Structural Changes in Global Warming I introduced the strucchange package and some basic examples to date structural breaks in time series. In the present tutorial, I am going to show how dating structural changes (if any) and then Intervention Analysis can help in finding better ARIMA models. Dating structural changes consists in determining if there are any structural breaks in the time series data generating process, and, if so, their dates. Intervention analysis estimates the effect of an external or exogenous intervention on a time series. As an example of intervention, a permanent level shift, as we will see in this tutorial. In our scenario, the external or exogenous intervention is not known in advance, (or supposed to be known), it is inferred from the structural break we will identify. The dataset considered for the analysis…
Original Post: ARIMA models and Intervention Analysis

My advice on dplyr::mutate()

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time. Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark. “Character is what you are in the dark.”John Whorfin quoting Dwight L. Moody. I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations…
Original Post: My advice on dplyr::mutate()

Data liquidity in the age of inference

Water ripples (source: Blazing Firebug via Pixabay)Save the dates for the Artificial Intelligence Conference in New York, happening April 29-May 2, 2018. The call for speakers is now open. It’s a special time in the evolutionary history of computing. Oft-used terms like big data, machine learning, and artificial intelligence have become popular descriptors of a broader underlying shift in information processing. While traditional rules-based computing isn’t going anywhere, a new computing paradigm is emerging around probabilistic inference, where digital reasoning is learned from sample data rather than hardcoded with boolean logic. This shift is so significant that a new computing stack is forming around it with emphasis on data engineering, algorithm development, and even novel hardware designs optimized for parallel computing workloads, both within data centers and at endpoints. A funny thing about probabilistic inference is that when models work…
Original Post: Data liquidity in the age of inference

You need an Analytics Center of Excellence

Ferris wheel (source: Skeeze via Pixabay)Check out Carme Artigas’ executive briefing “Analytics Centers of Excellence as a way to accelerate big data adoption by business” at the Strata Data Conference in Singapore, Dec. 5-7. Registration is now open. More than 10 years after big data emerged as a new technology paradigm, it is finally in a mature state and its business value throughout most industry sectors is established by a significant number of use cases. A couple of years ago, the discussion was still about how big data changed our way of capturing, processing, analyzing, and exploiting data in new and meaningful ways for business decision makers. Now many companies undertake analytical projects at a departmental level, redefining the relationship between business and IT by the adoption of Agile and DevOps methodologies. Real-time processing, machine learning algorithms, and even artificial…
Original Post: You need an Analytics Center of Excellence

Regression Analysis — What You Should’ve Been Taught But Weren’t, and Were Taught But Shouldn’t Have Been

The above title was the title of my talk this evening at our Bay Area R Users Group. I had been asked to talk about my new book, and I presented four of the myths that are dispelled in the book. Hadley also gave an interesting talk, “An introduction to tidy evaluation,” involving some library functions that are aimed at writing clearer, more readable R. The talk came complete with audience participation, very engaging and informative. The venue was GRAIL, a highly-impressive startup. We will be hearing a lot more about this company, I am sure. Related If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…
Original Post: Regression Analysis — What You Should’ve Been Taught But Weren’t, and Were Taught But Shouldn’t Have Been

Time Series Analysis in R Part 1: The Time Series Object

Many of the methods used in time series analysis and forecasting have been around for quite some time but have taken a back seat to machine learning techniques in recent years. Nevertheless, time series analysis and forecasting are useful tools in any data scientist’s toolkit. Some recent time series-based competitions have recently appeared on kaggle, such as one hosted by Wikipedia where competitors are asked to forecast web traffic to various pages of the site. As an economist, I have been working with time series data for many years; however, I was largely unfamiliar with (and a bit overwhelmed by) R’s functions and packages for working with them. From the base ts objects to a whole host of other packages like xts, zoo, TTR, forecast, quantmod and tidyquant, R has a large infrastructure supporting time series analysis. I decided to…
Original Post: Time Series Analysis in R Part 1: The Time Series Object

Query the planet: Geospatial big data analytics at Uber

Uber’s Presto architecture (source: Courtesy of Zhenxiao Luo)For more on efficient geospatial analysis, check out Zhenxiao Luo and Wei Yan’s session, “Geospatial big data analysis at Uber,” at the Strata NYC Data Conference in NYC, September 25-28, 2017. From determining the most convenient rider pickup points to predicting the fastest routes, Uber aims to use data-driven analytics to create seamless trip experiences. Within engineering, analytics inform decision-making processes across the board. One of the distinct challenges for Uber is analyzing geospatial big data. City locations, trips, and event information, for instance, provide insights that can improve business decisions and better serve users. Geospatial data analysis is particularly challenging, especially in a big data scenario, such as computing how many rides start at a transit location, how many drivers are crossing state lines, and so on. For these analytical requests, we…
Original Post: Query the planet: Geospatial big data analytics at Uber