Interactive Statistical Graphics: Showing More By Showing Less

Version 4 of the R Hmisc packge and version 5 of the R rms package interfaces with interactive plotly graphics, which is an interface to the D3 javascript graphics library.  This allows various results of statistical analyses to be viewed interactively, with pre-programmed drill-down information.  More examples will be added here.  We start with a video showing a new way to display survival curves.Note that plotly graphics are best used with RStudio Rmarkdown html notebooks, and are distributed to reviewers as self-contained (but somewhat large) html files. Printing is discouraged, but possible, using snapshots of the interactive graphics.Concerning the second bullet point below, boxplots have a high ink:information ratio and hide bimodality and other data features.  Many statisticians prefer to use dot plots and violin plots.  I liked those methods for a while, then started to have trouble with the choice of a smoothing…
Original Post: Interactive Statistical Graphics: Showing More By Showing Less

Preprocessing for Machine Learning with tf.Transform

Posted by Kester Tong, David Soergel, and Gus Katsiapis, Software EngineersWhen applying machine learning to real world datasets, a lot of effort is required to preprocess data into a format suitable for standard machine learning models, such as neural networks. This preprocessing takes a variety of forms, from converting between formats, to tokenizing and stemming text and forming vocabularies, to performing a variety of numerical operations such as normalization.Today we are announcing tf.Transform, a library for TensorFlow that allows users to define preprocessing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph. Users define a pipeline by composing modular Python functions, which tf.Transform then executes with Apache Beam, a framework for large-scale, efficient, distributed data processing. Apache Beam pipelines can be…
Original Post: Preprocessing for Machine Learning with tf.Transform

The CS Capacity Program – New Tools and SIGCSE 2017

Posted by Chris Stephenson, Head of Computer Science Education StrategyThe CS Capacity program was launched in March of 2015 to help address a dramatic increase in undergraduate computer science enrollments that is creating serious resource and pedagogical challenges for many colleges and universities. Over the last two years, a diverse group of universities have been working to develop successful strategies that support the expansion of high-quality CS programs at the undergraduate level. Their work focuses on innovations in teaching and technologies that support scaling while ensuring the engagement of women and underrepresented students. These innovations could provide assistance to many other institutions that are challenged to provide a high-quality educational experience to an increasing number of introductory-level students.The cohort of CS Capacity institutions include George Mason University, Mount Holyoke College, Rutgers University, and the University California Berkeley which are working…
Original Post: The CS Capacity Program – New Tools and SIGCSE 2017

Software Engineering vs Machine Learning Concepts

Not all core concepts from software engineering translate into the machine learning universe. Here are some differences I’ve noticed.Divide and Conquer A key technique in software engineering is to break a problem down into simpler subproblems, solve those subproblems, and then compose them into a solution to the original problem. Arguably, this is the entire job, recursively applied until the solution can be expressed in a single line in whatever programming language is being used. The canonical pedagogical example is the Tower of Hanoi.Unfortunately, in machine learning we never exactly solve a problem. At best, we approximately solve a problem. This is where the technique needs modification: in software engineering the subproblem solutions are exact, but in machine learning errors compound and the aggregate result can be complete rubbish. In addition apparently paradoxical situations can arise where a component is…
Original Post: Software Engineering vs Machine Learning Concepts

An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!

Posted by Paul Natsev, Software EngineerLast September, we released the YouTube-8M dataset, which spans millions of videos labeled with thousands of classes, in order to spur innovation and advancement in large-scale video understanding. More recently, other teams at Google have released datasets such as Open Images and YouTube-BoundingBoxes that, along with YouTube-8M, can be used to accelerate image and video understanding. To further these goals, today we are releasing an update to the YouTube-8M dataset, and in collaboration with Google Cloud Machine Learning and kaggle.com, we are also organizing a video understanding competition and an affiliated CVPR’17 Workshop.An Updated YouTube-8MThe new and improved YouTube-8M includes cleaner and more verbose labels (twice as many labels per video, on average), a cleaned-up set of videos, and for the first time, the dataset includes pre-computed audio features, based on a state-of-the-art audio modeling…
Original Post: An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!

A Litany of Problems With p-values

In my opinion, null hypothesis testing and p-values have done significant harm to science.  The purpose of this note is to catalog the many problems caused by p-values.  As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.The American Statistical Association has done a great service by issuing its Statement on Statistical Significance and P-values.  Now it’s time to act.  To create the needed motivation to change, we need to fully describe the depth of the problem.It is important to note that no statistical paradigm is perfect.  Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults.  This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.Consider an assertion such as “the coin is fair”,…
Original Post: A Litany of Problems With p-values