Crawling the internet: data science within a large engineering system

by BILL RICHOUX Critical decisions are being made continuously within large software systems. Often such decisions are the responsibility of a separate machine learning (ML) system. But there are instances when having a separate ML system is not ideal. In this blog post we describe one of these instances — Google search deciding when to check if web pages have changed. Through this example, we discuss some of the special considerations impacting a data scientist when designing solutions to improve decision-making deep within software infrastructure.Data scientists promote principled decision-making following several different arrangements. In some cases, data scientists provide executive level guidance, reporting insights and trends. Alternatively, guidance and insight may be delivered below the executive level to product managers and engineering leads, directing product feature development via metrics and A/B experiments.This post focuses on an even lower-level pattern, when…
Original Post: Crawling the internet: data science within a large engineering system

Should you learn R or Python for data science?

One of the most common questions I get asked is, “Should I learn R or Python?”. My general response is: it’s up to you! Both are popular open source data platforms with active, growing communities; both are are highly sought after by employers, and both have a rich set of capabilities for working with data. It really depends most on your interests and the kind of employer you want to work for. If your interests lean more towards traditional statistical analysis and inference as used within industries like manufacturing, finance, and the life sciences, I’d lean towards R. If you’re more interested in machine learning and artificial intelligence applications, I’d lean towards Python. But even that’s not a hard-and-fast rule: R has excellent support for machine learning and deep learning frameworks, and Python is often used for traditional data science…
Original Post: Should you learn R or Python for data science?

Because it's Friday: Planes, pains, and automobiles

It’s conference season, so I’ve been travelling on a lot of planes recently. (And today, I’m heading to Budapest for eRum 2018.) So this resonates with me right now: There’s another video in the same vein about hotels which is quite funny as well. Anyway, that’s all from us for this week; we’ll be back next week reporting from eRum. Until then, have a great weekend!
Original Post: Because it's Friday: Planes, pains, and automobiles

goodpractice 1.0.2 on CRAN

Hannah Frick, Data Scientist We are excited to annouce that the goodpractice package is nowavailable on CRAN. The package gives advice about good practices when building R packages. Adviceincludes functions and syntax to avoid, package structure, codecomplexity, code formatting, etc. You can install the CRAN version via install.packages(“goodpractice”) Building R packages Building an R package is a great way of encapsulating code,documentation and data in a single testable and easily distributableunit. For a package to be distributed via CRAN, it needs to pass a set ofchecks implemented in R CMD check, such as: Is there minimaldocumentation, e.g., are all arguments of exported functions documented?Are all dependencies declared? These checks are helpful in developing a solid R package but they don’tcheck for several other good practices. For example, a package does notneed to contain any tests but is it good practice…
Original Post: goodpractice 1.0.2 on CRAN

Five Takeaways from ODSC East 2018

The last four days in Boston have been nothing but attending talks and meeting with great people. I was exposed to a variety of interesting topics, including data science/deep learning applications in healthcare and other fields, and technical discussions/training sessions at different levels. The bottom line is that ODSC definitely exceeded my expectation. Here I compiled some resources that you can use and also want to share some of my takeways.

Keywords: Generative Adversarial Networks, Transfer Learning, Deep Learning, Fake News Detection, TensorFlow, Kubernetes, Blockchain

Network analysis

How-to: Scroll down to the README file, find the Binder section and click launchbinder. This way you can run the notebooks without having to download anything. You need to understand basic Python to do the exercises.
This one was the most fun part at ODSC, not only because it was a well designed course for everyone to get started with the topic, and not only because Eric Ma, the presenter, was such a nice presenter and a wonderful guy to talk with, but also that next day we discussed in length about how to apply machine learning on top of graphs using paper plates:

traditional machine learning vs machine learning on graphs vs link predictions using machine learning methods

Another big takeaway is that he couldn’t stress the importance of writing unit test for data science work, which you all don’t do (just kidding but I definitely don’t).
The tutorial on Github includes the student’s notebooks for exercises, and instructor’s notebooks checking the answers.

Generative Adversarial Networks

How-to: Download GANs.pdf and follow instructions.
This talk was given by Dan Becker, PhD, head of Kaggle Learn. He went through the basic concept of GANs and the techniques of building generative networks and discriminative networks, and using keras and functional interface in tensorflow. There were some interesting examples shown during the presentation, such as transforming a video of a horse into a video of a zebra, and generating a real life image from simple sketches.
And in case you haven’t heard of it, Kaggle has a service called Kaggle kernels, which allows people to run notebooks on Kaggle’s infrastructure for free (including GPU support), which is super useful for playing around with the tutorials. You can find the links to the tutorial kernels in the pdf file.
Extra code available at if you want to modify the implementation of the generator and discriminator in the tutorial, such as adding dropout layers.

Deep learning for detecting fake news

The first and foremost question that should be asked is, why is a data scientist at Uber spending so much effort doing this kind of stuff? Anyways, the
talk raised an interesting point: Identifying fake news is the wrong problem to solve as fact checks are just hard to do by machines. Instead, the right problem to solve is to classify journalism vs not journalism, sensationalism vs objectivism, etc. Whether you like the arugument or not, it does provide a viable way to apply the existing natural language deep learning models to solve these kinds of problems. In short, they use (non-naive) doc2vec and LSTM models to detect features in the news articles, and build classifiers on top of it to categorize the article as journalism or not journalism.
They also created an application at, which you can use to test if an article sounds like journalism or not. There are many other classifiers being tested but those are only exposed to their developerss at this point.

Docker, Kubernetes and distributed TensorFlow

If anyone is interested in container technology such as Docker, and how one can use it to enable machine learning training at scale, this would be helpful to you. Kubernetes is a container orchestrator for distributed applications. I know it sounds complicated, but basically it could create a cluster of nodes with containers that containing your code to train a model. There were two sessions on the topic of using kubernetes to distribute TensorFlow training. The first guy is the founder of PipelineAI, a Silicon Valley start-up guy who started by offending all women by saying east coast girls look better than the west coast ones (and he admitted he has problem with self-control). The second presenter from Microsoft spoke in a better manner, and the git repo is from him. It is Microsoft though so all the examples are tied with Azure.


Interestingly, while there were keynotes and marketing talks about blockchain and how it’s gonna impact your business, there was no one technical session about it. No one. I guess that says enough. The keynote presenter (from MIT) talked about how they could find the hidden network between bitcoin users, but to me that’s just a network analysis on meta data in a blockchain instead of doing data science on encrypted data. He did talk about how to do pattern recognition on encrypted data, but did not give any specific examples. Still, I guess doing data science on encrypted data is a real thing and could catch attention soon. If you’re interested in these kinds of marketing stunt go take a look at their website:


There was a talk about the evolution of color theory and technology and how to choose colors for data science projects. The presenter explained how opposite colors in the color wheel works, and why Monet was a master of using white plus just a little colors to create a pleasing result. It happens that Boston Museum of Fine Arts has the largest Monet collection in the States so it was a natural decision to visit it after the talk. I also came to know from the presentation that there’s an app called Adobe Capture which automatically turns photos into color palettes, which is quite fun:

Creating color pallete from Poppy Field Argenteuil by Claude Monet.

There was a guy talking about automated machine learning. I saw DataRobot, the company that does this sort of thing, getting a lot of exposures at ODSC, so I’ll just assume that it’s legit.

There was the founder of Julia talking about why Julia could become the next Machine Learning language by combining the ease of use of Python with the speed of C. I did not attend the talk, but the description of the talk says all Python’s scikit-learn stuff could be imported into Julia, which is pretty cool.

There was a researcher talking about model interpretability using Locally Interpretable Model-agnostic Explanations (LIME). No material to share but you can just google this term and see how it tries to interprete a complex model by forming locally linear problems.

Because it's Friday: The eyes don't work

Spring has finally arrived around here, and a recent tweet reminded me of the depths of fear that Spring brought to me as 7-year old me back in Australia: swooping magpies. These native birds, related but quite different to the magpies of North America and Europe, get very aggressive in the Spring, and will attack anyone that walks into their territory, swooping in from behind and attacking your head with their long sharp beak. (They can easily draw blood.) They’ll repeat their attacks over and over again until you leave. People try many things to prevent the attacks, like wearing sunglasses backwards or putting fake eyes on the back of your head, but as you can see from the video below they don’t really work: True story: my mum used to make we wear a plastic ice-cream tub…
Original Post: Because it's Friday: The eyes don't work

Compliance bias in mobile experiments

by DANIEL PERCIVALRandomized experiments are invaluable in making product decisions, including on mobile apps. But what if users don’t immediately uptake the new experimental version? What if their uptake rate is not uniform? We’d like to be able to make decisions without having to wait for the long tail of users to experience the treatment to which they have been assigned. This blog post provides details for how we can make inferences without waiting for complete uptake. Background At Google, experimentation is an invaluable tool for making decisions and inference about new products and features. An experimenter, once their candidate product change is ready for testing, often needs only to write a few lines of configuration code to begin an experiment. Ready-made systems then perform standardized analyses on their work, giving a common and repeatable method of decision making. This…
Original Post: Compliance bias in mobile experiments

pytorch PU learning trick

I’m often using positive-unlabeled learning nowadays. In particular for observational dialog modeling, next utterance classification is a standard technique for training and evaluating models. In this setup the observed continuation of the conversation is considered a positive (since a human said it, it is presumed a reasonable thing to say at that point in the conversation) and other randomly chosen utterances are treated as unlabeled (they might be reasonable things to say at that point in the conversation).Suppose you have a model whose final layer is a dot product between a vector produced only from context and a vector produced only from response. I use models of this form as “level 1” models because they facilitate precomputation of a fast serving index, but note the following trick will not apply to architectures like bidirectional attention. Anyway for these models you…
Original Post: pytorch PU learning trick

Generative Adversarial Networks – An Experiment with Training Improvement Techniques (in Tensorflow)

Well first, if you’re interested in Deep Learning but just don’t bother reading this post at all, I recommend you to take a look at the deep learning nano degree offered by Udacity. It looks like a well designed series of courses that covers all major aspects of deep learning.


Generative Adversarial Networks (GANs) have become one of the most popular advancements in deep learning. The method is unique in that it has a generative model that generates synthetic data from noise, while having another discriminative model that tries to distinguish the synthetic data from real ones. For example, a well trained GAN for image classification could generate images that look way more realistic than thoses generated by other deep learning models. In addition to image recognition, generative models are also useful for tasks such as predicting rare events, as GANs could be used to increase the target sample size so that a predictive model could yield better performance. This post is not intended to serve as an introduction to GANs as there are many great articles covering it. For instance, this article uses a simple example to demonstrate the implementation of GANs in Tensorflow. The objective is to train a model that learns to generate a Gaussian distribution like this:

As a vanilla GAN model is hard to train, the article explored ways to improve training, one highlight being minibatch training (the article has it well explained). Since the article was published, there has been further development in training techniques of GANs. Therefore, I took a look at a couple of techiniques and applied them to this simple example to see how the new models perform. All the new code are written based on the original code here.

Orinigal Models

The original article (which you’re recommended to read first) showed examples of generated distributions by models with or without minibatch technique. I re-ran the two model training processes on my laptop:

No minibatch


The results look slight different from the article. Minibatch is supposed to make the model better at generating a similar distribution but it didn’t work quite well as intended.

Adding Noises

As explained here and here, Adding Gaussian noises (with zero mean and tiny variance) to the input data of the discriminative network, i.e. the synthetic data points generated by the generative model and data points sampled from the real Gaussian distribution, could force the generator output and the real distribution to spread out so that to create more overlaps, which makes it easier for training. I tweaked the original code so now the class DataDistribution could be used to not only sample data from the target distribution, but also sample noises by setting mu = 0 and sigma = 0.001 (or some other small numbers):

class DataDistribution(object):
    def __init__(self, mu = 4, sigma = 0.5): = mu
        self.sigma = sigma

    def sample(self, N):
        samples = np.random.normal(, self.sigma, N)
        return samples

In train method, we can now add noises to the input of the discriminators:

for step in range(params.num_steps + 1):
        # update discriminator
        x = data.sample(params.batch_size)
        z = gen.sample(params.batch_size)
        # Sample noise
        n_x = noise.sample(params.batch_size)
        n_z = noise.sample(params.batch_size)
        loss_d, _, =[model.loss_d, model.opt_d], {
                model.x: np.reshape(x + n_x, (params.batch_size, 1)),
                model.z: np.reshape(z + n_z, (params.batch_size, 1))

The results are as follows:

No minibatch, added noise (std = 0.001)

Minibatch, added noise (std = 0.001)

The model without minibatch is able to mimic the bell shape pretty well, but do notice that it also leaves a long tail to the left. The training loss of the generator actually increased from the first example. The minibatch model does look to have improved a lot from the first example, where the output distribution is much less centered around mean now.

Feature Matching

This post explained pretty well how feature matching works in training GANs. The basic idea is that, instead of just using the activation layer of the discriminator to minimizating the loss of the generator, it uses information from the hidden layer together with the activation layer for better optimization. To implement this, we need to expose a hidden layer (h2) of the discriminator:

def discriminator(input, h_dim, minibatch_layer=True):
    h0 = tf.nn.relu(linear(input, h_dim * 2, 'd0'))
    h1 = tf.nn.relu(linear(h0, h_dim * 2, 'd1'))
    # without the minibatch layer, the discriminator needs an additional layer
    # to have enough capacity to separate the two distributions correctly
    if minibatch_layer:
        h2 = minibatch(h1)
        h2 = tf.nn.relu(linear(h1, h_dim * 2, scope='d2'))

    h3 = tf.sigmoid(linear(h2, 1, scope='d3'))
    return h3, h2

h2 will be feeded into the generator’s loss function:

# Original loss function: self.loss_g = tf.reduce_mean(-log(self.D2))
self.loss_g = tf.sqrt(tf.reduce_sum(tf.pow(self.D1_h2 - self.D2_h2, 2)))

Where D1_h2 and D2_h2 are two h2 layers from the discriminator that takes in generator’s data and real samples respectively. Here are the results:

No minibatch, added noise (std = 0.001), feature matching

Minibatch, added noise (std = 0.001), feature matching

The model without minibatch improved from the last attempt as you can tell the fat tail has disappeared, though it may not be an apparent improvement on the vanilla method. In contrast, the model with minibatch and added noise did not perform well.


The experiments yielded mixed results, but this really is just a toy project. The updated code can be found here. If you’re interested in learning more about GANs, the linked articles in the post are all really good starting point provided you have prior knowledge in traditional deep networks:

GANs introduction and example: http://
Improvement techniques: http://
Adding Noises: http://
Feature matching: http://

ggplot2 Time Series Heatmaps: revisited in the tidyverse

I revisited my previous post on creating beautiful time series calendar heatmaps in ggplot, moving the code into the tidyverse.To obtain following example: Simply use the following code:I hope the commented code is self-explanatory – enjoy 🙂 Related If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…
Original Post: ggplot2 Time Series Heatmaps: revisited in the tidyverse