Five Takeaways from ODSC East 2018

The last four days in Boston have been nothing but attending talks and meeting with great people. I was exposed to a variety of interesting topics, including data science/deep learning applications in healthcare and other fields, and technical discussions/training sessions at different levels. The bottom line is that ODSC definitely exceeded my expectation. Here I compiled some resources that you can use and also want to share some of my takeways.

Keywords: Generative Adversarial Networks, Transfer Learning, Deep Learning, Fake News Detection, TensorFlow, Kubernetes, Blockchain

Network analysis

How-to: Scroll down to the README file, find the Binder section and click launchbinder. This way you can run the notebooks without having to download anything. You need to understand basic Python to do the exercises.
This one was the most fun part at ODSC, not only because it was a well designed course for everyone to get started with the topic, and not only because Eric Ma, the presenter, was such a nice presenter and a wonderful guy to talk with, but also that next day we discussed in length about how to apply machine learning on top of graphs using paper plates:

traditional machine learning vs machine learning on graphs vs link predictions using machine learning methods

Another big takeaway is that he couldn’t stress the importance of writing unit test for data science work, which you all don’t do (just kidding but I definitely don’t).
The tutorial on Github includes the student’s notebooks for exercises, and instructor’s notebooks checking the answers.

Generative Adversarial Networks

How-to: Download GANs.pdf and follow instructions.
This talk was given by Dan Becker, PhD, head of Kaggle Learn. He went through the basic concept of GANs and the techniques of building generative networks and discriminative networks, and using keras and functional interface in tensorflow. There were some interesting examples shown during the presentation, such as transforming a video of a horse into a video of a zebra, and generating a real life image from simple sketches.
And in case you haven’t heard of it, Kaggle has a service called Kaggle kernels, which allows people to run notebooks on Kaggle’s infrastructure for free (including GPU support), which is super useful for playing around with the tutorials. You can find the links to the tutorial kernels in the pdf file.
Extra code available at if you want to modify the implementation of the generator and discriminator in the tutorial, such as adding dropout layers.

Deep learning for detecting fake news

The first and foremost question that should be asked is, why is a data scientist at Uber spending so much effort doing this kind of stuff? Anyways, the
talk raised an interesting point: Identifying fake news is the wrong problem to solve as fact checks are just hard to do by machines. Instead, the right problem to solve is to classify journalism vs not journalism, sensationalism vs objectivism, etc. Whether you like the arugument or not, it does provide a viable way to apply the existing natural language deep learning models to solve these kinds of problems. In short, they use (non-naive) doc2vec and LSTM models to detect features in the news articles, and build classifiers on top of it to categorize the article as journalism or not journalism.
They also created an application at, which you can use to test if an article sounds like journalism or not. There are many other classifiers being tested but those are only exposed to their developerss at this point.

Docker, Kubernetes and distributed TensorFlow

If anyone is interested in container technology such as Docker, and how one can use it to enable machine learning training at scale, this would be helpful to you. Kubernetes is a container orchestrator for distributed applications. I know it sounds complicated, but basically it could create a cluster of nodes with containers that containing your code to train a model. There were two sessions on the topic of using kubernetes to distribute TensorFlow training. The first guy is the founder of PipelineAI, a Silicon Valley start-up guy who started by offending all women by saying east coast girls look better than the west coast ones (and he admitted he has problem with self-control). The second presenter from Microsoft spoke in a better manner, and the git repo is from him. It is Microsoft though so all the examples are tied with Azure.


Interestingly, while there were keynotes and marketing talks about blockchain and how it’s gonna impact your business, there was no one technical session about it. No one. I guess that says enough. The keynote presenter (from MIT) talked about how they could find the hidden network between bitcoin users, but to me that’s just a network analysis on meta data in a blockchain instead of doing data science on encrypted data. He did talk about how to do pattern recognition on encrypted data, but did not give any specific examples. Still, I guess doing data science on encrypted data is a real thing and could catch attention soon. If you’re interested in these kinds of marketing stunt go take a look at their website:


There was a talk about the evolution of color theory and technology and how to choose colors for data science projects. The presenter explained how opposite colors in the color wheel works, and why Monet was a master of using white plus just a little colors to create a pleasing result. It happens that Boston Museum of Fine Arts has the largest Monet collection in the States so it was a natural decision to visit it after the talk. I also came to know from the presentation that there’s an app called Adobe Capture which automatically turns photos into color palettes, which is quite fun:

Creating color pallete from Poppy Field Argenteuil by Claude Monet.

There was a guy talking about automated machine learning. I saw DataRobot, the company that does this sort of thing, getting a lot of exposures at ODSC, so I’ll just assume that it’s legit.

There was the founder of Julia talking about why Julia could become the next Machine Learning language by combining the ease of use of Python with the speed of C. I did not attend the talk, but the description of the talk says all Python’s scikit-learn stuff could be imported into Julia, which is pretty cool.

There was a researcher talking about model interpretability using Locally Interpretable Model-agnostic Explanations (LIME). No material to share but you can just google this term and see how it tries to interprete a complex model by forming locally linear problems.

Generative Adversarial Networks – An Experiment with Training Improvement Techniques (in Tensorflow)

Well first, if you’re interested in Deep Learning but just don’t bother reading this post at all, I recommend you to take a look at the deep learning nano degree offered by Udacity. It looks like a well designed series of courses that covers all major aspects of deep learning.


Generative Adversarial Networks (GANs) have become one of the most popular advancements in deep learning. The method is unique in that it has a generative model that generates synthetic data from noise, while having another discriminative model that tries to distinguish the synthetic data from real ones. For example, a well trained GAN for image classification could generate images that look way more realistic than thoses generated by other deep learning models. In addition to image recognition, generative models are also useful for tasks such as predicting rare events, as GANs could be used to increase the target sample size so that a predictive model could yield better performance. This post is not intended to serve as an introduction to GANs as there are many great articles covering it. For instance, this article uses a simple example to demonstrate the implementation of GANs in Tensorflow. The objective is to train a model that learns to generate a Gaussian distribution like this:

As a vanilla GAN model is hard to train, the article explored ways to improve training, one highlight being minibatch training (the article has it well explained). Since the article was published, there has been further development in training techniques of GANs. Therefore, I took a look at a couple of techiniques and applied them to this simple example to see how the new models perform. All the new code are written based on the original code here.

Orinigal Models

The original article (which you’re recommended to read first) showed examples of generated distributions by models with or without minibatch technique. I re-ran the two model training processes on my laptop:

No minibatch


The results look slight different from the article. Minibatch is supposed to make the model better at generating a similar distribution but it didn’t work quite well as intended.

Adding Noises

As explained here and here, Adding Gaussian noises (with zero mean and tiny variance) to the input data of the discriminative network, i.e. the synthetic data points generated by the generative model and data points sampled from the real Gaussian distribution, could force the generator output and the real distribution to spread out so that to create more overlaps, which makes it easier for training. I tweaked the original code so now the class DataDistribution could be used to not only sample data from the target distribution, but also sample noises by setting mu = 0 and sigma = 0.001 (or some other small numbers):

class DataDistribution(object):
    def __init__(self, mu = 4, sigma = 0.5): = mu
        self.sigma = sigma

    def sample(self, N):
        samples = np.random.normal(, self.sigma, N)
        return samples

In train method, we can now add noises to the input of the discriminators:

for step in range(params.num_steps + 1):
        # update discriminator
        x = data.sample(params.batch_size)
        z = gen.sample(params.batch_size)
        # Sample noise
        n_x = noise.sample(params.batch_size)
        n_z = noise.sample(params.batch_size)
        loss_d, _, =[model.loss_d, model.opt_d], {
                model.x: np.reshape(x + n_x, (params.batch_size, 1)),
                model.z: np.reshape(z + n_z, (params.batch_size, 1))

The results are as follows:

No minibatch, added noise (std = 0.001)

Minibatch, added noise (std = 0.001)

The model without minibatch is able to mimic the bell shape pretty well, but do notice that it also leaves a long tail to the left. The training loss of the generator actually increased from the first example. The minibatch model does look to have improved a lot from the first example, where the output distribution is much less centered around mean now.

Feature Matching

This post explained pretty well how feature matching works in training GANs. The basic idea is that, instead of just using the activation layer of the discriminator to minimizating the loss of the generator, it uses information from the hidden layer together with the activation layer for better optimization. To implement this, we need to expose a hidden layer (h2) of the discriminator:

def discriminator(input, h_dim, minibatch_layer=True):
    h0 = tf.nn.relu(linear(input, h_dim * 2, 'd0'))
    h1 = tf.nn.relu(linear(h0, h_dim * 2, 'd1'))
    # without the minibatch layer, the discriminator needs an additional layer
    # to have enough capacity to separate the two distributions correctly
    if minibatch_layer:
        h2 = minibatch(h1)
        h2 = tf.nn.relu(linear(h1, h_dim * 2, scope='d2'))

    h3 = tf.sigmoid(linear(h2, 1, scope='d3'))
    return h3, h2

h2 will be feeded into the generator’s loss function:

# Original loss function: self.loss_g = tf.reduce_mean(-log(self.D2))
self.loss_g = tf.sqrt(tf.reduce_sum(tf.pow(self.D1_h2 - self.D2_h2, 2)))

Where D1_h2 and D2_h2 are two h2 layers from the discriminator that takes in generator’s data and real samples respectively. Here are the results:

No minibatch, added noise (std = 0.001), feature matching

Minibatch, added noise (std = 0.001), feature matching

The model without minibatch improved from the last attempt as you can tell the fat tail has disappeared, though it may not be an apparent improvement on the vanilla method. In contrast, the model with minibatch and added noise did not perform well.


The experiments yielded mixed results, but this really is just a toy project. The updated code can be found here. If you’re interested in learning more about GANs, the linked articles in the post are all really good starting point provided you have prior knowledge in traditional deep networks:

GANs introduction and example: http://
Improvement techniques: http://
Adding Noises: http://
Feature matching: http://

Scoring H2O MOJO Models with Spark DataFrame and Dataset

by Jiankun Liu


H2O allows you to export models to POJOs or MOJOs (Model Object, Optimized) and later be deployed in production, presumably for scoring large datasets, or building real-time applications. Theoretically it would work in a spark application, but the official documentation did not explain into details other than saying you can “create a map to call the POJO for each row and save the result to a new column, row-by-row.” One post showed how to import the dependencies, load the models, and make predictions in spark shell but did not actually provide examples of scoring with a Spark DataFrame or Dataset. So I had to research on my own, and after some trials and errors, I finally made it work.
The scenario I created below would build a model in R using H2O, exported the model to MOJO, and then be imported in a Spark application to score on a test data set. Since it’s the Star Wars season again, I can’t help but make this post a bit relevant. So I used the starwars dataset to build a model that predicts the likelihood of a character being human based on their height and mass.
This post assumes that you have some experience with H2O and Spark.

Model training in R with H2O

# Import libararies and initialize h2o

model_path <- "/path/to/model"  # Specify where to save the model
test_path <- "/path/to/test"  # Specify where to save the test set
starwars <- %>% 
  select(name, height, mass, species) %>% 
  filter(! & (! | ! %>%
  mutate(mass = ifelse(, median(mass, na.rm = TRUE), mass)) %>% 
  mutate(is_human = ifelse(species == "Human", 1, 0)) %>%
  select(-species) %>%

The data after preprocessing looks like this:

name height mass is_human
Luke Skywalker 172 77.0 1
C-3PO 167 75.0 0
R2-D2 96 32.0 0
Darth Vader 202 136.0 1

To convince me that Darth Vader can be classified as human, I had to resolve to Stack Exchange, where someone’s research suggested that only 31.15 percent of Darth Vader was replaced by machine, so I was cool with that.
Next, we split the dataset into training and test set. The test set will also be used as an example for scoring in Spark. We then build a logistic regression model by calling h2o’s glm function.

starwars.split <- h2o.splitFrame(starwars, ratios = 0.75, seed = 1234)
train <- starwars.split[[1]]
test <- starwars.split[[2]]

# Save test data
h2o.exportFile(test, path = test_path)

# Fit a glm model
fit <- h2o.glm(x = c("height", "mass"), 
               y = "is_human", 
               training_frame = train, 
               validation_frame = test,
               family = "binomial")
# ** Reported on validation data. **

# MSE:  0.2616778
# RMSE:  0.5115445
# LogLoss:  0.7174052
# Mean Per-Class Error:  0.4583333
# AUC:  0.3863636
# Gini:  -0.2272727
# R^2:  -0.04869349
# Residual Deviance:  33.00064
# AIC:  39.00064

The performance is awful, which is totally expected. Seriously, the aliens are too much human-alike in Star Wars if you only take into account height and weight. Maybe skin color could be a better predictor. But it can be saved for another day.

Finally, we export the model to disk:

h2o.download_mojo(fit, path = model_path, get_genmodel_jar=TRUE)

This generates a mojo model in .zip as well as a .jar file that is later used as a dependency for scoring.

Scoring with Spark (and Scala)

You could either use spark-submit or spark-shell. If you use spark-submit, h2o-genmodel.jar needs to be put under lib folder of the root directory of your spark application so it could be added as a dependency during compilation. The following code assumes you're running spark-shell. In order to use h2o-genmodel.jar, you need to append the jar file when launching spark-shell by providing a --jar flag. For example:

/usr/lib/spark/bin/spark-shell \
--conf spark.serializer="org.apache.spark.serializer.KryoSerializer" \
--conf spark.driver.memory="3g" \
--conf spark.executor.memory="10g" \
--conf spark.executor.instances=10 \
--conf spark.executor.cores=4 \
--jars /path/to/h2o-genmodel.jar

Now in the Spark shell, import the dependencies

import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.MojoModel

Using DataFrame

val modelPath = "/path/to/zip/file"
val dataPath = "/path/to/test/data"

// Import data
val dfStarWars ="header", "true").csv(dataPath)
// Import MOJO model
val mojo = MojoModel.load(modelPath)
val easyModel = new EasyPredictModelWrapper(mojo)

// score
val dfScore = {
  x =>
    val r = new RowData
    r.put("height", x.getAs[String](1))
    r.put("mass", x.getAs[String](2))
    val score = easyModel.predictBinomial(r).classProbabilities
    (x.getAs[String](0), score(1))
}.toDF("name", "isHumanScore")

The variable score is a list of two scores for level 0 and 1. score(1) is the score for level 1, which is "human". By default the map function returns a DataFrame with unspecified column names "_1", "_2", etc. You can rename the columns by calling toDF.

Using Dataset

To use the Dataset API we just need to create two case classes, one for the input data, and one for the output.

case class StarWars (
  name: String,
  height: String,
  mass: String,
  is_human: String

case class Score (
  name: String,
  isHumanScore: Double
// Dataset
val dtStarWars =[StarWars]
val dtScore = {
  x =>
    val r = new RowData
    r.put("height", x.height)
    r.put("mass", x.mass)
    val score = easyModel.predictBinomial(r).classProbabilities
    Score(, score(1))

With Dataset you can get the value of a column by calling x.columnName directly. Just notice that the types of the column values have to be String, so you might need to manually cast them if they are of other types defined in the case class.


RevolutionsTelco Customer Churn with R in SQL Server 2016Video series: Introduction to Microsoft R ServerBecause it’s Friday: The font of Stranger ThingsMicrosoft R Open 3.3.1 now available for Windows, Mac and LinuxR with Power BI: Import, Transform, Visualize and ShareDeep Learning Part 2: Transfer Learning and Fine-tuning Deep Convolutional Neural NetworksEdward Tufte Keynote Presenter at Data Science Summit, Sep 26-27Five…
Original Post: Revolutions

Build a multi-label documentation classification model for SHARE using One vs the rest classifiers

Jiankun Liu

A recent post talked about how we can label documents on SHARE with Natural Language Processing models. In this post I’m going to include more detail on how it was done. If you are interested in reading further, I recommend you read the previous post (link) first, which introduced the problem, the data set, and the work flow.

If you’d like some background on general machine learning text classification techniques, I highly recommend this awesome tutorial done by scikit-learn.

Our goal is to predict the subject area of documents as accurately as possible. In this blog post we are going to explain how to achieve this subject classifcation using machine learning models, how to build a framework that is scalable to large data, and the fun we can have using deep learning with tensorflow.

Data and preprocessing

The data set I used is from the PLOS API, which I introduced in the previous post (link). I reused the code from here with a few changes to harvest all of the data from PLOS with subject areas. Although the taxonomy is in tree structure, the API does not provide a full list of terms for each document. The only way to get the labels for documents is to specify the subject area in each query. This results in a very inconvenient situation for fetching all possible terms for each document. Therefore I decided to start with the top tier documents.

The basic processing steps:
– Harvest documents for each subject area
– Clean the text data using the code here
– Store both the raw documents and preprocessed documents into MongoDB
– Remove outliers, including documents with not enough words for accurate classification

Since this tutorial focuses on the training step, I won’t go over the code for the entire preprocessing step.

First, we map each of the 11 subject areas to a number from 0 to 10. Because each document can have multiple subject areas, those numbers later will be used to create binary classifications in the training step. In other words, we’ll create 11 different classifications for each document, one for each of the subject areas, and specifiy weather or not that document maps to that subject area.

  • Biology and life sciences
  • Computer and information sciences
  • Earth sciences
  • Ecology and environmental sciences
  • Engineering and technology
  • Medicine and health sciences
  • People and places
  • Physical sciences
  • Research and analysis methods
  • Science policy
  • Social sciences

Fetching data

Now, we’ll go to the source – PLOS – and gather training data.

In the below examples, we’ll assume that we have a Mongo database pre-loaded with all of the data from PLOS. The code for gathering this data is available here.

We’ll use x to represent exploratory variables, and y to represent response variables.

# Initialization.

# Default values
STOP_WORDS = 'english'
NGRAM_RANGE = (1, 2)

# Setup
client = MongoClient(settings.MONGO_URI)

mp = MongoProcessor()

ids = mp.fetch_ids()
train_ids, test_ids = train_test_split(ids, test_size=TEST_SIZE)   # TEST_SIZE = 0.2
(X_train_text, y_train_all) = mp.batch_xy(train_ids, batch_size=len(train_ids), epochs=1, collection='input').next()
(X_test_text, y_test_all) = mp.batch_xy(test_ids, batch_size=len(test_ids), epochs=1, collection='input').next()

This code fetches all the document identifiers, splits them into training group and test group, then uses the identifiers to fetch the document text from MongoDB. This code is specific for the framework we are using, thus can’t be generalized for normal use. However, the format of the data is just simple text, which is what we need for the text classification problem.

Here is an example of one document:

u”structural controllability and controlling centrality of temporal networks temporal networks are such networks where nodes and interactions may appear and disappear at various time scales with the evidence of ubiquity of temporal networks in our economy nature and society it urgent and significant to focus on its structural controllability as well as the corresponding characteristics which nowadays is still an untouched topic we develop graphic tools to study the structural controllability as well as its characteristics identifying the intrinsic mechanism of the ability of individuals in controlling a dynamic and large scale temporal network classifying temporal trees of a temporal network into different types we give both upper and lower analytical bounds of the controlling centrality which are verified by numerical simulations of both artificial and empirical temporal networks we find that the positive relationship between aggregated degree and controlling centrality as well as the scale free distribution of node controlling centrality are virtually independent of the time scale and types of datasets meaning the inherent robustness and heterogeneity of the controlling centrality of nodes within temporal networks”

Feature Extraction

We will use a count vectorizier in scikit-learn to extract features, or a detailed count of words and word combinations that appear in each document. Scikit-learn has a good tutorial of the common text feature extraction process here.

vectorizer = CountVectorizer(decode_error='ignore', ngram_range=NGRAM_RANGE, stop_words=STOP_WORDS)
X_train = vectorizer.fit_transform(X_train_text)

X_train is the vector representations of documents, where each column is a term, which can be either a word or a two-word term. Each row records the frequency of each term occuring in a document. These are the features we will use in the classification model.


This is a multi-label classification problem. This means that each document can be classified into one or more categories. Most existing algorithms by default cannot deal with this problem. There are normally two solutions. Your first option is to change the loss function to adapt to the multi-label classification problem.

Second, you can train the model for each class separately by building One vs the rest classifiers. The advantages of the second option is that each classifier can be tuned separately to get the best results. One of the most common problems with document classification is imbalanced data. For example, let’s say that more than 95 percent of the documents have the label “biology and life sciences,” while less than 3 percent of the documents have the label “science policy.” This results in a biased model that would predict almost all documents as “biology and life sciences” and none as “science policy,” which seems accurate at first glance, but in reality ignores important documents. By building separate binary classifiers each one will be fine-tuned respectively to tackle the imbalanced data problem.

The classifier we use from scikit-learn is the SGDClassifier, which supports several loss functions, including hinge loss, which is the loss function of SVM, and huber loss, which is for the regression model. We use hinge loss for demonstration.

cls_base_name = 'SGD'
cls_base = SGDClassifier
cls_list = [cls_base()] * len(all_classes)

The above code created 11 classifiers, each for classifying one subject area.


Now, we’ll start using the data to train the model.

In the below example, we’ll transform each of the multi class variables to a binary response variable. In other words, we’ll loop through each of the subject areas, and decide if the document can be classified using that subject.

def OVR_transformer(classes, pos):
    Tranform multi-labels to binary classes for OneVsRest classifiers.
    return 1 if pos in classes else 0

for j in range(len(all_classes)):
    # Create namespace for storing metrics.
    cls_name = cls_base_name + str(j)

    # transform y
    y_train = np.asarray(map(lambda x: OVR_transformer(x, j), y_train_all))
    cls_list[j].fit(X_train, y_train)

To build the one vs rest classifiers we need to transform the labels into 0 and 1 for each of the subject areas. If the label is 0, the document can’t be classified with the subject, and a label of 1 means that it can.

The OVR_transformer cunction transforms the labels into 0 and 1 based on which subject area we are considering as the true label.


Now that we’ve trained the model, it’s time to test.

X_test = vectorizer.transform(X_test_text)
for j in range(len(all_classes)):
    cls_name = cls_base_name + str(j)
    # Test
    print("Test results")
    y_test = np.asarray(map(lambda x: OVR_transformer(x, j), y_test_all))
    y_pred = cls_list[j].predict(X_test)
    test_stats = precision_recall_fscore_support(y_test, y_pred, pos_label=1)
    accuracy = cls_list[j].score(X_test, y_test)
    print("Test metrics for {}: {}".format(cls_name, test_stats))
    print(confusion_matrix(y_test, y_pred))

The function precision_recall_fscore_support gives you the metrics needed to evaluate the models. In our case, precision is the most important one. It is the ratio of “true positive,” pre-defined labels and all positive predictions, including potential misclasifications. We want to avoid incluidng wrong information in our metadata, while still tolerating some documents not being labelled at all. Another important metric is recall, or the ratio of “true positive” to all positive predictions. If this ratio is too low, the classifier is basically doing nothing. Precision and recall tradeoff is always a challenge. In our case, we would like to maximize precision while having a resonable recall.

In our test, the classifiers achieved 60 to 90 percent precision. However, the recall of some classifiers are extremely low due to the imbalanced data. This is acceptable for the simplified model, but it is clear that there could be improvement.

How to improve the model

The above code is the most simplified version of a multi-label classification model. The biggest problem with the model is imbalanced data, which we explained above. To tackle the problem, we can consider training the classifiers separately using the undersampling strategy, or other methods.

In terms of feature engineering, adding Term Frequency – Inverse Document Frequency (TF-IDF)( is the first obvious choice. TF-IDF can tune down the influence of high frequency words that don’t help with the classification, like “to,” “and,” “the,” etc. It is a must have model improvement to include unless you are dealing with very large data. Another method worth trying is the Word2Vec method, which transforms each word into a highdimensional vector. Then you can get the mean of the all word vectors in a document and use that as a feature together with TF-IDF. In fact, many papers nowadays will use the combination of TF-IDF and Word2Vec with SVM model as the baseline.

The framework

Here are some lessons I learned while working on this problem.

  • Be as generalized as possible to use in other cases
  • Make it scalable
  • Make sure your code is lightweight and simple

The training data we are dealing with is no where near “big data” on the TB, PB level, but storing them as flat files would definitely not be helpful for us to move on. Therefore, we used MongoDB instead of flat files to store the data.


Bonus: Text classification with Convolutional Neural Networks

Following the trend of deep learning in recent years, many papers have researched text classification with Convolutional Neural Networks, which focus on learning models that are similar in strucutre to neurons in the brain. I decided to give it a shot, but the performance was simply lacking. In case you are interested in how to do it, I suggest read this awesome tutorial first, then read the code of the prototype here.

In the tutorial, the author trained the word vectors starting from scratch. In our case, we used pre-trained word vectors from GoogleNews corpus, which you can find online.Therefore, we need to change the code in the embedding layer to use it properly. You can find the code foe this part here.

The conclusion, to put it simply, is that it costs you a significant amount of time and resources in exchange for little or no gain in precision. As fancy as Convolutional Neural Networks sounds, you might want to try traditional machine learning techniques before jumping into more complicated and hyped methods.

All my code is available online at

Thanks to Erin Braswell for the editing work!

Classify SHARE documents with Natural Language Processing

*Edit: SHARE has published the post on their blog:
Jiankun Liu, 03/22/2016

Developers at the Center for Open Science working on the SHARE project are constantly looking for ways to improve SHARE’s metadata quality. One challenging task is to add subject areas so that users can have more options and control when searching and filtering documents. Since we have more than 5 million documents on SHARE, manually labeling the documents would be very tough. Therefore, we need to rely on an automated process that can achieve fairly high precision. That’s where machine learning comes in.

To tackle the problem, I built the multi-label document classification model using training data from the Public Library of Science API, which stores more than 160 thousand documents with subject areas defined by their taxonomy. The documents contain titles and abstracts that can be used to generate features for the classification model. The taxonomy has a hierarchical structure that contains more than ten thousands terms, but only the terms on the root are used as the subjects areas in our model to begin with. Documents are cleaned and normalized before being stored in the the database (MongoDB).


These documents provide abundant yet imbalanced training data for our supervised machine learning model (for example more than 90 percent documents are labeled “Biological and life sciences”, which badly affects the predictions). A lot of preprocessing methods were used to address multiple issues in the dataset which will be illustrated in a follow-up post, but here is a simplified workflow:


To begin with, the documents are transformed into bag-of-words vector representations by calculating the n-gram term frequency and tf-idf (term frequency – inverse document frequency) of each term. The features are used to fit the classifiers. Since this is a multi-label classification problem (each document can have multiple subject areas), we trained 11 One-vs-rest classifiers, where each classifier was exclusively used to identify if one document belongs to one particular subject area. For example, when training the “Earth sciences” classifier, all documents that have “Earth sciences” as one of their subject areas will be labeled 1 and others labeled 0. By training those classifiers separately it provided more flexibility for tuning so that we can deploy the models with good precision while keep improving other ones. The best classifiers could achieve over 90 percent precision (number of true-positive over all documents predicted as positive, which we cared the most in this case), while others need further optimization. Nevertheless we are confident the model will keep improving over time with more feature engineering (e.g. adding Word2vec), more diverse training data, and more parameter optimization.

Finally, we need to take into consideration the scalability of our framework. Traditional methods described above requires all training data to be loaded in the memory at once. In case of increasing training data size, I built a framework that can utilize batch training methods (or online learning) and feed in data one chunk at a time:


A follow-up post will further explain the detailed preprocessing, feature engineering, and modeling steps. As a bonus we will also show how to use Google’s TensorFlow to build a Convolutional Neural Network for the text classification problem.


Thanks to Katherine Schinkel who contributed to model selection and metrics.