Scoring H2O MOJO Models with Spark DataFrame and Dataset

FavoriteLoadingAdd to favorites

by Jiankun Liu

Introduction

H2O allows you to export models to POJOs or MOJOs (Model Object, Optimized) and later be deployed in production, presumably for scoring large datasets, or building real-time applications. Theoretically it would work in a spark application, but the official documentation did not explain into details other than saying you can “create a map to call the POJO for each row and save the result to a new column, row-by-row.” One post showed how to import the dependencies, load the models, and make predictions in spark shell but did not actually provide examples of scoring with a Spark DataFrame or Dataset. So I had to research on my own, and after some trials and errors, I finally made it work.
The scenario I created below would build a model in R using H2O, exported the model to MOJO, and then be imported in a Spark application to score on a test data set. Since it’s the Star Wars season again, I can’t help but make this post a bit relevant. So I used the starwars dataset to build a model that predicts the likelihood of a character being human based on their height and mass.
This post assumes that you have some experience with H2O and Spark.

Model training in R with H2O


# Import libararies and initialize h2o
library(dplyr)
library(magrittr)
library(h2o)
data("starwars")
h2o.init()

model_path <- "/path/to/model"  # Specify where to save the model
test_path <- "/path/to/test"  # Specify where to save the test set
starwars <- as.data.frame(starwars) %>% 
  select(name, height, mass, species) %>% 
  filter(!is.na(species) & (!is.na(mass) | !is.na(height))) %>%
  mutate(mass = ifelse(is.na(mass), median(mass, na.rm = TRUE), mass)) %>% 
  mutate(is_human = ifelse(species == "Human", 1, 0)) %>%
  select(-species) %>%
  as.h2o()
h2o.table(starwars$is_human)

The data after preprocessing looks like this:

name height mass is_human
Luke Skywalker 172 77.0 1
C-3PO 167 75.0 0
R2-D2 96 32.0 0
Darth Vader 202 136.0 1

To convince me that Darth Vader can be classified as human, I had to resolve to Stack Exchange, where someone’s research suggested that only 31.15 percent of Darth Vader was replaced by machine, so I was cool with that.
Next, we split the dataset into training and test set. The test set will also be used as an example for scoring in Spark. We then build a logistic regression model by calling h2o’s glm function.


starwars.split <- h2o.splitFrame(starwars, ratios = 0.75, seed = 1234)
train <- starwars.split[[1]]
test <- starwars.split[[2]]

# Save test data
h2o.exportFile(test, path = test_path)

# Fit a glm model
fit <- h2o.glm(x = c("height", "mass"), 
               y = "is_human", 
               training_frame = train, 
               validation_frame = test,
               family = "binomial")
fit
# ** Reported on validation data. **

# MSE:  0.2616778
# RMSE:  0.5115445
# LogLoss:  0.7174052
# Mean Per-Class Error:  0.4583333
# AUC:  0.3863636
# Gini:  -0.2272727
# R^2:  -0.04869349
# Residual Deviance:  33.00064
# AIC:  39.00064

The performance is awful, which is totally expected. Seriously, the aliens are too much human-alike in Star Wars if you only take into account height and weight. Maybe skin color could be a better predictor. But it can be saved for another day.

Finally, we export the model to disk:


h2o.download_mojo(fit, path = model_path, get_genmodel_jar=TRUE)

This generates a mojo model in .zip as well as a .jar file that is later used as a dependency for scoring.

Scoring with Spark (and Scala)

You could either use spark-submit or spark-shell. If you use spark-submit, h2o-genmodel.jar needs to be put under lib folder of the root directory of your spark application so it could be added as a dependency during compilation. The following code assumes you're running spark-shell. In order to use h2o-genmodel.jar, you need to append the jar file when launching spark-shell by providing a --jar flag. For example:


/usr/lib/spark/bin/spark-shell \
--conf spark.serializer="org.apache.spark.serializer.KryoSerializer" \
--conf spark.driver.memory="3g" \
--conf spark.executor.memory="10g" \
--conf spark.executor.instances=10 \
--conf spark.executor.cores=4 \
--jars /path/to/h2o-genmodel.jar

Now in the Spark shell, import the dependencies

import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.MojoModel

Using DataFrame

val modelPath = "/path/to/zip/file"
val dataPath = "/path/to/test/data"

// Import data
val dfStarWars = spark.read.option("header", "true").csv(dataPath)
// Import MOJO model
val mojo = MojoModel.load(modelPath)
val easyModel = new EasyPredictModelWrapper(mojo)

// score
val dfScore = dfStarWars.map {
  x =>
    val r = new RowData
    r.put("height", x.getAs[String](1))
    r.put("mass", x.getAs[String](2))
    val score = easyModel.predictBinomial(r).classProbabilities
    (x.getAs[String](0), score(1))
}.toDF("name", "isHumanScore")

The variable score is a list of two scores for level 0 and 1. score(1) is the score for level 1, which is "human". By default the map function returns a DataFrame with unspecified column names "_1", "_2", etc. You can rename the columns by calling toDF.

Using Dataset

To use the Dataset API we just need to create two case classes, one for the input data, and one for the output.

case class StarWars (
  name: String,
  height: String,
  mass: String,
  is_human: String
)

case class Score (
  name: String,
  isHumanScore: Double
)
// Dataset
val dtStarWars = dfStarWars.as[StarWars]
val dtScore = dtStarWars.map {
  x =>
    val r = new RowData
    r.put("height", x.height)
    r.put("mass", x.mass)
    val score = easyModel.predictBinomial(r).classProbabilities
    Score(x.name, score(1))
}

With Dataset you can get the value of a column by calling x.columnName directly. Just notice that the types of the column values have to be String, so you might need to manually cast them if they are of other types defined in the case class.

Leave a Reply

Your email address will not be published. Required fields are marked *