This post was simultaneously published to Medium.

At Curalate, we use state of the art deep learning and computer vision to add a layer of magic to our products. Intelligent Product Tagging, for example, identifies our clients’ products in user-generated photos. Being a startup, we need to build these deep learning and computer vision systems the same way we do the rest of our products: quickly.

Our computer vision systems are built in two phases, research and productization, and we require a deep learning framework that accelerates both. During the research phase, we need a framework that’s quick to get started with and is flexible enough to experiment with new ideas. Once we have a solution, we need a framework that can easily be integrated into a microservice and deployed to multiple production environments.

In the past, we used Caffe for experimentation and our own custom inference interface to deploy the trained models to production. Experimentation was slow due to Caffe’s dated Python API, lack of automatic differentiation, unreliable build/install process, and clunky support for advanced layers which required us to maintain our own custom fork. Productization of Caffe was challenging since we had to maintain our own JNI interface.

We needed new and modern framework that fulfilled all of our needs while saving us from the shortcomings of Caffe. After a review of all the available options, we decided to move to MXNet. In this post, we’ll discuss why we migrated to MXNet as our deep learning framework of choice to facilitate our speed of experimentation, development, and deployment.

Training and Experimentation

Whenever we are faced with a new computer vision problem, we start by looking at existing state-of-the-art implementations. If we are lucky the functionality of the service we are implementing is similar to an existing pre-trained model for MXNet. MXNet has a fairly fleshed out and maintained Model Zoo that contains all of the standard pre-trained models that we would expect from any deep learning framework (ImageNet, PascalVOC, …). If we can not find what we need there, MXNet is also one of the frameworks that supports ONNX and its cross framework model format giving us access to many more pre-trained models.

Converting and reusing old models and code is also possible with MXNet. MMDNN provides support for converting our old models into the MXNet model format if retraining is not needed. MXNet is also supported as a backend for Keras, the high level neural net API, allowing us to run exactly the same Keras code developed with other frameworks with the quick MXNet backend instead.

When we have a more domain specific service in mind, where the data we are training/predicting is new but the model/task is well researched (e.g. image classification, object detection and instance segmentation), ideally we can avoid implementing a research paper from scratch and find an open source implementation to use and contribute to instead. MXNet maintains a long list of example projects in its own code base for most of the popular deep learning applications/models (neural style transfer, Faster R-CNN, speech-to-text). Outside of this, MXNet has a fairly large community of open source developers that maintain MXNet versions of most of the popular and state of the art research papers in the machine learning community.

When the state of the art is just not good enough, MXNet is still an excellent choice for researching novel deep learning models. MXNet offers both symbolic (static graph) and imperative (dynamic graph) APIs allowing us to work with whichever paradigm is the most appropriate for the task. MXNet’s high-level imperative API, called Gluon, offers a full set of plug-and-play neural network building blocks including data loaders, predefined layers and losses. This gives us the ability to save time from implementing common layers/methods for our models and spend more of our development time writing the new state of the art secret sauce in a natural Pythonic control flow. If we are working specifically with a computer vision or natural language processing task, Gluon has model toolkits that provide implementations of state-of-the-are deep learning algorithms in GluonCV and GluonNLP respectively. When it comes to debugging our models during training Gluon allows us to set breakpoints to help us analyze the internals/output of our deep models, and on top of that MXNet has its own (early in development) support for writing logs out for Tensorboard with MXBoard.

If Python is not your data science/experimentation language of choice, MXNet also provides API’s for training in C++, Scala, Julia, Perl and R. The Scala interface also includes early support for training in a distributed Apache Spark cluster for your big data needs.

Inference API and Scala

After training, our models get deployed to microservices that power various Curalate applications. At Curalate, we write our microservices in Scala using the Finatra framework. Before we switched to MXNet, we had to maintain JNA/Bridj borders between our Scala code and deep learning frameworks. This is where MXNet’s Scala API really sped up our development: any model we train in Python can be immediately loaded into our production web service framework.

Most of our deep learning microservices have a very similar data flow:

  • Load an input for the deep net (in our case, typically a set of images)
  • Pre-process the input data (i.e., scaling, cropping, etc)
  • Perform inference on the deep net model
  • Return the results to the caller

When we first started using MXNet, (version 0.11.0), the Scala API only had support for NDArrays and Modules. This was great, but we needed more functionality for higher level operations such as:

  • Loading images from a network data store (in our case s3)
  • Pre-processing images to match what is expected by the net (i.e., scaling, cropping, converting to raw RGB)
  • Mutex locking the GPU to avoid contention

We implemented this functionality in an easy to use, high-level inference API (As of MXNet 1.2.0, the Scala API has added support for inference on images and thread management). Though our actual deployment has some Curalate-specific logic, its general design is:

trait InferenceAPI {

     * Performs inference on the provided input images, and returns the results.
    def predict(images: Iterable[BufferedImage]): Future[Iterable[Array[Float]]]

     * Loads the images from s3 pointed to by , performs inference, and returns the results.
    def loadAndPredict(s3Bucket: String, s3Keys: Iterable[String]): Future[Iterable[Array[Float]]]

     * Pre-processes the provided image (i.e., scaling, center crop, etc), 
     * and converts it to a 32-bit floating point NDArray.
    private def preprocess(image: BufferedImage): NDArray
     * Performs the inference on a batch of images in the provided NDArray.
    private def predict(batchArray: NDArray): Future[Array[Float]]

One particularly interesting item is the mutex locking on the GPU. We achieved this by creating a singleton actor with Akka that acts as a gate keeper to the GPU. The actor is :

class InferenceActor(batchSize: Int, deepNet: module.Module) extends {

  def receive: Receive = {
    case input: NDArray => {
      try {
        val start = System.currentTimeMillis()

        val batch = new DataBatch(
          pad = 0

        val prediction = deepNet.predict(batch).head // only using one batch at a time.
        val result = prediction.toArray
        val duration = System.currentTimeMillis() - start
        sender ! (result, duration)
      } catch {
        case e: Throwable => {
          logger.error(s"Could not run prediction!!!", e)
          sender ! e


We then use Akka’s ask pattern to pass images to the InferenceActor. Not only does this let us mutex lock the GPU in a nice way, it also returns a Scala Future, which integrates well with our async Finatra patterns. The call to the actor looks like:

private def predict(batchArray: NDArray): Future[Array[Float]] = {
    (actor ? batchArray).map {
        case (result, duration) => {
            // record the GPU execution time
            logger.debug(s"GPU Execution took $duration milliseconds for a batch size of $batchSize")

Though we built a custom image-centric API on top of the MXNet Scala interface for our needs, MXNet treating Scala as a first class language made this extremely easy. The resulting code was minimal and elegant.


Continuous integration and deployment of microservices is inherently challenging, but deep learning systems add another layer of complexity due to their dependence on specialized hardware (i.e., GPUs) and native software stack (i.e., CUDA, cuDNN, and MXNet’s binary library). While our Scala-only microservices can be compiled into a war or jar and dropped into a Docker container, the story is not as simple for our deep learning services. Ideally, we’d like to deploy to these specialized environments while maintaining the agility and speed offered by continuous integration. In addition, we sometimes want to deploy to non-GPU environments (such as our development boxes, or applications where latency is OK) to save costs.

MXNet helps us solve for all of these challenges since they separate their core library from their Scala bindings API.

MXNet Binary Packages

First, let’s look at how we can enable fast, continuous integration and deployment. MXNet’s Scala API requires two binary files be present in the environment: the core library and the Java/Scala bindings At Curalate, we use Jenkins to build both Docker images and AMIs for deployment in AWS. Building MXNet or its Scala bindings from source each time we need an image (for, say a new service or a different version) is time consuming and puts a serious drag on our deployment process.

To alleviate this, we build custom MXNet Debian packages that contain both the core library and the Java/Scala bindings Specifically, we maintain a bash script that:

  • Checks out a specific version of MXNet (provided by the user)
  • Compiles the MXNet base library
  • Compiles the MXNet Scala bindings
  • Packages all compiled binaries into a Debian file using checkinstall.
  • Uploads the Debian file to our s3-based repository using deb-s3.

We run this script inside a Docker container, so the resulting artifact is binary-compatible with the architecture of the container we compiled it in. We maintain separate repositories for different versions of Ubuntu and Debian, all of which are populated by the MXNet build script.

By compiling MXNet Debian files and maintaining our own repository, we can quickly install MXNet on any image that has access to our internal deb repository. When we set up a new service or update an existing one, our build process simply runs apt-get install libmxnet. In addition, this makes updating MXNet pretty simple: we just pass the version into our bash script and it becomes available in our repository.

Varying Environments

Our second challenge is how to build a proper abstraction over the different hardware environments we run our deep learning services on. In production we want to run our models on GPUs for speed. This requires the host OS to have CUDA installed, and MXNet to be properly compiled against it. Often, the version of CUDA is different depending on the operating system we’re using (i.e., CUDA 8 on Ubuntu 14, CUDA 9 on Ubuntu 16). To complicate matters further, we develop on laptops without GPUs and would like to test our code in our IDEs as we build things.

To solve this, we again take advantage of MXNet’s separation of the Scala package and their core libraries. Specifically, the MXNet Scala gets packaged into mxnet-core.jar, which depends on the library being installed on the host operating system. So long as the MXNet version is maintained, we can package just the mxnet-core.jar with our services, and let the host operating system decide what stack is compiled in and

In other words, we again leverage our bash script from above to build multiple MXNet Debian files: one that is compiled for a GPU environment, and one for a CPU environment. We also compile a CPU environment for OSX which we install on our dev boxes via homebrew. The build.sbt files in our Scala projects only bring in mxnet-core, so no binary code is included in our service builds.

This flexibility allows us to create one service artifact in Scala, and run it in different environments.


MXNet provides Curalate with the flexibility needed to research and build cutting edge deep learning systems extremely quickly. The Python interface has the flexibility necessary to explore research ideas while executing extremely quickly on modern hardware. For productization, Scala is treated as a first class language, providing us with an API that enables us to use MXNet in our already existing microservice architecture. Finally, the separation between the Scala API and the binary packages allow us to deploy to multiple CPU and GPU environments without recompiling or repackaging our Scala code.