Curalate Engineering Blog

Safely Modifying Your Hosts File with Gas Mask

Thu, 30 May 2019 00:00:00 +0000

Sometimes the DNS for a specific domain on your machine needs to point somewhere else – at Curalate, we test changes microservices locally before shipping them, which could require redirecting requests to look at that local instance. One way to do this is by adding an entry like 127.0.0.1 some.service.curalate.com to /etc/hosts.

Why use a hosts file manager?

In most cases, it’s not advised to directly modify /etc/hosts. Because it’s buried deep into the filesystem, it’s easy to forget you’ve modified it, which can lead to numerous problems ranging from annoying to dangerous. Also, danger aside, it can begin to get messy and complex if you have a lot of entries to manage. Think of even just fifteen lines you’re constantly commenting/uncommenting to represent the configuration you need at a given moment. This would be insanity. Gas Mask, a simple UI-based hosts file manager, allows you to set up different hosts files, while making it plainly obvious which hosts file is currently activated on your system via the OS Menu bar.

Installation instructions

Go to https://github.com/2ndalpha/gasmask and download the latest version.
Unpack and install.
On first-run, the only hosts file listed will be Original File which is the /etc/hosts file you’ll no longer be modifying.

Creating your first host file

Create a new hosts file and name it something that makes sense.
Add the test entry 127.0.0.1 google.com and save. The format of these entries is <target IP address> <URL or IP to redirect>.
Activate that hosts file. Gasmask substitutes in this file at /etc/hosts.
You may either need to flush your DNS cache or just restart the browser.
To test it out, go to google.com in the browser.
What happens now is, when the browser goes to get the IP for google.com, the OS sees the matching entry in your hosts file, then refers to 127.0.0.1 (your local computer) to make the request – which will fail.
Go ahead and reactivate your Original File, restart the browser, and you should be able to access google.com as expected.

Use the Menubar icon

It’s easy to forget to flip your hosts file back. You’ll end up spending 45 minutes on what you think is a bug, that doesn’t reproduce for anyone else, only to realize your hosts file is sending requests someplace else.

Next time you reach to modify that hosts file, consider integrating Gas Mask into your development workflow to keep it maintained and safe from unintended state.

How to Setup a Scheduled Scala Spark Job

Wed, 27 Mar 2019 00:00:00 +0000

Have you written a Scala Spark job that processes a massive amount of data on an intimidating amount of RAM and you want to run it daily/weekly/monthly on a schedule on AWS? I had to do this recently, and couldn’t find a good tutorial on the full process to get the spark job running. Included in this article and accompanying repository is everything you need to get your Scala Spark job running on AWS Data Pipeline and EMR.

Code Repo

This tutorial is not going to walk you through the process of actually writing your specific Scala Spark job to do whatever number crunching you need. There are already plenty of resources available (1, 2, 3) to get you started on that. The code template for setting up a Spark Scala job is available in this GitHub repo.

Assuming that you have already written your Spark Job and are only using the AWS Java SDK to connect to your AWS data stores, drop your code in the Main function of SparkJob.scala and run the deploy.sh script to upload the fat jar to your S3 bucket.

If you do take other dependencies, then it may take some extra work on your part. To run a Scala Spark job on AWS you need to compile a fat jar that contains the byte code for your job and all of the libraries it needs to run. This project already has the sbt-assembly plugin setup and a assemblyMergeStrategy set up to package the Spark, Hadoop, and AWS SDK together in the fat jar. If you need to add in other libraries that do not play well with each other, or are using a noncompatible version of Spark for this current repo, there are a few good resources available to help you through the needed build.sbt modifications.

Outside of the previously mentioned needed changes you need to set a few parameters in the deploy.sh script. Mainly the deploymentPath to your specific S3 bucket, adding a profile to the AWS CLI command to upload to your specific S3 bucket if it’s private, and changing the resulting fat jar name if you please. The deploy script uses the AWS CLI to upload the fat jar to S3, so if you do not have it installed and configured you will need to go through the steps to do that before using the deploy.sh script.

Setting Up The Job On AWS

Once your fat jar is uploaded to S3 it’s time to set up the scheduled job on AWS Data Pipeline.

Create AWS Data Pipeline

Log on to the AWS dash, navigate to the AWS Data Pipeline console, and click the Create new pipeline button.

Load The Spark Job Template Definition

Add a name for your pipeline and select the Import a definition source option. Included in the GitHub repo for the Spark template is a datapipeline.json file that you can import that contains a pre-defined data pipeline for your Spark job that should simplify the setup processes. The definition contains the node configuration to fire the pipeline off on a schedule and notify you via an AWS SNS alarm on a success or failure run. All of your needed configuration options have been parameterized to simplify this process.

Set Your Parameters

Set the min parameter to the EMR step(s) option. Update the S3 path and fat jar name to point to the fat jar uploaded earlier with the deploy.sh script. Set EC2KeyPair so that the data pipeline has access to your EC2 instances. Select the master node instance type and the number of and type of core-nodes for the Spark cluster. Finally, we need to set the ARN for both the success and failure notifications for the Spark job so you will know if something goes wrong with your scheduled job.

Create AWS SNS

Open a new tab and navigate to the AWS SNS console. Click the Create Topic option.

In the pop-up window set a topic name and display name for your success alarm and hit the create topic button.

You will be navigated to the topic details page for your new SNS Topic. Click the create subscription button.

Here you can choose the type of notification you want, for this example, we will set up an email notification for this alarm. Place your email address in the Endpoint field and hit Create subscription.

Check your email inbox and confirm the subscription to your SNS alarm.

Once you have successfully confirmed your subscription in your email, you should see it listed in the Subscriptions table on the Topic details page for your SNS Alarm.

You will need to repeat this process for both of your success and failure alarms. On each of the Topic details pages for both of your alarms, you need to copy the Topic ARN value and paste it into their respective parameter fields in your Data Pipeline setup.

Schedule your Job

Once your ARN parameters are set for your alarms you can move on to scheduling when this pipeline will run. Set the cadence for when the job will run and make sure you set your starting and ending dates to something in the future. You can also set the job to run only on pipeline activation if you would rather manually start your job.

Finish The Pipeline Setup

Once you have your parameters set and scheduled all you need to do is set an S3 bucket location for the logs from the pipeline executions and you can click Activate at the bottom of the page. This will run a check on all of your settings and confirm that everything should work. Some things you might run into here could be setting improper values for the instance types for your clusters (Note: not all EC2 types work in EMR and Data Pipeline), or your schedule settings don’t make sense.

Test Your Job

Once your job activates successfully you should be done! If you are doing a test run first, just wait for the time that the pipeline is scheduled to start and it should run and email you on the result of the run. If you want to view the console/standard output of your job you can find it in the Emr Step Logs link and click the stdout.gz link to view/download the output. Here you will also find the error output if something goes wrong in stderr.gz. The logs from the Spark job in your pipeline are in the controller.gz file if you need to do some debugging with your pipeline.

Hey, you busy? I have thousands of questions to ask you.

Mon, 31 Dec 2018 17:00:00 +0000

If you’re a brick-and-mortar business owner, you quickly identify patterns in your customer foot traffic, especially around the holidays when achieving your sales goals depends on both timely and quality service. If you’re an e-commerce business owner, like many of Curalate’s 1,000+ customers, it’s really no different: holiday sales are crucial to success and they depend heavily on your site’s reliability.

At Curalate, we take great pride in maintaining high availability and low latency for our client integrations throughout the year. But over the “Black Fiveday” period—Thanksgiving through Cyber Monday—and the week leading up to and including the day after Christmas, we see a roughly 5x increase in network requests to our APIs from our clients’ sites. Therefore it’s crucial that we both design our systems to handle that increased load and perform load testing on the systems ahead of time to prove that our designs work.

This post describes how we carried out load tests of our infrastructure to prepare for the holiday traffic increase on our APIs. Additionally, it highlights how our approach towards dynamic scalability reduces costs by avoiding over-provisioning.

Load Test Planning

Our first question was: what volume of traffic can we expect? To answer that, we consulted the last several years of data describing our holiday traffic load pattern. Second: what are the important dimensions of that traffic? For example, do we expect a majority of the traffic to be cached or uncached? Do total requests matter or only instantaneous load? Since a previous blog post discussed cached versus uncached testing, this post focuses on API request rate.

Total requests in a day is interesting, but only suggests an average requests-per-second (RPS) rate. The metric we’re mostly interested in is the daily peak RPS rate. This gives us an idea of the busiest moment in our day, and if we can handle that rate, we should have confidence that we can handle lesser request rates at other times.

Peak requests-per-second

Consulting our historical data, we observed that Black Fiveday typically has a daily peak RPS that is 5x higher than the normal peak rate during September of the same year. With this in mind, we calculated the 5x peak RPS rate from our September 2018 data, and then used some simple bash scripts (described in the next section) and other tools to steadily increase our API request load to the predicted maximum rate.

The below graph shows the Daily Peak RPS rate on our API over the course of the entire fourth quarter (September through December, 2018). November 5th shows the internal load generated using the script below targeting our expected 5x traffic increase. November 24th (Black Friday) and December 5th were our two highest peaks for publicly-generated traffic to our API by end-users.

So, how’d we do? Well, we were spot-on with our prediction! Next year we’re predicting winning lottery numbers 😉. In truth, we didn’t expect our prediction to be that close to the exact peak traffic load, and it’s a reasonable question to ask if we should have tested a little higher given the actual load. I can understand arguments on both sides.

Load Test Execution

Our 2017 holiday load testing blog post already shared details on the different load tests we performed (cached & uncached, various API endpoints, etc.), as well as some example output of the tool, so this post reproduces just one of those scripts here for context. To execute the actual load test, we once again used Vegeta this year, as we enjoyed our experience using it last year.

The following bash script shows how we use Vegeta to slowly increase the request rate to a specified API endpoint:

#!/bin/bash

#  File: load-test.sh
# usage: ./load-test.sh <api endpoint> <start rate> <rate step increment> <step duration> <max rate> <test tag>

if [ $# -ne 6 ]; then
  echo "usage: $0 <api endpoint> <start rate> <rate step increment> <step duration> <max rate> <test tag>"
  exit 1
fi

Target=$1
StartRate=$2
RateStep=$3
StepDuration=$4
MaxRate=$5
Tag=$6

CurrentRate=$StartRate

while [ $CurrentRate -le $MaxRate ]; do
  OutputFile="test-$Tag-$MaxRate-$CurrentRate.bin"
  if [ $CurrentRate -lt $MaxRate ]; then
    echo $Target | ./vegeta attack -rate=$CurrentRate -duration=$StepDuration > $OutputFile
  else
    echo $Target | ./vegeta attack -rate=$MaxRate > $OutputFile
  fi
  CurrentRate=$((CurrentRate+RateStep))
done

The above script will repeatedly execute Vegeta with an increasing request rate until it reaches the specified max rate. Each Vegeta execution will run for the specified duration and the final execution, with the maximum rate, will run until manually killed. Additionally, each execution will write a distinct Vegeta output file.

Elastic scaling example

Handling a 5x peak in your daily traffic load is great, and it’s really easy, right?

InstanceCount=$(echo "$InstanceCount * 5" | bc)

😛 Okay, so it’s really easy if you don’t care about your costs. For those of us who do care about costs, our goal is to dynamically scale up our computing resources as demand for our services increases, and then gracefully scale down our resources as demand wanes.

Dynamic Scaling

AWS obviously makes this very easy as long as you have things configured properly. While some of our legacy services still run as AMIs deployed within auto-scaling groups on EC2 instances, most of our microservices now run as containers on Amazon ECS. We wrote a blog post earlier this year detailing all the various aspects that come into play when running our production systems on ECS.

Since that post already explains how we run our systems on ECS, this post simply provides an example of that dynamic scaling in action. The graphs below show the request rate on one of our services called “Media API” (“media-api-service”) and the corresponding number of containers powering that service (“media-api-service_v3”) during the three day period from Christmas through December 27th.

As you can see, we were quick to add more containers to meet increased request demand, and efficient in removing them soon after the request rate dropped. While the bottom graph is not a direct representation of the underlying EC2 instances powering the ECS cluster, it’s a good estimate for how dynamic our EC2 capacity (and costs) would be if all services in the cluster employ a similar scaling approach.

Scaling metric

Speaking of scaling, what metric do we scale on? There are several options including CPU and memory utilization, as well as latency and queue size. In our experience, scaling on the CPU utilization works well enough for container-based services. When a given service’s containers have an average CPU utilization above some threshold, say, 75%, we add a new container and re-evaluate after some time. When CPU utilization drops below, say, 60%, we stop a container and re-evaluate after some time. We scale some of our services on memory utilization in the same way.

For brevity and clarity, this post skipped over a few points related to various aspects above. Let me address a few lingering questions and relevant details that are worth noting but not central to the theme of this post.

While total requests doesn’t directly describe peak requests-per-second, it can impact other resources sensitive to total data size such as event queues and logging/diagnostic data. Keep those types of resources in mind both when executing the load test and facing increased live traffic.
We talked about scaling up compute resources to meet the increased server request rate, but if your traffic is bursty in the extreme (e.g., it goes from 1x to 10x or higher in a few seconds) then you should consider using a CDN rather than trying to dynamically scale to meet that load. A CDN introduces some delay in pushing out new data but it can withstand orders of magnitude higher request rates because the response is statically determined by the request input.
Synthetic load tests on production infrastructure could impact live production traffic through dependent resources such as databases, caches and event queues. For example, if you run a “cache-missing” test on live infrastructure, you may end up evicting all of your real customers’ data and drastically reducing performance for your legitimate production traffic.
The beginning of this post mentions the importance of reliability during the holiday period. To see how Curalate’s reliability compared to several of our competitors during Black Fiveday 2018, check out this blog post.

—

Hopefully this post gives you some insight into how we prepare for the holiday traffic load spike at Curalate. If you value high availability and low latency in your production systems as we do, check out our jobs page, click the Join Us link on the top right of the page, or shoot us an email at hello@curalate.com.

Thanks for reading!

Uploading EC2 Logs to S3 on Shutdown

Tue, 04 Sep 2018 08:56:00 +0000

If you’ve ever used an auto scaling group (ASG) on AWS, you’ve probably had an EC2 instance fail and get removed from the ASG. While great for redundancy (the ASG launches a new instance to start handling requests), it makes debugging the failure difficult since the ASG terminates the bad instance, erasing any evidence of what went wrong. Below, I present a script that will upload relevant files to S3 after an instance is triggered to shutdown but before it terminates.

To achieve this, we make use of Linux’s runlevel scripts. The instructions below are for Ubuntu, but it should be straight forward to migrate to a different distro.

First, we make a script /etc/rc0.d/K01upload-logs. This script will run when the system is shutting down. You should change BUCKET, PATH, and LOG_FILE to match your needs.

#!/bin/bash

source /etc/environment
# get strict after sourcing environment since we don't trust it...
set -euo pipefail
IFS=$'\n\t'

# what logs should I upload and where to?
LOG_FILE="/var/log/tomcat7/catalina.out"
BUCKET="my-logs-bucket"
# below we include the instance id in the path. That way it's easily findable.
HOST=$(/usr/bin/curl http://169.254.169.254/latest/meta-data/instance-id)
PATH="services/logs/$HOST/"

# upload the logs
/bin/echo "Uploading logs to s3://$BUCKET/$PATH" | /usr/bin/wall
/usr/local/bin/aws s3 cp $LOG_FILE s3://$BUCKET/$PATH

wait

After installing the script, you need to set the permissions:

chown root:root /etc/rc0.d/K01upload-logs
chmod +x /etc/rc0.d/K01upload-logs

And that’s it! The script will upload the logs to your S3 bucket when the ASG terminates an instance. We’ve found this extremely helpful for our deep learning infrastructure that can often contain errors from C++ code (and thus isn’t handled by the jvm or sent to our logging services).

How Curalate uses MXNet on AWS for Deep Learning Magic

Wed, 01 Aug 2018 00:00:00 +0000

This post was simultaneously published to Medium.

At Curalate, we use state of the art deep learning and computer vision to add a layer of magic to our products. Intelligent Product Tagging, for example, identifies our clients’ products in user-generated photos. Being a startup, we need to build these deep learning and computer vision systems the same way we do the rest of our products: quickly.

Our computer vision systems are built in two phases, research and productization, and we require a deep learning framework that accelerates both. During the research phase, we need a framework that’s quick to get started with and is flexible enough to experiment with new ideas. Once we have a solution, we need a framework that can easily be integrated into a microservice and deployed to multiple production environments.

In the past, we used Caffe for experimentation and our own custom inference interface to deploy the trained models to production. Experimentation was slow due to Caffe’s dated Python API, lack of automatic differentiation, unreliable build/install process, and clunky support for advanced layers which required us to maintain our own custom fork. Productization of Caffe was challenging since we had to maintain our own JNI interface.

We needed new and modern framework that fulfilled all of our needs while saving us from the shortcomings of Caffe. After a review of all the available options, we decided to move to MXNet. In this post, we’ll discuss why we migrated to MXNet as our deep learning framework of choice to facilitate our speed of experimentation, development, and deployment.

Training and Experimentation

Whenever we are faced with a new computer vision problem, we start by looking at existing state-of-the-art implementations. If we are lucky the functionality of the service we are implementing is similar to an existing pre-trained model for MXNet. MXNet has a fairly fleshed out and maintained Model Zoo that contains all of the standard pre-trained models that we would expect from any deep learning framework (ImageNet, PascalVOC, …). If we can not find what we need there, MXNet is also one of the frameworks that supports ONNX and its cross framework model format giving us access to many more pre-trained models.

Converting and reusing old models and code is also possible with MXNet. MMDNN provides support for converting our old models into the MXNet model format if retraining is not needed. MXNet is also supported as a backend for Keras, the high level neural net API, allowing us to run exactly the same Keras code developed with other frameworks with the quick MXNet backend instead.

When we have a more domain specific service in mind, where the data we are training/predicting is new but the model/task is well researched (e.g. image classification, object detection and instance segmentation), ideally we can avoid implementing a research paper from scratch and find an open source implementation to use and contribute to instead. MXNet maintains a long list of example projects in its own code base for most of the popular deep learning applications/models (neural style transfer, Faster R-CNN, speech-to-text). Outside of this, MXNet has a fairly large community of open source developers that maintain MXNet versions of most of the popular and state of the art research papers in the machine learning community.

When the state of the art is just not good enough, MXNet is still an excellent choice for researching novel deep learning models. MXNet offers both symbolic (static graph) and imperative (dynamic graph) APIs allowing us to work with whichever paradigm is the most appropriate for the task. MXNet’s high-level imperative API, called Gluon, offers a full set of plug-and-play neural network building blocks including data loaders, predefined layers and losses. This gives us the ability to save time from implementing common layers/methods for our models and spend more of our development time writing the new state of the art secret sauce in a natural Pythonic control flow. If we are working specifically with a computer vision or natural language processing task, Gluon has model toolkits that provide implementations of state-of-the-are deep learning algorithms in GluonCV and GluonNLP respectively. When it comes to debugging our models during training Gluon allows us to set breakpoints to help us analyze the internals/output of our deep models, and on top of that MXNet has its own (early in development) support for writing logs out for Tensorboard with MXBoard.

If Python is not your data science/experimentation language of choice, MXNet also provides API’s for training in C++, Scala, Julia, Perl and R. The Scala interface also includes early support for training in a distributed Apache Spark cluster for your big data needs.

Inference API and Scala

After training, our models get deployed to microservices that power various Curalate applications. At Curalate, we write our microservices in Scala using the Finatra framework. Before we switched to MXNet, we had to maintain JNA/Bridj borders between our Scala code and deep learning frameworks. This is where MXNet’s Scala API really sped up our development: any model we train in Python can be immediately loaded into our production web service framework.

Most of our deep learning microservices have a very similar data flow:

Load an input for the deep net (in our case, typically a set of images)
Pre-process the input data (i.e., scaling, cropping, etc)
Perform inference on the deep net model
Return the results to the caller

When we first started using MXNet, (version 0.11.0), the Scala API only had support for NDArrays and Modules. This was great, but we needed more functionality for higher level operations such as:

Loading images from a network data store (in our case s3)
Pre-processing images to match what is expected by the net (i.e., scaling, cropping, converting to raw RGB)
Mutex locking the GPU to avoid contention

We implemented this functionality in an easy to use, high-level inference API (As of MXNet 1.2.0, the Scala API has added support for inference on images and thread management). Though our actual deployment has some Curalate-specific logic, its general design is:

trait InferenceAPI {

    /**
     * Performs inference on the provided input images, and returns the results.
     */
    def predict(images: Iterable[BufferedImage]): Future[Iterable[Array[Float]]]

    /**
     * Loads the images from s3 pointed to by , performs inference, and returns the results.
     */
    def loadAndPredict(s3Bucket: String, s3Keys: Iterable[String]): Future[Iterable[Array[Float]]]

    /**
     * Pre-processes the provided image (i.e., scaling, center crop, etc), 
     * and converts it to a 32-bit floating point NDArray.
     */
    private def preprocess(image: BufferedImage): NDArray
    
    /**
     * Performs the inference on a batch of images in the provided NDArray.
     */
    private def predict(batchArray: NDArray): Future[Array[Float]]
}

One particularly interesting item is the mutex locking on the GPU. We achieved this by creating a singleton actor with Akka that acts as a gate keeper to the GPU. The actor is :

class InferenceActor(batchSize: Int, deepNet: module.Module) extends akka.actor.Actor {

  def receive: Receive = {
    case input: NDArray => {
      try {
        val start = System.currentTimeMillis()

        val batch = new DataBatch(
          IndexedSeq(input),
          IndexedSeq.empty[NDArray],
          IndexedSeq.fill(batchSize)(0L),
          pad = 0
        )

        val prediction = deepNet.predict(batch).head // only using one batch at a time.
        val result = prediction.toArray
        prediction.dispose()
        val duration = System.currentTimeMillis() - start
        sender ! (result, duration)
      } catch {
        case e: Throwable => {
          logger.error(s"Could not run prediction!!!", e)
          sender ! e
        }
      }
    }
  }

}

We then use Akka’s ask pattern to pass images to the InferenceActor. Not only does this let us mutex lock the GPU in a nice way, it also returns a Scala Future, which integrates well with our async Finatra patterns. The call to the actor looks like:

private def predict(batchArray: NDArray): Future[Array[Float]] = {
    (actor ? batchArray).map {
        case (result, duration) => {
            // record the GPU execution time
            logger.debug(s"GPU Execution took $duration milliseconds for a batch size of $batchSize")
            result
        }
    }
}

Though we built a custom image-centric API on top of the MXNet Scala interface for our needs, MXNet treating Scala as a first class language made this extremely easy. The resulting code was minimal and elegant.

Deployment

Continuous integration and deployment of microservices is inherently challenging, but deep learning systems add another layer of complexity due to their dependence on specialized hardware (i.e., GPUs) and native software stack (i.e., CUDA, cuDNN, and MXNet’s binary library). While our Scala-only microservices can be compiled into a war or jar and dropped into a Docker container, the story is not as simple for our deep learning services. Ideally, we’d like to deploy to these specialized environments while maintaining the agility and speed offered by continuous integration. In addition, we sometimes want to deploy to non-GPU environments (such as our development boxes, or applications where latency is OK) to save costs.

MXNet helps us solve for all of these challenges since they separate their core library from their Scala bindings API.

MXNet Binary Packages

First, let’s look at how we can enable fast, continuous integration and deployment. MXNet’s Scala API requires two binary files be present in the environment: the core library libmxnet.so and the Java/Scala bindings libmxnet-scala.so. At Curalate, we use Jenkins to build both Docker images and AMIs for deployment in AWS. Building MXNet or its Scala bindings from source each time we need an image (for, say a new service or a different version) is time consuming and puts a serious drag on our deployment process.

To alleviate this, we build custom MXNet Debian packages that contain both the core library libmxnet.so and the Java/Scala bindings libmxnet-scala.so. Specifically, we maintain a bash script that:

Checks out a specific version of MXNet (provided by the user)
Compiles the MXNet base library libmxnet.so
Compiles the MXNet Scala bindings libmxnet-scala.so
Packages all compiled binaries into a Debian file using checkinstall.
Uploads the Debian file to our s3-based repository using deb-s3.

We run this script inside a Docker container, so the resulting artifact is binary-compatible with the architecture of the container we compiled it in. We maintain separate repositories for different versions of Ubuntu and Debian, all of which are populated by the MXNet build script.

By compiling MXNet Debian files and maintaining our own repository, we can quickly install MXNet on any image that has access to our internal deb repository. When we set up a new service or update an existing one, our build process simply runs apt-get install libmxnet. In addition, this makes updating MXNet pretty simple: we just pass the version into our bash script and it becomes available in our repository.

Varying Environments

Our second challenge is how to build a proper abstraction over the different hardware environments we run our deep learning services on. In production we want to run our models on GPUs for speed. This requires the host OS to have CUDA installed, and MXNet to be properly compiled against it. Often, the version of CUDA is different depending on the operating system we’re using (i.e., CUDA 8 on Ubuntu 14, CUDA 9 on Ubuntu 16). To complicate matters further, we develop on laptops without GPUs and would like to test our code in our IDEs as we build things.

To solve this, we again take advantage of MXNet’s separation of the Scala package and their core libraries. Specifically, the MXNet Scala gets packaged into mxnet-core.jar, which depends on the library libmxnet-scala.so being installed on the host operating system. So long as the MXNet version is maintained, we can package just the mxnet-core.jar with our services, and let the host operating system decide what stack is compiled in libmxnet-scala.so and libmxnet.so.

In other words, we again leverage our bash script from above to build multiple MXNet Debian files: one that is compiled for a GPU environment, and one for a CPU environment. We also compile a CPU environment for OSX which we install on our dev boxes via homebrew. The build.sbt files in our Scala projects only bring in mxnet-core, so no binary code is included in our service builds.

This flexibility allows us to create one service artifact in Scala, and run it in different environments.

Conclusion

MXNet provides Curalate with the flexibility needed to research and build cutting edge deep learning systems extremely quickly. The Python interface has the flexibility necessary to explore research ideas while executing extremely quickly on modern hardware. For productization, Scala is treated as a first class language, providing us with an API that enables us to use MXNet in our already existing microservice architecture. Finally, the separation between the Scala API and the binary packages allow us to deploy to multiple CPU and GPU environments without recompiling or repackaging our Scala code.

Productionalizing ECS

Wed, 16 May 2018 09:45:39 +0000

In January of last year we decided as a company to move towards containerization and began a migration to move onto AWS ECS. We pushed to move to containers, and off of AMI based VM deployments, in order to speed up our deployments, simplify our build tooling (since it only has to work on containers), get the benefits of being able to run our production code in a sandbox even locally on our dev machines (something you can’t really do easily with AMI’s), and lower our costs by getting more out of the resources we’re already paying for.

However, making ECS production ready was actually quite the challenge. In this post I’ll discuss:

Scaling the underlying ECS cluster
Upgrading the backing cluster images
Monitoring our containers
Cleanup of images, container artifacts
Remote debugging of our JVM processes

Which is a short summary of the things we encountered and our solutions, finally making ECS a set it and forget it system.

Scaling the cluster

The first thing we struggled with was how to scale our cluster. ECS is a container orchestrator, analogous to Kubernetes or Rancher, but you still need to have a set of EC2 machines to run as a cluster. The machines all need to have the ECS Docker agent installed on it and ECS doesn’t provide a way to automatically scale and manage your cluster for you. While this has changed recently with the announcement of Fargate, Fargate’s pricing makes it cost prohibitive for organizations with a lot of containers.

The general recommendation that AWS gave with ECS was to scale based on CPU reservation limit OR memory limit. There’s no clear way to scale with a combination of the two, since auto scaling rules need to apply to a single CloudWatch metric or you face potential thrashing.

Our first attempt on scaling was to try and scale on container placement failures. ECS logs a message when containers are unable to be placed due to constraints (not enough memory on the cluster, or not enough CPU reservation left), but there is no way to actually capture that event programmatically (see this github issue). The goal here was to not preemptively scale, but instead scale on actual pressure. This way we wouldn’t be overpaying for machines in the cluster that aren’t heavily used. However we had to discard this idea since it wasn’t possible due to API limitations.

Our second attempt, and one that we have been using now in production, is to use an AWS Lambda function to monitor the memory and CPU reservation of the cluster and emit a compound metric to CloudWatch that we can scale on. We set a compound threshold with the logic of:

If either memory or CPU is above the max threshold, scale up
Else if both memory and CPU are below the min, scale down.
Else stay the same

We represent a scale up event with a CloudWatch value of 2, a scale down as value 0 and otherwise the nominal state as value 1.

The code for that is shown below:

package com.curalate.lambdas.ecs_scaling

import com.amazonaws.services.CloudWatch.AmazonCloudWatch
import com.amazonaws.services.CloudWatch.model._
import com.curalate.lambdas.ecs_scaling.ScaleMetric.ScaleMetric
import com.curalate.lambdas.ecs_scaling.config.ClusterScaling
import org.joda.time.DateTime
import scala.collection.JavaConverters._

object ScaleMetric extends Enumeration {
  type ScaleMetric = Value

  val ScaleDown = Value(0)
  val StaySame = Value(1)
  val ScaleUp = Value(2)
}

case class ClusterMetric(
  clusterName: String,
  scaleMetric: ScaleMetric,
  periodSeconds: Int
)

class MetricResolver(clusterName: String, cloudWatch: AmazonCloudWatch) {
  private lazy val now = DateTime.now()
  private lazy val start = now.minusMinutes(3)

  private val dimension = new Dimension().withName("ClusterName").withValue(clusterName)

  val periodSeconds = 60

  protected def getMetric(name: String): Double = {
    val baseRequest = new GetMetricStatisticsRequest().withDimensions(dimension)

    cloudWatch.getMetricStatistics(baseRequest.
      withMetricName(name).
      withNamespace("AWS/ECS").
      withStartTime(start.toDate).
      withEndTime(now.toDate).
      withPeriod(periodSeconds).
      withStatistics(Statistic.Maximum)
    ).getDatapoints.asScala.head.getMaximum
  }

  lazy val currentCpuReservation: Double = getMetric("CPUReservation")

  lazy val currentMemoryReservation: Double = getMetric("MemoryReservation")
}

class ClusterStatus(
  scaling: ClusterScaling,
  metricResolver: MetricResolver
) {

  protected val logger = org.slf4j.LoggerFactory.getLogger(getClass)

  def getCompositeMetric(): ClusterMetric = {
    logger.info(s"CPU: ${metricResolver.currentCpuReservation}, memory: ${metricResolver.currentMemoryReservation}")

    val state =
      if (metricResolver.currentCpuReservation >= scaling.CPU.max || metricResolver.currentMemoryReservation >= scaling.memory.max) {
        ScaleMetric.ScaleUp
      }
      else if (metricResolver.currentCpuReservation <= scaling.CPU.min && metricResolver.currentMemoryReservation <= scaling.memory.min) {
        ScaleMetric.ScaleDown
      } else {
        ScaleMetric.StaySame
      }

    ClusterMetric(scaling.name, state, metricResolver.periodSeconds)
  }
}

class CloudwatchEmitter(cloudWatch: AmazonCloudWatch) {
  def writeMetric(metric: ClusterMetric): Unit = {
    val cluster = new Dimension().withName("ClusterName").withValue(metric.clusterName)

    cloudWatch.putMetricData(new PutMetricDataRequest().
      withMetricData(new MetricDatum().
        withMetricName("ScaleStatus").
        withDimensions(cluster).
        withTimestamp(DateTime.now().toDate).
        withStorageResolution(metric.periodSeconds).
        withValue(metric.scaleMetric.id.toDouble)
      ).withNamespace("Curalate/AutoScaling"))
  }
}

Wiring in our ECS cluster to autoscale on this metric value in our Terraform configuration looks like

resource "aws_cloudwatch_metric_alarm" "cluster_scale_status_high_blue" {
  count               = "${var.autoscale_enabled}"
  alarm_name          = "${var.cluster_name}_ScaleStatus_high_blue"
  comparison_operator = "${lookup(var.alarm_high, "comparison_operator")}"
  evaluation_periods  = "${lookup(var.alarm_high, "evaluation_periods")}"
  period              = "${lookup(var.alarm_high, "period")}"
  statistic           = "${lookup(var.alarm_high, "statistic")}"
  threshold           = "${lookup(var.alarm_high, "threshold")}"
  metric_name         = "ScaleStatus"
  namespace           = "Curalate/AutoScaling"

  dimensions {
    ClusterName = "${var.cluster_name}"
  }

  alarm_description = "High cluster resource usage"
  alarm_actions     = ["${aws_autoscaling_policy.scale_up_blue.arn}"]
}

resource "aws_cloudwatch_metric_alarm" "cluster_scale_status_low_blue" {
  count               = "${var.autoscale_enabled}"
  alarm_name          = "${var.cluster_name}_ScaleStatus_low_blue"
  comparison_operator = "${lookup(var.alarm_low, "comparison_operator")}"
  evaluation_periods  = "${lookup(var.alarm_low, "evaluation_periods")}"
  period              = "${lookup(var.alarm_low, "period")}"
  statistic           = "${lookup(var.alarm_low, "statistic")}"
  threshold           = "${lookup(var.alarm_low, "threshold")}"
  metric_name         = "ScaleStatus"
  namespace           = "Curalate/AutoScaling"

  dimensions {
    ClusterName = "${var.cluster_name}"
  }

  alarm_description = "Low cluster resource usage"
  alarm_actions     = ["${aws_autoscaling_policy.scale_down_blue.arn}"]
}

variable "alarm_high" {
  type = "map"

  default = {
    comparison_operator = "GreaterThanThreshold"
    evaluation_periods  = 4
    period              = 60
    statistic           = "Maximum"
    threshold           = 1
  }
}

variable "alarm_low" {
  type = "map"

  default = {
    comparison_operator = "LessThanThreshold"
    evaluation_periods  = 10
    period              = 60
    statistic           = "Maximum"
    threshold           = 1
  }
}

We made our Lambda dynamically configurable by loading data from our configuration system and allowing us to onboard new clusters to monitor, and to dynamically tune the values of the thresholds.

You can see this in effect here:

Host draining and ECS rescheduling

This leads us to another problem. When the ASG goes to down-scale from a CloudWatch event, it puts the boxes into DRAINING. However, draining doesn’t necessarily mean that existing services have been re-scheduled on other boxes! It just means that connections are drained from the existing hosts, and that the ECS scheduler now needs to move the containers elsewhere. This can be problematic in that if you are down-scaling 2 hosts that are serving both of your HA containers, then you can now have a situation where your service is at 0 instances! To solve this, we wired up a custom ASG lifecycle hook that polls the draining machines and makes sure that the containers are fully stopped, and that the active cluster contains at least the min running instances of each service (where a service defines its minimum acceptable threshold and its min allowed running instances). For example if a service can run at 50% capacity and its min is set to 20, then we need to verify that there are at least 10 active before we fully allow the box to be removed, giving the ECS scheduler time to move things around.

Cluster upgrades

Solving cluster scaling and draining just introduced the next question: how do we do zero downtime cluster upgrades? Because we now have many services running on the cluster, the blast radius for failure is much higher. If we fail a cluster upgrade we could take many of the services at Curalate down with us.

Our solution, while not fully automated, is beautiful in its simplicity. Leveraging the draining Lambda, we keep all our clusters grouped into ASGs labeled blue and green. To upgrade, we spin up the unused cluster with new backing AMI’s and wait for it to be steady state. Then we tear down the old cluster and rely on the draining Lambda to prevent any race issues with the ECS scheduler.

Each time we need to do a cluster upgrade, either for security updates, base AMI changes, or other infrastructure wide sweeping changes, we do a manual toggle using Terraform to drive the base changes.

For example, our Terraform ECS cluster module in QA looks like this

module "ecs_cluster_default_bg" {
  source = "github.com/curalate/infra-modules.git//aws-ecs-cluster-blue-green?ref=2018.03.07.20.09.12"

  cluster_name                       = "${aws_ecs_cluster.default.name}"
  availability_zones                 = "${data.terraform_remote_state.remote_env_state.availability_zones}"
  environment                        = "${var.environment}"
  region                             = "${var.region}"
  iam_instance_profile               = "${data.terraform_remote_state.remote_env_state.iam_instance_profile}"
  key_name                           = "${data.terraform_remote_state.remote_env_state.key_name}"
  security_group_ids                 = "${data.terraform_remote_state.remote_env_state.ecs_security_groups}"
  subnet_ids                         = "${data.terraform_remote_state.remote_env_state.public_subnet_ids}"
  drain_hook_notification_target_arn = "${module.drain_hook.notification_target_arn}"
  drain_hook_role_arn                = "${module.drain_hook.role_arn}"
  autoscale_enabled                  = true

  root_volume_size = 50
  team             = "devops"

  blue = {
    image_id      = "ami-5ac76b27"
    instance_type = "c4.2xlarge"

    min_size         = 2
    max_size         = 15
    desired_capacity = 5
  }

  green = {
    image_id      = "ami-c868b6b5"
    instance_type = "c3.2xlarge"

    min_size         = 0
    max_size         = 0
    desired_capacity = 0
  }
}

Monitoring with statsd

Curalate uses Datadog as our graph visualization tool and we send metrics to datadog via the dogstatsd agent that is installed on our boxes. Applications emit UDP events to the dogstatsd agent which then aggregates and sends messages to datadog over TCP.

In the containerized world we had 3 options for sending metrics

Send directly from the app
Deploy all containers with a sidecar of statsd
Proxy messages to the host box and leave dogstatsd on the host

We elected for option 3 since option 1 makes it difficult to do sweeping upgrades and option 2 uses extra resources on ECS we didn’t want.

However we needed a way to determistically write messages from a Docker container to its host. To do this we leveraged the docker0 bridge network

# returns x.x.x.1 base ip of the docker0 bridge IP
get_data_dog_host(){
    # extracts the ip address from eth0 of 1.2.3.4 and splits off the last octet (returning 1.2.3)
    BASE_IP=`ifconfig | grep eth0 -A 1 | grep inet | awk '{print $2}' | sed "s/addr://" | cut -d. -f1-3`

    echo "${BASE_IP}.1"
}

And we configure our apps to use this IP to send messages to.

Cleanup

One thing we chose to do was to volume mount our log folders to the host system for semi-archival reasons. By mounting our application logs to the host, if the container crashed or was removed from Docker, we’d still have a record of what happened.

That said, containers are transient; they come and go all the time. The first question we had was “where do logs go?”. What folder do we mount them to? For us, we chose to mount all container logs in the following schema:

/var/log/curalate/<service-name>/containers/<constainer-sha>

This way we can back correlate the logs for a particular container in a particular folder. If we have multiple instances of the same image running a host the logs don’t stomp on each other.

We normally have a log rotator on our AMI boxes that handles long running log files, however in our AMI based deployments machines and clusters are immutable and singular. This means that as we do new deploys the old logs are removed with the box and only one service is allowed to sit on one EC2 host.

In the new world the infrastructure is immutable at the container level, not the VM level. So in this sense, the base VM also has a log rotator to rotate all the container logs, but we didn’t account for the fact that services will start and stop and deploy hundreds of times daily leaving hundreds of rotated log files in stale folders.

After the first disk alert though we added the following cron script:

buntu@ip-172-17-50-242:~$ crontab -l
# Chef Name: container-log-prune
*/10 * * * * /opt/curalate/docker/prune.rb
# Chef Name: volume-and-image-prune
0 0 * * * docker images -q | xargs docker rmi && docker system prune -f

We have 2 things happening here, the first is a Ruby script that checks for running containers and then deletes all container IDs in the recursive log glob that aren’t active anymore. We run this once an hour.

#!/usr/bin/env ruby

require 'set'
require 'fileutils'
require 'English'

containers = `docker ps --format ''`.split("\n").to_set

unless $CHILD_STATUS.success?
  puts 'Failed to query docker'
  exit 1
end

dirs = Dir.glob('/var/log/**/containers/*')

to_delete = dirs.reject do |d|
  (containers.include? File.basename(d))
end

to_delete.each do |d|
  puts "Deleting #{d}"

  FileUtils.rm_rf d
end

The second script is pretty straightforward and we leverage the Docker system prune command to remove old volume overlay data, images that are unused, and any other system cleanup stuff. We run this daily because we want to leverage the existing images that are already downloaded on a box to speed up autoscaling. We’re ok with taking a once daily hit to download the image base layers at midnight if necessary during a scaling event.

JMXMP

JMX is a critical tool in our toolbox here at Curalate as nearly all of our services and applications are written using Scala on the JVM. Normally in our AMI deployments we can singularly control the ports that are open and they are determistic. If we open port 5555 it’s always open on that box. However when we start to have many services run on the same host, we need to leverage dockers dynamic port routing which makes knowing which port maps to what more difficult.

Normally this isn’t really an issue, as services that do need to expose ports to either each other or the public are routed through an ALB that manages that for us. But JMX is a different beast. JMX, in its ultimate wisdom, requires a 2 port handshake in order to connect. What this means is that the port you connect to on JMX is not the ultimate port you communicate over in JMX. When you make a JMX connection to the connection port it replies back with the communication port and you then communicate on that.

But in the world of dynamic port mappings, we can find the first port from the dynamic mapping, but there is no way for us to know the second port. This is because the container itself has no information about what its port mapping is, for all it knows its port is what it says it is!

Thankfully there is a solution using an extension to JMX called JMXMP. With some research from this blog post we rolled a jmxmp java agent:

package com.curalate.agents.jmxmp;

import javax.management.MBeanServer;
import javax.management.remote.JMXConnectorServer;
import javax.management.remote.JMXConnectorServerFactory;
import javax.management.remote.JMXServiceURL;
import java.lang.instrument.Instrumentation;
import java.lang.management.ManagementFactory;
import java.net.Inet4Address;
import java.net.InetAddress;
import java.net.NetworkInterface;
import java.net.UnknownHostException;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

public class Agent {
    public static void premain(String agentArgs, Instrumentation inst) {
        Boolean enableLogging = Boolean.valueOf(System.getProperty("javax.management.remote.jmx.enable_logging", "false"));

        Boolean bindEth0 = Boolean.valueOf(System.getProperty("javax.management.remote.jmx.bind_eth0", "true"));

        try {
            Map<String, String> jmxEnvironment = new HashMap<String, String>();

            jmxEnvironment.put("jmx.remote.server.address.wildcard", "false");

            final String defaultHostAddress = (bindEth0 ? getEth0() : getLocalHost()).replace("/","");

            JMXServiceURL jmxUrl = new JMXServiceURL(System.getProperty("javax.management.remote.jmxmp.url", "service:jmx:jmxmp://" + defaultHostAddress + ":5555"));

            MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();

            JMXConnectorServer jmxRemoteServer = JMXConnectorServerFactory.newJMXConnectorServer(jmxUrl, jmxEnvironment, mbs);

            if (enableLogging) {
                System.out.println("Starting jmxmp agent on '" + jmxUrl + "'");
            }

            jmxRemoteServer.start();
        }
        catch (Throwable e) {
            if (enableLogging) {
                e.printStackTrace();
            }
        }
    }

    private static String getEth0() {
        try {
            return Collections.list(NetworkInterface.getByName("eth0").getInetAddresses())
                              .stream()
                              .filter(x -> !x.isLoopbackAddress() && x instanceof Inet4Address)
                              .findFirst()
                              .map(Object::toString)
                              .orElse("127.0.0.1");
        }
        catch (Exception e) {
            return "127.0.0.1";
        }
    }

    private static String getLocalHost() {
        try {
            return InetAddress.getLocalHost().getHostName();
        }
        catch (UnknownHostException e) {
            return "127.0.0.1";
        }
    }
}

That we bundle in all our service startups:

exec java -agentpath:/usr/local/lib/libheapster.so -javaagent:agents/jmxmp-agent.jar $JVM_OPTS $JVM_ARGS -jar

JMXMP does basically the same thing as JMX, except it only requires one port to be open. By standardizing our ports on port 5555 we can look up the 5555 port mapping in ECS and connect to it via JMXMP and get all our “favorite” JMX goodies (if you’re doing JMX you’re already in a bad place).

For full reference all our dockerized java apps have a main entrypoint that Docker executes which is shown below. This allows us some sane default JVM settings but also exposes a way for us to manually override any of the settings via the JVM_ARGS env var (which we can set during our Terraform deployments)

#!/usr/bin/env bash

HOST_IP="localhost"

# Entrypoint for service start
main() {
    set_host_ip

    DATADOG_HOST=`get_data_dog_host`

    # location the fat jar
    BIN_JAR=`ls /app/bin/*.jar | head`

    LOG_PATH="/var/log/${HOSTNAME}"

    mkdir -p ${LOG_PATH}
    mkdir -p /heap_dumps

    JVM_OPTS="""
        -server \
        -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/heap_dumps \
        -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled \
        -Xmx512m -Xms512m \
        -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark \
        -Dsun.net.inetaddr.ttl=5 \
        -Dcom.sun.management.jmxremote.port=1045 \
        -Dcom.sun.management.jmxremote.authenticate=false \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcontainer.id=${HOSTNAME} \
        -Dhostname=${HOST_IP} \
        -Dlog.service.output=${LOG_PATH}/service.log \
        -Dlog.access.output=${LOG_PATH}/access.log \
        -Dloggly.enabled=${LOGGLY_ENABLED} \
        -Ddatadog.host=${DATADOG_HOST} \
        -Ddatadog.defaultTags=application:${CLOUD_APP}
    """

    exec java -agentpath:/usr/local/lib/libheapster.so -javaagent:agents/jmxmp-agent.jar $JVM_OPTS $JVM_ARGS -jar ${BIN_JAR} $@
}

# Set the host IP variable of the EC2 host instance
# by querying the EC2 metadata api
# if there is no response after the timeout we'll default to localhost
set_host_ip () {
    if [ "200" == "$(/usr/bin/curl --connect-timeout 2 --max-time 2 -s -o /dev/null -w "%{http_code}" 169.254.169.254/latest/meta-data/local-ipv4)" ]
    then
        HOST_IP=$(curl 169.254.169.254/latest/meta-data/local-ipv4)
    else
        HOST_IP="$(hostname)"

        if [ "${HOST_IP}" = "${HOSTNAME}" ]
        then
            HOST_IP="localhost"
        fi
    fi
}

# returns x.x.x.1 base ip of the docker0 bridge IP
get_data_dog_host(){
    # extracts the ip address from eth0 of 1.2.3.4 and splits off the last octet (returning 1.2.3)
    BASE_IP=`ifconfig | grep eth0 -A 1 | grep inet | awk '{print $2}' | sed "s/addr://" | cut -d. -f1-3`

    echo "${BASE_IP}.1"
}

# execute main
main $@

Choosing how to group a cluster

One thing we wrestled with is how to choose where a service will go. For the most part we have a default cluster comprised of c4.2xl’s that everyone is allowed to deploy to.

I wanted to call out that choosing what service goes on what cluster and what machine types comprise a cluster can be tricky. For our GPU based services, the choice is obvious in that they go onto a cluster that has GPU acceleration. For other clusters we tried smaller machines with fewer containers, and we tried larger machines with more containers. We found that we preferred fewer larger machines since most of our services are not running at full capacity, so they get the benefit of extra IO and memory without overloading the host system. With smaller boxes we had less headroom and it was more difficult to pack services with varying degrees of memory/CPU reservation necessities.

On that note however, we’ve also chosen to segment some high priority applications onto their own clusters. These are services that under no circumstances can fail, or require more than average resources (whether IO or otherwise), or are particularly unstable. While we don’t get the cost savings by binpacking services onto that cluster, we still get the fast deploy/rollback/scaling properties with containers so we still consider it a net win.

Conclusion

ECS was really easy to get started on, but as with any production system there are always gotcha’s. Overall we’re really pleased with the experience, even though it wasn’t pain free. In the end, we can now deploy in seconds, rollback in seconds, and still enjoy a pseudo immutable infrastructure that is simple to reason about as well as locally reproducible!

Choosing a Deep Learning library for developing and deploying your App/Service

Fri, 23 Mar 2018 10:11:36 +0000

Interest in deep learning is growing and growing and, with it at peak hype right now, a lot of people are looking to find the best deep learning library to build their new app or bring their company into the modern age. There are many deep learning toolkits to choose from ranging from the long used, supported, and robust academic libraries to the new state-of-the-art, industry backed platforms.

At Curalate, we’ve been working on deep learning problems since 2014, meaning we’ve had the chance to watch the deep learning community and its open source libraries grow. We have also had the fortunate (unfortunate?) experience of using a few of the deep learning libraries in our production services and applications, and along the way, we have learned a lot about what to look for in a deep learning library to build reliable, production-ready applications and services. In this post, I’ll share our lessons learned knowledge in hopes it will help you in your search for the perfect deep learning library match. You might even find that your best fit is using more than one!

Important factors

The specifics needs of your application/service

The platform you are developing on and deploying to.

Develop in OSX? Linux? Windows? Plan on having your application run in a web browser? A smart phone? A massive multi-node GPU cluster? It’s not surprising that each of the libraries have prioritized different environments and some will work much better for your specific situation.

The specific deep net architecture you are trying to implement

If you are just trying to implement a typical, pre-trained classification net, this factor may not be as important for you. Some libraries are more performant and appropriate for certain types of deep nets (LSTMs, RNNs), but more on this later.

API language requirements

If you already have a code base written in language A, you probably would like to keep it that way without having to figure out some convoluted way to fit a deep net interface in language B into it. Luckily, it seems that most of the common languages are covered at this point in at least one of the libraries, or in an external community project.

Codebase Quality

Is the code base actively maintained?

How healthy is the project in terms of maintainers? Is there a large group/company committing time and resources to the libraries development? If you find a bug or issue with the library, how long is it going to take for it to get addressed?

Release status of the library itself

Is the library or a certain feature/API you are going to need still considered to be in an Alpha or Beta state? Has the library been used in enough to have most of the kinks ironed out?

Ease of Use

Train to production pipeline

Your model training code and production code do not have to run in the same environments or even the same language. Can you train your model with a quick-to-prototype language in a documented, version-controlled, repeatable way so you can research new and different models for your application? Then can you deploy your saved model in a fairly quick and painless fashion? That may be through the same library with a different language API, using a library’s prebuilt production-serving framework, or even converting your model from one library to another that is better suited for your target platform.

Keras support

Does the library have support for being used as a backed for Keras? Keras is not a deep learning library per se, but a library that sits on top of other deep learning libraries and provides a single, easy to use, high-level interface to write and train deep learning models. Where it lacks in optimizations, it is great for beginners with great documentation and a modular, object oriented design.

Dynamic vs Static computation

Now we could write a whole blog post on this topic alone, but to keep it brief, do you want to work with a static computation graph API that follows a symbolic programming paradigm? Or do you want a dynamic computation graph API that follows an imperative programming paradigm?

Static Computation Graphing
- You define the deep net once, and uses a session to execute ops in the net many times.
- The library can optimize the net before you use it, so the nets end up being more efficient with memory and speed.
- Good for fixed size net (feed-forward, CNNs)
- Leads to the API being more verbose and harder to debug domain specific language (DSL)
- Offers better over loading and model management in regards to system resources.
Dynamic Computation Graphing
- Nets are built and rebuilt at runtime, and executed line by line how you define them. This lets you use standard imperative language (think Python) statements, features, and control structures.
- Tends to be more flexible and useful for when the net structure needs to change at runtime, like in RNNs
- Makes debugging easy since an error is not thrown in a single call to execute the net after its compiled, but at the specific line in the dynamic graph at run time.

Support

Documentation

How good is the documentation? Are there coding examples that cover most of the use cases you need? Are you used to getting your documentation in a certain style from a specific company?

Community support

How large is the community? Just because a deep learning library is really good does not mean people are actually using it. Are you going to be able to find 3rd party blog posts, code samples, and tutorials using the library? If you run into a problem, what is the chance you are going to find someone on Stack Overflow with the answer to your problem?

Research

Does the research community actively use the library to develop state-of-the-art deep learning models and solutions? A lot of state-of-the-art discoveries made by the academic community require modification to the deep learning libraries themselves and it’s pretty common for research groups to release their source code for conference papers to the public. Most of these new models will be released as pretrained models and listed in a Model Zoo specific to the library. Porting these solutions between libraries is not a trivial task if you are not comfortable reimplementing the research paper.

Performance

Performance with specific network structures

How fast does your planned network structure run on each of the deep learning libraries? Will you be able to train and prototype your models faster on one vs another? If you are deploying to a service, how many requests per second can you expect to run through the library?

Scalability

How well does the library scale when you start providing it with more resources to meet your production load? Can you save money by using a more efficient scaling library over another? (Cloud GPU instances can be really expensive)

The Libraries

Caffe
C++, Python, Matlab
UC Berkeley
Watches: 2,161 | Stars: 23,338
Forks: 14,247
Avg Issue Resolution: 61 Days
Open issues: 15%^*
No Keras support
Research Citations: 7081
Model zoo

Caffe, with its unparalleled performance and well-tested C++ codebase, was basically the first mainstream, production-grade deep learning library. Caffe is good for implementing CNNs, image processing, and for fine-tuning pre-trained nets. In fact, you can do all of these things with writing little to no code. You just place your training/validation data (mainly pictures) in a specific folder, set up config files for the deep net and its training parameters, and then call a precompiled Caffe binary that trains your net.

Being first to market means that a lot of early research and models were written with Caffe, and the research that built off of that forked and continued to use the same code base. Because of this, you will find a lot of state-of-the-art work, even to this day, still using Caffe despite its limitations. A lot of these models can be found in the Caffe Model Zoo, which is one of the first and largest (if not the largest) model zoos.

But now we have to start talking about its limitations. Caffe was built and designed around an original intended use case: conventional CNN applications. Because of this, Caffe is not very flexible. Overall, it’s not very good for RNNs and LSTM networks. Even with it’s adaption of CMake, building the library can still be a pain (especially for non Linux environments). It has little support for multiple GPUs (training only) and can only be deployed to a server environment. The configuration files to define the deep net structure are very cumbersome. The prototxt for ResNet-152 is 6775 lines long!

In Caffe, the deep net is treated as a collection of layers, as opposed to nodes of single tensor operations. Layers can be thought of as a composition of multiple tensor operations. These layers are not very flexible and there are a lot of them that duplicate similar logic internally. Because Caffe does not support auto differentiation, if you want to develop new layer types, you have to define the full forward and backwards gradient updates. You can define these layers in Caffe’s Python interface, but unlike other libraries where the Python interface is accelerated by their underling C implementations, Caffe Python layers run in Python.

So should you use Caffe? If you are looking to reimplement some specific model from a research paper from 2015 using existing, open source code, it is not a bad Library. If you are looking for raw performance and not opposed to using a C++ library and API on a GPU server for your service/app, Caffe is still one of the fastest libraries around for fully connected networks.

But because of its limitations and technical debt, a lot of the community and its efforts have moved on from Caffe in some form or another. Caffe is a special case when it comes to model converters, in that it is the best supported library with converters to almost all other deep learning libraries making it easier to move your work off of it. The creator of Caffe has been hired by Google to work on their deep learning library TensorFlow, and now by Facebook to create a successor to Caffe in the appropriately named Caffe2.

Torch
Lua, C++
Deepmind, NYU, IDIAP
Watches: 675 | Stars: 7,761
Forks: 2,254
Avg Issue Resolution: 55 Days
Open issues: 33%^*
No Keras support
Research Citations: 955
Model zoo

PyTorch
Python
Facebook
Watches: 690 | Stars: 13,111
Forks: 2,795
Avg Issue Resolution: 2 Days
Open issues: 18%^*
No Keras support
Research Citations: 16
Model zoo

Torch and PyTorch are related by much more than just their name. Torch was one of the original, academic-created deep learning libraries. While it may not have as much research citing it for its use in the results, it still has a very large community around it. Many of the researchers who originally worked on Torch moved to Facebook. Unsurprisingly, Facebook has since developed the successor to Torch in the form of PyTorch. PyTorch and Torch use the same underlying C libraries, TN, THC, THNN, and THCUNN, which provide them with very similar performance characteristics. When it comes to typical deep learning architectures, Torch offers some of the fastest, but not the fastest, performance around with GPU scaling efficiency that matches the best.

Where Torch and PyTorch differ is in their interface, API, and graphing paradigms. Torch was written with a LUA API interface, which can be a major barrier of entry for most people. While you can do research and development in LUA, it doesn’t have the massive community backing and vast open source libraries like Python does, so it can be quite limiting. Torch uses a static graph paradigm like Caffe’s at the time. Also like Caffe, it does not have any auto-differentiation capabilities, meaning if you want to implement new tensor operations for your deep net you have to write the backwards gradient calculations, and it has a pretty substantial model zoo of pre-trained models.

PyTorch was made with the goal of fixing or modernizing various issues with Torch, to create probably one of the best currently available libraries for doing research and development. PyTorch, as the name suggests, has a very well designed Python API. It supports both dynamic graph programming and auto differentiation for all of the easy to debug and prototype goodness. PyTorch also has its own visualization dashboard called Visdom, which while more limited than TensorBoard (more on this later), is still very helpful for development.

So should you use Torch or PyTorch? For specifically research, and development of new models, PyTorch is probably currently the best option. Even though PyTorch is still very new, most people in the deep learning field would agree that you should use it over classic Torch. Not to say Torch does not have its advantages. Because of its age, it has a much larger backlog of research citing it for its use, and is more stable than PyTorch, but both of these advantages will be lost over time. If you are looking for a library to deploy into any kind of production environment, then you should probably look elsewhere.

Tensorflow
Python, C++, Java, Go
Google
Watches: 7,632 | Stars: 93,376
Forks: 59,923
Avg Issue Resolution: 8 Days
Open issues: 16%^*
Works with Keras
Research Citations: 866
Model zoo

TensorFlow, without a doubt, is currently the biggest player in the deep learning field and for good reason. TensorFlow is Google’s attempt to build a single deep learning framework for everything deep learning related. There is very little that TensorFlow does not do well. Because it was created by Google, it was built with massive distributed computing in mind, but it also had mobile development capabilities in the form of TensorFlow Mobile and TensorFlow Light. Its documentation is also considered one of the best. Their documentation covers multiple API languages that TensorFlow supports, and if you consider the interfaces made by 3rd parties in the community, it even has APIs for C#, Haskell, Julia, Ruby, Rust, and Scala. Speaking of that community, TensorFlow has the largest community out of any of the deep learning libraries and currently has the most research activity.

From the beginning, TensorFlow was made with a clear static graph API that was easy to use, but as interests and needs are changing in the machine learning field, it recently gained support for dynamic graph functionality in the form of TensorFlow Fold. TensorFlow has Keras support, making it very easy for beginners and even has its own custom version built into the Python API.

When Google first released TensorFlow, they also released TensorBoard. A data visualization tool that was created to help you understand the flow of tensors through your model for debugging, optimization, and just understanding the the complex and confusing nature of deep learning models. You can use TensorBoard to visualize your TensorFlow model, plot summary metrics about the execution of your model, and show additional data like images that pass through it.

Now what about deploying your models once you have finished training them? Well Google also has a solution for that in TensorFlow Serving, a flexible, high-performance serving system for ML models, designed for production environments. It comes in the form of modular C++ libraries, binaries, and docker/k8 containers that can be used as an RPC server or a set of libraries. There are even Google CloudML services set up with it to get your model up in production in no time. TensorFlow Serving’s main goal is to optimize for throughput with little to no down time. It includes a built-in scheduler that aims for the efficiency of mini-batching requests through the model and can manage multiple models at once running on shared hardware. Currently the API interface only supports prediction, but will support regression, classification, and multi-inference soon.

Now TensorFlow is not perfect. Both Serving and Fold are still in their early days of development, so they might not want to be something you would rely on. All of the APIs outside of the Python API are not covered by their API stability promises. But the biggest issue when it comes to TensorFlow when compared to the other libraries is performance.

There is no real way to get around the issue; TensorFlow is just slower and more of a resource hog when compared to the other libraries. Looking at performance across your typical deep net architectures you can expect to see other libraries perform up to twice as fast as TensorFlow at similar batch sizes. You should avoid TensorFlow in general if you need performant Recurrent nets (RNNs) or Long Short Term Memory nets (LSTMs). TensorFlow is even the worst at scaling efficiency when compared to the other libraries despite its focus on distributed computing.

So should you use TensorFlow? We wouldn’t blame you if you did and would probably suggest it for 80% of the possible use cases out there. Especially if you are new to the deep learning field and want to work with a library and ecosystem that has solutions for almost everything you could possibly need. But, if you are willing to put in the extra time and effort, you can find a much more performant and equally-featured experience with other libraries.

CNTK
Python, C#, C++, R
Microsoft
Watches: 1,334 | Stars: 14,057
Forks: 3,727
Avg Issue Resolution: 21 Days
Open issues: 12%^*
Works with Keras
Research Citations: 21
Model zoo

CNTK, the Microsoft Cognitive Tooklit, was originally created by MSR Speech researchers several years ago but has evolved into much more. It is a unified framework for building Deep nets, Recurrent net (RNNs), Long Short Term Memory nets (LSTMs), Convolution nets (CNNs), and Deep Structured Semantic Models (DSSMs). It can pretty much work for all types of deep learning applications from speech/text to vision.

CNTK supports distributed training like TensorFlow and Torch. It even supports a proprietary, commercially-licensed, 1-bit Stochastic Gradient Decent algorithm that significantly improves distributed performance. Thanks to CNTK’s early focus on language models, when it comes to running RNNS and LSTMs, it is 5-10 times better than the other libraries when running these dynamic network structures.

The biggest reason to use CNTK is if you or your company traditionally works with Microsoft software and products. CNTK is one of the few libraries to have first class support for running on Windows with additional support for running on Linux and NO support for OSX. It has direct support for deploying to a Microsoft Azure production environment and APIs that properly supports Microsoft’s languages of choice. Its model zoo is even set up in a very “MSDN documentation” fashion.

The main downside to CNTK is that it lacks support from both the general research and software dev community. Microsoft may be using it internally for a lot of their services and probably has the reliability to support it, but it is just having trouble gaining market share (like many of Microsofts recent endeavors).

So should you use CNTK? If you are used to developing in Visual Studio and need an API for your .NET application, there probably is no better fit. But there are better options out there for most OSX/Linux devs with better all-around support. Also, if you are trying to do research and development that is not specific to LSTMs or RNNs, there are more appropriate libraries.

MXNet
Python, Scala, R, Julia, C++, Perl, Go, Javascript, Matlab
Apache, Amazon
Watches: 1,135 | Stars: 13,425
Forks: 4,950
Avg Issue Resolution: 53 Days
Open issues: 11%^*
Works with Keras
Research Citations: 319
Model zoo

MXNet is one of the newest players in the deep learning field but has been gaining ground fast. Originally created at the University of Washington and Carnegie Mellon University, it has been adopted by both The Apache Foundation and Amazon Web Services as their deep learning library of choice and has put their development efforts behind it.

MXNet supports almost all of the features the rest of the other libraries support. It has the largest selection of officially supported languages for its APIs, and it can run on everything from a web browser, a mobile phone, to a massive distributed server farm. In fact, Amazon has found that you can get up to an 85% scaling efficiency with MXNet. In most other cases, MXNet has some of the best performance when running with typical deep learning architectures.

MXNet supports both static graph programming and dynamic graph programming with the raw MXNet and Gluon APIs respectively. The Gluon API is also MXNet’s clear, concise, and simple API for deep learning created in collaboration with AWS and Microsoft in the same spirit as Keras, but MXNet also supports Keras if you prefer it. MXNet also has its own serving framework for getting your trained MXNet models into production and has extra support for running on AWS. It even has its own TensorBoard implementation that provides much of the same functionality as the TensorFlow equivalent.

MXNet does have notable weaknesses that make working with it a little more annoying. The documentation could be much better. The APIs have gone through a few changes before the first 1.0 release and the documentation reflects this, which can get a little confusing in some places. In terms of community support, its not the worst or the best, but somewhere in the middle. There is a notable amount of people using it and research, and there are plenty of usage examples for different net types along with their model zoo.

So should you use MXNet? If you are willing to put the time in and deal with some of the pain points from it being a younger deep learning library, it is probably the best option for 80% of use cases along with TensorFlow. Especially we would suggest it over TensorFlow if performance is a big concern of yours. If you are looking for the most flexible library to give you as many options as possible in your train to production pipeline with a native API for your production code, it’s probably the best option.

The Other Libraries

Now the previous 6 deep learning libraries covered are by no means that only options available to you. They are just the biggest players and arguably the most relevant for 2018. There are many more available to you to choose from that may better fit your specific needs (Deployment destination, non-english documentation/community, hardware, etc.). We will try to briefly cover them here and provide a jumping off point if you want to dig into one of them deeper.

Theano

Python API
University of Montreal
Future work on the project has stopped, May it rest in peace
Watches: 573, Star: 8041, Forks: 2426, Median Issue Resolution: 12 days, Open issues: 19%^*
Research Citations: 1,080
Makes you do a lot of things from scratch, which leads to more verbose code.
Single GPU support only
Numerous open-source deep-libraries have been built on top of Theano, including Keras, Lasagne and Blocks
No real reason to use over TensorFlow unless you are working with old code.

Caffe2

C++, Python APIs
Facebook
Watches: 552, Stars: 7631, Forks: 1821, Median Issue Resolution: 55 Days, Open issues: 33%^*
Caffe2 is facebooks second entry into the deep learning library ecosystem.
It is built with a focus more on mobile and industrial-strength production applications over development and research.
Where Caffe only supported single GPU training, Caffe2 is built to run utilizing both multiple GPUs on a single host and multiple hosts with single to multiple GPUs.

CoreML

Swift, Objective-C APIs
Apple
Closed source
Not a full DL library (you can not use it to train models at the moment), but mainly focused on deploying pre-trained models optimized for Apple devices
- If you need to train your own model, you will need to use one of the above libraries
- Model converters available for Keras, Caffe, Scikit-learn, libSVM, XGBoost, MXNet, and TensorFlow

Paddle

Python API
Baidu
Watches: 558, Star: 6580, Forks: 1756, Median Issue Resolution: 7 days, Open issues: 24%^*
One of the newest libraries available
Chinese documentation with an English translation
Has the potential to become a big player in the market

Neon

Python API
Intel
Watches: 351, Stars: 3437, Forks: 778, Median Issue Resolution Time: 28 days, Open issues: 16%^*
Written with Intel MKL accelerated hardware in mind (Intel Xeon and Phi processors)

Chainer

Python API
Preferred Networks
Watches: 310, Stars: 3595, Forks: 949, Median Issue Resolution Time: 31 days, Open issues: 13%^*
Research Citations: 207
Dynamic computation graph
Smaller company effort with a Japanese and English community

Deeplearning4j

Java, Scala APIs
Skymind
Watches: 792, Stars: 8527, Forks: 4120, Median Issue Resolution Time: 19 days, Open issues: 21%^*
Written with Java and the JVM in mind
Keras Support (Python API)
DL4J can take advantage distributed computing frameworks including Hadoop and Apache Spark.
On multi-GPUs, it is equal to Caffe in performance.
Can import models from Tensorflow
Uses ND4J (Numpy for the JVM)

DyNet

C++, Python APIs
Carnegie Mellon University
Watches: 178, Stars: 2189, Forks: 527, Median Issue Resolution Time: 4 days, Open issues: 16%^*
Dynamic computation graph
Small user community

MatConvNet

Matlab APIs
Watches: 113, Stars: 959, Forks: 633, Median Issue Resolution Time: 96 days, Open issues: 53%^*
a MATLAB toolbox implementing Convolutional Neural Networks (CNNs) for computer vision applications

Darknet

Python, C APIs
Watches: 520, Stars: 6276, Forks 3072, Median Issue Resolution Time: 55 days, Open issues: 78%^*
Very small open source effort with a laid back dev group
not useful for production environments

Leaf

Rust API
autumnai
Watches: 195, Stars: 5229, Forks: 265, Median Issue Resolution Time: 131 days, Open issues: 58%^*
Support for the lib looks to be dead

TLDR

Choose either TensorFlow or MXNet for probably about 80% of use cases (TensorFlow if you prioritize community support and documentation, MXNet if you need performance). Look at PyTorch if you are mainly looking for something to develop/train new models. If you love Microsoft and are developing for a .NET environment in Windows and Visual Studio, try out CNTK. Look into OpenML for just deploying models to Apple devices specifically and Deeplearning4j if you really like to keep things JVM focused.

^* Numbers taken at time of writing, expected to change.

R&D At Curalate: A Case Study of Deep Metric Embedding

Thu, 01 Feb 2018 10:11:36 +0000

At Curalate, we make social sell for hundreds of the world’s largest brands and retailers. Our Fanreel product is a good example of this; it empowers brands to collect, curate, and publish social user-generated photos to their e-commerce site. A vital step in this pipeline is connecting the user generated content (UGC) to the product on our client’s web site. Automating this process requires cutting edge computer vision techniques whose implementation details are not always clear, especially for production use cases. In this post, I review how we leveraged Curalate’s R&D principles to build a visual search engine that identifies which of our clients’ products are in user generated photos. The resulting system allows our clients to quickly connect user generated content to their e-comm site, enabling the UGC to generate revenue immediately upon distribution.

Step 1: Do Your Homework

We start every R&D project by hitting the books and catching up on the relevant research. This lets us understand what is feasible, the (rough) computational costs, and any pitfalls of various techniques. In this case, our goal is to find which products are in any UGC image using only the product images from the client’s e-comm site. This is extremely difficult: UGC photos have dramatic lighting conditions, generally contain multiple objects or clutter, and may have undergone non rigid transformations (especially if it’s a garment). Knowing we had a difficult problem on our hands, we did an extensive literature review on papers from leading computer vision conferences, journals, and even arxiv to ensure we have a good understanding of the state of the art.

One approach stood out in the literature review: deep metric learning. Deep metric learning is a deep learning technique that learns an embedding function that, when applied to images of the same product, produces feature vectors that are close together in Euclidean space. This technique is perfect for our use case: we can train the system from existing pairs of UGC and product images in our platform to understand the complex transformations products undergo in UGC photos.

The figure above (from Song et. al.) shows a t-SNE visualization of a learned embedding of the Stanford Online Products dataset. Notice that images of similar products are close together: wooden furniture zoomed in on the upper left, and bike parts on the lower right. Once we’ve learned this embedding function, identifying the products in a UGC image can be achieved by finding which embedding vectors from the client’s product photos that are closest to that of the UGC.

Most techniques for deep metric learning start with a deep convolutional neural network trained on imagenet (i.e., a basenet), remove the final classification layer, add a new layer that performs a projection to the n-dimensional embedding space, and fine-tune it with an appropriate loss function. One highly cited work is Facenet by Schroff et. al., who propose a loss function that uses triplets of images. Each triplet contains an anchor image, a positive example that is the same class as the anchor, and a negative match that is a different class than the anchor image. Though more recent work has surpassed Facenet, in interest of speed (we are a startup!) we decided to take it for a spin since a tensorflow implementation was available online.

Step 2: Prototype and Experiment

The second phase of an R&D project at Curalate is the prototype phase. In this phase, we implement our chosen approach as fast as possible, and evaluate it on publically available data as well as our own. As with many things in a startup, speed is key here. Specifically, we need answers as fast as possible so we know what we need to build. This phase is designed to answer the question: will it work and, if so, how well? In addition, this phase is when we experiment with different implementation details of techniques we wish to implement. Hyper parameter tuning, architecture components, and comparing different algorithms all occur in this phase of R&D.

The big question we want to answer for our deep metric embedding project is: which basenet should we use? The Facenet paper used GoogLeNet inception models, but there have been many improvements since their publication. To compare different networks, we measure each of their performance on the Stanford Online Products dataset. We implemented Facenet’s triplet loss in MXNet so we can easily swap-out the underlying basenet.

We compared the following networks from the MXNet Model Zoo:

A secondary question we wished to answer with this experiment was how efficiently we could compute the embeddings. To explore this, we also evaluated two smaller, faster networks:

The figure above shows the recall-at-1 accuracy for all basenets. Not surprisingly, the more computationally expensive networks (i.e., Resnet-152 and SENet) have the highest accuracy. SENet, in particular, achieved a recall-at-1 of 71.6%, which is only two percentage less than the current state of the art.

One of the exciting results for us was squeezenet. Though it only achieved 60% accuracy, this network is extremely small (< 5MB) and computationally fast enough to run on a mobile phone. Thus we could sacrifice some accuracy for a huge savings in computational cost if we require it.

Step 3: Ship It

The final phase of an R&D project at Curalate is productization. In this phase, we leverage our findings from the prototype and literature phases to design and build a reliable and efficient production system. All code from the prototype phase is discarded or heavily refactored to be more efficient, testable, and maintainable. With deep learning systems, we also build a data pipeline for extracting, versioning, and snapshotting datasets from our current production systems.

For this project, we train the model on a P6000 GPU rented from paperspace. We again use MXNet so the resulting model can be deployed directly to our production web services (which are written in Scala). We opted to use Resnet-152 as a basenet to get a high accuracy result, and deployed the learned network to g2.2xlarge instances on aws.

The visual search system we built powers our Intelligent Product Tagging feature, which you can see in the video below. Using deep metric embedding, we vastly increased the accuracy of intelligent product tagging compared to non-embedded deep features.

Load Testing for Expected Increases in Traffic with Vegeta

Thu, 21 Dec 2017 13:00:00 +0000

At Curalate, our service and API traffic is fairly tightly coupled to e-commerce traffic, so any increase is reasonably predictable. We expect an increase in request rate towards the beginning of November each year, with traffic peaking at 10x our steady rate on Black Friday and Cyber Monday.

Why Load Test?

Curalate works directly with retail brands to drive traffic to their sites. The holiday shopping period is the most important time of the year for most of them, and we need to ensure that our experiences continue to operate at a high standard throughout.

More generally, though, load testing is critical for services and APIs, especially in cases where load is expected to increase. It uncovers potential points of failure, during business hours, and hopefully prevents people from needing to wake up at 2 a.m. on a weekend.

Creating a Test Plan

In cases of expected load increases, it’s important to understand as much as possible before diving into it. There are a few questions to ask:

Is there any data available so I can understand the expected load? Is it a yearly increase - are previous years a good indication? If it’s a brand new launch, what are the expectations?
What are the hard and soft dependencies of the service or API that I’m testing? What sort of caching is in place? Does a 10x increase on my service cause a 10x increase on everything downstream, as well?
Should we test against the active production environment, or is it feasible to spin up a staging environment with the same scaling behavior?
Depending on the breadth of dependencies, it may not be possible to spin up a new duplicated environment.
If I test against production, how can I ensure I don’t negatively affect live traffic?
Am I expecting an increase in load across services? If there are any core dependencies, what does the combined load look like at peak?
How much of a buffer do I provide against the expected peak?
Does my service have any rate limiting that I need to bypass or keep in mind? How do I simulate a live traffic without being throttled?

Getting Right to It

In our case, there were four main services that we were interested in testing against expected load, separated into on-site (APIs and services that are called directly from our client’s sites), and off-site (our custom built and Curalate-hosted services). This distinction works well for us, because we expected a 10x increase in on-site experiences, but 2-3x increase to off-site ones - brands focus on driving traffic to their own e-commerce site.

Now, there are many tools out there for load testing. For our purposes, I used Vegeta, for its robust set of options and extensibility. It was easy to script around to allow a steadily increasing request rate to either a single target or lazily generated targets. The output functionality is also well thought out. It supports top line latency stats along with some basic charting capabilities.

Let’s assume we had a service that we wanted to test up to 1000 RPS, both against a single target, and against multiple targets - to work around any caching in place.

1000 RPS Single and Multi-Target

The setup was fairly simple:

Spin up a couple of AWS EC2 m3.2xlarge instances.

SSH to the instances and create a load_testing folder, and fetch the Vegeta binary.

wget "https://github.com/tsenart/vegeta/releases/download/v6.3.0/vegeta-v6.3.0-linux-386.tar.gz"

Put together a simple, quick script to handle steadily increasing the request rate, and then hold steady at the max rate.

#!/bin/bash
target=$1
maxRate=$2
rateInc=$3
incDuration=$4
startAt=$5
currentRate=$startAt
hitType=$6

while [ $currentRate -le $maxRate ]
do
  if [ $currentRate -eq $maxRate ]
  then
    echo $target | ./vegeta attack -rate=$currentRate > reel-$maxRate-$currentRate-$hitType-test.bin
  else
    echo $target | ./vegeta attack -rate=$currentRate -duration=$incDuration > reel-$maxRate-$currentRate-$hitType-test.bin
  fi
  currentRate=$((currentRate+rateInc))
done

Basically, if it hasn’t yet hit the max rate, run vegeta at the current rate for the specified duration, then increase the rate by the increment, and loop again. If the max rate is hit, don’t specify a duration - run until manually killed. The multi-targets script is similar, but reads from a targets.txt file.

#!/bin/bash
maxRate=$1
rateInc=$2
incDuration=$3
startAt=$4
currentRate=$startAt
hitType=$5

while [ $currentRate -le $maxRate ]
do
  if [ $currentRate -eq $maxRate ]
  then
    ./vegeta attack -rate=$currentRate -targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin
  else
    ./vegeta attack -rate=$currentRate -duration=$incDuration -targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin
  fi
  currentRate=$((currentRate+rateInc))
done

Aside: I was unable to get the -lazy flag to work properly with Vegeta, so I went with brute force and just generated a ton of targets to a file. I’m convinced it could have been more elegant, but sometimes the easy solution works just as well.

With the setup complete, it’s as simple as setting up whatever monitoring you want on a display or two, and fire off the scripts.

sh ./rate_increasing_multi.sh 1000 50 120s 50 uncached

Which says to increase up to 1000 RPS, 50 at a time, for 2 minutes at each rate, starting at 50 RPS.

For each results file generated, ./vegeta report -inputs "out.txt" will output something like (this example is for 250 RPS)

Requests      [total, rate]            66177, 249.98
Duration      [total, attack, wait]    4m24.783548697s, 4m24.731999487s, 51.54921ms
Latencies     [mean, 50, 95, 99, max]  64.885905ms, 57.516245ms, 107.88721ms, 730.867162ms, 2.309337436s
Bytes In      [total, mean]            943011144, 14249.83
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:66177
Error Set:

Load Testing and Results

As different tests are kicked off and rates increase, it’s necessary to keep an eye on any monitoring dashboards, or alerts that may fire, and bail out of the test early. From there, logging should help in diagnosing what failed, and tickets can be filed each step of the way. After those issues are resolved, you can pick back up testing until you hit your goal, and maintain it for long enough to be comfortable with the test.

It should go without saying, but when testing against a live, production environment, it’s always nice to give the current on-call engineers a heads up, and keep them in the loop the entire way through.

As for Curalate’s load testing, on Cyber Monday we experienced record-breaking traffic numbers - even exceeding our 10x estimates slightly - to our services, and the on-call engineers slept soundly through Thanksgiving weekend.

Tracing High Volume Services

Tue, 26 Sep 2017 12:11:36 +0000

We like to think that building a service ecosystem is like stacking building blocks. You start with a function in your code. That function is hosted in a class. That class in a service. That service is hosted in a cluster. That cluster in a region. That region in a data center, etc. At each level there’s a myriad of challenges.

From the start, developers tend to use things like logging and metrics to debug their systems, but a certain class of problems crops up when you need to debug across services. From a debugging perspective, you’d like to have a higher projection of the view of the system: a linearized view of what requests are doing. I.e. You want to be able to see that service A called service B and service C called service D at the granularity of single requests.

Cross Service Logging

The simplest solution to this is to require that every call from service to service comes with some sort of trace identifier. Incoming requests into the system, either from public API’s or client side requests, or even from async daemon invoked timers/schedules/etc generates a trace. This trace then gets propagated through the entire system. If you use this trace in all your log statements you can now correlate cross service calls.

How is this accomplished at Curalate? For the most part we use Finagle based services and the Twitter ecosystem has done a good job of providing the concept of a thread local TraceId and automatically propagating it to all other twitter-* components (yet another reason we like Finatra!).

All of our service clients automatically pull this thread local trace id out and populate a known HTTP header field that services then pick up and re-assume. For Finagle based clients this is auto-magick’d for you. For other clients that we use, like OkHttp, we had to add custom interceptors that pulled the trace from the thread local and set it on the request.

Here is an example of the header being sent automatically as part of Zipkin based headers (which we re-use as our internal trace identifiers):

Notice the X-B3-TraceId header. When a service receives this request it’ll re-assume the trace id and set its SLF4j MDC field of traceId to be that value. We can now include in our logback.xml configuration to include the trace id like in our STDOUT log configuration below:

<appender name="STDOUT-COLOR" class="ch.qos.logback.core.ConsoleAppender">
    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
        <level>TRACE</level>
    </filter>
    <encoder>
        <pattern>%yellow(%d) [%magenta(%X{traceId})] [%thread] %highlight(%-5level) %cyan(%logger{36}) %marker - %msg%n</pattern>
    </encoder>
</appender>

And we can also send the trace id as a structured JSON field to Loggly.

Let’s look at an example from our own logs:

What we’re seeing here is a system called media-api made a query to a system called networkinformationsvc. The underlying request carried a correlating trace id across the service boundaries and both systems logged to Loggly with the json.tid (transaction id) field populated. Now we can query our logs and get a linear time based view of what’s happening.

Thread local tracing

The trick here is to make sure that this implicit trace id that is pinned to the thread local of the initiating request properly moves from thread to thread as you make async calls. We don’t want anyone to have to ever remember to set the trace. It should just gracefully flow from thread to thread implicity.

To make sure that traces hop properly between systems we had to make sure to enforce that everybody uses an ExecutionContext that safely captures the callers thread local’s before executing. This is critical, otherwise you can make an async call and the trace id gets dropped. In that case, bye bye go the logs! It’s hyper important to always take an execution context and to never pin an execution context when it comes to async scala code. Thankfully, we can make any execution context safe by wrapping it up in a delegate:

/**
 * Wrapper around an existing ExecutionContext that makes it propagate MDC information.
 */
class PropagatingExecutionContextWrapper(wrapped: ExecutionContext)
  extends ExecutionContext { self =>

   override def prepare(): ExecutionContext = new ExecutionContext {
     // Save the call-site state
     private val context = Local.save()

     def execute(r: Runnable): Unit = self.execute(new Runnable {
       def run(): Unit = {
         // re-assume the captured call site thread locals
         Local.let(context) {
           r.run()
         }
       }
     })

     def reportFailure(t: Throwable): Unit = self.reportFailure(t)
   }

  override def execute(r: Runnable): Unit = wrapped.execute(r)

  override def reportFailure(t: Throwable): Unit = wrapped.reportFailure(t)
}

class TwitterExecutionContextProvider extends ExecutionContextProvider {
  /**
   * Safely wrap any execution context into one that properly passes context
   *
   * @param executionContext
   * @return
   */
  override def of(executionContext: ExecutionContext) = new PropagatingExecutionContextWrapper(executionContext)
}

We’ve taken this trace wrapping concept and applied to all kinds of executors like ExecutorService, and ScheduledExecutorService. Technically we don’t really want to expose the internals of how we wrap traces, so we load an ExecutionContextProvider via a java service loading mechanism and provide an API contract so that people can wrap executors without caring how they are wrapped:

/**
 * A provider that loads from the java service mechanism
 */
object ExecutionContextProvider {
  lazy val provider: ExecutionContextProvider = {
    Option(ServiceLoader.load(classOf[ExecutionContextProvider])).
      map(_.asScala).
      getOrElse(Nil).
      headOption.
      getOrElse(throw new MissingExecutionContextException)
  }
}

/**
 * Marker interfaces to provide contexts with custom logic. This
 * forces users to make sure to use the execution context providers that support request tracing
 * and maybe other tooling
 */
trait ProvidedExecutionContext extends ExecutionContext

/**
 * A context provider contract
 */
trait ExecutionContextProvider {
  def of(context: ExecutionContext): ProvidedExecutionContext

  ...
}

From a callers perspective they now do:

implicit val execContext = ExecutionContextProvider.provider.of(scala.concurrent.ExecutionContext.Implicits.global)

Which would wrap, in this example, the default scala context.

Service to Service dependency and performance tracing

Well that’s great! We have a way to safely and easily pass trace id’s, and we’ve tooled through our clients to all pass this trace id automatically, but this only gives us logging information. We’d really like to be able to leverage the trace information to get more interesting statistics such as service to service dependencies, performance across service hops, etc. Correlated logs is just the beginning of what we can do.

Zipkin is an open source tool that we’ve discussed here before so we won’t go too much into it, but needless to say that Zipkin hinges on us having proper trace identifiers. It samples incoming requests to determine IF things should be traced or not (i.e. sent to Zipkin). By default, we have all our services send 0.1% of their requests to Zipkin to minimize impact on the service.

Let’s look at an example:

In this Zipkin trace we can see that this batch call made a call to Dynamo. The whole call took 6 milliseconds and 4 of those milliseconds were spent calling Dynamo. We’ve tooled through all our external client dependencies with Zipkin trace information automatically using java dynamic proxies so that as we upgrade our external dep’s we get tracing on new functions as well.

If we dig further into the trace:

We can now see (highlighted) the trace ID and search in our logs for logs related to this trace

Finding needles in the haystack

We have a way to correlate logs, and get sampled performance and dependency information between services via Zipkin. What we still can’t do yet is trace an individual piece of data flowing through high volume queues and streams.

Some of our services at Curalate process 5 to 10 thousand items a second. It’s just not fiscally prudent to log all that information to Loggly or emit unique metrics to our metrics system (DataDog). Still, we want to know at the event level where things are in the system, where they passed through, where they got dropped etc. We want to answer the question of

Where is identifier XYZ.123 in the system and where did it go and come from?

This is difficult to answer with the current tools we’ve discussed.

To solve this problem we have one more system in play. This is our high volume auditing system that lets us write and filter audit events at a large scale (100k req/s+). The basic architecture here is we have services write audit events via an Audit API which are funneled to Kinesis Firehose. The firehose stream buffers data for either 5 minutes or 128 MB (whichever comes first). When the buffer limit is reached, firehose dumps newline separated JSON in a flat fi`le into S3. We have a lambda function that waits for S3 create events on the bucket, reads the JSON, then transforms the JSON events into Parquet which is an efficient columnar storage format. The Parquet file is written back into S3 into a new folder with the naming scheme of

year=YYYY/month=MM/day=DD/hour=HH/minute=mm/<uuid>.parquet

Where the minutes are grouped in 5 minute intervals. This partition is then added to Athena, which is a managed map-reduce around PrestoDB, that lets you query large datasets in S3.

What does this have to do with trace id’s? Each event emitted comes with a trace id that we can use to query back to logs or Zipkin or other correlating identifiers. This means that even if services aren’t logging to Loggly due to volume restrictions, we can still see how events trace through the system. Let’s look at an example where we find a specific network identifier from Instagram and see when it was data mined and when we added semantic image tags to it (via our vision APIs):

SELECT minute, app, message, timestamp, context
FROM curalateauditevents."audit_events"
WHERE context['network_id'] = '1584258444344170009_249075471' and context['network']='instagram'
and day=18 and hour=22
order by timestamp desc
limit 100

This is the Athena query. We’ve included the specific network ID and network we are looking for, as well as a limited partition scope.

Notice the two highlights.

Starting at the second highlight there is a message that we augmented the piece of data. In our particular pipe we only augment data under specific circumstances (not every image is analyzed) and so it was important to see that some images were dropped and this one was augmented. Now we can definitely say “yes, item ABC was augmented but item DEF was not and here is why”. Awesome.

Moving upwards, the first highlight is how much data was scanned. This particular partition we looked through has 100MB of data, but we only searched through 2MB to find what we wanted (this is due to the optimization of Parquet). Athena is priced by how much data you scan at a cost of $5 per terabyte. So this query was pretty much free at a cost of $0.000004. The total set of files across all the partitions for the past week is roughly 21GB spanning about 3.5B records. So even if we queried all the data, we’d only pay $.04. In fact, the biggest cost here isn’t in storage or query or lambda, it’s in firehose! Firehose charges you $0.029 per GB transferred. At this rate we pay 60 cents a week. The boss is going to be ok with that.

However, there are still some issues here. Remember the target scale is upwards of 100k req/s. At that scale we’re dealing with a LOT of data through Kinesis Firehose. That’s a lot of data into S3, a lot of IO reads to transform to Parquet, and a lot of opportunities to accidentally scan through tons of data in our athena partitions with poorly written queries that loop over repeated data (even though we limit partitions to a 2 week TTL). We also now have issues of rate limiting with Kinesis Firehose.

On top of that, some services just pump so much repeated data that its not worth seeing it all the time. To that end we need some sort of way to do live filters on the streams. What we’ve done to solve this problem is leverage dynamically invoked Nashorn javascript filters. We load up filters from a known remote location at an interval of 30 seconds, and if a service is marked for filtering (i.e. it has a really high load and needs to be filtered) then it’ll run all of its audit events through the filter before it actually gets sent to the downstream firehose. If an event fails the filter it’s discarded. If it passes, the event is annotated with which filter name it passed through and sent through the stream.

Filters are just YML files for us:

name: "Filter name"
expiration: <Optional DateTime. Epoch or string datetime of ISO formats parseable by JODA>
js: |
    function filter(event) {
        // javascript that returns a boolean
    }

And an example filter may look like

name: "anton_client_filter"
js: |
    function filter(event) {
      var client = event.context.get("client_id")

      return client != null && client == "3136"
    }

In this filter only events that are marked with the client id of my client will pass through. Some systems don’t need to be filtered so all their events pass through anyway.

Now we can write queries like

SELECT minute, app, message, timestamp, context
FROM curalateauditevents."audit_events"
WHERE contains(trace_names, 'anton_client_filter')
and day=18 and hour=22
limit 100

To get events that were tagged with my filter in the current partition. From there, we now can do other exploratory queries to find related data (either by trace id or by other identifiers related to the data we care about).

Let’s look at some graphs that show how dramatic this filtering can be

Here the purple line is one of our data mining ingestion endpoints. It’s pumping a lot of data to firehose, most of which is repeated over time and so isn’t super useful to get all the input from. The moment the graph drops is when the yml file was uploaded with a filter to add filtering to the service. The blue line is a downstream service that gets data after debouncing and other processing. Given its load is a lot less we don’t care so much that it is sending all its data downstream. You can see the purple line slow to a trickle later on when the filter kicks in and data starts matching it.

Caveats with Nashorn

Building the system out there were a few interesting caveats when using Nashorn in a high volume pipeline like this.

The first was that subtle differences in javascript can have massive performance impacts. Let’s look at some examples and benchmark them.

function filter(event) {
  var anton = {
    "136742": true,
    "153353": true
  }

  var mineable = event.context.get("mineable_id")

  return mineable != null && anton[mineable]
}

The JMH benchmarks of running this code is

[info] FiltersBenchmark.testInvoke  thrpt   20     1027.409 ±      29.922  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20  1484234.075 ± 1783689.007  ns/op

What?? 29 ops/second

Let’s make some adjustments to the filter, given our internal system loads the javascript into an isolated scope per filter and then re-invokes just the function filter each time (letting us safely create global objects and pay heavy prices for things once):

var anton = {
  "136742": true,
  "153353": true
}

function filter(event) {
  var mineable = event.context.get("mineable_id")

  return mineable != null && anton[mineable]
}

[info] FiltersBenchmark.testInvoke  thrpt   20  7391161.402 ± 206020.703  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    14879.890 ±   8087.179  ns/op

Ah, much better! 206k ops/sec.

If we use java constructs:

function filter(event) {
  var anton = new java.util.HashSet();
  anton.add("136742")
  anton.add("153353")

  var mineable = event.context.get("mineable_id")

  return mineable != null && anton.contains(mineable)
}

[info] FiltersBenchmark.testInvoke  thrpt   20  5662799.317 ± 301113.837  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    41963.710 ±  11349.277  ns/op

Even better! 301k ops/sec

Something is clearly up with the anonymous object creation in Nashorn. Needless to say, benchmarking is important, especially when these filters are going to be dynamically injected into every single service we have. We need them to be performant, sandboxed, and safe to fail.

For that we make sure everything runs its own engine scope in a separate execution context isolated from main running code and is fired off asynchronously to not block the main calling thread. This is also where we have monitoring and alerting on when someone uploads a non-performant filter so we can investigate and mitigate quickly.

For example, the discovery of the poorly performing json object came from this alert:

Conclusion

Tracing is hard and it’s incredibly difficult to tool through after the fact if you start to build service architectures without this in mind from the get go. Tooling trace identifiers through the system from the beginning sets you up for success in building more interesting debugging infrastructure that isn’t always possible without that. When building larger service ecosystems it’s important to keep in mind how to inspect things at varying granularity levels. Sometimes building custom tools to help inspect the systems is worth the effort, especially if they help debug complicated escalations or data inconsistencies.

Curalate Engineering Blog

Safely Modifying Your Hosts File with Gas Mask

Why use a hosts file manager?

Installation instructions

Creating your first host file

Use the Menubar icon

How to Setup a Scheduled Scala Spark Job

Code Repo

Setting Up The Job On AWS

Create AWS Data Pipeline

Load The Spark Job Template Definition

Set Your Parameters

Create AWS SNS

Schedule your Job

Finish The Pipeline Setup

Test Your Job

Hey, you busy? I have thousands of questions to ask you.

Load Test Planning

Peak requests-per-second

Load Test Execution

Elastic scaling example

Dynamic Scaling

Scaling metric

Discussion and related notes

—

Uploading EC2 Logs to S3 on Shutdown

How Curalate uses MXNet on AWS for Deep Learning Magic

Training and Experimentation

Inference API and Scala

Deployment

MXNet Binary Packages

Varying Environments

Conclusion

Productionalizing ECS

Scaling the cluster

Host draining and ECS rescheduling

Cluster upgrades

Monitoring with statsd

Cleanup

JMXMP

Choosing how to group a cluster

Conclusion

Choosing a Deep Learning library for developing and deploying your App/Service

Important factors

The specifics needs of your application/service

The platform you are developing on and deploying to.

The specific deep net architecture you are trying to implement

API language requirements

Codebase Quality

Is the code base actively maintained?

Release status of the library itself

Ease of Use

Train to production pipeline

Keras support

Dynamic vs Static computation

Support

Documentation

Community support

Research

Performance

Performance with specific network structures

Scalability

The Libraries

The Other Libraries

TLDR

R&D At Curalate: A Case Study of Deep Metric Embedding

Step 1: Do Your Homework

Step 2: Prototype and Experiment

Step 3: Ship It

Load Testing for Expected Increases in Traffic with Vegeta

Why Load Test?

Creating a Test Plan

Getting Right to It

1000 RPS Single and Multi-Target

Load Testing and Results

Tracing High Volume Services

Cross Service Logging

Thread local tracing

Service to Service dependency and performance tracing

Finding needles in the haystack