<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Curalate Engineering Blog</title>
    <description>A blog about developing at &lt;a href=&quot;http://curalate.com&quot;&gt;Curalate&lt;/a&gt;. How we handle big data architecture, design for the consumer web, and help our customers get the most out of their imagery.
</description>
    <link>http://engineering.curalate.com/</link>
    <atom:link href="http://engineering.curalate.com/feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Thu, 30 May 2019 14:26:53 +0000</pubDate>
    <lastBuildDate>Thu, 30 May 2019 14:26:53 +0000</lastBuildDate>
    <generator>Jekyll v3.8.5</generator>
    
      <item>
        <title>Safely Modifying Your Hosts File with Gas Mask</title>
        <description>&lt;p&gt;Sometimes the DNS for a specific domain on your machine needs to point somewhere else – at Curalate, we test changes microservices locally before shipping them, which could require redirecting requests to look at that local instance. One way to do this is by adding an entry like &lt;code class=&quot;highlighter-rouge&quot;&gt;127.0.0.1 some.service.curalate.com&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;why-use-a-hosts-file-manager&quot;&gt;Why use a hosts file manager?&lt;/h3&gt;
&lt;p&gt;In most cases, it’s not advised to directly modify &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt;. Because it’s buried deep into the filesystem, it’s easy to forget you’ve modified it, which can lead to numerous problems ranging from annoying to dangerous.
Also, danger aside, it can begin to get messy and complex if you have a lot of entries to manage. Think of even just fifteen lines you’re constantly commenting/uncommenting to represent the configuration you need at a given moment. This would be insanity.
Gas Mask, a simple UI-based hosts file manager, allows you to set up different hosts files, while making it plainly obvious which hosts file is currently activated on your system via the OS Menu bar.&lt;/p&gt;

&lt;h3 id=&quot;installation-instructions&quot;&gt;Installation instructions&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Go to &lt;a href=&quot;https://github.com/2ndalpha/gasmask&quot;&gt;https://github.com/2ndalpha/gasmask&lt;/a&gt; and download the latest version.&lt;/li&gt;
  &lt;li&gt;Unpack and install.&lt;/li&gt;
  &lt;li&gt;On first-run, the only hosts file listed will be &lt;code class=&quot;highlighter-rouge&quot;&gt;Original File&lt;/code&gt; which is the &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt; file you’ll no longer be modifying.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;creating-your-first-host-file&quot;&gt;Creating your first host file&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Create a new hosts file and name it something that makes sense.&lt;/li&gt;
  &lt;li&gt;Add the test entry &lt;code class=&quot;highlighter-rouge&quot;&gt;127.0.0.1 google.com&lt;/code&gt; and save. The format of these entries is &lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;lt;target IP address&amp;gt; &amp;lt;URL or IP to redirect&amp;gt;&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Activate that hosts file. Gasmask substitutes in this file at &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;You may either need to flush your DNS cache or just restart the browser.&lt;/li&gt;
  &lt;li&gt;To test it out, go to google.com in the browser.&lt;/li&gt;
  &lt;li&gt;What happens now is, when the browser goes to get the IP for google.com, the OS sees the matching entry in your hosts file, then refers to &lt;code class=&quot;highlighter-rouge&quot;&gt;127.0.0.1&lt;/code&gt; (your local computer) to make the request – which will fail.&lt;/li&gt;
  &lt;li&gt;Go ahead and reactivate your Original File, restart the browser, and you should be able to access google.com as expected.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;use-the-menubar-icon&quot;&gt;Use the Menubar icon&lt;/h3&gt;
&lt;p&gt;It’s easy to forget to flip your hosts file back. You’ll end up spending 45 minutes on what you think is a bug, that doesn’t reproduce for anyone else, only to realize your hosts file is sending requests someplace else.&lt;/p&gt;

&lt;p&gt;Next time you reach to modify that hosts file, consider integrating Gas Mask into your development workflow to keep it maintained and safe from unintended state.&lt;/p&gt;
</description>
        <pubDate>Thu, 30 May 2019 00:00:00 +0000</pubDate>
        <link>http://engineering.curalate.com/2019/05/30/gas-mask.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2019/05/30/gas-mask.html</guid>
        
        
      </item>
    
      <item>
        <title>How to Setup a Scheduled Scala Spark Job</title>
        <description>&lt;p&gt;Have you written a Scala Spark job that processes a massive amount of data on an intimidating amount of RAM and you want to run it daily/weekly/monthly on a schedule on &lt;a href=&quot;https://aws.amazon.com&quot;&gt;AWS&lt;/a&gt;?
I had to do this recently, and couldn’t find a good tutorial on the full process to get the spark job running.
Included in this article and accompanying repository is everything you need to get your Scala Spark job running on AWS &lt;a href=&quot;https://aws.amazon.com/datapipeline/&quot;&gt;Data Pipeline&lt;/a&gt; and &lt;a href=&quot;https://aws.amazon.com/emr/&quot;&gt;EMR&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;code-repo&quot;&gt;Code Repo&lt;/h2&gt;

&lt;p&gt;This tutorial is not going to walk you through the process of actually writing your specific Scala Spark job to do whatever number crunching you need. 
There are already plenty of resources available (&lt;a href=&quot;https://www.analyticsvidhya.com/blog/2017/01/scala/&quot;&gt;1&lt;/a&gt;, &lt;a href=&quot;https://spark.apache.org/docs/latest/quick-start.html&quot;&gt;2&lt;/a&gt;, &lt;a href=&quot;https://www.coursera.org/learn/scala-spark-big-data&quot;&gt;3&lt;/a&gt;) to get you started on that. 
The code template for setting up a Spark Scala job is available in &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template&quot;&gt;this GitHub repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Assuming that you have already written your Spark Job and are only using the &lt;a href=&quot;https://aws.amazon.com/sdk-for-java/&quot;&gt;AWS Java SDK&lt;/a&gt; to connect to your AWS data stores, drop your code in the &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/src/main/scala/SparkJob.scala#L4&quot;&gt;Main&lt;/a&gt; function of &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/src/main/scala/SparkJob.scala&quot;&gt;SparkJob.scala&lt;/a&gt; and run the &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/deploy.sh&quot;&gt;deploy.sh&lt;/a&gt; script to upload the fat jar to your S3 bucket.&lt;/p&gt;

&lt;p&gt;If you do take other dependencies, then it may take some extra work on your part. 
To run a Scala Spark job on AWS you need to compile a &lt;a href=&quot;https://stackoverflow.com/questions/19150811/what-is-a-fat-jar&quot;&gt;fat jar&lt;/a&gt; that contains the byte code for your job and all of the libraries it needs to run.
This project already has the &lt;a href=&quot;https://github.com/sbt/sbt-assembly&quot;&gt;sbt-assembly&lt;/a&gt; plugin setup and a &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/build.sbt#L13&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;assemblyMergeStrategy&lt;/code&gt;&lt;/a&gt; set up to package the Spark, Hadoop, and AWS SDK together in the fat jar.
If you need to add in other libraries that do not play well with each other, or are using a noncompatible version of Spark for this current repo, there are a few good &lt;a href=&quot;http://queirozf.com/entries/creating-scala-fat-jars-for-spark-on-sbt-with-sbt-assembly-plugin&quot;&gt;resources&lt;/a&gt; available to help you through the needed &lt;code class=&quot;highlighter-rouge&quot;&gt;build.sbt&lt;/code&gt; modifications.&lt;/p&gt;

&lt;p&gt;Outside of the previously mentioned needed changes you need to set a few parameters in the &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/deploy.sh&quot;&gt;deploy.sh&lt;/a&gt; script. 
Mainly the &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/deploy.sh#L2&quot;&gt;deploymentPath&lt;/a&gt; to your specific S3 bucket, adding a profile to the &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/deploy.sh#L14&quot;&gt;AWS CLI command&lt;/a&gt; to upload to your specific S3 bucket if it’s private, and changing the resulting fat jar &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/deploy.sh#L3&quot;&gt;name&lt;/a&gt; if you please. 
The deploy script uses the &lt;a href=&quot;https://aws.amazon.com/cli/&quot;&gt;AWS CLI&lt;/a&gt; to upload the fat jar to S3, so if you do not have it installed and configured you will need to go through the &lt;a href=&quot;https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html&quot;&gt;steps to do that&lt;/a&gt; before using the &lt;code class=&quot;highlighter-rouge&quot;&gt;deploy.sh&lt;/code&gt; script.&lt;/p&gt;

&lt;h2 id=&quot;setting-up-the-job-on-aws&quot;&gt;Setting Up The Job On AWS&lt;/h2&gt;

&lt;p&gt;Once your fat jar is uploaded to S3 it’s time to set up the scheduled job on AWS Data Pipeline.&lt;/p&gt;

&lt;h4 id=&quot;create-aws-data-pipeline&quot;&gt;Create AWS Data Pipeline&lt;/h4&gt;

&lt;p&gt;Log on to the AWS dash, navigate to the &lt;a href=&quot;https://console.aws.amazon.com/datapipeline&quot;&gt;AWS Data Pipeline&lt;/a&gt; console, and click the &lt;code class=&quot;highlighter-rouge&quot;&gt;Create new pipeline&lt;/code&gt; button.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/o9omjFr.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;load-the-spark-job-template-definition&quot;&gt;Load The Spark Job Template Definition&lt;/h4&gt;

&lt;p&gt;Add a name for your pipeline and select the &lt;code class=&quot;highlighter-rouge&quot;&gt;Import a definition&lt;/code&gt; source option. 
Included in the &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template&quot;&gt;GitHub repo&lt;/a&gt; for the Spark template is a &lt;a href=&quot;https://github.com/jessebrizzi/spark-fat-jar-template/blob/master/datapipeline.json&quot;&gt;datapipeline.json&lt;/a&gt; file that you can import that contains a pre-defined data pipeline for your Spark job that should simplify the setup processes.
The definition contains the node configuration to fire the pipeline off on a schedule and notify you via an &lt;a href=&quot;https://aws.amazon.com/sns/&quot;&gt;AWS SNS&lt;/a&gt; alarm on a success or failure run.
All of your needed configuration options have been parameterized to simplify this process.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/K1EXQRx.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;set-your-parameters&quot;&gt;Set Your Parameters&lt;/h4&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/6JqVHi3.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Set the min parameter to the &lt;code class=&quot;highlighter-rouge&quot;&gt;EMR step(s)&lt;/code&gt; option.
Update the S3 path and fat jar name to point to the fat jar uploaded earlier with the &lt;code class=&quot;highlighter-rouge&quot;&gt;deploy.sh&lt;/code&gt; script.
Set &lt;code class=&quot;highlighter-rouge&quot;&gt;EC2KeyPair&lt;/code&gt; so that the data pipeline has access to your EC2 instances.
Select the master node instance type and the number of and type of core-nodes for the Spark cluster.
Finally, we need to set the ARN for both the success and failure notifications for the Spark job so you will know if something goes wrong with your scheduled job.&lt;/p&gt;

&lt;h4 id=&quot;create-aws-sns&quot;&gt;Create AWS SNS&lt;/h4&gt;

&lt;p&gt;Open a new tab and navigate to the &lt;a href=&quot;https://console.aws.amazon.com/sns/v2/&quot;&gt;AWS SNS&lt;/a&gt; console.
Click the &lt;code class=&quot;highlighter-rouge&quot;&gt;Create Topic&lt;/code&gt; option.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/wMlXUuO.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the pop-up window set a topic name and display name for your success alarm and hit the &lt;code class=&quot;highlighter-rouge&quot;&gt;create topic&lt;/code&gt; button.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/QtmsevM.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You will be navigated to the topic details page for your new SNS Topic.
Click the &lt;code class=&quot;highlighter-rouge&quot;&gt;create subscription&lt;/code&gt; button.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/x7t8yUT.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here you can choose the type of notification you want, for this example, we will set up an email notification for this alarm.
Place your email address in the &lt;code class=&quot;highlighter-rouge&quot;&gt;Endpoint&lt;/code&gt; field and hit &lt;code class=&quot;highlighter-rouge&quot;&gt;Create subscription&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/C13ikXf.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Check your email inbox and confirm the subscription to your SNS alarm.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/2Wksf2l.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once you have successfully confirmed your subscription in your email, you should see it listed in the &lt;code class=&quot;highlighter-rouge&quot;&gt;Subscriptions&lt;/code&gt; table on the &lt;code class=&quot;highlighter-rouge&quot;&gt;Topic details&lt;/code&gt; page for your SNS Alarm.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/W6SAfep.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You will need to repeat this process for both of your success and failure alarms.
On each of the &lt;code class=&quot;highlighter-rouge&quot;&gt;Topic details&lt;/code&gt; pages for both of your alarms, you need to copy the &lt;code class=&quot;highlighter-rouge&quot;&gt;Topic ARN&lt;/code&gt; value and paste it into their respective parameter fields in your Data Pipeline setup.&lt;/p&gt;

&lt;h4 id=&quot;schedule-your-job&quot;&gt;Schedule your Job&lt;/h4&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/C9R631G.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once your ARN parameters are set for your alarms you can move on to scheduling when this pipeline will run.
Set the cadence for when the job will run and make sure you set your starting and ending dates to something in the future.
You can also set the job to run only on pipeline activation if you would rather manually start your job.&lt;/p&gt;

&lt;h4 id=&quot;finish-the-pipeline-setup&quot;&gt;Finish The Pipeline Setup&lt;/h4&gt;

&lt;p&gt;Once you have your parameters set and scheduled all you need to do is set an S3 bucket location for the logs from the pipeline executions and you can click &lt;code class=&quot;highlighter-rouge&quot;&gt;Activate&lt;/code&gt; at the bottom of the page.
This will run a check on all of your settings and confirm that everything should work. 
Some things you might run into here could be setting improper values for the instance types for your clusters (Note: not all EC2 types work in EMR and Data Pipeline), or your schedule settings don’t make sense.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/NV3OIlc.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;test-your-job&quot;&gt;Test Your Job&lt;/h4&gt;

&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/UQ5w5oE.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once your job activates successfully you should be done!
If you are doing a test run first, just wait for the time that the pipeline is scheduled to start and it should run and email you on the result of the run.
If you want to view the console/standard output of your job you can find it in the &lt;code class=&quot;highlighter-rouge&quot;&gt;Emr Step Logs&lt;/code&gt; link and click the &lt;code class=&quot;highlighter-rouge&quot;&gt;stdout.gz&lt;/code&gt; link to view/download the output.
Here you will also find the error output if something goes wrong in &lt;code class=&quot;highlighter-rouge&quot;&gt;stderr.gz&lt;/code&gt;.
The logs from the Spark job in your pipeline are in the &lt;code class=&quot;highlighter-rouge&quot;&gt;controller.gz&lt;/code&gt; file if you need to do some debugging with your pipeline.&lt;/p&gt;
</description>
        <pubDate>Wed, 27 Mar 2019 00:00:00 +0000</pubDate>
        <link>http://engineering.curalate.com/2019/03/27/scheduled-scala-spark-job.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2019/03/27/scheduled-scala-spark-job.html</guid>
        
        <category>spark</category>
        
        <category>scala</category>
        
        <category>aws</category>
        
        <category>emr</category>
        
        <category>ec2</category>
        
        <category>data</category>
        
        <category>pipeline</category>
        
        <category>scheduled</category>
        
        <category>job</category>
        
        <category>daily</category>
        
        <category>weekly</category>
        
        
      </item>
    
      <item>
        <title>Hey, you busy? I have thousands of questions to ask you.</title>
        <description>&lt;p&gt;If you’re a brick-and-mortar business owner, you quickly identify patterns in your customer foot traffic,
especially around the holidays when achieving your sales goals depends on both timely and quality service.
If you’re an e-commerce business owner,
like many of Curalate’s 1,000+ customers,
it’s really no different: holiday sales are crucial to success and
they depend heavily on your site’s reliability.&lt;/p&gt;

&lt;p&gt;At Curalate, we take great pride in maintaining
high availability and low latency for our client integrations throughout the year.
But over the “Black Fiveday” period—Thanksgiving through Cyber Monday—and
the week leading up to and including the day after Christmas,
we see a roughly 5x increase in network requests to our APIs from our clients’ sites.
Therefore it’s crucial that we both design our systems to handle that increased load
and perform load testing on the systems ahead of time to prove that our designs work.&lt;/p&gt;

&lt;p&gt;This post describes how we carried out load tests of our infrastructure to prepare for the holiday traffic increase on our APIs.
Additionally, it highlights how our approach towards dynamic scalability reduces costs by avoiding over-provisioning.&lt;/p&gt;

&lt;h1 id=&quot;load-test-planning&quot;&gt;Load Test Planning&lt;/h1&gt;

&lt;p&gt;Our first question was: what volume of traffic can we expect?
To answer that, we consulted the last several years of data describing our holiday traffic load pattern.
Second: what are the important dimensions of that traffic?
For example, do we expect a majority of the traffic to be cached or uncached? 
Do total requests matter or only instantaneous load?
Since a previous &lt;a href=&quot;http://engineering.curalate.com/2017/12/21/expected-traffic-load-testing.html&quot;&gt;blog post&lt;/a&gt;
discussed cached versus uncached testing,
this post focuses on API request rate.&lt;/p&gt;

&lt;p&gt;Total requests in a day is interesting,
but only suggests an &lt;em&gt;average&lt;/em&gt; requests-per-second (RPS) rate.
The metric we’re mostly interested in is the daily &lt;em&gt;peak&lt;/em&gt; RPS rate.
This gives us an idea of the busiest moment in our day, and
if we can handle that rate,
we should have confidence that we can handle lesser request rates at other times.&lt;/p&gt;

&lt;h2 id=&quot;peak-requests-per-second&quot;&gt;Peak requests-per-second&lt;/h2&gt;
&lt;p&gt;Consulting our historical data, we observed that Black Fiveday typically
has a daily peak RPS that is 5x higher than the normal peak rate during September of the same year.
With this in mind, we calculated the 5x peak RPS rate from our September 2018 data, and
then used some simple bash scripts (described in the next section)
and other tools to steadily increase our API request load to the predicted maximum rate.&lt;/p&gt;

&lt;p&gt;The below graph shows the Daily Peak RPS rate on our API over the course of the entire fourth quarter (September through December, 2018).
November 5th shows the internal load generated using the script below targeting our expected 5x traffic increase.
November 24th (Black Friday) and December 5th were our two highest peaks for publicly-generated traffic to our API by end-users.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2018-12-31-holiday/peak_rps.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So, how’d we do? Well, we were spot-on with our prediction! Next year we’re predicting winning lottery numbers 😉.
In truth, we didn’t expect our prediction to be that close to the exact peak traffic load, and
it’s a reasonable question to ask if we should have tested a little higher given the actual load.
I can understand arguments on both sides.&lt;/p&gt;

&lt;h1 id=&quot;load-test-execution&quot;&gt;Load Test Execution&lt;/h1&gt;
&lt;p&gt;Our &lt;a href=&quot;http://engineering.curalate.com/2017/12/21/expected-traffic-load-testing.html&quot;&gt;2017 holiday load testing blog post&lt;/a&gt;
already shared details on the different load tests we performed (cached &amp;amp; uncached, various API endpoints, etc.),
as well as some example output of the tool,
so this post reproduces just one of those scripts here for context.
To execute the actual load test, we once again used &lt;a href=&quot;https://github.com/tsenart/vegeta&quot;&gt;Vegeta&lt;/a&gt; this year,
as we enjoyed our experience using it last year.&lt;/p&gt;

&lt;p&gt;The following bash script shows how we use Vegeta to slowly increase the request rate to a specified API endpoint:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/bin/bash&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#  File: load-test.sh&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# usage: ./load-test.sh &amp;lt;api endpoint&amp;gt; &amp;lt;start rate&amp;gt; &amp;lt;rate step increment&amp;gt; &amp;lt;step duration&amp;gt; &amp;lt;max rate&amp;gt; &amp;lt;test tag&amp;gt;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$# &lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;-ne&lt;/span&gt; 6 &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
  &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;usage: &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$0&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; &amp;lt;api endpoint&amp;gt; &amp;lt;start rate&amp;gt; &amp;lt;rate step increment&amp;gt; &amp;lt;step duration&amp;gt; &amp;lt;max rate&amp;gt; &amp;lt;test tag&amp;gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;exit &lt;/span&gt;1
&lt;span class=&quot;k&quot;&gt;fi

&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;Target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;StartRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$2&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;RateStep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$3&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;StepDuration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$4&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;MaxRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$5&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;Tag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$6&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;CurrentRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$StartRate&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CurrentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-le&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$MaxRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do
  &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;OutputFile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;test-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$Tag&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MaxRate&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CurrentRate&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;.bin&quot;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CurrentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lt&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$MaxRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
    &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$Target&lt;/span&gt; | ./vegeta attack &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CurrentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-duration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$StepDuration&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$OutputFile&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;else
    &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$Target&lt;/span&gt; | ./vegeta attack &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MaxRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$OutputFile&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;fi
  &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CurrentRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;$((&lt;/span&gt;CurrentRate+RateStep&lt;span class=&quot;k&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The above script will repeatedly execute Vegeta with an increasing request rate until it reaches the specified max rate.
Each Vegeta execution will run for the specified duration and the final execution, with the maximum rate, will run until manually killed.
Additionally, each execution will write a distinct Vegeta output file.&lt;/p&gt;

&lt;h1 id=&quot;elastic-scaling-example&quot;&gt;Elastic scaling example&lt;/h1&gt;
&lt;p&gt;Handling a 5x peak in your daily traffic load is great, and it’s really easy, right?&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;InstanceCount=$(echo &quot;$InstanceCount * 5&quot; | bc)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;😛 Okay, so it’s really easy if you don’t care about your costs.
For those of us who do care about costs,
our goal is to dynamically scale up our computing resources as demand for our services increases,
and then gracefully scale down our resources as demand wanes.&lt;/p&gt;

&lt;h2 id=&quot;dynamic-scaling&quot;&gt;Dynamic Scaling&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://aws.amazon.com/&quot;&gt;AWS&lt;/a&gt; obviously makes this very easy as long as you have things configured properly.
While some of our legacy services still run as AMIs deployed within auto-scaling groups on EC2 instances,
most of our microservices now run as containers on &lt;a href=&quot;https://aws.amazon.com/ecs/&quot;&gt;Amazon ECS&lt;/a&gt;.
We wrote a &lt;a href=&quot;http://engineering.curalate.com/2018/05/16/productionalizing-ecs.html&quot;&gt;blog post&lt;/a&gt; earlier this year detailing
all the various aspects that come into play when running our production systems on ECS.&lt;/p&gt;

&lt;p&gt;Since that post already explains how we run our systems on ECS,
this post simply provides an example of that dynamic scaling in action.
The graphs below show the request rate on one of our services called “Media API” (“media-api-service”)
and the corresponding number of containers powering that service (“media-api-service_v3”)
during the three day period from Christmas through December 27th.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2018-12-31-holiday/ecs_scaling.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As you can see, we were quick to add more containers to meet increased request demand,
and efficient in removing them soon after the request rate dropped.
While the bottom graph is not a direct representation of the underlying EC2 instances powering the ECS cluster,
it’s a good estimate for how dynamic our EC2 capacity (and costs) would be if all services in the cluster employ a similar scaling approach.&lt;/p&gt;

&lt;h2 id=&quot;scaling-metric&quot;&gt;Scaling metric&lt;/h2&gt;
&lt;p&gt;Speaking of scaling, what metric do we scale on?
There are several options including CPU and memory utilization, as well as latency and queue size.
In our experience, scaling on the CPU utilization works well enough for container-based services.
When a given service’s containers have an average CPU utilization above some threshold, say, 75%,
we add a new container and re-evaluate after some time.
When CPU utilization drops below, say, 60%, we stop a container and re-evaluate after some time.
We scale some of our services on memory utilization in the same way.&lt;/p&gt;

&lt;h1 id=&quot;discussion-and-related-notes&quot;&gt;Discussion and related notes&lt;/h1&gt;
&lt;p&gt;For brevity and clarity, this post skipped over a few points related to various aspects above.
Let me address a few lingering questions and relevant details that are worth noting but not central to the theme of this post.&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;While total requests doesn’t directly describe peak requests-per-second,
it can impact other resources sensitive to total data size such as event queues and logging/diagnostic data.
Keep those types of resources in mind both when executing the load test and facing increased live traffic.&lt;/li&gt;
  &lt;li&gt;We talked about scaling up compute resources to meet the increased server request rate,
but if your traffic is bursty in the extreme (e.g., it goes from 1x to 10x or higher in a few seconds)
then you should consider using a CDN rather than trying to dynamically scale to meet that load.
A CDN introduces some delay in pushing out new data
but it can withstand orders of magnitude higher request rates
because the response is statically determined by the request input.&lt;/li&gt;
  &lt;li&gt;Synthetic load tests on production infrastructure could impact live production traffic
through dependent resources such as databases, caches and event queues.
For example, if you run a “cache-missing” test on live infrastructure,
you may end up evicting all of your real customers’ data and
drastically reducing performance for your legitimate production traffic.&lt;/li&gt;
  &lt;li&gt;The beginning of this post mentions the importance of reliability during the holiday period.
To see how Curalate’s reliability compared to several of our competitors during Black Fiveday 2018,
check out this &lt;a href=&quot;https://www.curalate.com/blog/speed-reliability-ugc-provider/&quot;&gt;blog post&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;—&lt;/h1&gt;
&lt;p&gt;Hopefully this post gives you some insight into how we prepare for the holiday traffic load spike at Curalate.
If you value high availability and low latency in your production systems as we do,
check out our &lt;a href=&quot;https://www.themuse.com/profiles/curalate&quot;&gt;jobs&lt;/a&gt; page,
click the Join Us link on the top right of the page,
or shoot us an email at hello@curalate.com.&lt;/p&gt;

&lt;p&gt;Thanks for reading!&lt;/p&gt;
</description>
        <pubDate>Mon, 31 Dec 2018 17:00:00 +0000</pubDate>
        <link>http://engineering.curalate.com/2018/12/31/holiday-load-prep.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2018/12/31/holiday-load-prep.html</guid>
        
        <category>infrastructure</category>
        
        <category>holiday</category>
        
        <category>forecasting</category>
        
        
      </item>
    
      <item>
        <title>Uploading EC2 Logs to S3 on Shutdown</title>
        <description>&lt;p&gt;If you’ve ever used an auto scaling group (ASG) on AWS, you’ve probably had an EC2 instance fail and get removed from the ASG.
While great for redundancy (the ASG launches a new instance to start handling requests), it makes debugging the failure difficult since the ASG &lt;em&gt;terminates&lt;/em&gt; the bad instance, erasing any evidence of what went wrong.
Below, I present a script that will upload relevant files to S3 after an instance is triggered to shutdown but before it terminates.&lt;/p&gt;

&lt;p&gt;To achieve this, we make use of Linux’s runlevel scripts. The instructions below are for Ubuntu, but it should be straight forward to migrate to a different distro.&lt;/p&gt;

&lt;p&gt;First, we make a script &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/rc0.d/K01upload-logs&lt;/code&gt;. This script will run when the system is shutting down. You should change &lt;code class=&quot;highlighter-rouge&quot;&gt;BUCKET&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;PATH&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;LOG_FILE&lt;/code&gt; to match your needs.&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/bin/bash&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; /etc/environment
&lt;span class=&quot;c&quot;&gt;# get strict after sourcing environment since we don't trust it...&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-euo&lt;/span&gt; pipefail
&lt;span class=&quot;nv&quot;&gt;IFS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;$'&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n\t&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# what logs should I upload and where to?&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;LOG_FILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/var/log/tomcat7/catalina.out&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;BUCKET&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;my-logs-bucket&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# below we include the instance id in the path. That way it's easily findable.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;$(&lt;/span&gt;/usr/bin/curl http://169.254.169.254/latest/meta-data/instance-id&lt;span class=&quot;k&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;services/logs/&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOST&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# upload the logs&lt;/span&gt;
/bin/echo &lt;span class=&quot;s2&quot;&gt;&quot;Uploading logs to s3://&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BUCKET&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PATH&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; | /usr/bin/wall
/usr/local/bin/aws s3 cp &lt;span class=&quot;nv&quot;&gt;$LOG_FILE&lt;/span&gt; s3://&lt;span class=&quot;nv&quot;&gt;$BUCKET&lt;/span&gt;/&lt;span class=&quot;nv&quot;&gt;$PATH&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After installing the script, you need to set the permissions:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;chown root:root /etc/rc0.d/K01upload-logs
chmod +x /etc/rc0.d/K01upload-logs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And that’s it! The script will upload the logs to your S3 bucket when the ASG terminates an instance. We’ve found this extremely helpful for our &lt;a href=&quot;http://engineering.curalate.com/2018/08/01/mxnet-case-study.html&quot;&gt;deep learning infrastructure&lt;/a&gt; that can often contain errors from C++ code (and thus isn’t handled by the jvm or sent to our logging services).&lt;/p&gt;
</description>
        <pubDate>Tue, 04 Sep 2018 08:56:00 +0000</pubDate>
        <link>http://engineering.curalate.com/2018/09/04/ec2-log-upload.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2018/09/04/ec2-log-upload.html</guid>
        
        <category>aws</category>
        
        <category>bash</category>
        
        
      </item>
    
      <item>
        <title>How Curalate uses MXNet on AWS for Deep Learning Magic</title>
        <description>&lt;p&gt;&lt;em&gt;This post was simultaneously published to &lt;a href=&quot;https://medium.com/apache-mxnet/how-curalate-uses-mxnet-on-aws-for-some-deep-learning-magic-d9139d84be1c&quot;&gt;Medium&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At &lt;a href=&quot;www.curalate.com&quot;&gt;Curalate&lt;/a&gt;, we use state of the art deep learning and computer vision to add a layer of magic to our products.
&lt;a href=&quot;https://www.curalate.com/blog/intelligent-product-tagging/&quot;&gt;Intelligent Product Tagging&lt;/a&gt;, for example, identifies our clients’ products in user-generated photos.
Being a startup, we need to build these deep learning and computer vision systems the same way we do the rest of our products: quickly.&lt;/p&gt;

&lt;p&gt;Our computer vision systems are built in &lt;a href=&quot;https://www.curalate.com/blog/how-Curalate-built-a-kick-ass-research-development-team/&quot;&gt;two phases&lt;/a&gt;, research and productization, and we require a deep learning framework that accelerates both.
During the research phase, we need a framework that’s quick to get started with and is flexible enough to experiment with new ideas.
Once we have a solution, we need a framework that can easily be integrated into a microservice and deployed to multiple production environments.&lt;/p&gt;

&lt;p&gt;In the past, we used &lt;a href=&quot;http://caffe.berkeleyvision.org/&quot;&gt;Caffe&lt;/a&gt; for experimentation and our own custom inference interface to deploy the trained models to production.
Experimentation was slow due to Caffe’s dated Python API, lack of automatic differentiation, unreliable build/install process, and clunky support for advanced layers which required us to maintain our own custom fork. 
Productization of Caffe was challenging since we had to maintain our own &lt;a href=&quot;http://engineering.curalate.com/2016/04/29/bridging-scala-to-c++-with-bridj.html&quot;&gt;JNI interface&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We needed new and modern framework that fulfilled all of our needs while saving us from the shortcomings of Caffe.
After a &lt;a href=&quot;http://engineering.curalate.com/2018/03/23/DL-lib-for-app-dev-and-prod.html&quot;&gt;review of all the available options&lt;/a&gt;, we decided to move to &lt;a href=&quot;https://mxnet.incubator.apache.org/&quot;&gt;MXNet&lt;/a&gt;.
In this post, we’ll discuss why we migrated to MXNet as our deep learning framework of choice to facilitate our speed of experimentation, development, and deployment.&lt;/p&gt;

&lt;h1 id=&quot;training-and-experimentation&quot;&gt;Training and Experimentation&lt;/h1&gt;

&lt;p&gt;Whenever we are faced with a new computer vision problem, we start by looking at existing state-of-the-art implementations.
If we are lucky the functionality of the service we are implementing is similar to an existing pre-trained model for MXNet.
MXNet has a fairly fleshed out and maintained &lt;a href=&quot;https://mxnet.incubator.apache.org/model_zoo/&quot;&gt;Model Zoo&lt;/a&gt; that contains all of the standard pre-trained models that we would expect from any deep learning framework (ImageNet, PascalVOC, …).
If we can not find what we need there, MXNet is also one of the frameworks that &lt;a href=&quot;https://aws.amazon.com/blogs/machine-learning/announcing-onnx-support-for-apache-mxnet/&quot;&gt;supports&lt;/a&gt; &lt;a href=&quot;https://onnx.ai/&quot;&gt;ONNX&lt;/a&gt; and its cross framework model format giving us access to many more pre-trained models.&lt;/p&gt;

&lt;p&gt;Converting and reusing old models and code is also possible with MXNet.
&lt;a href=&quot;https://github.com/Microsoft/MMdnn&quot;&gt;MMDNN&lt;/a&gt; provides support for converting our old models into the MXNet model format if retraining is not needed.
MXNet is also supported as a &lt;a href=&quot;https://github.com/awslabs/keras-apache-mxnet&quot;&gt;backend for Keras&lt;/a&gt;, the high level neural net API, allowing us to run exactly the same Keras code developed with other frameworks with the quick MXNet backend instead.&lt;/p&gt;

&lt;p&gt;When we have a more domain specific service in mind, where the data we are training/predicting is new but the model/task is well researched (e.g. image classification, object detection and instance segmentation), ideally we can avoid implementing a research paper from scratch and find an open source implementation to use and contribute to instead.
MXNet maintains a long list of &lt;a href=&quot;https://github.com/apache/incubator-mxnet/tree/master/example&quot;&gt;example projects&lt;/a&gt; in its own code base for most of the popular deep learning applications/models (neural style transfer, Faster R-CNN, speech-to-text).
Outside of this, MXNet has a fairly large community of open source developers that maintain MXNet versions of most of the popular and state of the art research papers in the machine learning community.&lt;/p&gt;

&lt;p&gt;When the state of the art is just not good enough, MXNet is still an excellent choice for researching novel deep learning models.
MXNet offers both symbolic (static graph) and imperative (dynamic graph) APIs allowing us to work with whichever paradigm is the most appropriate for the task.
MXNet’s high-level imperative API, called &lt;a href=&quot;https://mxnet.incubator.apache.org/gluon/index.html&quot;&gt;Gluon&lt;/a&gt;, offers a full set of plug-and-play neural network building blocks including data loaders, predefined layers and losses.
This gives us the ability to save time from implementing common layers/methods for our models and spend more of our development time writing the new state of the art secret sauce in a natural Pythonic control flow.
If we are working specifically with a computer vision or natural language processing task, Gluon has model toolkits that provide implementations of state-of-the-are deep learning algorithms in &lt;a href=&quot;https://gluon-cv.mxnet.io/&quot;&gt;GluonCV&lt;/a&gt; and &lt;a href=&quot;http://gluon-nlp.mxnet.io/&quot;&gt;GluonNLP&lt;/a&gt; respectively.
When it comes to debugging our models during training Gluon allows us to set breakpoints to help us analyze the internals/output of our deep models, and on top of that MXNet has its own (early in development) support for writing logs out for &lt;a href=&quot;https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard&quot;&gt;Tensorboard&lt;/a&gt; with &lt;a href=&quot;https://github.com/awslabs/mxboard&quot;&gt;MXBoard&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If Python is not your data science/experimentation language of choice, MXNet also provides API’s for training in C++, Scala, Julia, Perl and R.
The Scala interface also includes early support for training in a distributed &lt;a href=&quot;https://github.com/apache/incubator-mxnet/tree/master/scala-package/spark&quot;&gt;Apache Spark&lt;/a&gt; cluster for your big data needs.&lt;/p&gt;

&lt;h1 id=&quot;inference-api-and-scala&quot;&gt;Inference API and Scala&lt;/h1&gt;

&lt;p&gt;After training, our models get deployed to microservices that power various Curalate applications.
At Curalate, we write our microservices in Scala using the &lt;a href=&quot;https://twitter.github.io/finatra/&quot;&gt;Finatra&lt;/a&gt; framework.
Before we switched to MXNet, we had to maintain JNA/Bridj borders between our Scala code and deep learning frameworks. This is where MXNet’s Scala API really sped up our development: any model we train in Python can be immediately loaded into our production web service framework.&lt;/p&gt;

&lt;p&gt;Most of our deep learning microservices have a very similar data flow:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Load an input for the deep net (in our case, typically a set of images)&lt;/li&gt;
  &lt;li&gt;Pre-process the input data (i.e., scaling, cropping, etc)&lt;/li&gt;
  &lt;li&gt;Perform inference on the deep net model&lt;/li&gt;
  &lt;li&gt;Return the results to the caller&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When we first started using MXNet, (version 0.11.0), the Scala API only had support for NDArrays and Modules.
This was great, but we needed more functionality for higher level operations such as:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Loading images from a network data store (in our case s3)&lt;/li&gt;
  &lt;li&gt;Pre-processing images to match what is expected by the net (i.e., scaling, cropping, converting to raw RGB)&lt;/li&gt;
  &lt;li&gt;Mutex locking the GPU to avoid contention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We implemented this functionality in an easy to use, high-level inference API (As of MXNet 1.2.0, the Scala API has &lt;a href=&quot;https://mxnet.incubator.apache.org/api/scala/infer.html&quot;&gt;added support&lt;/a&gt; for inference on images and thread management).
Though our actual deployment has some Curalate-specific logic, its general design is:&lt;/p&gt;

&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;trait&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;InferenceAPI&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/**
     * Performs inference on the provided input images, and returns the results.
     */&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;images&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;BufferedImage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Future&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]]]&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/**
     * Loads the images from s3 pointed to by , performs inference, and returns the results.
     */&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loadAndPredict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s3Bucket&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s3Keys&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Future&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]]]&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/**
     * Pre-processes the provided image (i.e., scaling, center crop, etc), 
     * and converts it to a 32-bit floating point NDArray.
     */&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;preprocess&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;BufferedImage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;NDArray&lt;/span&gt;
    
    &lt;span class=&quot;cm&quot;&gt;/**
     * Performs the inference on a batch of images in the provided NDArray.
     */&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batchArray&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;NDArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Future&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One particularly interesting item is the mutex locking on the GPU.
We achieved this by creating a singleton actor with &lt;a href=&quot;https://akka.io/&quot;&gt;Akka&lt;/a&gt; that acts as a gate keeper to the GPU.
The actor is :&lt;/p&gt;

&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;InferenceActor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batchSize&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;deepNet&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;module.Module&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;akka&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;actor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Actor&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;receive&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Receive&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;NDArray&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currentTimeMillis&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DataBatch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
          &lt;span class=&quot;nc&quot;&gt;IndexedSeq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt;
          &lt;span class=&quot;nc&quot;&gt;IndexedSeq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;empty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;NDArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt;
          &lt;span class=&quot;nc&quot;&gt;IndexedSeq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batchSize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0L&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt;
          &lt;span class=&quot;n&quot;&gt;pad&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;deepNet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// only using one batch at a time.
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toArray&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dispose&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;duration&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currentTimeMillis&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;sender&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;duration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Throwable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
          &lt;span class=&quot;n&quot;&gt;logger&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;error&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Could not run prediction!!!&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
          &lt;span class=&quot;n&quot;&gt;sender&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We then use Akka’s &lt;code class=&quot;highlighter-rouge&quot;&gt;ask&lt;/code&gt; pattern to pass images to the &lt;code class=&quot;highlighter-rouge&quot;&gt;InferenceActor&lt;/code&gt;.
Not only does this let us mutex lock the GPU in a nice way, it also returns a Scala &lt;code class=&quot;highlighter-rouge&quot;&gt;Future&lt;/code&gt;, which integrates well with our async Finatra patterns.
The call to the actor looks like:&lt;/p&gt;

&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batchArray&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;NDArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Future&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;actor&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batchArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;duration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// record the GPU execution time
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;logger&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;debug&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;GPU Execution took $duration milliseconds for a batch size of $batchSize&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Though we built a custom image-centric API on top of the MXNet Scala interface for our needs, MXNet treating Scala as a first class language made this extremely easy.
The resulting code was minimal and elegant.&lt;/p&gt;

&lt;h1 id=&quot;deployment&quot;&gt;Deployment&lt;/h1&gt;

&lt;p&gt;Continuous integration and deployment of microservices is inherently challenging, but deep learning systems add another layer of complexity due to their dependence on specialized hardware (i.e., GPUs) and native software stack (i.e., CUDA, cuDNN, and MXNet’s binary library).
While our Scala-only microservices can be compiled into a war or jar and dropped into a Docker container, the story is not as simple for our deep learning services.
Ideally, we’d like to deploy to these specialized environments while maintaining the agility and speed offered by continuous integration.
In addition, we sometimes want to deploy to non-GPU environments (such as our development boxes, or applications where latency is OK) to save costs.&lt;/p&gt;

&lt;p&gt;MXNet helps us solve for all of these challenges since they separate their core library from their Scala bindings API.&lt;/p&gt;

&lt;h2 id=&quot;mxnet-binary-packages&quot;&gt;MXNet Binary Packages&lt;/h2&gt;

&lt;p&gt;First, let’s look at how we can enable fast, continuous integration and deployment.
MXNet’s Scala API requires two binary files be present in the environment: the core library &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet.so&lt;/code&gt; and the Java/Scala bindings &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet-scala.so&lt;/code&gt;.
At Curalate, we use Jenkins to build both Docker images and AMIs for deployment in AWS.
Building MXNet or its Scala bindings from source each time we need an image (for, say a new service or a different version) is time consuming and puts a serious drag on our deployment process.&lt;/p&gt;

&lt;p&gt;To alleviate this, we build custom MXNet Debian packages that contain both the core library &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet.so&lt;/code&gt; and the Java/Scala bindings &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet-scala.so&lt;/code&gt;.
Specifically, we maintain a bash script that:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Checks out a specific version of MXNet (provided by the user)&lt;/li&gt;
  &lt;li&gt;Compiles the MXNet base library &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet.so&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Compiles the MXNet Scala bindings &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet-scala.so&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Packages all compiled binaries into a Debian file using &lt;a href=&quot;https://help.ubuntu.com/community/CheckInstall&quot;&gt;checkinstall&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Uploads the Debian file to our s3-based repository using &lt;a href=&quot;https://github.com/krobertson/deb-s3&quot;&gt;deb-s3&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We run this script inside a Docker container, so the resulting artifact is binary-compatible with the architecture of the container we compiled it in.
We maintain separate repositories for different versions of Ubuntu and Debian, all of which are populated by the MXNet build script.&lt;/p&gt;

&lt;p&gt;By compiling MXNet Debian files and maintaining our own repository, we can quickly install MXNet on any image that has access to our internal deb repository.
When we set up a new service or update an existing one, our build process simply runs &lt;code class=&quot;highlighter-rouge&quot;&gt;apt-get install libmxnet&lt;/code&gt;.
In addition, this makes updating MXNet pretty simple: we just pass the version into our bash script and it becomes available in our repository.&lt;/p&gt;

&lt;h2 id=&quot;varying-environments&quot;&gt;Varying Environments&lt;/h2&gt;

&lt;p&gt;Our second challenge is how to build a proper abstraction over the different hardware environments we run our deep learning services on.
In production we want to run our models on GPUs for speed.
This requires the host OS to have CUDA installed, and MXNet to be properly compiled against it.
Often, the version of CUDA is different depending on the operating system we’re using (i.e., CUDA 8 on Ubuntu 14, CUDA 9 on Ubuntu 16).
To complicate matters further, we develop on laptops without GPUs and would like to test our code in our IDEs as we build things.&lt;/p&gt;

&lt;p&gt;To solve this, we again take advantage of MXNet’s separation of the Scala package and their core libraries.
Specifically, the MXNet Scala gets packaged into &lt;code class=&quot;highlighter-rouge&quot;&gt;mxnet-core.jar&lt;/code&gt;, which depends on the library &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet-scala.so&lt;/code&gt; being installed on the host operating system.
So long as the MXNet version is maintained, we can package &lt;em&gt;just&lt;/em&gt; the &lt;code class=&quot;highlighter-rouge&quot;&gt;mxnet-core.jar&lt;/code&gt; with our services, and let the host operating system decide what stack is compiled in &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet-scala.so&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;libmxnet.so&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In other words, we again leverage our bash script from above to build multiple MXNet Debian files: one that is compiled for a GPU environment, and one for a CPU environment.
We also compile a CPU environment for OSX which we install on our dev boxes via &lt;a href=&quot;https://brew.sh/&quot;&gt;homebrew&lt;/a&gt;.
The &lt;code class=&quot;highlighter-rouge&quot;&gt;build.sbt&lt;/code&gt; files in our Scala projects only bring in &lt;code class=&quot;highlighter-rouge&quot;&gt;mxnet-core&lt;/code&gt;, so no binary code is included in our service builds.&lt;/p&gt;

&lt;p&gt;This flexibility allows us to create one service artifact in Scala, and run it in different environments.&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;MXNet provides Curalate with the flexibility needed to research and build cutting edge deep learning systems extremely quickly.
The Python interface has the flexibility necessary to explore research ideas while executing extremely quickly on modern hardware.
For productization, Scala is treated as a first class language, providing us with an API that enables us to use MXNet in our already existing microservice architecture.
Finally, the separation between the Scala API and the binary packages allow us to deploy to multiple CPU and GPU environments without recompiling or repackaging our Scala code.&lt;/p&gt;
</description>
        <pubDate>Wed, 01 Aug 2018 00:00:00 +0000</pubDate>
        <link>http://engineering.curalate.com/2018/08/01/mxnet-case-study.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2018/08/01/mxnet-case-study.html</guid>
        
        <category>mxnet</category>
        
        <category>scala</category>
        
        <category>Python</category>
        
        <category>deeplearning</category>
        
        
      </item>
    
      <item>
        <title>Productionalizing ECS</title>
        <description>&lt;p&gt;In January of last year we decided as a company to move towards containerization and began a migration to move onto &lt;a href=&quot;https://aws.amazon.com/ecs/&quot;&gt;AWS ECS&lt;/a&gt;.  We pushed to move to containers, and off of AMI based VM deployments, in order to speed up our deployments, simplify our build tooling (since it only has to work on containers), get the benefits of being able to run our production code in a sandbox even locally on our dev machines (something you can’t really do easily with AMI’s), and lower our costs by getting more out of the resources we’re already paying for.&lt;/p&gt;

&lt;p&gt;However, making ECS production ready was actually quite the challenge. In this post I’ll discuss:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Scaling the underlying ECS cluster&lt;/li&gt;
  &lt;li&gt;Upgrading the backing cluster images&lt;/li&gt;
  &lt;li&gt;Monitoring our containers&lt;/li&gt;
  &lt;li&gt;Cleanup of images, container artifacts&lt;/li&gt;
  &lt;li&gt;Remote debugging of our JVM processes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which is a short summary of the things we encountered and our solutions, finally making ECS a set it and forget it system.&lt;/p&gt;

&lt;h2 id=&quot;scaling-the-cluster&quot;&gt;Scaling the cluster&lt;/h2&gt;

&lt;p&gt;The first thing we struggled with was how to scale our cluster.  ECS is a container orchestrator, analogous to Kubernetes or Rancher, but you still need to have a set of EC2 machines to run as a cluster. The machines all need to have the ECS Docker agent installed on it and ECS doesn’t provide a way to automatically scale and manage your cluster for you.  While this has changed recently with the announcement of Fargate, Fargate’s pricing makes it cost prohibitive for organizations with a lot of containers.&lt;/p&gt;

&lt;p&gt;The general recommendation that AWS gave with ECS was to scale based on CPU reservation limit OR memory limit. There’s no clear way to scale with a combination of the two, since auto scaling rules need to apply to a single CloudWatch metric or you face potential thrashing.&lt;/p&gt;

&lt;p&gt;Our first attempt on scaling was to try and scale on container placement failures.  ECS logs a message when containers are unable to be placed due to constraints (not enough memory on the cluster, or not enough CPU reservation left), but there is no way to actually capture that event programmatically (see &lt;a href=&quot;https://github.com/aws/amazon-ecs-agent/issues/1221&quot;&gt;this github issue&lt;/a&gt;).  The goal here was to not preemptively scale, but instead scale on actual pressure. This way we wouldn’t be overpaying for machines in the cluster that aren’t heavily used. However we had to discard this idea since it wasn’t possible due to API limitations.&lt;/p&gt;

&lt;p&gt;Our second attempt, and one that we have been using now in production, is to use an AWS Lambda function to monitor the memory and CPU reservation of the cluster and emit a compound metric to CloudWatch that we can scale on.  We set a compound threshold with the logic of:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;If either memory or CPU is above the max threshold, scale up&lt;/li&gt;
  &lt;li&gt;Else if both memory and CPU are below the min, scale down.&lt;/li&gt;
  &lt;li&gt;Else stay the same&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We represent a scale up event with a CloudWatch value of &lt;code class=&quot;highlighter-rouge&quot;&gt;2&lt;/code&gt;, a scale down as value &lt;code class=&quot;highlighter-rouge&quot;&gt;0&lt;/code&gt; and otherwise the nominal state as value &lt;code class=&quot;highlighter-rouge&quot;&gt;1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The code for that is shown below:&lt;/p&gt;

&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;package&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.curalate.lambdas.ecs_scaling&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.amazonaws.services.CloudWatch.AmazonCloudWatch&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.amazonaws.services.CloudWatch.model._&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.curalate.lambdas.ecs_scaling.ScaleMetric.ScaleMetric&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.curalate.lambdas.ecs_scaling.config.ClusterScaling&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.joda.time.DateTime&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;scala.collection.JavaConverters._&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;object&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ScaleMetric&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Enumeration&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ScaleMetric&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Value&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ScaleDown&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StaySame&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ScaleUp&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ClusterMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;clusterName&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;scaleMetric&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ScaleMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MetricResolver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clusterName&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cloudWatch&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;AmazonCloudWatch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;lazy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;now&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DateTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;lazy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minusMinutes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dimension&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Dimension&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;withName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ClusterName&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;withValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clusterName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;periodSeconds&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;protected&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;baseRequest&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GetMetricStatisticsRequest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;withDimensions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dimension&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;cloudWatch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getMetricStatistics&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;baseRequest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;withMetricName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;withNamespace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;AWS/ECS&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;withStartTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toDate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;withEndTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toDate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;withPeriod&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;withStatistics&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Statistic&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Maximum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getDatapoints&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;asScala&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getMaximum&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;lazy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;currentCpuReservation&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;CPUReservation&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;lazy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;currentMemoryReservation&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MemoryReservation&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ClusterStatus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;scaling&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ClusterScaling&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;metricResolver&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;MetricResolver&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;protected&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;logger&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;org&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;slf4j&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;LoggerFactory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getLogger&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getClass&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getCompositeMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ClusterMetric&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;logger&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;info&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;CPU: ${metricResolver.currentCpuReservation}, memory: ${metricResolver.currentMemoryReservation}&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metricResolver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currentCpuReservation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaling&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;CPU&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metricResolver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currentMemoryReservation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaling&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;ScaleMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;ScaleUp&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metricResolver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currentCpuReservation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaling&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;CPU&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;min&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metricResolver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currentMemoryReservation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaling&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;ScaleMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;ScaleDown&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;ScaleMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;StaySame&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;nc&quot;&gt;ClusterMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scaling&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metricResolver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CloudwatchEmitter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cloudWatch&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;AmazonCloudWatch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;writeMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ClusterMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cluster&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Dimension&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;withName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ClusterName&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;withValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clusterName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;cloudWatch&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;putMetricData&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PutMetricDataRequest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;withMetricData&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MetricDatum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;withMetricName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ScaleStatus&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;withDimensions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;withTimestamp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;DateTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toDate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;withStorageResolution&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;withValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scaleMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toDouble&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;withNamespace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Curalate/AutoScaling&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Wiring in our ECS cluster to autoscale on this metric value in our Terraform configuration looks like&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;resource &quot;aws_cloudwatch_metric_alarm&quot; &quot;cluster_scale_status_high_blue&quot; {
  count               = &quot;${var.autoscale_enabled}&quot;
  alarm_name          = &quot;${var.cluster_name}_ScaleStatus_high_blue&quot;
  comparison_operator = &quot;${lookup(var.alarm_high, &quot;comparison_operator&quot;)}&quot;
  evaluation_periods  = &quot;${lookup(var.alarm_high, &quot;evaluation_periods&quot;)}&quot;
  period              = &quot;${lookup(var.alarm_high, &quot;period&quot;)}&quot;
  statistic           = &quot;${lookup(var.alarm_high, &quot;statistic&quot;)}&quot;
  threshold           = &quot;${lookup(var.alarm_high, &quot;threshold&quot;)}&quot;
  metric_name         = &quot;ScaleStatus&quot;
  namespace           = &quot;Curalate/AutoScaling&quot;

  dimensions {
    ClusterName = &quot;${var.cluster_name}&quot;
  }

  alarm_description = &quot;High cluster resource usage&quot;
  alarm_actions     = [&quot;${aws_autoscaling_policy.scale_up_blue.arn}&quot;]
}

resource &quot;aws_cloudwatch_metric_alarm&quot; &quot;cluster_scale_status_low_blue&quot; {
  count               = &quot;${var.autoscale_enabled}&quot;
  alarm_name          = &quot;${var.cluster_name}_ScaleStatus_low_blue&quot;
  comparison_operator = &quot;${lookup(var.alarm_low, &quot;comparison_operator&quot;)}&quot;
  evaluation_periods  = &quot;${lookup(var.alarm_low, &quot;evaluation_periods&quot;)}&quot;
  period              = &quot;${lookup(var.alarm_low, &quot;period&quot;)}&quot;
  statistic           = &quot;${lookup(var.alarm_low, &quot;statistic&quot;)}&quot;
  threshold           = &quot;${lookup(var.alarm_low, &quot;threshold&quot;)}&quot;
  metric_name         = &quot;ScaleStatus&quot;
  namespace           = &quot;Curalate/AutoScaling&quot;

  dimensions {
    ClusterName = &quot;${var.cluster_name}&quot;
  }

  alarm_description = &quot;Low cluster resource usage&quot;
  alarm_actions     = [&quot;${aws_autoscaling_policy.scale_down_blue.arn}&quot;]
}

variable &quot;alarm_high&quot; {
  type = &quot;map&quot;

  default = {
    comparison_operator = &quot;GreaterThanThreshold&quot;
    evaluation_periods  = 4
    period              = 60
    statistic           = &quot;Maximum&quot;
    threshold           = 1
  }
}

variable &quot;alarm_low&quot; {
  type = &quot;map&quot;

  default = {
    comparison_operator = &quot;LessThanThreshold&quot;
    evaluation_periods  = 10
    period              = 60
    statistic           = &quot;Maximum&quot;
    threshold           = 1
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We made our Lambda dynamically configurable by loading data from our configuration system and allowing us to onboard new clusters to monitor, and to dynamically tune the values of the thresholds.&lt;/p&gt;

&lt;p&gt;You can see this in effect here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2018-05-16-ecs/ecs_cloudwatch.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;host-draining-and-ecs-rescheduling&quot;&gt;Host draining and ECS rescheduling&lt;/h2&gt;

&lt;p&gt;This leads us to another problem. When the ASG goes to down-scale from a CloudWatch event, it puts the boxes into DRAINING. However, draining doesn’t necessarily mean that existing services have been re-scheduled on other boxes!  It just means that connections are drained from the existing hosts, and that the ECS scheduler now needs to move the containers elsewhere. This can be problematic in that if you are down-scaling 2 hosts that are serving both of your HA containers, then you can now have a situation where your service is at 0 instances! To solve this, we wired up a custom ASG lifecycle hook that polls the draining machines and makes sure that the containers are fully stopped, and that the active cluster contains at  least the min running instances of each service (where a service defines its minimum acceptable threshold and its min allowed running instances). For example if a service can run at 50% capacity and its min is set to 20, then we need to verify that there are at &lt;em&gt;least&lt;/em&gt; 10 active before we fully allow the box to be removed, giving the ECS scheduler time to move things around.&lt;/p&gt;

&lt;h2 id=&quot;cluster-upgrades&quot;&gt;Cluster upgrades&lt;/h2&gt;

&lt;p&gt;Solving cluster scaling and draining just introduced the next question: how do we do zero downtime cluster upgrades?  Because we now have many services running on the cluster, the blast radius for failure is much higher. If we fail a cluster upgrade we could take many of the services at Curalate down with us.&lt;/p&gt;

&lt;p&gt;Our solution, while not fully automated, is beautiful in its simplicity.  Leveraging the draining Lambda, we keep all our clusters grouped into ASGs labeled &lt;code class=&quot;highlighter-rouge&quot;&gt;blue&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;green&lt;/code&gt;. To upgrade, we spin up the unused cluster with new backing AMI’s and wait for it to be steady state. Then we tear down the old cluster and rely on the draining Lambda to prevent any race issues with the ECS scheduler.&lt;/p&gt;

&lt;p&gt;Each time we need to do a cluster upgrade, either for security updates, base AMI changes, or other infrastructure wide sweeping changes, we do a manual toggle using Terraform to drive the base changes.&lt;/p&gt;

&lt;p&gt;For example, our Terraform ECS cluster module in QA looks like this&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;module &quot;ecs_cluster_default_bg&quot; {
  source = &quot;github.com/curalate/infra-modules.git//aws-ecs-cluster-blue-green?ref=2018.03.07.20.09.12&quot;

  cluster_name                       = &quot;${aws_ecs_cluster.default.name}&quot;
  availability_zones                 = &quot;${data.terraform_remote_state.remote_env_state.availability_zones}&quot;
  environment                        = &quot;${var.environment}&quot;
  region                             = &quot;${var.region}&quot;
  iam_instance_profile               = &quot;${data.terraform_remote_state.remote_env_state.iam_instance_profile}&quot;
  key_name                           = &quot;${data.terraform_remote_state.remote_env_state.key_name}&quot;
  security_group_ids                 = &quot;${data.terraform_remote_state.remote_env_state.ecs_security_groups}&quot;
  subnet_ids                         = &quot;${data.terraform_remote_state.remote_env_state.public_subnet_ids}&quot;
  drain_hook_notification_target_arn = &quot;${module.drain_hook.notification_target_arn}&quot;
  drain_hook_role_arn                = &quot;${module.drain_hook.role_arn}&quot;
  autoscale_enabled                  = true

  root_volume_size = 50
  team             = &quot;devops&quot;

  blue = {
    image_id      = &quot;ami-5ac76b27&quot;
    instance_type = &quot;c4.2xlarge&quot;

    min_size         = 2
    max_size         = 15
    desired_capacity = 5
  }

  green = {
    image_id      = &quot;ami-c868b6b5&quot;
    instance_type = &quot;c3.2xlarge&quot;

    min_size         = 0
    max_size         = 0
    desired_capacity = 0
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;monitoring-with-statsd&quot;&gt;Monitoring with statsd&lt;/h2&gt;

&lt;p&gt;Curalate uses Datadog as our graph visualization tool and we send metrics to datadog via the dogstatsd agent that is installed on our boxes.  Applications emit UDP events to the dogstatsd agent which then aggregates and sends messages to datadog over TCP.&lt;/p&gt;

&lt;p&gt;In the containerized world we had 3 options for sending metrics&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Send directly from the app&lt;/li&gt;
  &lt;li&gt;Deploy all containers with a sidecar of statsd&lt;/li&gt;
  &lt;li&gt;Proxy messages to the host box and leave dogstatsd on the host&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We elected for option 3 since option 1 makes it difficult to do sweeping upgrades and option 2 uses extra resources on ECS we didn’t want.&lt;/p&gt;

&lt;p&gt;However we needed a way to determistically write messages from a Docker container to its host.  To do this we leveraged the docker0 bridge network&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# returns x.x.x.1 base ip of the docker0 bridge IP&lt;/span&gt;
get_data_dog_host&lt;span class=&quot;o&quot;&gt;(){&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# extracts the ip address from eth0 of 1.2.3.4 and splits off the last octet (returning 1.2.3)&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;BASE_IP&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;ifconfig | &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;eth0 &lt;span class=&quot;nt&quot;&gt;-A&lt;/span&gt; 1 | &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;inet | awk &lt;span class=&quot;s1&quot;&gt;'{print $2}'&lt;/span&gt; | sed &lt;span class=&quot;s2&quot;&gt;&quot;s/addr://&quot;&lt;/span&gt; | cut &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f1-3&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;

    &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;BASE_IP&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;.1&quot;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And we configure our apps to use this IP to send messages to.&lt;/p&gt;

&lt;h2 id=&quot;cleanup&quot;&gt;Cleanup&lt;/h2&gt;

&lt;p&gt;One thing we chose to do was to volume mount our log folders to the host system for semi-archival reasons.  By mounting our application logs to the host, if the container crashed or was removed from Docker, we’d still have a record of what happened.&lt;/p&gt;

&lt;p&gt;That said, containers are transient; they come and go all the time. The first question we had was “where do logs go?”. What folder do we mount them to? For us, we chose to mount all container logs in the following schema:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;/var/log/curalate/&amp;lt;service-name&amp;gt;/containers/&amp;lt;constainer-sha&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This way we can back correlate the logs for a particular container in a particular folder.  If we have multiple instances of the same image running a host the logs don’t stomp on each other.&lt;/p&gt;

&lt;p&gt;We normally have a log rotator on our AMI boxes that handles long running log files, however in our AMI based deployments machines and clusters are immutable and singular. This means that as we do new deploys the old logs are removed with the box and only one service is allowed to sit on one EC2 host.&lt;/p&gt;

&lt;p&gt;In the new world the infrastructure is immutable at the container level, not the VM level. So in this sense, the base VM also has a log rotator to rotate all the container logs, but we didn’t account for the fact that services will start and stop and deploy hundreds of times daily leaving hundreds of rotated log files in stale folders.&lt;/p&gt;

&lt;p&gt;After the first disk alert though we added the following cron script:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;buntu@ip-172-17-50-242:~&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;crontab &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Chef Name: container-log-prune&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;/10 &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; /opt/curalate/docker/prune.rb
&lt;span class=&quot;c&quot;&gt;# Chef Name: volume-and-image-prune&lt;/span&gt;
0 0 &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; docker images &lt;span class=&quot;nt&quot;&gt;-q&lt;/span&gt; | xargs docker rmi &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker system prune &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We have 2 things happening here, the first is a Ruby script that checks for running containers and then deletes all container IDs in the recursive log glob that aren’t active anymore.  We run this once an hour.&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;#!/usr/bin/env ruby&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'set'&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'fileutils'&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'English'&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;containers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sb&quot;&gt;`docker ps --format ''`&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;to_set&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;unless&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;$CHILD_STATUS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;success?&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;puts&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'Failed to query docker'&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;exit&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;dirs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;glob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'/var/log/**/containers/*'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;to_delete&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dirs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reject&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;include?&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;File&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;basename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;to_delete&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;each&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;puts&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Deleting &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;#{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

  &lt;span class=&quot;no&quot;&gt;FileUtils&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;rm_rf&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The second script is pretty straightforward and we leverage the Docker system prune command to remove old volume overlay data, images that are unused, and any other system cleanup stuff.  We run this daily because we want to leverage the existing images that are already downloaded on a box to speed up autoscaling.  We’re ok with taking a once daily hit to download the image base layers at midnight if necessary during a scaling event.&lt;/p&gt;

&lt;h2 id=&quot;jmxmp&quot;&gt;JMXMP&lt;/h2&gt;
&lt;p&gt;JMX is a critical tool in our toolbox here at Curalate as nearly all of our services and applications are written using Scala on the JVM.  Normally in our AMI deployments we can singularly control the ports that are open and they are determistic. If we open port 5555 it’s always open on that box.  However when we start to have many services run on the same host, we need to leverage dockers dynamic port routing which makes knowing &lt;em&gt;which&lt;/em&gt; port maps to what more difficult.&lt;/p&gt;

&lt;p&gt;Normally this isn’t really an issue, as services that do need to expose ports to either each other or the public are routed through an ALB that manages that for us. But JMX is a different beast.  JMX, in its ultimate wisdom, requires a 2 port handshake in order to connect. What this means is that the port you connect to on JMX is not the ultimate port you &lt;em&gt;communicate&lt;/em&gt; over in JMX. When you make a JMX connection to the connection port it replies back with the communication port and you then communicate on that.&lt;/p&gt;

&lt;p&gt;But in the world of dynamic port mappings, we can find the first port from the dynamic mapping, but there is no way for us to know the &lt;em&gt;second&lt;/em&gt; port. This is because the container itself has no information about what its port mapping is, for all it knows its port is what it says it is!&lt;/p&gt;

&lt;p&gt;Thankfully there is a solution using an extension to JMX called JMXMP.  With some research from &lt;a href=&quot;https://github.com/frankgrimes97/jmxmp-java-agent&quot;&gt;this blog post&lt;/a&gt; we rolled a jmxmp java agent:&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;package&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;com&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;curalate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;agents&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;jmxmp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;javax.management.MBeanServer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;javax.management.remote.JMXConnectorServer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;javax.management.remote.JMXConnectorServerFactory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;javax.management.remote.JMXServiceURL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.lang.instrument.Instrumentation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.lang.management.ManagementFactory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.net.Inet4Address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.net.InetAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.net.NetworkInterface&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.net.UnknownHostException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.util.Collections&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.util.HashMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.util.Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Agent&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;premain&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;agentArgs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Instrumentation&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inst&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Boolean&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;enableLogging&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Boolean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;valueOf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getProperty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;javax.management.remote.jmx.enable_logging&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;Boolean&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bindEth0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Boolean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;valueOf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getProperty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;javax.management.remote.jmx.bind_eth0&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jmxEnvironment&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HashMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;();&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;jmxEnvironment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;put&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;jmx.remote.server.address.wildcard&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;

            &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;defaultHostAddress&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bindEth0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getEth0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getLocalHost&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;JMXServiceURL&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jmxUrl&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JMXServiceURL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getProperty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;javax.management.remote.jmxmp.url&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;service:jmx:jmxmp://&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;defaultHostAddress&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;:5555&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;MBeanServer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mbs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ManagementFactory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getPlatformMBeanServer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;JMXConnectorServer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jmxRemoteServer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JMXConnectorServerFactory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;newJMXConnectorServer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;jmxUrl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jmxEnvironment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mbs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;enableLogging&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Starting jmxmp agent on '&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jmxUrl&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;'&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;jmxRemoteServer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Throwable&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;enableLogging&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;printStackTrace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getEth0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Collections&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NetworkInterface&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getByName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;eth0&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getInetAddresses&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt;
                              &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
                              &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;isLoopbackAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;instanceof&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Inet4Address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
                              &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;findFirst&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
                              &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;Object:&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
                              &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;orElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;127.0.0.1&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;127.0.0.1&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getLocalHost&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;InetAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getLocalHost&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getHostName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UnknownHostException&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;127.0.0.1&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That we bundle in all our service startups:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exec java -agentpath:/usr/local/lib/libheapster.so -javaagent:agents/jmxmp-agent.jar $JVM_OPTS $JVM_ARGS -jar
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;JMXMP does basically the same thing as JMX, except it only requires &lt;em&gt;one&lt;/em&gt; port to be open.  By standardizing our ports on port 5555 we can look up the 5555 port mapping in ECS and connect to it via JMXMP and get all our “favorite” JMX goodies (if you’re doing JMX you’re already in a bad place).&lt;/p&gt;

&lt;p&gt;For full reference all our dockerized java apps have a main entrypoint that Docker executes which is shown below. This allows us some sane default JVM settings but also exposes a way for us to manually override any of the settings via the &lt;code class=&quot;highlighter-rouge&quot;&gt;JVM_ARGS&lt;/code&gt; env var (which we can set during our Terraform deployments)&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env bash&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;HOST_IP&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;localhost&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Entrypoint for service start&lt;/span&gt;
main&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    set_host_ip

    &lt;span class=&quot;nv&quot;&gt;DATADOG_HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;get_data_dog_host&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# location the fat jar&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;BIN_JAR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; /app/bin/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.jar | head&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;

    &lt;span class=&quot;nv&quot;&gt;LOG_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/var/log/&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

    mkdir &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LOG_PATH&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;
    mkdir &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; /heap_dumps

    &lt;span class=&quot;nv&quot;&gt;JVM_OPTS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&quot;
        -server &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/heap_dumps &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Xmx512m -Xms512m &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dsun.net.inetaddr.ttl=5 &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dcom.sun.management.jmxremote.port=1045 &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dcom.sun.management.jmxremote.authenticate=false &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dcom.sun.management.jmxremote.ssl=false &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dcontainer.id=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dhostname=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOST_IP&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dlog.service.output=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LOG_PATH&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/service.log &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dlog.access.output=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LOG_PATH&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/access.log &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Dloggly.enabled=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LOGGLY_ENABLED&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Ddatadog.host=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;DATADOG_HOST&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
        -Ddatadog.defaultTags=application:&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CLOUD_APP&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;
    &quot;&quot;&quot;&lt;/span&gt;

    &lt;span class=&quot;nb&quot;&gt;exec &lt;/span&gt;java &lt;span class=&quot;nt&quot;&gt;-agentpath&lt;/span&gt;:/usr/local/lib/libheapster.so &lt;span class=&quot;nt&quot;&gt;-javaagent&lt;/span&gt;:agents/jmxmp-agent.jar &lt;span class=&quot;nv&quot;&gt;$JVM_OPTS&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$JVM_ARGS&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;BIN_JAR&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$@&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Set the host IP variable of the EC2 host instance&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# by querying the EC2 metadata api&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# if there is no response after the timeout we'll default to localhost&lt;/span&gt;
set_host_ip &lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;200&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;$(&lt;/span&gt;/usr/bin/curl &lt;span class=&quot;nt&quot;&gt;--connect-timeout&lt;/span&gt; 2 &lt;span class=&quot;nt&quot;&gt;--max-time&lt;/span&gt; 2 &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /dev/null &lt;span class=&quot;nt&quot;&gt;-w&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;%{http_code}&quot;&lt;/span&gt; 169.254.169.254/latest/meta-data/local-ipv4&lt;span class=&quot;k&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;then
        &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOST_IP&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;$(&lt;/span&gt;curl 169.254.169.254/latest/meta-data/local-ipv4&lt;span class=&quot;k&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;else
        &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOST_IP&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;$(&lt;/span&gt;hostname&lt;span class=&quot;k&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOST_IP&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;then
            &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOST_IP&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;localhost&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;fi
    fi&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# returns x.x.x.1 base ip of the docker0 bridge IP&lt;/span&gt;
get_data_dog_host&lt;span class=&quot;o&quot;&gt;(){&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# extracts the ip address from eth0 of 1.2.3.4 and splits off the last octet (returning 1.2.3)&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;BASE_IP&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;ifconfig | &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;eth0 &lt;span class=&quot;nt&quot;&gt;-A&lt;/span&gt; 1 | &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;inet | awk &lt;span class=&quot;s1&quot;&gt;'{print $2}'&lt;/span&gt; | sed &lt;span class=&quot;s2&quot;&gt;&quot;s/addr://&quot;&lt;/span&gt; | cut &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f1-3&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;

    &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;BASE_IP&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;.1&quot;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# execute main&lt;/span&gt;
main &lt;span class=&quot;nv&quot;&gt;$@&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;choosing-how-to-group-a-cluster&quot;&gt;Choosing how to group a cluster&lt;/h2&gt;
&lt;p&gt;One thing we wrestled with is how to choose &lt;em&gt;where&lt;/em&gt; a service will go.  For the most part we have a &lt;code class=&quot;highlighter-rouge&quot;&gt;default&lt;/code&gt; cluster comprised of &lt;code class=&quot;highlighter-rouge&quot;&gt;c4.2xl&lt;/code&gt;’s that everyone is allowed to deploy to.&lt;/p&gt;

&lt;p&gt;I wanted to call out that choosing what service goes on what cluster and what machine types comprise a cluster can be tricky. For our GPU based services, the choice is obvious in that they go onto a cluster that has GPU acceleration.  For other clusters we tried smaller machines with fewer containers, and we tried larger machines with more containers.  We found that we preferred fewer larger machines since most of our services are not running at full capacity, so they get the benefit of extra IO and memory without overloading the host system. With smaller boxes we had less headroom and it was more difficult to pack services with varying degrees of memory/CPU reservation necessities.&lt;/p&gt;

&lt;p&gt;On that note however, we’ve also chosen to segment some high priority applications onto their own clusters. These are services that under no circumstances can fail, or require more than average resources (whether IO or otherwise), or are particularly unstable.  While we don’t get the cost savings by binpacking services onto that cluster, we still get the fast deploy/rollback/scaling properties with containers so we still consider it a net win.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;ECS was really easy to get started on, but as with any production system there are always gotcha’s.  Overall we’re really pleased with the experience, even though it wasn’t pain free.  In the end, we can now deploy in seconds, rollback in seconds, and still enjoy a pseudo immutable infrastructure that is simple to reason about as well as locally reproducible!&lt;/p&gt;
</description>
        <pubDate>Wed, 16 May 2018 09:45:39 +0000</pubDate>
        <link>http://engineering.curalate.com/2018/05/16/productionalizing-ecs.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2018/05/16/productionalizing-ecs.html</guid>
        
        <category>ecs</category>
        
        <category>aws</category>
        
        
      </item>
    
      <item>
        <title>Choosing a Deep Learning library for developing and deploying your App/Service</title>
        <description>&lt;p&gt;Interest in deep learning is &lt;a href=&quot;https://medium.com/intuitionmachine/8-exponential-hockey-stick-charts-for-deep-learning-74bba7a0284c&quot;&gt;growing&lt;/a&gt; and &lt;a href=&quot;https://trends.google.com/trends/explore?date=all&amp;amp;q=deep%20learning,machine%20learning&quot;&gt;growing&lt;/a&gt; and, with it at &lt;a href=&quot;https://www.gartner.com/smarterwithgartner/top-trends-in-the-gartner-hype-cycle-for-emerging-technologies-2017/&quot;&gt;peak hype&lt;/a&gt; right now, a lot of people are looking to find the best deep learning library to build their new app or bring their company into the modern age.
There are many deep learning toolkits to choose from ranging from the long used, supported, and robust academic libraries to the new state-of-the-art, industry backed platforms.&lt;/p&gt;

&lt;p&gt;At Curalate, we’ve been working on deep learning problems since 2014, meaning we’ve had the chance to watch the deep learning community and its open source libraries grow.
We have also had the fortunate (unfortunate?) experience of using a few of the deep learning libraries in our production services and applications, and along the way, we have learned a lot about what to look for in a deep learning library to build reliable, production-ready applications and services.
In this post, I’ll share our lessons learned knowledge in hopes it will help you in your search for the perfect deep learning library match.
You might even find that your best fit is using more than one!&lt;/p&gt;

&lt;h1 id=&quot;important-factors&quot;&gt;Important factors&lt;/h1&gt;

&lt;h2 id=&quot;the-specifics-needs-of-your-applicationservice&quot;&gt;The specifics needs of your application/service&lt;/h2&gt;
&lt;h3 id=&quot;the-platform-you-are-developing-on-and-deploying-to&quot;&gt;The platform you are developing on and deploying to.&lt;/h3&gt;

&lt;p&gt;Develop in OSX? Linux? Windows? Plan on having your application run in a web browser? A smart phone? A massive multi-node GPU cluster?
It’s not surprising that each of the libraries have prioritized different environments and some will work much better for your specific situation.&lt;/p&gt;

&lt;h3 id=&quot;the-specific-deep-net-architecture-you-are-trying-to-implement&quot;&gt;The specific deep net architecture you are trying to implement&lt;/h3&gt;

&lt;p&gt;If you are just trying to implement a typical, pre-trained classification net, this factor may not be as important for you.
Some libraries are more performant and appropriate for certain types of deep nets (LSTMs, RNNs), but more on this later.&lt;/p&gt;

&lt;h3 id=&quot;api-language-requirements&quot;&gt;API language requirements&lt;/h3&gt;

&lt;p&gt;If you already have a code base written in language A, you probably would like to keep it that way without having to figure out some &lt;a href=&quot;/2016/04/29/bridging-scala-to-c++-with-bridj.html&quot;&gt;convoluted way&lt;/a&gt; to fit a deep net interface in language B into it.
Luckily, it seems that most of the common languages are covered at this point in at least one of the libraries, or in an external community project.&lt;/p&gt;

&lt;h2 id=&quot;codebase-quality&quot;&gt;Codebase Quality&lt;/h2&gt;
&lt;h3 id=&quot;is-the-code-base-actively-maintained&quot;&gt;Is the code base actively maintained?&lt;/h3&gt;

&lt;p&gt;How healthy is the project in terms of maintainers? Is there a large group/company committing time and resources to the libraries development?
If you find a bug or issue with the library, how long is it going to take for it to get addressed?&lt;/p&gt;

&lt;h3 id=&quot;release-status-of-the-library-itself&quot;&gt;Release status of the library itself&lt;/h3&gt;

&lt;p&gt;Is the library or a certain feature/API you are going to need still considered to be in an Alpha or Beta state?
Has the library been used in enough to have most of the kinks ironed out?&lt;/p&gt;

&lt;h2 id=&quot;ease-of-use&quot;&gt;Ease of Use&lt;/h2&gt;
&lt;h3 id=&quot;train-to-production-pipeline&quot;&gt;Train to production pipeline&lt;/h3&gt;

&lt;p&gt;Your model training code and production code do not have to run in the same environments or even the same language.
Can you train your model with a quick-to-prototype language in a documented, version-controlled, repeatable way so you can research new and different models for your application?
Then can you deploy your saved model in a fairly quick and painless fashion?
That may be through the same library with a different language API, using a library’s prebuilt production-serving framework, or even &lt;a href=&quot;https://github.com/ysh329/deep-learning-model-convertor&quot;&gt;converting your model&lt;/a&gt; from one library to another that is better suited for your target platform.&lt;/p&gt;

&lt;h3 id=&quot;keras-support&quot;&gt;Keras support&lt;/h3&gt;

&lt;p&gt;Does the library have support for being used as a backed for &lt;a href=&quot;https://keras.io/&quot;&gt;Keras&lt;/a&gt;?
Keras is not a deep learning library per se, but a library that sits on top of other deep learning libraries and provides a single, easy to use, high-level interface to write and train deep learning models.
Where it lacks in optimizations, it is great for beginners with great documentation and a modular, object oriented design.&lt;/p&gt;

&lt;h3 id=&quot;dynamic-vs-static-computation&quot;&gt;Dynamic vs Static computation&lt;/h3&gt;

&lt;p&gt;Now we could write a whole blog post on this topic alone, but to keep it brief, do you want to work with a static computation graph API that follows a symbolic programming paradigm?
Or do you want a dynamic computation graph API that follows an imperative programming paradigm?&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Static Computation Graphing
    &lt;ul&gt;
      &lt;li&gt;You define the deep net once, and uses a session to execute ops in the net many times.&lt;/li&gt;
      &lt;li&gt;The library can optimize the net before you use it, so the nets end up being more efficient with memory and speed.&lt;/li&gt;
      &lt;li&gt;Good for fixed size net (feed-forward, CNNs)&lt;/li&gt;
      &lt;li&gt;Leads to the API being more verbose and harder to debug domain specific language (DSL)&lt;/li&gt;
      &lt;li&gt;Offers better over loading and model management in regards to system resources.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Dynamic Computation Graphing
    &lt;ul&gt;
      &lt;li&gt;Nets are built and rebuilt at runtime, and executed line by line how you define them. This lets you use standard imperative language (think Python) statements, features, and control structures.&lt;/li&gt;
      &lt;li&gt;Tends to be more flexible and useful for when the net structure needs to change at runtime, like in RNNs&lt;/li&gt;
      &lt;li&gt;Makes debugging easy since an error is not thrown in a single call to execute the net after its compiled, but at the specific line in the dynamic graph at run time.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;support&quot;&gt;Support&lt;/h2&gt;
&lt;h3 id=&quot;documentation&quot;&gt;Documentation&lt;/h3&gt;

&lt;p&gt;How good is the documentation?
Are there coding examples that cover most of the use cases you need?
Are you used to getting your documentation in a certain style from a specific company?&lt;/p&gt;

&lt;h3 id=&quot;community-support&quot;&gt;Community support&lt;/h3&gt;

&lt;p&gt;How large is the community?
Just because a deep learning library is really good does not mean people are actually using it.
Are you going to be able to find 3rd party blog posts, code samples, and tutorials using the library?
If you run into a problem, what is the chance you are going to find someone on Stack Overflow with the answer to your problem?&lt;/p&gt;

&lt;h3 id=&quot;research&quot;&gt;Research&lt;/h3&gt;

&lt;p&gt;Does the research community actively use the library to develop state-of-the-art deep learning models and solutions?
A lot of state-of-the-art discoveries made by the academic community require modification to the deep learning libraries themselves and it’s pretty common for research groups to release their source code for conference papers to the public.
Most of these new models will be released as pretrained models and listed in a Model Zoo specific to the library.
Porting these solutions between libraries is not a trivial task if you are not comfortable reimplementing the research paper.&lt;/p&gt;

&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
&lt;h3 id=&quot;performance-with-specific-network-structures&quot;&gt;Performance with specific network structures&lt;/h3&gt;

&lt;p&gt;How fast does your planned network structure run on each of the deep learning libraries?
Will you be able to train and prototype your models faster on one vs another?
If you are deploying to a service, how many requests per second can you expect to run through the library?&lt;/p&gt;

&lt;h3 id=&quot;scalability&quot;&gt;Scalability&lt;/h3&gt;

&lt;p&gt;How well does the library scale when you start providing it with more resources to meet your production load?
Can you save money by using a more efficient scaling library over another? (Cloud GPU instances can be &lt;a href=&quot;https://aws.amazon.com/ec2/instance-types/p2/&quot;&gt;really&lt;/a&gt; &lt;a href=&quot;https://aws.amazon.com/ec2/instance-types/p3/&quot;&gt;expensive&lt;/a&gt;)&lt;/p&gt;

&lt;h1 id=&quot;the-libraries&quot;&gt;The Libraries&lt;/h1&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;column left&quot;&gt;
  &lt;ul class=&quot;scorecard&quot;&gt;
    &lt;li class=&quot;header&quot;&gt;&lt;a href=&quot;https://github.com/BVLC/caffe&quot;&gt;Caffe&lt;/a&gt;&lt;/li&gt;
    &lt;li class=&quot;grey&quot;&gt;C++, Python, Matlab&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;http://caffe.berkeleyvision.org/&quot;&gt;UC Berkeley&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Watches: 2,161 | Stars: 23,338&lt;br /&gt;Forks: 14,247&lt;br /&gt;Avg Issue Resolution: 61 Days&lt;br /&gt;Open issues: 15%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;No Keras support&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=1739257544589912763&amp;amp;as_sdt=40005&amp;amp;sciodt=0,10&amp;amp;hl=en&quot;&gt;Research Citations:&lt;/a&gt; 7081&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://github.com/BVLC/caffe/wiki/Model-Zoo&quot;&gt;Model zoo&lt;/a&gt;&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/BVLC/caffe&quot;&gt;Caffe&lt;/a&gt;, with its unparalleled performance and well-tested C++ codebase, was basically the first mainstream, production-grade deep learning library.
Caffe is good for implementing CNNs, image processing, and for fine-tuning pre-trained nets.
In fact, you can do all of these things with writing little to no code.
You just place your training/validation data (mainly pictures) in a specific folder, set up config files for the deep net and its training parameters, and then call a precompiled Caffe binary that trains your net.&lt;/p&gt;

&lt;p&gt;Being first to market means that a lot of early research and models were written with Caffe, and the research that built off of that forked and continued to use the same code base.
Because of this, you will find a lot of state-of-the-art work, even to this day, still using Caffe despite its limitations.
A lot of these models can be found in the &lt;a href=&quot;https://github.com/BVLC/caffe/wiki/Model-Zoo&quot;&gt;Caffe Model Zoo&lt;/a&gt;, which is one of the first and largest (if not &lt;strong&gt;the&lt;/strong&gt; largest) model zoos.&lt;/p&gt;

&lt;p&gt;But now we have to start talking about its limitations.
Caffe was built and designed around an original intended use case: conventional CNN applications.
Because of this, Caffe is not very flexible.
Overall, it’s not very good for RNNs and LSTM networks.
Even with it’s adaption of CMake, building the library can still be a pain (especially for non Linux environments).
It has little support for multiple GPUs (training only) and can only be deployed to a server environment.
The configuration files to define the deep net structure are very cumbersome.
The prototxt for ResNet-152 is 6775 lines long!&lt;/p&gt;

&lt;p&gt;In Caffe, the deep net is treated as a collection of layers, as opposed to nodes of single tensor operations.
Layers can be thought of as a composition of multiple tensor operations.
These layers are not very flexible and there are &lt;a href=&quot;https://github.com/BVLC/caffe/tree/master/src/caffe/layers&quot;&gt;a lot&lt;/a&gt; of them that duplicate similar logic internally.
Because Caffe does not support auto differentiation, if you want to develop new layer types, you have to define the full forward and backwards gradient updates.
You can define these layers in Caffe’s Python interface, but unlike other libraries where the Python interface is accelerated by their underling C implementations, Caffe Python layers run in Python.&lt;/p&gt;

&lt;p&gt;So should you use Caffe?
If you are looking to reimplement some specific model from a research paper from 2015 using existing, open source code, it is not a bad Library.
If you are looking for raw performance and not opposed to using a C++ library and API on a GPU server for your service/app, Caffe is still one of the &lt;a href=&quot;https://arxiv.org/pdf/1608.07249.pdf&quot;&gt;fastest&lt;/a&gt; libraries around for fully connected networks.&lt;/p&gt;

&lt;p&gt;But because of its limitations and technical debt, a lot of the community and its efforts have moved on from Caffe in some form or another.
Caffe is a special case when it comes to &lt;a href=&quot;https://github.com/ysh329/deep-learning-model-convertor&quot;&gt;model converters&lt;/a&gt;, in that it is the best supported library with converters to almost all other deep learning libraries making it easier to move your work off of it.
The &lt;a href=&quot;http://daggerfs.com/&quot;&gt;creator of Caffe&lt;/a&gt; has been hired by Google to work on their deep learning library TensorFlow, and now by Facebook to create a successor to Caffe in the appropriately named Caffe2.&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;author-blurb&quot;&gt;
&lt;/div&gt;

&lt;div class=&quot;column right&quot;&gt;
  &lt;ul class=&quot;scorecard&quot;&gt;
    &lt;li class=&quot;header&quot;&gt;&lt;a href=&quot;https://github.com/torch/torch7&quot;&gt;Torch&lt;/a&gt;&lt;/li&gt;
    &lt;li class=&quot;grey&quot;&gt;Lua, C++&lt;/li&gt;
    &lt;li&gt;Deepmind, NYU, IDIAP&lt;/li&gt;
    &lt;li&gt;Watches: 675 | Stars: 7,761&lt;br /&gt;Forks: 2,254&lt;br /&gt;Avg Issue Resolution: 55 Days&lt;br /&gt;Open issues: 33%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;No Keras support&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=5370776646193960982&amp;amp;as_sdt=40005&amp;amp;sciodt=0,10&amp;amp;hl=en&quot;&gt;Research Citations:&lt;/a&gt; 955&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://github.com/carpedm20/awesome-torch&quot;&gt;Model zoo&lt;/a&gt;&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;div class=&quot;column right&quot;&gt;
  &lt;ul class=&quot;scorecard&quot;&gt;
    &lt;li class=&quot;header&quot;&gt;&lt;a href=&quot;https://github.com/pytorch/pytorch&quot;&gt;PyTorch&lt;/a&gt;&lt;/li&gt;
    &lt;li class=&quot;grey&quot;&gt;Python&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://research.fb.com/&quot;&gt;Facebook&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Watches: 690 | Stars: 13,111&lt;br /&gt;Forks: 2,795&lt;br /&gt;Avg Issue Resolution: 2 Days&lt;br /&gt;Open issues: 18%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;No Keras support&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=7333966966476165372&amp;amp;as_sdt=5,33&amp;amp;sciodt=0,33&amp;amp;hl=en&quot;&gt;Research Citations:&lt;/a&gt; 16&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://github.com/pytorch/vision&quot;&gt;Model zoo&lt;/a&gt;&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/torch/torch7&quot;&gt;Torch&lt;/a&gt; and &lt;a href=&quot;https://github.com/pytorch/pytorch&quot;&gt;PyTorch&lt;/a&gt; are related by much more than just their name.
Torch was one of the original, academic-created deep learning libraries. While it may not have as much research citing it for its use in the results, it still has a very large community around it.
Many of the researchers who originally worked on Torch moved to Facebook. 
Unsurprisingly, Facebook has since developed the successor to Torch in the form of PyTorch.
PyTorch and Torch use the same underlying C libraries, TN, THC, THNN, and THCUNN, which provide them with very similar performance characteristics.
When it comes to typical deep learning architectures, Torch offers some of the fastest, but not the &lt;a href=&quot;https://arxiv.org/pdf/1608.07249.pdf&quot;&gt;fastest&lt;/a&gt;, performance around with GPU scaling efficiency that matches the best.&lt;/p&gt;

&lt;p&gt;Where Torch and PyTorch differ is in their interface, API, and graphing paradigms.
Torch was written with a LUA API interface, which can be a major barrier of entry for most people.
While you can do research and development in LUA, it doesn’t have the massive community backing and vast open source libraries like Python does, so it can be quite limiting.
Torch uses a static graph paradigm like Caffe’s at the time.
Also like Caffe, it does not have any auto-differentiation capabilities, meaning if you want to implement new tensor operations for your deep net you have to write the backwards gradient calculations, and it has a pretty substantial &lt;a href=&quot;https://github.com/carpedm20/awesome-torch&quot;&gt;model zoo&lt;/a&gt; of pre-trained models.&lt;/p&gt;

&lt;p&gt;PyTorch was made with the goal of fixing or modernizing various issues with Torch, to create probably one of the best currently available libraries for doing research and development.
PyTorch, as the name suggests, has a very well designed Python API.
It supports both dynamic graph programming and auto differentiation for all of the easy to debug and prototype goodness.
PyTorch also has its own visualization dashboard called &lt;a href=&quot;https://github.com/facebookresearch/visdom&quot;&gt;Visdom&lt;/a&gt;, which while more limited than TensorBoard (more on this later), is still very helpful for development.&lt;/p&gt;

&lt;p&gt;So should you use Torch or PyTorch?
For specifically research, and development of new models, PyTorch is probably currently the best option.
Even though PyTorch is still very new, most people in the deep learning field would agree that you should use it over classic Torch.
Not to say Torch does not have its advantages.
Because of its age, it has a much larger backlog of research citing it for its use, and is more stable than PyTorch, but both of these advantages will be lost over time.
If you are looking for a library to deploy into any kind of production environment, then you should probably look &lt;a href=&quot;https://www.reddit.com/r/MachineLearning/comments/664ufi/n_facebook_releases_new_deep_learning_framework/dgfm6z6/&quot;&gt;elsewhere&lt;/a&gt;.&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;author-blurb&quot;&gt;
&lt;/div&gt;

&lt;div class=&quot;column left&quot;&gt;
  &lt;ul class=&quot;scorecard&quot;&gt;
    &lt;li class=&quot;header&quot;&gt;&lt;a href=&quot;https://github.com/tensorflow/tensorflow&quot;&gt;Tensorflow&lt;/a&gt;&lt;/li&gt;
    &lt;li class=&quot;grey&quot;&gt;Python, C++, Java, Go&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://opensource.google.com/projects/tensorflow&quot;&gt;Google&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Watches: 7,632 | Stars: 93,376&lt;br /&gt;Forks: 59,923&lt;br /&gt;Avg Issue Resolution: 8 Days&lt;br /&gt;Open issues: 16%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Works with Keras&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=4870469586968585222&amp;amp;as_sdt=40005&amp;amp;sciodt=0,10&amp;amp;hl=en&quot;&gt;Research Citations:&lt;/a&gt; 866&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://github.com/tensorflow/models&quot;&gt;Model zoo&lt;/a&gt;&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/tensorflow/tensorflow&quot;&gt;TensorFlow&lt;/a&gt;, without a doubt, is currently the biggest player in the deep learning field and for good reason.
TensorFlow is Google’s attempt to build a single deep learning framework for everything deep learning related.
There is very little that TensorFlow does not do well.
Because it was created by Google, it was built with massive distributed computing in mind, but it also had mobile development capabilities in the form of &lt;a href=&quot;https://www.tensorflow.org/mobile/mobile_intro&quot;&gt;TensorFlow Mobile&lt;/a&gt; and &lt;a href=&quot;https://www.tensorflow.org/mobile/tflite/&quot;&gt;TensorFlow Light&lt;/a&gt;.
Its documentation is also considered one of the best.
Their documentation covers multiple API languages that TensorFlow supports, and if you consider the interfaces made by 3rd parties in the community, it even has APIs for C#, Haskell, Julia, Ruby, Rust, and Scala.
Speaking of that community, TensorFlow has the largest community out of any of the deep learning libraries and currently has the most research activity.&lt;/p&gt;

&lt;p&gt;From the beginning, TensorFlow was made with a clear static graph API that was easy to use, but as interests and needs are changing in the machine learning field, it recently gained support for dynamic graph functionality in the form of &lt;a href=&quot;https://github.com/tensorflow/fold&quot;&gt;TensorFlow Fold&lt;/a&gt;.
TensorFlow has Keras support, making it very easy for beginners and even has its own custom version built into the Python API.&lt;/p&gt;

&lt;p&gt;When Google first released TensorFlow, they also released &lt;a href=&quot;https://www.tensorflow.org/get_started/summaries_and_tensorboard&quot;&gt;TensorBoard&lt;/a&gt;.
A data visualization tool that was created to help you understand the flow of tensors through your model for debugging, optimization, and just understanding the the complex and confusing nature of deep learning models.
You can use TensorBoard to visualize your TensorFlow model, plot summary metrics about the execution of your model, and show additional data like images that pass through it.&lt;/p&gt;

&lt;p&gt;Now what about deploying your models once you have finished training them?
Well Google also has a solution for that in &lt;a href=&quot;https://github.com/tensorflow/serving&quot;&gt;TensorFlow Serving&lt;/a&gt;, a flexible, high-performance serving system for ML models, designed for production environments.
It comes in the form of modular C++ libraries, binaries, and docker/k8 containers that can be used as an RPC server or a set of libraries.
There are even Google CloudML services set up with it to get your model up in production in no time.
TensorFlow Serving’s main goal is to optimize for throughput with little to no down time.
It includes a built-in scheduler that aims for the efficiency of mini-batching requests through the model and can manage multiple models at once running on shared hardware.
Currently the API interface only supports prediction, but will support regression, classification, and multi-inference soon.&lt;/p&gt;

&lt;p&gt;Now TensorFlow is not perfect.
Both Serving and Fold are still in their early days of development, so they might not want to be something you would rely on.
All of the APIs outside of the Python API are not covered by their &lt;a href=&quot;https://www.tensorflow.org/programmers_guide/version_semantics&quot;&gt;API stability promises&lt;/a&gt;.
But the biggest issue when it comes to TensorFlow when compared to the other libraries is performance.&lt;/p&gt;

&lt;p&gt;There is no real way to get around the issue; TensorFlow is just slower and more of a resource &lt;a href=&quot;https://medium.com/@julsimon/keras-shoot-out-part-2-a-deeper-look-at-memory-usage-8a2dd997de81&quot;&gt;hog&lt;/a&gt; when compared to the other libraries.
&lt;a href=&quot;https://arxiv.org/pdf/1608.07249.pdf&quot;&gt;Looking&lt;/a&gt; at performance across your typical deep net architectures you can expect to see other libraries perform &lt;em&gt;up to&lt;/em&gt; twice as fast as TensorFlow at similar batch sizes.
You should avoid TensorFlow in general if you need performant Recurrent nets (RNNs) or Long Short Term Memory nets (LSTMs).
TensorFlow is even the worst at scaling efficiency when compared to the other libraries despite its focus on distributed computing.&lt;/p&gt;

&lt;p&gt;So should you use TensorFlow?
We wouldn’t blame you if you did and would probably suggest it for 80% of the possible use cases out there.
Especially if you are new to the deep learning field and want to work with a library and ecosystem that has solutions for almost everything you could possibly need.
But, if you are willing to put in the extra time and effort, you can find a much more performant and equally-featured experience with other libraries.&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;author-blurb&quot;&gt;
&lt;/div&gt;

&lt;div class=&quot;column right&quot;&gt;
  &lt;ul class=&quot;scorecard&quot;&gt;
    &lt;li class=&quot;header&quot;&gt;&lt;a href=&quot;https://github.com/Microsoft/CNTK&quot;&gt;CNTK&lt;/a&gt;&lt;/li&gt;
    &lt;li class=&quot;grey&quot;&gt;Python, C#, C++, R&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://www.microsoft.com/en-us/cognitive-toolkit/&quot;&gt;Microsoft&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Watches: 1,334 | Stars: 14,057&lt;br /&gt;Forks: 3,727&lt;br /&gt;Avg Issue Resolution: 21 Days&lt;br /&gt;Open issues: 12%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Works with Keras&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=14941870274579355971&amp;amp;as_sdt=40005&amp;amp;sciodt=0,10&amp;amp;hl=en&quot;&gt;Research Citations:&lt;/a&gt; 21&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://www.microsoft.com/en-us/cognitive-toolkit/features/model-gallery/&quot;&gt;Model zoo&lt;/a&gt;&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/Microsoft/CNTK&quot;&gt;CNTK&lt;/a&gt;, the Microsoft Cognitive Tooklit, was originally created by MSR Speech researchers several years ago but has evolved into much more.
It is a unified framework for building Deep nets, Recurrent net (RNNs), Long Short Term Memory nets (LSTMs), Convolution nets (CNNs), and Deep Structured Semantic Models (DSSMs).
It can pretty much work for all types of deep learning applications from speech/text to vision.&lt;/p&gt;

&lt;p&gt;CNTK supports distributed training like TensorFlow and Torch. 
It even supports a proprietary, commercially-licensed, 1-bit Stochastic Gradient Decent algorithm that significantly improves distributed performance.
Thanks to CNTK’s early focus on language models, when it comes to running RNNS and LSTMs, it is &lt;a href=&quot;https://arxiv.org/pdf/1608.07249.pdf&quot;&gt;5-10 times better&lt;/a&gt; than the other libraries when running these dynamic network structures.&lt;/p&gt;

&lt;p&gt;The biggest reason to use CNTK is if you or your company traditionally works with Microsoft software and products.
CNTK is one of the few libraries to have first class support for running on Windows with additional support for running on Linux and &lt;a href=&quot;https://github.com/Microsoft/CNTK/issues/43&quot;&gt;NO&lt;/a&gt; support for OSX.
It has direct support for &lt;a href=&quot;https://docs.microsoft.com/en-us/cognitive-toolkit/Deploy-Model-to-AKS&quot;&gt;deploying to a Microsoft Azure production environment&lt;/a&gt; and APIs that properly supports Microsoft’s languages of choice.
Its model zoo is even set up in a very “MSDN documentation” fashion.&lt;/p&gt;

&lt;p&gt;The main downside to CNTK is that it lacks support from both the general research and software dev community.
Microsoft may be using it internally for a lot of their &lt;a href=&quot;https://blogs.technet.microsoft.com/machinelearning/tag/bing/&quot;&gt;services&lt;/a&gt; and probably has the reliability to support it, but it is just having trouble gaining market share (like many of Microsofts recent &lt;a href=&quot;https://www.theverge.com/2017/10/9/16446280/microsoft-finally-admits-windows-phone-is-dead&quot;&gt;endeavors&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;So should you use CNTK?
If you are used to developing in Visual Studio and need an API for your .NET application, there probably is no better fit.
But there are better options out there for most OSX/Linux devs with better all-around support.
Also, if you are trying to do research and development that is not specific to LSTMs or RNNs, there are more appropriate libraries.&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;div class=&quot;author-blurb&quot;&gt;
&lt;/div&gt;

&lt;div class=&quot;column left&quot;&gt;
  &lt;ul class=&quot;scorecard&quot;&gt;
    &lt;li class=&quot;header&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-mxnet&quot;&gt;MXNet&lt;/a&gt;&lt;/li&gt;
    &lt;li class=&quot;grey&quot;&gt;Python, Scala, R, Julia, C++, Perl, Go, Javascript, Matlab&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://www.apache.org/&quot;&gt;Apache&lt;/a&gt;, &lt;a href=&quot;https://aws.amazon.com/mxnet/&quot;&gt;Amazon&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Watches: 1,135 | Stars: 13,425&lt;br /&gt;Forks: 4,950&lt;br /&gt;Avg Issue Resolution: 53 Days&lt;br /&gt;Open issues: 11%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;Works with Keras&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=3990509978676884239&amp;amp;as_sdt=40005&amp;amp;sciodt=0,10&amp;amp;hl=en&quot;&gt;Research Citations:&lt;/a&gt; 319&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://mxnet.apache.org/model_zoo/index.html&quot;&gt;Model zoo&lt;/a&gt;&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/apache/incubator-mxnet&quot;&gt;MXNet&lt;/a&gt; is one of the newest players in the deep learning field but has been gaining ground fast.
Originally created at the University of Washington and Carnegie Mellon University, it has been adopted by both The Apache Foundation and Amazon Web Services as their deep learning library of choice and has put their development efforts &lt;a href=&quot;http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html&quot;&gt;behind it&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;MXNet supports almost all of the features the rest of the other libraries support.
It has the largest selection of officially supported languages for its APIs, and it can run on everything from a &lt;a href=&quot;http://rupeshs.github.io/machineye/&quot;&gt;web browser&lt;/a&gt;, a &lt;a href=&quot;https://mxnet.incubator.apache.org/how_to/smart_device.html&quot;&gt;mobile phone&lt;/a&gt;, to a massive distributed server farm.
In fact, Amazon has &lt;a href=&quot;http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html&quot;&gt;found&lt;/a&gt; that you can get up to an 85% scaling efficiency with MXNet.
In most other cases, MXNet has some of the best performance when running with typical deep learning architectures.&lt;/p&gt;

&lt;p&gt;MXNet supports both static graph programming and dynamic graph programming with the raw MXNet and Gluon APIs respectively.
The Gluon API is also MXNet’s clear, concise, and simple API for deep learning created in collaboration with AWS and Microsoft in the same spirit as Keras, but MXNet also supports Keras if you prefer it.
MXNet also has its own &lt;a href=&quot;https://github.com/awslabs/mxnet-model-server&quot;&gt;serving framework&lt;/a&gt; for getting your trained MXNet models into production and has extra support for running on &lt;a href=&quot;https://aws.amazon.com/blogs/ai/introducing-model-server-for-apache-mxnet/&quot;&gt;AWS&lt;/a&gt;.
It even has its own &lt;a href=&quot;https://github.com/dmlc/tensorboard&quot;&gt;TensorBoard implementation &lt;/a&gt; that provides much of the same functionality as the TensorFlow equivalent.&lt;/p&gt;

&lt;p&gt;MXNet does have notable weaknesses that make working with it a little more annoying.
The documentation could be much better. The APIs have gone through a few changes before the first 1.0 release and the documentation reflects this, which can get a little confusing in some places.
In terms of community support, its not the worst or the best, but somewhere in the middle.
There is a notable amount of people using it and research, and there are plenty of usage examples for different net types along with their &lt;a href=&quot;https://mxnet.apache.org/model_zoo/index.html&quot;&gt;model zoo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So should you use MXNet?
If you are willing to put the time in and deal with some of the pain points from it being a younger deep learning library, it is probably the best option for 80% of use cases along with TensorFlow.
Especially we would suggest it over TensorFlow if performance is a big concern of yours.
If you are looking for the most flexible library to give you as many options as possible in your train to production pipeline with a native API for your production code, it’s probably the best option.&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h1 id=&quot;the-other-libraries&quot;&gt;The Other Libraries&lt;/h1&gt;

&lt;p&gt;Now the previous 6 deep learning libraries covered are by no means that only options available to you.
They are just the biggest players and arguably the most relevant for 2018.
There are many more available to you to choose from that may better fit your specific needs (Deployment destination, non-english documentation/community, hardware, etc.).
We will try to briefly cover them here and provide a jumping off point if you want to dig into one of them deeper.&lt;/p&gt;

&lt;h3 id=&quot;theano&quot;&gt;&lt;a href=&quot;https://github.com/Theano/Theano&quot;&gt;Theano&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Python API&lt;/li&gt;
  &lt;li&gt;University of Montreal&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/forum/#!msg/theano-users/7Poq8BZutbY/rNCIfvAEAwAJ&quot;&gt;Future work on the project has stopped, May it rest in peace&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 573, Star: 8041, Forks: 2426, Median Issue Resolution: 12 days, Open issues: 19%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=6483663616580699907&amp;amp;as_sdt=5,33&amp;amp;sciodt=0,33&amp;amp;hl=en&quot;&gt;Research Citations&lt;/a&gt;: 1,080&lt;/li&gt;
  &lt;li&gt;Makes you do a lot of things from scratch, which leads to more verbose code.&lt;/li&gt;
  &lt;li&gt;Single GPU support only&lt;/li&gt;
  &lt;li&gt;Numerous open-source deep-libraries have been built on top of Theano, including &lt;a href=&quot;https://github.com/fchollet/keras&quot;&gt;Keras&lt;/a&gt;, &lt;a href=&quot;https://lasagne.readthedocs.org/en/latest/&quot;&gt;Lasagne&lt;/a&gt; and &lt;a href=&quot;https://github.com/mila-udem/blocks&quot;&gt;Blocks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;No real reason to use over TensorFlow unless you are working with old code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;caffe2&quot;&gt;&lt;a href=&quot;https://github.com/caffe2/caffe2&quot;&gt;Caffe2&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;C++, Python APIs&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://research.fb.com/&quot;&gt;Facebook&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 552, Stars: 7631, Forks: 1821, Median Issue Resolution: 55 Days, Open issues: 33%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Caffe2 is facebooks second entry into the deep learning library ecosystem.&lt;/li&gt;
  &lt;li&gt;It is built with a focus more on mobile and industrial-strength production applications over development and research.&lt;/li&gt;
  &lt;li&gt;Where Caffe only supported single GPU training, Caffe2 is built to run utilizing both multiple GPUs on a single host and multiple hosts with single to multiple GPUs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;coreml&quot;&gt;&lt;a href=&quot;https://developer.apple.com/documentation/coreml&quot;&gt;CoreML&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Swift, Objective-C APIs&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://developer.apple.com/machine-learning/&quot;&gt;Apple&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Closed source&lt;/li&gt;
  &lt;li&gt;Not a full DL library (you can not use it to train models at the moment), but mainly focused on deploying pre-trained models optimized for Apple devices
    &lt;ul&gt;
      &lt;li&gt;If you need to train your own model, you will need to use one of the above libraries&lt;/li&gt;
      &lt;li&gt;Model converters available for Keras, Caffe, Scikit-learn, libSVM, XGBoost, MXNet, and TensorFlow&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;paddle&quot;&gt;&lt;a href=&quot;https://github.com/PaddlePaddle/Paddle&quot;&gt;Paddle&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Python API&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://research.baidu.com/&quot;&gt;Baidu&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 558, Star: 6580, Forks: 1756, Median Issue Resolution: 7 days, Open issues: 24%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;One of the newest libraries available&lt;/li&gt;
  &lt;li&gt;Chinese documentation with an English translation&lt;/li&gt;
  &lt;li&gt;Has the potential to become a big player in the market&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;neon&quot;&gt;&lt;a href=&quot;https://github.com/PaddlePaddle/Paddle&quot;&gt;Neon&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Python API&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://ai.intel.com/&quot;&gt;Intel&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 351, Stars: 3437, Forks: 778, Median Issue Resolution Time: 28 days, Open issues: 16%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Written with Intel MKL accelerated hardware in mind (Intel Xeon and Phi processors)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;chainer&quot;&gt;&lt;a href=&quot;https://github.com/chainer/chainer&quot;&gt;Chainer&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Python API&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.preferred-networks.jp/&quot;&gt;Preferred Networks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 310, Stars: 3595, Forks: 949, Median Issue Resolution Time: 31 days, Open issues: 13%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://scholar.google.com/scholar?cites=2313195456264875252&amp;amp;as_sdt=5,33&amp;amp;sciodt=0,33&amp;amp;hl=en&quot;&gt;Research Citations&lt;/a&gt;: 207&lt;/li&gt;
  &lt;li&gt;Dynamic computation graph&lt;/li&gt;
  &lt;li&gt;Smaller company effort with a Japanese and English community&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;deeplearning4j&quot;&gt;&lt;a href=&quot;https://github.com/deeplearning4j/deeplearning4j&quot;&gt;Deeplearning4j&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Java, Scala APIs&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://skymind.ai/&quot;&gt;Skymind&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 792, Stars: 8527, Forks: 4120, Median Issue Resolution Time: 19 days, Open issues: 21%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Written with Java and the JVM in mind&lt;/li&gt;
  &lt;li&gt;Keras Support (Python API)&lt;/li&gt;
  &lt;li&gt;DL4J can take advantage distributed computing frameworks including Hadoop and Apache Spark.&lt;/li&gt;
  &lt;li&gt;On multi-GPUs, it is equal to Caffe in performance.&lt;/li&gt;
  &lt;li&gt;Can import models from Tensorflow&lt;/li&gt;
  &lt;li&gt;Uses &lt;a href=&quot;https://nd4j.org/&quot;&gt;ND4J&lt;/a&gt; (Numpy for the JVM)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;dynet&quot;&gt;&lt;a href=&quot;https://github.com/clab/dynet&quot;&gt;DyNet&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;C++, Python APIs&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.casos.cs.cmu.edu/projects/DyNet/index.php&quot;&gt;Carnegie Mellon University&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 178, Stars: 2189, Forks: 527, Median Issue Resolution Time: 4 days, Open issues: 16%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Dynamic computation graph&lt;/li&gt;
  &lt;li&gt;Small user community&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;matconvnet&quot;&gt;&lt;a href=&quot;https://github.com/vlfeat/matconvnet&quot;&gt;MatConvNet&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Matlab APIs&lt;/li&gt;
  &lt;li&gt;Watches: 113, Stars: 959, Forks: 633, Median Issue Resolution Time: 96 days, Open issues: 53%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;a MATLAB toolbox implementing Convolutional Neural Networks (CNNs) for computer vision applications&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;darknet&quot;&gt;&lt;a href=&quot;https://github.com/pjreddie/darknet&quot;&gt;Darknet&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Python, C APIs&lt;/li&gt;
  &lt;li&gt;Watches: 520, Stars: 6276, Forks 3072, Median Issue Resolution Time: 55 days, Open issues: 78%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Very small open source effort with a laid back dev group&lt;/li&gt;
  &lt;li&gt;not useful for production environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;leaf&quot;&gt;&lt;a href=&quot;https://github.com/autumnai/leaf&quot;&gt;Leaf&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Rust API&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/autumnai&quot;&gt;autumnai&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Watches: 195, Stars: 5229, Forks: 265, Median Issue Resolution Time: 131 days, Open issues: 58%&lt;a href=&quot;#note1&quot; id=&quot;note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Support for the lib looks to be &lt;a href=&quot;https://github.com/autumnai/leaf/issues/108&quot;&gt;dead&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h1 id=&quot;tldr&quot;&gt;TLDR&lt;/h1&gt;

&lt;p&gt;Choose either &lt;a href=&quot;https://github.com/tensorflow/tensorflow&quot;&gt;TensorFlow&lt;/a&gt; or &lt;a href=&quot;https://github.com/apache/incubator-mxnet&quot;&gt;MXNet&lt;/a&gt; for probably about 80% of use cases (TensorFlow if you prioritize community support and documentation, MXNet if you need performance).
Look at PyTorch if you are mainly looking for something to develop/train new models.
If you love Microsoft and are developing for a .NET environment in Windows and Visual Studio, try out &lt;a href=&quot;https://github.com/Microsoft/CNTK&quot;&gt;CNTK&lt;/a&gt;.
Look into &lt;a href=&quot;https://developer.apple.com/documentation/coreml&quot;&gt;OpenML&lt;/a&gt; for just deploying models to Apple devices specifically and &lt;a href=&quot;https://github.com/deeplearning4j/deeplearning4j&quot;&gt;Deeplearning4j&lt;/a&gt; if you &lt;em&gt;really&lt;/em&gt; like to keep things JVM focused.&lt;/p&gt;

&lt;p&gt;&lt;a id=&quot;note1&quot; href=&quot;#note1ref&quot;&gt;&lt;sup&gt;*&lt;/sup&gt;&lt;/a&gt; Numbers taken at time of writing, expected to change.&lt;/p&gt;

&lt;!--- CSS FOR HTML SCORECARDS --&gt;

&lt;style&gt;

.customHR {
  color: black;
}

/* Create three columns of equal width */
.column {
    width: 33.3%;
    padding: 8px;
}

.right {
  float: right;
}

.left {
  float: left;
}

/* Style the list */
.scorecard {
    list-style-type: none;
    border: 1px solid #073642;
    margin: 0;
    padding: 0;
    -webkit-transition: 0.3s;
    transition: 0.3s;
}

/* Scorecard header */
.scorecard .header {
    background-color: #ef8644;
    color: white;
    font-size: 20px;
}

.scorecard .header a {
  color: white;
}

/* List items */
.scorecard li {
    border-bottom: 1px solid #073642;
    padding: 10px;
    text-align: center;
}

/* Grey list item */
.scorecard .grey {
    background-color: #073642;
}

/* Change the width of the three columns to 100%
(to stack horizontally on small screens) */
@media only screen and (max-width: 600px) {
    .columns {
        width: 100%;
    }
}
&lt;/style&gt;

</description>
        <pubDate>Fri, 23 Mar 2018 10:11:36 +0000</pubDate>
        <link>http://engineering.curalate.com/2018/03/23/DL-lib-for-app-dev-and-prod.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2018/03/23/DL-lib-for-app-dev-and-prod.html</guid>
        
        <category>deep</category>
        
        <category>learning</category>
        
        <category>neural</category>
        
        <category>network</category>
        
        <category>tensorflow</category>
        
        <category>mxnet</category>
        
        <category>caffe</category>
        
        <category>cntk</category>
        
        <category>pytorch</category>
        
        <category>vs</category>
        
        <category>comparison</category>
        
        
      </item>
    
      <item>
        <title>R&amp;D At Curalate: A Case Study of Deep Metric Embedding</title>
        <description>&lt;p&gt;At Curalate, we make social sell for hundreds of the world’s largest brands and retailers. Our
&lt;a href=&quot;https://www.curalate.com/solution/fanreel/&quot;&gt;Fanreel&lt;/a&gt; product is a good example of this; it empowers brands to collect, curate, and publish social user-generated photos to their e-commerce site. A vital step in this pipeline is connecting the user generated content (UGC) to
the product on our client’s web site. Automating this process requires cutting edge computer vision techniques whose implementation details are not always clear, especially for production use cases.
In this post, I review how we leveraged &lt;a href=&quot;https://www.curalate.com/blog/how-Curalate-built-a-kick-ass-research-development-team/&quot;&gt;Curalate’s R&amp;amp;D principles&lt;/a&gt; to build a visual search engine that
identifies which of our clients’ products are in user generated photos.
The resulting system allows our clients to quickly connect user generated content to their e-comm site, enabling the UGC to generate revenue immediately upon distribution.&lt;/p&gt;

&lt;h1 id=&quot;step-1-do-your-homework&quot;&gt;Step 1: Do Your Homework&lt;/h1&gt;

&lt;p&gt;We start every R&amp;amp;D project by hitting the books and catching up on the relevant research. 
This lets us understand what is feasible, the (rough) computational costs, and any pitfalls of various techniques.
In this case, our goal is to find which products are in any UGC image using only the product images from the client’s e-comm site. 
This is extremely difficult: UGC photos have dramatic lighting conditions, generally contain multiple objects or clutter, and may have
undergone non rigid transformations (especially if it’s a garment).
Knowing we had a difficult problem on our hands, we did an extensive literature review on papers from leading computer vision conferences, journals, and even &lt;a href=&quot;https://arxiv.org/&quot;&gt;arxiv&lt;/a&gt; to ensure we have a good understanding of the state of the art.&lt;/p&gt;

&lt;p&gt;One approach stood out in the literature review: &lt;em&gt;deep metric
learning&lt;/em&gt;. Deep metric learning is a deep learning technique that learns an embedding function that, when applied to images of the same
product, produces feature vectors that are close together in Euclidean space.
This technique is perfect for our use case: we can train the system from existing pairs of UGC and product images in our platform to understand the complex transformations products undergo in UGC photos.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2018-02-01-dme/tSNE.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The figure above (from &lt;a href=&quot;https://arxiv.org/abs/1612.01213&quot;&gt;Song et. al.&lt;/a&gt;) shows a
 &lt;a href=&quot;https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding&quot;&gt;t-SNE&lt;/a&gt; visualization of a learned embedding of the &lt;a href=&quot;http://cvgl.stanford.edu/projects/lifted_struct/&quot;&gt;Stanford Online Products&lt;/a&gt; dataset.
Notice that images of similar products are close together: wooden furniture zoomed in on the upper left, and bike parts on the lower right.
Once we’ve learned this embedding function, identifying the products in a UGC image can be achieved by finding which embedding vectors from
the client’s product photos that are closest to that of the UGC.&lt;/p&gt;

&lt;p&gt;Most techniques for deep metric learning start with a deep convolutional neural network trained on &lt;a href=&quot;http://www.image-net.org/&quot;&gt;imagenet&lt;/a&gt;
(i.e., a &lt;em&gt;basenet&lt;/em&gt;), remove the final classification layer, add a new layer that performs a projection to the n-dimensional embedding
space, and fine-tune it with an appropriate loss function.
One highly cited work is &lt;a href=&quot;https://arxiv.org/abs/1503.03832&quot;&gt;Facenet&lt;/a&gt; by Schroff et. al., who propose a loss function that uses triplets
of images. Each triplet contains an &lt;em&gt;anchor&lt;/em&gt; image, a &lt;em&gt;positive&lt;/em&gt; example that is the same class as the anchor, and a &lt;em&gt;negative&lt;/em&gt; match
that is a different class than the anchor image. Though &lt;a href=&quot;https://arxiv.org/abs/1703.07464&quot;&gt;more recent work&lt;/a&gt; has surpassed Facenet, in
interest of speed (&lt;a href=&quot;/2016/01/14/welcome.html&quot;&gt;we are a startup!&lt;/a&gt;) we decided to take it for a spin since a
&lt;a href=&quot;https://github.com/davidsandberg/facenet&quot;&gt;tensorflow implementation&lt;/a&gt; was available online.&lt;/p&gt;

&lt;h1 id=&quot;step-2-prototype-and-experiment&quot;&gt;Step 2: Prototype and Experiment&lt;/h1&gt;

&lt;p&gt;The second phase of an R&amp;amp;D project at Curalate is the prototype phase. 
In this phase, we implement our chosen approach as fast as possible, and evaluate it on publically available data as well as our own.
As with many things in a startup, speed is key here. 
Specifically, we need &lt;em&gt;answers&lt;/em&gt; as fast as possible so we know what we need to build.
This phase is designed to answer the question: will it work and, if so, how well?
In addition, this phase is when we experiment with different implementation details of techniques we wish to implement.
Hyper parameter tuning, architecture components, and comparing different algorithms all occur in this phase of R&amp;amp;D.&lt;/p&gt;

&lt;p&gt;The big question we want to answer for our deep metric embedding project is: which basenet should we use? The Facenet paper used GoogLeNet inception models, but there have been many
improvements since their publication. To compare different networks, we measure each of their performance on the 
&lt;a href=&quot;http://cvgl.stanford.edu/projects/lifted_struct/&quot;&gt;Stanford Online Products&lt;/a&gt;
dataset. We implemented Facenet’s triplet loss in &lt;a href=&quot;https://mxnet.apache.org/&quot;&gt;MXNet&lt;/a&gt; so we can easily swap-out the underlying basenet.&lt;/p&gt;

&lt;p&gt;We compared the following networks from the &lt;a href=&quot;https://mxnet.apache.org/model_zoo/index.html&quot;&gt;MXNet Model Zoo&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://data.dmlc.ml/models/imagenet/vgg/&quot;&gt;VGG-19&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://data.dmlc.ml/models/imagenet/inception-bn/&quot;&gt;Inception v3&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://data.dmlc.ml/models/imagenet/resnet/&quot;&gt;Resnet (multiple sizes)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bruinxiong/SENet.mxnet&quot;&gt;SENet-ResNext-50&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A secondary question we wished to answer with this experiment was how efficiently we could compute the embeddings. To explore this, we also evaluated two smaller, faster networks:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://data.dmlc.ml/models/imagenet/resnet/&quot;&gt;Resnet 18&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://data.dmlc.ml/models/imagenet/squeezenet&quot;&gt;SqueezeNet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;iframe width=&quot;600&quot; height=&quot;371&quot; seamless=&quot;&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot; src=&quot;https://docs.google.com/spreadsheets/d/e/2PACX-1vRDhNwpd9skkcn4svSk56Fidz0zGL5ObqSscbAPLyA-ji4zvub7YECMb7MiIJRlQAFBkiskTlNjKgHO/pubchart?oid=812950348&amp;amp;format=interactive&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;The figure above shows the recall-at-1 accuracy for all basenets. 
Not surprisingly, the more computationally expensive networks (i.e., Resnet-152 and SENet) have the highest accuracy.
SENet, in particular, achieved a recall-at-1 of 71.6%, which is only two percentage less than the &lt;a href=&quot;https://arxiv.org/pdf/1703.07464.pdf&quot;&gt;current state of the art&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One of the exciting results for us was squeezenet. Though it only achieved 60% accuracy, this network is extremely small (&amp;lt; 5MB) and
computationally fast enough to run on a mobile phone. Thus we could sacrifice some accuracy for a huge savings in computational cost if we require it.&lt;/p&gt;

&lt;h1 id=&quot;step-3-ship-it&quot;&gt;Step 3: Ship It&lt;/h1&gt;

&lt;p&gt;The final phase of an R&amp;amp;D project at Curalate is productization.
In this phase, we leverage our findings from the prototype and literature phases to design and build a reliable and efficient production system.
All code from the prototype phase is discarded or heavily refactored to be more efficient, testable, and maintainable.
With deep learning systems, we also build a data pipeline for extracting, versioning, and snapshotting datasets from our current production systems.&lt;/p&gt;

&lt;p&gt;For this project, we train the model on a P6000 GPU rented from &lt;a href=&quot;https://www.paperspace.com/&quot;&gt;paperspace&lt;/a&gt;.
We again use &lt;a href=&quot;https://mxnet.apache.org/&quot;&gt;MXNet&lt;/a&gt; so the resulting model can be deployed directly to our production web services (which are 
&lt;a href=&quot;http://engineering.curalate.com/2016/02/16/build-deploy-at-Curalate.html&quot;&gt;written in Scala&lt;/a&gt;).
We opted to use Resnet-152 as a basenet to get a high accuracy result, and deployed the learned network to &lt;code class=&quot;highlighter-rouge&quot;&gt;g2.2xlarge&lt;/code&gt; instances on aws.&lt;/p&gt;

&lt;p&gt;The visual search system we built powers our &lt;a href=&quot;https://www.curalate.com/blog/intelligent-product-tagging/&quot;&gt;Intelligent Product Tagging&lt;/a&gt; feature, which you
can see in the video below. Using deep metric embedding, we vastly increased the accuracy of intelligent product tagging compared to non-embedded deep features.&lt;/p&gt;

&lt;video width=&quot;715px&quot; loop=&quot;&quot; controls=&quot;&quot;&gt;
	&lt;source type=&quot;video/mp4&quot; src=&quot;https://www.curalate.com/wp-content/uploads/2016/12/apt-demo-v2-for-blog-and-PR-video1.mp4&quot; /&gt;
&lt;/video&gt;

</description>
        <pubDate>Thu, 01 Feb 2018 10:11:36 +0000</pubDate>
        <link>http://engineering.curalate.com/2018/02/01/deep-metric-embedding.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2018/02/01/deep-metric-embedding.html</guid>
        
        <category>computer</category>
        
        <category>vision</category>
        
        <category>deep</category>
        
        <category>metric</category>
        
        <category>machine</category>
        
        <category>learning</category>
        
        
      </item>
    
      <item>
        <title>Load Testing for Expected Increases in Traffic with Vegeta</title>
        <description>&lt;p&gt;At Curalate, our service and API traffic is fairly tightly coupled to e-commerce traffic, so any increase is reasonably predictable. We expect an increase in request rate towards the beginning of November each year, with traffic peaking at 10x our steady rate on Black Friday and Cyber Monday.&lt;/p&gt;

&lt;h2 id=&quot;why-load-test&quot;&gt;Why Load Test?&lt;/h2&gt;
&lt;p&gt;Curalate works directly with retail brands to drive traffic to their sites. The holiday shopping period is the most important time of the year for most of them, and we need to ensure that our experiences continue to operate at a high standard throughout.&lt;/p&gt;

&lt;p&gt;More generally, though, load testing is critical for services and APIs, especially in cases where load is expected to increase. It uncovers potential points of failure, during business hours, and hopefully prevents people from needing to wake up at 2 a.m. on a weekend.&lt;/p&gt;

&lt;h2 id=&quot;creating-a-test-plan&quot;&gt;Creating a Test Plan&lt;/h2&gt;
&lt;p&gt;In cases of expected load increases, it’s important to understand as much as possible before diving into it. There are a few questions to ask:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Is there any data available so I can understand the expected load? Is it a yearly increase - are previous years a good indication? If it’s a brand new launch, what are the expectations?&lt;/li&gt;
  &lt;li&gt;What are the hard and soft dependencies of the service or API that I’m testing? What sort of caching is in place? Does a 10x increase on my service cause a 10x increase on everything downstream, as well?&lt;/li&gt;
  &lt;li&gt;Should we test against the active production environment, or is it feasible to spin up a staging environment with the same scaling behavior?&lt;/li&gt;
  &lt;li&gt;Depending on the breadth of dependencies, it may not be possible to spin up a new duplicated environment.&lt;/li&gt;
  &lt;li&gt;If I test against production, how can I ensure I don’t negatively affect live traffic?&lt;/li&gt;
  &lt;li&gt;Am I expecting an increase in load across services? If there are any core dependencies, what does the combined load look like at peak?&lt;/li&gt;
  &lt;li&gt;How much of a buffer do I provide against the expected peak?&lt;/li&gt;
  &lt;li&gt;Does my service have any rate limiting that I need to bypass or keep in mind? How do I simulate a live traffic without being throttled?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;getting-right-to-it&quot;&gt;Getting Right to It&lt;/h2&gt;
&lt;p&gt;In our case, there were four main services that we were interested in testing against expected load, separated into on-site (APIs and services that are called directly from our client’s sites), and off-site (our custom built and Curalate-hosted services). This distinction works well for us, because we expected a 10x increase in on-site experiences, but 2-3x increase to off-site ones - brands focus on driving traffic to their own e-commerce site.&lt;/p&gt;

&lt;p&gt;Now, there are many tools out there for load testing. For our purposes, I used &lt;a href=&quot;https://github.com/tsenart/vegeta&quot;&gt;Vegeta&lt;/a&gt;, for its robust set of options and extensibility. It was easy to script around to allow a steadily increasing request rate to either a single target or lazily generated targets. The output functionality is also well thought out. It supports top line latency stats along with some basic charting capabilities.&lt;/p&gt;

&lt;p&gt;Let’s assume we had a service that we wanted to test up to 1000 RPS, both against a single target, and against multiple targets - to work around any caching in place.&lt;/p&gt;

&lt;h4 id=&quot;1000-rps-single-and-multi-target&quot;&gt;1000 RPS Single and Multi-Target&lt;/h4&gt;
&lt;p&gt;The setup was fairly simple:&lt;/p&gt;

&lt;p&gt;Spin up a couple of AWS EC2 m3.2xlarge instances.&lt;/p&gt;

&lt;p&gt;SSH to the instances and create a &lt;code class=&quot;highlighter-rouge&quot;&gt;load_testing&lt;/code&gt; folder, and fetch the Vegeta binary.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;wget &quot;https://github.com/tsenart/vegeta/releases/download/v6.3.0/vegeta-v6.3.0-linux-386.tar.gz&quot;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Put together a simple, quick script to handle steadily increasing the request rate, and then hold steady at the max rate.&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/bin/bash&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;maxRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$2&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;rateInc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$3&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;incDuration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$4&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;startAt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$5&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;currentRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$startAt&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;hitType&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$6&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-le&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;do
  if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-eq&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;then
    &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$target&lt;/span&gt; | ./vegeta attack &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; reel-&lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$hitType&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;-test&lt;/span&gt;.bin
  &lt;span class=&quot;k&quot;&gt;else
    &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$target&lt;/span&gt; | ./vegeta attack &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-duration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$incDuration&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; reel-&lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$hitType&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;-test&lt;/span&gt;.bin
  &lt;span class=&quot;k&quot;&gt;fi
  &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;currentRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;$((&lt;/span&gt;currentRate+rateInc&lt;span class=&quot;k&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Basically, if it hasn’t yet hit the max rate, run vegeta at the current rate for the specified duration, then increase the rate by the increment, and loop again. If the max rate is hit, don’t specify a duration - run until manually killed. The multi-targets script is similar, but reads from a &lt;code class=&quot;highlighter-rouge&quot;&gt;targets.txt&lt;/code&gt; file.&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/bin/bash&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;maxRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;rateInc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$2&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;incDuration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$3&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;startAt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$4&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;currentRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$startAt&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;hitType&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$5&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-le&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;do
  if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-eq&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt;
    ./vegeta attack &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-targets&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;targets.txt &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; reel-&lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$hitType&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;-test&lt;/span&gt;.bin
  &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
    ./vegeta attack &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-duration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$incDuration&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-targets&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;targets.txt &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; reel-&lt;span class=&quot;nv&quot;&gt;$maxRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$currentRate&lt;/span&gt;-&lt;span class=&quot;nv&quot;&gt;$hitType&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;-test&lt;/span&gt;.bin
  &lt;span class=&quot;k&quot;&gt;fi
  &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;currentRate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;$((&lt;/span&gt;currentRate+rateInc&lt;span class=&quot;k&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Aside: I was unable to get the &lt;code class=&quot;highlighter-rouge&quot;&gt;-lazy&lt;/code&gt; flag to work properly with Vegeta, so I went with brute force and just generated a ton of targets to a file. I’m convinced it could have been more elegant, but sometimes the easy solution works just as well.&lt;/p&gt;

&lt;p&gt;With the setup complete, it’s as simple as setting up whatever monitoring you want on a display or two, and fire off the scripts.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;sh ./rate_increasing_multi.sh 1000 50 120s 50 uncached&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Which says to increase up to 1000 RPS, 50 at a time, for 2 minutes at each rate, starting at 50 RPS.&lt;/p&gt;

&lt;p&gt;For each results file generated, &lt;code class=&quot;highlighter-rouge&quot;&gt;./vegeta report -inputs &quot;out.txt&quot;&lt;/code&gt; will output something like (this example is for 250 RPS)&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Requests      [total, rate]            66177, 249.98
Duration      [total, attack, wait]    4m24.783548697s, 4m24.731999487s, 51.54921ms
Latencies     [mean, 50, 95, 99, max]  64.885905ms, 57.516245ms, 107.88721ms, 730.867162ms, 2.309337436s
Bytes In      [total, mean]            943011144, 14249.83
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:66177
Error Set:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;load-testing-and-results&quot;&gt;Load Testing and Results&lt;/h2&gt;
&lt;p&gt;As different tests are kicked off and rates increase, it’s necessary to keep an eye on any monitoring dashboards, or alerts that may fire, and bail out of the test early. From there, logging should help in diagnosing what failed, and tickets can be filed each step of the way. After those issues are resolved, you can pick back up testing until you hit your goal, and maintain it for long enough to be comfortable with the test.&lt;/p&gt;

&lt;p&gt;It should go without saying, but when testing against a live, production environment, it’s always nice to give the current on-call engineers a heads up, and keep them in the loop the entire way through.&lt;/p&gt;

&lt;p&gt;As for Curalate’s load testing, on Cyber Monday we experienced record-breaking traffic numbers - even exceeding our 10x estimates slightly - to our services, and the on-call engineers slept soundly through Thanksgiving weekend.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://media.giphy.com/media/vMnuZGHJfFSTe/giphy.gif&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
</description>
        <pubDate>Thu, 21 Dec 2017 13:00:00 +0000</pubDate>
        <link>http://engineering.curalate.com/2017/12/21/expected-traffic-load-testing.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2017/12/21/expected-traffic-load-testing.html</guid>
        
        <category>load-testing</category>
        
        <category>vegeta</category>
        
        
      </item>
    
      <item>
        <title>Tracing High Volume Services</title>
        <description>&lt;p&gt;We like to think that building a service ecosystem is like stacking building blocks.  You start with a function in your code. That function is hosted in a class. That class in a service. That service is hosted in a cluster. That cluster in a region. That region in a data center, etc. At each level there’s a myriad of challenges.&lt;/p&gt;

&lt;p&gt;From the start, developers tend to use things like logging and metrics to debug their systems, but a certain class of problems crops up when you need to debug &lt;em&gt;across&lt;/em&gt; services. From a debugging perspective, you’d like to have a higher projection of the view of the system: a linearized view of what requests are doing. I.e. You want to be able to see that &lt;code class=&quot;highlighter-rouge&quot;&gt;service A&lt;/code&gt; called &lt;code class=&quot;highlighter-rouge&quot;&gt;service B&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;service C&lt;/code&gt; called &lt;code class=&quot;highlighter-rouge&quot;&gt;service D&lt;/code&gt; at the granularity of single requests.&lt;/p&gt;

&lt;h1 id=&quot;cross-service-logging&quot;&gt;Cross Service Logging&lt;/h1&gt;

&lt;p&gt;The simplest solution to this is to require that every call from service to service comes with some sort of trace identifier. Incoming requests into the system, either from public API’s or client side requests, or even from async daemon invoked timers/schedules/etc generates a trace. This trace then gets propagated through the entire system. If you use this trace in all your log statements you can now correlate &lt;em&gt;cross service&lt;/em&gt; calls.&lt;/p&gt;

&lt;p&gt;How is this accomplished at Curalate? For the most part we use Finagle based services and the Twitter ecosystem has done a good job of providing the concept of a &lt;a href=&quot;https://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.html&quot;&gt;thread local&lt;/a&gt; &lt;a href=&quot;https://twitter.github.io/finagle/docs/com/twitter/finagle/tracing/Trace$.html&quot;&gt;TraceId&lt;/a&gt; and automatically propagating it to all other twitter-* components (yet another reason we like &lt;a href=&quot;/2017/07/05/from-thrift-to-finatra.html&quot;&gt;Finatra&lt;/a&gt;!).&lt;/p&gt;

&lt;p&gt;All of our service clients automatically pull this thread local trace id out and populate a known HTTP header field that services then pick up and re-assume.  For Finagle based clients this is auto-magick’d for you. For other clients that we use, like &lt;a href=&quot;http://square.github.io/okhttp/&quot;&gt;OkHttp&lt;/a&gt;, we had to add custom interceptors that pulled the trace from the thread local and set it on the request.&lt;/p&gt;

&lt;p&gt;Here is an example of the header being sent automatically as part of Zipkin based headers (which we re-use as our internal trace identifiers):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/finagle_trace_id.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Notice the &lt;code class=&quot;highlighter-rouge&quot;&gt;X-B3-TraceId&lt;/code&gt; header. When a service receives this request it’ll re-assume the trace id and set its SLF4j &lt;a href=&quot;https://logback.qos.ch/manual/mdc.html&quot;&gt;MDC&lt;/a&gt; field of &lt;code class=&quot;highlighter-rouge&quot;&gt;traceId&lt;/code&gt; to be that value. We can now include in our logback.xml configuration to include the trace id like in our STDOUT log configuration below:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;appender&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;STDOUT-COLOR&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ch.qos.logback.core.ConsoleAppender&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;filter&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ch.qos.logback.classic.filter.ThresholdFilter&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;level&amp;gt;&lt;/span&gt;TRACE&lt;span class=&quot;nt&quot;&gt;&amp;lt;/level&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/filter&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;encoder&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;pattern&amp;gt;&lt;/span&gt;%yellow(%d) [%magenta(%X{traceId})] [%thread] %highlight(%-5level) %cyan(%logger{36}) %marker - %msg%n&lt;span class=&quot;nt&quot;&gt;&amp;lt;/pattern&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/encoder&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/appender&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And we can also send the trace id as a structured JSON field to Loggly.&lt;/p&gt;

&lt;p&gt;Let’s look at an example from our own logs:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/tid_example.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What we’re seeing here is a system called &lt;code class=&quot;highlighter-rouge&quot;&gt;media-api&lt;/code&gt; made a query to a system called &lt;code class=&quot;highlighter-rouge&quot;&gt;networkinformationsvc&lt;/code&gt;. The underlying request carried a correlating trace id across the service boundaries and both systems logged to Loggly with the &lt;code class=&quot;highlighter-rouge&quot;&gt;json.tid&lt;/code&gt; (transaction id) field populated. Now we can query our logs and get a linear time based view of what’s happening.&lt;/p&gt;

&lt;h1 id=&quot;thread-local-tracing&quot;&gt;Thread local tracing&lt;/h1&gt;

&lt;p&gt;The trick here is to make sure that this implicit trace id that is pinned to the thread local of the initiating request properly &lt;em&gt;moves&lt;/em&gt; from thread to thread as you make async calls.  We don’t want anyone to have to ever &lt;em&gt;remember&lt;/em&gt; to set the trace. It should just gracefully flow from thread to thread &lt;em&gt;implicity&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;To make sure that traces hop properly between systems we had to make sure to enforce that everybody uses an &lt;code class=&quot;highlighter-rouge&quot;&gt;ExecutionContext&lt;/code&gt; that safely captures the callers thread local’s before executing. This is critical, otherwise you can make an async call and the trace id gets dropped. In that case, bye bye go the logs!  It’s hyper important to always &lt;em&gt;take an execution context&lt;/em&gt; and to never &lt;em&gt;pin an execution context&lt;/em&gt; when it comes to async scala code. Thankfully, we can make any execution context &lt;em&gt;safe&lt;/em&gt; by wrapping it up in a delegate:&lt;/p&gt;

&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/**
 * Wrapper around an existing ExecutionContext that makes it propagate MDC information.
 */&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PropagatingExecutionContextWrapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wrapped&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ExecutionContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ExecutionContext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;

   &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prepare&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ExecutionContext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ExecutionContext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// Save the call-site state
&lt;/span&gt;     &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Local&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;save&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Runnable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Runnable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
       &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;c1&quot;&gt;// re-assume the captured call site thread locals
&lt;/span&gt;         &lt;span class=&quot;nc&quot;&gt;Local&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;let&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
           &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
         &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
       &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
     &lt;span class=&quot;o&quot;&gt;})&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reportFailure&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Throwable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reportFailure&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
   &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Runnable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wrapped&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reportFailure&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Throwable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wrapped&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reportFailure&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;TwitterExecutionContextProvider&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ExecutionContextProvider&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;cm&quot;&gt;/**
   * Safely wrap any execution context into one that properly passes context
   *
   * @param executionContext
   * @return
   */&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;executionContext&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ExecutionContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PropagatingExecutionContextWrapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;executionContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’ve taken this trace wrapping concept and applied to all kinds of executors like &lt;code class=&quot;highlighter-rouge&quot;&gt;ExecutorService&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;ScheduledExecutorService&lt;/code&gt;. Technically we don’t really want to expose the internals of how we wrap traces, so we load an &lt;code class=&quot;highlighter-rouge&quot;&gt;ExecutionContextProvider&lt;/code&gt; via a java &lt;a href=&quot;https://docs.oracle.com/javase/7/docs/api/java/util/ServiceLoader.html&quot;&gt;service loading&lt;/a&gt; mechanism and provide an API contract so that people can wrap executors without caring how they are wrapped:&lt;/p&gt;

&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/**
 * A provider that loads from the java service mechanism
 */&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;object&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ExecutionContextProvider&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;lazy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;provider&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ExecutionContextProvider&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nc&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;ServiceLoader&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;classOf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;ExecutionContextProvider&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;asScala&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Nil&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;headOption&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;throw&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MissingExecutionContextException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/**
 * Marker interfaces to provide contexts with custom logic. This
 * forces users to make sure to use the execution context providers that support request tracing
 * and maybe other tooling
 */&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;trait&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProvidedExecutionContext&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ExecutionContext&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/**
 * A context provider contract
 */&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;trait&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ExecutionContextProvider&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ExecutionContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ProvidedExecutionContext&lt;/span&gt;

  &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From a callers perspective they now do:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;implicit val execContext = ExecutionContextProvider.provider.of(scala.concurrent.ExecutionContext.Implicits.global)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which would wrap, in this example, the default scala context.&lt;/p&gt;

&lt;h1 id=&quot;service-to-service-dependency-and-performance-tracing&quot;&gt;Service to Service dependency and performance tracing&lt;/h1&gt;

&lt;p&gt;Well that’s great! We have a way to safely and easily pass trace id’s, and we’ve tooled through our clients to all pass this trace id automatically, but this only gives us &lt;em&gt;logging&lt;/em&gt; information.  We’d really like to be able to leverage the trace information to get more interesting statistics such as service to service dependencies, performance across service hops, etc.  Correlated logs is just the beginning of what we can do.&lt;/p&gt;

&lt;p&gt;Zipkin is an open source tool that we’ve discussed here &lt;a href=&quot;/2016/09/12/zipkin-at-curalate.html&quot;&gt;before&lt;/a&gt; so we won’t go too much into it, but needless to say that Zipkin hinges on us having proper trace identifiers.  It samples incoming requests to determine IF things should be traced or not (i.e. sent to Zipkin). By default, we have all our services send 0.1% of their requests to Zipkin to minimize impact on the service.&lt;/p&gt;

&lt;p&gt;Let’s look at an example:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/zipkin.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this Zipkin trace we can see that this batch call made a call to Dynamo. The whole call took 6 milliseconds and 4 of those milliseconds were spent calling Dynamo.  We’ve tooled through all our external client dependencies with Zipkin trace information automatically using java dynamic proxies so that as we upgrade our external dep’s we get tracing on new functions as well.&lt;/p&gt;

&lt;p&gt;If we dig further into the trace:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/zipkin_w_trace.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can now see (highlighted) the trace ID and search in our logs for logs related to this trace&lt;/p&gt;

&lt;h1 id=&quot;finding-needles-in-the-haystack&quot;&gt;Finding needles in the haystack&lt;/h1&gt;

&lt;p&gt;We have a way to correlate logs, and get sampled performance and dependency information between services via Zipkin. What we still can’t do yet is trace an individual piece of data flowing through high volume queues and streams.&lt;/p&gt;

&lt;p&gt;Some of our services at Curalate process 5 to 10 thousand items a second.  It’s just not fiscally prudent to log all that information to Loggly or emit unique metrics to our metrics system (DataDog). Still, we want to know at the event level where things are in the system, where they passed through, where they got dropped etc. We want to answer the question of&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Where is identifier XYZ.123 in the system and where did it go and come from?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is difficult to answer with the current tools we’ve discussed.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://media.giphy.com/media/3o7aTskHEUdgCQAXde/giphy.gif&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To solve this problem we have one more system in play.  This is our high volume auditing system that lets us write and filter audit events at a large scale (100k req/s+).  The basic architecture here is we have services write audit events via an Audit API which are funneled to Kinesis Firehose. The firehose stream buffers data for either 5 minutes or 128 MB (whichever comes first).  When the buffer limit is reached, firehose dumps newline separated JSON in a flat fi`le into S3.  We have a lambda function that waits for S3 create events on the bucket, reads the JSON, then transforms the JSON events into &lt;a href=&quot;https://parquet.apache.org/&quot;&gt;Parquet&lt;/a&gt; which is an efficient columnar storage format.  The Parquet file is written back into S3 into a new folder with the naming scheme of&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;year=YYYY/month=MM/day=DD/hour=HH/minute=mm/&amp;lt;uuid&amp;gt;.parquet
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Where the minutes are grouped in 5 minute intervals.  This partition is then added to Athena, which is a managed map-reduce around PrestoDB, that lets you query large datasets in S3.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/auditing_arch.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What does this have to do with trace id’s?  Each event emitted comes with a trace id that we can use to query back to logs or Zipkin or other correlating identifiers.  This means that even if services aren’t logging to Loggly due to volume restrictions, we can still see how events trace through the system. Let’s look at an example where we find a specific network identifier from Instagram and see when it was data mined and when we added semantic image tags to it (via our vision APIs):&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;minute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;curalateauditevents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;audit_events&quot;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'network_id'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'1584258444344170009_249075471'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'network'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'instagram'&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;18&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hour&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;22&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;order&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;timestamp&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;desc&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;limit&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is the Athena query.  We’ve included the specific network ID and network we are looking for, as well as a limited partition scope.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/athena_query.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Notice the two highlights.&lt;/p&gt;

&lt;p&gt;Starting at the second highlight there is a message that we augmented the piece of data. In our particular pipe we only augment data under specific circumstances (not every image is analyzed) and so it was important to see that some images were dropped and this one was augmented. Now we can definitely say “yes, item ABC was augmented but item DEF was not and here is why”.  Awesome.&lt;/p&gt;

&lt;p&gt;Moving upwards, the first highlight is how much data was scanned.  This particular partition we looked through has 100MB of data, but we only searched through 2MB to find what we wanted (this is due to the optimization of Parquet). Athena is priced by how much data you scan at a cost of $5 per terabyte. So this query was pretty much free at a cost of $0.000004. The total set of files across all the partitions for the past week is roughly 21GB spanning about 3.5B records.  So even if we queried &lt;em&gt;all&lt;/em&gt; the data, we’d only pay $.04.  In fact, the biggest cost here isn’t in storage or query or lambda, it’s in firehose! Firehose charges you $0.029 per GB transferred.  At this rate we pay 60 cents a week. The boss is going to be ok with that.&lt;/p&gt;

&lt;p&gt;However, there are &lt;em&gt;still&lt;/em&gt; some issues here. Remember the target scale is upwards of 100k req/s.  At that scale we’re dealing with a LOT of data through Kinesis Firehose.  That’s a lot of data into S3, a lot of IO reads to transform to Parquet, and a lot of opportunities to accidentally scan through tons of data in our athena partitions with poorly written queries that loop over repeated data (even though we limit partitions to a 2 week TTL).  We also now have issues of rate limiting with Kinesis Firehose.&lt;/p&gt;

&lt;p&gt;On top of that, some services just pump so much repeated data that its not worth seeing it all the time.  To that end we need some sort of way to do live filters on the streams.  What we’ve done to solve this problem is leverage dynamically invoked &lt;a href=&quot;https://www.javaworld.com/article/2144908/scripting-jvm-languages/nashorn--javascript-made-great-in-java-8.html&quot;&gt;Nashorn javascript&lt;/a&gt; filters.  We load up filters from a known remote location at an interval of 30 seconds, and if a service is marked for filtering (i.e. it has a really high load and needs to be filtered) then it’ll run all of its audit events through the filter &lt;em&gt;before&lt;/em&gt; it actually gets sent to the downstream firehose.  If an event fails the filter it’s discarded. If it passes, the event is annotated with which filter name it passed through and sent through the stream.&lt;/p&gt;

&lt;p&gt;Filters are just YML files for us:&lt;/p&gt;

&lt;div class=&quot;language-yml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Filter&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;name&quot;&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;expiration&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;lt;Optional DateTime. Epoch or string datetime of ISO formats parseable by JODA&amp;gt;&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;js&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;|&lt;/span&gt;
    &lt;span class=&quot;no&quot;&gt;function filter(event) {&lt;/span&gt;
        &lt;span class=&quot;no&quot;&gt;// javascript that returns a boolean&lt;/span&gt;
    &lt;span class=&quot;no&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And an example filter may look like&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;anton_client_filter&quot;&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;js&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;|&lt;/span&gt;
    &lt;span class=&quot;no&quot;&gt;function filter(event) {&lt;/span&gt;
      &lt;span class=&quot;no&quot;&gt;var client = event.context.get(&quot;client_id&quot;)&lt;/span&gt;

      &lt;span class=&quot;no&quot;&gt;return client != null &amp;amp;&amp;amp; client == &quot;3136&quot;&lt;/span&gt;
    &lt;span class=&quot;no&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this filter only events that are marked with the client id of my client will pass through. Some systems don’t need to be filtered so all their events pass through anyway.&lt;/p&gt;

&lt;p&gt;Now we can write queries like&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;minute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;curalateauditevents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;&quot;audit_events&quot;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trace_names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'anton_client_filter'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;18&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hour&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;22&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;limit&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To get events that were tagged with my filter in the current partition. From there, we now can do other exploratory queries to find related data (either by trace id or by other identifiers related to the data we care about).&lt;/p&gt;

&lt;p&gt;Let’s look at some graphs that show how dramatic this filtering can be&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/filtering.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here the purple line is one of our data mining ingestion endpoints. It’s pumping a lot of data to firehose, most of which is repeated over time and so isn’t super useful to get all the input from. The moment the graph drops is when the yml file was uploaded with a filter to add filtering to the service. The blue line is a downstream service that gets data after debouncing and other processing. Given its load is a lot less we don’t care so much that it is sending all its data downstream. You can see the purple line slow to a trickle later on when the filter kicks in and data starts matching it.&lt;/p&gt;

&lt;h2 id=&quot;caveats-with-nashorn&quot;&gt;Caveats with Nashorn&lt;/h2&gt;

&lt;p&gt;Building the system out there were a few interesting caveats when using Nashorn in a high volume pipeline like this.&lt;/p&gt;

&lt;p&gt;The first was that subtle differences in javascript can have &lt;em&gt;massive&lt;/em&gt; performance impacts.  Let’s look at some examples and benchmark them.&lt;/p&gt;

&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;136742&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;153353&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mineable_id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href=&quot;http://openjdk.java.net/projects/code-tools/jmh/&quot;&gt;JMH&lt;/a&gt; benchmarks of running this code is&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[info] FiltersBenchmark.testInvoke  thrpt   20     1027.409 ±      29.922  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20  1484234.075 ± 1783689.007  ns/op
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What?? 29 ops/second&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://media.giphy.com/media/3ohhwH6yMO7ED5xc7S/giphy.gif&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Let’s make some adjustments to the filter, given our internal system loads the javascript into an isolated scope per filter and then re-invokes just the function &lt;code class=&quot;highlighter-rouge&quot;&gt;filter&lt;/code&gt; each time (letting us safely create global objects and pay heavy prices for things once):&lt;/p&gt;

&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;136742&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;153353&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mineable_id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[info] FiltersBenchmark.testInvoke  thrpt   20  7391161.402 ± 206020.703  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    14879.890 ±   8087.179  ns/op
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Ah, much better! 206k ops/sec.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://media.giphy.com/media/Tud8FymnIZtW8/giphy.gif&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If we use java constructs:&lt;/p&gt;

&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;java&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;util&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;HashSet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;136742&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;153353&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mineable_id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;anton&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;mineable&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[info] FiltersBenchmark.testInvoke  thrpt   20  5662799.317 ± 301113.837  ops/s
[info] FiltersBenchmark.testInvoke   avgt   20    41963.710 ±  11349.277  ns/op
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Even better! 301k ops/sec&lt;/p&gt;

&lt;p&gt;Something is clearly up with the anonymous object creation in Nashorn.  Needless to say, benchmarking is important, especially when these filters are going to be dynamically injected into every single service we have.  We need them to be performant, sandboxed, and safe to fail.&lt;/p&gt;

&lt;p&gt;For that we make sure everything runs its own engine scope in a separate execution context isolated from main running code and is fired off asynchronously to not block the main calling thread.  This is also where we have monitoring and alerting on when someone uploads a non-performant filter so we can investigate and mitigate quickly.&lt;/p&gt;

&lt;p&gt;For example, the discovery of the poorly performing json object came from this alert:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/2017-09-26-high-volume-tracing/high_cpu.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;Tracing is hard and it’s incredibly difficult to tool through after the fact if you start to build service architectures without this in mind from the get go.  Tooling trace identifiers through the system from the beginning sets you up for success in building more interesting debugging infrastructure that isn’t always possible without that.  When building larger service ecosystems it’s important to keep in mind how to inspect things at varying granularity levels.  Sometimes building custom tools to help inspect the systems is worth the effort, especially if they help debug complicated escalations or data inconsistencies.&lt;/p&gt;
</description>
        <pubDate>Tue, 26 Sep 2017 12:11:36 +0000</pubDate>
        <link>http://engineering.curalate.com/2017/09/26/tracing-services.html</link>
        <guid isPermaLink="true">http://engineering.curalate.com/2017/09/26/tracing-services.html</guid>
        
        <category>tracing</category>
        
        <category>soa</category>
        
        <category>services</category>
        
        <category>logging</category>
        
        <category>scala</category>
        
        <category>devops</category>
        
        
      </item>
    
  </channel>
</rss>
