At Curalate, our service and API traffic is fairly tightly coupled to e-commerce traffic, so any increase is reasonably predictable. We expect an increase in request rate towards the beginning of November each year, with traffic peaking at 10x our steady rate on Black Friday and Cyber Monday.

Why Load Test?

Curalate works directly with retail brands to drive traffic to their sites. The holiday shopping period is the most important time of the year for most of them, and we need to ensure that our experiences continue to operate at a high standard throughout.

More generally, though, load testing is critical for services and APIs, especially in cases where load is expected to increase. It uncovers potential points of failure, during business hours, and hopefully prevents people from needing to wake up at 2 a.m. on a weekend.

Creating a Test Plan

In cases of expected load increases, it’s important to understand as much as possible before diving into it. There are a few questions to ask:

  • Is there any data available so I can understand the expected load? Is it a yearly increase - are previous years a good indication? If it’s a brand new launch, what are the expectations?
  • What are the hard and soft dependencies of the service or API that I’m testing? What sort of caching is in place? Does a 10x increase on my service cause a 10x increase on everything downstream, as well?
  • Should we test against the active production environment, or is it feasible to spin up a staging environment with the same scaling behavior?
  • Depending on the breadth of dependencies, it may not be possible to spin up a new duplicated environment.
  • If I test against production, how can I ensure I don’t negatively affect live traffic?
  • Am I expecting an increase in load across services? If there are any core dependencies, what does the combined load look like at peak?
  • How much of a buffer do I provide against the expected peak?
  • Does my service have any rate limiting that I need to bypass or keep in mind? How do I simulate a live traffic without being throttled?

Getting Right to It

In our case, there were four main services that we were interested in testing against expected load, separated into on-site (APIs and services that are called directly from our client’s sites), and off-site (our custom built and Curalate-hosted services). This distinction works well for us, because we expected a 10x increase in on-site experiences, but 2-3x increase to off-site ones - brands focus on driving traffic to their own e-commerce site.

Now, there are many tools out there for load testing. For our purposes, I used Vegeta, for its robust set of options and extensibility. It was easy to script around to allow a steadily increasing request rate to either a single target or lazily generated targets. The output functionality is also well thought out. It supports top line latency stats along with some basic charting capabilities.

Let’s assume we had a service that we wanted to test up to 1000 RPS, both against a single target, and against multiple targets - to work around any caching in place.

1000 RPS Single and Multi-Target

The setup was fairly simple:

Spin up a couple of AWS EC2 m3.2xlarge instances.

SSH to the instances and create a load_testing folder, and fetch the Vegeta binary.

wget "https://github.com/tsenart/vegeta/releases/download/v6.3.0/vegeta-v6.3.0-linux-386.tar.gz"

Put together a simple, quick script to handle steadily increasing the request rate, and then hold steady at the max rate.

#!/bin/bash
target=$1
maxRate=$2
rateInc=$3
incDuration=$4
startAt=$5
currentRate=$startAt
hitType=$6

while [ $currentRate -le $maxRate ]
do
  if [ $currentRate -eq $maxRate ]
  then
    echo $target | ./vegeta attack -rate=$currentRate > reel-$maxRate-$currentRate-$hitType-test.bin
  else
    echo $target | ./vegeta attack -rate=$currentRate -duration=$incDuration > reel-$maxRate-$currentRate-$hitType-test.bin
  fi
  currentRate=$((currentRate+rateInc))
done

Basically, if it hasn’t yet hit the max rate, run vegeta at the current rate for the specified duration, then increase the rate by the increment, and loop again. If the max rate is hit, don’t specify a duration - run until manually killed. The multi-targets script is similar, but reads from a targets.txt file.

#!/bin/bash
maxRate=$1
rateInc=$2
incDuration=$3
startAt=$4
currentRate=$startAt
hitType=$5

while [ $currentRate -le $maxRate ]
do
  if [ $currentRate -eq $maxRate ]
  then
    ./vegeta attack -rate=$currentRate -targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin
  else
    ./vegeta attack -rate=$currentRate -duration=$incDuration -targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin
  fi
  currentRate=$((currentRate+rateInc))
done

Aside: I was unable to get the -lazy flag to work properly with Vegeta, so I went with brute force and just generated a ton of targets to a file. I’m convinced it could have been more elegant, but sometimes the easy solution works just as well.

With the setup complete, it’s as simple as setting up whatever monitoring you want on a display or two, and fire off the scripts.

sh ./rate_increasing_multi.sh 1000 50 120s 50 uncached

Which says to increase up to 1000 RPS, 50 at a time, for 2 minutes at each rate, starting at 50 RPS.

For each results file generated, ./vegeta report -inputs "out.txt" will output something like (this example is for 250 RPS)

Requests      [total, rate]            66177, 249.98
Duration      [total, attack, wait]    4m24.783548697s, 4m24.731999487s, 51.54921ms
Latencies     [mean, 50, 95, 99, max]  64.885905ms, 57.516245ms, 107.88721ms, 730.867162ms, 2.309337436s
Bytes In      [total, mean]            943011144, 14249.83
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:66177
Error Set:

Load Testing and Results

As different tests are kicked off and rates increase, it’s necessary to keep an eye on any monitoring dashboards, or alerts that may fire, and bail out of the test early. From there, logging should help in diagnosing what failed, and tickets can be filed each step of the way. After those issues are resolved, you can pick back up testing until you hit your goal, and maintain it for long enough to be comfortable with the test.

It should go without saying, but when testing against a live, production environment, it’s always nice to give the current on-call engineers a heads up, and keep them in the loop the entire way through.

As for Curalate’s load testing, on Cyber Monday we experienced record-breaking traffic numbers - even exceeding our 10x estimates slightly - to our services, and the on-call engineers slept soundly through Thanksgiving weekend.