Tips for Starting With Redshift

When beginning to use Amazon Redshift there are always some growing pains. In this blog post we’ll go through 3 tips and tricks we learned from starting up our own Redshift pipeline!

Why We Use Redshift

At Curalate we serve a lot of images. In July of 2016 we served billions of images throughout our Fanreel, Like2Buy and Reveal products, and that number is steadily increasing. We don’t just serve plain images, though. All of these images are “productized”, containing information about the products within them or any associated information, and are used to drive interactions and sales for our customers. It’s easier to show what I mean than to describe it.

Say you stumbled across this picture of my dog and thought, “Wow I bet Fluffy would look great in that harness, and I wonder how much that leash costs”. Lucky for you, this image is productized and served through Reveal! You could easily hover over the image and answer those questions…

The value we add is not just how many images are served but how effective they are for our clients. We need to be able to answer questions with the analytics we provide, such as:

How well were they received by the public?
How many people clicked through to the products and came back for more?
Do people convert more often after interacting with these productized images?
Which images drove the most traffic to the product page?

To help answer these questions we’ve built a custom data pipeline which captures and stores usage metrics from our client-facing products. All of these metrics end up in an Amazon Redshift cluster, a columnar data warehouse from Amazon, which we use for daily rollups, client reports, and our own internal investigations to help make our products better.

Here’s a quick overview of our pipeline:

Our data pipeline has been running with near 100% uptime for well over a year. It is structured like this:

Curalate products send usage metrics like impressions, clicks and hovers and we convert them to a standardized JSON format.
Using an Apache Kafka queue and Pinterest’s Secor we send batches of metrics to S3 for storage. Were we to build this system today we would likely use Amazon Kinesis with Amazon Firehose due to Firehose into Redshift being available and Kinesis being more fully featured than it was during initial development.
We run a nightly job, which we’ll walk through below, to safely load our data from S3 into Redshift.

Learning to work with Redshift had a lot of interesting challenges and plenty of lessons learned. Here are a few of them!

Tip #1 - Design For Deduplication

TL;DR - Add a GUID and use a staging table to guarantee you only load new data

Redshift does not enforce a primary key, which means we need to deduplicate our own data. Duplication will confuse our analytic reports, much like this cute baby was confused by his duplicate dad.

Since we’re loading automatically from S3 we need to make sure the process is repeatable without introducing duplicates so that any engineer can re-run the load at any time without worrying about the state of the cluster. We designed this into our metrics system from the beginning and you should too. This will be especially useful if you’re working with time-series data and can rely on a strict ordering.

Step 1: Add a GUID

Make sure that every piece of information recorded into your pipeline has a GUID attached to it. It’s a relatively cheap operation to do per metric and is required for easy deduplication. You could also use Redshift’s string concatenation (e.g col1 || col2) if you have a tuple primary key, but that will be a bit slower due to the computation requirements.

Step 2: Use a server side timestamp

It may seem obvious, but users do weird things. Sometimes they like to pretend they’re living in the future so they set their clock days, months, even years into the future (or past). Never rely on a timestamp that came from something you don’t control. A server side timestamp should also be the sortkey for your cluster (we’ll get back to choosing that in Tip #2 below).

Step 3: Load into a staging table

Redshift’s COPY operation is a fantastic tool, but it’s fairly inflexible and can’t handle the logic we need. Instead of loading directly into your primary table(s), create a temporary table with the same layout as your production table:

CREATE TEMP TABLE stagingTableForLoad (
  server_timestamp BIGINT NOT NULL,
  guid             VARCHAR(128) NOT NULL,
  data             VARCHAR(128) NOT NULL,
  ...
) DISTKEY(guid) SORTKEY(server_timestamp);

It’s important to give a sortkey and distkey for this table, especially if your daily loads can be large. Once you’ve created your table, run the COPY command.

COPY stagingTableForLoad
FROM 's3://your-s3-bucket/todaysDate'
CREDENTIALS '...'
JSON 's3://your-s3-bucket/jsonpaths'
GZIP

Step 4: Insert any new rows (efficiently)

Using our GUID and our timestamp, we can efficiently check for any rows which may have already been loaded. This takes two calls. First, get the minimum and maximum timestamps from the data to be loaded. Since the server_timestamp column is the sortkey and our staging table is relatively small, this is a fast operation:

SELECT MIN(server_timestamp) AS min_timestamp,
       MAX(server_timestamp) AS max_timestamp
FROM stagingTableForLoad;

Using your new values, insert by comparing against a set of the GUIDs already present in the data’s time range.

INSERT INTO productionTable
SELECT * FROM stagingTableForLoad
WHERE guid NOT IN (
  SELECT guid
  FROM productionTable
  WHERE server_timestamp BETWEEN min_timestamp AND max_timestamp
)

We use this method for our daily load and it runs very quickly, adding only a few seconds on top of the transmission time from S3.

Tip #2 - Pick Your Sort Key and Dist Key Carefully

TL;DR - Choose your keys carefully or you’re in for slow queries and a very annoying weekend trying to rebalance a skewed cluster.

The sortkey and distkey (or partition key) of your cluster are important choices. They are set when your table is created and cannot be changed without rebuilding the full table. When you start dealing with huge tables, a poorly chosen distribution key or an under-utilized sortkey can make your experience using Redshift very painful.

Sort Key Selection

If you haven’t yet, read Amazon’s documents about choosing a sort key. If you are using time-series data (such as usage metrics or sales reports), set your SORTKEY column to be a timestamp generated by a machine that you control. By inserting data daily and running regular VACUUM operations, the sort key will help to keep your queries running fast.

If you aren’t working with time-series data you can use another column as your sort key to help make scans more efficient. Think of a few queries you’ll likely be writing, and if there’s a column where you frequently include range boundaries then use that as your sort key. Make sure to run frequent VACUUM operations if you aren’t adding new data in sortkey order.

In general you should include your sort key on every query you write unless you have a good reason not to. If you frequently run unbounded queries without a range component, especially for a query that is frequently run, think really hard if Redshift is the right choice for you.

Distribution Key Selection

The distribution key is a bit trickier. Read up on the documents here. The distribution key (or distkey) sets up which column of your table will be hashed to choose the cluster partitioning for your data.

The enemy you’re constantly fighting with the dist key is data skew. Skew is the term for uneven resource distribution, where the resources of your cluster aren’t uniformly distributed causing a few nodes to be responsible for more than their share of the cluster’s load. No matter what you pick for your distkey make sure that the data skew of your cluster is as close to 0 so that you use everything you’re paying for and be sure to monitor skew as your cluster grows in case your assumptions about your data are incorrect. If you already have a cluster set up, Amazon has a very useful script you can use to measure your table’s skew and more.

This arm wrestler has a very high arm-muscle skew

Given that skew should always be a concern, there are two sides to balance when choosing your key: data co-location and parallelization.

You can choose to co-locate similar data onto the same machine by choosing a distkey that is shared by multiple events, which will help speed up processing by limiting communication between nodes in your cluster, but can slow performance by putting potentially more processing onto fewer nodes depending on your query. You have to be careful here, as this is likely to introduce some amount of skew into your disk usage.
You can choose a purely random distkey. The query load and data will be evenly split across all nodes and skew will be 0, but you could incur some extra network delays due to network communication to move data between nodes.

Currently our distribution key is purely random, but it wasn’t always that way. Migrating a distribution key was definitely a big growing pain for us.

A Story of Skew

When setting up our first cluster we chose a bad distribution key. We leaned too far towards co-locating data for query speed and we didn’t check our per-node Cloudwatch metrics often enough to detect skew. After collecting a few months of data our cluster became inoperable without firing any alerts first. A quick investigation found that some of our largest clients were sharing a single node. That node’s disk had filled up and our cluster became unusable despite the reported overall disk usage being only around 20%. We couldn’t run any queries without freeing up disk space on the single full node first, so we needed to free up space before we could try to rebalance the table.

We doubled the number of nodes in the cluster hoping to make room on that single node. This was ineffective because of the poor distkey choice. We didn’t have any control over the hash function used and so with doubling our cluster our full node only gained 2-3% more free storage. We ended up having to grow the cluster by 8x to free up that node by around 15% so that we could run a staggered copy onto a new, well-balanced table.

Tip #3 - Constant Functions Aren’t Cheap

The Redshift query planner does a great job with the hard stuff, but it has some blind spots as far as simple optimization goes. As an example, at Curalate we use epoch timestamps throughout our platform, which Redshift does not handle well. It has no native datatype for them and somewhat limited methods. If you want to query against our cluster you need to use epoch timestamps, and we frequently want to have an easy way to use real dates within our SQL workbench instead of using an external converter.

For example, the below will calculate the epoch timestamp for midnight on April 1st, 2016:

EXTRACT(epoch FROM timestamp '2016-04-01 00:00:00')

This does exactly what you’d expect, getting the epoch timestamp in seconds for April 1st. It returns a constant, there is no column or variable here, and it is a fast query. Redshift doesn’t currently support variables, so if you want to use this data you may include this function in your query.

Say we want to find the total number of metrics recorded way back in June of last year (remembering that server_timestamp is our sortkey and it contains an epoch). If we use our easily-readable function, it looks like this:

SELECT COUNT(*)
FROM productionTable
WHERE server_timestamp >= EXTRACT(epoch FROM timestamp '2015-06-01 00:00:00')
AND server_timestamp < EXTRACT(epoch FROM timestamp '2015-07-01 00:00:00')

Looks good, right? Easy to modify, simple to run against a variety of dates. Sadly, this query takes a very, very long time to run against a cluster of any meaningful size. The EXTRACT function is not treated as a constant, and is evaluated against every row in your table. This query performs a scan of the entire productionTable, loading and decompressing every timestamp and reevaluating the EXTRACT function for every row. In practice, the above query on our 19 billion row cluster took over 10 minutes before I got tired of waiting and killed it. It likely would’ve taken hours.

However, if you precompute the value, you’ll end up with this:

SELECT COUNT(*)
FROM productionTable
WHERE server_timestamp >= 1456804800
AND server_timestamp < 1459483200

This query finished in 6 seconds! When in doubt, simplify all of your constants.

And for what it’s worth, the answer for us is hundreds of millions of events in June 2015. For comparison, in July 2016 we recorded nearly 8x that many. We’ve grown so much!

I hope some of these tips have been useful! If you have any questions or feedback please leave a comment.