Organizing Redshift

Now that we understand the basics of creating primary keys, the concept of foreign keys, and the basics of OLAP. We will dive into how a distributed database like Redshift is best optimized.

Distributed Compute Engines

In a world driven by data, businesses generate lots and lots of data. We have already discussed the difference between OLAP and OLTP engines, but here I will just focus on how to scale (increase capacity by either creating a bigger computer or more computers) Redshift specifically, which is an OLAP database.

Redshift is an analytical database that functions as a single platform with several different computers (here out referred to as nodes) that can both store and process data. The main question is why would I want to create several new nodes instead of just increasing the size of the node my data currently resides on? Two main reasons:

  1. Some data is so large that the tables could not fit onto even a single node and thus need to be stored across several nodes.
  2. Reduce disk scan by organizing your data across several nodes.

So what does this mean? Let's take a look at how Redshift operates.

Simon Says

Redshift's group of nodes (which we will now refer to as a cluster), consist (normally) of a least one leader node and 1 or more worker nodes. The point of the worker nodes is to listen to the instructions of the leader node and return to it the data it is looking for per the query it was given. The leader node generates a plan for the worker nodes to follow that will best optimize the query it was given. Most of this abstraction takes place via Redshift built in operations, but there are some ways that users can help make queries more efficient.

Distribution (Dist) Keys

As you can recall from earlier, data in a Redshift cluster is stored across several nodes. Well, how does Redshift decide how that data gets stored? Normally, if the data is small enough, it will just store a copy of that data on every single node, which is quicker to retrieve than trying to search for bits of a small dataset across several nodes. If the dataset is larger, Redshift will normally distribute the data in and Even fashion, giving equally sized chunks to each node in the cluster. Lets assume that we have the following table:

site
product
employee
value
date
seattle
car
greg
45,000
2022-05-21
portland
scooter
claire
850
2022-06-01
seattle
boat
carl
25,000
2022-06-08
portland
scooter
claire
850
2022-06-15
seattle
boat
carl
25,000
2022-06-04

Now also assume that we have a 3 node cluster with 1 leader node and 2 worker nodes. Normally a table this small would just be copied to every single node, but for example's sake, let's just assume that it will be distributed across the clusters like so:

node
site
product
employee
value
date
2
seattle
car
greg
45,000
2022-05-21
2
portland
scooter
claire
850
2022-06-01
3
seattle
boat
carl
25,000
2022-06-08
3
portland
scooter
claire
850
2022-06-15
3
seattle
boat
carl
25,000
2022-06-04

In this instance, no data is assigned to the leader node, but 2 of the rows get stored on node 2 and 3 rows get stored on node 3.

OLAP databases actually store data in columnar blocks (generally of 1 Mb), but in order to make the point here and not get to down in the weeds we are just going to use rows for simplicity.

This is great, our data is distributed! Now, we are running into issues because we have very slow query performance. When we ask our analysts how they are querying the data, they send over the following query:

SELECT
*
FROM sales

WHERE site = 'seattle'
date > '2022-06-01'

We realize what is happening, the analysts are querying based on site, and because the sites are split across two nodes, Redshift has to scan two separate nodes for 3 rows. We can mitigate this inefficiency by telling Redshift that it should distribute the data based on the site column. We can do that by setting up the table by providing a distkey.

CREATE TABLE orders (
  site VARCHAR,
  product VARCHAR,
  employee VARCHAR,
  value NUMERIC,
  date TIMESTAMP
  )
  distkey(site);

Now when we take a look at the data it will be organized as follows:

node
site
product
employee
value
date
2
seattle
car
greg
45,000
2022-05-21
3
portland
scooter
claire
850
2022-06-01
2
seattle
boat
carl
25,000
2022-06-08
3
portland
scooter
claire
850
2022-06-15
2
seattle
boat
carl
25,000
2022-06-04

Notice that the sites are on the same node based on their location. Now when the query above is sent to the leader node, it will know to only check node 2 for the data.

Our analysts are super happy with how much the query speed has improved, but they still feel like it takes longer than necessary. There is another optimization that we can try to improve query speed and efficiency. We have told Redshift where to store our data (dist key), now let's tell it in what order to store our data with sort keys.

Sort Keys

Sort keys are a great way to prevent unnecessary scanning of data. We have used the dist key to reduce the amount of nodes scanned from 2 to 1. We can now use the sort key to reduce the amount of rows scanned. If we take a look back at our query, we can see that the analyst is only wanting to return site seattle and dates that are anything greater than June 1st, 2022. Let's assume that it is common to filter base on date. If we check out the order of the data for Seattle is stored on node 3, we can see the dates are out of order. sort index is not an actual column, but a reference to the order in which the leader node in Redshift will read the data.

sort index
node
site
product
employee
value
date
1
2
seattle
car
greg
45,000
2022-05-21
2
2
seattle
boat
carl
25,000
2022-06-08
3
2
seattle
boat
carl
25,000
2022-06-04

This is not optimal, because we are going to scan all 3 rows instead of just scanning the two rows for the data we need based on the sort index. We can use a sort key in order to prevent his from happening in the future.

ALTER TABLE sales ALTER SORTKEY date DESC;

We can now see the data sorted in a more efficient manner for the leader node to scan the data.

sort index
node
site
product
employee
value
date
3
2
seattle
car
greg
45,000
2022-05-21
1
2
seattle
boat
carl
25,000
2022-06-08
2
2
seattle
boat
carl
25,000
2022-06-04

When we run our query now we only scan 1 node and 2 rows vs before sorting and distributing properly we scanned 2 nodes and 5 rows. That's a 60% reduction in data being scanned! While this is a very simplistic example, those numbers are not far off from a real world example of properly distributing and sorting your data.


Revision #3
Created 14 June 2022 20:23:54 by Trevor Barnes
Updated 14 June 2022 23:09:50 by Trevor Barnes