Amazon Redshift materialized views enable you to significantly improve performance of complex queries that are frequently run as part of your extract, load, and transform (ELT), business intelligence (BI), or dashboarding applications. Materialized views precompute and store the result sets of the SQL query in the view definition. Materialized views speed up data access, because the query doesn’t need to rerun the computation each time the query runs, which also reduces the resource consumption.
Amazon Redshift has the ability to automatically rewrite your SQL queries that don’t explicitly reference existing materialized views to use an existing materialized view if it will improve performance. This feature is valuable and, in some cases, the only option for performance optimization. Consider packaged ISV apps or even just reports— users often don’t have access to the SQL to optimize. In some cases, even if they do have access, the code or script is so old that nobody is familiar with it and you don’t know what regressions even a small change might introduce.
In this post, we describe how the automatic query rewrite feature works and some scenarios where you could take advantage of this feature. For information about the materialized view feature itself, refer to Speed up your ELT and BI queries with Amazon Redshift materialized views and Creating materialized views in Amazon Redshift.
All examples in this post are run on an 8 node ra3.4xlarge cluster with the 3 TB TPC-DS cloud benchmark dataset.
Let’s look at three different scenarios where the automatic query rewrite feature could help: optimizing joins between two large tables, optimizing joins for tables that have multiple join paths, and optimizing table scans.
Optimize joins between two large tables
There are many situations where you have two large tables that are joined frequently. In this case, creating a materialized view that joins these two tables could help improve the performance of those queries. Materialized views precompute the join and store the results so subsequent runs only need to retrieve the saved results; no need to run the expensive JOINs each time. With automatic query rewrite, none of the end-user queries have to be modified to refer to the materialized view. When creating the explain plan for the query, Amazon Redshift replaces the join between the two tables with the materialized view.
By default, the automatic query rewrite uses a materialized view only if it’s up to date and reflects all changes from its base tables. This means that the query isn’t rewritten to use the materialized view if the base tables have more recent updates that aren’t yet reflected in the materialized view.
For example, consider the following SQL query. The query joins two tables:
store_sales (8,639,936,081 rows) and
customer (30,000,000 rows):
The query runs in 545,520 milliseconds; the following is the explain plan for the query:
Let’s create a materialized view that pre-computes the join between the
customer tables using the following SQL statement:
Let’s now rerun the original query:
The query runs much faster (46,493 milliseconds). This is because of the automatic query rewrite feature, which has rewritten the preceding query to use the newly created materialized view instead of joining both tables. The explain plan for this query shows this change:
The original query run also consumed 1,263 CPU seconds and read 45,013 blocks of data, whereas the query that ran after the creation of the materialized view only consumed 898 CPU seconds and read 29,256 blocks. That is a reduction of 29% in CPU consumption and 35% in blocks read.
The optimizer can also rewrite the following query to use the previously created materialized view, which includes the additional join to the
Optimize joins for tables that have multiple join paths
For large tables on Amazon Redshift, the ideal distribution style would be ‘KEY’, with the distribution key being the column that is used most frequently in the JOIN clause. There are situations where some large tables have multiple join paths. 50% of the queries may use a particular column to join to the table, and the other 50% of the queries may use a different column to join to the table. Both types of queries are important and have stringent performance requirements. In this case, you could pick one column as the distribution key for the table and then create a materialized view with the second column as the distribution key. This is possible because materialized views can have their own distribution and sort keys.
Here’s an example to illustrate how this works.
web_sales table (2,159,968,881 rows) has the distribution key
ws_order_number. This helps optimize a majority of the queries (70% of the joins to this table use
ws_order_number as the join column). The remaining 30% use the column
ws_bill_customer_sk to join to the table, as shown in the following SQL statement. This query took 12,790 milliseconds to run.
We can create the materialized view to help improve the performance of the remaining 30% of the queries. Note the DISTKEY keyword in the following code. We have defined a new distribution key for the materialized view (
Rerunning the following query returns rows much faster than before (7,715 milliseconds vs. 12,790 milliseconds):
Again, the explain plan of the query has changed; it now references the materialized view even though the SQL statement doesn’t explicitly reference the materialized view:
Optimize table scans
Table scans on Amazon Redshift are made efficient through the use of sort keys. Sort keys determine the order in which the columns are stored in the data blocks. Picking a column that appears frequently in your filtering conditions as a sort key can improve query performance significantly.
Compound sort keys with multiple columns can be defined on your table in case multiple columns are good candidates for sort keys. But in some situations where two or more high cardinality columns are sort key candidates, the compound sort key may not provide adequate performance. In these cases, a materialized view could be created with a different sort key to maintain that data in an alternate sorted order to help cater to a subset of the queries.
In the following example query, the
web_sales table uses the column
ws_sold_date_sk for the sort key, because this is the column that is used commonly for filtering rows. A smaller set of queries use
ws_sales_price for filtering rows. Given that both
ws_sales_price are high cardinality columns with lots of unique values, a compound sort key with both columns may not be performant for all query patterns.
Let’s create the following materialized view and see how it can help improve the performance of the preceding query:
Running the following query returns rows much faster (5 milliseconds vs. 3,548 milliseconds) because the automatic query rewrite is using the materialized view:
The following is the new explain plan:
Materialized views on Amazon Redshift can be a powerful optimization tool if used appropriately. With automatic query rewrite, you can optimize queries without any impact to end-users or their queries. This allows you to create materialized views after the application has gone live. Some customers plan this as part of their performance-optimization strategy when building new apps. The real value is that you can optimize queries and workloads without needing to modify the source code or scripts, and you can benefit even with a partial match.
About the Authors
Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.
Jeetesh Srivastva is a Sr. Manager, Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions using Amazon Redshift and other AWS Analytic services. He has worked to deliver on-premises and cloud-based analytic solutions for customers in banking and finance and hospitality industry verticals.
Sain Das is an Analytics Specialist Solutions Architect at AWS and helps customers build scalable cloud solutions that help turn data into actionable insights.