At work we’re in the midst of a big project to move our data analysis out of relational databases and into a datalake. We’re an AWS customer, so instead of running queries in RDS or Redshift, analysts will be using Athena.
Whilst Athena provides an analyst-friendly SQL interface for running queries, it has a “pay-as-you-go” model which charges for the amount of data scanned per query. That can really add up!
Athena in AWS Console
After running an Athena query, the AWS console displays summary data about the query:
Note it does report the amount of data scanned (“33.42GB”), but users may not realize that data scanned means dollars spent.
So I created a tampermonkey script to display the estimated cost per query. If you don’t know, Tampermonkey is a Chrome plugin which runs user scripts on websites. This allows users to customize (improve) websites as they see fit.
The script calculates the query cost using the Athena pricing for the region, multiplied by the amount of data scanned (minimum 10MB). Because the query results bar is dynamic, it uses mutation observers to watch for changes.
After installing the script, running the query again displays an “Est. Cost” in the results bar:
Looks like I spent 17ยข ($5 / 1000GB * 33.42GB).
Controlling Athena Spend
Showing spend data to users is one way to help them avoid costly queries. But installing Tampermonkey scripts is insecure, and often the technically savvy users aren’t the ones that need to be reminded of Athena’s cost model.
The most effective way to control Athena spend is via the workgroups feature. You can create a workgroup for each set of users and assign a “Per query data usage control” which sets the maximum amount of data scanned per query. This is a good fail safe to avoid runaway scans such as when a query doesn’t filter a table by its partitioned values.
Tags: tampermonkey aws datalake athena