Google BigQuery: How to reduce storage and processing costs
Optimizing data models in Google BigQuery is a solution to prevent storage and processing costs from escalating abruptly.
In addition to improving your models and avoiding unexpected billing surprises from this data warehouse, knowing optimization practices for models is a path to more performant and secure deliveries.
Learn how costs in BQ work and the main resources available for optimizing models and queries. Delve into topics such as:
- sampling features;
- partitioning;
- clustering;
- incremental configuration strategies;
- cached queries;
- and best practices with SQL.
Enjoy your reading!
BigQuery: Google's Modern Data Warehouse
BigQuery is a data warehouse service for data analysis, which is self-managing and auto-scalable.
In BQ, a table is not a single file in the filesystem. Instead, each column of the table is a file in the filesystem, and each of these files representing the columns is distributed within the geographical region where the project is configured.
This allows for parallelism, subpartitioning, and massive data reading, which is the main reason for BigQuery's high performance.
BigQuery: Cost per Stored Byte
Optimizing data models in Google BigQuery is a solution to prevent storage and processing costs from escalating abruptly.
First, it's necessary to understand how costs in BigQuery work to define the best ways to optimize these costs.
The default storage cost currently in BQ is 2 cents per GB, which can drop to 1 cent.
This happens when a table remains read-only for more than 90 days.
In this case, BQ moves the table from common usage storage to long-term storage.
With this reclassification, the cost drops to 1 cent due to the savings in long-term storage.
When a table is fully processed, it may never reach these 90 days required to be moved to long-term storage due to its ingestion or reading on both sides.
However, if a table is partitioned, its partitions follow the above rule, meaning that when a partition completes 90 days or more without being processed, it is reclassified, and its cost is optimized.
Another important piece of information regarding BQ cost is that it may vary depending on the region, but the cost reduction between short and long-term storage always follows a 50% proportion.
BigQuery: Cost per processed byte
When a query is written in BQ, it shows the execution plan and the number of bytes to be scanned.
In summary, BQ understands how much of the query it can parallelize and allocate in slots, which are sets of Central Processing Unit (CPU) and memory.
BQ executes the query through the slots and finally sends it to the various storages, thus processing a certain amount of bytes per query.
The default cost per terabyte processed in BigQuery is around 5 dollars, varying between regions.
Alternative Option
There is another type of processing cost that is flexible but not the default service, which basically functions as a slot reservation.
This specific cost can be negotiated in contracts between Google and the contracting company and serves specific purposes.
From this, you can identify that there are various optimization methods in BQ, and reducing storage and processing costs is a goal to be achieved in companies that use this resource.
BigQuery: Essential Optimizations for Your Models
The more optimized your table and queries are, the lower the processing and storage costs in BigQuery.
Here are some strategies to optimize BQ costs in both data processing and data storage in the following table.
BigQuery: Data sampling
When testing business logics and running models in the development environment, most of the time, the goal is not to process all the data in a table.
In this sense, applying data sampling logics is a very efficient strategy and ensures storage and processing savings in BigQuery.
This data warehouse has some functions that guarantee data sampling, depending on what you need.
We have two distinct examples of how to perform data sampling. Check it out!
TABLESAMPLE Function
To work with a random sample of data, the TABLESAMPLE SYSTEM function is the ideal choice, as it considers only a percentage of the data from a table, which is determined by the user.
In the example below, we are selecting 10% randomly from the data in a table.
MOD and FARMFINGERPRINT() Approach
These are more suitable for those who want to have customized control over the sampling process to ensure selection of rows at specific intervals or who want to implement specific logics.
However, it is worth noting that implementing the mentioned approach or function is more complex. See the second example:
From the data sample, you will be able to execute your models, create a sample table, and perform validation without processing the entire table.
BigQuery: Data Partitioning
Data partitioning in a table occurs when it is divided into segments, called partitions, based on a specific criterion in a column.
We have a practical case for you to understand how BigQuery works.
Imagine a table with over 1 billion rows, but you usually only query the data from the last month of this table.
Without partitioning, BQ accesses your entire table. That means 1 billion rows are read to return the data from the last month.
This query processes a large amount of bytes, generating a higher cost than a partitioned query. Right!?
With monthly partitioning, for example, BQ only accesses the partition where your filter date is stored, reducing the processing cost of the query.
In other words, the smaller the amount of processed bytes, the higher the performance and the lower the cost.
What not to do in partitioning
- Policy tags
Masked fields, i.e., those with policy tags (PII), should not be used for partitioning. This is because partitioning is performed based on specific columns that have criteria for distributing data and query frequency.
BQ is unable to relate a PII column to the business area result. Furthermore, these columns can change their nature over time, with a variable mutation rate, which implies non-use of long-term storage.
In BigQuery, it is generally recommended to choose columns that are stable and have a relatively uniform distribution of data.
- Partition limit
BigQuery limits up to 4000 partitions in a table, so it is important to think about strategies in cases where partitions exceed this limit.
If a table is partitioned by day, and it is a table that has more than 4000 days, the table will exceed the limit. In this case, it is interesting to change the partition by month.
BigQuery: Clustering
In Google BigQuery, it's possible to physically structure data in a table based on one or more specific columns.
This results in organizing records so that similar data is stored close to each other on disk, providing optimization in accessing data during queries.
Linked to partitioning, these are powerful configurations that enhance queries, reducing costs and increasing performance.
Clustering is similar to a group by, but with this query being stored in your storage. Storage blocks are scaled adaptively based on the table size.
Just like in partitioning, queries that filter or aggregate columns in clustering directly check the blocks related to that query, not the entire table.
This feature automatically reduces the processed bytes and decreases processing costs.
In other words, it's as if BQ knows the direct path of those data and doesn't need to scan all the data in the table.
BigQuery: Columns in Clustering
Unlike partitioning, it's possible to create clustering with up to four columns, where the order of columns indicates the precedence of grouping and sorting in BQ.
In other words, the order of columns is important for clustering. Therefore, consider:
- Using the first column in clustering based on the most used filter in the table. This can be analyzed by the most applied filters on dashboards that consume this table, for example.
- Making use of other columns that make sense as filters in the tables for clustering.
- To optimize results in clustering, it's necessary to filter using the clustered columns following these recommendations:some text
- When querying a clustered table, always use the first column in the clustering order as a filter whenever possible, and ideally, the other clustered columns.
- If a query doesn't take into account the first clustered column, your query won't be optimized.
Performance and cost reduction: partitioning and clustering
Thinking in terms of performance and cost reduction in processing, partitioning and clustering are the most powerful configurations that BigQuery offers.
When both are combined in your table, you will have the greatest allies in processing optimization.
See how the tables look without partitioning and clustering, and to the right of the table, when both combinations are present.
And see the example of a query without partitioning and clustering and another with both.
This image does not use partitioning or clustering configuration, processing a total of 1.47 MB when executed, even with a date filter and other conditions.
This is because, regardless of the filter, the query will process all rows of the table to retrieve the requested data.
In the image above, the table is partitioned by the orderdate column, with partitions by day, and clustered by the salesorderid and customerid columns, processing a total of 1.08 KB.
When executed, the query will scan data only from the partition applied in the WHERE filter.
Additionally, clustering helps to find data that is clustered in that specific partition, thus reducing processing.
BigQuery: Incremental Table
Incremental tables process fewer data due to their characteristic of inserting only new data or changes since the last processed execution. Therefore, the table is not reprocessed in its entirety.
Like a columnar database, and as mentioned in the first part of this article, if there is an update to a record in only a few specific columns, BQ can capture this modification and change the columns without reprocessing the entire record.
In BigQuery, there are two possibilities for incremental models: Merge, which is the default, and Insert overwrite.
- Merge
The merge strategy uses a unique key to update data. If a unique key already exists in the destination table, it updates the data of that row with the update function.
If the unique key does not exist in the table, the data is simply inserted, behaving similar to append.
However, it's worth noting that the append strategy is not available for BQ.
With the merge strategy, you ensure that the data is deduplicated thanks to the use of the primary key.
To check if the unique key exists in the table, this strategy needs to scan the entire table, making it still quite costly in terms of processing.
- Insert overwrite
On the other hand, the insert overwrite strategy is more complex and optimized for incremental processing.
This option solves the problem of complete scanning.
The solution used by insert overwrite is to work with partitions. It deletes the selected partitions from the current destination table and inserts into it those that are transformed selected from the logic implemented in the model.
Except for the use of partitions, this process resembles the delete+insert strategy. However, this uses two separate instructions, delete + insert.
Insert overwrite uses the merge statement only for what is less costly for BQ.
It is a high complexity strategy and if not defined correctly, it can generate duplicate data.
Merge and Insert Overwrite: Comparison of strategies
The table below compares the Merge strategy and Insert Overwrite.
Query Cache and BI Engine
BigQuery has a native function that temporarily caches query results for a certain period of time.
This feature saves computational resources, but the storage is discarded after the expiration time, which is generally short, although adjustable.
Cached queries incur no additional costs and are of great help in scenarios where you do not have many users and distinct queries within a short period of time, or when repeated queries are frequently executed.
However, often it is necessary to enhance cached queries. For this, BigQuery offers the BI Engine, a solution that provides performance enhancement through in-memory acceleration for interactive queries.
This feature is additional to Google services and has continuous updates keeping the cached data automatically updated.
It is autoscalable and integrates with Business Intelligence (BI) tools for a better user experience.
SQL: Best practices
To wrap up the topic on optimizations, here are some tips for performing your queries with better performance:
- Avoid using select *. Select only the columns necessary for your query.
- Use filters like where, especially with the partitioned and clustered columns of the table.
- Avoid using the limit function without any associated filter. It continues to scan the entire table if used alone, only limiting the number of rows shown by the BQ interface to the user.
- Use joins with partitioned and clustered columns.
- Do not use subselects (subqueries); instead, try using Common Table Expressions (CTEs). This way, BQ can parallelize the query better.
- Take advantage of BQ's native functions instead of recreating multiple SQL functions.
BigQuery: Monitoring
Monitoring queries and models running in production is one of the most effective ways to rethink strategies that have already been applied or need to be applied.
Monitoring can be done through BI tools with metadata produced by BQ or by the tool being used to compile the code, such as dbt.
BQ's execution plan helps to find possible bottlenecks and provides insights when well analyzed.
Additionally, constant monitoring can identify unused models that are consuming processing and storage resources of the data warehouse.
Conclusion on cost reduction in Google BigQuery
Exploring different methods for cost reduction in Google BigQuery (BQ) makes it clear that efficiency in data management is essential to avoid unwanted financial surprises.
The careful implementation of strategies such as:
- data sampling,
- partitioning,
- clustering,
- incremental configurations,
- and cached queries.
These resources, combined with best practices in SQL, not only enhance performance but also directly contribute to cost reduction.
It is up to the data team to understand and apply these resources strategically in BQ, ensuring efficient use of resources and, consequently, more effective financial management in projects.
Indicium can help you get the best solutions for your data
Indicium is a reference in creating solutions in data science, analytics, and artificial intelligence for national and international companies.
And we want to help you achieve maximum usability of your data with the most modern tools, promoting innovation based on data with speed, security, and governance.
Want to include Google BigQuery or another data warehouse in your data team?
Talk to our team of experts here and discover customized data solutions for your case.