On November 5th, 2019, Google added support for integer column partitioning in BigQuery tables. This is an upgrade from the existing partitioning functionality, which has been limited to date partitioning — with ingestion time and column— only. It is a significant optimization of how BigQuery stores and reads the data that will be processed and will help improve query speed and cost.
Which columns can be used?
To use the feature, select a top-level integer (integer field within a record will not work) that is either nullable or required, and not a repeated field. As per semantics, the suggested practice would be to select a column that tends to be used in queries as a filter.
- Selecting the partitioning column
I will demonstrate the effectiveness of this new feature on a publicly available dataset natality (bigquery-public-data.samples.natality) using attribute year as the partitioning column. Attribute year is a nullable top-level integer.
- Creating a partitioned table
At the time of writing this article, it is not possible to set integer partitioning up within the UI, however, this can be done using data definition language, command line or the BQ API.To create a table using DDL, we will select all the data from the public natality dataset and set a destination dataset and table within our project. In the DDL syntax, we will specify the range of the selected integer (year). Bucket ranges from minimum year (1969) to the maximum (2008) found in the natality data.
CREATE TABLE dataset_id.natality_ip PARTITION BY RANGE_BUCKET(year, GENERATE_ARRAY(1969, 2008, 1)) OPTIONS( description="Natality dataset partitioned by YEAR") AS SELECT * FROM `bigquery-public-data.samples.natality`;
- Querying Partitioned TableIn the “partition by range bucket” we specify the partitioning column (year) and an array of values that will be used. The array is created with the GENERATE_ARRAY function, which automatically builds an array between a minimum (1969) and a maximum (2008) value with a desired step (1). Any rows with a year value that falls outside of this range will be in the “unpartitioned” partition.Queries on partitioned tables are written in the same manner as if the table was not partitioned.|
SELECT AVG(weight_pounds) FROM `dataset_id.integer_partitioning.natality_ip` WHERE year = 2000
Based on year filtering within the where clause, BigQuery automatically knows to take advantage of partitioning.
- Performance increase
The amount of data read (~ processing cost in BigQuery) in the partitioned query is 62MB compared to 2.1 GB when natality table is not partitioned. That makes it 34 times less expensive than when the same query is run on the original table. If we look at it from a high level, that makes a lot of sense. We divided our table into 40 partitions, one for each year between 1969 and 2008.
In terms of processing, the job requires fewer workers and reduces scheduling time.
As you can see, our wait time (time the average worker spent waiting to be scheduled) decreased drastically and since our stage00 output only had a single row (a single worker), the stage01 executed a lot quicker as well.
Because of the overall efficiency of BigQuery when it comes to problem parallelization, the total response time will not be as big of an improvement as we saw with cost, however, it will still be noticeable, especially for problems or projects that require closer to 2,000 workers deployed. 2,000 is a limit for the number of slots (workers) for on-demand pricing.
In the example above, the “before” query was using 69 whereas the “after” query was using a single worker. If we were running 30 queries simultaneously, before would hit the 2,000 slots limit and the partitioned solution would have stayed at a safe number of 30 workers.
With a problem this small, any precise measurements come with a lot of noise, but on average the non-partitioned table took about 1.5s and the partitioned table took about 1.1s (elapsed time). A far more significant difference was in the “slot time consumed”, where the difference was, on average, about 14s compared to 1s when data was partitioned.
And for anyone interested in the actual result, the average weight of a baby born in the US in the year 2000 was 7.31lbs.
Adding data to integer-partitioned tables
Because we often work with “live” data in BigQuery, it is important for the solution to support data additions. Partitioning supports streaming, DDL inserts and load jobs for adding data to an existing partitioned table. Any data added to the table will be automatically partitioned. For rows that are streamed, the data will initially stay in an “unpartitioned” partition and will later be assigned to its specific location.
Intro to Partitioned Tables by Google
Query Plan Explanation by Google
Adswerve’s $500 Monthly BigQuery Credit by Adswerve
Please reach out with any questions or feedback to @lukaslo and @adswerve on Twitter.