bigdata

How to unfold two Arrays in BigQuery

Imaing we have data like this:

WITH Sequences AS
  (SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS prod_type, [1.1, 1.2, 2.1, 2.3, 3.3, 3.4] AS prod_price,
   UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS prod_type, [1.3, 4.2, 2.1, 7.3, 5.3, 9.4] AS prod_price,
   UNION ALL SELECT 3 AS id, [5, 10] AS prod_type, [1.8, 4.9, 2.0, 7.6, 5.1, 8.4] AS prod_price)
select * from sequences

How could I get the total price of each “prod_type” for every “id”?

First we need to unfold the “prod_type” and “prod_price” correspondingly:

WITH Sequences AS
  (SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS prod_type, [1.1, 1.2, 2.1, 2.3, 3.3, 3.4] AS prod_price,
   UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS prod_type, [1.3, 4.2, 2.1, 7.3, 5.3, 9.4] AS prod_price,
   UNION ALL SELECT 3 AS id, [5, 10] AS prod_type, [1.8, 4.9, 2.0, 7.6, 5.1, 8.4] AS prod_price)
SELECT id, prod_type, prod_price
from
sequences,
unnest(prod_type) AS prod_type,
unnest(prod_price) AS prod_price;

and then use “group by” to calculate total price:

...
SELECT
  id,
  prod_type,
  SUM(prod_price)
FROM
  sequences,
  UNNEST(prod_type) AS prod_type,
  UNNEST(prod_price) AS prod_price
GROUP BY
  id,
  prod_type

A BigQuery error about the partition

We were using client.query() (from Python API of BigQuery) to insert selected data into a table with a specific partition. But the script reported errors like:

google.api_core.exceptions.BadRequest: 400 Some rows belong to different partitions rather than destination partition

This note said it might be the cause of the incorrect date format for the partition. I checked the code but only found the partition format is correct.

The real reason is the input: the “selected data”. The data that will be inserted is from this SQL:

SELECT col1, col2, "2023-01-06" as partition_date FROM my_table;

The partition date set by the Python script bigquery.QueryJobConfig(destination="new_table$20230103") for the destination table is “2023-01-03” but the source data’s partition date is “2023-01-06”. This is why there is the above error.

Using Python to run BigQuery job with project id

Here is the code for me to query a table of BigQuery:

from google.cloud import bigquery
from google.cloud.bigquery_storage import BigQueryReadClient

client = bigquery.Client()
storage_client = BigQueryReadClient()
df = client.query("select * from my_table1").to_dataframe(bqstorage_client=storage_client)

Then it reported the error:

“Access Denied: Project PRJ_B: User does not have bigquery.jobs.create permission in project PRJ_B.”

But actually, I want to launch a job in project PRJ_A. So I add a shell command “gcloud config set project PRJ_A” before running this python script. But the errors continued.

After searching the API doc of Python BigQuery, I found out that the “bigquery.Client()” function could add an argument:

client = bigquery.Client(project="PRJ_A")

Now the script works well.

A strange error in BigQuery

Two days ago we met a weird error when running a select through BigQuery Python API:

Error : google.api_core.exceptions.BadRequest: 400 Bad int64 value: BA1D

I checked the select SQL but it doesn’t contain any type like “int64”.

After “binary search” in the SQL code, I finally found out that the SQL is actually querying a “view” and the code of this view is like:

SELECT
  cast(col1, int64) AS COL1,
  cast(col2, int64) AS COL2,
FROM
  table1

The correct solution is to change “cast” to “safe_cast”.

Here is the lesson for me: some errors may occur not only in the direct SQL code but in some indirect views…

The correct way to insert data from another table in BigQuery

Incorrect code:

WITH source1 as (
	SELECT blah FROM blah
),
source2 as (
    SELECT moreblah FROM source1
)
INSERT INTO newtable FROM source2;

Correct solution:

INSERT INTO newtable 
    WITH source1 as (
    	SELECT blah FROM blah
    ),
    source2 as (
        SELECT moreblah FROM source1
    )
    SELECT * FROM source2;

Get the number of rows for a parquet file

We were using Pandas to get the number of rows for a parquet file:

import pandas as pd
df = pd.read_parquet("my.parquet")
print(df.shape[0])

This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.

If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:

import pyarrow.parquet as pq
table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)

This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.

Get DDL of a table in BigQuery

How could I conveniently get the creating-SQL of a table in BigQuery? We could use INFORMATION_SCHEMA:

SELECT
  table_name,
  ddl
FROM
  `data-to-insights.taxi.INFORMATION_SCHEMA.TABLES`
WHERE
  table_name="tlc_yellow_trips_2018_sample"

The result of ddl is:

CREATE TABLE `data-to-insights.taxi.tlc_yellow_trips_2018_sample`
(
  vendor_id STRING,
  pickup_datetime DATETIME,
  dropoff_datetime DATETIME,
  passenger_count INT64,
  trip_distance NUMERIC,
  rate_code STRING,
  store_and_fwd_flag STRING,
  payment_type STRING,
  fare_amount NUMERIC,
  extra NUMERIC,
  mta_tax NUMERIC,
  tip_amount NUMERIC,
  tolls_amount NUMERIC,
  imp_surcharge NUMERIC,
  total_amount NUMERIC,
  pickup_location_id STRING,
  dropoff_location_id STRING
);

Some hints on Dataproc

When running a job in the cluster of Dataproc, it reported:

java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.

The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file into properties to the template of creating a cluster:

properties:
          spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar

the job starts to read data from BigQuery tables.

Remember not to use gs://spark-lib/bigquery/spark-bigquery-latest.jar because it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p

2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?

Still need to add more items in the template to let it install pip packages:

    clusterName: robin
    config:
      gceClusterConfig:
        metadata:
          enable-cloud-sql-proxy-on-workers: 'false'
          use-cloud-sql-private-ip: 'false'
          PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth'
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh
        executionTimeout: 600s

3. To see how a Hive table be created

show create table <table>;

Recover truncated table in BigQuery

If you accidentally truncate a table in BigQuery, you can try this article to recover the data. Furthermore, I found out that the "bq cp project:dataset.table@-36000 project:dataset.table” method could not work in my situation. The only working solution is “SYSTEM_TIME AS OF“:

CREATE `mydataset.newtable` AS
SELECT *
FROM `mydataset.mytable`
  FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR);

and then “bq cp project:mydataset.newtable project:mydataset.mytable“

Change the schema of BigQuery tables

We can easily add new column for a table in BigQuery:

ALTER TABLE mydataset.mytable
      ADD COLUMN new_col STRING

But when you want to delete or rename an existed column, there is no SQL to implement it. The only way to delete or rename an existed column is to use the bq command:

bq show --format=prettyjson mydataset.mytable > schema.json
# Edit the schema.json to only leave a list of columns
bq mk --table mydataset.new_mytable schema.json
# Export data from `mytable` to `new_mytable`
bq rm --table mydataset.mytable
bq cp --table mydataset.new_mytable mydataset.mytable

And remember to backup your data before operating!