Parquet

Migrate Spark job to BigQuery

I have just finished a work about migrating Spark job to BigQuery, or more precisely: migrate Python code to SQL. It’s a tedious work but improve the performance significantly: from 4 hours runtime of PySpark to half an hour on BigQuery (Honors belongs to the BigQuery!).

There are a few notes for the migration, or just SQL skills:

To create or overwrite a temporary table:

CREATE OR REPLACE TEMP TABLE `my_temp_tbl` AS ...

2. Select all columns from a table except some special ones:

SELECT * EXCEPT(year, month, day) FROM ...

3. To do pivot() on BigQuery: https://hoffa.medium.com/easy-pivot-in-bigquery-one-step-5a1f13c6c710. The key is clause EXECUTE IMMEDIATE which works like eval() in Python: take string as input and run it as SQL snippet.

4. Using clause OFFSET with LIMIT is terribly slow when the table is very big. The best solution for me is that use “bq extract” to export data to GCS as parquet files, and then get each part of these files by a program.

5. The parquet files could use column names that contain a hyphen, like “last-year”, “real-name”. But the BigQuery only support columns with underline, like “last_year”, “real_name”. So the “bq load” will automatically transfer column name “last-year” in the parquet file to “last_year” in the table of BigQuery.

An old bug about PyArrow

To save memory for my program using Pandas, I change types of some column from string to category as the reference.

df[["os_type", "cpu_type", "chip_brand"]] =
	df[["os_type", "cpu_type", "chip_brand"]].astype("category")

It could save at least half memory in my case. But when I use pyarrow to store the dataframe to parquet

df.to_parquet("my.parquet")

it reports errors:

Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647

It’s a bug from old version pyarrow and had been fixed in Sep 2019. Then I tried to upgrade my pyarrow-0.12.1 to pyarrow-0.17.1 and it fixed this error.

But the story hasn’t ended up here.

For pyarrow-0.12.1, the below snippet will return a class of type <pyarrow.lib.Column>

import pyarrow.parquet as pq
table = pq.read_table(path)
table.column(0)

and this class will also contain a attribute “Column name”

But for pyarrow-0.17.1, the same code will return a class of type <pyarrow.lib.ChunkedArray> which doesn’t have a “Column name”.

This difference will make some code fail (actually, our program). Beware of this: after you upgrade pyarrow (or any other library in Python), run the test to make sure all the legacy code work properly.

Get the schema of a parquet file

Previously I just use this snippet to get all the column names of a parquet file:

import pandas as pd
df = pd.read_parquet("hello.parquet")
print(list(df.columns))

But if the parquet file is very large (maybe not very large, for example, 1GB), it will cause OOM in my small VM (about 4GB RAM).

Actually, what I want is just column names, not the whole data. Since parquet file has strongly designed format, there must be someway we can only get the schema instead of all data.

And, here it is:

import pyarrow.parquet as pq
schema = pq.read_schema("hello.parquet", memory_map=True)
print(list(schema.names))

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1961","1961","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1962","1962","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1963","1963","tonnes","0.000000",""
......

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd._
/* Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag */
case class Trade(area_code:Int, area:String, item_code:Int, item:String, element_code:Int,
                 element:String, year:Int, unit:String, value:Double, flag:String)
object TradeCrops {
  def scrub(str:String):String = {
    return str.replace("\"", "")
  }
  def toInt(str:String):Int = {
    try {
      return scrub(str).toInt
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toDouble(str:String):Double = {
    try {
      return scrub(str).toDouble
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toTrade(line:String):Trade = {
    val fields = line.split("\",")
    Trade(
      toInt(fields(0)),
      scrub(fields(1)),
      toInt(fields(2)),
      scrub(fields(3)),
      toInt(fields(4)),
      scrub(fields(5)),
      toInt(fields(7)),
      scrub(fields(8)),
      toDouble(fields(9)),
      scrub(fields(10))
    )
  }
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("Trade Crops Application")
    val sc = new SparkContext(conf)
    val spark = SparkSession.builder()
        .appName("Spark SQL Trade Crops")
        .getOrCreate()
    val file = sc.textFile("hdfs:///FAO/Trade_Crops.csv")
    val tradeRDD = file.filter(_.split("\",").length == 11).map(toTrade(_))
    val tradeDF = spark.createDataFrame(tradeRDD)
    tradeDF.write.parquet("hdfs:///FAO/Trade_Crops.parquet")
  }
}

The build.sbt is

lazy val root = (project in file("."))
  .settings(
    name := "FAO",
    version := "1.0",
    scalaVersion := "2.11.7",
    unmanagedJars in Compile += file("/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.6.jar"),
    libraryDependencies ++= Seq(
      "org.apache.spark" % "spark-core_2.11" % "2.1.1",
      "org.apache.spark" % "spark-sql_2.11" % "2.1.1",
      "org.apache.hadoop" % "hadoop-client" % "2.6.0"
    )
  )

Always remember to add dependency for “spark-sql” or else it will report “createDataFrame() if not a member of spark”.
And finally, the submit script is:

/disk1/spark-2.1.1-bin-hadoop2.6/bin/spark-submit --class TradeCrops \
  --master yarn \
  --driver-memory 2G \
  --executor-memory 2G \
  --executor-cores 1 \
  --num-executors 64 \
  ./target/scala-2.11/FAO_2.11-1.0.jar

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

CREATE TABLE employee_p (
    employee_id INT,
    birthday DATE,
    first_name STRING,
    family_name STRING,
    work_day DATE)
PARTITIONED BY (gender CHAR(1))
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;

Then import data into it:

SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, gender, work_day
                  FROM employee;

But it reports error:

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795
","_col1":"1961-07-05","_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:257)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795","_col1":"1961-07-05","
_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:245)
        ... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximu
m number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:922)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:699)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:97)
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:236)
        ... 7 more

All the employees have only two genders: “M” and “F”. How could Hive report “too many dynamic partitions”?
To look for the fundamental cause, I use “explain” before my HQL, and finally noticed by this line:

expressions: UDFToInteger(VALUE._col0) (type: int), CAST( VALUE._col1 AS DATE) (type: date), VALUE._col2 (type: string), VALUE._col3 (type: string), CAST( VALUE._col4 AS DATE) (type: date), VALUE._col5 (type: string)

Hive use “_col4” as partition column and it’s type is DATE! So the correct import HQL should put partition column at last:

INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, work_day, gender
                  FROM employee;

We successfully import data by dynamic partitions.
Now we create new parquet format table “salary” (using buckets) and join two tables:

CREATE TABLE salary_p (
    employee_id INT,
    salary INT,
    start_date DATE,
    end_date DATE)
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;
INSERT OVERWRITE TABLE salary_p SELECT * FROM salary;
/* Join two tables */
SELECT  e.gender, AVG(s.salary) AS avg_salary
  FROM  employee_p AS e
        JOIN  salary_p as s
        ON (e.employee_id == s.employee_id)
  GROUP BY e.gender;

The join operation only cost 90 seconds, much smaller than previous 140 seconds without bucketing and partitioning.

Robin on Linux

Parquet