Monthly Archives: June 2017

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1961","1961","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1962","1962","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1963","1963","tonnes","0.000000",""
......

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd._
/* Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag */
case class Trade(area_code:Int, area:String, item_code:Int, item:String, element_code:Int,
                 element:String, year:Int, unit:String, value:Double, flag:String)
object TradeCrops {
  def scrub(str:String):String = {
    return str.replace("\"", "")
  }
  def toInt(str:String):Int = {
    try {
      return scrub(str).toInt
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toDouble(str:String):Double = {
    try {
      return scrub(str).toDouble
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toTrade(line:String):Trade = {
    val fields = line.split("\",")
    Trade(
      toInt(fields(0)),
      scrub(fields(1)),
      toInt(fields(2)),
      scrub(fields(3)),
      toInt(fields(4)),
      scrub(fields(5)),
      toInt(fields(7)),
      scrub(fields(8)),
      toDouble(fields(9)),
      scrub(fields(10))
    )
  }
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("Trade Crops Application")
    val sc = new SparkContext(conf)
    val spark = SparkSession.builder()
        .appName("Spark SQL Trade Crops")
        .getOrCreate()
    val file = sc.textFile("hdfs:///FAO/Trade_Crops.csv")
    val tradeRDD = file.filter(_.split("\",").length == 11).map(toTrade(_))
    val tradeDF = spark.createDataFrame(tradeRDD)
    tradeDF.write.parquet("hdfs:///FAO/Trade_Crops.parquet")
  }
}

The build.sbt is

lazy val root = (project in file("."))
  .settings(
    name := "FAO",
    version := "1.0",
    scalaVersion := "2.11.7",
    unmanagedJars in Compile += file("/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.6.jar"),
    libraryDependencies ++= Seq(
      "org.apache.spark" % "spark-core_2.11" % "2.1.1",
      "org.apache.spark" % "spark-sql_2.11" % "2.1.1",
      "org.apache.hadoop" % "hadoop-client" % "2.6.0"
    )
  )

Always remember to add dependency for “spark-sql” or else it will report “createDataFrame() if not a member of spark”.
And finally, the submit script is:

/disk1/spark-2.1.1-bin-hadoop2.6/bin/spark-submit --class TradeCrops \
  --master yarn \
  --driver-memory 2G \
  --executor-memory 2G \
  --executor-cores 1 \
  --num-executors 64 \
  ./target/scala-2.11/FAO_2.11-1.0.jar

Importance of function’s return Type in Scala

The Scala code is below:

abstract class Human {
        def call
}
class Student(val name: String) extends Human {
        override def call = "\"" + name + "\""
}
object MyApp extends App {
        var s:Student = new Student("robin")
        println(s.call)
}

But the output of this section of code is:

()

Nothing, just a pair of parenthesis. What happen? Why doesn’t Scala use the ‘call’ function in subclass ‘Student’?
The answer is: we forget to define the return Type of function ‘call’, so its default return Type is ‘Unit’. The value of ‘Unit’ is only ‘()’. Hence no matter what value we give to ‘call’, it only return ‘()’.
We just to add return Type:

abstract class Human {
        def call: String
}
......

It print “robin” now.

Data Join in AWS Redshift

In “Amazon Redshift Database Developer Guide“, there is an explanation for data join:
“HASH JOIN and HASH are used when joining tables where the join columns are not both distribution keys and sort keys.
MERGE JOIN is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted.”
Let’s take ‘salary’ and ’employee’ for example.
Firstly, we EXPLAIN the join of ‘salary’ and ’employee’, and it shows “Hash Join”:

Then we create two new tables:

CREATE TABLE salary_new
distkey (employee_id)
sortkey (employee_id)
AS SELECT * FROM salary;
CREATE TABLE employee_new
distkey (employee_id)
sortkey (employee_id)
AS SELECT * FROM employee;

Currently, the join column is both distkey and sortkey. Hence EXPLAIN shows “Merge Join”:

Some tips about “Amazon Redshift Database Developer Guide”

Show diststyle of tables

SELECT relname, reldiststyle
FROM   pg_class
WHERE  relname='salary' OR relname='employee';

Details about distribution styles: http://docs.aws.amazon.com/redshift/latest/dg/viewing-distribution-styles.html
How to COPY multiple files into Redshift from S3
http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html
Could “Group” (or “Order”) by number, not column name

SELECT listing.sellerid, sum(sales.qtysold)
FROM   sales, listing
WHERE  sales.salesid = listing.listid
AND    listing.listtime > '2008-12-01'
AND    sales.saletime > '2008-12-01'
GROUP BY 1
ORDER BY 1;

COPY with automatical compression
To apply automatic compression to an empty table, regardless of its current compression encodings, run the COPY command with the COMPUPDATE option set to ON. To disable automatic compression, run the COPY command with the COMPUPDATE option set to OFF.
Change diststyle of table

CREATE TALBLE userseven DISTSTYLE EVEN AS
SELECT * FROM users;

Show storage space of columns

select col, max(blocknum)
from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name ='lineorder'
and col < 17
group by name, col
order by col;

Change current environment in SQL Editor

set query_group to test;
set session authorization 'adminwlm';
set wlm_query_slot_count to 3; /* override current level */

Primary key and foreign key
Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer uses them when it generates query plans. If you set primary keys and foreign keys, your application must maintain the validity of the keys.  
Distribution info in EXPLAIN
DS_DIST_NONE
No redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table.
DS_DIST_ALL_NONE
No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node.
DS_DIST_INNER
The inner table is redistributed.
DS_DIST_OUTER
The outer table is redistributed.
DS_BCAST_INNER
A copy of the entire inner table is broadcast to all the compute nodes.
DS_DIST_ALL_INNER
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.
DS_DIST_BOTH
Both tables are redistributed.
Create Like

create table likesales (like sales);
insert into likesales (select * from sales);
drop table sales;
alter table likesales rename to sales;

Interleaved skew

select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;

The value for interleaved_skew is a ratio that indicates the amount of skew. A value of 1 means there is no skew. If the skew is greater than 1.4, a VACUUM REINDEX will usually improve performance unless the skew is inherent in the underlying set.
About interleaved sort key: http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html#t_Sorting_data-interleaved
Concurrent write
Concurrent write operations are supported in Amazon Redshift in a protective way, using write locks on tables and the principle of serializable isolation.  
UNLOAD

unload  ('select * from venue order by venueid')
to 's3://mybucket/tickit/venue/reload_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
delimiter '|';
truncate venue;
copy venue
from 's3://mybucket/tickit/venue/reload_manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest
delimiter '|';

Redshift UDF
In addition to the Python Standard Library, the following modules are part of the Amazon Redshift implementation:
* numpy 1.8.2  
* pandas 0.14.1  
* python-dateutil 2.2  
* pytz 2015.7  
* scipy 0.12.1  
* six 1.3.0  
* wsgiref 0.1.2

CREATE FUNCTION f_within_range (x1 float, y1 float, x2 float, y2 float) RETURNS bool
 IMMUTABLE as $$
    def distance(x1, y1, x2, y2):
        import math
        return math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)
    return distance(x1, y1, x2, y2) < 20
$$ LANGUAGE plpythonu;

Data Join
* Nested Loop : The least optimal join, a nested loop is used mainly for cross-joins (Cartesian products) and some  inequality joins.  
* Hash Join and Hash  Typically faster than a nested loop join, a hash join and hash are used for inner joins and left and  right outer joins. These operators are used when joining tables where the join columns are not both distribution keys and sort keys. The hash operator creates the hash table for the inner table in the join; the hash join operator reads the outer table, hashes the joining column, and finds matches in the inner hash table.  
* Merge Join  Typically the fastest join, a merge join is used for inner joins and outer joins. The merge join is not used for full joins. This operator is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted. It reads two sorted tables in order and finds the matching rows. To view the percent of unsorted rows, query the SVV_TABLE_INFO (p. 786) system table.  
wlm_query_slot_count
You can temporarily override the amount of memory assigned to a query by setting the wlm_query_slot_count parameter to specify the number of slots allocated to the query.  By default, WLM queues have a concurrency level of 5  
VARCHAR
A VARCHAR(12) column can contain 12 single-byte characters, 6 two-byte characters, 4 three- byte characters, or 3 four-byte characters.  
Tuple in Redshift SQL

select * from venue
where (venuecity, venuestate) in (('Miami', 'FL'), ('Tampa', 'FL'))
order by venueid;

SIMILAR TO

SELECT gender, COUNT(gender) FROM employee WHERE first_name SIMILAR TO '%ein%' GROUP BY gender;

analyze_threshold_percent
To reduce processing time and improve overall system performance, Amazon Redshift skips analyzing a table if the percentage of rows that have changed since the last ANALYZE command run is lower than the analyze threshold specified by the analyze_threshold_percent parameter. By default, analyze_threshold_percent is 10
COPY from DynamoDB
Setting READRATIO to 100 or higher will enable Amazon Redshift to consume the entirety of the DynamoDB table's provisioned throughput, which will seriously degrade the performance of concurrent read operations against the same table during the COPY session. Write traffic will be unaffected.
Different databases in Redshift
After you have created the TICKIT database, you can connect to the new database from your SQL client. Use the same connection parameters as you used for your current connection, but change the database name to tickit.
Interleaved Sort Key
A maximum of eight columns can be specified for an interleaved sort key.
Concatenate in SQL

insert into t1(col1) values('Incomplete'::char(3));

INSERT INTO from SELECT

INSERT INTO hello
( SELECT employee_id, first_name
  FROM employee ORDER BY 2)

Prepare and execute PLAN

DROP TABLE IF EXISTS prep1;
CREATE TABLE prep1 (c1 int, c2 char(20));
PREPARE prep_insert_plan (int, char)
AS insert into prep1 values ($1, $2);
EXECUTE prep_insert_plan (1, 'one');
EXECUTE prep_insert_plan (2, 'two');
EXECUTE prep_insert_plan (3, 'three');
DEALLOCATE prep_insert_plan;

Powerful 'WITH' for sub-query in SQL

WITH workage AS (
SELECT employee_id, datediff(day, birthday, work_day)/365 AS work_age FROM employee)
SELECT COUNT(employee_id), work_age FROM workage GROUP BY work_age ORDER BY 1 DESC;

UNLOAD with compression

unload ('select * from employee')
TO 's3://robin-data-023/employee_'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
gzip;

VACCUM

VACUUM FULL salary TO 100 PERCENT;

'OVER' in SQL

SELECT *,
100*salary/FIRST_VALUE(salary)
OVER (PARTITION BY employee_id ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS most_salary_percent
FROM salary
WHERE employee_id = 10001 OR employee_id = 10002;

Show and set current settings

select current_setting('query_group');
select set_config('query_group', 'test', true);

Show blocks(1MB) allocated to each column in the 'salary' table

select col, count(*)
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name = 'salary'
group by col
order by col;

Slice and Col
slice: Node slice
col: Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, ROW_ID
In STV_SLICES , we can see relations between slice and node. Single node have two slices: 0 and 1
Common used tables for meta information
pg_table_def
svv_table_info

Robin on Linux

Monthly Archives: June 2017