Yearly Archives: 2016

The type of variables in Python

Haven’t written python code for more than one year, I met this simple problem:

Even the code have print out the value of “a” and “b” as 2 and 1, the condition check “if a >= b:” is false! Spending more than 10 minutes, I eventually get the reason:… Read more »

My understanding of CNN (Convolutional Neural Network)

The classic Neural Network of Machine Learning usually use fully-connection, which will cost too much computing resource to get final result if the inputs are high-resolution images. So comes the Convolutional Neural Network. CNN (Convolutional Neural Network) splits the whole big image into small pieces (called Receptive Fields), and do… Read more »

Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically. The Hive script “land_price.hql”:

We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”),… Read more »

Use Oozie to run terasort

      No Comments on Use Oozie to run terasort

The better choice of “Action” for running terasort test case in Oozie is “Java Action” instead of “Mapreduce Action” because terasort need to run

first and then load ‘partitonFile’ by “TotalOrderPartitioner”. It’s not a simple Mapreduce job which need merely a few propertyies. The directory of this”TerasortApp” which using… Read more »

Using “sysbench” to test memory performance

Sysbench is a powerful testing tool for CPU / Memory / Mysql etc. Three years ago, I used to test performance of MYSQL by using it. Yesterday, I used Sysbench to test memory bandwidth of my server. By using command:

It reported the memory bandwidth could reach 8.4GB/s, which… Read more »

Install CDH(Cloudera Distribution Hadoop) by Cloudera Manager

These days I was trying to install Cloudera-5.8.3 on my centos-7 machines, and here are some steps for operation and tips for trouble shooting: 0. If you are not in USA, the speed of network for accessing Cloudera Repository of RPMS(or Parcels) is desperately slow, thus we need to move… Read more »

Using Pig to join two tables and sort it

Having two tables: salary and employee,we can use Pig to find the most high-salary employees:

The result is:

Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the… Read more »