Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically.

The Hive script “land_price.hql”:

We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”), so we set the “mapred.job.queue.name” to “root.default”.

Remember to use SUBSTR() in Hive to erase quote charactor “\”” when importing data from raw CSV file.

The “coordinator.xml” for Apache Oozie:

The “workflow.xml” for Apache Oozie:

We run two jobs parallelly here: Hive and TeraSort (TeraSort is not useful in real productive environment, but it could be a good substitute for real private job in my company).

The sqoop once report error “javax.xml.parsers.ParserConfigurationException: Feature ‘http://apache.org/xml/features/xinclude’ is not recognized”.
The solution is change file “/usr/lib/hadoop/bin/hadoop” like:

“job.properties” for Oozie:

Remember to set “oozie.use.system.libpath=true” therefore Oozie could run Hive and Sqoop job correctly.

The script to create MYSQL table:

After launch the Oozie coordinator, it will finally put consequent data into MYSQL table:


MYSQL

Looks the land price of “WOKINGHAM” in October 2015 is extremely expensive.

Books I read in year 2015

Illustrated Network Hardware Famous Cases for DataCenter Mysql Internal: Innodb Engine The Martian Antifragile

The first book is about network hardware, like router, switcher. As a coder, I usually use servers on cloud, therefore haven’t see the real high performance routers (I have sought bare server, 1Gb switcher). This book open my eyes.

The second book is about how to build Datacenter. It’s really a work for architecture, not IT guys.

About two years ago, I worked with Mysql team in my company as a kernel developer. We have used PCIE-card of NAND and flashcache as our solution for Mysql to process hight throughput pressure. But util this year, I have read over the architecture of InnoDB Engine which is the most powerful and effective engine in Mysql. Actually, it’s not so difficult to have a overview of the InnoDB Engine in a book. But, it is still very hard to understand the code of it 🙂

I haven’t go to cinema to watch “The Martian” because I have read it in my Kindle on my commute everyday. It is really a sci-fi story for Geeks who like do research on Computer,Chemistry,Physics,etc. The only question I want to ask the author is:” How could you invent so much troubles on Mars to torture Mark Watney?”