hdfs

Problems about using DistCp on Hadoop

After installing all Hadoop environment, I used DistCp to copy large files in distributed cluster. But it report error:

#hadoop distcp
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/Job
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
        at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
        at java.lang.Class.getMethod0(Class.java:3018)
        at java.lang.Class.getMethod(Class.java:1784)
        at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.Job
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 7 more

Seems it can’t even find the basic MapReduce class. Then I checked CLASSPATH for Hadoop:

#hadoop classpath
/usr/lib/hadoop-2.6.0/etc/hadoop/:/usr/lib/hadoop-2.6.0/share/hadoop/common/lib/*:/usr/lib/hadoop-2.6.0/share/hadoop/common/*:/usr/lib/hadoop-2.6.0//share/hadoop/hdfs:/usr/lib/hadoop-2.6.0//share/hadoop/hdfs/lib/*:/usr/lib/hadoop-2.6.0//share/hadoop/hdfs/*:/usr/lib/hadoop-2.6.0/share/hadoop/yarn/lib/*:/usr/lib/hadoop-2.6.0/share/hadoop/yarn/*:/usr/lib/hadoop-mapreduce/share/hadoop/mapreduce/*

Pretty strange, the HADOOP_CLASSPATH contains ‘mapreduce’ directories. It supposed to be able to find ‘Job’ class, unless the MapReduce jar package is in other directories.
Finally, I found the real MapReduce jar is actually in other position. Therefore I add these directories into HADOOP_CLASSPATH: edit ~/.bashrc and add following line

#.bashrc
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/usr/lib/hadoop-2.6.0/share/hadoop/mapreduce/*:/usr/lib/hadoop-2.6.0/share/hadoop/mapreduce/lib/*

DistCp could work now.

Some tips about using Apache Flume

Question1: Flume process report “Expected timestamp in the Flume event headers, but it was null”
Solution1: The flume process expect to receive events with timestamp, but the event doesn’t have. For sending normal text event to flume, we need to tell it to generate timestamp with every events by itself. Put below line into configuration:

a1.sinks.k1.hdfs.useLocalTimeStamp=true

Question2: HDFS Sink generates tremendous small files with high frequency even we have set “a1.sinks.k2.hdfs.rollInterval=600”
Solution2: We still need to set “rollCount” and “rollSize”, as Flume will roll file if any condition of “rollInterval”, “rollCOunt”, or “rollSize” been fulfilled.

a1.sinks.k1.hdfs.rollInterval=600
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=0

Question3: Flume process exit and report “Exception in thread “SinkRunner-PollingRunner-DefaultSinkProcessor” java.lang.OutOfMemoryError: GC overhead limit exceeded”
Solution3: Simply add “JAVA_OPTS=”-Xms12g -Xmx12g” (My server has more than 16G physical memory) into “/usr/lib/flume-ng/bin/flume-ng”
—— My configuration file for Flume ——

a1.sources=r1 r2
a1.sinks=k1 k2
a1.channels=c1 c2
a1.sources.r1.type=netcat
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=44444
a1.sources.r1.channels=c1
a1.sources.r2.type=netcat
a1.sources.r2.bind=0.0.0.0
a1.sources.r2.port=55555
a1.sources.r2.channels=c2
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=/user/realestates/CN/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix=re-
a1.sinks.k1.hdfs.rollInterval=600
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.channel=c1
a1.sinks.k2.type=hdfs
a1.sinks.k2.hdfs.path=/user/realestates/AU/%Y-%m-%d/
a1.sinks.k2.hdfs.filePrefix=re-
a1.sinks.k2.hdfs.rollInterval=600
a1.sinks.k2.hdfs.rollCount=0
a1.sinks.k2.hdfs.rollSize=0
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = hour
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.useLocalTimeStamp=true
a1.sinks.k2.channel=c2
a1.channels.c1.type=memory
a1.channels.c1.capacity=23456789
a1.channels.c1.transactionCapacity=23456789
a1.channels.c2.type=memory
a1.channels.c2.capacity=23456789
a1.channels.c2.transactionCapacity=23456789

The startup command for Cloudera Environment:

sudo -u hdfs flume-ng agent --conf ./ --conf-file example.conf \
     -name a1 -Dflume.root.logger=INFO,console \
     -Dorg.apache.flume.log.rawdata=true

Why my Spark job hangs?

After running my small application for Spark of Machine Learning , the job hangs and the Spark UI for it display nothing for more than 5 minutes.
That is weird and I see some logs in UI of yarn:

16/09/30 17:05:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 0 time(s); maxRetries=45
16/09/30 17:05:53 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 1 time(s); maxRetries=45
16/09/30 17:06:13 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 2 time(s); maxRetries=45
16/09/30 17:06:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 3 time(s); maxRetries=45
16/09/30 17:06:53 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 4 time(s); maxRetries=45
16/09/30 17:07:13 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 5 time(s); maxRetries=45
16/09/30 17:07:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 6 time(s); maxRetries=45

I haven’t any IP looks like “110.75.x.x”. Why is the Spark job trying to connect it ?
After reviewing the code carefully, I find out the problem:

    val conf = new SparkConf().setAppName("Simple Regression")
    val sc = new SparkContext(conf)
    val smsData = sc.textFile("hdfs://user/sanbai/SMSSpamCollection")

It is me who forget to add IP to URI of HDFS. Thus, the correct code should be:

    val smsData = sc.textFile("hdfs://127.0.0.1/user/sanbai/SMSSpamCollection")

Now the application runs correctly.

Usage limitations of HDFS’s C API

I have to change a program which is written by c language from writing local files to writing on HDFS. After learning the example of C API in libhdfs, I complete the modification of open()/write()/read() to hdfsOpenFile()/hdfsWriteFile()/hdfsReadFile() and so on. But when running the new program, many problems occured. The first is: after fork(), I can’t open files of HDFS anymore. And the problem looks very common in community and haven’t any solution yet.
So I have to try the hdfs-fuse tool. According to the steps of this article, I successfully build and run the hdfs-fuse:

./fuse_dfs_wrapper.sh -d dfs://x.x.x.x:8020 /data/ -obig_writes

But something weird happened:

fd = open("my.db", "w");
write(fd, "hello", 5);
fsync(fd);
....

After fsync(), the size of file “my.db” is still zero by “ls” command on mountpoint “/data”! It cause the program report error and can’t continue to process.
The reason is fuse-dfs haven’t implement fuse_fsync() interface. After adding the implementation of fuse_fsync() by hdfsHSync(), it works now. But the performance is too bad: about 10~20MB/s in network.
Consequently, I decided to use glusterfs instead of HDFS because it totally don’t need any modification for user program and support erasure-code since version 3.6 (this will dramatically reduce occupation of storage space).

Robin on Linux

hdfs