A problem about using DataFrame in Apache Spark

Here is the code for loading CSV file (table employee) to DataFrame of Apache Spark:

But after I run the jar in Spark, it report:

Seems data haven’t been correctly load.
After reviewed the document for CSV format carefully, I noticed that the quote in my CSV file is instead of . So I added a option in my code to let Spark recognise single quote:

This time the CSV have been read out properly.

A convenient environment to write LaTex

More than one year ago, I wrote a paper about how to accelerate Deep Learning training for sparse features and dense features (images). For writing this paper, I installed a bunch of tools and plugins in my Mac-book and fixed a lot of errors for them by searching Google. Seems preparing LaTex environment on a local computer is really a pain in the neck.
Fortunately I found a convenient way today.
First, download your favourite template. For me the best template is CVPR-2020, from which anyone could download template. The template is a zip file.
Second, go to overleaf.com, sign up a new account. Then, in the top-left of the page, click “New Project”, and click “Upload Project”, choose the zip file above.
Third, now you would see a beautiful IDE for writing LaTex.


Using Single Shot Detection to detect birds (Episode four)

In the previous article, I reached mAP 0.770 for VOC2007 test.
Four months has past. After trying a lot of interesting ideas from different papers, such as FPN, celu, RFBNet, I finally realised that the data is more important than network structures. Then I use COCO2017+VOC instead of only VOC to train my model. The mAP for VOC2007 test eventually reached 0.797.
But another strange thing happen: there are will be a strange big bounding box around the whole image for the 16-birds-image. After using dropout and changing augmentation policies, the strange big box still existed.
I doubt that COCO2017 dataset for birds is not general enough. Therefore I decided to use a more abundant dataset — Open Images Dataset V5. After retrieve all bird images from Open Images Dataset V5, I get 18525 images with corresponding annotations. By using them for training, I finally got a more promising bird detection result for that 16-birds-image (by using threshold 0.65):

Seems these bird images in Open Images Dataset V5 are more general than COCO2017. But the mAP of COCO evaluation is smaller for the model trained by Open Images than model trained by COCO2017. So it looks like I need a more comprehensive evaluation metrics now.

The MySQL master-slave drift problem in AWS

About one month ago, we met a problem in MySQL master-slave architecture on AWS ec2. The MySQL master runs very fast, but the slave can only get the new data from about two or three hours ago.
We firstly suspect the resources for the master or slave instance are not enough therefore we upgrade the instance type to let them have more CPU cores and memory. But the lag problem still existed.
Only after we set binlog_group_commit_sync_delay=10000, the drift disappeared.
Let’s see the description for binlog_group_commit_sync_delay:

binlog_group_commit_sync_delay Controls how many microseconds the binary log commit waits before synchronizing the binary log file to disk. By default binlog_group_commit_sync_delay is set to 0, meaning that there is no delay. Setting binlog_group_commit_sync_delay to a microsecond delay enables more transactions to be synchronized together to disk at once, reducing the overall time to commit a group of transactions because the larger groups require fewer time units per group.

An example of using Spark Structured Streaming

This snippet will monitor two directories and join the data from them when there is a new CSV file in any directory.

The join operation is implemented by Spark SQL which is easy to use (for DBA), and also easy to maintain.

Some articles said if the Spark process restart after failed, the ‘checkpoint’ would help it to continue work from last uncompleted position. I tried it in my local computer, and noticed that it do make some duplicated rows after restart. This is a severe problem for production environment so I will check it in next testings.

A problem of using Pyspark SQL

Here is the code:

It will report error after running ‘cat xxx.py|bin/pyspark’:

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here:

Some problems about using AWS DMS

AWS DMS is a new type of service used to migrate data from different types of database and data-warehouse. I met some problems when trying to use it in production environment.

Problem 1. When using a MySQL server of AWS RDS as the source of a replication task. It reported errors after started the task:

The failure message looks terrible. But at least I can find this doc to follow. After changed the configurations as below:

binlog_format ROW
binlog_checksum NONE
binlog_row_image FULL

the error still existed.
The real answer is in here since I used RDS instead of self-managed MySQL. After I add one line Terraform code to enable “automatic backups”:

the replication task began to work without the error.

Problem 2. Running replication task for a while to export data from MySQL to AWS Redshift. A new error log appeared in Redshift load logs:

Why masteruser is not authorized? The answer is here. Below is the Terraform code:

Then I had giiven “dms_assume_role” two Trusty Entities

Problem 3. There was still a error in Redshift load log (so many errors in AWS DMS…):

Error Type Raw Field Value
Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS] timestamp 0000-00-00 00:00:00

Seems the answer is here. Therefore I added “acceptanydate=true;timeformat=auto” into the “extra connection settings” in Redshift endpoint. But the error just changed to:

Error Type Raw Field Value
Invalid data timestamp 0000-00-00 00:00:00

After searching for almost two days, I found that the reason is in the schema of Redshift, which is automatically created by AWS DMS replication task.

Since the schema doesn’t allow “mydate” column to be null but the “acceptanydate=true” is trying to transfer “0000-00-00 00:00:00 to null”, the final error is “Invalid data” for Redshift.
The solution for this problem is: create table of Redshift manually to let “mydate” column to be “nullable”, and change the working mode of replication task to “TRUNCATE_BEFORE_LOAD”.

Processing date and time in AWS Redshift

Since AWS Redshift don’t have function like FROM_UNIX(), it’s much more weird to get formatted time from a UNIX timestamp (called ‘epoch’ in Reshift):

Ref: https://stackoverflow.com/questions/39815425/how-to-convert-epoch-to-datetime-redshift

If we want to see the statistics result group by hours:

Some tips about using AWS Glue

Configure about data format
To use AWS Glue, I write a ‘catalog table’ into my Terraform script:

But after using PySpark script to access this table, it reports:

Seems we can’t use ‘OpenCSVSerde’. Actually, the correct answer is:

The version of zeppelin
When using zeppelin to run PySpark script, it reports error:

According to the document:

The latest release of Apache Zeppelin, 0.8.x, is not supported. Download the older release named zeppelin-0.7.3-bin-all.tgz from the download page and follow the installation instructions.

Google Cloud Summit 2019

      No Comments on Google Cloud Summit 2019

Yesterday I joined the Google Cloud Summit 2019 in Sydney.

The meeting place is quite huge. And there are lot of booths from different partners of Google Cloud.

The keynote was quite abstract and a little boring, so I chose to walk around different booths for fun. Here are some useful conversations and information I collected:
[HashiCorp: company for Terraform and Nomad]
Q: How short could Terraform support a new service in cloud provider, such as Lake Formation in AWS?
A: It depends on requirements from users.
Q: Only users who paid for enterprise version of Terraform?
A: No. All users, include who use open source version, will be considered to support a new service in cloud. We published new version of Terraform quarterly, although we can’t make sure every new services in this quarter will be included.

[Confluent: company for Apache Kafka]
Q: Could I use ksql in Kafka as standard SQL?
A: Currently, no. ksql only supports self-defined syntax in Kafka. It looks really like SQL, but it’s actually another language.
Q: Could I use ksql to access a table in MySQL?
A: Yes. You can export a table in MySQL to be a ‘kstream‘. Then you can use ksql to access this ‘kstream’.

[Tableau: you know what I mean…]
Q: What are the new updates for Tableau in recent one year?
A: We published a new function called ‘Ask Data’. You can type in query with natural language, and it will translate them to tableau query, by using state-of-the-art NLP technologies.

After I type in some query like ‘avg price in Manly’, it worked very well. But if I type in query like ‘top 5 price near Chatswood’, Tableau failed to get the right answer.

A: You know, NLP is really hard so we only support a range of anonymous for query words.

[elastic: company for ElasticSearch and Kibana]
Q: What’s the biggest cluster of ElasticSearch in production?
A: Well, a lot of big companies use quit big ElasticSearch clusters, such as Netflix, eBay. But we don’t know which one is the biggest because they won’t tell us every details of their clusters 🙂

[Google Cloud]
Q: Is there a product in Google Cloud that could continually import data from MySQL and export them to Cloud Storage or BigTable?
A: Yes. Cloud Data Fusion will be your best choice.

In the booth of ASUS (it produced a lot of chrome books for Google), I noticed the Dev Board which contain a edge-TPU.

The demo use “mobilenet_ssd v2” as the backbone for object detection. Just as my choice.