machine learning

Use repeated dataset correctly with timm’s data loader

For an experiment of metaformer, I was trying to add CIFAR100 dataset into the training script. Since CIFAR100 is too small, I need to let it repeat mulitple times in one epoch. Therefore I add a new type of dataset:

class RepeatDataset(Dataset):
    def __init__(self, dataset, repeats):
        self.dataset = dataset
        self.repeats = repeats
        self.length = len(dataset) * repeats

    def __getitem__(self, idx):
        return self.dataset[idx % len(self.dataset)]

    def __len__(self): 
        return self.length

But the training will report error:

Traceback (most recent call last):                                                                                    
  File "/home/robin/code/metaformer/train.py", line 970, in <module>                                                  
    main()                                                                                                            
  File "/home/robin/code/metaformer/train.py", line 732, in main                                                      
    train_metrics = train_one_epoch(                       
                    ^^^^^^^^^^^^^^^^                                                                                  
  File "/home/robin/code/metaformer/train.py", line 798, in train_one_epoch                                           
    for batch_idx, (input, target) in enumerate(loader):                                                              
                                      ^^^^^^^^^^^^^^^^^                                                               
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/timm/data/loader.py", line 131, in __iter__                                                                                                                     
    for next_input, next_target in self.loader:                                                                       
                                   ^^^^^^^^^^^                                                                        
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 733, in __next__                                                                                                          
    data = self._next_data()                                                                                                                                                                                                                
           ^^^^^^^^^^^^^^^^^                                                                                          
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data                                                                                                       
    return self._process_data(data, worker_id)                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data                                                                                                    
    data.reraise()                                         
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/torch/_utils.py", line 750, in reraise                                                                                                                          
    raise exception                                        
AttributeError: Caught AttributeError in DataLoader worker process 0.                                                 
Original Traceback (most recent call last):                                                                           
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop                                                                                                   
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]                                                   
           ^^^^^^^^^^^^^^^^^^^^                            
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch                                                                                                            
    return self.collate_fn(data)                           
           ^^^^^^^^^^^^^^^^^^^^^                           
  File "/home/robin/miniconda3/envs/poolformer/lib/python3.12/site-packages/timm/data/mixup.py", line 305, in __call__                                                                                                                      
    output = torch.zeros((batch_size, *batch[0][0].shape), dtype=torch.uint8)                                         
                                       ^^^^^^^^^^^^^^^^^                                                              
AttributeError: 'Image' object has no attribute 'shape'. Did you mean: 'save'?

It cost me a quite long time to solve it. The key is in the implementation of “timm.data.create_loader”: https://github.com/huggingface/pytorch-image-models/blob/main/timm/data/loader.py#L291. In it, it will set “dataset.transform” to a new value, and in “timm.data.dataset” https://github.com/huggingface/pytorch-image-models/blob/main/timm/data/dataset.py#L66-L67, it will check and use this new set “transform”:

...
        if self.transform is not None:
            img = self.transform(img)     
...

Since the class RepeatDataset is created by myself and it will not handle the “dataset.transform = create_transform()”, it failed when calling the non-existed “transform()”.

The fix comes from ChatGPT and I think it’s not bad:

class RepeatDataset(Dataset):
    def __init__(self, dataset, repeats):
        self.dataset = dataset
        self.repeats = repeats
        self.length = len(dataset) * repeats

    @property
    def transform(self):
        return self.dataset.transform

    @transform.setter
    def transform(self, value):
        self.dataset.transform = value

    def __getitem__(self, idx):
        return self.dataset[idx % len(self.dataset)]

    def __len__(self):
        return self.length

Experiments about ‘accelerate’ library of HuggingFace

If you want to run your training code with ‘accelerate‘ fp8, you need to install ‘transformer_engine‘ or ‘MS-AMP‘. But these two packages are hard to install beccause they depends on specific CUDA/CUDNN versions. After one afternoon’s efforet, I finally gave up and started to directly using docker image ‘nvcr.io/nvidia/pytorch:24.04-py3’.

docker run \
  --gpus all \
  -it \
  --rm \
  --shm-size="16g" \
  --network host \
  nvcr.io/nvidia/pytorch:24.04-py3

After enter the container by using above command, I still need to install ‘accelerate’ directly by using ‘python3 -m pip install accelerate’. In the ‘accelerate config’, I set to use ‘fp8’ with ‘E4M3’. But the training process reported error about LayerNorm. Then I manually modify the code (may not be correct but it works):

# transformer_engine/pytorch/module/layernorm.py

class _LayerNorm(torch.autograd.Function):
    """functional LayerNorm"""

    @staticmethod
    def forward(
        ctx,
        inp: torch.Tensor,
        ln_weight: torch.Tensor,
        ln_bias: torch.Tensor,
        eps: float,
        fwd_ln_sm_margin: int,
        bwd_ln_sm_margin: int,
        zero_centered_gamma: bool,
        is_grad_enabled: bool,
        activation_dtype: torch.dtype,
    ) -> torch.Tensor:
        # Make sure input dimensions are compatible
        in_features = ln_weight.numel()
        assert inp.is_cuda, "TransformerEngine needs CUDA."
        permute = False
        if inp.shape[-1] != in_features:
            inp = inp.permute(0, 2, 3, 1)
            permute = True
        assert inp.shape[-1] == in_features, "LayerNorm not possible"
        if permute:
            inp = inp.permute(0, 3, 1, 2)
        inputmat = inp.reshape((-1, in_features))

        # Cast for native AMP
        inputmat = cast_if_needed(inputmat, activation_dtype)
        ln_weight = cast_if_needed(ln_weight, activation_dtype)
        ln_bias = cast_if_needed(ln_bias, activation_dtype)

        if is_grad_enabled:
            ln_out, mu, rsigma = tex.layernorm_fwd(inputmat, ln_weight,
                ln_bias, eps, fwd_ln_sm_margin, zero_centered_gamma)
            ctx.save_for_backward(inputmat, ln_weight, mu, rsigma)
            ctx.inp_shape = inp.shape
            ctx.bwd_ln_sm_margin = bwd_ln_sm_margin
            ctx.zero_centered_gamma = zero_centered_gamma
        else:
            ln_out, mu, rsigma = layernorm_fwd_inf(inputmat, ln_weight,
                ln_bias, eps, zero_centered_gamma), None, None
        return ln_out.view_as(inp)

Finally the training could work properly. But the speed is the same with bf16…

Experiments about ‘torchao’

‘torchao‘ is a python library that support PyTorch native quantization and sparsity for training and inference. I just finished some experiments/tests with it for my image-classification project, which use CNN model by PyTorch. Below are some conclusions.

My project already used Automatic Mixed Precision of ‘bfloat16’, but the convert_to_float8_training still easily reduce about 60% of the VRAM (on my RTX 4090 GPU):

from torchao.float8 import convert_to_float8_training

def module_filter_fn(mod: torch.nn.Module, fqn: str) -> bool:
    # Example: Exclude the output layer from float8 conversion
    if fqn == "output":
        return False
    # Example: Exclude linear layers with dimensions not divisible by 16
    if isinstance(mod, torch.nn.Linear):
        if mod.in_features % 16 != 0 or mod.out_features % 16 != 0:
            return False
    return True

convert_to_float8_training(m, module_filter_fn=module_filter_fn)



AdamW8bit could decrease the VRAM from 22.6GB to 22.4GB, not too much.

Didn't see any VRAM difference after using CPUOffloadOptimizer. Since it couldn't work well with learning-rate-scheduler. I tend to give up it.

Try to understand Variational Autoencoders

ELBO as the Loss Function

Note: “p(x|z)” means True Posterior, “q(z|x)” means Approximate Posterior

What if only use first term of the Loss?

What’s the meaning of “GAN tend to lack full support over the data”?

Why using Gaussian Distribution for latent variable?

An experiment about my stupid idea

After training both image classification and sound classification deep learning models. I found out that the image training is much slower than the sound training, although the sound dataset is much bigger than image dataset.

The first idea that jumped out of my mind is that the image has 3 channels (RGB) but the sound spectrogram just have 1 channel. Therefore if I compress RGB into 1 channel (such as using gray image), the training speed of image classification will become 3 times faster.

A few days ago, I started to train the image classification with gray image. But the speed of training is almost the same with RGB image. Until then, I realized how stupid I am.

Let’s see below graph cut from residual network paper. Yes, the first layer is a 7×7 64 filters convoluation, and it will map no matter how many channels just to 64 filters. If the image is a gray one, it maps it to 64 filters; if the image is a RGB one, it also maps it to 64 filters. The computing cost will only reduce 3 times for first layer if I reduce the 3-channels to 1-channel. Compare to total computing cost, this change is quite minor.

That’s why there is no body mentioned about this “accelerating” technology before 🙂

Notes and experiences from Audio Classification research

All the code is here.

The baseline of training balanced data of AudioSet is 0.27 mAP. Using TimeMasking and FrequentMasking could slightly push it to 0.28 mAP.

I tried mixup of raw sounds like AST but it didn’t improve the mAP totally (the reason is still a myth for me). But, the mixup of fbank filters could push metric to 0.293 mAP.

Until then, the fbank filter will be resized to (384, 384) for model deit_distilled. After I recovered the size of fbank filter to (128, 998), it reached 0.323 mAP.

The most recent (hope it’s not the last) change is copied wholly from AST: use the pretrained parameters of Conv2D from deit_distilled but change the stride size — also expand the position embeddings since the sequence length has changed. The result is 0.333 mAP.

It is worth noting that this is the first time I feel the power of pretrained model by my hand. If I re-initialized parameters of position embedings instead of “bilinear” interpolating it, the result will be far away from 0.333 mAP. Also if I used new initialized parameters of the Conv2D (first layer for Vision Transformer), the result is as bad as before.

I will take care of whehter pretrained model also works well for unbalanced data of AduioSet

Augmentation helps ALBEF a lot

I was trying to implement ALBEF by myself for practice. After finishing all the parts (Vision part, BERT part, including Masked Language Model), I trained the model on COCO-Captions/SBU-Captions/CC3M/CC12M dataset (actually more data than the original paper). But the result is quite weird. An old steam train was recognised as a building, and a few fish were recognised as statues.

To solve these weird mistakes, I reviewed the code many times and finally noticed a sentence in the paper:

Although it’s just a normal sentence in the paper, the augmentation could improve the ALBEF model significantly. After randomly cropping the 256×256 raw image to 224×224 and also using the RandAugment, I finally got a more stable and suitable model. Let’s see some examples:

Previously, the fish had been recognised as “shoes”, and the bedroom as “city”. They all become very well after augmentation.

But there are still some interesting bad cases:

Adding a prefix of “A picture of” could help the ALBEF model improve its recognition capability, or actually, there is a lot of text like “A picture of XXX” in the CC3M or CC12m dataset.

Anyhow, I finally implemented and trained a workable ALBEF model by myself, and my RTX-3080Ti card.

Multimodal trials: solve the Masked Language problem about my tiny ALBEF implementation (episode 3)

I just wrote my implementation of ALBEF in my own way. But when evaluated with some masked sentences, it failed.

I am using this image:

When I asked “This is a chocolate <|mask|>”, it generated “This is a chocolate urn”. Quite strange

Then I asked “This is a <|mask|> cake, it generated “This is a iph cake”. Totally wrong.

After checking my implementation of the dataset, and training on a small part of CC3M, a week passed and I finally got the reason today: the tiktoken is a BPE tokenizer that will use sub-words as tokens and these sub-words severely hurt the model. For example, sub-words “urn” and “iph” appear too many times and the model would use them to replace the masked word in prediction.

By replacing tiktoken with BertTokenizerFast (from “transformers” package), the model correctly generates “This is a chocolate cake”.

Multimodal trials: my tiny CLIP implementation (episode 2)

Three weeks passed since the previous article. Here are the answers to the previous three questions:

Q1: The original paper of CLIP uses L2 normalization on multimodal embeddings. But in my test, the model will not converge with it.

Answer 1: The reason the model didn’t converge is that the learning rate is too large. After reducing the learning rate a bit and adding the L2 normalization, the model could get above 80% validation accuracy. The L2 normalization essentially projects embeddings to a high-dimensional sphere with a one-unit radius, which by intuitive could regularize the model.

Q2: Adding the learnable temperature parameter will cause a training error, which requires a “retain_graph=True” argument for “backward()”

Answer 2: If I use the code

def __init__(self):
  self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)).exp()

def forward(self):
  ...
  logits_per_image = self.logit_scale * img_embds @ txt_embds.T

it will report the error

Traceback (most recent call last):
  File "/home/robin/code/try_multimodal/train.py", line 196, in <module>
    trainer.train(args)
  File "/home/robin/code/try_multimodal/train.py", line 149, in train
    train_result = self.train_loop(cmodel, optimizer)
  File "/home/robin/code/try_multimodal/train.py", line 81, in train_loop
    self.scaler.scale(loss).backward()
  File "/home/robin/miniconda3/envs/nanoGPT/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/robin/miniconda3/envs/nanoGPT/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

But if I moved the “exp()” to “forward()”:

def __init__(self):
  self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

def forward(self):
  ...
  logits_per_image = self.logit_scale.exp() * img_embds @ txt_embds.T

It works well. The reason is that “exp()” would bring the gradient to “logit_scale” so we’d better to let it in “forward()” to avoid graph edge duplication.

Q3: When using “torch.compile()”, it will report a Triton error after the first epoch

Answer 3: Seems the “torch.compile()” needs the input batch to be fixed shape, which means you’d better not change the batch_size at the training step. To avoid this, I dropped the last batch of the dataset since the last batch usually wouldn’t have enough samples for BATCH_SIZE.

There is a new discovery for “torch.compile()”. Yesterday I was trying to compile the model (a self-implement ALBEF) but came an error:

...
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

After tumbling in the debug document of PyTorch compiling, I finally found out that the solution is just installed “g++” on my computer…

Previously, the evaluation of 50000 val images of ImageNet1K shows that the top-5 accuracy is just 6.66%. After I added CC12M with CC3M as training dataset, the top-5 evaluation accuracy raised to 23.76%. My tiny CLIP model checkpoint is here.

Does sinusoid Positional Embeddings actually work well?

The GPT part of my Multimodal trials mainly comes from nanoGPT. In the nanoGPT, the Positional Encoding is just a learnable tensor (“wpe” means “weights of positional embedding”):

self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))

It’s different from the implementation of the original paper. The original paper mentioned:

We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results.

The “vanilla” Positional Embeddings for the transformer are two functions:

$PE_(pos,2i) = sin(pos/10000^{2i/d_{model}})$

$PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}})$

Which ones work better in the model training? Let me try running “python train.py config/train_shakespeare_char.py” in nanoGPT and get the best validation loss as metrics.

I wrote my own sinusoid Positional Embeddings for testing:

class GPT(nn.Module):
  def __init__(self, config):
	...
    # Position Embedding from original Transformer paper
    divisor = torch.pow(
        10000, 2 * torch.arange(1, config.n_embd + 1) / config.n_embd
    )
    pe = []
    for pos in range(1, config.block_size + 1):
        if pos % 2 == 0:
            pe.append(torch.sin(pos / divisor).unsqueeze(0))
        else:
            pe.append(torch.cos(pos / divisor).unsqueeze(0))
    self.register_buffer("pos_emb", torch.cat(pe, 0))

The “10000” (let’s call it “base number” for convenience) looks too big for a shorter sequence length, so I do experiments by changing it to “block_size” “2*block_size” etc.

The testing result:

	validation loss
Original nanoGPT	1.4754
Base number: 10000	1.4959
Base number: 4 * block_size	1.4916
Base number: 2 * block_size	1.4995
Base number: 3.14/2 * block_size	1.4870
Base number: block_size	1.4947

From my simple tests, the learnable Positional Embeddings has the best effort. nanoGPT wins this round.

I have a guess about why the author of Transformer chose “10000”. The smallest “pos” is 1 and the biggest $2i/d_{model}$ is 2. Therefore the smallest value in sin() is $1/10000^2=1e-8$ , which is very close to the minimal value of FLOAT16 $5.96e-8$