Models and data I/O#

Here we look into our siamese network architecture. The siamese network is designed to take outputs from two different branches and project the outputs onto a common embedding space following the metric learning scheme with a distance loss function. In our case, it is composed of an audio encoder branch and a side information embedding branch.

1) Audio encoder#

For our two experiments (word-audio and image-audio), we use the same structure for the audio encoder. The audio encoder is a 2D convolutional network that takes mel-spectrograms as inputs. It is a common architecture used for music classification or tagging tasks.

class MelCNN(nn.Module):
    def __init__(self, emb_dim):
        super(MelCNN, self).__init__()

        # Spectrogram normalization
        self.spec_bn = nn.BatchNorm2d(1)

        # CNN : input (1, 63 * N, 80) / kernel size (3x3)
        self.layer1 = Conv_2d(1, 64, pooling=(1,2))
        self.layer2 = Conv_2d(64, 128, pooling=(3,4))
        self.layer3 = Conv_2d(128, 128, pooling=(7,5))
        self.layer4 = Conv_2d(128, 128, pooling=(3,2))
        self.pool = torch.nn.AdaptiveAvgPool2d(1)

    def forward(self, x):
        x = self.spec_bn(x)
        x = self.layer1(x)        
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.pool(x)
        return x
    
class Conv_2d(nn.Module):
    def __init__(self, input_channels, output_channels, shape=3, stride=1, pooling=2):
        super(Conv_2d, self).__init__()
        self.conv2d = nn.Conv2d(input_channels, output_channels, shape, stride=stride, padding=shape//2)
        self.bn = nn.BatchNorm2d(output_channels)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(pooling)
    def forward(self, x):
        out = self.pool(self.relu(self.bn(self.conv2d(x))))
        return out

2) Word embedding model#

For the word embedding model, as we just take the precomputed vectors for each class, we don’t need an extra pretrained model. We directly feed in the 300 dimensional word embedding vectors that had been trained with GloVe model.

3) Word-audio siamese network#

For the word-audio experiment, we build a siamese network that take the output vectors of the audio encoder and the word embedding vectors. We add a projection layer, which is a single fully-connected layer with a sigmoid function, to each of the audio encoder output and the word embeddings.

class WordAudioSiameseNetwork(nn.Module):
    def __init__(
        self
    ) -> None:
        super().__init__()
        
        self.audio_model = MelCNN(128)
        self.audio_projection = nn.Linear(in_features=128, out_features=128, bias=True)
        self.word_projection = nn.Linear(in_features=300, out_features=128, bias=True)
        
    def forward(self, x_audio, pos_word, neg_word):
        x_audio = self.audio_model(x_audio)
        x_audio = torch.squeeze(x_audio, dim=-1)
        x_audio = torch.squeeze(x_audio, dim=-1)
        x_audio = nn.Sigmoid()(self.audio_projection(x_audio))
        
        x_word_pos = nn.Sigmoid()(self.word_projection(pos_word))
        x_word_neg = nn.Sigmoid()(self.word_projection(neg_word))
        
        return x_audio, x_word_pos, x_word_neg

The triplet loss is then computed between the outputs from the projection layers to guide the model to properly learn the common embedding space. For that reason, we feed one audio embedding vector, one positive word embedding that the input audio is annotated with, and one negative word embedding that is sampled among all unrelated labels uniformly at random.

class TripletLoss(nn.Module):
    def __init__(self, margin):
        super(TripletLoss, self).__init__()
        self.margin = margin
        self.relu = nn.ReLU()

    def forward(self, anchor, positive, negative, size_average=True):
        cosine_positive = nn.CosineSimilarity(dim=-1)(anchor, positive)
        cosine_negative = nn.CosineSimilarity(dim=-1)(anchor, negative)
        losses = self.relu(self.margin - cosine_positive + cosine_negative)
        return losses.mean()

4) Image encoder#

The image encoder we use for the image-audio ZSL experiment is a Resnet-101 architecture that is pretrained to classify a large-sized image dataset, ImageNet.

We remove the last layer of the pretrained model and use the outputs of the second last layer.

5) Image-audio siamese network#

The only difference of the image-audio siamese network from the word-audio model is that it uses the pretrained image classification model instead of directly feeding embedding vectors.

class ImageAudioSiameseNetwork(nn.Module):
    def __init__(
        self
    ) -> None:
        super().__init__()

        self.audio_model = MelCNN(128)
        self.audio_projection = nn.Linear(in_features=128, out_features=128, bias=True)

        # Using a pretrained resnet101 image classification model as a backbone.
        visual_model = torchvision.models.resnet101(pretrained=True)
        layers = list(visual_model.children())
        self.visual_model = nn.Sequential(*layers[:-1])
        for _m in self.visual_model.children():
            for param in _m.parameters():
                param.requires_grad = False
        self.visual_projection = nn.Linear(in_features=2048, out_features=128, bias=True)
                
    def forward(self, x_audio, pos_img, neg_img):
        x_audio = nn.Sigmoid()(self.audio_model(x_audio))
        x_audio = torch.squeeze(x_audio, dim=-1)
        x_audio = torch.squeeze(x_audio, dim=-1)
        x_audio = nn.Sigmoid()(self.audio_projection(x_audio))

        pos_img = nn.Sigmoid()(self.visual_model(pos_img))
        pos_img = torch.squeeze(pos_img, dim=-1)
        pos_img = torch.squeeze(pos_img, dim=-1)
        x_img_pos = nn.Sigmoid()(self.visual_projection(pos_img))
        
        neg_img = nn.Sigmoid()(self.visual_model(neg_img))
        neg_img = torch.squeeze(neg_img, dim=-1)
        neg_img = torch.squeeze(neg_img, dim=-1)
        x_img_neg = nn.Sigmoid()(self.visual_projection(neg_img))
        
        return x_audio, x_img_pos, x_img_neg

Data transform#

The audio and image data needs some transformation before being fed to the siamese network. For the image data, we do basic augmentation techniques such as cropping and normalizing. For the audio data, we apply the mel-spectrogram transformation procedure on the fly.

def get_transforms():
    img_transforms = {
        'train': transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ]),
        'test': transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ]),
    }

    SAMPLING_RATE = 16000
    N_FFT = 512
    HOP_LENGTH = 256
    N_MELS = 80
    EPS = 1e-10
    mel_transform = Compose(
        [
            MelSpectrogram(
                sample_rate=SAMPLING_RATE,
                n_fft=N_FFT,
                hop_length=HOP_LENGTH,
                n_mels=N_MELS,
            ),
            Lambda(lambda x: x.clamp(min=EPS)),
            AmplitudeToDB(stype='power', top_db=80.),
            Lambda(lambda x: x / 80.),
            Lambda(lambda x: x.transpose(1, 0)),  # (F, T) -> (T, F)
        ]
    )

    return img_transforms, mel_transform

All classes are contained in

zsl/
    model.py
    loss.py
    trasforms.py

for a quick usage.