keras_ensemble_cifar10

This repository is supported by Huawei (HCNA-AI Certification Course) and Student Innovation Center of SJTU.
Thanks to the teachers for their contributions.

cifar10

I just use Keras and Tensorflow to implementate all of these models and do some ensemble experiments based on BIGBALLON’s work.

Requirements

Architectures and papers

Documents & tutorials

You can aslo see the articles if you can speak Chinese.

Accuracy of all single models

In particular
Change the batch size according to your GPU’s memory.
Modify the learning rate schedule may imporve the results of accuracy!
Thanks to the GPU computing platform (Composed of 100 pieces of 1080Ti) provided by the Student Innovation Center.

network GPU model size batch size epoch loss function training time val_acc(%)
Wide-resnet 28x10 GTX1080TI x 2 139M 128 250 crossentropy 4 h 55 min 96.50
Wide-resnet 28x10 GTX1080TI x 2 139M 128 250 focal_loss 6 h 34 min 95.50
DenseNet-160x24 GTX1080TI x 2 30.2M 64 250 crossentropy 24 h 22 min 95.70
DenseNet-160x24 GTX1080TI x 2 30.2M 64 250 focal_loss 25 h 21 min 95.60
ResNeXt-8x64d GTX1080TI x 2 142M 120 250 crossentropy 26 h 07 min 94.40
ResNeXt-8x64d GTX1080TI x 2 142M 120 250 focal_loss 35 h 10 min 94.60
SENet(ResNeXt-4x64d) GTX1080TI x 2 80.2M 120 250 crossentropy 25 h 38 min 94.27

I didn’t calculate the accuracy in the test set. As the author of keras said, every time you use feedback from your validation process to tune your model, you leak information about the validation process into the model. Repeated just a few times, this is innocuous; but done systematically over many iterations, it will eventually cause your model to overfit to the validation process (even though no model is directly trained on any of the validation data). This makes the evaluation process less reliable.

Accuracy of all ensemble models

In particular
I first tune in the validation set, determine the parameters of models.

Voting

Models test_acc(%)
DenseNet-160x24 + Wide-ResNet 28x10 96.10
DenseNet-160x24 + Wide-ResNet 28x10 + SENet(ResNeXt-4x64d) 96.38
DenseNet-160x24 + Wide-ResNet 28x10 + ResNeXt-29(8x64d) with focal loss + SENet(ResNeXt-4x64d) 96.38
DenseNet-160x24 + Wide-ResNet 28x10 + ResNeXt-29(8x64d) with focal loss 96.52

Weighted Mean

Models test_acc(%)
0.6×Wide-ResNet 28x10 + 0.4×DenseNet-160x24 96.38
0.8×Wide-ResNet 28x10 + 0.8×DenseNet-160x24 + 0.4×ResNeXt-29(8x64d) with focal loss 96.53
0.9×Wide-ResNet 28x10 +0.9×DenseNet-160x24 +0.2×SENet(ResNeXt-4x64d) 96.47
Wide-ResNet 28x10 + DenseNet-160x24 + ResNeXt-29(8x64d) with focal loss + 0×SENet(ResNeXt-4x64d) 96.15

About Focal Loss and Cross Entropy

Reference to paper: Focal Loss for Dense Object Detection
Code: mutil-class focal loss implemented in keras

In addition to solving the extremely unbalanced positive-negative sample problem, focal loss can also solve the problem of easy example dominant. That’s why I did the following experiment.

Wide-resnet 28x10

| network | GPU | model size | batch size | epoch | loss function | training time | val_acc(%) | |:———————-|:————-:|:———–:|:———-:|:—–:|:————–:|:————-:|:———–:| | Wide-resnet 28x10 | GTX1080TI x 2 | 139M | 128 | 250 | crossentropy | 4 h 55 min | 96.50 | | Wide-resnet 28x10 | GTX1080TI x 2 | 139M | 128 | 250 | focal_loss | 6 h 34 min | 95.50 |

DenseNet-160x24

| network | GPU | model size | batch size | epoch | loss function | training time | val_acc(%) | |:———————-|:————-:|:———–:|:———-:|:—–:|:————–:|:————-:|:———–:| | DenseNet-160x24 | GTX1080TI x 2 | 30.2M | 64 | 250 | crossentropy | 24 h 22 min | 95.70 | | DenseNet-160x24 | GTX1080TI x 2 | 30.2M | 64 | 250 | focal_loss | 25 h 21 min | 95.60 |

ResNeXt-8x64d

| network | GPU | model size | batch size | epoch | loss function | training time | val_acc(%) | |:———————-|:————-:|:———–:|:———-:|:—–:|:————–:|:————-:|:———–:| | ResNeXt-8x64d | GTX1080TI x 2 | 142M | 120 | 250 | crossentropy | 26 h 07 min | 94.40 | | ResNeXt-8x64d | GTX1080TI x 2 | 142M | 120 | 250 | focal_loss | 35 h 10 min | 94.60 |

We can see from the table above, focal loss improves the accuracy of Model ResNeXt-8x64d. But it reduces the accuracy of other models.

About Ensemble Methods

Voting

    import numpy as np
    from scipy import stats
    import pandas as pd

    models =[wresnet,densenet,resnext,senet]
    labels = []
    for m in models:
        predicts = np.argmax(m.predict(x_test), axis=1)
        labels.append(predicts)

    # Ensemble with voting
    labels = np.array(labels)
    labels = np.transpose(labels, (1, 0))
    labels = stats.mode(labels, axis=-1)[0]
    labels = np.squeeze(labels)
    error = np.sum(np.not_equal(labels, y_test1)) / y_test1.shape[0]  
    print('The precision on test : ', 1-error)

Weighted Mean

    import numpy as np
    from scipy import stats
    import pandas as pd

    # Predict labels with models
    dense_layer_model1 = Model(inputs=wresnet.input,
                                         outputs=wresnet.get_layer('dense_1').output)
    dense_layer_model2 = Model(inputs=densenet.input,
                                         outputs=densenet.get_layer('dense_1').output)
    dense_layer_model3 = Model(inputs=resnext.input,
                                         outputs=resnext.get_layer('dense_1').output)

    dense_output1 = dense_layer_model1.predict(x_val)
    dense_output2 = dense_layer_model2.predict(x_val)
    dense_output3 = dense_layer_model3.predict(x_val)

    best_error = 888
    best_renpin1 = 666
    best_renpin2 = 999

    for renpin1 in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:  
        for renpin2 in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:  
            ams = (renpin1)*dense_output1+(renpin2)*dense_output2+(2-renpin1-renpin2)*dense_output3
            predicts = np.argmax(ams, axis=1)
            error = np.sum(np.not_equal(predicts, y_val1)) / y_val1.shape[0] 
            print(" Precision: {} , renpin1: {} , renpin2: {}".format(1-error, renpin1, renpin2))
            if error < best_error:
                best_error = error
                best_renpin1 = renpin1
                best_renpin2 = renpin2
    print("====================================================")            
    print("Best precision: {} , renpin1:  {} , renpin2: {} ".format(1-best_error, best_renpin1, best_renpin2))
    print("====================================================")
    test_output1 = dense_layer_model1.predict(x_test)
    test_output2 = dense_layer_model2.predict(x_test)
    test_output3 = dense_layer_model3.predict(x_test)
    ams1 = (best_renpin1)*test_output1+(best_renpin2)*test_output2+(2-best_renpin1-best_renpin2)*test_output3
    predicts1 = np.argmax(ams1, axis=1)
    error1 = np.sum(np.not_equal(predicts1, y_test1)) / y_test1.shape[0] 
    print("Precision on test: {} , renpin1:  {} , renpin2: {} ".format(1-error1, best_renpin1, best_renpin2))

About Multiple GPUs Training

Since the latest version of Keras is already supported keras.utils.multi_gpu_model, so you can simply use the following code to train your model with multiple GPUs:

from keras.utils import multi_gpu_model
from keras.applications.resnet50 import ResNet50

model = ResNet50()

# Replicates `model` on 8 GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',optimizer='adam')

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

About Transfer Learning

Keras Applications are deep learning models that are made available alongside pre-trained weights. These models can be used for prediction, feature extraction, and fine-tuning.

Fine-tune InceptionV3 on CIFAR-10

    x_train, y_train, x_val, y_val, x_test, y_test = get_CIFAR10_data()
    
    y_train = keras.utils.to_categorical(y_train, num_classes)
    y_val = keras.utils.to_categorical(y_val, num_classes)
    y_test  = keras.utils.to_categorical(y_test, num_classes)
    x_train = x_train.astype('float32')
    x_val = x_val.astype('float32')
    x_test  = x_test.astype('float32')
    
    # - mean / std
    for i in range(3):
        x_train[:,:,:,i] = (x_train[:,:,:,i] - mean[i]) / std[i]
        x_test[:,:,:,i] = (x_test[:,:,:,i] - mean[i]) / std[i]
        x_val[:,:,:,i] = (x_val[:,:,:,i] - mean[i]) / std[i]


    print('Train data shape before: ', x_train.shape)
    print('Validation data shape before: ', x_val.shape)
    print('Test data shape before: ', x_test.shape)
    print('Type train: ', type(x_train))
    
    x_train = tf.image.resize_images(x_train, [96, 96], method=0).eval(session = sess)
    x_test = tf.image.resize_images(x_test, [96, 96], method=0).eval(session = sess)
    x_val = tf.image.resize_images(x_val, [96, 96], method=0).eval(session = sess)
    
    print('Train data shape: ', x_train.shape)
    print('Validation data shape: ', x_val.shape)
    print('Test data shape: ', x_test.shape)
    print('Type train: ', type(x_train))
    # setting input pic
    input_img = Input(shape=(96, 96, 3)) 
    # create the base pre-trained model
    base_model = InceptionV3(input_tensor=input_img, weights='imagenet', include_top=False)

    # add a global spatial average pooling layer
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    # let's add a fully-connected layer
    x = Dense(1024, activation='relu')(x)
    # and a logistic layer -- let's say we have 200 classes
    predictions = Dense(10, activation='softmax')(x)

    # this is the model we will train
    model = Model(inputs=base_model.input, outputs=predictions)

    print(model.summary())

    # first: train only the top layers (which were randomly initialized)
    # i.e. freeze all convolutional InceptionV3 layers
    for layer in base_model.layers:
        layer.trainable = False

    # set optimizer
    parallel_model = multi_gpu_model(model, gpus=2)
    parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
   

    # set callback
    tb_cb     = TensorBoard(log_dir='./Inception/', histogram_freq=0)                                   # tensorboard log
    # change_lr = LearningRateScheduler(scheduler)                                                    # learning rate scheduler
    # ckpt      = ModelCheckpoint('./ckpt_inception.h5', save_best_only=True, mode='auto', period=1)    # checkpoint 
    cbks      = [tb_cb]                   

    # set data augmentation
    print('Using real-time data augmentation.')

    datagen   = ImageDataGenerator(horizontal_flip=True,
            width_shift_range=0.125,height_shift_range=0.125,fill_mode='reflect')
    datagen.fit(x_train)

    # start training
    start = time.time()
    parallel_model.fit_generator(datagen.flow(x_train, y_train,batch_size=batch_size), steps_per_epoch=iterations, epochs=epochs, callbacks=cbks,validation_data=(x_val, y_val))

    loss, accuracy = parallel_model.evaluate(x_test,y_test)
    print('\ntest loss',loss)
    print('accuracy',accuracy)
    end = time.time()
    print('transfer learning time',end-start)  
    model.save('transfer_inceptionV3.h5')


    # let's visualize layer names and layer indices to see how many layers
    # we should freeze:
    for i, layer in enumerate(base_model.layers):
       print(i, layer.name)

    # we chose to train the top 2 inception blocks, i.e. we will freeze
    # the first 249 layers and unfreeze the rest:
    for layer in model.layers[:249]:
       layer.trainable = False
    for layer in model.layers[249:]:
       layer.trainable = True

    # set optimizer
    # we need to recompile the model for these modifications to take effect
    # we use SGD with a low learning rate

    sgd = optimizers.SGD(lr=0.0001, momentum=0.9, nesterov=True)
    parallel_model = multi_gpu_model(model, gpus=2)
    parallel_model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
   

    # set callback
    tb_cb     = TensorBoard(log_dir='./Inception_finetune/', histogram_freq=0)                                   # tensorboard log
    cbks      = [tb_cb]                   

    # set data augmentation
    print('Using real-time data augmentation.')

    datagen   = ImageDataGenerator(horizontal_flip=True,
            width_shift_range=0.125,height_shift_range=0.125,fill_mode='reflect')
    datagen.fit(x_train)

    # start training
    start = time.time()
    parallel_model.fit_generator(datagen.flow(x_train, y_train,batch_size=batch_size), steps_per_epoch=iterations, epochs=epochs1, callbacks=cbks,validation_data=(x_val, y_val))

    loss, accuracy = parallel_model.evaluate(x_test,y_test)
    print('\ntest loss',loss)
    print('accuracy',accuracy)
    end = time.time()
    print('fine tune time',end-start)  
    model.save('finetune_inceptionV3.h5')

Because the input size for InceptionV3 should be no smaller than 75×75, I resize the pictures from 32×32 to 96×96.

Models test_acc(%)
VGG19 (Pre-trained on ImageNet) 75.24
InceptionV3 (Pre-trained on ImageNet) No time to run

Frankly, I didn’t get good results because I didn’t have enough time to fine tune.

About Hard Examples

Here are some images that our model does not correctly predict.

cifar10

About Cutout & AutoAugment

Contributors

Please feel free to contact me if you have any questions!