keras_ensemble_cifar10

This repository is supported by Huawei (HCNA-AI Certification Course) and Student Innovation Center of SJTU.
Thanks to the teachers for their contributions.

cifar10

I just use Keras and Tensorflow to implementate all of these models and do some ensemble experiments based on BIGBALLON’s work.

Requirements

Python (3.5)
keras (>= 2.1.5)
tensorflow-gpu (>= 1.4.1)

Architectures and papers

Vgg19 Network
- Very Deep Convolutional Networks for Large-Scale Image Recognition
- The 1st places in ILSVRC 2014 localization tasks
- The 2nd places in ILSVRC 2014 classification tasks
Residual Network
- Deep Residual Learning for Image Recognition
- Identity Mappings in Deep Residual Networks
- CVPR 2016 Best Paper Award
- 1st places in all five main tracks:
  - ILSVRC 2015 Classification: “Ultra-deep” 152-layer nets
  - ILSVRC 2015 Detection: 16% better than 2nd
  - ILSVRC 2015 Localization: 27% better than 2nd
  - COCO Detection: 11% better than 2nd
  - COCO Segmentation: 12% better than 2nd
Wide Residual Network
- Wide Residual Networks
ResNeXt
- Aggregated Residual Transformations for Deep Neural Networks
- Used in Mask-RCNN
DenseNet
- Densely Connected Convolutional Networks
- CVPR 2017 Best Paper Award
SENet
- Squeeze-and-Excitation Networks
- The 1st places in ILSVRC 2017 classification tasks

Documents & tutorials

You can aslo see the articles if you can speak Chinese.

Accuracy of all single models

In particular：
Change the batch size according to your GPU’s memory.
Modify the learning rate schedule may imporve the results of accuracy!
Thanks to the GPU computing platform (Composed of 100 pieces of 1080Ti) provided by the Student Innovation Center.

network	GPU	model size	batch size	epoch	loss function	training time	val_acc(%)
Wide-resnet 28x10	GTX1080TI x 2	139M	128	250	crossentropy	4 h 55 min	96.50
Wide-resnet 28x10	GTX1080TI x 2	139M	128	250	focal_loss	6 h 34 min	95.50
DenseNet-160x24	GTX1080TI x 2	30.2M	64	250	crossentropy	24 h 22 min	95.70
DenseNet-160x24	GTX1080TI x 2	30.2M	64	250	focal_loss	25 h 21 min	95.60
ResNeXt-8x64d	GTX1080TI x 2	142M	120	250	crossentropy	26 h 07 min	94.40
ResNeXt-8x64d	GTX1080TI x 2	142M	120	250	focal_loss	35 h 10 min	94.60
SENet(ResNeXt-4x64d)	GTX1080TI x 2	80.2M	120	250	crossentropy	25 h 38 min	94.27

I didn’t calculate the accuracy in the test set. As the author of keras said, every time you use feedback from your validation process to tune your model, you leak information about the validation process into the model. Repeated just a few times, this is innocuous; but done systematically over many iterations, it will eventually cause your model to overfit to the validation process (even though no model is directly trained on any of the validation data). This makes the evaluation process less reliable.

Accuracy of all ensemble models

In particular：
I first tune in the validation set, determine the parameters of models.

Voting

Models	test_acc(%)
DenseNet-160x24 + Wide-ResNet 28x10	96.10
DenseNet-160x24 + Wide-ResNet 28x10 + SENet(ResNeXt-4x64d)	96.38
DenseNet-160x24 + Wide-ResNet 28x10 + ResNeXt-29(8x64d) with focal loss + SENet(ResNeXt-4x64d)	96.38
DenseNet-160x24 + Wide-ResNet 28x10 + ResNeXt-29(8x64d) with focal loss	96.52

Weighted Mean

Models	test_acc(%)
0.6×Wide-ResNet 28x10 + 0.4×DenseNet-160x24	96.38
0.8×Wide-ResNet 28x10 + 0.8×DenseNet-160x24 + 0.4×ResNeXt-29(8x64d) with focal loss	96.53
0.9×Wide-ResNet 28x10 +0.9×DenseNet-160x24 +0.2×SENet(ResNeXt-4x64d)	96.47
Wide-ResNet 28x10 + DenseNet-160x24 + ResNeXt-29(8x64d) with focal loss + 0×SENet(ResNeXt-4x64d)	96.15

About Focal Loss and Cross Entropy

Reference to paper: Focal Loss for Dense Object Detection
Code: mutil-class focal loss implemented in keras

In addition to solving the extremely unbalanced positive-negative sample problem, focal loss can also solve the problem of easy example dominant. That’s why I did the following experiment.

Wide-resnet 28x10

| network | GPU | model size | batch size | epoch | loss function | training time | val_acc(%) | |:———————-|:————-:|:———–:|:———-:|:—–:|:————–:|:————-:|:———–:| | Wide-resnet 28x10 | GTX1080TI x 2 | 139M | 128 | 250 | crossentropy | 4 h 55 min | 96.50 | | Wide-resnet 28x10 | GTX1080TI x 2 | 139M | 128 | 250 | focal_loss | 6 h 34 min | 95.50 |

DenseNet-160x24

| network | GPU | model size | batch size | epoch | loss function | training time | val_acc(%) | |:———————-|:————-:|:———–:|:———-:|:—–:|:————–:|:————-:|:———–:| | DenseNet-160x24 | GTX1080TI x 2 | 30.2M | 64 | 250 | crossentropy | 24 h 22 min | 95.70 | | DenseNet-160x24 | GTX1080TI x 2 | 30.2M | 64 | 250 | focal_loss | 25 h 21 min | 95.60 |

ResNeXt-8x64d

| network | GPU | model size | batch size | epoch | loss function | training time | val_acc(%) | |:———————-|:————-:|:———–:|:———-:|:—–:|:————–:|:————-:|:———–:| | ResNeXt-8x64d | GTX1080TI x 2 | 142M | 120 | 250 | crossentropy | 26 h 07 min | 94.40 | | ResNeXt-8x64d | GTX1080TI x 2 | 142M | 120 | 250 | focal_loss | 35 h 10 min | 94.60 |

We can see from the table above, focal loss improves the accuracy of Model ResNeXt-8x64d. But it reduces the accuracy of other models.

About Ensemble Methods

Voting

    import numpy as np
    from scipy import stats
    import pandas as pd

    models =[wresnet,densenet,resnext,senet]
    labels = []
    for m in models:
        predicts = np.argmax(m.predict(x_test), axis=1)
        labels.append(predicts)

    # Ensemble with voting
    labels = np.array(labels)
    labels = np.transpose(labels, (1, 0))
    labels = stats.mode(labels, axis=-1)[0]
    labels = np.squeeze(labels)
    error = np.sum(np.not_equal(labels, y_test1)) / y_test1.shape[0]  
    print('The precision on test : ', 1-error)

Weighted Mean

    import numpy as np
    from scipy import stats
    import pandas as pd

    # Predict labels with models
    dense_layer_model1 = Model(inputs=wresnet.input,
                                         outputs=wresnet.get_layer('dense_1').output)
    dense_layer_model2 = Model(inputs=densenet.input,
                                         outputs=densenet.get_layer('dense_1').output)
    dense_layer_model3 = Model(inputs=resnext.input,
                                         outputs=resnext.get_layer('dense_1').output)

    dense_output1 = dense_layer_model1.predict(x_val)
    dense_output2 = dense_layer_model2.predict(x_val)
    dense_output3 = dense_layer_model3.predict(x_val)

    best_error = 888
    best_renpin1 = 666
    best_renpin2 = 999

    for renpin1 in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:  
        for renpin2 in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:  
            ams = (renpin1)*dense_output1+(renpin2)*dense_output2+(2-renpin1-renpin2)*dense_output3
            predicts = np.argmax(ams, axis=1)
            error = np.sum(np.not_equal(predicts, y_val1)) / y_val1.shape[0] 
            print(" Precision: {} , renpin1: {} , renpin2: {}".format(1-error, renpin1, renpin2))
            if error < best_error:
                best_error = error
                best_renpin1 = renpin1
                best_renpin2 = renpin2
    print("====================================================")            
    print("Best precision: {} , renpin1:  {} , renpin2: {} ".format(1-best_error, best_renpin1, best_renpin2))
    print("====================================================")
    test_output1 = dense_layer_model1.predict(x_test)
    test_output2 = dense_layer_model2.predict(x_test)
    test_output3 = dense_layer_model3.predict(x_test)
    ams1 = (best_renpin1)*test_output1+(best_renpin2)*test_output2+(2-best_renpin1-best_renpin2)*test_output3
    predicts1 = np.argmax(ams1, axis=1)
    error1 = np.sum(np.not_equal(predicts1, y_test1)) / y_test1.shape[0] 
    print("Precision on test: {} , renpin1:  {} , renpin2: {} ".format(1-error1, best_renpin1, best_renpin2))

About Multiple GPUs Training

Since the latest version of Keras is already supported keras.utils.multi_gpu_model, so you can simply use the following code to train your model with multiple GPUs:

from keras.utils import multi_gpu_model
from keras.applications.resnet50 import ResNet50

model = ResNet50()

# Replicates `model` on 8 GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',optimizer='adam')

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

About Transfer Learning

Keras Applications are deep learning models that are made available alongside pre-trained weights. These models can be used for prediction, feature extraction, and fine-tuning.

Fine-tune InceptionV3 on CIFAR-10

    x_train, y_train, x_val, y_val, x_test, y_test = get_CIFAR10_data()
    
    y_train = keras.utils.to_categorical(y_train, num_classes)
    y_val = keras.utils.to_categorical(y_val, num_classes)
    y_test  = keras.utils.to_categorical(y_test, num_classes)
    x_train = x_train.astype('float32')
    x_val = x_val.astype('float32')
    x_test  = x_test.astype('float32')
    
    # - mean / std
    for i in range(3):
        x_train[:,:,:,i] = (x_train[:,:,:,i] - mean[i]) / std[i]
        x_test[:,:,:,i] = (x_test[:,:,:,i] - mean[i]) / std[i]
        x_val[:,:,:,i] = (x_val[:,:,:,i] - mean[i]) / std[i]


    print('Train data shape before: ', x_train.shape)
    print('Validation data shape before: ', x_val.shape)
    print('Test data shape before: ', x_test.shape)
    print('Type train: ', type(x_train))
    
    x_train = tf.image.resize_images(x_train, [96, 96], method=0).eval(session = sess)
    x_test = tf.image.resize_images(x_test, [96, 96], method=0).eval(session = sess)
    x_val = tf.image.resize_images(x_val, [96, 96], method=0).eval(session = sess)
    
    print('Train data shape: ', x_train.shape)
    print('Validation data shape: ', x_val.shape)
    print('Test data shape: ', x_test.shape)
    print('Type train: ', type(x_train))
    # setting input pic
    input_img = Input(shape=(96, 96, 3)) 
    # create the base pre-trained model
    base_model = InceptionV3(input_tensor=input_img, weights='imagenet', include_top=False)

    # add a global spatial average pooling layer
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    # let's add a fully-connected layer
    x = Dense(1024, activation='relu')(x)
    # and a logistic layer -- let's say we have 200 classes
    predictions = Dense(10, activation='softmax')(x)

    # this is the model we will train
    model = Model(inputs=base_model.input, outputs=predictions)

    print(model.summary())

    # first: train only the top layers (which were randomly initialized)
    # i.e. freeze all convolutional InceptionV3 layers
    for layer in base_model.layers:
        layer.trainable = False

    # set optimizer
    parallel_model = multi_gpu_model(model, gpus=2)
    parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
   

    # set callback
    tb_cb     = TensorBoard(log_dir='./Inception/', histogram_freq=0)                                   # tensorboard log
    # change_lr = LearningRateScheduler(scheduler)                                                    # learning rate scheduler
    # ckpt      = ModelCheckpoint('./ckpt_inception.h5', save_best_only=True, mode='auto', period=1)    # checkpoint 
    cbks      = [tb_cb]                   

    # set data augmentation
    print('Using real-time data augmentation.')

    datagen   = ImageDataGenerator(horizontal_flip=True,
            width_shift_range=0.125,height_shift_range=0.125,fill_mode='reflect')
    datagen.fit(x_train)

    # start training
    start = time.time()
    parallel_model.fit_generator(datagen.flow(x_train, y_train,batch_size=batch_size), steps_per_epoch=iterations, epochs=epochs, callbacks=cbks,validation_data=(x_val, y_val))

    loss, accuracy = parallel_model.evaluate(x_test,y_test)
    print('\ntest loss',loss)
    print('accuracy',accuracy)
    end = time.time()
    print('transfer learning time',end-start)  
    model.save('transfer_inceptionV3.h5')


    # let's visualize layer names and layer indices to see how many layers
    # we should freeze:
    for i, layer in enumerate(base_model.layers):
       print(i, layer.name)

    # we chose to train the top 2 inception blocks, i.e. we will freeze
    # the first 249 layers and unfreeze the rest:
    for layer in model.layers[:249]:
       layer.trainable = False
    for layer in model.layers[249:]:
       layer.trainable = True

    # set optimizer
    # we need to recompile the model for these modifications to take effect
    # we use SGD with a low learning rate

    sgd = optimizers.SGD(lr=0.0001, momentum=0.9, nesterov=True)
    parallel_model = multi_gpu_model(model, gpus=2)
    parallel_model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
   

    # set callback
    tb_cb     = TensorBoard(log_dir='./Inception_finetune/', histogram_freq=0)                                   # tensorboard log
    cbks      = [tb_cb]                   

    # set data augmentation
    print('Using real-time data augmentation.')

    datagen   = ImageDataGenerator(horizontal_flip=True,
            width_shift_range=0.125,height_shift_range=0.125,fill_mode='reflect')
    datagen.fit(x_train)

    # start training
    start = time.time()
    parallel_model.fit_generator(datagen.flow(x_train, y_train,batch_size=batch_size), steps_per_epoch=iterations, epochs=epochs1, callbacks=cbks,validation_data=(x_val, y_val))

    loss, accuracy = parallel_model.evaluate(x_test,y_test)
    print('\ntest loss',loss)
    print('accuracy',accuracy)
    end = time.time()
    print('fine tune time',end-start)  
    model.save('finetune_inceptionV3.h5')

Because the input size for InceptionV3 should be no smaller than 75×75, I resize the pictures from 32×32 to 96×96.

Models	test_acc(%)
VGG19 (Pre-trained on ImageNet)	75.24
InceptionV3 (Pre-trained on ImageNet)	No time to run

Frankly, I didn’t get good results because I didn’t have enough time to fine tune.

About Hard Examples

Here are some images that our model does not correctly predict.

cifar10

About Cutout & AutoAugment

Model of the second-place team (Test acc: 97.1%)
- Reference to paper: Improved Regularization of Convolutional Neural Networks with Cutout
- Code: Cutout (Pytorch)
Model of the first-place team (Test acc: 97.7%)
- Reference to paper: AutoAugment: Learning Augmentation Policies from Data
- Code: Autoaugment (Tensorflow)

Contributors

Please feel free to contact me if you have any questions!

keras_ensemble_cifar10

3.47% on CIFAR-10

keras_ensemble_cifar10

Requirements

Architectures and papers

Documents & tutorials

Accuracy of all single models

Accuracy of all ensemble models

Voting

Weighted Mean

About Focal Loss and Cross Entropy

Wide-resnet 28x10

DenseNet-160x24

ResNeXt-8x64d

About Ensemble Methods

Voting

Weighted Mean

About Multiple GPUs Training

About Transfer Learning

Fine-tune InceptionV3 on CIFAR-10

About Hard Examples

About Cutout & AutoAugment

Contributors