self training with noisy student improves imagenet classification

Lhs High School Bell Schedule, Karla Elliott Obituary, School Lunch Menu Pasco County, Nicholas Dean Des Barres, Articles S

Aerial Images Change Detection, Multi-Task Self-Training for Learning General Representations, Self-Training Vision Language BERTs with a Unified Conditional Model, 1Cademy @ Causal News Corpus 2022: Leveraging Self-Training in Causality Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. In contrast, changing architectures or training with weakly labeled data give modest gains in accuracy from 4.7% to 16.6%. 3429-3440. . These significant gains in robustness in ImageNet-C and ImageNet-P are surprising because our models were not deliberately optimizing for robustness (e.g., via data augmentation). EfficientNet with Noisy Student produces correct top-1 predictions (shown in. It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. Also related to our work is Data Distillation[52], which ensembled predictions for an image with different transformations to teach a student network. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Lastly, we follow the idea of compound scaling[69] and scale all dimensions to obtain EfficientNet-L2. Are you sure you want to create this branch? Works based on pseudo label[37, 31, 60, 1] are similar to self-training, but also suffers the same problem with consistency training, since it relies on a model being trained instead of a converged model with high accuracy to generate pseudo labels. We use EfficientNet-B0 as both the teacher model and the student model and compare using Noisy Student with soft pseudo labels and hard pseudo labels. However state-of-the-art vision models are still trained with supervised learning which requires a large corpus of labeled images to work well. Since a teacher models confidence on an image can be a good indicator of whether it is an out-of-domain image, we consider the high-confidence images as in-domain images and the low-confidence images as out-of-domain images. During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as accurate as possible. ImageNet . Self-training with noisy student improves imagenet classification, in: Proceedings of the IEEE/CVF Conference on Computer . Papers With Code is a free resource with all data licensed under. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. The total gain of 2.4% comes from two sources: by making the model larger (+0.5%) and by Noisy Student (+1.9%). Z. Yalniz, H. Jegou, K. Chen, M. Paluri, and D. Mahajan, Billion-scale semi-supervised learning for image classification, Z. Yang, W. W. Cohen, and R. Salakhutdinov, Revisiting semi-supervised learning with graph embeddings, Z. Yang, J. Hu, R. Salakhutdinov, and W. W. Cohen, Semi-supervised qa with generative domain-adaptive nets, Unsupervised word sense disambiguation rivaling supervised methods, 33rd annual meeting of the association for computational linguistics, R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang, Adversarially robust generalization just requires more unlabeled data, X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, Proceedings of the IEEE international conference on computer vision, Making convolutional networks shift-invariant again, X. Zhang, Z. Li, C. Change Loy, and D. Lin, Polynet: a pursuit of structural diversity in very deep networks, X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, Proceedings of the 20th International conference on Machine learning (ICML-03), Semi-supervised learning literature survey, University of Wisconsin-Madison Department of Computer Sciences, B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, Learning transferable architectures for scalable image recognition, Architecture specifications for EfficientNet used in the paper. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. With Noisy Student, the model correctly predicts dragonfly for the image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. It extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. However an important requirement for Noisy Student to work well is that the student model needs to be sufficiently large to fit more data (labeled and pseudo labeled). 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Different types of. We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Compared to consistency training[45, 5, 74], the self-training / teacher-student framework is better suited for ImageNet because we can train a good teacher on ImageNet using label data. Conclusion, Abstract , ImageNet , web-scale extra labeled images weakly labeled Instagram images weakly-supervised learning . To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. Train a classifier on labeled data (teacher). (2) With out-of-domain unlabeled images, hard pseudo labels can hurt the performance while soft pseudo labels leads to robust performance. Self-training with noisy student improves imagenet classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687-10698, (2020 . The score is normalized by AlexNets error rate so that corruptions with different difficulties lead to scores of a similar scale. The top-1 and top-5 accuracy are measured on the 200 classes that ImageNet-A includes. . labels, the teacher is not noised so that the pseudo labels are as good as The abundance of data on the internet is vast. task. Finally, we iterate the process by putting back the student as a teacher to generate new pseudo labels and train a new student. We also list EfficientNet-B7 as a reference. If nothing happens, download Xcode and try again. Callback to apply noisy student self-training (a semi-supervised learning approach) based on: Xie, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020). For this purpose, we use a much larger corpus of unlabeled images, where some images may not belong to any category in ImageNet. Scripts used for our ImageNet experiments: Similar scripts to run predictions on unlabeled data, filter and balance data and train using the filtered data. Self-Training With Noisy Student Improves ImageNet Classification @article{Xie2019SelfTrainingWN, title={Self-Training With Noisy Student Improves ImageNet Classification}, author={Qizhe Xie and Eduard H. Hovy and Minh-Thang Luong and Quoc V. Le}, journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2019 . We duplicate images in classes where there are not enough images. Summarization_self-training_with_noisy_student_improves_imagenet_classification. Our procedure went as follows. On robustness test sets, it improves ImageNet-A top . This attack performs one gradient descent step on the input image[20] with the update on each pixel set to . We used the version from [47], which filtered the validation set of ImageNet. The algorithm is basically self-training, a method in semi-supervised learning (. EfficientNet-L1 approximately doubles the training time of EfficientNet-L0. You signed in with another tab or window. When dropout and stochastic depth are used, the teacher model behaves like an ensemble of models (when it generates the pseudo labels, dropout is not used), whereas the student behaves like a single model. Self-training first uses labeled data to train a good teacher model, then use the teacher model to label unlabeled data and finally use the labeled data and unlabeled data to jointly train a student model. To date (2020) we will introduce "Noisy Student Training", which is a state-of-the-art model.The idea is to extend self-training and Distillation, a paper that shows that by adding three noises and distilling multiple times, the student model will have better generalization performance than the teacher model. https://arxiv.org/abs/1911.04252, Accompanying notebook and sources to "A Guide to Pseudolabelling: How to get a Kaggle medal with only one model" (Dec. 2020 PyData Boston-Cambridge Keynote), Deep learning has shown remarkable successes in image recognition in recent years[35, 66, 62, 23, 69]. Chowdhury et al. Infer labels on a much larger unlabeled dataset. (or is it just me), Smithsonian Privacy In Noisy Student, we combine these two steps into one because it simplifies the algorithm and leads to better performance in our preliminary experiments. The performance drops when we further reduce it. ImageNet-A test set[25] consists of difficult images that cause significant drops in accuracy to state-of-the-art models. In the above experiments, iterative training was used to optimize the accuracy of EfficientNet-L2 but here we skip it as it is difficult to use iterative training for many experiments. For RandAugment, we apply two random operations with the magnitude set to 27. Agreement NNX16AC86A, Is ADS down? on ImageNet, which is 1.0 Here we show an implementation of Noisy Student Training on SVHN, which boosts the performance of a Self-training with Noisy Student improves ImageNet classification. A. Alemi, Thirty-First AAAI Conference on Artificial Intelligence, C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, EfficientNet: rethinking model scaling for convolutional neural networks, Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results, H. Touvron, A. Vedaldi, M. Douze, and H. Jgou, Fixing the train-test resolution discrepancy, V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), J. Weston, F. Ratle, H. Mobahi, and R. Collobert, Deep learning via semi-supervised embedding, Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le, Unsupervised data augmentation for consistency training, S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, I. During the generation of the pseudo Code for Noisy Student Training. Hence, a question that naturally arises is why the student can outperform the teacher with soft pseudo labels. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le. Significantly, after using the masks generated by student-SN, the classification performance improved by 0.9 of AC, 0.7 of SE, and 0.9 of AUC. Chum, Label propagation for deep semi-supervised learning, D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, Semi-supervised learning with deep generative models, Semi-supervised classification with graph convolutional networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Add a Infer labels on a much larger unlabeled dataset. For a small student model, using our best model Noisy Student (EfficientNet-L2) as the teacher model leads to more improvements than using the same model as the teacher, which shows that it is helpful to push the performance with our method when small models are needed for deployment. We use EfficientNets[69] as our baseline models because they provide better capacity for more data. Different kinds of noise, however, may have different effects. et al. The mapping from the 200 classes to the original ImageNet classes are available online.222https://github.com/hendrycks/natural-adv-examples/blob/master/eval.py. Self-training 1 2Self-training 3 4n What is Noisy Student? While removing noise leads to a much lower training loss for labeled images, we observe that, for unlabeled images, removing noise leads to a smaller drop in training loss. We hypothesize that the improvement can be attributed to SGD, which introduces stochasticity into the training process. Notice, Smithsonian Terms of We iterate this process by putting back the student as the teacher. The main difference between Data Distillation and our method is that we use the noise to weaken the student, which is the opposite of their approach of strengthening the teacher by ensembling. As can be seen from Table 8, the performance stays similar when we reduce the data to 116 of the total data, which amounts to 8.1M images after duplicating. We iterate this process by putting back the student as the teacher. We obtain unlabeled images from the JFT dataset [26, 11], which has around 300M images. This article demonstrates the first tool based on a convolutional Unet++ encoderdecoder architecture for the semantic segmentation of in vitro angiogenesis simulation images followed by the resulting mask postprocessing for data analysis by experts. Since we use soft pseudo labels generated from the teacher model, when the student is trained to be exactly the same as the teacher model, the cross entropy loss on unlabeled data would be zero and the training signal would vanish. We call the method self-training with Noisy Student to emphasize the role that noise plays in the method and results. To achieve strong results on ImageNet, the student model also needs to be large, typically larger than common vision models, so that it can leverage a large number of unlabeled images. Self-training with Noisy Student improves ImageNet classification. The baseline model achieves an accuracy of 83.2. 3.5B weakly labeled Instagram images. Noisy Student Training is a semi-supervised learning method which achieves 88.4% top-1 accuracy on ImageNet (SOTA) and surprising gains on robustness and adversarial benchmarks. For example, without Noisy Student, the model predicts bullfrog for the image shown on the left of the second row, which might be resulted from the black lotus leaf on the water. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. Whether the model benefits from more unlabeled data depends on the capacity of the model since a small model can easily saturate, while a larger model can benefit from more data. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to . We then train a larger EfficientNet as a student model on the First, a teacher model is trained in a supervised fashion. However, the additional hyperparameters introduced by the ramping up schedule and the entropy minimization make them more difficult to use at scale. However, in the case with 130M unlabeled images, with noise function removed, the performance is still improved to 84.3% from 84.0% when compared to the supervised baseline. You signed in with another tab or window. There was a problem preparing your codespace, please try again. Code is available at https://github.com/google-research/noisystudent. Classification of Socio-Political Event Data, SLADE: A Self-Training Framework For Distance Metric Learning, Self-Training with Differentiable Teacher, https://github.com/hendrycks/natural-adv-examples/blob/master/eval.py. In all previous experiments, the students capacity is as large as or larger than the capacity of the teacher model. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. During this process, we kept increasing the size of the student model to improve the performance. This is why "Self-training with Noisy Student improves ImageNet classification" written by Qizhe Xie et al makes me very happy. putting back the student as the teacher. mFR (mean flip rate) is the weighted average of flip probability on different perturbations, with AlexNets flip probability as a baseline. Parthasarathi et al. Noisy Students performance improves with more unlabeled data. An important contribution of our work was to show that Noisy Student can potentially help addressing the lack of robustness in computer vision models. It implements SemiSupervised Learning with Noise to create an Image Classification. To noise the student, we use dropout[63], data augmentation[14] and stochastic depth[29] during its training. Noisy Student improves adversarial robustness against an FGSM attack though the model is not optimized for adversarial robustness. Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores. For each class, we select at most 130K images that have the highest confidence. Lastly, we will show the results of benchmarking our model on robustness datasets such as ImageNet-A, C and P and adversarial robustness. Probably due to the same reason, at =16, EfficientNet-L2 achieves an accuracy of 1.1% under a stronger attack PGD with 10 iterations[43], which is far from the SOTA results. The comparison is shown in Table 9. . Our main results are shown in Table1. Similar to[71], we fix the shallow layers during finetuning. As shown in Table3,4 and5, when compared with the previous state-of-the-art model ResNeXt-101 WSL[44, 48] trained on 3.5B weakly labeled images, Noisy Student yields substantial gains on robustness datasets. Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. Noisy Student Training is a semi-supervised training method which achieves 88.4% top-1 accuracy on ImageNet After using the masks generated by teacher-SN, the classification performance improved by 0.2 of AC, 1.2 of SP, and 0.7 of AUC. Noisy Student Training is based on the self-training framework and trained with 4 simple steps: Train a classifier on labeled data (teacher). We then select images that have confidence of the label higher than 0.3. Figure 1(c) shows images from ImageNet-P and the corresponding predictions. We evaluate the best model, that achieves 87.4% top-1 accuracy, on three robustness test sets: ImageNet-A, ImageNet-C and ImageNet-P. ImageNet-C and P test sets[24] include images with common corruptions and perturbations such as blurring, fogging, rotation and scaling. See w Summary of key results compared to previous state-of-the-art models. Our experiments showed that our model significantly improves accuracy on ImageNet-A, C and P without the need for deliberate data augmentation. But during the learning of the student, we inject noise such as data This work proposes a novel architectural unit, which is term the Squeeze-and-Excitation (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2.Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features Press Copyright Contact us Creators . After testing our models robustness to common corruptions and perturbations, we also study its performance on adversarial perturbations. Notably, EfficientNet-B7 achieves an accuracy of 86.8%, which is 1.8% better than the supervised model. - : self-training_with_noisy_student_improves_imagenet_classification A novel random matrix theory based damping learner for second order optimisers inspired by linear shrinkage estimation is developed, and it is demonstrated that the derived method works well with adaptive gradient methods such as Adam. This work introduces two challenging datasets that reliably cause machine learning model performance to substantially degrade and curates an adversarial out-of-distribution detection dataset called IMAGENET-O, which is the first out- of-dist distribution detection dataset created for ImageNet models. The ADS is operated by the Smithsonian Astrophysical Observatory under NASA Cooperative Next, with the EfficientNet-L0 as the teacher, we trained a student model EfficientNet-L1, a wider model than L0. In this work, we showed that it is possible to use unlabeled images to significantly advance both accuracy and robustness of state-of-the-art ImageNet models. Self-training with Noisy Student improves ImageNet classification Original paper: https://arxiv.org/pdf/1911.04252.pdf Authors: Qizhe Xie, Eduard Hovy, Minh-Thang Luong, Quoc V. Le HOYA012 Introduction EfficientNet ImageNet SOTA EfficientNet We iterate this process by putting back the student as the teacher. This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date. 1ImageNetTeacher NetworkStudent Network 2T [JFT dataset] 3 [JFT dataset]ImageNetStudent Network 4Student Network1DropOut21 1S-TTSS equal-or-larger student model We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. Although noise may appear to be limited and uninteresting, when it is applied to unlabeled data, it has a compound benefit of enforcing local smoothness in the decision function on both labeled and unlabeled data. Self-training with Noisy Student. We iterate this process by putting back the student as the teacher. Afterward, we further increased the student model size to EfficientNet-L2, with the EfficientNet-L1 as the teacher. We will then show our results on ImageNet and compare them with state-of-the-art models. mCE (mean corruption error) is the weighted average of error rate on different corruptions, with AlexNets error rate as a baseline. Are you sure you want to create this branch? Do better imagenet models transfer better? Especially unlabeled images are plentiful and can be collected with ease. This work adopts the noisy-student learning method, and adopts 3D nnUNet as the segmentation model during the experiments, since No new U-Net is the state-of-the-art medical image segmentation method and designs task-specific pipelines for different tasks. Self-Training With Noisy Student Improves ImageNet Classification Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. Work fast with our official CLI. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. In the following, we will first describe experiment details to achieve our results. On ImageNet-C, it reduces mean corruption error (mCE) from 45.7 to 31.2. Stochastic Depth is a simple yet ingenious idea to add noise to the model by bypassing the transformations through skip connections. A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Edit social preview. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. self-mentoring outperforms data augmentation and self training. The architecture specifications of EfficientNet-L0, L1 and L2 are listed in Table 7. Self-training is a form of semi-supervised learning [10] which attempts to leverage unlabeled data to improve classification performance in the limited data regime. Using self-training with Noisy Student, together with 300M unlabeled images, we improve EfficientNets[69] ImageNet top-1 accuracy to 87.4%. We use EfficientNet-B4 as both the teacher and the student. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. Next, a larger student model is trained on the combination of all data and achieves better performance than the teacher by itself.OUTLINE:0:00 - Intro \u0026 Overview1:05 - Semi-Supervised \u0026 Transfer Learning5:45 - Self-Training \u0026 Knowledge Distillation10:00 - Noisy Student Algorithm Overview20:20 - Noise Methods22:30 - Dataset Balancing25:20 - Results30:15 - Perturbation Robustness34:35 - Ablation Studies39:30 - Conclusion \u0026 CommentsPaper: https://arxiv.org/abs/1911.04252Code: https://github.com/google-research/noisystudentModels: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnetAbstract:We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Our model is also approximately twice as small in the number of parameters compared to FixRes ResNeXt-101 WSL. In contrast, the predictions of the model with Noisy Student remain quite stable. The accuracy is improved by about 10% in most settings. By showing the models only labeled images, we limit ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of state-of-the-art models. https://arxiv.org/abs/1911.04252. The pseudo labels can be soft (a continuous distribution) or hard (a one-hot distribution). In both cases, we gradually remove augmentation, stochastic depth and dropout for unlabeled images, while keeping them for labeled images. Their main goal is to find a small and fast model for deployment. Med. We also study the effects of using different amounts of unlabeled data. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. Although they have produced promising results, in our preliminary experiments, consistency regularization works less well on ImageNet because consistency regularization in the early phase of ImageNet training regularizes the model towards high entropy predictions, and prevents it from achieving good accuracy. But training robust supervised learning models is requires this step. over the JFT dataset to predict a label for each image. To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. Training these networks from only a few annotated examples is challenging while producing manually annotated images that provide supervision is tedious. This model investigates a new method for incorporating unlabeled data into a supervised learning pipeline. Soft pseudo labels lead to better performance for low confidence data. Secondly, to enable the student to learn a more powerful model, we also make the student model larger than the teacher model. This result is also a new state-of-the-art and 1% better than the previous best method that used an order of magnitude more weakly labeled data [ 44, 71]. Their noise model is video specific and not relevant for image classification. Amongst other components, Noisy Student implements Self-Training in the context of Semi-Supervised Learning. student is forced to learn harder from the pseudo labels. . . The top-1 accuracy is simply the average top-1 accuracy for all corruptions and all severity degrees. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. We verify that this is not the case when we use 130M unlabeled images since the model does not overfit the unlabeled set from the training loss. Noisy Student (B7, L2) means to use EfficientNet-B7 as the student and use our best model with 87.4% accuracy as the teacher model. [2] show that Self-Training is superior to Pre-training with ImageNet Supervised Learning on a few Computer .