GS Lab Wins Silver at Kaggle’s “Human Protein Atlas – Single Cell Classification” Competition
Kaggle competitions are some of the biggest and most competitive data science platforms, drawing hundreds of teams from across the globe. Therefore, we are glad to share how two data scientists from GS Lab won a silver medal in the recently concluded Kaggle competition titled “Human Protein Atlas – Single Cell Classification”. In the face of stiff competition — over 700 teams of data scientists had participated — GS Lab’s team ranked 50th. Kudos to the team of Mugdha Hardikar and Shubham Joshi who represented GS Lab on this platform.
This post provides a brief background about why GS Lab participates in Kaggle competitions, followed by some high-level details of the competition and the medal-winning approach.
Why are Kaggle competitions important?
Kaggle is a community of data scientists from around the world. It is a vibrant community where data scientists from big technology giants and reputed academia sponsor, compete, collaborate and learn. Competitions with various prizes are regularly held on the Kaggle platform. Kaggle competitions have been instrumental in advancing the field of data science – many techniques and algorithms became popular as they won Kaggle competitions consistently.
At GS Lab, one of our fundamental principles is to have applied expertise in cutting-edge and emerging technologies that help us solve problems for our customers and partners in the most effective way.
Given that data science – particularly deep learning – is a rapidly evolving discipline, participation in Kaggle competitions enables us to:
- Continuously up-skill our data scientists in applying the latest developments in data science on real-world problems with real data. This is especially crucial for deep-learning techniques.
- Collaborate with, and learn from, leading data scientists around the world in solving some of the most challenging data science problems.
- Objectively gauge our expertise, and identify areas where we need to amplify our skills.
Deconstructing the Problem Statement
The Kaggle competition titled “Human Protein Atlas – Single Cell Classification” posed the following problem statement to the participants: Given images of cells from microscopes and labels assigned together for all cells in the image, develop approach capable of segmenting and classifying each individual cell with precise labels.
This is a weakly supervised multi-label classification problem, as only image level labels are provided while the task is to predict cell level labels.
As specified in the competition, the dataset comprises 17 different cell types of highly different morphology, which affect the protein patterns of the different organelles. All image samples are represented by four filters (stored as individual files), the protein of interest (green) plus three cellular landmarks: nucleus (blue), microtubules (red), endoplasmic reticulum (yellow). The green filter should be used to predict the label, and the other filters are used as references.
Samples from training data
Evaluation is done by computing “mAP” (mean Average Precision), with the mean taken over the 19 segmentable classes of the challenge at segmentation IoU >= 0.6
High-level approach taken by GS Lab team
To segment individual cells from image, we used pretrained models from “HPA-Cell-Segmentation”. This library was suggested by competition hosts along with a few others. As per hosts, HPACellSegmentation matches ~90% of the cells in test set. Hence, we focused more on developing an efficient model for single cell classification than improving on cell segmentation part.
Data selection for training:
- For image level model all the images provided were used
- For cell models initially, cells from images which has only single label has been used. This is because there will be less noise when moving labels from image level to cell level.
- For further iterations, data was selected with the strategy which is a variant of noisy student.
Expertise and skills
- Efficient use of GPUs using tf.dataset resulted in faster training and cost savings.
- This was our first experience in applying a variation of Noisy-student technique, and resulted in improved model performance.
- Tradeoff between using Fast segmentator to fit more models in ensemble vs using high precision segmentator: Though Fast segmentator typically reduces the score of a model, it allowed us to add an additional model to the ensemble, resulting in an overall improvement to the score.
False paths, what didn’t work
- Application of clustering techniques to determine which labels might be incorrect did not yield benefits.
- Experiments using weighted averages of the model predictions in the ensembles were inconclusive.
What the team didn’t have time for
- Building stacked ensemble of the models.
- Pruning of models for better execution time.
Candidate areas to build skills in
Based on discussions in the competition forums and the approaches taken by other winners, a few promising areas where we can focus on building skills are identified:
We are also sensing the ecosystem around pytorch developing more rapidly, especially for computer vision, medical imaging and related areas. This observation, combined with several issues we have faced with TensorFlow, lead us to consider building skills on pytorch as well.
We have participated in a few competitions, and It has always been a great learning experience for our data scientists. It feels good to have done this well while competing with the best in the world; but given the rapidly evolving field of data science, we also realize that there is ample scope for growth and continuing to hone our skills.
If you have more questions, do reach out to us at email@example.com
To know more about our data science work, do visit https://www.gslab.com/data-machine-learning-artificial-intelligence