Silver Medal at Kaggle’s “U.S. Patent Phrase to Phrase Matching” Competition

Kaggle competitions are some of the biggest and most competitive data science platforms, drawing hundreds of teams from across the globe. Data scientists at GS Lab participate in Kaggle competitions to continually enhance their expertise in applying cutting-edge and emerging technologies to real-world applications. In this vein, Piyush Chauhan and Krushnakant Mardikar worked on the Kaggle problem statement elaborated below, guided by Chief data scientist at GS Lab – Vineet Raina. One of the solutions by Piyush Chauhan ranked 42nd (among 1,889), resulting in a silver medal. This blog will provide a brief description of the problem statement followed by some details of the medal-winning approach.

Deconstructing the problem statement

Motivation

Document matching is a common kind of problem in text analytics. For example, in the field of patent processing, matching documents can involve determining the semantic similarity between phrases to find if a similar invention exists already. Examples of semantic challenges in such a task, as described in the competition details, include identifying similarities between “TV set” and “television set”, and potential similarities between “strong material” and “steel”, where the similarity can depend on the additional context within a scientific domain.

Problem Statement

The problem statement of the Kaggle competition “U.S. Patent Phrase to Phrase Matching” was as follows:

“To train models on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents”

This similarity between two phrases has to be found in a specific context i.e. research domain/subdomain of the two phrases.

Problem-Statement
Data

As specified in the competition, the dataset comprises pairs of phrases (an anchor and a target phrase) along with the patent’s context code. The score can take a value from the set {0,0.25,0.5,0.75,1} and represents how similar the anchor and target are given the context. For training purposes, around 36k anchor-target pairs were provided whereas the test set of about 12k pairs was kept hidden (as this was a code competition) and is used for evaluation as explained below.

Samples from training Data:

Competition Evaluation

Evaluation is done by computing the  “Pearson correlation coefficient” between the predicted and actual similarity scores. The public leaderboard results are published only on 24% of the test set and the final evaluation is done on the remaining 76% of the test set.

Overview of Final Approach

Training 

Our approach to this competition was to perform a lot of initial experiments with various pre-trained architectures to quickly find out what works best and then apply more advanced techniques like fine-tuning using different headers and schedulers along with tuning other hyper-parameters like learning rate, weight decay, etc. We also decoded the context code to their textual description by using a code-to-text mapping from the USPTO official website. This textual mapping would extract more detailed information about the context and help the model learn better.

The 4 best architectures chosen above were trained with 5 different loss functions resulting in 20 models. A weighted ensemble(optimized using Nelder-Mead) was created to combine the predictions from the 20 models to yield the final prediction.

Final-Approach
Cross-Validation 

The hidden test set did not have any anchors (Phrase1) which were present in the train set. Hence the cross-validation folds were prepared using StratifiedGroupKfold so that the folds were grouped by anchors and stratified by score. This would mimic the behavior where the test fold doesn’t have anchors present in the training folds.

Testing

The final predictions on the test set were done by taking a weighted ensemble across different models. Weights were computed using Nelder Mead optimization.

Expertise and skills

Applied Techniques
  • The use of Automatic Mixed Precision resulted in faster training and allowed for use of larger batch sizes.
  • Proper selection of cross-validation strategy was key in this competition and ensured that the cross-validation score has a high correlation with the leaderboard score.
  • As Pearson correlation was the metric for evaluation, using custom losses like MSE + Pearson loss and BCE + Pearson loss helped achieve a higher score as per the competition metric
  • Differential learning rates permit the usage of larger learning rates for the different pooling heads while using smaller learning rates for the transformer base model. Clubbing this with the one-cycle schedulers (cosine & linear) resulted in significant improvement in scores for almost all the models.
  • Weighted Ensemble using Nelder – Mead optimization worked better than simple average as all the models were not of similar strength. Weighted average allowed assigning higher weights to strong models
False paths, what didn’t work
  • Training experiments performed using Stochastic Weight Averaging were inconclusive concerning significant performance improvement.
  • The Siamese network approach using the Sentence Similarity model (Patent- SBERTA) didn’t yield good results.
  • Pseudo-labeling of the test set and removal of noisy labels from the train set didn’t improve model performance
Candidate areas to build skills in

Based on discussions in the competition forums and the approaches taken by other winners, a few promising areas where we can focus on building skills are identified:

Conclusion:

As with all Kaggle competitions we’ve participated in so far, this one too has been a great learning experience. The eventual result amidst tough competition validated our expertise in creating generalizable NLP models using the latest technologies. Nevertheless, NLP is one of the more rapidly evolving fields within data science and there continues to be ample scope to hone our skills further.

Further reading:
This post details a silver-medal winning approach by our data scientists who worked on the Kaggle problem statement “Human Protein Atlas – Single Cell Classification”. If you have more questions, do reach out to us at info@gslab.com  To know more about our data science work, do visit https://www.gslab.com/data-machine-learning-artificial-intelligence
Srinath Krishnamurthy | GS Lab
Author
Srinath Krishnamurthy |Principal Architect (TOGAF – 9 certified)

Srinath has 16+ years of professional experience designing data mining, predictive modeling, and analytics solutions in varied areas such as CRM (retail/finance), life sciences, healthcare, video conferencing, industrial IoT, and smart cities. He is a TOGAF9-certified architect specializing in aligning business goals to the technical roadmap.

Srinath heads the data science practice and has been instrumental in building a multidisciplinary team of data scientists, data engineers, and software engineers geared towards providing end-end data science solutions.