Machine Learning Makes Migration from Bangalore to Pune Simpler

Inspiration

A friend of mine is a native from Bangalore, Karnataka. Recently, for some personal reasons, she decided to move to Pune. She called me to check what neighborhoods in Pune are like her current neighborhood in Bangalore. She loves her current neighborhood in Bangalore. She went on and on about different amenities, facilities, and life in her neighborhood. She would like it to close to her office when it starts. While I know Pune reasonably well, I quickly realized that no one person would have the details. Everyone has biases and preferences. Wouldn’t it be great if she can find the neighborhoods in Pune that are exactly like his current neighborhood in Bangalore? She can then pick a neighborhood like her current neighborhood and very close to her new workplace.

I knew the data for such information exists. The data scientist in me got excited. Some of the machine learning enthusiasts from GS Lab and I discussed this problem. Here, we came up with a neighborhood recommender. With this, now anyone moving between the cities can choose a neighborhood of liking.

How does it work?

It is simple. Pick cities, neighborhoods, decide the number of venues to be compared, and let Machine Learning kick in to recommend matching communities. It shows you the map. It lists all the points of comparison. Isn’t it cool?

If you are curious to know what is under the hood of this recommender, read on.
Data Description
Methodology

We use Foursquare APIs to get a list of venues nearby these neighborhoods. We have scraped data for categories like Food, Parks, Schools, Libraries, Theatres, Nightlife, Shops, and Grocery stores, etc. Here is a sample table of such venue information:

methodology

We have done some exploratory data analysis by analyzing data sets to summarize their main characteristics with visual methods. We have done this on combined data (Pune and Bangalore) for comparison purposes.

Statistical Analysis Technique

We used the K-means clustering technique to analyze the data. K-means algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.

Results

We applied the K-means clustering technique on data separately (on Pune and Bangalore neighborhood data) in python.

Elbow Method:

K-means is somewhat naive — it clusters the data into k clusters, even if k is not the correct number of clusters to use. When we come to clustering, it is hard to know how many clusters are optimal.

One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10), and for each value of k, calculate the Sum of Squared Errors (SSE).

When we graph the plot, we see that the graph levels off slowly after three clusters. This implies that the addition of more clusters will not help us that much.

Scatter Graph:

Let us visualize the results by plotting the data colored by these labels. We will also plot the cluster centers as determined by the k-means estimator.

In the scatter plot, we have used three categories for both the cities as suggested by the elbow curve.

Cluster Analysis:

We have created three clusters as

Bangalore neighborhood clustering indicates that there are three clusters, namely

  • Cluster – 0: The neighborhoods fall under this category: Food, Parks, Grocery Stores, and Theatres.
  • Cluster – 1: The neighborhoods fall under this category: Grocery Stores, Food, Parks, and Libraries.
  • Cluster – 2: The neighborhoods fall under this category: Schools, Foods, and Grocery Stores.

Pune neighborhood clustering indicates that there are three clusters, namely

  • Cluster – 0: The neighborhoods fall under this category: Food, Theatres, and Schools.
  • Cluster – 1: The neighborhoods fall under this category: Food, Grocery Stores, and Libraries.
  • Cluster – 2: The neighborhoods fall under this category: Grocery Stores, Foods, and Libraries.
Mapping Clusters:

We can map these neighborhood clusters as:

Cluster analysis-1

Based on the above plots we can match clusters for Pune and Bangalore as

Cluster analysis-2

We can conclude that the person living in Cluster – 0 of Bangalore should live in Cluster – 2 of Pune. For example, the person living in Jayanagar, Bangalore, should live in Aundh, Pune because there is nearly the same number of Grocery Stores, Foods, and Parks in both the clusters. Also a significant number of Parks and schools. So based on the analysis, we can suggest the person move from Jayanagar, Bangalore to Aundh, Pune.

Based on the top 10 venues, we recommend Aundh Pune, which has top 10 venues of:

Aundh, Pune 18.5602 73.80310 Grocery Stores School Library Department Store Parks Coffee Shop Chinese Restaurant Bar Shopping Mall Gas Station

Which we can easily see to be comparable with top venues of Jayanagar Bangalore top venues of:

Jayanagar, Bangalore 12.9308 77.5838 Grocery Store Library Park Indian Restaurant Multiplex


With venues like Grocery Stores, Libraries, Parks, Restaurants in common, the transition for my friend is going to be a very smooth one!

Discussions

Bangalore has 76 unique venue categories, with different neighborhoods having a different number of venues and some neighborhoods even reaching 50 for the number of venues. There are quite a few Coffee Shops, a variety of restaurants, parks, stores, etc. Pune has 76 unique venue categories. Most of the neighborhoods have reached the limit of 50, with the remaining having overall a very high count. There are various Libraries, Parks, Grocery Stores, Cineplex Schools, theatres, hotels, and plazas.

Limitations and Further Work

In this project, we have considered only the frequency of venues of a specific category to categorize neighborhoods. Many other factors can augment it (e.g., other statistical measures like variance, various percentiles, etc. of this number, cost of living, family factors, availability of housing, public transportation, etc.). We are also relying on data from foursquare for our analysis. We are also using the free developer’s account, imposing some restrictions on the data we obtain. We can enhance the study further w.r.t. both these factors as well.

Conclusion

Bangalore and Pune City are similar in their neighborhoods, being large metropolitan cities in India. A person hailing from Jayanagar is advised to pick Aundh for settling down while moving from Bangalore to Pune City. You can carry out a similar analysis for any neighborhood in these cities or extend it for migration between any other cities/countries by gathering similar data for those two cities.

Credits: Sagar Nangare and Vijay Singh

Sameer-Mahajan
Author
Sameer Mahajan | Principal Architect

Sameer Mahajan has 25 years of experience in the software industry. He has worked for companies like Microsoft and Symantec across areas like machine learning, storage, cloud, big data, networking and analytics in the United States & India.

Sameer holds 9 US patents and is an alumnus of IIT Bombay and Georgia Tech. He not only conducts hands-on workshops and seminars but also participates in panel discussions in upcoming technologies like machine learning and big data. Sameer is one of the mentors for the Machine Learning Foundations course at Coursera.