A friend of mine is a native from Bangalore, Karnataka. Recently, for some personal reasons, she decided to move to Pune. She called me to check what neighborhoods in Pune are like her current neighborhood in Bangalore. She loves her current neighborhood in Bangalore. She went on and on about different amenities, facilities, and life in her neighborhood. She would like it to close to her office when it starts. While I know Pune reasonably well, I quickly realized that no one person would have the details. Everyone has biases and preferences. Wouldn’t it be great if she can find the neighborhoods in Pune that are exactly like his current neighborhood in Bangalore? She can then pick a neighborhood like her current neighborhood and very close to her new workplace.
I knew the data for such information exists. The data scientist in me got excited. Some of the machine learning enthusiasts from GS Lab and I discussed this problem. Here, we came up with a neighborhood recommender. With this, now anyone moving between the cities can choose a neighborhood of liking.
How does it work?
It is simple. Pick cities, neighborhoods, decide the number of venues to be compared, and let Machine Learning kick in to recommend matching communities. It shows you the map. It lists all the points of comparison. Isn’t it cool?
If you are curious to know what is under the hood of this recommender, read on.
- Information about venues (having categories like restaurant, park, schools, etc.) around these neighborhoods: We used Foursquare API to get information about venues around these neighborhoods by registering on https://foursquare.com/
- Data scraped from Foursquare API using stevesie platform https://stevesie.com/accounts/login/?next=/cloud/apis/foursquare/venue-search
We use Foursquare APIs to get a list of venues nearby these neighborhoods. We have scraped data for categories like Food, Parks, Schools, Libraries, Theatres, Nightlife, Shops, and Grocery stores, etc. Here is a sample table of such venue information:
We have done some exploratory data analysis by analyzing data sets to summarize their main characteristics with visual methods. We have done this on combined data (Pune and Bangalore) for comparison purposes.
Statistical Analysis Technique
We used the K-means clustering technique to analyze the data. K-means algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.
We applied the K-means clustering technique on data separately (on Pune and Bangalore neighborhood data) in python.
K-means is somewhat naive — it clusters the data into k clusters, even if k is not the correct number of clusters to use. When we come to clustering, it is hard to know how many clusters are optimal.
One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10), and for each value of k, calculate the Sum of Squared Errors (SSE).
When we graph the plot, we see that the graph levels off slowly after three clusters. This implies that the addition of more clusters will not help us that much.
Let us visualize the results by plotting the data colored by these labels. We will also plot the cluster centers as determined by the k-means estimator.
In the scatter plot, we have used three categories for both the cities as suggested by the elbow curve.
We have created three clusters as
Bangalore neighborhood clustering indicates that there are three clusters, namely
- Cluster – 0: The neighborhoods fall under this category: Food, Parks, Grocery Stores, and Theatres.
- Cluster – 1: The neighborhoods fall under this category: Grocery Stores, Food, Parks, and Libraries.
- Cluster – 2: The neighborhoods fall under this category: Schools, Foods, and Grocery Stores.
Pune neighborhood clustering indicates that there are three clusters, namely
- Cluster – 0: The neighborhoods fall under this category: Food, Theatres, and Schools.
- Cluster – 1: The neighborhoods fall under this category: Food, Grocery Stores, and Libraries.
- Cluster – 2: The neighborhoods fall under this category: Grocery Stores, Foods, and Libraries.
We can map these neighborhood clusters as:
Based on the above plots we can match clusters for Pune and Bangalore as
We can conclude that the person living in Cluster – 0 of Bangalore should live in Cluster – 2 of Pune. For example, the person living in Jayanagar, Bangalore, should live in Aundh, Pune because there is nearly the same number of Grocery Stores, Foods, and Parks in both the clusters. Also a significant number of Parks and schools. So based on the analysis, we can suggest the person move from Jayanagar, Bangalore to Aundh, Pune.
Based on the top 10 venues, we recommend Aundh Pune, which has top 10 venues of:
|Aundh, Pune||18.5602||73.80310||Grocery Stores||School||Library||Department Store||Parks||Coffee Shop||Chinese Restaurant||Bar||Shopping Mall||Gas Station|
Which we can easily see to be comparable with top venues of Jayanagar Bangalore top venues of:
|Jayanagar, Bangalore||12.9308||77.5838||Grocery Store||Library||Park||Indian Restaurant||Multiplex|
With venues like Grocery Stores, Libraries, Parks, Restaurants in common, the transition for my friend is going to be a very smooth one!
Bangalore has 76 unique venue categories, with different neighborhoods having a different number of venues and some neighborhoods even reaching 50 for the number of venues. There are quite a few Coffee Shops, a variety of restaurants, parks, stores, etc. Pune has 76 unique venue categories. Most of the neighborhoods have reached the limit of 50, with the remaining having overall a very high count. There are various Libraries, Parks, Grocery Stores, Cineplex Schools, theatres, hotels, and plazas.
Limitations and Further Work
In this project, we have considered only the frequency of venues of a specific category to categorize neighborhoods. Many other factors can augment it (e.g., other statistical measures like variance, various percentiles, etc. of this number, cost of living, family factors, availability of housing, public transportation, etc.). We are also relying on data from foursquare for our analysis. We are also using the free developer’s account, imposing some restrictions on the data we obtain. We can enhance the study further w.r.t. both these factors as well.
Bangalore and Pune City are similar in their neighborhoods, being large metropolitan cities in India. A person hailing from Jayanagar is advised to pick Aundh for settling down while moving from Bangalore to Pune City. You can carry out a similar analysis for any neighborhood in these cities or extend it for migration between any other cities/countries by gathering similar data for those two cities.
Credits: Sagar Nangare and Vijay Singh
Sameer Mahajan | Principal Architect
Sameer Mahajan has 25 years of experience in the software industry. He has worked for companies like Microsoft and Symantec across areas like machine learning, storage, cloud, big data, networking and analytics in the United States & India.
Sameer holds 9 US patents and is an alumnus of IIT Bombay and Georgia Tech. He not only conducts hands-on workshops and seminars but also participates in panel discussions in upcoming technologies like machine learning and big data. Sameer is one of the mentors for the Machine Learning Foundations course at Coursera.