Capacitated Clustering with improved K-means in R

Often in a supply chain design, we need to divide the whole customer space into some segments/polygons/clusters so that, each of these segments will not cross a given threshold of demand. It is achieved with a special type of clustering: Capacitated clustering.

Capacitated Clustering refers to a type of clustering where the following two conditions are generally met:

  1. The points of the clusters will be in the neighborhood
  2. The total demand of the cluster will not exceed a certain given upper limit

The condition number 1 can be easily achieved with K-means or Hierarchical clustering but it is impossible to restrain the demand of the clusters to a specific limit with these two.

In general, the Capacitated clustering problem can be expressed in an LP form (Page 3: but it is time-consuming.

I came across a paper (Eetha, S. (2009). Improved K-Means Algorithm for Capacitated Clustering Problem.) where Capacitated clustering was done by a variant of K-means. The logic adopted here seemed pretty straightforward and easy to use. I have modified it in some aspects and embedded that into R. The results I am getting are pretty encouraging.

The function takes 4 inputs:

  1. Input: This should contain the dataframe with the details of the customers with their respective Latitude, Longitude and Demands.
  2. Capacity constraint: This is the upper limit of capacity for each cluster.
  3. relax_c: This is the relaxation constant. Ideally this should be between 0.7-1. This factor will be multiplied by the capacity constraint so that the number of clusters will be more than the optimal number of clusters, hence we will have some flexibility. Default value is 0.8.
  4. iterations: This is the number of iterations to achieve a proper set of centroids for the clusters.

The Github link:


#Creating a data frame
customer_details <- data.frame(lat=runif(50,77,78),
#Calling the function
cap <- capacitated_clustering(customer_details, 500, 0.8, 15)

#Changing the format to convert it into data.table
cap_dt<- data.table(cap)
hulls = cap_dt[,.SD[chull(long,lat)],by=.(cluster)]

ggplot() +
  geom_point(data=cap_dt,aes(x=long,y=lat,color=as.factor(cluster))) +
  geom_polygon(data = hulls,aes(x=long, y=lat, fill=as.factor(cluster),alpha = 0.5))+
  theme(legend.position = 'none')

Few overlapping is there, but overall it looks good!

If this can be improved in any way, then please let me know by leaving a comment.


Solving a capacitated VRP with the nearest neighbor heuristic in R

Google OR Tools have dedicated functions and packages for Python, C++, Java and C#. But there is nothing available for R. So, in this post, I aimed to tackle the Capacitated VRP in R, with the help of the nearest neighbor heuristic. The result obtained here is encouraging. That’s why I am sharing it in the blog post. The main aim here is to even better this approach with open-source collaboration. The Github link has also been shared at the end of this post. Before proceeding further, let’s have a look at some of the details of the VRP function:

  1. The function will accept a list of vehicles with distinctive ids and capacities. It is assumed that the capacity of the vehicles are enough to meet the customer demand for the given day.
  2. The nearest neighbor heuristic is an example of a greedy algorithm. Which signifies that it will search for the local optimum solution. In order to find the best result, a user defined parameter (either can be the total time of iterations or the number of iterations) is introduced and through these iterations, the best solution can be obtained.
  3. In order to obtain the best solution, a cost function needs to be introduced. Here the cost function is designed with a simple assumption that the cost for each vehicle depends on the duration for which they have been hired. And if two vehicles are hired for the same duration then the cost will be more for the vehicle with more capacity.
  4. One advantage of this approach is both homogeneous and heterogenous fleets can be easily accomodated in the model without any effort.
  5. The model can be built by taking into account either distance or time, whatever the designer may like. For distance the Geodist can be used, for travel time OSRM package can be used or any model built for a specific problem statement can be utilised. In this example travel time has been computed considering the speed to be 9 km/hr.

The code is available here:

I have been listening: Analysing my stats is a great website for anyone who listens to music (which I guess is everyone) and loves Data. I have been using the site to track what I have been listening since May 2020. So after almost using the site for a year, I decided to decode how I have been listening. This article is mostly focused on the question of ‘how’ rather than ‘what’ because the website pretty much tells you what any user listens. If anyone is interested in extracting any insight of their listening habit then refer to the link provided below. I am sharing the link for my R Script file at the bottom. Only thing you have to do is to download the data from this website. The timeframe considered for this study is June 2020 to May 2021.

Distribution of songs per day:

As it can be seen the distribution for a number of songs listened to per day is right-skewed. A right-skewed data generally suggests that the lower bounds are very low with respect to the whole data. For this case, it will be translated into something like this: There have been many days when I have listened to very few songs. Very few here signifies a number that is very less compared to the daily mean and medium.

Distribution of songs per month:

This graph is also right skewed but it has also an added feature: it is multimodal. We have one peak that is larger than the other two. The other two are similar. This explains that the number of Scrobbles per month has 2-3 groups. One group where the number might be high, one medium, and one low. These high, medium and low distinctions are obviously with respect to the whole dataset. It can be summarised better if we take a look at the number of songs per month for one year:

As it is evident: the number has varied hugely. From 2000 in June to less than 750 in February. Listening habits based on the time of the day: Next up: I have gone a bit deeper and tried to decipher whether I have listened to more songs at day or at night. For that, I have classified Day and Night in the following way:

  1. If the time is after 5 am and before 5 pm then it is day
  2. If it is after 5 pm and before 5 am then it is night

Based on this assumption, this is how I have listened to songs with respect to each month:

Nothing is very obvious from this plot but a few points:

  1. For most of the months, the proportion of songs that were Scrobbled in day is higher.
  2. The proportion of songs for night is higher for June and July. It makes sense, because during the lockdown in 2020, I used to stay awake for long!
  3. For February to March, the proportion for night is lower than other months. During this time, I started going to the University, then had Covid and the got the Job. So, it is also expected! I try to follow a routine, you see.

From the box plot for the day and night, it can be seen that median number of total songs listened during that time is higher for the day than for the night. There have been also many outliers for night compared to that for the day. It suggests, there have been many nights when I have listened to more songs than I usually do at nights. Don’t we all have those nights where we immerse ourselves with the melody of our favourite artists, and lost the track of time?

GitHub link for the file:

Implementing Sweep Algorithm for Constrained Clustering in R

Constrained-based clustering is a unique way of clustering where not only we take into account the similarity of the points but also the specific constraints which they might have or we manually impose.
Sweep algorithm is a popular algorithm used for constrained clustering and vehicle routing problems. The basic idea here is that we will ‘sweep’ the destinations with their similarity and cluster them together. The basic objective here is something like that:

Here we can see the depot will serve each cluster. Now, to implement it in real life, I have made a modification. Not only each point will be under a cluster like the above image but, each cluster will have a specific capacity that it will not exceed. This may become useful if someone is planning to use a single vehicle for a single cluster.
First, I have ordered the points in a clockwise manner with respect to the center. After that, the cluster has been formed with respect to their demands. The R code for implementing it is as below. Let’s first create a dummy data frame and load the tidyverse package.


id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))

df<- data.frame(id, lon, lat, demand)

Now let’s create a cluster function directly. It will do both the sorting and clustering:

constrained_cluster <- function(df,truck_load=170, lat_median=median(df$lat), lon_median=median(df$lon)){
  df$angle = atan2(df$lat - median(df$lat),
                   df$lon - median(df$lon))
  df<- df[order(df$angle, decreasing = TRUE),]
  cluster_list<- c()
  while (i <= length(df$angle)){
    d <- d+ df$demand[i]
      cluster_list[i] <- cluster_number
      i<- i+1
      cluster_number <- cluster_number+1
      d <- 0

Next step, let’s call the function and visualize the clusters. Let’s see if they resemble the initial picture:

df_with_cluster<- constrained_cluster(df)

ggplot(data=df_with_cluster) +
  geom_point(aes(x=lon, y=lat, colour=factor(cluster_list)))+
  theme(legend.position = 'none')

Voila! It is done!