Blog

Uber H3 in R: functions and use cases

This is an extension of the h3jsr library functionality post that is already available. I have been extensively working with the library over the past 2 weeks and have come up with use cases which can come in handy while working with real-life datasets.

Wrote a detailed post on Medium on that: https://shibaprasadb.medium.com/uber-h3-in-r-functions-and-use-cases-3675c5113ce2

Advertisement

Connecting the dots with heuristics and empirics

Warning: This is kind of an opinionated article, based on the little experience I have. The world is Bayesian, and I may update my beliefs as I come across more intriguing problems in the near future.

One of my professors at Jadavpur University always used to emphasize on “connecting the dots”. He used to say to become an asset to an organisation one needs to master this art. Over the last 16 months of working, heuristics and empirics have been two of my biggest ‘friends’ when it boiled down to connecting the dots.

When I first started working, I was given a variant of a Facility Location Problem. And my manager said something that time: “Try coming up with your own heuristics. They are more flexible.” At first, I found it hard to understand, but over time I realised how right he was.

When it comes to mathematical modelling (and if you’re using an open-source solver), there are a few trade-offs that you need to take care of. If the problem size is huge, and there are too many real-life constraints, then you’re better off using your own heuristics than aiming for an exact solution. I remember one time I wrote a MILP, it almost took 2 hours for the MILP formulation and for the solver to solve it. Then I thought of changing the method, it took me almost 3 days to come up with my own heuristic. The result? The solution time went down to ~100 secs. Yes! Not even kidding! Albeit, the solution cost was higher than the exact solution, but if one wanted to put the whole thing into production then the heuristic would have been the far better choice. I am planning to write an academic paper on this so not going into the details of the method right now.

Next, I remember I had data on the Pincode level for different Indian cities. I had to convert them to latitude and longitude points so that every point would be around the given Pincode, approximately. This has to be one of the most interesting data pre-processing I have done so far. So what did I do? I used numerical methods! More precisely: I formulated the problem through an approximate equation and then used the Newton-Raphson method (learned this in my 3rd sem of Engineering, never thought would come in handy like this). Bam! Solved!

Now, as I had said earlier it is about those trade-offs. So, there will be situations when building a solution based on your heuristic will not be worth it. For me, so far in 15+ months, it had happened once. I had to solve a knapsack problem in R. The Knapsack function in R was erratic, and writing my own logic for that didn’t seem worth it as it was just a sub-problem of a bigger problem. So, I just formulated an LP and solved it. The solution time was very less and the results were satisfactory!

Nowadays, the term “connecting the dots” seems very pertinent to me for these reasons. And two of the best tools to do that have been: heuristics and empirics. The only downside is, that the art of connecting the dots isn’t taught proactively. It is something that you need to acquire. And yes, one more important thing which I strongly feel: before anything, aligning oneself with the business is the most crucial thing. If you’re not doing that in a religious manner, then it will be impossible to take a little help from these two friends!

Capacitated Clustering with improved K-means in R

Often in a supply chain design, we need to divide the whole customer space into some segments/polygons/clusters so that, each of these segments will not cross a given threshold of demand. It is achieved with a special type of clustering: Capacitated clustering.

Capacitated Clustering refers to a type of clustering where the following two conditions are generally met:

  1. The points of the clusters will be in the neighborhood
  2. The total demand of the cluster will not exceed a certain given upper limit

The condition number 1 can be easily achieved with K-means or Hierarchical clustering but it is impossible to restrain the demand of the clusters to a specific limit with these two.

In general, the Capacitated clustering problem can be expressed in an LP form (Page 3: http://www.din.uem.br/~ademir/sbpo/sbpo2017/pdf/168777.pdf) but it is time-consuming.

I came across a paper (Eetha, S. (2009). Improved K-Means Algorithm for Capacitated Clustering Problem.) where Capacitated clustering was done by a variant of K-means. The logic adopted here seemed pretty straightforward and easy to use. I have modified it in some aspects and embedded that into R. The results I am getting are pretty encouraging.

The function takes 4 inputs:

  1. Input: This should contain the dataframe with the details of the customers with their respective Latitude, Longitude and Demands.
  2. Capacity constraint: This is the upper limit of capacity for each cluster.
  3. relax_c: This is the relaxation constant. Ideally this should be between 0.7-1. This factor will be multiplied by the capacity constraint so that the number of clusters will be more than the optimal number of clusters, hence we will have some flexibility. Default value is 0.8.
  4. iterations: This is the number of iterations to achieve a proper set of centroids for the clusters.

The Github link: https://github.com/shibaprasadb/capacitated_clustering_improved_kmeans

Example:

library(data.table)
set.seed(4)
#Creating a data frame
customer_details <- data.frame(lat=runif(50,77,78),
                               long=runif(50,23,24),
                               demand=runif(50,50,100))
#Calling the function
cap <- capacitated_clustering(customer_details, 500, 0.8, 15)

#Changing the format to convert it into data.table
cap_dt<- data.table(cap)
hulls = cap_dt[,.SD[chull(long,lat)],by=.(cluster)]

#Visualising
ggplot() +
  geom_point(data=cap_dt,aes(x=long,y=lat,color=as.factor(cluster))) +
  geom_polygon(data = hulls,aes(x=long, y=lat, fill=as.factor(cluster),alpha = 0.5))+
  theme_bw()+
  coord_equal()+
  theme_bw()+
  theme(legend.position = 'none')

Few overlapping is there, but overall it looks good!

If this can be improved in any way, then please let me know by leaving a comment.

Solving a capacitated VRP with the nearest neighbor heuristic in R

Google OR Tools have dedicated functions and packages for Python, C++, Java and C#. But there is nothing available for R. So, in this post, I aimed to tackle the Capacitated VRP in R, with the help of the nearest neighbor heuristic. The result obtained here is encouraging. That’s why I am sharing it in the blog post. The main aim here is to even better this approach with open-source collaboration. The Github link has also been shared at the end of this post. Before proceeding further, let’s have a look at some of the details of the VRP function:

  1. The function will accept a list of vehicles with distinctive ids and capacities. It is assumed that the capacity of the vehicles are enough to meet the customer demand for the given day.
  2. The nearest neighbor heuristic is an example of a greedy algorithm. Which signifies that it will search for the local optimum solution. In order to find the best result, a user defined parameter (either can be the total time of iterations or the number of iterations) is introduced and through these iterations, the best solution can be obtained.
  3. In order to obtain the best solution, a cost function needs to be introduced. Here the cost function is designed with a simple assumption that the cost for each vehicle depends on the duration for which they have been hired. And if two vehicles are hired for the same duration then the cost will be more for the vehicle with more capacity.
  4. One advantage of this approach is both homogeneous and heterogenous fleets can be easily accomodated in the model without any effort.
  5. The model can be built by taking into account either distance or time, whatever the designer may like. For distance the Geodist can be used, for travel time OSRM package can be used or any model built for a specific problem statement can be utilised. In this example travel time has been computed considering the speed to be 9 km/hr.

The code is available here: https://github.com/shibaprasadb/Capacitated_VRP_with_nearest_neighbor

I have been listening: Analysing my Last.fm stats

Last.fm is a great website for anyone who listens to music (which I guess is everyone) and loves Data. I have been using the site to track what I have been listening since May 2020. So after almost using the site for a year, I decided to decode how I have been listening. This article is mostly focused on the question of ‘how’ rather than ‘what’ because the website pretty much tells you what any user listens. If anyone is interested in extracting any insight of their listening habit then refer to the link provided below. I am sharing the link for my R Script file at the bottom. Only thing you have to do is to download the data from this website. The timeframe considered for this study is June 2020 to May 2021.

Distribution of songs per day:

As it can be seen the distribution for a number of songs listened to per day is right-skewed. A right-skewed data generally suggests that the lower bounds are very low with respect to the whole data. For this case, it will be translated into something like this: There have been many days when I have listened to very few songs. Very few here signifies a number that is very less compared to the daily mean and medium.

Distribution of songs per month:

This graph is also right skewed but it has also an added feature: it is multimodal. We have one peak that is larger than the other two. The other two are similar. This explains that the number of Scrobbles per month has 2-3 groups. One group where the number might be high, one medium, and one low. These high, medium and low distinctions are obviously with respect to the whole dataset. It can be summarised better if we take a look at the number of songs per month for one year:

As it is evident: the number has varied hugely. From 2000 in June to less than 750 in February. Listening habits based on the time of the day: Next up: I have gone a bit deeper and tried to decipher whether I have listened to more songs at day or at night. For that, I have classified Day and Night in the following way:

  1. If the time is after 5 am and before 5 pm then it is day
  2. If it is after 5 pm and before 5 am then it is night

Based on this assumption, this is how I have listened to songs with respect to each month:

Nothing is very obvious from this plot but a few points:

  1. For most of the months, the proportion of songs that were Scrobbled in day is higher.
  2. The proportion of songs for night is higher for June and July. It makes sense, because during the lockdown in 2020, I used to stay awake for long!
  3. For February to March, the proportion for night is lower than other months. During this time, I started going to the University, then had Covid and the got the Job. So, it is also expected! I try to follow a routine, you see.

From the box plot for the day and night, it can be seen that median number of total songs listened during that time is higher for the day than for the night. There have been also many outliers for night compared to that for the day. It suggests, there have been many nights when I have listened to more songs than I usually do at nights. Don’t we all have those nights where we immerse ourselves with the melody of our favourite artists, and lost the track of time?

GitHub link for the file: https://github.com/shibaprasadb/LastFM

Implementing Sweep Algorithm for Constrained Clustering in R

Constrained-based clustering is a unique way of clustering where not only we take into account the similarity of the points but also the specific constraints which they might have or we manually impose.
Sweep algorithm is a popular algorithm used for constrained clustering and vehicle routing problems. The basic idea here is that we will ‘sweep’ the destinations with their similarity and cluster them together. The basic objective here is something like that:

Here we can see the depot will serve each cluster. Now, to implement it in real life, I have made a modification. Not only each point will be under a cluster like the above image but, each cluster will have a specific capacity that it will not exceed. This may become useful if someone is planning to use a single vehicle for a single cluster.
First, I have ordered the points in a clockwise manner with respect to the center. After that, the cluster has been formed with respect to their demands. The R code for implementing it is as below. Let’s first create a dummy data frame and load the tidyverse package.

library(tidyverse)

set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))

df<- data.frame(id, lon, lat, demand)

Now let’s create a cluster function directly. It will do both the sorting and clustering:

constrained_cluster <- function(df,truck_load=170, lat_median=median(df$lat), lon_median=median(df$lon)){
  df$angle = atan2(df$lat - median(df$lat),
                   df$lon - median(df$lon))
  df<- df[order(df$angle, decreasing = TRUE),]
  d<-0
  cluster_number<-1
  cluster_list<- c()
  i<-1
  while (i <= length(df$angle)){
    d <- d+ df$demand[i]
    if(d<=truck_load){
      cluster_list[i] <- cluster_number
      i<- i+1
    }
    else{
      cluster_number <- cluster_number+1
      d <- 0
      i<-i
    }
  }
  return(cbind(df,as.data.frame(cluster_list)))
}

Next step, let’s call the function and visualize the clusters. Let’s see if they resemble the initial picture:

df_with_cluster<- constrained_cluster(df)

ggplot(data=df_with_cluster) +
  geom_point(aes(x=lon, y=lat, colour=factor(cluster_list)))+
  theme_bw()+
  theme(legend.position = 'none')

Voila! It is done!