Portcast provides predictions for arrival/departure time of container vessels at ports. To be able to generate such results, we need to know the routes between different ports. Each route is defined by a sequence of points and each point is represented as a (latitude, longitude) coordinate.

The route between 2 ports is obtained by analysing the historical AIS (think of it like GPS but for container vessels) data of vessels that travelled between those 2 ports. We cannot simply compute the theoretical shortest route because there are practical and political concerns for container vessels and optimum route is not always the shortest route.

However, historical AIS data includes a huge amount of points throughout the journey between any 2 ports. How do we decide which points are critical and which are redundant so that we save only the necessary points to form the route in our database? This is where Ramer–Douglas–Peucker algorithm comes in handy.

Ramer–Douglas–Peucker (RDP)

Given a path composed of line segments, RDP finds a simplified path with a subset of points from the original one such that the difference between the simplified path and the original is within a tolerable range. The algorithm defines 'difference' based on the maximum distance between the original path and the simplified path.

Intuitively, this algorithm removes the points on a straight line but keeps all the turning points. The simplified path should preserve the rough shape of the original one but consist only of a subset of the points.

Algorithm

Suppose we have $n$ points $P_1,\dots,P_n$. Let $C$ be a path created from the $n$ points consisting of $n-1$ line segments $\{\overline{P_1P_2},\overline{P_2P_3},\dots,\overline{P_{n-1}P_n}\}$. Let $d(P_k,\overline{P_iP_j})$ be the distance between $P_k$ and $\overline{P_{i}P_{j}}$. Let $C'$ be a simplified path which uses a subset of points from the original one. The difference $d_H(C, C')$ between $C$ and $C'$ is defined as (simplified Hausdorff distance):

$$ d_H(C, C') = \max_{i\leq k\leq j}d(P_k,\overline{P_i P_j}), \overline{P_iP_j}\in C' $$

The goal is to find the $C'$ which has the minimum number of line segments that satisfies $d_H(C, C') \leq\epsilon$, where $\epsilon>0$ is the maximum difference that we deem tolerable.

Illustration of RDP:

  1. Consider the path $\tilde{C}$ which only contains the line segment from start to end:

    $$ \tilde{C} = \{\overline{P_1P_n}\} $$

  2. Find $d_H(C,\tilde{C})$

    $$ d_H(C,\tilde{C}) = \max_{i=2,\dots,n-1}d(P_{i}, \overline{P_{1}P_{n}})

    $$

  3. If $d_H(C,\tilde{C})\leq\epsilon$, return the path $\tilde{C}$

  4. Else, suppose $\max_{i=2,\dots,n-1}d(P_{i}, \overline{P_{1}P_{n}})=d(P_{m}, \overline{P_{1}P_{n}})$, split $C$ into two paths

    $$ C_{1} = \{\overline{P_1P_2},\overline{P_2P_3},\dots,\overline{P_{m-1}P_m}\} $$

    $$ C_{2} = \{\overline{P_mP_{m+1}},\overline{P_{m+1}P_{m+2}},\dots,\overline{P_{n-1}P_n}\} $$