Say you reside in England. Consider taking a train from point A to B. You have an idea of how long the trip will take based on the current train schedule but have uncertainty on its latency (i.e. possible time delay). The operators of the trains only tell you at best the fraction that arrive within a few minutes of their scheduled time. In other words, you have no idea of the fatness of the tails, and whether you will get delayed by tens of minutes on a seemingly short run.
How would you derive the probability of a specific lateness of time dt based only on the knowledge of the distance X between point A and B, the maximum train speed Vm, and an average train speed (same as distance X divided by the average trip duration t )?
I won't solve this problem in anyway remotely that I would consider a classical approach. Instead I will make an assumption based on the principle of maximum entropy (which I tend to use for everything I run across these days).
Let us see how close we can come to the empirical distribution of train delay times observed based on assuming very limited infomation.
First of all, I have rather limited experiences travelling by train in England. I do know that the speed of a train can vary quite a bit as I have experienced the crawl from Heathrow to Victoria Station. I also know that the train has some maximum speed that it won't exceed. You realize this when you notice that the train rarely arrives early. So the average train speed and maximum train speed provides a pair of constraints that we can use for estimating a train velocity probability density function (PDF).
I assert via maximum entropy (MaxEnt) arguments that the train velocity (or speed for purists) PDF likely looks like this:
 In accordance with the constraints, MaxEnt predicts an exponential profile up to the maximum value, which ends at 1.44 miles/minute in the histogram.  I would describe it as a dispersive velocity profile following a reverse damped exponential (the damping takes place for smaller velocities, see this post for a similar behavior describing TCP statistics):
In accordance with the constraints, MaxEnt predicts an exponential profile up to the maximum value, which ends at 1.44 miles/minute in the histogram.  I would describe it as a dispersive velocity profile following a reverse damped exponential (the damping takes place for smaller velocities, see this post for a similar behavior describing TCP statistics):p(V) = 1/dv * exp ((V-Vm)/dv)where V ranges from 0 to Vm. The factor dv relates to the average speed by
dv = Vm - X/tThe smaller that dv becomes, the closer that the average speed approaches the maximum speed. Since we don't know anything more about the distributions of speeds, MaxEnt suggests that this exponential "approximation" will work just fine.
The only remaining step we need to do involves some probability theory to relate this velocity distribution to a time delay distribution. Fortunately, once you get the hang of it, one can just about do this in your sleep.
As the easiest approach, figure out the cumulative probability of velocities that will reach the destination in time T. This becomes the integral over p(v) from a just-in-time speed X/T to the maximum speed Vm:
Integral of p(v) from v=X/T to VmThis results in the cumulative probability distribution function (upper-case P):
P(T) = 1 - exp ((X/T-Vm)/dv)To turn this into a probability density function, we simply need to take the derivative with respect to T.
p(T) = Vm*Tm/(dv*T2) * exp(-dt*Vm/(dv*T))We can also cast it in terms of the time delay by
T = dt + X/Vmso
p(dt) = (X/dv) exp (-dt*Vm/(dv*(dt+X/Vm)) / (dt+X/Vm)2This might seem like a complicated expression but all of the parameters are well known but one, dv. And even in this case, we can estimate dv from some data. The following plot illustrates how the PDF changes with dv. Note that as dv becomes small, the probability of arriving on time becomes closer to 1 (e.g. the Swiss or German train system)

By definition, we have a fat-tail probability distribution because the density follows off as an inverse power law of exponent 2.
So we need some data to check how good this distribution works. My interest in the topic of train scheduling actually came from a paper by an expert in the field of superstatistics [1]:
"Modeling train delays with q-exponential functions"Superstatistics may work as an alternate technique to solve the problem but as used by Briggs and Beck, it requires a couple of arbitrary parameters to fit the distribution to. This actually points to the difference between solving a problem as I am trying to do, versus blindly employing arbitrary statistical functions as Briggs and Beck attempt.
Keith Briggs and Christian Beck
Physica A: Statistical Mechanics and its Applications
Volume 378, Issue 2, 15 May 2007, Pages 498-504
arxiv.org
In any case, the authors did the hard work of collating statistics of over 2 million train departures covering 23 train stations in the British Rail Network.
Most of the statistics look generally similar but I take the route from Reading to London Paddington station as a trial run.
 Both the superstatistics approach and my entropic dispersion solution fit the data remarkably well.  One can see clearly that the profile neither follows Poisson statistics (which would give a straight line) nor does it follow normal Gaussian statistics (which would show an upside-down parabolic shape on a semi-log graph).
Both the superstatistics approach and my entropic dispersion solution fit the data remarkably well.  One can see clearly that the profile neither follows Poisson statistics (which would give a straight line) nor does it follow normal Gaussian statistics (which would show an upside-down parabolic shape on a semi-log graph).The shape of the tail points to the real fat-tail probabilities that occur on British rail lines -- as we can see long delays do occur and the power law likely comes from the principle of entropic dispersion.
So to recap, the values I used stem from real observables:
X = 36 miles (exact)
Vm = 1.44 miles/minute (from the schedule, see below)
dv = 0.26 miles/minute (from the curve fit or alternatively from the average latency)

I stated earlier that most people would approach this problem from the perspective of Poisson statistics using time as the varying parameter. Briggs and Beck do this as well, but they use another layer of probability, called superstatistics [1] to "fatten" the tail. Although I appreciate the idea of superstatistics and have used before without realizing it had a name (see my previous post on hyperbolic discounting of an example of a doubly integrated distribution function ), I believe that entropic dispersion of velocities gives a more parsimonious explanation.
Chalk up another success story for entropic dispersion, especially as it shows consistency with the data on the entropic dispersion for all forms of human travel.
Yet, why does no one else use this approach? Does no one understand how to do word problems any longer? I swear that a smart high school student could derive the solution if given the premise.
[1] "Superstatistics", Beck C.; Cohen E.G.D. Physica A, Volume 322, 1 May 2003, pp. 267-275(9)