Statistical distributions of noisy concentration measurements

Calibrating physical process models on concentration data can be surprisingly tricky because there are multiple things to consider. Some of which are not always obvious:

Measurement data is non-negative, which requires distribution constrained to the non-negative interval.
Concentration measurements often vary multiplicatively. This is because usually, samples are diluted or enriched to bring them on a measurable concentration range before measurement.
Thus, measurement errors of a physical process that spans several orders of magnitude during its temporal evolution, increases multiplicatively
Different error scales have consequences for calculating the probability of a datapoint given a distribution. Using the log-normal distribution, high concentration observations automatically have lower probability than low concentration observations. More on that in Dangerous lognormal distributions over time

Simulating the accumulation of a chemical in a physical system

Below is a physical simulation of a system characterized by increasing concentrations of a molecule, it’s sampling and it’s measurement (with a error simulation of the measurement process).

import numpy as np
import matplotlib as mpl
from scipy.stats import norm, lognorm
from scipy.integrate import solve_ivp
from matplotlib import pyplot as plt

# set up a random number generator
rng = np.random.default_rng(seed=2)

# set rc parameters for plotting
# mpl.rcParams["figure.facecolor"] =  (1.0, 1.0, 1.0, 0.0)
# mpl.rcParams["axes.facecolor"] =  (1.0, 1.0, 1.0, 0.0)

To simulate the system I use the uptake of a chemical from external medium (e.g. water) with a constant concentration $C_e = 10$ g/L into an organism, where the internal concentration $C_i$ can be measured. This scenario is frequently encountered in aquatic toxicology.

To describe the process we have the uptake rate constant $k_1$ and the removal rate constant $k_2$.

Mathematically the process is described as

$$\frac{dC_i}{dt} = k_1~C_e - k_2~C_i$$

The corresponding python function is:

def accumulation(t, y, k1, k2):
    """First order differential equation model for the uptake of 
    a chemical from the environment into an organism"""
    Ce, Ci = y
    dCi_dt = k1 * Ce - k2 * Ci
    return 0, dCi_dt

We can solve the equation over a given time interval $\Tau$ with scipy’s differential equation solver solve_ivp:

# parameters
t_max = 100
k1 = 50 
k2 = 0.08

# initial values (y_0)
Ci_0 = 0.0
Ce_0 = 10.0

# define points for which the solution should be computed
T = np.linspace(start=0, stop=t_max, num=1001)

# solve
sol = solve_ivp(accumulation, t_span=(0, t_max), y0=(Ce_0, Ci_0), args=(k1, k2), t_eval=T)
Y = sol.y[1]

# plot
plt.plot(T, Y, color="black")
plt.xlabel("Time (h)")
plt.ylabel("Internal concentration $C_i$ (g/L)")
plt.xlim(0, None)
_ = plt.ylim(0, None)

png

The above figure show that at an environmental concentration of 10 g/L, slow depuration in the organism leads to a large bioaccumulation. The steady state concentration can be calculated like such

$$C_i = C_e \cdot \frac{k_1}{k_2} = 10 \cdot \frac{50}{0.08} = 6250$$

where $\frac{k_1}{k_2}$ is also known as the partitioning coefficient.

Measurement errors

So far no problem. We have a description of a physical process. When we have to measure this process, things start to get messy.

We have different sources of errors:

When the orgaisms are small such as Daphnia magna (the waterflea) or Danio rerio (zebrafish) embryos. The same organism cannot be measured repeatedly, because it is simply too small. This means in each measurement of the physical process, already biological variation is included.
Measurement instruments cannot simply chew a sample and spit out a value. Instead the sample must be pre-processed and if needed diluted so that the concentrations are approximately on a similar scale.

Let’s break this down in a simulation:

Biological variability

# sampling times (more often initially to capture the dynamic)
T_sampling = [0.1, 1.5, 3, 6, 12, 24, 36, 48, 72, 96]
# T_sampling = np.linspace(0.0, 100., 1001)[1:]

# we look up the Y values at the sampling times
Y_true = Y[np.isin(T, T_sampling)]

plt.plot(T, Y, color="black")
plt.plot(T_sampling, Y_true, ls="", marker="o", color="tab:blue", label="sample")
plt.legend()
plt.xlabel("Time (h)")
plt.ylabel("Internal concentration $C_i$ (g/L)")
plt.xlim(0, None)
_ = plt.ylim(0, None)

png

These are the sampling times, but as mentioned earlier, sampling from the same organism is not possible. So we need to consider that organisms have slightly different kinetic parameters.

For simplicity we assume that all organisms are in the same medium, so we can ignore pipetting errors for the external medium.

For the biological variation we assume

$$k_{1,i} \sim \text{Normal}\left(k_1,~0.05~k_1\right)$$ $$k_{2,i} \sim \text{Normal}\left(k_2,~0.05~k_2\right)$$

This means the parameter of individuals in a population have a standard deviation of 5 % of the population mean.

deviation = 0.05

# create probability distributions
K1 = norm(loc=k1, scale=deviation * k1)
K2 = norm(loc=k2, scale=deviation * k2)


x = np.linspace(0, 100, 1000)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3))
ax1.plot(x, K1.pdf(x), color="black")
ax1.set_xlim(0, None)
ax1.set_ylim(0, None)
ax1.set_xlabel("$k_1$")
ax1.set_ylabel("probability density")

x = np.linspace(0, 0.2, 1000)
ax2.plot(x, K2.pdf(x), color="black")
ax2.set_xlim(0, None)
ax2.set_ylim(0, None)
_ = ax2.set_xlabel("$k_2$")

png

The probability density curves of the two distributions show that parameters outside the population mean will be very unlikely, so we can go on with our model and simulate all timepoints with randomly varying rate constants.

plt.plot(T, Y, color="black")
plt.xlabel("Time (h)")
plt.ylabel("Internal concentration $C_i$ (g/L)")
plt.xlim(0, None)

Y_sampling = []
for i, t in enumerate(T_sampling):
    # the random samples are seeded in order to make them reproducible
    k1_i = float(K1.rvs(1, random_state=rng))
    k2_i = float(K2.rvs(1, random_state=rng))

    # solve the model until t
    T_i = np.linspace(0, t, 100)
    sol = solve_ivp(accumulation, t_span=(0, t), y0=(Ce_0, Ci_0), args=(k1_i, k2_i), t_eval=T_i)
    Y_i = sol.y[1]
    
    # the last value is the sampling time
    y_i = Y_i[-1] 
    Y_sampling.append(y_i)

    plt.plot(T_i, Y_i, color="grey", lw=0.5)
    _ = plt.plot(t, y_i, ls="", color="tab:orange", marker="o")

_ = plt.ylim(0, None)

png

This figure helps a lot to understand the biological source of variation when repeatedly sampling over time. Good. We can see that at low concentrations, the errors have not propagated sufficiently to bring us problems, but toward the steady state, we can see approximately 10% deviation to either side emerging from 5% variation in each coefficient. This range is no surprise if we remember the formular for the partition coefficient to yield the steady state concentration. The interval that comes about from 5 % variation in either direction is computed by calculating the best and worst case of the rate coefficients w.r.t their standard deviations $\left[\frac{1.05}{0.95}\approx 1.1,~\frac{0.95}{1.05} \approx 0.9\right]$.

Measurement errors

Let’s look at the table of measurements

import pandas as pd

data = pd.DataFrame(data=np.array([T_sampling, Y_sampling]).T, columns=["Time", "C_i"])
print(data)

   Time          C_i
0   0.1    50.276529
1   1.5   697.143018
2   3.0  1444.040931
3   6.0  2323.923307
4  12.0  3955.080821
5  24.0  5652.516719
6  36.0  5997.241976
7  48.0  6282.754174
8  72.0  6595.700654
9  96.0  6578.585079

Pipetting errors

We can see that our samples span 2 orders of magnitude. For measuring the sample we have to bring it down to the same order of magnitude. Assume we are super accurate and bring them down all to the level of the first sample.

Let’s spin up a calculation for dilution of a chemical.

def dilution(C, C_target, vol_sample=1, pipetting_error=0.0, rng: np.random.Generator=None, printformula=False):
    """This equation calculates the dilution factor, and volumes to be added for 
    diluting a sample to get to the final concentration.
    """
    dilution_factor = C / C_target
    total_volume = dilution_factor * vol_sample
    vol_water = (total_volume - vol_sample)

    if pipetting_error == 0.0:
        C_diluted = (C * vol_sample + 0 * vol_water) / total_volume

    else: 
        vol_sample = rng.normal(vol_sample, vol_sample * pipetting_error)
        vol_water = rng.normal(vol_water, vol_water * pipetting_error)
        total_volume = vol_water + vol_sample
        C_diluted = (C * vol_sample + 0 * vol_water) / total_volume

    if printformula:
        print(f"C_target = ({vol_sample} ml C + {vol_water} ml H2O) ÷ {total_volume} ml Mix = {C_diluted} mg/L")
    return C_diluted, dilution_factor, total_volume, vol_water

for key, row in data.iterrows():
    dilution(C=row.C_i, C_target=1, vol_sample=1, printformula=row.Time % 1.5 == 0 or row.Time == 0.1)

C_target = (1 ml C + 49.27652942490094 ml H2O) ÷ 50.27652942490094 ml Mix = 1.0 mg/L
C_target = (1 ml C + 696.1430181055296 ml H2O) ÷ 697.1430181055296 ml Mix = 1.0 mg/L
C_target = (1 ml C + 1443.0409306159568 ml H2O) ÷ 1444.0409306159568 ml Mix = 1.0 mg/L
C_target = (1 ml C + 2322.9233072125085 ml H2O) ÷ 2323.9233072125085 ml Mix = 1.0 mg/L
C_target = (1 ml C + 3954.0808214331596 ml H2O) ÷ 3955.0808214331596 ml Mix = 1.0 mg/L
C_target = (1 ml C + 5651.516719070092 ml H2O) ÷ 5652.516719070092 ml Mix = 1.0 mg/L
C_target = (1 ml C + 5996.241975785413 ml H2O) ÷ 5997.241975785413 ml Mix = 1.0 mg/L
C_target = (1 ml C + 6281.7541738614855 ml H2O) ÷ 6282.7541738614855 ml Mix = 1.0 mg/L
C_target = (1 ml C + 6594.700654162078 ml H2O) ÷ 6595.700654162078 ml Mix = 1.0 mg/L
C_target = (1 ml C + 6577.5850791395615 ml H2O) ÷ 6578.5850791395615 ml Mix = 1.0 mg/L

We quickly see that this is not feasible, because we can hardly dilute with 6 L of H2O (ultra-pure) in a normal laboratory context. This is why we have to conduct a series of dilutions for some samples but not for others. Can you see the errors building up, already?

Let’s say a range of 1 to 100 mg/L can be measured. Start with 10% dilution and then go successively down if needed



dilution_factor = 0.1
vol_sample = 1
pipetting_error = 0.05

n_dilutions = []
Ci_diluted_final = []
for key, row in data.iterrows():
    C_i = row.C_i

    if row.Time % 1.5 == 0 or row.Time == 0.1:
        print(f"Dilution sample {key}")
        print(f"---------------------")

    n = 0
    while C_i > 100:
        C_diluted, _, vol, vol_h2o = dilution(
            C=C_i, C_target=C_i * dilution_factor, vol_sample=vol_sample,
            pipetting_error=pipetting_error, rng=rng,
            printformula=row.Time % 1.5 == 0 or row.Time == 0.1
        )

        C_i = C_diluted
        n += 1

    Ci_diluted_final.append(C_i)
    n_dilutions.append(n)

data["n_dilutions"] = n_dilutions
data["Ci_diluted"] = Ci_diluted_final
data["diultion_factor"] = dilution_factor

Dilution sample 0
---------------------
Dilution sample 1
---------------------
C_target = (1.0420732486185071 ml C + 9.084615789141308 ml H2O) ÷ 10.126689037759816 ml Mix = 71.73856004861068 mg/L
Dilution sample 2
---------------------
C_target = (1.0165285504067663 ml C + 9.184726760836618 ml H2O) ÷ 10.201255311243385 ml Mix = 143.8949216680432 mg/L
C_target = (0.9494621249923333 ml C + 9.352431448264834 ml H2O) ÷ 10.301893573257168 ml Mix = 13.261909291822489 mg/L
Dilution sample 3
---------------------
C_target = (1.1028351409171184 ml C + 8.262700873544013 ml H2O) ÷ 9.365536014461131 ml Mix = 273.65270754743295 mg/L
C_target = (0.9135294266422759 ml C + 8.322825863761055 ml H2O) ÷ 9.236355290403331 ml Mix = 27.065849370763683 mg/L
Dilution sample 4
---------------------
C_target = (1.0420729446727 ml C + 9.05792204586333 ml H2O) ÷ 10.09999499053603 ml Mix = 408.0677982386441 mg/L
C_target = (1.053917122036965 ml C + 9.325093892538373 ml H2O) ÷ 10.379011014575338 ml Mix = 41.43647587536828 mg/L
Dilution sample 5
---------------------
C_target = (1.010528590618764 ml C + 9.127817165362666 ml H2O) ÷ 10.13834575598143 ml Mix = 563.4084584460845 mg/L
C_target = (0.9915119751138433 ml C + 9.39080709504518 ml H2O) ÷ 10.382319070159022 ml Mix = 53.8055351270539 mg/L
Dilution sample 6
---------------------
C_target = (0.9435142019109622 ml C + 8.810163528219757 ml H2O) ÷ 9.75367773013072 ml Mix = 580.1383983571765 mg/L
C_target = (1.0121469426549368 ml C + 9.8106393863022 ml H2O) ÷ 10.822786328957136 ml Mix = 54.25454114740229 mg/L
Dilution sample 7
---------------------
C_target = (0.96177679421398 ml C + 8.514422793388375 ml H2O) ÷ 9.476199587602355 ml Mix = 637.661449857652 mg/L
C_target = (0.9718356400747981 ml C + 9.436172490512174 ml H2O) ÷ 10.408008130586971 ml Mix = 59.540895385377276 mg/L
Dilution sample 8
---------------------
C_target = (0.9882497246554087 ml C + 9.595956158656628 ml H2O) ÷ 10.584205883312038 ml Mix = 615.8420789662002 mg/L
C_target = (0.9063735241196988 ml C + 9.507835413855213 ml H2O) ÷ 10.414208937974912 ml Mix = 53.598209786094216 mg/L
Dilution sample 9
---------------------
C_target = (1.051743311996477 ml C + 8.361600439589331 ml H2O) ÷ 9.413343751585808 ml Mix = 735.018612086619 mg/L
C_target = (1.007679094326798 ml C + 9.54709093583754 ml H2O) ÷ 10.554770030164338 ml Mix = 70.17328536993733 mg/L

Now we can see the dilution cycles with added noise from pipetting errors. Let’s look at our new table with our new diluted sample. The concentration given in the sample is the true concentration we are giving to the measurement instrument.

print(data)

   Time          C_i  n_dilutions  Ci_diluted  diultion_factor
0   0.1    50.276529            0   50.276529              0.1
1   1.5   697.143018            1   71.738560              0.1
2   3.0  1444.040931            2   13.261909              0.1
3   6.0  2323.923307            2   27.065849              0.1
4  12.0  3955.080821            2   41.436476              0.1
5  24.0  5652.516719            2   53.805535              0.1
6  36.0  5997.241976            2   54.254541              0.1
7  48.0  6282.754174            2   59.540895              0.1
8  72.0  6595.700654            2   53.598210              0.1
9  96.0  6578.585079            2   70.173285              0.1

Aha! So we can see that there is 10% noise, I suspect up to 20% due to 4 pipetting actions in 2 cycles with a 5% pipetting error. Usually less, but especially with multiple cycles, the error can be quite severe. Note that we eventually scale back by dividing the diluted concentraion (once measured) by $C_i = C_{i,dil} / 0.1^n$ where $n$ is the number of dilutions

Measurement error

The final piece of error until we have our concentration is the device measurement error.

Lets give this another 5%

measurement_error = 0.05
Ci_measured = rng.normal(loc=data.Ci_diluted, scale=data.Ci_diluted * measurement_error)

data["Ci_measured"] = Ci_measured
data["Ci_measured_unscaled"] = Ci_measured / data.diultion_factor ** data.n_dilutions
print(data)

   Time          C_i  n_dilutions  Ci_diluted  diultion_factor  Ci_measured  \
0   0.1    50.276529            0   50.276529              0.1    50.497543   
1   1.5   697.143018            1   71.738560              0.1    75.324417   
2   3.0  1444.040931            2   13.261909              0.1    14.836686   
3   6.0  2323.923307            2   27.065849              0.1    27.436560   
4  12.0  3955.080821            2   41.436476              0.1    40.855573   
5  24.0  5652.516719            2   53.805535              0.1    51.731191   
6  36.0  5997.241976            2   54.254541              0.1    56.012564   
7  48.0  6282.754174            2   59.540895              0.1    58.955221   
8  72.0  6595.700654            2   53.598210              0.1    53.119186   
9  96.0  6578.585079            2   70.173285              0.1    69.803964   

   Ci_measured_unscaled  
0             50.497543  
1            753.244171  
2           1483.668603  
3           2743.655983  
4           4085.557313  
5           5173.119142  
6           5601.256353  
7           5895.522117  
8           5311.918551  
9           6980.396415

Finally we can put it all together and plot our concentrations against the data. We use our plot from earlier and add the back scaled measured concentrations

data["Ci_measured"] = Ci_measured
fig, ax = plt.subplots(1,1,figsize=(6,3))
ax.plot(T, Y, color="black", label="population mean")
ax.set_xlabel("Time (h)")
ax.set_ylabel("Internal concentration $C_i$ (g/L)")
ax.set_xlim(0, None)

Y_sampling = []
for i, t in enumerate(T_sampling):
    # the random samples are seeded in order to make them reproducible
    k1_i = float(K1.rvs(1, random_state=1+i))
    k2_i = float(K2.rvs(1, random_state=2+i))

    # solve the model until t
    T_i = np.linspace(0, t, 100)
    sol = solve_ivp(accumulation, t_span=(0, t), y0=(Ce_0, Ci_0), args=(k1_i, k2_i), t_eval=T_i)
    Y_i = sol.y[1]
    
    # the last value is the sampling time
    y_i = Y_i[-1] 
    y_obs_i = data.loc[i, "Ci_measured_unscaled"]
    Y_sampling.append(y_i)

    ax.plot(T_i, Y_i, color="grey", lw=0.5, label="y trajectory")
    _ = ax.plot(t, y_i, ls="", color="tab:orange", marker="o", label="y true")
    _ = ax.plot(t, y_obs_i, ls="", color="tab:orange", alpha=.4, marker="o", label="y measured")
    _ = ax.vlines(t, min(y_i, y_obs_i), max(y_i, y_obs_i), color="tab:orange", alpha=.4, lw=.5)

_ = ax.set_ylim(10, None)
handles, labels = ax.get_legend_handles_labels()
by_label = dict(zip(labels, handles))
_ = ax.legend(by_label.values(), by_label.keys())
fig.patch.set_alpha(0.0)
fig.patch.set_alpha(0.0)
ax.patch.set_alpha(0.0)

png

There we are! We have proportionally much higher noise in the data at large concentrations, originating from the processes described above. The sizes of the errors is of course not fixed and 5% error on the pipetting and measurement may be quite high. These information should be acquired when conducting the research to see what you are getting.

Multiplicative error models are a good choice for concentration data

The first step of the analysis showed how the sources of noise build up and separate the sources of errors. Next, we’ll look at the error function and show why a lognormal error model is a good model for concentrations

error = data["Ci_measured_unscaled"] / Y_sampling
error_biased = data["Ci_measured_unscaled"] / Y_true
fig, ax = plt.subplots(1,1, figsize=(6,3))
ax.plot(T_sampling, np.log(error), ls="", marker="o", color="tab:orange", label=r"$\epsilon_{true} = \frac{C_{i,obs}}{C_{i,indiv}}$")
ax.vlines(T_sampling, 0, np.log(error), color="tab:orange", lw=0.5)
ax.plot(T_sampling, np.log(error_biased), ls="", marker="x", color="tab:orange", label=r"$\epsilon_{biased} = \frac{C_{i,obs}}{C_{i,mean}}$")
ax.hlines(0, T.min()-5, T.max()+5, color="black")
ax.set_ylabel(r"log product error ($\epsilon$)")
ax.set_xlabel("Time (h)")
ax.set_xlim(-5,T.max()+5)
_ = ax.legend()
fig.patch.set_alpha(0.0)
fig.patch.set_alpha(0.0)
ax.patch.set_alpha(0.0)

png

These residuals look promising. On first glance, they don’t seem to follow a pattern of increasing errors with increasing concentrations. I’m currently uncertain, whether the increasing number of dilution steps is already handled in this error model. I would not know which mechanism would have handled this effect.

When simulating many more of these data, it becomes apparent that the error is slightly smaller initially. It will be interesting to see whether this affects inference enough to require a more elaborate error function.

Note, that the biased log errors, which are the measured concentration divided by the population mean trajectory (thick black line in previous figures), also look okay.

One can probably concloude, that pipetting errors can be neglected when they are not extreme (no excessive serial dilution, relatively small pipetting error) and that multiplicative error distributions are the right model for the residual errors.

In the next article Dangerous lognormal distributions over time we will naively apply a lognormal distribution at the previously modelled concentration timeseries and see how such a distribution runs into problems, due to increasing variance with increasing mean values. And how this can be cured by looking only at the residual error.

Simulating the accumulation of a chemical in a physical system#

Measurement errors#

Biological variability#

Measurement errors#

Pipetting errors#

Measurement error#

Multiplicative error models are a good choice for concentration data#

Simulating the accumulation of a chemical in a physical system

Measurement errors

Biological variability

Measurement errors

Pipetting errors

Measurement error

Multiplicative error models are a good choice for concentration data