---
title: "Calibration and internal validation methodology with OptirrigCORE"
output:
  rmarkdown::html_vignette:
    css: styles/vignette.css
vignette: >
  %\VignetteIndexEntry{Calibration and internal validation methodology with OptirrigCORE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE, file = 'styles/setup.R'}
rm(list = ls())
gc()
```

## Introduction

This vignette describes a possible workflow for calibrating and internally
validating an **OptirrigCORE** model. It is not intended as the only
valid way to calibrate the model. Other strategies may be more appropriate
depending on the crop, the dataset, and the calibration objective.

The aim is not to document one specific crop or one specific experiment, but to
explain how a calibration can be organized when several parameters, several
observed variables, and several trials are used together.

The general logic presented here is adapted from previous applied calibration
and internal validation workflows, but is rewritten here in a more generic
OptirrigCORE framework.

The workflow combines four steps:

1. a clear definition of the calibration problem;
2. a broad exploration of the parameter space;
3. a local refinement of promising parameter sets with **Nelder-Mead**;
4. an internal split-sample evaluation to assess how stable the calibration is
   across groups of trials.

The main objective is to obtain one final parameter set that is not only good
on one calibration subset, but also reasonably stable when transferred to
held-out trials.

## Overview of the workflow

The same sequence is repeated for each split scenario:

1. define the calibrated parameters, their bounds, and the observations included
   in the objective function;
2. divide the available trials into coherent groups;
3. hold out one group for internal validation and use the remaining groups for
   calibration;
4. sample candidate parameter sets within the predefined bounds;
5. run the model for each candidate on the calibration subset;
6. score each candidate with one weighted `nRMSE` criterion computed on all
  matched observation-simulation pairs in the calibration subset;
7. retain the best candidates while avoiding near-duplicate parameter sets;
8. initialize a local **Nelder-Mead** optimization from the selected candidates;
9. evaluate the optimized parameter set on both the calibration subset and the
   held-out group;
10. repeat the procedure until each group has been used once as the held-out
    validation group;
11. compare all split-specific optima and derive one final common parameter set.

This structure separates the search into two complementary phases. The
screening phase explores the parameter space broadly. The local optimization
phase refines a promising region identified during screening. The split-sample
phase then checks whether the result depends too strongly on one particular
subset of the data.

Again, this workflow should be read as a proposed methodology, not as a
mandatory OptirrigCORE calibration protocol.

## Defining the calibration problem

Before running any optimization, the calibration target must be defined
explicitly. This includes:

- the parameters allowed to vary;
- the lower and upper bounds of each parameter;
- the observed variables retained for calibration;
- the normalization rule used to compare variables with different units;
- the weighting rule used in the objective function;
- the construction of the calibration and validation groups.

This step is essential. An optimizer can always return a numerical solution, but
that solution is only meaningful if the calibration problem has been defined in
a scientifically interpretable way.

### Calibrated parameters

The list of calibrated parameters depends on the crop, the model version, and
the observations available. In practice, the parameters may describe canopy
development, biomass production, yield formation, assimilate partitioning, water
stress response, rooting depth, soil exploration, or phenology.

The list should remain parsimonious. If too many parameters are calibrated with
respect to the amount of information available in the observations, several very
different parameter sets may produce similar scores. This equifinality makes the
final calibration harder to interpret and less reliable outside the calibration
context.

### Observations used for calibration

The objective function can combine **dynamic observations**, such as `LAI` or
biomass time series, and **terminal observations**, such as final yield or
other end-of-cycle production variables.

These two types of observations constrain different parts of the model:

- dynamic variables help constrain the simulated trajectory during the season;
- terminal variables help constrain the final outcome at harvest or at the end
  of the simulated cycle.

Using both types of information is generally more informative than calibrating
only on final values. It reduces the risk of obtaining a parameter set that
reproduces the final yield for the wrong physiological reasons.

## Objective function

### Rationale

In some applied calibration workflows, the performance criterion has been
simplified relative to a previous sequential multi-metric score. Instead of combining
several indicators and ranking trials one by one, the workflow uses one main
criterion: a weighted `nRMSE`.

The same criterion is used during screening, local optimization, and held-out
evaluation. This makes the successive steps easier to compare and easier to
interpret.

In practical terms, dynamic canopy variables and end-of-cycle production
variables may need to remain jointly present in the score because they inform
different parts of the system, whereas a sparsely observed auxiliary variable
may be too weakly documented to drive the optimization on its own.

### Mathematical formulation

For a candidate parameter set indexed by $r$ and a retained variable $v$, let
$I_v$ be the set of matched observation-simulation pairs available in the
current calibration scenario. A weighted root mean square error is first
computed for each variable:

$$
RMSE_r^{(v)} =
\sqrt{
\frac{
\sum_{i \in I_v} \omega_i^{(v)} \left(S_{i,r}^{(v)} - O_i^{(v)}\right)^2
}{
\sum_{i \in I_v} \omega_i^{(v)}
}
}
$$

This quantity is then normalized:

$$
nRMSE_r^{(v)} = \frac{RMSE_r^{(v)}}{s_v}
$$

The final score can then be written more explicitly. For a three-variable
implementation:

$$
\mathrm{Score}_r =
\frac{p_1 \, nRMSE_r^{(1)} + p_2 \, nRMSE_r^{(2)} + p_3 \, nRMSE_r^{(3)}}{\sum_{i=1}^{3} p_i}
$$

where:

- $r$ indexes one candidate parameter set;
- $I_v$ is the set of matched observation-simulation pairs for variable $v$ in
  the current calibration scenario;
- $O_i^{(v)}$ is the observed value of pair $i$ for variable $v$;
- $S_{i,r}^{(v)}$ is the simulated value of pair $i$ for variable $v$ under
  candidate parameter set $r$;
- $\omega_i^{(v)}$ is the observation-level weight assigned to pair $i$ for
  variable $v$;
- $s_v$ is the normalization factor used for variable $v$;
- $p_1$, $p_2$, and $p_3$ are the variable-level weights in the explicit
  three-variable formulation;
- $nRMSE_r^{(1)}$, $nRMSE_r^{(2)}$, and $nRMSE_r^{(3)}$ are the normalized
  errors of the three retained variables.

Lower values of $\mathrm{Score}_r$ indicate a better agreement between
simulations and observations after normalization and weighting.

A key methodological change in this type of workflow is that the score is no
longer computed trial by trial before aggregation. Instead, it is
computed directly on the full set of matched observation-simulation pairs
available in the calibration scenario.

### Normalization

The normalization factor $s_v$ must be defined once for each observed variable.
Typical choices are:

- the mean of the observed values;
- the observed range;
- a fixed reference value based on expert knowledge or previous studies.

The same normalization rule must be used in all split scenarios. Otherwise,
calibration and validation scores cannot be compared consistently across splits.

The weighted `nRMSE` is useful here because it remains interpretable when
variables have very different units and magnitudes.

The selected rule should also avoid numerical instability. For example, if a
variable can have values close to zero, using its mean as the denominator may
produce excessively large normalized residuals.

### Weighting

Weights can be applied at two levels.

First, **variable weights** $p_v$ define the relative importance of each
retained variable in the final score. If no specific calibration objective is
targeted, a simple default is to use equal weights, that is $p_v = 1$ for every
retained variable.

Second, **observation-level weights** $\omega_i^{(v)}$ adjust the influence of
individual matched observation-simulation pairs inside the variable-specific
`RMSE`. This is useful when some variables or trials contain denser observation
series than others.

In this formulation, the two levels remain separate: $\omega_i^{(v)}$ acts
within each variable-specific error term, whereas $p_v$ is used only when
combining normalized errors across variables. Whatever weighting scheme is
chosen should then be kept consistent across all splits.

### Handling sparse and dense observations

A trial with one isolated observation does not provide the same information as a
trial with a complete time series. Both can be included, but their contribution
should be controlled.

One simple option is to assign a lower observation-level weight to isolated
measurements. Another option is to normalize the contribution by trial or by
variable so that dense time series do not dominate the objective function only
because they contain more points.

The chosen approach depends on the scientific objective. If the seasonal
dynamics are the main calibration target, dense trajectories should naturally
carry more information. If each trial should contribute more equally, then a
trial-level or variable-level balancing rule is preferable.

In this workflow, observations are pooled across trials and dates within the
same calibration subset instead of computing one score per trial and then
averaging those trial scores. This is mainly a practical strategy for sparse
datasets. In many experiments, collecting these observations is logistically
demanding, so the amount of observed data per trial can be limited. Pooling all
available observations helps use that limited information more efficiently and
produces a more stable global score.

This is not the only valid choice. Trial-by-trial scoring can also be relevant,
especially when dense observations are available and when equal weighting of
trials is part of the calibration objective.

## Internal split-sample validation

### Principle

The split-sample procedure used here is an internal robustness check. It should
not be interpreted as a fully independent external validation.

The available trials are divided into $G$ coherent groups, for example `A`,
`B`, `C`, `D`, and so on. Then $G$ split scenarios are created. In split $k$,
one group is excluded from calibration and used as the held-out validation
group, while all other groups are merged and used together as the calibration
subset.

Each group is therefore used:

- once as a validation group;
- $G - 1$ times as part of a calibration subset.

The held-out group is then moved until each group has played the validation
role once. Across these rotations, all available data are used in calibration,
and all available data are also used once as held-out validation.

The number of groups is not fixed. It depends on the size of the dataset, the
heterogeneity of the trials, the experimental contexts that must remain intact,
and the amount of information required in each calibration subset.

For example, with four groups `A`, `B`, `C`, and `D`, the split structure is:

| Validation group | Calibration groups |
|:--|:--|
| `A` | `B + C + D` |
| `B` | `A + C + D` |
| `C` | `A + B + D` |
| `D` | `A + B + C` |

This is only an example. The method applies to any number of groups, provided
each group is coherent and each calibration subset contains enough information
to support the optimization.

### Constructing the groups

Groups should preserve the structure of the data. Experimental contexts should
not be split into artificial fragments only to obtain equal group sizes.

The practical difficulty is to fill these groups as evenly as possible with the
available trials while keeping each trial and its modalities together.

When constructing groups, the following elements should be considered:

- the number of trials;
- the number of trial-treatment combinations;
- the number of observations for each retained variable;
- the timing of the observations;
- the diversity of climatic, agronomic, and water stress conditions;
- the presence or absence of dynamic and terminal measurements.

Two groups with the same number of trials can have very different information
content. For example, one group may contain detailed `LAI` and biomass time
series, while another may contain mainly final yield values.

### Interpretation of one split

Within one split, only the calibration subset is used to rank screened parameter
sets and to run the local optimization. The held-out group is not used during
screening or optimization.

After optimization, the selected parameter set is evaluated on the held-out
group. This score measures how well the calibration transfers to data that were
not used to estimate the parameters in that split.

This separation must remain strict. If the held-out group influences the choice
of candidates, the local optimization, or the stopping decision, it is no longer
a validation subset for that split.

## Screening the parameter space

### Parameter bounds

The screening stage starts from lower and upper bounds for each calibrated
parameter. These bounds define the search domain.

They should be broad enough to allow real exploration, but not so broad that a
large fraction of candidate parameter sets becomes biologically implausible or
numerically useless.

Bounds can be based on:

- previously calibrated values;
- published parameter ranges;
- expert knowledge;
- sensitivity analyses;
- biological and agronomic constraints;
- numerical stability observed in previous simulations.

The role of the bounds is not to guess the final answer. Their role is to define
a credible domain where the optimizer is allowed to search.

### Candidate generation

Candidate parameter sets are sampled within the predefined bounds. This can be
done with random sampling or with a space-filling design such as Latin hypercube
sampling.

A full factorial grid is usually avoided because the number of combinations
increases rapidly with the number of parameters. Screening should explore the
space broadly without making the computation impossible.

The number of screened candidates, denoted here as $n_{\mathrm{screen}}$, is a
compromise. Too few candidates make the search superficial. Too many candidates
increase runtime without necessarily improving the final calibration.

Invalid combinations can be filtered after sampling or rejected during model
evaluation.

### Candidate evaluation

For each sampled parameter set and for each split, the workflow is:

1. insert the parameter values into the model inputs;
2. run the model on every trial in the calibration subset of the current split;
3. align simulated outputs with the retained observations;
4. compute normalized residuals;
5. apply variable and observation weights;
6. aggregate the variable-specific weighted errors into the final score
  $\mathrm{Score}_r$.

A screened parameter set is therefore evaluated against the full calibration
subset of the current split, not independently for each trial before averaging
trial-level scores.

The screened candidates are then ranked from the lowest score to the highest
score. The best candidates identify promising regions of the parameter space.

Screening is not expected to provide the final parameter set. Its role is to
identify useful starting points for local optimization and to reveal whether
several distinct regions of the parameter space produce similar scores.

## From screening to local optimization

### Selecting distinct starting candidates

The local optimizer should not be initialized from several nearly identical
parameter sets. After screening, the workflow therefore retains the best
candidates that are also sufficiently distinct from one another.

A practical approach is:

1. rank all screened candidates by their calibration score;
2. retain the best candidate;
3. move down the ranking and retain a new candidate only if it is far enough
   from all previously retained candidates;
4. continue until the required number of candidates has been selected.

The distance should be computed in normalized parameter space. For parameter
$j$, the normalized coordinate is:

$$
z_j = \frac{x_j - l_j}{u_j - l_j}
$$

where $x_j$ is the parameter value, and $l_j$ and $u_j$ are its lower and
upper bounds.

Computing distances after normalization prevents parameters with large numerical
ranges from dominating the distance calculation. An Euclidean distance can then
be used in this normalized space.

The minimum distance threshold should be fixed before examining the final
results, because this choice can influence which local regions are explored.

### Initializing Nelder-Mead

For a problem with $p$ calibrated parameters, a Nelder-Mead simplex contains
$p + 1$ vertices.

If the implementation allows a user-defined initial simplex, the selected
distinct candidates can be used directly as the simplex vertices. The best
screened candidate is used as the main vertex, and the next `p` sufficiently
distinct candidates form the remaining vertices.

If the implementation accepts only one starting point, the best screened
candidate is used as the starting point, and the simplex is generated internally
by the optimizer or by a controlled perturbation of that starting point.

This distinction is important for reproducibility, because not all
implementations of Nelder-Mead expose the initial simplex in the same way.

## Local optimization with Nelder-Mead

For each split scenario, the local optimization is run only on the calibration
subset of that split. Each split has its own screening scores, its own selected
starting candidates, and its own optimized parameter set. The local refinement
step uses the Nelder-Mead simplex method (Nelder and Mead, 1965).

This is necessary because the best region of the parameter space may change when
a different group is held out.

### Role of Nelder-Mead

In this workflow, Nelder-Mead is used for local refinement. It does not replace
the broad screening stage.

Its role is to:

- start from a promising region found during screening;
- adjust parameter values continuously rather than only choosing among sampled
  candidates;
- improve the pooled weighted score on the calibration subset.

A poor screening step cannot be reliably repaired by Nelder-Mead alone. If the
screening misses the relevant region of the parameter space, the local optimizer
may converge to a suboptimal solution.

### Bounds during local optimization

Standard Nelder-Mead is not naturally a bounded optimization method. Therefore,
parameter bounds must be enforced explicitly.

This can be done by one of the following approaches:

- transforming bounded parameters into an unconstrained scale before
  optimization;
- adding a strong penalty when the optimizer proposes values outside the bounds;
- clipping or rejecting invalid proposals;
- using a bounded variant of the algorithm, if available.

The chosen approach should be documented, because it affects both the final
parameter values and the reproducibility of the calibration.

### Outputs retained for each split

For each split scenario, the following outputs should be retained:

- the validation group and calibration groups;
- the parameter bounds;
- the screened candidates and their calibration scores;
- the selected candidates used to initialize the local search;
- the local optimization settings;
- the final optimized parameter set;
- the final calibration score;
- the validation score on the held-out group;
- simulated-versus-observed outputs for diagnostic plots.

Keeping these outputs separately for each split makes it possible to identify
whether one group systematically destabilizes the calibration.

## Selecting the final parameter set

At the end of the split-sample procedure, one optimized parameter set is
available for each held-out group.

The final objective is usually not to keep several independent parameterizations.
The objective is to derive one common final parameter set that performs well
across the dataset and remains biologically plausible.

The final selection should consider:

- the average calibration performance across splits;
- the average held-out validation performance across splits;
- the variability of scores across validation groups;
- the biological plausibility of the parameter values;
- the stability of each parameter across split-specific optima;
- the consistency between dynamic variables and terminal variables.

### Why the best split-specific optimum is not necessarily the final choice

The optimum with the lowest score in one split is not automatically the best
final parameter set. It may be too adapted to the calibration subset used in
that split.

A typical warning sign is a parameter set with an excellent calibration score in
one split but weaker or unstable performance when evaluated across other split
configurations.

The final parameter set should therefore be selected as a cross-split
compromise, not as the winner of one isolated optimization run.

### Practical final-selection strategies

Several strategies can be used to define the final parameter set.

One option is to retain the split-specific optimum that gives the best average
performance when it is recalculated across all calibration and validation
subsets.

Another option is to compute a robust centre of the split-specific optima, for
example a median parameter vector, and then evaluate this candidate explicitly
on every split.

A third option is to identify parameters that are stable across splits and focus
additional analysis on parameters that remain unstable.

Whatever the strategy, the final candidate should be re-evaluated on:

- each calibration subset;
- each held-out validation group;
- optionally, the full pooled dataset.

Once held-out split scores are used to choose the final common parameter set,
these scores become part of internal model selection. They should no longer be
presented as a fully independent validation of the final model.

If a strict final validation is required, an additional independent dataset must
be kept aside and used only once, after the final parameter set has been fixed.

## Methodological reference

The explicit local optimization step described in this vignette relies on the
Nelder-Mead simplex method:

- Nelder, J. A. and Mead, R. (1965). A simplex method for function
  minimization. *The Computer Journal*, 7(4), 308-313.

## Conclusion

The proposed workflow follows a clear numerical hierarchy. First, the
calibration target is defined. Then the parameter space is explored broadly with
screening. Promising regions are refined with Nelder-Mead. Finally,
split-specific results are compared to select one stable and interpretable final
parameter set.

This workflow does not remove all calibration difficulties. In particular, it
does not eliminate equifinality, uncertainty in observations, or structural
model error. However, it makes the calibration process clearer, easier to
reproduce, and easier to diagnose.

It should also be read for what it is: one proposed calibration and internal
validation strategy among several possible ones, not the only path available in
OptirrigCORE.