This vignette describes a possible workflow for calibrating and internally validating an OptirrigCORE model. It is not intended as the only valid way to calibrate the model. Other strategies may be more appropriate depending on the crop, the dataset, and the calibration objective.
The aim is not to document one specific crop or one specific experiment, but to explain how a calibration can be organized when several parameters, several observed variables, and several trials are used together.
The general logic presented here is adapted from previous applied calibration and internal validation workflows, but is rewritten here in a more generic OptirrigCORE framework.
The workflow combines four steps:
The main objective is to obtain one final parameter set that is not only good on one calibration subset, but also reasonably stable when transferred to held-out trials.
The same sequence is repeated for each split scenario:
nRMSE criterion
computed on all matched observation-simulation pairs in the calibration
subset;This structure separates the search into two complementary phases. The screening phase explores the parameter space broadly. The local optimization phase refines a promising region identified during screening. The split-sample phase then checks whether the result depends too strongly on one particular subset of the data.
Again, this workflow should be read as a proposed methodology, not as a mandatory OptirrigCORE calibration protocol.
Before running any optimization, the calibration target must be defined explicitly. This includes:
This step is essential. An optimizer can always return a numerical solution, but that solution is only meaningful if the calibration problem has been defined in a scientifically interpretable way.
The list of calibrated parameters depends on the crop, the model version, and the observations available. In practice, the parameters may describe canopy development, biomass production, yield formation, assimilate partitioning, water stress response, rooting depth, soil exploration, or phenology.
The list should remain parsimonious. If too many parameters are calibrated with respect to the amount of information available in the observations, several very different parameter sets may produce similar scores. This equifinality makes the final calibration harder to interpret and less reliable outside the calibration context.
The objective function can combine dynamic
observations, such as LAI or biomass time series,
and terminal observations, such as final yield or other
end-of-cycle production variables.
These two types of observations constrain different parts of the model:
Using both types of information is generally more informative than calibrating only on final values. It reduces the risk of obtaining a parameter set that reproduces the final yield for the wrong physiological reasons.
In some applied calibration workflows, the performance criterion has
been simplified relative to a previous sequential multi-metric score.
Instead of combining several indicators and ranking trials one by one,
the workflow uses one main criterion: a weighted nRMSE.
The same criterion is used during screening, local optimization, and held-out evaluation. This makes the successive steps easier to compare and easier to interpret.
In practical terms, dynamic canopy variables and end-of-cycle production variables may need to remain jointly present in the score because they inform different parts of the system, whereas a sparsely observed auxiliary variable may be too weakly documented to drive the optimization on its own.
For a candidate parameter set indexed by \(r\) and a retained variable \(v\), let \(I_v\) be the set of matched observation-simulation pairs available in the current calibration scenario. A weighted root mean square error is first computed for each variable:
\[ RMSE_r^{(v)} = \sqrt{ \frac{ \sum_{i \in I_v} \omega_i^{(v)} \left(S_{i,r}^{(v)} - O_i^{(v)}\right)^2 }{ \sum_{i \in I_v} \omega_i^{(v)} } } \]
This quantity is then normalized:
\[ nRMSE_r^{(v)} = \frac{RMSE_r^{(v)}}{s_v} \]
The final score can then be written more explicitly. For a three-variable implementation:
\[ \mathrm{Score}_r = \frac{p_1 \, nRMSE_r^{(1)} + p_2 \, nRMSE_r^{(2)} + p_3 \, nRMSE_r^{(3)}}{\sum_{i=1}^{3} p_i} \]
where:
Lower values of \(\mathrm{Score}_r\) indicate a better agreement between simulations and observations after normalization and weighting.
A key methodological change in this type of workflow is that the score is no longer computed trial by trial before aggregation. Instead, it is computed directly on the full set of matched observation-simulation pairs available in the calibration scenario.
The normalization factor \(s_v\) must be defined once for each observed variable. Typical choices are:
The same normalization rule must be used in all split scenarios. Otherwise, calibration and validation scores cannot be compared consistently across splits.
The weighted nRMSE is useful here because it remains
interpretable when variables have very different units and
magnitudes.
The selected rule should also avoid numerical instability. For example, if a variable can have values close to zero, using its mean as the denominator may produce excessively large normalized residuals.
Weights can be applied at two levels.
First, variable weights \(p_v\) define the relative importance of each retained variable in the final score. If no specific calibration objective is targeted, a simple default is to use equal weights, that is \(p_v = 1\) for every retained variable.
Second, observation-level weights \(\omega_i^{(v)}\) adjust the influence of
individual matched observation-simulation pairs inside the
variable-specific RMSE. This is useful when some variables
or trials contain denser observation series than others.
In this formulation, the two levels remain separate: \(\omega_i^{(v)}\) acts within each variable-specific error term, whereas \(p_v\) is used only when combining normalized errors across variables. Whatever weighting scheme is chosen should then be kept consistent across all splits.
A trial with one isolated observation does not provide the same information as a trial with a complete time series. Both can be included, but their contribution should be controlled.
One simple option is to assign a lower observation-level weight to isolated measurements. Another option is to normalize the contribution by trial or by variable so that dense time series do not dominate the objective function only because they contain more points.
The chosen approach depends on the scientific objective. If the seasonal dynamics are the main calibration target, dense trajectories should naturally carry more information. If each trial should contribute more equally, then a trial-level or variable-level balancing rule is preferable.
In this workflow, observations are pooled across trials and dates within the same calibration subset instead of computing one score per trial and then averaging those trial scores. This is mainly a practical strategy for sparse datasets. In many experiments, collecting these observations is logistically demanding, so the amount of observed data per trial can be limited. Pooling all available observations helps use that limited information more efficiently and produces a more stable global score.
This is not the only valid choice. Trial-by-trial scoring can also be relevant, especially when dense observations are available and when equal weighting of trials is part of the calibration objective.
The split-sample procedure used here is an internal robustness check. It should not be interpreted as a fully independent external validation.
The available trials are divided into \(G\) coherent groups, for example
A, B, C, D, and so
on. Then \(G\) split scenarios are
created. In split \(k\), one group is
excluded from calibration and used as the held-out validation group,
while all other groups are merged and used together as the calibration
subset.
Each group is therefore used:
The held-out group is then moved until each group has played the validation role once. Across these rotations, all available data are used in calibration, and all available data are also used once as held-out validation.
The number of groups is not fixed. It depends on the size of the dataset, the heterogeneity of the trials, the experimental contexts that must remain intact, and the amount of information required in each calibration subset.
For example, with four groups A, B,
C, and D, the split structure is:
| Validation group | Calibration groups |
|---|---|
A |
B + C + D |
B |
A + C + D |
C |
A + B + D |
D |
A + B + C |
This is only an example. The method applies to any number of groups, provided each group is coherent and each calibration subset contains enough information to support the optimization.
Groups should preserve the structure of the data. Experimental contexts should not be split into artificial fragments only to obtain equal group sizes.
The practical difficulty is to fill these groups as evenly as possible with the available trials while keeping each trial and its modalities together.
When constructing groups, the following elements should be considered:
Two groups with the same number of trials can have very different
information content. For example, one group may contain detailed
LAI and biomass time series, while another may contain
mainly final yield values.
Within one split, only the calibration subset is used to rank screened parameter sets and to run the local optimization. The held-out group is not used during screening or optimization.
After optimization, the selected parameter set is evaluated on the held-out group. This score measures how well the calibration transfers to data that were not used to estimate the parameters in that split.
This separation must remain strict. If the held-out group influences the choice of candidates, the local optimization, or the stopping decision, it is no longer a validation subset for that split.
The screening stage starts from lower and upper bounds for each calibrated parameter. These bounds define the search domain.
They should be broad enough to allow real exploration, but not so broad that a large fraction of candidate parameter sets becomes biologically implausible or numerically useless.
Bounds can be based on:
The role of the bounds is not to guess the final answer. Their role is to define a credible domain where the optimizer is allowed to search.
Candidate parameter sets are sampled within the predefined bounds. This can be done with random sampling or with a space-filling design such as Latin hypercube sampling.
A full factorial grid is usually avoided because the number of combinations increases rapidly with the number of parameters. Screening should explore the space broadly without making the computation impossible.
The number of screened candidates, denoted here as \(n_{\mathrm{screen}}\), is a compromise. Too few candidates make the search superficial. Too many candidates increase runtime without necessarily improving the final calibration.
Invalid combinations can be filtered after sampling or rejected during model evaluation.
For each sampled parameter set and for each split, the workflow is:
A screened parameter set is therefore evaluated against the full calibration subset of the current split, not independently for each trial before averaging trial-level scores.
The screened candidates are then ranked from the lowest score to the highest score. The best candidates identify promising regions of the parameter space.
Screening is not expected to provide the final parameter set. Its role is to identify useful starting points for local optimization and to reveal whether several distinct regions of the parameter space produce similar scores.
The local optimizer should not be initialized from several nearly identical parameter sets. After screening, the workflow therefore retains the best candidates that are also sufficiently distinct from one another.
A practical approach is:
The distance should be computed in normalized parameter space. For parameter \(j\), the normalized coordinate is:
\[ z_j = \frac{x_j - l_j}{u_j - l_j} \]
where \(x_j\) is the parameter value, and \(l_j\) and \(u_j\) are its lower and upper bounds.
Computing distances after normalization prevents parameters with large numerical ranges from dominating the distance calculation. An Euclidean distance can then be used in this normalized space.
The minimum distance threshold should be fixed before examining the final results, because this choice can influence which local regions are explored.
For a problem with \(p\) calibrated parameters, a Nelder-Mead simplex contains \(p + 1\) vertices.
If the implementation allows a user-defined initial simplex, the
selected distinct candidates can be used directly as the simplex
vertices. The best screened candidate is used as the main vertex, and
the next p sufficiently distinct candidates form the
remaining vertices.
If the implementation accepts only one starting point, the best screened candidate is used as the starting point, and the simplex is generated internally by the optimizer or by a controlled perturbation of that starting point.
This distinction is important for reproducibility, because not all implementations of Nelder-Mead expose the initial simplex in the same way.
For each split scenario, the local optimization is run only on the calibration subset of that split. Each split has its own screening scores, its own selected starting candidates, and its own optimized parameter set. The local refinement step uses the Nelder-Mead simplex method (Nelder and Mead, 1965).
This is necessary because the best region of the parameter space may change when a different group is held out.
In this workflow, Nelder-Mead is used for local refinement. It does not replace the broad screening stage.
Its role is to:
A poor screening step cannot be reliably repaired by Nelder-Mead alone. If the screening misses the relevant region of the parameter space, the local optimizer may converge to a suboptimal solution.
Standard Nelder-Mead is not naturally a bounded optimization method. Therefore, parameter bounds must be enforced explicitly.
This can be done by one of the following approaches:
The chosen approach should be documented, because it affects both the final parameter values and the reproducibility of the calibration.
For each split scenario, the following outputs should be retained:
Keeping these outputs separately for each split makes it possible to identify whether one group systematically destabilizes the calibration.
At the end of the split-sample procedure, one optimized parameter set is available for each held-out group.
The final objective is usually not to keep several independent parameterizations. The objective is to derive one common final parameter set that performs well across the dataset and remains biologically plausible.
The final selection should consider:
The optimum with the lowest score in one split is not automatically the best final parameter set. It may be too adapted to the calibration subset used in that split.
A typical warning sign is a parameter set with an excellent calibration score in one split but weaker or unstable performance when evaluated across other split configurations.
The final parameter set should therefore be selected as a cross-split compromise, not as the winner of one isolated optimization run.
Several strategies can be used to define the final parameter set.
One option is to retain the split-specific optimum that gives the best average performance when it is recalculated across all calibration and validation subsets.
Another option is to compute a robust centre of the split-specific optima, for example a median parameter vector, and then evaluate this candidate explicitly on every split.
A third option is to identify parameters that are stable across splits and focus additional analysis on parameters that remain unstable.
Whatever the strategy, the final candidate should be re-evaluated on:
Once held-out split scores are used to choose the final common parameter set, these scores become part of internal model selection. They should no longer be presented as a fully independent validation of the final model.
If a strict final validation is required, an additional independent dataset must be kept aside and used only once, after the final parameter set has been fixed.
The explicit local optimization step described in this vignette relies on the Nelder-Mead simplex method:
The proposed workflow follows a clear numerical hierarchy. First, the calibration target is defined. Then the parameter space is explored broadly with screening. Promising regions are refined with Nelder-Mead. Finally, split-specific results are compared to select one stable and interpretable final parameter set.
This workflow does not remove all calibration difficulties. In particular, it does not eliminate equifinality, uncertainty in observations, or structural model error. However, it makes the calibration process clearer, easier to reproduce, and easier to diagnose.
It should also be read for what it is: one proposed calibration and internal validation strategy among several possible ones, not the only path available in OptirrigCORE.