Calibration and internal validation methodology with OptirrigCORE

Introduction

This vignette describes a possible workflow for calibrating and internally validating an OptirrigCORE model. It is not intended as the only valid way to calibrate the model. Other strategies may be more appropriate depending on the crop, the dataset, and the calibration objective.

The aim is not to document one specific crop or one specific experiment, but to explain how a calibration can be organized when several parameters, several observed variables, and several trials are used together.

The general logic presented here is adapted from previous applied calibration and internal validation workflows, but is rewritten here in a more generic OptirrigCORE framework.

The workflow combines four steps:

a clear definition of the calibration problem;
a broad exploration of the parameter space;
a local refinement of promising parameter sets with Nelder-Mead;
an internal split-sample evaluation to assess how stable the calibration is across groups of trials.

The main objective is to obtain one final parameter set that is not only good on one calibration subset, but also reasonably stable when transferred to held-out trials.

Overview of the workflow

The same sequence is repeated for each split scenario:

define the calibrated parameters, their bounds, and the observations included in the objective function;
divide the available trials into coherent groups;
hold out one group for internal validation and use the remaining groups for calibration;
sample candidate parameter sets within the predefined bounds;
run the model for each candidate on the calibration subset;
score each candidate with one weighted nRMSE criterion computed on all matched observation-simulation pairs in the calibration subset;
retain the best candidates while avoiding near-duplicate parameter sets;
initialize a local Nelder-Mead optimization from the selected candidates;
evaluate the optimized parameter set on both the calibration subset and the held-out group;
repeat the procedure until each group has been used once as the held-out validation group;
compare all split-specific optima and derive one final common parameter set.

This structure separates the search into two complementary phases. The screening phase explores the parameter space broadly. The local optimization phase refines a promising region identified during screening. The split-sample phase then checks whether the result depends too strongly on one particular subset of the data.

Again, this workflow should be read as a proposed methodology, not as a mandatory OptirrigCORE calibration protocol.

Defining the calibration problem

Before running any optimization, the calibration target must be defined explicitly. This includes:

the parameters allowed to vary;
the lower and upper bounds of each parameter;
the observed variables retained for calibration;
the normalization rule used to compare variables with different units;
the weighting rule used in the objective function;
the construction of the calibration and validation groups.

This step is essential. An optimizer can always return a numerical solution, but that solution is only meaningful if the calibration problem has been defined in a scientifically interpretable way.

Calibrated parameters

The list of calibrated parameters depends on the crop, the model version, and the observations available. In practice, the parameters may describe canopy development, biomass production, yield formation, assimilate partitioning, water stress response, rooting depth, soil exploration, or phenology.

The list should remain parsimonious. If too many parameters are calibrated with respect to the amount of information available in the observations, several very different parameter sets may produce similar scores. This equifinality makes the final calibration harder to interpret and less reliable outside the calibration context.

Observations used for calibration

The objective function can combine dynamic observations, such as LAI or biomass time series, and terminal observations, such as final yield or other end-of-cycle production variables.

These two types of observations constrain different parts of the model:

dynamic variables help constrain the simulated trajectory during the season;
terminal variables help constrain the final outcome at harvest or at the end of the simulated cycle.

Using both types of information is generally more informative than calibrating only on final values. It reduces the risk of obtaining a parameter set that reproduces the final yield for the wrong physiological reasons.

Objective function

Rationale

In some applied calibration workflows, the performance criterion has been simplified relative to a previous sequential multi-metric score. Instead of combining several indicators and ranking trials one by one, the workflow uses one main criterion: a weighted nRMSE.

The same criterion is used during screening, local optimization, and held-out evaluation. This makes the successive steps easier to compare and easier to interpret.

In practical terms, dynamic canopy variables and end-of-cycle production variables may need to remain jointly present in the score because they inform different parts of the system, whereas a sparsely observed auxiliary variable may be too weakly documented to drive the optimization on its own.

Mathematical formulation

For a candidate parameter set indexed by \(r\) and a retained variable \(v\), let \(I_v\) be the set of matched observation-simulation pairs available in the current calibration scenario. A weighted root mean square error is first computed for each variable:

\[ RMSE_r^{(v)} = \sqrt{ \frac{ \sum_{i \in I_v} \omega_i^{(v)} \left(S_{i,r}^{(v)} - O_i^{(v)}\right)^2 }{ \sum_{i \in I_v} \omega_i^{(v)} } } \]

This quantity is then normalized:

\[ nRMSE_r^{(v)} = \frac{RMSE_r^{(v)}}{s_v} \]

The final score can then be written more explicitly. For a three-variable implementation:

\[ \mathrm{Score}_r = \frac{p_1 \, nRMSE_r^{(1)} + p_2 \, nRMSE_r^{(2)} + p_3 \, nRMSE_r^{(3)}}{\sum_{i=1}^{3} p_i} \]

where:

\(r\) indexes one candidate parameter set;
\(I_v\) is the set of matched observation-simulation pairs for variable \(v\) in the current calibration scenario;
\(O_i^{(v)}\) is the observed value of pair \(i\) for variable \(v\);
\(S_{i,r}^{(v)}\) is the simulated value of pair \(i\) for variable \(v\) under candidate parameter set \(r\);
\(\omega_i^{(v)}\) is the observation-level weight assigned to pair \(i\) for variable \(v\);
\(s_v\) is the normalization factor used for variable \(v\);
\(p_1\), \(p_2\), and \(p_3\) are the variable-level weights in the explicit three-variable formulation;
\(nRMSE_r^{(1)}\), \(nRMSE_r^{(2)}\), and \(nRMSE_r^{(3)}\) are the normalized errors of the three retained variables.

Lower values of \(\mathrm{Score}_r\) indicate a better agreement between simulations and observations after normalization and weighting.

A key methodological change in this type of workflow is that the score is no longer computed trial by trial before aggregation. Instead, it is computed directly on the full set of matched observation-simulation pairs available in the calibration scenario.

Normalization

The normalization factor \(s_v\) must be defined once for each observed variable. Typical choices are:

the mean of the observed values;
the observed range;
a fixed reference value based on expert knowledge or previous studies.

The same normalization rule must be used in all split scenarios. Otherwise, calibration and validation scores cannot be compared consistently across splits.

The weighted nRMSE is useful here because it remains interpretable when variables have very different units and magnitudes.

The selected rule should also avoid numerical instability. For example, if a variable can have values close to zero, using its mean as the denominator may produce excessively large normalized residuals.

Weighting

Weights can be applied at two levels.

First, variable weights \(p_v\) define the relative importance of each retained variable in the final score. If no specific calibration objective is targeted, a simple default is to use equal weights, that is \(p_v = 1\) for every retained variable.

Second, observation-level weights \(\omega_i^{(v)}\) adjust the influence of individual matched observation-simulation pairs inside the variable-specific RMSE. This is useful when some variables or trials contain denser observation series than others.

In this formulation, the two levels remain separate: \(\omega_i^{(v)}\) acts within each variable-specific error term, whereas \(p_v\) is used only when combining normalized errors across variables. Whatever weighting scheme is chosen should then be kept consistent across all splits.

Handling sparse and dense observations

A trial with one isolated observation does not provide the same information as a trial with a complete time series. Both can be included, but their contribution should be controlled.

One simple option is to assign a lower observation-level weight to isolated measurements. Another option is to normalize the contribution by trial or by variable so that dense time series do not dominate the objective function only because they contain more points.

The chosen approach depends on the scientific objective. If the seasonal dynamics are the main calibration target, dense trajectories should naturally carry more information. If each trial should contribute more equally, then a trial-level or variable-level balancing rule is preferable.

In this workflow, observations are pooled across trials and dates within the same calibration subset instead of computing one score per trial and then averaging those trial scores. This is mainly a practical strategy for sparse datasets. In many experiments, collecting these observations is logistically demanding, so the amount of observed data per trial can be limited. Pooling all available observations helps use that limited information more efficiently and produces a more stable global score.

This is not the only valid choice. Trial-by-trial scoring can also be relevant, especially when dense observations are available and when equal weighting of trials is part of the calibration objective.

Internal split-sample validation

Principle

The split-sample procedure used here is an internal robustness check. It should not be interpreted as a fully independent external validation.

The available trials are divided into \(G\) coherent groups, for example A, B, C, D, and so on. Then \(G\) split scenarios are created. In split \(k\), one group is excluded from calibration and used as the held-out validation group, while all other groups are merged and used together as the calibration subset.

Each group is therefore used:

once as a validation group;
\(G - 1\) times as part of a calibration subset.

The held-out group is then moved until each group has played the validation role once. Across these rotations, all available data are used in calibration, and all available data are also used once as held-out validation.

The number of groups is not fixed. It depends on the size of the dataset, the heterogeneity of the trials, the experimental contexts that must remain intact, and the amount of information required in each calibration subset.

For example, with four groups A, B, C, and D, the split structure is:

Validation group	Calibration groups
`A`	`B + C + D`
`B`	`A + C + D`
`C`	`A + B + D`
`D`	`A + B + C`

This is only an example. The method applies to any number of groups, provided each group is coherent and each calibration subset contains enough information to support the optimization.

Constructing the groups

Groups should preserve the structure of the data. Experimental contexts should not be split into artificial fragments only to obtain equal group sizes.

The practical difficulty is to fill these groups as evenly as possible with the available trials while keeping each trial and its modalities together.

When constructing groups, the following elements should be considered:

the number of trials;
the number of trial-treatment combinations;
the number of observations for each retained variable;
the timing of the observations;
the diversity of climatic, agronomic, and water stress conditions;
the presence or absence of dynamic and terminal measurements.

Two groups with the same number of trials can have very different information content. For example, one group may contain detailed LAI and biomass time series, while another may contain mainly final yield values.

Interpretation of one split

Within one split, only the calibration subset is used to rank screened parameter sets and to run the local optimization. The held-out group is not used during screening or optimization.

After optimization, the selected parameter set is evaluated on the held-out group. This score measures how well the calibration transfers to data that were not used to estimate the parameters in that split.

This separation must remain strict. If the held-out group influences the choice of candidates, the local optimization, or the stopping decision, it is no longer a validation subset for that split.

Screening the parameter space

Parameter bounds

The screening stage starts from lower and upper bounds for each calibrated parameter. These bounds define the search domain.

They should be broad enough to allow real exploration, but not so broad that a large fraction of candidate parameter sets becomes biologically implausible or numerically useless.

Bounds can be based on:

previously calibrated values;
published parameter ranges;
expert knowledge;
sensitivity analyses;
biological and agronomic constraints;
numerical stability observed in previous simulations.

The role of the bounds is not to guess the final answer. Their role is to define a credible domain where the optimizer is allowed to search.

Candidate generation

Candidate parameter sets are sampled within the predefined bounds. This can be done with random sampling or with a space-filling design such as Latin hypercube sampling.

A full factorial grid is usually avoided because the number of combinations increases rapidly with the number of parameters. Screening should explore the space broadly without making the computation impossible.

The number of screened candidates, denoted here as \(n_{\mathrm{screen}}\), is a compromise. Too few candidates make the search superficial. Too many candidates increase runtime without necessarily improving the final calibration.

Invalid combinations can be filtered after sampling or rejected during model evaluation.

Candidate evaluation

For each sampled parameter set and for each split, the workflow is:

insert the parameter values into the model inputs;
run the model on every trial in the calibration subset of the current split;
align simulated outputs with the retained observations;
compute normalized residuals;
apply variable and observation weights;
aggregate the variable-specific weighted errors into the final score \(\mathrm{Score}_r\).

A screened parameter set is therefore evaluated against the full calibration subset of the current split, not independently for each trial before averaging trial-level scores.

The screened candidates are then ranked from the lowest score to the highest score. The best candidates identify promising regions of the parameter space.

Screening is not expected to provide the final parameter set. Its role is to identify useful starting points for local optimization and to reveal whether several distinct regions of the parameter space produce similar scores.

From screening to local optimization

Selecting distinct starting candidates

The local optimizer should not be initialized from several nearly identical parameter sets. After screening, the workflow therefore retains the best candidates that are also sufficiently distinct from one another.

A practical approach is:

rank all screened candidates by their calibration score;
retain the best candidate;
move down the ranking and retain a new candidate only if it is far enough from all previously retained candidates;
continue until the required number of candidates has been selected.

The distance should be computed in normalized parameter space. For parameter \(j\), the normalized coordinate is:

\[ z_j = \frac{x_j - l_j}{u_j - l_j} \]

where \(x_j\) is the parameter value, and \(l_j\) and \(u_j\) are its lower and upper bounds.

Computing distances after normalization prevents parameters with large numerical ranges from dominating the distance calculation. An Euclidean distance can then be used in this normalized space.

The minimum distance threshold should be fixed before examining the final results, because this choice can influence which local regions are explored.

Initializing Nelder-Mead

For a problem with \(p\) calibrated parameters, a Nelder-Mead simplex contains \(p + 1\) vertices.

If the implementation allows a user-defined initial simplex, the selected distinct candidates can be used directly as the simplex vertices. The best screened candidate is used as the main vertex, and the next p sufficiently distinct candidates form the remaining vertices.

If the implementation accepts only one starting point, the best screened candidate is used as the starting point, and the simplex is generated internally by the optimizer or by a controlled perturbation of that starting point.

This distinction is important for reproducibility, because not all implementations of Nelder-Mead expose the initial simplex in the same way.

Local optimization with Nelder-Mead

For each split scenario, the local optimization is run only on the calibration subset of that split. Each split has its own screening scores, its own selected starting candidates, and its own optimized parameter set. The local refinement step uses the Nelder-Mead simplex method (Nelder and Mead, 1965).

This is necessary because the best region of the parameter space may change when a different group is held out.

Role of Nelder-Mead

In this workflow, Nelder-Mead is used for local refinement. It does not replace the broad screening stage.

Its role is to:

start from a promising region found during screening;
adjust parameter values continuously rather than only choosing among sampled candidates;
improve the pooled weighted score on the calibration subset.

A poor screening step cannot be reliably repaired by Nelder-Mead alone. If the screening misses the relevant region of the parameter space, the local optimizer may converge to a suboptimal solution.

Bounds during local optimization

Standard Nelder-Mead is not naturally a bounded optimization method. Therefore, parameter bounds must be enforced explicitly.

This can be done by one of the following approaches:

transforming bounded parameters into an unconstrained scale before optimization;
adding a strong penalty when the optimizer proposes values outside the bounds;
clipping or rejecting invalid proposals;
using a bounded variant of the algorithm, if available.

The chosen approach should be documented, because it affects both the final parameter values and the reproducibility of the calibration.

Outputs retained for each split

For each split scenario, the following outputs should be retained:

the validation group and calibration groups;
the parameter bounds;
the screened candidates and their calibration scores;
the selected candidates used to initialize the local search;
the local optimization settings;
the final optimized parameter set;
the final calibration score;
the validation score on the held-out group;
simulated-versus-observed outputs for diagnostic plots.

Keeping these outputs separately for each split makes it possible to identify whether one group systematically destabilizes the calibration.

Selecting the final parameter set

At the end of the split-sample procedure, one optimized parameter set is available for each held-out group.

The final objective is usually not to keep several independent parameterizations. The objective is to derive one common final parameter set that performs well across the dataset and remains biologically plausible.

The final selection should consider:

the average calibration performance across splits;
the average held-out validation performance across splits;
the variability of scores across validation groups;
the biological plausibility of the parameter values;
the stability of each parameter across split-specific optima;
the consistency between dynamic variables and terminal variables.

Why the best split-specific optimum is not necessarily the final choice

The optimum with the lowest score in one split is not automatically the best final parameter set. It may be too adapted to the calibration subset used in that split.

A typical warning sign is a parameter set with an excellent calibration score in one split but weaker or unstable performance when evaluated across other split configurations.

The final parameter set should therefore be selected as a cross-split compromise, not as the winner of one isolated optimization run.

Practical final-selection strategies

Several strategies can be used to define the final parameter set.

One option is to retain the split-specific optimum that gives the best average performance when it is recalculated across all calibration and validation subsets.

Another option is to compute a robust centre of the split-specific optima, for example a median parameter vector, and then evaluate this candidate explicitly on every split.

A third option is to identify parameters that are stable across splits and focus additional analysis on parameters that remain unstable.

Whatever the strategy, the final candidate should be re-evaluated on:

each calibration subset;
each held-out validation group;
optionally, the full pooled dataset.

Once held-out split scores are used to choose the final common parameter set, these scores become part of internal model selection. They should no longer be presented as a fully independent validation of the final model.

If a strict final validation is required, an additional independent dataset must be kept aside and used only once, after the final parameter set has been fixed.

Methodological reference

The explicit local optimization step described in this vignette relies on the Nelder-Mead simplex method:

Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. The Computer Journal, 7(4), 308-313.

Conclusion

The proposed workflow follows a clear numerical hierarchy. First, the calibration target is defined. Then the parameter space is explored broadly with screening. Promising regions are refined with Nelder-Mead. Finally, split-specific results are compared to select one stable and interpretable final parameter set.

This workflow does not remove all calibration difficulties. In particular, it does not eliminate equifinality, uncertainty in observations, or structural model error. However, it makes the calibration process clearer, easier to reproduce, and easier to diagnose.

It should also be read for what it is: one proposed calibration and internal validation strategy among several possible ones, not the only path available in OptirrigCORE.