Calibrating the MATILDA framework#
In this notebook we will
… set up a glacio-hydrological model with all the data we have collected,
… run the model for the calibration period with default parameters and check the results,
… use a statistical parameter optimization routine to calibrate the model,
… and store the calibrated parameter set for the scenario runs in the next notebook.
We will use the glacio-hydrological modeling library [MATILDA] (cryotools/matilda), which has been developed for use in this workflow. It is based on the widely used HBV hydrological model, extended by a simple temperature-index melt model based roughly on the code of Seguinot (2019). Glacier evolution over time is modeled using the Δh approach following Seibert et. al. (2018).
Let’s start by importing some helper functions to work with yaml
and pickle
files and read required data from the config file.
Show code cell source
from tools.helpers import update_yaml, read_yaml, write_yaml
import configparser
import ast
# read local config.ini file
config = configparser.ConfigParser()
config.read('config.ini')
# get output dir and date range from config.ini
dir_input = config['FILE_SETTINGS']['DIR_INPUT']
dir_output = config['FILE_SETTINGS']['DIR_OUTPUT']
date_range = ast.literal_eval(config['CONFIG']['DATE_RANGE'])
print('MATILDA will be calibrated on the period ' + date_range[0] + ' to ' + date_range[1])
MATILDA will be calibrated on the period 1998-01-01 to 2020-12-31
The model requires a minimum setup period of one year. By default, the first two years are considered setup. We derive the respective dates from the defined time period accordingly.
Show code cell source
import pandas as pd
length_of_setup_period = 2
sim_start = pd.to_datetime(date_range[0]) + pd.DateOffset(years = length_of_setup_period)
set_up_end = sim_start - pd.DateOffset(days = 1)
dates = {'set_up_start': date_range[0],
'set_up_end': str(set_up_end).split(' ')[0], # remove hh:mm:ss
'sim_start': str(sim_start).split(' ')[0], # remove hh:mm:ss
'sim_end': date_range[1]}
for key in dates.keys(): print(key + ': ' + dates[key])
set_up_start: 1998-01-01
set_up_end: 1999-12-31
sim_start: 2000-01-01
sim_end: 2020-12-31
Many MATILDA parameters have been calculated in previous notebooks and stored in settings.yaml
. We can easily add the modeling periods using a helper function. The calculated glacier profile from Notebook 1 can be imported as a pandas DataFrame
and added to the settings dictionary as well.
Finally, we will also add some optional settings that control the aggregation frequency of the outputs, the choice of graphical outputs, and more.
Show code cell source
update_yaml(dir_output + 'settings.yml', dates)
remaining_settings = {"freq": "M", # aggregation level of model outputs (D, M, Y)
"warn": False, # show warnings of subpackages?
"plot_type": "all", # interactive and/or non-interactive plots ('print', 'interactive', 'all')
"elev_rescaling": True} # treat mean glacier elevation as constant or change with glacier evolution
update_yaml(dir_output + 'settings.yml', remaining_settings)
settings = read_yaml(dir_output + 'settings.yml')
glacier_profile = pd.read_csv(dir_output + 'glacier_profile.csv')
settings['glacier_profile'] = glacier_profile
print('MATILDA settings:\n\n')
for key in settings.keys(): print(key + ': ' + str(settings[key]))
MATILDA settings:
area_cat: 295.2763476500336
area_glac: 31.81370047643339
ele_cat: 3295.4765625
ele_dat: 3337.7120334796778
ele_glac: 4001.8798828125
elev_rescaling: True
freq: M
lat: 42.1831077450328
plot_type: all
plots: True
set_up_end: 1999-12-31
set_up_start: 1998-01-01
sim_end: 2020-12-31
sim_start: 2000-01-01
warn: False
glacier_profile: Elevation Area WE EleZone
0 1970.0 0.000000 0.0000 1900
1 2000.0 0.000000 0.0000 2000
2 2100.0 0.000000 0.0000 2100
3 2200.0 0.000000 0.0000 2200
4 2300.0 0.000000 0.0000 2300
.. ... ... ... ...
156 4730.0 0.000023 20721.3700 4700
157 4740.0 0.000013 14450.2180 4700
158 4750.0 0.000006 10551.4730 4700
159 4760.0 0.000000 0.0000 4700
160 4780.0 0.000002 6084.7456 4700
[161 rows x 4 columns]
Run MATILDA with default parameters#
We will force MATILDA with the pre-processed ERA5-Land data from Notebook 2. Although MATILDA can run without calibration on observations, the results would have extreme uncertainties. Therefore, we recommend to use at least runoff observations for your selected point to evaluate the simulations against. Here, we load runoff observations for your example catchment from 1982 to 2020 (with gaps).
Show code cell source
era5 = pd.read_csv(dir_output + 'ERA5L.csv', usecols=['dt', 'temp', 'prec'])
era5.columns = ['TIMESTAMP', 'T2', 'RRR']
# remove HH:MM:SS from 'TIMESTAMP' column
era5['TIMESTAMP'] = pd.to_datetime(era5['TIMESTAMP'])
era5['TIMESTAMP'] = era5['TIMESTAMP'].dt.date
print('ERA5 Data:')
display(era5)
obs = pd.read_csv(dir_input + 'obs_runoff_example.csv')
print('Observations:')
display(obs)
ERA5 Data:
TIMESTAMP | T2 | RRR | |
---|---|---|---|
0 | 1979-01-01 | 257.392996 | 0.023400 |
1 | 1979-01-02 | 256.435964 | 0.004129 |
2 | 1979-01-03 | 257.867342 | 0.001601 |
3 | 1979-01-04 | 258.322419 | 0.334886 |
4 | 1979-01-05 | 258.056697 | 0.057215 |
... | ... | ... | ... |
16066 | 2022-12-27 | 253.990888 | 0.001670 |
16067 | 2022-12-28 | 257.364969 | 0.223425 |
16068 | 2022-12-29 | 255.975972 | 1.357699 |
16069 | 2022-12-30 | 256.748078 | 3.906910 |
16070 | 2022-12-31 | 255.077631 | 1.239511 |
16071 rows × 3 columns
Observations:
Date | Qobs | |
---|---|---|
0 | 01/01/1982 | 1.31 |
1 | 01/02/1982 | 1.19 |
2 | 01/03/1982 | 1.31 |
3 | 01/04/1982 | 1.31 |
4 | 01/05/1982 | 1.19 |
... | ... | ... |
14240 | 12/27/2020 | 3.25 |
14241 | 12/28/2020 | 3.23 |
14242 | 12/29/2020 | 3.08 |
14243 | 12/30/2020 | 2.93 |
14244 | 12/31/2020 | 2.93 |
14245 rows × 2 columns
First, we run MATILDA with default parameters.
Show code cell source
from matilda.core import matilda_simulation
output_matilda = matilda_simulation(era5, obs, **settings)
---
MATILDA framework
Reading parameters for MATILDA simulation
Parameter set:
set_up_start 1998-01-01
set_up_end 1999-12-31
sim_start 2000-01-01
sim_end 2020-12-31
freq M
freq_long Monthly
lat 42.183108
area_cat 295.276348
area_glac 31.8137
ele_dat 3337.712033
ele_cat 3295.476562
ele_glac 4001.879883
ele_non_glac 3210.176791
hydro_year 10
soi None
warn False
pfilter 0
lr_temp -0.006
lr_prec 0
TT_snow 0
TT_rain 2
TT_diff 2
CFMAX_snow 2.5
CFMAX_ice 5
CFMAX_rel 2
BETA 1.0
CET 0.15
FC 250
K0 0.055
K1 0.055
K2 0.04
LP 0.7
MAXBAS 3.0
PERC 1.5
UZL 120
PCORR 1.0
SFCF 0.7
CWH 0.1
AG 0.7
CFR_ice 0.01
RFS 0.15
dtype: object
*-------------------*
Reading data
Set up period from 1998-01-01 to 1999-12-31 to set initial values
Simulation period from 2000-01-01 to 2020-12-31
Input data preprocessing successful
---
Initiating MATILDA simulation
Recalculating initial elevations based on glacier profile
>> Prior glacier elevation: 4001.8798828125m a.s.l.
>> Recalculated glacier elevation: 3951m a.s.l.
>> Prior non-glacierized elevation: 3210m a.s.l.
>> Recalculated non-glacierized elevation: 3216m a.s.l.
Calculating glacier evolution
*-------------------*
Running HBV routine
Finished spin up for initial HBV parameters
Finished HBV routine
*-------------------*
** Model efficiency based on Monthly aggregates **
KGE coefficient: 0.36
NSE coefficient: -0.95
RMSE: 70.39
MARE coefficient: 0.76
*-------------------*
avg_temp_catchment avg_temp_glaciers evap_off_glaciers \
count 6086.000 6086.000 6086.000
mean -3.487 -7.898 0.790
std 9.213 9.213 1.006
min -25.392 -29.803 0.000
25% -11.675 -16.086 0.000
50% -2.805 -7.216 0.233
75% 5.045 0.634 1.481
max 13.692 9.280 5.625
sum -21222.062 -48065.519 4808.873
prec_off_glaciers prec_on_glaciers rain_off_glaciers \
count 6086.000 6086.000 6086.000
mean 3.182 0.302 2.230
std 4.895 0.469 4.876
min 0.000 0.000 0.000
25% 0.154 0.015 0.000
50% 1.244 0.122 0.000
75% 4.214 0.396 2.118
max 54.582 5.699 54.582
sum 19364.307 1836.093 13570.670
snow_off_glaciers rain_on_glaciers snow_on_glaciers \
count 6086.000 6086.000 6086.000
mean 0.952 0.144 0.157
std 1.966 0.427 0.267
min 0.000 0.000 0.000
25% 0.000 0.000 0.000
50% 0.020 0.000 0.034
75% 0.970 0.014 0.207
max 19.128 5.699 2.503
sum 5793.637 877.758 958.335
snowpack_off_glaciers ... total_refreezing SMB \
count 6086.000 ... 6086.000 6086.000
mean 114.010 ... 0.023 -1.134
std 104.481 ... 0.055 7.104
min 0.000 ... 0.000 -41.167
25% 0.000 ... 0.000 0.000
50% 105.096 ... 0.000 0.218
75% 194.911 ... 0.012 1.839
max 439.324 ... 0.365 23.603
sum 693865.892 ... 141.768 -6898.623
actual_evaporation total_precipitation total_melt \
count 6086.000 6086.000 6086.000
mean 0.779 3.483 3.487
std 0.994 5.359 8.634
min 0.000 0.000 0.000
25% 0.000 0.169 0.000
50% 0.229 1.366 0.000
75% 1.462 4.619 1.323
max 5.625 60.281 51.193
sum 4742.754 21200.400 21221.885
runoff_without_glaciers runoff_from_glaciers runoff_ratio \
count 6086.000 6086.000 6086.000
mean 2.436 0.418 111.059
std 3.119 0.908 2128.207
min 0.000 0.000 0.000
25% 0.035 0.000 0.079
50% 0.683 0.000 0.681
75% 4.264 0.264 3.375
max 17.804 7.418 129567.627
sum 14824.523 2542.915 675905.254
total_runoff observed_runoff
count 6086.000 6086.000
mean 2.854 2.006
std 3.635 1.721
min 0.000 0.234
25% 0.035 0.702
50% 0.686 1.127
75% 5.437 3.102
max 20.787 8.983
sum 17367.438 12206.778
[9 rows x 29 columns]
End of the MATILDA simulation
---
The result is obviously far from reality and largely overestimates runoff. Therefore, the model needs calibration.
Calibrate MATILDA#
To adjust all model parameters to the catchment characteristics, we will perform an automated calibration using the Statistical Parameter Optimization Tool for Python. Since large uncertainties in the input data (especially precipitation) can lead to an overestimation of melt when the model is calibrated to the hydrograph only, we will additionally include glacier mass balance data for a multi-objective calibration.
… run this notebook locally on a computer with more cores (ideally a high performance cluster) or …
… reduce the number of calibration parameters using the global sensitivity. We will return to this topic later in this notebook.
Here we will demonstrate the use of the SPOT functions and then continue with a parameter set from a large HPC optimization run. If you need assistance to implement the routine on your HPC consult the SPOTPY documentation and contact us if you run into problems.
Add glacier mass balance data#
In addition to runoff we will use glacier mass balance as second calibration variable. Shean et. al. 2020 calculated robust geodetic mass balances for all glaciers in High Mountain Asia from 2000 to 2018. For this example (and all other catchments in HMA), we can use their data set so derive a target average annual mass balance in the calibration period. If your catchment is located outside HMA, you need to consult other sources.
We pick all individual mass balances that match the glacier IDs in our catchment and calculate the mean. In addition, we use the uncertainty measures listed in the dataset to derive an uncertainty range.
Show code cell source
import pandas as pd
mass_balances = pd.read_csv(dir_input + '/hma_mb_20190215_0815_rmse.csv', usecols=['RGIId', 'mb_mwea', 'mb_mwea_sigma'])
ids = pd.read_csv(dir_output + '/RGI/Glaciers_in_catchment.csv')
merged = pd.merge(mass_balances, ids, on='RGIId')
mean_mb = round(merged['mb_mwea'].mean() * 1000, 3) # Mean catchment MB in mm w.e.
mean_sigma = round(merged['mb_mwea_sigma'].mean() * abs(mean_mb), 3) # Mean uncertainty of catchment MB in mm w.e.
target_mb = [mean_mb - mean_sigma, mean_mb + mean_sigma]
print('Target glacier mass balance for calibration: ' + str(mean_mb) + ' +-' + str(mean_sigma) + 'mm w.e.')
Target glacier mass balance for calibration: -155.474 +-50.3mm w.e.
The MATILDA framework provides an interface to SPOTPY. Here we will use the psample()
function to run MATILDA with the same settings as before. To do this, we will remove redundant settings
and add some new ones specific to the function. Be sure to choose the number of repetitions carefully.
Show code cell source
from tools.helpers import drop_keys
psample_settings = drop_keys(settings, ['warn', 'plots', 'plot_type'])
additional_settings = {'rep': 10, # Number of model runs. For advice check the documentation of the algorithms.
'glacier_only': False, # True when calibrating a entirely glacierized catchment
'obj_dir': 'maximize', # should your objective funtion be maximized (e.g. NSE) or minimized (e.g. RMSE)
'target_mb': mean_mb, # Average annual glacier mass balance to target at
'dbformat': None, # Write the results to a file ('csv', 'hdf5', 'ram', 'sql')
'output': None, # Choose where to store the files
'algorithm': 'lhs', # Choose algorithm (for parallelization: mc, lhs, fast, rope, sceua or demcz)
'dbname': 'era5_matilda_example', # Choose name
'parallel': False, # Distribute the calculation on multiple cores or not
# 'cores': 20 # Set number of cores when running parallel
}
psample_settings.update(additional_settings)
print('Settings for calibration runs:\n\n')
for key in psample_settings.keys(): print(key + ': ' + str(psample_settings[key]))
Settings for calibration runs:
area_cat: 295.2763476500336
area_glac: 31.81370047643339
ele_cat: 3295.4765625
ele_dat: 3337.7120334796778
ele_glac: 4001.8798828125
elev_rescaling: True
freq: M
lat: 42.1831077450328
set_up_end: 1999-12-31
set_up_start: 1998-01-01
sim_end: 2020-12-31
sim_start: 2000-01-01
glacier_profile: Elevation Area WE EleZone norm_elevation delta_h
0 1970.0 0.000000 0.0000 1900 1.000000 1.003442
1 2000.0 0.000000 0.0000 2000 0.989324 0.945813
2 2100.0 0.000000 0.0000 2100 0.953737 0.774793
3 2200.0 0.000000 0.0000 2200 0.918149 0.632696
4 2300.0 0.000000 0.0000 2300 0.882562 0.515361
.. ... ... ... ... ... ...
156 4730.0 0.000023 20721.3700 4700 0.017794 -0.000265
157 4740.0 0.000013 14450.2180 4700 0.014235 -0.000692
158 4750.0 0.000006 10551.4730 4700 0.010676 -0.001119
159 4760.0 0.000000 0.0000 4700 0.007117 -0.001546
160 4780.0 0.000002 6084.7456 4700 0.000000 -0.002400
[161 rows x 6 columns]
rep: 10
glacier_only: False
obj_dir: maximize
target_mb: -155.474
dbformat: None
output: None
algorithm: lhs
dbname: era5_matilda_example
parallel: False
With these settings we can start the psample()
to run our model with various parameter combinations. The default parameter boundaries can be found in the MATILDA parameter documentation. If you want to narrow down the parameter space you can do that using the following syntax. Here, we define custom ranges for the temperature lapse rate and the precipitation correction factor.
Show code cell source
from matilda.mspot_glacier import psample
lim_dict = {'lr_temp_lo': -0.007, 'lr_temp_up': -0.005, 'PCORR_lo': 0.5, 'PCORR_up': 1.5}
best_summary = psample(df=era5, obs=obs, **psample_settings, **lim_dict)
Initializing the Latin Hypercube Sampling (LHS) with 10 repetitions
The objective function will be maximized
Starting the LHS algotrithm with 10 repetitions...
Creating LatinHyperCube Matrix
1 of 10, maximal objective function=0.0142974, time remaining: 00:00:11
Initialize database...
['csv', 'hdf5', 'ram', 'sql', 'custom', 'noData']
2 of 10, maximal objective function=0.390807, time remaining: 00:00:12
3 of 10, maximal objective function=0.390807, time remaining: 00:00:12
4 of 10, maximal objective function=0.390807, time remaining: 00:00:11
5 of 10, maximal objective function=0.390807, time remaining: 00:00:09
6 of 10, maximal objective function=0.390807, time remaining: 00:00:07
7 of 10, maximal objective function=0.390807, time remaining: 00:00:05
8 of 10, maximal objective function=0.390807, time remaining: 00:00:03
9 of 10, maximal objective function=0.488346, time remaining: 00:00:00
10 of 10, maximal objective function=0.488346, time remaining: 23:59:57
*** Final SPOTPY summary ***
Total Duration: 28.98 seconds
Total Repetitions: 10
Maximal objective value: 0.488346
Corresponding parameter setting:
lr_temp: -0.0061268
lr_prec: 0.000873701
BETA: 2.36071
CET: 0.188038
FC: 246.639
K0: 0.1165
K1: 0.0192281
K2: 0.132066
LP: 0.604746
MAXBAS: 4.18863
PERC: 0.0719596
UZL: 439.892
PCORR: 1.05211
TT_snow: 1.30233
TT_diff: 1.60421
CFMAX_ice: 9.04271
CFMAX_rel: 1.76758
SFCF: 0.486815
CWH: 0.110354
AG: 0.0240491
RFS: 0.238042
******************************
WARNING: The selected algorithm lhs can either maximize or minimize the objective function. You can specify the direction by passing obj_dir to analyze_results(). The default is 'maximize'.
Best parameter set:
lr_temp=-0.006126801758888213, lr_prec=0.0008737007946962512, BETA=2.360712742997325, CET=0.18803804995862727, FC=246.63862451398535, K0=0.1164995226794916, K1=0.019228080119798686, K2=0.1320661598275716, LP=0.6047461861480137, MAXBAS=4.188629611573909, PERC=0.07195961727035, UZL=439.8921536505491, PCORR=1.0521061414626551, TT_snow=1.3023283455933508, TT_diff=1.6042122829253105, CFMAX_ice=9.042708157072404, CFMAX_rel=1.7675788567226616, SFCF=0.48681527221872145, CWH=0.11035420113352071, AG=0.024049120558842185, RFS=0.23804156103842117
Run number 8 has the highest objectivefunction with: 0.4883
Run MATILDA with calibrated parameters#
The following parameter set was computed using an updated version of the Differential Evolution Markov Chain (DE-MCz) algorithm with 55k iterations on an HPC cluster. The parameters were optimized for runoff using the Kling-Gupta model efficiency coefficient and the results were filtered to match the target mass balance range.
Show code cell source
param = {'lr_temp': -0.006472598,
'lr_prec': 0.00010296448,
'BETA': 4.625306,
'CET': 0.2875196,
'FC': 364.81818,
'K0': 0.28723368,
'K1': 0.015692418,
'K2': 0.004580627,
'LP': 0.587188,
'MAXBAS': 6.730105,
'PERC': 1.1140852,
'UZL': 198.82584,
'PCORR': 0.74768984,
'TT_snow': -1.3534238,
'TT_diff': 0.70977557,
'CFMAX_ice': 2.782649,
'CFMAX_rel': 1.2481626,
'SFCF': 0.879982,
'CWH': 0.0020890352,
'AG': 0.8640329,
'RFS': 0.21825151}
print('Calibrated parameter set:\n\n')
for key in param.keys(): print(key + ': ' + str(param[key]))
Calibrated parameter set:
lr_temp: -0.006472598
lr_prec: 0.00010296448
BETA: 4.625306
CET: 0.2875196
FC: 364.81818
K0: 0.28723368
K1: 0.015692418
K2: 0.004580627
LP: 0.587188
MAXBAS: 6.730105
PERC: 1.1140852
UZL: 198.82584
PCORR: 0.74768984
TT_snow: -1.3534238
TT_diff: 0.70977557
CFMAX_ice: 2.782649
CFMAX_rel: 1.2481626
SFCF: 0.879982
CWH: 0.0020890352
AG: 0.8640329
RFS: 0.21825151
Show code cell source
output_matilda = matilda_simulation(era5, obs, **settings, parameter_set=param)
---
MATILDA framework
Reading parameters for MATILDA simulation
Parameter set:
set_up_start 1998-01-01
set_up_end 1999-12-31
sim_start 2000-01-01
sim_end 2020-12-31
freq M
freq_long Monthly
lat 42.183108
area_cat 295.276348
area_glac 31.8137
ele_dat 3337.712033
ele_cat 3295.476562
ele_glac 4001.879883
ele_non_glac 3210.176791
hydro_year 10
soi None
warn False
pfilter 0
lr_temp -0.006473
lr_prec 0.000103
TT_snow -1.353424
TT_rain -0.643648
TT_diff 0.709776
CFMAX_snow 2.229396
CFMAX_ice 2.782649
CFMAX_rel 1.248163
BETA 4.625306
CET 0.28752
FC 364.81818
K0 0.287234
K1 0.015692
K2 0.004581
LP 0.587188
MAXBAS 6.730105
PERC 1.114085
UZL 198.82584
PCORR 0.74769
SFCF 0.879982
CWH 0.002089
AG 0.864033
CFR_ice 0.01
RFS 0.218252
dtype: object
*-------------------*
Reading data
Set up period from 1998-01-01 to 1999-12-31 to set initial values
Simulation period from 2000-01-01 to 2020-12-31
Input data preprocessing successful
---
Initiating MATILDA simulation
Recalculating initial elevations based on glacier profile
>> Prior glacier elevation: 4001.8798828125m a.s.l.
>> Recalculated glacier elevation: 3951m a.s.l.
>> Prior non-glacierized elevation: 3210m a.s.l.
>> Recalculated non-glacierized elevation: 3216m a.s.l.
Calculating glacier evolution
*-------------------*
Running HBV routine
Finished spin up for initial HBV parameters
Finished HBV routine
*-------------------*
** Model efficiency based on Monthly aggregates **
KGE coefficient: 0.86
NSE coefficient: 0.76
RMSE: 24.72
MARE coefficient: 0.28
*-------------------*
avg_temp_catchment avg_temp_glaciers evap_off_glaciers \
count 6086.000 6086.000 6086.000
mean -3.417 -8.174 0.853
std 9.213 9.213 1.206
min -25.322 -30.075 0.000
25% -11.609 -16.365 0.000
50% -2.739 -7.494 0.131
75% 5.129 0.369 1.471
max 13.765 9.007 8.028
sum -20795.365 -49748.927 5191.372
prec_off_glaciers prec_on_glaciers rain_off_glaciers \
count 6086.000 6086.000 6086.000
mean 2.573 0.274 1.840
std 3.778 0.393 3.741
min 0.000 0.000 0.000
25% 0.125 0.020 0.000
50% 1.081 0.120 0.000
75% 3.499 0.371 2.157
max 40.798 4.357 40.798
sum 15658.660 1665.787 11200.029
snow_off_glaciers rain_on_glaciers snow_on_glaciers \
count 6086.000 6086.000 6086.000
mean 0.733 0.143 0.131
std 1.694 0.363 0.239
min 0.000 0.000 0.000
25% 0.000 0.000 0.000
50% 0.000 0.000 0.016
75% 0.557 0.070 0.150
max 17.968 4.357 2.232
sum 4458.631 869.777 796.010
snowpack_off_glaciers ... total_refreezing SMB \
count 6086.000 ... 6086.000 6086.000
mean 82.106 ... 0.027 -0.411
std 82.329 ... 0.070 4.629
min 0.000 ... 0.000 -22.191
25% 0.000 ... 0.000 -0.692
50% 67.578 ... 0.000 0.149
75% 149.772 ... 0.004 1.415
max 371.410 ... 0.466 20.937
sum 499698.902 ... 167.076 -2503.047
actual_evaporation total_precipitation total_melt \
count 6086.000 6086.000 6086.000
mean 0.853 2.847 0.993
std 1.206 4.170 2.605
min 0.000 0.000 0.000
25% 0.000 0.145 0.000
50% 0.131 1.203 0.000
75% 1.471 3.867 0.681
max 8.028 45.155 18.479
sum 5191.372 17324.448 6040.563
runoff_without_glaciers runoff_from_glaciers runoff_ratio \
count 6086.000 6086.000 6086.000
mean 1.713 0.317 29.748
std 1.295 0.630 80.744
min 0.000 0.000 0.000
25% 0.684 0.000 0.433
50% 1.341 0.000 1.247
75% 2.598 0.388 7.748
max 13.236 5.202 1351.906
sum 10427.172 1931.082 181048.141
total_runoff observed_runoff
count 6086.000 6086.000
mean 2.031 2.006
std 1.699 1.721
min 0.000 0.234
25% 0.684 0.702
50% 1.364 1.127
75% 3.237 3.102
max 13.829 8.983
sum 12358.253 12206.778
[9 rows x 29 columns]
End of the MATILDA simulation
---
In addition to the standard plots we can explore the results interactive ploty
plots. Go ahead and zoom as you like or select/deselect individual curves.
Show code cell source
output_matilda[9].show()
The same works for the long-term seasonal cycle.
Show code cell source
output_matilda[10].show()
Reducing the parameter space - Sensitivity analysis with FAST#
To reduce the computation time of the calibration procedure, we need to reduce the number of parameters to be optimized. Therefore, we will perform a global sensitivity analysis to identify the most important parameters and set the others to default values. The algorithm of choice will be the Fourier Amplitude Sensitivity Test (FAST) available through the SPOTPY library. As before, we will show the general procedure with a few iterations, but display results from extensive runs on a HPC. You can use the results as a guide for your parameter choices, but keep in mind that they are highly correlated with your catchment properties, such as elevation range and glacier coverage.
First, we calculate the required number of iterations using a formula from Henkel et. al. 2012. We choose the SPOTPY default frequency step of 2 and set the interference factor to a maximum of 4, since we assume a high intercorrelation of the model parameters. The total number of parameters in MATILDA is 21.
Show code cell source
def fast_iter(param, interf=4, freqst=2):
"""
Calculates the number of parameter iterations needed for parameterization and sensitivity analysis using FAST.
Parameters
----------
param : int
The number of input parameters being analyzed.
interf : int
The interference factor, which determines the degree of correlation between the input parameters.
freqst : int
The frequency step, which specifies the size of the intervals between each frequency at which the Fourier transform is calculated.
Returns
-------
int
The total number of parameter iterations needed for FAST.
"""
return (1 + 4 * interf ** 2 * (1 + (param - 2) * freqst)) * param
print('Needed number of iterations for FAST: ' + str(fast_iter(21)))
Needed number of iterations for FAST: 52437
That is a lot of iterations! Running this routine on a single core would take about two days, but can be sped up significantly with each additional core. The setup would look exactly like the parameter optimization before.
Note: No matter what number of iterations you define, SPOTPY will run \(N*k\) times, where \(k\) is the number of model parameters. So even if we set rep=10
, the algorithm will run at least 21 times.
Show code cell source
from matilda.mspot_glacier import psample
fast_settings = {'rep': 52437, # Choose wisely before running
'target_mb': None,
'algorithm': 'fast',
'dbname': 'fast_matilda_example',
'dbname': dir_output + 'fast_example',
'dbformat': 'csv'
}
psample_settings.update(fast_settings)
print('Settings for FAST:\n\n')
for key in psample_settings.keys(): print(key + ': ' + str(psample_settings[key]))
# fast_results = psample(df=era5, obs=obs, **psample_settings)
Settings for FAST:
area_cat: 295.2763476500336
area_glac: 31.81370047643339
ele_cat: 3295.4765625
ele_dat: 3337.7120334796778
ele_glac: 4001.8798828125
elev_rescaling: True
freq: M
lat: 42.1831077450328
set_up_end: 1999-12-31
set_up_start: 1998-01-01
sim_end: 2020-12-31
sim_start: 2000-01-01
glacier_profile: Elevation Area WE EleZone norm_elevation delta_h
0 1970.0 0.000000 0.0000 1900 1.000000 1.003442
1 2000.0 0.000000 0.0000 2000 0.989324 0.945813
2 2100.0 0.000000 0.0000 2100 0.953737 0.774793
3 2200.0 0.000000 0.0000 2200 0.918149 0.632696
4 2300.0 0.000000 0.0000 2300 0.882562 0.515361
.. ... ... ... ... ... ...
156 4730.0 0.000023 20721.3700 4700 0.017794 -0.000265
157 4740.0 0.000013 14450.2180 4700 0.014235 -0.000692
158 4750.0 0.000006 10551.4730 4700 0.010676 -0.001119
159 4760.0 0.000000 0.0000 4700 0.007117 -0.001546
160 4780.0 0.000002 6084.7456 4700 0.000000 -0.002400
[161 rows x 6 columns]
rep: 52437
glacier_only: False
obj_dir: maximize
target_mb: None
dbformat: csv
output: None
algorithm: fast
dbname: output/fast_example
parallel: False
We ran FASTs for the example catchment with the full number of iterations required. The results are saved in CSV files. We can use the spotpy.analyser
library to create easy-to-read data frames from the databases. The summary shows the first (S1
) and the total order sensitivity index (ST
) for each parameter. S1
refers to the variance of the model output explained by the parameter, holding all other parameters constant. The ST
takes into account the interaction of the parameters and is therefore a good measure of the impact of individual parameters on the model output.
Show code cell source
import spotpy
import os
import contextlib
def get_si(fast_results: str, to_csv: bool = False) -> pd.DataFrame:
"""
Computes the sensitivity indices of a given FAST simulation results file.
Parameters
----------
fast_results : str
The path of the FAST simulation results file.
to_csv : bool, optional
If True, the sensitivity indices are saved to a CSV file with the same
name as fast_results, but with '_sensitivity_indices.csv' appended to
the end (default is False).
Returns
-------
pd.DataFrame
A pandas DataFrame containing the sensitivity indices and parameter
names.
"""
if fast_results.endswith(".csv"):
fast_results = fast_results[:-4] # strip .csv
results = spotpy.analyser.load_csv_results(fast_results)
# Suppress prints
with contextlib.redirect_stdout(open(os.devnull, 'w')):
SI = spotpy.analyser.get_sensitivity_of_fast(results, print_to_console=False)
parnames = spotpy.analyser.get_parameternames(results)
sens = pd.DataFrame(SI)
sens['param'] = parnames
sens.set_index('param', inplace=True)
if to_csv:
sens.to_csv(os.path.basename(fast_results) + '_sensitivity_indices.csv', index=False)
return sens
display(get_si(dir_input + 'FAST/' + 'example_fast_nolim.csv'))
S1 | ST | |
---|---|---|
param | ||
lr_temp | 0.001027 | 0.028119 |
lr_prec | 0.000068 | 0.020323 |
BETA | 0.000406 | 0.044880 |
CET | 0.000183 | 0.061767 |
FC | 0.000177 | 0.070427 |
K0 | 0.000128 | 0.056534 |
K1 | 0.000611 | 0.096560 |
K2 | 0.000564 | 0.073538 |
LP | 0.000296 | 0.087176 |
MAXBAS | 0.000166 | 0.045120 |
PERC | 0.000079 | 0.051002 |
UZL | 0.000129 | 0.056398 |
PCORR | 0.011253 | 0.857075 |
TT_snow | 0.000339 | 0.087251 |
TT_diff | 0.000221 | 0.064408 |
CFMAX_ice | 0.001746 | 0.121897 |
CFMAX_rel | 0.000108 | 0.083392 |
SFCF | 0.007414 | 0.137477 |
CWH | 0.000612 | 0.063503 |
AG | 0.000275 | 0.116503 |
RFS | 0.000089 | 0.098914 |
If you have additional information on certain parameters, limiting their bounds can have a large impact on sensitivity. For our example catchment, field observations showed that the temperature lapse rate and precipitation correction were unlikely to exceed a certain range, so we limited the parameter space for both and ran a FAST again.
Show code cell source
lim_dict = {'lr_temp_lo': -0.007, 'lr_temp_up': -0.005, 'PCORR_lo': 0.5, 'PCORR_up': 1.5}
# fast_results = psample(df=era5, obs=obs, **psample_settings, **lim_dict)
To see the effect of parameter restrictions on sensitivity, we can plot the indices of both runs. Feel free to explore further by adding more FAST outputs to the plot function.
Show code cell source
import plotly.graph_objs as go
import pandas as pd
import plotly.io as pio
def plot_sensitivity_bars(*dfs, labels=None, show=False, bar_width=0.3, bar_gap=0.6):
"""
Plots a horizontal bar chart showing the total sensitivity index for each parameter in a MATILDA model.
Parameters
----------
*dfs : pandas.DataFrame
Multiple dataframes containing the sensitivity indices for each parameter.
labels : list of str, optional
Labels to use for the different steps in the sensitivity analysis. If not provided, the default
labels will be 'No Limits', 'Step 1', 'Step 2', etc.
bar_width : float, optional
Width of each bar in the chart.
bar_gap : float, optional
Space between bars.
"""
traces = []
colors = ['darkblue', 'orange', 'purple', 'cyan'] # add more colors if needed
for i, df in enumerate(dfs):
df = get_si(df)
if i > 0:
if labels is None:
label = 'Step ' + str(i)
else:
label = labels[i]
else:
label = 'No Limits'
trace = go.Bar(y=df.index,
x=df['ST'],
name=label,
orientation='h',
marker=dict(color=colors[i]),
width=bar_width)
traces.append(trace)
layout = go.Layout(title=dict(text='<b>' +'Total Sensitivity Index for MATILDA Parameters' + '<b>', font=dict(size=24)),
xaxis_title='Total Sensitivity Index',
yaxis_title='Parameter',
yaxis=dict(automargin=True),
bargap=bar_gap,
height=700)
fig = go.Figure(data=traces, layout=layout)
if show:
fig.show()
step1 = dir_input + 'FAST/' + 'example_fast_nolim.csv'
step2 = dir_input + 'FAST/' + 'era5_ipynb_fast_step1.csv'
plot_sensitivity_bars(step1, step2, labels=['No Limits', 'lr_temp and PCORR limited'], show=True)
From this figure, we can easily identify the most important parameters. Depending on your desired accuracy and computational resources, you can choose a sensitivity threshold. Let’s say we want to fix all parameters with a sensitivity index below 0.1. For our example, this will reduce the calibration parameters to 7.
Show code cell source
si_df = get_si(step2)
sensitive = si_df.index[si_df['ST'] > 0.10].values
print('Parameters with a sensitivity index > 0.1:')
for i in sensitive: print(i)
Parameters with a sensitivity index > 0.1:
K1
K2
LP
PCORR
TT_snow
CFMAX_ice
SFCF
To give you an idea how much this procedure can reduce calibration time, we run the fast_iter()
function again.
Show code cell source
print('Needed number of iterations for FAST with all parameters: ' + str(fast_iter(21)))
print('Needed number of iterations for FAST with the 7 most sensitive parameters: ' + str(fast_iter(7)))
Needed number of iterations for FAST with all parameters: 52437
Needed number of iterations for FAST with the 7 most sensitive parameters: 4935
For a single core this would roughly translate into a reduction from 44h to 4h.
To run the calibration with the reduced parameter space, we simply need to define values for the fixed parameters and use a helper function to translate them into boundary arguments. Here we simply use the values from our last SPOTPY run (param
).
Show code cell source
from matilda.mspot_glacier import dict2bounds
fixed_param = {key: val for key, val in param.items() if key not in sensitive}
fixed_param_bounds = dict2bounds(fixed_param)
print('Fixed bounds:\n')
for key in fixed_param_bounds.keys(): print(key + ': ' + str(fixed_param_bounds[key]))
Fixed bounds:
lr_temp_lo: -0.006472598
lr_prec_lo: 0.00010296448
BETA_lo: 4.625306
CET_lo: 0.2875196
FC_lo: 364.81818
K0_lo: 0.28723368
MAXBAS_lo: 6.730105
PERC_lo: 1.1140852
UZL_lo: 198.82584
TT_diff_lo: 0.70977557
CFMAX_rel_lo: 1.2481626
CWH_lo: 0.0020890352
AG_lo: 0.8640329
RFS_lo: 0.21825151
lr_temp_up: -0.006472598
lr_prec_up: 0.00010296448
BETA_up: 4.625306
CET_up: 0.2875196
FC_up: 364.81818
K0_up: 0.28723368
MAXBAS_up: 6.730105
PERC_up: 1.1140852
UZL_up: 198.82584
TT_diff_up: 0.70977557
CFMAX_rel_up: 1.2481626
CWH_up: 0.0020890352
AG_up: 0.8640329
RFS_up: 0.21825151
The psample()
setup is then as simple as before.
Show code cell source
new_settings = {'rep': 10, # Number of model runs. For advice check the documentation of the algorithms.
'glacier_only': False, # True when calibrating a entirely glacierized catchment
'obj_dir': 'maximize', # should your objective funtion be maximized (e.g. NSE) or minimized (e.g. RMSE)
'target_mb': -156, # Average annual glacier mass balance to target at
'dbformat': None, # Write the results to a file ('csv', 'hdf5', 'ram', 'sql')
'output': None, # Choose where to store the files
'algorithm': 'lhs', # Choose algorithm (for parallelization: mc, lhs, fast, rope, sceua or demcz)
'dbname': 'era5_matilda_example', # Choose name
'parallel': False, # Distribute the calculation on multiple cores or not
# 'cores': 20 # Set number of cores when running parallel
}
psample_settings.update(new_settings)
best_summary = psample(df=era5, obs=obs, **psample_settings, **fixed_param_bounds)
Initializing the Latin Hypercube Sampling (LHS) with 10 repetitions
The objective function will be maximized
Starting the LHS algotrithm with 10 repetitions...
Creating LatinHyperCube Matrix
1 of 10, maximal objective function=0.337516, time remaining: 00:00:17
Initialize database...
['csv', 'hdf5', 'ram', 'sql', 'custom', 'noData']
2 of 10, maximal objective function=0.337516, time remaining: 00:00:19
3 of 10, maximal objective function=0.337516, time remaining: 00:00:18
4 of 10, maximal objective function=0.337516, time remaining: 00:00:15
5 of 10, maximal objective function=0.339114, time remaining: 00:00:12
6 of 10, maximal objective function=0.339114, time remaining: 00:00:10
7 of 10, maximal objective function=0.339114, time remaining: 00:00:06
8 of 10, maximal objective function=0.339114, time remaining: 00:00:03
9 of 10, maximal objective function=0.339114, time remaining: 00:00:00
10 of 10, maximal objective function=0.339114, time remaining: 23:59:57
*** Final SPOTPY summary ***
Total Duration: 35.08 seconds
Total Repetitions: 10
Maximal objective value: 0.339114
Corresponding parameter setting:
lr_temp: -0.00647
lr_prec: 0.000103
BETA: 4.63
CET: 0.288
FC: 365
K0: 0.287
K1: 0.0939361
K2: 0.00689148
LP: 0.450759
MAXBAS: 6.73
PERC: 1.11
UZL: 199
PCORR: 0.824625
TT_snow: -0.853888
TT_diff: 0.71
CFMAX_ice: 10.0464
CFMAX_rel: 1.25
SFCF: 0.997235
CWH: 0.00209
AG: 0.864
RFS: 0.218
******************************
WARNING: The selected algorithm lhs can either maximize or minimize the objective function. You can specify the direction by passing obj_dir to analyze_results(). The default is 'maximize'.
Best parameter set:
lr_temp=-0.00647, lr_prec=0.000103, BETA=4.63, CET=0.288, FC=365.0, K0=0.287, K1=0.09393606219053328, K2=0.006891478196480864, LP=0.45075937122781196, MAXBAS=6.73, PERC=1.11, UZL=199.0, PCORR=0.8246250544275175, TT_snow=-0.8538884090236629, TT_diff=0.71, CFMAX_ice=10.046444567755756, CFMAX_rel=1.25, SFCF=0.9972351393811484, CWH=0.00209, AG=0.864, RFS=0.218
Run number 4 has the highest objectivefunction with: 0.3391
Now, we save the best parameter set for use in the next notebook. The set is stored in the best_summary
variable.
Note: In our example we will skip this step and use the result from the calibration on an HPC cluster.
Show code cell source
# param = best_summary['best_param']
Finally, we store the parameter set in a .yml
file.
Show code cell source
write_yaml(param, dir_output + 'parameters.yml')
print(f"Parameter set stored in '{dir_output}parameters.yml'")
Parameter set stored in 'output/parameters.yml'