GAMs
GAM model types from pymgcv.gam
.
GAM
GAM(
predictors: Mapping[str, Iterable[AbstractTerm] | AbstractTerm],
*,
family: AbstractFamily | None = None,
add_intercepts: bool = True,
)
Standard GAM Model.
Initialize the model.
Parameters:
-
predictors
(Mapping[str, Iterable[AbstractTerm] | AbstractTerm]
) –Dictionary mapping target variable names to an iterable of
AbstractTerm
objects used to predict \(g([\mathbb{E}[Y])\).- For simple models, this will usually be a single key-value pair:
- For multivariate models, e.g.
MVN
, the dictionary will have multiple pairs: - For multiparameter models, such as LSS-type models (e.g.
GauLSS
), the first key-value pair must correspond to the variable name in the data (usually modelling the location), and the subsequent dictionary elements model the other parameters in the order as defined by the family (e.g. scale and shape). The names of these extra parameters, can be anything, and are used as column names for prediction outputs.
-
family
(AbstractFamily | None
, default:None
) –Distribution family to use. See Families for available options.
-
add_intercepts
(bool
, default:True
) –If False, intercept terms must be manually added to the formulae using
Intercept
. If True, automatically adds an intercept term to each formula. Intercepts are added as needed by methods, such thatgam.predictors
reflect the model as constructed (i.e. before adding intercepts).
fit
fit(
data: DataFrame | Mapping[str, ndarray | Series],
*,
method: Literal[
"GCV.Cp", "GACV.Cp", "QNCV", "REML", "P-REML", "ML", "P-ML", "NCV"
] = "REML",
weights: str | ndarray | Series | None = None,
optimizer: str | tuple[str, str] = ("outer", "newton"),
scale: Union[Literal["unknown"], float, int, NoneType] = None,
select: bool = False,
gamma: float | int = 1,
knots: dict[str, ndarray] | None = None,
n_threads: int = 1,
) -> typing.Self
Fit the GAM.
Parameters:
-
data
(DataFrame | Mapping[str, ndarray | Series]
) –DataFrame or dictionary containing all variables referenced in the model. Note, using a dictionary is required when passing matrix-valued variables.
-
method
(Literal['GCV.Cp', 'GACV.Cp', 'QNCV', 'REML', 'P-REML', 'ML', 'P-ML', 'NCV']
, default:'REML'
) –Method for smoothing parameter estimation, matching the mgcv options.
-
weights
(str | ndarray | Series | None
, default:None
) –Observation weights. Either a string, matching a column name, or an array/series with length equal to the number of observations.
-
optimizer
(str | tuple[str, str]
, default:('outer', 'newton')
) –An string or length 2 tuple, specifying the numerical optimization method to use to optimize the smoothing parameter estimation criterion (given by method). "outer" for the direct nested optimization approach. "outer" can use several alternative optimizers, specified in the second element: "newton" (default), "bfgs", "optim" or "nlm". "efs" for the extended Fellner Schall method of Wood and Fasiolo (2017).
-
scale
(Union[Literal['unknown'], float, int, NoneType]
, default:None
) –If a number is provided, it is treated as a known scale parameter. If left to None, the scale parameter is 1 for Poisson and binomial and unknown otherwise. Note that (RE)ML methods can only work with scale parameter 1 for the Poisson and binomial cases.
-
select
(bool
, default:False
) –If set to True then gam can add an extra penalty to each term so that it can be penalized to zero. This means that the smoothing parameter estimation during fitting can completely remove terms from the model. If the corresponding smoothing parameter is estimated as zero then the extra penalty has no effect. Use gamma to increase level of penalization.
-
gamma
(float | int
, default:1
) –Increase this beyond 1 to produce smoother models. gamma multiplies the effective degrees of freedom in the GCV or UBRE/AIC. gamma can be viewed as an effective sample size in the GCV score, and this also enables it to be used with REML/ML. Ignored with P-RE/ML or the efs optimizer.
-
knots
(dict[str, ndarray] | None
, default:None
) –Dictionary mapping covariate names to knot locations. For most bases, the length of the knot locations should match with a user supplied
k
value. E.g. forS("x", k=64)
, you could passknots={"x": np.linspace(0, 1, 64)}
. For multidimensional smooths, e.g.S("x", "z", k=64)
, you could create a grid of coordinates:Example
Note if using
ThinPlateSpline
, this will avoid the eigen-decomposition used to find the basis, which although fast often leads to worse results. Different terms can use different numbers of knots, unless they share covariates. -
n_threads
(int
, default:1
) –Number of threads to use for fitting the GAM.
predict
predict(
data: DataFrame | Mapping[str, ndarray | Series] | None = None,
*,
compute_se: bool = False,
type: Literal["response", "link"] = "link",
block_size: int | None = None,
) -> dict[str, numpy.ndarray] | dict[str, pymgcv.custom_types.FitAndSE[numpy.ndarray]]
Compute model predictions with (optionally) uncertainty estimates.
Makes predictions for new data using the fitted GAM model. Predictions are returned on the link scale (linear predictor scale), not the response scale. For response scale predictions, apply the appropriate inverse link function to the results.
Parameters:
-
data
(DataFrame | Mapping[str, ndarray | Series] | None
, default:None
) –A dictionary or DataFrame containing all variables referenced in the model. Defaults to the data used to fit the model.
-
compute_se
(bool
, default:False
) –Whether to compute standard errors for predictions.
-
type
(Literal['response', 'link']
, default:'link'
) –Type of prediction to compute. Either "link" for linear predictor scale or "response" for response scale.
-
block_size
(int | None
, default:None
) –Number of rows to process at a time. If None then block size is 1000 if data supplied, and the number of rows in the model frame otherwise.
Returns:
partial_effects
partial_effects(
data: DataFrame | Mapping[str, Series | ndarray] | None = None,
*,
compute_se: bool = False,
block_size: int | None = None,
) -> (
dict[str, pandas.core.frame.DataFrame]
| dict[str, pymgcv.custom_types.FitAndSE[pandas.core.frame.DataFrame]]
)
Compute partial effects for all model terms.
Calculates the contribution of each model term to the overall prediction on the link scale. The sum of all fit columns equals the total prediction (link scale).
Parameters:
-
data
(DataFrame | Mapping[str, Series | ndarray] | None
, default:None
) –A dictionary or DataFrame containing all variables referenced in the model. Defaults to the data used to fit the model.
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors.
-
block_size
(int | None
, default:None
) –Number of rows to process at a time. If None then block size is 1000 if data supplied, and the number of rows in the model frame otherwise.
BAM
BAM(
predictors: Mapping[str, Iterable[AbstractTerm] | AbstractTerm],
*,
family: AbstractFamily | None = None,
add_intercepts: bool = True,
)
A big-data GAM (BAM) model.
Initialize the model.
Parameters:
-
predictors
(Mapping[str, Iterable[AbstractTerm] | AbstractTerm]
) –Dictionary mapping target variable names to an iterable of
AbstractTerm
objects used to predict \(g([\mathbb{E}[Y])\).- For simple models, this will usually be a single key-value pair:
- For multivariate models, e.g.
MVN
, the dictionary will have multiple pairs: - For multiparameter models, such as LSS-type models (e.g.
GauLSS
), the first key-value pair must correspond to the variable name in the data (usually modelling the location), and the subsequent dictionary elements model the other parameters in the order as defined by the family (e.g. scale and shape). The names of these extra parameters, can be anything, and are used as column names for prediction outputs.
-
family
(AbstractFamily | None
, default:None
) –Distribution family to use. See Families for available options.
-
add_intercepts
(bool
, default:True
) –If False, intercept terms must be manually added to the formulae using
Intercept
. If True, automatically adds an intercept term to each formula. Intercepts are added as needed by methods, such thatgam.predictors
reflect the model as constructed (i.e. before adding intercepts).
fit
fit(
data: DataFrame | Mapping[str, ndarray | Series],
*,
method: Literal[
"fREML", "GCV.Cp", "GACV.Cp", "REML", "P-REML", "ML", "P-ML", "NCV"
] = "fREML",
weights: str | ndarray | Series | None = None,
scale: Union[Literal["unknown"], float, int, NoneType] = None,
select: bool = False,
gamma: float | int = 1,
knots: dict[str, ndarray] | None = None,
chunk_size: int = 10000,
discrete: bool = False,
samfrac: float | int = 1,
n_threads: int = 1,
gc_level: Literal[0, 1, 2] = 0,
) -> typing.Self
Fit the GAM.
Parameters:
-
data
(DataFrame | Mapping[str, ndarray | Series]
) –DataFrame or dictionary containing all variables referenced in the model. Note, using a dictionary is required when passing matrix-valued variables.
-
method
(Literal['fREML', 'GCV.Cp', 'GACV.Cp', 'REML', 'P-REML', 'ML', 'P-ML', 'NCV']
, default:'fREML'
) –Method for smoothing parameter estimation, matching the mgcv, options.
-
weights
(str | ndarray | Series | None
, default:None
) –Observation weights. Either a string, matching a column name, or a array/series with length equal to the number of observations.
-
scale
(Union[Literal['unknown'], float, int, NoneType]
, default:None
) –If a number is provided, it is treated as a known scale parameter. If left to None, the scale parameter is 1 for Poisson and binomial and unknown otherwise. Note that (RE)ML methods can only work with scale parameter 1 for the Poisson and binomial cases.
-
select
(bool
, default:False
) –If set to True then gam can add an extra penalty to each term so that it can be penalized to zero. This means that the smoothing parameter estimation during fitting can completely remove terms from the model. If the corresponding smoothing parameter is estimated as zero then the extra penalty has no effect. Use gamma to increase level of penalization.
-
gamma
(float | int
, default:1
) –Increase this beyond 1 to produce smoother models. gamma multiplies the effective degrees of freedom in the GCV or UBRE/AIC. gamma can be viewed as an effective sample size in the GCV score, and this also enables it to be used with REML/ML. Ignored with P-RE/ML or the efs optimizer.
-
knots
(dict[str, ndarray] | None
, default:None
) –Dictionary mapping covariate names to knot locations. For most bases, the length of the knot locations should match with a user supplied
k
value. E.g. forS("x", k=64)
, you could passknots={"x": np.linspace(0, 1, 64)}
. For multidimensional smooths, e.g.S("x", "z", k=64)
, you could create a grid of coordinates:Example
Note if using
ThinPlateSpline
, this will avoid the eigen-decomposition used to find the basis, which although fast often leads to worse results. Different terms can use different numbers of knots, unless they share covariates. -
chunk_size
(int
, default:10000
) –The model matrix is created in chunks of this size, rather than ever being formed whole. Reset to 4p if chunk.size < 4p where p is the number of coefficients.
-
discrete
(bool
, default:False
) –if True and using method="fREML", discretizes covariates for storage and efficiency reasons.
-
samfrac
(float | int
, default:1
) –If
0<samfrac<1
, performs a fast preliminary fitting step using a subsample of the data to improve convergence speed. -
n_threads
(int
, default:1
) –Number of threads to use for fitting the GAM.
-
gc_level
(Literal[0, 1, 2]
, default:0
) –0 uses R's garbage collector, 1 and 2 use progressively more frequent garbage collection, which takes time but reduces memory requirements.
predict
predict(
data: DataFrame | Mapping[str, ndarray | Series] | None = None,
*,
compute_se: bool = False,
type: Literal["link", "response"] = "link",
block_size: int = 50000,
discrete: bool = True,
n_threads: int = 1,
gc_level: Literal[0, 1, 2] = 0,
) -> dict[str, pymgcv.custom_types.FitAndSE[numpy.ndarray]] | dict[str, numpy.ndarray]
Compute model predictions with uncertainty estimates.
Makes predictions for new data using the fitted GAM model. Predictions are returned on the link scale (linear predictor scale), not the response scale. For response scale predictions, apply the appropriate inverse link function to the results.
Parameters:
-
data
(DataFrame | Mapping[str, ndarray | Series] | None
, default:None
) –A dictionary or DataFrame containing all variables referenced in the model. Defaults to the data used to fit the model.
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors.
-
type
(Literal['link', 'response']
, default:'link'
) –Type of prediction to compute. Either "link" for linear predictor scale or "response" for response scale.
-
block_size
(int
, default:50000
) –Number of rows to process at a time.
-
n_threads
(int
, default:1
) –Number of threads to use for computation.
-
discrete
(bool
, default:True
) –If True and the model was fitted with discrete=True, then uses discrete prediction methods in which covariates are discretized for efficiency for storage and efficiency reasons.
-
gc_level
(Literal[0, 1, 2]
, default:0
) –0 uses R's garbage collector, 1 and 2 use progressively more frequent garbage collection, which takes time but reduces memory requirements.
partial_effects
partial_effects(
data: DataFrame | Mapping[str, ndarray | Series] | None = None,
*,
compute_se: bool = False,
block_size: int = 50000,
n_threads: int = 1,
discrete: bool = True,
gc_level: Literal[0, 1, 2] = 0,
) -> (
dict[str, pandas.core.frame.DataFrame]
| dict[str, pymgcv.custom_types.FitAndSE[pandas.core.frame.DataFrame]]
)
Compute partial effects for all model terms.
Calculates the contribution of each model term to the overall prediction. This decomposition is useful for understanding which terms contribute most to predictions and for creating partial effect plots. The sum of all fit columns equals the total prediction.
Parameters:
-
data
(DataFrame | Mapping[str, ndarray | Series] | None
, default:None
) –A dictionary or DataFrame containing all variables referenced in the model. Defaults to the data used to fit the model.
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors.
-
block_size
(int
, default:50000
) –Number of rows to process at a time. Higher is faster but more memory intensive.
-
n_threads
(int
, default:1
) –Number of threads to use for computation.
-
discrete
(bool
, default:True
) –If True and the model was fitted with discrete=True, then uses discrete prediction methods in which covariates are discretized for efficiency for storage and efficiency reasons.
-
gc_level
(Literal[0, 1, 2]
, default:0
) –0 uses R's garbage collector, 1 and 2 use progressively more frequent garbage collection, which takes time but reduces memory requirements.
AbstractGAM
Abstract base class for GAM models.
This class cannot be initialized but provides a common interface for fitting and predicting using different types of GAM models.
referenced_variables
property
List of variables referenced by the model required to be present in data.
check_k
Checking basis dimension choices (k).
The default choices for k
are relatively arbitrary. This function aids in
assessing whether the chosen basis dimensions are appropriate. A low p-value can
indicate that the chosen basis dimension is too low.
The function works by constrasting a residual variance estimate based on near
neighbour points (based on the covariates of a term), to the overall residual
variance. The k_index
is the ratio of the near neighbour estimate to the
overall variance. The further below 1 the k_index
is, the more likely it is
that there exists missed patterns in the residuals. The p-value is generated
using a randomization test to obtain the null distribution.
For details, see section 5.9 of:
Wood S.N. (2017) Generalized Additive Models: An Introduction with R (2nd
edition). Chapman and Hall/CRC Press.
Parameters:
-
subsample
(int
, default:5000
) –The maximum number of points to use, above which a random subsample is used.
-
n_rep
(int
, default:400
) –The number of re-shuffles to do to get the p-value.
Returns:
-
DataFrame
–A dataframe with the following columns:
term
: The mgcv-style name of the smooth term.max_edf
: The maximum possible edf (oftenk-1
).k_index
: The ratio between the nearest neighbour variance residual variance estimate and the overall variance.p_value
: The p-value of the randomization test.max_edf
: The maximum effective degrees of freedom.
coefficients
Extract model coefficients from the fitted GAM.
Returns a series where the index if the mgcv-style name of the parameter.
covariance
covariance(
*, sandwich: bool = False, freq: bool = False, unconditional: bool = False
) -> DataFrame
Extract the covariance matrix from the fitted GAM.
Extracts the Bayesian posterior covariance matrix of the parameters or frequentist covariance matrix of the parameter estimators from the fitted GAM.
Parameters:
-
sandwich
(bool
, default:False
) –If True, compute sandwich estimate of covariance matrix. Currently expensive for discrete bam fits.
-
freq
(bool
, default:False
) –If True, return the frequentist covariance matrix of the parameter estimators. If False, return the Bayesian posterior covariance matrix of the parameters. The latter option includes the expected squared bias according to the Bayesian smoothing prior.
-
unconditional
(bool
, default:False
) –If True (and freq=False), return the Bayesian smoothing parameter uncertainty corrected covariance matrix, if available.
Returns:
-
DataFrame
–The covariance matrix as a pandas dataframe where the column names and index
-
DataFrame
–are the mgcv-style parameter names.
partial_effect
partial_effect(
term: AbstractTerm | int,
target: str | None = None,
data: DataFrame | Mapping[str, Series | ndarray] | None = None,
*,
compute_se: bool = False,
) -> typing.Union[numpy.ndarray, pymgcv.custom_types.FitAndSE[numpy.ndarray]]
Compute the partial effect for a single model term.
This method efficiently computes the contribution of one specific term to the model predictions.
Parameters:
-
term
(AbstractTerm | int
) –The specific term to evaluate (must match a term used in the original model specification) or an integer index representing the position of the term in the target's predictor list
-
target
(str | None
, default:None
) –Name of the target variable from the keys of
gam.predictors
. If set to None, the single predictor is used if only one is present, otherwise an error is raised. -
data
(DataFrame | Mapping[str, Series | ndarray] | None
, default:None
) –DataFrame or dictionary containing the variables needed to compute the partial effect for the term.
-
compute_se
(bool
, default:False
) –Whether to compute and return standard errors
edf
Compute the effective degrees of freedom (EDF) for the model coefficients.
Returns:
-
Series
–A series of EDF values, with the mgcv-style coefficient names as the index.
penalty_edf
Computed the effective degrees of freedom (EDF) associated with each penalty.
Returns:
-
–
A series of EDF values, with the index being the mgcv-style name of the
-
–
penalty.
partial_residuals
partial_residuals(
term: AbstractTerm | int,
target: str | None = None,
data: DataFrame | Mapping[str, Series | ndarray] | None = None,
*,
avoid_scaling: bool = False,
) -> ndarray
Compute partial residuals for model diagnostic plots.
Partial residuals combine the fitted values from a specific term with the overall model residuals. They're useful for assessing whether the chosen smooth function adequately captures the relationship, or if a different functional form might be more appropriate.
Parameters:
-
term
(AbstractTerm | int
) –The model term to compute partial residuals for. If an integer, it is interpreted as the index of the term in the predictor of
target
. -
target
(str | None
, default:None
) –Name of the target variable (response variable or family parameter name from the model specification). If set to None, an error is raised when multiple predictors are present; otherwise, the sole available target is used.
-
data
(DataFrame | Mapping[str, Series | ndarray] | None
, default:None
) –A dictionary or DataFrame containing all variables referenced in the model. Defaults to the data used to fit the model.
-
avoid_scaling
(bool
, default:False
) –If True, and the term has a numeric by variable, the scaling by the by variable is not included in the term effect. This facilitates plotting the residuals, as the plots only show the smooth component (unscaled by the by variable).
Returns:
-
ndarray
–Series containing the partial residuals for the specified term
aic
Calculate Akaike's Information Criterion for fitted GAM models.
Where possible (fitting GAM
/BAM
models with "ML" or "REML"), this uses the approach of Wood, Pya & Saefken 2016,
which accounts for smoothing parameter uncertainty, without favouring
overly simple models.
Parameters:
-
k
(float
, default:2
) –Penalty per parameter (default 2 for classical AIC).
residuals
residuals(
type: Literal[
"deviance", "pearson", "scaled.pearson", "working", "response"
] = "deviance",
)
Compute the residuals for a fitted model.
Parameters:
-
type
(Literal['deviance', 'pearson', 'scaled.pearson', 'working', 'response']
, default:'deviance'
) –Type of residuals to compute, one of:
- response: Raw residuals \(y - \mu\), where \(y\) is the observed data and \(\mu\) is the model fitted value.
- pearson: Pearson residuals — raw residuals divided by the square root of the model's mean-variance relationship. $$ \frac{y - \mu}{\sqrt{V(\mu)}} $$
- scaled.pearson: Raw residuals divided by the standard deviation of the data according to the model mean variance relationship and estimated scale parameter.
- deviance: Deviance residuals as defined by the model’s family.
- working: Working residuals are the residuals returned from model fitting at convergence.
FitState
The mgcv gam, and the data used for fitting.
This gets set as an attribute fit_state on the AbstractGAM object after fitting.
Attributes:
-
rgam
–The fitted mgcv gam object.
-
data
–The data used for fitting.