Feature Selector

pybasin.feature_selector.default_feature_selector.DefaultFeatureSelector

Bases: Pipeline

Feature selector combining variance threshold and correlation filtering.

This class extends sklearn's Pipeline with two steps:

VarianceThreshold: Removes features with variance below threshold
CorrelationSelector: Removes highly correlated features (|corr| > threshold)

The correlation threshold uses absolute correlation values, meaning both positive and negative correlations above the threshold will trigger removal.

As a Pipeline subclass, this implements the full sklearn transformer API: fit(), transform(), fit_transform(), get_params(), set_params(), etc.

selector = DefaultFeatureSelector(variance_threshold=0.01, correlation_threshold=0.95)
features_filtered = selector.fit_transform(features)

Attributes:

Name	Type	Description
`variance_threshold`	`float`	Minimum variance required to keep a feature.
`correlation_threshold`	`float`	Maximum absolute correlation allowed between features. Features with \|correlation\| > threshold will be removed.
`min_features`	`int`	Minimum number of features to keep.

Functions

get_support

get_support(indices: bool = False) -> np.ndarray

Get mask or indices of features that passed the filter.

Parameters:

Name	Type	Description	Default
`indices`	`bool`	If True, returns indices. If False, returns boolean mask.	`False`

Returns:

Type	Description
`ndarray`	Boolean mask or integer indices of selected features.

pybasin.feature_selector.correlation_selector.CorrelationSelector

Bases: BaseEstimator, TransformerMixin

Scikit-learn transformer to remove highly correlated features.

When a pair of features has absolute correlation above threshold, the feature with the higher mean absolute correlation against all remaining features is removed. This follows the algorithm described by Kuhn & Johnson (2013) and implemented in R's caret::findCorrelation().

The mean absolute correlation is recomputed after each removal so that subsequent decisions reflect the current feature set.

Attributes:

Name	Type	Description
`threshold`	`float`	Correlation threshold. Feature pairs with absolute correlation above this value trigger a removal decision.
`min_features`	`int`	Minimum number of features to keep.
`support_`		Boolean mask of selected features (set after `fit`).
`n_features_in_`		Number of input features (set after `fit`).

Functions

fit

fit(X: ndarray, y: ndarray | None = None)

Compute which features to keep using mean absolute correlation ranking.

For each pair exceeding the threshold, the feature with the higher mean absolute correlation across all remaining features is dropped. The correlation statistics are recomputed after each removal.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Training data of shape (n_samples, n_features).	required
`y`	`ndarray \| None`	Not used, present for API consistency.	`None`

Returns:

Type	Description
	Fitted transformer.

transform

transform(X: ndarray) -> np.ndarray

Remove correlated features.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Input data of shape (n_samples, n_features).	required

Returns:

Type	Description
`ndarray`	Data with correlated features removed, shape (n_samples, n_features_out).

get_support

get_support(indices: bool = False)

Get a mask or indices of selected features.

Parameters:

Name	Type	Description	Default
`indices`	`bool`	If True, return feature indices. Otherwise, return boolean mask.	`False`

Returns:

Type	Description
	Boolean mask or integer indices of selected features.