Skip to content

Feature Selector

pybasin.feature_selector.default_feature_selector.DefaultFeatureSelector

Bases: Pipeline

Feature selector combining variance threshold and correlation filtering.

This class extends sklearn's Pipeline with two steps:

  1. VarianceThreshold: Removes features with variance below threshold
  2. CorrelationSelector: Removes highly correlated features (|corr| > threshold)

The correlation threshold uses absolute correlation values, meaning both positive and negative correlations above the threshold will trigger removal.

As a Pipeline subclass, this implements the full sklearn transformer API: fit(), transform(), fit_transform(), get_params(), set_params(), etc.

selector = DefaultFeatureSelector(variance_threshold=0.01, correlation_threshold=0.95)
features_filtered = selector.fit_transform(features)

Attributes:

Name Type Description
variance_threshold float

Minimum variance required to keep a feature.

correlation_threshold float

Maximum absolute correlation allowed between features. Features with |correlation| > threshold will be removed.

min_features int

Minimum number of features to keep.

Functions

get_support

get_support(indices: bool = False) -> np.ndarray

Get mask or indices of features that passed the filter.

Parameters:

Name Type Description Default
indices bool

If True, returns indices. If False, returns boolean mask.

False

Returns:

Type Description
ndarray

Boolean mask or integer indices of selected features.


pybasin.feature_selector.correlation_selector.CorrelationSelector

Bases: BaseEstimator, TransformerMixin

Scikit-learn transformer to remove highly correlated features.

When a pair of features has absolute correlation above threshold, the feature with the higher mean absolute correlation against all remaining features is removed. This follows the algorithm described by Kuhn & Johnson (2013) and implemented in R's caret::findCorrelation().

The mean absolute correlation is recomputed after each removal so that subsequent decisions reflect the current feature set.

Attributes:

Name Type Description
threshold float

Correlation threshold. Feature pairs with absolute correlation above this value trigger a removal decision.

min_features int

Minimum number of features to keep.

support_

Boolean mask of selected features (set after fit).

n_features_in_

Number of input features (set after fit).

Functions

fit

fit(X: ndarray, y: ndarray | None = None)

Compute which features to keep using mean absolute correlation ranking.

For each pair exceeding the threshold, the feature with the higher mean absolute correlation across all remaining features is dropped. The correlation statistics are recomputed after each removal.

Parameters:

Name Type Description Default
X ndarray

Training data of shape (n_samples, n_features).

required
y ndarray | None

Not used, present for API consistency.

None

Returns:

Type Description

Fitted transformer.

transform

transform(X: ndarray) -> np.ndarray

Remove correlated features.

Parameters:

Name Type Description Default
X ndarray

Input data of shape (n_samples, n_features).

required

Returns:

Type Description
ndarray

Data with correlated features removed, shape (n_samples, n_features_out).

get_support

get_support(indices: bool = False)

Get a mask or indices of selected features.

Parameters:

Name Type Description Default
indices bool

If True, return feature indices. Otherwise, return boolean mask.

False

Returns:

Type Description

Boolean mask or integer indices of selected features.