Feature Selector
pybasin.feature_selector.default_feature_selector.DefaultFeatureSelector
Bases: Pipeline
Feature selector combining variance threshold and correlation filtering.
This class extends sklearn's Pipeline with two steps:
- VarianceThreshold: Removes features with variance below threshold
- CorrelationSelector: Removes highly correlated features (|corr| > threshold)
The correlation threshold uses absolute correlation values, meaning both positive and negative correlations above the threshold will trigger removal.
As a Pipeline subclass, this implements the full sklearn transformer API: fit(), transform(), fit_transform(), get_params(), set_params(), etc.
selector = DefaultFeatureSelector(variance_threshold=0.01, correlation_threshold=0.95)
features_filtered = selector.fit_transform(features)
Attributes:
| Name | Type | Description |
|---|---|---|
variance_threshold |
float
|
Minimum variance required to keep a feature. |
correlation_threshold |
float
|
Maximum absolute correlation allowed between features. Features with |correlation| > threshold will be removed. |
min_features |
int
|
Minimum number of features to keep. |
Functions
get_support
Get mask or indices of features that passed the filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
bool
|
If True, returns indices. If False, returns boolean mask. |
False
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Boolean mask or integer indices of selected features. |
pybasin.feature_selector.correlation_selector.CorrelationSelector
Bases: BaseEstimator, TransformerMixin
Scikit-learn transformer to remove highly correlated features.
When a pair of features has absolute correlation above threshold,
the feature with the higher mean absolute correlation against all
remaining features is removed. This follows the algorithm described
by Kuhn & Johnson (2013) and implemented in R's
caret::findCorrelation().
The mean absolute correlation is recomputed after each removal so that subsequent decisions reflect the current feature set.
Attributes:
| Name | Type | Description |
|---|---|---|
threshold |
float
|
Correlation threshold. Feature pairs with absolute correlation above this value trigger a removal decision. |
min_features |
int
|
Minimum number of features to keep. |
support_ |
Boolean mask of selected features (set after |
|
n_features_in_ |
Number of input features (set after |
Functions
fit
Compute which features to keep using mean absolute correlation ranking.
For each pair exceeding the threshold, the feature with the higher mean absolute correlation across all remaining features is dropped. The correlation statistics are recomputed after each removal.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Training data of shape (n_samples, n_features). |
required |
y
|
ndarray | None
|
Not used, present for API consistency. |
None
|
Returns:
| Type | Description |
|---|---|
|
Fitted transformer. |
transform
Remove correlated features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input data of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Data with correlated features removed, shape (n_samples, n_features_out). |