As a data scientist working with Python, it’s crucial to understand the importance of feature selection when building a machine learning model. In real-life data science problems, it’s almost rare that all the variables in the dataset are useful for building a model. Adding redundant variables reduces the model’s generalization capability and may also reduce the overall accuracy of a classifier. Furthermore, adding more variables to a model increases the overall complexity of the model.

As per the ** Law of Parsimony** of ‘

**Learning Objectives:**

- Understanding the importance of feature selection.
- Familiarizing with different feature selection techniques.
- Applying feature selection techniques in practice and evaluating performance.

The goal of feature selection techniques in machine learning is to find the best set of features that allows one to build optimized models of studied phenomena.

The techniques for feature selection in machine learning can be broadly classified into the following categories:

These techniques can be used for labeled data and to identify the relevant features for increasing the efficiency of supervised models like classification and regression. For Example- linear regression, decision tree, SVM, etc.

These techniques can be used for unlabeled data. For Example- K-Means Clustering, Principal Component Analysis, Hierarchical Clustering, etc.

From a taxonomic point of view, these techniques are classified into filter, wrapper, embedded, and hybrid methods.

Now, let’s discuss some of these popular machine learning feature selection methods in detail.

Filter methods pick up the intrinsic properties of the features measured via univariate statistics instead of cross-validation performance. These methods are faster and less computationally expensive than wrapper methods. When dealing with high-dimensional data, it is computationally cheaper to use filter methods.

Let’s, discuss some of these techniques:

Information gain calculates the reduction in entropy from the transformation of a dataset. It can be used for feature selection by evaluating the Information gain of each variable in the context of the target variable.

```
from sklearn.feature_selection import mutual_info_classif
import matplotlib.pyplot as plt
%matplotlib inline
importances = mutual_info_classif(X, Y)
feat_importances = pd.Series(importances, dataframe.columns[0:len(dataframe.columns)-1])
feat_importances.plot(kind='barh', color = 'teal')
plt.show()
```

The Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with the best Chi-square scores. In order to correctly apply the chi-squared to test the relation between various features in the dataset and the target variable, the following conditions have to be met: the variables have to be *categorical*, sampled *independently, *and values should have an *expected frequency greater than 5*.

```
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Convert to categorical data by converting data to integers
X_cat = X.astype(int)
# Three features with highest chi-squared statistics are selected
chi2_features = SelectKBest(chi2, k=3)
X_kbest_features = chi2_features.fit_transform(X_cat, Y)
# Reduced features
print('Original feature number:', X_cat.shape[1])
print('Reduced feature number:', X_kbest_features.shape[1])
```

Fisher score is one of the most widely used supervised feature selection methods. The algorithm we will use returns the ranks of the variables based on the fisher’s score in descending order. We can then select the variables as per the case.

```
from skfeature.function.similarity_based import fisher_score
import matplotlib.pyplot as plt
%matplotlib inline
# Calculating scores
ranks = fisher_score.fisher_score(X, Y)
# Plotting the ranks
feat_importances = pd.Series(ranks, dataframe.columns[0:len(dataframe.columns)-1])
feat_importances.plot(kind='barh', color='teal')
plt.show()
```

Correlation is a measure of the linear relationship between 2 or more variables. Through correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that good variables correlate highly with the target. Furthermore, variables should be correlated with the target but uncorrelated among themselves.

If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only needs one, as the second does not add additional information. We will use the Pearson Correlation here.

```
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Correlation matrix
cor = dataframe.corr()
# Plotting Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(cor, annot=True)
plt.show()
```

We need to set an absolute value, say 0.5, as the threshold for selecting the variables. If we find that the predictor variables are correlated, we can drop the variable with a lower correlation coefficient value than the target variable. We can also compute multiple correlation coefficients to check whether more than two variables correlate. This phenomenon is known as multicollinearity.

The variance threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features with the same value in all samples. We assume that features with a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or feature and target variables into account, which is one of the drawbacks of filter methods.

```
from sklearn.feature_selection import VarianceThreshold
# Resetting the value of X to make it non-categorical
X = array[:, 0:8]
v_threshold = VarianceThreshold(threshold=0)
v_threshold.fit(X) # fit finds the features with zero variance
v_threshold.get_support()
```

The get_support returns a Boolean vector where True means the variable does not have zero variance.

‘The mean absolute difference (MAD) computes the absolute difference from the mean value. The main difference between the variance and MAD measures is the absence of the square in the latter. The MAD, like the variance, is also a scaled variant.’ [1] This means that the higher the MAD, the higher the discriminatory power.

```
import numpy as np
import matplotlib.pyplot as plt
# Calculate MAD
mean_abs_diff = np.sum(np.abs(X - np.mean(X, axis=0)), axis=0) / X.shape[0]
# Plot the bar chart
plt.bar(np.arange(X.shape[1]), mean_abs_diff, color='teal')
plt.xlabel("Feature Index")
plt.ylabel("Mean Absolute Deviation")
plt.title("MAD for Each Feature")
plt.show()
```**Dispersion Ratio**

‘Another measure of dispersion applies the arithmetic mean (AM) and the geometric mean (GM). For a given (positive) feature Xi on n patterns, the AM and GM are given by

respectively; since **AMi ≥ GMi**, with equality holding if and only if **Xi1 = Xi2 = …. = Xin**, then the ratio

*can be used as a dispersion measure. Higher dispersion implies a higher value of Ri, thus a more relevant feature. Conversely, when all the feature samples have (roughly) the same value, Ri is close to 1, indicating a low relevance feature.’ [1]*

```
import numpy as np
import matplotlib.pyplot as plt
# Given data (replace with your actual data)
X = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Calculate arithmetic mean
am = np.mean(X, axis=0)
# Calculate geometric mean
gm = np.power(np.prod(X, axis=0), 1 / X.shape[0])
# Calculate ratio of arithmetic mean and geometric mean
disp_ratio = am / gm
# Plot the bar chart
plt.bar(np.arange(X.shape[1]), disp_ratio, color='teal')
plt.xlabel("Index")
plt.ylabel("Arithmetic Mean / Geometric Mean")
plt.title("Ratio of Arithmetic Mean and Geometric Mean")
plt.show()
```

Wrappers require some method to search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. The feature selection process is based on a specific machine learning algorithm we are trying to fit on a given dataset. It follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. The wrapper methods usually result in better predictive accuracy than filter methods.

Let’s, discuss some of these techniques:

This is an iterative method wherein we start with the performing features against the target features. Next, we select another variable that gives the best performance in combination with the first selected variable. This process continues until the preset criterion is achieved.

```
# Import necessary libraries
from mlxtend.feature_selection import SequentialFeatureSelector
# Create a Sequential Feature Selector object using a logistic regression model
# k_features='best' selects the optimal number of features
# forward=True indicates forward feature selection
# n_jobs=-1 uses all available cores for parallel processing
ffs = SequentialFeatureSelector(lr, k_features='best', forward=True, n_jobs=-1)
# Fit the Sequential Feature Selector to the data
ffs.fit(X, y)
# Get the selected features
features = list(ffs.k_feature_names_)
# Convert feature names to integers (if necessary)
features = list(map(int, features))
# Fit the logistic regression model using only the selected features
lr.fit(X_train[:, features], y_train)
# Make predictions on the training data using the selected features
y_pred = lr.predict(X_train[:, features])
```

This method works exactly opposite to the Forward Feature Selection method. Here, we start with all the features available and build a model. Next, we the variable from the model, which gives the best evaluation measure value. This process is continued until the preset criterion is achieved.

```
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector
# Create a logistic regression model with class weight balancing
lr = LogisticRegression(class_weight='balanced', solver='lbfgs', random_state=42, n_jobs=-1, max_iter=500)
# Fit the logistic regression model to the data
lr.fit(X, y)
# Create a Sequential Feature Selector object using the logistic regression model
# k_features='best' selects the optimal number of features
# forward=False indicates backward feature selection
# n_jobs=-1 uses all available cores for parallel processing
bfs = SequentialFeatureSelector(lr, k_features='best', forward=False, n_jobs=-1)
# Fit the Sequential Feature Selector to the data
bfs.fit(X, y)
# Get the selected features
features = list(bfs.k_feature_names_)
# Convert feature names to integers (if necessary)
features = list(map(int, features))
# Fit the logistic regression model using only the selected features
lr.fit(X_train[:, features], y_train)
# Make predictions on the training data using the selected features
y_pred = lr.predict(X_train[:, features])
```

This method, along with the one discussed above, is also known as the Sequential Feature Selection method.

This is the most robust feature selection method covered so far. This is a brute-force evaluation of each feature subset. This means it tries every possible combination of the variables and returns the best-performing subset.

```
# Import necessary libraries
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.ensemble import RandomForestClassifier
# Create an Exhaustive Feature Selector object using a Random Forest classifier
# min_features=4 sets the minimum number of features to consider
# max_features=8 sets the maximum number of features to consider
# scoring='roc_auc' specifies the scoring metric
# cv=2 specifies the number of cross-validation folds
efs = ExhaustiveFeatureSelector(RandomForestClassifier(), min_features=4, max_features=8, scoring='roc_auc', cv=2)
# Fit the Exhaustive Feature Selector to the data
efs.fit(X, y)
# Print the selected features
selected_features = X_train.columns[list(efs.best_idx_)]
print(selected_features)
# Print the final prediction score
print(efs.best_score_)
```

‘*Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features, and each feature’s importance is obtained either through a coef_ attribute or a feature_importances_ attribute.**Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.’*

```
# Import necessary libraries
from sklearn.feature_selection import RFE
# Create a Recursive Feature Elimination object using a logistic regression model
# n_features_to_select=7 specifies the number of features to select
rfe = RFE(lr, n_features_to_select=7)
# Fit the Recursive Feature Elimination object to the data
rfe.fit(X_train, y_train)
# Make predictions on the training data using the selected features
y_pred = rfe.predict(X_train)
```

These methods encompass the benefits of both the wrapper and filter methods by including interactions of features but also maintaining reasonable computational costs. Embedded methods are iterative in the sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration.

Let’s discuss some of these techniques here:

Regularization consists of adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model, i.e., to avoid over-fitting. In linear model regularization, the penalty is applied over the coefficients that multiply each predictor. From the different types of regularization, Lasso or L1 has the property that can shrink some of the coefficients to zero. Therefore, that feature can be removed from the model.

```
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
# Create a logistic regression model with L1 regularization
logistic = LogisticRegression(C=1, penalty='l1', solver='liblinear', random_state=7).fit(X, y)
# Create a SelectFromModel object using the logistic regression model
model = SelectFromModel(logistic, prefit=True)
# Transform the feature matrix using the SelectFromModel object
X_new = model.transform(X)
# Get the indices of the selected features
selected_columns = selected_features.columns[selected_features.var() != 0]
# Print the selected features
print(selected_columns)
```

Random Forests is a kind of Bagging Algorithm that aggregates a specified number of decision trees. The tree-based strategies used by random forests naturally rank by how well they improve the purity of the node, or in other words, a decrease in the impurity (**Gini impurity**) over all trees. Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of the trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features.

```
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
# Create a Random Forest classifier with your hyperparameters
model = RandomForestClassifier(n_estimators=340)
# Fit the model to the data
model.fit(X, y)
# Get the importance of the resulting features
importances = model.feature_importances_
# Create a data frame for visualization
final_df = pd.DataFrame({'Features': pd.DataFrame(X).columns, 'Importances': importances})
final_df.set_index('Importances')
# Sort in ascending order for better visualization
final_df = final_df.sort_values('Importances')
# Plot the feature importances in bars
final_df.plot.bar(color='teal')
```

We have discussed a few techniques for feature selection. We have purposely left the feature extraction techniques like Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc. These methods help to reduce the dimensionality of the data or reduce the number of variables while preserving the variance of the data.

Apart from the methods discussed above, there are many other feature selection methods. There are hybrid methods, too, that use both filtering and wrapping techniques. If you wish to explore more about feature selection techniques, great comprehensive reading material, in my opinion, would be ‘*Feature Selection for Data and Pattern Recognition*’ by Urszula Stańczyk and Lakhmi C. Jain.

**Key Takeaways**

- Understanding the importance of feature selection and feature engineering in building a machine learning model.
- Familiarizing with different feature selection techniques, including supervised techniques (Information Gain, Chi-square Test, Fisher’s Score, Correlation Coefficient), unsupervised techniques (Variance Threshold, Mean Absolute Difference, Dispersion Ratio), and their classifications (Filter methods, Wrapper methods, Embedded methods, Hybrid methods).
- Evaluating the performance of feature selection techniques in practice through implementation.

A. A feature selection method is a technique in machine learning that involves choosing a subset of relevant features from the original set to enhance model performance, interpretability, and efficiency.

A. The feature selection principle centers on identifying and retaining the most informative features while eliminating redundant or irrelevant ones. This optimization aims to enhance model accuracy and efficiency.

A. No, PCA (Principal Component Analysis) is primarily a dimensionality reduction technique, not a feature selection method. While it reduces features, it doesn’t consider the individual importance of features for prediction.

A. Typical steps in feature selection include understanding the dataset and problem, choosing a relevant feature selection method, evaluating feature importance, selecting a subset of features, and assessing and validating model performance with the chosen features.

automatic feature selection in machine learningblogathonfeature selectionfeature selection in data sciencefeature selection in machine learningfeature selection in machine learning examplefeature selection in machine learning pptfeature selection in machine learning pythonfeature selection in machine learning python codefeature selection in machine learning research paperfeature selection in machine learning techniquesfeature selection in machine learning with exampleFeature Selection Methodsfeature subset selection in machine learningtypes of feature selection in machine learningTypes of Feature Selection Methods in ML

Kindly share dataset of above code

Hello I was just wondering if you know anyone who can help me with one of my Applied Machine Learning assignment. Thanks Bijal

It two distinct methods gives different set of features as the most important, say 5 features from the dataset. How to determine which exactly are the most important features in this case of contradiction

Which is the best method in case we have 7-8 numerical variables and 7-8 categorical variables for binary classification?

Use filter methods like Chi-square for categorical features, Information Gain for relevance, Fisher's Score for ranking, and Correlation Coefficient for relationships which are typically statistical methodologies. You can also experiment with wrapper methods like Forward Selection or Recursive Feature Elimination (RFE), and embedded methods like LASSO and tree-based models, e.g. Random Forest for optimal feature selection. Usually embedded methods are more commonly used often.