In this article, we will be analyzing flight fare prediction using a machine learning dataset using essential exploratory data analysis techniques then will draw some predictions about the price of the flight based on some features such as what type of airline it is, what is the arrival time, what is the departure time, what is the duration of the flight, source, destination and more.

- Learn the complete process of EDA (using a machine learning dataset)
- Learn to withdraw some insights from the dataset both mathematically and visualize it.
- Visualising the data to get better insight from it.
- We will also see what kind of stuff we can do in the feature engineering part.

*This article was published as a part of theÂ Data Science Blogathon.*

**Airline:**So this column will have all the types of airlines like Indigo, Jet Airways, Air India, and many more.**Date_of_Journey:**This column will let us know about the date on which the passengerâ€™s journey will start.**Source:**This column holds the name of the place from where the passengerâ€™s journey will start.**Destination:**This column holds the name of the place to where passengers wanted to travel.**Route:**Here we can know about that what is the route through which passengers have opted to travel from his/her source to their destination.**Arrival_Time:**Arrival time is when the passenger will reach his/her destination.**Duration:**Duration is the whole period that a flight will take to complete its journey from source to destination.**Total_Stops:**This will let us know in how many places flights will stop there for the flight in the whole journey.**Additional_Info:**In this column, we will get information about food, kind of food, and other amenities.**Price:**Price of the flight for a complete journey including all the expenses before onboarding.

By employing machine learning algorithms, particularly regression techniques, we aim to predict flight ticket prices accurately. Leveraging Python for data analysis (regression analysis) and utilizing various machine learning models, including linear regression, will allow us to conduct comprehensive flight price prediction analyses. Additionally, with a focus on the Indian aviation market, we can tailor our predictive models to suit the specific dynamics of this region.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from math import sqrt
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from prettytable import PrettyTable
```

**Now here we will be looking at the kind of columns our dataset has.**

`train_df.columns`

**Output:**

```
Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
'Additional_Info', 'Price'],
dtype='object')
```

**Here we can get more information about our dataset**

`train_df.info()`

**Output:**

**To know more about the dataset**

`train_df.describe()`

**Output:**

**Now while using the IsNull function we will gonna see the number of null values in our dataset**

`train_df.isnull().head()`

**Output:**

**Now while using the IsNull function and sum function we will gonna see the number of null values in our dataset**

`train_df.isnull().sum()`

**Output:**

```
Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 1
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 1
Additional_Info 0
Price 0
dtype: int64
```

**Dropping NAN values**

`train_df.dropna(inplace = True)`

**Duplicate values**

`train_df[train_df.duplicated()].head()`

**Output:**

**Here we will be removing those repeated values from the dataset and keeping the in-place attribute to be true so that there will be no changes.**

```
train_df.drop_duplicates(keep='first',inplace=True)
train_df.head()
```

**Output:**

`train_df.shape`

**Output:**

`(10462, 11)`

**Checking the Additional_info column and having the count of unique types of values.**

`train_df["Additional_Info"].value_counts()`

**Output:**

```
No info 8182
In-flight meal not included 1926
No check-in baggage included 318
1 Long layover 19
Change airports 7
Business class 4
No Info 3
1 Short layover 1
2 Long layover 1
Red-eye flight 1
Name: Additional_Info, dtype: int64
```

**Checking the different Airlines**

`train_df["Airline"].unique()`

**Output:**

```
array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
'Vistara Premium economy', 'Jet Airways Business',
'Multiple carriers Premium economy', 'Trujet'], dtype=object)
```

**Checking the different Airline Routes**

`train_df["Route"].unique()`

**Output: **See the code**.**

**Now letâ€™s look at our testing dataset**

```
test_df = pd.read_excel("Test_set.xlsx")
test_df.head(10)
```

**Output:**

**Now here we will be looking at the kind of columns our testing data has.**

`test_df.columns`

**Output:**

```
Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
'Additional_Info'],
dtype='object')
```

**Information about the dataset**

`test_df.info()`

**Output:**

**To know more about the testing dataset**

`test_df.describe()`

**Output:**

**Now while using the IsNull function and sum function we will gonna see the number of null values in our testing data**

`test_df.isnull().sum()`

**Output:**

```
Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 0
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 0
Additional_Info 0
dtype: int64
```

**Plotting Price vs Airline plot**

```
sns.catplot(y = "Price", x = "Airline", data = train_df.sort_values("Price", ascending = False), kind="boxen", height = 8, aspect = 3)
plt.show()
```

**Output:**

**Inference:** Here with the help of the cat plot we are trying to plot the boxplot between the price of the flight and the airline and we can conclude that **Jet Airways has the most outliers in terms of price**.

**Plotting Violin plot for Price vs Source**

```
sns.catplot(y = "Price", x = "Source", data = train_df.sort_values("Price", ascending = False), kind="violin", height = 4, aspect = 3)
plt.show()
```

**Output:**

**Inference:** Now with the help of cat plot only we are plotting a box plot between the price of the flight and the source place i.e. **the place from where passengers will travel to the destination and we can see that Banglore is the source location has the most outliers while Chennai has the least.**

**Plotting Box plot for Price vs Destination**

```
sns.catplot(y = "Price", x = "Destination", data = train_df.sort_values("Price", ascending = False), kind="box", height = 4, aspect = 3)
plt.show()
```

**Output:**

**Inference:** Here we are plotting the box plot with the help of a cat plot between the price of the flight and the destination to which the passenger is traveling and figured out that **New Delhi has the most outliers and Kolkata has the least.**

**Letâ€™s see our processed data first**

`train_df.head()`

**Output:**

**Here first we are dividing the features and labels and then converting the hours in minutes.**

```
train_df['Duration'] = train_df['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
test_df['Duration'] = test_df['Duration'].str.replace("h", '*60').str.replace(' ','+').str.replace('m','*1').apply(eval)
```

**Date_of_Journey:** Here we are organizing the format of the date of journey in our dataset for better preprocessing in the model stage.

```
train_df["Journey_day"] = train_df['Date_of_Journey'].str.split('/').str[0].astype(int)
train_df["Journey_month"] = train_df['Date_of_Journey'].str.split('/').str[1].astype(int)
train_df.drop(["Date_of_Journey"], axis = 1, inplace = True)
```

**Dep_Time:** Here we are converting departure time into hours and minutes

```
train_df["Dep_hour"] = pd.to_datetime(train_df["Dep_Time"]).dt.hour
train_df["Dep_min"] = pd.to_datetime(train_df["Dep_Time"]).dt.minute
train_df.drop(["Dep_Time"], axis = 1, inplace = True)
```

**Arrival_Time: **Similarly we are converting the arrival time into hours and minutes.

```
train_df["Arrival_hour"] = pd.to_datetime(train_df.Arrival_Time).dt.hour
train_df["Arrival_min"] = pd.to_datetime(train_df.Arrival_Time).dt.minute
train_df.drop(["Arrival_Time"], axis = 1, inplace = True)
```

**Now after final preprocessing letâ€™s see our dataset**

`train_df.head()`

**Output:**

**Plotting Bar chart for Months (Duration) vs Number of Flights**

```
plt.figure(figsize = (10, 5))
plt.title('Count of flights month wise')
ax=sns.countplot(x = 'Journey_month', data = train_df)
plt.xlabel('Month')
plt.ylabel('Count of flights')
for p in ax.patches:
ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va='bottom', color= 'black')
```

**Output:**

**Inference:** Here in the above graph we have plotted the count plot for journey in a month vs several flights and got to see that **May has the most number of flights.**

**Plotting Bar chart for Types of Airline vs Number of Flights**

```
plt.figure(figsize = (20,5))
plt.title('Count of flights with different Airlines')
ax=sns.countplot(x = 'Airline', data =train_df)
plt.xlabel('Airline')
plt.ylabel('Count of flights')
plt.xticks(rotation = 45)
for p in ax.patches:
ax.annotate(int(p.get_height()), (p.get_x()+0.25, p.get_height()+1), va='bottom', color= 'black')
```

**Output:**

**Inference:** Now from the above graph we can see that between the type of airline and** count of flights we can see that Jet Airways has the most flight boarded.**

**Plotting Ticket Prices vs Airlines**

```
plt.figure(figsize = (15,4))
plt.title('Price VS Airlines')
plt.scatter(train_df['Airline'], train_df['Price'])
plt.xticks
plt.xlabel('Airline')
plt.ylabel('Price of ticket')
plt.xticks(rotation = 90)
```

**Output:**

**Plotting Correlation**

```
plt.figure(figsize = (15,15))
sns.heatmap(train_df.corr(), annot = True, cmap = "RdYlGn")
plt.show()
```

**Output:**

**Dropping the Price column as it is of no use**

`data = train_df.drop(["Price"], axis=1)`

**Dealing with Categorical Data and Numerical Data**

```
train_categorical_data = data.select_dtypes(exclude=['int64', 'float','int32'])
train_numerical_data = data.select_dtypes(include=['int64', 'float','int32'])
test_categorical_data = test_df.select_dtypes(exclude=['int64', 'float','int32','int32'])
test_numerical_data = test_df.select_dtypes(include=['int64', 'float','int32'])
train_categorical_data.head()
```

**Output:**

**Label Encode and Hot Encode for Categorical Columns**

```
le = LabelEncoder()
train_categorical_data = train_categorical_data.apply(LabelEncoder().fit_transform)
test_categorical_data = test_categorical_data.apply(LabelEncoder().fit_transform)
train_categorical_data.head()
```

**Output:**

**Concatenating both Categorical Data and Numerical Data**

```
X = pd.concat([train_categorical_data, train_numerical_data], axis=1)
y = train_df['Price']
test_set = pd.concat([test_categorical_data, test_numerical_data], axis=1)
X.head()
```

**Output:**

`y.head()`

**Output:**

```
0 3897
1 7662
2 13882
3 6218
4 13302
Name: Price, dtype: int64
```

```
# Calculating Mean Absolute Percentage Error
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
```

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
```

```
print("The size of training input is", X_train.shape)
print("The size of training output is", y_train.shape)
print("The size of testing input is", X_test.shape)
print("The size of testing output is", y_test.shape)
```

**Output:**

The size of training input is (7323, 13)

The size of training output is (7323,)

The size of testing input is (3139, 13)

The size of testing output is (3139,)

```
# Performing GridSearchCV on Ridge Regression
params = {'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}
ridge_regressor = GridSearchCV(Ridge(), params, cv = 5, scoring = 'neg_mean_absolute_error', n_jobs = -1)
ridge_regressor.fit(X_train, y_train)
```

**Output:**

GridSearchCV(cv=5, estimator=Ridge(), n_jobs=-1,

param_grid={‘alpha’: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]},

scoring=’neg_mean_absolute_error’)

```
# Predicting train and test results
y_train_pred = ridge_regressor.predict(X_train)
y_test_pred = ridge_regressor.predict(X_test)
```

```
print("Train Results for Ridge Regressor Model:")
print("Root Mean Squared Error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean Absolute % Error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-Squared: ", r2_score(y_train.values, y_train_pred))
```

**Output:**

Train Results for Ridge Regressor Model:

Root Mean Squared Error: 3558.667750232805

Mean Absolute % Error: 32

R-Squared: 0.4150529285926381

```
print("Test Results for Ridge Regressor Model:")
print("Root Mean Squared Error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean Absolute % Error: ", round(mean_absolute_percentage_error(y_test, y_test_pred)))
print("R-Squared: ", r2_score(y_test, y_test_pred))
```

**Output:**

Test Results for Ridge Regressor Model:

Root Mean Squared Error: 3457.5985597925214

Mean Absolute % Error: 32

R-Squared: 0.42437171409958274

```
# Performing GridSearchCV on Lasso Regression
params = {'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}
lasso_regressor = GridSearchCV(Lasso(), params ,cv = 15,scoring = 'neg_mean_absolute_error', n_jobs = -1)
lasso_regressor.fit(X_train, y_train)
```

**Output:**

GridSearchCV(cv=15, estimator=Lasso(), n_jobs=-1,

param_grid={‘alpha’: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}, scoring=’neg_mean_absolute_error’)

```
# Predicting train and test results
y_train_pred = lasso_regressor.predict(X_train)
y_test_pred = lasso_regressor.predict(X_test)
```

```
print("Train Results for Lasso Regressor Model:")
print("Root Mean Squared Error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean Absolute % Error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-Squared: ", r2_score(y_train.values, y_train_pred))
```

**Output:**

Train Results for Lasso Regressor Model:

Root Mean Squared Error: 3560.853987663486

Mean Absolute % Error: 32

R-Squared: 0.4143339932536655

```
print("Test Results for Lasso Regressor Model:")
print("Root Mean squared Error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean Absolute % Error: ", round(mean_absolute_percentage_error(y_test, y_test_pred)))
print("R-Squared: ", r2_score(y_test, y_test_pred))
```

**Output:**

Test Results for Lasso Regressor Model: Root Mean squared Error: 3459.384927631988 Mean Absolute % Error: 32 R-Squared: 0.4237767638929625

```
# Performing GridSearchCV on Decision Tree Regression
depth = list(range(3,30))
param_grid = dict(max_depth = depth)
tree = GridSearchCV(DecisionTreeRegressor(), param_grid, cv = 10)
tree.fit(X_train,y_train)
```

**Output:**

GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),

param_grid={‘max_depth’: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]})

```
# Predicting train and test results
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
```

```
print("Train Results for Decision Tree Regressor Model:")
print("Root Mean squared Error: ", sqrt(mse(y_train.values, y_train_pred)))
print("Mean Absolute % Error: ", round(mean_absolute_percentage_error(y_train.values, y_train_pred)))
print("R-Squared: ", r2_score(y_train.values, y_train_pred))
```

**Output:**

Train Results for Decision Tree Regressor Model:

Root Mean squared Error: 560.9099093439073

Mean Absolute % Error: 3

R-Squared: 0.9854679156224377

```
print("Test Results for Decision Tree Regressor Model:")
print("Root Mean Squared Error: ", sqrt(mse(y_test, y_test_pred)))
print("Mean Absolute % Error: ", round(mean_absolute_percentage_error(y_test, y_test_pred)))
print("R-Squared: ", r2_score(y_test, y_test_pred))
```

**Output:**

Test Results for Decision Tree Regressor Model:

Root Mean Squared Error: 1871.5387049259973

Mean Absolute % Error: 9

R-Squared: 0.8313483417949448

```
ridge_score = round(ridge_regressor.score(X_train, y_train) * 100, 2)
ridge_score_test = round(ridge_regressor.score(X_test, y_test) * 100, 2)
lasso_score = round(lasso_regressor.score(X_train, y_train) * 100, 2)
lasso_score_test = round(lasso_regressor.score(X_test, y_test) * 100, 2)
decision_score = round(tree.score(X_train, y_train) * 100, 2)
decision_score_test = round(tree.score(X_test, y_test) * 100, 2)
```

```
# Comparing all the models
models = pd.DataFrame({
'Model': [ 'Ridge Regression', 'Lasso Regression','Decision Tree Regressor'],
'Score': [ ridge_score, lasso_score, decision_score],
'Test Score': [ ridge_score_test, lasso_score_test, decision_score_test]})
models.sort_values(by='Test Score', ascending=False)
```

**Output:**

Model Score Test Score

Decision Tree Regressor 98.55 83.13

Lasso Regression -252062.50 -248119.29

Ridge Regression -252539.70 -248538.03

```
# Training = Tr.
# Testing = Te.
x = PrettyTable()
x.field_names = ["Model Name", "Tr. RMSE", "Tr. MA%E", "Tr. R-Squared", "Te. RMSE", "Te. MA%E", "Te. R-Squared",]
x.add_row(['Ridge Regression','3558.67','32','0.42','3457.60','32','0.42'])
x.add_row(['Lasso Regression','3560.85','32','0.41','3459.38','32','0.42'])
x.add_row(['Decision Tree Regressor','853.54','06','0.97','1857.68','10','0.83'])
print(x)
```

**Output:**

+————————-+———-+———-+—————+———-+———-+—————+

| Model Name | Tr. RMSE | Tr. MA%E | Tr. R-Squared | Te. RMSE | Te. MA%E | Te. R-Squared |

+————————-+———-+———-+—————+———-+———-+—————+

| Ridge Regression | 3558.67 | 32 | 0.42 | 3457.60 | 32 | 0.42 |

| Lasso Regression | 3560.85 | 32 | 0.41 | 3459.38 | 32 | 0.42 |

| Decision Tree Regressor | 853.54 | 06 | 0.97 | 1857.68 | 10 | 0.83 |

+————————-+———-+———-+—————+———-+———-+—————+

With this, we come to an end of our article – flight price prediction using machine learning. Our regression models have successfully forecasted airline ticket prices with notable accuracy. Through rigorous feature engineering and optimization, particularly in decision tree regression, we’ve gained valuable insights into market dynamics.

As AI continues to evolve, machine learning techniques play a crucial role in accurately predicting airfare prices. Ensemble methods like random forest hold promise for further improving prediction accuracy, ensuring robustness in our models.

In summary, our study demonstrates the effectiveness of machine learning in forecasting airfare prices. Continued advancements in deep learning techniques will likely lead to even more precise predictions, benefiting travelers and industry stakeholders alike.

Hereâ€™s the repo link to this article. Hope you liked my article on flight fare prediction using machine learning. If you have any opinions or questions, then comment below.

Frequently Asked Questions

A. Hyperparameter tuning optimizes the performance of machine learning algorithms by adjusting parameters like alpha in Ridge Regression or max_depth in Decision Tree Regression. It enhances the model’s accuracy and generalization by finding the best parameter values through techniques like GridSearchCV.

A. Artificial intelligence, especially machine learning algorithms, analyzes historical flight data to learn patterns and make accurate predictions. By leveraging regression algorithms, AI captures complex relationships between features like airline, departure time, and destination, leading to more precise forecasts.

A. Bagging (Bootstrap Aggregating) combines multiple models trained on different subsets of the training data. In predicting ticket prices, it involves training multiple decision tree regressors on different data subsets and averaging their predictions to reduce variance and enhance accuracy.

A. Validation techniques such as cross-validation and GridSearchCV were employed. These methods assess model performance by splitting the dataset into subsets for training and testing, and by searching for the best hyperparameters. Metrics like RMSE and R-squared were used to evaluate performance.

A. Feature selection identifies the most relevant features impacting ticket prices, reducing model complexity and improving generalization. By focusing on influential factors and eliminating irrelevant ones, feature selection enhances model accuracy and interpretability, leading to better forecasts.

*The media shown in this article is not owned by Analytics Vidhya and is used at the Authorâ€™s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,