What are the odds that you know about the odds?
This article was published as a part of the Data Science Blogathon
Reading the title, if you have guessed that the article is all about odds, you are right. This article is all about odds and its variants – log of odds, odds ratio, etc. I am not sure about you, but I have always been confused by the terms odds, odds ratio, etc. These terms were taught using lengthy formulas and whatnot. If you are as confused as I was, this article will clear all your confusion.
You might have come across the term odds while studying probability or statistics. It is also famous in the betting, horse racing industries, etc. You might have heard the sentence, “What are the odds, that my horse will win the race” or “What are the odds that I will win a lottery”. Odds are nothing but chance, when someone says, “What are the odds”, one can interpret it as “What are the chances”.
Odds are basically the ratio of some event happening to some event not happening. It can also be defined as the ratio of the probability of an event happening to the Probability of the event not happening. Odds can be expressed as a Ratio or a Fraction.
Now one should also note that Odds should not be confused with Probability. Probability is the ratio of an event happening to the total number of events (Event Happening + event Not Happening). Odds can be derived from Probabilities and vice versa. But it is especially important not to confuse odds with probabilities.
We can consider an example to demonstrate the difference between Odds and Probability. Consider a team that played 100 matches and won 25 of them and lost 75 of them. Now we can calculate the Odds and Probabilities as follows,
We can say that the Odds in favor of the team winning are 1:3 or 1/3 or 0.333. Since we have odds in favor of the team winning, we also have odds against the team winning which is the “multiplicative inverse” of the odds in favor of the team winning. As you might have worked it out, the odds against the team winning are 3:1 or 3/1 or 3.
We can also calculate the probability of the team winning and losing as follows,
As I had mentioned that Odds can also be calculated from the probabilities, we will see how it is done below,
Few things to keep in mind about the odds,
- Odds are the ratio of the count of events happening to the count of events not happening.
- Odds can range from 0 to infinity, whereas probabilities range from 0 to 1.
- Odds are not probabilities.
- Odds can be calculated from Probabilities and vice versa.
- Odds can be in favor of an event happening or against the event happening.
As we have seen in the example above, the odds in favor of the team winning were 0.33, and the odds against the team winning were 3. This is just a simple case of 100 matches being played, consider a hypothetical situation where 1000 matches are played, and a team won only 25 of those and lost 975. In such cases, the odds in favor of the team winning will be 0.0256 and the odds against the team winning (odds in favor of team losing) will be 39. Because of such a large gap in the magnitude of the two odds, it becomes necessary to normalize it. This is mainly the reason why we log transform the odds. One more advantage of Log transforming odds is that once transformed, the distribution turns out to be symmetrical and this is helpful in the case of binary classification problems.
If you consider the above example, after log transforming, the odds will be as follows,
This can also be calculated using the probabilities as below,
It is also interesting to note that the formula that we just saw to calculate the Log of odds using the probabilities can also be written as below,
If you have seen the above formula somewhere, then you are right, it is the Logit transformation formula that we use in Logistic Regression. The need to Log transform the odds will be further illustrated using the example below.
Consider a team that played 1000 matches, and if we consider 1000 possibilities of the team winning and losing (both adding up to 1000) and then calculate the corresponding odds, and plot the graph of log odds, it will resemble the normal distribution. This is implemented using the python code below,
# importing necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
# initializing the Win and Lose lists Win = list(range(1,1000,1)) Lose = list(range(999,0,-1))
# creating empty data frame df = pd.DataFrame()
# initializing the columns of the data frame with the two lists df['Win'] = Win df['Lose'] = Lose
# calculating the odds of winning and losing df['Odds_Win'] = df['Win']/df['Lose'] df['Odds_Lose'] = df['Lose']/df['Win']
# calculating the log of odds of winning and Losing df['Log_Odds_Win'] = np.log(df['Odds_Win']) df['Log_Odds_Lose'] = np.log(df['Odds_Lose'])
# plotting the log odds of winning sns.displot(df['Log_Odds_Win'],kde=True); plt.title("Distribution of Log of Odds in Favor of Winning"); plt.xlabel("Log Odds"); plt.ylabel("Count");
You can see the histogram generated below,
As the name suggests Odds Ratio is just a “Ratio of Two Odds”. Although “Odds” is also a ratio, “Odds” and “Odds Ratio” are not the same. Odds are the ratio of an event happening to an event not happening, but Odds Ratio is the ratio of two odds (Odds1 and Odds2). The Odds ratio is an important concept that is useful while interpreting the output of the Logistic Regression algorithm, it also measures the association between events. It is especially important to differentiate the terms “Odds” and “Odds Ratio” and not to get confused between the two. The Odds ratio is expressed by the formula given below ( using the probabilities ).
Log of Odds Ratio:
As we have seen in the case of Odds, values of the Odds ratio range from 0 to infinity. When the numerator in an odds ratio is lesser than the denominator, the value of the Odds Ratio is less than 1 and when the numerator is greater than the denominator, the value is greater than 1 ( Up to infinity). Similar to Odds values, since there is a chance that the magnitude of these two values will be different, it is convenient if we normalize the ratio using Log normalization. Once we do that, the distribution of the odds ratios becomes normal. Log of odds ratio can be defined using the formula below,
Just for explaining the concept of Log of Odds Ratios, using the problem of win and lose described above, we can also calculate and plot the Log of Odds ratio using Python as below,
# calculating odds ratios df['Odds_Ratio_Win_Lose'] = df['Odds_Win']/df['Odds_Lose'] df['Odds_Ratio_Lose_Win'] = df['Odds_Lose']/df['Odds_Win']
# calculating log of odds ratios df['Log_Odds_Ratio_Win_Lose'] = np.log(df['Odds_Ratio_Win_Lose']) df['Log_Odds_Ratio_Lose_Win'] = np.log(df['Odds_Ratio_Lose_Win'])
# plotting the log odds of winning sns.displot(df['Log_Odds_Ratio_Win_Lose'],kde=True); plt.title("Distribution of Log of Odds Ratio Win Over Lose"); plt.xlabel("Log Odds Ratio"); plt.ylabel("Count");
The histogram of Log of Odds ratios is as below,
Use case of Odds Ratios:
- In medical terms, the Odds ratio is used to define the relationship between exposure and outcome. For example, the effects of smoking and lung cancer. For example, the odds ratio, in this case, defines the odds that a person will suffer from Lung Cancer given that he smokes to the odds that he will still suffer from Lung Cancer given that he does not smoke.
- The Odds ratio is also commonly seen while interpreting the outcome of a Logistic Regression model. Here the Regression coefficient of a variable represents the increase/decrease in the Log Odds of the dependent variable based on one unit increase in the independent variable.
Consider the example of Smoking and its effects on Lung cancer, if we are to form a two by two table showing the effects of smokers and non-smokers in causing lung cancer, the table would look something like this,
Out of the 319 patients,
- 160 are smokers and 159 are non-smokers.
- 134 have cancer and 185 do not have cancer.
- 100 smoke and have cancer, 60 smoke and do not have cancer.
- 34 do not smoke and have cancer and 125 do not smoke and do not have cancer.
From the above information, if you want to calculate the Odds Ration, you just have to cross multiply and take the ratio. Odds Ratio can be calculated as,
Now, what this odds ratio means is that the odds of someone smoking and having cancer are 6.127 times the odds that someone who does not smoke and has cancer. To find out if this Odds Ratio is statistically significant or not, we need to calculate the Confidence intervals, which is not in the scope of this article. However, you can see a complete example of it in the link here.
Odds and Odds Ratios play a very important role in the Medical domain, betting industries, etc. It becomes especially important in the medical domain to check the effect of certain exposures on certain outcomes. It is also important in the gambling industry as it relies heavily on the odds and probabilities. I hope I have been able to explain the intuition behind Odds, Odds Ratios, and Log effectively transforming the two.
As always any improvement tips, suggestions are always welcome.
Another article on Data Scraping using Python and Selenium can be found here.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.