Naive Bayes Algorithm from scratch

Swapnil Singh
Analytics Vidhya
Published in
7 min readOct 21, 2020

--

Naive Bayes is a trivial machine learning algorithm. It is an eager learner, which means that it takes more time for training than testing. This algorithm works on the assumption that all the features are independent of each other.

It is based on the Bayes theorem. It is easy to build. It is a collection of models and not just a single model.

Bayes Theorem

Here,

P(A|B): posterior probability of class given predictor

P(A): prior probability of the class

P(B|A): likelihood which is the probability of predictor given the class

P(B): Prior probability of the predictor

Naive Bayes theorem are of three types, Gaussian, Multinomial, and Benoulli. We would be developing the Gaussian Naive Bayes theorem from scratch. In Gaussian Naive Bayes, the distribution of features are assumed to be Gaussian or normal. When plotted it gives a bell shaped curve which is symmetric about the mean of the feature values.

Starting with the theory

Bayes Theorem for our case

After assuming the features to be mutually independent, we can multiply these probabilities with the prior probability, directly. To get the class of the datapoint, we find the maximum probability of the target classes, i.e.if class A has the highest probability then the datapoint belongs to the class A. Since the prior probability is neglected then only the numerator is important. As the conditional probability gives us very small values, therefore we apply log to the entire numerator, making it the summation of the conditional probabilities. These conditional probabilities are found out using the Gaussian Function.

Python Code

Importing the basic libraries for execution

We have imported pandas and numpy libraries.

Defining the Naive Bayes Class

Here we see four functions in the class. In the fit function, we get the number of samples, features, and also the number of unique classes. We also declare numpy arrays for mean and variance with all zeros of size number of classes and features. We also create a numpy array for prior probability of size number of classes. Thereafter, the mean, variance, and prior probabilities are found out and stored.

The predict function is used to predict the class of the datapoint. The _predict function is used to return the class with the maximum probability to the predict function. As we see, in the _predict function, prior gets its value from the _pdf function, which uses the Gaussian Function to get the conditional probability, this value is stored by taking its natural log. Class_conditional stores the sum of the prior probabilities. Posterior stores the sum of class_conditional and prior, and is later appended in the posteriors list. We return the maximum value of the class with maximum probability to the prior function.

We would be applying this Naive Bayes algorithm on UCI Heart disease dataset. Before applying Naive Bayes, we need to preprocess this dataset. We use StandardScaler to do this. We would be learning how to code StandardScaler from scratch.

StandardScaler

We see that the StandardScaler class contains three functions fit, transform, and fit_transform. In the fit function we find the mean, and standard variance of each column in the dataset. Transform function is used to normalise the columns, each value in the column is transformed using the following function. (X — mean)/standard variance. Fit_transform is used to fit and transform the dataset together in one step.

Reading the dataset and applying dataset

We have read the UCI dataset and we see that the dataset includes 303 datapoints and 13 features and 1 column for the target class. The target column indicates whether the patient has heart disease (indicated by 1) or not (indicated by 0). The dataset contains the data of 164 patients having heart disease and 139 patients don’t have heart disease. Age and sex attributes are for identification purposes and the other 11 attributes contain vital clinical records.[1]

The dataset used here contains the following features

  • age: age in years of the patient
  • sex: sex (1 = male, 0 = female)
  • cp: chest pain type (0 = typical angina, 1 = atypical angina, 2 = non — anginal pain, 3 = asymptomatic)
  • trestbps: resting blood pressure (in mm Hg on admission to hospital)
  • chol: serum cholesterol in mg/dl
  • fbs: fasting blood sugar > 120 mg/dl (0 = False, 1 = True)
  • restecg: resting electrocardiographic results ( 0 = normal, 1 = having ST-T wave
  • abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
  • thalach: minimum heart rate achieved
  • exang: exercise induced angina (0 = yes, 1 = no)
  • oldpeak: ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment (1= upsloping, 2 = flat, 3 = downsloping)
  • ca: number of major vessels (0–3) coloured by fluoroscopy
  • thal: 3 = normal; 6 = fixed defect; 7 = reversible defect

We then apply standard scaler on all columns except the target column. We also print the mean and standard scaler of the dataset.

Splitting the dataset and applying Naive Bayes

We import the train_test_split module to split the dataset into training and testing. We then assign all columns except the target column to variable x and target is assigned to variable y. We then use the train_test_split to split the dataset into training and testing data, where the testing data is of size 20% of the dataset. We then apply the Naive Bayes algorithm and then fit the model using the training dataset. The testing dataset is used in test the dataset using the predict function. Before using the predict function we need to convert the x_test dataframe to a numpy array. We then calculate the accuracy of the model using the accuracy_score function shown below.

Accuracy is the percentage of the predictions that are correct.

We have also formulated a GUI that takes entry from the user and predicts whether the user has heart disease or not.

GUI predicting the user has heart disease
GUI predicting the user doesn’t has heart disease

Application of Naive Bayes Algorithm

  • When the dataset is small with many parameters
  • When the classifier need should be easy to interpret

Advantages of Naive Bayes Algorithm

  • It is easy to implement and has fast processing speed
  • When assumption holds, it gives better results
  • Performs well on categorical data

Disadvantages of Naive Bayes Algorithm

  • If a target class is not present in the training data then, it wont be able to predict the class.
  • Features are usually dependent in real world datasets.

StandardScaler

Standard Scalar standardizes data by subtracting the mean value of datasets, from the data and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardAero results in a distribution with mean equals to 0 and standard deviation equal to 1, with most of the values lying between — 1 and 1. For reference, suppose a dataset with following values ([[1,-1,2],[2,0,0],[0,1,-1]]). After applying Standard Scalar, the values are converted to ([[0,1.22,-1.22],[- 1.22,0,1.22],[1.33,-0.26,-1.06]]). We can clearly see the mean of the processed data set coming out to be (0,0,0) and standard deviations being (0.99,0.99,0.99) (which is almost equal to 1). Even though it helps in easies the processing data, Standard Scalar has its own limitations. It cannot perform well in the presence of outliers. Outliers have an influence when calculating the empirical mean and standard deviation which shrinks the range of the data values. Since the outliers on each feature have different magnitudes, the spread of the transformed data on each feature is very different. Hence Standard Scalar cannot guarantee balanced feature scales in the presence of outliers.

Conclusion

  • We created the Naive Bayes, StandardScaler and Accuracy_score functions from scratch.
  • We applied them on UCI Heart disease dataset and achieved an accuracy of 90.16%
  • We also created a GUI for model and predicted the class of user entered data.

Reference

[1] Swapnil Singh, Ameyaa Biwalkar. (2020). Impact of Machine Learning Algorithms on Heart Disease Datasets. International Journal of Advanced Science and Technology, 29(3), 11786 -. Retrieved from http://sersc.org/journals/index.php/IJAST/article/view/29849

For the code for GUI visit https://github.com/The-Swapster/NaiveBayes

--

--

Swapnil Singh
Analytics Vidhya

Well-versed in C, C++, Java and Python (college level). Have basic knowledge about Oracle, SQL, Visual Basic, HTML and Auto Cad.