Machine Learning Algorithms

The decision, which machine learning algorithm should be used, depends on the underlying problem (e.g., classification, regression or clustering) and on the type of available data. In this chapter we will give you an overview of different machine learning algorithms as well as some advantages and disadvantages, that should be considered.

Supervised Machine Learning

Supervised machine learning models are trained with labelled data. The algorithms for supervised machine learning problems can be divided into classification and regression algorithms. Classification means grouping samples into distinct classes, whereas regression is used for predicting variables, for example the trend of housing prices. In the following, some examples of popular algorithms are shown.

Algorithm	Description	Task	Pros & Cons
Decision Trees	can be imagined as a tree, which splits from the root into leafs by making decisions based on the feature threshold	Classification and regression	+ can work with various data + easy to interpret + missing values can be interpolated + high performance + efficient - tends to overfit
Random forests	consists of decision trees	Classification	+ no overfitting + efficient + noise can be handled - large amount of trees can increase computation time
K-nearest neighbor	classifies a data point by chosing the class of its nearest neighbors	Classification and regression	+ simple use + can be used for multi-modal classification - large amount of training data lowers performance - noise and irrelevelant features decrease accuracy
Linear regression	fits a line to the data	Regression	+ simple use + overfitting can be avoided - can only be used for linear problems - may be too simple for real problems
Logistic regression	fits a logistic curve to the data	Classification	+ simple use + noise can be tolerated + efficient - tends to overfit - requires large amounts of training data
Naïve Bayes	classifies objects by using conditional propability	Classification and clustering (unsupervised)	+ simple use + fast training + little data needed + can be applied for binary & multiclass classification + used data can be discrete or continuous - cannot be used if features are dependent (e.g. time)
Support Vector Machines	classifies objects with the help of hyperplanes and creating margins between the classes	Classification and regression	+ high accuracy + works well with high dimensional data + rarely overfits - performance is dependent on parameter selection - noise decrease accuracy - difficult to interpret

Unsupervised machine learning

In contrast to supervised machine learning, unlabeled data is used for unsupervised machine learning. There are different divisions of unsupervised learning. The following is only an example:

Clustering:
Clusters of data points are created by finding their similarities and patterns. Clustering can be divided into different techniques like hierarchical and partitional clustering.
Dimensionality reduction:
High dimensional data is reduced to the most import information.
Association rule learning:
This technique is often used on large datasets, e.g. for data mining. It is based on finding the associations of different features.

Here you can see a few examples of unsupervised machine learning algorithms:

Algorithm	Description	Task	Pros & Cons
K-Means	creates a chosen number of clusters by adjusting the cluster centroids repeatedly	Partitional clustering	+ efficient + simple to interpret and use + fast - amount of clusters must be chosen before (even if they are unknown) - sensitive to amount of data
Agglomerative clustering	clusters objects based on the distance between them	Hierarchical clustering	+ amount of clusters does not have to be given - high complexity (less efficient)
Principal component analysis	reduces dimensionality by computing variables of the data without loosing too much information	Dimensionality reduction	+ fast calculation + lowers dimensionality to increase performance of other algorithms - can lead to information loss - difficult to interpret
Apriori algorithm	identifies frequently occurring data from a database by scanning it more than once	Association rule learning	+ easy to implement - may have low performance
Frequent pattern growth algorithm	is an improvement of the apriori algorithm, which needs to scan the database twice	Association rule learning	+ efficient and fast - harder to implement

References

Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., & Aljaaf, A. (2020). A systematic review on supervised and unsupervised machine learning algorithms for data science. Supervised and unsupervised learning for data science, 3-21.
Naeem, S., Ali, A., Anam, S., & Ahmed, M. (2023). An Unsupervised Machine Learning Algorithms: Comprehensive Review. International Journal of Computing and Digital Systems. http://dx.doi.org/10.12785/ijcds/130172
Ray, S. (2019). A quick review of machine learning algorithms. In 2019 International conference on machine learning, big data, cloud and parallel computing (COMITCon), 35-39. https://doi.org/10.1109/COMITCon.2019.8862451
Sindhu Meena, K., & Suriya, S. (2020). A survey on supervised and unsupervised learning techniques. In Proceedings of international conference on artificial intelligence, smart grid and smart city applications: AISGSC 2019 (pp. 627-644). Springer International Publishing.
Singh, A., Thakur, N., & Sharma, A. (2016). A review of supervised machine learning algorithms. In 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 1310-1315.
Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K., Elkhatib, Y., Hussain, A., & Al-Fuqaha, A. (2019). Unsupervised machine learning for networking: Techniques, applications and research challenges. In IEEE Access, vol. 7, pp. 65579-65615. https://doi.org/10.1109/ACCESS.2019.2916648

Supervised Machine Learning​

Unsupervised machine learning​

References​

Supervised Machine Learning

Unsupervised machine learning

References