Zum Hauptinhalt springen

Machine Learning Algorithms

The decision, which machine learning algorithm should be used, depends on the underlying problem (e.g., classification, regression or clustering) and on the type of available data. In this chapter we will give you an overview of different machine learning algorithms as well as some advantages and disadvantages, that should be considered.

Supervised Machine Learning

Supervised machine learning models are trained with labelled data. The algorithms for supervised machine learning problems can be divided into classification and regression algorithms. Classification means grouping samples into distinct classes, whereas regression is used for predicting variables, for example the trend of housing prices. In the following, some examples of popular algorithms are shown.

AlgorithmDescriptionTaskPros & Cons
Decision Treescan be imagined as a tree, which splits from the root into leafs by making decisions based on the feature thresholdClassification and regression+ can work with various data + easy to interpret + missing values can be interpolated + high performance + efficient - tends to overfit
Random forestsconsists of decision treesClassification+ no overfitting + efficient + noise can be handled - large amount of trees can increase computation time
K-nearest neighborclassifies a data point by chosing the class of its nearest neighborsClassification and regression+ simple use + can be used for multi-modal classification - large amount of training data lowers performance - noise and irrelevelant features decrease accuracy
Linear regressionfits a line to the dataRegression+ simple use + overfitting can be avoided - can only be used for linear problems - may be too simple for real problems
Logistic regressionfits a logistic curve to the dataClassification+ simple use + noise can be tolerated + efficient - tends to overfit - requires large amounts of training data
Naïve Bayesclassifies objects by using conditional propabilityClassification and clustering (unsupervised)+ simple use + fast training + little data needed + can be applied for binary & multiclass classification + used data can be discrete or continuous - cannot be used if features are dependent (e.g. time)
Support Vector Machinesclassifies objects with the help of hyperplanes and creating margins between the classesClassification and regression+ high accuracy + works well with high dimensional data + rarely overfits - performance is dependent on parameter selection - noise decrease accuracy - difficult to interpret

Unsupervised machine learning

In contrast to supervised machine learning, unlabeled data is used for unsupervised machine learning. There are different divisions of unsupervised learning. The following is only an example:

  • Clustering:
    Clusters of data points are created by finding their similarities and patterns. Clustering can be divided into different techniques like hierarchical and partitional clustering.
  • Dimensionality reduction:
    High dimensional data is reduced to the most import information.
  • Association rule learning:
    This technique is often used on large datasets, e.g. for data mining. It is based on finding the associations of different features.

Here you can see a few examples of unsupervised machine learning algorithms:

AlgorithmDescriptionTaskPros & Cons
K-Meanscreates a chosen number of clusters by adjusting the cluster centroids repeatedlyPartitional clustering+ efficient + simple to interpret and use + fast - amount of clusters must be chosen before (even if they are unknown) - sensitive to amount of data
Agglomerative clusteringclusters objects based on the distance between themHierarchical clustering+ amount of clusters does not have to be given - high complexity (less efficient)
Principal component analysisreduces dimensionality by computing variables of the data without loosing too much informationDimensionality reduction+ fast calculation + lowers dimensionality to increase performance of other algorithms - can lead to information loss - difficult to interpret
Apriori algorithmidentifies frequently occurring data from a database by scanning it more than onceAssociation rule learning+ easy to implement - may have low performance
Frequent pattern growth algorithmis an improvement of the apriori algorithm, which needs to scan the database twiceAssociation rule learning+ efficient and fast - harder to implement

References

  • Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., & Aljaaf, A. (2020). A systematic review on supervised and unsupervised machine learning algorithms for data science. Supervised and unsupervised learning for data science, 3-21.

  • Naeem, S., Ali, A., Anam, S., & Ahmed, M. (2023). An Unsupervised Machine Learning Algorithms: Comprehensive Review. International Journal of Computing and Digital Systems. http://dx.doi.org/10.12785/ijcds/130172

  • Ray, S. (2019). A quick review of machine learning algorithms. In 2019 International conference on machine learning, big data, cloud and parallel computing (COMITCon), 35-39. https://doi.org/10.1109/COMITCon.2019.8862451

  • Sindhu Meena, K., & Suriya, S. (2020). A survey on supervised and unsupervised learning techniques. In Proceedings of international conference on artificial intelligence, smart grid and smart city applications: AISGSC 2019 (pp. 627-644). Springer International Publishing.

  • Singh, A., Thakur, N., & Sharma, A. (2016). A review of supervised machine learning algorithms. In 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 1310-1315.

  • Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K., Elkhatib, Y., Hussain, A., & Al-Fuqaha, A. (2019). Unsupervised machine learning for networking: Techniques, applications and research challenges. In IEEE Access, vol. 7, pp. 65579-65615. https://doi.org/10.1109/ACCESS.2019.2916648