There are dozens of algorithms available including general-purpose classifiers and regressors, deep learning classifiers and regressors, and classifiers for NLP.


If the algorithm is only for classification it is named as “algorithm name” Classifier. If it is only for regression it is written as “algorithm name” Regressor. If the algorithm is available for both classification and regression it is named as “algorithm name” Classifier/Regressor.


All algorithms have parameters that can be tuned. To activate and include a parameter make sure the box beside that parameter is checked. The following illustration shows some of the parameters of the KNN classifier.



The N-neighbours parameter is active because the checkbox beside it is checked. This means when the number of neighbours is changed from 1 to something else, that change will reflect on how that parameter for KNN is set and used to train the model.


Since the checkbox besides weights is not checked, neither the default value of that parameter or any change to that parameter will influence or affect the parameters used in training the KNN algorithm.


Take note of the data type of a parameter so as to tune that parameter with the appropriate values.


General Purpose Classifiers and Regressors


KNearestNeighbors  Classifier/Regressor



Image courtesy https://fr.m.wikipedia.org/wiki/Fichier:KnnClassification.svg


Classification - KNN is an algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). Classification is computed from a simple majority vote of the nearest neighbours of each point: a query point is assigned the data class which has the most representatives within the nearest neighbours of the point.


Regression - KNN is an algorithm that stores all available cases and predicts the numerical target based on a similarity measure (e.g., distance functions).


K is an integer value specified by the user.


Parameters


N_neighbors - int, optional (default = 1)

Number of neighbours to use by default for kneighbors queries.


Weights - str, optional (default = ‘uniform’)

weight function used in prediction. Possible values:

  • ‘uniform’: uniform weights. All points in each neighbourhood are weighted equally.

  • ‘distance’: weight points by the inverse of their distance. In this case, closer neighbours of a query point will have a greater influence than neighbours which are further away.


Algorithm {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional (default = ‘auto’)

The algorithm used to compute the nearest neighbours:

  • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to the fit method.

  • ‘ball_tree’ will use BallTree

  • ‘kd_tree’ will use KDTree

  • ‘brute’ will use a brute-force search

Note: Fitting on sparse input overrides setting of this parameter, using brute force.


Leaf_size - int, optional (default = 30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.


P - int, optional (default = 2)

Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.


Metric - string, default ‘minkowski’

the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. Other distance metrics:


“euclidean”

EuclideanDistance

sqrt(sum((x - y)^2))

“manhattan”

ManhattanDistance

sum(|x - y|)

“chebyshev”

ChebyshevDistance

max(|x - y|)

“minkowski”

MinkowskiDistance

sum(|x - y|^p)^(1/p)

“wminkowski”

WMinkowskiDistance

sum(|w * (x - y)|^p)^(1/p)

“seuclidean”

SEuclideanDistance

sqrt(sum((x - y)^2 / V))

“mahalanobis”

MahalanobisDistance

sqrt((x - y)' V^-1 (x - y))

“hamming”

HammingDistance

N_unequal(x, y) / N_tot

“canberra”

CanberraDistance

sum(|x - y| / (|x| + |y|))

“braycurtis”

BrayCurtisDistance

sum(|x - y|) / (sum(|x|) + sum(|y|))


Metric_params - dict, optional (default = None)

Additional keyword arguments for the metric function.


N_jobs - int or None, optional (default=None)

The number of parallel jobs to run for neighbors' search. It is used to specify how many concurrent processes or threads should be used for routines that are parallelized. None means 1. -1 means using all processors


Extra Tree Classifier/Regressor

Also known as an Extremely Randomized Tree Classifier, it is a type of ensemble learning technique that aggregates the results of multiple de-correlated decision trees collected in a “forest” to output its classification result. In concept, it is very similar to a Random Forest Classifier and only differs from it in the manner of construction of the decision trees in the forest. 


It implements a meta estimator that fits a number of randomized decision trees (extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting 


Parameters

N_estimators - integer, optional (default=100) 

The number of trees in the forest.


criterion - string, optional (default=”gini”)

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.


Max_depth - integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


Min_samples_split - int, float, optional (default=2)

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.

  • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Min_samples_leaf - int, float, optional (default=1)

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • If int, then consider min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Min_weight_fraction_leaf - float, optional (default=0.)

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.


Max_features - int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

  • If int, then consider “max_features” features at each split.

  • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

  • If “auto”, then max_features=sqrt(n_features).

  • If “sqrt”, then max_features=sqrt(n_features).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.


Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.


Max_leaf_nodes - int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then an unlimited number of leaf nodes.


Min_impurity_decrease - float, optional (default=0.)

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.


Min_impurity_split - float, (default=None)

The threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise, it is a leaf.


Bootstrap - boolean, optional (default=True)

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.


oob_score - bool, optional (default=False)

Whether to use out-of-bag samples to estimate the generalization accuracy.


n_jobs - int or None, optional (default=1)

The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees. None means 1 . -1 means using all processors. 


random_state - int, RandomState instance or None, optional (default=None)

Controls 3 sources of randomness:

  • the bootstrapping of the samples used when building trees (if bootstrap=True)

  • the sampling of the features to consider when looking for the best split at each node (if max_features < n_features)

  • the draw of the splits for each of the max_features


verbose - int, optional (default=0)

Controls the verbosity when fitting and predicting.


warm_start - bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. 


class_weight - dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.


Note that for multi-output (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].


The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))


The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.


For multi-output, the weights of each column of y will be multiplied.


Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.



Linear Discriminant Analysis Classifier

A classifier with a linear decision boundary generated by fitting class conditional densities to the data and using Bayes’ rule.


The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.

The fitted model can also be used to reduce the dimensionality of the input by projecting it to the most discriminative directions.


Parameters

Solver - string, optional (default = lsqr)

Solver to use, possible values:

  • ‘svd’: Singular value decomposition. Does not compute the covariance matrix, therefore this solver is recommended for data with a large number of features.

  • ‘lsqr’: Least squares solution, can be combined with shrinkage.

  • ‘eigen’: Eigenvalue decomposition, can be combined with shrinkage.


shrinkage - string or float, optional (default = auto)

Shrinkage parameter, possible values:

  • None: no shrinkage.

  • ‘auto’: automatic shrinkage using the Ledoit-Wolf lemma.


Note that shrinkage works only with ‘lsqr’ and ‘eigen’ solvers.


n_components - int, optional (default=10)

Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. If None, will be set to min(n_classes - 1, n_features).


Logistic Regression Classifier

Uses a linear regression equation to produce discrete binary outputs


Parameters

penalty{‘l1’, ‘l2’}, default=’l2’

Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties.


solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

Algorithm to use in the optimization problem.

  • For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.

  • For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

  • ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

  • ‘liblinear’ and ‘saga’ also handle L1 penalty


Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale.



Decision Tree Classifier/Regressor


Image Courtesy: https://commons.wikimedia.org/wiki/File:Decision_tree_for_3-clique_no_arrowheads.svg


Creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. The classifier is capable of performing multi-class classification on a dataset. This non-parametric supervised learning algorithm can be used for both classification and regression.


Parameters

splitter{“best”, “random”}, default=”best”

The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.


criterion{“gini”, “entropy”}, default=”gini”

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.


max_depth - int, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


min_samples_split - int or float, default=2

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.

  • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.


min_samples_leaf - int or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • If int, then consider min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.


min_weight_fraction_leaf - float, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.


max_features - int, float or {“auto”, “sqrt”, “log2”, None}, default = auto

The number of features to consider when looking for the best split:

  • If int, then consider “max_features” features at each split.

  • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

  • If “auto”, then max_features=sqrt(n_features).

  • If “sqrt”, then max_features=sqrt(n_features).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.


The search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.


max_leaf_nodes - int, default=None

Grow a tree with max_leaf_nodes in the best-first fashion. Best nodes are defined as relative reduction in impurity. If None then an unlimited number of leaf nodes.


Min_impurity_decrease - float, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.


Random_state - int or RandomState, default = None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState.



Support Vector Machine Classifier/Regressor

(Image courtesy https://en.wikipedia.org/wiki/File:Support_vector_machine.jpg)


SVM marks each observation in a training set as belonging to one or the other of two categories, and then build a model by assigning new observations to one category or the other. An SVM model is a representation of the observations as points in space, mapped so that the observations of the separate categories are divided by a clear gap that is as wide as possible. New observations are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall.


Parameters

kernel - string, optional (default=’rbf’)

Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, or ‘precomputed’. If none is given, ‘rbf’ will be used.


degreeint, optional (default=3)

Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.


gamma{‘scale’, ‘auto’} or float, optional (default=auto)

Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.

  • if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,

  • if ‘auto’, uses 1 / n_features.


coef0 - float, optional (default=0.0)

Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.


tol - float, optional (default=1e-3)

Tolerance for stopping criterion.


C - float, optional (default = 1.0)

Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.


epsilon - float, optional (default = 0.1)

Loss function and should be non-zero.


shrinking - boolean, optional (default = False)

Whether to use the shrinking heuristic.


cache_size - float, optional

Specify the size of the kernel cache (in MB).


max_iter - int, optional (default = -1)

Hard limit on iterations within solver, or -1 for no limit.


LinearSVC/SVR

Image courtesy https://commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes_(SVG).svg


Linear support vector classifier/regressor fits the data provided and returns a “best fit” hyperplane that divides or categorizes the data. LinearSVC and SVR only supports a linear kernel, is faster and can scale a lot better to a large number of samples. They support both dense and sparse input.


Parameters

Penalty - str, ‘l1’ or ‘l2’ (default=’l2’) - [available for classification]

Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to “coef_” vectors that are sparse.


epsilon - float, default=0.0 [available for regression]

Epsilon parameter in the epsilon-insensitive loss function. Note that the value of this parameter depends on the scale of the target variable y. If unsure, set epsilon=0.


loss - str, ‘hinge’ or ‘squared_hinge’ (default = ’squared_hinge’)

Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.


dual - bool, (default = False)

Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.


tol - float, optional (default = 1e-4)

Tolerance for stopping criteria.


C - float, optional (default = 1.0)

Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.


multi_class - str, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)

Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while "crammer_singer" optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If "crammer_singer" is chosen, the options loss, penalty and dual will be ignored.


fit_intercept - bool, optional (default = False)

Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).


intercept_scaling - float, optional (default =1)

When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.


random_state - int, RandomState instance or None, optional (default = 0.1)

The seed of the pseudo-random number generator to use when shuffling the data for the dual coordinate descent (if dual=True). When dual=False the underlying implementation of LinearSVC is not random and random_state has no effect on the results. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used. 


max_iter - int, (default = 1000)

The maximum number of iterations to be run.


Random Forest Classifier/Regressor

Image courtesy https://commons.wikimedia.org/wiki/File:1_i0o8mjFfCn-uD79-F1Cqkw.png


Fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting


Parameters


n_estimators - integer, optional (default=100)

The number of trees in the forest.


criterion - string, optional (default = ”gini”)

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.


max_depth - integer or None, optional (default = None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


min_samples_split - int, float, optional (default = 2)

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.

  • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.


min_samples_leaf - int, float, optional (default = 1)

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • If int, then considers min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.


min_weight_fraction_leaf - float, optional (default = 0.0)

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.


max_features - int, float, string or None, optional (default = ”auto”)

The number of features to consider when looking for the best split:

  • If int, then considers “max_features” features at each split.

  • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

  • If “auto”, then max_features=sqrt(n_features).

  • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.


max_leaf_nodes - int or None, optional (default = None)

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then an unlimited number of leaf nodes.


min_impurity_decrease - float, optional (default = 0.0)

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.


min_impurity_split - float, (default=1e-7)

The threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise, it is a leaf.


bootstrap - boolean, optional (default = True)

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.


oob_score - bool (default=False)

Whether to use out-of-bag samples to estimate the generalization accuracy.


n_jobs - int or None, optional (default = 1)

The number of jobs to run in parallel. None means 1. -1 means using all processors. 


random_state - int, RandomState instance or None, optional (default = None)

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). 


verbose - int, optional (default = 0)

Controls the verbosity when fitting and predicting.


warm_start - bool, optional (default = False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.


class_weight - dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default = None)

weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.


Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].


The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))


The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.


For multi-output, the weights of each column of y will be multiplied.


Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.



MultiLayer Perceptron Neural Network (MLPNN) Classifier/Regressor


Image courtesy https://commons.wikimedia.org/wiki/File:Multilayer_Neural_Network.png


It is a deep artificial neural network composed of more than one perceptron. They are composed of an input layer to receive the signal, an output layer that makes a decision or prediction about the input, and in between those two, an arbitrary number of hidden layers that are the true computational engine of the MLP. MLP’s with one hidden layer is capable of approximating any continuous function. 


Parameters

hidden_layer_sizes - tuple, length = n_layers - 2, default = (100,)

The ith element represents the number of neurons in the ith hidden layer.


activation - {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default = ’relu’

Activation function for the hidden layer.

  • ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x

  • ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).

  • ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).

  • ‘relu’, the rectified linear unit function, returns f(x) = max(0, x)


solver - {‘lbfgs’, ‘sgd’, ‘adam’}, default = ’adam’

The solver for weight optimization.

  • ‘lbfgs’ is an optimizer in the family of quasi-Newton methods.

  • ‘sgd’ refers to stochastic gradient descent.

  • ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba

Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.


alpha - float, default=0.0001

L2 penalty (regularization term) parameter.


batch_size - int, default = ’auto’

Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)


learning_rate - {‘constant’, ‘invscaling’, ‘adaptive’}, default = ’constant’

Learning rate schedule for weight updates.

  • ‘constant’ is a constant learning rate given by ‘learning_rate_init’.

  • ‘invscaling’ gradually decreases the learning rate at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t)

  • ‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5.

Only used when solver='sgd'.


learning_rate_init - double, default = 0.001

The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.


power_t - double, default = 0.5

The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.


max_iter - int, default = 200

Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.


shuffle - bool, default = True

Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.


random_state - int, RandomState instance or None, default = None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used.


tol - float, default = 1e-4

Tolerance for the optimization. When the loss or score is not improving by at least tol for n_iter_no_change consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops.


warm_start - bool, default = False

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.


momentum - float, default = 0.9

Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.


nesterovs_momentum - boolean, default = True

Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.


early_stopping - bool, default = False

Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs. The split is stratified, except in a multilabel setting. Only effective when solver=’sgd’ or ‘adam’


validation_fraction - float, default = 0.1

The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True


beta_1 - float, default = 0.9

The exponential decay rate for estimates of the first-moment vector in adam, should be in [0, 1). Only used when solver=’adam’


beta_2 - float, default = 0.999

The exponential decay rate for estimates of second-moment vector in adam, should be in [0, 1). Only used when solver=’adam’.


epsilon - float, default = 1e-8

Value for numerical stability in adam. Only used when solver=’adam’


n_iter_no_change - int, default = 10

The maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’



SGDClassifier/Regressor


Image courtesy Joe pharos at the English language Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=42498187


Stochastic gradient descent (SGD) implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (known as the learning rate). SGD allows minibatch (online/out-of-core) learning. For best results using the default learning rate schedule, the data should have zero mean and unit variance.


Parameters

loss - str, default = ’hinge’

The loss function to be used. Defaults to ‘hinge’, which gives a linear SVM.


The possible options are ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’, or a regression loss: ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’.


The ‘log’ loss gives logistic regression, a probabilistic classifier. ‘modified_huber’ is another smooth loss that brings tolerance to outliers as well as probability estimates. ‘squared_hinge’ is like hinge but is quadratically penalized. ‘perceptron’ is the linear loss used by the perceptron algorithm. The other losses are designed for regression but can be useful in classification as well; see SGDRegressor for a description.


penalty - {‘l2’, ‘l1’, ‘elasticnet’}, default=’l2’

The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.


alpha - float, default = 0.0001

Constant that multiplies the regularization term. Defaults to 0.0001. Also used to compute learning_rate when set to ‘optimal’.


l1_ratio - float, default = 0.15

The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.


fit_intercept - bool, default = True

Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.


max_iter - int, default = 1000

The maximum number of passes over the training data (aka epochs). 


tol - float, default = 1e-3

The stopping criterion. If it is not None, the iterations will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs.


shuffle - bool, default = True

Whether or not the training data should be shuffled after each epoch.


verbose - int, default = 0

The verbosity level.


epsilon - float, default = 0.1

Epsilon in the epsilon-insensitive loss functions; only if loss is ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. For ‘huber’, determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold.


n_jobs - int, default = None

The number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. None means 1. -1 means using all processors.


random_state - int, RandomState instance, default = None

The seed of the pseudo-random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used.


learning_rate - str, default = ’optimal’

The learning rate schedule:


‘constant’:

eta = eta0


‘optimal’: [default]

eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.


‘invscaling’:

eta = eta0 / pow(t, power_t)


‘adaptive’:

eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.


eta0 - double, default = 0.0

The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.


power_t - double, default = 0.5

The exponent for inverse scaling learning rate [default 0.5].


early_stopping - bool, default = False

Whether to use early stopping to terminate training when the validation score is not improving. If set to True, it will automatically set aside a stratified fraction of training data as validation and terminate training when the validation score is not improving by at least tol for n_iter_no_change consecutive epochs.


validation_fraction - float, default = 0.1

The proportion of training data to set aside as a validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.


n_iter_no_change - int, default = 5

The number of iterations with no improvement to wait before early stopping.


class_weight - dict, {class_label: weight} or “balanced”, default=None

Preset for the class_weight fit parameter.


Weights associated with classes. If not given, all classes are supposed to have weight one.


The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).


warm_start - bool, default = False

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. 


average - bool or int, default = False

When set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.


Dummy Regressor

A regressor that makes predictions using simple rules. It is useful as a simple baseline to compare with other (real) regressors.


Parameters

strategy - str, default = mean 

Strategy to use to generate predictions.

  • “mean”: always predicts the mean of the training set

  • “median”: always predicts the median of the training set

  • “quantile”: always predicts a specified quantile of the training set, provided with the quantile parameter.

  • “constant”: always predicts a constant value that is provided by the user.


Constant - int or float or array-like of shape (n_outputs,), default = 10

The explicit constant as predicted by the “constant” strategy. This parameter is useful only for the “constant” strategy.


quantile - float in [0.0, 1.0], default = 

The quantile to predict using the “quantile” strategy. A quantile of 0.5 corresponds to the median, while 0.0 to the minimum and 1.0 to the maximum.


Lasso Regression


Image courtesy https://commons.wikimedia.org/wiki/File:PQSQ2.png


Linear Model trained with L1 prior as a regularizer (aka the Lasso). Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).


Parameters


alpha - float, default=1.0

Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.


fit_intercept - bool, default=True

Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centred).


normalize - bool, default=False

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. 


precompute - str, ‘auto’, bool or array-like of shape (n_features, n_features), default=False

Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto' let us decide. The Gram matrix can also be passed as an argument. For sparse input, this option is always True to preserve sparsity.


copy_X - bool, default=True

If True, X will be copied; else, it may be overwritten.


max_iter - int, default=1000

The maximum number of iterations


tol - float, default =1e-4

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.


warm_start - bool, default=False

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.


positive - bool, default=False

When set to True, forces the coefficients to be positive.


random_state - int, RandomState instance, default=None

The seed of the pseudo-random number generator that selects a random feature to update. Used when selection == ‘random’. Pass an int for reproducible output across multiple function calls. 


selection - str, {‘cyclic’, ‘random’}, default=’cyclic’

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.



Deep Learning Classifiers, Regressors and Parameters


Convolutional Neural Networks (CNN)



Image courtesy https://commons.wikimedia.org/wiki/File:Convolutional_Layers_of_a_Convolutional_Neural_Network.svg

It is a neural network that has one or more convolutional layers. It uses convolution in place of general matrix multiplication in at least one of its layers. CNN is essentially a sliding filter over the input. Simply put, if CNN is used for image classification as an example, rather than looking at an entire image at once to find certain features, it rather looks at smaller portions of the image. 

Although typically used for image processing and classification, CNN is also useful in NLP and for other autocorrelated data. 

 

Parameters

See Deep Learning parameters

 

TF_MultilayerPerceptronNNClassifier - Multilayer perceptron neural network using TensorFlow


TensorFlow - in simple terms is a deep learning library used for building neural networks. It works by the flow of Tensors in the form of a Computational Graph. Tensors are basically multidimensional arrays. A computational graph has a network of nodes, with each node performing an operation like addition, multiplication or evaluating some multivariate equation.


Deep Learning Parameters


loss - str, default = categorical_crossentropy


mean_squared_error: Computes the mean of squares of errors between labels and predictions.

categorical_crossentropy: Computes the cross-entropy loss between the labels and predictions.

mean_absolute_error: Computes the mean of the absolute difference between labels and predictions.

mean_absolute_percentage_error: Computes the mean absolute percentage error between y_true and y_pred.

mean_squared_logarithmic_error: Computes the mean squared logarithmic error between y_true and y_pred.

squared_hinge: Computes the squared hinge loss between y_true and y_pred.

hinge: Computes the hinge loss between y_true and y_pred.

categorical_hinge: Computes the categorical hinge loss between y_true and y_pred.

logcosh: Computes the logarithm of the hyperbolic cosine of the prediction error.

sparse_categorical_crossentropy: Computes the cross-entropy loss between the labels and predictions.

binary_crossentropy: Computes the cross-entropy loss between true labels and predicted labels.

kullback_leibler_divergence: Computes Kullback-Leibler divergence loss between y_true and y_pred.

poisson: Computes the Poisson loss between y_true and y_pred.

cosine_similarity or cosine_proximity: Computes the cosine similarity between y_true and y_pred.


optimizer - str, default = sgd


sgd: Stochastic gradient descent and momentum optimizer.

rmsprop: Optimizer that implements the RMSprop algorithm.

Adagrad: Optimizer that implements the Adagrad algorithm.

Adadelta: Optimizer that implements the Adadelta algorithm.

Adam: Optimizer that implements the Adam algorithm.

Adamax: Optimizer that implements the Adamax algorithm.

Aadam: Optimizer that implements the NAdam algorithm.

 

activation (function name(# of neurons) - default = relu(#)


elu(#): Exponential linear unit.

softmax(#): Softmax converts a real vector to a vector of categorical probabilities.

selu(#): Scaled Exponential Linear Unit (SELU).

softplus(#): Softplus activation function.

softsign(#): Softsign activation function.

relu(#): Applies the rectified linear unit activation function.

tanh(#): Hyperbolic tangent activation function.

sigmoid(#): Sigmoid activation function.

hard_sigmoid(#): Hard sigmoid activation function.

exponential(#): Exponential activation function.

linear(#): Linear activation function. 


metrics(metrics) - default = accuracy


accuracy: Calculates how often predictions equal labels.

mae: Computes the mean absolute error between labels and predictions.


epochs - int, default = 100

number of epochs (how many times each data point will be used) or a maximum number of passes over training data

batch_size - int, default = 5

Size of minimum batch size for stochastic optimizers.



Pytorch Conv Neural Network -  CNN in Pytorch  (see Convolutional Neural Network)


Parameters - see deep learning parameters


CNN_LSTM_NNClassifier - Convolutional Neural Network and Long short - term memory neural network


CNN is a neural network that has one or more convolutional layers. It uses convolution in place of general matrix multiplication in at least one of its layers. CNN is essentially a sliding filter over the input. Simply put, if CNN is used for image classification as an example, rather than looking at an entire image at once to find certain features, it rather looks at smaller portions of the image. Although typically used for image processing and classification, CNN is also useful in NLP and for other autocorrelated data.


Long Short Term Memory networks (LSTM) is a special kind of Recurrent Neural Network (RNN), capable of learning long-term dependencies.


LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behaviour, not something they struggle to learn! The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through”.


CNN_LSTM NN is a combination or a hybrid of both CNN and LSTM NN for deep learning. 

Parameters - see deep learning parameters


LSTM_NeuralNetwork


Long Short Term Memory networks (LSTM) is a special kind of Recurrent Neural Network (RNN), capable of learning long-term dependencies.


LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn! The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through”.


Parameters - see deep learning parameters


ConvLSTM_NNClassifier


 A convolutional Long Short-Term Memory neural network classifier. 


Parameters - see deep learning parameters 





NLP Algorithms


FastTextClassifier


A library for efficient learning of word representations and sentence classification


Parameters


lr - int,float, default = 0.1

Learning rate[0.10


dim - int, default = 100

size of word vectors [100]


ws - int, default = 5

Size of the context window[5]


epoch - int, default = 5

Number of Epochs[5]


minCount - int, default = 1

Minimal number of word occurrence [1]  


minCountLable - int, default = 1

Minimal number of label occurrence [1]


minn - int, default = 0

Minimum length of char ngram[0]


maxn - int, default = 0

Maximum length of char ngram[0]


neg - int, default = 5

Number of negatives sampled[5]


wordNGrams - int, default = 1

Max length of word ngrams[1]


loss - str, default = ns

loss function {ns, hs, softmax} 


bucket - int, default = 2000000

Number of buckets [2000000]


thread - int, default = 1

Number of threads[1]


lrUpdateRate - int, default = 100

change the rate of updates for the learning rate [100]


t - int, default = 0.0001

Sampling threshold 


label - str, default = __label__

labels prefix [__label__]


verbose - int, default = 2

Verbosity level [2]


pretrainedVectors - str, default = [file name]

pretrained word vectors for supervised learning [ ]




Custom Algorithms

A user can add custom algorithms for regression or classification modelling. This could be a unique algorithm, other variations the algorithms in mlOS or any of the numerous open-source algorithms.


Custom algorithms are added through the Algorithm Manager. Click on the Menu icon on the top left corner of the taskbar and select Algorithm Manager.





To add an algorithm

  1. Click Classification menu to add a classification algorithm or Regression menu to add a regression algorithm.
  2. Click on the +New Algorithm menu to add a custom classifier or repressor. 
  3. Provide a name for the algorithm you intend to add (e.g myCustomAlgorthm) and click +Add Algorithm



The newly added algorithm appears in the list of classifiers (if the algorithm added is a classification algorithm) or regressors (if the algorithm added is a regression algorithm). 

Check the box beside the newly added algorithm to select it (as shown in #3). Ensure that the box beside myCustomAlgorithm is checked.


Click Edit Algorithm 


This takes you to the advanced code editor with a directory named as “name of custom algorithm added” ie as in our example in the image the directory will be myCustomAlgorithm


The directory will have two files - 1. “algorithmName.json” which is where the parameters for the algorithm will be added 2. “algorithmName.py” which is the file with a framework where the algorithm code should be added.


Add code for the algorithm and click the save button. 



See advanced code editor for more information on how to work in the code editor.


Information about the new algorithm including the creator of the algorithm and the projects to which the algorithm has been specifically assigned


If needed, share the custom algorithm to be used by a particular project. 

To do this


  1. Click to select that project from the list of projects
  2. Click Grant Access to give the project the privilege to access and use the algorithm. You can also delete the access by clicking the Delete Access button.


This shows a sample parameter list for the algorithm. The sample is the parameters for the Random Forest algorithm. Click the Edit Configuration button to edit the “algorithmName.json” file. This file contains the code that shows you how the parameter for the random forest algorithm was set. Edit it to suit the parameters of your custom algorithm and click the save button in the code editor.