This section explains how to prepare the wrangled or machine learning ready data for modelling. 


Training & Target Features

Navigate to Data Engine > Define Dataset > Training & Target Features. This is where input and target features are selected for preprocessing.


 

To select input and target features 

  1. From the list of data in "My Data", select data 

  2. See a list of all the features or columns in the selected data. If there are many columns, you can search through the features or columns by column name

  3. Check the box beside a feature or column to select it as an input feature for modelling

  4. Click the radio button to select target or output feature 


The next step is to preprocess the selected features so they are ready to be fed to machine learning algorithms for modelling. 


Feature Pre-processing

This section explains how feature preprocessing algorithms can be applied to features in the data for modelling. 


Before you begin 


Feature preprocessing simply means transforming the features or columns to formats that can be easily understood by algorithms that learn from data. This way, the algorithms are able to learn from the data and use them in making decisions. 

For instance, some algorithms cannot learn from text categories. E.g if a feature called "gender" contains values Male and Female. Pre-processing this will mean converting the text data type to numeric. Thus, the male becomes 1 and the female becomes 2. 


The steps and algorithms for pre-processing data are shown and explained in this section. The algorithms for preprocessing Tabula, text (NLP) and time series data are different from those used for Vision data. 


To get here Data Engine -> Define Dataset -> Feature Pre-processing


Tabular, NLP & Time Series:

To preprocess features for Tabular, NLP & Time Series data, the platform makes recommendations as seen in [#6].Select "Feature(s) for preprocessing" (#1)


Select either to apply preprocessing algorithms on a specific feature(s) at a time or on all the features in the dataset at a time. (#2)

"On single feature" means the preprocessing algorithm selected is applied to only the feature selected in step 2

"On Dataset" means the preprocessing algorithm is applied to all the features in the entire data set  

A list of the Features selected in step 2 for preprocessing is shown (#3).


Note: Whenever features are selected and preprocessed with a particular algorithm, clear the features. Then select other features and apply a different preprocessing algorithm to them. Keep doing this until all the features in the data are preprocessed.


Select a preprocessing algorithm (#4) to be applied to the selected feature(s) and listed in step 4 and click "Add to Step" (#5).


Provide a name and edit code for custom preprocessing algorithms. Or fill in any necessary information required by an algorithm if needed (#5)

List of features and the corresponding preprocessing algorithms that will be applied once defined and saved (#6).


List of data pre-processing algorithms for Tabular, NLP, and time-series data and how they work are described below (follow steps 2 to 7 above)


Operation

How it works

New Algorithms

New Feature Extractor - add custom feature preprocessing algorithm or function

  1. Select "New Feature Extractor" 
  2. Provide a name for your custom feature extraction or preprocessing algorithm and click "Save Code Block"
  3. The custom feature extractor or preprocessor (e.g myFeatureExtractor as shown in the image) is saved under "Custom feature preprocessors" in #4. 
  4. Select it, and click "Edit Code"
  5. This opens the advanced code editor with the code and framework for you to add code for your custom feature extractor or preprocessor. Read the comments in the code and begin to customize it.   
  6. Once custom code is added and saved, follow steps 2 to 7 under feature preprocessing to apply the custom algorithm to feature(s) in the data. 

Default MLP Preprocessing

The platform examines the features selected for preprocessing and then applies the appropriate preprocessing algorithms to the selected features.

Zero-One Normalization

Scale the values in a feature or column between 0 and 1. Applied to features with a Num data type.

Scaling

Standardize data along any axis, center to the mean and component-wise scale to unit variance.

The range of values of raw data may vary widely and that may not be suitable for some machine learning algorithms. For example, many classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance

Standard Scaling

Standardize features by removing the mean and scaling to unit variance.

Scale down the values of a feature such that it has the properties of a standard normal distribution with a mean of zero and a standard deviation of 1.

Scale features if you intend to use algorithms that involve euclidean or gradient distance to find a global minimum point quickly. This means that you Have to scale features when using algorithms such as KNN, Kmeans clustering, Linear and Logistics Regression, all deep learning and artificial neural network algorithms like CNN.

You do not have to scale when using algorithms such as decision trees, random forest, xgboost, and all the bagging and boosting algorithms. Because the values in a feature are used to create branches based on conditions and rules.

Standardization of a dataset is a common requirement for many machine learning estimators. They might behave poorly if the individual features do not more or less look like standard normally distributed data.

Min-Max Normalization

Transform features by scaling each feature to a given range. It scales and translates each feature individually such that it is in the given range on the training set e.g rescale the range of features between zero and 1 or -1 and 1. 

For example, suppose that we have the students' weight data, and the students' weights span [160 pounds, 200 pounds]. To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).

Absolute Scaling

Scale each feature by its maximum absolute value. Each feature is scaled and translated such that the maximal absolute value of each feature in the training set is 1. This feature preprocessing technique does not shift or center the data, and therefore does not destroy any sparsity. 

L1 and L2 Normalization


Note: Normalization works only on rows and not on columns.
 
  • Normalize samples or rows or observations so that the values in a row have a unit norm.
  • Each sample in the data with at least one non zero is rescaled independently of other samples so that its norm L1 or L2 is equal to 1.
  • L2 also known as the Euclidean norm calculates the distance of the vector coordinate from the origin of the vector space. It is calculated as the euclidean distance from the origin to get a positive distance value. Unit form L2 means that if each element were squared and summed, the total would be equal to 1. 
  • Alternatively, L1 normalization can be applied instead of L2 normalization. And in both the norms the features will be transformed to values between 1- and 1. 
  • Scaling inputs to unit norms L1 or L2 is a common operation for text classification or clustering.

Quantile Transformer

  • Transform features using quantiles information. 
  • Transform features to follow a uniform or normal distribution. For a given feature, a quantile transformer will spread out the most frequent values and reduce the impact of outliers. 
  • The transformation is applied to each feature independently. 
  • First, an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. 
  • Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. 
Note: This transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Categorical Features

Categorical to numeric - transform non-numerical categories into numerical categories

Encode categorical data to numbers with values between 0 and the number of classes (n_class)-1. 

To put it simply, this transforms character or text categories in a feature to numerical categories. E.g male, female, and transgender become 0, 1, and 2.   



Vision:

This section explains how vision data is preprocessed for modelling.

 


Feature Preprocessing for vision data works as follows.

  1. Select the image Feature

  2. Select the Feature Pre-processing algorithm to be applied to the images

  3. Set parameters if any is needed then click "Add to step"

  4. A list of all the feature preprocessing algorithms and their respective parameters that will be applied to the feature selected in step 1 once saved. 


List of data pre-processing algorithms for Vision Data and how they work.


Operations

How it works

New Algorithm

New Feature Extractor - add custom algorithm or function for preprocessing vision data

Refer to how to add New Feature Extractor.

Vision Feature Extractor

Categorical to numeric - transform non-numerical categories to numerical categories

Encode categorical data to numbers with values between 0 and the number of classes (n_class)-1. For instance red, green, and blue become 0,1 and 2. 

Resize Image - resize image

Select "Resize Image" (Figure 21.2 #2)

Parameters (#3):
  • "Height" - height of resized image
  • "Width" - width of resized image

The raw image is not modified, rather a new image with new dimensions is returned.

Work on RGB image

By default, all images loaded into the platform are converted to greyscale. So this command will convert the images to RGB.
 

Note that working on RGB is at least three times more computationally expensive. 

Convert to Grayscale image

Convert all images to grayscale

Convert to BW image - Convert images to black and white

Select "Convert to BW image" 

Parameters:
  • "Threshold" - set the threshold for black and white image 

Use only R from RGB

Use only the red colour in an RGB image

Use only G from RGB

Use only the green colour in an RGB image

Use only B from RGB

Use only the blue colour in an RGB image

Pytorch RandomResizedCrop - crop a given image to random size and aspect ratio.

Select "Pytorch RandomResizedCrop"

Parameter:
  • "RandomResizedCrop" - size of the resized image

Note that a crop of the random size of the original size and random aspect ratio of the original aspect ratio is created. The crop is finally resized to the given size (Parameter).

Pytorch RandomHorizontalFlip

  • Horizontally flips the given image randomly with a default probability of 0.5.

Pytorch Normalize - Normalize a tensor image with a mean and standard deviation. 

Select "Pytorch Normalize"

Parameters:
  • "Mean[Channel1,…]" - sequence of mean for each channel
  • "STD[Channel1,…]" - sequence of standard deviation for each channel
     

Note: Given mean [M1,...Mn] and std[S1,...Sn] for n channels, each channel of the input will be normalized.  

Principal Component Analysis (PCA) - project data to a lower-dimensional space or put simply, reduce the dimensions of a feature set by maximizing the variance of the data point.


 

#Linear Dimensionality Reduction 

Select "Principal Component Analysis" (PCA)

Parameter:
  • "Number of Components" - number of components to keep


Note that PCA rotates and projects data along the direction of increasing variance. The features with maximum variance are the principal components. 

Linear Discriminant Analysis (LDA) - reduce the dimensions of the feature set by projecting data in a way that class separability is maximised.


 

#Linear Dimensionality Reduction  

Select "Linear Discriminant Analysis" (LDA)

Parameter:
  • "Number of Components"- number of components for dimensionality reduction


Note that variables from the same class are put closely together by the projection while variables from different classes are placed far apart by the projection.

Independent Component Analysis (ICA) - separate independent sources from mixed signals. Unlike PCA which focuses on maximizing the variance of the data points, ICA focuses on independence. 


#Linear Dimensionality Reduction 

Select "Independent Component Analysis" (ICA)

Parameter:
  • "Number of Components" - number of components to use, if none is specified all are used. 

Singular Value Decomposition (SVD) - reduce the dimension of data using truncate svd.


#Linear Dimensionality Reduction 

Select "Singular Value Decomposition" (SVD)

Parameter:
  • "Number of Components" - the desired dimensionality of output data. It must strictly be less than the number of features. 

Kernel Principal Component Analysis(KPCA) - use kernels to reduce the dimensions of data


#Non - Linear Dimensionality Reduction 

Select "Kernel Principal Component Analysis" (KPCA)


Parameter:
  • "Number of Components" - number of components. If none is specified all non-zero components are kept.


Review and Save Dataset


Once data has been preprocessed, the preprocessed data has to be saved and then split as between training and validation sets. This split set is what is used for modelling.  



  1. Click on "Review and Save Dataset"

  2. Dataset Name: name for the preprocessed data set

  3. Click "Define Dataset"


Cross-Validation Dataset


Now that data has been wrangled, preprocessed, and saved, we have to split the data into training and validation sets for modelling.


This section explains how to split data into training and validation sets.


Note: Cross validation is done automatically in the background when data is split.



To split data for modelling

  1. Click the "Cross-Validation Dataset" tab. You can toggle between this tab and the "Explore Dataset" tab. The Cross-Validation Dataset tab allows you to split data and the Explore Dataset tab allows you to explore and review everything that has been done to the data up until this point. 

  2. Select the data you have been working on

  3. All preprocessed data sets saved under the selected data in step 2 show up here.

    • Check the box beside the dataset you would like to split. By default, the data is set to split: 80% for training and 20% for validation

  4. Move the slider to select percentages for training and validation sets.

    1. Click "Generate Dataset"

    2. Click on the data icon in #3 to view the split datasets


You are ready for modelling. Check Machine Learning Engine for details on how to apply ML algorithms to the data.


Explore Dataset


Once data has been wrangled, preprocessed, saved and split into training and validation sets. The Explore Dataset tab allows you to review the dataset. The review include

  • Training and validation datasets

  • Training and target features

  • Feature preprocessing steps and algorithms applied to every feature

  • How missing values were treated if any

  • View configuration of the entire data and its features in a .json format


If there is the need for any changes, Click Edit Dataset to go back to Define Dataset and make any needed changes.  


Preprocessing Recipes

Feature preprocessing steps applied to any preprocessed and saved data set can be saved and re-used or applied to the same or different data sets at any time.  


To Save feature preprocessing Recipe

  1. Click on "Save Recipe" (Figure 22 #5)
  2. Provide a name for the "Feature preprocessing recipe" and click save.
  3. The recipe is saved and ready to be re-used at any time.

To reuse Feature preprocessing recipe: 

  1. Click on "Open Recipe" under feature preprocessing (Refer to Figure 21.1). The path is Data Engine -> Define Data -> Feature Preprocessing -> Open Recipe
  2. Select from the list of saved recipes 
  3. Select the appropriate recipe from the list of saved recipes to be applied to the data