The Data Wrangling Algorithms and Functions are shown in the region highlighted in the figure below.



These can be understood from the tables below:


Basic Operations


OperationSteps

New Algorithm

New Code Block - add or write custom code to wrangle the data.   

  1. Select “New Code Block”.
  2. Fill in BlockName (name of your custom operation), On Value, and By Value.
  3. Click on “save code block”.
  4. Find “CustomWrangling”.
  5. Click “CustomWrangling” and then click on “Edit Code”.
  6. The file opens in the code editor.
  7. A framework for coding in python with examples on how to add your own functions and algorithms is automatically generated. Add code and save when you finish.
  8. Your “CustomWrangling” operation is saved and ready to be applied to any data.

General

Replace By - replace a unique value in a feature by another value.

  1. Select “Replace By” 
  2. Parameters:
    • "Value": The existing value you want to replace.
    • "Using": The new value you want to use to replace the existing one.
  3. Under Value (Count), check the box to the left of the unique value to replace. 
    • Once checked, that value automatically fills the Value text box in step 2.
    • Select another value under Using to fill in the Using text box in step 2. Or You may type in a value. 

Format Date time - change the date-time format of a column 

  1. Select “Format Date Time
    • "Using": The new date/time format you enter replaces the existing
  2. Select a date-time format under Using to fill in the using text box in step 2. Or You may type in a format of choice.
  3. An auto-generated command for using %m-%d-%Y to replace the existing date-time format in the column.

TimeStamp to Date Time - change timestamp of date-time.

  1. Select “TimeStamp to Date Time” (no parameters required)

On Columns

Rename Column - rename an existing column

  1. Select “Rename Column
  2. Parameter 
    • Using” - Enter the new column name here.
  3. Auto-generated command: Rename column “ColumnName” to “New Column Name”. 

Copy Column - copy an entire column to another column with a new name

  1. Select “Copy Column
  2. Parameters: 
    • Using” - The name of the new Column Name
  3. Auto-generated command: Copy column “existing column name” as column “new column name” For example, copy column sepal_length as column sepal_length_new.

Change Data Type - change the data type of a column from text to num and vice versa. 

  1. Select “Change Data Type
  2. Parameters: 
    • Using” - Use this new data type to change a column's data type.  
  3. Auto-generated command: change column data type of column “column_name” to “new data type

Split Column - use any character to split the values in a column and save the resulting values in separate columns. 

  1. Select “Split Column”
  2. Parameters: 
    • Using” - Use this character to split the values in a column. This keeps the original column and its values and creates two new columns with the split values. 
    • For example, sepal_length has values 4.3, 4.4, 4.5 and so on.  
    • Split Using “.”
    • Generates two new columns - sepal_length0 with values 4,4,4 and sepal_length1 with values 3,4,5 
  3. Auto-generated command: Split column “column_name” using “character”.

Compute New Column - apply formulas to a combination of columns to generate a new column

  1. Select “Compute New Column”.
  2. Under variables, select at least one variable.
  3. Every selected variable is listed under “Variables for formula”.
  4. No parameters are required.
  5. Click on "Write Formula
  6. Fill the required information in the Write Formula page shown below and click "Done".  
  1. Type a name for the new column. 
  2. A list of all the columns selected to be used in the formula. Each column is represented by a variable X. The list starts from X1 to Xn. 
  3. Write a formula. You can click on a variable in b and it will show up in c. For instance in Figure 20.3 we have written a formula (X1+X2)*2 meaning (sepal_length + sepal_width)*2.


Note: Auto generated command for the operation is about to be performed. The command shown in Figure 20.3 means - compute a new column called “myformula” with the formula (sepal_length + sepal_width)*2


The auto-generated command for the new column that just computed is shown.


     7. Click "Add to Step"

     8. Click "Apply

On Rows

Delete Rows - delete rows that fit a certain criteria 

  1. Select “Delete Row
  2. Parameters: 
    • Value” - Deletes rows where the value is “Value”. Ie the criteria for deleting rows in the data. [Not clear]
  3. You can select a value under Value(Count) to fill in the value parameter.  
  4. Auto-generated command: Delete rows where value = “Value” on column “column_name”.

Randomly Resample X% Rows - random sampling

  1. Select “Randomly Resample X% Rows”
  2. Parameters: 
    • Values” 
    •  “Using

Missing Values

Delete Missing - delete all rows where values are missing from a particular column

  1. Select “Delete Missing
  2. No Parameter is required.
  3. Auto-generated command: Delete all rows where value is missing on column “column_name”. 

Replace Missing by a value - use a new value to replace missing values in a column 

  1. Select “Replace Missing by a value
  2. Parameters: 
    • Using” - Use a new value to replace all missing values in a particular column.
  3. Auto-generated command: Replace missing values using “parameter” on column “column_name”.

Replace Missing by Average - use the average of all the values in a column to replace missing values in that column 

  1. Select “Replace Missing by Average
  2. No Parameter is required
  3. Auto-generated command: Replace missing values using average on column “column_name



NLP / Text Data


Operations for text preprocessing

How to

Convert to uppercase - convert all text in a column to upper cases

  1. Select "Convert to upperCase"
  2. No Parameter is required
  3. Auto-generated command: Convert all text to uppercase on column “column_name” 

Convert to lowercase - convert all text in a column to lower cases 

  1. Select "Convert to lowercase"
  2. No Parameter is required
  3. Auto-generated command: Convert all text to lowercase on column “column_name

Convert to FastText Label - use fasttext to generate a label from a text data.


  1. Select "Convert to FastText Label"
  2. No Parameter is required
  3. Auto-generated command: convert to FastText label on column “column_name

Add prefix - add a word(s), letter(s), character(s) or a number(s) at the beginning of values in a column as a “prefix” 

  1. Select "Add prefix"
  2. Parameters: 
    • Using” - Use this prefix to begin the values in a selected column. 
    • For example: db, _db, . , prefix etc.  
  3. Auto-generated command: Concatenate “your prefix” on “column_name” as prefix

Add suffix - append a word(s), letter(s), character(s) or a number(s) at the end of values in a column as a “suffix”

  1.  Select "Add Suffix"
  2.  Parameters: 
    • Using” - Use this suffix to append the end of the values in a selected column. 
    • For example: _end, .com
  3. Auto-generated command: concatenate “your suffix” on “column_name” as suffix

Replace substring - replace part of a text in a column with another text or character(s)

  1. Select "Replace sub-string"
  2. Parameters:                           
    • Value” is the text or sequence of characters that you will replace.
    • Using” is the new text or sequence of characters you want to use to replace Value.
  3. Auto-generated command: Replace sub-string “current text or character(s)” using “new text or character(s)” on “column_name”. 

Delete substring - delete part of a text from a text 

  1. Select "Delete sub-string"
  2. Parameters:               
    • "Value" is the text or sequence of characters you want to delete.
  3. Auto-generated command: delete sub-string “value” on “column_name”.

Remove Punctuations - remove punctuations from text

  1. Select "Remove Punctuations"
  2. No Parameters required
  3. Auto-generated command: "Remove Punctuations" from Text on “column_name”.

Concatenate Columns - generate a third (new) column from a combination of existing columns

See "Compute New Column"

Custom - these are the custom data wrangling functions or algorithms that a user has written and included in the platform

See "New Code Block"



Time Series - for data indexed in time order


Operation

How To

Univariate Time Series to ML Ready Data - sample and create a new time series data for a certain period with a moving window size for prediction. This is usually for one sensor.

  1. Select "Univariate Time Series to ML Ready Data"
  2. Parameters:
    • Value” is the time lag
    • Using” is the window size
  3. Auto-generated command: create data set where Time Lag is Value and Window Size is Using samples on “column_name”.

Multivariate Time Series to ML Ready Data - sample and create a new time series data for a certain period with a moving window size for prediction. This is usually for one or more sensors collecting data at the same time.

  1. Select "Univariate Time Series to ML Ready Data"
  2. Parameters:
    • Value” is the time lag
    • Using” is the window size
  3. Auto-generated command: create data set where Time Lag is Value and Window Size is Using samples on “column_name”.



Vision / Images 


Operations for image processing

How to







































Data Wrangling Recipes

 

Completed data wrangling steps can be saved as recipes for data wrangling and reused on the same or different data files later.

To save recipe

  1. Click on the Save button.
  2. Provide a recipe name in the text box and click Save.

To reuse recipe

  1. Click on the folder button.
  2. Select from the list of saved recipes.
  3. Apply that recipe to data.