Data Wrangling Algorithms
Print
Created by: Avery Isidore
Modified on: Tue, 11 Aug, 2020 at 12:05 AM
The Data Wrangling Algorithms and Functions are shown in the region highlighted in the figure below.
These can be understood from the tables below:
Basic Operations
Operation | Steps |
New Algorithm |
New Code Block - add or write custom code to wrangle the data. | - Select “New Code Block”.
- Fill in BlockName (name of your custom operation), On Value, and By Value.
- Click on “save code block”.
- Find “CustomWrangling”.
- Click “CustomWrangling” and then click on “Edit Code”.
- The file opens in the code editor.
- A framework for coding in python with examples on how to add your own functions and algorithms is automatically generated. Add code and save when you finish.
- Your “CustomWrangling” operation is saved and ready to be applied to any data.
|
General |
Replace By - replace a unique value in a feature by another value. | - Select “Replace By”
- Parameters:
- "Value": The existing value you want to replace.
- "Using": The new value you want to use to replace the existing one.
- Under Value (Count), check the box to the left of the unique value to replace.
- Once checked, that value automatically fills the Value text box in step 2.
- Select another value under Using to fill in the Using text box in step 2. Or You may type in a value.
|
Format Date time - change the date-time format of a column | - Select “Format Date Time”
- "Using": The new date/time format you enter replaces the existing
- Select a date-time format under Using to fill in the using text box in step 2. Or You may type in a format of choice.
- An auto-generated command for using %m-%d-%Y to replace the existing date-time format in the column.
|
TimeStamp to Date Time - change timestamp of date-time. | - Select “TimeStamp to Date Time” (no parameters required)
|
On Columns |
Rename Column - rename an existing column | - Select “Rename Column”
- Parameter
- “Using” - Enter the new column name here.
- Auto-generated command: Rename column “ColumnName” to “New Column Name”.
|
Copy Column - copy an entire column to another column with a new name | - Select “Copy Column”
- Parameters:
- “Using” - The name of the new Column Name
- Auto-generated command: Copy column “existing column name” as column “new column name” For example, copy column sepal_length as column sepal_length_new.
|
Change Data Type - change the data type of a column from text to num and vice versa. | - Select “Change Data Type”
- Parameters:
- “Using” - Use this new data type to change a column's data type.
- Auto-generated command: change column data type of column “column_name” to “new data type”
|
Split Column - use any character to split the values in a column and save the resulting values in separate columns. | - Select “Split Column”
- Parameters:
- “Using” - Use this character to split the values in a column. This keeps the original column and its values and creates two new columns with the split values.
- For example, sepal_length has values 4.3, 4.4, 4.5 and so on.
- Split Using “.”
- Generates two new columns - sepal_length0 with values 4,4,4 and sepal_length1 with values 3,4,5
- Auto-generated command: Split column “column_name” using “character”.
|
Compute New Column - apply formulas to a combination of columns to generate a new column | - Select “Compute New Column”.
- Under variables, select at least one variable.
- Every selected variable is listed under “Variables for formula”.
- No parameters are required.
- Click on "Write Formula”
- Fill the required information in the Write Formula page shown below and click "Done".
- Type a name for the new column.
- A list of all the columns selected to be used in the formula. Each column is represented by a variable X. The list starts from X1 to Xn.
- Write a formula. You can click on a variable in b and it will show up in c. For instance in Figure 20.3 we have written a formula (X1+X2)*2 meaning (sepal_length + sepal_width)*2.
Note: Auto generated command for the operation is about to be performed. The command shown in Figure 20.3 means - compute a new column called “myformula” with the formula (sepal_length + sepal_width)*2
The auto-generated command for the new column that just computed is shown.
7. Click "Add to Step" 8. Click "Apply" |
On Rows |
Delete Rows - delete rows that fit a certain criteria | - Select “Delete Row”
- Parameters:
- “Value” - Deletes rows where the value is “Value”. Ie the criteria for deleting rows in the data. [Not clear]
- You can select a value under Value(Count) to fill in the value parameter.
- Auto-generated command: Delete rows where value = “Value” on column “column_name”.
|
Randomly Resample X% Rows - random sampling | - Select “Randomly Resample X% Rows”
- Parameters:
|
Missing Values |
Delete Missing - delete all rows where values are missing from a particular column | - Select “Delete Missing”
- No Parameter is required.
- Auto-generated command: Delete all rows where value is missing on column “column_name”.
|
Replace Missing by a value - use a new value to replace missing values in a column | - Select “Replace Missing by a value”
- Parameters:
- “Using” - Use a new value to replace all missing values in a particular column.
- Auto-generated command: Replace missing values using “parameter” on column “column_name”.
|
Replace Missing by Average - use the average of all the values in a column to replace missing values in that column | - Select “Replace Missing by Average”
- No Parameter is required
- Auto-generated command: Replace missing values using average on column “column_name”
|
NLP / Text Data
Operations for text preprocessing | How to |
Convert to uppercase - convert all text in a column to upper cases | - Select "Convert to upperCase"
- No Parameter is required
- Auto-generated command: Convert all text to uppercase on column “column_name”
|
Convert to lowercase - convert all text in a column to lower cases | - Select "Convert to lowercase"
- No Parameter is required
- Auto-generated command: Convert all text to lowercase on column “column_name”
|
Convert to FastText Label - use fasttext to generate a label from a text data.
| - Select "Convert to FastText Label"
- No Parameter is required
- Auto-generated command: convert to FastText label on column “column_name”
|
Add prefix - add a word(s), letter(s), character(s) or a number(s) at the beginning of values in a column as a “prefix” | - Select "Add prefix"
- Parameters:
- “Using” - Use this prefix to begin the values in a selected column.
- For example: db, _db, . , prefix etc.
- Auto-generated command: Concatenate “your prefix” on “column_name” as prefix
|
Add suffix - append a word(s), letter(s), character(s) or a number(s) at the end of values in a column as a “suffix” | - Select "Add Suffix"
- Parameters:
- “Using” - Use this suffix to append the end of the values in a selected column.
- For example: _end, .com
- Auto-generated command: concatenate “your suffix” on “column_name” as suffix
|
Replace substring - replace part of a text in a column with another text or character(s) | - Select "Replace sub-string"
- Parameters:
- “Value” is the text or sequence of characters that you will replace.
- “Using” is the new text or sequence of characters you want to use to replace Value.
- Auto-generated command: Replace sub-string “current text or character(s)” using “new text or character(s)” on “column_name”.
|
Delete substring - delete part of a text from a text | - Select "Delete sub-string"
- Parameters:
- "Value" is the text or sequence of characters you want to delete.
- Auto-generated command: delete sub-string “value” on “column_name”.
|
Remove Punctuations - remove punctuations from text | - Select "Remove Punctuations"
- No Parameters required
- Auto-generated command: "Remove Punctuations" from Text on “column_name”.
|
Concatenate Columns - generate a third (new) column from a combination of existing columns | See "Compute New Column" |
Custom - these are the custom data wrangling functions or algorithms that a user has written and included in the platform | See "New Code Block" |
Time Series - for data indexed in time order
Operation | How To |
Univariate Time Series to ML Ready Data - sample and create a new time series data for a certain period with a moving window size for prediction. This is usually for one sensor. | - Select "Univariate Time Series to ML Ready Data"
- Parameters:
- “Value” is the time lag
- “Using” is the window size
- Auto-generated command: create data set where Time Lag is Value and Window Size is Using samples on “column_name”.
|
Multivariate Time Series to ML Ready Data - sample and create a new time series data for a certain period with a moving window size for prediction. This is usually for one or more sensors collecting data at the same time. | - Select "Univariate Time Series to ML Ready Data"
- Parameters:
- “Value” is the time lag
- “Using” is the window size
- Auto-generated command: create data set where Time Lag is Value and Window Size is Using samples on “column_name”.
|
Vision / Images
Operations for image processing | How to |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Data Wrangling Recipes
Completed data wrangling steps can be saved as recipes for data wrangling and reused on the same or different data files later.
To save recipe | - Click on the Save button.
- Provide a recipe name in the text box and click Save.
|
To reuse recipe | - Click on the folder button.
- Select from the list of saved recipes.
- Apply that recipe to data.
|
Avery is the author of this solution article.
Did you find it helpful?
Yes
No
Send feedback Sorry we couldn't be helpful. Help us improve this article with your feedback.