TLDR
It seems absurd to add more columns to an already large dataset, right? đ (See) how procedurally adding data using existing columns helps the model gain further insight when predicting.Â
Glossary
What is it?
The significance of boolean values in your dataset
Ways to implement in code! đ©âđ»
Magical no-code solution âšđź
What is it?
Adding a new column with a conditional is an operation in a machine learning analysis that takes existing data, possibly several columns of varying ranges, and performs a conditional comparison to represent those columns into a single value (often boolean) for each row.
The significance of boolean values in your dataset
To understand the importance of generalizing some of your features/columns, consider what boolean values represent: true or false values that clearly indicate to the model what the positive or negative outcomes are in this prediction.Â
This column containing boolean values improves model training accuracy because it summarizes a range of values into a true or false metric that simply indicates whether certain conditions are fulfilled. Itâs like a test study guide: it is a summary/indicator that is helpful in predicting the likelihood of an outcome (like passing or not passing a test)
Confused lady math meme
Some examples
If the definition was a little abstract, here are some examples! The following are problems that are simplified by adding new columns using a conditional separator:Â
Classify a range and give it value -> Assign âPass/No Passâ to a numerical grade.Â
Simplify column data -> Reduce complexity of store records by using a boolean value to indicate whether a line of seasonal clothing sales was a net gain or loss
When ranking student satisfaction with their colleges, academic performance is only one aspect of the student experience (ie. extracurriculars, leadership, other life responsibilities). Thus, we donât need to be so specific with typical 0-100 grades, as the wider range of values in a column generates more noise. By adding an additional âPass/No passâ feature, we increase the accuracy of the model because a numerical range of grades is much harder to generalize (and predict) than âPass/No pass.â
UCLA student life (Source:Â Daily Bruin)
Ways to implement in code! đ©âđ»
Using the grades example discussed earlier, we can simplify a numerical grade like â95% or 72%â into âpass or no passâ values. We have aÂ
of grades (or âmarksâ) that weâll go through to determine whether the student passed the exam or not.Â
1
2
3
import pandas as pd
df = pd.read_csv('Marks10.csv')
df
From scratch
In regular Python, weâre simply interested in theÂ
Exampoints
column that denotes the grades. Thus, weâd simply save that column as a list of fifty exam scores named âgrades.â
1
2
grades = df['Exampoints']
print(list(grades))
Next, weâd use aÂ
conditional
comparison to loop through all the scores to check whether the grades are passing (which is anything higher than a 60%).
1
2
3
4
5
6
7
8
9
10
11
12
13
# included full list for readability
grades = [31, 45, 23, 69, 78, 45, 23, 89, 100, 97,
56, 11, 9, 55, 43, 44, 45, 46, 47, 48,
49, 23, 24, 25, 26, 69, 70, 71, 72, 73,
 74, 75, 76, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
pass_no_pass = []
for grade in grades:
if grade >= 60:
pass_no_pass.append('Pass')
else:
pass_no_pass.append('No pass')
Our list of âpass or no passâ values would then contain all of the records that we need to make a new column of data!Â
1
print(pass_no_pass)
Now that you know the logic behind performing a conditional operation on an existing column to determine values in a new one, next weâll show you how youâd add the additional column to the dataset for a side-by-side comparison of grades with pass/no pass:
1
2
df['pass_no_pass'] = pass_no_pass
df
âBest practiceâ: list comprehension
This method is the cleanest way for a Python developer toÂ
of data because itâs the fastest (compared toÂ
, a Pandas function they hate). Itâs a no-nonsense approach that extracts the column weâre analyzing as a list and only looks at that. In a way, itâs the method thatâs most similar to our first âFrom scratchâ example.
First, we write a function that takes in the exam score as an input and determines whether the grade is passing or not:
1
2
3
4
5
6
7
8
import pandas as pd
df = pd.read_csv('Marks10.csv')
def passed(grade):
if grade >= 60:
return True
else:
return False
Then, while iterating through the rows and just looking at the values in the column âExampoints,â it passes the values into the function we created earlier. We take the returned function values and save it in our new column, âPassed.âÂ
1
2
3
4
pass_no_pass = [passed(score) for score in df['Exampoints']]
df['Passed'] = pass_no_pass
df
Although this method might be fast, we still need to write a helper function that helps us calculate the boolean values. In the next section, we will show you how to use a built-in numpy function to do the same thing!
Using numpy.where()
Now that you know the logic behind performing a conditional operation, letâs see how we do this using a Pandas dataframe and the numpy functionÂ
np.where()
to add a âPassedâ column with boolean values this time!Â
The reason why we do this instead of the âPass/No Passâ outputs like in the last example is to clearly indicate to the model, which understands âtrueâ values to mean an outcome, denoted by column names like âPassed,â did occur.Â
The function,Â
, takes three parameters and they are:
The comparison, which in our case, is whether the student scored > 60.
True condition; if the student has âPassed,â we save this row value as âTrue.âÂ
Overwise, âFalseâ is stored.
1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
df = pd.read_csv('Marks10.csv')
df['Passed'] = np.where(df['Exampoints'] > 60, True, False)
df
With this true or false column, our model now knows what âgoodâ grades look like! However,Â
all
of the outcomes obtained in âPassedâ are useful for aiding the model in making predictions on student performance, including scores of students who didnât pass!Â
Whatâs fascinating about a data-driven analysis is that we can learn something valuable about student experience, performance, etc from everyoneâ not just the high-achieving students. A model can only be accurate and generalizable with a high volume of diverse data points.
Therefore, even though a D or below may not satisfy you or your parents, know that to our model, your worth is immeasurable đ.Â
Source:Â Bugcat Capoo
Okay technically we can measure how influential a row of data is to training results thanks to ML, but letâs not be so cerebral all the time đ
Magical no-code solution âšđź
Speaking of not thinking so hard, sometimes, when experimenting with data, weâd rather face data-related challenges than coding ones.Â
can alleviate the coding-related burdens for you!Â
To add a new column using a conditional operation to improve your modelâs accuracy and generality, first go to:
Edit data > Add column
Fill in âconditionalâ for logic
Then, fill in the logic for the columns youâre comparing (ex: âExampoints >= 60â)
Set the first outcome of the comparison evaluation as âTrue,â and the second as âFalseâ
Happy experimenting! Hope Mage can give you a magical experience working with data đ
Want to learn more about machine learning (ML)? VisitÂ
! âšđź