Pandas: Data Analysis Essentials
A significant portion of working on real-world projects, around 70-80%, is actually spent on preparing and cleaning the data. This is where Pandas comes in handy: Pandas is a powerful tool that empowers analysts to efficiently manipulate, analyze, and visualize data. Here’s a brief overview of how you can perform data analysis with Pandas.
Introduction
- Pandas: Python library for data manipulation and analysis.
- Installation: Run
pip install pandas
in your Python environment. - DataFrames: Two-dimensional labeled data structures in Pandas, used for storing and analyzing tabular data efficiently.
- Series: One-dimensional labeled data structure in Pandas, representing a single column or row of data, capable of holding various data types.
Data
The UCI Adult Income dataset is used for this article. It contains information about individuals from the 1994 United States Census, and the goal is to predict whether a person’s income exceeds $50,000 per year based on various attributes. It consists of 14 attributes, including features such as age, education level, work-class, marital status, occupation, race, sex, capital gain, capital loss, hours per week, native country, and the target variable, which indicates whether the income exceeds $50,000 or not.
Setup
- import numpy and pandas
- setup precision for two decimal places
- read csv from URL into a dataframe
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)
DATA_URL = "https://vmlverse.github.io/assets/datasets/adult.data.csv"
df = pd.read_csv(DATA_URL)
Exploring the data
Once data loaded into a DataFrame, you can examine its contents and structure.
1. df.head(n)
Returns the first n rows of the DataFrame. (Default:5):
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
2. df.tail(n)
Returns the last n rows of the DataFrame.(Default:5)
df.tail()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
3. df.shape
Returns the dimensions (rows, columns) of the DataFrame.
df.shape
(32561, 15)
4. df.info()
Provides information about the DataFrame, including the data - types of columns and missing values.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education-num 32561 non-null int64
5 marital-status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital-gain 32561 non-null int64
11 capital-loss 32561 non-null int64
12 hours-per-week 32561 non-null int64
13 native-country 32561 non-null object
14 salary 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
5. df.describe()
Generates descriptive statistics for numerical columns:
- descriptive statistics (count, mean, min, max, etc.)
- numerical columns (int64 and float64 types)
- In the below code, we can observe the metrics for numerical columns like age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week.
df.describe()
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
count | 32561.00 | 3.26e+04 | 32561.00 | 32561.00 | 32561.00 | 32561.00 |
mean | 38.58 | 1.90e+05 | 10.08 | 1077.65 | 87.30 | 40.44 |
std | 13.64 | 1.06e+05 | 2.57 | 7385.29 | 402.96 | 12.35 |
min | 17.00 | 1.23e+04 | 1.00 | 0.00 | 0.00 | 1.00 |
25% | 28.00 | 1.18e+05 | 9.00 | 0.00 | 0.00 | 40.00 |
50% | 37.00 | 1.78e+05 | 10.00 | 0.00 | 0.00 | 40.00 |
75% | 48.00 | 2.37e+05 | 12.00 | 0.00 | 0.00 | 45.00 |
max | 90.00 | 1.48e+06 | 16.00 | 99999.00 | 4356.00 | 99.00 |
- df.describe(include=[]): to see statistics on non-numerical features:
One has to explicitly indicate data types of interest in the include parameter.
In the below code, we can observe the metrics for non-numerical columns like workclass, education, marital-status, occupation, relationship, race, sex, native-country & salary.
However, note that unlike the stats for numerical columns, we can see statss like count, number of unique values, top occuring value and frequently occuring value.
df.describe(include=["object", "bool"])
workclass | education | marital-status | occupation | relationship | race | sex | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|
count | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 | 32561 |
unique | 9 | 16 | 7 | 15 | 6 | 5 | 2 | 42 | 2 |
top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States | <=50K |
freq | 22696 | 10501 | 14976 | 4140 | 13193 | 27816 | 21790 | 29170 | 24720 |
6. df.columns
prints the column names of a DataFrame
df.columns
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'salary'],
dtype='object')
7. df.astype()
To change column data type
For instance to change age column which is a int64
to float
datatype, we can do this operation.
df["age"] = df["age"].astype("float")
we can now run the df.info()
describe column to verify if the conversion happened on the age
column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null float64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education-num 32561 non-null int64
5 marital-status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital-gain 32561 non-null int64
11 capital-loss 32561 non-null int64
12 hours-per-week 32561 non-null int64
13 native-country 32561 non-null object
14 salary 32561 non-null object
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB
8. df.value_counts()
To analyze the distribution of values in your dataset:
df["sex"].value_counts()
Male 21790
Female 10771
Name: sex, dtype: int64
To calculate fractions, pass normalize=True to the value_counts function. You can use this to determine the percentage distribution.
df["sex"].value_counts(normalize=True)
Male 0.67
Female 0.33
Name: sex, dtype: float64
df.value_counts()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
25 Private 195994 1st-4th 2 Never-married Priv-house-serv Not-in-family White Female 0 0 40 Guatemala <=50K 3
23 Private 240137 5th-6th 3 Never-married Handlers-cleaners Not-in-family White Male 0 0 55 Mexico <=50K 2
38 Private 207202 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 48 United-States >50K 2
30 Private 144593 HS-grad 9 Never-married Other-service Not-in-family Black Male 0 0 40 ? <=50K 2
49 Self-emp-not-inc 43479 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K 2
..
31 Private 128567 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K 1
128493 HS-grad 9 Divorced Other-service Not-in-family White Female 0 0 25 United-States <=50K 1
128220 7th-8th 4 Widowed Adm-clerical Not-in-family White Female 0 0 35 United-States <=50K 1
127610 Bachelors 13 Married-civ-spouse Prof-specialty Wife White Female 0 0 40 United-States >50K 1
90 Self-emp-not-inc 282095 Some-college 10 Married-civ-spouse Farming-fishing Husband White Male 0 0 40 United-States <=50K 1
Length: 32537, dtype: int64
Sorting
- DataFrame can be sorted by the value of one of the variables (i.e columns).
- For example, we can sort by age (use ascending=False to sort in descending order):
df.sort_values(by="age", ascending=False).head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5406 | 90.0 | Private | 51744 | Masters | 14 | Never-married | Exec-managerial | Not-in-family | Black | Male | 0 | 0 | 50 | United-States | >50K |
6624 | 90.0 | Private | 313986 | 11th | 7 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
20610 | 90.0 | Private | 206667 | Masters | 14 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 0 | 0 | 40 | United-States | >50K |
1040 | 90.0 | Private | 137018 | HS-grad | 9 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 40 | United-States | <=50K |
1935 | 90.0 | Private | 221832 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | <=50K |
We can also sort by multiple columns:
df.sort_values(by=["age","education-num"], ascending=[False,True]).head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24238 | 90.0 | ? | 166343 | 1st-4th | 2 | Widowed | ? | Not-in-family | Black | Female | 0 | 0 | 40 | United-States | <=50K |
19747 | 90.0 | Private | 226968 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
25303 | 90.0 | ? | 175444 | 7th-8th | 4 | Separated | ? | Not-in-family | White | Female | 0 | 0 | 15 | United-States | <=50K |
32367 | 90.0 | Local-gov | 214594 | 7th-8th | 4 | Married-civ-spouse | Protective-serv | Husband | White | Male | 2653 | 0 | 40 | United-States | <=50K |
5272 | 90.0 | Private | 141758 | 9th | 5 | Never-married | Adm-clerical | Not-in-family | White | Female | 0 | 0 | 40 | United-States | <=50K |
Indexing and retrieving data
1. Calculate mean
df['hours-per-week'].mean()
40.437455852092995
2. Column Indexing
- DataFrames can be indexed by column name (label) or row name (index).
- The loc method is used for indexing by name, while iloc() is used for indexing by number.
loc() | iloc() |
---|---|
used for indexing by name | used for indexing by number |
df.loc[‘row_index’] will return the row with the specified index label | df.iloc[row_number] will return the row at the specified integer position |
df.loc[row_label, column_label] will return the value at the specified row and column labels | df.iloc[row_number, column_number] will return the value at the specified row and column positions |
Fetch rows 2 to 4 (inclusive)
df.loc[2:4]
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Fetch rows 2 to 4 (not inclusive). Can use iloc[] as well.
df.iloc[2:4]
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
Display rows 2 to 4 (inclusive) and columns from education to relationship.
df.loc[2:4, "education":"relationship"]
education | education-num | marital-status | occupation | relationship | |
---|---|---|---|---|---|
2 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family |
3 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband |
4 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife |
Display rows 2 to 4 (inclusive) and first three columns.
df.iloc[2:4, 0:3]
age | workclass | fnlwgt | |
---|---|---|---|
2 | 38 | Private | 215646 |
3 | 53 | Private | 234721 |
Note: You can cross the methods. For example, cannot use numerical representation on loc() or use alphabetic representation on iloc()
If we need the first or the last line of the data frame, we can use the df[:1] or df[-1:] construction:
- df[:1]: start at 0. end at 1 (non-inclusive).
- Result: displays first row alone.
df[:1]
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
Indexing can be done in loops as well. i.e., you can use negative numbers as well. e.g., df[-7:-2]
- df[-1:] : start at -1(last row). end at 0 (non-inclusive)
- Result: displays last row alone.
df[-1:]
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
3. Boolean indexing
- Boolean indexing with one column is also very convenient.
- The syntax is df[P(df[‘Name’])], where P is some logical condition that is checked for each element of the Name column.
- The result of such indexing is the DataFrame consisting only of the rows that satisfy the P condition on the Name column.
What are the average values of numerical features for male users?
Here we’ll resort to an additional method select_dtypes to select all numeric columns.
-
df.select_dtypes(include=np.number) selects columns in the DataFrame that have numeric data types. It filters out columns with non-numeric data types.
-
[df[“sex”] == “male”] filters the DataFrame rows based on a condition. It selects only the rows where the “sex” column has a value of “male”.
-
.mean() calculates the mean (average) value for each column in the filtered DataFrame.
-
Therefore, the code calculates the mean value for each numeric column in the DataFrame (df) but only for the rows where the “sex” column is “male”. This provides the average values of the numeric columns specifically for the male subset of the data.
df.select_dtypes(include=np.number)[df["sex"] == "Male"].mean()
age 39.43
fnlwgt 191771.45
education-num 10.10
capital-gain 1329.37
capital-loss 100.21
hours-per-week 42.43
dtype: float64
- What is the mean age of Females in the survey?
- Boolean condtion P => df[“sex”] == “Female”
- df[P][“age”]
- using the method .mean() to calculate mean in the column of interest.
df[df["sex"] == "Female"]["age"].mean()
36.85823043357163
- What is the max age of males who have a capital gain greater than 15,000?
- Boolean condtion P => (df[“sex”] == “Male”) & (df[“capital-gain”] > 15000 )
- df[P][“age”]
- using the method .max() to calculate max in the column of interest.
df[(df["sex"] == "Male") & (df["capital-gain"] > 15000 )]["age"].max()
90.0
Example: Finding The proportion of German citizens (native-country feature):
- df.shape[0] gives the number of rows in DataFrame
- Dividing the sum of True values by the total number of rows gives the percentage of individuals with Germany as their native country.
float((df["native-country"] == "Germany").sum()) / df.shape[0]
0.004207487485028101
Example: What are mean value and standard deviation of the age of those who receive more than 50K per year (salary feature) and those who receive less than 50K per year?
ages1 = df[df["salary"] == ">50K"]["age"]
ages2 = df[df["salary"] == "<=50K"]["age"]
print(
"The average age of the rich: {0} +- {1} years, poor - {2} +- {3} years.".format(
round(ages1.mean()),
round(ages1.std(), 1),
round(ages2.mean()),
round(ages2.std(), 1),
)
)
The average age of the rich: 44 +- 10.5 years, poor - 37 +- 14.0 years.
Example: Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)
df[df["salary"] == ">50K"]["education"].unique() # No
array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
'10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)
Example: Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.
# married men
df[(df["sex"] == "Male")
& (df["marital-status"].str.startswith("Married"))][
"salary"
].value_counts(normalize=True)
<=50K 0.56
>50K 0.44
Name: salary, dtype: float64
# single men
df[
(df["sex"] == "Male")
& ~(df["marital-status"].str.startswith("Married"))
]["salary"].value_counts(normalize=True)
<=50K 0.92
>50K 0.08
Name: salary, dtype: float64
4. Apply Method
- df.apply() method can be used to apply functions on Cells, Rows or Columns
For example, find the max value across all columns.
df.apply(np.max)
age 90
workclass Without-pay
fnlwgt 1484705
education Some-college
education-num 16
marital-status Widowed
occupation Transport-moving
relationship Wife
race White
sex Male
capital-gain 99999
capital-loss 4356
hours-per-week 99
native-country Yugoslavia
salary >50K
dtype: object
Lambda functions are very convenient to apply using apply() method.
- For eg, select the native-countries starting with Y.
- To break this down:
df["native-country"]
selects the native-country.df["native-country"].apply(lambda state: state[0] == "Y")
applies lambda function to include only the rows where the “native-country” column starts with the letter “Y”df[].head()
- then returns the first few rows of the filtered DataFrame.
df[df["native-country"].apply(lambda country: country[0] == "Y")].head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1025 | 56 | Private | 169133 | HS-grad | 9 | Married-civ-spouse | Other-service | Husband | White | Male | 0 | 0 | 50 | Yugoslavia | <=50K |
4447 | 25 | Private | 191230 | Some-college | 10 | Never-married | Exec-managerial | Own-child | White | Female | 0 | 0 | 40 | Yugoslavia | <=50K |
6328 | 20 | Private | 175069 | Some-college | 10 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 40 | Yugoslavia | <=50K |
7287 | 35 | Private | 164526 | HS-grad | 9 | Married-civ-spouse | Other-service | Husband | White | Male | 0 | 0 | 40 | Yugoslavia | >50K |
12506 | 40 | Local-gov | 183096 | 9th | 5 | Married-civ-spouse | Other-service | Wife | White | Female | 0 | 0 | 40 | Yugoslavia | >50K |
5. Map & Replace Method
map method can be used to replace values in a column by passing a dictionary of the form {old_value: new_value} as its argument:
d = {"Male": 1, "Female": 2}
df["sex_1"] = df["sex"].map(d)
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | NaN | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | NaN | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | NaN | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | NaN | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | NaN | 0 | 0 | 40 | Cuba | <=50K |
same thing can be done with the replace method.
df['sex_2'] = df['sex'].replace(d)
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | sex_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K | 1 |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K | 1 |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K | 1 |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K | 1 |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K | 2 |
However, note replace
method will not do anything with values not found in the mapping dictionary, while while map
will change them to NaNs.
# Create a Series
s = pd.Series(['A', 'B', 'C', 'D'])
# Create a mapping dictionary
mapping = {'A': 'Apple', 'B': 'Banana'}
# Using replace method
replaced = s.replace(mapping)
print('replace method will not do anything with values not found in the mapping dictionary')
print(replaced)
# Using map method
mapped = s.map(mapping)
print('map method will change values not found in the mapping dictionary to NaNs')
print(mapped)
replace method will not do anything with values not found in the mapping dictionary
0 Apple
1 Banana
2 C
3 D
dtype: object
map method wwill change values not found in the mapping dictionary to NaNs
0 Apple
1 Banana
2 NaN
3 NaN
dtype: object
6. Grouping
Grouping data in Pandas works as follows:
df.groupby(by=grouping_columns)[columns_to_show].function()
-
First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.
-
Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.
-
Finally, one or several functions are applied to the obtained groups per selected columns.
To understand this more intuitively, we can start with this small example.
- We can group by
education
. By doing this we are creating a new dataframe with theeducation
column as index. - However, this dataframe as no series values. For this we can apply as simple function like
count()
- The resulting dataframe shows the count each
education
value across all other columns. The data is repeating because,count()
is simply counting the number of entries for each education type in age, workclass, etc which obviously will be same.
result = df.groupby('education').count()
print(result)
age workclass fnlwgt education-num marital-status \
education
10th 933 933 933 933 933
11th 1175 1175 1175 1175 1175
12th 433 433 433 433 433
1st-4th 168 168 168 168 168
5th-6th 333 333 333 333 333
7th-8th 646 646 646 646 646
9th 514 514 514 514 514
Assoc-acdm 1067 1067 1067 1067 1067
Assoc-voc 1382 1382 1382 1382 1382
Bachelors 5355 5355 5355 5355 5355
Doctorate 413 413 413 413 413
HS-grad 10501 10501 10501 10501 10501
Masters 1723 1723 1723 1723 1723
Preschool 51 51 51 51 51
Prof-school 576 576 576 576 576
Some-college 7291 7291 7291 7291 7291
occupation relationship race sex capital-gain \
education
10th 933 933 933 933 933
11th 1175 1175 1175 1175 1175
12th 433 433 433 433 433
1st-4th 168 168 168 168 168
5th-6th 333 333 333 333 333
7th-8th 646 646 646 646 646
9th 514 514 514 514 514
Assoc-acdm 1067 1067 1067 1067 1067
Assoc-voc 1382 1382 1382 1382 1382
Bachelors 5355 5355 5355 5355 5355
Doctorate 413 413 413 413 413
HS-grad 10501 10501 10501 10501 10501
Masters 1723 1723 1723 1723 1723
Preschool 51 51 51 51 51
Prof-school 576 576 576 576 576
Some-college 7291 7291 7291 7291 7291
capital-loss hours-per-week native-country salary sex_2
education
10th 933 933 933 933 933
11th 1175 1175 1175 1175 1175
12th 433 433 433 433 433
1st-4th 168 168 168 168 168
5th-6th 333 333 333 333 333
7th-8th 646 646 646 646 646
9th 514 514 514 514 514
Assoc-acdm 1067 1067 1067 1067 1067
Assoc-voc 1382 1382 1382 1382 1382
Bachelors 5355 5355 5355 5355 5355
Doctorate 413 413 413 413 413
HS-grad 10501 10501 10501 10501 10501
Masters 1723 1723 1723 1723 1723
Preschool 51 51 51 51 51
Prof-school 576 576 576 576 576
Some-college 7291 7291 7291 7291 7291
This will start to make more sense by using a more meaningful function like .mean() which calculates the mean value for all numerical columns under each education
type.
result = df.groupby('education').mean()
print(result)
age fnlwgt education-num capital-gain capital-loss \
education
10th 37.43 196832.47 6.0 404.57 56.85
11th 32.36 194928.08 7.0 215.10 50.08
12th 32.00 199097.51 8.0 284.09 32.34
1st-4th 46.14 239303.00 2.0 125.88 48.33
5th-6th 42.89 232448.33 3.0 176.02 68.25
7th-8th 48.45 188079.17 4.0 233.94 65.67
9th 41.06 202485.07 5.0 342.09 29.00
Assoc-acdm 37.38 193424.09 12.0 640.40 93.42
Assoc-voc 38.55 181936.02 11.0 715.05 72.75
Bachelors 38.90 188055.91 13.0 1756.30 118.35
Doctorate 47.70 186698.76 16.0 4770.15 262.85
HS-grad 38.97 189538.74 9.0 576.80 70.47
Masters 44.05 179852.36 14.0 2562.56 166.72
Preschool 42.76 235889.37 1.0 898.39 66.49
Prof-school 44.75 185663.71 15.0 10414.42 231.20
Some-college 35.76 188742.92 10.0 598.82 71.64
hours-per-week sex_2
education
10th 37.05 1.32
11th 33.93 1.37
12th 35.78 1.33
1st-4th 38.26 1.27
5th-6th 38.90 1.25
7th-8th 39.37 1.25
9th 38.04 1.28
Assoc-acdm 40.50 1.39
Assoc-voc 41.61 1.36
Bachelors 42.61 1.30
Doctorate 46.97 1.21
HS-grad 40.58 1.32
Masters 43.84 1.31
Preschool 36.65 1.31
Prof-school 47.43 1.16
Some-college 38.85 1.38
<ipython-input-28-a0bc15c64f52>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
result = df.groupby('education').mean()
In the below example:
- we grouped the DataFrame df by the ‘education’ column using the groupby method.
- Then, we selected the ‘age’ column to calculate the average age for each education level.
- The result is a Series that shows the average age for each education category.
# result = df.groupby('education')['age'].mean()
result = df.groupby('education')['age'].mean()
print(result)
education
10th 37.43
11th 32.36
12th 32.00
1st-4th 46.14
5th-6th 42.89
7th-8th 48.45
9th 41.06
Assoc-acdm 37.38
Assoc-voc 38.55
Bachelors 38.90
Doctorate 47.70
HS-grad 38.97
Masters 44.05
Preschool 42.76
Prof-school 44.75
Some-college 35.76
Name: age, dtype: float64
Here is another example. Suppose we want to groupby Salary and determine how people of various occupations are collecting salary, we can do the following:
- we can group the DataFrame df by both the ‘occupation’ and ‘salary’ columns using the groupby method. This is like first creating an index for
occupation
and then create indexsalary
under each occupation type. - Then, we applied the size function to count the number of occurrences for each combination of occupation and salary.
- The result is a Series that shows the count for each occupation and salary category.
# Grouping by 'occupation' and 'salary' columns and counting the number of occurrences
result = df.groupby(['occupation', 'salary']).size()
print(result)
occupation salary
? <=50K 1652
>50K 191
Adm-clerical <=50K 3263
>50K 507
Armed-Forces <=50K 8
>50K 1
Craft-repair <=50K 3170
>50K 929
Exec-managerial <=50K 2098
>50K 1968
Farming-fishing <=50K 879
>50K 115
Handlers-cleaners <=50K 1284
>50K 86
Machine-op-inspct <=50K 1752
>50K 250
Other-service <=50K 3158
>50K 137
Priv-house-serv <=50K 148
>50K 1
Prof-specialty <=50K 2281
>50K 1859
Protective-serv <=50K 438
>50K 211
Sales <=50K 2667
>50K 983
Tech-support <=50K 645
>50K 283
Transport-moving <=50K 1277
>50K 320
dtype: int64
We can extend this to show percentage by doing the following:
# Grouping by 'occupation' and 'salary' columns and calculating the percentage
result = df.groupby(['occupation', 'salary']).size() / df.groupby(['occupation']).size() * 100
print(result)
occupation salary
? <=50K 89.64
>50K 10.36
Adm-clerical <=50K 86.55
>50K 13.45
Armed-Forces <=50K 88.89
>50K 11.11
Craft-repair <=50K 77.34
>50K 22.66
Exec-managerial <=50K 51.60
>50K 48.40
Farming-fishing <=50K 88.43
>50K 11.57
Handlers-cleaners <=50K 93.72
>50K 6.28
Machine-op-inspct <=50K 87.51
>50K 12.49
Other-service <=50K 95.84
>50K 4.16
Priv-house-serv <=50K 99.33
>50K 0.67
Prof-specialty <=50K 55.10
>50K 44.90
Protective-serv <=50K 67.49
>50K 32.51
Sales <=50K 73.07
>50K 26.93
Tech-support <=50K 69.50
>50K 30.50
Transport-moving <=50K 79.96
>50K 20.04
dtype: float64
In this example, we can compare the normal describe method vs the describe method with groupby. As we can see, the group by method is indexed on occupation, where as the regular df is indexed on 0 to N row index.
df.describe(percentiles=[])
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | sex_2 | |
---|---|---|---|---|---|---|---|
count | 32561.00 | 3.26e+04 | 32561.00 | 32561.00 | 32561.00 | 32561.00 | 32561.00 |
mean | 38.58 | 1.90e+05 | 10.08 | 1077.65 | 87.30 | 40.44 | 1.33 |
std | 13.64 | 1.06e+05 | 2.57 | 7385.29 | 402.96 | 12.35 | 0.47 |
min | 17.00 | 1.23e+04 | 1.00 | 0.00 | 0.00 | 1.00 | 1.00 |
50% | 37.00 | 1.78e+05 | 10.00 | 0.00 | 0.00 | 40.00 | 1.00 |
max | 90.00 | 1.48e+06 | 16.00 | 99999.00 | 4356.00 | 99.00 | 2.00 |
df.groupby(["occupation"]).describe(percentiles=[])
age | fnlwgt | ... | hours-per-week | sex_2 | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 50% | max | count | mean | std | min | ... | std | min | 50% | max | count | mean | std | min | 50% | max | |
occupation | |||||||||||||||||||||
? | 1843.0 | 40.88 | 20.34 | 17.0 | 35.0 | 90.0 | 1843.0 | 188658.67 | 107089.08 | 12285.0 | ... | 14.91 | 1.0 | 36.0 | 99.0 | 1843.0 | 1.46 | 0.50 | 1.0 | 1.0 | 2.0 |
Adm-clerical | 3770.0 | 36.96 | 13.36 | 17.0 | 35.0 | 90.0 | 3770.0 | 192043.40 | 103163.89 | 19302.0 | ... | 9.59 | 1.0 | 40.0 | 80.0 | 3770.0 | 1.67 | 0.47 | 1.0 | 2.0 | 2.0 |
Armed-Forces | 9.0 | 30.22 | 8.09 | 23.0 | 29.0 | 46.0 | 9.0 | 215425.89 | 83315.89 | 76313.0 | ... | 14.07 | 8.0 | 40.0 | 60.0 | 9.0 | 1.00 | 0.00 | 1.0 | 1.0 | 1.0 |
Craft-repair | 4099.0 | 39.03 | 11.61 | 17.0 | 38.0 | 90.0 | 4099.0 | 192132.60 | 107434.09 | 19491.0 | ... | 9.05 | 1.0 | 40.0 | 99.0 | 4099.0 | 1.05 | 0.23 | 1.0 | 1.0 | 2.0 |
Exec-managerial | 4066.0 | 42.17 | 11.97 | 17.0 | 41.0 | 90.0 | 4066.0 | 184414.01 | 103314.96 | 19914.0 | ... | 11.11 | 1.0 | 40.0 | 99.0 | 4066.0 | 1.29 | 0.45 | 1.0 | 1.0 | 2.0 |
Farming-fishing | 994.0 | 41.21 | 15.07 | 17.0 | 39.0 | 90.0 | 994.0 | 170190.18 | 116925.90 | 20795.0 | ... | 17.32 | 2.0 | 40.0 | 99.0 | 994.0 | 1.07 | 0.25 | 1.0 | 1.0 | 2.0 |
Handlers-cleaners | 1370.0 | 32.17 | 12.37 | 17.0 | 29.0 | 90.0 | 1370.0 | 204391.01 | 111934.10 | 19214.0 | ... | 10.58 | 2.0 | 40.0 | 95.0 | 1370.0 | 1.12 | 0.32 | 1.0 | 1.0 | 2.0 |
Machine-op-inspct | 2002.0 | 37.72 | 12.07 | 17.0 | 36.0 | 90.0 | 2002.0 | 195040.88 | 98159.93 | 13769.0 | ... | 7.59 | 1.0 | 40.0 | 96.0 | 2002.0 | 1.27 | 0.45 | 1.0 | 1.0 | 2.0 |
Other-service | 3295.0 | 34.95 | 14.52 | 17.0 | 32.0 | 90.0 | 3295.0 | 188608.45 | 109452.80 | 19752.0 | ... | 12.71 | 1.0 | 40.0 | 99.0 | 3295.0 | 1.55 | 0.50 | 1.0 | 2.0 | 2.0 |
Priv-house-serv | 149.0 | 41.72 | 18.63 | 17.0 | 40.0 | 81.0 | 149.0 | 201107.52 | 102595.40 | 24384.0 | ... | 16.18 | 4.0 | 35.0 | 99.0 | 149.0 | 1.95 | 0.23 | 1.0 | 2.0 | 2.0 |
Prof-specialty | 4140.0 | 40.52 | 12.02 | 17.0 | 40.0 | 90.0 | 4140.0 | 185296.61 | 100135.45 | 14878.0 | ... | 12.54 | 1.0 | 40.0 | 99.0 | 4140.0 | 1.37 | 0.48 | 1.0 | 1.0 | 2.0 |
Protective-serv | 649.0 | 38.95 | 12.82 | 17.0 | 36.0 | 90.0 | 649.0 | 202039.95 | 101910.29 | 19302.0 | ... | 12.33 | 3.0 | 40.0 | 99.0 | 649.0 | 1.12 | 0.32 | 1.0 | 1.0 | 2.0 |
Sales | 3650.0 | 37.35 | 14.19 | 17.0 | 35.0 | 90.0 | 3650.0 | 190885.89 | 103779.87 | 19410.0 | ... | 13.24 | 2.0 | 40.0 | 99.0 | 3650.0 | 1.35 | 0.48 | 1.0 | 1.0 | 2.0 |
Tech-support | 928.0 | 37.02 | 11.32 | 17.0 | 36.0 | 73.0 | 928.0 | 192098.30 | 113987.28 | 19847.0 | ... | 10.58 | 3.0 | 40.0 | 99.0 | 928.0 | 1.38 | 0.48 | 1.0 | 1.0 | 2.0 |
Transport-moving | 1597.0 | 40.20 | 12.45 | 17.0 | 39.0 | 90.0 | 1597.0 | 190366.36 | 109240.30 | 18827.0 | ... | 12.72 | 5.0 | 40.0 | 99.0 | 1597.0 | 1.06 | 0.23 | 1.0 | 1.0 | 2.0 |
15 rows × 42 columns
We can do the same thing using agg()
function.
df.groupby(["occupation"]).agg([np.mean, np.std, np.min, np.max])
<ipython-input-32-f3544e70bb41>:1: FutureWarning: ['workclass', 'education', 'marital-status', 'relationship', 'race', 'sex', 'native-country', 'salary'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df.groupby(["occupation"]).agg([np.mean, np.std, np.min, np.max])
age | fnlwgt | education-num | ... | capital-loss | hours-per-week | sex_2 | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | std | amin | amax | mean | std | amin | amax | mean | std | ... | amin | amax | mean | std | amin | amax | mean | std | amin | amax | |
occupation | |||||||||||||||||||||
? | 40.88 | 20.34 | 17 | 90 | 188658.67 | 107089.08 | 12285 | 981628 | 9.25 | 2.60 | ... | 0 | 4356 | 31.91 | 14.91 | 1 | 99 | 1.46 | 0.50 | 1 | 2 |
Adm-clerical | 36.96 | 13.36 | 17 | 90 | 192043.40 | 103163.89 | 19302 | 1033222 | 10.11 | 1.70 | ... | 0 | 3770 | 37.56 | 9.59 | 1 | 80 | 1.67 | 0.47 | 1 | 2 |
Armed-Forces | 30.22 | 8.09 | 23 | 46 | 215425.89 | 83315.89 | 76313 | 344415 | 10.11 | 2.03 | ... | 0 | 1887 | 40.67 | 14.07 | 8 | 60 | 1.00 | 0.00 | 1 | 1 |
Craft-repair | 39.03 | 11.61 | 17 | 90 | 192132.60 | 107434.09 | 19491 | 1455435 | 9.11 | 2.04 | ... | 0 | 3004 | 42.30 | 9.05 | 1 | 99 | 1.05 | 0.23 | 1 | 2 |
Exec-managerial | 42.17 | 11.97 | 17 | 90 | 184414.01 | 103314.96 | 19914 | 1484705 | 11.45 | 2.14 | ... | 0 | 4356 | 44.99 | 11.11 | 1 | 99 | 1.29 | 0.45 | 1 | 2 |
Farming-fishing | 41.21 | 15.07 | 17 | 90 | 170190.18 | 116925.90 | 20795 | 663394 | 8.61 | 2.76 | ... | 0 | 2457 | 46.99 | 17.32 | 2 | 99 | 1.07 | 0.25 | 1 | 2 |
Handlers-cleaners | 32.17 | 12.37 | 17 | 90 | 204391.01 | 111934.10 | 19214 | 889965 | 8.51 | 2.20 | ... | 0 | 2824 | 37.95 | 10.58 | 2 | 95 | 1.12 | 0.32 | 1 | 2 |
Machine-op-inspct | 37.72 | 12.07 | 17 | 90 | 195040.88 | 98159.93 | 13769 | 1033222 | 8.49 | 2.29 | ... | 0 | 3900 | 40.76 | 7.59 | 1 | 96 | 1.27 | 0.45 | 1 | 2 |
Other-service | 34.95 | 14.52 | 17 | 90 | 188608.45 | 109452.80 | 19752 | 1366120 | 8.78 | 2.30 | ... | 0 | 3770 | 34.70 | 12.71 | 1 | 99 | 1.55 | 0.50 | 1 | 2 |
Priv-house-serv | 41.72 | 18.63 | 17 | 81 | 201107.52 | 102595.40 | 24384 | 549430 | 7.36 | 3.11 | ... | 0 | 1602 | 32.89 | 16.18 | 4 | 99 | 1.95 | 0.23 | 1 | 2 |
Prof-specialty | 40.52 | 12.02 | 17 | 90 | 185296.61 | 100135.45 | 14878 | 747719 | 12.91 | 2.03 | ... | 0 | 3900 | 42.39 | 12.54 | 1 | 99 | 1.37 | 0.48 | 1 | 2 |
Protective-serv | 38.95 | 12.82 | 17 | 90 | 202039.95 | 101910.29 | 19302 | 857532 | 10.18 | 1.87 | ... | 0 | 2444 | 42.87 | 12.33 | 3 | 99 | 1.12 | 0.32 | 1 | 2 |
Sales | 37.35 | 14.19 | 17 | 90 | 190885.89 | 103779.87 | 19410 | 1226583 | 10.30 | 2.18 | ... | 0 | 2824 | 40.78 | 13.24 | 2 | 99 | 1.35 | 0.48 | 1 | 2 |
Tech-support | 37.02 | 11.32 | 17 | 73 | 192098.30 | 113987.28 | 19847 | 1268339 | 10.99 | 1.80 | ... | 0 | 2444 | 39.43 | 10.58 | 3 | 99 | 1.38 | 0.48 | 1 | 2 |
Transport-moving | 40.20 | 12.45 | 17 | 90 | 190366.36 | 109240.30 | 18827 | 1184622 | 8.77 | 2.04 | ... | 0 | 2824 | 44.66 | 12.72 | 5 | 99 | 1.06 | 0.23 | 1 | 2 |
15 rows × 28 columns
Example: Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race
for (race, sex), sub_df in df.groupby(["race", "sex"]):
print("Race: {0}, sex: {1}".format(race, sex))
print(sub_df["age"].describe())
Race: Amer-Indian-Eskimo, sex: Female
count 119.00
mean 37.12
std 13.11
min 17.00
25% 27.00
50% 36.00
75% 46.00
max 80.00
Name: age, dtype: float64
Race: Amer-Indian-Eskimo, sex: Male
count 192.00
mean 37.21
std 12.05
min 17.00
25% 28.00
50% 35.00
75% 45.00
max 82.00
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Female
count 346.00
mean 35.09
std 12.30
min 17.00
25% 25.00
50% 33.00
75% 43.75
max 75.00
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Male
count 693.00
mean 39.07
std 12.88
min 18.00
25% 29.00
50% 37.00
75% 46.00
max 90.00
Name: age, dtype: float64
Race: Black, sex: Female
count 1555.00
mean 37.85
std 12.64
min 17.00
25% 28.00
50% 37.00
75% 46.00
max 90.00
Name: age, dtype: float64
Race: Black, sex: Male
count 1569.00
mean 37.68
std 12.88
min 17.00
25% 27.00
50% 36.00
75% 46.00
max 90.00
Name: age, dtype: float64
Race: Other, sex: Female
count 109.00
mean 31.68
std 11.63
min 17.00
25% 23.00
50% 29.00
75% 39.00
max 74.00
Name: age, dtype: float64
Race: Other, sex: Male
count 162.00
mean 34.65
std 11.36
min 17.00
25% 26.00
50% 32.00
75% 42.00
max 77.00
Name: age, dtype: float64
Race: White, sex: Female
count 8642.00
mean 36.81
std 14.33
min 17.00
25% 25.00
50% 35.00
75% 46.00
max 90.00
Name: age, dtype: float64
Race: White, sex: Male
count 19174.00
mean 39.65
std 13.44
min 17.00
25% 29.00
50% 38.00
75% 49.00
max 90.00
Name: age, dtype: float64
Example:Count the average time of work (hours-per-week) those who earning a little and a lot (salary) for each country (native-country).
for (country, salary), sub_df in df.groupby(["native-country", "salary"]):
print(country, salary, round(sub_df["hours-per-week"].mean(), 2))
? <=50K 40.16
? >50K 45.55
Cambodia <=50K 41.42
Cambodia >50K 40.0
Canada <=50K 37.91
Canada >50K 45.64
China <=50K 37.38
China >50K 38.9
Columbia <=50K 38.68
Columbia >50K 50.0
Cuba <=50K 37.99
Cuba >50K 42.44
Dominican-Republic <=50K 42.34
Dominican-Republic >50K 47.0
Ecuador <=50K 38.04
Ecuador >50K 48.75
El-Salvador <=50K 36.03
El-Salvador >50K 45.0
England <=50K 40.48
England >50K 44.53
France <=50K 41.06
France >50K 50.75
Germany <=50K 39.14
Germany >50K 44.98
Greece <=50K 41.81
Greece >50K 50.62
Guatemala <=50K 39.36
Guatemala >50K 36.67
Haiti <=50K 36.33
Haiti >50K 42.75
Holand-Netherlands <=50K 40.0
Honduras <=50K 34.33
Honduras >50K 60.0
Hong <=50K 39.14
Hong >50K 45.0
Hungary <=50K 31.3
Hungary >50K 50.0
India <=50K 38.23
India >50K 46.48
Iran <=50K 41.44
Iran >50K 47.5
Ireland <=50K 40.95
Ireland >50K 48.0
Italy <=50K 39.62
Italy >50K 45.4
Jamaica <=50K 38.24
Jamaica >50K 41.1
Japan <=50K 41.0
Japan >50K 47.96
Laos <=50K 40.38
Laos >50K 40.0
Mexico <=50K 40.0
Mexico >50K 46.58
Nicaragua <=50K 36.09
Nicaragua >50K 37.5
Outlying-US(Guam-USVI-etc) <=50K 41.86
Peru <=50K 35.07
Peru >50K 40.0
Philippines <=50K 38.07
Philippines >50K 43.03
Poland <=50K 38.17
Poland >50K 39.0
Portugal <=50K 41.94
Portugal >50K 41.5
Puerto-Rico <=50K 38.47
Puerto-Rico >50K 39.42
Scotland <=50K 39.44
Scotland >50K 46.67
South <=50K 40.16
South >50K 51.44
Taiwan <=50K 33.77
Taiwan >50K 46.8
Thailand <=50K 42.87
Thailand >50K 58.33
Trinadad&Tobago <=50K 37.06
Trinadad&Tobago >50K 40.0
United-States <=50K 38.8
United-States >50K 45.51
Vietnam <=50K 37.19
Vietnam >50K 39.2
Yugoslavia <=50K 41.6
Yugoslavia >50K 49.5
Better way to do the same, using crosstab
pd.crosstab(
df["native-country"],
df["salary"],
values=df["hours-per-week"],
aggfunc=np.mean,
).T
native-country | ? | Cambodia | Canada | China | Columbia | Cuba | Dominican-Republic | Ecuador | El-Salvador | England | ... | Portugal | Puerto-Rico | Scotland | South | Taiwan | Thailand | Trinadad&Tobago | United-States | Vietnam | Yugoslavia |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
salary | |||||||||||||||||||||
<=50K | 40.16 | 41.42 | 37.91 | 37.38 | 38.68 | 37.99 | 42.34 | 38.04 | 36.03 | 40.48 | ... | 41.94 | 38.47 | 39.44 | 40.16 | 33.77 | 42.87 | 37.06 | 38.80 | 37.19 | 41.6 |
>50K | 45.55 | 40.00 | 45.64 | 38.90 | 50.00 | 42.44 | 47.00 | 48.75 | 45.00 | 44.53 | ... | 41.50 | 39.42 | 46.67 | 51.44 | 46.80 | 58.33 | 40.00 | 45.51 | 39.20 | 49.5 |
2 rows × 42 columns
Summary tables
1. crosstab()
Panda’s crosstab()
-allows you to compute a cross-tabulation of two or more factors, also known as a contingency table. It provides a convenient way to analyze the relationship between multiple categorical variables in a tabular format.
When you have two or more categorical variables in your DataFrame, crosstab helps you create a table that shows the frequency distribution of each variable combination. It counts the occurrences of different categories from each variable and arranges them in a grid-like structure.
# Compute the cross-tabulation
result = pd.crosstab(df['sex'], df['salary'])
print(result)
salary <=50K >50K
sex
Female 9592 1179
Male 15128 6662
margins=True
parameter in pd.crosstab() adds row and column totals to the resulting cross-tabulation table.
# Compute the cross-tabulation
result = pd.crosstab(df['sex'], df['salary'], margins=True)
print(result)
salary <=50K >50K All
sex
Female 9592 1179 10771
Male 15128 6662 21790
All 24720 7841 32561
You can also add more then one value to compare against. The resulting table shows the counts for each combination of ‘sex’, ‘race’, and ‘salary’.
# Compute the cross-tabulation
result = pd.crosstab(df['sex'], [df['race'],df['salary']])
print(result)
race Amer-Indian-Eskimo Asian-Pac-Islander Black Other \
salary <=50K >50K <=50K >50K <=50K >50K <=50K >50K
sex
Female 107 12 303 43 1465 90 103 6
Male 168 24 460 233 1272 297 143 19
race White
salary <=50K >50K
sex
Female 7614 1028
Male 13085 6089
The normalize=True parameter is passed to compute the cross-tabulation as percentages. The resulting table displays the percentage distribution of each salary level for each gender category.
# Compute the cross-tabulation as percentage
result = pd.crosstab(df['sex'], df['salary'], normalize=True, margins=True)*100
print(result)
salary <=50K >50K All
sex
Female 29.46 3.62 33.08
Male 46.46 20.46 66.92
All 75.92 24.08 100.00
From this we can interpret:
- almost 30% females have a salary less than or equal to 50K, while only around 4% have a salary greater than 50K.
- a higher proportion of males (around 46%) have a salary less than or equal to 50K, while a significant proportion (around 20%) have a salary greater than 50K.
2. pivot_table()
- pivot_table() allows you to create a Excel-style pivot table from a DataFrame
- way to summarize and analyze data by grouping and aggregating values based on one or more variables.
Here is how to make pivot tables work:
index
parameter - a list of variables to group data by,- Specify the columns using the
columns
parameter. - Choose the values to compute summary statistics.
- Select the aggregation function with
aggfunc
- what statistics we need to calculate for groups, e.g. sum, mean, maximum, minimum or something else. - Customize with additional parameters like margins and fill_value.
Here is how we can calculate the average capital-gain, capital-loss and hours-per-week by sex.
# Create the pivot table
pivot_table = pd.pivot_table(df, index='sex', values=['capital-gain', 'capital-loss', 'hours-per-week'], aggfunc='mean')
print(pivot_table)
capital-gain capital-loss hours-per-week
sex
Female 568.41 61.19 36.41
Male 1329.37 100.21 42.43
On average, males have higher values for ‘capital-gain’, ‘capital-loss’, and ‘hours-per-week’ compared to females.
Here is a more complex example. Suppose we want to provide a comprehensive view of the average hours worked per week for various education levels and marital statuses, broken down by different races.
We want to:
- Set the index to [‘education’, ‘marital-status’]. Required to group and organize the data based on different education levels and marital statuses. Note, education becomes the primary index while marital status becomes the secondary index.
- Specify the columns of the pivot table to ‘race’. Required so we can create separate columns in the pivot table for each unique race value.
- Choose the ‘hours-per-week’ column as the values to be analyzed using the values parameter.
- Select the ‘mean’ as the aggregation function using the aggfunc parameter.
- Fill any missing values in the pivot table with 0 using the fill_value parameter.
# Create the pivot table
pivot_table = pd.pivot_table(df, index=['education', 'marital-status'], columns='race', values='hours-per-week', aggfunc='mean', fill_value=0)
print(pivot_table)
race Amer-Indian-Eskimo Asian-Pac-Islander \
education marital-status
10th Divorced 42.00 40.00
Married-civ-spouse 35.00 43.00
Married-spouse-absent 40.00 38.50
Never-married 46.00 35.00
Separated 0.00 0.00
... ... ...
Some-college Married-civ-spouse 45.37 45.41
Married-spouse-absent 40.00 35.75
Never-married 37.14 32.52
Separated 35.50 46.67
Widowed 24.50 5.00
race Black Other White
education marital-status
10th Divorced 37.43 42.00 41.51
Married-civ-spouse 41.07 45.50 41.81
Married-spouse-absent 37.33 0.00 44.33
Never-married 34.33 31.67 30.31
Separated 45.79 0.00 38.37
... ... ... ...
Some-college Married-civ-spouse 42.28 40.00 43.53
Married-spouse-absent 36.05 29.00 37.88
Never-married 35.82 33.50 33.74
Separated 38.44 46.00 40.74
Widowed 38.90 0.00 32.34
[101 rows x 5 columns]
Data Transformations tables
1. Adding Columns
The df.insert()
function is to insert a new column into a DataFrame at a specific position. It allows you to specify the index location where you want to insert the new column and provide the column name and its corresponding values.
Here’s the syntax for df.insert()
:
df.insert(loc, column, value, allow_duplicates=False)
loc
: The index location where you want to insert the new column.column
: The name of the new column.value
: The values to be assigned to the new column.allow_duplicates
(optional): If set toTrue
, allows duplicate column names; default isFalse
.
For example, we can create a new age group column (pandas series) based on pd.cut() and insert at postion 2.
# Create the age_group serie/column based on age values
age_group = pd.cut(df['age'], bins=[0, 18, 30, 50, 100], labels=['<18', '18-30', '31-50', '51+'])
# Insert the age_group column at position 2
df.insert(2, 'age_group', age_group)
df.head()
age | workclass | age_group | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 31-50 | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 31-50 | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 31-50 | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 51+ | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 18-30 | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
If we dont need inserting at the a specific position, we simply add a new column by specifying it. The new column gets added at the end.
df['education-num-group'] = pd.cut(df['education-num'], bins=[0, 5, 10, 15, 20], labels=['<5', '5-10', '15', '15+'])
df.head()
age | workclass | age_group | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | education-num-group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 31-50 | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K | 15 |
1 | 50 | Self-emp-not-inc | 31-50 | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K | 15 |
2 | 38 | Private | 31-50 | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K | 5-10 |
3 | 53 | Private | 51+ | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K | 5-10 |
4 | 28 | Private | 18-30 | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K | 15 |
2. Deleting Columns or Rows
- To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows).
- The inplace argument tells whether to change the original DataFrame.
- With inplace=False, the drop method doesn’t change the existing DataFrame and returns a new one with dropped rows or columns.
- With inplace=True, it alters the DataFrame.
# get rid of just created columns
df.drop(["education-num-group", "age_group"], axis=1, inplace=True)
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
# and here’s how you can delete rows
print('Before change:{}'.format(df.shape))
df.drop([1, 2], inplace=True)
print('After change:{}'.format(df.shape))
Before change:(32561, 15)
After change:(32559, 15)
References
Twitter Facebook LinkedIn
Comments