Pandas: Data Analysis Essentials

35 minute read

A significant portion of working on real-world projects, around 70-80%, is actually spent on preparing and cleaning the data. This is where Pandas comes in handy: Pandas is a powerful tool that empowers analysts to efficiently manipulate, analyze, and visualize data. Here’s a brief overview of how you can perform data analysis with Pandas.

Introduction

Pandas: Python library for data manipulation and analysis.
Installation: Run pip install pandas in your Python environment.
DataFrames: Two-dimensional labeled data structures in Pandas, used for storing and analyzing tabular data efficiently.
Series: One-dimensional labeled data structure in Pandas, representing a single column or row of data, capable of holding various data types.

Data

The UCI Adult Income dataset is used for this article. It contains information about individuals from the 1994 United States Census, and the goal is to predict whether a person’s income exceeds $50,000 per year based on various attributes. It consists of 14 attributes, including features such as age, education level, work-class, marital status, occupation, race, sex, capital gain, capital loss, hours per week, native country, and the target variable, which indicates whether the income exceeds $50,000 or not.

Setup

import numpy and pandas
setup precision for two decimal places
read csv from URL into a dataframe

import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)

DATA_URL = "https://vmlverse.github.io/assets/datasets/adult.data.csv"
df = pd.read_csv(DATA_URL)

Exploring the data

Once data loaded into a DataFrame, you can examine its contents and structure.

1. df.head(n)

Returns the first n rows of the DataFrame. (Default:5):

df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

2. df.tail(n)

Returns the last n rows of the DataFrame.(Default:5)

df.tail()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
32556	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	38	United-States	<=50K
32557	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	40	United-States	>50K
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	40	United-States	>50K

3. df.shape

Returns the dimensions (rows, columns) of the DataFrame.

df.shape

(32561, 15)

4. df.info()

Provides information about the DataFrame, including the data - types of columns and missing values.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

5. df.describe()

Generates descriptive statistics for numerical columns:

descriptive statistics (count, mean, min, max, etc.)
numerical columns (int64 and float64 types)
In the below code, we can observe the metrics for numerical columns like age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week.

df.describe()

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	32561.00	3.26e+04	32561.00	32561.00	32561.00	32561.00
mean	38.58	1.90e+05	10.08	1077.65	87.30	40.44
std	13.64	1.06e+05	2.57	7385.29	402.96	12.35
min	17.00	1.23e+04	1.00	0.00	0.00	1.00
25%	28.00	1.18e+05	9.00	0.00	0.00	40.00
50%	37.00	1.78e+05	10.00	0.00	0.00	40.00
75%	48.00	2.37e+05	12.00	0.00	0.00	45.00
max	90.00	1.48e+06	16.00	99999.00	4356.00	99.00

df.describe(include=[]): to see statistics on non-numerical features:

One has to explicitly indicate data types of interest in the include parameter.
In the below code, we can observe the metrics for non-numerical columns like workclass, education, marital-status, occupation, relationship, race, sex, native-country & salary.
However, note that unlike the stats for numerical columns, we can see statss like count, number of unique values, top occuring value and frequently occuring value.

df.describe(include=["object", "bool"])

	workclass	education	marital-status	occupation	relationship	race	sex	native-country	salary
count	32561	32561	32561	32561	32561	32561	32561	32561	32561
unique	9	16	7	15	6	5	2	42	2
top	Private	HS-grad	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States	<=50K
freq	22696	10501	14976	4140	13193	27816	21790	29170	24720

6. df.columns

prints the column names of a DataFrame

df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

7. df.astype()

To change column data type
For instance to change age column which is a int64 to float datatype, we can do this operation.

df["age"] = df["age"].astype("float")

we can now run the df.info() describe column to verify if the conversion happened on the age column.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  float64
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32561 non-null  int64  
 5   marital-status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  salary          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB

8. df.value_counts()

To analyze the distribution of values in your dataset:

df["sex"].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

To calculate fractions, pass normalize=True to the value_counts function. You can use this to determine the percentage distribution.

df["sex"].value_counts(normalize=True)

Male      0.67
Female    0.33
Name: sex, dtype: float64

df.value_counts()

age  workclass         fnlwgt  education     education-num  marital-status      occupation         relationship   race   sex     capital-gain  capital-loss  hours-per-week  native-country  salary
25   Private           195994  1st-4th       2              Never-married       Priv-house-serv    Not-in-family  White  Female  0             0             40              Guatemala       <=50K     3
23   Private           240137  5th-6th       3              Never-married       Handlers-cleaners  Not-in-family  White  Male    0             0             55              Mexico          <=50K     2
38   Private           207202  HS-grad       9              Married-civ-spouse  Machine-op-inspct  Husband        White  Male    0             0             48              United-States   >50K      2
30   Private           144593  HS-grad       9              Never-married       Other-service      Not-in-family  Black  Male    0             0             40              ?               <=50K     2
49   Self-emp-not-inc  43479   Some-college  10             Married-civ-spouse  Craft-repair       Husband        White  Male    0             0             40              United-States   <=50K     2
                                                                                                                                                                                                      ..
31   Private           128567  HS-grad       9              Married-civ-spouse  Craft-repair       Husband        White  Male    0             0             40              United-States   <=50K     1
                       128493  HS-grad       9              Divorced            Other-service      Not-in-family  White  Female  0             0             25              United-States   <=50K     1
                       128220  7th-8th       4              Widowed             Adm-clerical       Not-in-family  White  Female  0             0             35              United-States   <=50K     1
                       127610  Bachelors     13             Married-civ-spouse  Prof-specialty     Wife           White  Female  0             0             40              United-States   >50K      1
90   Self-emp-not-inc  282095  Some-college  10             Married-civ-spouse  Farming-fishing    Husband        White  Male    0             0             40              United-States   <=50K     1
Length: 32537, dtype: int64

Sorting

DataFrame can be sorted by the value of one of the variables (i.e columns).
For example, we can sort by age (use ascending=False to sort in descending order):

df.sort_values(by="age", ascending=False).head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	salary
5406	90.0	Private	51744	Masters	14	Never-married	Exec-managerial	Not-in-family	Black	Male	50	United-States	>50K
6624	90.0	Private	313986	11th	7	Married-civ-spouse	Craft-repair	Husband	White	Male	40	United-States	<=50K
20610	90.0	Private	206667	Masters	14	Married-civ-spouse	Prof-specialty	Wife	White	Female	40	United-States	>50K
1040	90.0	Private	137018	HS-grad	9	Never-married	Other-service	Not-in-family	White	Female	40	United-States	<=50K
1935	90.0	Private	221832	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	45	United-States	<=50K

We can also sort by multiple columns:

df.sort_values(by=["age","education-num"], ascending=[False,True]).head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
24238	90.0	?	166343	1st-4th	2	Widowed	?	Not-in-family	Black	Female	0	40	United-States	<=50K
19747	90.0	Private	226968	7th-8th	4	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	40	United-States	<=50K
25303	90.0	?	175444	7th-8th	4	Separated	?	Not-in-family	White	Female	0	15	United-States	<=50K
32367	90.0	Local-gov	214594	7th-8th	4	Married-civ-spouse	Protective-serv	Husband	White	Male	2653	40	United-States	<=50K
5272	90.0	Private	141758	9th	5	Never-married	Adm-clerical	Not-in-family	White	Female	0	40	United-States	<=50K

Indexing and retrieving data

1. Calculate mean

df['hours-per-week'].mean()

40.437455852092995

2. Column Indexing

DataFrames can be indexed by column name (label) or row name (index).
The loc method is used for indexing by name, while iloc() is used for indexing by number.

loc()	iloc()
used for indexing by name	used for indexing by number
df.loc[‘row_index’] will return the row with the specified index label	df.iloc[row_number] will return the row at the specified integer position
df.loc[row_label, column_label] will return the value at the specified row and column labels	df.iloc[row_number, column_number] will return the value at the specified row and column positions

Fetch rows 2 to 4 (inclusive)

df.loc[2:4]

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	salary
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	40	Cuba	<=50K

Fetch rows 2 to 4 (not inclusive). Can use iloc[] as well.

df.iloc[2:4]

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K

Display rows 2 to 4 (inclusive) and columns from education to relationship.

df.loc[2:4, "education":"relationship"]

	education	education-num	marital-status	occupation	relationship
2	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family
3	11th	7	Married-civ-spouse	Handlers-cleaners	Husband
4	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife

Display rows 2 to 4 (inclusive) and first three columns.

df.iloc[2:4, 0:3]

	age	workclass	fnlwgt
2	38	Private	215646
3	53	Private	234721

Note: You can cross the methods. For example, cannot use numerical representation on loc() or use alphabetic representation on iloc()

If we need the first or the last line of the data frame, we can use the df[:1] or df[-1:] construction:

df[:1]: start at 0. end at 1 (non-inclusive).
Result: displays first row alone.

df[:1]

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K

Indexing can be done in loops as well. i.e., you can use negative numbers as well. e.g., df[-7:-2]

df[-1:] : start at -1(last row). end at 0 (non-inclusive)
Result: displays last row alone.

df[-1:]

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K

3. Boolean indexing

Boolean indexing with one column is also very convenient.
The syntax is df[P(df[‘Name’])], where P is some logical condition that is checked for each element of the Name column.
The result of such indexing is the DataFrame consisting only of the rows that satisfy the P condition on the Name column.

What are the average values of numerical features for male users?

Here we’ll resort to an additional method select_dtypes to select all numeric columns.

df.select_dtypes(include=np.number) selects columns in the DataFrame that have numeric data types. It filters out columns with non-numeric data types.
[df[“sex”] == “male”] filters the DataFrame rows based on a condition. It selects only the rows where the “sex” column has a value of “male”.
.mean() calculates the mean (average) value for each column in the filtered DataFrame.
Therefore, the code calculates the mean value for each numeric column in the DataFrame (df) but only for the rows where the “sex” column is “male”. This provides the average values of the numeric columns specifically for the male subset of the data.

df.select_dtypes(include=np.number)[df["sex"] == "Male"].mean()

age                   39.43
fnlwgt            191771.45
education-num         10.10
capital-gain        1329.37
capital-loss         100.21
hours-per-week        42.43
dtype: float64

What is the mean age of Females in the survey?
- Boolean condtion P => df[“sex”] == “Female”
- df[P][“age”]
- using the method .mean() to calculate mean in the column of interest.

df[df["sex"] == "Female"]["age"].mean()

36.85823043357163

What is the max age of males who have a capital gain greater than 15,000?
- Boolean condtion P => (df[“sex”] == “Male”) & (df[“capital-gain”] > 15000 )
- df[P][“age”]
- using the method .max() to calculate max in the column of interest.

df[(df["sex"] == "Male") & (df["capital-gain"] > 15000 )]["age"].max()

90.0

Example: Finding The proportion of German citizens (native-country feature):

df.shape[0] gives the number of rows in DataFrame
Dividing the sum of True values by the total number of rows gives the percentage of individuals with Germany as their native country.

float((df["native-country"] == "Germany").sum()) / df.shape[0]

0.004207487485028101

Example: What are mean value and standard deviation of the age of those who receive more than 50K per year (salary feature) and those who receive less than 50K per year?

ages1 = df[df["salary"] == ">50K"]["age"]
ages2 = df[df["salary"] == "<=50K"]["age"]
print(
    "The average age of the rich: {0} +- {1} years, poor - {2} +- {3} years.".format(
        round(ages1.mean()),
        round(ages1.std(), 1),
        round(ages2.mean()),
        round(ages2.std(), 1),
    )
)

The average age of the rich: 44 +- 10.5 years, poor - 37 +- 14.0 years.

Example: Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

df[df["salary"] == ">50K"]["education"].unique()  # No

array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
       'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
       '10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)

Example: Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

# married men
df[(df["sex"] == "Male")
     & (df["marital-status"].str.startswith("Married"))][
    "salary"
].value_counts(normalize=True)

<=50K    0.56
>50K     0.44
Name: salary, dtype: float64

# single men
df[
    (df["sex"] == "Male")
    & ~(df["marital-status"].str.startswith("Married"))
]["salary"].value_counts(normalize=True)

<=50K    0.92
>50K     0.08
Name: salary, dtype: float64

4. Apply Method

df.apply() method can be used to apply functions on Cells, Rows or Columns

For example, find the max value across all columns.

df.apply(np.max)

age                             90
workclass              Without-pay
fnlwgt                     1484705
education             Some-college
education-num                   16
marital-status             Widowed
occupation        Transport-moving
relationship                  Wife
race                         White
sex                           Male
capital-gain                 99999
capital-loss                  4356
hours-per-week                  99
native-country          Yugoslavia
salary                        >50K
dtype: object

Lambda functions are very convenient to apply using apply() method.

For eg, select the native-countries starting with Y.
To break this down:
- df["native-country"] selects the native-country.
- df["native-country"].apply(lambda state: state[0] == "Y") applies lambda function to include only the rows where the “native-country” column starts with the letter “Y”
- df[].head() - then returns the first few rows of the filtered DataFrame.

df[df["native-country"].apply(lambda country: country[0] == "Y")].head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	salary
1025	56	Private	169133	HS-grad	9	Married-civ-spouse	Other-service	Husband	White	Male	50	Yugoslavia	<=50K
4447	25	Private	191230	Some-college	10	Never-married	Exec-managerial	Own-child	White	Female	40	Yugoslavia	<=50K
6328	20	Private	175069	Some-college	10	Never-married	Adm-clerical	Own-child	White	Male	40	Yugoslavia	<=50K
7287	35	Private	164526	HS-grad	9	Married-civ-spouse	Other-service	Husband	White	Male	40	Yugoslavia	>50K
12506	40	Local-gov	183096	9th	5	Married-civ-spouse	Other-service	Wife	White	Female	40	Yugoslavia	>50K

5. Map & Replace Method

map method can be used to replace values in a column by passing a dictionary of the form {old_value: new_value} as its argument:

d = {"Male": 1, "Female": 2}
df["sex_1"] = df["sex"].map(d)
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	NaN	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	NaN	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	NaN	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	NaN	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	NaN	0	40	Cuba	<=50K

same thing can be done with the replace method.

df['sex_2'] = df['sex'].replace(d)
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	sex_2
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K	1
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K	1
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K	1
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K	1
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K	2

However, note replace method will not do anything with values not found in the mapping dictionary, while while map will change them to NaNs.

# Create a Series
s = pd.Series(['A', 'B', 'C', 'D'])

# Create a mapping dictionary
mapping = {'A': 'Apple', 'B': 'Banana'}

# Using replace method
replaced = s.replace(mapping)
print('replace method will not do anything with values not found in the mapping dictionary')
print(replaced)

# Using map method
mapped = s.map(mapping)
print('map method will change values not found in the mapping dictionary to NaNs')
print(mapped)

replace method will not do anything with values not found in the mapping dictionary
0     Apple
1    Banana
2         C
3         D
dtype: object
map method wwill change values not found in the mapping dictionary to NaNs
0     Apple
1    Banana
2       NaN
3       NaN
dtype: object

6. Grouping

Grouping data in Pandas works as follows: df.groupby(by=grouping_columns)[columns_to_show].function()

First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.
Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.
Finally, one or several functions are applied to the obtained groups per selected columns.

To understand this more intuitively, we can start with this small example.

We can group by education. By doing this we are creating a new dataframe with the education column as index.
However, this dataframe as no series values. For this we can apply as simple function like count()
The resulting dataframe shows the count each education value across all other columns. The data is repeating because, count() is simply counting the number of entries for each education type in age, workclass, etc which obviously will be same.

result = df.groupby('education').count()
print(result)

                age  workclass  fnlwgt  education-num  marital-status  \
education                                                               
10th            933        933     933            933             933   
11th           1175       1175    1175           1175            1175   
12th            433        433     433            433             433   
1st-4th         168        168     168            168             168   
5th-6th         333        333     333            333             333   
7th-8th         646        646     646            646             646   
9th             514        514     514            514             514   
Assoc-acdm     1067       1067    1067           1067            1067   
Assoc-voc      1382       1382    1382           1382            1382   
Bachelors      5355       5355    5355           5355            5355   
Doctorate       413        413     413            413             413   
HS-grad       10501      10501   10501          10501           10501   
Masters        1723       1723    1723           1723            1723   
Preschool        51         51      51             51              51   
Prof-school     576        576     576            576             576   
Some-college   7291       7291    7291           7291            7291   

              occupation  relationship   race    sex  capital-gain  \
education                                                            
10th                 933           933    933    933           933   
11th                1175          1175   1175   1175          1175   
12th                 433           433    433    433           433   
1st-4th              168           168    168    168           168   
5th-6th              333           333    333    333           333   
7th-8th              646           646    646    646           646   
9th                  514           514    514    514           514   
Assoc-acdm          1067          1067   1067   1067          1067   
Assoc-voc           1382          1382   1382   1382          1382   
Bachelors           5355          5355   5355   5355          5355   
Doctorate            413           413    413    413           413   
HS-grad            10501         10501  10501  10501         10501   
Masters             1723          1723   1723   1723          1723   
Preschool             51            51     51     51            51   
Prof-school          576           576    576    576           576   
Some-college        7291          7291   7291   7291          7291   

              capital-loss  hours-per-week  native-country  salary  sex_2  
education                                                                  
10th                   933             933             933     933    933  
11th                  1175            1175            1175    1175   1175  
12th                   433             433             433     433    433  
1st-4th                168             168             168     168    168  
5th-6th                333             333             333     333    333  
7th-8th                646             646             646     646    646  
9th                    514             514             514     514    514  
Assoc-acdm            1067            1067            1067    1067   1067  
Assoc-voc             1382            1382            1382    1382   1382  
Bachelors             5355            5355            5355    5355   5355  
Doctorate              413             413             413     413    413  
HS-grad              10501           10501           10501   10501  10501  
Masters               1723            1723            1723    1723   1723  
Preschool               51              51              51      51     51  
Prof-school            576             576             576     576    576  
Some-college          7291            7291            7291    7291   7291  

This will start to make more sense by using a more meaningful function like .mean() which calculates the mean value for all numerical columns under each education type.

result = df.groupby('education').mean()
print(result)

                age     fnlwgt  education-num  capital-gain  capital-loss  \
education                                                                   
10th          37.43  196832.47            6.0        404.57         56.85   
11th          32.36  194928.08            7.0        215.10         50.08   
12th          32.00  199097.51            8.0        284.09         32.34   
1st-4th       46.14  239303.00            2.0        125.88         48.33   
5th-6th       42.89  232448.33            3.0        176.02         68.25   
7th-8th       48.45  188079.17            4.0        233.94         65.67   
9th           41.06  202485.07            5.0        342.09         29.00   
Assoc-acdm    37.38  193424.09           12.0        640.40         93.42   
Assoc-voc     38.55  181936.02           11.0        715.05         72.75   
Bachelors     38.90  188055.91           13.0       1756.30        118.35   
Doctorate     47.70  186698.76           16.0       4770.15        262.85   
HS-grad       38.97  189538.74            9.0        576.80         70.47   
Masters       44.05  179852.36           14.0       2562.56        166.72   
Preschool     42.76  235889.37            1.0        898.39         66.49   
Prof-school   44.75  185663.71           15.0      10414.42        231.20   
Some-college  35.76  188742.92           10.0        598.82         71.64   

              hours-per-week  sex_2  
education                            
10th                   37.05   1.32  
11th                   33.93   1.37  
12th                   35.78   1.33  
1st-4th                38.26   1.27  
5th-6th                38.90   1.25  
7th-8th                39.37   1.25  
9th                    38.04   1.28  
Assoc-acdm             40.50   1.39  
Assoc-voc              41.61   1.36  
Bachelors              42.61   1.30  
Doctorate              46.97   1.21  
HS-grad                40.58   1.32  
Masters                43.84   1.31  
Preschool              36.65   1.31  
Prof-school            47.43   1.16  
Some-college           38.85   1.38  


<ipython-input-28-a0bc15c64f52>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  result = df.groupby('education').mean()

In the below example:

we grouped the DataFrame df by the ‘education’ column using the groupby method.
Then, we selected the ‘age’ column to calculate the average age for each education level.
The result is a Series that shows the average age for each education category.

# result = df.groupby('education')['age'].mean()
result = df.groupby('education')['age'].mean()
print(result)

education
10th            37.43
11th            32.36
12th            32.00
1st-4th         46.14
5th-6th         42.89
7th-8th         48.45
9th             41.06
Assoc-acdm      37.38
Assoc-voc       38.55
Bachelors       38.90
Doctorate       47.70
HS-grad         38.97
Masters         44.05
Preschool       42.76
Prof-school     44.75
Some-college    35.76
Name: age, dtype: float64

Here is another example. Suppose we want to groupby Salary and determine how people of various occupations are collecting salary, we can do the following:

we can group the DataFrame df by both the ‘occupation’ and ‘salary’ columns using the groupby method. This is like first creating an index for occupation and then create index salary under each occupation type.
Then, we applied the size function to count the number of occurrences for each combination of occupation and salary.
The result is a Series that shows the count for each occupation and salary category.

# Grouping by 'occupation' and 'salary' columns and counting the number of occurrences
result = df.groupby(['occupation', 'salary']).size()
print(result)

occupation         salary
?                  <=50K     1652
                   >50K       191
Adm-clerical       <=50K     3263
                   >50K       507
Armed-Forces       <=50K        8
                   >50K         1
Craft-repair       <=50K     3170
                   >50K       929
Exec-managerial    <=50K     2098
                   >50K      1968
Farming-fishing    <=50K      879
                   >50K       115
Handlers-cleaners  <=50K     1284
                   >50K        86
Machine-op-inspct  <=50K     1752
                   >50K       250
Other-service      <=50K     3158
                   >50K       137
Priv-house-serv    <=50K      148
                   >50K         1
Prof-specialty     <=50K     2281
                   >50K      1859
Protective-serv    <=50K      438
                   >50K       211
Sales              <=50K     2667
                   >50K       983
Tech-support       <=50K      645
                   >50K       283
Transport-moving   <=50K     1277
                   >50K       320
dtype: int64

We can extend this to show percentage by doing the following:

# Grouping by 'occupation' and 'salary' columns and calculating the percentage
result = df.groupby(['occupation', 'salary']).size() / df.groupby(['occupation']).size() * 100
print(result)

occupation         salary
?                  <=50K     89.64
                   >50K      10.36
Adm-clerical       <=50K     86.55
                   >50K      13.45
Armed-Forces       <=50K     88.89
                   >50K      11.11
Craft-repair       <=50K     77.34
                   >50K      22.66
Exec-managerial    <=50K     51.60
                   >50K      48.40
Farming-fishing    <=50K     88.43
                   >50K      11.57
Handlers-cleaners  <=50K     93.72
                   >50K       6.28
Machine-op-inspct  <=50K     87.51
                   >50K      12.49
Other-service      <=50K     95.84
                   >50K       4.16
Priv-house-serv    <=50K     99.33
                   >50K       0.67
Prof-specialty     <=50K     55.10
                   >50K      44.90
Protective-serv    <=50K     67.49
                   >50K      32.51
Sales              <=50K     73.07
                   >50K      26.93
Tech-support       <=50K     69.50
                   >50K      30.50
Transport-moving   <=50K     79.96
                   >50K      20.04
dtype: float64

In this example, we can compare the normal describe method vs the describe method with groupby. As we can see, the group by method is indexed on occupation, where as the regular df is indexed on 0 to N row index.

df.describe(percentiles=[])

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week	sex_2
count	32561.00	3.26e+04	32561.00	32561.00	32561.00	32561.00	32561.00
mean	38.58	1.90e+05	10.08	1077.65	87.30	40.44	1.33
std	13.64	1.06e+05	2.57	7385.29	402.96	12.35	0.47
min	17.00	1.23e+04	1.00	0.00	0.00	1.00	1.00
50%	37.00	1.78e+05	10.00	0.00	0.00	40.00	1.00
max	90.00	1.48e+06	16.00	99999.00	4356.00	99.00	2.00

df.groupby(["occupation"]).describe(percentiles=[])

	age						fnlwgt				...	hours-per-week				sex_2
	count	mean	std	min	50%	max	count	mean	std	min	...	std	min	50%	max	count	mean	std	min	50%	max
occupation
?	1843.0	40.88	20.34	17.0	35.0	90.0	1843.0	188658.67	107089.08	12285.0	...	14.91	1.0	36.0	99.0	1843.0	1.46	0.50	1.0	1.0	2.0
Adm-clerical	3770.0	36.96	13.36	17.0	35.0	90.0	3770.0	192043.40	103163.89	19302.0	...	9.59	1.0	40.0	80.0	3770.0	1.67	0.47	1.0	2.0	2.0
Armed-Forces	9.0	30.22	8.09	23.0	29.0	46.0	9.0	215425.89	83315.89	76313.0	...	14.07	8.0	40.0	60.0	9.0	1.00	0.00	1.0	1.0	1.0
Craft-repair	4099.0	39.03	11.61	17.0	38.0	90.0	4099.0	192132.60	107434.09	19491.0	...	9.05	1.0	40.0	99.0	4099.0	1.05	0.23	1.0	1.0	2.0
Exec-managerial	4066.0	42.17	11.97	17.0	41.0	90.0	4066.0	184414.01	103314.96	19914.0	...	11.11	1.0	40.0	99.0	4066.0	1.29	0.45	1.0	1.0	2.0
Farming-fishing	994.0	41.21	15.07	17.0	39.0	90.0	994.0	170190.18	116925.90	20795.0	...	17.32	2.0	40.0	99.0	994.0	1.07	0.25	1.0	1.0	2.0
Handlers-cleaners	1370.0	32.17	12.37	17.0	29.0	90.0	1370.0	204391.01	111934.10	19214.0	...	10.58	2.0	40.0	95.0	1370.0	1.12	0.32	1.0	1.0	2.0
Machine-op-inspct	2002.0	37.72	12.07	17.0	36.0	90.0	2002.0	195040.88	98159.93	13769.0	...	7.59	1.0	40.0	96.0	2002.0	1.27	0.45	1.0	1.0	2.0
Other-service	3295.0	34.95	14.52	17.0	32.0	90.0	3295.0	188608.45	109452.80	19752.0	...	12.71	1.0	40.0	99.0	3295.0	1.55	0.50	1.0	2.0	2.0
Priv-house-serv	149.0	41.72	18.63	17.0	40.0	81.0	149.0	201107.52	102595.40	24384.0	...	16.18	4.0	35.0	99.0	149.0	1.95	0.23	1.0	2.0	2.0
Prof-specialty	4140.0	40.52	12.02	17.0	40.0	90.0	4140.0	185296.61	100135.45	14878.0	...	12.54	1.0	40.0	99.0	4140.0	1.37	0.48	1.0	1.0	2.0
Protective-serv	649.0	38.95	12.82	17.0	36.0	90.0	649.0	202039.95	101910.29	19302.0	...	12.33	3.0	40.0	99.0	649.0	1.12	0.32	1.0	1.0	2.0
Sales	3650.0	37.35	14.19	17.0	35.0	90.0	3650.0	190885.89	103779.87	19410.0	...	13.24	2.0	40.0	99.0	3650.0	1.35	0.48	1.0	1.0	2.0
Tech-support	928.0	37.02	11.32	17.0	36.0	73.0	928.0	192098.30	113987.28	19847.0	...	10.58	3.0	40.0	99.0	928.0	1.38	0.48	1.0	1.0	2.0
Transport-moving	1597.0	40.20	12.45	17.0	39.0	90.0	1597.0	190366.36	109240.30	18827.0	...	12.72	5.0	40.0	99.0	1597.0	1.06	0.23	1.0	1.0	2.0

15 rows × 42 columns

We can do the same thing using agg() function.

df.groupby(["occupation"]).agg([np.mean, np.std, np.min, np.max])

<ipython-input-32-f3544e70bb41>:1: FutureWarning: ['workclass', 'education', 'marital-status', 'relationship', 'race', 'sex', 'native-country', 'salary'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
  df.groupby(["occupation"]).agg([np.mean, np.std, np.min, np.max])

	age				fnlwgt				education-num		...	capital-loss		hours-per-week				sex_2
	mean	std	amin	amax	mean	std	amin	amax	mean	std	...	amin	amax	mean	std	amin	amax	mean	std	amin	amax
occupation
?	40.88	20.34	17	90	188658.67	107089.08	12285	981628	9.25	2.60	...	0	4356	31.91	14.91	1	99	1.46	0.50	1	2
Adm-clerical	36.96	13.36	17	90	192043.40	103163.89	19302	1033222	10.11	1.70	...	0	3770	37.56	9.59	1	80	1.67	0.47	1	2
Armed-Forces	30.22	8.09	23	46	215425.89	83315.89	76313	344415	10.11	2.03	...	0	1887	40.67	14.07	8	60	1.00	0.00	1	1
Craft-repair	39.03	11.61	17	90	192132.60	107434.09	19491	1455435	9.11	2.04	...	0	3004	42.30	9.05	1	99	1.05	0.23	1	2
Exec-managerial	42.17	11.97	17	90	184414.01	103314.96	19914	1484705	11.45	2.14	...	0	4356	44.99	11.11	1	99	1.29	0.45	1	2
Farming-fishing	41.21	15.07	17	90	170190.18	116925.90	20795	663394	8.61	2.76	...	0	2457	46.99	17.32	2	99	1.07	0.25	1	2
Handlers-cleaners	32.17	12.37	17	90	204391.01	111934.10	19214	889965	8.51	2.20	...	0	2824	37.95	10.58	2	95	1.12	0.32	1	2
Machine-op-inspct	37.72	12.07	17	90	195040.88	98159.93	13769	1033222	8.49	2.29	...	0	3900	40.76	7.59	1	96	1.27	0.45	1	2
Other-service	34.95	14.52	17	90	188608.45	109452.80	19752	1366120	8.78	2.30	...	0	3770	34.70	12.71	1	99	1.55	0.50	1	2
Priv-house-serv	41.72	18.63	17	81	201107.52	102595.40	24384	549430	7.36	3.11	...	0	1602	32.89	16.18	4	99	1.95	0.23	1	2
Prof-specialty	40.52	12.02	17	90	185296.61	100135.45	14878	747719	12.91	2.03	...	0	3900	42.39	12.54	1	99	1.37	0.48	1	2
Protective-serv	38.95	12.82	17	90	202039.95	101910.29	19302	857532	10.18	1.87	...	0	2444	42.87	12.33	3	99	1.12	0.32	1	2
Sales	37.35	14.19	17	90	190885.89	103779.87	19410	1226583	10.30	2.18	...	0	2824	40.78	13.24	2	99	1.35	0.48	1	2
Tech-support	37.02	11.32	17	73	192098.30	113987.28	19847	1268339	10.99	1.80	...	0	2444	39.43	10.58	3	99	1.38	0.48	1	2
Transport-moving	40.20	12.45	17	90	190366.36	109240.30	18827	1184622	8.77	2.04	...	0	2824	44.66	12.72	5	99	1.06	0.23	1	2

15 rows × 28 columns

Example: Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race

for (race, sex), sub_df in df.groupby(["race", "sex"]):
    print("Race: {0}, sex: {1}".format(race, sex))
    print(sub_df["age"].describe())

Race: Amer-Indian-Eskimo, sex: Female
count    119.00
mean      37.12
std       13.11
min       17.00
25%       27.00
50%       36.00
75%       46.00
max       80.00
Name: age, dtype: float64
Race: Amer-Indian-Eskimo, sex: Male
count    192.00
mean      37.21
std       12.05
min       17.00
25%       28.00
50%       35.00
75%       45.00
max       82.00
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Female
count    346.00
mean      35.09
std       12.30
min       17.00
25%       25.00
50%       33.00
75%       43.75
max       75.00
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Male
count    693.00
mean      39.07
std       12.88
min       18.00
25%       29.00
50%       37.00
75%       46.00
max       90.00
Name: age, dtype: float64
Race: Black, sex: Female
count    1555.00
mean       37.85
std        12.64
min        17.00
25%        28.00
50%        37.00
75%        46.00
max        90.00
Name: age, dtype: float64
Race: Black, sex: Male
count    1569.00
mean       37.68
std        12.88
min        17.00
25%        27.00
50%        36.00
75%        46.00
max        90.00
Name: age, dtype: float64
Race: Other, sex: Female
count    109.00
mean      31.68
std       11.63
min       17.00
25%       23.00
50%       29.00
75%       39.00
max       74.00
Name: age, dtype: float64
Race: Other, sex: Male
count    162.00
mean      34.65
std       11.36
min       17.00
25%       26.00
50%       32.00
75%       42.00
max       77.00
Name: age, dtype: float64
Race: White, sex: Female
count    8642.00
mean       36.81
std        14.33
min        17.00
25%        25.00
50%        35.00
75%        46.00
max        90.00
Name: age, dtype: float64
Race: White, sex: Male
count    19174.00
mean        39.65
std         13.44
min         17.00
25%         29.00
50%         38.00
75%         49.00
max         90.00
Name: age, dtype: float64

Example:Count the average time of work (hours-per-week) those who earning a little and a lot (salary) for each country (native-country).

for (country, salary), sub_df in df.groupby(["native-country", "salary"]):
    print(country, salary, round(sub_df["hours-per-week"].mean(), 2))

? <=50K 40.16
? >50K 45.55
Cambodia <=50K 41.42
Cambodia >50K 40.0
Canada <=50K 37.91
Canada >50K 45.64
China <=50K 37.38
China >50K 38.9
Columbia <=50K 38.68
Columbia >50K 50.0
Cuba <=50K 37.99
Cuba >50K 42.44
Dominican-Republic <=50K 42.34
Dominican-Republic >50K 47.0
Ecuador <=50K 38.04
Ecuador >50K 48.75
El-Salvador <=50K 36.03
El-Salvador >50K 45.0
England <=50K 40.48
England >50K 44.53
France <=50K 41.06
France >50K 50.75
Germany <=50K 39.14
Germany >50K 44.98
Greece <=50K 41.81
Greece >50K 50.62
Guatemala <=50K 39.36
Guatemala >50K 36.67
Haiti <=50K 36.33
Haiti >50K 42.75
Holand-Netherlands <=50K 40.0
Honduras <=50K 34.33
Honduras >50K 60.0
Hong <=50K 39.14
Hong >50K 45.0
Hungary <=50K 31.3
Hungary >50K 50.0
India <=50K 38.23
India >50K 46.48
Iran <=50K 41.44
Iran >50K 47.5
Ireland <=50K 40.95
Ireland >50K 48.0
Italy <=50K 39.62
Italy >50K 45.4
Jamaica <=50K 38.24
Jamaica >50K 41.1
Japan <=50K 41.0
Japan >50K 47.96
Laos <=50K 40.38
Laos >50K 40.0
Mexico <=50K 40.0
Mexico >50K 46.58
Nicaragua <=50K 36.09
Nicaragua >50K 37.5
Outlying-US(Guam-USVI-etc) <=50K 41.86
Peru <=50K 35.07
Peru >50K 40.0
Philippines <=50K 38.07
Philippines >50K 43.03
Poland <=50K 38.17
Poland >50K 39.0
Portugal <=50K 41.94
Portugal >50K 41.5
Puerto-Rico <=50K 38.47
Puerto-Rico >50K 39.42
Scotland <=50K 39.44
Scotland >50K 46.67
South <=50K 40.16
South >50K 51.44
Taiwan <=50K 33.77
Taiwan >50K 46.8
Thailand <=50K 42.87
Thailand >50K 58.33
Trinadad&Tobago <=50K 37.06
Trinadad&Tobago >50K 40.0
United-States <=50K 38.8
United-States >50K 45.51
Vietnam <=50K 37.19
Vietnam >50K 39.2
Yugoslavia <=50K 41.6
Yugoslavia >50K 49.5

Better way to do the same, using crosstab

pd.crosstab(
    df["native-country"],
    df["salary"],
    values=df["hours-per-week"],
    aggfunc=np.mean,
).T

native-country	?	Cambodia	Canada	China	Columbia	Cuba	Dominican-Republic	Ecuador	El-Salvador	England	...	Portugal	Puerto-Rico	Scotland	South	Taiwan	Thailand	Trinadad&Tobago	United-States	Vietnam	Yugoslavia
salary
<=50K	40.16	41.42	37.91	37.38	38.68	37.99	42.34	38.04	36.03	40.48	...	41.94	38.47	39.44	40.16	33.77	42.87	37.06	38.80	37.19	41.6
>50K	45.55	40.00	45.64	38.90	50.00	42.44	47.00	48.75	45.00	44.53	...	41.50	39.42	46.67	51.44	46.80	58.33	40.00	45.51	39.20	49.5

2 rows × 42 columns

Summary tables

1. crosstab()

Panda’s crosstab() -allows you to compute a cross-tabulation of two or more factors, also known as a contingency table. It provides a convenient way to analyze the relationship between multiple categorical variables in a tabular format.

When you have two or more categorical variables in your DataFrame, crosstab helps you create a table that shows the frequency distribution of each variable combination. It counts the occurrences of different categories from each variable and arranges them in a grid-like structure.

# Compute the cross-tabulation
result = pd.crosstab(df['sex'], df['salary'])
print(result)

salary  <=50K  >50K
sex                
Female   9592  1179
Male    15128  6662

margins=True parameter in pd.crosstab() adds row and column totals to the resulting cross-tabulation table.

# Compute the cross-tabulation
result = pd.crosstab(df['sex'], df['salary'], margins=True)
print(result)

salary  <=50K  >50K    All
sex                       
Female   9592  1179  10771
Male    15128  6662  21790
All     24720  7841  32561

You can also add more then one value to compare against. The resulting table shows the counts for each combination of ‘sex’, ‘race’, and ‘salary’.

# Compute the cross-tabulation
result = pd.crosstab(df['sex'], [df['race'],df['salary']])
print(result)

race   Amer-Indian-Eskimo      Asian-Pac-Islander      Black      Other       \
salary              <=50K >50K              <=50K >50K <=50K >50K <=50K >50K   
sex                                                                            
Female                107   12                303   43  1465   90   103    6   
Male                  168   24                460  233  1272  297   143   19   

race    White        
salary  <=50K  >50K  
sex                  
Female   7614  1028  
Male    13085  6089  

The normalize=True parameter is passed to compute the cross-tabulation as percentages. The resulting table displays the percentage distribution of each salary level for each gender category.

# Compute the cross-tabulation as percentage
result = pd.crosstab(df['sex'], df['salary'], normalize=True, margins=True)*100
print(result)

salary  <=50K   >50K     All
sex                         
Female  29.46   3.62   33.08
Male    46.46  20.46   66.92
All     75.92  24.08  100.00

From this we can interpret:

almost 30% females have a salary less than or equal to 50K, while only around 4% have a salary greater than 50K.
a higher proportion of males (around 46%) have a salary less than or equal to 50K, while a significant proportion (around 20%) have a salary greater than 50K.

2. pivot_table()

pivot_table() allows you to create a Excel-style pivot table from a DataFrame
way to summarize and analyze data by grouping and aggregating values based on one or more variables.

Here is how to make pivot tables work:

index parameter - a list of variables to group data by,
Specify the columns using the columns parameter.
Choose the values to compute summary statistics.
Select the aggregation function with aggfunc - what statistics we need to calculate for groups, e.g. sum, mean, maximum, minimum or something else.
Customize with additional parameters like margins and fill_value.

Here is how we can calculate the average capital-gain, capital-loss and hours-per-week by sex.

# Create the pivot table
pivot_table = pd.pivot_table(df, index='sex', values=['capital-gain', 'capital-loss', 'hours-per-week'], aggfunc='mean')

print(pivot_table)

        capital-gain  capital-loss  hours-per-week
sex                                               
Female        568.41         61.19           36.41
Male         1329.37        100.21           42.43

On average, males have higher values for ‘capital-gain’, ‘capital-loss’, and ‘hours-per-week’ compared to females.

Here is a more complex example. Suppose we want to provide a comprehensive view of the average hours worked per week for various education levels and marital statuses, broken down by different races.

We want to:

Set the index to [‘education’, ‘marital-status’]. Required to group and organize the data based on different education levels and marital statuses. Note, education becomes the primary index while marital status becomes the secondary index.
Specify the columns of the pivot table to ‘race’. Required so we can create separate columns in the pivot table for each unique race value.
Choose the ‘hours-per-week’ column as the values to be analyzed using the values parameter.
Select the ‘mean’ as the aggregation function using the aggfunc parameter.
Fill any missing values in the pivot table with 0 using the fill_value parameter.

# Create the pivot table
pivot_table = pd.pivot_table(df, index=['education', 'marital-status'], columns='race', values='hours-per-week', aggfunc='mean', fill_value=0)

print(pivot_table)

race                                Amer-Indian-Eskimo  Asian-Pac-Islander  \
education    marital-status                                                  
10th         Divorced                            42.00               40.00   
             Married-civ-spouse                  35.00               43.00   
             Married-spouse-absent               40.00               38.50   
             Never-married                       46.00               35.00   
             Separated                            0.00                0.00   
...                                                ...                 ...   
Some-college Married-civ-spouse                  45.37               45.41   
             Married-spouse-absent               40.00               35.75   
             Never-married                       37.14               32.52   
             Separated                           35.50               46.67   
             Widowed                             24.50                5.00   

race                                Black  Other  White  
education    marital-status                              
10th         Divorced               37.43  42.00  41.51  
             Married-civ-spouse     41.07  45.50  41.81  
             Married-spouse-absent  37.33   0.00  44.33  
             Never-married          34.33  31.67  30.31  
             Separated              45.79   0.00  38.37  
...                                   ...    ...    ...  
Some-college Married-civ-spouse     42.28  40.00  43.53  
             Married-spouse-absent  36.05  29.00  37.88  
             Never-married          35.82  33.50  33.74  
             Separated              38.44  46.00  40.74  
             Widowed                38.90   0.00  32.34  

[101 rows x 5 columns]

Data Transformations tables

1. Adding Columns

The df.insert() function is to insert a new column into a DataFrame at a specific position. It allows you to specify the index location where you want to insert the new column and provide the column name and its corresponding values.

Here’s the syntax for df.insert():

df.insert(loc, column, value, allow_duplicates=False)

loc: The index location where you want to insert the new column.
column: The name of the new column.
value: The values to be assigned to the new column.
allow_duplicates (optional): If set to True, allows duplicate column names; default is False.

For example, we can create a new age group column (pandas series) based on pd.cut() and insert at postion 2.

# Create the age_group serie/column based on age values
age_group = pd.cut(df['age'], bins=[0, 18, 30, 50, 100], labels=['<18', '18-30', '31-50', '51+'])

# Insert the age_group column at position 2
df.insert(2, 'age_group', age_group)
df.head()

	age	workclass	age_group	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	31-50	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	31-50	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	31-50	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	51+	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	18-30	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

If we dont need inserting at the a specific position, we simply add a new column by specifying it. The new column gets added at the end.

df['education-num-group'] = pd.cut(df['education-num'], bins=[0, 5, 10, 15, 20], labels=['<5', '5-10', '15', '15+'])
df.head()

	age	workclass	age_group	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	education-num-group
0	39	State-gov	31-50	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K	15
1	50	Self-emp-not-inc	31-50	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K	15
2	38	Private	31-50	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K	5-10
3	53	Private	51+	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K	5-10
4	28	Private	18-30	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K	15

2. Deleting Columns or Rows

To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows).
The inplace argument tells whether to change the original DataFrame.
With inplace=False, the drop method doesn’t change the existing DataFrame and returns a new one with dropped rows or columns.
With inplace=True, it alters the DataFrame.

# get rid of just created columns
df.drop(["education-num-group", "age_group"], axis=1, inplace=True)
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

# and here’s how you can delete rows
print('Before change:{}'.format(df.shape))
df.drop([1, 2], inplace=True)
print('After change:{}'.format(df.shape))

Before change:(32561, 15)
After change:(32559, 15)

References

Share On
Twitter Facebook LinkedIn

Vimal Venugopal