Pandas: Data Analysis Essentials

35 minute read

A significant portion of working on real-world projects, around 70-80%, is actually spent on preparing and cleaning the data. This is where Pandas comes in handy: Pandas is a powerful tool that empowers analysts to efficiently manipulate, analyze, and visualize data. Here’s a brief overview of how you can perform data analysis with Pandas.

Introduction

  • Pandas: Python library for data manipulation and analysis.
  • Installation: Run pip install pandas in your Python environment.
  • DataFrames: Two-dimensional labeled data structures in Pandas, used for storing and analyzing tabular data efficiently.
  • Series: One-dimensional labeled data structure in Pandas, representing a single column or row of data, capable of holding various data types.

Data

The UCI Adult Income dataset is used for this article. It contains information about individuals from the 1994 United States Census, and the goal is to predict whether a person’s income exceeds $50,000 per year based on various attributes. It consists of 14 attributes, including features such as age, education level, work-class, marital status, occupation, race, sex, capital gain, capital loss, hours per week, native country, and the target variable, which indicates whether the income exceeds $50,000 or not.

Setup

  • import numpy and pandas
  • setup precision for two decimal places
  • read csv from URL into a dataframe
import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)

DATA_URL = "https://vmlverse.github.io/assets/datasets/adult.data.csv"
df = pd.read_csv(DATA_URL)

Exploring the data

Once data loaded into a DataFrame, you can examine its contents and structure.

1. df.head(n)

Returns the first n rows of the DataFrame. (Default:5):

df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

2. df.tail(n)

Returns the last n rows of the DataFrame.(Default:5)

df.tail()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
32556 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
32557 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K

3. df.shape

Returns the dimensions (rows, columns) of the DataFrame.

df.shape
(32561, 15)

4. df.info()

Provides information about the DataFrame, including the data - types of columns and missing values.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

5. df.describe()

Generates descriptive statistics for numerical columns:

  • descriptive statistics (count, mean, min, max, etc.)
  • numerical columns (int64 and float64 types)
  • In the below code, we can observe the metrics for numerical columns like age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week.
df.describe()
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 32561.00 3.26e+04 32561.00 32561.00 32561.00 32561.00
mean 38.58 1.90e+05 10.08 1077.65 87.30 40.44
std 13.64 1.06e+05 2.57 7385.29 402.96 12.35
min 17.00 1.23e+04 1.00 0.00 0.00 1.00
25% 28.00 1.18e+05 9.00 0.00 0.00 40.00
50% 37.00 1.78e+05 10.00 0.00 0.00 40.00
75% 48.00 2.37e+05 12.00 0.00 0.00 45.00
max 90.00 1.48e+06 16.00 99999.00 4356.00 99.00
  • df.describe(include=[]): to see statistics on non-numerical features:

One has to explicitly indicate data types of interest in the include parameter.
In the below code, we can observe the metrics for non-numerical columns like workclass, education, marital-status, occupation, relationship, race, sex, native-country & salary.
However, note that unlike the stats for numerical columns, we can see statss like count, number of unique values, top occuring value and frequently occuring value.

df.describe(include=["object", "bool"])
workclass education marital-status occupation relationship race sex native-country salary
count 32561 32561 32561 32561 32561 32561 32561 32561 32561
unique 9 16 7 15 6 5 2 42 2
top Private HS-grad Married-civ-spouse Prof-specialty Husband White Male United-States <=50K
freq 22696 10501 14976 4140 13193 27816 21790 29170 24720

6. df.columns

prints the column names of a DataFrame

df.columns
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

7. df.astype()

To change column data type
For instance to change age column which is a int64 to float datatype, we can do this operation.

df["age"] = df["age"].astype("float")

we can now run the df.info() describe column to verify if the conversion happened on the age column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  float64
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32561 non-null  int64  
 5   marital-status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  salary          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB

8. df.value_counts()

To analyze the distribution of values in your dataset:

df["sex"].value_counts()
Male      21790
Female    10771
Name: sex, dtype: int64

To calculate fractions, pass normalize=True to the value_counts function. You can use this to determine the percentage distribution.

df["sex"].value_counts(normalize=True)
Male      0.67
Female    0.33
Name: sex, dtype: float64
df.value_counts()
age  workclass         fnlwgt  education     education-num  marital-status      occupation         relationship   race   sex     capital-gain  capital-loss  hours-per-week  native-country  salary
25   Private           195994  1st-4th       2              Never-married       Priv-house-serv    Not-in-family  White  Female  0             0             40              Guatemala       <=50K     3
23   Private           240137  5th-6th       3              Never-married       Handlers-cleaners  Not-in-family  White  Male    0             0             55              Mexico          <=50K     2
38   Private           207202  HS-grad       9              Married-civ-spouse  Machine-op-inspct  Husband        White  Male    0             0             48              United-States   >50K      2
30   Private           144593  HS-grad       9              Never-married       Other-service      Not-in-family  Black  Male    0             0             40              ?               <=50K     2
49   Self-emp-not-inc  43479   Some-college  10             Married-civ-spouse  Craft-repair       Husband        White  Male    0             0             40              United-States   <=50K     2
                                                                                                                                                                                                      ..
31   Private           128567  HS-grad       9              Married-civ-spouse  Craft-repair       Husband        White  Male    0             0             40              United-States   <=50K     1
                       128493  HS-grad       9              Divorced            Other-service      Not-in-family  White  Female  0             0             25              United-States   <=50K     1
                       128220  7th-8th       4              Widowed             Adm-clerical       Not-in-family  White  Female  0             0             35              United-States   <=50K     1
                       127610  Bachelors     13             Married-civ-spouse  Prof-specialty     Wife           White  Female  0             0             40              United-States   >50K      1
90   Self-emp-not-inc  282095  Some-college  10             Married-civ-spouse  Farming-fishing    Husband        White  Male    0             0             40              United-States   <=50K     1
Length: 32537, dtype: int64

Sorting

  • DataFrame can be sorted by the value of one of the variables (i.e columns).
  • For example, we can sort by age (use ascending=False to sort in descending order):
df.sort_values(by="age", ascending=False).head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
5406 90.0 Private 51744 Masters 14 Never-married Exec-managerial Not-in-family Black Male 0 0 50 United-States >50K
6624 90.0 Private 313986 11th 7 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
20610 90.0 Private 206667 Masters 14 Married-civ-spouse Prof-specialty Wife White Female 0 0 40 United-States >50K
1040 90.0 Private 137018 HS-grad 9 Never-married Other-service Not-in-family White Female 0 0 40 United-States <=50K
1935 90.0 Private 221832 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States <=50K

We can also sort by multiple columns:

df.sort_values(by=["age","education-num"], ascending=[False,True]).head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
24238 90.0 ? 166343 1st-4th 2 Widowed ? Not-in-family Black Female 0 0 40 United-States <=50K
19747 90.0 Private 226968 7th-8th 4 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States <=50K
25303 90.0 ? 175444 7th-8th 4 Separated ? Not-in-family White Female 0 0 15 United-States <=50K
32367 90.0 Local-gov 214594 7th-8th 4 Married-civ-spouse Protective-serv Husband White Male 2653 0 40 United-States <=50K
5272 90.0 Private 141758 9th 5 Never-married Adm-clerical Not-in-family White Female 0 0 40 United-States <=50K

Indexing and retrieving data

1. Calculate mean

df['hours-per-week'].mean()
40.437455852092995

2. Column Indexing

  • DataFrames can be indexed by column name (label) or row name (index).
  • The loc method is used for indexing by name, while iloc() is used for indexing by number.
loc() iloc()
used for indexing by name used for indexing by number
df.loc[‘row_index’] will return the row with the specified index label df.iloc[row_number] will return the row at the specified integer position
df.loc[row_label, column_label] will return the value at the specified row and column labels df.iloc[row_number, column_number] will return the value at the specified row and column positions

Fetch rows 2 to 4 (inclusive)

df.loc[2:4]
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Fetch rows 2 to 4 (not inclusive). Can use iloc[] as well.

df.iloc[2:4]
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K

Display rows 2 to 4 (inclusive) and columns from education to relationship.

df.loc[2:4, "education":"relationship"]
education education-num marital-status occupation relationship
2 HS-grad 9 Divorced Handlers-cleaners Not-in-family
3 11th 7 Married-civ-spouse Handlers-cleaners Husband
4 Bachelors 13 Married-civ-spouse Prof-specialty Wife

Display rows 2 to 4 (inclusive) and first three columns.

df.iloc[2:4, 0:3]
age workclass fnlwgt
2 38 Private 215646
3 53 Private 234721

Note: You can cross the methods. For example, cannot use numerical representation on loc() or use alphabetic representation on iloc()

If we need the first or the last line of the data frame, we can use the df[:1] or df[-1:] construction:

  • df[:1]: start at 0. end at 1 (non-inclusive).
  • Result: displays first row alone.
df[:1]
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K

Indexing can be done in loops as well. i.e., you can use negative numbers as well. e.g., df[-7:-2]

  • df[-1:] : start at -1(last row). end at 0 (non-inclusive)
  • Result: displays last row alone.
df[-1:]
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K

3. Boolean indexing

  • Boolean indexing with one column is also very convenient.
  • The syntax is df[P(df[‘Name’])], where P is some logical condition that is checked for each element of the Name column.
  • The result of such indexing is the DataFrame consisting only of the rows that satisfy the P condition on the Name column.

What are the average values of numerical features for male users?

Here we’ll resort to an additional method select_dtypes to select all numeric columns.

  • df.select_dtypes(include=np.number) selects columns in the DataFrame that have numeric data types. It filters out columns with non-numeric data types.

  • [df[“sex”] == “male”] filters the DataFrame rows based on a condition. It selects only the rows where the “sex” column has a value of “male”.

  • .mean() calculates the mean (average) value for each column in the filtered DataFrame.

  • Therefore, the code calculates the mean value for each numeric column in the DataFrame (df) but only for the rows where the “sex” column is “male”. This provides the average values of the numeric columns specifically for the male subset of the data.

df.select_dtypes(include=np.number)[df["sex"] == "Male"].mean()
age                   39.43
fnlwgt            191771.45
education-num         10.10
capital-gain        1329.37
capital-loss         100.21
hours-per-week        42.43
dtype: float64
  • What is the mean age of Females in the survey?
    • Boolean condtion P => df[“sex”] == “Female”
    • df[P][“age”]
    • using the method .mean() to calculate mean in the column of interest.
df[df["sex"] == "Female"]["age"].mean()
36.85823043357163
  • What is the max age of males who have a capital gain greater than 15,000?
    • Boolean condtion P => (df[“sex”] == “Male”) & (df[“capital-gain”] > 15000 )
    • df[P][“age”]
    • using the method .max() to calculate max in the column of interest.
df[(df["sex"] == "Male") & (df["capital-gain"] > 15000 )]["age"].max()
90.0

Example: Finding The proportion of German citizens (native-country feature):

  • df.shape[0] gives the number of rows in DataFrame
  • Dividing the sum of True values by the total number of rows gives the percentage of individuals with Germany as their native country.
float((df["native-country"] == "Germany").sum()) / df.shape[0]
0.004207487485028101

Example: What are mean value and standard deviation of the age of those who receive more than 50K per year (salary feature) and those who receive less than 50K per year?

ages1 = df[df["salary"] == ">50K"]["age"]
ages2 = df[df["salary"] == "<=50K"]["age"]
print(
    "The average age of the rich: {0} +- {1} years, poor - {2} +- {3} years.".format(
        round(ages1.mean()),
        round(ages1.std(), 1),
        round(ages2.mean()),
        round(ages2.std(), 1),
    )
)
The average age of the rich: 44 +- 10.5 years, poor - 37 +- 14.0 years.

Example: Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

df[df["salary"] == ">50K"]["education"].unique()  # No
array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
       'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
       '10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)

Example: Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

# married men
df[(df["sex"] == "Male")
     & (df["marital-status"].str.startswith("Married"))][
    "salary"
].value_counts(normalize=True)
<=50K    0.56
>50K     0.44
Name: salary, dtype: float64
# single men
df[
    (df["sex"] == "Male")
    & ~(df["marital-status"].str.startswith("Married"))
]["salary"].value_counts(normalize=True)
<=50K    0.92
>50K     0.08
Name: salary, dtype: float64

4. Apply Method

  • df.apply() method can be used to apply functions on Cells, Rows or Columns

For example, find the max value across all columns.

df.apply(np.max)
age                             90
workclass              Without-pay
fnlwgt                     1484705
education             Some-college
education-num                   16
marital-status             Widowed
occupation        Transport-moving
relationship                  Wife
race                         White
sex                           Male
capital-gain                 99999
capital-loss                  4356
hours-per-week                  99
native-country          Yugoslavia
salary                        >50K
dtype: object

Lambda functions are very convenient to apply using apply() method.

  • For eg, select the native-countries starting with Y.
  • To break this down:
    • df["native-country"] selects the native-country.
    • df["native-country"].apply(lambda state: state[0] == "Y") applies lambda function to include only the rows where the “native-country” column starts with the letter “Y”
    • df[].head() - then returns the first few rows of the filtered DataFrame.
df[df["native-country"].apply(lambda country: country[0] == "Y")].head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
1025 56 Private 169133 HS-grad 9 Married-civ-spouse Other-service Husband White Male 0 0 50 Yugoslavia <=50K
4447 25 Private 191230 Some-college 10 Never-married Exec-managerial Own-child White Female 0 0 40 Yugoslavia <=50K
6328 20 Private 175069 Some-college 10 Never-married Adm-clerical Own-child White Male 0 0 40 Yugoslavia <=50K
7287 35 Private 164526 HS-grad 9 Married-civ-spouse Other-service Husband White Male 0 0 40 Yugoslavia >50K
12506 40 Local-gov 183096 9th 5 Married-civ-spouse Other-service Wife White Female 0 0 40 Yugoslavia >50K

5. Map & Replace Method

map method can be used to replace values in a column by passing a dictionary of the form {old_value: new_value} as its argument:

d = {"Male": 1, "Female": 2}
df["sex_1"] = df["sex"].map(d)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White NaN 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White NaN 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White NaN 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black NaN 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black NaN 0 0 40 Cuba <=50K

same thing can be done with the replace method.

df['sex_2'] = df['sex'].replace(d)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary sex_2
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K 1
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K 1
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K 1
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K 1
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K 2

However, note replace method will not do anything with values not found in the mapping dictionary, while while map will change them to NaNs.

# Create a Series
s = pd.Series(['A', 'B', 'C', 'D'])

# Create a mapping dictionary
mapping = {'A': 'Apple', 'B': 'Banana'}

# Using replace method
replaced = s.replace(mapping)
print('replace method will not do anything with values not found in the mapping dictionary')
print(replaced)

# Using map method
mapped = s.map(mapping)
print('map method will change values not found in the mapping dictionary to NaNs')
print(mapped)

replace method will not do anything with values not found in the mapping dictionary
0     Apple
1    Banana
2         C
3         D
dtype: object
map method wwill change values not found in the mapping dictionary to NaNs
0     Apple
1    Banana
2       NaN
3       NaN
dtype: object

6. Grouping

Grouping data in Pandas works as follows: df.groupby(by=grouping_columns)[columns_to_show].function()

  1. First, the groupby method divides the grouping_columns by their values. They become a new index in the resulting dataframe.

  2. Then, columns of interest are selected (columns_to_show). If columns_to_show is not included, all non groupby clauses will be included.

  3. Finally, one or several functions are applied to the obtained groups per selected columns.

To understand this more intuitively, we can start with this small example.

  • We can group by education. By doing this we are creating a new dataframe with the education column as index.
  • However, this dataframe as no series values. For this we can apply as simple function like count()
  • The resulting dataframe shows the count each education value across all other columns. The data is repeating because, count() is simply counting the number of entries for each education type in age, workclass, etc which obviously will be same.
result = df.groupby('education').count()
print(result)
                age  workclass  fnlwgt  education-num  marital-status  \
education                                                               
10th            933        933     933            933             933   
11th           1175       1175    1175           1175            1175   
12th            433        433     433            433             433   
1st-4th         168        168     168            168             168   
5th-6th         333        333     333            333             333   
7th-8th         646        646     646            646             646   
9th             514        514     514            514             514   
Assoc-acdm     1067       1067    1067           1067            1067   
Assoc-voc      1382       1382    1382           1382            1382   
Bachelors      5355       5355    5355           5355            5355   
Doctorate       413        413     413            413             413   
HS-grad       10501      10501   10501          10501           10501   
Masters        1723       1723    1723           1723            1723   
Preschool        51         51      51             51              51   
Prof-school     576        576     576            576             576   
Some-college   7291       7291    7291           7291            7291   

              occupation  relationship   race    sex  capital-gain  \
education                                                            
10th                 933           933    933    933           933   
11th                1175          1175   1175   1175          1175   
12th                 433           433    433    433           433   
1st-4th              168           168    168    168           168   
5th-6th              333           333    333    333           333   
7th-8th              646           646    646    646           646   
9th                  514           514    514    514           514   
Assoc-acdm          1067          1067   1067   1067          1067   
Assoc-voc           1382          1382   1382   1382          1382   
Bachelors           5355          5355   5355   5355          5355   
Doctorate            413           413    413    413           413   
HS-grad            10501         10501  10501  10501         10501   
Masters             1723          1723   1723   1723          1723   
Preschool             51            51     51     51            51   
Prof-school          576           576    576    576           576   
Some-college        7291          7291   7291   7291          7291   

              capital-loss  hours-per-week  native-country  salary  sex_2  
education                                                                  
10th                   933             933             933     933    933  
11th                  1175            1175            1175    1175   1175  
12th                   433             433             433     433    433  
1st-4th                168             168             168     168    168  
5th-6th                333             333             333     333    333  
7th-8th                646             646             646     646    646  
9th                    514             514             514     514    514  
Assoc-acdm            1067            1067            1067    1067   1067  
Assoc-voc             1382            1382            1382    1382   1382  
Bachelors             5355            5355            5355    5355   5355  
Doctorate              413             413             413     413    413  
HS-grad              10501           10501           10501   10501  10501  
Masters               1723            1723            1723    1723   1723  
Preschool               51              51              51      51     51  
Prof-school            576             576             576     576    576  
Some-college          7291            7291            7291    7291   7291  

This will start to make more sense by using a more meaningful function like .mean() which calculates the mean value for all numerical columns under each education type.

result = df.groupby('education').mean()
print(result)
                age     fnlwgt  education-num  capital-gain  capital-loss  \
education                                                                   
10th          37.43  196832.47            6.0        404.57         56.85   
11th          32.36  194928.08            7.0        215.10         50.08   
12th          32.00  199097.51            8.0        284.09         32.34   
1st-4th       46.14  239303.00            2.0        125.88         48.33   
5th-6th       42.89  232448.33            3.0        176.02         68.25   
7th-8th       48.45  188079.17            4.0        233.94         65.67   
9th           41.06  202485.07            5.0        342.09         29.00   
Assoc-acdm    37.38  193424.09           12.0        640.40         93.42   
Assoc-voc     38.55  181936.02           11.0        715.05         72.75   
Bachelors     38.90  188055.91           13.0       1756.30        118.35   
Doctorate     47.70  186698.76           16.0       4770.15        262.85   
HS-grad       38.97  189538.74            9.0        576.80         70.47   
Masters       44.05  179852.36           14.0       2562.56        166.72   
Preschool     42.76  235889.37            1.0        898.39         66.49   
Prof-school   44.75  185663.71           15.0      10414.42        231.20   
Some-college  35.76  188742.92           10.0        598.82         71.64   

              hours-per-week  sex_2  
education                            
10th                   37.05   1.32  
11th                   33.93   1.37  
12th                   35.78   1.33  
1st-4th                38.26   1.27  
5th-6th                38.90   1.25  
7th-8th                39.37   1.25  
9th                    38.04   1.28  
Assoc-acdm             40.50   1.39  
Assoc-voc              41.61   1.36  
Bachelors              42.61   1.30  
Doctorate              46.97   1.21  
HS-grad                40.58   1.32  
Masters                43.84   1.31  
Preschool              36.65   1.31  
Prof-school            47.43   1.16  
Some-college           38.85   1.38  


<ipython-input-28-a0bc15c64f52>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  result = df.groupby('education').mean()

In the below example:

  • we grouped the DataFrame df by the ‘education’ column using the groupby method.
  • Then, we selected the ‘age’ column to calculate the average age for each education level.
  • The result is a Series that shows the average age for each education category.
# result = df.groupby('education')['age'].mean()
result = df.groupby('education')['age'].mean()
print(result)
education
10th            37.43
11th            32.36
12th            32.00
1st-4th         46.14
5th-6th         42.89
7th-8th         48.45
9th             41.06
Assoc-acdm      37.38
Assoc-voc       38.55
Bachelors       38.90
Doctorate       47.70
HS-grad         38.97
Masters         44.05
Preschool       42.76
Prof-school     44.75
Some-college    35.76
Name: age, dtype: float64

Here is another example. Suppose we want to groupby Salary and determine how people of various occupations are collecting salary, we can do the following:

  • we can group the DataFrame df by both the ‘occupation’ and ‘salary’ columns using the groupby method. This is like first creating an index for occupation and then create index salary under each occupation type.
  • Then, we applied the size function to count the number of occurrences for each combination of occupation and salary.
  • The result is a Series that shows the count for each occupation and salary category.
# Grouping by 'occupation' and 'salary' columns and counting the number of occurrences
result = df.groupby(['occupation', 'salary']).size()
print(result)
occupation         salary
?                  <=50K     1652
                   >50K       191
Adm-clerical       <=50K     3263
                   >50K       507
Armed-Forces       <=50K        8
                   >50K         1
Craft-repair       <=50K     3170
                   >50K       929
Exec-managerial    <=50K     2098
                   >50K      1968
Farming-fishing    <=50K      879
                   >50K       115
Handlers-cleaners  <=50K     1284
                   >50K        86
Machine-op-inspct  <=50K     1752
                   >50K       250
Other-service      <=50K     3158
                   >50K       137
Priv-house-serv    <=50K      148
                   >50K         1
Prof-specialty     <=50K     2281
                   >50K      1859
Protective-serv    <=50K      438
                   >50K       211
Sales              <=50K     2667
                   >50K       983
Tech-support       <=50K      645
                   >50K       283
Transport-moving   <=50K     1277
                   >50K       320
dtype: int64

We can extend this to show percentage by doing the following:

# Grouping by 'occupation' and 'salary' columns and calculating the percentage
result = df.groupby(['occupation', 'salary']).size() / df.groupby(['occupation']).size() * 100
print(result)
occupation         salary
?                  <=50K     89.64
                   >50K      10.36
Adm-clerical       <=50K     86.55
                   >50K      13.45
Armed-Forces       <=50K     88.89
                   >50K      11.11
Craft-repair       <=50K     77.34
                   >50K      22.66
Exec-managerial    <=50K     51.60
                   >50K      48.40
Farming-fishing    <=50K     88.43
                   >50K      11.57
Handlers-cleaners  <=50K     93.72
                   >50K       6.28
Machine-op-inspct  <=50K     87.51
                   >50K      12.49
Other-service      <=50K     95.84
                   >50K       4.16
Priv-house-serv    <=50K     99.33
                   >50K       0.67
Prof-specialty     <=50K     55.10
                   >50K      44.90
Protective-serv    <=50K     67.49
                   >50K      32.51
Sales              <=50K     73.07
                   >50K      26.93
Tech-support       <=50K     69.50
                   >50K      30.50
Transport-moving   <=50K     79.96
                   >50K      20.04
dtype: float64

In this example, we can compare the normal describe method vs the describe method with groupby. As we can see, the group by method is indexed on occupation, where as the regular df is indexed on 0 to N row index.

df.describe(percentiles=[])
age fnlwgt education-num capital-gain capital-loss hours-per-week sex_2
count 32561.00 3.26e+04 32561.00 32561.00 32561.00 32561.00 32561.00
mean 38.58 1.90e+05 10.08 1077.65 87.30 40.44 1.33
std 13.64 1.06e+05 2.57 7385.29 402.96 12.35 0.47
min 17.00 1.23e+04 1.00 0.00 0.00 1.00 1.00
50% 37.00 1.78e+05 10.00 0.00 0.00 40.00 1.00
max 90.00 1.48e+06 16.00 99999.00 4356.00 99.00 2.00
df.groupby(["occupation"]).describe(percentiles=[])
age fnlwgt ... hours-per-week sex_2
count mean std min 50% max count mean std min ... std min 50% max count mean std min 50% max
occupation
? 1843.0 40.88 20.34 17.0 35.0 90.0 1843.0 188658.67 107089.08 12285.0 ... 14.91 1.0 36.0 99.0 1843.0 1.46 0.50 1.0 1.0 2.0
Adm-clerical 3770.0 36.96 13.36 17.0 35.0 90.0 3770.0 192043.40 103163.89 19302.0 ... 9.59 1.0 40.0 80.0 3770.0 1.67 0.47 1.0 2.0 2.0
Armed-Forces 9.0 30.22 8.09 23.0 29.0 46.0 9.0 215425.89 83315.89 76313.0 ... 14.07 8.0 40.0 60.0 9.0 1.00 0.00 1.0 1.0 1.0
Craft-repair 4099.0 39.03 11.61 17.0 38.0 90.0 4099.0 192132.60 107434.09 19491.0 ... 9.05 1.0 40.0 99.0 4099.0 1.05 0.23 1.0 1.0 2.0
Exec-managerial 4066.0 42.17 11.97 17.0 41.0 90.0 4066.0 184414.01 103314.96 19914.0 ... 11.11 1.0 40.0 99.0 4066.0 1.29 0.45 1.0 1.0 2.0
Farming-fishing 994.0 41.21 15.07 17.0 39.0 90.0 994.0 170190.18 116925.90 20795.0 ... 17.32 2.0 40.0 99.0 994.0 1.07 0.25 1.0 1.0 2.0
Handlers-cleaners 1370.0 32.17 12.37 17.0 29.0 90.0 1370.0 204391.01 111934.10 19214.0 ... 10.58 2.0 40.0 95.0 1370.0 1.12 0.32 1.0 1.0 2.0
Machine-op-inspct 2002.0 37.72 12.07 17.0 36.0 90.0 2002.0 195040.88 98159.93 13769.0 ... 7.59 1.0 40.0 96.0 2002.0 1.27 0.45 1.0 1.0 2.0
Other-service 3295.0 34.95 14.52 17.0 32.0 90.0 3295.0 188608.45 109452.80 19752.0 ... 12.71 1.0 40.0 99.0 3295.0 1.55 0.50 1.0 2.0 2.0
Priv-house-serv 149.0 41.72 18.63 17.0 40.0 81.0 149.0 201107.52 102595.40 24384.0 ... 16.18 4.0 35.0 99.0 149.0 1.95 0.23 1.0 2.0 2.0
Prof-specialty 4140.0 40.52 12.02 17.0 40.0 90.0 4140.0 185296.61 100135.45 14878.0 ... 12.54 1.0 40.0 99.0 4140.0 1.37 0.48 1.0 1.0 2.0
Protective-serv 649.0 38.95 12.82 17.0 36.0 90.0 649.0 202039.95 101910.29 19302.0 ... 12.33 3.0 40.0 99.0 649.0 1.12 0.32 1.0 1.0 2.0
Sales 3650.0 37.35 14.19 17.0 35.0 90.0 3650.0 190885.89 103779.87 19410.0 ... 13.24 2.0 40.0 99.0 3650.0 1.35 0.48 1.0 1.0 2.0
Tech-support 928.0 37.02 11.32 17.0 36.0 73.0 928.0 192098.30 113987.28 19847.0 ... 10.58 3.0 40.0 99.0 928.0 1.38 0.48 1.0 1.0 2.0
Transport-moving 1597.0 40.20 12.45 17.0 39.0 90.0 1597.0 190366.36 109240.30 18827.0 ... 12.72 5.0 40.0 99.0 1597.0 1.06 0.23 1.0 1.0 2.0

15 rows × 42 columns

We can do the same thing using agg() function.

df.groupby(["occupation"]).agg([np.mean, np.std, np.min, np.max])
<ipython-input-32-f3544e70bb41>:1: FutureWarning: ['workclass', 'education', 'marital-status', 'relationship', 'race', 'sex', 'native-country', 'salary'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
  df.groupby(["occupation"]).agg([np.mean, np.std, np.min, np.max])
age fnlwgt education-num ... capital-loss hours-per-week sex_2
mean std amin amax mean std amin amax mean std ... amin amax mean std amin amax mean std amin amax
occupation
? 40.88 20.34 17 90 188658.67 107089.08 12285 981628 9.25 2.60 ... 0 4356 31.91 14.91 1 99 1.46 0.50 1 2
Adm-clerical 36.96 13.36 17 90 192043.40 103163.89 19302 1033222 10.11 1.70 ... 0 3770 37.56 9.59 1 80 1.67 0.47 1 2
Armed-Forces 30.22 8.09 23 46 215425.89 83315.89 76313 344415 10.11 2.03 ... 0 1887 40.67 14.07 8 60 1.00 0.00 1 1
Craft-repair 39.03 11.61 17 90 192132.60 107434.09 19491 1455435 9.11 2.04 ... 0 3004 42.30 9.05 1 99 1.05 0.23 1 2
Exec-managerial 42.17 11.97 17 90 184414.01 103314.96 19914 1484705 11.45 2.14 ... 0 4356 44.99 11.11 1 99 1.29 0.45 1 2
Farming-fishing 41.21 15.07 17 90 170190.18 116925.90 20795 663394 8.61 2.76 ... 0 2457 46.99 17.32 2 99 1.07 0.25 1 2
Handlers-cleaners 32.17 12.37 17 90 204391.01 111934.10 19214 889965 8.51 2.20 ... 0 2824 37.95 10.58 2 95 1.12 0.32 1 2
Machine-op-inspct 37.72 12.07 17 90 195040.88 98159.93 13769 1033222 8.49 2.29 ... 0 3900 40.76 7.59 1 96 1.27 0.45 1 2
Other-service 34.95 14.52 17 90 188608.45 109452.80 19752 1366120 8.78 2.30 ... 0 3770 34.70 12.71 1 99 1.55 0.50 1 2
Priv-house-serv 41.72 18.63 17 81 201107.52 102595.40 24384 549430 7.36 3.11 ... 0 1602 32.89 16.18 4 99 1.95 0.23 1 2
Prof-specialty 40.52 12.02 17 90 185296.61 100135.45 14878 747719 12.91 2.03 ... 0 3900 42.39 12.54 1 99 1.37 0.48 1 2
Protective-serv 38.95 12.82 17 90 202039.95 101910.29 19302 857532 10.18 1.87 ... 0 2444 42.87 12.33 3 99 1.12 0.32 1 2
Sales 37.35 14.19 17 90 190885.89 103779.87 19410 1226583 10.30 2.18 ... 0 2824 40.78 13.24 2 99 1.35 0.48 1 2
Tech-support 37.02 11.32 17 73 192098.30 113987.28 19847 1268339 10.99 1.80 ... 0 2444 39.43 10.58 3 99 1.38 0.48 1 2
Transport-moving 40.20 12.45 17 90 190366.36 109240.30 18827 1184622 8.77 2.04 ... 0 2824 44.66 12.72 5 99 1.06 0.23 1 2

15 rows × 28 columns

Example: Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race

for (race, sex), sub_df in df.groupby(["race", "sex"]):
    print("Race: {0}, sex: {1}".format(race, sex))
    print(sub_df["age"].describe())
Race: Amer-Indian-Eskimo, sex: Female
count    119.00
mean      37.12
std       13.11
min       17.00
25%       27.00
50%       36.00
75%       46.00
max       80.00
Name: age, dtype: float64
Race: Amer-Indian-Eskimo, sex: Male
count    192.00
mean      37.21
std       12.05
min       17.00
25%       28.00
50%       35.00
75%       45.00
max       82.00
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Female
count    346.00
mean      35.09
std       12.30
min       17.00
25%       25.00
50%       33.00
75%       43.75
max       75.00
Name: age, dtype: float64
Race: Asian-Pac-Islander, sex: Male
count    693.00
mean      39.07
std       12.88
min       18.00
25%       29.00
50%       37.00
75%       46.00
max       90.00
Name: age, dtype: float64
Race: Black, sex: Female
count    1555.00
mean       37.85
std        12.64
min        17.00
25%        28.00
50%        37.00
75%        46.00
max        90.00
Name: age, dtype: float64
Race: Black, sex: Male
count    1569.00
mean       37.68
std        12.88
min        17.00
25%        27.00
50%        36.00
75%        46.00
max        90.00
Name: age, dtype: float64
Race: Other, sex: Female
count    109.00
mean      31.68
std       11.63
min       17.00
25%       23.00
50%       29.00
75%       39.00
max       74.00
Name: age, dtype: float64
Race: Other, sex: Male
count    162.00
mean      34.65
std       11.36
min       17.00
25%       26.00
50%       32.00
75%       42.00
max       77.00
Name: age, dtype: float64
Race: White, sex: Female
count    8642.00
mean       36.81
std        14.33
min        17.00
25%        25.00
50%        35.00
75%        46.00
max        90.00
Name: age, dtype: float64
Race: White, sex: Male
count    19174.00
mean        39.65
std         13.44
min         17.00
25%         29.00
50%         38.00
75%         49.00
max         90.00
Name: age, dtype: float64

Example:Count the average time of work (hours-per-week) those who earning a little and a lot (salary) for each country (native-country).

for (country, salary), sub_df in df.groupby(["native-country", "salary"]):
    print(country, salary, round(sub_df["hours-per-week"].mean(), 2))
? <=50K 40.16
? >50K 45.55
Cambodia <=50K 41.42
Cambodia >50K 40.0
Canada <=50K 37.91
Canada >50K 45.64
China <=50K 37.38
China >50K 38.9
Columbia <=50K 38.68
Columbia >50K 50.0
Cuba <=50K 37.99
Cuba >50K 42.44
Dominican-Republic <=50K 42.34
Dominican-Republic >50K 47.0
Ecuador <=50K 38.04
Ecuador >50K 48.75
El-Salvador <=50K 36.03
El-Salvador >50K 45.0
England <=50K 40.48
England >50K 44.53
France <=50K 41.06
France >50K 50.75
Germany <=50K 39.14
Germany >50K 44.98
Greece <=50K 41.81
Greece >50K 50.62
Guatemala <=50K 39.36
Guatemala >50K 36.67
Haiti <=50K 36.33
Haiti >50K 42.75
Holand-Netherlands <=50K 40.0
Honduras <=50K 34.33
Honduras >50K 60.0
Hong <=50K 39.14
Hong >50K 45.0
Hungary <=50K 31.3
Hungary >50K 50.0
India <=50K 38.23
India >50K 46.48
Iran <=50K 41.44
Iran >50K 47.5
Ireland <=50K 40.95
Ireland >50K 48.0
Italy <=50K 39.62
Italy >50K 45.4
Jamaica <=50K 38.24
Jamaica >50K 41.1
Japan <=50K 41.0
Japan >50K 47.96
Laos <=50K 40.38
Laos >50K 40.0
Mexico <=50K 40.0
Mexico >50K 46.58
Nicaragua <=50K 36.09
Nicaragua >50K 37.5
Outlying-US(Guam-USVI-etc) <=50K 41.86
Peru <=50K 35.07
Peru >50K 40.0
Philippines <=50K 38.07
Philippines >50K 43.03
Poland <=50K 38.17
Poland >50K 39.0
Portugal <=50K 41.94
Portugal >50K 41.5
Puerto-Rico <=50K 38.47
Puerto-Rico >50K 39.42
Scotland <=50K 39.44
Scotland >50K 46.67
South <=50K 40.16
South >50K 51.44
Taiwan <=50K 33.77
Taiwan >50K 46.8
Thailand <=50K 42.87
Thailand >50K 58.33
Trinadad&Tobago <=50K 37.06
Trinadad&Tobago >50K 40.0
United-States <=50K 38.8
United-States >50K 45.51
Vietnam <=50K 37.19
Vietnam >50K 39.2
Yugoslavia <=50K 41.6
Yugoslavia >50K 49.5

Better way to do the same, using crosstab

pd.crosstab(
    df["native-country"],
    df["salary"],
    values=df["hours-per-week"],
    aggfunc=np.mean,
).T
native-country ? Cambodia Canada China Columbia Cuba Dominican-Republic Ecuador El-Salvador England ... Portugal Puerto-Rico Scotland South Taiwan Thailand Trinadad&Tobago United-States Vietnam Yugoslavia
salary
<=50K 40.16 41.42 37.91 37.38 38.68 37.99 42.34 38.04 36.03 40.48 ... 41.94 38.47 39.44 40.16 33.77 42.87 37.06 38.80 37.19 41.6
>50K 45.55 40.00 45.64 38.90 50.00 42.44 47.00 48.75 45.00 44.53 ... 41.50 39.42 46.67 51.44 46.80 58.33 40.00 45.51 39.20 49.5

2 rows × 42 columns

Summary tables

1. crosstab()

Panda’s crosstab() -allows you to compute a cross-tabulation of two or more factors, also known as a contingency table. It provides a convenient way to analyze the relationship between multiple categorical variables in a tabular format.

When you have two or more categorical variables in your DataFrame, crosstab helps you create a table that shows the frequency distribution of each variable combination. It counts the occurrences of different categories from each variable and arranges them in a grid-like structure.

# Compute the cross-tabulation
result = pd.crosstab(df['sex'], df['salary'])
print(result)
salary  <=50K  >50K
sex                
Female   9592  1179
Male    15128  6662

margins=True parameter in pd.crosstab() adds row and column totals to the resulting cross-tabulation table.

# Compute the cross-tabulation
result = pd.crosstab(df['sex'], df['salary'], margins=True)
print(result)
salary  <=50K  >50K    All
sex                       
Female   9592  1179  10771
Male    15128  6662  21790
All     24720  7841  32561

You can also add more then one value to compare against. The resulting table shows the counts for each combination of ‘sex’, ‘race’, and ‘salary’.

# Compute the cross-tabulation
result = pd.crosstab(df['sex'], [df['race'],df['salary']])
print(result)
race   Amer-Indian-Eskimo      Asian-Pac-Islander      Black      Other       \
salary              <=50K >50K              <=50K >50K <=50K >50K <=50K >50K   
sex                                                                            
Female                107   12                303   43  1465   90   103    6   
Male                  168   24                460  233  1272  297   143   19   

race    White        
salary  <=50K  >50K  
sex                  
Female   7614  1028  
Male    13085  6089  

The normalize=True parameter is passed to compute the cross-tabulation as percentages. The resulting table displays the percentage distribution of each salary level for each gender category.

# Compute the cross-tabulation as percentage
result = pd.crosstab(df['sex'], df['salary'], normalize=True, margins=True)*100
print(result)
salary  <=50K   >50K     All
sex                         
Female  29.46   3.62   33.08
Male    46.46  20.46   66.92
All     75.92  24.08  100.00

From this we can interpret:

  • almost 30% females have a salary less than or equal to 50K, while only around 4% have a salary greater than 50K.
  • a higher proportion of males (around 46%) have a salary less than or equal to 50K, while a significant proportion (around 20%) have a salary greater than 50K.

2. pivot_table()

  • pivot_table() allows you to create a Excel-style pivot table from a DataFrame
  • way to summarize and analyze data by grouping and aggregating values based on one or more variables.

Here is how to make pivot tables work:

  • index parameter - a list of variables to group data by,
  • Specify the columns using the columns parameter.
  • Choose the values to compute summary statistics.
  • Select the aggregation function with aggfunc - what statistics we need to calculate for groups, e.g. sum, mean, maximum, minimum or something else.
  • Customize with additional parameters like margins and fill_value.

Here is how we can calculate the average capital-gain, capital-loss and hours-per-week by sex.

# Create the pivot table
pivot_table = pd.pivot_table(df, index='sex', values=['capital-gain', 'capital-loss', 'hours-per-week'], aggfunc='mean')

print(pivot_table)
        capital-gain  capital-loss  hours-per-week
sex                                               
Female        568.41         61.19           36.41
Male         1329.37        100.21           42.43

On average, males have higher values for ‘capital-gain’, ‘capital-loss’, and ‘hours-per-week’ compared to females.

Here is a more complex example. Suppose we want to provide a comprehensive view of the average hours worked per week for various education levels and marital statuses, broken down by different races.

We want to:

  • Set the index to [‘education’, ‘marital-status’]. Required to group and organize the data based on different education levels and marital statuses. Note, education becomes the primary index while marital status becomes the secondary index.
  • Specify the columns of the pivot table to ‘race’. Required so we can create separate columns in the pivot table for each unique race value.
  • Choose the ‘hours-per-week’ column as the values to be analyzed using the values parameter.
  • Select the ‘mean’ as the aggregation function using the aggfunc parameter.
  • Fill any missing values in the pivot table with 0 using the fill_value parameter.
# Create the pivot table
pivot_table = pd.pivot_table(df, index=['education', 'marital-status'], columns='race', values='hours-per-week', aggfunc='mean', fill_value=0)

print(pivot_table)
race                                Amer-Indian-Eskimo  Asian-Pac-Islander  \
education    marital-status                                                  
10th         Divorced                            42.00               40.00   
             Married-civ-spouse                  35.00               43.00   
             Married-spouse-absent               40.00               38.50   
             Never-married                       46.00               35.00   
             Separated                            0.00                0.00   
...                                                ...                 ...   
Some-college Married-civ-spouse                  45.37               45.41   
             Married-spouse-absent               40.00               35.75   
             Never-married                       37.14               32.52   
             Separated                           35.50               46.67   
             Widowed                             24.50                5.00   

race                                Black  Other  White  
education    marital-status                              
10th         Divorced               37.43  42.00  41.51  
             Married-civ-spouse     41.07  45.50  41.81  
             Married-spouse-absent  37.33   0.00  44.33  
             Never-married          34.33  31.67  30.31  
             Separated              45.79   0.00  38.37  
...                                   ...    ...    ...  
Some-college Married-civ-spouse     42.28  40.00  43.53  
             Married-spouse-absent  36.05  29.00  37.88  
             Never-married          35.82  33.50  33.74  
             Separated              38.44  46.00  40.74  
             Widowed                38.90   0.00  32.34  

[101 rows x 5 columns]

Data Transformations tables

1. Adding Columns

The df.insert() function is to insert a new column into a DataFrame at a specific position. It allows you to specify the index location where you want to insert the new column and provide the column name and its corresponding values.

Here’s the syntax for df.insert():

df.insert(loc, column, value, allow_duplicates=False)
  • loc: The index location where you want to insert the new column.
  • column: The name of the new column.
  • value: The values to be assigned to the new column.
  • allow_duplicates (optional): If set to True, allows duplicate column names; default is False.

For example, we can create a new age group column (pandas series) based on pd.cut() and insert at postion 2.

# Create the age_group serie/column based on age values
age_group = pd.cut(df['age'], bins=[0, 18, 30, 50, 100], labels=['<18', '18-30', '31-50', '51+'])

# Insert the age_group column at position 2
df.insert(2, 'age_group', age_group)
df.head()
age workclass age_group fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 31-50 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 31-50 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 31-50 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 51+ 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 18-30 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

If we dont need inserting at the a specific position, we simply add a new column by specifying it. The new column gets added at the end.

df['education-num-group'] = pd.cut(df['education-num'], bins=[0, 5, 10, 15, 20], labels=['<5', '5-10', '15', '15+'])
df.head()
age workclass age_group fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary education-num-group
0 39 State-gov 31-50 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K 15
1 50 Self-emp-not-inc 31-50 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K 15
2 38 Private 31-50 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K 5-10
3 53 Private 51+ 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K 5-10
4 28 Private 18-30 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K 15

2. Deleting Columns or Rows

  • To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows).
  • The inplace argument tells whether to change the original DataFrame.
  • With inplace=False, the drop method doesn’t change the existing DataFrame and returns a new one with dropped rows or columns.
  • With inplace=True, it alters the DataFrame.
# get rid of just created columns
df.drop(["education-num-group", "age_group"], axis=1, inplace=True)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
# and here’s how you can delete rows
print('Before change:{}'.format(df.shape))
df.drop([1, 2], inplace=True)
print('After change:{}'.format(df.shape))
Before change:(32561, 15)
After change:(32559, 15)

References

Comments