Python Pandas and DataFrame: The Power Couple for Modern Data Analysis

Piotr Ludwinek — Thu, 24 Aug 2023 03:30:00 +0000

Pandas is a free Python library that greatly expands your data analysis and processing capabilities. This library is one of the most important tools in the Python environment, and is widely used as a form of support in various industries. In this article, you will learn how to load, process and analyze data easily in Python. This competence is valued in many areas, from finance to engineering.

Jump To:

1. Python – Pandas Library in Data Science. Why is it worth finding out more?
2. Introduction to Pandas in Python
3. When will the Pandas library work best?
4. How to install the Pandas library
5. How to import Python Pandas into your project
6. Pandas DataFrames and series
7. Loading data from different sources
8. Pandas – operations on data
9. Selecting, filtering and sorting data
10. Using apply and map functions
11. Pandas: data cleaning and repairing missing data
12. Basic statistical operations and data grouping
13. Analyzing data using Pandas
14. Pandas, tips and the most popular features (cheat sheet)
15. Fundamentals of Pandas – summary

Python – Pandas Library in Data Science. Why is it worth finding out more?

The Pandas library was created to make it possible to work with various types of data that are not always complete or require proper processing. Pandas provides flexible and easy-to-use data structures and tools (albeit not always efficient… but later in the article I will also tell you how to deal with it). In addition to such libraries as NumPy, Matplotlib, Seaborn or Scikit–Learn, it has earned popularity and recognition among academics, analysts, engineers, and data enthusiasts.

Introduction to Pandas in Python

It is an ideal tool for managing and analyzing data (with the help of extra libraries) in small and medium-sized collections. In the case of large collections, which are popular in areas related to Big Data, processing is also possible. But as the amount of data increases, the likelihood of memory and performance problems increases too.

When will the Pandas library work best?

The Pandas library is well suited to working with different types and data sources:

Array data with columns of various types (e.g. Excel, SQL).
Time-series data.
Data with labels of rows and columns (labeled data).

Examples of applications of the Pandas library.

Here are some examples of applications of the Pandas library:

Loading various data formats (CSV, Excel, SQL, flat files, etc.).
Filtering, sorting and other operations on data.
Clearing data (deleting NaN values, that is: Not a Number data), averaging, replacing values, etc.).
Quickly and efficiently calculating statistics and performing operations on data.
Visualization of data with charts.

Before we start using the Pandas library, let’s make sure that it is installed in the Python environment.

How to install the Pandas library

Installing Pandas is very simple and can be done with the pip tool, which is the default Python package manager. I assume that Python is already installed on your computer, and if not – here’s how to do it: BeginnersGuide/Download – Python Wiki. Then just open the terminal and type the following command:

pip install pandas

You may be required to grant permissions to execute this command (e.g., using sudo on Unix systems or running a terminal with administrator rights on Windows), and if you are using a specific Python virtual environment (e.g., venv or conda), you will need to activate this environment before installing the package. You can read more about virtual environments in the documentation.

If you are using Anaconda, you can install Pandas using the command:

conda install pandas

How to import Python Pandas into your project

Once the Pandas library is installed, you can start using it. The first step is to import the library into our script or project. Importing a Pandas library is no different than any other library in Python.

We can do so using the code below:

import pandas as pd

In Python, Pandas is typically imported under the alias ‘pd’, which is a short and commonly accepted abbreviation. Now, when we want to use the Pandas library functions, instead of typing the full word “pandas“, using the “pd” abbreviation is sufficient. It is similar in the case of the NumPy library (‘np’), which I wrote about in a previous article.

For example, if we wanted to create a DataFrame (one of the key data structures in Pandas), the code would look like this:

import pandas as pd 

data = { 

'column_1': [3, 2, 0, 1],  

'column_2': [0, 3, 7, 2] 

} 

example_df = pd.DataFrame(data) 

print(example_df)

As you can see in the example, I used the DataFrame class, the basic structure provided by Pandas. In the next paragraph, we will discuss the other two basic data structures – Series and DataFrame.

Pandas DataFrames and series

The main purpose of the Pandas library is to facilitate working with data, which is why Pandas introduces two data structures: Series and DataFrame. Understanding these structures is key to using this library effectively.

Pandas series

A series is a one-dimensional data structure, or rather an array (ndarray), similar to a list or column in a table. Each element (e.g. integers, lists, objects, tuples) in the series is assigned to an identifier called an index. The series stores data of one type.

Here is an example of creating a series that contains a list of items:

import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"]) 

print(vals_sr)

When it comes to the index, by default, these are integers, starting from zero. The index can be changed, e.g. by labeling. In this case, we should extend our code that is responsible for creating the series. For this we use the index parameter. The code looks like this:

import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"], index=["A", "B", "C", "D", "E"]) 

print(vals_sr)

It is worth remembering that the number of “labels” should correspond to the number of elements in a series. Otherwise, the Python interpreter will return an error (ValueError). If you do not want to display the entire series, but only check which indexes have been assigned or display only values without them, you can use the following code snippet:

import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"], index=["A", "B", "C", "D", "E"]) 

print(vals_sr) 

print(vals_sr.index) # -- returns objects of Index type 

print(vals_sr.values) # -- returns an ndarray object

Pandas DataFrame

A DataFrame is a two-dimensional data structure similar to a table in a database or Excel spreadsheet. A DataFrame consists of rows and columns – each column in a DataFrame is a series. As you probably guess, even though a given column contains only one data type, a DataFrame can contain many columns, each of which includes a different type of data. An example would be creating a DataFrame from data on transactions made by customers, identified by ID.

import pandas as pd 

ct_data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'card_transactions': [11, 244, 31, 458, 63] 

} 

client_transaction_df = pd.DataFrame(ct_data) 

print(client_transaction_df)

Loading data from different sources

One of the most important benefits of Pandas is the ease with which you can load data from a variety of sources and file formats. The most popular ones include:

CSV
Excel (.xlsx)
SQL
Flat files (e.g., text files)

Data from the CSV file can be loaded into the DataFrame with the pd.read_csv() function.

import pandas as pd 

df = pd.read_csv('path_to_your_file.csv') 

print(df)

Similarly, we can load an Excel file using the pd.read_excel() function.

import pandas as pd 

df = pd.read_excel('path_to_your_file.xlsx') 

print(df)

To load the result of the SQL query, we must first create a connection with the database. Using the example of the SQLite database, we can do it this way:

import pandas as pd 

import sqlite3 

# Establishing a database connection 

conn = sqlite3.connect("database_name.db") 

# Execution of the query 

df = pd.read_sql_query("SELECT * FROM my_table", conn) 

print(df)

Assuming you’re using a different database, you’ll need to install the right Python driver and replace the sqlite3.connect with the right connection. If you want to learn more about sqlite, I encourage you to access the sqlite3documentation.

In the next paragraph, we will discuss the basic operations on data using series and DataFrame.

Pandas – operations on data

Once you know how to load data using the Pandas library, we will now focus on selecting, filtering and sorting data and using the apply and map functions.

Selecting, filtering and sorting data

Selecting specific data from the DataFrame is one of the basic and most commonly used operations. Pandas allows you to select data in many ways:

Column selection: df[‘columnname’]
Selecting rows using index numbers: df.iloc[index]
Selecting rows using index labels: df.loc[label]

See a sample snippet of the code below:

import pandas as pd 

ct_data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(ct_data) 

print(df) 

# Selecting the 'client_id' column 

print(df['client_id']) 

 

# Selecting the first line 

print(df.iloc[0]) 

# Selecting the row with the zero index label 

print(df.loc[0])

Filtering means the process of selecting a subset of data based on given criteria. For example, we may wish to select only those transactions that are available in a quantity greater than 60:

filtered_df = df[df['count'] > 60] 

print(filtered_df)

Data sorting is a simple process that we can perform using the sort_values() method:

sorted_df = df.sort_values('count') 

print(sorted_df)

Column operations: adding, deleting, renaming

To add a new column to the DataFrame, we can simply assign the data to the new column, as in the example below:

df['amount'] = [1200, 4500, 3000, 28000, 700] # -- we add a column with the sum of the amounts for which transactions were made 

print(df)

To delete a column, we will use the drop() method:

df = df.drop('amount', axis=1) 

print(df)

You can rename a column using the rename() method:

df = df.rename(columns={'client_id': 'client_code', 'count': 'quantity'}) 

print(df)

Using apply and map functions

The apply and map functions allow you to apply the selected function to each element saved in the series or DataFrame. For example, using functions from the NumPy library for the ‘quantity’ column.

import numpy as np 

df['log_quantity'] = df['quantity'].apply(np.log) 

print(df)

Applying the map method is similar, but it only works on the series. It is often used to replace values based on a dictionary. For example, for a list of customers, we may wish to add an additional digit in the identifier.

code_map = { 

'C34P': '0C34P', 

'C35S': '1C35S', 

'C35P': '1C35P', 

'C97S': '0C97S', 

'C58S': '0C58S' 

} 

df['client_code'] = df['client_code'].map(code_map) 

print(df)

Pandas: data cleaning and repairing missing data

Working with data that comes from real sources practically always involves the need to clean or correct it. The data often contains gaps, duplicates, or data types that are not suitable for the analysis. In this part, I will discuss problems and show you how to deal with them using Pandas library tools.

Handling missing data (NaN)

Missing data is marked as NaN (Not a Number). Pandas offers several methods to handle it, such as:

Filling in missing data with a specific value
Deleting rows of missing data

The fillna() method allows you to fill in the missing data with a specific value or using a specific method (e.g. ‘forward fill’ – ffill, ‘backward fill’ – bfill):

import numpy as np 

data = { 

    'A': [1, 2, np.nan], 

    'B': [5, np.nan, np.nan], 

C, 1, 2, 3 

} 

df = pd.DataFrame(data) 

df_filled_zeros = df.fillna (value=0)  # --We fill in the missing data with the value 0 

print(df_filled_zeros)

It is also possible to delete rows with missing data. In the case of a large set and a small number of “broken” rows, this should not have a large impact on the quality of the data. But with a small set, deleting several rows can significantly affect the next analysis. However, If you decide to delete the selected rows, you can use the dropna() method:

df_dropped = df.dropna()  # Removing rows with missing data 

print(df_dropped)

Sometimes, in our set, there are numerous unnecessary duplicates from the analysis perspective. Removing duplicates is especially useful when they make up the majority of our set. Getting rid of them will allow you to unburden the library and perform more efficient operations, e.g. on columns in the DataFrame. Pandas provides a drop_duplicates() method that allows you to easily remove duplicates:

data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(data) 

df = df.drop_duplicates()  # We remove duplicates 

print(df)

You can learn more about working with missing data from the extensive documentation on Pydata.org: Working with missing data — pandas 2.0.3 documentation (pydata.org).

If you work on data provided by other people or companies, you may encounter numeric data (integers, digits, floating point numbers), which are presented in the form of strings of characters. So, for example, the integer 200 in the DataFrame is written as a string with the value ‘200’. The interpreter will treat this as text, not a number. If you want to perform statistical or mathematical operations on such data, it is necessary to change the data type of the column (in this case from ‘str‘ to ‘int’). You can do this using the astype() method:

import pandas as pd 

data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(data) 

df['count'] = df['count'] .astype (int)  # We are changing the data type of the column 'count' to int 

print(df)

In conclusion, cleaning data is usually a necessary step in the process of data processing and analysis. Pandas comes with many tools to make it easier. If you want to learn more about working with text files, here is a link to the documentation: Working with text data — pandas 2.0.3 documentation (pydata.org). In the next part, I will move on to the topic of aggregation and grouping data.

Streamline Your Application Maintenance

Leszek Jaros, our Head of Telco and AMS Practice, is here to help you navigate the complexities of Application Maintenance Services. Book a consultation to boost your application's efficiency

Schedule a meeting

Basic statistical operations and data grouping

Once we have clean and properly formatted data, we can proceed to the analysis. In this paragraph, I will show you some basic statistics, data grouping, and operations on indexes and multi-indexes.

Basic statistics calculations

Pandas offers functions to perform calculation of basic statistics, such as:

mean – the average value,
median – the median value in the data set,
mode – the most common value in the data set,
standard deviation.

Based on the example of data samples generated with the Gaussian distribution, we will create a DataFrame and determine the above statistics.

import pandas as pd 

import numpy as np 

data = { 

'A': e.g.random.normal(0, 1, 100), 

'B': np.random.normal(1, 2, 100), 

'C': np.random.normal(-1, 0.5, 100) 

} 

df = pd.DataFrame(data) 

print(df.mean())  # average 

print(df.median())  # median 

print(df.mode())  # Moda 

print(df.std())  # standard deviation

Of course, based on the example of this data, I wanted to show you how tools in the Pandas library work. I leave testing its full capabilities on real data to you. To familiarize yourself with the specification of the discussed functions and many others that can be used on DataFrame, I encourage you to go through the documentation: pandas.DataFrame.mean — pandas 2.0.3 documentation (pydata.org)

Grouping data with the groupby function

The groupby function allows you to group data based on specific columns. If you use SQL language on a daily basis, the issue of grouping should not be new to you. We will use the groupby() function for grouping. After grouping the data, we can calculate statistics for each group:

import pandas as pd 

data = { 

'product_id': ['product_34P', 'product_34P', 'store_35S', 'product_35P', 'store_97S', 'product_35P', 'product_34P', 'store_58S'], 

'count': [12, 24, 36, 60, 18, 48, 20, 72], 

'price': [1.2, 0.5, 0.75, 1.25, 2.0, 1.3, 0.55, 0.8] 

'store': ['EU1', 'UK1', 'EU2', 'EU2', 'UK2', 'EU1', 'UK2', 'EU1'] 

} 

df = pd.DataFrame(data) 

grouped = df.groupby('product_id') 

print(grouped.mean())  #average price and quantity for each product

Indexing in Pandas: operations on indexes and multi-indexes

Indexes are an important part of the discussed data structures in the Pandas library. They allow quick access to data. Basic operations include, for example:

resetting the index,
setting a new index,
sorting by index.

Multi-indexes allow you to index across many levels, which is especially useful in the case of hierarchical data. Multi-indexes allow you to analyze data at different hierarchical levels. For example, if we have a dataset of product sales in different countries and regions. It allows you to notice global sales trends, as well as more detailed trends at the level of selected countries or regions – see the example code below. I used a previously created DataFrame with product sales data.

# Set ‘product_id’ as index 

df = df.set_index('product_id') 

print(df) 

# Resetting the index 

df = df.reset_index() 

print(df) 

# Setting up a multi-index 

df = df.set_index(['product_id', 'count']) 

print(df)

Data visualization using Pandas, NumPy and Matplotlib libraries

Data visualization is a key element of any data analysis. It helps to better understand the data structure and makes it easier to present results. Pandas offers built-in data visualization tools that are based on the Matplotlib library. An example code for creating several charts is below:

 import pandas as pd import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

# Sample data 

df = pd.DataFrame({ 

'A': np.random.randn(1000), 

'B': np.random.randn(1000), 

'C': np.random.randn(1000) 

}) 

# Line chart  

df['A'].plot(kind='line') 

plt.show() 

#Histogram 

df['B'].plot(kind='hist', bins=20) 

plt.show() 

# dot diagram 

df.plot(kind='scatter', x='A', y='B') 

plt.show() 

# Bar chart  

df['C'].value_counts().sort_index().plot(kind='bar') 

plt.show()

Analyzing data using Pandas

After discussing the basic features and operations that Pandas offers, we will move on to a practical example of analyzing a real dataset. For this purpose, we will use the public data set on the survival of passengers on the Titanic. The process can be split into the following steps:

Loading and pre-screening data
Data cleansing
Data analysis
Data visualization

You can download the data CSV FILE titanic.csv here:

titanic Download

We will start by loading the data and examining its structure:

import pandas as pd 

# Loading data 

df = pd.read_csv('titanic.csv') 

# Displaying the first 5 rows 

print(df.head()) 

# Data basics 

print(df.info())

Then we will perform a basic data cleansing – for example, removing the missing values:

# Removing rows with missing data 

df = df.dropna()

In this case, after deleting the rows with missing data, we can assume that the data is ready for analysis. Let’s now see how the ticket class influenced the chances of survival:

# Grouping data by ticket class and calculating average survival 

print(df.groupby('Pclass')['Survived'].mean())

We can also visualize our results to better understand the dependencies in the data:

import matplotlib.pyplot as plt 

# Survival bar graph by ticket class 

df.groupby('Pclass')['Survived'].mean().plot(kind='bar') 

plt.ylabel('Survival Rate') 

plt.show()

To sum up, this simple example shows how easy it is to use Pandas to load, clean, analyze, and visualize your data. I encourage you to familiarize yourself with the sample dataset (CSV file attached) and analyze selected dependencies on your own.

Pandas, tips and the most popular features (cheat sheet)

When working on a DataFrame, a “view” and “copy” of the original DataFrame may draw your attention. This is important because when we try to assign new values to a dataset:

When using the view – the changes will affect the original DataFrame,
When we use a copy – we will not change the original DataFrame.

It is worth using the.loc [] or .iloc[] method to select data and assign values to avoid unnecessary problems.

Furthermore, Pandas can “deduce” data types automatically when they are loaded, e.g. from a file or database. However, this is not always accurate or in accordance with the user’s assumptions. Therefore, it is worth using the .dtypes method and checking the data types in the created DataFrame. In the event of a mismatch, use the .astype () method that you are already familiar with and convert the selected data types.

When it comes to missing data (i.e. NaN values), performing operations on missing data may lead to incorrect or unexpected results. For example, the sum of the number and value of NaN gives the value of NaN. So remember to handle missing data, e.g. using the .dropna() and .fillna() methods, to delete or fill in the missing data accordingly.

Here are some of the most popular features in the Pandas library:

Loading and saving data:

read_csv(), to_csv(), read_excel(), to_excel(), read_sql(), to_sql()

Data selection:

.loc[], .iloc[]

Data manipulation:

drop(), rename(), set_index(), reset_index(), pivot(), melt()

Data cleansing:

dropna(), fillna(), replace(), duplicated(), drop_duplicates()

Data analysis:

describe(), value_counts(), groupby(), corr()

Stats:

mean(), median(), min(), max(), std(), quantile()

Operations on strings:

str.lower(), str.upper(), str.contains(), str.replace(), str.split(), str.join()

Data visualization:

fence(), hist(), boxplot()

Of course, the abovementioned functions have significant parameterization capabilities, so I encourage you to go through the documentation to better understand and adapt their use to your individual needs (see the documentation: General functions).

Fundamentals of Pandas – summary

Pandas is a popular library which is indispensable when working with data and analyzing it in the Python environment. The above introduction definitely does not cover the entire topic, but shows the basic capabilities and indicates possible directions for exploring the library further. The Pandas library is particularly efficient when combined with NumPy, Matplotlib, Seaborn and other libraries, depending on the given task.

Consult your project directly with a specialist

Book a meeting

Python – Nearshore Software Development Company – IT Outsourcing Services