<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	 xmlns:media="http://search.yahoo.com/mrss/" >

<channel>
	<title>Python &#8211; Nearshore Software Development Company &#8211; IT Outsourcing Services</title>
	<atom:link href="https://nearshore-it.eu/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>https://nearshore-it.eu</link>
	<description>We are Nearshore Software Development Company with 14years of experience in delivering a large scale IT projects in the areas of PHP, JAVA, .NET, BI and MDM.</description>
	<lastBuildDate>Wed, 21 Aug 2024 14:01:50 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.3</generator>

<image>
	<url>https://nearshore-it.eu/wp-content/uploads/2023/01/cropped-inetum-favicon-300x300-1-32x32.png</url>
	<title>Python &#8211; Nearshore Software Development Company &#8211; IT Outsourcing Services</title>
	<link>https://nearshore-it.eu</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Python Pandas and DataFrame: The Power Couple for Modern Data Analysis </title>
		<link>https://nearshore-it.eu/technologies/python-pandas-tutorial-check-our-complete-introduction-to-pandas/</link>
					<comments>https://nearshore-it.eu/technologies/python-pandas-tutorial-check-our-complete-introduction-to-pandas/#respond</comments>
		
		<dc:creator><![CDATA[Piotr Ludwinek]]></dc:creator>
		<pubDate>Thu, 24 Aug 2023 03:30:00 +0000</pubDate>
				<category><![CDATA[Technologies]]></category>
		<category><![CDATA[Articles]]></category>
		<category><![CDATA[Python]]></category>
		<guid isPermaLink="false">https://nearshore-it.eu/?p=24861</guid>

					<description><![CDATA[How to easily read, recognize and analyze data in Python? Explore the free Python library: Pandas.]]></description>
										<content:encoded><![CDATA[
<p><strong>Pandas is a free Python library</strong> that greatly expands your data analysis and processing capabilities. This library is one of the most important tools in the Python environment, and is widely used as a form of support in various industries. In this article, you will learn <strong>how to load</strong>, <strong>process </strong>and <strong>analyze data </strong>easily in Python. This competence is valued in many areas, from finance to engineering.</p>



<div class="table-of-contents">
    <p class="title">Jump To:</p>
    <ol>
                    <li><a href="#thon-–-Pandas-Library-in-Data-Science.">1.  Python – Pandas Library in Data Science. Why is it worth finding out more? </a></li>
                    <li><a href="#Introduction-to-Pandas-in-Python-">2.  Introduction to Pandas in Python</a></li>
                    <li><a href="#When-will-the-Pandas-library-work-best?-">3.  When will the Pandas library work best? </a></li>
                    <li><a href="#How-to-install-the-Pandas-library-">4.  How to install the Pandas library </a></li>
                    <li><a href="#How-to-import-Python-Pandas-into-your-project-">5.  How to import Python Pandas into your project </a></li>
                    <li><a href="#Pandas-DataFrames-and-series-">6.  Pandas DataFrames and series </a></li>
                    <li><a href="#Loading-data-from-different-sources-">7.  Loading data from different sources </a></li>
                    <li><a href="#Pandas-–-operations-on-data-">8.  Pandas – operations on data </a></li>
                    <li><a href="#Selecting,-filtering-and-sorting-data-">9.  Selecting, filtering and sorting data</a></li>
                    <li><a href="#Using-apply-and-map-functions-">10.  Using apply and map functions</a></li>
                    <li><a href="#Pandas:-data-cleaning-and-repairing-missing-data-">11.  Pandas: data cleaning and repairing missing data </a></li>
                    <li><a href="#Basic-statistical-operations-and-data-grouping">12.  Basic statistical operations and data grouping  </a></li>
                    <li><a href="#Analyzing-data-using-Pandas">13.  Analyzing data using Pandas </a></li>
                    <li><a href="#Pandas,-tips-and-the-most-popular-features-(cheat-sheet)">14.  Pandas, tips and the most popular features (cheat sheet) </a></li>
                    <li><a href="#Fundamentals-of-Pandas-–-summary-">15.  Fundamentals of Pandas – summary </a></li>
            </ol>
</div>


<h2 class="wp-block-heading" id="thon-–-Pandas-Library-in-Data-Science.">Python – Pandas Library in Data Science. Why is it worth finding out more?</h2>



<p>The Pandas library was created to make it possible to work with various types of data that are not always complete or require proper processing. Pandas provides flexible and easy-to-use data structures and tools (albeit not always efficient&#8230; but later in the article I will also tell you how to deal with it). In addition to such libraries as <strong>NumPy</strong>, <strong>Matplotlib</strong>, <strong>Seaborn </strong>or <strong>Scikit</strong>&#8211;<strong>Learn</strong>, it has earned popularity and recognition among academics, analysts, engineers, and data enthusiasts.&nbsp;</p>



<h2 class="wp-block-heading" id="Introduction-to-Pandas-in-Python-">Introduction to Pandas in Python</h2>



<p>It is an ideal tool for managing and analyzing data (with the help of extra libraries) in small and medium-sized collections. In the case of large collections, which are popular in areas related to <strong>Big Data</strong>, processing is also possible. But as the amount of data increases, the likelihood of memory and performance problems increases too.&nbsp;</p>



<h2 class="wp-block-heading" id="When-will-the-Pandas-library-work-best?-">When will the Pandas library work best?</h2>



<p>The Pandas library is well suited to working with different types and data sources:&nbsp;</p>



<ul class="wp-block-list">
<li>Array data with columns of various types (e.g. Excel, SQL). </li>



<li>Time-series data. </li>



<li>Data with labels of rows and columns (labeled data). </li>
</ul>



<p><strong>Examples of applications of the Pandas library.&nbsp;</strong></p>



<p>Here are some examples of applications of the Pandas library:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Loading various data formats</strong> (CSV, Excel, SQL, flat files, etc.). </li>



<li><strong>Filtering</strong>, <strong>sorting </strong>and other operations on data. </li>



<li><strong>Clearing data </strong>(deleting NaN values, that is: Not a Number data), <strong>averaging</strong>, <strong>replacing values</strong>, etc.). </li>



<li>Quickly and efficiently <strong>calculating statistics </strong>and <strong>performing operations </strong>on data. </li>



<li><strong>Visualization of data</strong> with charts. </li>
</ul>



<p>Before we start using the Pandas library, let&#8217;s make sure that it is installed in the Python environment.&nbsp;</p>



<p>Also read: <a href="https://nearshore-it.eu/articles/technologies/nosql-vs-sql-differences-and-use-cases/" target="_blank" rel="noreferrer noopener">SQL vs NoSQL databases</a>&nbsp;</p>



<h2 class="wp-block-heading" id="How-to-install-the-Pandas-library-">How to install the Pandas library</h2>



<p>Installing Pandas is very simple and can be done with the pip tool, which is the default Python package manager. I assume that Python is already installed on your computer, and if not – here&#8217;s how to do it<a href="https://wiki.python.org/moin/BeginnersGuide/Download" target="_blank" rel="noreferrer noopener">: BeginnersGuide/Download &#8211; Python Wiki</a>. Then just open the terminal and type the following command:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install pandas </pre>



<p>You may be required to grant permissions to execute this command (e.g., using sudo on Unix systems or running a terminal with administrator rights on <strong>Windows</strong>), and if you are using a specific Python virtual environment (e.g., venv or conda), you will need to activate this environment before installing the package. You can read more about virtual environments in the documentation.&nbsp;&nbsp;</p>



<p>If you are using Anaconda, you can install Pandas using the command:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">conda install pandas </pre>



<h2 class="wp-block-heading" id="How-to-import-Python-Pandas-into-your-project-">How to import Python Pandas into your project</h2>



<p>Once the Pandas library is installed, you can start using it. The first step is to import the library into our script or project. Importing a Pandas library is no different than any other library in Python.&nbsp;</p>



<p>We can do so using the code below:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd </pre>



<p>In Python, Pandas is typically imported under the alias ‘pd’, which is a short and commonly accepted abbreviation. Now, when we want to use the Pandas library functions, instead of typing the full word &#8220;<strong>pandas</strong>&#8220;, using the &#8220;<strong>pd</strong>&#8221; abbreviation is sufficient. It is similar in the case of the NumPy library (‘np’), which I wrote about in a previous article.&nbsp;</p>



<p>For example, if we wanted to create a DataFrame (one of the key data structures in Pandas), the code would look like this:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

data = { 

'column_1': [3, 2, 0, 1],  

'column_2': [0, 3, 7, 2] 

} 

example_df = pd.DataFrame(data) 

print(example_df) </pre>



<p>As you can see in the example, I used the DataFrame class, the basic structure provided by Pandas. In the next paragraph, we will discuss the other two basic data structures – Series and DataFrame.&nbsp;</p>



<h2 class="wp-block-heading" id="Pandas-DataFrames-and-series-">Pandas DataFrames and series</h2>



<p id="Pandas-DataFrames-and-series-">The main purpose of the Pandas library is to facilitate working with data, which is why Pandas introduces two data structures: <strong>Series </strong>and <strong>DataFrame</strong>. Understanding these structures is key to using this library effectively.&nbsp;</p>



<h3 class="wp-block-heading">Pandas series&nbsp;</h3>



<p>A series is a one-dimensional data structure, or rather an array (<strong>ndarray</strong>), similar to a list or column in a table. Each element (e.g. integers, lists, objects, tuples) in the series is assigned to an identifier called an index. The series stores data of one type.&nbsp;</p>



<p>Here is an example of creating a series that contains a list of items:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"]) 

print(vals_sr) </pre>



<p>When it comes to the index, by default, these are integers, starting from zero. The index can be changed, e.g. by labeling. In this case, we should extend our code that is responsible for creating the series. For this we use the index parameter. The code looks like this:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"], index=["A", "B", "C", "D", "E"]) 

print(vals_sr) </pre>



<p>It is worth remembering that the number of &#8220;labels&#8221; should correspond to the number of elements in a series. Otherwise, the Python interpreter will return an error (<strong>ValueError</strong>). If you do not want to display the entire series, but only check which indexes have been assigned or display only values without them, you can use the following code snippet:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

vals_sr = pd.Series(["Val_1", "Val_2", "Val_3", "Val_4", "Val_5"], index=["A", "B", "C", "D", "E"]) 

print(vals_sr) 

print(vals_sr.index) # -- returns objects of Index type 

print(vals_sr.values) # -- returns an ndarray object </pre>



<h3 class="wp-block-heading">Pandas DataFrame</h3>



<p>A DataFrame is a two-dimensional data structure similar to a table in a database or Excel spreadsheet. A DataFrame consists of rows and columns – each column in a DataFrame is a series. As you probably guess, even though a given column contains only one data type, a DataFrame can contain many columns, each of which includes a different type of data. An example would be creating a DataFrame from data on transactions made by customers, identified by ID.&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

ct_data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'card_transactions': [11, 244, 31, 458, 63] 

} 

client_transaction_df = pd.DataFrame(ct_data) 

print(client_transaction_df) </pre>



<h2 class="wp-block-heading" id="Loading-data-from-different-sources-">Loading data from different sources</h2>



<p>One of the most important benefits of Pandas is the ease with which you can load data from a variety of sources and file formats. The most popular ones include:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>CSV </strong></li>



<li><strong>Excel</strong> (.xlsx) </li>



<li><strong>SQL </strong></li>



<li><strong>Flat files</strong> (e.g., text files) </li>
</ul>



<p>Data from the CSV file can be loaded into the DataFrame with the <strong>pd.read_csv() function.</strong>&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

df = pd.read_csv('path_to_your_file.csv') 

print(df) </pre>



<p>Similarly, we can load an Excel file using the <strong>pd.read_excel() function.</strong>&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

df = pd.read_excel('path_to_your_file.xlsx') 

print(df) </pre>



<p>To load the result of the SQL query, we must first create a connection with the database. Using the example of the SQLite database, we can do it this way:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

import sqlite3 

# Establishing a database connection 

conn = sqlite3.connect("database_name.db") 

# Execution of the query 

df = pd.read_sql_query("SELECT * FROM my_table", conn) 

print(df) </pre>



<p>Assuming you&#8217;re using a different database, you&#8217;ll need to install the right Python driver and replace the sqlite3.connect with the right connection. If you want to learn more about sqlite, I encourage you to access <a href="https://docs.python.org/3/library/sqlite3.html#module-sqlite3" target="_blank" rel="noreferrer noopener">the</a> <a href="https://docs.python.org/3/library/sqlite3.html#module-sqlite3" target="_blank" rel="noreferrer noopener">sqlite3documentation.</a>&nbsp;&nbsp;</p>



<p>In the next paragraph, we will discuss the basic operations on data using series and DataFrame.&nbsp;</p>



<h2 class="wp-block-heading" id="Pandas-–-operations-on-data-">Pandas – operations on data</h2>



<p>Once you know how to load data using the Pandas library, we will now focus on selecting, filtering and sorting data and using the<strong> apply and map functions.</strong>&nbsp;</p>



<h2 class="wp-block-heading" id="Selecting,-filtering-and-sorting-data-">Selecting, filtering and sorting data</h2>



<p>Selecting specific data from the DataFrame is one of the basic and most commonly used operations. Pandas allows you to select data in many ways:&nbsp;</p>



<ul class="wp-block-list">
<li>Column selection: <strong>df[&#8216;columnname&#8217;]</strong> </li>



<li>Selecting rows using index numbers: <strong>df.iloc[index]</strong> </li>



<li>Selecting rows using index labels: <strong>df.loc[label]</strong> </li>
</ul>



<p>See a sample snippet of the code below:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

ct_data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(ct_data) 

print(df) 

# Selecting the 'client_id' column 

print(df['client_id']) 

 

# Selecting the first line 

print(df.iloc[0]) 

# Selecting the row with the zero index label 

print(df.loc[0]) </pre>



<p>Filtering means the process of selecting a subset of data based on given criteria. For example, we may wish to select only those transactions that are available in a quantity greater than 60:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">filtered_df = df[df['count'] > 60] 

print(filtered_df) </pre>



<p>Data sorting is a simple process that we can perform using the <strong>sort_values() </strong>method:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sorted_df = df.sort_values('count') 

print(sorted_df) </pre>



<h3 class="wp-block-heading">Column operations: adding, deleting, renaming</h3>



<p>To add a new column to the DataFrame, we can simply assign the data to the new column, as in the example below:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df['amount'] = [1200, 4500, 3000, 28000, 700] # -- we add a column with the sum of the amounts for which transactions were made 

print(df) </pre>



<p>To delete a column, we will use the drop() method:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df = df.drop('amount', axis=1) 

print(df) </pre>



<p>You can rename a column using the rename() method:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df = df.rename(columns={'client_id': 'client_code', 'count': 'quantity'}) 

print(df) </pre>



<h2 class="wp-block-heading" id="Using-apply-and-map-functions-">Using apply and map functions</h2>



<p>The apply and map functions allow you to apply the selected function to each element saved in the series or DataFrame. For example, using functions from the NumPy library for the ‘quantity’ column.&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import numpy as np 

df['log_quantity'] = df['quantity'].apply(np.log) 

print(df) </pre>



<p>Applying the map method is similar, but it only works on the series. It is often used to replace values based on a dictionary. For example, for a list of customers, we may wish to add an additional digit in the identifier.&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">code_map = { 

'C34P': '0C34P', 

'C35S': '1C35S', 

'C35P': '1C35P', 

'C97S': '0C97S', 

'C58S': '0C58S' 

} 

df['client_code'] = df['client_code'].map(code_map) 

print(df) </pre>



<h2 class="wp-block-heading" id="Pandas:-data-cleaning-and-repairing-missing-data-">Pandas: data cleaning and repairing missing data</h2>



<p>Working with data that comes from real sources practically always involves the need to clean or correct it. The data often contains gaps, duplicates, or data types that are not suitable for the analysis. In this part, I will discuss problems and show you how to deal with them using Pandas library tools.&nbsp;</p>



<h3 class="wp-block-heading">Handling missing data (NaN)&nbsp;</h3>



<p>Missing data is marked as<strong> NaN (Not a Number)</strong>. Pandas offers several methods to handle it, such as:&nbsp;</p>



<ul class="wp-block-list">
<li>Filling in missing data with a specific value </li>



<li>Deleting rows of missing data </li>
</ul>



<p>The&nbsp;<strong> fillna() </strong>method allows you to fill in the missing data with a specific value or using a specific method (e.g. &#8216;forward fill&#8217; – ffill, &#8216;backward fill&#8217; – bfill):&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import numpy as np 

data = { 

    'A': [1, 2, np.nan], 

    'B': [5, np.nan, np.nan], 

C, 1, 2, 3 

} 

df = pd.DataFrame(data) 

df_filled_zeros = df.fillna (value=0)  # --We fill in the missing data with the value 0 

print(df_filled_zeros) </pre>



<p>It is also possible to delete rows with missing data. In the case of a large set and a small number of &#8220;broken&#8221; rows, this should not have a large impact on the quality of the data. But with a small set, deleting several rows can significantly affect the next analysis. However, If you decide to delete the selected rows, you can use the <strong>dropna() method:</strong>&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df_dropped = df.dropna()  # Removing rows with missing data 

print(df_dropped) </pre>



<p>Sometimes, in our set, there are numerous unnecessary duplicates from the analysis perspective. Removing duplicates is especially useful when they make up the majority of our set. Getting rid of them will allow you to unburden the library and perform more efficient operations, e.g. on columns in the DataFrame. Pandas provides a<strong> drop_duplicates() </strong>method that allows you to easily remove duplicates:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(data) 

df = df.drop_duplicates()  # We remove duplicates 

print(df) </pre>



<p>You can learn more about working with missing data from the extensive documentation on Pydata.org: <a href="https://pandas.pydata.org/docs/user_guide/missing_data.html" target="_blank" rel="noreferrer noopener">Working with missing data — pandas 2.0.3 documentation (pydata.org)</a>.&nbsp;</p>



<p>If you work on data provided by other people or companies, you may encounter numeric data (<strong>integers</strong>, <strong>digits</strong>, <strong>floating point numbers</strong>), which are presented in the form of strings of characters. So, for example, the integer 200 in the DataFrame is written as a string with the value ‘200’. The interpreter will treat this as text, not a number. If you want to perform statistical or mathematical operations on such data, it is necessary to change the data type of the column (in this case from ‘str‘ to ‘int’). You can do this using the astype() method:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

data = { 

'client_id': ['C34P', 'C35S', 'C35P', 'C97S', 'C58S'], 

'count': [11, 244, 31, 458, 63] 

} 

df = pd.DataFrame(data) 

df['count'] = df['count'] .astype (int)  # We are changing the data type of the column 'count' to int 

print(df) </pre>



<p>In conclusion, cleaning data is usually a necessary step in the process of data processing and analysis. Pandas comes with many tools to make it easier. If you want to learn more about working with text files, here is a link to the documentation: <a href="https://pandas.pydata.org/docs/user_guide/text.html" target="_blank" rel="noreferrer noopener">Working with text data — pandas 2.0.3 documentation (pydata.org)</a>. In the next part, I will move on to the topic of aggregation and grouping data.&nbsp;</p>


</style><div class="promotion-box promotion-box--image-left "><div class="tiles latest-news-once"><div class="tile"><div class="tile-image"><img decoding="async" src="https://nearshore-it.eu/wp-content/uploads/2024/06/BigCTA_LeszekJaros.jpg" alt="BigCTA LeszekJaros" title="Python Pandas and DataFrame: The Power Couple for Modern Data Analysis  1"></div><div class="tile-content"><p class="entry-title client-name promotion-box__headline2">Streamline Your Application Maintenance</p>
<p class="promotion-box__description2"><strong>Leszek Jaros</strong>, our Head of Telco and AMS Practice, is here to help you navigate the complexities of Application Maintenance Services. Book a consultation to boost your application's efficiency</p>
<a class="btn btn-primary booking" href="https://outlook.office365.com/book/BookameetingwithLeszek@gfi.fr/" target="_blank" rel="noopener">Schedule a meeting</a></div></div></div></div>



<h2 class="wp-block-heading" id="Basic-statistical-operations-and-data-grouping">Basic statistical operations and data grouping</h2>



<p>Once we have clean and properly formatted data, we can proceed to the analysis. In this paragraph, I will show you some basic statistics, data grouping, and operations on indexes and multi-indexes.&nbsp;</p>



<h3 class="wp-block-heading">Basic statistics calculations&nbsp;</h3>



<p>Pandas offers functions to perform calculation of basic statistics, such as:&nbsp;&nbsp;</p>



<ul class="wp-block-list">
<li><strong>mean </strong>&#8211; the average value,  </li>



<li><strong>median </strong>– the median value in the data set,  </li>



<li><strong>mode </strong>– the most common value in the data set, </li>



<li><strong>standard </strong>deviation. </li>
</ul>



<p>Based on the example of data samples generated with the Gaussian distribution, we will create a DataFrame and determine the above statistics.&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

import numpy as np 

data = { 

'A': e.g.random.normal(0, 1, 100), 

'B': np.random.normal(1, 2, 100), 

'C': np.random.normal(-1, 0.5, 100) 

} 

df = pd.DataFrame(data) 

print(df.mean())  # average 

print(df.median())  # median 

print(df.mode())  # Moda 

print(df.std())  # standard deviation </pre>



<p>Of course, based on the example of this data, I wanted to show you how tools in the Pandas library work. I leave testing its full capabilities on real data to you. To familiarize yourself with the specification of the discussed functions and many others that can be used on DataFrame, I encourage you to go through the documentation<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html" target="_blank" rel="noreferrer noopener">: pandas.DataFrame.mean — pandas 2.0.3 documentation (pydata.org)</a>&nbsp;</p>



<h3 class="wp-block-heading">Grouping data with the groupby function&nbsp;</h3>



<p>The groupby function allows you to group data based on specific columns. If you use SQL language on a daily basis, the issue of grouping should not be new to you. We will use the <strong>groupby()</strong> function for grouping. After grouping the data, we can calculate statistics for each group:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

data = { 

'product_id': ['product_34P', 'product_34P', 'store_35S', 'product_35P', 'store_97S', 'product_35P', 'product_34P', 'store_58S'], 

'count': [12, 24, 36, 60, 18, 48, 20, 72], 

'price': [1.2, 0.5, 0.75, 1.25, 2.0, 1.3, 0.55, 0.8] 

'store': ['EU1', 'UK1', 'EU2', 'EU2', 'UK2', 'EU1', 'UK2', 'EU1'] 

} 

df = pd.DataFrame(data) 

grouped = df.groupby('product_id') 

print(grouped.mean())  #average price and quantity for each product </pre>



<h3 class="wp-block-heading">Indexing in Pandas: operations on indexes and multi-indexes&nbsp;</h3>



<p>Indexes are an important part of the discussed data structures in the Pandas library. They allow quick access to data. Basic operations include, for example:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>resetting </strong>the index,  </li>



<li><strong>setting </strong>a new index, </li>



<li><strong>sorting </strong>by index. </li>
</ul>



<p>Multi-indexes allow you to index across many levels, which is especially useful in the case of hierarchical data. Multi-indexes allow you to analyze data at different hierarchical levels. For example, if we have a dataset of product sales in different countries and regions. It allows you to notice global sales trends, as well as more detailed trends at the level of selected countries or regions – see the example code below. I used a previously created DataFrame with product sales data.&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Set ‘product_id’ as index 

df = df.set_index('product_id') 

print(df) 

# Resetting the index 

df = df.reset_index() 

print(df) 

# Setting up a multi-index 

df = df.set_index(['product_id', 'count']) 

print(df) </pre>



<h3 class="wp-block-heading">Data visualization using Pandas, NumPy and Matplotlib libraries&nbsp;</h3>



<p>Data visualization is a key element of any data analysis. It helps to better understand the data structure and makes it easier to present results. Pandas offers built-in data visualization tools that are based on the Matplotlib library. An example code for creating several charts is below:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""> import pandas as pd import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

# Sample data 

df = pd.DataFrame({ 

'A': np.random.randn(1000), 

'B': np.random.randn(1000), 

'C': np.random.randn(1000) 

}) 

# Line chart  

df['A'].plot(kind='line') 

plt.show() 

#Histogram 

df['B'].plot(kind='hist', bins=20) 

plt.show() 

# dot diagram 

df.plot(kind='scatter', x='A', y='B') 

plt.show() 

# Bar chart  

df['C'].value_counts().sort_index().plot(kind='bar') 

plt.show() </pre>



<h2 class="wp-block-heading" id="Analyzing-data-using-Pandas">Analyzing data using Pandas</h2>



<p>After discussing the basic features and operations that Pandas offers, we will move on to a practical example of analyzing a real dataset. For this purpose, we will use the public data set on the survival of passengers on the Titanic. The process can be split into the following steps:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Loading and pre-screening data </strong></li>



<li><strong>Data cleansing </strong></li>



<li><strong>Data analysis </strong></li>



<li><strong>Data visualization </strong></li>
</ul>



<p>You can download the data CSV FILE titanic.csv here:&nbsp;&nbsp;</p>



<div class="wp-block-file"><a id="wp-block-file--media-2aee124f-ffe2-4dc4-bcd9-000dadebd05e" href="https://nearshore-it.eu/wp-content/uploads/2023/08/titanic.csv">titanic</a><a href="https://nearshore-it.eu/wp-content/uploads/2023/08/titanic.csv" class="wp-block-file__button wp-element-button" aria-describedby="wp-block-file--media-2aee124f-ffe2-4dc4-bcd9-000dadebd05e" download>Download</a></div>



<div style="height:30px" aria-hidden="true" class="wp-block-spacer"></div>



<p>We will start by loading the data and examining its structure:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd 

# Loading data 

df = pd.read_csv('titanic.csv') 

# Displaying the first 5 rows 

print(df.head()) 

# Data basics 

print(df.info()) </pre>



<p>Then we will perform a basic data cleansing – for example, removing the missing values:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Removing rows with missing data 

df = df.dropna() </pre>



<p>In this case, after deleting the rows with missing data, we can assume that the data is ready for analysis. Let&#8217;s now see how the ticket class influenced the chances of survival:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Grouping data by ticket class and calculating average survival 

print(df.groupby('Pclass')['Survived'].mean())</pre>



<p>We can also visualize our results to better understand the dependencies in the data:&nbsp;</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import matplotlib.pyplot as plt 

# Survival bar graph by ticket class 

df.groupby('Pclass')['Survived'].mean().plot(kind='bar') 

plt.ylabel('Survival Rate') 

plt.show() </pre>



<p>To sum up, this simple example shows how easy it is to use Pandas to load, clean, analyze, and visualize your data. I encourage you to familiarize yourself with the sample dataset (<strong>CSV file attached</strong>) and analyze selected dependencies on your own.&nbsp;</p>



<h2 class="wp-block-heading" id="Pandas,-tips-and-the-most-popular-features-(cheat-sheet)">Pandas, tips and the most popular features (cheat sheet)</h2>



<p>When working on a DataFrame, a “view” and “copy” of the original DataFrame may draw your attention. This is important because when we try to assign new values to a dataset:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>When using the view</strong> – the changes will affect the original DataFrame, </li>



<li><strong>When we use a copy</strong> – we will not change the original DataFrame. </li>
</ul>



<p>It is worth <strong>using the.loc []</strong>&nbsp;<strong> or .iloc[] </strong>method to select data and assign values to avoid unnecessary problems.&nbsp;</p>



<p>Furthermore, Pandas can &#8220;deduce&#8221; data types automatically when they are loaded, e.g. from a file or database. However, this is not always accurate or in accordance with the user&#8217;s assumptions. Therefore, it is worth using the <strong>.dtypes</strong> method and checking the data types in the created DataFrame. In the event of a mismatch, use the .astype <strong>() method that you are</strong> <strong>already </strong>familiar with and convert the selected data types.&nbsp;</p>



<p>When it comes to missing data (i.e. NaN values), performing operations on missing data may lead to incorrect or unexpected results. For example, the sum of the number and value of NaN gives the value of NaN. So remember to handle missing data, e.g. using the <strong>.dropna()</strong> and <strong>.fillna() </strong>methods, to delete or fill in the missing data accordingly.&nbsp;</p>



<p>Here are some of the most popular features in the Pandas library:&nbsp;</p>



<p><strong>Loading and saving data:&nbsp;</strong>&nbsp;</p>



<ul class="wp-block-list">
<li>read_csv(), to_csv(), read_excel(), to_excel(), read_sql(), to_sql() </li>
</ul>



<p><strong>Data selection:&nbsp;</strong>&nbsp;</p>



<ul class="wp-block-list">
<li>.loc[], .iloc[] </li>
</ul>



<p><strong>Data manipulation:</strong>&nbsp;&nbsp;</p>



<ul class="wp-block-list">
<li>drop(), rename(), set_index(), reset_index(), pivot(), melt() </li>
</ul>



<p><strong>Data cleansing:</strong></p>



<ul class="wp-block-list">
<li>dropna(), fillna(), replace(), duplicated(), drop_duplicates() </li>
</ul>



<p><strong>Data analysis:</strong>&nbsp;</p>



<ul class="wp-block-list">
<li>describe(), value_counts(), groupby(), corr() </li>
</ul>



<p><strong>Stats:</strong>&nbsp;</p>



<ul class="wp-block-list">
<li>mean(), median(), min(), max(), std(), quantile() </li>
</ul>



<p><strong>Operations on strings:</strong>&nbsp;</p>



<ul class="wp-block-list">
<li>str.lower(), str.upper(), str.contains(), str.replace(), str.split(), str.join() </li>
</ul>



<p><strong>Data visualization:</strong></p>



<ul class="wp-block-list">
<li>fence(), hist(), boxplot() </li>
</ul>



<p>Of course, the abovementioned functions have significant parameterization capabilities, so I encourage you to go through the documentation to better understand and adapt their use to your individual needs (<a href="https://pandas.pydata.org/docs/dev/reference/general_functions.html" target="_blank" rel="noreferrer noopener">see the documentation: General functions</a>).&nbsp;&nbsp;</p>



<h2 class="wp-block-heading" id="Fundamentals-of-Pandas-–-summary-">Fundamentals of Pandas – summary&nbsp;</h2>



<p>Pandas is a popular library which is indispensable when working with data and analyzing it in the Python environment. The above introduction definitely does not cover the entire topic, but shows the basic capabilities and indicates possible directions for exploring the library further. The Pandas library is particularly efficient when combined with NumPy, Matplotlib, Seaborn and other libraries, depending on the given task. </p>


</style><div class="promotion-box promotion-box--image-left promotion-box--full-width-without-image"><div class="tiles latest-news-once"><div class="tile"><div class="tile-content"><p class="promotion-box__description2"><strong>Consult your project directly with a specialist</strong></p>
<a class="btn btn-primary booking" href="https://outlook.office365.com/book/BookameetingwithLeszek@gfi.fr/" target="_blank" rel="noopener">Book a meeting</a></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://nearshore-it.eu/technologies/python-pandas-tutorial-check-our-complete-introduction-to-pandas/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
