dhweb

Coding

This notebook covers data manipulation in Python with pandas

1. List item

2. List item

The main goals are:

Working with local data using the pandas, matplotlib, numpy libraries in Python:
- Loading and exploring datasets (e.g., the California housing dataset)
- Basic data manipulation and analysis
- Creating visualizations (scatter plots, histograms) using Matplotlib
Converting query results to a Pandas DataFrame for further analysis

The notebook demonstrates how to load local data files (CSV) into Pandas DataFrames, perform basic operations like printing the data, examining column information, and selecting specific columns. It also shows how to create visualizations like scatter plots and histograms using Matplotlib.

link text# Working with local data

#First we are going to import the Python data science pacakge called Pandas https://pandas.pydata.org/ and assign it the name 'pd' for easy reference
import pandas as pd
#Second we will import Matplotlib which is a Python data package for creating static, dynamic and interactive graphics with Python https://matplotlib.org/
import matplotlib as plt
#NumPy is a package that enables scientific computing and stores large data arrays in a way that facilitates larger / intensive calculations. https://numpy.org/doc/stable/user/absolute_beginners.html
import numpy as np

#Now we must first know where we are
!ls
#This folder contains sample data provided by colab

#First we are going to pull in one of the stock datasets provided by colab to examine it

calidf = pd.read_csv('/content/sample_data/california_housing_test.csv')

This simple Python code written using the Pandas library reads a CSV (Comma-Separated Values) file named “california_housing_test.csv” from the specified file path “/content/sample_data/” and turns it into a Data Frame available for further analysis.

Say What?

pd is an alias (shorthand) for the Pandas library, which is a powerful data manipulation and analysis library in Python. It is a common practice to import Pandas using import pandas as pd at the beginning of your Python script
read_csv() is a function provided by the Pandas library that reads a CSV file and creates a DataFrame object from it. A DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table, with rows and columns.
‘/content/sample_data/california_housing_test.csv‘ is the file path where the CSV file is located within the “/content/sample_data/” directory.

The entire line:

calidf = pd.read_csv(‘/content/sample_data/california_housing_test.csv‘)

assigns the resulting DataFrame object (created from reading the CSV file) to a variable named calidf.

After execution the variable calidf will contain a Pandas DataFrame representing the data from the “california_housing_test.csv” file. You can then perform various data manipulation, analysis, and visualization operations on this DataFrame using the rich functionality provided by the Pandas library. This is a common way to load tabular data (like CSV files) into a Python environment for further processing and analysis using the Pandas library.

#If you receive no errors you can quickly print the dataframe to see what the data looks like
print(calidf)

#Examine the top 5 rows of the dataframe
calidf.head()

Distributions

2-d distributions

Values

# Here we examine the information about our data and then take a peak at two columns
calidf.info()
#calidf[['housing_median_age', 'median_house_value']]
#maybe you are interested in median_house_value and median_income instead?
calidf[['total_rooms', 'median_house_value']]
#now it is clear I need to know more about my data and how

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipython-input-1-261973191.py in <cell line: 0>()
      1 # Here we examine the information about our data and then take a peak at two columns
----> 2 calidf.info()
      3 #calidf[['housing_median_age', 'median_house_value']]
      4 #maybe you are interested in median_house_value and median_income instead?
      5 calidf[['total_rooms', 'median_house_value']]

NameError: name 'calidf' is not defined

from matplotlib import pyplot as plt
#I could use the histogram feature to see distribution within a single variable
calidf['median_house_value'].plot(kind='hist', bins=20, title='median house value')
plt.gca().spines[['top', 'right']].set_visible(False)

from matplotlib import pyplot as plt
calidf['median_house_value'].plot(kind='hist', bins=20, title='median_house_value')
plt.gca().spines[['top', 'right']].set_visible(False)

#First I am just going to look at two of my measures to get a sense of distribution.
from matplotlib import pyplot as plt
_df_2.plot(kind='scatter', x='median_income', y='median_house_value', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

#Oh no wait I have an error! The interpreter tells me where the error is. It looks like I am calling something '_df_2' which doesn't exist
#In order to address this error I must make a change in line above replacing '_df_2' with 'calidf'
#calidf does exist and is the pandas dataframe you imported from the CSV file
#once you have made the change rerun the code block

from matplotlib import pyplot as plt
#the output looks different depending on what I am plotting
calidf.plot(kind='scatter', x='population', y='', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)
#Opps looks like there is also an error here. What is different between this block of code and the one directly above it?
#It looks like I have not defined a Y value for my scatter plot. Make a choice on which Column of data you would like to see plotted alongside population and re-run this block

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3801             try:
-> 3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:

8 frames
/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ''

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-11-14cf506e8985> in <cell line: 3>()
      1 from matplotlib import pyplot as plt
      2 #the output looks different depending on what I am plotting
----> 3 calidf.plot(kind='scatter', x='population', y='', s=32, alpha=.8)
      4 plt.gca().spines[['top', 'right',]].set_visible(False)

/usr/local/lib/python3.10/dist-packages/pandas/plotting/_core.py in __call__(self, *args, **kwargs)
    943         if kind in self._dataframe_kinds:
    944             if isinstance(data, ABCDataFrame):
--> 945                 return plot_backend.plot(data, x=x, y=y, kind=kind, **kwargs)
    946             else:
    947                 raise ValueError(f"plot kind {kind} can only be used for data frames")

/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/__init__.py in plot(data, kind, **kwargs)
     69             kwargs["ax"] = getattr(ax, "left_ax", ax)
     70     plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 71     plot_obj.generate()
     72     plot_obj.draw()
     73     return plot_obj.result

/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py in generate(self)
    450         self._compute_plot_data()
    451         self._setup_subplots()
--> 452         self._make_plot()
    453         self._add_table()
    454         self._make_legend()

/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py in _make_plot(self)
   1259         scatter = ax.scatter(
   1260             data[x].values,
-> 1261             data[y].values,
   1262             c=c_values,
   1263             label=label,

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   3805             if self.columns.nlevels > 1:
   3806                 return self._getitem_multilevel(key)
-> 3807             indexer = self.columns.get_loc(key)
   3808             if is_integer(indexer):
   3809                 indexer = [indexer]

/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:
-> 3804                 raise KeyError(key) from err
   3805             except TypeError:
   3806                 # If we have a listlike key, _check_indexing_error will raise

KeyError: ''

Amazing, you have now successfully worked with CSV data uploaded to our local instance. Now we are going to import a csv file from an online repository. When you share your personal Colab version to me all blocks must run correctly.

Next we are going to import our own data from Kaggle

Go to Kaggle.com and create an account using the Google Account associated with your CCV email that you created for this class.

Click on Datasets and then search for interstellar travel.

Spend some time reading the information provided about the data to become familiar with what is available to you within the dataset.

Now download the data. It will land as a compressed file. You will need to extract the CSV file locally first before uploading it to the colab instance.

Now you come back to this Colab and find –> on the left hand side panel click on the folder icon then double click to open.

Now click on the document with the up arrow –> When you scoll over it will say ‘Upload to session storage’. What this does is allow us to load data temporarily for this instance. You will get a warning to ensure the file is saved elsewhere since it will delete after you close the session.

Once you have selected the data to upload (depending on the size) Google will upload it and let you know what level of disk storage you are working with. You will find this information at the bottom left side of your panel.

import pandas as pd

interstellar_travel = pd.read_csv('/content/sample_data/interstellar_travel.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547568 entries, 0 to 547567
Data columns (total 18 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   Age                                    547568 non-null  int64  
 1   Gender                                 547568 non-null  object 
 2   Occupation                             547568 non-null  object 
 3   Travel Class                           547568 non-null  object 
 4   Destination                            547568 non-null  object 
 5   Star System                            547568 non-null  object 
 6   Distance to Destination (Light-Years)  547568 non-null  float64
 7   Duration of Stay (Earth Days)          547568 non-null  float64
 8   Number of Companions                   547568 non-null  int64  
 9   Purpose of Travel                      547568 non-null  object 
 10  Transportation Type                    547568 non-null  object 
 11  Price (Galactic Credits)               547568 non-null  float64
 12  Booking Date                           547568 non-null  object 
 13  Departure Date                         547568 non-null  object 
 14  Special Requests                       547568 non-null  object 
 15  Loyalty Program Member                 547568 non-null  object 
 16  Month                                  547568 non-null  int64  
 17  Customer Satisfaction Score            547568 non-null  float64
dtypes: float64(4), int64(3), object(11)
memory usage: 75.2+ MB

Awesome you have successfully called the data from CSV into a dataframe. To get a sense of the column headers you use the df.info() command and see that some of the names for the column headers are going to be tricky to call.

To address this we can reassign column names when loading our dataframe, which makes calling our variables a bit easier.

#There are MANY MANY options available within the read_csv command - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv
#To start for today we are just going to focus on re-naming our columns.
galactictravel = pd.read_csv('/content/sample_data/interstellar_travel.csv'
                   , header=0 #we MUST designate the header row as 0 in order to use the 'names=()' flag
                   , names=('age','gender','occupation','travelclass', 'destination', 'starsystem', 'distance2dest' , 'stayduration'
                            , 'numbercompanions', 'travelpurpose', 'transporttype', 'price', 'bookingdate', 'departuredate'
                            , 'specialrequest', 'loyaltyprogrammember', 'month', 'customersatisfactionscore')
                );

galactictravel.info()
#that looks better to me.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547568 entries, 0 to 547567
Data columns (total 18 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   age                        547568 non-null  int64  
 1   gender                     547568 non-null  object 
 2   occupation                 547568 non-null  object 
 3   travelclass                547568 non-null  object 
 4   destination                547568 non-null  object 
 5   starsystem                 547568 non-null  object 
 6   distance2dest              547568 non-null  float64
 7   stayduration               547568 non-null  float64
 8   numbercompanions           547568 non-null  int64  
 9   travelpurpose              547568 non-null  object 
 10  transporttype              547568 non-null  object 
 11  price                      547568 non-null  float64
 12  bookingdate                547568 non-null  object 
 13  departuredate              547568 non-null  object 
 14  specialrequest             547568 non-null  object 
 15  loyaltyprogrammember       547568 non-null  object 
 16  month                      547568 non-null  int64  
 17  customersatisfactionscore  547568 non-null  float64
dtypes: float64(4), int64(3), object(11)
memory usage: 75.2+ MB

galactictravel.head()
#lets inspect the first 5 rows of data for our original dataframe. Notice the columns names

All of the following code blocks provide different options to inspect our data.

galactictravel.tail()
#Granted the column headers in our original df are a bit easier to read on our end. When trying to call characters like ( or ) in the correct sintax to pass for python adds unnecessary complexity

galactictravel.dtypes
#This tells me valueable informatiom about what my data is. This impacts what I can use each aspect (column) for.

galactictravel.isnull().sum()
#let's also check to make sure the data is relatively complete. This is of course an arbitrary data set so it is complete

galactictravel.describe()
#I can use the describe function to show me meta information about my data that is quantiative in nature. This must be Float Decimal or Int.

galactictravel['distance2dest'].value_counts()
#You can call a specific variable and then request '.value_coutns()' to understand categorical data

galactictravel['month'].value_counts().sort_index()
#Here is am requesting the same information just requiring the output to be sorted in a particular way -- which is also logical 1 - January - 12 - December

galactictravel.shape
#The shape of my data is basically telling you the length and width.
#The first number n=399,918 is the number of records in my dataset
#The second number after the ',' comma n=18 is the number of columns in my dataset
#We will notice that the number of records in this data is different than the number of rows in my CSV file n=547,000 or so

interstellar_travel.columns
#Here we can quickly get a list of all the column names or header names for the 18 columns of data we have available to us

galactictravel.index
#We can understand here how the index is applied to our data frame

galactictravel.sample(5)
#This can be used to show five random records to get a sense of how the data woud look if I was viewing it in a spreadsheet.

#Here I am just now looking to represent a subset of my data. I could at this step write it to CSV call this it's own data frame even if I wanted to
galactictravel[['destination', 'travelpurpose', 'starsystem', 'travelclass', 'stayduration']]

galactictravel.columns

#Okay, what if we want to know all the systems folks are going to within a given system and base this by Star system so we are only looking at the further point folks are traveling within a specific system

max_dist = galactictravel.groupby(['starsystem', 'destination'])['distance2dest'].max()
max_dist.to_csv('max_distance.csv')

max_dist1000  = max_dist[max_dist > 1000]
# Now lets saw we want to make a bar chart that shows only those destinations that are 800 or more light years away
#to do this I create a new data segment all ax
ax = max_dist1000.plot(kind='bar', title='Maximum Distance')
ax.set_xlabel('Star System - Destination')
ax.set_ylabel('Light Years')

plt.show()

This bar graph shows us all the Destinations that are over 800 light years away. We could reduce this to 500 or even 250.

#Here is what it looks like if we try to plot all the max distance destinations. This data has a long right tails
from matplotlib import pyplot as plt
galactictravel['distance2dest'].plot(kind='hist', bins=200, title='Distance to Destination')
plt.gca().spines[['top', 'right',]].set_visible(False)

This histogram could tell us about our data if we had the ability to detect those outliers on the long right tail. Let’s make a scatter plot of our ‘Distance2Dest’ variable and ‘price’ just to get a sense of how to plot two dimensions to get a sense of the distribution. One thing we can tell from this histogram it is to make meaningful visual representations of this information we might need to clear up a few outliers.

from matplotlib import pyplot as plt
#So to give me a sense of a different slice of the data I decide to look at price of travel in relation to distance to destination
#not surprisingly it looks like the furthest distances are the most expensive to get to.
galactictravel.plot(kind='scatter', x='distance2dest', y='price', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# so now let us draw on statistical methods to dentify and remove outliers for the 'distance2dest' column
#here I will use the interquartile range approach
q1 = galactictravel['distance2dest'].quantile(0.25) #define the break point for the lower quartile
q3 = galactictravel['distance2dest'].quantile(0.75) #define the break point for the upper quartile
iqr = q3 - q1 #the IQR is the space in between these two thresholds.
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
#get rid of all the data points that lie within the lower or upper bound
df3 = galactictravel[(galactictravel['distance2dest'] >= lower_bound) & (galactictravel['distance2dest'] <= upper_bound)]

# Plot the histogram without to have an easier to digest representation of the distribution of this data point.
plt.figure(figsize=(10, 6))
df3['distance2dest'].plot(kind='hist', bins=100, title='Distance to Destination', density=True)
plt.gca().spines[['top', 'right']].set_visible(False)
plt.show()

To identify and remove outliers from the ‘distance2dest’ column using the interquartile range (IQR) method:

Calculate the first quartile (Q1) and third quartile (Q3) of the ‘distance2dest’ column.
Calculate the interquartile range (IQR) as Q3 – Q1. Define the lower bound as Q1 – 1.5 * IQR and the upper bound as Q3 + 1.5 * IQR.
Create a new DataFrame galactictravel by filtering out rows where ‘distance2dest’ is outside the lower and upper bounds. Plot the histogram of the ‘distance2dest’ column from the new DataFrame galactictravel:
Set the figure size using plt.figure(figsize=(10, 6)) for better visualization. Plot the histogram using galactictravel[‘distance2dest’].plot(kind=’hist’, bins=100, title=’Distance to Destination’, density=True).
Set the number of bins for the histogram to 100 using bins=100.
Set the title of the plot using title=’Distance to Destination’.
Normalize the histogram by setting density=True to display the probability density function.
Remove the top and right spines of the plot using plt.gca().spines[[‘top’, ‘right’]].set_visible(False).
Display the plot using plt.show().

A lot of what I am doing here is styling. Depending on what you are trying to communicate and the distribution of your data the type of graphical representation you choose will differ.

#Now what about those outliers?

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Read the CSV file into a DataFrame
galactictravel = pd.read_csv('/content/sample_data/interstellar_travel_csv.csv'
                   , header=0 #we MUST designate the header row as 0 in order to use the 'names=()'
                   , names=('age','gender','occupation','travelclass', 'destination', 'starsystem', 'distance2dest' , 'stayduration'
                            , 'numbercompanions', 'travelpurpose', 'transporttype', 'price', 'bookingdate', 'departuredate'
                            , 'specialrequest', 'loyaltyprogrammember', 'month', 'customersatisfactionscore')
                )

# Create a copy of the original DataFrame
galactictravel2 = galactictravel.copy()

# Identify and remove outliers for the 'distance2dest' column
q1 = galactictravel2['distance2dest'].quantile(0.25)
q3 = galactictravel2['distance2dest'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Create a new DataFrame without outliers
df3 = galactictravel[(galactictravel['distance2dest'] >= lower_bound) & (galactictravel['distance2dest'] <= upper_bound)]

# Calculate the number of outliers removed
outliers_removed = len(galactictravel) - len(df3)

# Print the number of outliers removed
print(f"Number of outliers removed: {outliers_removed}")

# Plot the histogram without outliers
plt.figure(figsize=(10, 6))
df3['distance2dest'].plot(kind='hist', bins=100, title='Distance to Destination - Ivy', density=True)
plt.gca().spines[['top', 'right']].set_visible(False)
plt.show()

Number of outliers removed: 20204

Excellent work so far. You have come to the end of the journey where you have profiled a random set of data off the interenet. Uploading data to Colab in this way can be really useful in adhoc situations and not so helpful in ones where live data feeds require monitoring or refresh.

To address this we will look to our permanenet database. Since we’ve already created a DATABASE in Snowflake we will make a connection to it and query the data we wrote to the TABLE in our SCHEMA.

Who Be Me?

A Quick Intro to Betty Ivy

This notebook covers data manipulation in Python with pandas

The main goals are:

Say What?

Distributions

2-d distributions

Values

Next we are going to import our own data from Kaggle

All of the following code blocks provide different options to inspect our data.

dhweb

Who Be Me?

A Quick Intro to Betty Ivy

This notebook covers data manipulation in Python with pandas

**The main goals are:**

Say What?

Distributions

2-d distributions

Values

Next we are going to import our own data from Kaggle

All of the following code blocks provide different options to inspect our data.

The main goals are: