Who be me? The Ivy first name Betty, call me daughter Irie. The ship alive, Wolven does breath. Travel the through retrograde; into the ether see what we see.
Who be me? The Ivy first name Betty, call me daughter Irie. The ship alive, Wolven does breath. Travel the through retrograde; into the ether see what we see.
1. List item
2. List item
The notebook demonstrates how to load local data files (CSV) into Pandas DataFrames, perform basic operations like printing the data, examining column information, and selecting specific columns. It also shows how to create visualizations like scatter plots and histograms using Matplotlib.
link text# Working with local data
#First we are going to import the Python data science pacakge called Pandas https://pandas.pydata.org/ and assign it the name 'pd' for easy reference
import pandas as pd
#Second we will import Matplotlib which is a Python data package for creating static, dynamic and interactive graphics with Python https://matplotlib.org/
import matplotlib as plt
#NumPy is a package that enables scientific computing and stores large data arrays in a way that facilitates larger / intensive calculations. https://numpy.org/doc/stable/user/absolute_beginners.html
import numpy as np
#Now we must first know where we are
!ls
#This folder contains sample data provided by colab
#First we are going to pull in one of the stock datasets provided by colab to examine it
calidf = pd.read_csv('/content/sample_data/california_housing_test.csv')
This simple Python code written using the Pandas library reads a CSV (Comma-Separated Values) file named “california_housing_test.csv” from the specified file path “/content/sample_data/” and turns it into a Data Frame available for further analysis.
The entire line:
calidf = pd.read_csv(‘/content/sample_data/california_housing_test.csv‘)
assigns the resulting DataFrame object (created from reading the CSV file) to a variable named calidf.
After execution the variable calidf will contain a Pandas DataFrame representing the data from the “california_housing_test.csv” file. You can then perform various data manipulation, analysis, and visualization operations on this DataFrame using the rich functionality provided by the Pandas library. This is a common way to load tabular data (like CSV files) into a Python environment for further processing and analysis using the Pandas library.
#If you receive no errors you can quickly print the dataframe to see what the data looks like
print(calidf)

#Examine the top 5 rows of the dataframe
calidf.head()













# Here we examine the information about our data and then take a peak at two columns
calidf.info()
#calidf[['housing_median_age', 'median_house_value']]
#maybe you are interested in median_house_value and median_income instead?
calidf[['total_rooms', 'median_house_value']]
#now it is clear I need to know more about my data and how
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/tmp/ipython-input-1-261973191.py in <cell line: 0>()
1 # Here we examine the information about our data and then take a peak at two columns
----> 2 calidf.info()
3 #calidf[['housing_median_age', 'median_house_value']]
4 #maybe you are interested in median_house_value and median_income instead?
5 calidf[['total_rooms', 'median_house_value']]
NameError: name 'calidf' is not defined
from matplotlib import pyplot as plt
#I could use the histogram feature to see distribution within a single variable
calidf['median_house_value'].plot(kind='hist', bins=20, title='median house value')
plt.gca().spines[['top', 'right']].set_visible(False)

from matplotlib import pyplot as plt
calidf['median_house_value'].plot(kind='hist', bins=20, title='median_house_value')
plt.gca().spines[['top', 'right']].set_visible(False)

#First I am just going to look at two of my measures to get a sense of distribution.
from matplotlib import pyplot as plt
_df_2.plot(kind='scatter', x='median_income', y='median_house_value', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)
#Oh no wait I have an error! The interpreter tells me where the error is. It looks like I am calling something '_df_2' which doesn't exist
#In order to address this error I must make a change in line above replacing '_df_2' with 'calidf'
#calidf does exist and is the pandas dataframe you imported from the CSV file
#once you have made the change rerun the code block

from matplotlib import pyplot as plt
#the output looks different depending on what I am plotting
calidf.plot(kind='scatter', x='population', y='', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)
#Opps looks like there is also an error here. What is different between this block of code and the one directly above it?
#It looks like I have not defined a Y value for my scatter plot. Make a choice on which Column of data you would like to see plotted alongside population and re-run this block
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
8 frames
/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ''
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-11-14cf506e8985> in <cell line: 3>()
1 from matplotlib import pyplot as plt
2 #the output looks different depending on what I am plotting
----> 3 calidf.plot(kind='scatter', x='population', y='', s=32, alpha=.8)
4 plt.gca().spines[['top', 'right',]].set_visible(False)
/usr/local/lib/python3.10/dist-packages/pandas/plotting/_core.py in __call__(self, *args, **kwargs)
943 if kind in self._dataframe_kinds:
944 if isinstance(data, ABCDataFrame):
--> 945 return plot_backend.plot(data, x=x, y=y, kind=kind, **kwargs)
946 else:
947 raise ValueError(f"plot kind {kind} can only be used for data frames")
/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/__init__.py in plot(data, kind, **kwargs)
69 kwargs["ax"] = getattr(ax, "left_ax", ax)
70 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 71 plot_obj.generate()
72 plot_obj.draw()
73 return plot_obj.result
/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py in generate(self)
450 self._compute_plot_data()
451 self._setup_subplots()
--> 452 self._make_plot()
453 self._add_table()
454 self._make_legend()
/usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py in _make_plot(self)
1259 scatter = ax.scatter(
1260 data[x].values,
-> 1261 data[y].values,
1262 c=c_values,
1263 label=label,
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in __getitem__(self, key)
3805 if self.columns.nlevels > 1:
3806 return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
3808 if is_integer(indexer):
3809 indexer = [indexer]
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
KeyError: ''
Amazing, you have now successfully worked with CSV data uploaded to our local instance. Now we are going to import a csv file from an online repository. When you share your personal Colab version to me all blocks must run correctly.
Go to Kaggle.com and create an account using the Google Account associated with your CCV email that you created for this class.
Click on Datasets and then search for interstellar travel.

Spend some time reading the information provided about the data to become familiar with what is available to you within the dataset.

Now download the data. It will land as a compressed file. You will need to extract the CSV file locally first before uploading it to the colab instance.
Now you come back to this Colab and find –> on the left hand side panel click on the folder icon then double click to open.

Now click on the document with the up arrow –> When you scoll over it will say ‘Upload to session storage’. What this does is allow us to load data temporarily for this instance. You will get a warning to ensure the file is saved elsewhere since it will delete after you close the session.
Once you have selected the data to upload (depending on the size) Google will upload it and let you know what level of disk storage you are working with. You will find this information at the bottom left side of your panel.
import pandas as pd
interstellar_travel = pd.read_csv('/content/sample_data/interstellar_travel.csv')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547568 entries, 0 to 547567
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 547568 non-null int64
1 Gender 547568 non-null object
2 Occupation 547568 non-null object
3 Travel Class 547568 non-null object
4 Destination 547568 non-null object
5 Star System 547568 non-null object
6 Distance to Destination (Light-Years) 547568 non-null float64
7 Duration of Stay (Earth Days) 547568 non-null float64
8 Number of Companions 547568 non-null int64
9 Purpose of Travel 547568 non-null object
10 Transportation Type 547568 non-null object
11 Price (Galactic Credits) 547568 non-null float64
12 Booking Date 547568 non-null object
13 Departure Date 547568 non-null object
14 Special Requests 547568 non-null object
15 Loyalty Program Member 547568 non-null object
16 Month 547568 non-null int64
17 Customer Satisfaction Score 547568 non-null float64
dtypes: float64(4), int64(3), object(11)
memory usage: 75.2+ MB
Awesome you have successfully called the data from CSV into a dataframe. To get a sense of the column headers you use the df.info() command and see that some of the names for the column headers are going to be tricky to call.
To address this we can reassign column names when loading our dataframe, which makes calling our variables a bit easier.
#There are MANY MANY options available within the read_csv command - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv
#To start for today we are just going to focus on re-naming our columns.
galactictravel = pd.read_csv('/content/sample_data/interstellar_travel.csv'
, header=0 #we MUST designate the header row as 0 in order to use the 'names=()' flag
, names=('age','gender','occupation','travelclass', 'destination', 'starsystem', 'distance2dest' , 'stayduration'
, 'numbercompanions', 'travelpurpose', 'transporttype', 'price', 'bookingdate', 'departuredate'
, 'specialrequest', 'loyaltyprogrammember', 'month', 'customersatisfactionscore')
);
galactictravel.info()
#that looks better to me.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547568 entries, 0 to 547567
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 547568 non-null int64
1 gender 547568 non-null object
2 occupation 547568 non-null object
3 travelclass 547568 non-null object
4 destination 547568 non-null object
5 starsystem 547568 non-null object
6 distance2dest 547568 non-null float64
7 stayduration 547568 non-null float64
8 numbercompanions 547568 non-null int64
9 travelpurpose 547568 non-null object
10 transporttype 547568 non-null object
11 price 547568 non-null float64
12 bookingdate 547568 non-null object
13 departuredate 547568 non-null object
14 specialrequest 547568 non-null object
15 loyaltyprogrammember 547568 non-null object
16 month 547568 non-null int64
17 customersatisfactionscore 547568 non-null float64
dtypes: float64(4), int64(3), object(11)
memory usage: 75.2+ MB
galactictravel.head()
#lets inspect the first 5 rows of data for our original dataframe. Notice the columns names

galactictravel.tail()
#Granted the column headers in our original df are a bit easier to read on our end. When trying to call characters like ( or ) in the correct sintax to pass for python adds unnecessary complexity

galactictravel.dtypes
#This tells me valueable informatiom about what my data is. This impacts what I can use each aspect (column) for.
galactictravel.isnull().sum()
#let's also check to make sure the data is relatively complete. This is of course an arbitrary data set so it is complete
galactictravel.describe()
#I can use the describe function to show me meta information about my data that is quantiative in nature. This must be Float Decimal or Int.
galactictravel['distance2dest'].value_counts()
#You can call a specific variable and then request '.value_coutns()' to understand categorical data
galactictravel['month'].value_counts().sort_index()
#Here is am requesting the same information just requiring the output to be sorted in a particular way -- which is also logical 1 - January - 12 - December
galactictravel.shape
#The shape of my data is basically telling you the length and width.
#The first number n=399,918 is the number of records in my dataset
#The second number after the ',' comma n=18 is the number of columns in my dataset
#We will notice that the number of records in this data is different than the number of rows in my CSV file n=547,000 or so
interstellar_travel.columns
#Here we can quickly get a list of all the column names or header names for the 18 columns of data we have available to us
galactictravel.index
#We can understand here how the index is applied to our data frame
galactictravel.sample(5)
#This can be used to show five random records to get a sense of how the data woud look if I was viewing it in a spreadsheet.
#Here I am just now looking to represent a subset of my data. I could at this step write it to CSV call this it's own data frame even if I wanted to
galactictravel[['destination', 'travelpurpose', 'starsystem', 'travelclass', 'stayduration']]
galactictravel.columns
#Okay, what if we want to know all the systems folks are going to within a given system and base this by Star system so we are only looking at the further point folks are traveling within a specific system
max_dist = galactictravel.groupby(['starsystem', 'destination'])['distance2dest'].max()
max_dist.to_csv('max_distance.csv')
max_dist1000 = max_dist[max_dist > 1000]
# Now lets saw we want to make a bar chart that shows only those destinations that are 800 or more light years away
#to do this I create a new data segment all ax
ax = max_dist1000.plot(kind='bar', title='Maximum Distance')
ax.set_xlabel('Star System - Destination')
ax.set_ylabel('Light Years')
plt.show()
This bar graph shows us all the Destinations that are over 800 light years away. We could reduce this to 500 or even 250.
#Here is what it looks like if we try to plot all the max distance destinations. This data has a long right tails
from matplotlib import pyplot as plt
galactictravel['distance2dest'].plot(kind='hist', bins=200, title='Distance to Destination')
plt.gca().spines[['top', 'right',]].set_visible(False)
This histogram could tell us about our data if we had the ability to detect those outliers on the long right tail. Let’s make a scatter plot of our ‘Distance2Dest’ variable and ‘price’ just to get a sense of how to plot two dimensions to get a sense of the distribution. One thing we can tell from this histogram it is to make meaningful visual representations of this information we might need to clear up a few outliers.
from matplotlib import pyplot as plt
#So to give me a sense of a different slice of the data I decide to look at price of travel in relation to distance to destination
#not surprisingly it looks like the furthest distances are the most expensive to get to.
galactictravel.plot(kind='scatter', x='distance2dest', y='price', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# so now let us draw on statistical methods to dentify and remove outliers for the 'distance2dest' column
#here I will use the interquartile range approach
q1 = galactictravel['distance2dest'].quantile(0.25) #define the break point for the lower quartile
q3 = galactictravel['distance2dest'].quantile(0.75) #define the break point for the upper quartile
iqr = q3 - q1 #the IQR is the space in between these two thresholds.
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
#get rid of all the data points that lie within the lower or upper bound
df3 = galactictravel[(galactictravel['distance2dest'] >= lower_bound) & (galactictravel['distance2dest'] <= upper_bound)]
# Plot the histogram without to have an easier to digest representation of the distribution of this data point.
plt.figure(figsize=(10, 6))
df3['distance2dest'].plot(kind='hist', bins=100, title='Distance to Destination', density=True)
plt.gca().spines[['top', 'right']].set_visible(False)
plt.show()
To identify and remove outliers from the ‘distance2dest’ column using the interquartile range (IQR) method:
A lot of what I am doing here is styling. Depending on what you are trying to communicate and the distribution of your data the type of graphical representation you choose will differ.
#Now what about those outliers?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Read the CSV file into a DataFrame
galactictravel = pd.read_csv('/content/sample_data/interstellar_travel_csv.csv'
, header=0 #we MUST designate the header row as 0 in order to use the 'names=()'
, names=('age','gender','occupation','travelclass', 'destination', 'starsystem', 'distance2dest' , 'stayduration'
, 'numbercompanions', 'travelpurpose', 'transporttype', 'price', 'bookingdate', 'departuredate'
, 'specialrequest', 'loyaltyprogrammember', 'month', 'customersatisfactionscore')
)
# Create a copy of the original DataFrame
galactictravel2 = galactictravel.copy()
# Identify and remove outliers for the 'distance2dest' column
q1 = galactictravel2['distance2dest'].quantile(0.25)
q3 = galactictravel2['distance2dest'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Create a new DataFrame without outliers
df3 = galactictravel[(galactictravel['distance2dest'] >= lower_bound) & (galactictravel['distance2dest'] <= upper_bound)]
# Calculate the number of outliers removed
outliers_removed = len(galactictravel) - len(df3)
# Print the number of outliers removed
print(f"Number of outliers removed: {outliers_removed}")
# Plot the histogram without outliers
plt.figure(figsize=(10, 6))
df3['distance2dest'].plot(kind='hist', bins=100, title='Distance to Destination - Ivy', density=True)
plt.gca().spines[['top', 'right']].set_visible(False)
plt.show()
Number of outliers removed: 20204

Excellent work so far. You have come to the end of the journey where you have profiled a random set of data off the interenet. Uploading data to Colab in this way can be really useful in adhoc situations and not so helpful in ones where live data feeds require monitoring or refresh.
To address this we will look to our permanenet database. Since we’ve already created a DATABASE in Snowflake we will make a connection to it and query the data we wrote to the TABLE in our SCHEMA.