Cleaning Up the Galaxy: Handling Outliers in Pandas

When analyzing interstellar travel data, sometimes you get bad readings. A sensor glitches, a solar flare hits the relay, or a pilot forgets to log their arrival time. Back on the Wolven, we call this “space junk.” On Earth, you call them outliers.

In this tutorial, we are looking at a dataset of 500,000 space travel records from a Kaggle dataset titled “Interstellar Travel Customer Satisfaction Analysis.” It contains data on travel class, destination, and—crucially—Distance to Destination.

This dataset, titled ‘Interstellar Travel Customer Satisfaction Analysis,’ provides a comprehensive view of customer experiences in interstellar space travel.

When I visualized this data using a histogram, I noticed something strange. There were extreme values on the long right tail. Some trips were logged as taking impossibly long distances for the price paid. To build an accurate pricing model for “Galactic Credits,” we need to filter out this noise.

The Technique: Interquartile Range (IQR)

We will use a statistical method called the IQR to define what is “normal” and what is an anomaly.

Step 1: Calculate the Quartiles Think of your data as a long line of travelers. We cut the line at the 25% mark (Q1) and the 75% mark (Q3).

q1 = galactictravel[‘distance2dest’].quantile(0.25)

q3 = galactictravel[‘distance2dest’].quantile(0.75)

Step 2: Define the Bounds The IQR is the distance between Q1 and Q3. We then define a “fence.” Any data point that is more than 1.5 times the IQR outside of our quartiles is considered an outlier.

iqr = q3 – q1

lower_bound = q1 – 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

Step 3: Filter the Data Now, we create a new view of the universe that only includes the reliable data.

clean_df = galactictravel[

    (galactictravel[‘distance2dest’] >= lower_bound) &

    (galactictravel[‘distance2dest’] <= upper_bound)

]

Visual Proof: Before cleaning, our scatter plot looked like a shotgun blast. After cleaning, we can clearly see the relationship between distance and price.

Identify and remove outliers from the ‘distance2dest’ column using the interquartile range (IQR) method.

Data cleaning isn’t the glamorous part of the job. It’s the janitorial work of the data scientist. But without it, your models will fail, your ships will get lost, and your predictions will be worthless. We must scrub the data until it shines like the salt flats of Barrata.

Verified by MonsterInsights