Scroll to read more

Using Data Science to: Investigate UFO sightings.

Introduction

As a kid growing up with Star Trek and the X-files, I was convinced that one day I’d meet aliens. But just as those shows faded into memory, so did my hopes of meeting visitors from another world.

But in the back of my mind, I’ve always wondered if any UFO sightings could be legitimate. Until recently, I had no way to investigate it, but about a year ago I came across the National UFO Reporting Center (http://www.nuforc.org/) and realized that with Data Science, I finally can.

The Data

The National UFO Reporting Center, or NUFORC, contains data on over 100,000 UFO sightings over the last 100 years, with the majority coming from the year 2000 and beyond. 

The reports are filled out by the individuals that did the sighting, with the option to include a name or post anonymously.  Each report contains time, date, city, shape, duration, and a summary.  Not all data points are available for every sighting, but overall it is surprisingly complete. 

The first step was to scrape and clean the data.  Luckily for me, I had help on this project. A colleague named Alex Heaton scraped and cleaned the data, while two others, Navya Sandadi and Imran Manji, worked with me on the analysis.

(I’ve also made the data available here: https://www.kaggle.com/thaddeussegura/ufo-sightings)

Exploring the Data

With the data scraped and cleaned, we could jump right in to looking for things of interest in the data.  A natural place to start was to look for days with unusually high sightings, however, this revealed the first questionable finding.

Looking at the top 20 days of all time, 65% of them fall on the 4th of July. This rate is 23,725% higher than expected, which introduces the question: Is it really more likely that aliens are visiting to watch the fireworks, or that some of the fireworks are being mistaken for UFOs.

Next we checked out the breakdown across the days of the month, just looking for anything unusual. Sure enough, there was another issue.

Looking at the data, the 1st and 15th appear to be disproportionately high. This led us to investigate the data further, and we discovered that sometimes when the date was undefined, it defaulted to the 1st, or 15th.

Based on this, we decided to avoid digging into any date with high sightings that was on the 1st, 15th, or 4th of July.

Key Dates

Avoiding the dates listed above, we set out to identify other dates with disproportionally high sightings.  Because we only wanted to look at the most extreme dates, we decided to pull out only the dates that were greater than 4 standard deviations from the mean number of sightings. 

Basically, this means that there is a 99.994% probability that this spike in sightings couldn’t be attributed to chance.

Once I had the key dates, the next step was to visualize them on a map to look for trends.  I wrote a script to extract the longitude and latitude for each sighting and then plotted those results.

There were some sightings that appeared to be clustered, so I manually circled them in yellow. But I wondered what kind of clusters an algorithm would find on its own.

Clustering

I decided to do some clustering analysis using the K-Means algorithm. If you want to learn more, I wrote a 2 minute post explaining it here: https://thaddeus-segura.com/k-means/

Below is one of the days broken out using 8 clusters. (5 are shown for the US, but you can zoom out to see sightings across the world.)

Using these clusters, we can actually identify the projected “centers” of each. Using the centers for blue and red, I searched google to see if there was any other explanation.

A quick search revealed two separate phenomenon. First, for the blue cluster over California, the sighting was apparently a Navy Missile test.

For the red cluster around Salt Lake, this was documented as some “space junk” burning up in the atmosphere.

Repeating this for the other key days revealed that they also had been explained by worldly phenomenon.

11/7/2015 California: Navy Missle test.
11/16/1999 Ohio/Indiana/Michigan: Meteor
9/19/2009 East Coast: NASA Black Brant xii Rocket Test.

For all dates I checked, there appeared to be a well documented explanation, so we had to keep digging.

Comments Analysis

The next step then was to dig into the comments and see if we could find anything else of interest. In reviewing the comments, we noticed that some had been manually tagged with the term “hoax”, so we decided to use this to dig into some of the sightings more deeply.

Starting by looking hoax vs non-hoax rates by state, we do see that there is some variation in relative rates, but for the most part, they are all pretty near the 50% mark. We did not feel that any of these merited further analysis, so we moved on.

Next we revisited some of our top dates to see if they were higher or lower than the average rate. We found that some of the key dates, 11/7/2015, 7/27/2016, were clearly a “hoax” by the National UFO Reporting Center standards, while other key dates such as 11/16/1999 had extremely low rates, despite a high number of sightings.

Digging into 11/16/1999, revealed 195 total sightings across nearly half the states, with 40% of them centralized around Ohio and Michigan. This had been explained as being a “meteor” but given the range of times over which it was observed, this explanation may fall short.

Shape Analysis

The final step was to dig into the “shape” category listed in the sightings. Our theory was that if we could find dates/times when a specific shape occurred much more than usual, then maybe this could add some weight to those sightings.

Step one was to establish the normal occurrence rates of each shape. Here is a snapshot of the breakdown.

Next we looked into dates that had “over indexed” on specific shapes, and this is where we ran into problems. First, we discovered a lack of consistency in the descriptions from one user to another. Second, the daily data was highly skewed by some of the major sightings, such as those discussed above. Finally for the remaining outliers, we again found explanations such as meteor and rocket tests.

Conclusion

The question we set out to answer here was if any UFO Sightings were legitimate. Taking everything we found into consideration, we were left with two possible conclusions.

  1. The evidence suggests that most sightings are either inaccurate or well documented, often as military or government testing.  Therefore, there is not compelling evidence that there are high veracity sightings that can be identified in the data.

—-OR—-

2. There is a massive conspiracy to explain legitimate sightings away by citing government and military testing.  The dates that came up over and over in our analysis were legitimate sightings, but the truth is being hidden from us.

So are we alone? We obviously cannot say for sure, but based on the data, it seems like it… for now.