We analyzed almost 30 million rows of Pennsylvania voter registration data. Here's how

The now months-long legal battle for personal information about Pennsylvania voters prompted this news organization to analyze almost 30 million rows of registration collected over the past two years.

A GOP-led Senate committee has requested voter data from the Pennsylvania Department of State for a “forensic investigation” to uncover duplicate voters, fake registrations and a number of other possible errors.

The data in the requests includes partial Social Security numbers and driver’s license numbers, two pieces of personal information not contained in the state’s voter rolls available online.

Yet much of the information being requested already is available publicly.

This news organization downloaded a copy of the state’s registration as of on Dec. 13, 2021, the same week as a Commonwealth Court hearing on a motion to quash the subpoena.

Full analysis: We examined millions of Pennsylvania voter records. There were few 'irregularities'

A Python programming script was used to clean and merge the individual county files.

Each county has four files that make up the voter rolls: one with the bulk of the data and three containing column names and the full text of some data values that are coded in the main file.

The main voter files often use alphanumeric codes in place of municipalities or school district names; this news organization’s Python script replaces those codes with their corresponding locations in each county.

Python, and a widely used data analysis tool called pandas, were used to analyze the data once its was compiled into a single file. A program like Excel can’t read more than about 1.04 million rows, making it impossible to review the data using anything other than a programming language.

The analysis began by joining the first and last names with the birth dates for each voter, and then counting the repeating values.

To avoid incorrect duplicates for common last names, middle names were then added to those identifications.

If any of those fields were empty, the identification would be left blank to avoid empty repeats inflating the overall duplicate totals.

The analysis couldn’t make any conclusive statements without access to more personal information, but the goal was to try and see if a reasonable investigation could show potential widespread errors worth investigating further.

After this news organization discussed the initial findings with multiple experts, another review was done on voter exports taken just before the November 2020 election and May 2021 primary.

Those exports returned similar results as seen in December as far as total duplicate voters and other potential issues.

The earlier data also showed only several of the individual identifiers listed in December showed up across all three exports, supporting the idea that repeating voters were the result of a change of address that hadn’t been entirely updated when the export was released.

Click here to sign up for our Watchdog Newsletter and receive bi-weekly updates on our watchdog journalism.

This article originally appeared on Bucks County Courier Times: Explaining our analysis of Pennsylvania voter registration data