At Cortico, we are making talk radio searchable in order to surface local voices and the range of issues and opinions being discussed across the country. With that comes a host of problems, including lots of duplicate content ranging from syndicated shows to repeated commercials.
This talk will go through how we adapted the technology used in identifying popular songs to automatically detect duplicate content within roughly 4000 hours of audio collected per day from nearly 200 radio stations. By utilizing audio fingerprinting, we encode and compare subsequences of audio to identify near duplicates.
To do this at scale, we set up an ephemeral Spark cluster within Kubernetes to find duplicates once a day. From this data, we can begin to map out the space of American talk radio.
Allison is a software engineer at Cortico, a nonprofit that helps elevate local voices and share stories from communities all over the United States. Previously, she worked at MIT Lincoln Laboratory, researching secret stuff for the Department of Defense.