BBC News... without the crap
2024-03-09
Did I mention recently that I love RSS? That it brings me great joy? That I start and finish almost every day in my feed reader? Probably.
I used to have a single minor niggle with the BBC News RSS feed: that it included sports news, which I didn't care about. So I wrote a script that downloaded it, stripped sports news, and re-exported the feed for me to subscribe to. Magic.
But lately - presumably as a result of technical changes at the Beeb's side - this feed has found two fresh ways to annoy me:
- The feed now re-publishes a story if it gets re-promoted to the front page... but with a different <guid> (it appears to get a #0 after it when first published, a #1 the second time, and so on). In a typical day the feed reader might scoop up new stories about once an hour, any by the time I get to reading them the same exact story might appear in my reader multiple times. Ugh.
- They've started adding iPlayer and BBC Sounds content to the BBC News feed. I don't follow BBC News in my feed reader because I want to watch or listen to things. If you do, that's fine, but I don't, and I'd rather filter this content out.
Luckily, I already have a recipe for improving this feed, thanks to my prior work. Let's look at my newly-revised script (also available on GitHub):
#!/usr/bin/env ruby
require 'bundler/inline'
# # Sample crontab:
# # At 41 minutes past each hour, run the script and log the results
# */20 * * * * ~/bbc-news-rss-filter-sport-out.rb > ~/bbc-news-rss-filter-sport-out.log 2>>&1
# Dependencies:
# * open-uri - load remote URL content easily
# * nokogiri - parse/filter XML
gemfile do
source 'https://rubygems.org'
gem 'nokogiri'
end
require 'open-uri'
# Regular expression describing the GUIDs to reject from the resulting RSS feed
# We want to drop everything from the "sport" section of the website, also any iPlayer/Sounds links
REJECT_GUIDS_MATCHING = /^https:\/\/www\.bbc\.co\.uk\/(sport|iplayer|sounds)\//
# Load and filter the original RSS
rss = Nokogiri::XML(open('https://feeds.bbci.co.uk/news/rss.xml?edition=uk'))
rss.css('item').select{|item| item.css('guid').text =~ REJECT_GUIDS_MATCHING }.each(&:unlink)
# Strip the anchors off the s: BBC News "republishes" stories by using guids with #0, #1, #2 etc, which results in duplicates in feed readers
rss.css('guid').each{|g|g.content=g.content.gsub(/#.*$/,'')}
File.open( '/www/bbc-news-no-sport.xml', 'w' ){ |f| f.puts(rss.to_s) }
It's amazing what you can do with Nokogiri and a half dozen lines of Ruby.
That revised script removes from the feed anything whose <guid> suggests it's sports news or from BBC Sounds or iPlayer, and also strips any "anchor" part of the <guid> before re-exporting the feed. Much better.
You're free to take and adapt the script to your own needs, or - if you don't mind being tied to my opinions about what should be in BBC News' RSS feed - just subscribe to my copy: link below -