Screen Scrape an RSS feed with PHP: A Guide

I was looking for a guide to making an RSS feed for a site that didn’t have one* and found the internet very much lacking in what it provided. It turns out that there is not as much to choose from as a Google search would lead you to believe. Dennis Pallett seems to be one of the only people to have written a review and it is featured on many, many different sites. While it was very helpful I still thought it was lacking in its explanation.

So I’m going to use Dennis’ code along with how I altered it and go over what I did and what things mean and hopefully paint a clearer picture of how to turn the world in to an RSS feed.

Screen scrapping an RSS feed is based on some simple concepts. Grab each individual post in an array and then grab the title, permalink, and full text out and throw them in their respective RSS tags.

< ?php

</p>

$url = “http://www.gailgauthier.com/blogger.html”;

$data = getUrlTEXT($url);

// Get content items
preg_match_all ("/<div class=\"posts\">([^`]*?)< \/div>/", $data, $matches);

GetUrlTEXT is a quick function defined at the bottom of the script that uses cURL to get the full text of any url.

preg_match_all and preg_match are both functions that I don’t fully understand but they uses strings of characters called ‘regular expressions’ that tell PHP what text to include and not include (regular expressions guide here).

While I don’t completely understand it I can point some things out. preg_match takes 3 arguments. The first is the string you are looking for, the second is the full text you are searching, and the third is an array that all the found strings are put in.

The first argument are in double quotes as well as forward and back slashes (ex. “/REGULAR EXPRESIONS HERE/”). You also see the start and end of a div as part of the expression. This tells preg_match what to look in between to find the text you are looking for. The regular expressions between the div tags, ([^`]*?), definitely have a meaning but you’ll have to read up to figure it out. Needless to say it works in finding whatever is in between whatever you put on either side of the regular expressions.

Next you can just set up the RSS header information.

// Begin feed
header (“Content-Type: text/xml; charset=ISO-8859-1”);
echo “< ?xml version="1.0" encoding="ISO-8859-1" ?>\n”;
?>

<rss version=”2.0”
xmlns:dc=”http://purl.org/dc/elements/1.1/”
xmlns:content=”http://purl.org/rss/1.0/modules/content/”
xmlns:admin=”http://webns.net/mvcb/”
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
<channel>
<title>Original Content—A Gail Gauthier Blog—Latest Content</title>
<description>The latest content from Gail Gauthier (http://www.gailgauthier.com/blogger.html), screen scraped! </description>
<link>http://www.gailgauthier.com/blogger.html</link>
<language>en-us</language>

Nothing hard here, just change the title, description and link. The rest of the information is standard for the RSS file.

Next we loop through the ‘matches’ array we made to extract the title, permalink, full text and author name.

// Loop through each content item
foreach ($matches[0] as $match) {

// First, get title
preg_match (”/<h3>([^`]*?)</h3>/”, $match, $temp);
$title = $temp[‘1’];
$title = strip_tags($title);
$title = trim($title);

// Second, get url
preg_match (”/<span class="byline">posted by gail at <a href="([^`]*?)">/”, $match, $temp);
$url = $temp[‘1’];
$url = trim($url);

// Third, get text
preg_match (”/< /h3>([^`]*?)<span class="byline">/”, $match, $temp);
$text = $temp['1']; $text = trim($text); $text = str_replace('<br />', '<br />', $text);

// Fourth, and finally, get author
preg_match (”/<span class="byline">By ([^`]*?)</span>/”, $match, $temp);
$author = $temp[‘1’];
$author = trim($author);

As you can see getting the title is simple enough, it’s the only thing inside of <h3> tags. The permalink was much harder though since it was not between any specific tags. Instead I used the string ‘<span class="byline">posted by gail at <a href="’ this is obviously a very bad hack to get what you want but not all site will be nicely set up for you to make it in to an RSS feed.

if (!($title '') && !($text ‘’))
{
// Echo RSS XML
echo “<item>\n”;
echo “\t\t\t<title>” . strip_tags($title) . “</title>\n”;
echo “\t\t\t<link>” . strip_tags($url) . “</link>\n”;
echo “\t\t\t<description>” . strip_tags($text) . “</description>\n”;
echo “\t\t\t<content :encoded>< ![CDATA[ n”;
echo $text . “\n”;
echo ” ]]></content>\n”;
echo “\t\t\t<dc :creator>” . “Gail Gauthier” . “</dc>\n”;
echo “\t\t</item>\n”;
}//end if
}//end foreach

After we find all our information we just need to print it out in RSS format. First I do a quick check that the post has actual information in it and then it is just outputted in to the tags.

That about it for this method of screen scrapping an RSS feed. I’ve heard Kottke mention ways of using the DOM from PHP5 but I am still working on learning DOM in general. You can download this full PHP script here.

It turns out what Blogger was FTPing to Gail Gauthier’s site was just not listing the ATOM feed in the html. I later found the feed and had to just be happy that I learned something.