Hudzilla Coding Academy: Project Five

Code
Hudzilla Coding Academy

 

What do you do if you like reading the BBC News site, but hate having to browse there every hour to see what's changed? The answer lies in RSS, which lets people subscribe to the news that interests them, and have their computer automatically check for updates. Most good news sites have had RSS feeds for a while, which means that people can draw up a list of their favourite sites and bring all the news they care about into one place.

Previously we looked at working with files, and now we're going to take that a bit further by working with a specific type of file: XML files. XML is very similar to HTML in that it marks blocks of text as having a specific meaning, eg <strong> is used in HTML to say "this part of the text should be shown in a bold font." But unlike HTML, XML doesn't use this markup to denote how things should look - it doesn't have any style information attached to it. Instead, XML says what things mean, which makes it perfect for sending data around.

XML comes in many different types depending on what kind of data it is carrying, and the particular flavour we're interested in RSS: Really Simple Syndication. This stores news items that get updated whenever the main site gets updated, so that people can tell their computer to download the RSS feed once every ten minutes, then read the headlines without needing to touch a web browser.

Mono comes with lots of tools for working with XML, so we don't have to worry about how to read the files - you just feed the raw RSS XML into Mono and then call various methods on it to see what's inside. That means we can focus on what we intend to do with it. We're going to:

  1. Build a program to download RSS feeds and print them neatly.
  2. Make the program track the feed so that it only shows things that have appeared since we last checked.
  3. Get the program to remember the feeds that users were interested in.

Before we start

If you haven't already completed Hudzilla Coding Academy Projects 1, 2, 3 and 4 you may find this one a bit tricky because it builds upon many of the techniques already taught in previous projects.

First steps

The first thing you need to do is understand exactly what XML - and, more specifically, RSS - looks like. Below is an example RSS feed, and you'll see that the channel (the news feed) has a title, description and link. This is all meta-information - you can safely ignore this if you just want the news, but smarter programs will use this information to describe the news in a user interface.

You'll also see that there are two <item> elements, but there could easily be hundreds depending on how big the news site is. These are the actual news items, and again contain title, description and link fields, but this time they are specific to each individual news story. Here it is:

<?xml version="1.0" ?>
<rss version="2.0">
   <channel>
      <title>My Excellent Site</title>
      <description>There's lots of great content here - please subscribe!</description>
      <link>http://www.example.com</link>

      <item>
         <title>Mono rocks!</title>
         <description>Free .NET takes over world</description>
         <link>http://www.example.com/news/mono</link>
         <guid>http://www.example.com/news/mono</guid>
      </item>

      <item>
         <title>Mono beats PHP</title>
         <description>Consistent function rules</description>
         <link>http://www.example.com/news/monovsphp</link>
         <guid>http://www.example.com/news/monovsphp</guid>
      </item>
   </channel>
</rss>

You'll notice that each <item> has identical <link> and <guid> elements. 'GUID' is short for globally unique identifier, and is any value that is unique to that exact story across the whole internet. This is required for RSS feeds, as it's used to let RSS programs know if they've seen that news story before or not.

You need to be careful to choose GUIDs that are unique - not just unique to your site, but unique to all other sites too. The easiest (and most common) way to do this is just to use the link to the story as the GUID, because it's guaranteed to be unique.

Now let's try a real example. Start a new console project in MonoDevelop and call it TermFeed. In the 'using' lines at the top, add this:

using System.IO;
using System.XML;

Next, right-click on References in the Solution pane on the left of MonoDevelop's window, then select Edit References from the menu that appears. Make sure System.Xml is selected from that list, and click OK.

You need to add a reference to System.Xml so that Mono knows that to pull that library in.

You need to add a reference to System.Xml so that Mono knows that to pull that library in.

Now change the Main() code to this:

XmlDocument doc = new XmlDocument();
doc.Load("http://tinyurl.com/8mwkm");
Console.Write(doc.InnerXml);

The TinyURL link is there to save space - you can use the full URL if you want to: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml.

That code uses the XmlDocument class to read a URL, then print it to the screen. This is a new class and can do all sorts of clever things, but right now we're just interested in the fact that if you create a new XmlDocument object and call Load() on it with a URL, Mono will download the XML and parse it ready for you to use. By "parse" we mean "extract the structure from the text", which means that once Mono has parsed the XML it knows how many news items there are and what they contain - it's no longer just a lot of plain text.

We're not doing anything fancy with this XmlDocument object just yet, we're just printing out its InnerXml property to the console. If you were wondering, InnerXml contains the plain-text version of the XML feed - ie, what Mono downloaded to parse - so by printing it out we are just making sure that the XML feed is being read correctly.

Hit F5 to compile and run the program, and you should see a large chunk of text printed out in the Application Output pane in MonoDevelop. This is our RSS - it may look like a complex beast, but we're going to tame it!

This first version of our program simply prints a text dump into the Application Output pane in MonoDevelop.

This first version of our program simply prints a text dump into the Application Output pane in MonoDevelop.

Just the headlines

We can whip our RSS into shape using two simple Mono methods: SelectSingleNode() and SelectNodes(). These let you search through XML for the exact data you're interested in, and either return just one XML node (the name for an XML element such as <item> once it has been read by our program) or return all the matching nodes. The reason Mono calls them nodes is because XML can be viewed like a tree: <rss> contains one or more <channel>s, which in turn contains one or more <item>s.

So, what we want v2 of our program to do is to read all the news items, then print the headline and description information from each story. Here's my recipe for the TermFeed v2:

  1. Preheat your RSS by passing it through XmlDocument.Load().
  2. Peel the skin to reveal only the <item> elements we care about.
  3. Gently sift through the <items>, sprinkling their data over Console.Write() as necessary.
  4. Season with salt, and serve.

Or in the more conventional C#...

XmlDocument doc = new XmlDocument();
doc.Load("http://tinyurl.com/8mwkm");
XmlNodeList items = doc.SelectNodes("//item");

foreach (XmlNode item in items) {
	Console.WriteLine(item.SelectSingleNode("title").InnerText);
	Console.WriteLine(" " + item.SelectSingleNode("description").InnerText);
	Console.WriteLine("");
}

That should replace the three lines of code you already have inside Main(). The parameter that gets passed into SelectNodes - //item - is known as XPath. This is the special way of searching for things inside XML, and our example means 'get any <item> element, anywhere in the XML'. That's what the // means: 'get any'. Take a look at this XML:

<stuff>
	<clothing>
		<item>Trousers</item>
		<item>Socks</item>
	</clothing>

	<news>
		<item>Wii released</item>
		<item>Xbox 360 sucks!</item>
	</news>
</stuff>

If we use the XPath //item to get news items from that XML, we'll be disappointed: it will pull out items of clothing and items of news in the same search! Rather than using the // 'get any' search, you would need to be more specific and say you only want <item> elements that are part of <news> elements. In XPath, you would use /news/item.

Our RSS feed only uses <item> when it's referring to news items, so using //item is safe enough for now. This search gives us back an object of the type XmlNodeList. If an XML node contains one XML element, then an XmlNodeList contains several XML elements, right? Right. I just wanted to make sure you hadn't lost the plot while we were discussing XPath!

Once we have a list of all the news items, it's just a matter of printing them out. Previously you learned how to use the foreach loop, and now it's back - and working with XmlNodes rather than plain old strings. This loop goes through each news item that was returned from SelectNodes(), and puts it into the item variable ready for us to read.

Each <item> in our XML contains several interesting children of its own: the title of the news, the description, the link and so on. To extract each of these, we need to use the SelectSingleNode() method on our item, which gives us an XmlNode. So to get the title of a news item, we'd do item.SelectSingleNode("title"). But that just gives us an XML node, which is Mono's internal representation of the <item> XML element as opposed to the actual contents of the XML node. That's what the InnerText part does: it retrieves the textual content from an XmlNode object.

With all that in mind, here's one of those code lines again:

Console.WriteLine(item.SelectSingleNode("title").InnerText);

That uses the current item, gets its title node (which contains the title of the current news item), gets the text of that title node and prints it out to the console. After the headline and description is printed out for a news story, Console.WriteLine() is called with an empty string so that it prints a blank line between stories.

That's it: compile and run your program with F5, and be amazed at how your wonderful culinary skills have transformed the raw ingredients of an RSS feed into a printout on your screen!

With just a little bit of processing, our RSS feed parser is starting to become useful!

With just a little bit of processing, our RSS feed parser is starting to become useful!

What's new, pussycat?

Our program has a problem, which is that RSS feeds can be long, and really people only care about what's changed since they last checked the feed. This is a real issue: how can we track which RSS news items people have read already, and only show the ones they haven't seen? Well, cast your mind back into the dark mists of a thousand words ago, and you'll remember globally unique identifiers.

Here's what I said: "This is required for RSS feeds, as its used to let RSS programs know if they've seen that news story before or not." Each RSS news item needs a GUID so that it's absolutely unique on the web, and we can use that to know whether we've seen something before or not.

Here's how it should work:

  1. Get the RSS feed.
  2. Store all the GUIDs, one per line, in a file.
  3. Next time the RSS feed is loaded, only show news items if they don't appear in our list of cached GUIDs.

It's only three steps, but actually programming the thing is a bit harder. In fact, this requires what is probably the longest piece of code we've looked at to date! Here's how the new Main() method ought to look - I've added comments throughout to explain how it all works:

XmlDocument doc = new XmlDocument();
doc.Load("http://tinyurl.com/8mwkm");

// this string array will store the contents of our cache file
string[] guidcache;

// to check whether a file exists or not, we use the File.Exists() method
if (File.Exists("guidcache.txt")) {
   // we have a cache file - go ahead and read it in!
   guidcache = File.ReadAllLines("guidcache.txt");
} else {
   // we don't have a cache file - create a new string array
   // with 0 elements (ie, it's empty)
   guidcache = new string[0];
}

// grab all the news items as per usual...
XmlNodeList items = doc.SelectNodes("//item");

// now loop over all those items
foreach (XmlNode item in items) {
   // presume we're going to show this news item by default
   bool showthisitem = true;

   // now go through each GUID in our cache...
   foreach (string guid in guidcache) {
      // ... and compare it against the GUID of this news item
      if (guid == item.SelectSingleNode("guid").InnerText) {
         // if we're here, we've got a match - don't show this item!
         showthisitem = false;

         // now tell Mono to exit the loop - we've matched the GUID, 
         // and so don't need to check against other GUIDs in the cache
         break;
      }
   }

   if (showthisitem) {
      // we can only get here if the GUID isn't in our cache - print it out!
      Console.WriteLine(item.SelectSingleNode("title").InnerText);
      Console.WriteLine(" " + item.SelectSingleNode("description").InnerText);
      Console.WriteLine("");

      // ... now add the GUID to our cache file so it is ignored next time.
      File.AppendAllText("guidcache.txt", item.SelectSingleNode("guid").InnerText + "\n");
   }
}

The key with long code chunks such as that one is to take your time and not skip over bits you don't understand. All the code is neatly indented to help you follow foreach loop and if statements, and there's nothing there you haven't seen before - all that file reading and. writing, for example, is straight from Project One.

Believe it or not, that's the easiest way to write the code, but if you're looking for something that runs a bit faster, I suggest you insert this just after the bool showthisitem line:

string thisguid = item.SelectSingleNode("guid").InnerText;

Rather than having to call SelectSingleNode() for every GUID in the cache and for every news item, that line caches the GUID in a string variable that you can use instead of the other SelectSingleNode("guid") calls.

Subscribe today!

Let's take our program to warp speed: right now we have the BBC URL right in our source code, which is known as being "hard coded". But what if people want to read a different news source? Or what if they want to read several news sources and update them all simultaneously? This requires more advanced coding, but it does start to make our program useful at last.

To be able to work with multiple news feeds, our program has to be able to do the following:

  1. When provided with the parameter sub followed by a URL, it should subscribe to that feed.
  2. When provided with the parameter unsub followed by a URL, it should unsubscribe to that feed.
  3. When provided with no parameters at all, it should refresh all the RSS feeds and show all the new entries.
  4. When provided with the parameter reset it should clear the GUID list and refresh the feeds, showing all entries in all the feeds.

That's nothing too far above our current code, particularly if you've already been through Project One and Project Three, but there is one subtle change here from our existing code: actions 2 and 3 both need to print out the RSS feeds. Now, the coarse way to solve this is to select all the RSS printing code we have already, then copy and paste it so the same code is in our program twice; but a much better solution is to create our own method that can be called from anywhere, and centralises all the code in one place.

But first, we need to write the code to subscribe and unsubscribe to our feeds. The smart way to do this is to use a switch/case block to check the number of arguments that were passed into the program. For example, our basic code to check what operation the program should perform would look like this:

switch (args.Length) {
	case 0:
		// refresh the feeds!
		break;	
	case 1:
		// reset the feeds!
		break;
	case 2:
		// sub or unsub to a feed!
		break;
}

If no parameters are provided, we need the program to refresh all the feeds. If one parameter is provided, we can just go ahead and reset the feeds - we don't need to check what that parameter is, because the only reason our program would be called with just one parameter is to reset the feeds. Finally, if two parameters are called we need to check whether it's a sub or an unsub, then take the appropriate action.

Put that skeleton switch/case block at the top of your Main() method for now - we need to fill it in bit by bit.

Make an executive decision

We're going to deal with the subscribing and unsubscribing first. This needs to check whether sub or unsub was provided, then it adds the feed to the subscription list. Here goes:

case 2:
	// sub or unsub to a feed!

	// ignore the second parameter if it's empty
	if (args[1] == "") return;

	if (args[0] == "sub") {
		// add the site to the existing list
		File.AppendAllText("sitelist.txt", args[1] + "\n");
	} else {
		if (File.Exists("sitelist.txt")) {
			// remove site from the list - this works in a very similar way
			// to the deleting method used in Project One
			string[] sitelist = File.ReadAllLines("sitelist.txt");

			File.Delete("sitelist.txt");

			foreach (string site in sitelist) {
				if (site == args[1]) {
					// aha! this is the site we need to drop; ignore it
				} else {
					File.AppendAllText("sitelist.txt", site + "\n");
				}
			}
		}
	}

	break;

Subscribing is pretty simple, but unsubscribing is a little trickier. In the code above, it works by reading in the site's file then deleting it. It then goes over every site that the user currently subscribes to and writes it out line-by-line to the site's file. But when it finds the site that they want to unsubscribe from, it skips over it. Yes, that's exactly how we solved the problem in Project One - your skills are starting to come in useful already!

The other two cases are easier, and ought to look like this:

switch (args.Length) {
	case 0:
		// refresh the feeds!
		ReadFeeds();
		break;

	case 1:
		// reset the feeds!
		File.Delete("guidcache.txt");
		ReadFeeds();
		break;

The ReadFeeds() method is what I meant about code reuse: we could paste all the code needed to read feeds directly into both case statements, but it's faster to just create our own method: ReadFeeds(). So, when the program is called without any parameters, ReadFeeds() is called immediately. When it's called with a single parameter, we clear the GUID cache then call ReadFeeds().

The ReadFeeds() method itself is largely the same as the old RSS reading code, but we need to modify it if we want it to be able to read from multiple sites. First, create the empty method:

static void ReadFeeds() {

}

Now you need to copy all our RSS code into that method. So select everything from here:

XmlDocument doc = new XmlDocument();
doc.Load("http://tinyurl.com/8mwkm");

...all the way down to here:

      File.AppendAllText("guidcache.txt", item.SelectSingleNode("guid").InnerText + "\n");
   }
}

And press Ctrl+X to remove it from the file and place it on your clipboard, then paste it into the new ReadFeeds() method. Now we need to top and tail it with some code to read the list of RSS feeds from a file, then execute all that code for each one. So, put this code at the very beginning of ReadFeeds():

string[] sitelist;

if (File.Exists("sitelist.txt")) {
   // if a site list exists, load it
   sitelist = File.ReadAllLines("sitelist.txt");
} else {
   // if not, bail out - there's nothing to do here
   return;
}

// now loop over each site in the list
foreach (string site in sitelist) {
    // these two replace the previous "XmlDocument doc" / "doc.Load()" lines
	// because we need to load the site URL now!
	XmlDocument doc = new XmlDocument();
	doc.Load(site);

Then at the end of the method, add another closing brace (that's a "}", remember) to finish the foreach loop. None of that code should be surprising, but it does finish of our program perfectly.

Hit F8 to compile the program without running it, then open up a terminal and browse to the location of your MonoDevelop project. From there, look for the bin/Debug directory, and you should find an executable waiting for you. Give it a try - I think you'll agree it's actually very useful!

The finished product remembers what has been seen already and handles multiple feeds smoothly.

The finished product remembers what has been seen already and handles multiple feeds smoothly.

Let's wrap up

At the end of this fifth project, you've now been introduced to XML. And if you're thinking, "wait... I put in all that work, and only learned one thing?" relax: XML is really important. You see, XML is a format the computers and humans can read, can be transferred over the web, and can be used for all sorts of tasks. We'll be making extensive use of XML (particularly the SelectSingleNode() and SelectNodes() methods) in future projects, so please don't just forget about it!

Homework

If you're following this with a tutor, you will be required to complete the following homework before continuing. If you're working by yourself, I strongly recommend you find someone who can help check your work and provide feedback.

The homework for this project is made up of just one coding problem, but it's not easy. What you need to do is to make the program save and load settings from a file, .termfeed, which is itself an XML file. The settings you should load, save and use are: whether the contents of the <link> element should be shown after stories, whether to show only new stories or all stories (in essence, whether to use or ignore the GUID cache) and whether the program should be silent if none of the news feeds have any new items or whether it should print out the message "No news updates - be patient!"

Keep in mind that XML is just plain text. Yes, reading it in is best done using XmlDocument, but writing out an XML document is really just a matter of writing out a string using something like File.AppendAllText().

If you have problems, try to solve them yourself - you might not succeed, but you'll learn a lot by trying! If you're still having problems, drop your tutor an email and ask for help.

The small print

This work was produced for TuxRadar.com as part of the Hudzilla Coding Academy series. All source code for this project is licensed under the GNU General Public License v3 or later. All other rights are reserved.

You should follow us on Identi.ca or Twitter


Your comments

is this different from the rss reader from lxf?

Just curious if this is different from the RSS Reader Mono program in LXF a few months to a year ago.

Thanks

Re: is this different from the rss reader from lxf?

If you imagine that same article, just without having to fit into the same word count, this is what you get.

I don't understand any of this.

but i think you need to look me up, by my real name, on facebook.

Error in code

I think the statement "using System.XML;" is wrong.
It should be "using System.Xml;" (else you get a compile error).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

CAPTCHA
We can't accept links (unless you obfuscate them). You also need to negotiate the following CAPTCHA...

Username:   Password:
Create Account | About TuxRadar