Easily retrieve HTML content from websites using C# and HTML Agility Pack (HtmlAgilityPack)

I needed to retrieve a chunk of content (a product price) within a certain DIV (<div class=”price”>) on a website for a large number of search terms. The quickest way of doing this was to use the HtmlAgilityPack library for .NET to call the website’s search page (with custom search terms) and then pull out the content I needed using the SelectNodes() method. SelectNodes() uses XPATH to search across the document, which sounds tricky but is easy when you know some basic rules. XPATH is described well on w3schools if you need to know the syntax. You will need to download the Html Agility Pack .DLL and make a reference to it in your code.

The following is a simplified version of my code with a file reader/writer to read in a text file containing search terms line by line and output the result to another text file. The exact code will change depending on the website you are using, especially the XPATH in SelectNodes():

using System;
using System.IO;
using HtmlAgilityPack;

namespace webget
{  class Program
  {
    static void Main(string[] args)
    {
      string row = "", line = "";
      StreamReader infile = new StreamReader("input.csv");
      StreamWriter outfile = new StreamWriter("output.csv");

      // Create the Html Agility Pack object
      HtmlWeb hw = new HtmlWeb();

      while ((line = infile.ReadLine()) != null)
      {
        // load the website and store in a Htmldocument object
        HtmlDocument doc = hw.Load("http://www.website.com/search.php?keywords=" + line.Trim());

        try
        {
          // loop through every DIV with class "price" on the website and extract the content
          foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//div[@class='price']"))
          {
            // clean up the content by removing any standard text items and the £ symbol
            string price = link.InnerText.Replace("£", "").Replace("r", "").Replace("t", "").Replace("n", "").Trim();
            line += price + ",";
          }
        }
        catch
        {
          line += ",ERROR";
        }

        Console.WriteLine(line);
        outfile.WriteLine(line);
      }

     infile.Close();
     outfile.Close();
   }
  }
}

One thought on “Easily retrieve HTML content from websites using C# and HTML Agility Pack (HtmlAgilityPack)

  1. Woah! I’m really digging the template/theme of this blog. It’s simple, yet effective.
    A lot of times it’s hard to get that “perfect balance” between user friendliness and visual appeal. I must say that you’ve done a excellent
    job with this. In addition, the blog loads super fast for me on Internet
    explorer. Outstanding Blog!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>