Canarys | IT Services

Blogs

HTML Parser

Date:
Author:
Share

Are you looking for any HTML Parser?

Want to traverse through HTML DOM elements?

Want to read properties and its values of HTML element?

Want to add properties to HTML elements dynamically?

Want to modify HTML form at runtime?

JQuery is one of the best solution to do all these at client side. But what about to do the same on server side?

Here is one of the best, smarter, fastest and reliable solution: Html Agility Pack

Html Agility Pack is an open source .NET library which is very similar to working with XmlDocument class. All you need to do is inputting XPath to Html Agility Pack. You can download the library from CodePlex.

Sample HTML

Here I am going to show you few of the examples based on my sample HTML form as shown in above screenshot.

Loading HTML Form:

HtmlDocument class is used to load the HTML form as shown below:

//Load HTML Document into DOM
HtmlDocument htmDoc = new HtmlDocument();
string html = File.ReadAllText(“E:\TestSri.html”);
htmDoc.LoadHtml(html);

Read HTML Node:

To read any HTML node/tag, you can use the XPath to that node. Then you will get the node in the form of HtmlNode object. HtmlNode class offers you the core feature of traversing to child nodes, reading properties of HTML tag, etc.

//Read HTML Node
//Get Title of HTML Document
HtmlNode titleNode = htmDoc.DocumentNode.SelectSingleNode(“//title”);
string title = titleNode.InnerText;

where htmDoc.DocumentNode will give the root node of HTML Form.

Read HTML Nodes with same type:

To read any HTML Node(s) from Agility Pack, all you need is to give XPath as shown in above example. But in this case instead of SelectSingleNode you can use SelectNodes to get nodes, which will return the collection of HTMLNodes.

//Read HTML Nodes
//Get all SPAN tags
HtmlNodeCollection spanCollection = htmDoc.DocumentNode.SelectNodes(“//span”);

//Get all Anchor tags
HtmlNodeCollection anchorCollection = htmDoc.DocumentNode.SelectNodes(“//a”);

//Get all DIV tags
HtmlNodeCollection divCollection = htmDoc.DocumentNode.SelectNodes(“//div”);

Find Node:

You can get any node from the HTML form based on its Id. For that we can use GetElementbyId() method of HTMLDocument class.

//Find Node
//Get the tag which is having id as dvLogin
HtmlNode ndLogin = htmDoc.GetElementbyId(“dvLogin”);
//Get the tag which is having id as guestName and find its text
HtmlNode ndGuestName = htmDoc.GetElementbyId(“guestName”);
string guestName = ndGuestName.InnerText;

Read Attributes:

In order to read the attributes/properties of html node, first find HTML Node to which you want get them. HtmlNode class have attributes collection properties, choose the any attribute as you want from the collection which will give you HtmlAttribute object. HtmlAttribute class has the properties name and value to get needed data.

//Read Attributes
//Find URL of forgot password
HtmlNodeCollection anchorsCollection = htmDoc.DocumentNode.SelectNodes(“//a”);
string forgotURL = string.Empty;

foreach (HtmlNode node in anchorsCollection)
   {                
     if (node.InnerText.ToLower().Contains(“forgot”))                
     {                    
       HtmlAttribute attrib = node.Attributes[“href”];                    
       forgotURL = attrib.Value;                
     }
   }

Add Attributes:

In order to add attributes/properties to any HTML Node, first find the HTML Node. Once you get the HtmlNode object, there are two ways to add the attributes. One is using HtmlAttribute class and second one is use the Add() methods on Attributes collection of HtmlNode class.

//Add Attributes
//Add the target property to forgot password anchor tag
HtmlNodeCollection anchorsCollection1 = htmDoc.DocumentNode.SelectNodes(“//a”);

foreach (HtmlNode node in anchorsCollection1)
{
   if (node.InnerText.ToLower().Contains(“forgot”))

{

node.Attributes.Add(“target”, “_blank”);

}

}

Traverse thru HTML Document/Nodes:

As I describes above, HtmlNode class provides the core features of parser. The HtmlNode class has a property to get ChildNodes, which gives you the HtmlNodeCollection. On this collection you can traverse thru each node as you want.

//Traverse HTML Nodes
//Go thru dvLogin (div) tag and get the user name from loginName field
HtmlNode ndDivLogin = htmDoc.GetElementbyId(“dvLogin”);
HtmlNode ndUsername = ndDivLogin.ChildNodes.Single(node => node.Id == “loginName”);
string userName = ndUsername.Attributes[“value”].Value;

Save Modified HTML:

Once after modifying any HTML Form inside the parser you can save it wherever you want.

//Save modified HTML
htmDoc.Save(“E:\Test Sri1.html”);

Not only these, you can perform very coolest operations inside this parser like inserting tags, removing tags, applying styles, etc.

Other well-known parsers: MsHtml

Hope this helps to who are looking for HTML Parser at server-side..!

Leave a Reply

Your email address will not be published. Required fields are marked *

Reach Us

With Canarys,
Let’s Plan. Grow. Strive. Succeed.