Labels

slider

Recent

Navigation

How to search HTML Page by specific text using html agility pack?

How to search HTML Page by specific text using html agility pack, Web Scraping by html agility pack, Data Extraction by html agility pack

Introduction

In the last session about  How to Get elements by class in Html Agility Pack C# , one can very well understand about how to use the HtmlAgility pack to the fullest in order to obtain the elements that fall under the same CSS class. Furthermore to that, there might be a situation where one needs to perform data extraction from the Html content on the web page. If you are having trouble in learning about HtmlAgility pack, then follow this Install HTML agility pack and Load an HTML Document.

How to search HTML Page by specific text using html agility pack

Hard to do it on Regex!

Many a times, when it comes to text extraction, using regular expressions is the most common method that strikes to the mind. Though, the very purpose of these are to achieve extraction of content from the text based on the pattern, yet there are many shortfalls for novices to use them.
  • Being able to conclude about the perfect Regex pattern is very tricky. Unless one is expert on them, it is difficult to tell if the pattern is efficient or not.
  • Adding to this complex situation, Regex is altogether a different system. Hence, using them may slow down the process.

Free Video Library: Learn HTML Agility Pack Step by Step

Alternate yet native and efficient method!

HtmlAgility pack has most of the utilities to help in getting job done swift and hassle free. One can traverse through entire HTML content present in a webpage. Follow here HTML Traversing using Agility Pack, to get comfortable about the topic.
Applying innerText on an HTML element is an easy solution to extract specific text and thus, web scraping is not a big ordeal.

Step #1

Declare HtmlWeb variable and HtmlAgilityPack.HtmlDocument variable.

Step #2

Load the web page into HtmlDocument variable.

Step #3

Filter the Html elements based on the class name using the technique as mentioned below into IENumerable of type HtmlElements.
DocumentNode.Descendants().Where(n => n.HasClass("mw-jump-link")).

Step #4

Iterate through each item in the nodes using a foreach loop and apply innerText on each of the item.
Once you are done extracting the specific text, you can consider changing the HTML contents and to know how to manipulate the HTML content, do visit this session HTML Manipulation using html agility pack.
using System;
using System;
using HtmlAgilityPack;
     
public class Program
{
 public static void Main()
 {
  // define htmldocument
     var doc = new HtmlAgilityPack.HtmlDocument();
  
  // declare HTMLWeb
  HtmlWeb web = new HtmlWeb();
  
  // here loading document for specfic URL
  doc = web.Load("https://www.technologycrowds.com/2019/06/sha-512-hash-using-c-sharp.html");
  
  // here searching for specific words
  var ress = doc.DocumentNode.SelectSingleNode("//*[text()[contains(., 'Working')]]").InnerText;
  
  // now displaying final output
  Console.WriteLine(ress);
 }
}

Output 

Working Sample

Conclusion

You could learn a lot more about HtmlAgility Pack from wide variety of tutorials by searching from here HTML Agility Pack.

Relevant Reading

Share

Anjan kant

Outstanding journey in Microsoft Technologies (ASP.Net, C#, SQL Programming, WPF, Silverlight, WCF etc.), client side technologies AngularJS, KnockoutJS, Javascript, Ajax Calls, Json and Hybrid apps etc. I love to devote free time in writing, blogging, social networking and adventurous life

Post A Comment:

0 comments: