Piton

Parsing HTML using Python

Parsing HTML using Python
Parsing HTML is one of the most common task done today to collect information from the websites and mine it for various purposes, like to establish price performance of a product over time, reviews of a book on a website and much more. There exist many libraries like BeautifulSoup in Python which abstracts away so many painful points in parsing HTML but it is worth knowing how those libraries actually work beneath that layer of abstraction.

In this lesson, that is what we intend to do. We will find out how values of different HTML tags can be extracted and also override the default functionality of this module to add some logic of our own. We will do this using the HTMLParser class in Python in html.parser module. Let's see the code in action.

Looking at HTMLParser class

To parse HTML text in Python, we can make use of HTMLParser class in html.parser module. Let's look at the class dfinition for the HTMLParser class:

class html.parser.HTMLParser(*, convert_charrefs=True)

The convert_charrefs field, if set to True will make all the character references converted to their Unicode equivalents. Only the script/style elements aren't converted. Now, we will try to understand each function for this class as well to better understand what each function does.

Subclassing the HTMLParser class

In this section, we will sub-class the HTMLParser class and will take a look at some of the functions being called when HTML data is passed to class instance. Let's write a simple script which do all of this:

from html.parser import HTMLParser
class LinuxHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag encountered:", tag)
def handle_endtag(self, tag):
print("End tag encountered :", tag)
def handle_data(self, data):
print("Data found :", data)
parser = LinuxHTMLParser()
parser.feed("
'

Python HTML parsing module


')

Here is what we get back with this command:

Python HTMLParser subclass

HTMLParser functions

In this section, we will work with various functions of the HTMLParser class and look at functionality of each of those:

from html.parser import HTMLParser
from html.entities import name2codepoint
class LinuxHint_Parse(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
def handle_comment(self, data):
print("Comment :", data)
def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)
def handle_charref(self, name):
if name.startswith('x'):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)
def handle_decl(self, data):
print("Decl :", data)
parser = LinuxHint_Parse()

With various calls, let us feed separate HTML data to this instance and see what output these calls generate. We will start with a simple DOCTYPE string:

parser.feed(''"http://www.w3.org/TR/html4/strict.dtd">')

Here is what we get back with this call:

DOCTYPE String

Let us now try an image tag and see what data it extracts:

parser.feed('The Python logo')

Here is what we get back with this call:

HTMLParser image tag

Next, let's try how script tag behaves with Python functions:

parser.feed('')
parser.feed('')
parser.feed('#python color: green ')

Here is what we get back with this call:

Script tag in htmlparser

Finally, we pass comments to the HTMLParser section as well:

parser.feed(''
'')

Here is what we get back with this call:

Parsing comments

Conclusion

In this lesson, we looked at how we can parse HTML using Python own HTMLParser class without any other library. We can easily modify the code to change the source of the HTML data to an HTTP client.

Read more Python based posts here.

Best Gamepad Mapping Apps for Linux
If you like to play games on Linux with a gamepad instead of a typical keyboard and mouse input system, there are some useful apps for you. Many PC ga...
Instrumente utile pentru jucătorii Linux
Dacă vă place să jucați jocuri pe Linux, este posibil să fi folosit aplicații și utilitare precum Wine, Lutris și OBS Studio pentru a îmbunătăți exper...
Jocuri HD remasterizate pentru Linux care nu au avut niciodată lansare Linux mai devreme
Mulți dezvoltatori și editori de jocuri vin cu remasterizarea HD a jocurilor vechi pentru a prelungi durata de viață a francizei, vă rog fanilor să so...