Extracting text from HTML in Python: a very fast approach

Last updated on September 05, 2023, in python

When working on NLP problems, sometimes you need to obtain a large corpus of text. The internet is the biggest source of text, but unfortunately, extracting text from arbitrary HTML pages is a hard and painful task.

Let's suppose we need to extract full text from various web pages, and we want to strip all HTML tags. Typically, the default solution is to use get_text method from BeautifulSoup package, which internally uses lxml. It's a well-tested solution, but it can be very slow when working with hundreds of thousands of HTML documents due to the libxml2 dependency.

By replacing BeautifulSoup with selectolax, you can get a 5-30x speedup almost for free!

Here is a simple benchmark which parses 10 000 HTML pages from commoncrawl:

# coding: utf-8

from time import time

import warc
from bs4 import BeautifulSoup
from selectolax.parser import HTMLParser


def get_text_bs(html):
    tree = BeautifulSoup(html, 'lxml')

    body = tree.body
    if body is None:
        return None

    for tag in body.select('script'):
        tag.decompose()
    for tag in body.select('style'):
        tag.decompose()

    text = body.get_text(separator='\n')
    return text


def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='\n')
    return text


def read_doc(record, parser=get_text_selectolax):
    url = record.url
    text = None

    if url:
        payload = record.payload.read()
        header, html = payload.split(b'\r\n\r\n', maxsplit=1)
        html = html.strip()

        if len(html) > 0:
            text = parser(html)

    return url, text


def process_warc(file_name, parser, limit=10000):
    warc_file = warc.open(file_name, 'rb')
    t0 = time()
    n_documents = 0
    for i, record in enumerate(warc_file):
        url, doc = read_doc(record, parser)

        if not doc or not url:
            continue

        n_documents += 1

        if i > limit:
            break

    warc_file.close()
    print('Parser: %s' % parser.__name__)
    print('Parsing took %s seconds and produced %s documents\n' % (time() - t0, n_documents))

>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz
>>> file_name = "CC-MAIN-20180116070444-20180116090444-00000.warc.gz"
>>> process_warc(file_name, get_text_selectolax, 10000)
Parser: get_text_selectolax
Parsing took 16.170367002487183 seconds and produced 3317 documents
>>> process_warc(file_name, get_text_bs, 10000)
Parser: get_text_bs
Parsing took 432.6902508735657 seconds and produced 3283 documents

Clearly, it's not the best way to benchmark something, but it gives an idea that selectolax can be sometimes 30 times faster than lxml.

I wrote selectolax while looking for a fast HTML parser in Python. It is a Cython wrapper to the Modest and lexbor engines. Both engines are very fast HTML5 parsers written in pure C by lexborisov. Despite the speed, they follow all the standards when parsing HTML and handle malformed HTML the same way modern browsers do.

Selectolax is not limited to only one use case and supports CSS selectors as well as other HTML traversing functions. Any feedback and feature requests are appreciated, so you should definitely give it a try ;).

If you have any questions, feel free to ask them via e-mail displayed in the footer.

python , webscraping

Ashish 2018-10-31 #
Om,
nice blog can anyone using tell me how to extract url's using selectolax
reply

Gaurav Sahu 2019-09-27 #
Important to use the reanimated version of warc library here: https://github.com/erroneousboat/warc3
reply

Haider 2021-07-27 #
Yep. pip install warc will not work. It is an old repo working with python 2
Use instead: pip install git+https://github.com/erroneousboat/warc3
reply

Extracting text from HTML in Python: a very fast approach

Recent posts in Python category

October 21, 2020

On code isolation in Python

August 24, 2020

How to turn an ordinary gzip archive into a database

April 28, 2019

Detecting SQL injections in Python code using AST

August 09, 2018

How Python saves memory when storing strings

June 29, 2018

How virtual environment libraries work in Python

Comments