Extracting text from HTML in Python: a very fast approach
When working on NLP problems, sometimes you need to obtain a large corpus of text. The internet is the biggest source of text, but unfortunately extracting text from arbitrary HTML pages is a hard and painful task.
Let's suppose we need to extract full text from various web pages and we want to strip all HTML tags. Typically, the default solution is to use
get_text method from BeautifulSoup package which internally uses lxml. It's a well-tested solution, but it can be very slow when working with hundreds of thousands of HTML documents.
By replacing BeautifulSoup with selectolax, you can get a 5-30x speedup almost for free!
Here is a simple benchmark which parses 10 000 HTML pages from commoncrawl:
# coding: utf-8 from time import time import warc from bs4 import BeautifulSoup from selectolax.parser import HTMLParser def get_text_bs(html): tree = BeautifulSoup(html, 'lxml') body = tree.body if body is None: return None for tag in body.select('script'): tag.decompose() for tag in body.select('style'): tag.decompose() text = body.get_text(separator='\n') return text def get_text_selectolax(html): tree = HTMLParser(html) if tree.body is None: return None for tag in tree.css('script'): tag.decompose() for tag in tree.css('style'): tag.decompose() text = tree.body.text(separator='\n') return text def read_doc(record, parser=get_text_selectolax): url = record.url text = None if url: payload = record.payload.read() header, html = payload.split(b'\r\n\r\n', maxsplit=1) html = html.strip() if len(html) > 0: text = parser(html) return url, text def process_warc(file_name, parser, limit=10000): warc_file = warc.open(file_name, 'rb') t0 = time() n_documents = 0 for i, record in enumerate(warc_file): url, doc = read_doc(record, parser) if not doc or not url: continue n_documents += 1 if i > limit: break warc_file.close() print('Parser: %s' % parser.__name__) print('Parsing took %s seconds and produced %s documents\n' % (time() - t0, n_documents))
>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz >>> file_name = "CC-MAIN-20180116070444-20180116090444-00000.warc.gz" >>> process_warc(file_name, get_text_selectolax, 10000) Parser: get_text_selectolax Parsing took 16.170367002487183 seconds and produced 3317 documents >>> process_warc(file_name, get_text_bs, 10000) Parser: get_text_bs Parsing took 432.6902508735657 seconds and produced 3283 documents
Clearly, it's not the best way to benchmark something, but it gives an idea that selectolax can be sometimes 30 times faster than lxml.
I wrote selectolax half a year ago when I was looking for a fast HTML parser in Python. Basically, it is a Cython wrapper to the Modest engine. The engine itself is a very powerful and fast HTML5 parser written in pure C by lexborisov.
Selectolax is not limited to only one use case and supports
CSS selectors as well as other HTML traversing functions. Any feedback and feature requests are appreciated, so you should definitely give it a try ;).
- Gaurav Sahu 3 years, 8 months ago (from disqus) #
Important to use the reanimated version of warc library here: https://github.com/erroneousboat/warc3
- Haider 1 year, 10 months ago #
Yep. pip install warc will not work. It is an old repo working with python 2
Use instead: pip install git+https://github.com/erroneousboat/warc3
nice blog can anyone using tell me how to extract url's using selectolax