🐥 Stripping HTML from strings in Python using only the standard library
yellowduck.be·8h
Flag this post

247 words, 2 min read

When processing text scraped from the web or user-generated content, you’ll often need to remove HTML tags while keeping the readable text. Many developers reach for external packages like BeautifulSoup, but you can achieve the same goal using only Python’s standard library.

The following snippet is a minimal and dependency-free solution for stripping HTML tags from a string. It’s based on a community contribution by Eloff on Stack Overflow.

from io import StringIO
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self...

Similar Posts

Loading similar posts...