🐥 Stripping HTML from strings in Python using only the standard library

247 words, 2 min read

When processing text scraped from the web or user-generated content, you’ll often need to remove HTML tags while keeping the readable text. Many developers reach for external packages like BeautifulSoup, but you can achieve the same goal using only Python’s standard library.

The following snippet is a minimal and dependency-free solution for stripping HTML tags from a string. It’s based on a community contribution by Eloff on Stack Overflow.

from io import StringIO
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self...

247 words, 2 min read

The following snippet is a minimal and dependency-free solution for stripping HTML tags from a string. It’s based on a community contribution by Eloff on Stack Overflow.

from io import StringIO
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

How it works

HTMLParser is part of Python’s standard library and can process HTML input incrementally.
The custom subclass MLStripper overrides the handle_data() method to capture only text content.
StringIO efficiently collects the output as the parser processes the HTML.
The helper function strip_tags() simply feeds the HTML input and returns the collected text.

Example usage

html = "<p>Hello <strong>world</strong>! This is <a href='#'>a link</a>.</p>"
text = strip_tags(html)
print(text)

Output:

Hello world! This is a link.

Why this approach?

No dependencies — uses only Python’s built-in modules.
Lightweight and fast — suitable for small to medium HTML snippets.
Safe and controlled — avoids executing any scripts or external libraries.

For larger or malformed HTML documents, you might still prefer robust parsers like BeautifulSoup or lxml. But for most basic HTML cleanup tasks, this standard-library solution is elegant and effective.

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.

How it works

Example usage

Why this approach?

Similar Posts