As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!
I want to talk about making computers understand human language. It sounds complex, but with Python, we can start with simple steps and build up to impressive applications. Over the years, I’ve used these methods to analyze customer feedback, automate support, and even generate content. Let me show you how you can do the same.
First, we need to prepare our text. Raw text is messy—full of URLs, odd punctuation, a…
As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!
I want to talk about making computers understand human language. It sounds complex, but with Python, we can start with simple steps and build up to impressive applications. Over the years, I’ve used these methods to analyze customer feedback, automate support, and even generate content. Let me show you how you can do the same.
First, we need to prepare our text. Raw text is messy—full of URLs, odd punctuation, and variations. Think of this like washing vegetables before you cook. We clean it to get consistent results. In Python, libraries like spacy help with this intelligent cleaning, called preprocessing and tokenization.
import spacy
import re
nlp = spacy.load("en_core_web_sm")
def preprocess_text(text):
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# Normalize extra spaces
text = ' '.join(text.split())
return text
def advanced_tokenization(text):
doc = nlp(text)
results = {'tokens': [], 'lemmas': [], 'pos_tags': []}
for token in doc:
if not token.is_punct and not token.is_space:
results['tokens'].append(token.text)
results['lemmas'].append(token.lemma_)
results['pos_tags'].append(token.pos_)
return results
# Let's see it work
sample = "Apple Inc. is planning a new store in San Francisco next month."
cleaned = preprocess_text(sample)
tokens = advanced_tokenization(cleaned)
print(f"Clean text: {cleaned}")
print(f"Base words (lemmas): {tokens['lemmas']}")
This code turns a sentence into clean, standard parts. The lemma is the base word—"planning" becomes "plan". This consistency is crucial for the next steps.
Once text is clean, we can find the important names and places in it. This is called Named Entity Recognition (NER). It’s like a highlighter for text, picking out companies, people, and locations automatically. I use this to quickly scan news articles or legal documents for key players.
from collections import Counter
def extract_entities(text):
doc = nlp(text)
entities = []
for ent in doc.ents:
entities.append({'text': ent.text, 'type': ent.label_})
return entities
# Analyzing multiple documents
documents = [
"Microsoft announced new features in Seattle.",
"Amazon Web Services reported record earnings.",
"Google's CEO Sundar Pichai spoke in California."
]
all_entities = []
for doc in documents:
all_entities.extend(extract_entities(doc))
# Count how often each entity type appears
type_counter = Counter([e['type'] for e in all_entities])
print("Entity types found:", dict(type_counter))
Running this shows that "Microsoft" and "Google" are organizations (ORG), while "Seattle" and "California" are geographic locations (GPE). This automatic tagging saves hours of manual review.
Now, let’s gauge feeling or opinion in text, which is sentiment analysis. Early tools just classified text as positive or negative. Now, we can detect nuance, like frustration or mild satisfaction. I’ve built systems that track brand sentiment from social media using these techniques.
from textblob import TextBlob
import numpy as np
def analyze_sentiment(text):
analysis = TextBlob(text)
# Polarity: -1 (negative) to +1 (positive)
# Subjectivity: 0 (factual) to 1 (opinionated)
polarity = analysis.sentiment.polarity
subjectivity = analysis.sentiment.subjectivity
# Simple categorization
if polarity > 0.2:
sentiment = "Positive"
elif polarity < -0.2:
sentiment = "Negative"
else:
sentiment = "Neutral"
return {
'sentiment': sentiment,
'polarity_score': round(polarity, 3),
'subjectivity_score': round(subjectivity, 3)
}
# Test it
reviews = [
"This product is fantastic and works perfectly!",
"It's okay, does the job but nothing special.",
"Terrible quality, broke immediately. Very disappointed."
]
for rev in reviews:
result = analyze_sentiment(rev)
print(f"Review: {rev[:40]}...")
print(f" Sentiment: {result['sentiment']}")
print(f" Polarity Score: {result['polarity_score']}")
This gives a measurable score for emotion. For more advanced needs, pre-trained transformer models from libraries like transformers can detect sarcasm or mixed feelings, which I often integrate for customer service analysis.
When you have thousands of documents, you need to find the common themes without reading each one. This is topic modeling. I think of it as a sorting machine that reads all your documents and groups them by hidden topics. LDA is a classic algorithm for this.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Example document collection
documents = [
"Stock markets hit record highs as tech stocks surge.",
"Climate conference agrees on new emission reduction goals.",
"AI makes a breakthrough in medical image analysis.",
"Investments in solar and wind power have tripled."
]
# Convert text to numbers
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(documents)
# Create the topic model
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(doc_term_matrix)
# Show the main words for each topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
top_words_idx = topic.argsort()[:-6:-1] # Top 5 words
top_words = [feature_names[i] for i in top_words_idx]
print(f"Topic {topic_idx}: {', '.join(top_words)}")
This might output Topic 0: stocks, markets, tech, highs, surge and Topic 1: climate, emission, goals, reduction, conference. It instantly reveals the main themes: finance and environment.
For tasks like spam detection or categorizing support tickets, we use text classification. We teach a model by showing it many labeled examples. Today, fine-tuning pre-trained transformer models gives remarkable accuracy, even with modest amounts of your own data.
# Simulated training data
train_texts = ["Great product!", "Worst experience.", "It works fine."]
train_labels = [1, 0, 1] # 1=Positive, 0=Negative
# In a real project, you would:
# 1. Load a pre-trained model (like DistilBERT)
# 2. Add a new classification layer
# 3. Train it on your labeled texts
# 4. Save it and use it to predict new texts
# This is a conceptual outline:
def train_classifier(texts, labels):
print("Step 1: Convert texts to model inputs (tokenization).")
print("Step 2: Adjust model weights based on our labels.")
print("Step 3: Validate the model on held-out data.")
print("Step 4: Use the trained model for predictions.")
return "Trained Model"
model = train_classifier(train_texts, train_labels)
# A simple rule-based simulation for demonstration
def predict_simple(text):
positive_words = ['great', 'good', 'excellent', 'perfect']
if any(word in text.lower() for word in positive_words):
return "Positive"
else:
return "Negative"
print(predict_simple("This is a great solution!"))
For production, you’d use the transformers library by Hugging Face, which handles the complex steps. I’ve used this to build classifiers that route customer emails to the correct department with over 95% accuracy.
Creating conversational agents or generating text requires sequence-to-sequence models. These are the engines behind many chatbots. They read an input sequence (like a user’s question) and generate an output sequence (the response). I’ll show a simplified concept.
# Conceptual structure of a sequence model
class SimpleSeq2Seq:
def __init__(self):
# In reality, this holds a complex neural network
self.knowledge_base = {
"hello": "Hello! How can I assist you today?",
"weather": "I can't check real-time weather, but I hope it's nice!",
"name": "I'm a Python-based language model."
}
def respond(self, input_text):
input_lower = input_text.lower()
for key, response in self.knowledge_base.items():
if key in input_lower:
return response
return "I'm not sure how to answer that. Could you rephrase?"
bot = SimpleSeq2Seq()
print(bot.respond("Hello there!"))
print(bot.respond("What's your name?"))
Real models, like GPT or DialoGPT, are trained on massive dialogues and generate far more coherent and varied responses. The key is they understand context; they remember what was said earlier in the conversation.
Long documents need summaries. There are two main ways: extractive and abstractive. Extractive summarization picks the most important existing sentences. It’s like highlighting. Abstractive summarization writes new sentences to convey the core meaning, like a human would.
# Simulating extractive summarization
def extractive_summary(text, num_sentences=2):
# Split into sentences (simple split for demo)
sentences = text.split('. ')
# In reality, you'd score sentences by importance
# Using factors like word frequency and position
selected = sentences[:num_sentences] # Simple: pick first ones
return '. '.join(selected) + '.'
article = """
Artificial intelligence is changing many industries. Machine learning powers recommendations and fraud detection. New models appear regularly. Ethical issues like bias are important. The future may bring even more integration with human processes.
"""
summary = extractive_summary(article)
print("Original length:", len(article.split()))
print("Summary length:", len(summary.split()))
print("Summary:", summary)
For abstractive summarization, I often use the pipeline feature from the transformers library with a model like facebook/bart-large-cnn. It can take a long article and produce a concise, well-written paragraph.
Finally, we have machine translation. Modern neural translation models understand context much better than old word-for-word systems. They can handle idioms and technical terms. Python makes it straightforward to access state-of-the-art models.
# This outlines the process using a pre-trained model
def translate_text(text, target_lang='es'):
# In practice, use: from transformers import MarianMTModel, MarianTokenizer
# model_name = f'Helsinki-NLP/opus-mt-en-{target_lang}'
# This loads a model trained specifically for English to target language
print(f"[Concept] Translating to {target_lang.upper()}: '{text}'")
# The model encodes the English sentence, then decodes it into Spanish.
return "[Translated Text Would Appear Here]"
# Example
print(translate_text("Hello, how are you?", 'es'))
The real magic is that these models, such as the MarianMT models, have been trained on millions of sentence pairs. They don’t just swap words; they rephrase ideas to sound natural in the target language.
Each of these eight techniques is a tool. You start with preprocessing to clean your data. Then, you might extract entities to find key information. Sentiment analysis tells you how people feel. Topic modeling helps you organize large collections of text. Classification automates sorting. Sequence models enable conversation and generation. Summarization condenses information. Translation breaks down language barriers.
I often combine them. For instance, I might translate foreign social media posts, analyze their sentiment, extract mentioned company names, and summarize the main topics—all in an automated pipeline. Python’s ecosystem, with libraries like spacy, nltk, transformers, and scikit-learn, makes this integration possible. The best approach is to start simple, get one technique working, and then gradually add more complexity as your needs grow.
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | Java Elite Dev | Golang Elite Dev | Python Elite Dev | JS Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva