NLP often uses machine learning algorithms to train models for various tasks. These algorithms learn patterns from large amounts of text data. For example, a Naive Bayes classifier can be used for text classification tasks by learning the probability of a document belonging to a certain class based on the occurrence of words.
Text preprocessing is the first step in most NLP tasks. It involves cleaning and normalizing the text data.
import re
text = "Hello! This is a sample text with some punctuation 123."
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Convert to lowercase
text = text.lower()
print(text)
Tokenization is the process of splitting text into individual words or tokens.
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "Natural Language Processing is fascinating."
tokens = word_tokenize(text)
print(tokens)
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
word = "running"
stemmed_word = stemmer.stem(word)
print(stemmed_word)
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word = "better"
lemmatized_word = lemmatizer.lemmatize(word, pos='a') # 'a' for adjective
print(lemmatized_word)
Part - of - Speech (POS) tagging assigns a part of speech (such as noun, verb, adjective) to each word in a sentence.
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
tokens = word_tokenize("The quick brown fox jumps over the lazy dog.")
pos_tags = pos_tag(tokens)
print(pos_tags)
Named Entity Recognition (NER) identifies and classifies named entities in text, such as persons, organizations, locations, etc.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Sentiment analysis determines the sentiment (positive, negative, or neutral) of a text.
from textblob import TextBlob
text = "This movie is amazing!"
blob = TextBlob(text)
sentiment = blob.sentiment.polarity
if sentiment > 0:
print("Positive sentiment")
elif sentiment < 0:
print("Negative sentiment")
else:
print("Neutral sentiment")
Text classification is the task of assigning a text document to one or more predefined categories.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Sample data
documents = ["This is a sports article", "This is a technology news"]
labels = ["sports", "technology"]
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MultinomialNB())
])
pipeline.fit(documents, labels)
new_document = ["New smartphone released"]
predicted_label = pipeline.predict(new_document)
print(predicted_label)
Natural Language Processing with Python offers a wide range of possibilities for working with human language data. By understanding the fundamental concepts, using the right libraries, and following common practices and best practices, you can effectively implement various NLP tasks. Whether it’s sentiment analysis, text classification, or named entity recognition, Python provides the tools and flexibility to build powerful NLP applications.