In this tutorial I will show you how to remove all special characters, punctuation except spaces from string in Python.
The following program is to extract data from a URL using beautifulsoup package. If the title tag contain special characters then I want to remove it.
import string
from docx import Document
from bs4 import BeautifulSoup
import urllib.request
def remove_symbols(title):
trans = str.maketrans("", "", string.punctuation)
cleaned_title = title.translate(trans)
return cleaned_title
hdr = {"User-Agent": "My Agent"}
request = urllib.request.Request(url = 'https://tensix.com/oracle-bi-publisher-installation-error-inst-05058-a-lookup-of-the-address-for-this-machine/',
headers=hdr)
f = urllib.request.urlopen(request)
myfile = f.read()
soup = BeautifulSoup(myfile, 'html.parser')
title = soup.title.text.strip()
doc = Document()
doc.add_heading(title, 1)
cleaned_title = remove_symbols(title)
print(cleaned_title)
But above code not removing full stops & numbers. I m going to use Regx to remove unwanted Characters.
def remove_symbols(title):
for k in title.split("\n"):
return re.sub(r"[^a-zA-Z0-9]+", ' ', k)