I’ve always been happy with the results of BeautifulSoup 3.0.x (has anyone tried 3.2.x yet?). It’s one of those chunks of code that yields significant productivity increases.
However, today I’m using BeautifulSoup in a place where performance is an issue (Google App Engine). Here are two lines of code which take ~1.1 seconds to run given ~90 kilobytes of HTML on my (slow) workstation:
soup = BeautifulSoup(html)
price = soup.find('td', {'class': 'price'}).text
I tried SoupStrainer two or three different ways, which gave moderate performance improvement. Adding SoupStrainer reduced run time to ~0.65 seconds (50% improvement):
soup = BeautifulSoup(html, SoupStrainer('table', {'class': 'product_prices'}))
price = soup.find('td', {'class': 'price'}).text
So, before sitting down for a couple of hours to write some good ol’ regular expressions, I tried one more idea, and it’s working great! The idea depends on BeautifulSoup’s very high tolerance for poorly-formatted HTML. By using a simpler method (string.find() for example) to find something close to the information you’re looking for, you can just get a slice of the HTML string that should include the information of interest and send that to BeautifulSoup.
price_index = html.find('class="price"')
soup = BeautifulSoup(html[price_index-100:price_index+100])
price = soup.find('td', {'class': 'price'}).text
I realize that html.find() wouldn’t work for everyone; using a regular expression would be more reliable. This takes ~0.010 seconds:
price_index = re.search('class\s*=\s*"price"', html, re.I)
soup = BeautifulSoup(html[price_index-100:price_index+100])
price = soup.find('td', {'class': 'price'}).text