Optimizing Beautiful Soup Performance

The bigger the page, the more slowly this model is created.

Using SoupStrainer

One option to tune the performance a bit is to tell Beautiful Soup which part of the whole page you will need, and it will create the object model from the relevant part. To do this, you can use a SoupStrainer object.

A SoupStrainer tells Beautiful Soup what parts to extract, and the parse tree will consist only of these elements. This speeds up the process a bit, if you can narrow down the required information to a smaller portion of the HTML.

strainer = SoupStrainer(name='ul', attrs={'class':'productLister gridView'})
soup = BeautifulSoup(content, 'html.parser', parse_only=strainer)

The preceding code creates a simple SoupStrainer that limits the parse tree to unordered list having a class attribute 'productLister gridView' - which helps to reduce the site to the required parts - and it uses this strainer to create the soup.

Because you already have a working scraper, you can replace the soup calls using a strainer to speed up things.

Multiple Attributes in Strainer

The following piece of information is hard to find on the Internet: you can use multiple attributes in the strainer to parse the website. For example, if you extract the links to product pages, you have three options based on the level of the current department link:

The link leads to product pages
The link leads to a first-level sublist
The link leads from a first-level sublist to a second-level sublist

In this case, you have three different classes but want to create the soup if any of them is present. You can do something like this:

BeautifulSoup(content, 'html.parser', name='ul', attrs={'class':['productLister gridView', 'categories shelf']})

Here, you have listed all three versions of the list that can happen, and the soup contains all the relevant information.