June 1, 2022
The bigger the page, the more slowly this model is created.
One option to tune the performance a bit is to tell Beautiful Soup which part of the whole page you will need, and it will create the object model from the relevant part. To do this, you can use a SoupStrainer
object.
A SoupStrainer
tells Beautiful Soup what parts to extract, and the parse tree will consist only of these elements. This speeds up the process a bit, if you can narrow down the required information to a smaller portion of the HTML.
strainer = SoupStrainer(name='ul', attrs={'class':'productLister gridView'})
soup = BeautifulSoup(content, 'html.parser', parse_only=strainer)
The preceding code creates a simple SoupStrainer
that limits the parse tree to unordered list having a class attribute 'productLister gridView' - which helps to reduce the site to the required parts - and it uses this strainer to create the soup.
Because you already have a working scraper, you can replace the soup calls using a strainer to speed up things.
The following piece of information is hard to find on the Internet: you can use multiple attributes in the strainer to parse the website. For example, if you extract the links to product pages, you have three options based on the level of the current department link:
In this case, you have three different classes but want to create the soup if any of them is present. You can do something like this:
BeautifulSoup(content, 'html.parser', name='ul', attrs={'class':['productLister gridView', 'categories shelf']})
Here, you have listed all three versions of the list that can happen, and the soup contains all the relevant information.