Rendition object (76)

Find specific words on web pages with Scrapy

Have you ever had the problem to scan your website for specific words, for example to find restricted words. This is a perfect job for Python and Scrapy. With only a few lines of code you can automate this task.

Installation

First install Scrapy and the required dependencies. I used pip to do the job.

$ pip install Scrapy

Create a Scrapy Project

Befor you can start to spider your website you have to create a Scrapy project. Open the directory you want to store the project and run the following command:

$ scrapy startproject wordlist_scrapper

The Spider Script

Open the project with you preferred editor and create a new file "spider.py" in the "spiders" folder. Put in the following code:

from io import StringIO
from functools import partial
from scrapy.http import Request
from scrapy.spiders import BaseSpider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item

def find_all_substrings(string, sub):

    import re
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    allowed_domains = ["www.phooky.com"]
    start_urls = ["http://www.phooky.com"]
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    crawl_count = 0
    words_found = 0                                 

    def check_buzzwords(self, response):

        self.__class__.crawl_count += 1

        crawl_count = self.__class__.crawl_count

        wordlist = [
            "Lorem",
            "dolores",
            "feugiat",
            ]

        url = response.url
        contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
        data = response.body.decode('utf-8')

        for word in wordlist:
                substrings = find_all_substrings(data, word)
                for pos in substrings:
                        ok = False
                        if not ok:
                                self.__class__.words_found += 1
                                print(word + ";" + url + ";")
        return Item()

    def _requests_to_follow(self, response):
        if getattr(response, "encoding", None) != None:
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

Replace "allowed_domains", "start_urls" and "wordlist" with your own data. 

Crawl your Website

Now you can test the spider script and store the data in a csv-file:

$ scrapy crawl webcrawler > wordlist.csv

The result is a csv-file with two columns showing the word and the related url on your website where it appears. 

Of course this a very simple script but I'm sure you can imagine how powerful Scrapy is. You can find more information about Scrapy in the official documentation:

Scrapy 1.5 documentation