Have you ever had the problem to scan your website for specific words, for example to find restricted words. This is a perfect job for Python and Scrapy. With only a few lines of code you can automate this task.
First install Scrapy and the required dependencies. I used pip to do the job.
$ pip install Scrapy
Create a Scrapy Project
Befor you can start to spider your website you have to create a Scrapy project. Open the directory you want to store the project and run the following command:
$ scrapy startproject wordlist_scrapper
The Spider Script
Open the project with you preferred editor and create a new file "spider.py" in the "spiders" folder. Put in the following code:
from io import StringIO from functools import partial from scrapy.http import Request from scrapy.spiders import BaseSpider from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.item import Item def find_all_substrings(string, sub): import re starts = [match.start() for match in re.finditer(re.escape(sub), string)] return starts class WebsiteSpider(CrawlSpider): name = "webcrawler" allowed_domains = ["www.phooky.com"] start_urls = ["http://www.phooky.com"] rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")] crawl_count = 0 words_found = 0 def check_buzzwords(self, response): self.__class__.crawl_count += 1 crawl_count = self.__class__.crawl_count wordlist = [ "Lorem", "dolores", "feugiat", ] url = response.url contenttype = response.headers.get("content-type", "").decode('utf-8').lower() data = response.body.decode('utf-8') for word in wordlist: substrings = find_all_substrings(data, word) for pos in substrings: ok = False if not ok: self.__class__.words_found += 1 print(word + ";" + url + ";") return Item() def _requests_to_follow(self, response): if getattr(response, "encoding", None) != None: return CrawlSpider._requests_to_follow(self, response) else: return 
Replace "allowed_domains", "start_urls" and "wordlist" with your own data.
Crawl your Website
Now you can test the spider script and store the data in a csv-file:
$ scrapy crawl webcrawler > wordlist.csv
The result is a csv-file with two columns showing the word and the related url on your website where it appears.
Of course this a very simple script but I'm sure you can imagine how powerful Scrapy is. You can find more information about Scrapy in the official documentation:
Cover image by Lorenzo Cafaro | close-up-code-coding-computer-23989 | CC0 License