Monitor how article titles are changed over time on news websites. You can find the scraped data here: https://git.nolog.cz/NoLog.cz/headline-exports
Find a file
2023-08-21 13:39:06 +02:00
.vscode Migrate code style to Black 2023-08-21 10:38:13 +02:00
data update feeds 2022-08-27 12:46:18 +02:00
misc Migrate code style to Black 2023-08-21 10:38:13 +02:00
processor Migrate code style to Black 2023-08-21 10:38:13 +02:00
view Implement expand diff togle with AlpineJS 2023-08-21 13:39:06 +02:00
.editorconfig Migrate code style to Black 2023-08-21 10:38:13 +02:00
.gitignore Ignore sqlite temporary files 2023-08-19 11:09:43 +02:00
docker-compose.yml Update docker compose 2023-08-19 11:07:37 +02:00
README.md add expire value to redis keys (7 days) 2022-08-27 13:45:25 +02:00

Headline

Monitor how article titles are changed over time on news websites.


This tool is probably not production ready beacause it was written in two afternoons by an amateur (I'm not a professional programmer). If you want to run it, at least put a reverse proxy between it and public network or run it locally.

I did't do any research on legality of analysing RSS feeds and it's possible you can get into legal issues by presenting the outcomes publicly.


Architecture

The "processor" script will fetch rss feeds configured in processor/config.yaml every 5 minutes (configured in processor/crontab), store the article in Redis and compare new/old articles to find changes in title. When change is found, it generates nice visual diff and stores it with other information (detection time, article link, new/old title, etc.) in permanent database (sqlite3 for now).

The "view" script is reading data from the permanent database (sqlite3) and presents it to the user.

Installation

Run docker-compose up -d and everything should start. You can change ./processor/config.yaml to edit rss sources. After first start, you have to wait for ~5mins for the "processor" to create first empty database. The webserver will throw error until then.

to-do

  • Collect creation time of orig/new article, write it to permanent storage (sqlite3 for now) and display it.
  • Write better readme and little more docs.
  • Create view with some more info and stats (list of feeds, articles in redis, etc.)
  • IDEA: Figure out how to monitor changes in article description (maybe just compare hashes?) and how to present them. (Right now, the code can store descriptions in redis, but nothing else)