NoLog.cz/headline

Fork 0

Monitor how article titles are changed over time on news websites. You can find the scraped data here: https://git.nolog.cz/NoLog.cz/headline-exports

Find a file

Ondřej Nývlt 64e2f4d622 Implement expand diff togle with AlpineJS		2023-08-21 13:39:06 +02:00
.vscode	Migrate code style to Black	2023-08-21 10:38:13 +02:00
data	update feeds	2022-08-27 12:46:18 +02:00
misc	Migrate code style to Black	2023-08-21 10:38:13 +02:00
processor	Migrate code style to Black	2023-08-21 10:38:13 +02:00
view	Implement expand diff togle with AlpineJS	2023-08-21 13:39:06 +02:00
.editorconfig	Migrate code style to Black	2023-08-21 10:38:13 +02:00
.gitignore	Ignore sqlite temporary files	2023-08-19 11:09:43 +02:00
docker-compose.yml	Update docker compose	2023-08-19 11:07:37 +02:00
README.md	add expire value to redis keys (7 days)	2022-08-27 13:45:25 +02:00

README.md

Headline

Monitor how article titles are changed over time on news websites.

This tool is probably not production ready beacause it was written in two afternoons by an amateur (I'm not a professional programmer). If you want to run it, at least put a reverse proxy between it and public network or run it locally.

I did't do any research on legality of analysing RSS feeds and it's possible you can get into legal issues by presenting the outcomes publicly.

Architecture

The "processor" script will fetch rss feeds configured in processor/config.yaml every 5 minutes (configured in processor/crontab), store the article in Redis and compare new/old articles to find changes in title. When change is found, it generates nice visual diff and stores it with other information (detection time, article link, new/old title, etc.) in permanent database (sqlite3 for now).

The "view" script is reading data from the permanent database (sqlite3) and presents it to the user.

Installation

Run docker-compose up -d and everything should start. You can change ./processor/config.yaml to edit rss sources. After first start, you have to wait for ~5mins for the "processor" to create first empty database. The webserver will throw error until then.

to-do

Collect creation time of orig/new article, write it to permanent storage (sqlite3 for now) and display it.
Write better readme and little more docs.
Create view with some more info and stats (list of feeds, articles in redis, etc.)
IDEA: Figure out how to monitor changes in article description (maybe just compare hashes?) and how to present them. (Right now, the code can store descriptions in redis, but nothing else)