A Simple File Integrity Monitor in Python with hashlib

A simple file integrity monitor in Python answers a narrow but valuable question: which files in a directory have changed, been added, or been deleted since I last looked? This Python Quick Guide builds a baseline of SHA-256 hashes, stores it as JSON, and diffs the current state against it — a dependency-free file integrity monitoring approach for watching the handful of directories that should rarely change, like a web root or a config folder. 

Simple file integrity monitor in Python building a SHA-256 baseline and diffing directory changes 

Key Takeaways

  • A simple file integrity monitor records a baseline of file hashes and reports added, modified, and deleted files on each subsequent run.
  • SHA-256 over the file contents detects modification reliably, where a timestamp check alone can be trivially reset by an attacker.
  • The baseline is just a dictionary of path-to-hash, which serialises cleanly to JSON with no database or external dependency.
  • File integrity monitoring is best aimed at directories that should be stable — a web root, a startup folder, a config directory — not at paths that change constantly.
  • This is a detective control: it tells you something changed after the fact, so it pairs with, rather than replaces, controls that prevent the change.

Environment

  • Python 3.9+, standard library only — hashlib, os, and json.
  • A directory worth watching. For testing I pointed it at a copy of a static site's web root.
  • Somewhere to store the baseline JSON outside the monitored directory, so a change to the tree does not rewrite its own baseline.
  • Tested on Windows 11; the code runs unchanged on Linux or macOS.

The Problem

Some directories are supposed to be boring. A web root, the contents of a startup folder, a set of configuration files — these change when I deploy or reconfigure, and at no other time. So an unexpected change to one of them is a signal worth having. The trouble is that Windows does not shout when a file in a watched folder quietly changes, and checking by hand is not a plan.

The naive version watches modification timestamps, but a timestamp is attacker-controllable — timestomping a file back to its original mtime is a well-known trick, and it is exactly why the registry and on-disk forensics work harder than just trusting metadata. Hashing the contents sidesteps that: change a single byte and the hash changes, regardless of what the timestamp claims. A file integrity monitor is just a hash baseline plus a diff, and the standard library has everything needed to build one.

The Solution — Build a File Integrity Monitor in Python

Step 1 — Build a baseline of content hashes

The baseline is a dictionary mapping each file's path to its SHA-256 digest. I reuse the chunked hashing approach so the monitor handles large files without loading them whole, then walk the tree and record every file:

import hashlib
import os
import json

def sha256_file(path, chunk_size=65536):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(chunk_size), b""):
            h.update(chunk)
    return h.hexdigest()

def build_baseline(root):
    baseline = {}
    for dirpath, _dirs, files in os.walk(root):
        for name in files:
            full = os.path.join(dirpath, name)
            try:
                baseline[full] = sha256_file(full)
            except (PermissionError, FileNotFoundError, OSError):
                continue   # skip what we cannot read
    return baseline

Hashing contents rather than reading timestamps is the whole point — it is the same reasoning behind the SHA-256 scanner in the companion IOC-matching post, applied to change detection instead of known-bad matching.

Step 2 — Save and load the baseline as JSON

A dictionary of strings serialises to JSON with no effort, which means the baseline needs no database. The one rule: store it outside the directory you are watching, or the act of saving it changes the tree you are monitoring:

def save_baseline(baseline, path):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(baseline, f, indent=2)

def load_baseline(path):
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)

Keeping the baseline elsewhere also means an attacker who modifies a watched file does not automatically find and rewrite the record of what it used to be — though a baseline stored on the same host is never fully tamper-proof, which is a limitation worth being honest about.

Step 3 — Diff the current state against the baseline

Detection is a set comparison plus a hash comparison. Files in the baseline but missing now were deleted; files present now but not in the baseline were added; files in both whose hash differs were modified. Python's set operations make the first two trivial:

def compare(baseline, root):
    current = build_baseline(root)
    base_paths = set(baseline)
    curr_paths = set(current)

    added = curr_paths - base_paths
    deleted = base_paths - curr_paths
    modified = {p for p in base_paths & curr_paths if baseline[p] != current[p]}

    return {"added": sorted(added),
            "deleted": sorted(deleted),
            "modified": sorted(modified)}

The base_paths & curr_paths intersection is the set of files that existed both times, and comparing hashes only over that intersection is what isolates a genuine content change from an add or a delete. Three categories, one pass, no false precision.

Step 4 — Run it and report only when something changed

A monitor that prints "nothing changed" every run trains you to ignore it. I report only when there is something to report, which makes the output something you actually read — the same restraint that keeps the event-log retention I covered in managing event log sizes from drowning the signal:

def run(root, baseline_path):
    if not os.path.exists(baseline_path):
        save_baseline(build_baseline(root), baseline_path)
        print(f"Baseline created for {root}")
        return

    changes = compare(load_baseline(baseline_path), root)
    if not any(changes.values()):
        return   # silent when nothing changed

    for category, paths in changes.items():
        for p in paths:
            print(f"[{category.upper()}] {p}")

run(r"C:\inetpub\wwwroot", r"C:\fim\wwwroot_baseline.json")

Run it on a schedule — Task Scheduler on Windows, cron elsewhere — and after a legitimate deployment, regenerate the baseline by deleting the JSON so the next run records the new known-good state. Routing the output to a log or an alert turns this into a feed your monitoring can act on, much like the events in my essential Windows Event IDs guide.

Frequently Asked Questions

Why hash file contents instead of checking modification time?

Because modification timestamps are attacker-controllable — timestomping resets a file's mtime to hide a change. A content hash changes whenever a single byte changes, regardless of what the timestamp says, so it detects modifications that a timestamp check would miss.

Where should I store the baseline file?

Outside the directory being monitored. If the baseline lives inside the watched tree, saving it alters the tree, and an attacker who edits a watched file can find and rewrite its baseline entry. A separate location reduces both problems, though any on-host baseline is still only as trustworthy as the host.

Can this file integrity monitor run continuously in real time?

This design is a point-in-time scan you run on a schedule, which is simple and dependency-free. For real-time alerts you would watch the filesystem with OS notifications (the watchdog library wraps these), at the cost of an external dependency and more moving parts. Scheduled scans are usually enough for directories that change rarely.

What kinds of directories is file integrity monitoring good for?

Directories that are supposed to be stable: a web root, configuration folders, startup locations, and binaries that change only on deployment. Pointing it at paths that change constantly — user profiles, temp directories, logs — produces noise that buries any real signal.

Conclusion

A file integrity monitor is one of those tools that sounds like it needs a product and actually needs about forty lines of standard-library Python: hash a baseline, store it as JSON, diff the current state with set operations. Hashing contents rather than trusting timestamps is what makes it resistant to the obvious evasion, and reporting only on change is what makes it something you keep running.

The honest limitations are that it is a detective control, not a preventive one — it tells you a change happened, after it happened — and that a baseline stored on the monitored host is never fully tamper-proof. Aimed at the right directories, the ones that should be boring, it turns "did anything change here?" from a manual chore into a scheduled answer.

Related Posts

Editorial note: posts on this blog are drafted with AI assistance and then reviewed, edited, and tested against a real environment before publishing. Commands, output, and screenshots come from systems I actually ran the work on.

Detection Engineering File Integrity Monitoring Incident Response Python Sysadmin Windows Security
SecurityScriptographer author

About the author

SecurityScriptographer is written and maintained by one person — a defender who builds and tests the detections, scripts, and Microsoft 365 workflows here before publishing them. More about me · @twi_nox

0 comments:

Post a Comment