File hashing for IOC matching is one of the most basic and most useful things you can do with Python on the defensive side: compute a cryptographic hash of every file in a directory and compare it against a list of known-bad hashes from a threat feed. This Python Quick Guide builds a small SHA-256 scanner with the standard-library hashlib module, handles large files without exhausting memory, and is honest about what hash matching can and cannot catch.
Key Takeaways
- File hashing for IOC matching means computing a SHA-256 (or SHA-1/MD5) digest per file and checking it against a set of known-bad hashes from a threat-intelligence feed.
- The standard-library
hashlibmodule does this with no external dependency, and reading the file in chunks keeps memory flat even on large files. - Loading the IOC list into a Python
setmakes each lookup constant-time, so the scan cost is dominated by reading files, not matching. - Hash matching only catches files you already have a hash for — a single changed byte produces a completely different hash, which is the technique's built-in limitation.
- SHA-256 is the right default; MD5 and SHA-1 still appear in feeds, so a practical scanner often computes more than one algorithm.
Environment
- Python 3.9+ on Windows 11; the code is platform-neutral and runs the same on Linux or macOS.
- Standard-library modules only —
hashlib,os, andpathlib. Nopip installrequired. - A list of known-bad hashes. For testing I used a small text file of SHA-256 values, the same shape you get from most open threat feeds.
The Problem
A threat feed hands you indicators of compromise as file hashes. That is genuinely useful intelligence, but on its own it is just a text file of hex strings. To turn it into a check, I need to walk a filesystem, hash every file, and see whether any of those hashes appear in the bad list. Sysmon already records process image hashes on Event ID 1, which I covered in the Sysmon configuration walkthrough, but that only covers things that executed. To sweep files at rest — a download folder, a web root, a USB drive handed to me for triage — I want my own scanner.
The obvious naive approach, reading each whole file into memory and hashing it, falls over the moment you hit a multi-gigabyte file. So the scanner has to stream each file in chunks, and the matching has to be fast enough that the hashing, not the lookup, is the bottleneck. Both are a few lines once you know the shape.
The Solution — File Hashing for IOC Matching in Python
Step 1 — Hash a single file without loading it into memory
The core of the scanner is one function: take a path, return its SHA-256 hex digest. The important detail is reading the file in fixed-size chunks and feeding each chunk to the hash object with update(), so a 10 GB file uses the same trivial amount of memory as a 10 KB one:
import hashlib
def sha256_file(path, chunk_size=65536):
"""Return the SHA-256 hex digest of a file, read in chunks."""
h = hashlib.sha256()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(chunk_size), b""):
h.update(chunk)
return h.hexdigest()
Note the file is opened in binary mode ("rb") — hashing operates on bytes, not text, and opening in text mode would corrupt the digest on any non-text file. The iter(callable, sentinel) form reads until f.read() returns an empty bytes object, which is a clean way to loop over a file in chunks.
Step 2 — Load the IOC list into a set for fast lookups
The known-bad hashes go into a Python set, not a list. Membership testing in a set is constant-time, so even with hundreds of thousands of indicators each lookup is effectively free. I normalise to lowercase so the comparison does not fail purely on casing, which is a common difference between feeds:
def load_iocs(ioc_path):
"""Load known-bad hashes into a lowercase set for O(1) lookups."""
with open(ioc_path, "r", encoding="utf-8") as f:
return {line.strip().lower() for line in f if line.strip()}
If the feed ships one hash per line, that is all you need. Feeds that ship CSV or JSON need a parsing step first, which is the kind of regex and field extraction I cover in the companion post on turning logs into threats — the same discipline of pulling structured indicators out of messy input.
Step 3 — Walk a directory and flag matches
With the two helpers in place, the scan is a walk over a directory tree, hashing each file and checking the set. I wrap the hashing in a try/except because a real filesystem always contains files you cannot read — locked, permission-denied, or vanished between listing and opening — and one unreadable file should not abort the whole scan:
import os
def scan(root, iocs):
for dirpath, _dirs, files in os.walk(root):
for name in files:
full = os.path.join(dirpath, name)
try:
digest = sha256_file(full)
except (PermissionError, FileNotFoundError, OSError) as exc:
print(f"[skip] {full}: {exc}")
continue
if digest in iocs:
print(f"[MATCH] {full} {digest}")
iocs = load_iocs("bad_hashes.txt")
scan(r"C:\Users\Public\Downloads", iocs)
Every match prints the path and the hash so you can pivot — pull the file for analysis, check where it came from, and look at what ran around the same time. Pairing a disk-level match with a process-level view is where process investigation picks up the thread.
Step 4 — Compute more than one algorithm when the feed demands it
SHA-256 is the right default, but plenty of feeds still publish MD5 or SHA-1. Rather than read each file three times, hash it once and update all three digests in the same pass over the bytes:
def multi_hash(path, chunk_size=65536):
hashers = {
"md5": hashlib.md5(),
"sha1": hashlib.sha1(),
"sha256": hashlib.sha256(),
}
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(chunk_size), b""):
for h in hashers.values():
h.update(chunk)
return {name: h.hexdigest() for name, h in hashers.items()}
MD5 and SHA-1 are cryptographically broken for collision resistance and you should never rely on them for integrity guarantees, but for matching against a feed that only provides those hashes they are still fit for purpose — you are comparing against a known value, not trusting the algorithm to be collision-free. Compute SHA-256 as well and prefer it whenever the feed offers it.
Frequently Asked Questions
Why hash files in chunks instead of reading the whole file?
Reading an entire file into memory to hash it works until you meet a file larger than available RAM, at which point it fails or thrashes. Reading fixed-size chunks and calling update() on each keeps memory use constant regardless of file size, so the same code handles a 10 KB and a 10 GB file.
Is MD5 safe to use for IOC matching?
For matching against a feed that publishes MD5 values, yes — you are checking whether a file equals a specific known-bad hash, not relying on the algorithm's collision resistance. For any integrity or signing purpose, MD5 and SHA-1 are broken and should not be used. Prefer SHA-256 whenever the feed provides it.
What is the main limitation of hash-based detection?
It only catches files you already have a hash for. Cryptographic hashes are designed so that a single changed byte produces a completely different digest, so any trivial modification — a recompiled build, a different packer, a one-byte patch — defeats the match. Hash matching is precise but brittle; it complements behavioural detection rather than replacing it.
How do I make the scan faster on large directories?
Most of the time is spent reading bytes off disk, so increasing the chunk size helps a little, and hashing files in parallel with a thread or process pool helps more on multi-core machines with fast storage. You can also skip files by size or extension before hashing if your IOC set only contains, say, executables.
Conclusion
A SHA-256 scanner is maybe twenty lines of standard-library Python, and it turns a flat list of threat-feed hashes into something you can point at a filesystem. The two details that matter are streaming each file in chunks so memory stays flat, and loading the indicators into a set so matching stays fast.
The limitation is the honest part: hash matching catches only exact, previously-seen files. A recompiled or repacked sample sails straight past it, which is why this sits alongside behavioural and signature detection rather than standing in for them. Within that boundary it is fast, precise, and dependency-free — a good first pass over anything handed to you for triage.
Related Posts
- Sysmon Configuration for Windows Security Monitoring — Sysmon already records process image hashes on Event ID 1.
- Windows Security: Registry Forensics — Where Attackers Hide — another at-rest artifact to sweep during triage.
- PowerShell Quick Guide: Process Investigation — pivot from a file match to what was running.
Editorial note: posts on this blog are drafted with AI assistance and then reviewed, edited, and tested against a real environment before publishing. Commands, output, and screenshots come from systems I actually ran the work on.
0 comments:
Post a Comment