Python for threat hunting shines at the unglamorous middle of an investigation: turning a wall of raw text into indicators, checking files against those indicators, asking whether the wider world has seen them, and noticing when something that should be stable quietly changes. This guide builds that whole toolkit out of (mostly) the standard library — a regex IOC extractor, a SHA-256 file scanner, a VirusTotal hash lookup, and a file integrity monitor — as one connected workflow rather than four disconnected scripts. Everything here is read-only, defensive, and meant for indicators and systems you are authorised to investigate.
Key Takeaways
- Python for threat hunting follows an indicator lifecycle: extract IOCs from text, match files against them by hash, enrich unknowns via VirusTotal, and monitor integrity over time.
- Regex extraction matches the shape of an indicator, so it must be paired with refanging beforehand and validation (for example the
ipaddressmodule) afterward. - The standard-library
hashlibmodule hashes files with no dependency; reading in chunks keeps memory flat on multi-gigabyte files, and asetof known-bad hashes makes matching effectively free. - A VirusTotal hash lookup enriches an indicator without ever uploading the file, but you must pace requests under the free tier's 4-per-minute limit and treat a
404as "unknown," not "clean." - A file integrity monitor is a hash baseline plus a diff — a detective control that catches changes a timestamp check would miss, best aimed at directories that should be boring.
Environment
- Python 3.9+ on Windows 11; every example is platform-neutral and runs the same on Linux or macOS.
- Standard library for most of it —
hashlib,os,re,ipaddress,json. Only the VirusTotal step needsrequests(pip install requests). - A free VirusTotal API key for the enrichment section, kept in an environment variable rather than hard-coded.
- A list of known-bad hashes from a threat feed, and a directory or two worth watching.
The Problem
Indicators rarely arrive in a tidy CSV. They show up embedded in a SIEM alert, buried in an email header, or scattered through the prose of a threat-intel write-up — and once you have them, a flat list of hex strings is not yet a check. To make it useful I need to pull indicators out of messy text, sweep a filesystem for files that match known-bad hashes, ask VirusTotal about the ones I do not recognise, and keep an eye on the directories that are supposed to never change. Each of those is a few dozen lines of Python, and they connect: the output of one is the input of the next. The four sections below build them in that order.
Part 1 — Extract IOCs from Text with Regex
Regex is the right tool for pulling IPs, hashes, domains, and URLs out of free text, with two caveats that trip people up. First, a regex matches a shape, not a meaning — an IPv4 pattern matches 999.0.0.1 as eagerly as a real address. Second, threat reports deliberately defang indicators so they are not clickable, writing hxxps://evil[.]com. Handle both or your list is quietly wrong.
Refang first, then extract hashes
Refanging reverses the defang substitutions and must run before anything else, or every defanged indicator slips through. Hashes are then the easiest win — fixed lengths and a fixed alphabet:
import re
def refang(text):
for bad, good in {"[.]": ".", "[dot]": ".", "hxxps": "https",
"hxxp": "http", "[at]": "@", "[:]": ":"}.items():
text = text.replace(bad, good)
return text # replace hxxps before hxxp, or hxxps becomes httpss
HASH_PATTERNS = {
"sha256": re.compile(r"\b[a-fA-F0-9]{64}\b"),
"sha1": re.compile(r"\b[a-fA-F0-9]{40}\b"),
"md5": re.compile(r"\b[a-fA-F0-9]{32}\b"),
}
def extract_hashes(text):
return {name: set(pat.findall(text)) for name, pat in HASH_PATTERNS.items()}
Collecting into a set deduplicates as you go, which matters because the same hash often appears many times in one report. Those extracted hashes are exactly what Part 2 matches against the filesystem.
Extract IPs and validate them, accept fuzziness on domains
This is where regex alone is not enough: the pattern matches the shape of an IPv4 address but cannot tell 8.8.8.8 from 999.0.0.1. Let regex find candidates and let the ipaddress module decide which are real:
import ipaddress
IPV4_CANDIDATE = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
def extract_ips(text):
valid = set()
for candidate in IPV4_CANDIDATE.findall(text):
try:
ipaddress.ip_address(candidate) # raises ValueError if invalid
valid.add(candidate)
except ValueError:
pass
return valid
ipaddress.ip_address() rejects any octet over 255, and the same object exposes .is_private and .is_loopback if you want to drop internal ranges. Domains and URLs are fuzzier — a pattern strict enough to reject every false positive also rejects valid edge cases — so treat https?://\S+ matches and domain-shaped strings as candidates to review, not facts. Regex extraction is triage, not parsing: it surfaces things to verify fast, which is most of the value.
Part 2 — Match Files Against a Known-Bad Feed
A threat feed hands you indicators as file hashes. To turn that into a check, walk a filesystem, hash every file, and see whether any digest appears in the bad list. The core is one function — and the important detail is reading the file in fixed-size chunks so a 10 GB file uses the same trivial memory as a 10 KB one:
import hashlib, os
def sha256_file(path, chunk_size=65536):
"""SHA-256 hex digest of a file, read in chunks to keep memory flat."""
h = hashlib.sha256()
with open(path, "rb") as f: # binary mode matters
for chunk in iter(lambda: f.read(chunk_size), b""):
h.update(chunk)
return h.hexdigest()
def load_iocs(ioc_path):
"""Known-bad hashes into a lowercase set for O(1) lookups."""
with open(ioc_path, "r", encoding="utf-8") as f:
return {line.strip().lower() for line in f if line.strip()}
def scan(root, iocs):
for dirpath, _dirs, files in os.walk(root):
for name in files:
full = os.path.join(dirpath, name)
try:
digest = sha256_file(full)
except (PermissionError, FileNotFoundError, OSError) as exc:
print(f"[skip] {full}: {exc}")
continue
if digest in iocs:
print(f"[MATCH] {full} {digest}")
Opening in binary mode ("rb") is essential — hashing operates on bytes, and text mode would corrupt the digest on any non-text file. The set makes membership testing constant-time, so the scan cost is dominated by reading files, not matching. The try/except matters because a real filesystem always contains files you cannot read. SHA-256 is the right default, but if a feed only publishes MD5 or SHA-1 you can update all three digests in a single pass over the bytes rather than reading the file three times. The honest limitation: hash matching only catches files you already have a hash for — one changed byte produces a completely different digest, which is why it complements behavioural detection rather than replacing it.
Part 3 — Enrich Unknowns with VirusTotal
Local matching only knows about your own list. When the scan turns up a file you do not recognise, VirusTotal can tell you whether the wider world has classified it — and the privacy-safe way to ask is to submit only the hash, never the file. Uploading a file makes it retrievable by other users and premium customers, so a hash lookup is both safer and faster:
import os, time, requests
API_KEY = os.environ["VT_API_KEY"] # set in your shell, never in the file
BASE_URL = "https://www.virustotal.com/api/v3/files/"
HEADERS = {"x-apikey": API_KEY}
def lookup_hash(file_hash):
resp = requests.get(BASE_URL + file_hash, headers=HEADERS, timeout=30)
if resp.status_code == 404:
return {"hash": file_hash, "status": "not_found"}
if resp.status_code == 429:
return {"hash": file_hash, "status": "rate_limited"}
resp.raise_for_status()
stats = resp.json()["data"]["attributes"]["last_analysis_stats"]
return {"hash": file_hash, "status": "found",
"malicious": stats.get("malicious", 0)}
def lookup_many(hashes, delay=15): # ~4 requests/minute on the free tier
out = []
for h in hashes:
out.append(lookup_hash(h))
time.sleep(delay)
return out
The three response codes are the part that matters. A 404 is not a failure — it means VirusTotal has never seen that hash, which for an internal build is expected and for an unknown executable is itself a mild signal. A 429 means you have hit the rate limit and should back off; the free tier allows 4 requests per minute and 500 per day, so it is built for enrichment on a handful of indicators, not bulk scanning. Only a found result carries a verdict, and even then the engine count is context, not gospel — a single detection is often a false positive while five or more from reputable engines is a strong signal. Keep the count visible and let an analyst overrule it.
Part 4 — Monitor File Integrity Over Time
Some directories are supposed to be boring — a web root, a startup folder, a config directory change when you deploy and at no other time. An unexpected change is a signal worth having, and hashing contents (rather than trusting a timestamp, which an attacker can reset by timestomping) is what makes it reliable. A file integrity monitor is just a hash baseline plus a diff, reusing the same chunked sha256_file from Part 2:
import json
def build_baseline(root):
baseline = {}
for dirpath, _dirs, files in os.walk(root):
for name in files:
full = os.path.join(dirpath, name)
try:
baseline[full] = sha256_file(full)
except (PermissionError, FileNotFoundError, OSError):
continue
return baseline
def compare(baseline, root):
current = build_baseline(root)
base, curr = set(baseline), set(current)
return {
"added": sorted(curr - base),
"deleted": sorted(base - curr),
"modified": sorted(p for p in base & curr if baseline[p] != current[p]),
}
def run(root, baseline_path):
if not os.path.exists(baseline_path):
with open(baseline_path, "w") as f:
json.dump(build_baseline(root), f, indent=2)
print(f"Baseline created for {root}")
return
with open(baseline_path) as f:
changes = compare(json.load(f), root)
if not any(changes.values()):
return # silent when nothing changed
for category, paths in changes.items():
for p in paths:
print(f"[{category.upper()}] {p}")
The baseline is a dictionary of path-to-hash that serialises cleanly to JSON — no database needed — but store it outside the watched directory, or the act of saving it changes the tree you are monitoring. The set operations isolate three categories in one pass: curr - base is added, base - curr is deleted, and a hash mismatch over the intersection is modified. Reporting only on change is what keeps the monitor something you actually read; run it on a schedule, and regenerate the baseline after a legitimate deployment. It is a detective control — it tells you a change happened after the fact — and a baseline on the monitored host is never fully tamper-proof, but aimed at the right directories it turns "did anything change here?" into a scheduled answer.
Frequently Asked Questions
What does "refang" mean and why is it needed?
Threat reports defang indicators so they cannot be clicked, writing hxxp for http and evil[.]com for evil.com. Refanging reverses those substitutions. If you skip it, your regex never matches the defanged indicators and you silently miss most of the report's IOCs.
Why validate IP addresses if the regex already matched them?
Because the regex matches the shape of an address, not a valid one — 999.1.1.1 and 1.2.3.4.5 both pass a naive pattern. The ipaddress module enforces the actual rules, so combining regex for candidates with ipaddress for validation gives you only real addresses.
Why hash files in chunks instead of reading the whole file?
Reading an entire file into memory works until you meet a file larger than available RAM, at which point it fails or thrashes. Reading fixed-size chunks and calling update() on each keeps memory use constant regardless of file size, so the same code handles a 10 KB and a 10 GB file.
What is the main limitation of hash-based detection?
It only catches files you already have a hash for. Cryptographic hashes are designed so a single changed byte produces a completely different digest, so any trivial modification — a recompiled build, a different packer — defeats the match. Hash matching is precise but brittle, and complements behavioural detection rather than replacing it.
Does looking up a hash upload my file to VirusTotal?
No. A hash lookup sends only the hash string, never the file. This is the privacy-safe way to query VirusTotal — uploading a file makes it retrievable by other users and premium customers, so never upload confidential files. Even a hash, though, signals your interest in that indicator.
What are the VirusTotal public API rate limits?
The free public API allows 4 requests per minute and 500 per day. Exceed either and the API responds with HTTP 429. Roughly one request every 15 seconds keeps you under the per-minute cap; higher volume requires a premium key.
Why hash file contents for integrity instead of checking modification time?
Modification timestamps are attacker-controllable — timestomping resets a file's mtime to hide a change. A content hash changes whenever a single byte changes, regardless of what the timestamp claims, so it detects modifications a timestamp check would miss.
Conclusion
These four tools are one workflow, not four scripts. Regex extraction turns unstructured text into a candidate indicator list; the SHA-256 scanner matches files at rest against known-bad hashes; the VirusTotal lookup enriches the unknowns without ever exposing a file; and the integrity monitor watches the directories that should never change. Each is well under a hundred lines, mostly standard library, and the output of one feeds the next.
The honest framing runs through all of them: regex extraction is triage, not parsing; hash matching is precise but brittle; VirusTotal counts inform rather than conclude; and a file integrity monitor is detective, not preventive. Within those boundaries this is a fast, dependency-light toolkit for the file-and-indicator side of threat hunting — the part you reach for when someone hands you a folder, a log, or an alert and asks "is any of this bad?"
Related Posts
- From Logs to Threats: SIEM Correlation Rules for Real Attacks — what to do with indicators once you have extracted and matched them.
- Sysmon Configuration for Windows Security Monitoring — Sysmon records process image hashes on Event ID 1, a source for the hashes you match.
- PowerShell Threat Hunting: Windows Endpoint Triage Guide — the live-system counterpart for pivoting from a file match to what was running.
Authoritative references: Python hashlib documentation and the VirusTotal public vs premium API documentation.
Editorial note: posts on this blog are drafted with AI assistance and then reviewed, edited, and tested against a real environment before publishing. Commands, output, and screenshots come from systems I actually ran the work on.
0 comments:
Post a Comment