Extract IOCs from Logs with Python Regex (re module)

Q: What does refang mean and why is it needed?

Threat reports defang indicators so they cannot be clicked, writing hxxp for http and evil[.]com for evil.com. Refanging reverses those substitutions. If you skip it, your regex never matches the defanged indicators and you silently miss most of the report's IOCs.

Q: How do I avoid duplicate indicators in the output?

Collect matches into a set rather than a list. The same hash or IP usually appears many times in one source, and a set deduplicates automatically. Convert to a sorted list only at the end when you need stable, serialisable output.

Extracting IOCs from logs with Python regex is the glue work between a wall of raw text and a usable indicator list: pull the IP addresses, file hashes, domains, and URLs out of an alert, an email header, or a log file so you can pivot on them. This Python Quick Guide builds a small, honest IOC extractor with the standard-library re module, validates what regex alone gets wrong, and handles the defanged indicators that threat reports love to use.

Key Takeaways

Extracting IOCs from logs with Python regex means matching the predictable shapes of IPs, hashes, domains, and URLs, then validating the matches that regex cannot get right on its own.
File hashes have fixed lengths (32, 40, 64 hex characters), which makes them the most reliable indicator to extract with a regular expression.
A regex for IPv4 will happily match impossible addresses like 999.1.1.1, so validate matches with the standard-library ipaddress module rather than trusting the pattern.
Threat reports defang indicators (hxxp, 1.2.3[.]4) to make them unclickable, so a refang step has to run before extraction or you miss everything.
Regex extraction is a triage tool, not a parser — it produces candidates to verify, not a guaranteed-correct indicator feed.

Environment

Python 3.9+, standard library only — re and ipaddress. No external packages.
Any text source: a saved alert, an email .eml, a proxy log, or text pasted from a threat report.
Tested on Windows 11, though nothing here is platform-specific.

The Problem

Indicators rarely arrive in a tidy CSV. They show up embedded in a SIEM alert, buried in an email header, or scattered through the prose of a threat-intel write-up. Reading them out by hand is slow and error-prone, and the indicators I miss are exactly the ones I needed. I want to point a script at any blob of text and get back a deduplicated list of IPs, hashes, domains, and URLs to feed into the kind of matching I covered in monitoring the events that matter.

Regex is the obvious tool, and it is the right one — with two caveats that trip people up. First, a regex matches a shape, not a meaning: an IPv4 pattern matches 300.300.300.300 just as eagerly as a real address. Second, threat reports deliberately break their indicators so they are not clickable, writing hxxps://evil[.]com instead of the real thing. Both are easy to handle once you know they are there, and ignoring either gives you a list that is quietly wrong.

The Solution — Extract IOCs from Logs with Python Regex

Step 1 — Refang the text before you match anything

Defanging replaces the parts that make an indicator live: http becomes hxxp, dots become [.], and @ becomes [at]. If you extract before reversing this, every defanged indicator slips through. Refanging is a handful of literal replacements and should always run first:

def refang(text):
    """Reverse common defang styles so indicators match cleanly."""
    replacements = {
        "[.]": ".", "(.)": ".", "[dot]": ".",
        "[:]": ":", "hxxps": "https", "hxxp": "http",
        "[at]": "@", "[@]": "@",
    }
    for bad, good in replacements.items():
        text = text.replace(bad, good)
    return text

Order matters in one place: replace hxxps before hxxp, otherwise the shorter rule turns hxxps into httpss. Small thing, easy to miss.

Step 2 — Extract file hashes, the easy win

Hashes are the most reliable indicator to pull with regex because they have fixed lengths and a fixed alphabet. The only subtlety is matching the longest first — a SHA-256 contains substrings that look like MD5s, so anchor on word boundaries and check the 64-character pattern before the 32-character one:

import re

HASH_PATTERNS = {
    "sha256": re.compile(r"\b[a-fA-F0-9]{64}\b"),
    "sha1":   re.compile(r"\b[a-fA-F0-9]{40}\b"),
    "md5":    re.compile(r"\b[a-fA-F0-9]{32}\b"),
}

def extract_hashes(text):
    return {name: set(pat.findall(text)) for name, pat in HASH_PATTERNS.items()}

Using a set deduplicates as you go, which matters because the same hash often appears many times in one report. These are exactly the values you would feed into the SHA-256 scanner from the companion post on turning logs into detections.

Step 3 — Extract IPs, then validate them

This is where regex alone is not enough. The pattern below matches the shape of an IPv4 address, but it cannot tell 8.8.8.8 from 999.0.0.1. The fix is to let regex find candidates and let the ipaddress module decide which are real:

import ipaddress

IPV4_CANDIDATE = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")

def extract_ips(text):
    valid = set()
    for candidate in IPV4_CANDIDATE.findall(text):
        try:
            ipaddress.ip_address(candidate)   # raises ValueError if invalid
            valid.add(candidate)
        except ValueError:
            pass
    return valid

ipaddress.ip_address() rejects anything with an octet over 255, so the impossible matches fall away. If you want to drop private and loopback ranges as well — usually you do, since 10.0.0.1 is rarely the indicator you care about — the same object exposes .is_private and .is_loopback to filter on.

Step 4 — Extract domains and URLs, and accept the fuzziness

URLs and domains are where regex stops being precise. A pattern strict enough to reject every false positive also rejects valid edge cases, and a pattern loose enough to catch everything pulls in noise. I aim for a pragmatic middle and treat the output as candidates to review:

URL = re.compile(r"\bhttps?://[^\s\"'<>\])]+", re.IGNORECASE)
DOMAIN = re.compile(
    r"\b(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z]{2,}\b",
    re.IGNORECASE,
)

def extract_urls_domains(text):
    return {
        "urls": set(URL.findall(text)),
        "domains": set(DOMAIN.findall(text)),
    }

The domain pattern will match things that are technically domain-shaped but not real domains — file.txt or v1.2 can slip through. That is the nature of regex on free text, and it is why this is a triage step. For authoritative parsing of structured formats you would use a real parser; here the goal is to surface candidates fast, then verify the handful that matter, much like the manual-then-automated rhythm in process investigation.

Step 5 — Put it together into one extractor

The full pass refangs first, then runs each extractor over the cleaned text and returns a single structure you can serialise to JSON or hand to the next tool:

def extract_iocs(raw_text):
    text = refang(raw_text)
    iocs = {"ips": extract_ips(text)}
    iocs.update(extract_hashes(text))
    iocs.update(extract_urls_domains(text))
    # sets are not JSON-serialisable; convert when exporting
    return {k: sorted(v) for k, v in iocs.items()}

with open("alert.txt", "r", encoding="utf-8", errors="ignore") as f:
    for indicator_type, values in extract_iocs(f.read()).items():
        print(f"{indicator_type}: {len(values)} found")

Opening with errors="ignore" matters on real log files, which routinely contain bytes that are not valid UTF-8 and would otherwise abort the read on the first bad character.

Frequently Asked Questions

Why validate IP addresses if the regex already matched them?

Because the regex matches the shape of an address, not a valid one — 999.1.1.1 and 1.2.3.4.5 both look close enough to pass a naive pattern. The ipaddress module enforces the actual rules (octets 0–255, correct structure), so combining regex for candidates with ipaddress for validation gives you only real addresses.

What does "refang" mean and why is it needed?

Threat reports defang indicators so they cannot be clicked or auto-resolved, writing hxxp for http and evil[.]com for evil.com. Refanging reverses those substitutions. If you skip it, your regex never matches the defanged indicators and you silently miss most of the report's IOCs.

Can regex reliably extract every domain from text?

No. Domains in free text are ambiguous — a strict pattern rejects valid edge cases and a loose one matches non-domains like file.txt. Regex extraction is a triage step that produces candidates to verify, not a guaranteed-correct list. For structured formats, use a proper parser instead.

How do I avoid duplicate indicators in the output?

Collect matches into a set rather than a list. The same hash or IP usually appears many times in one source, and a set deduplicates automatically. Convert to a sorted list only at the end, when you need a JSON-serialisable, stable order for output.

Conclusion

A regex IOC extractor turns unstructured text into a deduplicated indicator list in well under a hundred lines, with nothing outside the standard library. The two things that separate a useful extractor from a misleading one are refanging before you match and validating IPs after you match — skip either and the output looks fine while being quietly wrong.

The honest framing is that regex extraction is triage, not parsing. It is fast, dependency-free, and good at surfacing candidates from any blob of text, but it produces things to verify rather than facts to trust. Hashes it gets right; IPs it gets right once validated; domains and URLs it gets approximately right, which is usually enough to start the pivot.

From Logs to Threats: SIEM Correlation Rules for Real Attacks — what to do with indicators once you have extracted them.
Essential Windows Event IDs for Security Monitoring — the logs you will most often be extracting indicators from.
PowerShell Quick Guide: Process Investigation — the manual-then-automated rhythm of verifying what you find.

Editorial note: posts on this blog are drafted with AI assistance and then reviewed, edited, and tested against a real environment before publishing. Commands, output, and screenshots come from systems I actually ran the work on.

Security Scriptographer — PowerShell & Threat Hunting

Through Security Scriptographer, I transform complex security concepts into practical scripts and tutorials. Proficient in PowerShell, Python and various security frameworks, I'm here to help others enhance their security toolkit. Simple code, serious security. 🛡️

Extract IOCs from Logs with Python Regex (re module)

Key Takeaways

Environment

The Problem

The Solution — Extract IOCs from Logs with Python Regex

Step 1 — Refang the text before you match anything

Step 2 — Extract file hashes, the easy win

Step 3 — Extract IPs, then validate them

Step 4 — Extract domains and URLs, and accept the fuzziness

Step 5 — Put it together into one extractor

Frequently Asked Questions

Why validate IP addresses if the regex already matched them?

What does "refang" mean and why is it needed?

Can regex reliably extract every domain from text?

How do I avoid duplicate indicators in the output?

Conclusion

Related Posts

0 comments:

Post a Comment

Search

most popular blogs

From Logs to Threats: SIEM Correlation Rules for Real Attacks

MITRE ATT&CK to SIEM Rules: A Practical Look at SIOR-Helper

Important References

Categories

Blog Archive

Report Abuse