// DOCUMENTATION · v0.1

How FindPy works.

Overview

An analyst's question is decomposed by a Planner LLM into a DAG of tasks. Each task is dispatched to a specialized Agent. Agents read and write a shared Evidence Graph. Every source is content-hashed and signed at ingest; every claim cites its sources. A Synthesizer rolls findings into an analyst brief — every claim ends with an evidence ID. A WebSocket hub streams progress to the dashboard live.

Architecture

Analyst UI (Next.js)
  │ HTTP + WS
  ▼
FastAPI ── Orchestrator ──┬─► Planner agent
                          │
                          ├─► Specialized agents (parallel)
                          │     web · news · telegram · image · c2pa
                          │     deepfake · geo · sat · credibility
                          │     narrative · cib
                          │
                          └─► Evidence graph (SQLite / Neo4j)
                                + content-addressable artifact store
                                + signed envelopes

See ARCHITECTURE.md for component diagrams and the agent contract.

The 11 agents

planner

Decomposes the analyst query into a DAG of tasks. The only agent that *chooses* — sub-agents do what they are told.

web_crawler

DuckDuckGo HTML search + trafilatura article extraction. Air-gap fallback: local demo corpus.

news_rss

Sweeps curated RSS feeds, semantic-ranks items by relevance to the query.

Telethon (MTProto) channel monitor. Demo mode serves canned channel posts.

image_analyzer

EXIF + pHash + object hints; links each carrying Source to the Image node via MENTIONS edges.

c2pa

JUMBF box scanner. Reads Content Credentials manifests when present; absence is itself a signal.

deepfake_jury

Five-detector vote with weighted aggregation + explainable rationale per detector.

geolocator

EXIF-GPS first, gazetteer fallback (Hotan, Kashgar, Leh, Pangong, etc.). Production: sun-shadow inversion + landmark matching.

sat_imagery

Sentinel-2 STAC search via Copernicus Earth Search v1 (no auth needed); BBOX over last 180d filtered by cloud cover.

credibility

Four-factor score per source: domain reputation × channel morphology × recency × corroboration count.

narrative

Embedding-clusters sources; orders chronologically; flags credible-origin → low-cred-amplification threads.

cib

Coordinated inauthentic behavior: temporal co-posting · semantic near-duplication · burst-handle morphology · synthetic-content carriage.

The forensic jury

The deepfake jury votes across five real algorithms with weighted aggregation and per-detector rationale. Confidence-gating prevents low-signal detectors from dominating; a single strong detector (≥ 0.80) is enough to flip a verdict — real OSINT triage behaviour.

DETECTOR	w	SIGNAL
ELA	0.15	Error Level Analysis — re-save at known JPEG quality, diff against original; splices leave higher residual energy.
JPEG ghost	0.15	Per-region recompression-quality minima. Splices show inconsistent ghost minima across regions.
shadow physics	0.20	Measured shadow azimuth (Sobel gradients) vs NOAA solar position at claimed lat/lon/time. Confidence-gated.
GAN fingerprint	0.20	Horizontal/vertical Laplacian variance imbalance — proxy for real GAN-trace CNN. Drop-in slot for ONNX DFDC model.
amplification pattern	0.30	OSINT-grade signal: pHash-duplicate carriers + credibility profile; flags credible-origin → low-cred bloom.

Hybrid verdict rule: synthetic if weighted score ≥ 0.55 OR any single detector ≥ 0.80 OR (≥ 2 detectors firing AND weighted ≥ 0.40); else suspect if ≥ 0.30 or any detector firing; else authentic.

Evidence graph

The graph is a property graph (SQLite in dev, Neo4j-ready in production):

Nodes:
  Investigation · Task · Source · Entity · Claim · Image · Finding

Edges:
  SUPPORTS · CONTRADICTS · CITES · MENTIONS · DERIVED_FROM
  PRODUCED_BY · DEPENDS_ON · LOCATED_AT

Provenance

Every Source carries a SHA-256 content hash, a producer-agent label, a retrieval timestamp, and an HMAC-SHA256 signature over the canonical JSON envelope. The signing key is generated on first run atdata/.signing_key (0600). Production swap: Ed25519 — theSigner interface is identical.

# verify the entire investigation
curl https://api.findpy.com/api/audit/<investigation_id>

# expected:
{
  "summary": { "sources": 11, "signatures_verified": 11, "artifact_hashes_ok": 11 },
  "sources": [ { "id": "src_...", "signature_verified": true, ... }, ... ]
}

API reference

method

path

description

POST

/api/investigate

Start an investigation. Returns {investigation_id}.

GET

/api/investigations

List all investigations.

GET

/api/investigations/{id}

Full snapshot: investigation, all nodes, all edges.

GET

/api/evidence/{node_id}

Single evidence node by id.

GET

/api/artifacts/{sha256}

Serve content-addressable artifact bytes.

GET

/api/audit/{id}

Re-verify every signed envelope + re-hash every artifact byte.

/api/ws/investigation/{id}

Live event stream for an investigation.

GET

/api/health

Liveness + active LLM provider name.

Live OpenAPI spec: findpy-api.fly.dev/docs

Deployment

The recommended split:

Frontend → Vercel (free hobby tier; native Next.js).
Backend → Fly.io (free allowance covers the demo; supports WebSockets + persistent volume).
Optional: Cloudflare Tunnel for showing a local backend with a public URL.

Full deploy guide: DEPLOY.md

Stack & swap-in

Every dev component has a labeled production swap-in. The agent contract does not change — only the layer underneath.

layer	dev → production
LLM	Ollama qwen2.5:7b → Qwen2.5-72B on vLLM
evidence graph	SQLite → Neo4j cluster
embeddings	hashing-trick → BGE-M3 in Qdrant
deepfake CNNs	pure-PIL detectors → ONNX DFDC + AIGenImageDetector
signing	HMAC-SHA256 → Ed25519
Telegram	demo corpus → Telethon multi-account rotation
sat imagery	STAC discovery → STAC + band download + NDBI delta
multi-tenancy	none → OIDC + RBAC + structured audit log

// SOURCE CODE

github.com/pushanjain/findpy ↗try the console →