FindPyv0.1
// DOCUMENTATION · v0.1

How FindPy works.

Overview

An analyst's question is decomposed by a Planner LLM into a DAG of tasks. Each task is dispatched to a specialized Agent. Agents read and write a shared Evidence Graph. Every source is content-hashed and signed at ingest; every claim cites its sources. A Synthesizer rolls findings into an analyst brief — every claim ends with an evidence ID. A WebSocket hub streams progress to the dashboard live.

Architecture

Analyst UI (Next.js)
  │ HTTP + WS
  ▼
FastAPI ── Orchestrator ──┬─► Planner agent
                          │
                          ├─► Specialized agents (parallel)
                          │     web · news · telegram · image · c2pa
                          │     deepfake · geo · sat · credibility
                          │     narrative · cib
                          │
                          └─► Evidence graph (SQLite / Neo4j)
                                + content-addressable artifact store
                                + signed envelopes

See ARCHITECTURE.md for component diagrams and the agent contract.

The 11 agents

planner
Decomposes the analyst query into a DAG of tasks. The only agent that *chooses* — sub-agents do what they are told.
web_crawler
DuckDuckGo HTML search + trafilatura article extraction. Air-gap fallback: local demo corpus.
news_rss
Sweeps curated RSS feeds, semantic-ranks items by relevance to the query.
telegram
Telethon (MTProto) channel monitor. Demo mode serves canned channel posts.
image_analyzer
EXIF + pHash + object hints; links each carrying Source to the Image node via MENTIONS edges.
c2pa
JUMBF box scanner. Reads Content Credentials manifests when present; absence is itself a signal.
deepfake_jury
Five-detector vote with weighted aggregation + explainable rationale per detector.
geolocator
EXIF-GPS first, gazetteer fallback (Hotan, Kashgar, Leh, Pangong, etc.). Production: sun-shadow inversion + landmark matching.
sat_imagery
Sentinel-2 STAC search via Copernicus Earth Search v1 (no auth needed); BBOX over last 180d filtered by cloud cover.
credibility
Four-factor score per source: domain reputation × channel morphology × recency × corroboration count.
narrative
Embedding-clusters sources; orders chronologically; flags credible-origin → low-cred-amplification threads.
cib
Coordinated inauthentic behavior: temporal co-posting · semantic near-duplication · burst-handle morphology · synthetic-content carriage.

The forensic jury

The deepfake jury votes across five real algorithms with weighted aggregation and per-detector rationale. Confidence-gating prevents low-signal detectors from dominating; a single strong detector (≥ 0.80) is enough to flip a verdict — real OSINT triage behaviour.

DETECTORwSIGNAL
ELA0.15Error Level Analysis — re-save at known JPEG quality, diff against original; splices leave higher residual energy.
JPEG ghost0.15Per-region recompression-quality minima. Splices show inconsistent ghost minima across regions.
shadow physics0.20Measured shadow azimuth (Sobel gradients) vs NOAA solar position at claimed lat/lon/time. Confidence-gated.
GAN fingerprint0.20Horizontal/vertical Laplacian variance imbalance — proxy for real GAN-trace CNN. Drop-in slot for ONNX DFDC model.
amplification pattern0.30OSINT-grade signal: pHash-duplicate carriers + credibility profile; flags credible-origin → low-cred bloom.

Hybrid verdict rule: synthetic if weighted score ≥ 0.55 OR any single detector ≥ 0.80 OR (≥ 2 detectors firing AND weighted ≥ 0.40); else suspect if ≥ 0.30 or any detector firing; else authentic.

Evidence graph

The graph is a property graph (SQLite in dev, Neo4j-ready in production):

Nodes:
  Investigation · Task · Source · Entity · Claim · Image · Finding

Edges:
  SUPPORTS · CONTRADICTS · CITES · MENTIONS · DERIVED_FROM
  PRODUCED_BY · DEPENDS_ON · LOCATED_AT

Provenance

Every Source carries a SHA-256 content hash, a producer-agent label, a retrieval timestamp, and an HMAC-SHA256 signature over the canonical JSON envelope. The signing key is generated on first run atdata/.signing_key (0600). Production swap: Ed25519 — theSigner interface is identical.

# verify the entire investigation
curl https://api.findpy.com/api/audit/<investigation_id>

# expected:
{
  "summary": { "sources": 11, "signatures_verified": 11, "artifact_hashes_ok": 11 },
  "sources": [ { "id": "src_...", "signature_verified": true, ... }, ... ]
}

API reference

method
path
description
POST
/api/investigate
Start an investigation. Returns {investigation_id}.
GET
/api/investigations
List all investigations.
GET
/api/investigations/{id}
Full snapshot: investigation, all nodes, all edges.
GET
/api/evidence/{node_id}
Single evidence node by id.
GET
/api/artifacts/{sha256}
Serve content-addressable artifact bytes.
GET
/api/audit/{id}
Re-verify every signed envelope + re-hash every artifact byte.
WS
/api/ws/investigation/{id}
Live event stream for an investigation.
GET
/api/health
Liveness + active LLM provider name.

Live OpenAPI spec: findpy-api.fly.dev/docs

Deployment

The recommended split:

  • Frontend → Vercel (free hobby tier; native Next.js).
  • Backend → Fly.io (free allowance covers the demo; supports WebSockets + persistent volume).
  • Optional: Cloudflare Tunnel for showing a local backend with a public URL.

Full deploy guide: DEPLOY.md

Stack & swap-in

Every dev component has a labeled production swap-in. The agent contract does not change — only the layer underneath.

layerdev → production
LLMOllama qwen2.5:7b → Qwen2.5-72B on vLLM
evidence graphSQLite → Neo4j cluster
embeddingshashing-trick → BGE-M3 in Qdrant
deepfake CNNspure-PIL detectors → ONNX DFDC + AIGenImageDetector
signingHMAC-SHA256 → Ed25519
Telegramdemo corpus → Telethon multi-account rotation
sat imagerySTAC discovery → STAC + band download + NDBI delta
multi-tenancynone → OIDC + RBAC + structured audit log