Every firm that talks to a prime broker has run into this: SFTP ingestion was working fine, then suddenly stopped, then your account got temporarily blocked because something kept retrying every minute. This post is the runbook we wished existed.
The principles are the same whether you're using NiFi, Airflow, custom Python, or a vendor ingestion tool. The mistakes are universal.
Rule 1: Failures should back off, not hammer
The single most common mistake: scheduling a poll every minute, with no exponential backoff on failure. This works fine when things work. When auth fails, wrong key, rotated credentials, server-side change, your scheduler quietly turns into a brute-force attack.
Prime broker SSH gateways have aggressive perimeter security. They're configured to assume that hundreds of failed auth attempts from the same source IP is hostile. They're not wrong.
Use exponential backoff with jitter. Something like:
delay = base * (2 ** attempt) + jitter
delay = min(delay, max_delay) # cap at 30 minutes
And, this is the part people skip, after N consecutive failures, stop entirely and alert. Don't keep retrying forever. The fact that you've failed 50 times in a row is information, not noise.
Rule 2: Distinguish "auth failed" from "couldn't connect"
These look similar in logs but they require very different actions.
| Symptom | Likely cause | Action |
|---|---|---|
| Connection timeout | IP not allowlisted, network change, firewall | Check egress IP, contact PB |
| Connection refused | Server-side firewall block, possibly auto-block | Stop retrying, contact PB |
| Auth methods exhausted | Wrong key, wrong username, key rotation | Verify config, do not retry blindly |
| Host key mismatch | PB rotated server host keys | Update known_hosts, do not auto-accept |
| No matching algorithm | Client outdated, PB upgraded crypto | Upgrade SSH client library |
Different exceptions should trigger different alerting paths. Auth-exhausted at 3am should page someone. Connection timeout for 30 seconds shouldn't.
Rule 3: Allowlists are forever (until they aren't)
Most PB SFTPs allowlist by source IP. If your ingestion runs from a cloud VM, your egress IP is stable until it isn't. Things that quietly change egress IPs:
- Auto-scaling groups (your traffic egresses from whichever instance happens to handle it)
- NAT gateway replacement during a VPC migration
- Kubernetes node rotation if you don't pin egress to a NAT
- VPN provider changing endpoints
Mitigation: route your SFTP egress through a NAT with a fixed Elastic IP (or equivalent on GCP/Azure). Reserve the IP. Document it in your PB onboarding so you can request changes formally rather than emergency-calling support.
Rule 4: Keys belong in a secrets manager, not on disk
Long-form opinion: storing SSH private keys as files on the ingestion host is a security and operational risk. Files get accidentally committed, copied to backups, embedded in container images. When you rotate the key, you have to find every copy.
Better pattern: store the key in AWS Secrets Manager / GCP Secret Manager / HashiCorp Vault. Mount it at runtime, never write it to durable storage. Rotation becomes a single API call.
Rule 5: Monitor for "success" as well as "failure"
Most monitoring alerts on failures. Equally important: alert when expected files don't arrive. Silent failure modes:
- SFTP poll succeeds, but the PB's process that writes the file is broken, you cheerfully list an empty directory every minute
- File arrives but is zero bytes
- File arrives with yesterday's date because their job ran but pointed at stale data
Layer a "did the expected file arrive by 7am?" check on top of the connection check. Same alerting path, different question.
Rule 6: When you get locked out, stop, Then call
If your ingestion is in a failure loop and your IP gets blocked, the worst thing you can do is keep retrying while you also call support. Every additional attempt extends the block window and risks escalating from auto-clear to manual-clear.
The sequence:
- Stop the scheduler. Disable the job, don't just pause it.
- Confirm nothing else on the host is also retrying (other connectors, monitoring probes).
- Try a single manual
ssh -vvvfrom a different host if possible, to confirm whether the block is IP-scoped or account-scoped. - Call PB support with: your account ID, source IP, username, approximate start/stop times of the failure storm.
- Wait for confirmation before re-enabling.
Most PBs will reset the block within an hour during business hours, once they verify it was you. Outside business hours, you're waiting until morning. Don't make it worse.
ForeStrat's DataStrat module ships these patterns by default, retry policy, monitoring, secret management. If you're tired of writing this yourself, email demo@forestrat.ai.