Every firm that talks to a prime broker has run into this: SFTP ingestion was working fine, then suddenly stopped, then your account got temporarily blocked because something kept retrying every minute. This post is the runbook we wished existed.

The principles are the same whether you're using NiFi, Airflow, custom Python, or a vendor ingestion tool. The mistakes are universal.

Rule 1: Failures should back off, not hammer

The single most common mistake: scheduling a poll every minute, with no exponential backoff on failure. This works fine when things work. When auth fails, wrong key, rotated credentials, server-side change, your scheduler quietly turns into a brute-force attack.

Prime broker SSH gateways have aggressive perimeter security. They're configured to assume that hundreds of failed auth attempts from the same source IP is hostile. They're not wrong.

Use exponential backoff with jitter. Something like:

delay = base * (2 ** attempt) + jitter
delay = min(delay, max_delay)   # cap at 30 minutes

And, this is the part people skip, after N consecutive failures, stop entirely and alert. Don't keep retrying forever. The fact that you've failed 50 times in a row is information, not noise.

Rule 2: Distinguish "auth failed" from "couldn't connect"

These look similar in logs but they require very different actions.

SymptomLikely causeAction
Connection timeout IP not allowlisted, network change, firewall Check egress IP, contact PB
Connection refused Server-side firewall block, possibly auto-block Stop retrying, contact PB
Auth methods exhausted Wrong key, wrong username, key rotation Verify config, do not retry blindly
Host key mismatch PB rotated server host keys Update known_hosts, do not auto-accept
No matching algorithm Client outdated, PB upgraded crypto Upgrade SSH client library

Different exceptions should trigger different alerting paths. Auth-exhausted at 3am should page someone. Connection timeout for 30 seconds shouldn't.

Rule 3: Allowlists are forever (until they aren't)

Most PB SFTPs allowlist by source IP. If your ingestion runs from a cloud VM, your egress IP is stable until it isn't. Things that quietly change egress IPs:

Mitigation: route your SFTP egress through a NAT with a fixed Elastic IP (or equivalent on GCP/Azure). Reserve the IP. Document it in your PB onboarding so you can request changes formally rather than emergency-calling support.

Architecture note For multi-region or HA setups, ask your PB to allowlist multiple IPs at onboarding rather than after an outage. Most PBs accept 2–3 source IPs per environment. It's a 10-minute conversation upfront and prevents a 10-hour outage later.

Rule 4: Keys belong in a secrets manager, not on disk

Long-form opinion: storing SSH private keys as files on the ingestion host is a security and operational risk. Files get accidentally committed, copied to backups, embedded in container images. When you rotate the key, you have to find every copy.

Better pattern: store the key in AWS Secrets Manager / GCP Secret Manager / HashiCorp Vault. Mount it at runtime, never write it to durable storage. Rotation becomes a single API call.

Rule 5: Monitor for "success" as well as "failure"

Most monitoring alerts on failures. Equally important: alert when expected files don't arrive. Silent failure modes:

Layer a "did the expected file arrive by 7am?" check on top of the connection check. Same alerting path, different question.

Rule 6: When you get locked out, stop, Then call

If your ingestion is in a failure loop and your IP gets blocked, the worst thing you can do is keep retrying while you also call support. Every additional attempt extends the block window and risks escalating from auto-clear to manual-clear.

The sequence:

  1. Stop the scheduler. Disable the job, don't just pause it.
  2. Confirm nothing else on the host is also retrying (other connectors, monitoring probes).
  3. Try a single manual ssh -vvv from a different host if possible, to confirm whether the block is IP-scoped or account-scoped.
  4. Call PB support with: your account ID, source IP, username, approximate start/stop times of the failure storm.
  5. Wait for confirmation before re-enabling.

Most PBs will reset the block within an hour during business hours, once they verify it was you. Outside business hours, you're waiting until morning. Don't make it worse.


ForeStrat's DataStrat module ships these patterns by default, retry policy, monitoring, secret management. If you're tired of writing this yourself, email demo@forestrat.ai.