Data Sources

All threat intelligence and reputation data sources used by ScamVerify™, including what each provides, update frequency, and coverage.

ScamVerify™ aggregates data from 18+ independent sources to produce risk assessments. No single source is trusted in isolation. The scoring engine cross-references multiple signals to reduce false positives and catch threats that any individual source might miss.

FTC Do Not Call Complaints

What it provides: Consumer complaint records filed with the Federal Trade Commission against phone numbers that violated the National Do Not Call Registry. Each record includes the complaint date, subject category (robocall, live caller, etc.), and whether the consumer reported a robocall.

Coverage: 2.79 million+ complaint records covering U.S. phone numbers.

Update frequency: Hourly sync via automated pipeline. New complaints are typically available within 1 to 2 hours of FTC publication.

Used by: Phone channel. Complaint count, recency, and robocall percentage are primary inputs to the phone scoring engine.

FCC Consumer Complaints

What it provides: Telecom-specific consumer complaints filed with the Federal Communications Commission. These cover a broader range of issues than FTC data, including unwanted calls, cramming, slamming, and accessibility violations.

Coverage: U.S. telecom complaints. Overlaps with FTC data but provides independent corroboration and catches complaints not filed with the FTC.

Update frequency: Synced via automated pipeline on a regular schedule.

Used by: Phone channel. FCC complaint counts contribute to the base score independently of FTC data.

Twilio Lookup

What it provides: Real-time telecom infrastructure data for phone numbers, including:

Carrier name (e.g., T-Mobile, Bandwidth.com)
Line type (mobile, landline, VoIP, non-fixed VoIP, toll-free)
CNAM (Caller Name, the registered caller ID)
Number validity (whether the number exists in the telecom network)
Network status (whether the number is currently active, inactive, reachable, or unreachable on the carrier network)

Coverage: U.S. phone numbers. International coverage varies by region.

Update frequency: Real-time lookup on each request (cached after first lookup).

Used by: Phone channel. Line type and carrier information are critical scoring signals. VoIP numbers, especially non-fixed VoIP, carry higher baseline risk. Invalid numbers are automatically scored at 100. Inactive numbers (not assigned to any subscriber) are strong spoofing indicators.

Robocall Detection Database

What it provides: Real-time identification of phone numbers associated with robocall activity. Numbers are flagged based on call pattern analysis across carrier networks.

Coverage: U.S. phone numbers with active robocall patterns.

Update frequency: Real-time on each lookup.

Used by: Phone channel. A robocall flag enforces a minimum score floor of 65, ensuring flagged numbers are never rated below high_risk.

IPQS (IP Quality Score)

What it provides: Reputation scores for both phone numbers and URLs. The phone reputation score evaluates fraud risk based on patterns across financial services, e-commerce, and telecom. The URL reputation score evaluates domain risk based on hosting infrastructure, traffic patterns, and known associations with malicious activity.

Coverage: Global. Covers phone numbers and URLs/domains.

Update frequency: Real-time on each lookup.

Used by: Phone channel (phone reputation) and URL channel (URL/domain reputation). IPQS scores feed into the rules engine as one of many signals. The IPQS score is included in the signals object as ipqs_risk_score for URL lookups.

URLhaus (abuse.ch)

What it provides: A database of URLs and domains associated with malware distribution. Maintained by abuse.ch, a Swiss security research project. Entries are contributed by security researchers worldwide and verified before inclusion.

Coverage: 2,374+ malware-distributing domains. Global coverage with a focus on actively distributing threats.

Update frequency: Automated sync via scheduled pipeline. New entries are typically available within hours of publication.

Used by: URL channel. A URLhaus listing is a high-confidence signal that the domain is distributing malware. Indicated in the signals object as urlhaus_listed: true.

ThreatFox (abuse.ch)

What it provides: An Indicator of Compromise (IOC) database covering domains, IPs, and URLs associated with malware command-and-control servers, phishing infrastructure, and botnet activity. Broader than URLhaus, covering threat infrastructure beyond just malware distribution URLs.

Coverage: 54,377+ domains. Global coverage across multiple threat categories.

Update frequency: Automated sync via scheduled pipeline.

Used by: URL channel. A ThreatFox listing indicates the domain is associated with malicious infrastructure. Indicated in the signals object as threatfox_listed: true.

Google Web Risk

What it provides: Google's classification of URLs into threat categories: phishing, malware, social engineering, and unwanted software. This is the same technology that powers Safe Browsing warnings in Chrome.

Coverage: Global. Covers billions of URLs based on Google's web crawling infrastructure.

Update frequency: Real-time on each lookup.

Used by: URL channel. A Google Web Risk flag is a high-confidence signal. The specific threat type (e.g., SOCIAL_ENGINEERING, MALWARE) is included in the signals object as google_web_risk.

WHOIS / RDAP

What it provides: Domain registration data, including:

Registration date (used to calculate domain age)
Expiration date
Registrar name (e.g., GoDaddy, Namecheap)
Registrant information (when not redacted by privacy services)

Coverage: All registered domains with public WHOIS/RDAP records.

Update frequency: Real-time on each lookup (cached after first lookup).

Used by: URL channel and email channel (sender domain analysis). Domain age is a significant scoring factor. Domains registered in the last 30 days are inherently riskier. The signals object includes domain_age_days and registrar.

SSL Certificate Analysis

What it provides: SSL/TLS certificate details for the target domain, including:

Certificate issuer (e.g., Let's Encrypt, DigiCert)
Validation type (DV, OV, EV)
Certificate age (when it was issued)
Expiration status

Coverage: Any domain serving HTTPS traffic.

Update frequency: Real-time on each lookup.

Used by: URL channel. Recently issued certificates on new domains are a risk signal. Missing or expired certificates contribute to the score. The signals object includes ssl_issuer and ssl_age_days.

Brand Impersonation Detection

What it provides: Detection of domains that mimic well-known brands. The system checks for typosquatting (e.g., arnazon.com), lookalike domains, and keyword-based impersonation targeting banks, tech companies, government agencies, and other commonly impersonated entities.

Coverage: Proprietary brand database covering major financial institutions, tech companies, government agencies, and popular consumer brands.

Update frequency: Brand database is maintained and updated regularly.

Used by: URL channel and email channel (sender domain analysis). Detected impersonation is included in the signals object with the matched brand name and confidence level.

Community Reports

What it provides: User-submitted reports indicating whether a phone number or URL is a scam or legitimate. Reports include a classification (scam, legitimate, robocall, telemarketer, debt collector, wrong number) and optional comments.

Coverage: Grows over time as users contribute. Coverage is strongest for numbers and URLs that receive high search volume.

Update frequency: Real-time. New reports immediately influence scoring.

Used by: Phone and URL channels. Community consensus can shift verdicts in either direction. After 3+ consistent reports from different users, community data begins to significantly influence the final score. After 10+ consistent reports, community consensus can override AI analysis.

High-Risk VoIP Carriers

What it provides: A proprietary list of 18 VoIP carriers that are disproportionately associated with scam calls. These carriers provide cheap, disposable phone numbers that are favored by fraudsters.

Coverage: 18 identified carriers. The list is maintained based on FTC complaint patterns, carrier abuse reports, and industry intelligence.

Update frequency: Updated as new high-risk carriers are identified.

Used by: Phone channel. Numbers on these carriers receive a score boost and, when combined with no caller ID, enforce a minimum score floor.

Document Verification Sources

The document channel uses a specialized set of data sources to verify entities extracted from uploaded documents. These sources are only used by the document pipeline.

Smarty US Street API

What it provides: Address validation and classification for U.S. street addresses extracted from documents. Determines whether an address is real, deliverable, and whether it belongs to a Commercial Mail Receiving Agency (CMRA) such as a UPS Store or private mailbox service. Scam documents frequently use CMRA addresses to appear legitimate while hiding behind a mail forwarding service.

Key fields returned:

Field	Type	Description
`valid`	boolean	Whether the address is a real, recognized U.S. address
`deliverable`	boolean	Whether mail can be delivered to this address
`is_cmra`	boolean	Whether the address is a Commercial Mail Receiving Agency (mailbox store)
`is_vacant`	boolean	Whether the address is flagged as vacant
`rdi`	string or null	Residential Delivery Indicator: `residential` or `commercial`

Google Places Text Search

What it provides: Institution verification for organizations claimed in documents. When a document claims to be from "IRS Office, 1234 Federal Plaza," this source checks whether a matching government office, bank, courthouse, or other institution actually exists at or near that address.

Key fields returned:

Field	Type	Description
`found`	boolean	Whether a matching place was found
`place_name`	string or null	Name of the matched place
`matches_claimed`	boolean	Whether the found place matches the document's claimed issuer
`is_government`	boolean	Whether the place is classified as a government office
`place_types`	string[]	Google place type classifications (e.g., `courthouse`, `local_government_office`)

CourtListener People API

What it provides: Verification of judges and officials named in legal documents. CourtListener maintains a comprehensive database of federal and state judges. When a document claims to be signed by "Judge Robert Williams, District Court," this source checks whether that person exists in the judicial record.

Key fields returned:

Field	Type	Description
`found`	boolean	Whether a matching judge or official was found
`matched_name`	string or null	The name as it appears in CourtListener records
`match_confidence`	string	Confidence level: `exact`, `partial`, or `none`

CourtListener Citation Lookup

What it provides: Verification of case law citations referenced in legal documents. Scam documents often cite real-sounding but fabricated case numbers to appear legitimate. This source checks whether cited cases (e.g., "Smith v. Jones, 542 F.3d 117") actually exist in published legal records.

Key fields returned:

Field	Type	Description
`verified`	boolean or null	Whether the citation was found in legal records (`null` if lookup failed)
`citation_type`	string or null	Type of citation (e.g., `case_law`, `statute`)

GovInfo API

What it provides: Verification of federal statutes and regulations cited in documents. When a document references "26 USC 6331" or "42 CFR 482.12," this source checks whether those citations exist in the United States Code or Code of Federal Regulations.

Key fields returned:

Field	Type	Description
`verified`	boolean	Whether the statute or regulation exists
`collection`	string or null	The legal collection: `USCODE` (United States Code) or `CFR` (Code of Federal Regulations)

GPT-4o Vision AI

What it provides: Intelligent entity extraction from document images. Analyzes the visual content of uploaded documents to identify the document type, claimed issuer, and all verifiable entities. Also detects red flags (urgency language, threats, unusual formatting) and matches against known scam patterns.

Entities extracted:

Entity	Description
`document_type`	Classification of the document (e.g., `government_notice`, `invoice`, `court_order`, `collection_letter`)
`claimed_issuer`	The organization or agency the document claims to be from
`phone_numbers`	Phone numbers found in the document
`urls`	URLs and web addresses found in the document
`addresses`	Physical addresses found in the document
`officials`	Names of officials, judges, or agents referenced
`citations`	Legal citations, case numbers, and statute references
`dollar_amounts`	Monetary amounts referenced
`dates`	Dates and deadlines referenced
`red_flags`	Language patterns and formatting anomalies that indicate fraud
`scam_pattern`	The identified scam pattern, if any (e.g., `government_impersonation`, `fake_invoice`, `debt_collection_fraud`)

Source Availability in API Responses

Every API response includes a sources_checked array that tells you exactly which data sources contributed to the assessment. This is useful for understanding the depth of analysis and for handling cases where certain sources were unavailable.

{
  "sources_checked": [
    "ftc",
    "fcc",
    "twilio",
    "nomorobo",
    "community_reports",
    "ai_analysis"
  ]
}

For URL lookups:

{
  "sources_checked": [
    "google_web_risk",
    "rdap",
    "ssl",
    "redirects",
    "brand_detection",
    "urlhaus",
    "threatfox",
    "ipqs",
    "ai_analysis"
  ]
}

For document lookups:

{
  "sources_checked": [
    "gpt4o_vision",
    "smarty_address",
    "google_places",
    "courtlistener_people",
    "courtlistener_citation",
    "govinfo",
    "ai_analysis"
  ]
}

If a source is missing from sources_checked, it means either the source did not return data for that specific input or the source was temporarily unavailable. The scoring engine handles missing sources gracefully by relying on the remaining signals.

On this page