Content Deduplication
Purpose
Section titled “Purpose”The same resource can appear on multiple Exchanges. An AP wire story might be available from Exchange A and Exchange B. A syndicated article might be offered by the original provider’s Exchange and a resource aggregator. The Broker deduplicates these offers so the agent pays the lowest price for each unique resource.
ResourceIdentity Matching
Section titled “ResourceIdentity Matching”Each Offer carries an optional ResourceIdentity that the Broker uses for deduplication. The identity key is derived from a priority-ordered set of fields:
// contentIdentityKey produces a deduplication key from layered identity fields.// Matching priority: iptc_guid > doi > content_hash > canonical_urlfunc contentIdentityKey(id *rampv1.ResourceIdentity) string { if id == nil { return uuid.NewString() // no identity = unique (no dedup) } if id.IptcGuid != nil && *id.IptcGuid != "" { return "iptc:" + *id.IptcGuid } if id.Doi != nil && *id.Doi != "" { return "doi:" + *id.Doi } if id.ContentHash != nil && *id.ContentHash != "" { return "hash:" + *id.ContentHash } if id.CanonicalUrl != nil && *id.CanonicalUrl != "" { return "url:" + *id.CanonicalUrl } return uuid.NewString()}Identity Field Priority
Section titled “Identity Field Priority”| Priority | Field | Description | Example |
|---|---|---|---|
| 1 (highest) | iptc_guid | IPTC globally unique identifier for news content | urn:newsml:apnews.com:20260315:article-123 |
| 2 | doi | Digital Object Identifier for academic/scientific content | 10.1038/s41586-026-01234-5 |
| 3 | content_hash | Content fingerprint (method depends on hash_method: simhash-v1 for fuzzy dedup, sha256 for exact match) | sha256:abc123... |
| 4 (lowest) | canonical_url | Canonical URL of the content | https://apnews.com/article/ai-regulation |
When no ResourceIdentity is present on an offer, the offer is treated as unique and is never deduplicated. This is the conservative default — it is better to pay twice for the same content than to accidentally merge two different articles.
SimHash Comparison
Section titled “SimHash Comparison”For content available from multiple sources that may have minor editorial differences (e.g., AP wire stories republished with small modifications), strict hash matching (content_hash) may miss duplicates. The Broker supports SimHash comparison for fuzzy matching:
- Level 2 verification uses SHA-256 exact hash — content is byte-for-byte identical
- Level 1 verification uses SimHash similarity — content is structurally similar (e.g., 95%+ similarity)
- Level 0 — no content verification
SimHash comparison is applied when:
- Two offers share the same
canonical_urlbut have differentcontent_hashvalues - Two offers from different Exchanges reference the same article by title/author metadata but have no shared identifier
The SimHash comparison operates on the content identity metadata provided by the Exchange, not on the content itself (which is not yet fetched at selection time).
Canonical URL Grouping
Section titled “Canonical URL Grouping”When higher-priority identifiers are not available, the Broker falls back to canonical_url grouping. This catches the common case where the same article is offered by multiple Exchanges that all reference the same provider URL.
Canonical URL matching is exact string comparison after normalization:
- Strip trailing slashes
- Lowercase the hostname
- Sort query parameters
- Remove tracking parameters (utm_*, ref, etc.)
Cross-Exchange Dedup Strategy
Section titled “Cross-Exchange Dedup Strategy”The full deduplication pipeline operates across all Exchange responses:
type contentGroup struct { Key string Offers []rankedOffer}
func deduplicateOffers(offers []rankedOffer) []rankedOffer { groups := make(map[string][]rankedOffer)
for _, o := range offers { key := contentIdentityKey(o.Offer.Identity) groups[key] = append(groups[key], o) }
var deduped []rankedOffer for _, group := range groups { // Sort by unit_cost ascending within group sort.Slice(group, func(i, j int) bool { return unitCostOf(group[i]) < unitCostOf(group[j]) }) deduped = append(deduped, group[0]) // cheapest wins } return deduped}How It Works
Section titled “How It Works”- Collect offers from all Exchange responses.
- Compute identity key for each offer using the priority-ordered fields.
- Group offers by identity key. Offers with the same key are considered duplicates.
- Within each group, sort by effective cost per unit (unit_cost) ascending.
- Keep only the cheapest offer from each group.
Example
Section titled “Example”An agent requests content at https://apnews.com/article/ai-regulation. Two Exchanges respond:
| Exchange | Offer ID | unit_cost | ResourceIdentity |
|---|---|---|---|
| Exchange Alpha | offer-a1 | $0.0003/token | iptc:urn:newsml:apnews.com:20260315:ai-reg |
| Exchange Beta | offer-b1 | $0.0005/token | iptc:urn:newsml:apnews.com:20260315:ai-reg |
Both offers share the same iptc_guid. The deduplicator groups them and selects offer-a1 (lower unit_cost). The agent pays $0.0003/unit instead of $0.0005/unit.
Deduplication in Batch Queries
Section titled “Deduplication in Batch Queries”For batch (multi-URI) queries, deduplication runs per URI. Each URI’s offers are deduplicated independently:
- For each URI in the batch, collect all offers from all Exchange responses that match that URI.
- Within each URI’s offer set, run the standard deduplication pipeline.
- Select the best (cheapest) offer per URI after dedup.
Cross-URI deduplication (e.g., “the agent requested two URLs that happen to be the same article”) is handled by the batch selection engine, which checks ResourceIdentity across URIs before executing transactions.
Deduplication Metrics
Section titled “Deduplication Metrics”The Broker tracks deduplication effectiveness through Prometheus metrics:
var ( // How often deduplication found the same content from multiple Exchanges deduplicationRate = promauto.NewHistogram( prometheus.HistogramOpts{ Name: "broker_dedup_sets", Help: "Number of content groups with offers from multiple Exchanges", Buckets: []float64{0, 1, 2, 3, 5, 10}, }, ))The SelectionRationale returned with every selection decision includes DeduplicatedSets — the count of content groups that had offers from multiple Exchanges. This provides visibility into how often deduplication produces savings.
Content Attestation Levels (v1.0)
Section titled “Content Attestation Levels (v1.0)”In v1.0, content attestation level replaces the former content verification concept. Attestation level is both a deduplication tiebreaker and a selection ranking factor:
| Level | Attestation | Guarantee |
|---|---|---|
| 0 | None | No attestation. Content may carry identifiers (DOI, IPTC GUID) for dedup, but nothing is cryptographically verifiable. |
| 1 | Self-attested (provider) | Provider signed claims with Ed25519. Agent can recompute content hash from delivered bytes. |
| 2 | Third-party (verification vendor) | Independent verification. Vendor crawled and measured the content. Agent trusts the attestation without re-verifying. |
Higher attestation levels rank higher in the Selection Engine because they provide stronger guarantees. When two offers for the same content have different attestation levels, the higher level is preferred even at slightly higher cost, because it enables automated dispute resolution and reduces the risk of receiving content that does not match what was advertised.
Content identifiers (DOI, ISBN, IPTC GUID) used for deduplication are orthogonal to attestation level. A Level 0 article may carry a DOI — that identifies WHAT content it is but does NOT verify that the delivered bytes match the identified content.