Skip to content

Cross-Source Joins in the Gold Layer

Status: Implemented (risk_sentiment_gold.py) Last updated: 2026-05-12


Problem Statement

Individual silver tables are single-source (FRED, Yahoo, ECB, BIS). Many useful quantitative signals require combining data across sources that have different frequencies, calendars, and schemas. The gold layer must solve temporal alignment, missing-data handling, and multi-source lineage tracking.


Design Principles

  1. Date is the universal join key — all silver tables expose a date column (timestamp, UTC-normalized).
  2. Outer join first, filter after — capture the full date range from all sources, then trim to the intersection where the anchor source has data.
  3. Anchor source — one source defines the output calendar. Typically the highest-frequency daily source (e.g., Yahoo market data defines trading days).
  4. Forward-fill for lower-frequency series — monthly FRED data (UMCSENT, FEDFUNDS) is carried forward to fill daily rows until the next observation.
  5. No interpolation — we forward-fill (last observation carried forward), never interpolate between points. This avoids look-ahead bias.
  6. Null propagation during warm-up — derived columns (z-scores, momentum) that require N days of history will be null for the first N rows. These rows are kept in storage but excluded from scoring.

Temporal Alignment Strategy

Source A (daily): |--x--x--x--x--x--x--x--x--x--|
Source B (monthly): |--x-----------x-----------x---|
↓ forward-fill
Aligned: |--x--x--x--x--x--x--x--x--x--|
A A+B A+B A+B A+B

Implementation (PySpark)

# Outer join on date
joined = source_a.join(source_b, on="date", how="outer")
# Forward-fill monthly columns
w_ffill = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)
for col_name in monthly_columns:
joined = joined.withColumn(
col_name,
F.last(F.col(col_name), ignorenulls=True).over(w_ffill),
)
# Filter to anchor source calendar (rows where anchor has data)
aligned = joined.filter(F.col("anchor_col").isNotNull())

Calendar Considerations

SourceCalendarGaps
Yahoo FinanceNYSE trading daysWeekends, US holidays
FRED daily series (VIX, DGS10)Business daysWeekends, federal holidays
FRED monthly (CPI, UNRATE)First of monthN/A (forward-filled)
ECBTARGET2 calendarEU holidays
BISQuarterlyN/A (forward-filled)

When joining FRED daily + Yahoo daily, the calendars are nearly identical (both skip weekends). Minor holiday mismatches (e.g., Presidents’ Day — FRED closed, NYSE open) are handled by the forward-fill: the previous FRED value carries into the holiday gap.


Normalization Before Scoring

Cross-source signals combine values with wildly different scales (VIX ~ 12-80, SPY ~ 300-600, UMCSENT ~ 50-110). Direct comparison is meaningless.

Approach: Rolling z-score normalization

window = 252 # ~1 year of trading days
z = (value - rolling_mean(window)) / rolling_stddev(window)
  • Each component is converted to a z-score relative to its own trailing 1-year distribution.
  • Z-scores are dimensionless and comparable across sources.
  • Rolling window avoids regime bias from using the full history.

Sign convention:

  • Positive z-score = component indicates risk-on
  • Invert naturally risk-off indicators (VIX, USD strength) by negating their z-score

Composite Scoring

The composite is a simple equally-weighted average of component z-scores:

score = (z1 + z2 + ... + zN) / N

Where some components are negated per their risk direction:

risk_score = (-VIX_z - USD_z + UMCSENT_z + SPY_momentum_z + HYG_SPY_ratio_z) / 5

Regime classification:

  • risk_score > 0.5 → risk_on
  • risk_score < -0.5 → risk_off
  • otherwise → neutral

The thresholds (0.5 / -0.5) are initial values; they can be tuned via backtesting in notebooks.


Multi-Source Lineage

Standard lineage records track a single source_id. Cross-source gold signals use a composite identifier:

{
"source_id": "fred+yahoo",
"inputs": [
"s3://{silver}/source=fred/",
"s3://{silver}/source=yahoo/"
],
"stage": "gold"
}

Lineage Graph

fred_ingestion (bronze) ──→ fred_silver ──┐
├──→ risk_sentiment_gold
yahoo_ingestion (bronze) ──→ yahoo_silver ──┘

The lineage record captures both input paths, enabling full DAG traversal from gold back to both original bronze sources.

Schema Hash

The schema hash covers the output gold table only. Input schemas are recorded in their respective silver lineage records. This keeps each record self-contained while the DAG provides the full picture.


Adding a New Cross-Source Signal

  1. Identify which silver tables you need and their frequencies.
  2. Choose the anchor source (highest frequency, defines output calendar).
  3. Create etl/external/{domain}_gold.py following the pattern:
    • read_*_silver() — read and filter each source
    • pivot_*() — pivot series/tickers to columns
    • align_sources() — outer join + forward-fill + anchor filter
    • compute_*() — domain-specific scoring
    • run() — orchestrate, write output, record lineage
  4. Use composite source_id (e.g., "fred+yahoo+ecb") in the lineage record.
  5. Register in infrastructure construct for Step Functions orchestration.
  6. Add to the docs navigation table.

Current Cross-Source Signals

SignalDomainSourcesFrequencyOutput Path
risk_compositesentimentFRED (VIXCLS, DTWEXBGS, UMCSENT) + Yahoo (HYG, SPY)dailydomain=sentiment/signal=risk_composite/

Future Candidates

SignalSourcesRationale
credit_stressFRED (DFF, T10Y2Y) + Yahoo (HYG, LQD) + BIS (credit gaps)Multi-dimensional credit condition index
global_liquidityFRED (M2SL) + ECB (M3) + BIS (cross-border credit)Central bank liquidity composite
equity_macro_momentumYahoo (SPY, EEM) + FRED (INDPRO, PAYEMS)Macro-confirmed equity momentum

SDK / Notebook Usage

Once deployed, consume via stratum-sdk:

import stratum
client = stratum.connect(api_key="sk_...")
# Pre-computed composite score
risk = client.signal("risk_composite", start="2020-01-01")
# Underlying components for custom weighting
vix = client.dataset("fred_rates", columns=["VIXCLS"], start="2020-01-01")
spy = client.dataset("yahoo_prices", columns=["SPY"], start="2020-01-01")
# Custom composite with different weights
my_score = -2 * risk["VIXCLS_zscore"] + 3 * risk["spy_momentum_20d_zscore"]