Cross-Source Joins in the Gold Layer
Status: Implemented (risk_sentiment_gold.py) Last updated: 2026-05-12
Problem Statement
Individual silver tables are single-source (FRED, Yahoo, ECB, BIS). Many useful quantitative signals require combining data across sources that have different frequencies, calendars, and schemas. The gold layer must solve temporal alignment, missing-data handling, and multi-source lineage tracking.
Design Principles
- Date is the universal join key — all silver tables expose a
datecolumn (timestamp, UTC-normalized). - Outer join first, filter after — capture the full date range from all sources, then trim to the intersection where the anchor source has data.
- Anchor source — one source defines the output calendar. Typically the highest-frequency daily source (e.g., Yahoo market data defines trading days).
- Forward-fill for lower-frequency series — monthly FRED data (UMCSENT, FEDFUNDS) is carried forward to fill daily rows until the next observation.
- No interpolation — we forward-fill (last observation carried forward), never interpolate between points. This avoids look-ahead bias.
- Null propagation during warm-up — derived columns (z-scores, momentum) that require N days of history will be null for the first N rows. These rows are kept in storage but excluded from scoring.
Temporal Alignment Strategy
Source A (daily): |--x--x--x--x--x--x--x--x--x--|Source B (monthly): |--x-----------x-----------x---| ↓ forward-fillAligned: |--x--x--x--x--x--x--x--x--x--| A A+B A+B A+B A+BImplementation (PySpark)
# Outer join on datejoined = source_a.join(source_b, on="date", how="outer")
# Forward-fill monthly columnsw_ffill = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)for col_name in monthly_columns: joined = joined.withColumn( col_name, F.last(F.col(col_name), ignorenulls=True).over(w_ffill), )
# Filter to anchor source calendar (rows where anchor has data)aligned = joined.filter(F.col("anchor_col").isNotNull())Calendar Considerations
| Source | Calendar | Gaps |
|---|---|---|
| Yahoo Finance | NYSE trading days | Weekends, US holidays |
| FRED daily series (VIX, DGS10) | Business days | Weekends, federal holidays |
| FRED monthly (CPI, UNRATE) | First of month | N/A (forward-filled) |
| ECB | TARGET2 calendar | EU holidays |
| BIS | Quarterly | N/A (forward-filled) |
When joining FRED daily + Yahoo daily, the calendars are nearly identical (both skip weekends). Minor holiday mismatches (e.g., Presidents’ Day — FRED closed, NYSE open) are handled by the forward-fill: the previous FRED value carries into the holiday gap.
Normalization Before Scoring
Cross-source signals combine values with wildly different scales (VIX ~ 12-80, SPY ~ 300-600, UMCSENT ~ 50-110). Direct comparison is meaningless.
Approach: Rolling z-score normalization
window = 252 # ~1 year of trading daysz = (value - rolling_mean(window)) / rolling_stddev(window)- Each component is converted to a z-score relative to its own trailing 1-year distribution.
- Z-scores are dimensionless and comparable across sources.
- Rolling window avoids regime bias from using the full history.
Sign convention:
- Positive z-score = component indicates risk-on
- Invert naturally risk-off indicators (VIX, USD strength) by negating their z-score
Composite Scoring
The composite is a simple equally-weighted average of component z-scores:
score = (z1 + z2 + ... + zN) / NWhere some components are negated per their risk direction:
risk_score = (-VIX_z - USD_z + UMCSENT_z + SPY_momentum_z + HYG_SPY_ratio_z) / 5Regime classification:
risk_score > 0.5→ risk_onrisk_score < -0.5→ risk_off- otherwise → neutral
The thresholds (0.5 / -0.5) are initial values; they can be tuned via backtesting in notebooks.
Multi-Source Lineage
Standard lineage records track a single source_id. Cross-source gold signals use a composite identifier:
{ "source_id": "fred+yahoo", "inputs": [ "s3://{silver}/source=fred/", "s3://{silver}/source=yahoo/" ], "stage": "gold"}Lineage Graph
fred_ingestion (bronze) ──→ fred_silver ──┐ ├──→ risk_sentiment_goldyahoo_ingestion (bronze) ──→ yahoo_silver ──┘The lineage record captures both input paths, enabling full DAG traversal from gold back to both original bronze sources.
Schema Hash
The schema hash covers the output gold table only. Input schemas are recorded in their respective silver lineage records. This keeps each record self-contained while the DAG provides the full picture.
Adding a New Cross-Source Signal
- Identify which silver tables you need and their frequencies.
- Choose the anchor source (highest frequency, defines output calendar).
- Create
etl/external/{domain}_gold.pyfollowing the pattern:read_*_silver()— read and filter each sourcepivot_*()— pivot series/tickers to columnsalign_sources()— outer join + forward-fill + anchor filtercompute_*()— domain-specific scoringrun()— orchestrate, write output, record lineage
- Use composite
source_id(e.g.,"fred+yahoo+ecb") in the lineage record. - Register in infrastructure construct for Step Functions orchestration.
- Add to the docs navigation table.
Current Cross-Source Signals
| Signal | Domain | Sources | Frequency | Output Path |
|---|---|---|---|---|
| risk_composite | sentiment | FRED (VIXCLS, DTWEXBGS, UMCSENT) + Yahoo (HYG, SPY) | daily | domain=sentiment/signal=risk_composite/ |
Future Candidates
| Signal | Sources | Rationale |
|---|---|---|
| credit_stress | FRED (DFF, T10Y2Y) + Yahoo (HYG, LQD) + BIS (credit gaps) | Multi-dimensional credit condition index |
| global_liquidity | FRED (M2SL) + ECB (M3) + BIS (cross-border credit) | Central bank liquidity composite |
| equity_macro_momentum | Yahoo (SPY, EEM) + FRED (INDPRO, PAYEMS) | Macro-confirmed equity momentum |
SDK / Notebook Usage
Once deployed, consume via stratum-sdk:
import stratum
client = stratum.connect(api_key="sk_...")
# Pre-computed composite scorerisk = client.signal("risk_composite", start="2020-01-01")
# Underlying components for custom weightingvix = client.dataset("fred_rates", columns=["VIXCLS"], start="2020-01-01")spy = client.dataset("yahoo_prices", columns=["SPY"], start="2020-01-01")
# Custom composite with different weightsmy_score = -2 * risk["VIXCLS_zscore"] + 3 * risk["spy_momentum_20d_zscore"]