feat: Add SQLAlchemy Offline Store Support#5993
feat: Add SQLAlchemy Offline Store Support#5993aniketpalu wants to merge 6 commits intofeast-dev:masterfrom
Conversation
…greSQL & SQLite Signed-off-by: Aniket Paluskar <apaluska@redhat.com>
Signed-off-by: Aniket Paluskar <apaluska@redhat.com>
| #} | ||
|
|
||
| {% macro table_alias(alias) -%} | ||
| {% if dialect == 'oracle' %}{{ alias }}{% else %}AS {{ alias }}{% endif %} |
There was a problem hiding this comment.
Oracle doesn't use AS keyword for table aliases.
Signed-off-by: Aniket Paluskar <apaluska@redhat.com>
|
@aniketpalu While I fully share the desire to simplify and somehow consolidate offline store implementations, the idea of using sqlalchemy has come up before and was (imho rightfully) rejected. sqlalchemy data transfer performance is simply unacceptable for anything other than a toy pipeline. To get the rows to arrow, you first go through row-by-row sqlalchemy python tuples, then use pandas as an intermediate step. Previously I was trying to use ibis to accomplish something similar (both duckdb and mssql stores share same ibis implementation for example), but even if ibis sounds too complex, I'd strongly suggest exploring different technologies for this. For example you could generate sql in any dialect, transpile to target dialect with something like sqlglot and then use adbc for transport. you would avoid extra heavy dependencies and get probably the best performance one could realistically get with python. |
Signed-off-by: Aniket Paluskar <apaluska@redhat.com>
| WHERE fv_sub_{{ outer_loop_index }}.event_timestamp <= base.event_timestamp | ||
| {% if featureview.ttl != 0 %} | ||
| AND fv_sub_{{ outer_loop_index }}.event_timestamp >= {{ timestamp_minus_seconds('base.event_timestamp', featureview.ttl) }} | ||
| {% endif %} | ||
| {% for entity in featureview.entities %} | ||
| AND fv_sub_{{ outer_loop_index }}."{{ entity }}" = base."{{ entity }}" | ||
| {% endfor %} |
There was a problem hiding this comment.
🔴 PostgreSQL non-entity multi-FV path: correlated LEFT JOIN subquery without LATERAL keyword
When get_historical_features is called without an entity_df (non-entity retrieval mode) and there are multiple feature views, the Jinja2 template generates a LEFT JOIN with a derived table that references columns from the outer base_entities alias. In PostgreSQL, this is a correlated subquery pattern that requires the LATERAL keyword, but it is not included.
Root cause and generated SQL
The template at lines 807-825 generates SQL like:
FROM base_entities base
LEFT JOIN (
SELECT DISTINCT ON ("entity_col")
event_timestamp, "entity_col", "feature1"
FROM "fv_0__data" AS fv_sub_0
WHERE fv_sub_0.event_timestamp <= base.event_timestamp -- references outer 'base'
AND fv_sub_0."entity_col" = base."entity_col" -- references outer 'base'
ORDER BY "entity_col", event_timestamp DESC
) AS fv_0 ON TRUEPostgreSQL prohibits referencing other FROM-clause entries inside a non-LATERAL derived table. This will fail at runtime with:
ERROR: invalid reference to FROM-clause entry for table "base"
The fix would require adding LATERAL to the LEFT JOIN (LEFT JOIN LATERAL (...)) when producing a correlated subquery for PostgreSQL.
Impact: All multi-feature-view non-entity historical feature retrievals on PostgreSQL will fail with a SQL error.
Was this helpful? React with 👍 or 👎 to provide feedback.
| ) {{ table_alias('ranked') }} | ||
| WHERE ranked.rn = 1 | ||
| {% endif %} | ||
| ) {{ table_alias('fv_' ~ outer_loop_index) }} {{ on_true() }} |
There was a problem hiding this comment.
🔴 Non-PostgreSQL non-entity multi-FV path: LEFT JOIN ON 1=1 with no entity matching produces cross join
For non-PostgreSQL dialects (Oracle, MySQL, SQLite, etc.) in the non-entity multi-feature-view path, the ROW_NUMBER subquery selects the globally latest row per entity, then joins to base_entities using only ON 1=1 (a cross join condition) with no entity key matching. This silently produces incorrect results.
Root cause and generated SQL
The template at lines 826-845 generates SQL like:
FROM base_entities base
LEFT JOIN (
SELECT event_timestamp, "entity_col", "feature1"
FROM (
SELECT fv_sub_0.*,
ROW_NUMBER() OVER(PARTITION BY "entity_col" ORDER BY event_timestamp DESC) AS rn
FROM "fv_0__data" AS fv_sub_0
) AS ranked
WHERE ranked.rn = 1
) AS fv_0 ON 1=1 -- no entity matching!The inner subquery cannot reference base (since it's not a LATERAL join), so it returns only the single globally-latest row per entity. The outer ON 1=1 then cross-joins this with every row in base_entities. If base_entities has 10 entity+timestamp combinations, each will get the same feature values (the latest per entity, ignoring timestamp), which is both a cross-join explosion and semantically wrong.
Contrast this with the PostgreSQL path which at least attempts entity matching (but fails due to the missing LATERAL keyword, see BUG-0001).
Impact: Multi-feature-view non-entity historical retrievals on non-PostgreSQL databases will silently return incorrect, cross-joined data.
Prompt for agents
In sdk/python/feast/infra/offline_stores/contrib/sqlalchemy_offline_store/sqlalchemy.py, the Jinja2 template MULTIPLE_FEATURE_VIEW_POINT_IN_TIME_JOIN (lines 804-847) has broken LEFT JOIN logic in the non-entity multi-feature-view path (the path taken when start_date and end_date are set and there are multiple featureviews).
Two issues need to be fixed:
1. PostgreSQL path (lines 807-825): The DISTINCT ON subquery references base.event_timestamp and base.entity columns, but the derived table is not declared as LATERAL. Change LEFT JOIN to LEFT JOIN LATERAL for this path.
2. Non-PostgreSQL path (lines 826-844): The ROW_NUMBER subquery cannot correlate with the outer base table. Instead of using ON 1=1, add entity-matching conditions to the ON clause, e.g.: ON base.entity_col = fv_0.entity_col. The subquery should also be filtered to only include relevant rows (not just the globally latest), or the ON clause should handle the matching.
A clean approach would be:
- For PostgreSQL: Use LEFT JOIN LATERAL with the existing correlated DISTINCT ON subquery.
- For all other dialects: Either use a similar LATERAL pattern (if supported), or restructure the join to put entity matching conditions in the ON clause instead of ON 1=1. For example, change line 845 from ON 1=1 to ON base.entity_col = fv_N.entity_col for each entity column.
Was this helpful? React with 👍 or 👎 to provide feedback.
Signed-off-by: Aniket Paluskar <apaluska@redhat.com>
|
Thank you for your comment. It has been eye opening. I agree that I understand that |
What this PR does / why we need it:
This PR introduces a universal SQLAlchemy-based Offline Store for Feast, enabling historical feature retrieval from any SQL database supported by SQLAlchemy. Instead of maintaining separate implementations for each database (PostgreSQL, MySQL, Oracle, SQLite, etc.), this single implementation uses dialect-aware SQL generation to support multiple databases through one codebase.
Key Features
Configure feature_store.yaml
PostgreSQL Example
Oracle Example
MySQL Example
SQLite Example
Define Data Source
Which issue(s) this PR fixes:
Misc