Solved: Re: BQ Time travel with Analytics Hub's Linked dat...

nachi_napo

Hi team,

I've setup a data exchange and created a listing in it. I subscribed to it and this created a linked dataset in my target project (different from source project). When I try to run time travel queries for tables in my linked dataset (select * from target_project.linked_dataset.standard_table for system_time as of timestamp() ) they fail with Not found: Table standard_table:linked_dataset.standard_table@1713797390587 was not found in location US at [1:1] error even though the timestamp used in the query is earlier than 2 days (minimum time travel retention period). I've verified that both my source dataset (aka shared dataset according to Analytics Hub ) and linked datasets are in the same regions (US).
My question is how can we use time travel for standard tables in linked dataset to retrieve the data at some time in near past?
Thanks

ms4446

The error you're getting suggests there's a disconnect between how time travel is normally handled in BigQuery and how it interacts with linked datasets from Analytics Hub. While time travel works seamlessly on native BigQuery tables, linked datasets present a unique scenario.

Why Standard Time Travel Doesn't Work as Expected

Linked Data Nature: Linked datasets in Analytics Hub act as references or pointers to the original data in the source project. They are not physical copies of the data.
Time Travel Implementation: Time travel relies on historical snapshots of table data. Since linked datasets are references, these snapshots aren't automatically created for them.

Refined Approach for Time Travel with Linked Datasets

Identify Source Tables: Determine the exact tables in the source project's dataset that correspond to the linked dataset tables you're trying to query.
Direct Time Travel on Source: Instead of querying the linked dataset directly with time travel, construct your queries to target the source tables in the original dataset.
- Example: If your linked dataset is target_project.linked_dataset.standard_table, find its corresponding source table (e.g., source_project.source_dataset.standard_table) and use this in your time travel query:
```
SELECT *
FROM source_project.source_dataset.standard_table FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
```

Additional Considerations

Permissions: Ensure that you have the necessary permissions to access both the source project and the linked dataset.
Egress Restrictions (If Applicable): Double-check if any data egress restrictions have been inadvertently applied to the listing. If so, you'll need to adjust them to allow time travel queries on the source data.
Alternative (Data Copies): If time travel is a critical requirement and direct querying of the source is not feasible, you might consider creating copies of the relevant tables within your target project. These copies would then allow for standard time travel operations.

Example (Complete Flow)

Listing in Data Exchange (Source Project): source_project.source_dataset (shared as source_dataset)
Linked Dataset (Target Project): target_project.linked_dataset

Query:

SELECT *
FROM source_project.source_dataset.standard_table FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)

View solution in original post

nachi_napo

One more point. I've not set any data egress restriction on the listing in the data exchange

ms4446

The error you're getting suggests there's a disconnect between how time travel is normally handled in BigQuery and how it interacts with linked datasets from Analytics Hub. While time travel works seamlessly on native BigQuery tables, linked datasets present a unique scenario.

Why Standard Time Travel Doesn't Work as Expected

Linked Data Nature: Linked datasets in Analytics Hub act as references or pointers to the original data in the source project. They are not physical copies of the data.
Time Travel Implementation: Time travel relies on historical snapshots of table data. Since linked datasets are references, these snapshots aren't automatically created for them.

Refined Approach for Time Travel with Linked Datasets

Identify Source Tables: Determine the exact tables in the source project's dataset that correspond to the linked dataset tables you're trying to query.
Direct Time Travel on Source: Instead of querying the linked dataset directly with time travel, construct your queries to target the source tables in the original dataset.
- Example: If your linked dataset is target_project.linked_dataset.standard_table, find its corresponding source table (e.g., source_project.source_dataset.standard_table) and use this in your time travel query:
```
SELECT *
FROM source_project.source_dataset.standard_table FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
```

Additional Considerations

Permissions: Ensure that you have the necessary permissions to access both the source project and the linked dataset.
Egress Restrictions (If Applicable): Double-check if any data egress restrictions have been inadvertently applied to the listing. If so, you'll need to adjust them to allow time travel queries on the source data.
Alternative (Data Copies): If time travel is a critical requirement and direct querying of the source is not feasible, you might consider creating copies of the relevant tables within your target project. These copies would then allow for standard time travel operations.

Example (Complete Flow)

Listing in Data Exchange (Source Project): source_project.source_dataset (shared as source_dataset)
Linked Dataset (Target Project): target_project.linked_dataset

Query:

SELECT *
FROM source_project.source_dataset.standard_table FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)

nachi_napo

@ms4446 Thanks for your reply. I've a follow up question. How can we detect if a dataset is a original source dataset (that contains data) or is a linked dataset in Java code?

nachi_napo

Based on my debugging, I could make use of Dataset.getDescription(). If it contains something like Source listing: <listing_name> we can say its a linked dataset . But this approach can fail in some edge cases. So I would want to know a better and concrete approach.

ms4446

The most reliable way to distinguish between linked datasets and source datasets is to examine specific metadata properties associated with BigQuery datasets.

Here's an improved Java code snippet leveraging the BigQuery API:

 

import com.google.cloud.bigquery.*;
// ... (other imports)

public boolean isLinkedDataset(BigQuery bigquery, DatasetId datasetId) {
    Dataset dataset = bigquery.getDataset(datasetId);

    if (dataset == null) {
        // Dataset not found, handle this case appropriately
        return false; 
    }

    DatasetInfo datasetInfo = dataset.toBuilder().build();

    // Check for the presence of the linked dataset metadata
    if (datasetInfo.getLinkedDataset() != null && !datasetInfo.getLinkedDataset().isEmpty()) {
        return true; // It's a linked dataset
    }

    return false; // It's not a linked dataset
}

BigQuery API: We use the com.google.cloud.bigquery library to interact with BigQuery.
isLinkedDataset Function:
- Accepts a BigQuery client (bigquery) and a DatasetId object (specifying the project ID, dataset ID).
- Retrieves the dataset using bigquery.getDataset().
- Core Check: The key is to examine the DatasetInfo.getLinkedDataset() method. If this returns a non-empty value, it indicates that the dataset is linked to another dataset in a different project.
Error Handling: The code includes a null check for the dataset retrieval, as it's a good practice to handle the case where a dataset might not exist.

Explanation:

Dataset Retrieval: We fetch the Dataset object representing the target dataset.
Dataset Metadata: We obtain the DatasetInfo object, which contains detailed metadata about the dataset.
Linked Dataset Check: We directly examine the DatasetInfo.getLinkedDataset() property. If it has a value (a list of strings representing linked dataset information), we know it's a linked dataset.

Why this is Superior to Description Parsing:

Reliability: This approach is more reliable because it relies on specific metadata properties designed to indicate linked datasets. It doesn't depend on parsing potentially variable descriptions.
Clarity: The code explicitly checks for linked dataset metadata, making the logic clearer and easier to maintain.
Edge Cases: It avoids the potential for false positives or negatives that could arise from parsing descriptions that might be edited or changed.

Example:

 

BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
DatasetId datasetId = DatasetId.of("your-project-id", "your-dataset-id");

if (isLinkedDataset(bigquery, datasetId)) {
    System.out.println("Dataset is a linked dataset.");
} else {
    System.out.println("Dataset is an original source dataset.");
}

Important Note: Make sure you have the necessary BigQuery permissions to access dataset metadata.

nachi_napo

@ms4446 I'm not able to find DatasetInfo.getLinkedDataset() method. I'm using com.google.cloud:google-cloud-bigquery:2.5.1. I also checked the DatasetInfo class of the latest version 2.39.1 but couldn't find getLinkedDataset() method.

ms4446

My apologies for the confusion. The getLinkedDataset() method was recently introduced in the BigQuery Java client library (version 2.13.0 and later), and it's not present in older versions like 2.5.1.

Alternative Approach for Older Versions

Since you are using version 2.5.1, you'll need a slightly different approach to detect linked datasets. You can still use the description parsing technique you mentioned earlier, but with a more robust check:

 

import com.google.cloud.bigquery.*;
// ... (other imports)

public boolean isLinkedDataset(BigQuery bigquery, DatasetId datasetId) {
    Dataset dataset = bigquery.getDataset(datasetId);

    if (dataset == null) {
        // Dataset not found
        return false; 
    }

    String description = dataset.getDescription();
    
    // Check for specific linked dataset description pattern
    return description != null && description.matches("Source listing: .*"); 
}

Explanation:

Get Dataset Description: We retrieve the dataset description using dataset.getDescription().
Regular Expression Match: We use the regular expression Source listing: .* to check if the description matches the pattern typically found in linked dataset descriptions. This pattern indicates that the dataset is linked to a listing in a data exchange.

Important Considerations:

Reliability: While this approach is less ideal than using the dedicated getLinkedDataset() method, it is still fairly reliable as the description pattern for linked datasets is consistent.
Edge Cases: It's theoretically possible that a user could manually modify the description of a linked dataset to remove the "Source listing:" part. However, this is unlikely in most scenarios.

Upgrading Library Version (Recommended):

If possible, I strongly recommend upgrading your BigQuery Java client library to a newer version (2.13.0 or later). This will give you access to the more reliable getLinkedDataset() method, eliminating the need for description parsing.

Example (Upgraded Version):

Assuming you've upgraded your library to a compatible version, here's how you'd use the getLinkedDataset() method:

 

public boolean isLinkedDataset(BigQuery bigquery, DatasetId datasetId) {
    Dataset dataset = bigquery.getDataset(datasetId);
    
    if (dataset == null) {
        // Dataset not found
        return false;
    }
    
    DatasetInfo datasetInfo = dataset.toBuilder().build();
    
    // Use the dedicated method (available in version 2.13.0+)
    return datasetInfo.getLinkedDataset() != null && !datasetInfo.getLinkedDataset().isEmpty(); 
}

BQ Time travel with Analytics Hub's Linked dataset