How to find duplicate events ingestion into chronicle

Hi All,

Is there any way that we can find the duplicate events ingested into chronicle. If yes, could you please share more information.

With Regards,
Shaik Shaheer

1 4 168
4 REPLIES 4

@jstoner @mikewilusz @manthavish  - Could you please help me identifying the duplicate logs into chronicle.

This should be possible by using a pivot table, on the basis that the log contains a unique identifier (Global Event ID, Event ID, Log Id etc). In the following case we are using Google Chronicle's demo instance, and utilizing the 'Crowdstrike Falcon' log source, with the UDM field that contains an event's unique identifier being "metadata.product_log_id".

[1] - First we search for the log type we want, in this case 'Crowdstrike Falcon' is the following: metadata.log_type = "CS_EDR"

[2] - Navigate to 'Pivot'

[3] - Apply Pivot settings like the screenshot below (grouping by the unique identifier}

AymanC_0-1707327642792.png

[4] - Click on the :, export the data into a .csv, and remove all the ones that are equal to "1" (which if you order by Descending will be at the bottom) :).

This should show you the Event count based on the UDM field that is grouped (in this basis we are implying that metadata.product_log_id for the 'CS_EDR' logs is a unique identifier for each log). Depending on the need of this, it is likely that the creation of a dashboard may be better suited.

Hope this helps!

Hi Ayman C,

Greetings...!!!

Thank you for your suggestion, and we attempted to implement this method. However, it makes the analyst's job tedious as they have to manually export and individually check the logs. Is there an alternative automation process available?

With Regards,
Shaik Shaheer

Hi Shaik,

Google Chronicle SIEM customers can leverage several automation strategies to check for duplicate ingested data. Here's a breakdown:

1. Hash-Based Deduplication

  • Mechanism:

    • Calculate a cryptographic hash (e.g., MD5, SHA-256) of the essential components of each event (consider a combination of timestamp, source IP, key fields).
    • Store the hash values in a fast-access data structure (like a bloom filter or a hash table).
    • Before ingesting a new event, check its hash against the stored values. If a match is found, it's likely a duplicate.
  • Pros:

    • Reliable for detecting exact duplicates.
    • Can be implemented at the pipeline level.
  • Cons:

    • Minor changes to an event will produce a different hash, potentially leading to false negatives.

2. Similarity Detection with Chronicle Rules

  • Mechanism:

    • Create Chronicle detection rules that compare essential event fields using similarity matching thresholds. Consider features like:
      • Near-matching timestamps
      • Similar IP addresses (perhaps within the same subnet)
      • Matching key fields (e.g., usernames, file names)
    • Optionally, use fuzzy matching or string comparison algorithms (Levenshtein distance) for more flexible comparisons.
  • Pros:

    • Detects duplicates even with slight modifications.
    • Leverages Chronicle's built-in rule engine, making it accessible to security analysts.
  • Cons:

    • Can be resource-intensive with large data volumes.
    • Requires careful rule tuning to avoid false positives.

3. External Data Deduplication

  • Mechanism:

    • Send a stream of normalized events or pre-calculated hashes to a dedicated deduplication service or utilize a log management/SIEM platform that has deduplication features natively.
  • Pros:

    • Offloads computation from Chronicle.
    • Potentially more advanced deduplication algorithms and centralized management.
  • Cons:

    • Adds complexity to the data pipeline.
    • Might introduce latency.