Skip to content

Restore and send gaps for Windows events #1747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

duhminick
Copy link
Contributor

@duhminick duhminick commented Jun 28, 2025

Description of the issue

Similar to #1721 but instead for Windows. With multi-threaded support, it is possible that certain events were not sent and the agent was shutdown: gaps in event delivery. We want to send those events on agent startup.

We also did not have any mocking for the sys calls to the Windows event API. This results in events being written to Windows directly since the actual sys calls were used for the tests.

Description of changes

Core Feature: Windows Event Log Gap Detection and Recovery

  1. Enhanced wineventlog.go with gap processing capabilities

    1. Added gapsToRead field to track missing event ranges during initialization and resubscription. gapsToRead is set by inverting the ranges from the restored state file.
    2. Implemented gap prioritization logic in the main event loop - gaps are processed before new events
    3. Added readGaps() method that opens separate query handles for each gap range and processes missing events
  2. New range-based querying functionality

    1. Added openAtRange() method to create targeted queries for specific event ID ranges
    2. Implemented CreateRangeQuery() utility to build filter queries based on RecordID using the filter (EventRecordID > X and EventRecordID <= Y)
  3. Architecture improvements for testability

    1. Refactored Windows API calls into WindowsEventAPI interface to enable mocking
    2. Consolidated system call functionality from sys_call.go into windows_event_api.go
    3. Added SetEventAPI() method to inject mock implementations during testing
    4. All Windows API calls now go through the interface (EvtSubscribe, EvtQuery, EvtNext, EvtClose, etc.)
    5. Created mock_windows_event_api.go with full mock implementation of Windows Event Log APIs

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. Unit tests
  2. Tested locally

Manual Tests

No state

Given no state file and Windows event logs having records and the Windows event logs having records

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (26 records in total, records 1-26).
image

Results in
image

The state file after update:

26
Windows-AlphabetLog_i-0be8d9fb5653d32a1_TwoAlphaLog
0-26

Gaps in range

Given the state file having the content
24
Windows-AlphabetLog_i-0be8d9fb5653d32a1_TwoAlphaLog
2-5,10-15,20-24

and the Windows event logs having records
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AA AB AC (29 records in total).

Results in
image

Notice that the "already sent" events are not delivered again.

The state file after update:

29
Windows-AlphabetLog_i-0be8d9fb5653d32a1_TwoAlphaLog
0-29

Confirm tailing of event logs after agent stop

This is a continuation of the `Gaps in range` test. We want to verify that offset still works. So given the state of the `Gaps in range` test, we will stop the agent, add a new Windows event, then start the agent.

Results in
image

and the state file:

30
Windows-AlphabetLog_i-0be8d9fb5653d32a1_TwoAlphaLog
0-30

Confirm tailing of event logs while agent still running

Given the state file having the content
24
Windows-AlphabetLog_i-0be8d9fb5653d32a1_TwoAlphaLog
2-5,10-15,20-24

and the Windows event logs having records
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (26 records in total).

results in
image

then once adding a Windows event using Write-EventLog -LogName $LogName -Source $Source -EventId 1027 -EntryType Information -Message "Alphabet Entry: Letter AA"

results in
image

with the state file having content:

27
Windows-AlphabetLog3_i-0be8d9fb5653d32a1_ThreeAlphaLog
0-27

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@duhminick duhminick changed the title Dominic gap windows pr Restore and send gaps for Windows events Jun 28, 2025
@duhminick duhminick added the ready for testing Indicates this PR is ready for integration tests to run label Jun 30, 2025
@duhminick duhminick marked this pull request as ready for review June 30, 2025 14:48
@duhminick duhminick requested a review from a team as a code owner June 30, 2025 14:48
@duhminick
Copy link
Contributor Author

duhminick commented Jun 30, 2025

Seems like the Windows range-based offset is inclusive unlike the Linux one. Going to make an update for that.

Fix assertion in wineventlog test

Fix mocks
@duhminick duhminick force-pushed the dominic-gap-windows-pr branch from 15061cd to 8bd9113 Compare June 30, 2025 21:32
@@ -219,3 +238,17 @@ func insertPlaceholderValues(rawMessage string, evtDataValues []Datum) string {
}
return sb.String()
}

func utf16PtrToString(ptr *uint16) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems like this is only used in the mock/tests. Consider moving it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using it in two separate tests, utils_test.go and wineventlog_test.go - I figured it would be okay to have it here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only used in tests, then IMO it should be in a test file.

Comment on lines +81 to +87
var levelsFilter string
formattedLevels := make([]string, len(levels))
for i, level := range levels {
formattedLevels[i] = fmt.Sprintf(eventLogLevelFilter, level)
}
if len(formattedLevels) > 0 {
levelsFilter = fmt.Sprintf("(%s)", strings.Join(formattedLevels, " or "))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could use a string.Builder instead of creating the slice and string. Not a big deal since this isn't called that frequently and there aren't that many levels/filters.

eventLogLevelFilter = "Level='%s'"
eventIgnoreOldFilter = "TimeCreated[timediff(@SystemTime) &lt;= %d]"
eventRangeFilter = "EventRecordID &gt; %d and EventRecordID &lt; %d"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this doesn't include the start or end record IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it won't include the start or end IDs. I used to have it include the start, but it would've led to one duplicate event

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the upper bound be lte instead? From what I can tell, if the state file is

0-10,20-30

It means that 10 was sent, but 20 was not. So when we invert it to get the gaps, it would be (10, 20].

Comment on lines 200 to 205
// Need to shift by 1 because the range logic assumes [start, end) whereas Windows events is [start, end]
if readingFromGap {
r.Shift(recordNumber + 1)
} else {
r.Shift(recordNumber)
}
Copy link
Contributor

@jefchien jefchien Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just setting the number for the record that was just read, that means the end has to be inclusive. So each range that's sent is really just 1-wide and is actually just (start, end] and the equivalent of [end, end]. For example, if we just read ID 100, the range would look like (99, 100]. It represents knowing that 100 was processed, but not knowing it 99 was. This is fine when subscribing because of the EvtSubscribeStartAfterBookmark flag as mentioned above.

I'm not sure I understand why it won't work the same for the gaps. Isn't it going to try to restore from the state file that was generated during the subscribe? Is this just a matter of the RangeQuery not being upper bound inclusive?

@duhminick duhminick force-pushed the dominic-gap-windows-pr branch from a024adc to e4321ab Compare July 1, 2025 06:32
Comment on lines +305 to +312
handle, err := w.openAtRange(r)
defer func() {
winEventAPI.EvtClose(handle)
}()
if err != nil {
continue
}
readRecords := w.readFromHandle(handle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Right now, the defer call will happen at the end of readGaps. Consider moving this into a separate function so the close happens after each read.

func (w *windowsEventLog) readGap(r state.Range) []*windowsEventLogRecord

@@ -219,3 +238,17 @@ func insertPlaceholderValues(rawMessage string, evtDataValues []Datum) string {
}
return sb.String()
}

func utf16PtrToString(ptr *uint16) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only used in tests, then IMO it should be in a test file.

Comment on lines +329 to +341
func marshalRangeList(rl state.RangeList) string {
var marshalledRanges []string
for _, r := range rl {
text, _ := r.MarshalText()
marshalledRanges = append(marshalledRanges, string(text))
}
return strings.Join(marshalledRanges, ",")
}

func assertStateFileRange(t *testing.T, fileName string, rl state.RangeList) {
content, _ := os.ReadFile(fileName)
assert.Contains(t, string(content), marshalRangeList(rl))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of marshaling, we could also just call the Restore function of the state manager.

Comment on lines +184 to +186
// This is per EvtHandle hence the necessity to break up these calls
// 0, 1, 4 were "sent" previously (should be skipped)
mockAPI.AddMockEventsForQuery(createMockEventRecords(0, 2, 5))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Think this is incorrect. Should be 1, 2, and 5 were previously sent and should be skipped.

elog.Init()

// Simulate new subscription events arriving
mockAPI.SimulateSubscriptionEvents(createMockEventRecords(5, 6, 7, 8))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The subscription event should not have 5 in it if we follow our logic and 5 was already sent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for testing Indicates this PR is ready for integration tests to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants