Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser for unsaved Windows Notepad tabs #540

Merged
merged 39 commits into from
Aug 16, 2024

Conversation

joost-j
Copy link
Contributor

@joost-j joost-j commented Feb 15, 2024

Just like Notepad++, the Windows Notepad application for Windows 11 is now able to restore unsaved tabs when you re-open the application. This blog explains where it is stored and that you should be able to somehow view the contents.

It turned out that there was a bit more to it than running strings or grep on it. The application stores the tabs in different formats, depending on the size and some other unknown factors. I've even encountered a file where 26 characters of text were encoded as 34(?!) separate blocks; each block containing a length field, a single character and a CRC32 checksum of that very small block. Using grep or strings on that file would not have yielded any results. Information stored in these Notepad tabs may be helpful during forensic investigations and/or incident response cases.

The file format uses LEB128 variable-integer encoding for the block sizes, which is not yet present in the dissect framework. Therefore, this PR depends on fox-it/dissect.cstruct#69, so in the end it depends on a new major release of dissect.cstruct.

The dissect/target/plugins/aps/texteditor folder was created, with corresponding new record types, so unsaved tabs from other text editors can also be added in the future.

@joost-j joost-j force-pushed the feature/windows_notepad_tabs branch from 0158683 to 7d5d5ce Compare February 15, 2024 22:47
@joost-j joost-j marked this pull request as ready for review February 19, 2024 14:49
@Horofic Horofic self-requested a review February 20, 2024 09:21
@joost-j joost-j requested a review from Horofic February 26, 2024 15:47
@joost-j joost-j requested a review from Schamper March 4, 2024 12:10
@joost-j
Copy link
Contributor Author

joost-j commented Mar 4, 2024

Refactored the code to work with cstruct v3 after fox-it/dissect.cstruct#73 has been merged.

@Nordgaren
Copy link

Heyo! You guys still working on this? I have gotten quite a bit labeled in my tabstate util. Found you guys mentioned this on John Hammonds tweet about the video going over this format.

I know there is a crc32 after the weird timestamp, but I have not implemented that.

I was wondering, because looking at the code you guys have, so far, there might be a few things you guys are missing. I don't really understand how this package interprets the C code, but if it's just like normal C structures, I think the magic and header start might be off, here.

This format seems to change a lot, depending on the "state" it is in (new tab, unsaved tab and saved tab). I was wondering if maybe @joost-j might have some more work done on this, and maybe we can collaborate to fill out the rest? Here is my tabstate util for Rust.

@joost-j
Copy link
Contributor Author

joost-j commented Mar 5, 2024

Hi @Nordgaren, I think it's a great idea to collaborate on this! I have observed different storage formats as well, but could not yet link them to any action with regards to the "state" yet, so I'm really interested in that as well. Let's try to fill out the rest indeed!

As mentioned in your repo, I'll contact you on Discord.

@daddycocoaman
Copy link

daddycocoaman commented Mar 9, 2024

Hi,

I recently caught on to the Notepad tab hype and this plugin seems premature, IMO. It appears that Notepad tab files are a single format with three states:

  • Unsaved file
  • Saved file
  • Saved file with unsaved data

The file extension can also slightly change depending on if the file is open somewhere else or not (.tmp files vs just bin).

As it's written, this plugin will attempt to parse unsaved and saved files and will miss a lot of things, even in the unsaved state. For example, these structs can't be right at all:

struct multi_block_entry {
    uint16    offset;
    uleb128   len;
    wchar     data[len];
    char      crc32[4];
};
struct single_block_entry {
    uint16    offset;
    uleb128   len;
    wchar     data[len];
    char      unk1;
    char      crc32[4];
};

In a large file, the offset cannot be a uint16 (65535 max size) because files can have more than 65535 characters. It's more than likely an uleb128. Additionally, these structs don't address what happens when a character is deleted, because the structure is different. If I add the letter r at offset 2 in the unsaved state, the file will append: 02 00 01 72 00 A0 29 D5 3E

Where 02 is the offset, 00 is "unknown" (not really unknown, read below), 01 is the uleb128 length, 72 00 is the wchar, then the crc32.

But if I remove that r, the file appends: 02 01 00 E5 DE 3C 3D. Same offset "02" but the next value is a "01" which is the length of the item I removed (if you highlight and delete multiple characters, it'll be whatever the length of the deleted characters was).

What this all equates to is that if the second uleb128 value is 00, then a character has been added. If it's anything other than zero, characters have been removed by that length.

With all that said, I've actually managed to figure out the notepad tab format (with the exception of a single u8 field) in all three states. I'll be posting about it soon and I'd recommend waiting before adding this plugin.

@Schamper
Copy link
Member

Schamper commented Mar 9, 2024

@daddycocoaman will you be contributing your findings?

@Nordgaren
Copy link

@daddycocoaman will you be contributing your findings?

I have given all of this data to Joost, myself, a while ago. I will be making a full write-up shortly. I didn't have time when I told him I would, but, everything listen here is in my util, except the crc. I just haven't taken the time to put it in yet. Unexpected life events. haha.

So you guys should be able to implement what I have so far, and he markdown file should follow this weekend. Maybe even a .bt file if I can manage!

@Nordgaren
Copy link

Nordgaren commented Mar 10, 2024

Hi,

I recently caught on to the Notepad tab hype and this plugin seems premature, IMO. It appears that Notepad tab files are a single format with three states:

* Unsaved file

* Saved file

* Saved file with unsaved data

The file extension can also slightly change depending on if the file is open somewhere else or not (.tmp files vs just bin).

As it's written, this plugin will attempt to parse unsaved and saved files and will miss a lot of things, even in the unsaved state. For example, these structs can't be right at all:

struct multi_block_entry {
    uint16    offset;
    uleb128   len;
    wchar     data[len];
    char      crc32[4];
};
struct single_block_entry {
    uint16    offset;
    uleb128   len;
    wchar     data[len];
    char      unk1;
    char      crc32[4];
};

In a large file, the offset cannot be a uint16 (65535 max size) because files can have more than 65535 characters. It's more than likely an uleb128. Additionally, these structs don't address what happens when a character is deleted, because the structure is different. If I add the letter r at offset 2 in the unsaved state, the file will append: 02 00 01 72 00 A0 29 D5 3E

Where 02 is the offset, 00 is "unknown" (not really unknown, read below), 01 is the uleb128 length, 72 00 is the wchar, then the crc32.

But if I remove that r, the file appends: 02 01 00 E5 DE 3C 3D. Same offset "02" but the next value is a "01" which is the length of the item I removed (if you highlight and delete multiple characters, it'll be whatever the length of the deleted characters was).

What this all equates to is that if the second uleb128 value is 00, then a character has been added. If it's anything other than zero, characters have been removed by that length.

With all that said, I've actually managed to figure out the notepad tab format (with the exception of a single u8 field) in all three states. I'll be posting about it soon and I'd recommend waiting before adding this plugin.

I can confirm most of what you said here. It is uleb128, as Joost identified. You got the saved states, pretty much. There is the new tab state (which I believe you are calling unsaved) and additionally a like a "soft save", which happens when notepad closes, but the tab stays open without a filepath. This will write all of the buffer contents to the file, instead of the weird keystroke meme that it seems to be in the new tab state.

Did you get the entire metadata structure in the saved file state? That is pretty much the only data I haven't figured out all the way, but I know it's size.

you can checkout tabstate-util crate on my GitHub, if you want to cross reference your findings. Might be good. There's some weird curveballs in this format.

I believe there still might be some structures that only appear in special conditions. Those are kinda the hardest to work out.

I think a good idea now would be to consolidate test files so we can get all of the possible structures available for testing/parsing.

@Nordgaren
Copy link

There is also this issue on my repo which has a lot of good information

@daddycocoaman
Copy link

There is also this issue on my repo which has a lot of good information

That is a great thread. To answer your question, yes I have all the fields of the saved state, including the cursor locations, timestamp, and a few other things in the format. There's no special delimiters in the format. Everything has a specific value.

To answer @Schamper, I hadn't actually heard of dissect before the comments on the John Hammond video. I think I might just put out the format and let people adopt it however they liked. Personally, as a red teamer, I have my own reasons. 😂

@Nordgaren
Copy link

There is also this issue on my repo which has a lot of good information

That is a great thread. To answer your question, yes I have all the fields of the saved state, including the cursor locations, timestamp, and a few other things in the format. There's no special delimiters in the format. Everything has a specific value.

To answer @Schamper, I hadn't actually heard of dissect before the comments on the John Hammond video. I think I might just put out the format and let people adopt it however they liked. Personally, as a red teamer, I have my own reasons. 😂

What did you find the single byte and then 4 byte int that mirrors it around the cursor start and end, to be?

@Paradoxis
Copy link
Contributor

@daddycocoaman

Personally, as a red teamer, I have my own reasons. 😂

Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.

Dissect isn't just a blue-team tool, I use it as a red teamer myself all the time as it ships with some pretty sick file parsing capabilities (especially nice if you're don't have 4TB of bandwidth to spare when all you need is like 200 bytes or something). Hell, I'd like to see this plugin get implemented as well, can think of some pretty cool stuff you could do with it

If it's losing out on credit you're worried about, external contributors are always credited (as I was here and here) :)

@JustArion
Copy link

JustArion commented Mar 10, 2024

Hi there, @ogmini and I both have some implementations of the tabs to near completion. ogimini has the C# implementation here and I have the ImHex pattern implementation here. Both aren't 100% complete yet but most of it is.

@Nordgaren
Copy link

Nordgaren commented Mar 10, 2024

My .bt is uploaded, now.

It's missing new tab state files (the files with keystrokes instead of just characters) and tabstate files with extra buffers after the main one, for now, but it should be otherwise accurate.

@daddycocoaman
Copy link

daddycocoaman commented Mar 11, 2024

Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.

I'm putting out a blog post through my employer (hopefully reviewed and posted this week once I submit it today). It's less about credit on this repo and more like making sure the work I put in to RE the format is written up and distributed more formally.

What did you find the single byte and then 4-byte int that mirrors it around the cursor start and end, to be?

Sorry, I misspoke when I said "all" (I was trying to be careful about not saying that). At any rate, the bytes after the cursor end are not a single field, but 4 separate byte fields that represent different boolean options. I've labeled them as wordWrapEnabled, rightToLeftEnabled, showUnicodeControlChars, and I'm missing the last one at the moment.

The byte before the cursor start always appear to be ``01` and is the field I mentioned earlier that I hadn't figured out. There are a couple of other boolean options related specifically to the Notepad UWF app but I have no idea how to configure them in either Notepad or the registry (like GhostFile or ClassicEditor, which I thought would mean using the old Notepad). I even tried turning off the integrated Copilot 🥲

@ogmini
Copy link

ogmini commented Mar 11, 2024

Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.

I'm putting out a blog post through my employer (hopefully reviewed and posted this week once I submit it today). It's less about credit on this repo and more like making sure the work I put in to RE the format is written up and distributed more formally.

What did you find the single byte and then 4-byte int that mirrors it around the cursor start and end, to be?

Sorry, I misspoke when I said "all" (I was trying to be careful about not saying that). At any rate, the bytes after the cursor end are not a single field, but 4 separate byte fields that represent different boolean options. I've labeled them as wordWrapEnabled, rightToLeftEnabled, showUnicodeControlChars, and I'm missing the last one at the moment.

The byte before the cursor start always appear to be ``01` and is the field I mentioned earlier that I hadn't figured out. There are a couple of other boolean options related specifically to the Notepad UWF app but I have no idea how to configure them in either Notepad or the registry (like GhostFile or ClassicEditor, which I thought would mean using the old Notepad). I even tried turning off the integrated Copilot 🥲

Ah! More Notepad options... Hadn't thought to even test those yet. I hope you'll link the blog post when its published. This has been an interesting exercise and learning experience for me.

Not to change topics, have you looked at the Windowstate files? I started to take a stab at those to stop from fixating on the Tabstate files too much. It stores window size and position as one would guess. I've only started trying to figure out the rest of the file. https://github.com/ogmini/Notepad-Windowstate-Buffer

@Nordgaren
Copy link

Heya, fellow red teamer here: Could you maybe elaborate on this part? I'm genuinely curious why you wouldn't just share the information if you have it.

I'm putting out a blog post through my employer (hopefully reviewed and posted this week once I submit it today). It's less about credit on this repo and more like making sure the work I put in to RE the format is written up and distributed more formally.

What did you find the single byte and then 4-byte int that mirrors it around the cursor start and end, to be?

Sorry, I misspoke when I said "all" (I was trying to be careful about not saying that). At any rate, the bytes after the cursor end are not a single field, but 4 separate byte fields that represent different boolean options. I've labeled them as wordWrapEnabled, rightToLeftEnabled, showUnicodeControlChars, and I'm missing the last one at the moment.

The byte before the cursor start always appear to be ``01` and is the field I mentioned earlier that I hadn't figured out. There are a couple of other boolean options related specifically to the Notepad UWF app but I have no idea how to configure them in either Notepad or the registry (like GhostFile or ClassicEditor, which I thought would mean using the old Notepad). I even tried turning off the integrated Copilot 🥲

Amazing. Thank you! So now we just need the byte before the cursors and the byte at the end of those bools, then! Thank you for sharing! I will go label the 3 bools. The last byte could also be padding, maybe? Also, did you find anything for the 0x00 that comes after the sha256 hash and before the unknown 0x01 before the cursors?

@daddycocoaman
Copy link

I need to reconfirm but I think that first one is a null terminator for the SHA256. I remember seeing a comparison for 0x00 when it was comparing hashes but that could have been the data.

Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.

@ogmini
Copy link

ogmini commented Mar 13, 2024

Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.

I've tagged that as the sequence number for the .0 and .1 files. It goes up incrementally. It is also a uLEB128.

https://github.com/ogmini/Notepad-Tabstate-Buffer?tab=readme-ov-file#0bin--1bin

*Edit

Oh, I think I misread your post. You already know about the .0 and .1 file. I've still assumed it to be a sequence number for the bin file. Just that it always appears to be 0x00.

@Nordgaren
Copy link

Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.

I've tagged that as the sequence number for the .0 and .1 files. It goes up incrementally. It is also a uLEB128.

https://github.com/ogmini/Notepad-Tabstate-Buffer?tab=readme-ov-file#0bin--1bin

*Edit

Oh, I think I misread your post. You already know about the .0 and .1 file. I've still assumed it to be a sequence number for the bin file. Just that it always appears to be 0x00.

I have actually noticed that the the 4th byte in the file, which is supposed to be the saved state, is also the count of remaining characters in a file, I think? Could be a uleb128, too, as you mentioned. the 3rd byte in the file does seem to change, and I think I have seen it change with the correct magic, as well. I can't remember for certain. Will try to dig a bit, shortly!

@Nordgaren
Copy link

Nordgaren commented Mar 13, 2024

I need to reconfirm but I think that first one is a null terminator for the SHA256. I remember seeing a comparison for 0x00 when it was comparing hashes but that could have been the data.

Also, unfortunately, it's not the only thing left. The third byte in the format after "NP" is not part of the magic header. It's used by the .0 and .1 temp state files for unsaved files.

It would be weird for it to be a null terminator for the sha 256 hash, though, as it is a fixed sized hash in the file. 32 bytes, or 256 bits. But I also wouldn't put it past microsoft. It could also be padding, as well? One thing I find weird is that they are using a varint for a fixed sized int. I assume that is for future proofing, though. So maybe they just have a fixed sized varint (lol) and some extra padding to compensate? That also seems like an out there idea, but just throwing it out there in case it helps anyone else see something different.

So far it doesn't look like they care about padding or even alignment, tbh, although I haven't actually sat down to check. Plus, they could be reading the individual bytes for the sha 256 hash, anyhow, so thus no real alignment issues. Should only be an issue if they are reading the type as multiple u32/u64/u128s or a single u256

@joost-j
Copy link
Contributor Author

joost-j commented Mar 25, 2024

Hi @ogmini, @daddycocoaman, @JustArion , thanks for all the suggestions and tips! The PR was aimed at getting initial support for unsaved tabs into dissect. Then the John Hammond video was published, after which it seemed that more and more about this file format was being researched by multiple people. I had indeed not yet uncovered/reversed all of the fields and structures that were suggested above, which e.g. also takes into account the state of the application (closed/opened). Looks like I can include some more test files covering all different states, to make this plugin more complete. Obviously, I will also incorporate the suggestions that were made in this thread into my code.

Unfortunately I have not been able to work on this project for the past few weeks, but I'm planning to pick up the research again in the near future.

@joost-j
Copy link
Contributor Author

joost-j commented May 8, 2024

I added the optionsVersion field in the latest commit. Although we might not be 100% sure if this is correct, we can always fill in these gaps later on. At least, the dissect parser now seems to be able to parse the newer format as well. The parser is now also able to handle variable-length data buffers that is passed after a fixed-length data buffer.

@joost-j joost-j requested a review from Horofic May 8, 2024 13:00
@joost-j
Copy link
Contributor Author

joost-j commented Jun 25, 2024

Although the file format is still changing and we might not have uncovered 100% of the file format, I feel this is a great example of online collaboration to get things done. I therefore submitted a proposal for a talk at the SANS summit in Prague in September 2024, which got accepted last week. 🎉

The goal of the talk is not to explain the file format in full detail, nor to take full credit of uncovering the file format, but to encourage and inspire people to take a (deep) dive into unknown file formats and share the findings with the entire DFIR community! Of course I'll be sure to mention your contributions @Nordgaren @daddycocoaman @ogmini @JustArion, as well as John Hammond's video. And of course, if one of you happens to be there in person, let's discuss this irl!

@ogmini
Copy link

ogmini commented Jun 25, 2024

Although the file format is still changing and we might not have uncovered 100% of the file format, I feel this is a great example of online collaboration to get things done. I therefore submitted a proposal for a talk at the SANS summit in Prague in September 2024, which got accepted last week. 🎉

The goal of the talk is not to explain the file format in full detail, nor to take full credit of uncovering the file format, but to encourage and inspire people to take a (deep) dive into unknown file formats and share the findings with the entire DFIR community! Of course I'll be sure to mention your contributions @Nordgaren @daddycocoaman @ogmini @JustArion, as well as John Hammond's video. And of course, if one of you happens to be there in person, let's discuss this irl!

Very cool, I hope you can share your slides or talk after the summit. I'm also going to be submitting a talk to BSides-NYC mainly talking about my experience, what I've learned, and my process. Hopefully it is accepted. I'll also make sure to mention you all.

I haven't had much of a chance to update this thread since this isn't remotely my day job. I've been working on trying to automate generation of test state files with known parameters to more easily detect changes from future updates. I've also been poking at the application hive which is stored as the settings.dat file in the Settings folder. Nothing particularly mind-blowing in that file.

I continue to update my notes here: https://github.com/ogmini/Notepad-State-Library

@Horofic
Copy link
Contributor

Horofic commented Jul 29, 2024

@daddycocoaman did you make any progress in the meantime or have a estimation on a possible release? I'm eager on how we can improve the implementation further!

@Horofic Horofic self-assigned this Aug 13, 2024
@joost-j joost-j requested a review from Horofic August 14, 2024 10:31
Copy link

codecov bot commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 95.09804% with 5 lines in your changes missing coverage. Please review.

Project coverage is 75.54%. Comparing base (1123da0) to head (a9b32eb).
Report is 1 commits behind head on main.

Files Patch % Lines
...t/target/plugins/apps/texteditor/windowsnotepad.py 94.73% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #540      +/-   ##
==========================================
+ Coverage   75.46%   75.54%   +0.07%     
==========================================
  Files         303      305       +2     
  Lines       26229    26331     +102     
==========================================
+ Hits        19794    19891      +97     
- Misses       6435     6440       +5     
Flag Coverage Δ
unittests 75.54% <95.09%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Horofic
Horofic previously approved these changes Aug 16, 2024
@Horofic Horofic merged commit a78c7f4 into fox-it:main Aug 16, 2024
18 checks passed
@JSCU-CNI
Copy link
Contributor

The test files of this PR should be LFS objects. Could someone fix that? :)

Encountered 17 files that should have been pointers, but weren't:
        tests/_data/plugins/apps/texteditor/windowsnotepad/3d0cc86e-dfc9-4f16-b74a-918c2c24188c.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/3f915e17-cf6c-462b-9bd1-2f23314cb979.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/85167c9d-aac2-4469-ae44-db5dccf8f7f4.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/appclosed_saved_and_deletions.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/appclosed_unsaved.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/ba291ccd-f1c3-4ca8-949c-c01f6633789d.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/c515e86f-08b3-4d76-844a-cddfcd43fcbb.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/cfe38135-9dca-4480-944f-d5ea0e1e589f.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/dae80df8-e1e5-4996-87fe-b453f63fcb19.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/e609218e-94f2-45fa-84e2-f29df2190b26.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/lots-of-deletions.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/new-format.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/saved.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/stored_unsaved_with_new_data.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/unsaved-with-deletions.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/unsaved.bin
        tests/_data/plugins/apps/texteditor/windowsnotepad/wrong-checksum.bin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants