Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish v0.1.0 to crates.io #45

Closed
puffyCid opened this issue Dec 16, 2024 · 6 comments
Closed

Publish v0.1.0 to crates.io #45

puffyCid opened this issue Dec 16, 2024 · 6 comments

Comments

@puffyCid
Copy link
Collaborator

Now that this project has been tested on lots of log files and receiving feedback (and additional contributions).

It makes sense to start publishing to crates.io, this will hopefully make it easier to use in projects.

@jrouaix @dgmcdona you both have been making alot of contributions, anything either of you want to get merged before first version gets published to crates?

Otherwise, i will try to publish on 2024-12-19 🥳

@jrouaix
Copy link
Contributor

jrouaix commented Dec 16, 2024

🎉

We use a very close fork of this repo, the only mandatory thing we'd need from a published crate to use it is some public modules :

image

shindan-io#2

@dgmcdona
Copy link
Contributor

Hey, thanks for asking! I have a few thoughts on things that might be good to add before an initial release in order to prevent major breaking changes for consumers down the road, and also make feature additions and enhancements a little easier as well.

Decouple filesystem implementation from parsing functions

In parser.rs, many of the methods take &str path arguments, conduct filesystem operations under the hood, and make the assumption that logs will be in the logarchive format produced by log collect. This is probably true for many users, but lots of forensic collection tools aren't designed to produce that directory structure. What could help in this case is to define a trait, and then implement a UnifiedLogsParser struct (or something like it) that takes something that implements that trait as its input. I'm thinking something like this:

// traits.rs
use std::io::Read;

pub trait FileProvider {
    fn tracev3_files(&self) -> Box<dyn Iterator<Item=Box<dyn Read>>>;
    fn uuidtext_files(&self) -> Box<dyn Iterator<Item=Box<dyn Read>>>;
    fn dsc_files(&self) -> Box<dyn Iterator<Item=Box<dyn Read>>>;
    fn timesync_files(&self) -> Box<dyn Iterator<Item=Box<dyn Read>>>;
}

This could also be written to allow for static dispatch, but would require consumers to use rustc > 1.75:

pub trait FileProvider {
    fn tracev3_files(&self) -> impl Iterator<Item=impl Read>;
    fn uuidtext_files(&self) -> impl Iterator<Item=impl Read>;
    fn dsc_files(&self) -> impl Iterator<Item=impl Read>;
    fn timesync_files(&self) -> impl Iterator<Item=impl Read>;
}

I lean towards the first option since I view the cost of using dynamic dispatch with Box in this context as fairly negligible.

Then, a type can be defined that implements that trait, providing users with an easy path to the common case while allowing loose coupling for users to implement the trait in whatever way works best for them. If we want to allow for, for instance, on-demand lookups of strings down the road (I believe @jrouaix proposed a more memory friendly implementation of LogData where the format strings aren't fully materialized until asked), or other operations that might require reading from a specific offset, it may also be a good idea to require Seek as in addition to Read, or ReadSeek as seen here.

This could be as simple as wrapping PathBuf using the Newtype pattern:

use std::path::PathBuf;

struct LogArchive(PathBuf);

impl FileProvider for LogArchive {
    // implement trait methods
}

Allow for variation in behavior for certain aspects of the parser

The big one here that I'm thinking of is strings_data/shared_strings lookups. A scenario I'm imagining is this: You might have multiple collections taken from the same machine over time. It's likely that a lot of that string/shared_string data overlaps, and you may want a way to deduplicate that by implementing a custom storage of some kind (likely a database). Defining another trait here could allow for these custom implementations. It would also allow software operating in memory-constrained environments, as requested in #14 , to define their own implementations without changing the current in-memory storage implementation.

I have far less understanding of this part of the code than you, but my reading of the source suggests that this should be possible.

Decide on public API + add docs

What would really help this release go off with a bang is if we updated the docs and decided exactly what should be public and why. It may also be a good idea to re-export some things with pub use statements to create a more intuitive API. As it stands now, the cargo docs for this crate require a bit of digging into examples to get a handle on how things should be done.

All of that is a quite a lot in order for this to be released to crates.io on the 19th, so I totally understand if you don't feel like trying to hit all of those items. I could also be off target in my understanding of some things, so if anyone has better ideas for how these things could be acheived I'd love to hear them. For my part, I'd be happy to implement at least the first point in the next few days, though, and hopefully contribute to the docs as well.

@jrouaix
Copy link
Contributor

jrouaix commented Dec 17, 2024

Wow @dgmcdona very nice backlog !

It will need a really long time to achieve all this.
Perhaps it's ok to publish the crate now and handle breaking changes with semver.

Some more thoughts :

Decouple filesystem implementation from parsing functions

Yes ! We (https://github.com/shindan-io) have a little tool we use as file access layer : https://github.com/shindan-io/scnr. If this change land on unified logs we could implement a unified logs plugin 🎉

About the memory friendly implementation, my idea was to parse only the types not needing allocations.

  • Some low level parser would iterate over some LogData<'a> { integer_value, bool_value, ipadress ...., message: MessageStruct<'a> { slice: &'a[u8], other_parts ... } }
  • From this low level structs holding lifetime to memory slices we could iterate very fast, filter on some already parsed information, and only then call some faillible function to materialize the full fleshed log : fn allocate_message(&self) -> Result<FullFleshedMessage, SecondStageError>

It really is a cloudy plan for now, I didn't do any work on this yet but my current work on the lib to have nicer parsers and avoid allocations is a way for me to discover the lib, maturate the idea and prepare for this future work.

Allow for variation in behavior for certain aspects of the parser

Ho, I never thought about this. I think it could be another level of possibility after the previous one (low level no alloc parsers). And perhaps also it's not the duty of this lib but could be achieved by users if the api allow this kind of usage.

Decide on public API + add docs

I cannot say no, but is it better to publish now and polish then, or polish now holding back on an important feature : "having a crate to use this lib easily" ? I'd vote "shot now".

@puffyCid
Copy link
Collaborator Author

puffyCid commented Dec 18, 2024

thanks for the comments all. The 19th date was deadline if no comments were posted on issue.

I'm fine adjusting date if needed, but I think since the library is an ok working state, im inclined to release a 0.1.0 version on 19th (or this weekend).

IMO releasing a version on crates.io now would allow users who are comfortable using the library in its current state to have a stable version to use (instead of specific a github repo commit in Cargo.toml). While major changes are being done in preparation for version 1.0 (or 0.2.0).

Regarding comments so far:

  1. Im fine marking more parts of the library pub. I think that could be done before its published

  2. Decouple filesystem implementation from parsing functions. I like the idea. If you want to open a dedicated issue and own it that would awesome! Since holidays are coming very soon, i will defer to you on level of effort and time u think it will take. It sounds like though something like this will probably be ready in Jan 2025?

  3. Allow for variation in behavior for certain aspects of the parser. I like this idea alot, it kind of reminds me of the NSA WELM project and Velocidex evtx-data. It sounds like the idea is to store some of the string data in another format (ex: sqlite). Is that accurate? I would be interested in researching this a bit more. It might be do able.

  4. Decide on public API + add docs. Yes agree, docs for this library are not good. It sounds like completing Decouple filesystem implementation from parsing functions would make a good starting to really start making heavy documentation changes?

@dgmcdona
Copy link
Contributor

Thanks for the feedback! I have a working implementation of the filesystem decoupling that I can open an MR for (at least a draft one) to get feedback on tomorrow evening, and I’d be happy to flesh out the docs as well once we finalize a design that we’re happy with. Releasing to crates.io so users can pin a stable version makes sense to me as well, especially since it’s still only v0.1.0 and people should expect breaking changes as the library matures anyway. I don’t feel like that should necessarily block on any of those issues I raised, but it may be worth taking a look at the aforementioned PR to see if it’s something you’d want to include, since it does a lot for usability but will definitely be a breaking change to the most public-facing part of the API.

@puffyCid
Copy link
Collaborator Author

v0.1.0 published!
thanks all who contributed! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants