This project aims to take any raw email message (RFC 5322, 2822, 822) message and transform it into a simple data structure with outputs to JSON or MsgPack (or anything else if you implement an ecoder).
The data structures output from this library simplify email parsing for the most common use cases including:
- Getting the HTML and/or text body of an email
- Reading attachments
- Reading and decoding headers like From, To, Subject, Date, etc
Most of the heavy-lifting is handled by Apache James Mime4j. All we're doing here is specifying a simple and opinionated data structure.
Why not use Mime4j directly?
Mime4j is the kitchen sink. You still need to put the pieces together and handle error scenarios. The goal of this library is to give reasonable output for any given email source, so you can write apps that consume email as easily as you write apps to consume, say, a REST API.
What do you mean "opinioned" data structure?
Email is very flexible, and that makes it very complicated. There are many things which are technically possible to do with email but are never done in practice. This project errs on the side of practical.
Here are a few examples:
- An email may technically be From multiple addrs, but we simplify this and assume there's only one.
- According to the spec, there is a difference between mailboxes, mailbox lists, addrs, and addr lists; and there are esoteric features of mailbox lists that are seldom used in practice (e.g. grouping). We do away with these complications. An email addr is always the same shape no matter what it is (From, To, Sender, Reply-To, whatever).
- We try to normalise and parse common headers into usable data structures. For example, dates are parsed into UTC ISO 8601 dates. Another exmaple is how we attempt to detect the "Date" based on Recieved headers if there is no Date.
- We simplify parsing out the body html/text, even in convuluted multipart email messages.
The library handles parsing emails and then sorts the various pieces into a structure you can use in real-world applications.
The project comes with a CLI command to demonstrate an example implementation. Build the project (./gradlew build
), then
use the fat jar at cli/build/libs/parse-email.jar
$ java -jar cli/build/libs/parse-email.jar \
--file lib/src/test/resources/email-sources/simple-multi.eml \
--format json
See example output below.
val file = File("my-email.eml")
val parser = Parser()
val message = parser.parse(file.inputStream())
val outFile = File("email.json")
val encoder = JsonEncoder.create()
encoder.writeToStream(message, outFile.outputStream())
{
"subject": "My Subject",
"messageId": "\u003cCAOZrWnfF9pMZY4k9stM07W8nCkCoixwEFGK\u003dn710e7xw\u003d[email protected]\u003e",
"from": {
"name": "John Doe",
"email": "[email protected]"
},
"sender": null,
"replyTo": [],
"returnPath": null,
"tos": [
{
"name": "Jane Doe",
"email": "[email protected]"
}
],
"ccs": [],
"date": "2017-08-17 13:18:33",
"references": [],
"bodyText": "Hello, world!",
"bodyHtml": "<p>Hello, world!</p>",
"headers": [
{
"name": "MIME-Version",
"value": "1.0"
},
{
"name": "Received",
"value": "by 192.168.1.1 with HTTP; Thu, 17 Aug 2017 05:18:33 -0700 (PDT)"
},
{
"name": "X-Originating-IP",
"value": "[10.0.1.1]"
},
{
"name": "Date",
"value": "Thu, 17 Aug 2017 13:18:33 +0100",
"date": "2017-08-17 13:18:33"
},
{
"name": "Delivered-To",
"value": "[email protected]"
},
{
"name": "Message-ID",
"value": "\u003cCAOZrWnfF9pMZY4k9stM07W8nCkCoixwEFGK\u003dn710e7xw\u003d[email protected]\u003e"
},
{
"name": "Subject",
"value": "My Subject"
},
{
"name": "From",
"value": "John Doe \u003c[email protected]\u003e",
"addrs": [
{
"name": "John Doe",
"email": "[email protected]"
}
]
},
{
"name": "To",
"value": "Jane Doe \u003c[email protected]\u003e",
"addrs": [
{
"name": "Jane Doe",
"email": "[email protected]"
}
]
},
{
"name": "Content-Type",
"value": "multipart/mixed; boundary\u003d\"001a113eb2a6dace510556f2022f\""
}
]
}