-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile each schema only once #198
Compile each schema only once #198
Conversation
} | ||
|
||
implicit class LineEndingAmbiguousMatcherString(s: String) extends ExpectationsCreation{ | ||
def ===/(other: String): MatchResult[String] = createExpectable(s).applyMatcher(new LineEndingAmbiguousMatcher(other)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test resource files contain LF endings and they are generated using system line endings. In case of windows CRLF, which breaks all the tests. This should compare files regardless of the line endings.
val contents: String = scala.io.Source.fromFile(fileName).mkString | ||
val source = scala.io.Source.fromFile(fileName) | ||
val contents: String = source.mkString | ||
source.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream wasn't closed and in this a rec function, which breaks on the 2nd execution
7f16f3c
to
1f681c0
Compare
Ok, sounds great. Thanks very much for the contributions. main/trunk is kept up-to-date with what's published, so I can offer the following options:
|
It is up to you. I don't have anything ready, it will take me a bit of time to make all of it review worthy. I don't plan to break the API, at most extend it a little. Would you rather I push everything here and create one giant PR or a set of smaller ones? |
restrictedFields: Boolean, | ||
targetScalaPartialVersion: String): Unit = { | ||
inFiles.flatMap(fileParser.getSchemaOrProtocols(_, format, classStore, classLoader)) | ||
.distinct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the biggest change. This compiles each avro schema only once for a set a given input source files.
Doing this speeds up compilation by 65% in my case, but requires also a small change in build plugins (sbt-avrohugger etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.distinct
In the scripted
tests in sbt-avrohugger, distinct
is throwing an exception when the schema has no namespace (java classes, wreckless with null
)
[info] [info] Compiling Avro IDL to /private/var/folders/1m/4vqhbccd6l7dkz3t8q766fdr0000gn/T/sbt_dc4d7335/target/scala-2.12/src_managed/main/compiled_avro
[info] [error] java.lang.NullPointerException: Cannot invoke "String.hashCode()" because "this.namespace" is null
[info] [error] at org.apache.avro.Protocol.hashCode(Protocol.java:400)
[info] [error] at scala.runtime.Statics.anyHash(Statics.java:122)
[info] [error] at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:76)
[info] [error] at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:246)
[info] [error] at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:167)
[info] [error] at scala.util.Right.hashCode(Either.scala:438)
[info] [error] at scala.collection.mutable.FlatHashTable.addEntry(FlatHashTable.scala:153)
[info] [error] at scala.collection.mutable.FlatHashTable.addEntry$(FlatHashTable.scala:152)
[info] [error] at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
[info] [error] at scala.collection.mutable.FlatHashTable.addElem(FlatHashTable.scala:144)
[info] [error] at scala.collection.mutable.FlatHashTable.addElem$(FlatHashTable.scala:143)
[info] [error] at scala.collection.mutable.HashSet.addElem(HashSet.scala:41)
[info] [error] at scala.collection.mutable.HashSet.add(HashSet.scala:66)
[info] [error] at scala.collection.SeqLike.distinct(SeqLike.scala:519)
[info] [error] at scala.collection.SeqLike.distinct$(SeqLike.scala:509)
[info] [error] at scala.collection.AbstractSeq.distinct(Seq.scala:45)
[info] [error] at avrohugger.generators.FileGenerator.filesToFiles(FileGenerator.scala:116)
[info] [error] at avrohugger.Generator.filesToFiles(Generator.scala:98)
[info] [error] at sbtavrohugger.FileWriter$.generateCaseClasses(FileWriter.scala:40)
[info] [error] at sbtavrohugger.SbtAvrohugger$.$anonfun$avroSettings$8(SbtAvrohugger.scala:82)
[info] [error] at sbt.util.FileFunction$.$anonfun$cached$1(FileFunction.scala:81)
[info] [error] at sbt.util.FileFunction$.$anonfun$cached$4(FileFunction.scala:154)
[info] [error] at sbt.util.Difference.apply(Tracked.scala:415)
[info] [error] at sbt.util.Difference.apply(Tracked.scala:395)
[info] [error] at sbt.util.FileFunction$.$anonfun$cached$3(FileFunction.scala:150)
[info] [error] at sbt.util.Difference.apply(Tracked.scala:415)
[info] [error] at sbt.util.Difference.apply(Tracked.scala:390)
[info] [error] at sbt.util.FileFunction$.$anonfun$cached$2(FileFunction.scala:149)
[info] [error] at sbtavrohugger.SbtAvrohugger$.$anonfun$avroSettings$7(SbtAvrohugger.scala:84)
[info] [error] at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[info] [error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[info] [error] at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[info] [error] at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[info] [error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[info] [error] at sbt.Execute.work(Execute.scala:292)
[info] [error] at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[info] [error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[info] [error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[info] [error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
[info] [error] at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
[info] [error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
[info] [error] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
[info] [error] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
[info] [error] at java.base/java.lang.Thread.run(Thread.java:1583)
[info] [error] (Compile / avroScalaGenerate) java.lang.NullPointerException: Cannot invoke "String.hashCode()" because "this.namespace" is null
[info] [error] Total time: 1 s, completed Dec 21, 2024, 1:18:10 PM
[error] x avrohugger/GenericSerializationTests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using distinctBy
, but that appeared in scala 2.13, while avrohugger needs to publish against 2.12 (in order to support the sbt plugin). the following is basically copy-paste from stack overflow, you may prefer a different solution.
def distinctByFullName(list: List[Either[Schema, Protocol]]): List[Either[Schema, Protocol]] =
list.filterNot{
var set = Set[String]()
e =>
val property = e match {
case Left(value) => value.getFullName()
case Right(value) => Option(value.getNamespace).fold(value.getName())(ns => s"${ns}.${value.getName()}")
}
val b = set(property);
set += property;
b
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the sbt-plugin, I think the needed changes will be fine. E.g., Filewriter
can be like, etc.
...
// for (inFile <- AvscFileSorter.sortSchemaFiles(getSrcFiles(srcDirs, "avsc"))) {
// log.info("Compiling AVSC %s to %s".format(inFile, target.getPath))
// generator.fileToFile(inFile, target.getPath)
// }
val avscFiles: Seq[File] = getSrcFiles(srcDirs, "avsc")
if (avscFiles.isEmpty) ()
else
log.info("Compiling AVSC to %s".format(target.getPath))
generator.filesToFiles(AvscFileSorter.sortSchemaFiles(avscFiles).toList, target.getPath)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these snippets aren't setbacks to your performance improvements, then I can publish this round as version 2.10.0
, and we can build from there at our convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check improvements in new commits.
We have to preserve order of compilation, so we can't turn the list into a set. We can only use it to lookup if a schema was already processed.
I retested with my avro libs and it is still as fast and works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julianpeeters/sbt-avrohugger#105 and here is matching sbt-avrohugger PR in case you decide to publish 2.10.0
This is as far I would like to go in first round of improvements. It already makes a huge difference in my build pipeline. Please review and let me know how you would like to proceed. |
Looking good. A couple issues, but, once resolved, this can be |
Good stuff, thanks for the speed-ups. avrohugger 2.10.0 and sbt-avrohugger 2.10.0 are available now on maven central. |
#199 I have a big avro library that I'm compiling with avrohugger. It has been great, but the performance is a bit lacking. On the CI it can take a few hours and I have done some experiments and I was able to get it down to 20 mins (around 5% of the original compilation time). I'd like to contribute my improvements, but to make sure that I don't break anything I need a stable environment. I'm running windows and unfortunately none of the tests pass.