Compile each schema only once #198

mkljakubowski · 2024-12-19T16:08:01Z

#199 I have a big avro library that I'm compiling with avrohugger. It has been great, but the performance is a bit lacking. On the CI it can take a few hours and I have done some experiments and I was able to get it down to 20 mins (around 5% of the original compilation time). I'd like to contribute my improvements, but to make sure that I don't break anything I need a stable environment. I'm running windows and unfortunately none of the tests pass.

mkljakubowski · 2024-12-19T16:09:33Z

avrohugger-tools/src/test/scala/Util.scala

+  }
+
+  implicit class LineEndingAmbiguousMatcherString(s: String) extends ExpectationsCreation{
+    def ===/(other: String): MatchResult[String] = createExpectable(s).applyMatcher(new LineEndingAmbiguousMatcher(other))


Test resource files contain LF endings and they are generated using system line endings. In case of windows CRLF, which breaks all the tests. This should compare files regardless of the line endings.

mkljakubowski · 2024-12-19T16:10:16Z

avrohugger-tools/src/test/scala/Util.scala

-        val contents: String = scala.io.Source.fromFile(fileName).mkString
+        val source = scala.io.Source.fromFile(fileName)
+        val contents: String = source.mkString
+        source.close()


stream wasn't closed and in this a rec function, which breaks on the 2nd execution

julianpeeters · 2024-12-20T02:44:08Z

I'd like to contribute my improvements in a few PRs

Ok, sounds great. Thanks very much for the contributions.

main/trunk is kept up-to-date with what's published, so I can offer the following options:

I can publish milestones for each round/PR, especially if you've got multiple ready to go. E.g., 2.10.0-M1
Or I can simply publish 2.10.0, if you prefer a real release sooner, and more PRs delayed

mkljakubowski · 2024-12-20T08:54:44Z

It is up to you. I don't have anything ready, it will take me a bit of time to make all of it review worthy. I don't plan to break the API, at most extend it a little. Would you rather I push everything here and create one giant PR or a set of smaller ones?

mkljakubowski · 2024-12-20T10:23:31Z

avrohugger-core/src/main/scala/generators/FileGenerator.scala

+    restrictedFields: Boolean,
+    targetScalaPartialVersion: String): Unit = {
+    inFiles.flatMap(fileParser.getSchemaOrProtocols(_, format, classStore, classLoader))
+      .distinct


This is the biggest change. This compiles each avro schema only once for a set a given input source files.
Doing this speeds up compilation by 65% in my case, but requires also a small change in build plugins (sbt-avrohugger etc)

.distinct

In the scripted tests in sbt-avrohugger, distinct is throwing an exception when the schema has no namespace (java classes, wreckless with null)

[info] [info] Compiling Avro IDL to /private/var/folders/1m/4vqhbccd6l7dkz3t8q766fdr0000gn/T/sbt_dc4d7335/target/scala-2.12/src_managed/main/compiled_avro [info] [error] java.lang.NullPointerException: Cannot invoke "String.hashCode()" because "this.namespace" is null [info] [error] at org.apache.avro.Protocol.hashCode(Protocol.java:400) [info] [error] at scala.runtime.Statics.anyHash(Statics.java:122) [info] [error] at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:76) [info] [error] at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:246) [info] [error] at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:167) [info] [error] at scala.util.Right.hashCode(Either.scala:438) [info] [error] at scala.collection.mutable.FlatHashTable.addEntry(FlatHashTable.scala:153) [info] [error] at scala.collection.mutable.FlatHashTable.addEntry$(FlatHashTable.scala:152) [info] [error] at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41) [info] [error] at scala.collection.mutable.FlatHashTable.addElem(FlatHashTable.scala:144) [info] [error] at scala.collection.mutable.FlatHashTable.addElem$(FlatHashTable.scala:143) [info] [error] at scala.collection.mutable.HashSet.addElem(HashSet.scala:41) [info] [error] at scala.collection.mutable.HashSet.add(HashSet.scala:66) [info] [error] at scala.collection.SeqLike.distinct(SeqLike.scala:519) [info] [error] at scala.collection.SeqLike.distinct$(SeqLike.scala:509) [info] [error] at scala.collection.AbstractSeq.distinct(Seq.scala:45) [info] [error] at avrohugger.generators.FileGenerator.filesToFiles(FileGenerator.scala:116) [info] [error] at avrohugger.Generator.filesToFiles(Generator.scala:98) [info] [error] at sbtavrohugger.FileWriter$.generateCaseClasses(FileWriter.scala:40) [info] [error] at sbtavrohugger.SbtAvrohugger$.$anonfun$avroSettings$8(SbtAvrohugger.scala:82) [info] [error] at sbt.util.FileFunction$.$anonfun$cached$1(FileFunction.scala:81) [info] [error] at sbt.util.FileFunction$.$anonfun$cached$4(FileFunction.scala:154) [info] [error] at sbt.util.Difference.apply(Tracked.scala:415) [info] [error] at sbt.util.Difference.apply(Tracked.scala:395) [info] [error] at sbt.util.FileFunction$.$anonfun$cached$3(FileFunction.scala:150) [info] [error] at sbt.util.Difference.apply(Tracked.scala:415) [info] [error] at sbt.util.Difference.apply(Tracked.scala:390) [info] [error] at sbt.util.FileFunction$.$anonfun$cached$2(FileFunction.scala:149) [info] [error] at sbtavrohugger.SbtAvrohugger$.$anonfun$avroSettings$7(SbtAvrohugger.scala:84) [info] [error] at scala.Function1.$anonfun$compose$1(Function1.scala:49) [info] [error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63) [info] [error] at sbt.std.Transform$$anon$4.work(Transform.scala:69) [info] [error] at sbt.Execute.$anonfun$submit$2(Execute.scala:283) [info] [error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24) [info] [error] at sbt.Execute.work(Execute.scala:292) [info] [error] at sbt.Execute.$anonfun$submit$1(Execute.scala:283) [info] [error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265) [info] [error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:65) [info] [error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) [info] [error] at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) [info] [error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) [info] [error] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [info] [error] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [info] [error] at java.base/java.lang.Thread.run(Thread.java:1583) [info] [error] (Compile / avroScalaGenerate) java.lang.NullPointerException: Cannot invoke "String.hashCode()" because "this.namespace" is null [info] [error] Total time: 1 s, completed Dec 21, 2024, 1:18:10 PM [error] x avrohugger/GenericSerializationTests

I tried using distinctBy, but that appeared in scala 2.13, while avrohugger needs to publish against 2.12 (in order to support the sbt plugin). the following is basically copy-paste from stack overflow, you may prefer a different solution.

def distinctByFullName(list: List[Either[Schema, Protocol]]): List[Either[Schema, Protocol]] = list.filterNot{ var set = Set[String]() e => val property = e match { case Left(value) => value.getFullName() case Right(value) => Option(value.getNamespace).fold(value.getName())(ns => s"${ns}.${value.getName()}") } val b = set(property); set += property; b }

for the sbt-plugin, I think the needed changes will be fine. E.g., Filewriter can be like, etc.

... // for (inFile <- AvscFileSorter.sortSchemaFiles(getSrcFiles(srcDirs, "avsc"))) { // log.info("Compiling AVSC %s to %s".format(inFile, target.getPath)) // generator.fileToFile(inFile, target.getPath) // } val avscFiles: Seq[File] = getSrcFiles(srcDirs, "avsc") if (avscFiles.isEmpty) () else log.info("Compiling AVSC to %s".format(target.getPath)) generator.filesToFiles(AvscFileSorter.sortSchemaFiles(avscFiles).toList, target.getPath) ...

if these snippets aren't setbacks to your performance improvements, then I can publish this round as version 2.10.0, and we can build from there at our convenience.

Please check improvements in new commits.
We have to preserve order of compilation, so we can't turn the list into a set. We can only use it to lookup if a schema was already processed.
I retested with my avro libs and it is still as fast and works as expected.

julianpeeters/sbt-avrohugger#105 and here is matching sbt-avrohugger PR in case you decide to publish 2.10.0

mkljakubowski · 2024-12-20T10:26:05Z

This is as far I would like to go in first round of improvements. It already makes a huge difference in my build pipeline. Please review and let me know how you would like to proceed.

julianpeeters · 2024-12-21T21:26:13Z

Looking good. A couple issues, but, once resolved, this can be 2.10.0.

julianpeeters · 2025-01-01T02:18:29Z

Good stuff, thanks for the speed-ups. avrohugger 2.10.0 and sbt-avrohugger 2.10.0 are available now on maven central.

mkljakubowski commented Dec 19, 2024

View reviewed changes

fix tests on windows

1f681c0

mkljakubowski force-pushed the speedup-round1 branch from 7f16f3c to 1f681c0 Compare December 19, 2024 16:38

mkljakubowski added 2 commits December 19, 2024 17:53

format

fb08fd0

optimize cache lookup

73b4a39

mkljakubowski added 4 commits December 20, 2024 10:21

optimize cache lookup

5d98bf2

fix flaky tests

cd1460e

compile schema only once per file

7845dcc

rename

ce6db2a

mkljakubowski commented Dec 20, 2024

View reviewed changes

mkljakubowski changed the title ~~Fix tests on windows~~ Compile each schema only once Dec 20, 2024

rename

b8bbe54

mkljakubowski added 2 commits December 24, 2024 11:42

review

1b9ff46

review

b0e0cbd

julianpeeters merged commit 8af3cc4 into julianpeeters:main Jan 1, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile each schema only once #198

Compile each schema only once #198

mkljakubowski commented Dec 19, 2024 •

edited

Loading

mkljakubowski Dec 19, 2024

mkljakubowski Dec 19, 2024

julianpeeters commented Dec 20, 2024

mkljakubowski commented Dec 20, 2024

mkljakubowski Dec 20, 2024

julianpeeters Dec 21, 2024

julianpeeters Dec 21, 2024

julianpeeters Dec 21, 2024

julianpeeters Dec 21, 2024

mkljakubowski Dec 24, 2024 •

edited

Loading

mkljakubowski Dec 24, 2024

mkljakubowski commented Dec 20, 2024

julianpeeters commented Dec 21, 2024

julianpeeters commented Jan 1, 2025

Compile each schema only once #198

Compile each schema only once #198

Conversation

mkljakubowski commented Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianpeeters commented Dec 20, 2024

mkljakubowski commented Dec 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkljakubowski Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkljakubowski commented Dec 20, 2024

julianpeeters commented Dec 21, 2024

julianpeeters commented Jan 1, 2025

mkljakubowski commented Dec 19, 2024 •

edited

Loading

mkljakubowski Dec 24, 2024 •

edited

Loading