Cleanup and speedup

Both by a lot
Scripter17 · Dec 31, 2024 · adca506 · adca506
1 parent 47fc276
commit adca506
Show file tree

Hide file tree

Showing 18 changed files with 234 additions and 192 deletions.
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@ There are several non-obvious privacy concerns you should keep in mind while usi
     - While this does prevent the redirect website from putting cookies on your browser and possibly gives it the false impression you clicked the link, it gives the website certainty you viewed the link.
         - In the hopefully never going to happen case of someone hijacking a supported redirect site, this could allow an attacker to reliably grab your IP by sending it in an email/DM.
             - While you can configure URL Cleaner to use a proxy to avoid the IP grabbing, it would still let them know when you're online.
-- For some websites URL Cleaner strips out more than just tracking stuff. I'm still not sure if or when this ever becomes a security issue.
+- For some websites, URL Cleaner strips out more than just tracking stuff. I'm still not sure if or when this ever becomes a security issue.
 
 If you are in any way using URL Cleaner in a life-or-death scenario, PLEASE always use the `no-network` flag and be extremely careful of people you even remotely don't trust sending you URLs.
 
@@ -180,15 +180,28 @@ On a mostly stock lenovo thinkpad T460S (Intel i5-6300U (4) @ 3.000GHz) running
 ```
 
 In practice, when using [URL Cleaner Site and its userscript](https://github.com/Scripter17/url-cleaner-site), performance is significantly (but not severely) worse.  
-Often the first few cleanings will take a few hundred milliseconds each because the page is still loading. Subsequent cleanings should generally be in the 10ms-50ms range.
+Often the first few cleanings will take a few hundred milliseconds each because the page is still loading.  
+However, because of the overhead of using HTTP (even if it's just to localhost) subsequent cleanings, for me, are basically always at least 10ms.
 
 Mileage varies wildly but as long as you're not spawning a new instance of URL Cleaner for each URL it should be fast enough.
 
-There is (currently still experimental) support for multithreading.  
-In its default configuration, it's able to do 10k of the above amazon URL in 51 milliseconds on the same laptop, an almost 2x speedup on a computer with only 2 cores.  
-On a i5-8500 (6) @ 4.100GHz, times can get as low as 17 milliseconds. If anyone wants to test this on 32+ cores I would be quite interested in the result.  
-Additionally, spawning more threads than you have cores can be helpful in netowrk latency bound jobs, AKA redirects. What exactly the limits and side effects of that is is likely website-dependent.  
-Also its effects on caching are yet to be figured out.
+Also startup time varies wildly. My laptop takes 5-6ms to start it but every other computer I've tested takes 10ms. Really not sure why because the other computers are massively faster.
+
+##### Parallelization
+
+There is (currently still experimental) support for parallelization.
+
+On the same laptop as the above benchmarks, the default settings make 10k of the amazon URL go from 95ms to 51ms.  
+On my desktop with an Intel i5-8500 (6) @ 4.100GHz, that benchmark gets around 17ms and one *hundred* thousand of the URL takes about 138ms.  
+On my friend's dekstop with an AMD Ryzen 9 7950X3D (32) @ 5.759GHz, doing the same 100k amazon URL benchmark takes about (TODO: REBENCH).  
+
+Network requests and interacting with the cache have effects on performance that I haven't yet properly looked into.
+
+Please note that at this time parallelization has no effects on the library's API.  
+It's not obvious how I would design it so I'm waiting for inspiration to strike.
+
+Also please note that compiling with parallelization then setting the thread count to 1 gives worse performance than not compiling with parallelization.  
+Through very basic testing, 2 threads seems to be about the same as not compiling with parallelization.
 
 #### Credits
 

diff --git a/benchmarking/benchmark.sh b/benchmarking/benchmark.sh
@@ -45,8 +45,8 @@ for arg in "$@"; do
     --only-massif)     if [ $an_only_is_set -eq 0 ]; then an_only_is_set=1; hyperfine=0            ; callgrind=0; cachegrind=0          ; dhat=0; memcheck=0; else echo "Error: Multiple --only- flags were set."; exit 1; fi ;;
     --only-dhat)       if [ $an_only_is_set -eq 0 ]; then an_only_is_set=1; hyperfine=0            ; callgrind=0; cachegrind=0; massif=0        ; memcheck=0; else echo "Error: Multiple --only- flags were set."; exit 1; fi ;;
     --only-memcheck)   if [ $an_only_is_set -eq 0 ]; then an_only_is_set=1; hyperfine=0            ; callgrind=0; cachegrind=0; massif=0; dhat=0            ; else echo "Error: Multiple --only- flags were set."; exit 1; fi ;;
-    --nums)            mode=nums    ; just_set_mode=1 ;;
     --urls)            mode=urls    ; just_set_mode=1 ;;
+    --nums)            mode=nums    ; just_set_mode=1 ;;
     --features)        mode=features; just_set_mode=1 ;;
     --out-file)        mode=out_file; just_set_mode=1 ;;
     --)                break ;;

diff --git a/build.rs b/build.rs
@@ -3,12 +3,19 @@
 use std::io::Write;
 
 fn main() {
+    let default_config = serde_json::from_str::<serde_json::Value>(&std::fs::read_to_string("default-config.json").expect("Reading the default config to work.")).expect("Deserializing the default config to work.");
+
+    if std::fs::exists("default-config.minified.json").expect("Checking the existence of default-config.minified.json to work") {
+        let maybe_old_minified_default_config = serde_json::from_str::<serde_json::Value>(&std::fs::read_to_string("default-config.minified.json").expect("Reading the minified default config to work.")).expect("Deserializing the minified default config to work.");
+        if default_config == maybe_old_minified_default_config {return;}
+    }
+
     std::fs::OpenOptions::new()
         .create(true)
         .write(true)
         .truncate(true)
         .open("default-config.minified.json")
         .expect("Opening default-config.minified.json to work.")
-        .write_all(serde_json::to_string(&serde_json::from_str::<serde_json::Value>(&std::fs::read_to_string("default-config.json").expect("Reading the default config to work.")).expect("Deserializing the default config to work.")).expect("Serializing the default config to work.").as_bytes())
+        .write_all(serde_json::to_string(&default_config).expect("Serializing the default config to work.").as_bytes())
         .expect("Writing the minified default config to work.");
 }
diff --git a/default-config.json b/default-config.json
@@ -76,22 +76,21 @@
 				"nerd.whatever.social", "z.opnxng.com"
 			],
 			"redirect-host-without-www-dot-prefixes": [
-				"2kgam.es", "4.nbcla.com", "a.co", "ab.co", "abc7.la", "abc7ne.ws", "adobe.ly", "aje.io", "aje.io", "amzn.asia", "amzn.ew",
-				"amzn.to", "api.link.agorapulse.com", "apple.co", "b23.tv", "bbc.in", "bit.ly", "bitly.com", "bitly.com", "bityl.co", "blizz.ly",
-				"blockclubchi.co", "bloom.bg", "boxd.it", "buff.ly", "bzfd.it", "cbsn.ws", "cfl.re", "chn.ge", "chng.it", "clckhl.co", "cnb.cx",
-				"cnn.it", "cons.lv", "cos.lv", "cutt.ly", "db.tt", "dcdr.me", "depop.app.link", "dis.gd", "dlvr.it", "econ.st", "etsy.me", "fal.cn",
-				"fanga.me", "fb.me", "fdip.fr", "flip.it", "forms.gle", "g.co", "glo.bo", "go.bsky.app", "go.forbes.com", "go.microsoft.com",
-				"go.nasa.gov", "gofund.me", "goo.gl", "goo.su", "gum.co", "hmstr.fr", "hulu.tv", "ift.tt", "intel.ly", "interc.pt", "is.gd",
-				"iwe.one", "j.mp", "jbgm.es", "k00.fr", "katy.to", "kck.st", "kre.pe", "kre.pe", "l.leparisien.fr", "l.leparisien.fr", "lin.ee",
-				"link.animaapp.com", "linkr.it", "lnk.to", "loom.ly", "loom.ly", "lpc.ca", "m.sesame.org", "msft.it", "mzl.la", "n.pr", "nas.cr",
-				"nbc4i.co", "ninten.do", "ntdo.co.uk", "nvda.ws", "ny.ti", "nyer.cm", "nyp.st", "nyti.ms", "nyto.ms", "on.forbes.com", "on.ft.com",
-				"on.ft.com", "on.msnbc.com", "on.nyc.gov", "onl.bz", "onl.la", "onl.sc", "operagx.gg", "orlo.uk", "ow.ly", "peoplem.ag", "perfht.ml",
-				"pin.it", "pixiv.me", "play.st", "politi.co", "prn.to", "propub.li", "pulse.ly", "py.pl", "qr1.be", "rb.gy", "rb.gy", "rblx.co",
-				"rdbl.co", "redd.it", "reurl.cc", "reut.rs", "rzr.to", "s.goodsmile.link", "s.team", "s76.co", "share.firefox.dev", "shor.tf",
-				"shorturl.at", "sonic.frack.deals", "spoti.fi", "spr.ly", "spr.ly", "spr.ly", "sqex.to", "t.co", "t.ly", "theatln.tc", "thecut.io",
-				"thef.pub", "thr.cm", "thrn.co", "tiny.cc", "tmz.me", "to.pbs.org", "tps.to", "tr.ee", "trib.al", "u.jd.com", "unes.co", "unf.pa",
-				"uni.cf", "uniceflink.org", "visitlink.me", "w.wiki", "wlgrn.com", "wlo.link", "wn.nr", "wwdc.io", "x.gd", "xbx.ly", "xhslink.com",
-				"yrp.ca"
+				"2kgam.es", "4.nbcla.com", "a.co", "ab.co", "abc7.la", "abc7ne.ws", "adobe.ly", "aje.io", "aje.io", "amzn.asia", "amzn.ew", "amzn.to",
+				"api.link.agorapulse.com", "apple.co", "b23.tv", "bbc.in", "bit.ly", "bitly.com", "bitly.com", "bityl.co", "blizz.ly", "blockclubchi.co",
+				"bloom.bg", "boxd.it", "buff.ly", "bzfd.it", "cbsn.ws", "cfl.re", "chn.ge", "chng.it", "clckhl.co", "cnb.cx", "cnn.it", "cons.lv",
+				"cos.lv", "cutt.ly", "db.tt", "dcdr.me", "depop.app.link", "dis.gd", "dlvr.it", "econ.st", "etsy.me", "fal.cn", "fanga.me", "fb.me",
+				"fdip.fr", "flip.it", "forms.gle", "g.co", "glo.bo", "go.bsky.app", "go.forbes.com", "go.microsoft.com", "go.nasa.gov", "gofund.me",
+				"goo.gl", "goo.su", "gum.co", "hmstr.fr", "hulu.tv", "ift.tt", "intel.ly", "interc.pt", "is.gd", "iwe.one", "j.mp", "jbgm.es",
+				"k00.fr", "katy.to", "kck.st", "kre.pe", "kre.pe", "kre.pe", "l.leparisien.fr", "l.leparisien.fr", "lin.ee", "link.animaapp.com",
+				"linkr.it", "lnk.to", "loom.ly", "loom.ly", "lpc.ca", "m.sesame.org", "msft.it", "mzl.la", "n.pr", "nas.cr", "nbc4i.co", "ninten.do",
+				"ntdo.co.uk", "nvda.ws", "ny.ti", "nyer.cm", "nyp.st", "nyti.ms", "nyto.ms", "on.forbes.com", "on.ft.com", "on.ft.com", "on.msnbc.com",
+				"on.nyc.gov", "onl.bz", "onl.la", "onl.sc", "operagx.gg", "orlo.uk", "ow.ly", "peoplem.ag", "perfht.ml", "pin.it", "pixiv.me", "play.st",
+				"politi.co", "prn.to", "propub.li", "pulse.ly", "py.pl", "qr1.be", "rb.gy", "rb.gy", "rblx.co", "rdbl.co", "redd.it", "reurl.cc",
+				"reut.rs", "rzr.to", "s.goodsmile.link", "s.team", "s76.co", "share.firefox.dev", "shor.tf", "shorturl.at", "sonic.frack.deals",
+				"spoti.fi", "spr.ly", "spr.ly", "spr.ly", "sqex.to", "t.co", "t.ly", "theatln.tc", "thecut.io", "thef.pub", "thr.cm", "thrn.co",
+				"tiny.cc", "tmz.me", "to.pbs.org", "tps.to", "tr.ee", "trib.al", "u.jd.com", "unes.co", "unf.pa", "uni.cf", "uniceflink.org",
+				"visitlink.me", "w.wiki", "wlgrn.com", "wlo.link", "wn.nr", "wwdc.io", "x.gd", "xbx.ly", "xhslink.com", "yrp.ca"
 			],
 			"redirect-not-subdomains": [
 				"lnk.to", "visitlink.me", "goo.gl", "o93x.net", "pusle.ly"
@@ -1416,7 +1415,7 @@
 							"PartMap": {
 								"part": "Path",
 								"map": {
-									"/search": {"AllowQueryParams": ["hl", "q", "tbm", "p", "udm", "filter"]},
+									"/search": {"AllowQueryParams": ["hl", "q", "tbm", "p", "udm", "filter", "vsrid", "vsdim", "vsint", "ins_vfs"]},
 									"/setprefs": {"RemoveQueryParams": ["sa", "ved"]}
 								}
 							}

diff --git a/src/glue/advanced_http.rs b/src/glue/advanced_http.rs
@@ -8,7 +8,7 @@ use url::Url;
 use serde::{Deserialize, Serialize};
 use reqwest::{Method, header::{HeaderName, HeaderValue, HeaderMap}};
 use thiserror::Error;
-#[allow(unused_imports, reason = "Used in a doc comment.")]
+#[expect(unused_imports, reason = "Used in a doc comment.")]
 use reqwest::cookie::Cookie;
 
 use crate::types::*;

diff --git a/src/glue/caching.rs b/src/glue/caching.rs
@@ -259,7 +259,7 @@ impl InnerCache {
     /// 
     /// If unconnected, connect to the path then return the connection.
     /// 
-    /// If the path is a file and doesn't exist, writes [`EMPTY_CACHE`] to the path.
+    /// If the path is a file and doesn't exist, makes the file.
     /// 
     /// If the path is `:memory:`, the database is storeed ephemerally in RAM and not saved to disk.
     /// # Errors

diff --git a/src/glue/command.rs b/src/glue/command.rs
@@ -15,7 +15,6 @@ use thiserror::Error;
 use serde::{Serialize, Deserialize};
 use which::which;
 
-#[allow(unused_imports, reason = "Used in a doc comment.")]
 use crate::types::*;
 use crate::util::*;
 

diff --git a/src/glue/headermap.rs b/src/glue/headermap.rs
@@ -4,7 +4,7 @@ use std::collections::HashMap;
 
 use serde::{Deserialize, ser::{Serializer, Error as _}, de::{Deserializer, Error as _}};
 use reqwest::header::HeaderMap;
-#[allow(unused_imports, reason = "Used in a doc comment.")] // [`HeaderValue`] is imported for [`serialize`]'s documentation.
+#[expect(unused_imports, reason = "Used in a doc comment.")] // [`HeaderValue`] is imported for [`serialize`]'s documentation.
 use reqwest::header::HeaderValue;
 
 /// Deserializes a [`HeaderMap`]

diff --git a/src/glue/headervalue.rs b/src/glue/headervalue.rs
@@ -1,7 +1,6 @@
 //! Provides serialization and deserialization functions for [`HeaderValue`].
 
 use serde::{Deserialize, ser::{Serializer, Error as _}, de::{Deserializer, Error as _}};
-#[allow(unused_imports, reason = "Used in a doc comment.")]
 use reqwest::header::HeaderValue;
 
 /// Deserializes a [`HeaderValue`]

diff --git a/src/glue/proxy.rs b/src/glue/proxy.rs
@@ -11,7 +11,7 @@ use reqwest::Proxy;
 
 use crate::util::is_default;
 
-#[allow(unused_imports, reason = "Used in a doc comment.")]
+#[expect(unused_imports, reason = "Used in a doc comment.")]
 use crate::glue::HttpClientConfig;
 
 /// Used by [`HttpClientConfig`] to detail how a [`reqwest::Proxy`] should be made.

diff --git a/src/glue/regex/regex_parts.rs b/src/glue/regex/regex_parts.rs
@@ -9,7 +9,7 @@ use std::str::FromStr;
 use serde::{Serialize, Deserialize};
 use regex::{Regex, RegexBuilder};
 use regex_syntax::{ParserBuilder, Parser, Error as RegexSyntaxError};
-#[allow(unused_imports, reason = "Used in a doc comment.")]
+#[expect(unused_imports, reason = "Used in a doc comment.")]
 use super::RegexWrapper;
 
 use crate::util::*;

diff --git a/src/lib.rs b/src/lib.rs
@@ -43,11 +43,6 @@
 //! }
 //! ```
 
-#[allow(unused_imports, reason = "Used in the module's doc comment.")]
-use std::str::FromStr;
-#[allow(unused_imports, reason = "Used in the module's doc comment.")]
-use serde::Deserialize;
-
 pub mod glue;
 pub mod types;
 pub(crate) mod util;