From 70c602e576680d839a09ad2774b683a42f57fab8 Mon Sep 17 00:00:00 2001 From: ummel Date: Tue, 1 Oct 2024 18:36:35 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20@=20ummel/fu?= =?UTF-8?q?sionModel@8fa33529befe6323ab829316838678f2988dbade=20?= =?UTF-8?q?=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- 404.html | 2 +- LICENSE.html | 2 +- authors.html | 2 +- index.html | 2 +- pkgdown.js | 8 ++ pkgdown.yml | 4 +- reference/analyze.html | 2 +- reference/analyze2.html | 2 +- reference/fuse.html | 2 +- reference/fusionModel-package.html | 2 +- reference/importance.html | 2 +- reference/impute.html | 2 +- reference/index.html | 8 +- reference/monotonic.html | 2 +- reference/plot_valid.html | 2 +- reference/prepXY.html | 2 +- reference/read_fsd.html | 12 ++- reference/read_up.html | 130 +++++++++++++++++++++++++++++ reference/recs.html | 2 +- reference/train.html | 2 +- reference/validate.html | 2 +- search.json | 2 +- sitemap.xml | 1 + 23 files changed, 173 insertions(+), 24 deletions(-) create mode 100644 reference/read_up.html diff --git a/404.html b/404.html index 61b86a5..0f83be1 100644 --- a/404.html +++ b/404.html @@ -59,7 +59,7 @@

Page not found (404)

diff --git a/LICENSE.html b/LICENSE.html index 2df9a11..adbaa0c 100644 --- a/LICENSE.html +++ b/LICENSE.html @@ -236,7 +236,7 @@

How to Apply These Terms diff --git a/authors.html b/authors.html index 231bdd8..4f1f37b 100644 --- a/authors.html +++ b/authors.html @@ -72,7 +72,7 @@

Citation

diff --git a/index.html b/index.html index dbea09e..4bbd145 100644 --- a/index.html +++ b/index.html @@ -512,7 +512,7 @@

Developers

diff --git a/pkgdown.js b/pkgdown.js index 9757bf9..1a99c65 100644 --- a/pkgdown.js +++ b/pkgdown.js @@ -152,3 +152,11 @@ async function searchFuse(query, callback) { }); }); })(window.jQuery || window.$) + +document.addEventListener('keydown', function(event) { + // Check if the pressed key is '/' + if (event.key === '/') { + event.preventDefault(); // Prevent any default action associated with the '/' key + document.getElementById('search-input').focus(); // Set focus to the search input + } +}); diff --git a/pkgdown.yml b/pkgdown.yml index bd87fb6..fff3bec 100644 --- a/pkgdown.yml +++ b/pkgdown.yml @@ -1,8 +1,8 @@ pandoc: 3.1.11 -pkgdown: 2.1.0 +pkgdown: 2.1.1 pkgdown_sha: ~ articles: {} -last_built: 2024-09-16T22:10Z +last_built: 2024-10-01T18:36Z urls: reference: https://ummel.github.io/fusionModel/reference article: https://ummel.github.io/fusionModel/articles diff --git a/reference/analyze.html b/reference/analyze.html index b959a3c..4eca019 100644 --- a/reference/analyze.html +++ b/reference/analyze.html @@ -219,7 +219,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/analyze2.html b/reference/analyze2.html index 54c8dc8..3ed8376 100644 --- a/reference/analyze2.html +++ b/reference/analyze2.html @@ -206,7 +206,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/fuse.html b/reference/fuse.html index ec872fa..b57fac1 100644 --- a/reference/fuse.html +++ b/reference/fuse.html @@ -135,7 +135,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/fusionModel-package.html b/reference/fusionModel-package.html index 201950d..f95036f 100644 --- a/reference/fusionModel-package.html +++ b/reference/fusionModel-package.html @@ -57,7 +57,7 @@

Author< diff --git a/reference/importance.html b/reference/importance.html index f487830..38fac45 100644 --- a/reference/importance.html +++ b/reference/importance.html @@ -88,7 +88,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/impute.html b/reference/impute.html index feedb71..e4294d9 100644 --- a/reference/impute.html +++ b/reference/impute.html @@ -92,7 +92,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/index.html b/reference/index.html index 1cc4f50..91e3b4a 100644 --- a/reference/index.html +++ b/reference/index.html @@ -96,6 +96,12 @@

All functionsread_up() + + +
Read ORNL UrbanPop data from disk
+
+ recs
@@ -121,7 +127,7 @@

All functions -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/monotonic.html b/reference/monotonic.html index c227a62..c8e5bd4 100644 --- a/reference/monotonic.html +++ b/reference/monotonic.html @@ -88,7 +88,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/plot_valid.html b/reference/plot_valid.html index 97e477d..33a7db2 100644 --- a/reference/plot_valid.html +++ b/reference/plot_valid.html @@ -128,7 +128,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/prepXY.html b/reference/prepXY.html index d1d0180..35f9d4e 100644 --- a/reference/prepXY.html +++ b/reference/prepXY.html @@ -124,7 +124,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/read_fsd.html b/reference/read_fsd.html index 25e5ecd..50c3765 100644 --- a/reference/read_fsd.html +++ b/reference/read_fsd.html @@ -37,7 +37,11 @@

Read fusion output from disk

Usage

-
read_fsd(fsd, columns = NULL, cores = data.table::getDTthreads())
+
read_fsd(
+  fsd,
+  columns = NULL,
+  cores = max(1, parallel::detectCores(logical = FALSE) - 1)
+)
@@ -53,7 +57,7 @@

Argumentscores -

Integer. Number of cores used by fread.

+

Integer. Number of cores used by read_fst.

@@ -62,7 +66,7 @@

Value

Details

-

As of version 3.0, this is simply a convenient wrapper around read_fst, since fusion output data files (.fsd) produced by fuse are actually native fst files under the hood.

+

As of version 2.3.0, this is simply a convenient wrapper around read_fst, since fusion output data files (.fsd) are actually native fst files under the hood.

@@ -93,7 +97,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/read_up.html b/reference/read_up.html new file mode 100644 index 0000000..b206989 --- /dev/null +++ b/reference/read_up.html @@ -0,0 +1,130 @@ + +Read ORNL UrbanPop data from disk — read_up • fusionModel + Skip to contents + + +
+
+
+ +
+

NOTE: For fusionACS internal use only! Efficiently read pre-processed ORNL UrbanPop data from disk. Function arguments only make sense if you are familiar with the structure of the processed UrbanPop data.

+
+ +
+

Usage

+
read_up(
+  path,
+  year = NULL,
+  state = NULL,
+  county = NULL,
+  tract_bg = NULL,
+  hid = NULL,
+  df = NULL,
+  cores = max(1, parallel::detectCores(logical = FALSE) - 1)
+)
+
+ +
+

Arguments

+ + +
path
+

Character. File path to fst file containing UrbanPop data.

+ + +
year
+

Integer. Year(s) to select.

+ + +
state
+

Integer. State FIPS code(s) to select.

+ + +
county
+

Integer. County FIPS code(s) to select.

+ + +
tract_bg
+

Integer. Tract and block group FIPS code(s) to select.

+ + +
hid
+

Integer. ACS-PUMS household ID(s) to select.

+ + +
df
+

Data frame containing at least 'year' and/or 'state' columns. Provides unique combinations of the above argument values to return. df is used to perform an inner merge on an initial subset of data based on 'year' or 'state'.

+ + +
cores
+

Integer. Number of cores used by read_fst.

+ +
+
+

Value

+

A keyed data.table.

+
+
+

Details

+

Provides an efficient and fast way to load a subset of UrbanPop data into memory. An initial subset is read using state or year to restrict rows. Then data.table operations (subset or merge) are used to efficiently reduce to the final returned subset.

+
+ +
+

Examples

+
up.path <- "~/Documents/Projects/fusionData/urbanpop/Processed national UrbanPop.fst"
+
+out <- read_up(path = up.path, year = 2015, state = 4)
+unique(dplyr::select(out, year, state))
+
+out <- read_up(path = up.path, year = 2015:2016, state = c(2, 12, 15))
+unique(dplyr::select(out, year, state))
+
+up.df <- data.frame(year = c(2015, 2018), state = c(8, 5), county = c(1, 3))
+out <- read_up(path = up.path, df = up.df)
+unique(dplyr::select(out, year, state, county))
+
+
+ + +
+ + + +
+ + + + + + + diff --git a/reference/recs.html b/reference/recs.html index 5d5406e..8d3bcf4 100644 --- a/reference/recs.html +++ b/reference/recs.html @@ -160,7 +160,7 @@

Source< diff --git a/reference/train.html b/reference/train.html index 4e1d4be..b0ab8d6 100644 --- a/reference/train.html +++ b/reference/train.html @@ -173,7 +173,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/reference/validate.html b/reference/validate.html index e849515..930ec39 100644 --- a/reference/validate.html +++ b/reference/validate.html @@ -122,7 +122,7 @@

Examples -

Site built with pkgdown 2.1.0.

+

Site built with pkgdown 2.1.1.

diff --git a/search.json b/search.json index a77ec27..1fe7ac5 100644 --- a/search.json +++ b/search.json @@ -1 +1 @@ -[{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":null,"dir":"","previous_headings":"","what":"GNU General Public License","title":"GNU General Public License","text":"Version 3, 29 June 2007Copyright © 2007 Free Software Foundation, Inc.  Everyone permitted copy distribute verbatim copies license document, changing allowed.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"preamble","dir":"","previous_headings":"","what":"Preamble","title":"GNU General Public License","text":"GNU General Public License free, copyleft license software kinds works. licenses software practical works designed take away freedom share change works. contrast, GNU General Public License intended guarantee freedom share change versions program–make sure remains free software users. , Free Software Foundation, use GNU General Public License software; applies also work released way authors. can apply programs, . speak free software, referring freedom, price. General Public Licenses designed make sure freedom distribute copies free software (charge wish), receive source code can get want , can change software use pieces new free programs, know can things. protect rights, need prevent others denying rights asking surrender rights. Therefore, certain responsibilities distribute copies software, modify : responsibilities respect freedom others. example, distribute copies program, whether gratis fee, must pass recipients freedoms received. must make sure , , receive can get source code. must show terms know rights. Developers use GNU GPL protect rights two steps: (1) assert copyright software, (2) offer License giving legal permission copy, distribute /modify . developers’ authors’ protection, GPL clearly explains warranty free software. users’ authors’ sake, GPL requires modified versions marked changed, problems attributed erroneously authors previous versions. devices designed deny users access install run modified versions software inside , although manufacturer can . fundamentally incompatible aim protecting users’ freedom change software. systematic pattern abuse occurs area products individuals use, precisely unacceptable. Therefore, designed version GPL prohibit practice products. problems arise substantially domains, stand ready extend provision domains future versions GPL, needed protect freedom users. Finally, every program threatened constantly software patents. States allow patents restrict development use software general-purpose computers, , wish avoid special danger patents applied free program make effectively proprietary. prevent , GPL assures patents used render program non-free. precise terms conditions copying, distribution modification follow.","code":""},{"path":[]},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_0-definitions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"0. Definitions","title":"GNU General Public License","text":"“License” refers version 3 GNU General Public License. “Copyright” also means copyright-like laws apply kinds works, semiconductor masks. “Program” refers copyrightable work licensed License. licensee addressed “”. “Licensees” “recipients” may individuals organizations. “modify” work means copy adapt part work fashion requiring copyright permission, making exact copy. resulting work called “modified version” earlier work work “based ” earlier work. “covered work” means either unmodified Program work based Program. “propagate” work means anything , without permission, make directly secondarily liable infringement applicable copyright law, except executing computer modifying private copy. Propagation includes copying, distribution (without modification), making available public, countries activities well. “convey” work means kind propagation enables parties make receive copies. Mere interaction user computer network, transfer copy, conveying. interactive user interface displays “Appropriate Legal Notices” extent includes convenient prominently visible feature (1) displays appropriate copyright notice, (2) tells user warranty work (except extent warranties provided), licensees may convey work License, view copy License. interface presents list user commands options, menu, prominent item list meets criterion.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_1-source-code","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"1. Source Code","title":"GNU General Public License","text":"“source code” work means preferred form work making modifications . “Object code” means non-source form work. “Standard Interface” means interface either official standard defined recognized standards body, , case interfaces specified particular programming language, one widely used among developers working language. “System Libraries” executable work include anything, work whole, () included normal form packaging Major Component, part Major Component, (b) serves enable use work Major Component, implement Standard Interface implementation available public source code form. “Major Component”, context, means major essential component (kernel, window system, ) specific operating system () executable work runs, compiler used produce work, object code interpreter used run . “Corresponding Source” work object code form means source code needed generate, install, (executable work) run object code modify work, including scripts control activities. However, include work’s System Libraries, general-purpose tools generally available free programs used unmodified performing activities part work. example, Corresponding Source includes interface definition files associated source files work, source code shared libraries dynamically linked subprograms work specifically designed require, intimate data communication control flow subprograms parts work. Corresponding Source need include anything users can regenerate automatically parts Corresponding Source. Corresponding Source work source code form work.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_2-basic-permissions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"2. Basic Permissions","title":"GNU General Public License","text":"rights granted License granted term copyright Program, irrevocable provided stated conditions met. License explicitly affirms unlimited permission run unmodified Program. output running covered work covered License output, given content, constitutes covered work. License acknowledges rights fair use equivalent, provided copyright law. may make, run propagate covered works convey, without conditions long license otherwise remains force. may convey covered works others sole purpose make modifications exclusively , provide facilities running works, provided comply terms License conveying material control copyright. thus making running covered works must exclusively behalf, direction control, terms prohibit making copies copyrighted material outside relationship . Conveying circumstances permitted solely conditions stated . Sublicensing allowed; section 10 makes unnecessary.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_3-protecting-users-legal-rights-from-anti-circumvention-law","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"3. Protecting Users’ Legal Rights From Anti-Circumvention Law","title":"GNU General Public License","text":"covered work shall deemed part effective technological measure applicable law fulfilling obligations article 11 WIPO copyright treaty adopted 20 December 1996, similar laws prohibiting restricting circumvention measures. convey covered work, waive legal power forbid circumvention technological measures extent circumvention effected exercising rights License respect covered work, disclaim intention limit operation modification work means enforcing, work’s users, third parties’ legal rights forbid circumvention technological measures.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_4-conveying-verbatim-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"4. Conveying Verbatim Copies","title":"GNU General Public License","text":"may convey verbatim copies Program’s source code receive , medium, provided conspicuously appropriately publish copy appropriate copyright notice; keep intact notices stating License non-permissive terms added accord section 7 apply code; keep intact notices absence warranty; give recipients copy License along Program. may charge price price copy convey, may offer support warranty protection fee.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_5-conveying-modified-source-versions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"5. Conveying Modified Source Versions","title":"GNU General Public License","text":"may convey work based Program, modifications produce Program, form source code terms section 4, provided also meet conditions: ) work must carry prominent notices stating modified , giving relevant date. b) work must carry prominent notices stating released License conditions added section 7. requirement modifies requirement section 4 “keep intact notices”. c) must license entire work, whole, License anyone comes possession copy. License therefore apply, along applicable section 7 additional terms, whole work, parts, regardless packaged. License gives permission license work way, invalidate permission separately received . d) work interactive user interfaces, must display Appropriate Legal Notices; however, Program interactive interfaces display Appropriate Legal Notices, work need make . compilation covered work separate independent works, nature extensions covered work, combined form larger program, volume storage distribution medium, called “aggregate” compilation resulting copyright used limit access legal rights compilation’s users beyond individual works permit. Inclusion covered work aggregate cause License apply parts aggregate.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_6-conveying-non-source-forms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"6. Conveying Non-Source Forms","title":"GNU General Public License","text":"may convey covered work object code form terms sections 4 5, provided also convey machine-readable Corresponding Source terms License, one ways: ) Convey object code , embodied , physical product (including physical distribution medium), accompanied Corresponding Source fixed durable physical medium customarily used software interchange. b) Convey object code , embodied , physical product (including physical distribution medium), accompanied written offer, valid least three years valid long offer spare parts customer support product model, give anyone possesses object code either (1) copy Corresponding Source software product covered License, durable physical medium customarily used software interchange, price reasonable cost physically performing conveying source, (2) access copy Corresponding Source network server charge. c) Convey individual copies object code copy written offer provide Corresponding Source. alternative allowed occasionally noncommercially, received object code offer, accord subsection 6b. d) Convey object code offering access designated place (gratis charge), offer equivalent access Corresponding Source way place charge. need require recipients copy Corresponding Source along object code. place copy object code network server, Corresponding Source may different server (operated third party) supports equivalent copying facilities, provided maintain clear directions next object code saying find Corresponding Source. Regardless server hosts Corresponding Source, remain obligated ensure available long needed satisfy requirements. e) Convey object code using peer--peer transmission, provided inform peers object code Corresponding Source work offered general public charge subsection 6d. separable portion object code, whose source code excluded Corresponding Source System Library, need included conveying object code work. “User Product” either (1) “consumer product”, means tangible personal property normally used personal, family, household purposes, (2) anything designed sold incorporation dwelling. determining whether product consumer product, doubtful cases shall resolved favor coverage. particular product received particular user, “normally used” refers typical common use class product, regardless status particular user way particular user actually uses, expects expected use, product. product consumer product regardless whether product substantial commercial, industrial non-consumer uses, unless uses represent significant mode use product. “Installation Information” User Product means methods, procedures, authorization keys, information required install execute modified versions covered work User Product modified version Corresponding Source. information must suffice ensure continued functioning modified object code case prevented interfered solely modification made. convey object code work section , , specifically use , User Product, conveying occurs part transaction right possession use User Product transferred recipient perpetuity fixed term (regardless transaction characterized), Corresponding Source conveyed section must accompanied Installation Information. requirement apply neither third party retains ability install modified object code User Product (example, work installed ROM). requirement provide Installation Information include requirement continue provide support service, warranty, updates work modified installed recipient, User Product modified installed. Access network may denied modification materially adversely affects operation network violates rules protocols communication across network. Corresponding Source conveyed, Installation Information provided, accord section must format publicly documented (implementation available public source code form), must require special password key unpacking, reading copying.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_7-additional-terms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"7. Additional Terms","title":"GNU General Public License","text":"“Additional permissions” terms supplement terms License making exceptions one conditions. Additional permissions applicable entire Program shall treated though included License, extent valid applicable law. additional permissions apply part Program, part may used separately permissions, entire Program remains governed License without regard additional permissions. convey copy covered work, may option remove additional permissions copy, part . (Additional permissions may written require removal certain cases modify work.) may place additional permissions material, added covered work, can give appropriate copyright permission. Notwithstanding provision License, material add covered work, may (authorized copyright holders material) supplement terms License terms: ) Disclaiming warranty limiting liability differently terms sections 15 16 License; b) Requiring preservation specified reasonable legal notices author attributions material Appropriate Legal Notices displayed works containing ; c) Prohibiting misrepresentation origin material, requiring modified versions material marked reasonable ways different original version; d) Limiting use publicity purposes names licensors authors material; e) Declining grant rights trademark law use trade names, trademarks, service marks; f) Requiring indemnification licensors authors material anyone conveys material (modified versions ) contractual assumptions liability recipient, liability contractual assumptions directly impose licensors authors. non-permissive additional terms considered “restrictions” within meaning section 10. Program received , part , contains notice stating governed License along term restriction, may remove term. license document contains restriction permits relicensing conveying License, may add covered work material governed terms license document, provided restriction survive relicensing conveying. add terms covered work accord section, must place, relevant source files, statement additional terms apply files, notice indicating find applicable terms. Additional terms, permissive non-permissive, may stated form separately written license, stated exceptions; requirements apply either way.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_8-termination","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"8. Termination","title":"GNU General Public License","text":"may propagate modify covered work except expressly provided License. attempt otherwise propagate modify void, automatically terminate rights License (including patent licenses granted third paragraph section 11). However, cease violation License, license particular copyright holder reinstated () provisionally, unless copyright holder explicitly finally terminates license, (b) permanently, copyright holder fails notify violation reasonable means prior 60 days cessation. Moreover, license particular copyright holder reinstated permanently copyright holder notifies violation reasonable means, first time received notice violation License (work) copyright holder, cure violation prior 30 days receipt notice. Termination rights section terminate licenses parties received copies rights License. rights terminated permanently reinstated, qualify receive new licenses material section 10.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_9-acceptance-not-required-for-having-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"9. Acceptance Not Required for Having Copies","title":"GNU General Public License","text":"required accept License order receive run copy Program. Ancillary propagation covered work occurring solely consequence using peer--peer transmission receive copy likewise require acceptance. However, nothing License grants permission propagate modify covered work. actions infringe copyright accept License. Therefore, modifying propagating covered work, indicate acceptance License .","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_10-automatic-licensing-of-downstream-recipients","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"10. Automatic Licensing of Downstream Recipients","title":"GNU General Public License","text":"time convey covered work, recipient automatically receives license original licensors, run, modify propagate work, subject License. responsible enforcing compliance third parties License. “entity transaction” transaction transferring control organization, substantially assets one, subdividing organization, merging organizations. propagation covered work results entity transaction, party transaction receives copy work also receives whatever licenses work party’s predecessor interest give previous paragraph, plus right possession Corresponding Source work predecessor interest, predecessor can get reasonable efforts. may impose restrictions exercise rights granted affirmed License. example, may impose license fee, royalty, charge exercise rights granted License, may initiate litigation (including cross-claim counterclaim lawsuit) alleging patent claim infringed making, using, selling, offering sale, importing Program portion .","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_11-patents","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"11. Patents","title":"GNU General Public License","text":"“contributor” copyright holder authorizes use License Program work Program based. work thus licensed called contributor’s “contributor version”. contributor’s “essential patent claims” patent claims owned controlled contributor, whether already acquired hereafter acquired, infringed manner, permitted License, making, using, selling contributor version, include claims infringed consequence modification contributor version. purposes definition, “control” includes right grant patent sublicenses manner consistent requirements License. contributor grants non-exclusive, worldwide, royalty-free patent license contributor’s essential patent claims, make, use, sell, offer sale, import otherwise run, modify propagate contents contributor version. following three paragraphs, “patent license” express agreement commitment, however denominated, enforce patent (express permission practice patent covenant sue patent infringement). “grant” patent license party means make agreement commitment enforce patent party. convey covered work, knowingly relying patent license, Corresponding Source work available anyone copy, free charge terms License, publicly available network server readily accessible means, must either (1) cause Corresponding Source available, (2) arrange deprive benefit patent license particular work, (3) arrange, manner consistent requirements License, extend patent license downstream recipients. “Knowingly relying” means actual knowledge , patent license, conveying covered work country, recipient’s use covered work country, infringe one identifiable patents country reason believe valid. , pursuant connection single transaction arrangement, convey, propagate procuring conveyance , covered work, grant patent license parties receiving covered work authorizing use, propagate, modify convey specific copy covered work, patent license grant automatically extended recipients covered work works based . patent license “discriminatory” include within scope coverage, prohibits exercise , conditioned non-exercise one rights specifically granted License. may convey covered work party arrangement third party business distributing software, make payment third party based extent activity conveying work, third party grants, parties receive covered work , discriminatory patent license () connection copies covered work conveyed (copies made copies), (b) primarily connection specific products compilations contain covered work, unless entered arrangement, patent license granted, prior 28 March 2007. Nothing License shall construed excluding limiting implied license defenses infringement may otherwise available applicable patent law.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_12-no-surrender-of-others-freedom","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"12. No Surrender of Others’ Freedom","title":"GNU General Public License","text":"conditions imposed (whether court order, agreement otherwise) contradict conditions License, excuse conditions License. convey covered work satisfy simultaneously obligations License pertinent obligations, consequence may convey . example, agree terms obligate collect royalty conveying convey Program, way satisfy terms License refrain entirely conveying Program.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_13-use-with-the-gnu-affero-general-public-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"13. Use with the GNU Affero General Public License","title":"GNU General Public License","text":"Notwithstanding provision License, permission link combine covered work work licensed version 3 GNU Affero General Public License single combined work, convey resulting work. terms License continue apply part covered work, special requirements GNU Affero General Public License, section 13, concerning interaction network apply combination .","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_14-revised-versions-of-this-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"14. Revised Versions of this License","title":"GNU General Public License","text":"Free Software Foundation may publish revised /new versions GNU General Public License time time. new versions similar spirit present version, may differ detail address new problems concerns. version given distinguishing version number. Program specifies certain numbered version GNU General Public License “later version” applies , option following terms conditions either numbered version later version published Free Software Foundation. Program specify version number GNU General Public License, may choose version ever published Free Software Foundation. Program specifies proxy can decide future versions GNU General Public License can used, proxy’s public statement acceptance version permanently authorizes choose version Program. Later license versions may give additional different permissions. However, additional obligations imposed author copyright holder result choosing follow later version.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_15-disclaimer-of-warranty","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"15. Disclaimer of Warranty","title":"GNU General Public License","text":"WARRANTY PROGRAM, EXTENT PERMITTED APPLICABLE LAW. EXCEPT OTHERWISE STATED WRITING COPYRIGHT HOLDERS /PARTIES PROVIDE PROGRAM “” WITHOUT WARRANTY KIND, EITHER EXPRESSED IMPLIED, INCLUDING, LIMITED , IMPLIED WARRANTIES MERCHANTABILITY FITNESS PARTICULAR PURPOSE. ENTIRE RISK QUALITY PERFORMANCE PROGRAM . PROGRAM PROVE DEFECTIVE, ASSUME COST NECESSARY SERVICING, REPAIR CORRECTION.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_16-limitation-of-liability","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"16. Limitation of Liability","title":"GNU General Public License","text":"EVENT UNLESS REQUIRED APPLICABLE LAW AGREED WRITING COPYRIGHT HOLDER, PARTY MODIFIES /CONVEYS PROGRAM PERMITTED , LIABLE DAMAGES, INCLUDING GENERAL, SPECIAL, INCIDENTAL CONSEQUENTIAL DAMAGES ARISING USE INABILITY USE PROGRAM (INCLUDING LIMITED LOSS DATA DATA RENDERED INACCURATE LOSSES SUSTAINED THIRD PARTIES FAILURE PROGRAM OPERATE PROGRAMS), EVEN HOLDER PARTY ADVISED POSSIBILITY DAMAGES.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_17-interpretation-of-sections-15-and-16","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"17. Interpretation of Sections 15 and 16","title":"GNU General Public License","text":"disclaimer warranty limitation liability provided given local legal effect according terms, reviewing courts shall apply local law closely approximates absolute waiver civil liability connection Program, unless warranty assumption liability accompanies copy Program return fee. END TERMS CONDITIONS","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"how-to-apply-these-terms-to-your-new-programs","dir":"","previous_headings":"","what":"How to Apply These Terms to Your New Programs","title":"GNU General Public License","text":"develop new program, want greatest possible use public, best way achieve make free software everyone can redistribute change terms. , attach following notices program. safest attach start source file effectively state exclusion warranty; file least “copyright” line pointer full notice found. Also add information contact electronic paper mail. program terminal interaction, make output short notice like starts interactive mode: hypothetical commands show w show c show appropriate parts General Public License. course, program’s commands might different; GUI interface, use “box”. also get employer (work programmer) school, , sign “copyright disclaimer” program, necessary. information , apply follow GNU GPL, see . GNU General Public License permit incorporating program proprietary programs. program subroutine library, may consider useful permit linking proprietary applications library. want , use GNU Lesser General Public License instead License. first, please read .","code":" Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. This is free software, and you are welcome to redistribute it under certain conditions; type 'show c' for details."},{"path":"https://ummel.github.io/fusionModel/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Kevin Ummel. Author, maintainer. Karthik Akkiraju. Contributor. Miguel Poblete Cazenave. Contributor.","code":""},{"path":"https://ummel.github.io/fusionModel/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Ummel K (2024). fusionModel: Data fusion analysis synthetic data R. R package version 2.3.0, https://ummel.github.io/fusionModel/.","code":"@Manual{, title = {fusionModel: Data fusion and analysis of synthetic data in R}, author = {Kevin Ummel}, year = {2024}, note = {R package version 2.3.0}, url = {https://ummel.github.io/fusionModel/}, }"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"fusionmodel","dir":"","previous_headings":"","what":"Data fusion and analysis of synthetic data in R","title":"Data fusion and analysis of synthetic data in R","text":"Kevin Ummel (ummel@berkeley.edu) Overview Motivation Methodology Installation Simple fusion Advanced fusion Analyzing fused data Validating fusion models","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"overview","dir":"","previous_headings":"","what":"Overview","title":"Data fusion and analysis of synthetic data in R","text":"fusionModel enables variables unique “donor” dataset statistically simulated (.e. fused ) “recipient” dataset. Variables common donor recipient used model simulate fused variables. package provides simple efficient interface general data fusion R, leveraging state---art machine learning algorithms Microsoft’s LightGBM framework. also provides tools analyzing synthetic/simulated data, calculating uncertainty, validating fusion output. fusionModel developed allow statistical integration microdata disparate social surveys. data fusion workhorse underpinning larger fusionACS data platform development Socio-Spatial Climate Collaborative. context, fusionModel used fuse variables range social surveys onto microdata American Community Survey, allowing analysis spatial resolution otherwise impossible.","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"motivation","dir":"","previous_headings":"","what":"Motivation","title":"Data fusion and analysis of synthetic data in R","text":"desire “fuse” otherwise integrate independent datasets long history, dating least early 1970’s (Ruggles Ruggles 1974; Alter 1974). Social scientists long recognized large amounts unconnected data “” – usually concerning characteristics households individuals (.e. microdata) – , ideally, like integrate analyze whole. aim falls general heading “Statistical Data Integration” (SDI) (Lewaa et al. 2021). prominent examples data fusion involved administrative record linkage. consists exact matching probabilistic linking independent datasets, using observable information like social security numbers, names, birth dates individuals. Record linkage gold standard can yield incredibly important insights high levels statistical confidence, evidenced pioneering work Raj Chetty colleagues. However, record linkage rarely feasible kinds microdata researchers use day--day (nevermind difficulty accessing administrative data). explosion online tracking social network data undoubtedly offer new lines analysis, time , least, social survey microdata remain indispensable. challenge promise recognized 50 years ago Nancy Richard Ruggles remains true today: Unfortunately, single microdata set contains different kinds information required problems economist wishes analyze. Different microdata sets contain different kinds information…great deal information collected sample basis. two samples involved probability individual appearing may small, exact matching impossible. methods combining types information contained two different samples one microdata set required. (Ruggles Ruggles 1974; 353-354) Practitioners regularly impute otherwise predict variable two one dataset another. Piecemeal, ad hoc data fusion common necessity quantitative research. Proper data fusion, hand, seeks systematically combine “two different samples one microdata set”. size nature samples involved intended analyses strongly influence choice data integration technique structure output. led relevant literature diverse convoluted, practitioners take different data “setups” objectives. context fusionACS, interested following problem: microdata two independent surveys, B, sample underlying population time period (e.g. occupied U.S. households nationwide 2018). specify “recipient” dataset B “donor”. goal generate new dataset, C, original survey responses plus realistic representation respondent might answered questionnaire survey B. , identify set common/shared variables X surveys solicit. attempt fuse set variables unique B – call Z, “fusion variables” – onto original microdata , conditional X.","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"methodology","dir":"","previous_headings":"","what":"Methodology","title":"Data fusion and analysis of synthetic data in R","text":"fusion strategy implemented fusionModel package borrows expands upon ideas statistical matching (D’Orazio et al. 2006), imputation (Little Rubin 2019), data synthesis (Drechsler 2011) literatures create flexible data fusion tool. employs variable-k, conditional expectation matching leverages high-performance gradient boosting algorithms. package accommodates fusion many variables, individually blocks, efficient computation recipient large relative donor. Specifically, goal create data fusion tool meets following requirements: Accommodate donor recipient datasets divergent sample sizes Handle continuous, categorical, semi-continuous (zero-inflated) variable types Ensure realistic values fused variables Scale efficiently larger datasets Fuse variables “one--one” “blocks” Employ data modeling approach : Makes distributional assumptions (.e. non-parametric) Automatically detects non-linear interaction effects Automatically selects predictor variables potentially large set Ability prevent overfitting (e.g. cross-validation) Complete methodological details available fusionACS Guidebook (INSERT LINK).","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"installation","dir":"","previous_headings":"","what":"Installation","title":"Data fusion and analysis of synthetic data in R","text":"","code":"devtools::install_github(\"ummel/fusionModel\") library(fusionModel) fusionModel v2.2.1 | https://github.com/ummel/fusionModel"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"simple-fusion","dir":"","previous_headings":"","what":"Simple fusion","title":"Data fusion and analysis of synthetic data in R","text":"package includes example microdata 2015 Residential Energy Consumption Survey (see ?recs details). real-world use cases, donor recipient data typically independent vary sample size. illustrative purposes, randomly split recs microdata separate “donor” “recipient” datasets equal number observations. recipient dataset contains 13 variables shared donor. shared “predictor” variables provide statistical link two datasets. fusionModel exploits information shared variables. 5 “fusion variables” unique donor. variables fused recipient. includes mix continuous categorical (factor) variables. create fusion model using train() function. minimal usage shown . See ?train additional function arguments options. default, results “.fsn” (fusion) object saved “fusion_model.fsn” current working directory. fuse variables recipient, simply pass recipient data path .fsn model fuse() function. variable specified fusion.vars fused order provided. default, fuse() generates single implicate (version) synthetic outcomes. Later, ’ll work multiple implicates perform proper analysis uncertainty estimation. Let’s look recipient dataset’s fused/simulated variables. Note results look different, call fuse() generates unique, probabilistic set outcomes. can quick sanity checks compare distribution fusion variables donor sim. , least, confirms fusion output obviously wrong. Later, ’ll perform formal internal validation exercise using multiple implicates. can look kernel density plots non-zero values continuous variables see univariate distributions donor generally similar sim.","code":"# Rows to use for donor dataset d <- seq(from = 1, to = nrow(recs), by = 2) # Create donor and recipient datasets donor <- recs[d, c(2:16, 20:22)] recipient <- recs[-d, 2:14] # Specify fusion and shared/common predictor variables predictor.vars <- names(recipient) fusion.vars <- setdiff(names(donor), predictor.vars) predictor.vars [1] \"income\" \"age\" \"race\" \"education\" \"employment\" [6] \"hh_size\" \"division\" \"urban_rural\" \"climate\" \"renter\" [11] \"home_type\" \"year_built\" \"heat_type\" # The variables to be fused sapply(donor[fusion.vars], class) $insulation [1] \"ordered\" \"factor\" $aircon [1] \"factor\" $square_feet [1] \"integer\" $electricity [1] \"numeric\" $natural_gas [1] \"numeric\" # Train a fusion model fsn.model <- train(data = donor, y = fusion.vars, x = predictor.vars) 5 fusion variables 13 initial predictor variables 2843 observations Using all available predictors for each fusion variable Training step 1 of 5: insulation Training step 2 of 5: aircon Training step 3 of 5: square_feet -- R-squared of cluster means: 0.967 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 49.0 141.0 200.6 357.0 498.0 Training step 4 of 5: electricity -- R-squared of cluster means: 0.966 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.00 51.75 115.00 176.02 281.25 498.00 Training step 5 of 5: natural_gas -- R-squared of cluster means: 0.968 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 54.0 129.0 170.1 247.0 499.0 Fusion model saved to: /home/kevin/Documents/Projects/fusionModel/fusion_model.fsn Total processing time: 8.53 secs # Fuse 'fusion.vars' to the recipient sim <- fuse(data = recipient, fsn = fsn.model) 5 fusion variables 13 initial predictor variables 2843 observations Generating 1 implicate Fusion step 1 of 5: insulation -- Predicting LightGBM models -- Simulating fused values Fusion step 2 of 5: aircon -- Predicting LightGBM models -- Simulating fused values Fusion step 3 of 5: square_feet -- Predicting LightGBM models -- Simulating fused values Fusion step 4 of 5: electricity -- Predicting LightGBM models -- Simulating fused values Fusion step 5 of 5: natural_gas -- Predicting LightGBM models -- Simulating fused values Total processing time: 0.8 secs head(sim) M insulation aircon square_feet 1: 1 Well insulated Central air conditioning system 1956 2: 1 Well insulated Central air conditioning system 1621 3: 1 Adequately insulated No air conditioning 558 4: 1 Adequately insulated Central air conditioning system 3072 5: 1 Adequately insulated Central air conditioning system 1010 6: 1 Adequately insulated No air conditioning 1910 electricity natural_gas 1: 17000 0.0 2: 6070 146.2 3: 1334 0.0 4: 9620 797.0 5: 37500 0.0 6: 11240 223.0 sim <- data.frame(sim) # Compare means of the continuous variables cbind(donor = colMeans(donor[fusion.vars[3:5]]), sim = colMeans(sim[fusion.vars[3:5]])) donor sim square_feet 2070.784 2012.7306 electricity 10994.517 10675.6508 natural_gas 338.154 323.5523 # Compare frequencies of categorical variable classes cbind(donor = table(donor$insulation), sim = table(sim$insulation)) donor sim Not insulated 40 40 Poorly insulated 459 443 Adequately insulated 1401 1419 Well insulated 943 941 cbind(donor = table(donor$aircon), sim = table(sim$aircon)) donor sim Central air conditioning system 1788 1818 Individual window/wall or portable units 545 545 Both a central system and individual units 125 119 No air conditioning 385 361"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"advanced-fusion","dir":"","previous_headings":"","what":"Advanced fusion","title":"Data fusion and analysis of synthetic data in R","text":"call train(), specify set hyperparameters search training LightGBM gradient boosting model (see ?train details). hyperparameters can used tune underlying GBM models better cross-validated performance. also set nfolds = 10 (default 5) indicate number cross-validation folds use. Since requires additional computation, cores argument used enable parallel processing. generally want create multiple versions simulated fusion variables – called implicates – order reduce bias point estimates calculate associated uncertainty. can using M argument within fuse(). generate 10 implicates; .e. 10 unique, probabilistic representations recipient records might look like respect fusion variables. Note implicate sim10 identified “M” variable/column.","code":"# Train a fusion model with variable blocks fsn.model <- train(data = donor, y = fusion.vars, x = predictor.vars, nfolds = 10, hyper = list(boosting = c(\"gbdt\", \"goss\"), num_leaves = c(10, 30), feature_fraction = c(0.7, 0.9)), cores = 2) 5 fusion variables 13 initial predictor variables 2843 observations Using all available predictors for each fusion variable Using OpenMP multithreading within LightGBM (2 cores) Training step 1 of 5: insulation Training step 2 of 5: aircon Training step 3 of 5: square_feet -- R-squared of cluster means: 0.971 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 61.0 142.0 198.4 330.0 499.0 Training step 4 of 5: electricity -- R-squared of cluster means: 0.971 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 61.0 179.0 220.7 388.0 499.0 Training step 5 of 5: natural_gas -- R-squared of cluster means: 0.959 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 70.0 193.0 217.6 340.0 499.0 Fusion model saved to: /home/kevin/Documents/Projects/fusionModel/fusion_model.fsn Total processing time: 44.9 secs # Fuse multiple implicates to the recipient sim10 <- fuse(data = recipient, fsn = fsn.model, M = 10) 5 fusion variables 13 initial predictor variables 2843 observations Generating 10 implicates Fusion step 1 of 5: insulation -- Predicting LightGBM models -- Simulating fused values Fusion step 2 of 5: aircon -- Predicting LightGBM models -- Simulating fused values Fusion step 3 of 5: square_feet -- Predicting LightGBM models -- Simulating fused values Fusion step 4 of 5: electricity -- Predicting LightGBM models -- Simulating fused values Fusion step 5 of 5: natural_gas -- Predicting LightGBM models -- Simulating fused values Total processing time: 2.38 secs head(sim10) M insulation aircon square_feet 1: 1 Well insulated Central air conditioning system 1728 2: 1 Adequately insulated Central air conditioning system 1492 3: 1 Well insulated No air conditioning 636 4: 1 Well insulated Central air conditioning system 2948 5: 1 Well insulated Individual window/wall or portable units 1140 6: 1 Adequately insulated No air conditioning 1579 electricity natural_gas 1: 13700 0.0 2: 15200 0.0 3: 1880 89.7 4: 8670 651.0 5: 14460 0.0 6: 14220 0.0 table(sim10$M) 1 2 3 4 5 6 7 8 9 10 2843 2843 2843 2843 2843 2843 2843 2843 2843 2843"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"analyzing-fused-data","dir":"","previous_headings":"","what":"Analyzing fused data","title":"Data fusion and analysis of synthetic data in R","text":"fused values inherently probabilistic, reflecting uncertainty underlying statistical models. Multiple implicates needed calculate unbiased point estimates associated uncertainty particular analysis data. general, implicates preferable requires computation. Since proper analysis multiple implicates can rather cumbersome – coding mathematical standpoint – analyze() function provides convenient way calculate point estimates associated uncertainty common analyses. Potential analyses currently include variable means, proportions, sums, counts, medians, (optionally) calculated population subgroups. example, calculate mean value “electricity” variable across observations recipient dataset, following. response variable categorical, analyze() automatically returns proportions associated factor level. want perform analysis across subsets recipient population – example, calculate mean value “electricity” household “income” – can use static arguments. see mean electricity consumption increases household income. also possible multiple kinds analyses single call analyze(). example, following call calculates mean value “natural_gas” “square_feet”, median value “square_feet”, sum “electricity” (.e. total consumption) “insulation” (.e. total count level). estimates calculated population subgroup defined intersection “race” “urban_rural” status. can (example) isolate results white households rural areas. Notice mean estimate “square_feet” exceeds median, reflecting skewed distribution. complicated analyses can performed using custom fun argument analyze(). See Examples section ?analyze.","code":"analyze(x = list(mean = \"electricity\"), implicates = sim10) Using 10 implicates Assuming uniform sample weights Total processing time: 0.0344 secs N y level type est moe 1: 2843 electricity NA mean 10808.86 232.2422 analyze(x = list(mean = \"aircon\"), implicates = sim10) Using 10 implicates Assuming uniform sample weights Total processing time: 0.0903 secs N y level type est 1: 2843 aircon Central air conditioning system proportion 0.62866690 2: 2843 aircon Individual window/wall or portable units proportion 0.19212100 3: 2843 aircon Both a central system and individual units proportion 0.04400281 4: 2843 aircon No air conditioning proportion 0.13520929 moe 1: 0.017603881 2: 0.015537772 3: 0.007524075 4: 0.014413665 analyze(x = list(mean = \"electricity\"), implicates = sim10, static = recipient, by = \"income\") Using 10 implicates Assuming uniform sample weights Total processing time: 0.0229 secs income N y level type est moe 1: Less than $20,000 471 electricity NA mean 8884.312 478.0797 2: $20,000 - $39,999 645 electricity NA mean 9719.018 456.4553 3: $40,000 - $59,999 464 electricity NA mean 10537.228 538.0130 4: $60,000 to $79,999 372 electricity NA mean 11314.790 572.3587 5: $80,000 to $99,999 248 electricity NA mean 11550.401 693.7652 6: $100,000 to $119,999 222 electricity NA mean 12138.498 850.3994 7: $120,000 to $139,999 119 electricity NA mean 12490.518 1113.5339 8: $140,000 or more 302 electricity NA mean 13683.182 728.6027 result <- analyze(x = list(mean = c(\"natural_gas\", \"square_feet\"), median = \"square_feet\", sum = c(\"electricity\", \"insulation\")), implicates = sim10, static = recipient, by = c(\"race\", \"urban_rural\")) Using 10 implicates Assuming uniform sample weights Total processing time: 1.63 secs subset(result, race == \"White\" & urban_rural == \"Rural\") race urban_rural N y level type est 1: White Rural 503 electricity sum 6906696.6100 2: White Rural 503 insulation Not insulated count 5.6000 3: White Rural 503 insulation Poorly insulated count 68.4000 4: White Rural 503 insulation Adequately insulated count 209.7000 5: White Rural 503 insulation Well insulated count 219.3000 6: White Rural 503 natural_gas mean 157.5723 7: White Rural 503 square_feet mean 2387.3509 8: White Rural 503 square_feet median 2159.4000 moe 1: 3.115890e+05 2: 5.653837e+00 3: 2.178315e+01 4: 3.144372e+01 5: 3.613723e+01 6: 2.579220e+01 7: 1.164896e+02 8: 1.573095e+02"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"validating-fusion-models","dir":"","previous_headings":"","what":"Validating fusion models","title":"Data fusion and analysis of synthetic data in R","text":"validate() function provides convenient way perform internal validation tests synthetic variables fused back onto original donor data. allows us assess quality underlying fusion model; analogous assessing model skill comparing predictions observed training data. validate() compares analytical results derived using multiple-implicate fusion output derived using original donor microdata. performing analyses population subsets varying size, validate() estimates synthetic variables perform analyses varying difficulty/complexity. computes fusion variable means proportions subsets full sample – separately observed fused data – compares results. First, fuse multiple implicates fusion.vars using original donor data – recipient data, previously. Next, pass sim results validate(). argument subset_vars specifies want validation exercise compare observed (donor) simulated point estimates across population subsets defined “income”, “age”, “race”, “education”. See ?validate details. validate() output includes ggplot2 graphics helpfully summarize validation results. example, plot shows observed simulated point estimates compare, using median absolute percent error performance metric. see synthetic data good job reproducing point estimates fusion variables population subset question reasonably large. smaller subsets – .e. difficult analyses due small sample size – “square_feet”, “natural_gas”, “electricity” remain well modeled, error increases rapidly “aircon” “insulation”. information useful understanding kind reliability can expect particular variables types analyses, given underlying fusion model data. Happy fusing!","code":"sim <- fuse(data = donor, fsn = fsn.model, M = 40) 5 fusion variables 13 initial predictor variables 2843 observations Generating 40 implicates Fusion step 1 of 5: insulation -- Predicting LightGBM models -- Simulating fused values Fusion step 2 of 5: aircon -- Predicting LightGBM models -- Simulating fused values Fusion step 3 of 5: square_feet -- Predicting LightGBM models -- Simulating fused values Fusion step 4 of 5: electricity -- Predicting LightGBM models -- Simulating fused values Fusion step 5 of 5: natural_gas -- Predicting LightGBM models -- Simulating fused values Total processing time: 7.8 secs valid <- validate(observed = donor, implicates = sim, subset_vars = c(\"income\", \"age\", \"race\", \"education\")) Assuming uniform sample weights One-hot encoding categorical fusion variables Correlation between observed and fused values: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.075 0.109 0.255 0.303 0.431 0.774 Processing validation analyses for 5 fusion variables Performed 1430 analyses across 130 subsets Smoothing validation metrics Average smoothed performance metrics across subset range: y est vad moe 1 aircon 0.0323 0.689 1.37 2 electricity 0.0225 0.419 1.06 3 insulation 0.0390 0.492 1.41 4 natural_gas 0.0277 0.500 1.06 5 square_feet 0.0146 0.788 1.18 Creating ggplot2 graphics Total processing time: 3.14 secs valid$plots$est"},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":null,"dir":"Reference","previous_headings":"","what":"Analyze fusion output — analyze","title":"Analyze fusion output — analyze","text":"Calculation point estimates associated margin error analyses using fused/synthetic microdata. Can calculate means, proportions, sums, counts, medians, optionally across population subgroups.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Analyze fusion output — analyze","text":"","code":"analyze( x, implicates, static = NULL, weight = NULL, rep_weights = NULL, by = NULL, fun = NULL, var_scale = 4, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Analyze fusion output — analyze","text":"x List. Named list specifying desired analysis type(s) associated target variable(s). Example: x = list(mean = c(\"v1\", \"v2\"), median = \"v3\") translates : \"Return mean value variables v1 v2 median v3\". Supported analysis types include mean, sum, median. Mean sum automatically return proportions counts, respectively, target variable factor. Target variables must implicates, static, data.frame returned custom fun. implicates Data frame. Implicates synthetic (fused) variables. Typically generated fuse. implicates row-stacked identified integer column \"M\". static Data frame. Optional static (non-synthetic) variables vary across implicates. Note nrow(static) = nrow(implicates) / max(implicates$M) row-ordering assumed consistent static implicates. weight Character. Name observation weights column static. NULL (default), uniform weights assumed. rep_weights Character. Optional vector replicate weight columns static. provided, returned margin errors reflect additional variance due uncertainty sample weights. Character. Optional column name(s) implicates static (typically factors) collectively define set population subgroups analysis executed. NULL, analysis done whole sample. fun Function. Optional function applied input data prior executing analyses. Can used non-conventional/custom analyses. var_scale Scalar. Factor scale unadjusted replicate weight variance. determined survey design. default (var_scale = 4) appropriate ACS RECS. cores Integer. Number cores used. applicable Unix systems.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Analyze fusion output — analyze","text":"data.table reporting analysis results, possibly across subgroups defined . returned quantities include: N Number observations used analysis. y Target variable. level Levels factor target variables. type Type estimate returned: mean, proportion, sum, count, median. est Point estimate. moe Margin error associated 90% confidence interval.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Analyze fusion output — analyze","text":"minimum, user must supply synthetic implicates (typically generated fuse). Inputs checked consistent dimensions. implicates contains single implicate rep_weights = NULL, \"typical\" standard error returned warning make sure user aware situation. Estimates standard errors requested analysis calculated separately implicate. final point estimate mean estimate across implicates. final standard error pooled SE across implicates, calculated using Rubin's pooling rules (1987) finite population adjustment degrees freedom (Barnard Rubin 1999). replicate weights provided, standard errors implicate calculated via variance estimates across replicates. Calculations leverage data.table operations speed memory efficiency. within-implicate variance calculated around point estimate (rather around mean replicates). equivalent mse = TRUE svrepdesign. seems appropriate method surveys. replicate weights provided, standard errors implicate calculated using variance within implicate. means, ratio variance approximation Cochran (1977) used, known good approximation bootstrapped SE's weighted means (Gatz Smith 1995). proportions, generalization unweighted SE formula used (see ). regression coefficients, standard error calculated summary.glm.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Analyze fusion output — analyze","text":"Barnard, J., & Rubin, D.B. (1999). Small-sample degrees freedom multiple imputation. Biometrika, 86, 948-955. Cochran, W. G. (1977). Sampling Techniques (3rd Edition). Wiley, New York. Gatz, D.F., Smith, L. (1995). Standard Error Weighted Mean Concentration — . Bootstrapping vs Methods. Atmospheric Environment, vol. 29, . 11, 1185–1193. Rubin, D.B. (1987). Multiple imputation nonresponse surveys. Hoboken, NJ: Wiley.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Analyze fusion output — analyze","text":"","code":"# Build a fusion model using RECS microdata fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient sim <- fuse(data = recs, fsn = fsn.path, M = 30) head(sim) #--------- # Multiple types of analyses can be done at once # This calculates estimates using the full sample result <- analyze(x = list(mean = c(\"natural_gas\", \"aircon\"), median = \"electricity\", sum = c(\"electricity\", \"aircon\")), implicates = sim, weight = \"weight\") View(result) #----- # Mean electricity consumption, by climate zone and urban/rural status result1 <- analyze(x = list(mean = \"electricity\"), implicates = sim, static = recs, weight = \"weight\", by = c(\"climate\", \"urban_rural\")) # Same as above but including sample weight uncertainty # Note that only the first 30 replicate weights are used internally result2 <- analyze(x = list(mean = \"electricity\"), implicates = sim, static = recs, weight = \"weight\", rep_weights = paste0(\"rep_\", 1:96), by = c(\"climate\", \"urban_rural\")) # Helper function for comparison plots pfun <- function(x, y) {plot(x, y); abline(0, 1, lty = 2)} # Inclusion of replicate weights does not affect estimates, but it does # increase margin of error due to uncertainty in RECS sample weights pfun(result1$est, result2$est) pfun(result1$moe, result2$moe) # Notice that relative uncertainty declines with subset size plot(result1$N, result1$moe / result1$est) #----- # Use a custom function to perform more complex analyses # Custom function should return a data frame with non-standard target variables my_fun <- function(data) { # Manipulate 'data' as desired # All variables in 'implicates' and 'static' are available # Construct electricity consumption per square foot kwh_per_ft2 <- data$electricity / data$square_feet # Binary (T/F) indicator if household uses natural gas use_natural_gas <- data$natural_gas > 0 # Return data.frame of custom variables to be analyzed data.frame(kwh_per_ft2, use_natural_gas) } # Do analysis using variables produced by custom function # Can included non-custom target variables as well result <- analyze(x = list(mean = c(\"kwh_per_ft2\", \"use_natural_gas\", \"electricity\")), implicates = sim, static = recs, weight = \"weight\", fun = my_fun)"},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":null,"dir":"Reference","previous_headings":"","what":"Analyze fusion output — analyze2","title":"Analyze fusion output — analyze2","text":"Calculation point estimates associated margin error analyses using fused/synthetic microdata replicate weights. Efficiently computes means, proportions, sums, counts, medians, standard deviations, variances, optionally across population subgroups. differs analyze requires replicate weights calculates uncertainty using full replicate weight variance (approximation).","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Analyze fusion output — analyze2","text":"","code":"analyze2( analyses, implicates, static, weight, rep_weights, by = NULL, var_scale = 4, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Analyze fusion output — analyze2","text":"analyses List. Specifies desired analyses. See Details Examples. Variables referenced analyses must implicates static. implicates Data frame file path. Implicates synthetic (fused) variables; typically output fuse. implicates row-stacked identified integer column \"M\". file path \".fst\" file, necessary columns read memory. static Data frame file path. Static variables vary across implicates; typically \"recipient\" microdata passed fuse. minimum, static must contain weight rep_weights. file path \".fst\" file, necessary columns read memory. Note nrow(static) = nrow(implicates) / max(implicates$M) row-ordering assumed consistent static implicates. weight Character. Name primary observation weights column static. rep_weights Character. Vector replicate weight columns static. Character. Optional column name(s) implicates static (typically factors) collectively define set population subgroups analysis executed. NULL, analysis done whole sample. var_scale Scalar. Factor scale unadjusted replicate weight variance. determined survey design. default (var_scale = 4) appropriate ACS RECS. cores Integer. Number cores used multithreading collapse-package functions.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Analyze fusion output — analyze2","text":"tibble reporting analysis results, possibly across subgroups defined . returned quantities include: lhs Optional analysis name; \"left hand side\" analysis formula. rhs \"right hand side\" analysis formula. type Type analysis: sum, mean, median, prop(ortion) count. level Factor levels categorical analyses; NA omitted otherwise. est Point estimate; mean estimate across implicates. moe Margin error associated 90% confidence interval. rshare Share MOE attributable replicate weights (opposed variance across implicates).","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Analyze fusion output — analyze2","text":"final point estimates mean estimates across implicates. final margin error derived pooled standard error across implicates, calculated using Rubin's pooling rules (1987). within-implicate standard error's calculated using replicate weights var_scale. entry analyses list formula format Z ~ F(E), Z optional, user-friendly name analysis, F allowable “outer function”, E “inner expression” containing one microdata variables. example: mysum ~ mean(Var1 + Var2) case, outer function mean(). Allowable outer functions : mean(), sum(), median(), sd(), var(). inner expression contains one variable, first evaluated F() applied result. case, internal variable X = Var1 + Var2 generated across observations, mean(X) computed. inner expression desired, analyses list can use following convenient syntax apply single outer function multiple variables: mean = c(\"Var1\", \"Var2\") inner expression can also utilize function takes variable names arguments returns vector length inputs. useful defining complex operations separate function (e.g. microsimulation). example: myfun = function(Var1, Var2) {Var1 + Var2} mysum ~ mean(myfun(Var1, Var2)) use sum() mean() inner expression returns categorical vector automatically results category-wise weighted counts proportions, respectively. example, following analysis fail evaluated literally, since mean() expects numeric input inner expression returns character. interpreted request return weighted proportions categorical outcome. myprop ~ mean(ifelse(Var1 > 10 , 'Yes', '')) analyze2() uses \"fast\" versions allowable outer functions, provided fast-statistical-functions collapse package. functions highly optimized weighted, grouped calculations. addition, outer functions mean(), sum(), median() enjoy use platform-independent multithreading across columns cores > 1. Analyses numerical inner expressions processed using series calls collap unique observation weights. Analyses categorical inner expressions utilize series calls fsum.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Analyze fusion output — analyze2","text":"Rubin, D.B. (1987). Multiple imputation nonresponse surveys. Hoboken, NJ: Wiley.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Analyze fusion output — analyze2","text":"","code":"# Build a fusion model using RECS microdata fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\", \"insulation\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient recipient <- recs[c(predictor.vars, \"weight\", paste0(\"rep_\", 1:96))] sim <- fuse(data = recipient, fsn = fsn.path, M = 30) head(sim) #----- # Example of custom pre-processing function myfun <- function(v1, v2, v3) v1 + v2 + v3 # Various ways to specify analyses... my.analyses <- list( # Return means for 'electricity' and proportions for 'aircon' mean = c(\"electricity\", \"aircon\"), # Identical to mean = \"electricity\"; duplicate analyses automatically removed electricity ~ mean(electricity), # Simple addition in the inner expression mysum ~ sum(electricity + natural_gas), # Standard deviation of electricity sd = \"electricity\", # Unnamed analyses (no left-hand side in formula) ~ var(electricity + natural_gas), ~ mean(insulation), # Proportions ~ sum(insulation), # Counts # Proportions involving manipulation of >1 variable myprop ~ mean(aircon != \"No air conditioning\" & insulation < \"Adequately insulated\"), # Custom inner function mycustom ~ median(myfun(electricity, natural_gas, v3 = 100)) ) # Do the requeted analyses, by \"division\" result <- analyze2( analyses = my.analyses, implicates = sim, static = recipient, weight = \"weight\", rep_weights = paste0(\"rep_\", 1:96), by = \"division\" ) head(result) #----- # To calculate a conditional estimate, set unused/ignored observations to NA # All outer functions execute with 'na.rm = TRUE' # Example: mean natural_gas conditional on natural_gas > 0 # data.table::fifelse() is much faster than base::ifelse() for large data result <- analyze2( analyses = ~mean(data.table::fifelse(natural_gas > 0, natural_gas, NA_real_)), implicates = sim, static = recipient, weight = \"weight\", rep_weights = paste0(\"rep_\", 1:96), by = \"division\" )"},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuse variables to a recipient dataset — fuse","title":"Fuse variables to a recipient dataset — fuse","text":"Fuse variables recipient dataset using .fsn model produced train. Output can passed analyze validate.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuse variables to a recipient dataset — fuse","text":"","code":"fuse( data, fsn, fsd = NULL, M = 1, retain = NULL, kblock = 10, margin = 2, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuse variables to a recipient dataset — fuse","text":"data Data frame. Recipient dataset. categorical variables factors ordered whenever possible. Data types levels strictly validated predictor variables defined fsn. fsn Character. Path fusion model file (.fsn) generated train. fsd Character. Optional fusion output file created ending .fsd (.e. \"fused data\"). compressed binary file can read using fst package. fsd = NULL (default), fusion results returned data.table. M Integer. Number implicates simulate. retain Character. Names columns data retained output; .e. repeated across implicates. Useful retaining ID weight variables use subsequent analysis fusion output. kblock Integer. Fixed number nearest neighbors use fusing variables block. Must >= 5 <= 30. applicable variables fused (.e. block). margin Numeric. Safety margin used estimating many implicates can processed memory . Set higher fuse() experiences memory shortfall. Alternatively, can set negative value manually specify number chunks use. example, margin = -3 splits M implicates three chunks approximately equal size. cores Integer. Number cores used. LightGBM prediction parallel-enabled systems OpenMP available.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuse variables to a recipient dataset — fuse","text":"fsd = NULL, data.table number rows equal M * nrow(data). Integer column \"M\" indicates implicate assignment observation. Note ordering recipient observations consistent within implicates, change row order using analyze. fsd specified, path .fsd file results written. Metadata column classes factor levels stored column names. read_fsd used load files saved via fsd argument.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Fuse variables to a recipient dataset — fuse","text":"UPDATE.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuse variables to a recipient dataset — fuse","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Generate single implicate of synthetic 'fusion.vars', # using original RECS data as the recipient recipient <- recs[predictor.vars] sim <- fuse(data = recipient, fsn = fsn.path) head(sim) # Calling fuse() again produces different results sim <- fuse(data = recipient, fsn = fsn.path) head(sim) # Generate multiple implicates sim <- fuse(data = recipient, fsn = fsn.path, M = 5) head(sim) table(sim$M) # Optionally, write results directly to disk # Note that \"results.fsd\" will be written to working directory sim <- fuse(data = recipient, fsn = fsn.path, M = 5, fsd = \"results.fsd\") sim <- read_fsd(sim) head(sim)"},{"path":"https://ummel.github.io/fusionModel/reference/fusionModel-package.html","id":null,"dir":"Reference","previous_headings":"","what":"fusionModel: Data fusion and analysis of synthetic data in R — fusionModel-package","title":"fusionModel: Data fusion and analysis of synthetic data in R — fusionModel-package","text":"Data fusion analysis synthetic data R.","code":""},{"path":[]},{"path":"https://ummel.github.io/fusionModel/reference/fusionModel-package.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"fusionModel: Data fusion and analysis of synthetic data in R — fusionModel-package","text":"Maintainer: Kevin Ummel ummel@berkeley.edu contributors: Karthik Akkiraju [contributor] Miguel Poblete Cazenave [contributor]","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":null,"dir":"Reference","previous_headings":"","what":"Extract predictor variable importance from a fusion model — importance","title":"Extract predictor variable importance from a fusion model — importance","text":"Returns predictor variable (feature) importance underlying LightGBM models stored fusion model file (.fsn) disk.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extract predictor variable importance from a fusion model — importance","text":"","code":"importance(fsn)"},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extract predictor variable importance from a fusion model — importance","text":"fsn Character. Path fusion model file (.fsn) generated train.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extract predictor variable importance from a fusion model — importance","text":"named list containing detailed summary importance results. summary results useful, return average importance predictor across potentially multiple underlying LightGBM models; .e. zero (\"z\"), mean (\"m\"), quantile (\"q\") models. See Examples suggested plotting results.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Extract predictor variable importance from a fusion model — importance","text":"Importance metrics computed via lgb.importance. Three types measures returned; \"gain\" typically preferred measure.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Extract predictor variable importance from a fusion model — importance","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Extract predictor variable importance ximp <- importance(fsn.path) # Plot summary results library(ggplot2) ggplot(ximp$summary, aes(x = x, y = gain)) + geom_bar(stat = \"identity\") + facet_grid(~ y) + coord_flip() # View detailed results View(ximp$detailed)"},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":null,"dir":"Reference","previous_headings":"","what":"Impute missing data via fusion — impute","title":"Impute missing data via fusion — impute","text":"universal missing data imputation tool wraps successive calls train fuse hood. Designed simplicity ease use.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Impute missing data via fusion — impute","text":"","code":"impute(data, weight = NULL, ignore = NULL, cores = 1)"},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Impute missing data via fusion — impute","text":"data data frame missing values. weight Optional name observation weights column data. ignore Optional names columns data ignore predictor variables. cores Number physical CPU cores used lightgbm. LightGBM parallel-enabled platforms OpenMP available.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Impute missing data via fusion — impute","text":"data frame missing values imputed.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Impute missing data via fusion — impute","text":"Variables missing values imputed sequentially, beginning variable fewest missing values. Since LightGBM models accommodate NA values predictor set, available variables used potential predictors (excluding ignore variables). call train, 80% observations randomly selected training remaining 20% used validation set determine appropriate number tree learners. LightGBM model parameters kept sensible default values train. Since lightgbm uses OpenMP multithreading, advisable use impute inside forked/parallel process cores > 1.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Impute missing data via fusion — impute","text":"","code":"# Create data frame with random NA values ?recs data <- recs[, 2:7] miss <- replicate(ncol(data), runif(nrow(data)) < runif(1, 0.01, 0.3)) data[miss] <- NA colSums(is.na(data)) # Impute the missing values result <- impute(data) anyNA(result)"},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":null,"dir":"Reference","previous_headings":"","what":"Ensure a monotonic relationship between two variables — monotonic","title":"Ensure a monotonic relationship between two variables — monotonic","text":"monotonic() returns modified values input vector y smoothed, monotonic, consistent across values input x. designed used post-fusion one wants ensure plausible relationship consumption (x) expenditure (y), assumption consumers face identical, monotonic pricing structure. default, mean returned values forced equal original mean y (preserve = TRUE). direction monotonicity (increasing decreasing) detected automatically, use cases limited consumption expenditure variables.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Ensure a monotonic relationship between two variables — monotonic","text":"","code":"monotonic(x, y, w = NULL, preserve = TRUE, plot = FALSE)"},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Ensure a monotonic relationship between two variables — monotonic","text":"x Numeric. y Numeric. w Numeric. Optional observation weights. preserve Logical. Preserve original mean y values returned values? plot Logical. Plot (sampled) data points derived monotonic relationship?","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Ensure a monotonic relationship between two variables — monotonic","text":"numeric vector modified y values. Optionally, plot showing returned monotonic relationship.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Ensure a monotonic relationship between two variables — monotonic","text":"initial smoothing accomplished via supsmu result coerced monotone. coercion step modifies values much, second smooth attempted via scam model either monotone increasing decreasing constraint. SCAM fails fit, function falls back lm simple linear predictions. y = 0 x = 0 (typical consumption-expenditure variables), outcome enforced result. input data randomly sampled 10,000 observations, necessary, speed.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Ensure a monotonic relationship between two variables — monotonic","text":"","code":"y <- monotonic(x = recs$propane_btu, y = recs$propane_expend, plot = TRUE) mean(recs$propane_expend) mean(y)"},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot validation results — plot_valid","title":"Plot validation results — plot_valid","text":"Creates optionally saves disk representative plots validation results returned validate. Requires suggested ggplot2 package. function (default) called within validate. Can useful save graphics disk generate plots subset fusion variables.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot validation results — plot_valid","text":"","code":"plot_valid(valid, y = NULL, path = NULL, cores = 1, ...)"},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot validation results — plot_valid","text":"valid Object returned validate. y Character. Fusion variables use validation graphics. Useful plotting partial validation results. Default use fusion variables present valid. path Character. Path directory .png graphics saved. Directory created necessary. NULL (default), files saved disk. cores Integer. Number cores used. applicable Unix systems. ... Arguments passed ggsave control .png graphics saved disk.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot validation results — plot_valid","text":"list \"plots\", \"smooth\", \"data\" slots. \"plots\" slot contains following ggplot objects: est: Comparison point estimates (median absolute percent error). moe: Comparison 90% margin error (median ratio simulated--observed MOE). Additional named slots (one fusion variables) contain plots described scatterplot results. \"smooth\" data frame plotting values used produce smoothed median plots. \"data\" data frame complete validation results returned original call validate.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Plot validation results — plot_valid","text":"Validation results visualized convey expected, typical (median) performance fusion variables. , well simulated data match observed data respect point estimates confidence intervals population subsets various size? Plausible error metrics derived input validation data plotting. comparison point estimates, error metric absolute percent error continuous variables; categorical case absolute error scaled maximum possible error 1. Since metrics strictly comparable, -variable plots denote categorical fusion variables dotted lines. given fusion variable, error metric exhibit variation (often quite skewed) even subsets comparable size, due fact subset looks unique partition data. order convey expected, typical performance varies subset size, smoothed median error conditional subset size approximated plotted.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot validation results — plot_valid","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars, weight = \"weight\") # Fuse back onto the donor data (multiple implicates) sim <- fuse(data = recs, file = fsn.path, M = 30) # Calculate validation results but do not generate plots valid <- validate(observed = recs, implicates = sim, subset_vars = c(\"income\", \"education\", \"race\", \"urban_rural\"), weight = \"weight\", plot = FALSE) # Create validation plots valid <- plot_valid(valid) # View some of the plots valid$plots$est valid$plots$moe valid$plots$electricity$bias # Can also save the plots to disk at creation # Will save .png files to 'valid_plots' folder in working directory # Note that it is fine to pass a 'valid' object with existing $plots slot # In that case, the plots are simply re-generated vplots <- plot_valid(valid, path = file.path(getwd(), \"valid_plots\"), width = 8, height = 6)"},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":null,"dir":"Reference","previous_headings":"","what":"Prepare the 'x' and 'y' inputs — prepXY","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"Optional--useful function : 1) provide plausible ordering 'y' (fusion) variables 2) identify subset 'x' (predictor) variables likely consequential subsequent model training. Output can passed directly train. useful large datasets many /highly-correlated predictors. Employs absolute Spearman rank correlation screen LASSO models (via glmnet) return plausible ordering 'y' preferred subset 'x' variables associated .","code":""},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"","code":"prepXY( data, y, x, weight = NULL, cor_thresh = 0.05, lasso_thresh = 0.95, xmax = 100, xforce = NULL, fraction = 1, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"data Data frame. Training dataset. categorical variables factors ordered whenever possible. y Character list. Variables data eventually fuse recipient dataset. y list, entry character vector possibly indicating multiple variables fuse block. x Character. Predictor variables data common donor eventual recipient. weight Character. Name observation weights column data. NULL (default), uniform weights assumed. cor_thresh Numeric. Predictors exhibit less cor_thresh absolute Spearman (rank) correlation y variable screened prior LASSO step. Fast exclusion predictors LASSO step probably need consider. lasso_thresh Numeric. Controls aggressively LASSO step screens predictors. Lower value aggressive. lasso_thresh = 0.95, example, retains predictors collectively explain least 95% deviance explained \"full\" model. xmax Integer. Maximum number predictors returned LASSO step. strictly control number final predictors returned (especially categorical y variables), useful setting () soft upper bound. Lower xmax can help control computation time large number x pass correlation screen. xmax = Inf imposes restriction. xforce Character. Subset x variables \"force\" included predictors results. fraction Numeric. Fraction observations data randomly sample. larger datasets, sampling often minimal effect results speeds computation. cores Integer. Number cores used. applicable Unix systems.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"List named slots \"y\" \"x\". list length. Former gives preferred fusion order. Latter gives preferred sets predictor variables.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"","code":"y <- names(recs)[c(14:16, 20:22)] x <- names(recs)[2:13] # Fusion variable \"blocks\" are respected by prepXY() y <- c(list(y[1:2]), y[-c(1:2)]) # Do the prep work... prep <- prepXY(data = recs, y = y, x = x) # The result can be passed to train() train(data = recs, y = prep$y, x = prep$x)"},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":null,"dir":"Reference","previous_headings":"","what":"Read fusion output from disk — read_fsd","title":"Read fusion output from disk — read_fsd","text":"Read fusion output written directly disk via fuse.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read fusion output from disk — read_fsd","text":"","code":"read_fsd(fsd, columns = NULL, cores = data.table::getDTthreads())"},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read fusion output from disk — read_fsd","text":"fsd Character. File path ending .fsd produced call fuse. columns Character. Column names read. default read columns. cores Integer. Number cores used fread.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read fusion output from disk — read_fsd","text":"data.table integer column \"M\" indicating implicate assignment observation. Note ordering recipient observations consistent within implicates, change row order using analyze validate.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Read fusion output from disk — read_fsd","text":"version 3.0, simply convenient wrapper around read_fst, since fusion output data files (.fsd) produced fuse actually native fst files hood.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Read fusion output from disk — read_fsd","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Write fusion output directly to disk # Note that \"results.fsd\" will be written to working directory recipient <- recs[predictor.vars] sim <- fuse(data = recipient, fsn = fsn.path, M = 5, csv = \"results.fsd\") # Read the fusion output saved to disk sim <- read_fsd(sim) head(sim)"},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":null,"dir":"Reference","previous_headings":"","what":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"Pre-processed, household-level microdata containing selection 31 variables derived 2015 RECS, plus survey replicate weights. variety data types included. missing values. Variable names altered original.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"","code":"recs"},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"tibble 5,686 rows 124 variables: weight Primary sampling weight income Annual gross household income last year age Respondent age race Respondent race education Highest education completed respondent employment Respondent employment status hh_size Number household members division Census Division urban_rural Census 2010 Urban Type climate IECC Climate Code renter household renting home? home_type Type housing unit year_built Range housing unit built square_feet Total square footage insulation Level insulation heating Main space heating fuel aircon Type air conditioning equipment used centralac_age Age central air conditioner televisions Number televisions used disconnect Frequency receiving disconnect notice electricity Total annual electricity usage, kilowatthours natural_gas Total annual natural gas usage, hundred cubic feet fuel_oil Total annual fuel oil/kerosene usage, gallons propane Total annual propane usage, gallons propane_btu Total annual propane usage, thousand Btu propane_expend Total annual propane expenditure, dollars heating_share Share total energy used space heating heating_share Share total energy used cooling (AC fans) other_share Share total energy used end-uses use_ng Logical indicating household uses natural gas have_ac Logical indicating household air conditioning rep_1:rep_96 Replicate weights uncertainty estimation","code":""},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":"source","dir":"Reference","previous_headings":"","what":"Source","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"https://www.eia.gov/consumption/residential/data/2015/","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":null,"dir":"Reference","previous_headings":"","what":"Train a fusion model — train","title":"Train a fusion model — train","text":"Train fusion model \"donor\" data using sequential LightGBM models model conditional distributions. resulting fusion model (.fsn file) can used fuse simulate outcomes \"recipient\" dataset.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Train a fusion model — train","text":"","code":"train( data, y, x, fsn = \"fusion_model.fsn\", weight = NULL, nfolds = 5, nquantiles = 2, nclusters = 2000, krange = c(10, 500), hyper = NULL, fork = FALSE, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Train a fusion model — train","text":"data Data frame. Donor dataset. Categorical variables must factors ordered whenever possible. y Character list. Variables data eventually fuse recipient dataset. Variables fused order provided. y list, entry character vector possibly indicating multiple variables fuse block. x Character list. Predictor variables data common donor eventual recipient. list, slot specifies x predictors use y. fsn Character. File path fusion model saved. Must use .fsn suffix. weight Character. Name observation weights column data. NULL (default), uniform weights assumed. nfolds Numeric. Number cross-validation folds used LightGBM model training. , nfolds < 1, fraction observations use training set; remainder used validation (faster cross-validation). nquantiles Numeric. Number quantile models train continuous y variables, addition conditional mean. nquantiles evenly-distributed percentiles used. example, default nquantiles = 2 yields quantile models 25th 75th percentiles. Higher values may produce accurate conditional distributions expense computation time. Even nquantiles recommended since conditional mean tends capture central tendency, making median model superfluous. nclusters Numeric. Maximum number k-means clusters use. Higher better computational cost. nclusters = 0 nclusters = Inf turn clustering. krange Numeric. Minimum maximum number nearest neighbors use construction continuous conditional distributions. Higher max(krange) better computational cost. hyper List. LightGBM hyperparameters used model training. NULL, default values used. See Details Examples. fork Logical. parallel processing via forking used, possible? See Details. cores Integer. Number physical CPU cores used parallel computation. fork = FALSE Windows platform (since forking possible), fusion variables/blocks processed serially LightGBM uses cores internal multithreading via OpenMP. Unix system, fork = TRUE, cores > 1, cores <= length(y) fusion variables/blocks processed parallel via mclapply.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Train a fusion model — train","text":"fusion model object (.fsn) saved fsn.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Train a fusion model — train","text":"y list, slot indicates either single variable , alternatively, multiple variables fuse block. Variables within block sampled jointly original donor data fusion. See Examples. y variables exhibit variance continuous y variables less 10 * nfolds non-zero observations (minimum required cross-validation) automatically removed warning. fusion model written fsn zipped archive created zip containing models data required fuse. hyper argument can used specify LightGBM hyperparameter values perform \"grid search\" model training. See full list parameters. combination hyperparameters, nfolds cross-validation performed using lgb.cv early stopping condition. parameter combination lowest loss function value used fit final model via lgb.train. candidate parameter values specified hyper, longer processing time. hyper = NULL, single set parameters used following default values: boosting = \"gbdt\" data_sample_strategy = \"goss\" num_leaves = 31 feature_fraction = 0.8 max_depth = 5 min_data_in_leaf = max(10, round(0.001 * nrow(data))) num_iterations = 2500 learning_rate= 0.1 max_bin = 255 min_data_in_bin = 3 max_cat_threshold = 32 Typical users reason modify hyperparameters listed . Note num_iterations imposes ceiling, since early stopping typically result models lower number iterations. See Examples. Testing small--medium size datasets suggests forking typically faster OpenMP multithreading (default). However, forking sometimes \"hang\" (continue run CPU usage error message) OpenMP process previously used session. issue appears related Intel's OpenMP implementation (see ). can triggered operations called train() use data.table fst multithread mode. experience hanged forking, try calling data.table::setDTthreads(1) fst::threads_fst(1) immediately library(fusionModel) new session.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Train a fusion model — train","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # When 'y' is a list, it can specify variables to fuse as a block fusion.vars <- list(\"electricity\", \"natural_gas\", c(\"heating_share\", \"cooling_share\", \"other_share\")) fusion.vars train(data = recs, y = fusion.vars, x = predictor.vars) # When 'x' is a list, it specifies which predictor variables to use for each 'y' xlist <- list(predictor.vars[1:4], predictor.vars[2:8], predictor.vars) xlist train(data = recs, y = fusion.vars, x = xlist) # Specify a single set of LightGBM hyperparameters # Here we use Random Forests instead of the default Gradient Boosting Decision Trees train(data = recs, y = fusion.vars, x = predictor.vars, hyper = list(boosting = \"rf\", feature_fraction = 0.6, max_depth = 10 )) # Specify a range of LightGBM hyperparameters to search over # This takes longer, because there are more models to test train(data = recs, y = fusion.vars, x = predictor.vars, hyper = list(max_depth = c(5, 10), feature_fraction = c(0.7, 0.9) ))"},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":null,"dir":"Reference","previous_headings":"","what":"Validate fusion output — validate","title":"Validate fusion output — validate","text":"Performs internal validation analyses fused microdata estimate well simulated variables reflect patterns dataset used train underlying fusion model (.e. observed/donor data). provides standard approach validating fusion output associated models. See Examples recommended usage.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Validate fusion output — validate","text":"","code":"validate( observed, implicates, subset_vars, weight = NULL, min_size = 30, plot = TRUE, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Validate fusion output — validate","text":"observed Data frame. Observed data validate simulated variables. Typically dataset used train fusion model used generate simulated. implicates Data frame. Implicates synthetic (fused) variables. Typically generated fuse. implicates row-stacked identified integer column \"M\". subset_vars Character. Vector columns observed used define population subsets across fusion variables validated. levels subset_vars (including two-way interactions subset_vars) define population subsets. Continuous subset_vars converted five-level ordered factor based univariate k-means clustering. weight Character. Name observation weights column observed. NULL (default), uniform weights assumed. min_size Integer. Subsets less min_size observations excluded. Since subsets observations unlikely give reliable estimates, make sense consider validation purposes. plot Logical. TRUE (default), plot_valid called internally summary plots returned along complete validation results. Requires ggplot2 package. cores Integer. Number cores used. applicable Unix systems.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Validate fusion output — validate","text":"plot = FALSE, data frame containing complete validation results. plot = FALSE, list containing full results well additional lot objects described plot_valid.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Validate fusion output — validate","text":"objective validate confirm fusion output sensible help establish utility synthetic data across myriad analyses. Utility based comparison point estimates confidence intervals derived using multiple-implicate synthetic data derived using original donor data. specific analyses tested include variable levels (means proportions) across population subsets varying size. allows estimates synthetic variables perform analyses real-world relevance, varying levels complexity. effect, validate() performs large number analyses kind analyze function designed one--one basis. users want use default setting plot = TRUE simultaneously return visualization (plots) validation results. Plot creation detailed plot_valid.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Validate fusion output — validate","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars, weight = \"weight\") # Fuse back onto the donor data (multiple implicates) sim <- fuse(data = recs, fsn = fsn.path, M = 20) # Calculate validation results valid <- validate(observed = recs, implicates = sim, subset_vars = c(\"income\", \"education\", \"race\", \"urban_rural\"))"}] +[{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":null,"dir":"","previous_headings":"","what":"GNU General Public License","title":"GNU General Public License","text":"Version 3, 29 June 2007Copyright © 2007 Free Software Foundation, Inc.  Everyone permitted copy distribute verbatim copies license document, changing allowed.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"preamble","dir":"","previous_headings":"","what":"Preamble","title":"GNU General Public License","text":"GNU General Public License free, copyleft license software kinds works. licenses software practical works designed take away freedom share change works. contrast, GNU General Public License intended guarantee freedom share change versions program–make sure remains free software users. , Free Software Foundation, use GNU General Public License software; applies also work released way authors. can apply programs, . speak free software, referring freedom, price. General Public Licenses designed make sure freedom distribute copies free software (charge wish), receive source code can get want , can change software use pieces new free programs, know can things. protect rights, need prevent others denying rights asking surrender rights. Therefore, certain responsibilities distribute copies software, modify : responsibilities respect freedom others. example, distribute copies program, whether gratis fee, must pass recipients freedoms received. must make sure , , receive can get source code. must show terms know rights. Developers use GNU GPL protect rights two steps: (1) assert copyright software, (2) offer License giving legal permission copy, distribute /modify . developers’ authors’ protection, GPL clearly explains warranty free software. users’ authors’ sake, GPL requires modified versions marked changed, problems attributed erroneously authors previous versions. devices designed deny users access install run modified versions software inside , although manufacturer can . fundamentally incompatible aim protecting users’ freedom change software. systematic pattern abuse occurs area products individuals use, precisely unacceptable. Therefore, designed version GPL prohibit practice products. problems arise substantially domains, stand ready extend provision domains future versions GPL, needed protect freedom users. Finally, every program threatened constantly software patents. States allow patents restrict development use software general-purpose computers, , wish avoid special danger patents applied free program make effectively proprietary. prevent , GPL assures patents used render program non-free. precise terms conditions copying, distribution modification follow.","code":""},{"path":[]},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_0-definitions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"0. Definitions","title":"GNU General Public License","text":"“License” refers version 3 GNU General Public License. “Copyright” also means copyright-like laws apply kinds works, semiconductor masks. “Program” refers copyrightable work licensed License. licensee addressed “”. “Licensees” “recipients” may individuals organizations. “modify” work means copy adapt part work fashion requiring copyright permission, making exact copy. resulting work called “modified version” earlier work work “based ” earlier work. “covered work” means either unmodified Program work based Program. “propagate” work means anything , without permission, make directly secondarily liable infringement applicable copyright law, except executing computer modifying private copy. Propagation includes copying, distribution (without modification), making available public, countries activities well. “convey” work means kind propagation enables parties make receive copies. Mere interaction user computer network, transfer copy, conveying. interactive user interface displays “Appropriate Legal Notices” extent includes convenient prominently visible feature (1) displays appropriate copyright notice, (2) tells user warranty work (except extent warranties provided), licensees may convey work License, view copy License. interface presents list user commands options, menu, prominent item list meets criterion.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_1-source-code","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"1. Source Code","title":"GNU General Public License","text":"“source code” work means preferred form work making modifications . “Object code” means non-source form work. “Standard Interface” means interface either official standard defined recognized standards body, , case interfaces specified particular programming language, one widely used among developers working language. “System Libraries” executable work include anything, work whole, () included normal form packaging Major Component, part Major Component, (b) serves enable use work Major Component, implement Standard Interface implementation available public source code form. “Major Component”, context, means major essential component (kernel, window system, ) specific operating system () executable work runs, compiler used produce work, object code interpreter used run . “Corresponding Source” work object code form means source code needed generate, install, (executable work) run object code modify work, including scripts control activities. However, include work’s System Libraries, general-purpose tools generally available free programs used unmodified performing activities part work. example, Corresponding Source includes interface definition files associated source files work, source code shared libraries dynamically linked subprograms work specifically designed require, intimate data communication control flow subprograms parts work. Corresponding Source need include anything users can regenerate automatically parts Corresponding Source. Corresponding Source work source code form work.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_2-basic-permissions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"2. Basic Permissions","title":"GNU General Public License","text":"rights granted License granted term copyright Program, irrevocable provided stated conditions met. License explicitly affirms unlimited permission run unmodified Program. output running covered work covered License output, given content, constitutes covered work. License acknowledges rights fair use equivalent, provided copyright law. may make, run propagate covered works convey, without conditions long license otherwise remains force. may convey covered works others sole purpose make modifications exclusively , provide facilities running works, provided comply terms License conveying material control copyright. thus making running covered works must exclusively behalf, direction control, terms prohibit making copies copyrighted material outside relationship . Conveying circumstances permitted solely conditions stated . Sublicensing allowed; section 10 makes unnecessary.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_3-protecting-users-legal-rights-from-anti-circumvention-law","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"3. Protecting Users’ Legal Rights From Anti-Circumvention Law","title":"GNU General Public License","text":"covered work shall deemed part effective technological measure applicable law fulfilling obligations article 11 WIPO copyright treaty adopted 20 December 1996, similar laws prohibiting restricting circumvention measures. convey covered work, waive legal power forbid circumvention technological measures extent circumvention effected exercising rights License respect covered work, disclaim intention limit operation modification work means enforcing, work’s users, third parties’ legal rights forbid circumvention technological measures.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_4-conveying-verbatim-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"4. Conveying Verbatim Copies","title":"GNU General Public License","text":"may convey verbatim copies Program’s source code receive , medium, provided conspicuously appropriately publish copy appropriate copyright notice; keep intact notices stating License non-permissive terms added accord section 7 apply code; keep intact notices absence warranty; give recipients copy License along Program. may charge price price copy convey, may offer support warranty protection fee.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_5-conveying-modified-source-versions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"5. Conveying Modified Source Versions","title":"GNU General Public License","text":"may convey work based Program, modifications produce Program, form source code terms section 4, provided also meet conditions: ) work must carry prominent notices stating modified , giving relevant date. b) work must carry prominent notices stating released License conditions added section 7. requirement modifies requirement section 4 “keep intact notices”. c) must license entire work, whole, License anyone comes possession copy. License therefore apply, along applicable section 7 additional terms, whole work, parts, regardless packaged. License gives permission license work way, invalidate permission separately received . d) work interactive user interfaces, must display Appropriate Legal Notices; however, Program interactive interfaces display Appropriate Legal Notices, work need make . compilation covered work separate independent works, nature extensions covered work, combined form larger program, volume storage distribution medium, called “aggregate” compilation resulting copyright used limit access legal rights compilation’s users beyond individual works permit. Inclusion covered work aggregate cause License apply parts aggregate.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_6-conveying-non-source-forms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"6. Conveying Non-Source Forms","title":"GNU General Public License","text":"may convey covered work object code form terms sections 4 5, provided also convey machine-readable Corresponding Source terms License, one ways: ) Convey object code , embodied , physical product (including physical distribution medium), accompanied Corresponding Source fixed durable physical medium customarily used software interchange. b) Convey object code , embodied , physical product (including physical distribution medium), accompanied written offer, valid least three years valid long offer spare parts customer support product model, give anyone possesses object code either (1) copy Corresponding Source software product covered License, durable physical medium customarily used software interchange, price reasonable cost physically performing conveying source, (2) access copy Corresponding Source network server charge. c) Convey individual copies object code copy written offer provide Corresponding Source. alternative allowed occasionally noncommercially, received object code offer, accord subsection 6b. d) Convey object code offering access designated place (gratis charge), offer equivalent access Corresponding Source way place charge. need require recipients copy Corresponding Source along object code. place copy object code network server, Corresponding Source may different server (operated third party) supports equivalent copying facilities, provided maintain clear directions next object code saying find Corresponding Source. Regardless server hosts Corresponding Source, remain obligated ensure available long needed satisfy requirements. e) Convey object code using peer--peer transmission, provided inform peers object code Corresponding Source work offered general public charge subsection 6d. separable portion object code, whose source code excluded Corresponding Source System Library, need included conveying object code work. “User Product” either (1) “consumer product”, means tangible personal property normally used personal, family, household purposes, (2) anything designed sold incorporation dwelling. determining whether product consumer product, doubtful cases shall resolved favor coverage. particular product received particular user, “normally used” refers typical common use class product, regardless status particular user way particular user actually uses, expects expected use, product. product consumer product regardless whether product substantial commercial, industrial non-consumer uses, unless uses represent significant mode use product. “Installation Information” User Product means methods, procedures, authorization keys, information required install execute modified versions covered work User Product modified version Corresponding Source. information must suffice ensure continued functioning modified object code case prevented interfered solely modification made. convey object code work section , , specifically use , User Product, conveying occurs part transaction right possession use User Product transferred recipient perpetuity fixed term (regardless transaction characterized), Corresponding Source conveyed section must accompanied Installation Information. requirement apply neither third party retains ability install modified object code User Product (example, work installed ROM). requirement provide Installation Information include requirement continue provide support service, warranty, updates work modified installed recipient, User Product modified installed. Access network may denied modification materially adversely affects operation network violates rules protocols communication across network. Corresponding Source conveyed, Installation Information provided, accord section must format publicly documented (implementation available public source code form), must require special password key unpacking, reading copying.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_7-additional-terms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"7. Additional Terms","title":"GNU General Public License","text":"“Additional permissions” terms supplement terms License making exceptions one conditions. Additional permissions applicable entire Program shall treated though included License, extent valid applicable law. additional permissions apply part Program, part may used separately permissions, entire Program remains governed License without regard additional permissions. convey copy covered work, may option remove additional permissions copy, part . (Additional permissions may written require removal certain cases modify work.) may place additional permissions material, added covered work, can give appropriate copyright permission. Notwithstanding provision License, material add covered work, may (authorized copyright holders material) supplement terms License terms: ) Disclaiming warranty limiting liability differently terms sections 15 16 License; b) Requiring preservation specified reasonable legal notices author attributions material Appropriate Legal Notices displayed works containing ; c) Prohibiting misrepresentation origin material, requiring modified versions material marked reasonable ways different original version; d) Limiting use publicity purposes names licensors authors material; e) Declining grant rights trademark law use trade names, trademarks, service marks; f) Requiring indemnification licensors authors material anyone conveys material (modified versions ) contractual assumptions liability recipient, liability contractual assumptions directly impose licensors authors. non-permissive additional terms considered “restrictions” within meaning section 10. Program received , part , contains notice stating governed License along term restriction, may remove term. license document contains restriction permits relicensing conveying License, may add covered work material governed terms license document, provided restriction survive relicensing conveying. add terms covered work accord section, must place, relevant source files, statement additional terms apply files, notice indicating find applicable terms. Additional terms, permissive non-permissive, may stated form separately written license, stated exceptions; requirements apply either way.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_8-termination","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"8. Termination","title":"GNU General Public License","text":"may propagate modify covered work except expressly provided License. attempt otherwise propagate modify void, automatically terminate rights License (including patent licenses granted third paragraph section 11). However, cease violation License, license particular copyright holder reinstated () provisionally, unless copyright holder explicitly finally terminates license, (b) permanently, copyright holder fails notify violation reasonable means prior 60 days cessation. Moreover, license particular copyright holder reinstated permanently copyright holder notifies violation reasonable means, first time received notice violation License (work) copyright holder, cure violation prior 30 days receipt notice. Termination rights section terminate licenses parties received copies rights License. rights terminated permanently reinstated, qualify receive new licenses material section 10.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_9-acceptance-not-required-for-having-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"9. Acceptance Not Required for Having Copies","title":"GNU General Public License","text":"required accept License order receive run copy Program. Ancillary propagation covered work occurring solely consequence using peer--peer transmission receive copy likewise require acceptance. However, nothing License grants permission propagate modify covered work. actions infringe copyright accept License. Therefore, modifying propagating covered work, indicate acceptance License .","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_10-automatic-licensing-of-downstream-recipients","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"10. Automatic Licensing of Downstream Recipients","title":"GNU General Public License","text":"time convey covered work, recipient automatically receives license original licensors, run, modify propagate work, subject License. responsible enforcing compliance third parties License. “entity transaction” transaction transferring control organization, substantially assets one, subdividing organization, merging organizations. propagation covered work results entity transaction, party transaction receives copy work also receives whatever licenses work party’s predecessor interest give previous paragraph, plus right possession Corresponding Source work predecessor interest, predecessor can get reasonable efforts. may impose restrictions exercise rights granted affirmed License. example, may impose license fee, royalty, charge exercise rights granted License, may initiate litigation (including cross-claim counterclaim lawsuit) alleging patent claim infringed making, using, selling, offering sale, importing Program portion .","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_11-patents","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"11. Patents","title":"GNU General Public License","text":"“contributor” copyright holder authorizes use License Program work Program based. work thus licensed called contributor’s “contributor version”. contributor’s “essential patent claims” patent claims owned controlled contributor, whether already acquired hereafter acquired, infringed manner, permitted License, making, using, selling contributor version, include claims infringed consequence modification contributor version. purposes definition, “control” includes right grant patent sublicenses manner consistent requirements License. contributor grants non-exclusive, worldwide, royalty-free patent license contributor’s essential patent claims, make, use, sell, offer sale, import otherwise run, modify propagate contents contributor version. following three paragraphs, “patent license” express agreement commitment, however denominated, enforce patent (express permission practice patent covenant sue patent infringement). “grant” patent license party means make agreement commitment enforce patent party. convey covered work, knowingly relying patent license, Corresponding Source work available anyone copy, free charge terms License, publicly available network server readily accessible means, must either (1) cause Corresponding Source available, (2) arrange deprive benefit patent license particular work, (3) arrange, manner consistent requirements License, extend patent license downstream recipients. “Knowingly relying” means actual knowledge , patent license, conveying covered work country, recipient’s use covered work country, infringe one identifiable patents country reason believe valid. , pursuant connection single transaction arrangement, convey, propagate procuring conveyance , covered work, grant patent license parties receiving covered work authorizing use, propagate, modify convey specific copy covered work, patent license grant automatically extended recipients covered work works based . patent license “discriminatory” include within scope coverage, prohibits exercise , conditioned non-exercise one rights specifically granted License. may convey covered work party arrangement third party business distributing software, make payment third party based extent activity conveying work, third party grants, parties receive covered work , discriminatory patent license () connection copies covered work conveyed (copies made copies), (b) primarily connection specific products compilations contain covered work, unless entered arrangement, patent license granted, prior 28 March 2007. Nothing License shall construed excluding limiting implied license defenses infringement may otherwise available applicable patent law.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_12-no-surrender-of-others-freedom","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"12. No Surrender of Others’ Freedom","title":"GNU General Public License","text":"conditions imposed (whether court order, agreement otherwise) contradict conditions License, excuse conditions License. convey covered work satisfy simultaneously obligations License pertinent obligations, consequence may convey . example, agree terms obligate collect royalty conveying convey Program, way satisfy terms License refrain entirely conveying Program.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_13-use-with-the-gnu-affero-general-public-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"13. Use with the GNU Affero General Public License","title":"GNU General Public License","text":"Notwithstanding provision License, permission link combine covered work work licensed version 3 GNU Affero General Public License single combined work, convey resulting work. terms License continue apply part covered work, special requirements GNU Affero General Public License, section 13, concerning interaction network apply combination .","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_14-revised-versions-of-this-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"14. Revised Versions of this License","title":"GNU General Public License","text":"Free Software Foundation may publish revised /new versions GNU General Public License time time. new versions similar spirit present version, may differ detail address new problems concerns. version given distinguishing version number. Program specifies certain numbered version GNU General Public License “later version” applies , option following terms conditions either numbered version later version published Free Software Foundation. Program specify version number GNU General Public License, may choose version ever published Free Software Foundation. Program specifies proxy can decide future versions GNU General Public License can used, proxy’s public statement acceptance version permanently authorizes choose version Program. Later license versions may give additional different permissions. However, additional obligations imposed author copyright holder result choosing follow later version.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_15-disclaimer-of-warranty","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"15. Disclaimer of Warranty","title":"GNU General Public License","text":"WARRANTY PROGRAM, EXTENT PERMITTED APPLICABLE LAW. EXCEPT OTHERWISE STATED WRITING COPYRIGHT HOLDERS /PARTIES PROVIDE PROGRAM “” WITHOUT WARRANTY KIND, EITHER EXPRESSED IMPLIED, INCLUDING, LIMITED , IMPLIED WARRANTIES MERCHANTABILITY FITNESS PARTICULAR PURPOSE. ENTIRE RISK QUALITY PERFORMANCE PROGRAM . PROGRAM PROVE DEFECTIVE, ASSUME COST NECESSARY SERVICING, REPAIR CORRECTION.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_16-limitation-of-liability","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"16. Limitation of Liability","title":"GNU General Public License","text":"EVENT UNLESS REQUIRED APPLICABLE LAW AGREED WRITING COPYRIGHT HOLDER, PARTY MODIFIES /CONVEYS PROGRAM PERMITTED , LIABLE DAMAGES, INCLUDING GENERAL, SPECIAL, INCIDENTAL CONSEQUENTIAL DAMAGES ARISING USE INABILITY USE PROGRAM (INCLUDING LIMITED LOSS DATA DATA RENDERED INACCURATE LOSSES SUSTAINED THIRD PARTIES FAILURE PROGRAM OPERATE PROGRAMS), EVEN HOLDER PARTY ADVISED POSSIBILITY DAMAGES.","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"id_17-interpretation-of-sections-15-and-16","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"17. Interpretation of Sections 15 and 16","title":"GNU General Public License","text":"disclaimer warranty limitation liability provided given local legal effect according terms, reviewing courts shall apply local law closely approximates absolute waiver civil liability connection Program, unless warranty assumption liability accompanies copy Program return fee. END TERMS CONDITIONS","code":""},{"path":"https://ummel.github.io/fusionModel/LICENSE.html","id":"how-to-apply-these-terms-to-your-new-programs","dir":"","previous_headings":"","what":"How to Apply These Terms to Your New Programs","title":"GNU General Public License","text":"develop new program, want greatest possible use public, best way achieve make free software everyone can redistribute change terms. , attach following notices program. safest attach start source file effectively state exclusion warranty; file least “copyright” line pointer full notice found. Also add information contact electronic paper mail. program terminal interaction, make output short notice like starts interactive mode: hypothetical commands show w show c show appropriate parts General Public License. course, program’s commands might different; GUI interface, use “box”. also get employer (work programmer) school, , sign “copyright disclaimer” program, necessary. information , apply follow GNU GPL, see . GNU General Public License permit incorporating program proprietary programs. program subroutine library, may consider useful permit linking proprietary applications library. want , use GNU Lesser General Public License instead License. first, please read .","code":" Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. This is free software, and you are welcome to redistribute it under certain conditions; type 'show c' for details."},{"path":"https://ummel.github.io/fusionModel/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Kevin Ummel. Author, maintainer. Karthik Akkiraju. Contributor. Miguel Poblete Cazenave. Contributor.","code":""},{"path":"https://ummel.github.io/fusionModel/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Ummel K (2024). fusionModel: Data fusion analysis synthetic data R. R package version 2.3.0, https://ummel.github.io/fusionModel/.","code":"@Manual{, title = {fusionModel: Data fusion and analysis of synthetic data in R}, author = {Kevin Ummel}, year = {2024}, note = {R package version 2.3.0}, url = {https://ummel.github.io/fusionModel/}, }"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"fusionmodel","dir":"","previous_headings":"","what":"Data fusion and analysis of synthetic data in R","title":"Data fusion and analysis of synthetic data in R","text":"Kevin Ummel (ummel@berkeley.edu) Overview Motivation Methodology Installation Simple fusion Advanced fusion Analyzing fused data Validating fusion models","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"overview","dir":"","previous_headings":"","what":"Overview","title":"Data fusion and analysis of synthetic data in R","text":"fusionModel enables variables unique “donor” dataset statistically simulated (.e. fused ) “recipient” dataset. Variables common donor recipient used model simulate fused variables. package provides simple efficient interface general data fusion R, leveraging state---art machine learning algorithms Microsoft’s LightGBM framework. also provides tools analyzing synthetic/simulated data, calculating uncertainty, validating fusion output. fusionModel developed allow statistical integration microdata disparate social surveys. data fusion workhorse underpinning larger fusionACS data platform development Socio-Spatial Climate Collaborative. context, fusionModel used fuse variables range social surveys onto microdata American Community Survey, allowing analysis spatial resolution otherwise impossible.","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"motivation","dir":"","previous_headings":"","what":"Motivation","title":"Data fusion and analysis of synthetic data in R","text":"desire “fuse” otherwise integrate independent datasets long history, dating least early 1970’s (Ruggles Ruggles 1974; Alter 1974). Social scientists long recognized large amounts unconnected data “” – usually concerning characteristics households individuals (.e. microdata) – , ideally, like integrate analyze whole. aim falls general heading “Statistical Data Integration” (SDI) (Lewaa et al. 2021). prominent examples data fusion involved administrative record linkage. consists exact matching probabilistic linking independent datasets, using observable information like social security numbers, names, birth dates individuals. Record linkage gold standard can yield incredibly important insights high levels statistical confidence, evidenced pioneering work Raj Chetty colleagues. However, record linkage rarely feasible kinds microdata researchers use day--day (nevermind difficulty accessing administrative data). explosion online tracking social network data undoubtedly offer new lines analysis, time , least, social survey microdata remain indispensable. challenge promise recognized 50 years ago Nancy Richard Ruggles remains true today: Unfortunately, single microdata set contains different kinds information required problems economist wishes analyze. Different microdata sets contain different kinds information…great deal information collected sample basis. two samples involved probability individual appearing may small, exact matching impossible. methods combining types information contained two different samples one microdata set required. (Ruggles Ruggles 1974; 353-354) Practitioners regularly impute otherwise predict variable two one dataset another. Piecemeal, ad hoc data fusion common necessity quantitative research. Proper data fusion, hand, seeks systematically combine “two different samples one microdata set”. size nature samples involved intended analyses strongly influence choice data integration technique structure output. led relevant literature diverse convoluted, practitioners take different data “setups” objectives. context fusionACS, interested following problem: microdata two independent surveys, B, sample underlying population time period (e.g. occupied U.S. households nationwide 2018). specify “recipient” dataset B “donor”. goal generate new dataset, C, original survey responses plus realistic representation respondent might answered questionnaire survey B. , identify set common/shared variables X surveys solicit. attempt fuse set variables unique B – call Z, “fusion variables” – onto original microdata , conditional X.","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"methodology","dir":"","previous_headings":"","what":"Methodology","title":"Data fusion and analysis of synthetic data in R","text":"fusion strategy implemented fusionModel package borrows expands upon ideas statistical matching (D’Orazio et al. 2006), imputation (Little Rubin 2019), data synthesis (Drechsler 2011) literatures create flexible data fusion tool. employs variable-k, conditional expectation matching leverages high-performance gradient boosting algorithms. package accommodates fusion many variables, individually blocks, efficient computation recipient large relative donor. Specifically, goal create data fusion tool meets following requirements: Accommodate donor recipient datasets divergent sample sizes Handle continuous, categorical, semi-continuous (zero-inflated) variable types Ensure realistic values fused variables Scale efficiently larger datasets Fuse variables “one--one” “blocks” Employ data modeling approach : Makes distributional assumptions (.e. non-parametric) Automatically detects non-linear interaction effects Automatically selects predictor variables potentially large set Ability prevent overfitting (e.g. cross-validation) Complete methodological details available fusionACS Guidebook (INSERT LINK).","code":""},{"path":"https://ummel.github.io/fusionModel/index.html","id":"installation","dir":"","previous_headings":"","what":"Installation","title":"Data fusion and analysis of synthetic data in R","text":"","code":"devtools::install_github(\"ummel/fusionModel\") library(fusionModel) fusionModel v2.2.1 | https://github.com/ummel/fusionModel"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"simple-fusion","dir":"","previous_headings":"","what":"Simple fusion","title":"Data fusion and analysis of synthetic data in R","text":"package includes example microdata 2015 Residential Energy Consumption Survey (see ?recs details). real-world use cases, donor recipient data typically independent vary sample size. illustrative purposes, randomly split recs microdata separate “donor” “recipient” datasets equal number observations. recipient dataset contains 13 variables shared donor. shared “predictor” variables provide statistical link two datasets. fusionModel exploits information shared variables. 5 “fusion variables” unique donor. variables fused recipient. includes mix continuous categorical (factor) variables. create fusion model using train() function. minimal usage shown . See ?train additional function arguments options. default, results “.fsn” (fusion) object saved “fusion_model.fsn” current working directory. fuse variables recipient, simply pass recipient data path .fsn model fuse() function. variable specified fusion.vars fused order provided. default, fuse() generates single implicate (version) synthetic outcomes. Later, ’ll work multiple implicates perform proper analysis uncertainty estimation. Let’s look recipient dataset’s fused/simulated variables. Note results look different, call fuse() generates unique, probabilistic set outcomes. can quick sanity checks compare distribution fusion variables donor sim. , least, confirms fusion output obviously wrong. Later, ’ll perform formal internal validation exercise using multiple implicates. can look kernel density plots non-zero values continuous variables see univariate distributions donor generally similar sim.","code":"# Rows to use for donor dataset d <- seq(from = 1, to = nrow(recs), by = 2) # Create donor and recipient datasets donor <- recs[d, c(2:16, 20:22)] recipient <- recs[-d, 2:14] # Specify fusion and shared/common predictor variables predictor.vars <- names(recipient) fusion.vars <- setdiff(names(donor), predictor.vars) predictor.vars [1] \"income\" \"age\" \"race\" \"education\" \"employment\" [6] \"hh_size\" \"division\" \"urban_rural\" \"climate\" \"renter\" [11] \"home_type\" \"year_built\" \"heat_type\" # The variables to be fused sapply(donor[fusion.vars], class) $insulation [1] \"ordered\" \"factor\" $aircon [1] \"factor\" $square_feet [1] \"integer\" $electricity [1] \"numeric\" $natural_gas [1] \"numeric\" # Train a fusion model fsn.model <- train(data = donor, y = fusion.vars, x = predictor.vars) 5 fusion variables 13 initial predictor variables 2843 observations Using all available predictors for each fusion variable Training step 1 of 5: insulation Training step 2 of 5: aircon Training step 3 of 5: square_feet -- R-squared of cluster means: 0.967 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 49.0 141.0 200.6 357.0 498.0 Training step 4 of 5: electricity -- R-squared of cluster means: 0.966 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.00 51.75 115.00 176.02 281.25 498.00 Training step 5 of 5: natural_gas -- R-squared of cluster means: 0.968 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 54.0 129.0 170.1 247.0 499.0 Fusion model saved to: /home/kevin/Documents/Projects/fusionModel/fusion_model.fsn Total processing time: 8.53 secs # Fuse 'fusion.vars' to the recipient sim <- fuse(data = recipient, fsn = fsn.model) 5 fusion variables 13 initial predictor variables 2843 observations Generating 1 implicate Fusion step 1 of 5: insulation -- Predicting LightGBM models -- Simulating fused values Fusion step 2 of 5: aircon -- Predicting LightGBM models -- Simulating fused values Fusion step 3 of 5: square_feet -- Predicting LightGBM models -- Simulating fused values Fusion step 4 of 5: electricity -- Predicting LightGBM models -- Simulating fused values Fusion step 5 of 5: natural_gas -- Predicting LightGBM models -- Simulating fused values Total processing time: 0.8 secs head(sim) M insulation aircon square_feet 1: 1 Well insulated Central air conditioning system 1956 2: 1 Well insulated Central air conditioning system 1621 3: 1 Adequately insulated No air conditioning 558 4: 1 Adequately insulated Central air conditioning system 3072 5: 1 Adequately insulated Central air conditioning system 1010 6: 1 Adequately insulated No air conditioning 1910 electricity natural_gas 1: 17000 0.0 2: 6070 146.2 3: 1334 0.0 4: 9620 797.0 5: 37500 0.0 6: 11240 223.0 sim <- data.frame(sim) # Compare means of the continuous variables cbind(donor = colMeans(donor[fusion.vars[3:5]]), sim = colMeans(sim[fusion.vars[3:5]])) donor sim square_feet 2070.784 2012.7306 electricity 10994.517 10675.6508 natural_gas 338.154 323.5523 # Compare frequencies of categorical variable classes cbind(donor = table(donor$insulation), sim = table(sim$insulation)) donor sim Not insulated 40 40 Poorly insulated 459 443 Adequately insulated 1401 1419 Well insulated 943 941 cbind(donor = table(donor$aircon), sim = table(sim$aircon)) donor sim Central air conditioning system 1788 1818 Individual window/wall or portable units 545 545 Both a central system and individual units 125 119 No air conditioning 385 361"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"advanced-fusion","dir":"","previous_headings":"","what":"Advanced fusion","title":"Data fusion and analysis of synthetic data in R","text":"call train(), specify set hyperparameters search training LightGBM gradient boosting model (see ?train details). hyperparameters can used tune underlying GBM models better cross-validated performance. also set nfolds = 10 (default 5) indicate number cross-validation folds use. Since requires additional computation, cores argument used enable parallel processing. generally want create multiple versions simulated fusion variables – called implicates – order reduce bias point estimates calculate associated uncertainty. can using M argument within fuse(). generate 10 implicates; .e. 10 unique, probabilistic representations recipient records might look like respect fusion variables. Note implicate sim10 identified “M” variable/column.","code":"# Train a fusion model with variable blocks fsn.model <- train(data = donor, y = fusion.vars, x = predictor.vars, nfolds = 10, hyper = list(boosting = c(\"gbdt\", \"goss\"), num_leaves = c(10, 30), feature_fraction = c(0.7, 0.9)), cores = 2) 5 fusion variables 13 initial predictor variables 2843 observations Using all available predictors for each fusion variable Using OpenMP multithreading within LightGBM (2 cores) Training step 1 of 5: insulation Training step 2 of 5: aircon Training step 3 of 5: square_feet -- R-squared of cluster means: 0.971 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 61.0 142.0 198.4 330.0 499.0 Training step 4 of 5: electricity -- R-squared of cluster means: 0.971 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 61.0 179.0 220.7 388.0 499.0 Training step 5 of 5: natural_gas -- R-squared of cluster means: 0.959 -- Number of neighbors in each cluster: Min. 1st Qu. Median Mean 3rd Qu. Max. 10.0 70.0 193.0 217.6 340.0 499.0 Fusion model saved to: /home/kevin/Documents/Projects/fusionModel/fusion_model.fsn Total processing time: 44.9 secs # Fuse multiple implicates to the recipient sim10 <- fuse(data = recipient, fsn = fsn.model, M = 10) 5 fusion variables 13 initial predictor variables 2843 observations Generating 10 implicates Fusion step 1 of 5: insulation -- Predicting LightGBM models -- Simulating fused values Fusion step 2 of 5: aircon -- Predicting LightGBM models -- Simulating fused values Fusion step 3 of 5: square_feet -- Predicting LightGBM models -- Simulating fused values Fusion step 4 of 5: electricity -- Predicting LightGBM models -- Simulating fused values Fusion step 5 of 5: natural_gas -- Predicting LightGBM models -- Simulating fused values Total processing time: 2.38 secs head(sim10) M insulation aircon square_feet 1: 1 Well insulated Central air conditioning system 1728 2: 1 Adequately insulated Central air conditioning system 1492 3: 1 Well insulated No air conditioning 636 4: 1 Well insulated Central air conditioning system 2948 5: 1 Well insulated Individual window/wall or portable units 1140 6: 1 Adequately insulated No air conditioning 1579 electricity natural_gas 1: 13700 0.0 2: 15200 0.0 3: 1880 89.7 4: 8670 651.0 5: 14460 0.0 6: 14220 0.0 table(sim10$M) 1 2 3 4 5 6 7 8 9 10 2843 2843 2843 2843 2843 2843 2843 2843 2843 2843"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"analyzing-fused-data","dir":"","previous_headings":"","what":"Analyzing fused data","title":"Data fusion and analysis of synthetic data in R","text":"fused values inherently probabilistic, reflecting uncertainty underlying statistical models. Multiple implicates needed calculate unbiased point estimates associated uncertainty particular analysis data. general, implicates preferable requires computation. Since proper analysis multiple implicates can rather cumbersome – coding mathematical standpoint – analyze() function provides convenient way calculate point estimates associated uncertainty common analyses. Potential analyses currently include variable means, proportions, sums, counts, medians, (optionally) calculated population subgroups. example, calculate mean value “electricity” variable across observations recipient dataset, following. response variable categorical, analyze() automatically returns proportions associated factor level. want perform analysis across subsets recipient population – example, calculate mean value “electricity” household “income” – can use static arguments. see mean electricity consumption increases household income. also possible multiple kinds analyses single call analyze(). example, following call calculates mean value “natural_gas” “square_feet”, median value “square_feet”, sum “electricity” (.e. total consumption) “insulation” (.e. total count level). estimates calculated population subgroup defined intersection “race” “urban_rural” status. can (example) isolate results white households rural areas. Notice mean estimate “square_feet” exceeds median, reflecting skewed distribution. complicated analyses can performed using custom fun argument analyze(). See Examples section ?analyze.","code":"analyze(x = list(mean = \"electricity\"), implicates = sim10) Using 10 implicates Assuming uniform sample weights Total processing time: 0.0344 secs N y level type est moe 1: 2843 electricity NA mean 10808.86 232.2422 analyze(x = list(mean = \"aircon\"), implicates = sim10) Using 10 implicates Assuming uniform sample weights Total processing time: 0.0903 secs N y level type est 1: 2843 aircon Central air conditioning system proportion 0.62866690 2: 2843 aircon Individual window/wall or portable units proportion 0.19212100 3: 2843 aircon Both a central system and individual units proportion 0.04400281 4: 2843 aircon No air conditioning proportion 0.13520929 moe 1: 0.017603881 2: 0.015537772 3: 0.007524075 4: 0.014413665 analyze(x = list(mean = \"electricity\"), implicates = sim10, static = recipient, by = \"income\") Using 10 implicates Assuming uniform sample weights Total processing time: 0.0229 secs income N y level type est moe 1: Less than $20,000 471 electricity NA mean 8884.312 478.0797 2: $20,000 - $39,999 645 electricity NA mean 9719.018 456.4553 3: $40,000 - $59,999 464 electricity NA mean 10537.228 538.0130 4: $60,000 to $79,999 372 electricity NA mean 11314.790 572.3587 5: $80,000 to $99,999 248 electricity NA mean 11550.401 693.7652 6: $100,000 to $119,999 222 electricity NA mean 12138.498 850.3994 7: $120,000 to $139,999 119 electricity NA mean 12490.518 1113.5339 8: $140,000 or more 302 electricity NA mean 13683.182 728.6027 result <- analyze(x = list(mean = c(\"natural_gas\", \"square_feet\"), median = \"square_feet\", sum = c(\"electricity\", \"insulation\")), implicates = sim10, static = recipient, by = c(\"race\", \"urban_rural\")) Using 10 implicates Assuming uniform sample weights Total processing time: 1.63 secs subset(result, race == \"White\" & urban_rural == \"Rural\") race urban_rural N y level type est 1: White Rural 503 electricity sum 6906696.6100 2: White Rural 503 insulation Not insulated count 5.6000 3: White Rural 503 insulation Poorly insulated count 68.4000 4: White Rural 503 insulation Adequately insulated count 209.7000 5: White Rural 503 insulation Well insulated count 219.3000 6: White Rural 503 natural_gas mean 157.5723 7: White Rural 503 square_feet mean 2387.3509 8: White Rural 503 square_feet median 2159.4000 moe 1: 3.115890e+05 2: 5.653837e+00 3: 2.178315e+01 4: 3.144372e+01 5: 3.613723e+01 6: 2.579220e+01 7: 1.164896e+02 8: 1.573095e+02"},{"path":"https://ummel.github.io/fusionModel/index.html","id":"validating-fusion-models","dir":"","previous_headings":"","what":"Validating fusion models","title":"Data fusion and analysis of synthetic data in R","text":"validate() function provides convenient way perform internal validation tests synthetic variables fused back onto original donor data. allows us assess quality underlying fusion model; analogous assessing model skill comparing predictions observed training data. validate() compares analytical results derived using multiple-implicate fusion output derived using original donor microdata. performing analyses population subsets varying size, validate() estimates synthetic variables perform analyses varying difficulty/complexity. computes fusion variable means proportions subsets full sample – separately observed fused data – compares results. First, fuse multiple implicates fusion.vars using original donor data – recipient data, previously. Next, pass sim results validate(). argument subset_vars specifies want validation exercise compare observed (donor) simulated point estimates across population subsets defined “income”, “age”, “race”, “education”. See ?validate details. validate() output includes ggplot2 graphics helpfully summarize validation results. example, plot shows observed simulated point estimates compare, using median absolute percent error performance metric. see synthetic data good job reproducing point estimates fusion variables population subset question reasonably large. smaller subsets – .e. difficult analyses due small sample size – “square_feet”, “natural_gas”, “electricity” remain well modeled, error increases rapidly “aircon” “insulation”. information useful understanding kind reliability can expect particular variables types analyses, given underlying fusion model data. Happy fusing!","code":"sim <- fuse(data = donor, fsn = fsn.model, M = 40) 5 fusion variables 13 initial predictor variables 2843 observations Generating 40 implicates Fusion step 1 of 5: insulation -- Predicting LightGBM models -- Simulating fused values Fusion step 2 of 5: aircon -- Predicting LightGBM models -- Simulating fused values Fusion step 3 of 5: square_feet -- Predicting LightGBM models -- Simulating fused values Fusion step 4 of 5: electricity -- Predicting LightGBM models -- Simulating fused values Fusion step 5 of 5: natural_gas -- Predicting LightGBM models -- Simulating fused values Total processing time: 7.8 secs valid <- validate(observed = donor, implicates = sim, subset_vars = c(\"income\", \"age\", \"race\", \"education\")) Assuming uniform sample weights One-hot encoding categorical fusion variables Correlation between observed and fused values: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.075 0.109 0.255 0.303 0.431 0.774 Processing validation analyses for 5 fusion variables Performed 1430 analyses across 130 subsets Smoothing validation metrics Average smoothed performance metrics across subset range: y est vad moe 1 aircon 0.0323 0.689 1.37 2 electricity 0.0225 0.419 1.06 3 insulation 0.0390 0.492 1.41 4 natural_gas 0.0277 0.500 1.06 5 square_feet 0.0146 0.788 1.18 Creating ggplot2 graphics Total processing time: 3.14 secs valid$plots$est"},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":null,"dir":"Reference","previous_headings":"","what":"Analyze fusion output — analyze","title":"Analyze fusion output — analyze","text":"Calculation point estimates associated margin error analyses using fused/synthetic microdata. Can calculate means, proportions, sums, counts, medians, optionally across population subgroups.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Analyze fusion output — analyze","text":"","code":"analyze( x, implicates, static = NULL, weight = NULL, rep_weights = NULL, by = NULL, fun = NULL, var_scale = 4, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Analyze fusion output — analyze","text":"x List. Named list specifying desired analysis type(s) associated target variable(s). Example: x = list(mean = c(\"v1\", \"v2\"), median = \"v3\") translates : \"Return mean value variables v1 v2 median v3\". Supported analysis types include mean, sum, median. Mean sum automatically return proportions counts, respectively, target variable factor. Target variables must implicates, static, data.frame returned custom fun. implicates Data frame. Implicates synthetic (fused) variables. Typically generated fuse. implicates row-stacked identified integer column \"M\". static Data frame. Optional static (non-synthetic) variables vary across implicates. Note nrow(static) = nrow(implicates) / max(implicates$M) row-ordering assumed consistent static implicates. weight Character. Name observation weights column static. NULL (default), uniform weights assumed. rep_weights Character. Optional vector replicate weight columns static. provided, returned margin errors reflect additional variance due uncertainty sample weights. Character. Optional column name(s) implicates static (typically factors) collectively define set population subgroups analysis executed. NULL, analysis done whole sample. fun Function. Optional function applied input data prior executing analyses. Can used non-conventional/custom analyses. var_scale Scalar. Factor scale unadjusted replicate weight variance. determined survey design. default (var_scale = 4) appropriate ACS RECS. cores Integer. Number cores used. applicable Unix systems.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Analyze fusion output — analyze","text":"data.table reporting analysis results, possibly across subgroups defined . returned quantities include: N Number observations used analysis. y Target variable. level Levels factor target variables. type Type estimate returned: mean, proportion, sum, count, median. est Point estimate. moe Margin error associated 90% confidence interval.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Analyze fusion output — analyze","text":"minimum, user must supply synthetic implicates (typically generated fuse). Inputs checked consistent dimensions. implicates contains single implicate rep_weights = NULL, \"typical\" standard error returned warning make sure user aware situation. Estimates standard errors requested analysis calculated separately implicate. final point estimate mean estimate across implicates. final standard error pooled SE across implicates, calculated using Rubin's pooling rules (1987) finite population adjustment degrees freedom (Barnard Rubin 1999). replicate weights provided, standard errors implicate calculated via variance estimates across replicates. Calculations leverage data.table operations speed memory efficiency. within-implicate variance calculated around point estimate (rather around mean replicates). equivalent mse = TRUE svrepdesign. seems appropriate method surveys. replicate weights provided, standard errors implicate calculated using variance within implicate. means, ratio variance approximation Cochran (1977) used, known good approximation bootstrapped SE's weighted means (Gatz Smith 1995). proportions, generalization unweighted SE formula used (see ). regression coefficients, standard error calculated summary.glm.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Analyze fusion output — analyze","text":"Barnard, J., & Rubin, D.B. (1999). Small-sample degrees freedom multiple imputation. Biometrika, 86, 948-955. Cochran, W. G. (1977). Sampling Techniques (3rd Edition). Wiley, New York. Gatz, D.F., Smith, L. (1995). Standard Error Weighted Mean Concentration — . Bootstrapping vs Methods. Atmospheric Environment, vol. 29, . 11, 1185–1193. Rubin, D.B. (1987). Multiple imputation nonresponse surveys. Hoboken, NJ: Wiley.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Analyze fusion output — analyze","text":"","code":"# Build a fusion model using RECS microdata fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient sim <- fuse(data = recs, fsn = fsn.path, M = 30) head(sim) #--------- # Multiple types of analyses can be done at once # This calculates estimates using the full sample result <- analyze(x = list(mean = c(\"natural_gas\", \"aircon\"), median = \"electricity\", sum = c(\"electricity\", \"aircon\")), implicates = sim, weight = \"weight\") View(result) #----- # Mean electricity consumption, by climate zone and urban/rural status result1 <- analyze(x = list(mean = \"electricity\"), implicates = sim, static = recs, weight = \"weight\", by = c(\"climate\", \"urban_rural\")) # Same as above but including sample weight uncertainty # Note that only the first 30 replicate weights are used internally result2 <- analyze(x = list(mean = \"electricity\"), implicates = sim, static = recs, weight = \"weight\", rep_weights = paste0(\"rep_\", 1:96), by = c(\"climate\", \"urban_rural\")) # Helper function for comparison plots pfun <- function(x, y) {plot(x, y); abline(0, 1, lty = 2)} # Inclusion of replicate weights does not affect estimates, but it does # increase margin of error due to uncertainty in RECS sample weights pfun(result1$est, result2$est) pfun(result1$moe, result2$moe) # Notice that relative uncertainty declines with subset size plot(result1$N, result1$moe / result1$est) #----- # Use a custom function to perform more complex analyses # Custom function should return a data frame with non-standard target variables my_fun <- function(data) { # Manipulate 'data' as desired # All variables in 'implicates' and 'static' are available # Construct electricity consumption per square foot kwh_per_ft2 <- data$electricity / data$square_feet # Binary (T/F) indicator if household uses natural gas use_natural_gas <- data$natural_gas > 0 # Return data.frame of custom variables to be analyzed data.frame(kwh_per_ft2, use_natural_gas) } # Do analysis using variables produced by custom function # Can included non-custom target variables as well result <- analyze(x = list(mean = c(\"kwh_per_ft2\", \"use_natural_gas\", \"electricity\")), implicates = sim, static = recs, weight = \"weight\", fun = my_fun)"},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":null,"dir":"Reference","previous_headings":"","what":"Analyze fusion output — analyze2","title":"Analyze fusion output — analyze2","text":"Calculation point estimates associated margin error analyses using fused/synthetic microdata replicate weights. Efficiently computes means, proportions, sums, counts, medians, standard deviations, variances, optionally across population subgroups. differs analyze requires replicate weights calculates uncertainty using full replicate weight variance (approximation).","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Analyze fusion output — analyze2","text":"","code":"analyze2( analyses, implicates, static, weight, rep_weights, by = NULL, var_scale = 4, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Analyze fusion output — analyze2","text":"analyses List. Specifies desired analyses. See Details Examples. Variables referenced analyses must implicates static. implicates Data frame file path. Implicates synthetic (fused) variables; typically output fuse. implicates row-stacked identified integer column \"M\". file path \".fst\" file, necessary columns read memory. static Data frame file path. Static variables vary across implicates; typically \"recipient\" microdata passed fuse. minimum, static must contain weight rep_weights. file path \".fst\" file, necessary columns read memory. Note nrow(static) = nrow(implicates) / max(implicates$M) row-ordering assumed consistent static implicates. weight Character. Name primary observation weights column static. rep_weights Character. Vector replicate weight columns static. Character. Optional column name(s) implicates static (typically factors) collectively define set population subgroups analysis executed. NULL, analysis done whole sample. var_scale Scalar. Factor scale unadjusted replicate weight variance. determined survey design. default (var_scale = 4) appropriate ACS RECS. cores Integer. Number cores used multithreading collapse-package functions.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Analyze fusion output — analyze2","text":"tibble reporting analysis results, possibly across subgroups defined . returned quantities include: lhs Optional analysis name; \"left hand side\" analysis formula. rhs \"right hand side\" analysis formula. type Type analysis: sum, mean, median, prop(ortion) count. level Factor levels categorical analyses; NA omitted otherwise. est Point estimate; mean estimate across implicates. moe Margin error associated 90% confidence interval. rshare Share MOE attributable replicate weights (opposed variance across implicates).","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Analyze fusion output — analyze2","text":"final point estimates mean estimates across implicates. final margin error derived pooled standard error across implicates, calculated using Rubin's pooling rules (1987). within-implicate standard error's calculated using replicate weights var_scale. entry analyses list formula format Z ~ F(E), Z optional, user-friendly name analysis, F allowable “outer function”, E “inner expression” containing one microdata variables. example: mysum ~ mean(Var1 + Var2) case, outer function mean(). Allowable outer functions : mean(), sum(), median(), sd(), var(). inner expression contains one variable, first evaluated F() applied result. case, internal variable X = Var1 + Var2 generated across observations, mean(X) computed. inner expression desired, analyses list can use following convenient syntax apply single outer function multiple variables: mean = c(\"Var1\", \"Var2\") inner expression can also utilize function takes variable names arguments returns vector length inputs. useful defining complex operations separate function (e.g. microsimulation). example: myfun = function(Var1, Var2) {Var1 + Var2} mysum ~ mean(myfun(Var1, Var2)) use sum() mean() inner expression returns categorical vector automatically results category-wise weighted counts proportions, respectively. example, following analysis fail evaluated literally, since mean() expects numeric input inner expression returns character. interpreted request return weighted proportions categorical outcome. myprop ~ mean(ifelse(Var1 > 10 , 'Yes', '')) analyze2() uses \"fast\" versions allowable outer functions, provided fast-statistical-functions collapse package. functions highly optimized weighted, grouped calculations. addition, outer functions mean(), sum(), median() enjoy use platform-independent multithreading across columns cores > 1. Analyses numerical inner expressions processed using series calls collap unique observation weights. Analyses categorical inner expressions utilize series calls fsum.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Analyze fusion output — analyze2","text":"Rubin, D.B. (1987). Multiple imputation nonresponse surveys. Hoboken, NJ: Wiley.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/analyze2.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Analyze fusion output — analyze2","text":"","code":"# Build a fusion model using RECS microdata fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\", \"insulation\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient recipient <- recs[c(predictor.vars, \"weight\", paste0(\"rep_\", 1:96))] sim <- fuse(data = recipient, fsn = fsn.path, M = 30) head(sim) #----- # Example of custom pre-processing function myfun <- function(v1, v2, v3) v1 + v2 + v3 # Various ways to specify analyses... my.analyses <- list( # Return means for 'electricity' and proportions for 'aircon' mean = c(\"electricity\", \"aircon\"), # Identical to mean = \"electricity\"; duplicate analyses automatically removed electricity ~ mean(electricity), # Simple addition in the inner expression mysum ~ sum(electricity + natural_gas), # Standard deviation of electricity sd = \"electricity\", # Unnamed analyses (no left-hand side in formula) ~ var(electricity + natural_gas), ~ mean(insulation), # Proportions ~ sum(insulation), # Counts # Proportions involving manipulation of >1 variable myprop ~ mean(aircon != \"No air conditioning\" & insulation < \"Adequately insulated\"), # Custom inner function mycustom ~ median(myfun(electricity, natural_gas, v3 = 100)) ) # Do the requeted analyses, by \"division\" result <- analyze2( analyses = my.analyses, implicates = sim, static = recipient, weight = \"weight\", rep_weights = paste0(\"rep_\", 1:96), by = \"division\" ) head(result) #----- # To calculate a conditional estimate, set unused/ignored observations to NA # All outer functions execute with 'na.rm = TRUE' # Example: mean natural_gas conditional on natural_gas > 0 # data.table::fifelse() is much faster than base::ifelse() for large data result <- analyze2( analyses = ~mean(data.table::fifelse(natural_gas > 0, natural_gas, NA_real_)), implicates = sim, static = recipient, weight = \"weight\", rep_weights = paste0(\"rep_\", 1:96), by = \"division\" )"},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuse variables to a recipient dataset — fuse","title":"Fuse variables to a recipient dataset — fuse","text":"Fuse variables recipient dataset using .fsn model produced train. Output can passed analyze validate.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuse variables to a recipient dataset — fuse","text":"","code":"fuse( data, fsn, fsd = NULL, M = 1, retain = NULL, kblock = 10, margin = 2, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuse variables to a recipient dataset — fuse","text":"data Data frame. Recipient dataset. categorical variables factors ordered whenever possible. Data types levels strictly validated predictor variables defined fsn. fsn Character. Path fusion model file (.fsn) generated train. fsd Character. Optional fusion output file created ending .fsd (.e. \"fused data\"). compressed binary file can read using fst package. fsd = NULL (default), fusion results returned data.table. M Integer. Number implicates simulate. retain Character. Names columns data retained output; .e. repeated across implicates. Useful retaining ID weight variables use subsequent analysis fusion output. kblock Integer. Fixed number nearest neighbors use fusing variables block. Must >= 5 <= 30. applicable variables fused (.e. block). margin Numeric. Safety margin used estimating many implicates can processed memory . Set higher fuse() experiences memory shortfall. Alternatively, can set negative value manually specify number chunks use. example, margin = -3 splits M implicates three chunks approximately equal size. cores Integer. Number cores used. LightGBM prediction parallel-enabled systems OpenMP available.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuse variables to a recipient dataset — fuse","text":"fsd = NULL, data.table number rows equal M * nrow(data). Integer column \"M\" indicates implicate assignment observation. Note ordering recipient observations consistent within implicates, change row order using analyze. fsd specified, path .fsd file results written. Metadata column classes factor levels stored column names. read_fsd used load files saved via fsd argument.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Fuse variables to a recipient dataset — fuse","text":"UPDATE.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/fuse.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuse variables to a recipient dataset — fuse","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Generate single implicate of synthetic 'fusion.vars', # using original RECS data as the recipient recipient <- recs[predictor.vars] sim <- fuse(data = recipient, fsn = fsn.path) head(sim) # Calling fuse() again produces different results sim <- fuse(data = recipient, fsn = fsn.path) head(sim) # Generate multiple implicates sim <- fuse(data = recipient, fsn = fsn.path, M = 5) head(sim) table(sim$M) # Optionally, write results directly to disk # Note that \"results.fsd\" will be written to working directory sim <- fuse(data = recipient, fsn = fsn.path, M = 5, fsd = \"results.fsd\") sim <- read_fsd(sim) head(sim)"},{"path":"https://ummel.github.io/fusionModel/reference/fusionModel-package.html","id":null,"dir":"Reference","previous_headings":"","what":"fusionModel: Data fusion and analysis of synthetic data in R — fusionModel-package","title":"fusionModel: Data fusion and analysis of synthetic data in R — fusionModel-package","text":"Data fusion analysis synthetic data R.","code":""},{"path":[]},{"path":"https://ummel.github.io/fusionModel/reference/fusionModel-package.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"fusionModel: Data fusion and analysis of synthetic data in R — fusionModel-package","text":"Maintainer: Kevin Ummel ummel@berkeley.edu contributors: Karthik Akkiraju [contributor] Miguel Poblete Cazenave [contributor]","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":null,"dir":"Reference","previous_headings":"","what":"Extract predictor variable importance from a fusion model — importance","title":"Extract predictor variable importance from a fusion model — importance","text":"Returns predictor variable (feature) importance underlying LightGBM models stored fusion model file (.fsn) disk.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extract predictor variable importance from a fusion model — importance","text":"","code":"importance(fsn)"},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extract predictor variable importance from a fusion model — importance","text":"fsn Character. Path fusion model file (.fsn) generated train.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extract predictor variable importance from a fusion model — importance","text":"named list containing detailed summary importance results. summary results useful, return average importance predictor across potentially multiple underlying LightGBM models; .e. zero (\"z\"), mean (\"m\"), quantile (\"q\") models. See Examples suggested plotting results.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Extract predictor variable importance from a fusion model — importance","text":"Importance metrics computed via lgb.importance. Three types measures returned; \"gain\" typically preferred measure.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/importance.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Extract predictor variable importance from a fusion model — importance","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Extract predictor variable importance ximp <- importance(fsn.path) # Plot summary results library(ggplot2) ggplot(ximp$summary, aes(x = x, y = gain)) + geom_bar(stat = \"identity\") + facet_grid(~ y) + coord_flip() # View detailed results View(ximp$detailed)"},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":null,"dir":"Reference","previous_headings":"","what":"Impute missing data via fusion — impute","title":"Impute missing data via fusion — impute","text":"universal missing data imputation tool wraps successive calls train fuse hood. Designed simplicity ease use.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Impute missing data via fusion — impute","text":"","code":"impute(data, weight = NULL, ignore = NULL, cores = 1)"},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Impute missing data via fusion — impute","text":"data data frame missing values. weight Optional name observation weights column data. ignore Optional names columns data ignore predictor variables. cores Number physical CPU cores used lightgbm. LightGBM parallel-enabled platforms OpenMP available.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Impute missing data via fusion — impute","text":"data frame missing values imputed.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Impute missing data via fusion — impute","text":"Variables missing values imputed sequentially, beginning variable fewest missing values. Since LightGBM models accommodate NA values predictor set, available variables used potential predictors (excluding ignore variables). call train, 80% observations randomly selected training remaining 20% used validation set determine appropriate number tree learners. LightGBM model parameters kept sensible default values train. Since lightgbm uses OpenMP multithreading, advisable use impute inside forked/parallel process cores > 1.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/impute.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Impute missing data via fusion — impute","text":"","code":"# Create data frame with random NA values ?recs data <- recs[, 2:7] miss <- replicate(ncol(data), runif(nrow(data)) < runif(1, 0.01, 0.3)) data[miss] <- NA colSums(is.na(data)) # Impute the missing values result <- impute(data) anyNA(result)"},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":null,"dir":"Reference","previous_headings":"","what":"Ensure a monotonic relationship between two variables — monotonic","title":"Ensure a monotonic relationship between two variables — monotonic","text":"monotonic() returns modified values input vector y smoothed, monotonic, consistent across values input x. designed used post-fusion one wants ensure plausible relationship consumption (x) expenditure (y), assumption consumers face identical, monotonic pricing structure. default, mean returned values forced equal original mean y (preserve = TRUE). direction monotonicity (increasing decreasing) detected automatically, use cases limited consumption expenditure variables.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Ensure a monotonic relationship between two variables — monotonic","text":"","code":"monotonic(x, y, w = NULL, preserve = TRUE, plot = FALSE)"},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Ensure a monotonic relationship between two variables — monotonic","text":"x Numeric. y Numeric. w Numeric. Optional observation weights. preserve Logical. Preserve original mean y values returned values? plot Logical. Plot (sampled) data points derived monotonic relationship?","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Ensure a monotonic relationship between two variables — monotonic","text":"numeric vector modified y values. Optionally, plot showing returned monotonic relationship.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Ensure a monotonic relationship between two variables — monotonic","text":"initial smoothing accomplished via supsmu result coerced monotone. coercion step modifies values much, second smooth attempted via scam model either monotone increasing decreasing constraint. SCAM fails fit, function falls back lm simple linear predictions. y = 0 x = 0 (typical consumption-expenditure variables), outcome enforced result. input data randomly sampled 10,000 observations, necessary, speed.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/monotonic.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Ensure a monotonic relationship between two variables — monotonic","text":"","code":"y <- monotonic(x = recs$propane_btu, y = recs$propane_expend, plot = TRUE) mean(recs$propane_expend) mean(y)"},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot validation results — plot_valid","title":"Plot validation results — plot_valid","text":"Creates optionally saves disk representative plots validation results returned validate. Requires suggested ggplot2 package. function (default) called within validate. Can useful save graphics disk generate plots subset fusion variables.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot validation results — plot_valid","text":"","code":"plot_valid(valid, y = NULL, path = NULL, cores = 1, ...)"},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot validation results — plot_valid","text":"valid Object returned validate. y Character. Fusion variables use validation graphics. Useful plotting partial validation results. Default use fusion variables present valid. path Character. Path directory .png graphics saved. Directory created necessary. NULL (default), files saved disk. cores Integer. Number cores used. applicable Unix systems. ... Arguments passed ggsave control .png graphics saved disk.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot validation results — plot_valid","text":"list \"plots\", \"smooth\", \"data\" slots. \"plots\" slot contains following ggplot objects: est: Comparison point estimates (median absolute percent error). moe: Comparison 90% margin error (median ratio simulated--observed MOE). Additional named slots (one fusion variables) contain plots described scatterplot results. \"smooth\" data frame plotting values used produce smoothed median plots. \"data\" data frame complete validation results returned original call validate.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Plot validation results — plot_valid","text":"Validation results visualized convey expected, typical (median) performance fusion variables. , well simulated data match observed data respect point estimates confidence intervals population subsets various size? Plausible error metrics derived input validation data plotting. comparison point estimates, error metric absolute percent error continuous variables; categorical case absolute error scaled maximum possible error 1. Since metrics strictly comparable, -variable plots denote categorical fusion variables dotted lines. given fusion variable, error metric exhibit variation (often quite skewed) even subsets comparable size, due fact subset looks unique partition data. order convey expected, typical performance varies subset size, smoothed median error conditional subset size approximated plotted.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/plot_valid.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot validation results — plot_valid","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars, weight = \"weight\") # Fuse back onto the donor data (multiple implicates) sim <- fuse(data = recs, file = fsn.path, M = 30) # Calculate validation results but do not generate plots valid <- validate(observed = recs, implicates = sim, subset_vars = c(\"income\", \"education\", \"race\", \"urban_rural\"), weight = \"weight\", plot = FALSE) # Create validation plots valid <- plot_valid(valid) # View some of the plots valid$plots$est valid$plots$moe valid$plots$electricity$bias # Can also save the plots to disk at creation # Will save .png files to 'valid_plots' folder in working directory # Note that it is fine to pass a 'valid' object with existing $plots slot # In that case, the plots are simply re-generated vplots <- plot_valid(valid, path = file.path(getwd(), \"valid_plots\"), width = 8, height = 6)"},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":null,"dir":"Reference","previous_headings":"","what":"Prepare the 'x' and 'y' inputs — prepXY","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"Optional--useful function : 1) provide plausible ordering 'y' (fusion) variables 2) identify subset 'x' (predictor) variables likely consequential subsequent model training. Output can passed directly train. useful large datasets many /highly-correlated predictors. Employs absolute Spearman rank correlation screen LASSO models (via glmnet) return plausible ordering 'y' preferred subset 'x' variables associated .","code":""},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"","code":"prepXY( data, y, x, weight = NULL, cor_thresh = 0.05, lasso_thresh = 0.95, xmax = 100, xforce = NULL, fraction = 1, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"data Data frame. Training dataset. categorical variables factors ordered whenever possible. y Character list. Variables data eventually fuse recipient dataset. y list, entry character vector possibly indicating multiple variables fuse block. x Character. Predictor variables data common donor eventual recipient. weight Character. Name observation weights column data. NULL (default), uniform weights assumed. cor_thresh Numeric. Predictors exhibit less cor_thresh absolute Spearman (rank) correlation y variable screened prior LASSO step. Fast exclusion predictors LASSO step probably need consider. lasso_thresh Numeric. Controls aggressively LASSO step screens predictors. Lower value aggressive. lasso_thresh = 0.95, example, retains predictors collectively explain least 95% deviance explained \"full\" model. xmax Integer. Maximum number predictors returned LASSO step. strictly control number final predictors returned (especially categorical y variables), useful setting () soft upper bound. Lower xmax can help control computation time large number x pass correlation screen. xmax = Inf imposes restriction. xforce Character. Subset x variables \"force\" included predictors results. fraction Numeric. Fraction observations data randomly sample. larger datasets, sampling often minimal effect results speeds computation. cores Integer. Number cores used. applicable Unix systems.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"List named slots \"y\" \"x\". list length. Former gives preferred fusion order. Latter gives preferred sets predictor variables.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/prepXY.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Prepare the 'x' and 'y' inputs — prepXY","text":"","code":"y <- names(recs)[c(14:16, 20:22)] x <- names(recs)[2:13] # Fusion variable \"blocks\" are respected by prepXY() y <- c(list(y[1:2]), y[-c(1:2)]) # Do the prep work... prep <- prepXY(data = recs, y = y, x = x) # The result can be passed to train() train(data = recs, y = prep$y, x = prep$x)"},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":null,"dir":"Reference","previous_headings":"","what":"Read fusion output from disk — read_fsd","title":"Read fusion output from disk — read_fsd","text":"Read fusion output written directly disk via fuse.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read fusion output from disk — read_fsd","text":"","code":"read_fsd( fsd, columns = NULL, cores = max(1, parallel::detectCores(logical = FALSE) - 1) )"},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read fusion output from disk — read_fsd","text":"fsd Character. File path ending .fsd produced call fuse. columns Character. Column names read. default read columns. cores Integer. Number cores used read_fst.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read fusion output from disk — read_fsd","text":"data.table integer column \"M\" indicating implicate assignment observation. Note ordering recipient observations consistent within implicates, change row order using analyze validate.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Read fusion output from disk — read_fsd","text":"version 2.3.0, simply convenient wrapper around read_fst, since fusion output data files (.fsd) actually native fst files hood.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_fsd.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Read fusion output from disk — read_fsd","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # Write fusion output directly to disk # Note that \"results.fsd\" will be written to working directory recipient <- recs[predictor.vars] sim <- fuse(data = recipient, fsn = fsn.path, M = 5, csv = \"results.fsd\") # Read the fusion output saved to disk sim <- read_fsd(sim) head(sim)"},{"path":"https://ummel.github.io/fusionModel/reference/read_up.html","id":null,"dir":"Reference","previous_headings":"","what":"Read ORNL UrbanPop data from disk — read_up","title":"Read ORNL UrbanPop data from disk — read_up","text":"NOTE: fusionACS internal use ! Efficiently read pre-processed ORNL UrbanPop data disk. Function arguments make sense familiar structure processed UrbanPop data.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_up.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Read ORNL UrbanPop data from disk — read_up","text":"","code":"read_up( path, year = NULL, state = NULL, county = NULL, tract_bg = NULL, hid = NULL, df = NULL, cores = max(1, parallel::detectCores(logical = FALSE) - 1) )"},{"path":"https://ummel.github.io/fusionModel/reference/read_up.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Read ORNL UrbanPop data from disk — read_up","text":"path Character. File path fst file containing UrbanPop data. year Integer. Year(s) select. state Integer. State FIPS code(s) select. county Integer. County FIPS code(s) select. tract_bg Integer. Tract block group FIPS code(s) select. hid Integer. ACS-PUMS household ID(s) select. df Data frame containing least 'year' /'state' columns. Provides unique combinations argument values return. df used perform inner merge initial subset data based 'year' 'state'. cores Integer. Number cores used read_fst.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_up.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Read ORNL UrbanPop data from disk — read_up","text":"keyed data.table.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_up.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Read ORNL UrbanPop data from disk — read_up","text":"Provides efficient fast way load subset UrbanPop data memory. initial subset read using state year restrict rows. data.table operations (subset merge) used efficiently reduce final returned subset.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/read_up.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Read ORNL UrbanPop data from disk — read_up","text":"","code":"up.path <- \"~/Documents/Projects/fusionData/urbanpop/Processed national UrbanPop.fst\" out <- read_up(path = up.path, year = 2015, state = 4) unique(dplyr::select(out, year, state)) out <- read_up(path = up.path, year = 2015:2016, state = c(2, 12, 15)) unique(dplyr::select(out, year, state)) up.df <- data.frame(year = c(2015, 2018), state = c(8, 5), county = c(1, 3)) out <- read_up(path = up.path, df = up.df) unique(dplyr::select(out, year, state, county))"},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":null,"dir":"Reference","previous_headings":"","what":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"Pre-processed, household-level microdata containing selection 31 variables derived 2015 RECS, plus survey replicate weights. variety data types included. missing values. Variable names altered original.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"","code":"recs"},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"tibble 5,686 rows 124 variables: weight Primary sampling weight income Annual gross household income last year age Respondent age race Respondent race education Highest education completed respondent employment Respondent employment status hh_size Number household members division Census Division urban_rural Census 2010 Urban Type climate IECC Climate Code renter household renting home? home_type Type housing unit year_built Range housing unit built square_feet Total square footage insulation Level insulation heating Main space heating fuel aircon Type air conditioning equipment used centralac_age Age central air conditioner televisions Number televisions used disconnect Frequency receiving disconnect notice electricity Total annual electricity usage, kilowatthours natural_gas Total annual natural gas usage, hundred cubic feet fuel_oil Total annual fuel oil/kerosene usage, gallons propane Total annual propane usage, gallons propane_btu Total annual propane usage, thousand Btu propane_expend Total annual propane expenditure, dollars heating_share Share total energy used space heating heating_share Share total energy used cooling (AC fans) other_share Share total energy used end-uses use_ng Logical indicating household uses natural gas have_ac Logical indicating household air conditioning rep_1:rep_96 Replicate weights uncertainty estimation","code":""},{"path":"https://ummel.github.io/fusionModel/reference/recs.html","id":"source","dir":"Reference","previous_headings":"","what":"Source","title":"Data from the 2015 Residential Energy Consumption Survey (RECS) — recs","text":"https://www.eia.gov/consumption/residential/data/2015/","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":null,"dir":"Reference","previous_headings":"","what":"Train a fusion model — train","title":"Train a fusion model — train","text":"Train fusion model \"donor\" data using sequential LightGBM models model conditional distributions. resulting fusion model (.fsn file) can used fuse simulate outcomes \"recipient\" dataset.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Train a fusion model — train","text":"","code":"train( data, y, x, fsn = \"fusion_model.fsn\", weight = NULL, nfolds = 5, nquantiles = 2, nclusters = 2000, krange = c(10, 500), hyper = NULL, fork = FALSE, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Train a fusion model — train","text":"data Data frame. Donor dataset. Categorical variables must factors ordered whenever possible. y Character list. Variables data eventually fuse recipient dataset. Variables fused order provided. y list, entry character vector possibly indicating multiple variables fuse block. x Character list. Predictor variables data common donor eventual recipient. list, slot specifies x predictors use y. fsn Character. File path fusion model saved. Must use .fsn suffix. weight Character. Name observation weights column data. NULL (default), uniform weights assumed. nfolds Numeric. Number cross-validation folds used LightGBM model training. , nfolds < 1, fraction observations use training set; remainder used validation (faster cross-validation). nquantiles Numeric. Number quantile models train continuous y variables, addition conditional mean. nquantiles evenly-distributed percentiles used. example, default nquantiles = 2 yields quantile models 25th 75th percentiles. Higher values may produce accurate conditional distributions expense computation time. Even nquantiles recommended since conditional mean tends capture central tendency, making median model superfluous. nclusters Numeric. Maximum number k-means clusters use. Higher better computational cost. nclusters = 0 nclusters = Inf turn clustering. krange Numeric. Minimum maximum number nearest neighbors use construction continuous conditional distributions. Higher max(krange) better computational cost. hyper List. LightGBM hyperparameters used model training. NULL, default values used. See Details Examples. fork Logical. parallel processing via forking used, possible? See Details. cores Integer. Number physical CPU cores used parallel computation. fork = FALSE Windows platform (since forking possible), fusion variables/blocks processed serially LightGBM uses cores internal multithreading via OpenMP. Unix system, fork = TRUE, cores > 1, cores <= length(y) fusion variables/blocks processed parallel via mclapply.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Train a fusion model — train","text":"fusion model object (.fsn) saved fsn.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Train a fusion model — train","text":"y list, slot indicates either single variable , alternatively, multiple variables fuse block. Variables within block sampled jointly original donor data fusion. See Examples. y variables exhibit variance continuous y variables less 10 * nfolds non-zero observations (minimum required cross-validation) automatically removed warning. fusion model written fsn zipped archive created zip containing models data required fuse. hyper argument can used specify LightGBM hyperparameter values perform \"grid search\" model training. See full list parameters. combination hyperparameters, nfolds cross-validation performed using lgb.cv early stopping condition. parameter combination lowest loss function value used fit final model via lgb.train. candidate parameter values specified hyper, longer processing time. hyper = NULL, single set parameters used following default values: boosting = \"gbdt\" data_sample_strategy = \"goss\" num_leaves = 31 feature_fraction = 0.8 max_depth = 5 min_data_in_leaf = max(10, round(0.001 * nrow(data))) num_iterations = 2500 learning_rate= 0.1 max_bin = 255 min_data_in_bin = 3 max_cat_threshold = 32 Typical users reason modify hyperparameters listed . Note num_iterations imposes ceiling, since early stopping typically result models lower number iterations. See Examples. Testing small--medium size datasets suggests forking typically faster OpenMP multithreading (default). However, forking sometimes \"hang\" (continue run CPU usage error message) OpenMP process previously used session. issue appears related Intel's OpenMP implementation (see ). can triggered operations called train() use data.table fst multithread mode. experience hanged forking, try calling data.table::setDTthreads(1) fst::threads_fst(1) immediately library(fusionModel) new session.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/train.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Train a fusion model — train","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory ?recs fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars) # When 'y' is a list, it can specify variables to fuse as a block fusion.vars <- list(\"electricity\", \"natural_gas\", c(\"heating_share\", \"cooling_share\", \"other_share\")) fusion.vars train(data = recs, y = fusion.vars, x = predictor.vars) # When 'x' is a list, it specifies which predictor variables to use for each 'y' xlist <- list(predictor.vars[1:4], predictor.vars[2:8], predictor.vars) xlist train(data = recs, y = fusion.vars, x = xlist) # Specify a single set of LightGBM hyperparameters # Here we use Random Forests instead of the default Gradient Boosting Decision Trees train(data = recs, y = fusion.vars, x = predictor.vars, hyper = list(boosting = \"rf\", feature_fraction = 0.6, max_depth = 10 )) # Specify a range of LightGBM hyperparameters to search over # This takes longer, because there are more models to test train(data = recs, y = fusion.vars, x = predictor.vars, hyper = list(max_depth = c(5, 10), feature_fraction = c(0.7, 0.9) ))"},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":null,"dir":"Reference","previous_headings":"","what":"Validate fusion output — validate","title":"Validate fusion output — validate","text":"Performs internal validation analyses fused microdata estimate well simulated variables reflect patterns dataset used train underlying fusion model (.e. observed/donor data). provides standard approach validating fusion output associated models. See Examples recommended usage.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Validate fusion output — validate","text":"","code":"validate( observed, implicates, subset_vars, weight = NULL, min_size = 30, plot = TRUE, cores = 1 )"},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Validate fusion output — validate","text":"observed Data frame. Observed data validate simulated variables. Typically dataset used train fusion model used generate simulated. implicates Data frame. Implicates synthetic (fused) variables. Typically generated fuse. implicates row-stacked identified integer column \"M\". subset_vars Character. Vector columns observed used define population subsets across fusion variables validated. levels subset_vars (including two-way interactions subset_vars) define population subsets. Continuous subset_vars converted five-level ordered factor based univariate k-means clustering. weight Character. Name observation weights column observed. NULL (default), uniform weights assumed. min_size Integer. Subsets less min_size observations excluded. Since subsets observations unlikely give reliable estimates, make sense consider validation purposes. plot Logical. TRUE (default), plot_valid called internally summary plots returned along complete validation results. Requires ggplot2 package. cores Integer. Number cores used. applicable Unix systems.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Validate fusion output — validate","text":"plot = FALSE, data frame containing complete validation results. plot = FALSE, list containing full results well additional lot objects described plot_valid.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Validate fusion output — validate","text":"objective validate confirm fusion output sensible help establish utility synthetic data across myriad analyses. Utility based comparison point estimates confidence intervals derived using multiple-implicate synthetic data derived using original donor data. specific analyses tested include variable levels (means proportions) across population subsets varying size. allows estimates synthetic variables perform analyses real-world relevance, varying levels complexity. effect, validate() performs large number analyses kind analyze function designed one--one basis. users want use default setting plot = TRUE simultaneously return visualization (plots) validation results. Plot creation detailed plot_valid.","code":""},{"path":"https://ummel.github.io/fusionModel/reference/validate.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Validate fusion output — validate","text":"","code":"# Build a fusion model using RECS microdata # Note that \"fusion_model.fsn\" will be written to working directory fusion.vars <- c(\"electricity\", \"natural_gas\", \"aircon\") predictor.vars <- names(recs)[2:12] fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars, weight = \"weight\") # Fuse back onto the donor data (multiple implicates) sim <- fuse(data = recs, fsn = fsn.path, M = 20) # Calculate validation results valid <- validate(observed = recs, implicates = sim, subset_vars = c(\"income\", \"education\", \"race\", \"urban_rural\"))"}] diff --git a/sitemap.xml b/sitemap.xml index 702ccd0..8996dce 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -14,6 +14,7 @@ https://ummel.github.io/fusionModel/reference/plot_valid.html https://ummel.github.io/fusionModel/reference/prepXY.html https://ummel.github.io/fusionModel/reference/read_fsd.html +https://ummel.github.io/fusionModel/reference/read_up.html https://ummel.github.io/fusionModel/reference/recs.html https://ummel.github.io/fusionModel/reference/train.html https://ummel.github.io/fusionModel/reference/validate.html