brief description and some questions about evil example #149

xychem · 2023-07-27T01:42:53Z

xychem
Jul 27, 2023
Collaborator

I do some brief description of our project to prevent me from misunderstanding what's Paul saying.

Brief description:

We want to compare the performance between diverse-selector and random sampling on the nonuniform database because sometimes the databases we get are nonuniformly distributed. In this subproject, we want to verify that our method is better than random sampling when the databases are nonuniformly distributed. (The most important thing is whether we can use the lest points to represent the whole properties of the database)

In the first step, I should find an explicit function (as many as more features) and sample it with some nonuniform sampling method. After we have the nonuniform database, we can do random sampling or diverse-selector to sample some points to rebuild a new explicit function(maybe we can just fit the function with the points or use some other methods, e.g. neural network). Once we have the new explicit function, we can calculate the mean squared error (MSE) or mean squared deviation (MSD) to measure the performance between the diverse-selector and random sampling.

In the second step, I should use the cancer_risk website to generate many nonuniform data points. Then sample them with random sampling method and diverse-selector. Finally, use the points we get to predict the targets of the database and measure the performance between the diverse-selector and random sampling. (is similar to the first step)

Questions:

I have read the paper A constraint on historic growth in global photosynthesis due to increasing $CO_2$ (which had been retracted). It seems they want to find a relationship between atomspheric $CO_2$ concentration and photosynthesis, but they use some variables that I haven't heard about before. However, I just find some functions to access $CO_2$ sensitivity. And the model is just four variables whether we need to find a new one with more features or just use it.
Can I use the database which often is used in machine learning (like the Iris database) instead of the cancer_risk database? Diverse-selector is like the method to solve classified questions I think. We can consider diverse-selector as a classified processor that classifies the database according to the diversity and the number of class we need*(actually it also equal the number of points we need)*. So I think the database machine learning uses can be used to verify our model.
I have viewed Shania's github and view the B3DB she has done before. I don't find the cancer_risk database has any connection with the B3DB (which is a database about the Boold-Brain Barrier and the content of B3DB is relative to the compound). Or maybe you want me to do another sample example according to the B3DB.

Supplement:

Some detials about the paper A constraint on historic growth in global photosynthesis due to increasing $CO_2$ (which had been retracted).
paper address: https://www.nature.com/articles/s41586-021-04096-9#Sec1
github address: https://github.com/trevorkeenan/gpp-co2

background：

Global photosynthesis cannot be observed directly, however, and must instead be either predicted by terrestrial biosphere models (TBMs) or inferred from proxies. However, there is no consensus regarding the expected historic global change due to increasing $CO_2$ levels. Satellite-based estimates suggest that TBMs overestimate the sensitivity of global photosynthesis to $CO_2$ . By contrast, observation-based proxies, based on ice-core records of carbonyl sulfide (COS) and herbarium and field-based deuterium isotopomers, suggest that TBMs may underestimate the sensitivity of global photosynthesis to $CO_2$ . So the authors combine TBMs and estimates of the terrestrial carbon cycle to constrain the historic response of photosynthesis to rising $CO_2$ , and use the constraint in combination with biophysical theory to assess and reconcile differences in previous reports.

variables：

LUE = light-use efficiency
GPP = global gross primary photosynthesis
TBM = terrestrial biosphere model (review: https://www.annualreviews.org/doi/10.1146/annurev-environ-012913-093456)

$S_{land}$ means the mean TBM modelled global terrestrial residual carbon sink
$\beta_{R}^{GPP}$ means the sensitivity of GPP
$\beta_{R}^{Reco}$ means the sensitivity of total global ecosystem respiration
$\gamma$ means the non-respired flux

PaulWAyers · 2023-07-27T17:58:17Z

PaulWAyers
Jul 27, 2023
Maintainer

I wouldn't us ethe photosynthesis case; it isn't worth the effort it would take.

Shania was working on B3DB, which isn't related to cancer risk. But she wasn't trying for an evil example, which is a different problem/task.

The advantage of using the risk-calculators or an analytical model for risk (rather than a database) is that we can manually construct a biased dataset/sample. With a standard ML dataset, usually it has already been curated to be well-behaved/balanced. We want exactly the opposite.* Sometimes real-world data would be relevant, but finding the real-world data is harder than just constructing a model. The point is not so much the model itself, but just that we purposely generate an evil example.

0 replies

xychem · 2023-07-28T00:49:16Z

xychem
Jul 28, 2023
Collaborator Author

Yeah, our project targets the evil example. The most important thing is how to gain the nonuniform database which also has the real meaning.

One of my questions is whether we can manually construct a biased dataset based on a standard ML dataset by deleting some data. Probably there are not enough points to sample after we deal with the database, however it may be an alternative plan.

I'm trying to find an explicit function with multivariable in Covid-19, finance, phase transition and other filed which are influenced by many factors, but it seems that researchers prefer using the neural network to solve their problems. I find a book Nonuniform Sampling Theory and Practice which maybe useful. I start reading this book just now and keep finding some literature about multivariable functions.

Two easier way for getting explicit function (but not enough good):

One way is using a linear regression function but the shortage is it isn't enough general.
The other easier way is using an arbitrary function (like $f(x_1,x_2,x_3)=x_1*(x_2+x_3)$ ), however it losses the real meaning.

Above is how to generate the biased database by using explicit. I'll do nonuniform sampling on the cancer risk database and B3DB to gain a biased database.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brief description and some questions about evil example #149

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

brief description and some questions about evil example #149

xychem Jul 27, 2023 Collaborator