Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared input sample sizes for sex = 0 and sex = 3 rows #151

Open
okenk opened this issue Feb 12, 2025 · 6 comments
Open

Shared input sample sizes for sex = 0 and sex = 3 rows #151

okenk opened this issue Feb 12, 2025 · 6 comments
Assignees

Comments

@okenk
Copy link
Contributor

okenk commented Feb 12, 2025

Describe the bug
If you provide both sexed and unsexed biological data, default settings in the analysis pipeline lead to the same sample size for the row with sex = 0 and sex = 3, rather than the specific sample sizes of sexed and unsexed groups.

To Reproduce
For any species with sexed and unsexed biological data run:

clean_PacFIN(bds.pacfin) |>
   get_pacfin_expansions(catch = catch, [other generic settings that should not matter]) |>
   getComps(Comps = "LEN", weightid = "Final_Sample_Size_L") |>
   writeComps(fname, len_bin)

Expected behavior
Input sample sizes for the rows with sex = 0 should differ from those in the rows with sex = 3.

Additional context
A sufficient but somewhat inefficient solution to this is to run everything from get_pacfin_expansions() down separately for sexed and unsexed data, and then put the data back together later when inputting it into the SS3 files.

Also, I totally may have missed an argument somewhere (probably would be in getComps()?) that would avoid this behavior-- it was not immediately obvious to me though!

@iantaylor-NOAA
Copy link
Contributor

iantaylor-NOAA commented Feb 12, 2025

There's a long discussion about these calculations in #29.
I had proposed a complex solution but agree with the decision that Kelli made to not embed a bunch of complexity into the code as was done in the past in ways that were hard to understand or follow.

Quoting from that thread:

The number of fish can then easily be added together to create the number of sexed fish. This is all returned from getComps(), which is passed to writeComps() so if a user wants to create a different input sample size based off of the number of tows and the number of fish one could.

[edit: add line break to start new paragraph missing in original comment]
However, the package doesn't provide any guidance on how one would do this. The help page for getComps() says "The documentation for these sample size columns is sparse because this function is set to be deprecated next cycle and replaced with a simplified path to writing composition data." That suggests to me that we're in an awkward in-between state with this package and it's documentation.

I can try to explore some ways to apply a multiplier to the input sample size for unsexed fish based on the proportion unsexed. However, I think a reasonable approach is to just discard the unsexed fish for all years where they represent a small fraction of the total, and make all fish unsexed in the years where they represent the majority, which is another way to resolve the problem (though again it requires more work from the user).

@okenk
Copy link
Contributor Author

okenk commented Feb 12, 2025

Hmm, I am not sure I agree with the discussion about sample sizes. It does not make sense to me.

I fully agree that {pacfintools} should only need to output sex = 0 and sex = 3, no need to do fancy stuff allowing for sex = 1 and 2.

HOWEVER, if you have unsexed fish from 3 tows and sexed fish from 5 tows, then you have more information (weight) from sexed fish than you do from unsexed fish. Why would we want to weight those two multinomial draws similarly?

@chantelwetzel-noaa says:

I think sex should be ignored when counting the number of tows. I think if we calculate sexed vs. unsexed separately we end up calculating a higher input sample size for unsexed vs. sexed fish.

Yes. Why would you not want to do this? You have more data from sexed fish.

This is where it really matters since all of our data weighting methods apply a simple multiplier by fleet and composition type making it really important that you have inadvertently over-weighted one of the sex inputs relative to the other.

Again, I totally agree with this statement, but I come to the opposite conclusion. If you have sexed samples from more tows, they should be weighted more than the unsexed samples. Why would you weight them similarly?

I should add, my understanding is there are two reasons for unsexed fish: 1) they are too small to sex (this is why we have unsexed fish in survey data) and 2) there is not time/capacity to sex them. My logic assumes (2) is much more common in fishery-dependent data.

@okenk
Copy link
Contributor Author

okenk commented Feb 12, 2025

Hold on. Does {pacfintools} include ALL fish when it makes the line for sex = 0? And then include only sexed fish when it makes the line for sex = 3? I had assumed the line for sex = 0 ONLY included unsexed fish.

@shcaba
Copy link

shcaba commented Feb 12, 2025 via email

@okenk
Copy link
Contributor Author

okenk commented Feb 12, 2025

Why would we want to count an individual fish in two separate rows of the comp data? I feel like each fish should only appear once (either in sex = 3 OR sex = 0).

It seems kind of fundamental to the multinomial distribution, that each entry is iid. If you are including the same individual fish from 1990 under sex = 0 and sex = 3, the sex = 0 and sex = 3 entries would not be independent, as the likelihood assumes.

EDIT: wait, I am realizing @shcaba and I said the same thing.

@iantaylor-NOAA
Copy link
Contributor

iantaylor-NOAA commented Feb 12, 2025

There are two many [edit: TOO many] possible use cases for the generalized tools to create good defaults in every case, but here are a few that I can think of and some ideas on what to do:

  1. you have a single-sex model (like yelloweye) in which case you can assign all fish to "U" and not worry about this (although better documentation of this approach is needed as discussed in better document how to create unexpanded and unsexed comps #150)
  2. unsexed fish are a majority of the samples for some years, in which case all fish could be assigned "U" for those years
  3. unsexed fish are unsexed at random and have similar size comps to the sexed fish in which case you can exclude these fish before you even process the comps as we've been doing for yellowtail* in https://github.com/pfmc-assessments/yellowtail_2025/blame/main/Rscripts/commercial_comps.R#L42.
  4. unsexed fish are too small to sex and shouldn't be excluded because that would bias selectivity. In this case, either an equal sex ratio could be applied or they could all be assigned to a single sex but either way the "combine_M_F" option in SS3 could be used to combined the observed and expected proportions within the multinomial so the apportionment (along with any inaccuracies in determining the sex of small fish that were assigned to "F" or "M") wouldn't impact the model.
  5. unsexed fish are a significant proportion of the total, are a mix of small and large but have different length distribution than the sexed fish so you want to add them as an additional vector into the model. We did this for the two lingcod assessments in 2021 (as shown in bubble plot below)

Image

In the first 3 cases, we don't have to worry about separate sample sizes for sexed and unsexed.
In case 4 the users is making some assumptions about reassigning samples and there's no generalized code to do this automatically so they will have to sort out the sample sizes.
Case 5 (separate vector as in lingcod) is the one case where I think the default sample size calculations could lead you astray. We did sensitivity analyses to test the impact of removing all the unsexed fish (as well as using the combine_M_F option and found a pretty small but non-zero impact (figure below). My current thinking is that if you were to include those separate vectors of unsexed fish, you should calculation the ratio of number of sampled fish that were unsexed vs sexed in each year and apply this ratio to the sample size of the sexed fish coming out of pacfintools to get the input sample size for the unsexed vectors.

Image

* with regard to yellowtail, it does look like there are a few years where there's are some periods in the 90s and 2010s where we may want to include the unsexed fish as separate (case 5), which I assume is why @okenk posted this issue in the first place
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants