Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remap shared services for edx-edxapp in DD #737

Open
3 of 7 tasks
robrap opened this issue Jul 26, 2024 · 17 comments
Open
3 of 7 tasks

Remap shared services for edx-edxapp in DD #737

robrap opened this issue Jul 26, 2024 · 17 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Jul 26, 2024

The problem is that default DD configuration for mysql, django (cache), defaultdb, requests, and potentially other libraries has each of these going to a shared service of the same name that is used across all of our IDA services.

Problems with this default approach:

  • It is difficult to determine what span data belongs to which IDA/service.
  • There is some metric data that may be impossible to separate between IDAs/services.

Acceptance Criteria

Notes:

  • DD support ticket https://help.datadoghq.com/hc/en-us/requests/1781912 asks a variety of questions about this topic, specifically targeting django (cache) and requests services. Questions and answers should ultimately be copied out of this ticket once it is complete.
  • DD support ticket https://help.datadoghq.com/hc/en-us/requests/1762193 asks similar questions about mysql, but has some additional information about metrics and multiple service tags (which should be avoided). Questions and answers should ultimately be copied out of this ticket once it is complete.
  • DD support ticket https://help.datadoghq.com/hc/en-us/requests/1873842 summarizes some of the above as well as other options we've considered, and asks for guidance.
  • There is an open (support) question about whether using DD_SERVICE_MAPPING is a simpler method of remapping, rather than using a variety of different settings.
    • However, remapping django to cache might be risky, because django could be used for something else in the future, so in this particular case we might want to use DD_DJANGO_CACHE_SERVICE_NAME instead.
  • For each service (e.g. mysql, django, etc.) we decide to remap, we need to choose between the IDA service service:edx-edxapp-lms (spans would still have a different operation_name), or service:edx-edxapp-lms-cache (a new DD service catalog service).
    • If everything went to the same service, we may need to adjust the primary operation_name for edx-edxapp-lms if it no longer defaults to operation_name:django.requests. See these DD docs for configuring the primary operation.
  • If we go with separate DD service names:
    • We would probably use the following naming convention:
      • service:edx-edxapp-lms
      • service:edx-edxapp-lms-cache (was django)
      • service:edx-edxapp-lms-defaultdb
      • Etc.
    • Unfortunately, I chose to go with service:edx-edxapp-lms-workers rather than service:edx-edxapp-workers-lms.
      • We might need to change this, because presumably we'd want service:edx-edxapp-workers-lms-cache, etc., and if we used a search like service:edx-edxapp-lms* to pick up all services related to service:edx-edxapp-lms, we would not want that to also pick up the worker service and all its sub-services.
      • Should we do this as a separate, pre-emptive ticket? We could use expand/contract to ensure monitors (listed in DD) and dashboards, etc. are updated before the name is changed. Communications will be required.
  • ADR should state that we are rejecting the status quo of a shared DD service across all of our IDAs.
  • What is the full list of services to address? DD has a dependency graph somewhere, and some other possible services may includes elasticsearch, redis, read_replicadb. Please review in DD to get the full list of affect dependencies. Note that some general shared services (e.g. aws.s3) probably should remain shared, but this could be discussed.

ADRish page: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1265598591/Datadog+Service+mapping

@robrap robrap added this to Arch-BOM Jul 26, 2024
@robrap robrap converted this from a draft issue Jul 26, 2024
@jristau1984 jristau1984 moved this to Groomed in Arch-BOM Jul 29, 2024
@jristau1984 jristau1984 moved this from Ready For Development to Backlog in Arch-BOM Aug 5, 2024
@robrap
Copy link
Contributor Author

robrap commented Aug 5, 2024

Answers from DD Support ticket:

Coming back from engineering to let you know answers to your queries:

a. Are there any downsides of using DD_SERVICE_MAPPING, over the individual settings like DD_DJANGO_CACHE_SERVICE_NAME? Which is preferred? Which takes precedence?

DD_SERVICE_MAPPING allows you to remap service names in a more generalized way. It is useful when you have multiple integrations or need a consistent naming scheme across various services.
Individual settings like DD_DJANGO_CACHE_SERVICE_NAME are more specific and used for fine-grained control over the naming of particular integrations.
Precedence: Typically, DD_SERVICE_MAPPING would take precedence over individual settings because it applies a broader remapping rule. However, specific integration settings can override this if explicitly defined.

Downsides:

Using DD_SERVICE_MAPPING can be less granular, potentially leading to confusion if not documented properly.
Individual settings provide more control but can be cumbersome to manage for large-scale applications with multiple integrations.

Preferred Option:

The preferred option depends on the use case. For broad consistency, DD_SERVICE_MAPPING is preferred. For fine-grained control, individual settings are preferred.

b. If we were to set DD_TRACE_SAMPLING_RULES with service-specific rules, would we use the original service name (e.g. django), or the mapped service name? Is the answer different for each option from question 1a?

When setting DD_TRACE_SAMPLING_RULES, use the mapped service name if you are using DD_SERVICE_MAPPING. This ensures that the rules apply correctly to the spans after they have been remapped.
The same applies to individual settings. If you are changing the service name using specific integration settings, use the new service name in the sampling rules.

Regarding service Mapping and operation names:

If you map services to a single service (e.g., edx-edxapp-lms), you can differentiate by operation_name.
However, if there is a risk that cache calls might outnumber Django requests, it might change the primary operation name in the Datadog UI.

In this case I would recommend:

To maintain differentiation and control, consider mapping services to distinct names (e.g., edx-edxapp-lms-cache for cache operations).
This approach allows for clearer visibility and avoids the potential issue of the UI selecting the most popular operation name.
You can also select primary operation manually in the UI

@robrap
Copy link
Contributor Author

robrap commented Aug 6, 2024

The following ticket is related to this ticket, because it would drop some of the services that need to be remapped (e.g. read_replicadb, defaultdb, django (cache)).

@robrap
Copy link
Contributor Author

robrap commented Aug 8, 2024

@timmc-edx: Should this wait on the trace concatenation fix so we don't bring in a lot of new spans? Not sure how this will affect things.

@timmc-edx
Copy link
Member

Hmm... I don't know that the two will interact that much.

@jristau1984 jristau1984 moved this from Backlog to Ready For Development in Arch-BOM Aug 8, 2024
@timmc-edx timmc-edx self-assigned this Aug 8, 2024
@timmc-edx timmc-edx moved this from Ready For Development to In Progress in Arch-BOM Aug 8, 2024
@timmc-edx
Copy link
Member

I'll be using https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1265598591/Datadog+Service+mapping to build out something ADR-like and track testing in edx-platform.

@robrap
Copy link
Contributor Author

robrap commented Aug 20, 2024

@timmc-edx: I noticed the following chart on the Service Catalog will have varying usefulness depending on the direction we choose. - - The default view is for Downstream Services.

  • There is a view by Type, but it is not the default.
  • There is also a Dependency Map view. I wonder if its numbers are misleading, and are for the downstream service overall, and not filtered based on the incoming service?

Image

@timmc-edx
Copy link
Member

OK, I think I've come around to suffixing—both because of how things will work in DD's UI and because I realized I'd made an incorrect assumption earlier that edxapp was like other services. :-P It wouldn't be edx-edxapp-lms-cache, because LMS and CMS (as far as I'm aware) share a cache. We'd instead use edx-edxapp-cache. Similarly, the workers share a DB, so the string -workers wouldn't get stuck in there anywhere either. This makes that option for service names feel much more manageable to me, and now is my preferred option.

@timmc-edx
Copy link
Member

...actually no, that doesn't make sense at all, as these are client side spans and we want to distinguish LMS vs. CMS client-side traffic.

None of the options we have really make sense, so I'm going to open a ticket with Datadog to see how they would recommend solving this.

@timmc-edx
Copy link
Member

Opened https://help.datadoghq.com/hc/en-us/requests/1873842 with options and pros/cons as we understand them, asking for guidance.

@timmc-edx
Copy link
Member

I've indicated interest in the Inferred Services Beta using the form linked at the top of https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/

@robrap
Copy link
Contributor Author

robrap commented Nov 7, 2024

@timmc-edx: Should this be marked as blocked while waiting in DD? Also, should we follow-up with Shannon from DD (when she is back) to see if we can get help moving this along?

@timmc-edx timmc-edx moved this from In Progress to Blocked in Arch-BOM Nov 7, 2024
@timmc-edx
Copy link
Member

Yes, moved to blocked. Agreed that we should ask Shannon to look into why we haven't gotten a response after I requested access to the beta.

@robrap
Copy link
Contributor Author

robrap commented Nov 7, 2024

I sent an email to DD, and we'll see what happens. Also added the waiting-on-other-team label.

@timmc-edx
Copy link
Member

We should be in the beta program now.

timmc-edx added a commit to edx/configuration that referenced this issue Dec 3, 2024
Combined with agent changes, this will allow for unified service naming
so that we can query DB metrics for a service.

When enabled for an environment, this change will change all spans in
edxapp to use the same service name rather than integration-specific names
like `service:defaultdb`.

See edx/edx-arch-experiments#737 and
https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/?tab=python
timmc-edx added a commit to edx/configuration that referenced this issue Dec 3, 2024
Combined with agent changes, this will allow for unified service naming
so that we can query DB metrics for a service.

When enabled for an environment, this change will change all spans in
edxapp to use the same service name rather than integration-specific names
like `service:defaultdb`.

See edx/edx-arch-experiments#737 and
https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/?tab=python
@timmc-edx
Copy link
Member

timmc-edx commented Dec 6, 2024

We've configured Inferred Services on stage. Before moving to edge and prod I want to check if anyone has monitors or dashboards that are using service: queries for something generic (like mongo) rather than an IDA.

Method: I used datadog_search.py to find all monitors and dashboards matching .*service:.* and then got a list of all services in edx via a table query for env:prod aws_account_name:_edx-prod by service. I then removed IDA names from the latter and built a regex to look for service: terms that maybe didn't involve an IDA (like grep 'service:' ~/tmp/dd-search-service.out | grep 'service:[^:]*(django|elasticsearch|...|requests)' -P) and pruned that.

The only edX thing I could find was monitor 145569239 "prod elasticsearch cluster latency" from the Cosmonauts team, referencing service:elasticsearch. I'll check in with them before proceeding.

Edit: Monitor has been adjusted to remove the service tag, so it should be compatible with the change now.

@timmc-edx timmc-edx moved this from In Progress to Blocked in Arch-BOM Dec 12, 2024
@timmc-edx
Copy link
Member

Moving to blocked for the moment; I found a bug in ddtrace that makes Inferred Services somewhat unappealing for production use (django.cache spans are still getting service:django instead of the base service name). I've reported it to Datadog in https://help.datadoghq.com/hc/en-us/requests/1958899 along with a likely fix, and hopefully they can release the fix soon.

@timmc-edx
Copy link
Member

We should be unblocked from deploying to prod once edx/configuration#131 rolls out. It's a temporary fix, though, and I've added reverting that to the A/C.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Blocked
Development

No branches or pull requests

2 participants