Remap shared services for edx-edxapp in DD #737

robrap · 2024-07-26T14:41:53Z

The problem is that default DD configuration for mysql, django (cache), defaultdb, requests, and potentially other libraries has each of these going to a shared service of the same name that is used across all of our IDA services.

Problems with this default approach:

It is difficult to determine what span data belongs to which IDA/service.
There is some metric data that may be impossible to separate between IDAs/services.

Acceptance Criteria

Decide on and build a Proof of Concept for a service naming scheme that associates other spans (e.g. mysql, django (cache), defaultdb, mysql, requests, kafka, etc.) with the edxapp service.
- Ensure spans do not have multiple service tags (may require SRE support)
- Ensure we've addressed the complete list of dependent services.
- Ensure the primary operation of the edxapp services remain as they are (e.g. django.requests, etc.).
Document (possibly in a wiki ADR) why we made the choices we made, and communicate to other service owners. Ideally they can participate in ADR review, so we get a consistent solution across all of our services.
implement the decided upon service naming scheme for edxapp
- Approach: Inferred Services
Clean up
- Revert fix: Fix django.cache service tags for Inferred Services config configuration#131 once https://help.datadoghq.com/hc/en-us/requests/1958899 is resolved

Notes:

DD support ticket https://help.datadoghq.com/hc/en-us/requests/1781912 asks a variety of questions about this topic, specifically targeting django (cache) and requests services. Questions and answers should ultimately be copied out of this ticket once it is complete.
DD support ticket https://help.datadoghq.com/hc/en-us/requests/1762193 asks similar questions about mysql, but has some additional information about metrics and multiple service tags (which should be avoided). Questions and answers should ultimately be copied out of this ticket once it is complete.
DD support ticket https://help.datadoghq.com/hc/en-us/requests/1873842 summarizes some of the above as well as other options we've considered, and asks for guidance.
There is an open (support) question about whether using DD_SERVICE_MAPPING is a simpler method of remapping, rather than using a variety of different settings.
- However, remapping django to cache might be risky, because django could be used for something else in the future, so in this particular case we might want to use DD_DJANGO_CACHE_SERVICE_NAME instead.
For each service (e.g. mysql, django, etc.) we decide to remap, we need to choose between the IDA service service:edx-edxapp-lms (spans would still have a different operation_name), or service:edx-edxapp-lms-cache (a new DD service catalog service).
- If everything went to the same service, we may need to adjust the primary operation_name for edx-edxapp-lms if it no longer defaults to operation_name:django.requests. See these DD docs for configuring the primary operation.
If we go with separate DD service names:
- We would probably use the following naming convention:
  - service:edx-edxapp-lms
  - service:edx-edxapp-lms-cache (was django)
  - service:edx-edxapp-lms-defaultdb
  - Etc.
- Unfortunately, I chose to go with service:edx-edxapp-lms-workers rather than service:edx-edxapp-workers-lms.
  - We might need to change this, because presumably we'd want service:edx-edxapp-workers-lms-cache, etc., and if we used a search like service:edx-edxapp-lms* to pick up all services related to service:edx-edxapp-lms, we would not want that to also pick up the worker service and all its sub-services.
  - Should we do this as a separate, pre-emptive ticket? We could use expand/contract to ensure monitors (listed in DD) and dashboards, etc. are updated before the name is changed. Communications will be required.
ADR should state that we are rejecting the status quo of a shared DD service across all of our IDAs.
What is the full list of services to address? DD has a dependency graph somewhere, and some other possible services may includes elasticsearch, redis, read_replicadb. Please review in DD to get the full list of affect dependencies. Note that some general shared services (e.g. aws.s3) probably should remain shared, but this could be discussed.

ADRish page: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1265598591/Datadog+Service+mapping

The text was updated successfully, but these errors were encountered:

robrap · 2024-08-05T21:46:34Z

Answers from DD Support ticket:

Coming back from engineering to let you know answers to your queries:

a. Are there any downsides of using DD_SERVICE_MAPPING, over the individual settings like DD_DJANGO_CACHE_SERVICE_NAME? Which is preferred? Which takes precedence?

DD_SERVICE_MAPPING allows you to remap service names in a more generalized way. It is useful when you have multiple integrations or need a consistent naming scheme across various services.
Individual settings like DD_DJANGO_CACHE_SERVICE_NAME are more specific and used for fine-grained control over the naming of particular integrations.
Precedence: Typically, DD_SERVICE_MAPPING would take precedence over individual settings because it applies a broader remapping rule. However, specific integration settings can override this if explicitly defined.

Downsides:

Using DD_SERVICE_MAPPING can be less granular, potentially leading to confusion if not documented properly.
Individual settings provide more control but can be cumbersome to manage for large-scale applications with multiple integrations.

Preferred Option:

The preferred option depends on the use case. For broad consistency, DD_SERVICE_MAPPING is preferred. For fine-grained control, individual settings are preferred.

b. If we were to set DD_TRACE_SAMPLING_RULES with service-specific rules, would we use the original service name (e.g. django), or the mapped service name? Is the answer different for each option from question 1a?

When setting DD_TRACE_SAMPLING_RULES, use the mapped service name if you are using DD_SERVICE_MAPPING. This ensures that the rules apply correctly to the spans after they have been remapped.
The same applies to individual settings. If you are changing the service name using specific integration settings, use the new service name in the sampling rules.

Regarding service Mapping and operation names:

If you map services to a single service (e.g., edx-edxapp-lms), you can differentiate by operation_name.
However, if there is a risk that cache calls might outnumber Django requests, it might change the primary operation name in the Datadog UI.

In this case I would recommend:

To maintain differentiation and control, consider mapping services to distinct names (e.g., edx-edxapp-lms-cache for cache operations).
This approach allows for clearer visibility and avoids the potential issue of the UI selecting the most popular operation name.
You can also select primary operation manually in the UI

robrap · 2024-08-06T20:42:40Z

The following ticket is related to this ticket, because it would drop some of the services that need to be remapped (e.g. read_replicadb, defaultdb, django (cache)).

Disable django db and cache instrumentation for edxapp #761

robrap · 2024-08-08T12:58:59Z

@timmc-edx: Should this wait on the trace concatenation fix so we don't bring in a lot of new spans? Not sure how this will affect things.

timmc-edx · 2024-08-08T13:17:40Z

Hmm... I don't know that the two will interact that much.

timmc-edx · 2024-08-15T21:33:42Z

I'll be using https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1265598591/Datadog+Service+mapping to build out something ADR-like and track testing in edx-platform.

robrap · 2024-08-20T17:55:50Z

@timmc-edx: I noticed the following chart on the Service Catalog will have varying usefulness depending on the direction we choose. - - The default view is for Downstream Services.

There is a view by Type, but it is not the default.
There is also a Dependency Map view. I wonder if its numbers are misleading, and are for the downstream service overall, and not filtered based on the incoming service?

timmc-edx · 2024-10-01T18:36:35Z

OK, I think I've come around to suffixing—both because of how things will work in DD's UI and because I realized I'd made an incorrect assumption earlier that edxapp was like other services. :-P It wouldn't be edx-edxapp-lms-cache, because LMS and CMS (as far as I'm aware) share a cache. We'd instead use edx-edxapp-cache. Similarly, the workers share a DB, so the string -workers wouldn't get stuck in there anywhere either. This makes that option for service names feel much more manageable to me, and now is my preferred option.

timmc-edx · 2024-10-02T14:58:09Z

...actually no, that doesn't make sense at all, as these are client side spans and we want to distinguish LMS vs. CMS client-side traffic.

None of the options we have really make sense, so I'm going to open a ticket with Datadog to see how they would recommend solving this.

timmc-edx · 2024-10-04T20:52:22Z

Opened https://help.datadoghq.com/hc/en-us/requests/1873842 with options and pros/cons as we understand them, asking for guidance.

timmc-edx · 2024-10-28T16:40:24Z

I've indicated interest in the Inferred Services Beta using the form linked at the top of https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/

robrap · 2024-11-07T19:29:17Z

@timmc-edx: Should this be marked as blocked while waiting in DD? Also, should we follow-up with Shannon from DD (when she is back) to see if we can get help moving this along?

timmc-edx · 2024-11-07T20:38:15Z

Yes, moved to blocked. Agreed that we should ask Shannon to look into why we haven't gotten a response after I requested access to the beta.

robrap · 2024-11-07T21:12:05Z

I sent an email to DD, and we'll see what happens. Also added the waiting-on-other-team label.

timmc-edx · 2024-11-19T14:55:27Z

We should be in the beta program now.

Combined with agent changes, this will allow for unified service naming so that we can query DB metrics for a service. When enabled for an environment, this change will change all spans in edxapp to use the same service name rather than integration-specific names like `service:defaultdb`. See edx/edx-arch-experiments#737 and https://docs.datadoghq.com/tracing/guide/inferred-service-opt-in/?tab=python

timmc-edx · 2024-12-06T22:26:15Z

We've configured Inferred Services on stage. Before moving to edge and prod I want to check if anyone has monitors or dashboards that are using service: queries for something generic (like mongo) rather than an IDA.

Method: I used datadog_search.py to find all monitors and dashboards matching .*service:.* and then got a list of all services in edx via a table query for env:prod aws_account_name:_edx-prod by service. I then removed IDA names from the latter and built a regex to look for service: terms that maybe didn't involve an IDA (like grep 'service:' ~/tmp/dd-search-service.out | grep 'service:[^:]*(django|elasticsearch|...|requests)' -P) and pruned that.

The only edX thing I could find was monitor 145569239 "prod elasticsearch cluster latency" from the Cosmonauts team, referencing service:elasticsearch. I'll check in with them before proceeding.

Edit: Monitor has been adjusted to remove the service tag, so it should be compatible with the change now.

timmc-edx · 2024-12-12T16:58:46Z

Moving to blocked for the moment; I found a bug in ddtrace that makes Inferred Services somewhat unappealing for production use (django.cache spans are still getting service:django instead of the base service name). I've reported it to Datadog in https://help.datadoghq.com/hc/en-us/requests/1958899 along with a likely fix, and hopefully they can release the fix soon.

timmc-edx · 2024-12-12T22:00:13Z

We should be unblocked from deploying to prod once edx/configuration#131 rolls out. It's a temporary fix, though, and I've added reverting that to the A/C.

robrap added this to Arch-BOM Jul 26, 2024

robrap converted this from a draft issue Jul 26, 2024

This was referenced Jul 26, 2024

Ensure Kafka Event Bus APM monitoring #658

Closed

Kafka consumer service naming conventions in DD #740

Closed

jristau1984 moved this to Groomed in Arch-BOM Jul 29, 2024

robrap mentioned this issue Jul 29, 2024

edxapp hosts are not tagged correctly #724

Open

jristau1984 moved this from Ready For Development to Backlog in Arch-BOM Aug 5, 2024

jristau1984 moved this from Backlog to Ready For Development in Arch-BOM Aug 8, 2024

timmc-edx self-assigned this Aug 8, 2024

timmc-edx moved this from Ready For Development to In Progress in Arch-BOM Aug 8, 2024

timmc-edx moved this from In Progress to Blocked in Arch-BOM Nov 7, 2024

robrap added the waiting-on-other-team label Nov 7, 2024

timmc-edx removed the waiting-on-other-team label Nov 19, 2024

timmc-edx mentioned this issue Dec 3, 2024

feat: Support for Datadog Inferred Services at the library level edx/configuration#122

Merged

3 tasks

timmc-edx moved this from In Progress to Blocked in Arch-BOM Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remap shared services for edx-edxapp in DD #737

Remap shared services for edx-edxapp in DD #737

robrap commented Jul 26, 2024 •

edited by timmc-edx

Loading

robrap commented Aug 5, 2024

robrap commented Aug 6, 2024

robrap commented Aug 8, 2024

timmc-edx commented Aug 8, 2024

timmc-edx commented Aug 15, 2024

robrap commented Aug 20, 2024

timmc-edx commented Oct 1, 2024

timmc-edx commented Oct 2, 2024

timmc-edx commented Oct 4, 2024

timmc-edx commented Oct 28, 2024

robrap commented Nov 7, 2024

timmc-edx commented Nov 7, 2024

robrap commented Nov 7, 2024

timmc-edx commented Nov 19, 2024

timmc-edx commented Dec 6, 2024 •

edited

Loading

timmc-edx commented Dec 12, 2024

timmc-edx commented Dec 12, 2024

Remap shared services for edx-edxapp in DD #737

Remap shared services for edx-edxapp in DD #737

Comments

robrap commented Jul 26, 2024 • edited by timmc-edx Loading

robrap commented Aug 5, 2024

robrap commented Aug 6, 2024

robrap commented Aug 8, 2024

timmc-edx commented Aug 8, 2024

timmc-edx commented Aug 15, 2024

robrap commented Aug 20, 2024

timmc-edx commented Oct 1, 2024

timmc-edx commented Oct 2, 2024

timmc-edx commented Oct 4, 2024

timmc-edx commented Oct 28, 2024

robrap commented Nov 7, 2024

timmc-edx commented Nov 7, 2024

robrap commented Nov 7, 2024

timmc-edx commented Nov 19, 2024

timmc-edx commented Dec 6, 2024 • edited Loading

timmc-edx commented Dec 12, 2024

timmc-edx commented Dec 12, 2024

robrap commented Jul 26, 2024 •

edited by timmc-edx

Loading

timmc-edx commented Dec 6, 2024 •

edited

Loading