On a nice May day, we spent a whole day going through Elastic APM Go Agent code, to try to figure out why it doesn't log Spans, and all Transactions of a certain kind are just always not sampled. Once again, because wrote our own Elastic APM PHP Agent, we got bit in the behind and spent a better part of the day WTF-ing through all the code, just to switch ONE BIT in the end, to solve the issue.
So. We have this large PHP system with smaller bits of Go micro services around it. We have been using Elastic APM to increase overall application visibility, logging errors, timings etc. One of the sweet things about APM traces is that they show correlated requests across applications. Here's one example:
In this case I just extended some PHP code (blue - running our APM PHP Agent), that in turn calls Go (green - running Elasic APM Go Agent) over HTTP.
The issue was, Go trace was not sampled. No Spans were included, no detailed Context data!
However other transactions in this Go service DID! Well there must be something wrong with the Go code!
We went over all configuration, initialization code, all HTTP routing and mounting code, all Elastic APM Go Agent code and could not find what the difference was between this HTTP handler and other ones in the same service. None!
Until my college suggested I do a direct CURL request, not one from the PHP service.
Lo and behold, Spans and Context data was there!!!
Well as it turns out, there was a bug in our PHP implementation. The value of the Traceparent Flag field was not properly populated. As described in the W3C Trace Context specification there is just one single flag defined that has some meaning:
static final byte FLAG_SAMPLED = 1; // 00000001 ... boolean sampled = (traceFlags & FLAG_SAMPLED) == FLAG_SAMPLED;
So the code was changed. One bit was changed:
- $_SERVER['last_TP'] = '00-'.$this->trace_id.'-'.$this->id.'-00'; + $_SERVER['last_TP'] = '00-'.$this->trace_id.'-'.$this->id.'-01';
After this was fixed, even nested traces were complete: