Detecting and Mitigating "Mercurial Cores" on Linux (C++ Focused)¶
Based on: Cores That Don't Count โ Hochschild et al., HotOS '21 https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
๐ง Overview¶
Some CPU cores in the field silently produce wrong results instead of crashing โ they "lie". These mercurial cores exhibit fail-silent Corruption Execution Errors (CEEs): computation errors that produce incorrect results without triggering hardware fault detection mechanisms.
Key Characteristics¶
- Rare but catastrophic: Occurring in a few cores per several thousand machines, but a single defective execution can corrupt databases, indexes, or results without warning
- Silent failures: Violations of the fundamental assumption that hardware errors are fail-stop
- Non-reproducible: CEEs manifest non-deterministically, requiring specific environmental conditions (voltage, temperature, frequency, aging)
- Data-pattern dependent: Corruption rates vary with specific data patterns
- Resistant to fixes: Cannot always be mitigated by microcode updates
Real-world Symptoms from Google's Fleet¶
- Lock semantics violations causing application crashes
- Vector and coherence operation corruption
- Deterministic AES computation errors (one core corrupted encryption such that only it could decrypt what it had wrongly encrypted)
- Garbage collection failures leading to data loss
- Non-deterministic database query results across replicas
- Repeated bit-flips at specific positions
- Kernel state corruption
Why This Matters for C++ Development¶
CEEs are significantly harder to detect than storage/network corruption because:
- Storage/network errors amortise checking costs across large data chunks
- Computational errors require checking at instruction granularity
- Per-instruction overhead makes detection expensive
๐งฉ 1. Detecting Mercurial Cores on Linux¶
The Detection Challenge¶
Mercurial cores are fundamentally difficult to detect because:
- They don't trigger hardware exceptions โ no MCE (Machine Check Exception) events
- Errors are non-deterministic โ may only appear under specific workloads or environmental conditions
- No single test catches them โ they passed manufacturing tests and operate "within spec"
๐ง System-level Monitoring¶
Enable and monitor MCE logging (though CEEs often bypass these):
sudo modprobe mce
sudo modprobe mce_inject # optional for testing
dmesg | grep -i mce
journalctl -k | grep "Machine check"
Install and configure rasdaemon (RAS = Reliability, Availability, Serviceability):
# Install hardware error monitoring daemon
sudo apt install rasdaemon
# Enable and start the service
sudo systemctl enable rasdaemon
sudo systemctl start rasdaemon
# Check for logged errors
sudo ras-mc-ctl --summary
sudo ras-mc-ctl --errors
# Monitor logs
journalctl -u rasdaemon -f
rasdaemon logs hardware errors including memory errors, PCIe errors, and MCE events. While it won't catch all mercurial core behaviour (CEEs are silent), it provides valuable baseline monitoring for hardware reliability.
๐งช C++ Test Harness for Core Validation¶
Comprehensive per-core test โ run identical workloads on all cores and compare results:
#include <iostream>
#include <vector>
#include <thread>
#include <cmath>
#include <format>
#include <ranges>
#include <sched.h>
double workload(int id) {
double sum = 0;
for (int i = 0; i < 5'000'000; ++i)
sum += std::sin(i * 0.001) * std::cos(i * 0.002 + id);
return sum;
}
void test_core(int cpu) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
sched_setaffinity(0, sizeof(set), &set);
double result = workload(cpu);
std::println("CPU {:>2}: result = {:.12f}", cpu, result);
}
int main() {
auto n = std::thread::hardware_concurrency();
std::println("Detected {} logical CPUs", n);
auto threads = std::vector<std::thread>{};
for (auto i : std::views::iota(0u, n))
threads.emplace_back(test_core, i);
for (auto &t : threads)
t.join();
std::println("โ
All cores tested โ compare results for anomalies.");
}
Cross-core verification โ compute on one core, verify on another:
#include <thread>
#include <cassert>
#include <format>
#include <iostream>
#include <sched.h>
auto compute() {
// Deterministic computation
double s = 0;
for (int i = 0; i < 1'000'000; ++i)
s += std::sin(i) * std::cos(i);
return s;
}
int main() {
auto result1 = compute();
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(1, &set); // Switch to core 1 for verification
std::thread verifier([&]() {
sched_setaffinity(0, sizeof(set), &set);
auto result2 = compute();
if (std::abs(result1 - result2) > 1e-9)
std::println("โ Cross-core mismatch detected!");
else
std::println("โ
Cross-core verification passed.");
});
verifier.join();
}
Check which CPU core is running โ minimal example for debugging:
#include <iostream>
#include <format>
#include <sched.h>
int main() {
int cpu = sched_getcpu();
std::println("Running on CPU: {}", cpu);
}
Usage recommendations: - Run these tests periodically as part of system health checks - Integrate into CI/CD pipelines for bare-metal deployments - Log results and monitor for drift over time - Consider temperature and load variations when interpreting results - Combine with stress testing to trigger environmental conditions
๐ฅ Stress Testing with stress-ng¶
Use stress-ng to trigger environmental conditions that may expose mercurial core behaviour:
# Install stress-ng
sudo apt install stress-ng
# CPU stress test with verification
# Uses all CPUs with built-in result verification
stress-ng --cpu 0 --verify --timeout 2m
# Floating-point heavy workload (matrix operations)
# More likely to expose numerical computation errors
stress-ng --matrix 0 --timeout 2m
# Combined stress test
stress-ng --cpu 0 --matrix 0 --verify --timeout 5m
# Target specific CPU cores
stress-ng --cpu 4 --taskset 0,1,2,3 --verify --timeout 2m
Why stress-ng helps:
- Increases temperature and power draw, potentially triggering marginal hardware
- High instruction throughput exercises more silicon pathways
- --verify flag checks computational correctness
- Floating-point operations (matrix) particularly sensitive to CEEs
Recommended approach:
1. Run stress-ng with --verify as baseline health check
2. Monitor for verification failures or unexpected results
3. Correlate failures with specific cores using taskset
4. Cross-reference with rasdaemon logs and system temperature
Application-level Detection Strategies¶
1. Cross-replica consistency checks (for distributed systems)
// Compare results across multiple replicas
auto validate_computation(std::span<const Result> replicas) -> std::optional<Result> {
if (replicas.size() < 2)
return {};
auto reference = replicas[0];
for (auto i = 1uz; i < replicas.size(); ++i)
if (replicas[i] != reference)
return {}; // Inconsistency detected
return reference;
}
2. Redundant computation (factor of 2ร overhead)
// Compute twice and compare
auto checked_compute(const Input& data) -> std::optional<Result> {
auto result1 = compute(data);
auto result2 = compute(data);
return (result1 == result2) ? std::optional{result1} : std::nullopt;
}
3. Invariant checking (exploit domain knowledge)
// Use mathematical properties to validate results
auto validate_fft(std::span<const Complex> input,
std::span<const Complex> fft_output) -> bool {
// Parseval's theorem: energy preserved in frequency domain
auto time_energy = std::transform_reduce(
input.begin(), input.end(), 0.0,
std::plus{}, [](auto c) { return std::norm(c); });
auto freq_energy = std::transform_reduce(
fft_output.begin(), fft_output.end(), 0.0,
std::plus{}, [](auto c) { return std::norm(c); });
auto tolerance = 1e-6 * time_energy;
return std::abs(time_energy - freq_energy / input.size()) < tolerance;
}
4. Monitoring non-determinism¶
// Track unexpected variations in deterministic computations
struct ComputationMonitor {
std::unordered_map<Input, Result> expected_results;
auto check_determinism(const Input& in, const Result& out) -> bool {
if (auto it = expected_results.find(in); it != expected_results.end())
return it->second == out;
expected_results[in] = out;
return true; // First time seeing this input
}
};
๐ก๏ธ 2. Mitigation Strategies¶
Hardware-level Approaches¶
Triple Modular Redundancy (TMR) โ requires 3ร computation with voting: - Three parallel executions on different cores - Majority voting determines correct result - Can correct single-core errors - Substantial overhead (3ร compute, synchronisation cost)
Core isolation โ remove suspect cores from scheduling:
# Disable CPU core 7
echo 0 | sudo tee /sys/devices/system/cpu/cpu7/online
# Verify
lscpu --extended
Software-level Strategies¶
1. Architecture shift: Prefer simpler processor designs - ARM-based processors (AWS Graviton) over complex x86 - Energy efficiency gains may correlate with reduced CEE exposure - Simpler pipelines and execution units = fewer failure modes
2. Critical path protection: Apply expensive checks selectively
// Only validate critical operations
enum class ValidationLevel { None, Basic, Paranoid };
template<ValidationLevel Level = ValidationLevel::None>
auto encrypt_data(std::span<const std::byte> plaintext) -> std::vector<std::byte> {
auto ciphertext = aes_encrypt(plaintext);
if constexpr (Level == ValidationLevel::Paranoid) {
// Decrypt and verify for critical operations
auto decrypted = aes_decrypt(ciphertext);
if (decrypted != plaintext)
throw std::runtime_error("Encryption validation failed");
}
return ciphertext;
}
3. Checksums and error detection codes¶
// Add CRC32 to computational results
struct ValidatedResult {
Result data;
std::uint32_t checksum;
static auto compute(const Input& in) -> ValidatedResult {
auto result = expensive_computation(in);
auto crc = compute_crc32(std::as_bytes(std::span{&result, 1}));
return {result, crc};
}
auto validate() const -> bool {
auto expected = compute_crc32(std::as_bytes(std::span{&data, 1}));
return checksum == expected;
}
};
๐ 3. Measurement and Focus Areas for C++ on Linux¶
What to Measure¶
1. Result consistency across runs and cores
# Benchmark deterministic computation across all cores
for core in $(seq 0 $(nproc --all)); do
taskset -c $core ./benchmark > results_core_${core}.txt
done
# Compare outputs
md5sum results_core_*.txt | sort | uniq -c
2. Core-specific failure rates¶
// Track errors per core
struct CoreMetrics {
std::atomic<std::size_t> computations{0};
std::atomic<std::size_t> validation_failures{0};
auto failure_rate() const -> double {
auto total = computations.load();
return total > 0 ?
static_cast<double>(validation_failures.load()) / total : 0.0;
}
};
inline auto per_core_metrics =
std::array<CoreMetrics, 128>{}; // Adjust size as needed
void record_computation_result(bool validation_passed) {
auto core = sched_getcpu();
if (core >= 0 and core < per_core_metrics.size()) {
per_core_metrics[core].computations.fetch_add(1);
if (not validation_passed)
per_core_metrics[core].validation_failures.fetch_add(1);
}
}
3. Environmental correlation¶
# Monitor CPU temperature and frequency during suspect behaviour
watch -n 1 'paste <(cat /sys/class/thermal/thermal_zone*/type) \
<(cat /sys/class/thermal/thermal_zone*/temp) | \
column -s $"\t" -t'
# CPU frequency
watch -n 1 'grep "cpu MHz" /proc/cpuinfo'
Focus Areas for C++ Development¶
1. Lock-free data structures โ particularly vulnerable to CEEs - Lock semantics violations are a documented symptom - Consider adding validation to critical sections - Test on diverse hardware configurations
2. Cryptographic operations¶
- CEEs can cause deterministic encryption errors
- Always validate critical crypto operations
- Consider hardware alternatives (AES-NI) with TMR for high-security contexts
3. Numerical computations¶
- Vector operations are vulnerable
- Use invariant checking (conservation laws, symmetries)
- Cross-validate critical results
4. Database and index operations¶
- Non-deterministic query results are a red flag
- Implement consistency checks across replicas
- Hash-based validation for critical index operations
๐ฏ Practical Recommendations¶
For Production C++ Systems¶
- Assume hardware can lie โ design validation into critical paths
- Cross-validate across replicas when possible
- Monitor for non-determinism in deterministic computations
- Track per-core metrics to identify suspect hardware
- Use invariant checking to detect corruption cheaply
- Consider architecture tradeoffs โ simpler CPUs may be more reliable
When to Worry¶
- Large-scale deployments (thousands of servers)
- Critical correctness requirements (finance, cryptography, databases)
- Long-running computations where silent corruption compounds
- Lock-free concurrent algorithms
- Numerically intensive workloads
When Not to Worry¶
- Small deployments (few machines)
- Short-lived processes with external validation
- Applications with strong end-to-end checksums
- Systems with comprehensive integration testing