The first idea is the problem of transmission-line effects showing up in on-chip interconnect. What is that? Chips are so fast now that they have to model microscopic wires across silicon as "long-distance", like the 10,000+ volt power lines that go cross-country. Similarly, there's an approach to computer architecture called Network on a Chip that turns your CPU into its own LAN. Weird. This sounds like "Powers of 10" or a fractal-based approach to solving scaling issues: If you can't make the big thing smaller, make the small thing act like a bunch of the big things (computer:ethernet::processor:network-on-a-chip).
The second idea is transactions across distributed systems. App Engine uses distributed transactions for Datastore transactions. To do this it relies on Megastore, a lesser-known piece of infrastructure at Google (though published). Megastore does inter-row transactions on Bigtable using transaction logs. It has the semantics of optimistic concurrency that are similar to STM in Haskell, like how the developer can
retry
the current transaction or give up. These two ideas connect when you consider that STM is kind of modeling a single running application as a distributed system across many cores; instead of committing transactions to disk it commits them to RAM. For me this brings up a ton of performance concerns at different scales. Here's where I'd like some help in understanding some details.
At the macro scale, I know that distributed transactions have a large coordination overhead that limits their throughput significantly. A rule of thumb for App Engine is don't rely on a single entity group (the primitive of transactionality) exceeding 1 transaction per second for 1MB rows (this is different than cross-entity-group writes, which are effectively boundless). What is the practical limit for writes/second of 1MB of data to a single variable in STM? Awful ballpark: If a distributed transaction commit takes 1 second due to disk seeks/network latency and RAM is 100x faster than disk, then naively single-variable STM-writes under gross contention should max out at ~100MB writes/sec (I bet that's way off).
At the micro scale, I know that accessing the same part of memory from multiple cores simultaneously can be extremely expensive. The reason is that each core has its own caches; writing to shared memory regions requires cross-core cache invalidations to keep the independent caches coherent. As I understand it, the reason why tcmalloc and other thread-local libraries have such huge performance gains is because they avoid cache performance issues. With multiple threads/cores writing to the same data locations, how can STM work around cache invalidation issues to achieve good memory performance?
So far it seems like STM is mostly used for lock-free/low-overhead collection datastructures (linked-lists, binary trees, hashtables), which may have significantly lower coordination overhead (though still significant) due to their size and shape. I'd love to see an apples-to-apples comparison of locked memory throughput of STM versus traditional mutexes.