The details on Dynamic Parallelism were hard to find after the new feature was introduced as part of the GTC 2012 keynote yesterday. Now Nvidia has followed up with a short whitepaper that describes how it works.
Dynamic Parallelism in CUDA is supported via an extension to the CUDA programming model that enables a CUDA kernel to create and synchronize new nested work. Basically, a child CUDA Kernel can be called from within a parent CUDA kernel and then optionally synchronize on the completion of that child CUDA Kernel. The parent CUDA kernel can consume the output produced from the child CUDA Kernel, all without CPU involvement.