Cuda

+-----------------------------------+ +-----------------------------------+ | CPU CORE | | GPU CORES | | [ALU] [ALU] [ Control Unit ] | | [A][A][A][A] [A][A][A][A] [Ctrl] | | | | [A][A][A][A] [A][A][A][A] [Ctrl] | | [ L1/L2/L3 Cache ] | | [A][A][A][A] [A][A][A][A] [Ctrl] | +-----------------------------------+ +-----------------------------------+ Optimized for Sequential Latency Optimized for Parallel Throughput The CPU: Latency Oriented

The smallest unit of execution. A thread executes a single copy of a CUDA kernel on a single data point.

The CUDA programming model relies on a hardware-software co-design that splits execution between the (CPU) and the Device (GPU). 1. Heterogeneous Computing

A parallel C++ template library similar to the Standard Template Library (STL). 2. CUDA in Artificial Intelligence

Complete Basic Linear Algebra Subprograms for matrix operations.