CUDA_调整指令级原语2标准函数与内部函数

  • Post author:
  • Post category:其他


标准函数和内部函数在数值精确度和性能上的表现是不同的。标准函数支持大部分的数学运算。但是,许多等效的内部函数能够使用较少的指令、改进的性能和更低的数值精确度,实现相同的功能。

就比如这个程序:

#include  "freshman.h"
#include <stdio.h>
#include <stdlib.h>
#include <cmath>
/**
 * This example demonstrates the relative performance and accuracy of CUDA
 * standard and intrinsic functions.
 *
 * The computational kernel of this example is the iterative calculation of a
 * value squared. This computation is done on the host, on the device with a
 * standard function, and on the device with an intrinsic function. The results
 * from all three are compared for numerical accuracy (with the host as the
 * baseline), and the performance of standard and intrinsic functions is also
 * compared.
 **/

 /**
  * Perform iters power operations using the standard powf function.
  **/
__global__ void standard_kernel(float a, float* out, int iters)
{
    int i;
    int tid = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (tid == 0)
    {
        float tmp;

        for (i = 0; i < iters; i++)
        {
            tmp = powf(a, 2.0f);
        }

        *out = tmp;
    }
}

/**
 * Perform iters power operations using the intrinsic __powf function.
 **/
__global__ void intrinsic_kernel(float a, float* out, int iters)
{
    int i;
    int tid = (blockDim.x * blockIdx.x) + threadIdx.x;

    if (tid == 0)
    {
        float tmp;

        for (i = 0; i < iters; i++)
        {
            tmp = __powf(a, 2.0f);
        }

        *out = tmp;
    }
}

int main(int argc, char** argv)
{
    int i;
    int runs = 30;
    int iters = 1000;

    float* d_standard_out, h_standard_out;
    CHECK(cudaMalloc((void**)&d_standard_out, sizeof(float)));

    float* d_intrinsic_out, h_intrinsic_out;
    CHECK(cudaMalloc((void**)&d_intrinsic_out, sizeof(float)));

    float input_value = 8181.25;

    double mean_intrinsic_time = 0.0;
    double mean_standard_time = 0.0;

    for (i = 0; i < runs; i++)
    {
        double start_standard = cpuSecond();
        standard_kernel << <1, 32 >> > (input_value, d_standard_out, iters);
        CHECK(cudaDeviceSynchronize());
        mean_standard_time += cpuSecond() - start_standard;

        double start_intrinsic = cpuSecond();
        intrinsic_kernel << <1, 32 >> > (input_value, d_intrinsic_out, iters);
        CHECK(cudaDeviceSynchronize());
        mean_intrinsic_time += cpuSecond() - start_intrinsic;
    }

    CHECK(cudaMemcpy(&h_standard_out, d_standard_out, sizeof(float),
        cudaMemcpyDeviceToHost));
    CHECK(cudaMemcpy(&h_intrinsic_out, d_intrinsic_out, sizeof(float),
        cudaMemcpyDeviceToHost));
    float host_value = powf(input_value, 2.0f);

    printf("Host calculated\t\t\t%f\n", host_value);
    printf("Standard Device calculated\t%f\n", h_standard_out);
    printf("Intrinsic Device calculated\t%f\n", h_intrinsic_out);
    printf("Host equals Standard?\t\t%s diff=%e\n",
        host_value == h_standard_out ? "Yes" : "No",
        fabs(host_value - h_standard_out));
    printf("Host equals Intrinsic?\t\t%s diff=%e\n",
        host_value == h_intrinsic_out ? "Yes" : "No",
        fabs(host_value - h_intrinsic_out));
    printf("Standard equals Intrinsic?\t%s diff=%e\n",
        h_standard_out == h_intrinsic_out ? "Yes" : "No",
        fabs(h_standard_out - h_intrinsic_out));
    printf("\n");
    printf("Mean execution time for standard function powf:    %f s\n",
        mean_standard_time);
    printf("Mean execution time for intrinsic function __powf: %f s\n",
        mean_intrinsic_time);

    return 0;
}

结果如下图所示:

可以看到对于标准函数powf或是内部函数__powf,与主机中所计算的pow都是不同的,但是标准函数的误差明显更小,但是内部函数的计算更快。

即使使用数值稳定的CUDA函数,GPU上的运算结果仍与传统的只在CPU上运行的应用结果不同。由于主机和设备上的浮点运算都存在固有的不精确性,有时很难指出一个输 出结果与另一个输出结果哪个更精确。

在从CPU到GPU的移植上也需要说明允许的误差范围。

操纵指令生成:

在很多时候,可以通过修改为内部函数来提高性能,如


通过这种方式当然可以提高性能,但是会很慢,所以可以使用nvcc自带的指令来调整优化。

例如

用nvcc –fmad = true编译,生成ptx文件。就可以得到一个算术指令:

而如果用了nvcc –fmad = false,那么结果就是:

可以很明显的看到二者的差别,fmad为true明显比false少了一个步骤。

下图是用于指令生成的编译器标志语句:

作为原书中的对比

这里我用了__fmul_rn,可以看到与原来的数据还是有偏差的

用nsight system分析:

不启用__fmul:

启用__fmul:

这指令级优化怎么还增加了呢?可能是__fmul_rn算法还涉及额外的浮点转化吧

不信邪,加入循环

正常*:

__fmul_rn:

好吧,看来转换确实很费时间。

实验失败,等下课再找找是什么原因。


最后的总结帖图



版权声明:本文为zhuangtu1999原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。