Commit 7c03f7d7 authored by Shilei Tian's avatar Shilei Tian
Browse files

[OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language

From this patch (plus some landed patches), `deviceRTLs` is taken as a regular OpenMP program with just `declare target` regions. In this way, ideally, `deviceRTLs` can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics.
Here're a list of changes in this patch.
1. For NVPTX, `DEVICE` is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove `DEVICE` or probably some other macros.
2. Shared variable is implemented with OpenMP allocator, which is defined in `allocator.h`. Again, this feature is not available on AMDGCN, so two macros are redefined properly.
3. CUDA header `cuda.h` is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation.
4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as `libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc`, such as `libomptarget-nvptx-cuda_80-sm_20.bc`.

With this change, there are also multiple features to be expected in the near future:
1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version.
2. Atomic operations used in `deviceRTLs` can be replaced by `omp atomic` if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong.
3. Target specific parts will be wrapped into `declare variant` with `isa` selector if it can work properly. No target specific macro is needed anymore.
4. (Maybe more...)

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D94745
parent 90b8ae01
Loading
Loading
Loading
Loading
+20 −22
Original line number Diff line number Diff line
@@ -712,33 +712,30 @@ void CudaToolChain::addClangTargetOptions(
  CC1Args.push_back("-mlink-builtin-bitcode");
  CC1Args.push_back(DriverArgs.MakeArgString(LibDeviceFile));

  std::string CudaVersionStr;

  // New CUDA versions often introduce new instructions that are only supported
  // by new PTX version, so we need to raise PTX level to enable them in NVPTX
  // back-end.
  const char *PtxFeature = nullptr;
  switch (CudaInstallation.version()) {
  case CudaVersion::CUDA_110:
    PtxFeature = "+ptx70";
    break;
  case CudaVersion::CUDA_102:
    PtxFeature = "+ptx65";
    break;
  case CudaVersion::CUDA_101:
    PtxFeature = "+ptx64";
    break;
  case CudaVersion::CUDA_100:
    PtxFeature = "+ptx63";
    break;
  case CudaVersion::CUDA_92:
    PtxFeature = "+ptx61";
    break;
  case CudaVersion::CUDA_91:
    PtxFeature = "+ptx61";
    break;
  case CudaVersion::CUDA_90:
    PtxFeature = "+ptx60";
#define CASE_CUDA_VERSION(CUDA_VER, PTX_VER)                                   \
  case CudaVersion::CUDA_##CUDA_VER:                                           \
    CudaVersionStr = #CUDA_VER;                                                \
    PtxFeature = "+ptx" #PTX_VER;                                              \
    break;
    CASE_CUDA_VERSION(110, 70);
    CASE_CUDA_VERSION(102, 65);
    CASE_CUDA_VERSION(101, 64);
    CASE_CUDA_VERSION(100, 63);
    CASE_CUDA_VERSION(92, 61);
    CASE_CUDA_VERSION(91, 61);
    CASE_CUDA_VERSION(90, 60);
#undef CASE_CUDA_VERSION
  default:
    // If unknown CUDA version, we take it as CUDA 8.0. Same assumption is also
    // made in libomptarget/deviceRTLs.
    CudaVersionStr = "80";
    PtxFeature = "+ptx42";
  }
  CC1Args.append({"-target-feature", PtxFeature});
@@ -784,8 +781,9 @@ void CudaToolChain::addClangTargetOptions(
    } else {
      bool FoundBCLibrary = false;

      std::string LibOmpTargetName =
          "libomptarget-nvptx-" + GpuArch.str() + ".bc";
      std::string LibOmpTargetName = "libomptarget-nvptx-cuda_" +
                                     CudaVersionStr + "-" + GpuArch.str() +
                                     ".bc";

      for (StringRef LibraryPath : LibraryPaths) {
        SmallString<128> LibOmpTargetFile(LibraryPath);
+2 −2
Original line number Diff line number Diff line
@@ -164,7 +164,7 @@
// RUN:   -fopenmp-relocatable-target -save-temps -no-canonical-prefixes %s 2>&1 \
// RUN:   | FileCheck -check-prefix=CHK-BCLIB-USER %s

// CHK-BCLIB: clang{{.*}}-triple{{.*}}nvptx64-nvidia-cuda{{.*}}-mlink-builtin-bitcode{{.*}}libomptarget-nvptx-sm_20.bc
// CHK-BCLIB: clang{{.*}}-triple{{.*}}nvptx64-nvidia-cuda{{.*}}-mlink-builtin-bitcode{{.*}}libomptarget-nvptx-cuda_80-sm_20.bc
// CHK-BCLIB-USER: clang{{.*}}-triple{{.*}}nvptx64-nvidia-cuda{{.*}}-mlink-builtin-bitcode{{.*}}libomptarget-nvptx-test.bc
// CHK-BCLIB-NOT: {{error:|warning:}}

@@ -177,7 +177,7 @@
// RUN:   -fopenmp-relocatable-target -save-temps -no-canonical-prefixes %s 2>&1 \
// RUN:   | FileCheck -check-prefix=CHK-BCLIB-WARN %s

// CHK-BCLIB-WARN: No library 'libomptarget-nvptx-sm_20.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.
// CHK-BCLIB-WARN: No library 'libomptarget-nvptx-cuda_80-sm_20.bc' found in the default clang lib directory or in LIBRARY_PATH. Please use --libomptarget-nvptx-bc-path to specify nvptx bitcode library.

/// ###########################################################################

+2 −1
Original line number Diff line number Diff line
@@ -26,7 +26,8 @@
#define DEVICE __attribute__((device))
#define INLINE inline DEVICE
#define NOINLINE __attribute__((noinline)) DEVICE
#define SHARED __attribute__((shared))
#define SHARED(NAME) __attribute__((shared)) NAME
#define EXTERN_SHARED(NAME) __attribute__((shared)) NAME
#define ALIGN(N) __attribute__((aligned(N)))

////////////////////////////////////////////////////////////////////////////////
+44 −0
Original line number Diff line number Diff line
//===--------- allocator.h - OpenMP target memory allocator ------- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// Macros for allocating variables in different address spaces.
//
//===----------------------------------------------------------------------===//

#ifndef OMPTARGET_ALLOCATOR_H
#define OMPTARGET_ALLOCATOR_H

#if _OPENMP
// Follows the pattern in interface.h
// Clang sema checks this type carefully, needs to closely match that from omp.h
typedef enum omp_allocator_handle_t {
  omp_null_allocator = 0,
  omp_default_mem_alloc = 1,
  omp_large_cap_mem_alloc = 2,
  omp_const_mem_alloc = 3,
  omp_high_bw_mem_alloc = 4,
  omp_low_lat_mem_alloc = 5,
  omp_cgroup_mem_alloc = 6,
  omp_pteam_mem_alloc = 7,
  omp_thread_mem_alloc = 8,
  KMP_ALLOCATOR_MAX_HANDLE = ~(0U)
} omp_allocator_handle_t;

#define __PRAGMA(STR) _Pragma(#STR)
#define OMP_PRAGMA(STR) __PRAGMA(omp STR)

#define SHARED(NAME)                                                           \
  NAME [[clang::loader_uninitialized]];                                        \
  OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))

#define EXTERN_SHARED(NAME)                                                    \
  NAME;                                                                        \
  OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))
#endif

#endif // OMPTARGET_ALLOCATOR_H
Loading