使用 PoseNet 和 Tensorflow.js 在浏览器中通过身体动作玩 Beat Saber

由于我没有装备，所以我没有玩过很多 VR 游戏，但我尝试过并且喜欢的一款游戏是Beat Saber。

如果你不熟悉它，它是一款类似《电子世界争霸战》的游戏，你需要用控制器来控制歌曲的节奏。它真的很有趣，但需要你拥有HTC Vive、Oculus Rift或PlayStation VR设备。

这些控制台可能很昂贵，因此并不是每个人都能买得起。

几个月前，我偶然发现了Supermedium的这个 repo 。它是用A-Frame 框架用 Web 技术制作的 Beat Saber 的克隆版，我感觉它真的很棒！你可以开始播放歌曲，查看生成的节拍，环顾场景，但它看起来不像能玩，或者至少，如果你没有 VR 设备的话，就玩不了。

我真的很想看看我是否可以做些什么，所以我决定添加 PoseNet（一个带有 Tensorflow.js 的姿势检测模型），以便能够用我的手在浏览器中玩这个游戏......而且它有效！！🤩🎉

好吧，它的性能不太好，因为相机的跟踪不如使用操纵杆那么准确，但说实话，我的主要目标是看看它是否可行。

我非常高兴它能发挥作用，人们需要的“唯一”东西就是一台（现代）笔记本电脑！

最终结果如下：

如果您对它的构建细节不感兴趣，您可以查看现场演示，也可以在Github repo中找到所有代码。

否则，既然希望您和我一样对此感到兴奋，那么让我们来谈谈它是如何工作的！

步骤1. 逆向工程

大部分代码库依赖于BeatSaver Viewer开源项目。

通常，在我的业余项目中，我会从头开始。我清楚地知道事情进展如何，这能让我快速做出修改。然而，这次的灵感来自于找到 BeatSaver 现有的代码库，所以我从他们的代码库开始。既然其他人已经完成了如此出色的工作，花时间重新制作游戏就毫无意义了。

不过，我很快就遇到了一些问题。我完全不知道该从何入手。如果你用普通的开发工具在浏览器中检查一个 3D 场景，试图找出应该修改哪个组件，你唯一能得到的就是…… canvas；你将无法检查场景内的不同 3D 元素。
使用 A-Frame，你可以使用它CTRL + Option + i来切换检查器，但它仍然没能帮我找到我想要的元素。

我必须做的是深入代码库，试图弄清楚到底发生了什么。我对 A-Frame 的经验并不多，所以我对一些 mixin 的名称、一些组件的来源、它们在场景中的渲染方式等等都感到困惑……

最后，我找到了beat我正在寻找的具有destroyBeat方法的组件，所以看起来很有希望！

为了测试我是否找到了我需要的东西，我在组件中做了一个快速更改，以便在每次单击页面主体时beat触发该功能，因此看起来像这样：destroyBeat

document.body.onclick = () => this.destroyBeat();

重新加载页面后，我启动了游戏，等待节拍显示，点击身体任意位置，就看到节拍爆发了。这真是迈出了重要的一步！

现在我对在哪里更改代码有了更好的了解，我开始研究使用 PoseNet 来看看我可以使用什么样的数据。

步骤 2. 使用 PoseNet 模型进行身体追踪

带有 Tensorflow.js 的PoseNet 模型允许您在浏览器中进行姿势估计并获取有关一些“关键点”的信息，例如肩膀、手臂、手腕等的位置......

在将其实现到游戏中之前，我对其进行了单独测试，以了解其工作原理。

基本实现如下：

在 HTML 文件中，首先导入 Tensorflow.js 和 PoseNet 模型：

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>

我们还可以在我们正在追踪的身体部位（就我而言是手腕）上显示网络摄像头的反馈和标记。

为此，我们首先添加一个视频标签和一个放置在视频上方的画布：

    <video id="video" playsinline style=" -moz-transform: scaleX(-1);
    -o-transform: scaleX(-1);
    -webkit-transform: scaleX(-1);
    transform: scaleX(-1);
    ">
    </video>
    <canvas id="output" style="position: absolute; top: 0; left: 0; z-index: 1;"></canvas>

姿势检测的 JavaScript 部分涉及几个步骤。

首先，我们需要设置 PoseNet。

// We create an object with the parameters that we want for the model. 
const poseNetState = {
  algorithm: 'single-pose',
  input: {
    architecture: 'MobileNetV1',
    outputStride: 16,
    inputResolution: 513,
    multiplier: 0.75,
    quantBytes: 2
  },
  singlePoseDetection: {
    minPoseConfidence: 0.1,
    minPartConfidence: 0.5,
  },
  output: {
    showVideo: true,
    showPoints: true,
  },
};

// We load the model.
let poseNetModel = await posenet.load({
    architecture: poseNetState.input.architecture,
    outputStride: poseNetState.input.outputStride,
    inputResolution: poseNetState.input.inputResolution,
    multiplier: poseNetState.input.multiplier,
    quantBytes: poseNetState.input.quantBytes
});

当模型加载时，我们实例化一个视频流：

let video;

try {
  video = await setupCamera();
  video.play();
} catch (e) {
  throw e;
}

async function setupCamera() {
  const video = document.getElementById('video');
  video.width = videoWidth;
  video.height = videoHeight;

  const stream = await navigator.mediaDevices.getUserMedia({
    'audio': false,
    'video': {
      width: videoWidth,
      height: videoHeight,
    },
  });
  video.srcObject = stream;

  return new Promise((resolve) => {
    video.onloadedmetadata = () => resolve(video);
  });
}

一旦视频流准备就绪，我们就开始检测姿势：

function detectPoseInRealTime(video) {
  const canvas = document.getElementById('output');
  const ctx = canvas.getContext('2d');
  const flipPoseHorizontal = true;

  canvas.width = videoWidth;
  canvas.height = videoHeight;

  async function poseDetectionFrame() {
    let poses = [];
    let minPoseConfidence;
    let minPartConfidence;

    switch (poseNetState.algorithm) {
      case 'single-pose':
        const pose = await poseNetModel.estimatePoses(video, {
          flipHorizontal: flipPoseHorizontal,
          decodingMethod: 'single-person'
        });
        poses = poses.concat(pose);
        minPoseConfidence = +poseNetState.singlePoseDetection.minPoseConfidence;
        minPartConfidence = +poseNetState.singlePoseDetection.minPartConfidence;
        break;
    }

    ctx.clearRect(0, 0, videoWidth, videoHeight);

    if (poseNetState.output.showVideo) {
      ctx.save();
      ctx.scale(-1, 1);
      ctx.translate(-videoWidth, 0);
      ctx.restore();
    }

    poses.forEach(({score, keypoints}) => {
      if (score >= minPoseConfidence) {
        if (poseNetState.output.showPoints) {
          drawKeypoints(keypoints, minPartConfidence, ctx);
        }
      }
    });
    requestAnimationFrame(poseDetectionFrame);
  }

  poseDetectionFrame();
}

在上面的示例中，我们调用drawKeypoints函数在画布上绘制指针上的圆点。代码如下：

function drawKeypoints(keypoints, minConfidence, ctx, scale = 1) {
    let leftWrist = keypoints.find(point => point.part === 'leftWrist');
    let rightWrist = keypoints.find(point => point.part === 'rightWrist');

    if (leftWrist.score > minConfidence) {
        const {y, x} = leftWrist.position;
        drawPoint(ctx, y * scale, x * scale, 10, colorLeft);
    }

    if (rightWrist.score > minConfidence) {
        const {y, x} = rightWrist.position;
        drawPoint(ctx, y * scale, x * scale, 10, colorRight);
    }
}

function drawPoint(ctx, y, x, r, color) {
  ctx.beginPath();
  ctx.arc(x, y, r, 0, 2 * Math.PI);
  ctx.fillStyle = color;
  ctx.fill();
}

结果如下：

现在跟踪功能已经可以自行运行，让我们继续将其添加到 BeatSaver 代码库中。

步骤 3. 将姿势追踪功能添加到 BeatSaver

要开始将我们的姿势检测添加到 3D 游戏中，我们需要采用上面编写的代码并在 BeatSaver 代码中实现它。

我们要做的就是将视频标签添加到主 HTML 文件中，并在其顶部创建一个新的 JS 文件，其中包含上面的 JS 代码。

在此阶段，我们应该得到如下结果：

这是一个很好的开端，但我们还没有完全达到目标。现在，我们开始进入这个项目中更棘手的部分。PoseNet的位置追踪是 2D 的，而 A-Frame 游戏是 3D 的，所以我们从手部追踪中得到的蓝点和红点实际上并没有添加到场景中。然而，为了能够破坏节拍，我们需要所有东西都成为游戏的一部分。

要做到这一点，我们需要从在画布上将手显示为圆圈，切换到创建需要放置在正确坐标的实际 3D 对象，但这并不是那么简单......

这些环境中坐标的工作方式是不同的。(x,y)画布上左手的坐标不会转换为(x,y)3D 中对象的相同坐标。

因此，下一步是找到一种方法来映射我们的二维和三维世界之间的位置。

映射二维和三维坐标

如上所述，二维和三维世界中的坐标工作方式不同。

在映射它们之前，我们需要创建一个新的 3D 对象来代表我们在游戏中的手。

在 A-frame 中，我们可以创建所谓的实体组件，即可以添加到场景中的自定义占位符对象。

1. 创建自定义 3D 对象

在我们的例子中，我们想要创建一个简单的立方体，我们可以这样做：

let el, self;

AFRAME.registerComponent('right-hand-controller', {
    schema: {
        width: {type: 'number', default: 1},
        height: {type: 'number', default: 1},
        depth: {type: 'number', default: 1},
        color: {type: 'color', default: '#AAA'},
    },
    init: function () {
        var data = this.data;
        el = this.el;
        self = this;

        this.geometry = new THREE.BoxGeometry(data.width, data.height, data.depth);
        this.material = new THREE.MeshStandardMaterial({color: data.color});
        this.mesh = new THREE.Mesh(this.geometry, this.material);
        el.setObject3D('mesh', this.mesh);
    }
});

然后，为了能够在屏幕上看到我们的自定义实体，我们需要在 HTML 中导入此文件并使用a-entity标签。

<a-entity id="right-hand" right-hand-controller="width: 0.1; height: 0.1; depth: 0.1; color: #036657" position="1 1 -0.2"></a-entity>

在上面的代码中，我们创建了一个新的类型实体right-hand-controller并赋予它一些属性。

现在我们应该在页面上看到一个立方体。

要改变它的位置，我们可以使用从 PoseNet 获取的数据。在我们的实体组件中，我们需要添加一些函数：

// this function runs when the component is initialised AND when a property updates.
update: function(){
  this.checkHands();
},
checkHands: function getHandsPosition() {
  // if we get the right hand position from PoseNet and it's different from the previous one, trigger the `onHandMove` function.
  if(rightHandPosition && rightHandPosition !== previousRightHandPosition){
    self.onHandMove();
    previousRightHandPosition = rightHandPosition;
  }
  window.requestAnimationFrame(getHandsPosition);
},
onHandMove: function(){
  //First, we create a 3-dimensional vector to hold the values of our PoseNet hand detection, mapped to the dimension of the screen.
  const handVector = new THREE.Vector3();
  handVector.x = (rightHandPosition.x / window.innerWidth) * 2 - 1;
  handVector.y = - (rightHandPosition.y / window.innerHeight) * 2 + 1; 
  handVector.z = 0; // that z value can be set to 0 because we don't get depth from the webcam.

  // We get the camera element and 'unproject' our hand vector with the camera's projection matrix (some magic I can't explain).
  const camera = self.el.sceneEl.camera;
  handVector.unproject(camera);

  // We get the position of our camera object.
  const cameraObjectPosition = camera.el.object3D.position;
  // The next 3 lines are what allows us to map between the position of our hand on the screen to a position in the 3D world. 
  const dir = handVector.sub(cameraObjectPosition).normalize();
  const distance = - cameraObjectPosition.z / dir.z;
  const pos = cameraObjectPosition.clone().add(dir.multiplyScalar(distance));
  // We use this new position to determine the position of our 'right-hand-controller' cube in the 3D scene. 
  el.object3D.position.copy(pos);
  el.object3D.position.z = -0.2;
}

在这个阶段，我们可以将手移动到摄像机前面并看到 3D 立方体移动。

我们需要做的最后一件事是所谓的光线投射，以便能够破坏节拍。

光线投射

在 Three.js 中，光线投射通常用于鼠标拾取，即确定鼠标位于 3D 空间中的哪些对象上。它也可以用于碰撞检测。

就我们而言，我们关心的不是鼠标，而是我们的“立方体手”。

为了检查我们的手在哪些物体上，我们需要在函数中添加以下代码onMoveHands：

// Create a raycaster with our hand vector.
const raycaster = new THREE.Raycaster();
raycaster.setFromCamera(handVector, camera);

// Get all the <a-entity beatObject> elements.
const entities = document.querySelectorAll('[beatObject]'); 
const entitiesObjects = [];

if(Array.from(entities).length){
  // If there are beats entities, get the actual beat mesh and push it into an array.
  for(var i = 0; i < Array.from(entities).length; i++){
    const beatMesh = entities[i].object3D.el.object3D.el.object3D.el.object3D.children[0].children[1];
    entitiesObjects.push(beatMesh);
  }

  // From the raycaster, check if we intersect with any beat mesh. 
  let intersects = raycaster.intersectObjects(entitiesObjects, true);
    if(intersects.length){
      // If we collide, get the entity, its color and type.
      const beat = intersects[0].object.el.attributes[0].ownerElement.parentEl.components.beat;
      const beatColor = beat.attrValue.color;
      const beatType = beat.attrValue.type;
      // If the beat is blue and not a mine, destroy it!
      if(beatColor === "blue"){
        if(beatType === "arrow" || beatType === "dot"){
          beat.destroyBeat();
        } 
      }
    }
}

我们完成了！

我们使用了 PoseNet 和 Tensorflow.js 来检测手部及其位置，将它们绘制在画布上，并映射到 3D 坐标，最后使用 Raycaster 来检测与节拍的碰撞并摧毁它们！🎉 🎉 🎉

我确实需要采取更多步骤才能弄清楚这一切，但这是一个非常有趣的挑战！

限制

当然，一如既往，有一些限制需要提及。

延迟和准确性

如果你试过演示版，你可能会注意到从你移动手到手在屏幕上显示之间有一些延迟。
在我看来，这是意料之中的，但它识别手腕并计算出它们在屏幕上应该放在哪里的速度确实让我印象深刻。

灯光

我认为，就计算机视觉而言，如果房间光线不足，你构建的任何体验都不会非常高效或可用。它只能利用网络摄像头的影像流来寻找最接近身体形状的图像，所以如果光线不足，它就无法做到这一点，游戏也无法正常运行。

用户体验

在真正的《Beat Saber》游戏中，我相信摇杆会对碰撞做出节拍反应吧？如果没有，那它应该会做出反应，这样用户就能获得一些关于发生了什么的触觉反馈。

然而，在这个特定的项目中，反馈只是视觉上的，在某种程度上，感觉有点奇怪，当你击中它们时，你会想要“感觉”到节拍的爆炸。

可以通过 Web 蓝牙连接一些 Arduino 和振动传感器来解决这个问题，但那是另一回事了……😂

差不多就是这样！

希望你喜欢！❤️✌️

文章来源：https://dev.to/devdevcharlie/playing-beat-saber-in-the-browser-with-body-movements-using-posenet-tensorflow-js-36km