画像からオブジェクトを検出する(Darknet YOLOv4-tiny model)

はじめに

ラズパイ4とカメラモジュールが転がっていたので何かobject検知系で何か作ろうと思いつく。

その前段で必要前提条件である画像からのオブジェクト検知について使えそうな技術を検証したのが本記事。

全体システムの最終的なゴール

簡易的な交通量監視システムの構築

（趣味の範疇なので精度や速度は無視する。）

本記事の検証項目

静止画像からオブジェクトを検知できて、raspi4bでも動作する軽量な方法を調査・検証

検証実施内容

GPT4-Vision

OpenAI社のVisionモデルでサクッとやろうとしたが、結果的には不適合で検証終了。

https://openai.com/pricing

メリット

◎APIに画像を載せるだけなので導入が超手軽

⚪︎費用も512x512pixの画像なら約0.002ドル/1リクエスト(0.3円くらい)とお手頃。

デメリット

❌オブジェクトのカウントをしてくれない

CAPTHA対策なのかオブジェクトの数を数えるように指示してもNGの旨が返ってくる。

△JSON形式でレスポンスを固定できない

システムプロンプトで指示はできるが、完全固定ではない模様（95％くらい守ってくれる？）

△安いとはいえリクエストの都度費用がかかるため大量のリクエストに不向き

1時間に1回リクエストで月220円くらいかかる。

TensorFlow LiteのMobileNet+SSDモデル

お手頃順にtensorFlowを検証したが、こちらはそもそもうまく動作しなかったため、不採用。

Colab上で動作検証をしたが、オブジェクト検知部分がおかしく色々調査してみたが、時間がかかりそうであったためスキップ。

・利用したモデル

lite-model_ssd_mobilenet_v1_1_metadata_2.tflite

github.com

・ソースコード(google colab)

!pip install tflite-runtime
import os
import time
from datetime import datetime

import numpy as np
import tflite_runtime.interpreter as tflite
from PIL import Image
from google.colab import drive
drive.mount('/content/drive')
 

# モデルとラベルファイルのパス
model_path = '/content/drive/My Drive/Colab Notebooks/detect.tflite'
label_path = '/content/drive/My Drive/Colab Notebooks/labelmap.txt'

# ラベルファイルの読み込み
with open(label_path, 'r') as file:
 labels = [line.strip() for line in file.readlines()]
print(labels)
# TensorFlow Liteモデルのロード
interpreter = tflite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()

# 入出力詳細の取得
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape']

# 画像の読み込みと前処理
image = Image.open('/content/drive/My Drive/Colab Notebooks/car_2.jpg')
image = image.resize*1
input_data = np.expand_dims(image, axis=0)

# 推論の実行
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
# 結果の取得
detections = interpreter.get_tensor(output_details[0]['index'])[0]

# 検出されたオブジェクトのラベルと信頼度を出力
for det in detections:
 label_index = int(det[0])
 print(det)
 score = float(det[1])
 if label_index in range(len(labels)) and score > 0.5: # 信頼度が50%以上のものだけを表示
 print(f"検出されたオブジェクト: {labels[label_index]}, 信頼度: {score:.2f}")

# 車のカウント
car_count = sum(1 for det in detections if det[0] == labels.index('car'))
print(f'画像中の車の数: {car_count}')
['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', '???', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', '???', 'backpack', 'umbrella', '???', '???', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', '???', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', '???', 'dining table', '???', '???', 'toilet', '???', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', '???', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'] [0.6060126 0.4098847 0.81949747 0.74221253] 検出されたオブジェクト: person, 信頼度: 0.41 [0.6511592 0.78893995 0.788709 0.97133744] 検出されたオブジェクト: person, 信頼度: 0.79 [0.63011193 0.58784795 0.658774 0.6115582 ] 検出されたオブジェクト: person, 信頼度: 0.59 [0.67801225 0.96616584 0.75084054 0.98877805] 検出されたオブジェクト: person, 信頼度: 0.97 [0.6713592 0.80989736 0.76514316 0.9385486 ] 検出されたオブジェクト: person, 信頼度: 0.81 [0.6352516 0.84297746 0.65822405 0.8611011 ] 検出されたオブジェクト: person, 信頼度: 0.84 [0.631746 0.6318665 0.6647895 0.6600792] 検出されたオブジェクト: person, 信頼度: 0.63 [0.674443 0.8473915 0.7712389 0.9538163] 検出されたオブジェクト: person, 信頼度: 0.85 [0.62252915 0.5297136 0.6571772 0.5649137 ] 検出されたオブジェクト: person, 信頼度: 0.53 [0.59755045 0.9395489 0.6974551 0.9893865 ] 検出されたオブジェクト: person, 信頼度: 0.94 画像中の車の数: 0

YOLOv4-tinyモデル

本記事で紹介するモデル。特に問題なく動作したためこちらを採用。

こんなお手頃に動作するなんて驚きですね。

・簡単な説明(by gpt4)

特徴: YOLO(you only look once)は非常に高速であり、多くの一般物体を検出するための学習済みモデルが豊富に用意されています。
使用方法: YOLOの学習済みモデル（例えばYOLOv3-tinyやYOLOv4-tiny）をダウンロードし、YOLOフレームワークを使用して画像から車両を検出します。

メリット

⚪︎ローカルで動作するのでAPI費用が不要

⚪︎実行に要する時間も数秒程度（対象画像の解像度に比例するがHD画質程度であればこのくらい）

デメリット

△導入や閾値調整が必要

・利用した学習済みモデル（各種外部ファイルはここからダウンロード）

github.com

・ソースコード(Colab)

import os
import time
from datetime import datetime
import cv2
import numpy as np
from PIL import Image
from google.colab import drive
from google.colab.patches import cv2_imshow
from collections import Counter
drive.mount('/content/drive')

# 共通パス
base_path = '/content/drive/My Drive/Colab Notebooks/'

# パスの設定
config_path = base_path + 'yolov4-tiny.cfg'
weights_path = base_path + 'yolov4-tiny.weights'
labels_path = base_path + 'coco.names'
image_path = base_path + 'car_4.jpg'

def load_labels(path):
 with open(path, 'r') as f:
 return [line.strip() for line in f.readlines()]

def load_network(config_path, weights_path):
 return cv2.dnn.readNetFromDarknet(config_path, weights_path)

def get_output_layers(net):
 layer_names = net.getLayerNames()
 return [layer_names[i - 1] for i in net.getUnconnectedOutLayers().flatten()]


def detect_objects(net, image, output_layers):
 blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416), swapRB=True, crop=False)
 net.setInput(blob)
 return net.forward(output_layers)

def draw_predictions(image, detections, labels):
 H, W = image.shape[:2]
 boxes = 
 confidences = 
 class_ids = 
 detected_labels = 

 for detection in detections:
 for output in detection:
 scores = output[5:]
 class_id = np.argmax(scores)
 confidence = scores[class_id]

 if confidence > 0.5:
 box = output[:4] * np.array([W, H, W, H])
 centerX, centerY, width, height = box.astype('int')
 x, y = int(centerX - width / 2), int(centerY - height / 2)

 boxes.append([x, y, int(width), int(height)])
 confidences.append(float(confidence))
 class_ids.append(class_id)

 indices = cv2.dnn.NMSBoxes(boxes, confidences, score_threshold=0.5, nms_threshold=0.4)

 if len(indices) > 0:
 for i in indices.flatten():
 x, y, w, h = boxes[i]
 color = [int(c) for c in np.random.randint(0, 255, size=(3,))]
 cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
 text = "{}: {:.4f}".format(labels[class_ids[i]], confidences[i])
 cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
 detected_labels.append(labels[class_ids[i]])

 return Counter(detected_labels)

# 実行部分
labels = load_labels(labels_path)
net = load_network(config_path, weights_path)
output_layers = get_output_layers(net)
image = cv2.imread(image_path)

detections = detect_objects(net, image, output_layers)
label_counts = draw_predictions(image, detections, labels)

for label, count in label_counts.items():
 print(f"{label}: {count}")

cv2_imshow(image)

car: 5
truck: 1

概ねあってます。

今流行りでのLLMではないですが、これくらいの精度でるなら交通量調査人の代替もすぐではないでしょうか。

*1:input_shape[1], input_shape[2]

mochimochi000の日記

備忘録です