Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
578 changes: 386 additions & 192 deletions xllm/core/framework/request/mm_codec.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use AI to refactor all class and func in this files and add commetns.

Large diffs are not rendered by default.

10 changes: 10 additions & 0 deletions xllm/core/framework/request/mm_codec.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,14 @@ class OpenCVVideoDecoder {
torch::Tensor& t,
VideoMetadata& meta);
};

class FFmpegAudioDecoder {
public:
FFmpegAudioDecoder() = default;
~FFmpegAudioDecoder() = default;

bool decode(const std::string& raw_data,
torch::Tensor& t,
AudioMetadata& meta);
};
} // namespace xllm
43 changes: 40 additions & 3 deletions xllm/core/framework/request/mm_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ bool ImageHandler::load(const MMContent& content,

bool ImageHandler::decode(MMInputItem& input) {
OpenCVImageDecoder decoder;
return decoder.decode(input.raw_data_, input.decode_data_);
return decoder.decode(input.raw_data_, input.decode_image_);
}

bool VideoHandler::load(const MMContent& content,
Expand Down Expand Up @@ -135,14 +135,51 @@ bool VideoHandler::load(const MMContent& content,
}

bool VideoHandler::decode(MMInputItem& input) {
FFmpegAudioDecoder audio_decoder;
if (audio_decoder.decode(
input.raw_data_, input.decode_audio_, input.audio_meta_)) {
input.type_ |= MMType::AUDIO;
}

OpenCVVideoDecoder decoder;
return decoder.decode(input.raw_data_, input.decode_data_, input.video_meta_);
return decoder.decode(
input.raw_data_, input.decode_video_, input.video_meta_);
}
Comment on lines 137 to +147

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return value of VideoHandler::decode depends solely on the success of video decoding, ignoring the result of audio decoding. This means if a video file contains a valid audio stream but an unsupported or corrupted video stream, the function will return false even though audio was successfully extracted. This will cause the processing of the item to fail and the extracted audio data to be discarded. The function should return true if at least one of the media types (audio or video) is successfully decoded.

bool VideoHandler::decode(MMInputItem& input) {
  bool audio_decoded = false;
  FFmpegAudioDecoder audio_decoder;
  if (audio_decoder.decode(
          input.raw_data_, input.decode_audio_, input.audio_meta_)) {
    input.type_ |= MMType::AUDIO;
    audio_decoded = true;
  }

  OpenCVVideoDecoder video_decoder;
  bool video_decoded = video_decoder.decode(
      input.raw_data_, input.decode_video_, input.video_meta_);

  return audio_decoded || video_decoded;
}


bool AudioHandler::load(const MMContent& content,
MMInputItem& input,
MMPayload& payload) {
input.clear();

const auto& audio_url = content.audio_url;
const auto& url = audio_url.url;

if (url.compare(0, dataurl_prefix_.size(), dataurl_prefix_) ==
0) { // data url

input.type_ = MMType::AUDIO;
return this->load_from_dataurl(url, input.raw_data_, payload);
} else if (url.compare(0, httpurl_prefix_.size(), httpurl_prefix_) ==
0) { // http url

input.type_ = MMType::AUDIO;
return this->load_from_http(url, input.raw_data_);
} else {
LOG(ERROR) << " audio url is invalid, url is " << url;
return false;
}
}
Comment on lines +149 to +171

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The AudioHandler::load function (and similarly ImageHandler::load and VideoHandler::load) retrieves data from a user-provided URL. The code checks if the URL starts with "http" before passing it to the load_from_http function, which in turn makes a request to that URL. This validation is insufficient to prevent Server-Side Request Forgery (SSRF) attacks. An attacker can provide a URL pointing to internal services or cloud metadata endpoints (e.g., http://127.0.0.1/admin, http://169.254.169.254/latest/meta-data/). This could allow an attacker to scan the internal network, access sensitive internal services, or steal cloud infrastructure credentials.

Remediation:
Implement a strict allow-list of trusted domains and IP ranges that the server is permitted to request. Requests to URLs outside of this allow-list should be blocked.


bool AudioHandler::decode(MMInputItem& input) {
FFmpegAudioDecoder decoder;
return decoder.decode(
input.raw_data_, input.decode_audio_, input.audio_meta_);
}

MMHandlerSet::MMHandlerSet() {
handlers_["image_url"] = std::make_unique<ImageHandler>();
handlers_["video_url"] = std::make_unique<VideoHandler>();
// handlers_["audio_url"] = std::make_unique<AudioHandler>();
handlers_["audio_url"] = std::make_unique<AudioHandler>();
handlers_["image_embedding"] =
std::make_unique<MMEmbeddingHandler>(MMType::IMAGE);
handlers_["video_embedding"] =
Expand Down
14 changes: 14 additions & 0 deletions xllm/core/framework/request/mm_handler.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,20 @@ class VideoHandler : public MMHandlerBase {
std::string dataurl_prefix_{"data:video"};
};

class AudioHandler : public MMHandlerBase {
public:
AudioHandler() = default;
~AudioHandler() = default;

virtual bool load(const MMContent& content,
MMInputItem& input,
MMPayload& payload) override;
virtual bool decode(MMInputItem& input) override;

private:
std::string dataurl_prefix_{"data:audio"};
};

class MMHandlerSet {
public:
MMHandlerSet();
Expand Down
27 changes: 22 additions & 5 deletions xllm/core/framework/request/mm_input.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,29 @@ struct MMInputItem {
raw_data_.clear();
}

MMType type_ = MMType::NONE;
std::optional<torch::Tensor> get_decode_data(MMType type) const {
if (type == MMType::IMAGE)
return decode_image_;
else if (type == MMType::VIDEO)
return decode_video_;
else if (type == MMType::AUDIO)
return decode_audio_;
else
return std::nullopt;
}

uint32_t type_ = MMType::NONE;

bool has_type(MMType type) const { return (type_ & type) != 0; }

std::string raw_data_; // binary

torch::Tensor decode_data_; // image: rgb, [c,h,w], uint8
torch::Tensor decode_image_; // image: rgb, [c,h,w], uint8
torch::Tensor decode_video_; // video: rgb, [t,c,h,w], uint8
torch::Tensor decode_audio_; // audio: mono, [t], float32

VideoMetadata video_meta_;
AudioMetadata audio_meta_;

EmbeddingOutput embedding_;
};
Expand Down Expand Up @@ -95,8 +111,9 @@ struct MMInput {
std::vector<torch::Tensor> vec;

for (const auto& item : items_) {
if (item.type_ == type) {
vec.emplace_back(item.decode_data_);
if (item.has_type(type)) {
auto t = item.get_decode_data(type);
if (t) vec.emplace_back(*t);
}
}
return std::move(vec);
Expand All @@ -106,7 +123,7 @@ struct MMInput {
std::vector<VideoMetadata> metas;
metas.reserve(items_.size());
for (auto& item : items_) {
if (item.type_ == MMType::VIDEO) {
if (item.has_type(MMType::VIDEO)) {
metas.push_back(item.video_meta_);
}
}
Expand Down
1 change: 1 addition & 0 deletions xllm/models/vlm/npu/glm4v.h
Original file line number Diff line number Diff line change
Expand Up @@ -952,6 +952,7 @@ class Glm4vForConditionalGenerationImpl : public torch::nn::Module {
auto t = video_input->video_grid_thw.index({torch::indexing::Slice(), 0});
auto video_tokens =
((video_input->video_grid_thw.prod(-1) / merge_size / merge_size) / t)
.cpu()
.contiguous()
.to(torch::kLong);
std::vector<int64_t> video_tokens_vec(
Expand Down
1 change: 1 addition & 0 deletions xllm/models/vlm/npu/glm4v_moe.h
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ class Glm4vMoeForConditionalGenerationImpl : public torch::nn::Module {
auto t = video_input->video_grid_thw.index({torch::indexing::Slice(), 0});
auto video_tokens =
((video_input->video_grid_thw.prod(-1) / merge_size / merge_size) / t)
.cpu()
.contiguous()
.to(torch::kLong);
std::vector<int64_t> video_tokens_vec(
Expand Down
1 change: 1 addition & 0 deletions xllm/models/vlm/npu/qwen2_5_vl.h
Original file line number Diff line number Diff line change
Expand Up @@ -783,6 +783,7 @@ class Qwen2_5_VLForConditionalGenerationImpl : public torch::nn::Module {
input_params);
auto video_tokens =
(video_input->video_grid_thw.prod(-1) / merge_size / merge_size)
.cpu()
.contiguous()
.to(torch::kLong);
std::vector<int64_t> video_tokens_vec(
Expand Down
1 change: 1 addition & 0 deletions xllm/models/vlm/qwen2_5_vl.h
Original file line number Diff line number Diff line change
Expand Up @@ -738,6 +738,7 @@ class Qwen2_5_VLForConditionalGenerationImpl : public torch::nn::Module {
input_params);
auto video_tokens =
(video_input->video_grid_thw.prod(-1) / merge_size / merge_size)
.cpu()
.contiguous()
.to(torch::kLong);
std::vector<int64_t> video_tokens_vec(
Expand Down
24 changes: 16 additions & 8 deletions xllm/processors/qwen2_vl_image_processor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -162,23 +162,31 @@ bool Qwen2VLImageProcessor::process(const MMInput& inputs, MMData& datas) {
std::vector<EmbeddingOutput> images_embedding;
std::vector<torch::Tensor> videos;
std::vector<VideoMetadata> video_meta_list;
std::vector<torch::Tensor> audios;
std::vector<AudioMetadata> audio_meta_list;

if (input_item.type_ == MMType::IMAGE) {
if (input_item.decode_data_.defined()) {
images.push_back(input_item.decode_data_);
if (input_item.has_type(MMType::IMAGE)) {
if (input_item.decode_image_.defined()) {
images.push_back(input_item.decode_image_);
} else if (input_item.embedding_.embedding.defined()) {
images_embedding.push_back(input_item.embedding_);
}
} else if (input_item.type_ == MMType::VIDEO) {
if (input_item.decode_data_.defined()) {
videos.push_back(input_item.decode_data_);
} else if (input_item.has_type(MMType::VIDEO)) {
if (input_item.decode_video_.defined()) {
videos.push_back(input_item.decode_video_);
}
video_meta_list.push_back(input_item.video_meta_);
} else if (input_item.has_type(MMType::AUDIO)) {
if (input_item.decode_audio_.defined()) {
audios.push_back(input_item.decode_audio_);
}
audio_meta_list.push_back(input_item.audio_meta_);
}
Comment on lines +168 to 184

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The if-else if structure for checking the media type of an MMInputItem is incorrect for handling items that contain multiple modalities, such as a video with an audio track. Since VideoHandler can produce an item with type_ being MMType::VIDEO | MMType::AUDIO, this logic will only process the video part and silently ignore the audio part. This will lead to data loss. You should use separate if statements for each modality to handle combined types correctly.

    if (input_item.has_type(MMType::IMAGE)) {
      if (input_item.decode_image_.defined()) {
        images.push_back(input_item.decode_image_);
      } else if (input_item.embedding_.embedding.defined()) {
        images_embedding.push_back(input_item.embedding_);
      }
    }
    if (input_item.has_type(MMType::VIDEO)) {
      if (input_item.decode_video_.defined()) {
        videos.push_back(input_item.decode_video_);
      }
      video_meta_list.push_back(input_item.video_meta_);
    }
    if (input_item.has_type(MMType::AUDIO)) {
      if (input_item.decode_audio_.defined()) {
        audios.push_back(input_item.decode_audio_);
      }
      audio_meta_list.push_back(input_item.audio_meta_);
    }


if (images_embedding.empty() && images.empty() &&
(videos.empty() || video_meta_list.empty())) {
LOG(ERROR) << "no image/video tensor or embedding found.";
(videos.empty() || video_meta_list.empty()) &&
(audios.empty() || audio_meta_list.empty())) {
LOG(ERROR) << "no image/video/audio tensor or embedding found.";
return false;
}

Expand Down