In automated video content production (such as meeting recordings, online courses, medical interview transcriptions, etc.), Automatic Speech Recognition (ASR) systems often segment audio streams into semantic fragments with timestamps and generate preliminary subtitles However, due to the limitations of acoustic and language models, ASR outputs commonly suffer from deletion, substitution, and insertion errors—for example, recognizing "您好" (hello) as "你好" (hello), or omitting the character "以" in "可以" (can).
In the subsequent video synthesis phase (such as subtitle burning, highlight review, multimodal quality inspection), a core challenge is:
How to accurately map each timestamped ASR subtitle segment back to its corresponding interval in the original reference text, and restore the correct content that was omitted or erroneously replaced?
This not only relates to the semantic completeness and professionalism of subtitles (such as in medical or judicial scenarios) but also directly affects the quality and compliance of the final video product This article proposes a lightweight alignment method that only relies on Python's standard library difflib, which can achieve high-precision synchronization and restoration of subtitle text with reference original text without requiring additional models or external tools, making it particularly suitable for post-processing of Chinese video subtitles
I Problem Modeling: Subtitle Segments vs Reference Full Text
Let the authoritative reference text corresponding to the video (such as a script, medical record summary, meeting minutes) be:
ref = "您好,请问有什么可以帮您?"
The subtitle segments output by the ASR system (usually with timestamps, but here we focus on text alignment) are:
asr_segments = ["你好", "请问有什么可帮您"]
Observations:
- "您好" → "你好" (substitution error);
- "可以" → "可" (deletion of "以");
- The ending punctuation "?" was not recognized
The ideal subtitle alignment result should be (for burning or highlighting during video synthesis):
["您好,", "请问有什么可以帮您?"]
That is:
- Each subtitle segment corresponds to a continuous, semantically complete substring in the original text;
- Both misrecognized words ("好") and omitted content ("以", "?") are correctly restored;
- Subtitle boundaries are reasonable, without overlap or omission, facilitating direct use in video rendering after binding with timestamps
II Technical Approach: Character-Level Alignment + Forward Anchoring
2.1 Why Choose Character-Level Alignment?
Chinese has no explicit word boundaries, and ASR errors often occur at the character level (such as "帮" vs "邦", or missing "以"). Character-level comparison can maximize the preservation of contextual continuity and avoid additional noise introduced by word segmentation errors
2.2 Using difflib.SequenceMatcher to Build Mapping
Perform global comparison between ref and the concatenated ASR character sequence to generate editing operations (opcodes), identifying four types of relationships: equal, replace, delete, and insert.
2.3 Segment Boundary Strategy: Forward Anchoring
- The start position of the i-th subtitle segment = the index of the first valid ASR character in this segment within
ref; - The end position = the start position of the (i+1)-th segment (or the position of its last character +1 for the last segment).
This strategy ensures:
- Deleted function words, punctuation, and connectives (such as "的", ",", "最终") are naturally included in the previous segment;
- Subtitle intervals continuously cover the full text, avoiding "gaps";
- After alignment with video timestamps, they can be directly used for subtitle rendering or segment editing
III Implementation Code
import difflib # Python built-in library
def align_asr_segments_to_ref(ref: str, asr_segments: list) -> list:
"""
Align ASR segment list to reference text, restoring deleted or replaced original content
Parameters:
ref (str): Reference text (standard original text)
asr_segments (list[str]): List of segments output by ASR (may contain recognition errors)
Returns:
list[str]: List of the same length as asr_segments, each element is the corresponding original text substring
"""
asr_full = ''.join(asr_segments)
if not asr_full:
return [""] * len(asr_segments)
# Character lists
ref_chars = list(ref)
asr_chars = list(asr_full)
# Alignment
matcher = difflib.SequenceMatcher(None, ref_chars, asr_chars)
opcodes = matcher.get_opcodes()
# Build: ref index corresponding to each asr character (sequential mapping)
asr_to_ref_index = [-1] * len(asr_chars)
ref_i = 0
asr_j = 0
for idx, (tag, i1, i2, j1, j2) in enumerate(opcodes):
if tag == 'equal':
for k in range(j1, j2):
asr_to_ref_index[k] = ref_i
ref_i += 1
asr_j = j2
elif tag == 'replace':
# asr[j1:j2] replaces ref[i1:i2]
# Each asr character尽量对应一个 ref character (sequentially)
for idx in range(j1, j2):
if ref_i < i2:
asr_to_ref_index[idx] = ref_i
ref_i += 1
else:
# asr is longer, extra parts map to the last ref position
asr_to_ref_index[idx] = i2 - 1 if i2 > i1 else i1
ref_i = i2 # Consume ref[i1:i2]
asr_j = j2
elif tag == 'delete':
# If characters at the beginning of the sentence are deleted, no need to process
if idx != 0:
# ref[i1:i2] is deleted, asr doesn't advance
# These ref characters have no corresponding asr characters, but belong to "current context"
# We don't map them, but ref_i advances
ref_i = i2
elif tag == 'insert':
# asr has extra characters, no corresponding ref characters
for idx in range(j1, j2):
asr_to_ref_index[idx] = ref_i # Insert at current position
asr_j = j2
# ref_i remains unchanged
# Now, calculate the start position of each ASR segment in ref
segment_starts = []
char_ptr = 0
for seg in asr_segments:
if seg:
# Position of the first valid character
for k in range(char_ptr, char_ptr + len(seg)):
if asr_to_ref_index[k] != -1:
segment_starts.append(asr_to_ref_index[k])
break
else:
# All are insert or invalid, use previous end or 0
prev_end = segment_starts[-1] if segment_starts else 0
segment_starts.append(prev_end)
else:
prev_end = segment_starts[-1] if segment_starts else 0
segment_starts.append(prev_end)
char_ptr += len(seg)
# Derive the end position of each segment: start of next segment, or end of ref
segment_ends = segment_starts[1:] + [len(ref)]
# Generate result
result = []
for start, end in zip(segment_starts, segment_ends):
# Ensure no out-of-bounds
start = max(0, min(start, len(ref)))
end = max(start, min(end, len(ref)))
result.append(ref[start:end])
return result
Can be used with [Online Python Runner: https://toolshu.com/en/python3] tool for quick testing
IV Application Scenario Examples
Example 1: Customer Service Dialogue Scenario
ref = "您好,请问有什么可以帮您?"
asr_segments = ["你好", "请问有什么可帮您"]
aligned = align_asr_segments_to_ref(ref, asr_segments)
print(aligned)
# Output: ['您好,', '请问有什么可以帮您?']
Example 2: Technical Meeting Minutes Scenario
ref = "我们在 Q3 的模型训练中使用了 LoRA 微调策略,结合 8 卡 A100 集群,最终在 72 小时内完成了 130 亿参数大模型的全量训练,验证集准确率达到 897%。"
asr_segments = ["Q3模型训练用了LoRA微调", "8卡A100集群", "72小时完成130亿参数训练", "验证准确率897"]
aligned = align_asr_segments_to_ref(ref, asr_segments)
print(aligned)
# Output: ['我们在 Q3 的模型训练中使用了 LoRA 微调策略,结合 ', '8 卡 A100 集群,最终在 ', '72 小时内完成了 130 亿参数大模型的全量训练,', '验证集准确率达到 897%。']
This alignment result successfully restored key elements that were missed by ASR in the original text, including:
- Subject "我们" (we) and connectives such as "在……中" (in...), "结合" (combined with), "最终" (finally) and other logical cohesive structures;
- Terminology completeness: "LoRA 微调策略" (LoRA fine-tuning strategy), "130 亿参数大模型" (13 billion parameter large model), "全量训练" (full training);
- Punctuation and tone: "," (comma), "。" (period) to maintain sentence boundaries;
- Expression standardization: "验证集准确率" (validation set accuracy) rather than the more colloquial "验证准确率" (validation accuracy).
V Advantages and Applicable Boundaries
✅ Advantages
- Zero dependencies: Only uses Python standard library, easy to integrate into video synthesis pipelines;
- High-fidelity restoration: Precisely restores missed punctuation, function words, and terminology, enhancing credibility in professional scenarios;
- Timestamp-friendly: Alignment results correspond one-to-one with original ASR segments, can be directly bound to timestamps for video rendering;
- Language universal: Applicable to any Unicode text, especially suitable for Chinese, Japanese and other languages without spaces
⚠️ Applicable Boundaries
- Requires ASR segments to be in correct order (does not handle out-of-order segments).
VI Appendix
After character alignment using the above code, punctuation in the original text is preserved by default If you need to remove it, you can refer to the following code
1. Remove leading and trailing punctuation
import string
text = "您好,请问有什么可以帮您?"
text = text.strip(string.punctuation + "!?。;:、()【】")
print(text)
# Output: 您好,请问有什么可以帮您
2. Remove all punctuation in the text (including spaces)
text = "您好,请问有什么可以帮您?"
text = text.translate(str.maketrans('', '', ' ,!?。;:、()【】“”')).strip()
print(text)
# Output: 您好请问有什么可以帮您
VII Conclusion
In today's increasingly popular automated video production, subtitles are not only carriers of information but also embodiments of professionalism and user experience. This solution effectively addresses the content integrity loss problem of ASR subtitles in the video synthesis phase through simple character-level alignment logic, and has verified its engineering value in medical interviews, academic lectures, customer service, and other scenarios
Note: The complete code has been tested with Python 3.8+ and can be directly integrated into ASR post-processing pipelines Boundary strategies can be adjusted or multimodal alignment capabilities can be expanded according to specific business needs
Article URL:https://toolshu.com/en/article/bdk89v7b
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License 。


Loading...