Text Preprocessing in qhChina
The qhChina package provides preprocessing utilities specifically designed for Chinese text, with a focus on segmentation (word tokenization). The preprocessing module provides flexible tools for tokenizing Chinese text using various backends.
Text Segmentation
Chinese text segmentation (word tokenization) is a critical preprocessing step for computational text analysis, since Chinese text doesn’t use spaces to separate words. The qhchina.preprocessing
module provides tools for segmenting Chinese text.
Basic Usage
from qhchina.preprocessing.segmentation import create_segmenter
# Create a segmenter with default settings (uses spaCy)
segmenter = create_segmenter(backend="spacy")
# Segment a single text into tokens
text = "量子计算将改变密码学的未来"
tokens = segmenter.segment(text)
print(tokens) # Output: ['量子', '计算', '将', '改变', '密码学', '的', '未来']
# To process multiple texts, iterate over them
texts = ["深度学习模型理解复杂语境", "太空探索发现新的系外行星"]
results = []
for text in texts:
tokens = segmenter.segment(text)
results.append(tokens)
print(results) # Output: [['深度', '学习', '模型', '理解', '复杂', '语境'], ['太空', '探索', '发现', '新的', '系外', '行星']]
# Create a segmenter with line-by-line processing strategy
line_segmenter = create_segmenter(backend="spacy", strategy="line")
long_text = """古代文明的天文观测记录。
量子纠缠现象的神奇特性。
人类意识的哲学讨论。"""
result = line_segmenter.segment(long_text) # Process each line separately
print(result) # Output: [['古代', '文明', '的', '天文', '观测', '记录', '。'], ['量子', '纠缠', '现象', '的', '神奇', '特性', '。'], ...]
Segmentation Strategies
The segmenters support different processing strategies through the strategy
parameter:
from qhchina.preprocessing.segmentation import create_segmenter
# Create a segmenter with sentence-by-sentence processing
segmenter = create_segmenter(
backend="spacy",
strategy="sentence" # Process text sentence by sentence
)
# For LLM-based segmentation with API calls, you might want to process in chunks
llm_segmenter = create_segmenter(
backend="llm",
strategy="chunk", # Process text in chunks
chunk_size=1000, # Size of each chunk in characters
api_key="your-key",
model="gpt-3.5-turbo"
endpoint="https://api.openai.com/v1"
)
The available strategies are:
Strategy | Description | Best Use Cases |
---|---|---|
whole |
Process the entire text at once | Small to medium texts, when you need context across the entire document |
line |
Split by line breaks and process each line separately | Large documents, log files, or structured text with natural line breaks |
sentence |
Split into sentences and process each sentence | NLP tasks like word2vec that need sentence boundaries, or when sentence context is important |
chunk |
Split into fixed-size chunks and process each chunk | Very large documents, or when using API-based backends like LLM to avoid token limits |
Choosing the right strategy depends on:
- Performance considerations: Batched processing can be much faster
- Memory constraints: Processing large documents in smaller units reduces memory usage
- API costs: For LLM backends, processing in chunks can reduce API costs
- Context requirements: Some tasks need sentence boundaries preserved
Creating a Segmenter
The create_segmenter()
function is the main entry point for creating segmentation tools:
from qhchina.preprocessing.segmentation import create_segmenter
# Create a segmenter with spaCy backend
segmenter = create_segmenter(
backend="spacy", # Segmentation backend
model_name="zh_core_web_sm", # Lighter spaCy model
batch_size=100, # Batch size for processing
strategy="chunk", # Process sentence by sentence
chunk_size=512, # Chunk size (for "chunk" strategy)
filters={
"min_length": 2, # Minimum token length to include
"excluded_pos": ["NUM", "SYM"], # POS tags to exclude
"stopwords": ["的", "了"] # Stopwords to exclude
}
)
# Or create a segmenter with Jieba backend
jieba_segmenter = create_segmenter(
backend="jieba", # Use Jieba backend
pos_tagging=True, # Enable POS tagging
strategy="line", # Process line by line
filters={
"min_length": 2, # Minimum token length to include
"excluded_pos": ["m", "x"], # POS tags to exclude
"stopwords": ["的", "了"] # Stopwords to exclude
}
)
Available Backends
Currently, the following segmentation backends are supported:
- spaCy: A powerful NLP library with Chinese language support
- Requires installing spaCy and a Chinese model:
pip install spacy && python -m spacy download zh_core_web_sm
- Supports POS filtering and other advanced features
- Slower but more accurate for complex NLP tasks
- Efficient batch processing using
nlp.pipe()
internally
- Requires installing spaCy and a Chinese model:
- Jieba: A popular Chinese text segmentation library
- Requires installing Jieba:
pip install jieba
- Faster processing speed, especially for large volumes of text
- Simpler to use with good accuracy for most use cases
- Requires installing Jieba:
- BERT: Neural-based Chinese word segmentation using BERT models
- Requires installing transformers and torch:
pip install transformers torch
- Offers high accuracy for complex texts using deep learning models
- Supports various tagging schemes and pre-trained models
- Requires installing transformers and torch:
- LLM: Large Language Model-based segmentation using API services like OpenAI
- Requires installing openai:
pip install openai
- Leverages state-of-the-art LLMs for accurate segmentation
- Customizable through prompts and system messages
- Using “chunk” or “sentence” strategy is recommended to reduce API costs
- Requires installing openai:
SpacySegmenter
The SpacySegmenter
class provides Chinese text segmentation using spaCy models:
from qhchina.preprocessing.segmentation import SpacySegmenter
# Create a spaCy-based segmenter
segmenter = SpacySegmenter(
model_name="zh_core_web_sm", # spaCy model to use
disable=["ner", "lemmatizer"], # Disable components for speed
batch_size=100, # Batch size for processing
user_dict=["量子物理", "深度学习"], # Custom user dictionary
strategy="sentence", # Process sentence by sentence
filters={
"min_length": 2, # Min token length to keep
"excluded_pos": ["NUM", "SYM", "SPACE"], # POS tags to exclude
"stopwords": ["的", "了"] # Stopwords to exclude
}
)
Available spaCy Models for Chinese
Model | Size | Description |
---|---|---|
zh_core_web_sm |
Small | Basic POS tagging and dependency parsing |
zh_core_web_md |
Medium | Includes word vectors |
zh_core_web_lg |
Large | Larger vocabulary and word vectors |
Install with: python -m spacy download zh_core_web_sm
JiebaSegmenter
The JiebaSegmenter
class provides Chinese text segmentation using the Jieba library:
from qhchina.preprocessing.segmentation import JiebaSegmenter
# Create a Jieba-based segmenter
segmenter = JiebaSegmenter(
pos_tagging=True, # Enable POS tagging
user_dict_path="path/to/dict.txt", # Custom user dictionary
strategy="line", # Process line by line
filters={
"min_length": 2, # Min token length to keep
"excluded_pos": ["m", "x"], # POS tags to exclude (Jieba's POS tags)
"stopwords": ["的", "了"], # Words to exclude
}
)
BertSegmenter
The BertSegmenter
class provides neural-based Chinese text segmentation using BERT models:
from qhchina.preprocessing.segmentation import BertSegmenter
# Create a BERT-based segmenter with a fine-tuned model
segmenter = BertSegmenter(
model_name="bert-modern-chinese-segmentation", # BERT model to use
tagging_scheme="bmes", # Tagging scheme: "be", "bme", or "bmes"
batch_size=16, # Batch size for processing
device="cuda", # Use GPU if available
strategy="chunk", # Process in fixed-size chunks
chunk_size=512, # Max sequence length for BERT
filters={
"min_length": 2, # Min token length to keep
"stopwords": ["的", "了"], # Words to exclude
}
)
# Segment text
text = "量子纠缠实验验证了非局域性原理。"
tokens = segmenter.segment(text)
print(tokens) # Output: [['量子', '纠缠', '实验', '验证', '了', '非局域性', '原理', '。']]
Tagging Schemes
The BertSegmenter supports several tagging schemes:
Scheme | Tags | Description |
---|---|---|
be |
B, E | Beginning and End of words |
bme |
B, M, E | Beginning, Middle, and End of words |
bmes |
B, M, E, S | Beginning, Middle, End, and Single-character words |
LLMSegmenter
The LLMSegmenter
class provides Chinese text segmentation using Large Language Models via API services:
from qhchina.preprocessing.segmentation import LLMSegmenter
# Create an LLM-based segmenter using OpenAI
segmenter = LLMSegmenter(
api_key="your-openai-api-key", # API key for the service
model="gpt-3.5-turbo", # Model to use
endpoint="https://api.openai.com/v1", # API endpoint URL - required parameter
system_message="你是语言学专家。", # System message for better segmentation
temperature=0.1, # Lower temperature for more consistent results
strategy="chunk", # Process in chunks to reduce API costs
chunk_size=1000, # Size of each chunk in characters
filters={
"min_length": 1, # Min token length to keep
"stopwords": ["的", "了"] # Words to exclude
}
)
# Segment text
text = "脑机接口技术将改变人类与数字世界的交互方式。"
tokens = segmenter.segment(text)
print(tokens) # Output: [['脑机', '接口', '技术', '将', '改变', '人类', '与', '数字', '世界', '的', '交互', '方式', '。']]
# Process multiple texts one by one
texts = ["虚拟现实创造沉浸式体验", "基因编辑技术引发伦理讨论"]
results = []
for text in texts:
results.append(segmenter.segment(text))
print(results)
Custom Prompts
You can customize the segmentation prompt for different segmentation styles:
# Custom prompt for academic-style segmentation
custom_prompt = """
请将以下中文文本分词,按照学术标准分词。请用JSON格式回答。
示例:
输入: "深度学习模型识别复杂图像"
输出: ["深度学习", "模型", "识别", "复杂", "图像"]
输入: "{text}"
输出:
"""
segmenter = LLMSegmenter(
api_key="your-openai-api-key",
model="gpt-4",
endpoint="https://api.openai.com/v1", # API endpoint URL - required parameter
prompt=custom_prompt,
temperature=0,
strategy="sentence" # Process sentence by sentence
)
Filtering Options
All segmenters support filtering options that can be passed during initialization:
Filter | Description |
---|---|
min_length |
Minimum length of tokens to include (default: 1) |
excluded_pos |
Set of POS tags to exclude (requires POS tagging support) |
stopwords |
List of words to exclude from results |
Common Workflows
Basic Text Processing Pipeline
from qhchina.preprocessing.segmentation import create_segmenter
from qhchina.helpers import load_texts, load_stopwords
# Load stopwords
stopwords = load_stopwords("zh_sim")
# Create segmenter with filters and strategy
segmenter = create_segmenter(
backend="jieba", # Using Jieba for faster processing
strategy="sentence", # Process sentence by sentence
filters={
"stopwords": stopwords,
"min_length": 2
}
)
# Load and process texts
raw_texts = load_texts(["path/to/file1.txt", "path/to/file2.txt"])
processed_sentences = []
for text in raw_texts:
# Segment text based on the strategy
sentences = segmenter.segment(text)
processed_sentences.extend(sentences)
# Now processed_documents is ready for analytics tasks
Integration with Analytics
from qhchina.preprocessing.segmentation import create_segmenter
from qhchina.analytics.topicmodels import LDAGibbsSampler
# Create segmenter with sentence strategy for word2vec or topic modeling
segmenter = create_segmenter(backend="spacy", strategy="sentence")
# Process text
text = """
宇宙起源理论存在多种可能性。暗物质构成宇宙的大部分质量。
量子力学和相对论难以统一。人类意识的本质仍是未解之谜。
"""
sentences = segmenter.segment(text) # Returns a list of tokenized sentences
# Use processed text in analytics
lda = LDAGibbsSampler(n_topics=5)
lda.fit(sentences)
# Get topics
topics = lda.get_topics(n_words=5)
for i, topic in enumerate(topics):
print(f"Topic {i}: {[word for word, _ in topic]}")
Using Original Libraries Directly
Using spaCy Directly
While qhChina’s segmentation wrappers provide convenience and enhanced features, you can also use the underlying libraries directly. Here’s how to use spaCy directly for Chinese text segmentation:
import spacy
import re
# Download the Chinese model if not already installed
try:
nlp = spacy.load("zh_core_web_sm")
except OSError:
spacy.cli.download("zh_core_web_sm")
nlp = spacy.load("zh_core_web_sm")
# Define a function to split text into sentences
def split_into_sentences(text, pattern=r"([。!?\.!?……]+)"):
# Split by sentence-ending punctuation, but keep the punctuation
raw_splits = re.split(pattern, text)
# Combine sentence content with its ending punctuation
sentences = []
i = 0
while i < len(raw_splits):
if i + 1 < len(raw_splits) and re.match(pattern, raw_splits[i+1]):
sentences.append(raw_splits[i] + raw_splits[i+1])
i += 2
else:
if raw_splits[i].strip():
sentences.append(raw_splits[i])
i += 1
return sentences
# Process text with different strategies
text = """量子计算将改变密码学的未来。
人工智能正在革新医疗诊断技术。
数据科学帮助解决复杂社会问题。"""
# Strategy 1: Process whole text at once
doc = nlp(text)
tokens_whole = [token.text for token in doc if not token.is_space]
print("Whole text processing:", tokens_whole)
# Strategy 2: Process line by line
lines = [line.strip() for line in text.split('\n') if line.strip()]
tokens_by_line = []
for line in lines:
doc = nlp(line)
tokens_by_line.append([token.text for token in doc if not token.is_space])
print("Line-by-line processing:", tokens_by_line)
# Strategy 3: Process sentence by sentence
sentences = split_into_sentences(text)
tokens_by_sentence = []
for sentence in sentences:
doc = nlp(sentence)
tokens_by_sentence.append([token.text for token in doc if not token.is_space])
print("Sentence-by-sentence processing:", tokens_by_sentence)
# Strategy 4: Batch processing with nlp.pipe() for efficiency
batch_texts = ["量子计算研究进展", "自然语言处理应用", "机器学习模型评估"]
tokens_batch = []
for doc in nlp.pipe(batch_texts, batch_size=50):
tokens_batch.append([token.text for token in doc if not token.is_space])
print("Batch processing:", tokens_batch)
# Advanced: Filter by POS and length
filtered_tokens = []
stopwords = {"的", "了", "和"}
for doc in nlp.pipe(batch_texts):
doc_tokens = [token.text for token in doc
if not token.is_space
and token.pos_ not in {"PUNCT", "NUM", "SYM"}
and len(token.text) >= 2
and token.text not in stopwords]
filtered_tokens.append(doc_tokens)
print("Filtered tokens:", filtered_tokens)
Using Jieba Directly
Similarly, you can use the Jieba library directly for word segmentation:
import jieba
import jieba.posseg as pseg
import re
# Optional: Add custom words to the dictionary
jieba.add_word("量子计算")
jieba.add_word("人工智能")
# Define a function to split text into sentences (same as above)
def split_into_sentences(text, pattern=r"([。!?\.!?……]+)"):
raw_splits = re.split(pattern, text)
sentences = []
i = 0
while i < len(raw_splits):
if i + 1 < len(raw_splits) and re.match(pattern, raw_splits[i+1]):
sentences.append(raw_splits[i] + raw_splits[i+1])
i += 2
else:
if raw_splits[i].strip():
sentences.append(raw_splits[i])
i += 1
return sentences
# Process text with different strategies
text = """量子计算将改变密码学的未来。
人工智能正在革新医疗诊断技术。
数据科学帮助解决复杂社会问题。"""
# Strategy 1: Process whole text at once
tokens_whole = list(jieba.cut(text))
print("Whole text processing:", tokens_whole)
# Strategy 2: Process line by line
lines = [line.strip() for line in text.split('\n') if line.strip()]
tokens_by_line = []
for line in lines:
tokens_by_line.append(list(jieba.cut(line)))
print("Line-by-line processing:", tokens_by_line)
# Strategy 3: Process sentence by sentence
sentences = split_into_sentences(text)
tokens_by_sentence = []
for sentence in sentences:
tokens_by_sentence.append(list(jieba.cut(sentence)))
print("Sentence-by-sentence processing:", tokens_by_sentence)
# With POS tagging
pos_tagged = []
stopwords = {"的", "了", "和"}
excluded_pos = {"x", "m", "r"} # Exclude certain POS tags
for sentence in sentences:
# Get tokens with POS tags
tokens = pseg.cut(sentence)
# Filter by POS and length
filtered = [word for word, flag in tokens
if len(word) >= 2
and word not in stopwords
and flag not in excluded_pos]
pos_tagged.append(filtered)
print("POS-filtered tokens:", pos_tagged)
# Load a user dictionary from file
# jieba.load_userdict("path/to/dict.txt")
These examples demonstrate how to use spaCy and Jieba directly, but they require more manual coding compared to qhChina’s wrappers, especially for handling different processing strategies, filtering, and batching.