Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 0098977172 | |||
| 5231e995dd |
@@ -0,0 +1,3 @@
|
|||||||
|
data/*
|
||||||
|
output/*
|
||||||
|
sentiment_output/*
|
||||||
@@ -0,0 +1,163 @@
|
|||||||
|
# 股吧数据爬取与情感分析系统
|
||||||
|
|
||||||
|
基于东方财富网股吧数据的爬虫系统,支持数据爬取、情感分析和关键词挖掘。
|
||||||
|
|
||||||
|
## 功能特性
|
||||||
|
|
||||||
|
- 🕷️ **股吧数据爬取** - 自动爬取指定股票的股吧帖子
|
||||||
|
- 😊 **情感分析** - 基于大连理工大学情感词汇本体进行情绪计算
|
||||||
|
- 🔍 **关键词挖掘** - 使用TF-IDF算法提取热门话题
|
||||||
|
- 📊 **可视化输出** - 生成词云、情绪分布图等可视化图表
|
||||||
|
|
||||||
|
## 项目结构
|
||||||
|
|
||||||
|
```
|
||||||
|
guba2vec/
|
||||||
|
├── spider.py # 股吧数据爬虫
|
||||||
|
├── sentiment_analysis.py # 情感分析模块
|
||||||
|
├── analyze.py # TF-IDF关键词分析
|
||||||
|
├── requirements.txt # 依赖列表
|
||||||
|
├── 大连理工大学中文情感词汇本体.xlsx # 情感词典
|
||||||
|
└── data/ # 爬取数据存储目录
|
||||||
|
└── guba_*.json/xlsx
|
||||||
|
```
|
||||||
|
|
||||||
|
## 安装依赖
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## 使用方法
|
||||||
|
|
||||||
|
### 1. 爬取股吧数据
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python spider.py
|
||||||
|
```
|
||||||
|
|
||||||
|
默认爬取以下游戏行业股票:
|
||||||
|
- 完美世界 (002624)
|
||||||
|
- 三七互娱 (002555)
|
||||||
|
- 巨人网络 (002558)
|
||||||
|
- 世纪华通 (002602)
|
||||||
|
- 昆仑万维 (300418)
|
||||||
|
- 游族网络 (002174)
|
||||||
|
- 掌趣科技 (300315)
|
||||||
|
- 吉比特 (603444)
|
||||||
|
|
||||||
|
爬取结果保存在 `data/` 目录下,包含 JSON 和 Excel 两种格式。
|
||||||
|
|
||||||
|
### 2. 情感分析
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python sentiment_analysis.py
|
||||||
|
```
|
||||||
|
|
||||||
|
基于大连理工大学中文情感词汇本体进行情绪分析,支持7种情绪类型:
|
||||||
|
- 正面情绪:快乐、好评、惊讶
|
||||||
|
- 负面情绪:愤怒、悲伤、恐惧、厌恶
|
||||||
|
|
||||||
|
分析结果保存在 `sentiment_output/` 目录,包含:
|
||||||
|
- 各股票详细情感数据(CSV)
|
||||||
|
- 情绪统计汇总(CSV)
|
||||||
|
- 可视化图表(PNG)
|
||||||
|
|
||||||
|
### 3. 关键词分析
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python analyze.py
|
||||||
|
```
|
||||||
|
|
||||||
|
使用TF-IDF算法提取关键词并生成词云,结果保存在 `output/` 目录。
|
||||||
|
|
||||||
|
## 核心模块说明
|
||||||
|
|
||||||
|
### spider.py
|
||||||
|
|
||||||
|
主要函数:
|
||||||
|
- `fetch_guba_data(code, page, page_size, sort_type)` - 爬取单页数据
|
||||||
|
- `fetch_stock_posts(code, name, pages, page_size)` - 爬取多页数据
|
||||||
|
- `save_to_json(data, name, filename)` - 保存为JSON格式
|
||||||
|
- `save_to_excel(data, name, filename)` - 保存为Excel格式
|
||||||
|
|
||||||
|
### sentiment_analysis.py
|
||||||
|
|
||||||
|
主要函数:
|
||||||
|
- `build_sentiment_dictionary()` - 构建情感词典
|
||||||
|
- `emotion_caculate(text, sentiment_dict)` - 计算文本情绪
|
||||||
|
- `load_and_analyze_data(data_dir, output_dir)` - 批量分析数据
|
||||||
|
- `generate_visualizations()` - 生成可视化图表
|
||||||
|
|
||||||
|
### analyze.py
|
||||||
|
|
||||||
|
主要函数:
|
||||||
|
- `clean_text(text)` - 文本清洗
|
||||||
|
- `tokenize(text)` - 中文分词
|
||||||
|
- `calculate_tfidf(texts)` - 计算TF-IDF
|
||||||
|
- `get_top_keywords()` - 获取Top关键词
|
||||||
|
- `generate_wordcloud()` - 生成词云
|
||||||
|
|
||||||
|
## 情感词典
|
||||||
|
|
||||||
|
使用 **大连理工大学中文情感词汇本体**(需自行准备),包含:
|
||||||
|
- 27469个情感词汇
|
||||||
|
- 7种情感分类
|
||||||
|
- 3种强度等级
|
||||||
|
- 2种极性(正面/负面)
|
||||||
|
|
||||||
|
备用方案:内置简化版情感词典,包含约200个常用情感词。
|
||||||
|
|
||||||
|
## 数据格式
|
||||||
|
|
||||||
|
### 爬取数据 (JSON)
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"stock_code": "002624",
|
||||||
|
"stock_name": "完美世界",
|
||||||
|
"total_pages": 10,
|
||||||
|
"total_posts": 200,
|
||||||
|
"crawl_time": "2024-01-01T12:00:00",
|
||||||
|
"posts": [
|
||||||
|
{
|
||||||
|
"post_id": "123456",
|
||||||
|
"post_title": "标题",
|
||||||
|
"post_content": "内容",
|
||||||
|
"post_user": {"user_nickname": "用户名"},
|
||||||
|
"post_publish_time": "2024-01-01 10:00",
|
||||||
|
"post_click_count": 100,
|
||||||
|
"post_comment_count": 10,
|
||||||
|
"post_like_count": 5
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 情感分析结果 (CSV)
|
||||||
|
| 帖子ID | 标题 | 内容 | positive | negative | sentiment_score |
|
||||||
|
|--------|------|------|----------|----------|-----------------|
|
||||||
|
| 123456 | ... | ... | 5 | 2 | 3 |
|
||||||
|
|
||||||
|
## 注意事项
|
||||||
|
|
||||||
|
1. 爬虫使用模拟移动端请求,请合理控制爬取频率
|
||||||
|
2. 情感词典文件需放置在项目根目录
|
||||||
|
3. 首次运行可能需要下载jieba分词字典
|
||||||
|
4. 生成词云需要系统安装中文字体(默认使用SimHei)
|
||||||
|
|
||||||
|
## 依赖列表
|
||||||
|
|
||||||
|
| 库 | 版本 | 用途 |
|
||||||
|
|----|------|------|
|
||||||
|
| requests | >=2.28.0 | HTTP请求 |
|
||||||
|
| pandas | >=2.0.0 | 数据处理 |
|
||||||
|
| openpyxl | >=3.1.0 | Excel读写 |
|
||||||
|
| jieba | >=0.42.1 | 中文分词 |
|
||||||
|
| scikit-learn | >=1.3.0 | TF-IDF计算 |
|
||||||
|
| numpy | >=1.24.0 | 数值计算 |
|
||||||
|
| matplotlib | >=3.7.0 | 可视化 |
|
||||||
|
| wordcloud | >=1.9.0 | 词云生成 |
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT License
|
||||||
+375
@@ -0,0 +1,375 @@
|
|||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import jieba
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||||
|
from wordcloud import WordCloud
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import matplotlib
|
||||||
|
matplotlib.use('Agg') # 使用非交互式后端
|
||||||
|
|
||||||
|
def load_stopwords(filepath='stopwords.txt'):
|
||||||
|
"""从文件加载停用词"""
|
||||||
|
stopwords = set()
|
||||||
|
if os.path.exists(filepath):
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
for line in f:
|
||||||
|
word = line.strip()
|
||||||
|
if word:
|
||||||
|
stopwords.add(word)
|
||||||
|
print(f"已加载 {len(stopwords)} 个停用词")
|
||||||
|
else:
|
||||||
|
print(f"警告:停用词文件 {filepath} 不存在,使用默认停用词")
|
||||||
|
stopwords = {
|
||||||
|
'的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要',
|
||||||
|
'去', '你', '会', '着', '没有', '看', '好', '自己', '这', '那', '有', '吗', '吧', '呢', '啊', '呀', '什么', '怎么',
|
||||||
|
'为什么', '哪里', '谁', '多少', '几', '个', '只', '条', '把', '本', '篇', '次', '天', '今天', '明天', '昨天', '又',
|
||||||
|
'再', '还', '已经', '还是', '但是', '可是', '不过', '只是', '只有', '就是', '或者', '跟', '和', '与', '及', '或',
|
||||||
|
'股吧', '东方财富', '帖子', '发表', '回复', '点击', '查看', '更多', '原文', '转发', '分享', '收藏', '评论', '点赞',
|
||||||
|
'http', 'https', 'com', 'cn', 'www', 'net', 'org'
|
||||||
|
}
|
||||||
|
return stopwords
|
||||||
|
|
||||||
|
# 加载停用词
|
||||||
|
STOPWORDS = load_stopwords()
|
||||||
|
|
||||||
|
def clean_text(text):
|
||||||
|
"""清洗文本"""
|
||||||
|
if not text:
|
||||||
|
return ""
|
||||||
|
# 移除URL
|
||||||
|
text = re.sub(r'https?://\S+|www\.\S+', '', text)
|
||||||
|
# 移除HTML标签
|
||||||
|
text = re.sub(r'<.*?>', '', text)
|
||||||
|
# 移除表情符号
|
||||||
|
text = re.sub(r'\[.*?\]', '', text)
|
||||||
|
# 移除纯英文和数字混合的无效标记(如 sh123、abc456等)
|
||||||
|
text = re.sub(r'\b[a-zA-Z]+\d+\b', '', text)
|
||||||
|
text = re.sub(r'\b\d+[a-zA-Z]+\b', '', text)
|
||||||
|
# 移除特殊字符(保留中文、英文、数字)
|
||||||
|
text = re.sub(r'[^\w\s]', ' ', text)
|
||||||
|
# 移除多余空格
|
||||||
|
text = re.sub(r'\s+', ' ', text).strip()
|
||||||
|
return text
|
||||||
|
|
||||||
|
def tokenize(text):
|
||||||
|
"""中文分词"""
|
||||||
|
words = jieba.lcut(text)
|
||||||
|
# 过滤停用词、短词、纯英文单词和无意义字符
|
||||||
|
filtered_words = []
|
||||||
|
for w in words:
|
||||||
|
# 跳过停用词和短词
|
||||||
|
if w in STOPWORDS or len(w) <= 1:
|
||||||
|
continue
|
||||||
|
# 检查是否是纯英文单词
|
||||||
|
if re.match(r'^[a-zA-Z]+$', w):
|
||||||
|
# 过滤掉纯英文单词(通常是论坛标记、无意义的缩写等)
|
||||||
|
continue
|
||||||
|
# 检查是否包含无意义的英文字符组合
|
||||||
|
if re.match(r'^[a-zA-Z\s]+$', w):
|
||||||
|
continue
|
||||||
|
filtered_words.append(w)
|
||||||
|
return filtered_words
|
||||||
|
|
||||||
|
def load_data(data_dir='data'):
|
||||||
|
"""加载所有股票数据"""
|
||||||
|
all_data = []
|
||||||
|
stock_info = {}
|
||||||
|
|
||||||
|
if not os.path.exists(data_dir):
|
||||||
|
print(f'数据目录 {data_dir} 不存在')
|
||||||
|
return all_data, stock_info
|
||||||
|
|
||||||
|
for filename in os.listdir(data_dir):
|
||||||
|
if filename.endswith('.json'):
|
||||||
|
filepath = os.path.join(data_dir, filename)
|
||||||
|
try:
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
data = json.load(f)
|
||||||
|
stock_name = data.get('stock_name', '未知')
|
||||||
|
stock_code = data.get('stock_code', '未知')
|
||||||
|
posts = data.get('posts', [])
|
||||||
|
|
||||||
|
stock_info[stock_code] = {
|
||||||
|
'name': stock_name,
|
||||||
|
'post_count': len(posts)
|
||||||
|
}
|
||||||
|
|
||||||
|
for post in posts:
|
||||||
|
content = post.get('post_content', '')
|
||||||
|
title = post.get('post_title', '')
|
||||||
|
publish_time = post.get('post_publish_time', '')
|
||||||
|
full_text = f"{title} {content}".strip()
|
||||||
|
|
||||||
|
if full_text:
|
||||||
|
all_data.append({
|
||||||
|
'stock_code': stock_code,
|
||||||
|
'stock_name': stock_name,
|
||||||
|
'post_id': post.get('post_id'),
|
||||||
|
'post_publish_time': publish_time,
|
||||||
|
'text': full_text,
|
||||||
|
'clean_text': clean_text(full_text)
|
||||||
|
})
|
||||||
|
except Exception as e:
|
||||||
|
print(f'加载文件 {filename} 失败: {e}')
|
||||||
|
|
||||||
|
return all_data, stock_info
|
||||||
|
|
||||||
|
def calculate_tfidf(texts):
|
||||||
|
"""计算TF-IDF"""
|
||||||
|
vectorizer = TfidfVectorizer(
|
||||||
|
tokenizer=tokenize,
|
||||||
|
token_pattern=None,
|
||||||
|
max_features=1000,
|
||||||
|
ngram_range=(1, 2)
|
||||||
|
)
|
||||||
|
|
||||||
|
tfidf_matrix = vectorizer.fit_transform(texts)
|
||||||
|
feature_names = vectorizer.get_feature_names_out()
|
||||||
|
|
||||||
|
return tfidf_matrix, feature_names, vectorizer
|
||||||
|
|
||||||
|
def get_top_keywords(tfidf_matrix, feature_names, top_n=20):
|
||||||
|
"""获取Top关键词"""
|
||||||
|
avg_tfidf = np.array(tfidf_matrix.mean(axis=0)).flatten()
|
||||||
|
top_indices = avg_tfidf.argsort()[-top_n*4:][::-1] # 多取一些,避免重复后不够
|
||||||
|
|
||||||
|
# 先收集候选词
|
||||||
|
candidates = []
|
||||||
|
for idx in top_indices:
|
||||||
|
word = feature_names[idx]
|
||||||
|
if len(word.strip()) > 0:
|
||||||
|
candidates.append({
|
||||||
|
'word': word,
|
||||||
|
'tfidf': avg_tfidf[idx],
|
||||||
|
'length': len(word.split()) # 词的长度(包含多少个词)
|
||||||
|
})
|
||||||
|
|
||||||
|
# 按词长降序排序(优先保留组合词)
|
||||||
|
candidates.sort(key=lambda x: (-x['length'], -x['tfidf']))
|
||||||
|
|
||||||
|
# 智能去重 - 优先保留组合词
|
||||||
|
keywords = []
|
||||||
|
seen_words = set()
|
||||||
|
seen_parts = set()
|
||||||
|
|
||||||
|
for candidate in candidates:
|
||||||
|
word = candidate['word']
|
||||||
|
word_parts = word.split()
|
||||||
|
|
||||||
|
# 检查是否应该添加这个词
|
||||||
|
should_add = True
|
||||||
|
|
||||||
|
# 检查这个词的任何部分是否已经被其他词使用了
|
||||||
|
for part in word_parts:
|
||||||
|
if part in seen_parts:
|
||||||
|
should_add = False
|
||||||
|
break
|
||||||
|
|
||||||
|
if should_add and word not in seen_words:
|
||||||
|
seen_words.add(word)
|
||||||
|
# 记录所有使用过的词部分
|
||||||
|
for part in word_parts:
|
||||||
|
seen_parts.add(part)
|
||||||
|
keywords.append({
|
||||||
|
'word': word,
|
||||||
|
'tfidf': candidate['tfidf']
|
||||||
|
})
|
||||||
|
if len(keywords) >= top_n:
|
||||||
|
break
|
||||||
|
|
||||||
|
# 按TF-IDF重新排序
|
||||||
|
keywords.sort(key=lambda x: -x['tfidf'])
|
||||||
|
return keywords
|
||||||
|
|
||||||
|
def get_stock_specific_keywords(all_data, stock_code, top_n=20):
|
||||||
|
"""获取特定股票的关键词"""
|
||||||
|
stock_texts = [d['clean_text'] for d in all_data if d['stock_code'] == stock_code]
|
||||||
|
other_texts = [d['clean_text'] for d in all_data if d['stock_code'] != stock_code]
|
||||||
|
|
||||||
|
if len(stock_texts) < 5:
|
||||||
|
return []
|
||||||
|
|
||||||
|
all_texts = stock_texts + other_texts
|
||||||
|
tfidf_matrix, feature_names, vectorizer = calculate_tfidf(all_texts)
|
||||||
|
|
||||||
|
# 计算该股票的平均TF-IDF
|
||||||
|
stock_matrix = tfidf_matrix[:len(stock_texts)]
|
||||||
|
avg_tfidf = np.array(stock_matrix.mean(axis=0)).flatten()
|
||||||
|
|
||||||
|
# 计算其他股票的平均TF-IDF
|
||||||
|
if other_texts:
|
||||||
|
other_matrix = tfidf_matrix[len(stock_texts):]
|
||||||
|
other_avg = np.array(other_matrix.mean(axis=0)).flatten()
|
||||||
|
# 计算差值
|
||||||
|
diff = avg_tfidf - other_avg
|
||||||
|
else:
|
||||||
|
diff = avg_tfidf
|
||||||
|
|
||||||
|
top_indices = diff.argsort()[-top_n*4:][::-1] # 多取一些,避免重复后不够
|
||||||
|
|
||||||
|
# 先收集候选词
|
||||||
|
candidates = []
|
||||||
|
for idx in top_indices:
|
||||||
|
word = feature_names[idx]
|
||||||
|
if len(word.strip()) > 0:
|
||||||
|
candidates.append({
|
||||||
|
'word': word,
|
||||||
|
'tfidf': avg_tfidf[idx],
|
||||||
|
'diff': diff[idx],
|
||||||
|
'length': len(word.split()) # 词的长度
|
||||||
|
})
|
||||||
|
|
||||||
|
# 按词长降序排序(优先保留组合词)
|
||||||
|
candidates.sort(key=lambda x: (-x['length'], -x['diff']))
|
||||||
|
|
||||||
|
# 智能去重 - 优先保留组合词
|
||||||
|
keywords = []
|
||||||
|
seen_words = set()
|
||||||
|
seen_parts = set()
|
||||||
|
|
||||||
|
for candidate in candidates:
|
||||||
|
word = candidate['word']
|
||||||
|
word_parts = word.split()
|
||||||
|
|
||||||
|
# 检查是否应该添加这个词
|
||||||
|
should_add = True
|
||||||
|
|
||||||
|
# 检查这个词的任何部分是否已经被其他词使用了
|
||||||
|
for part in word_parts:
|
||||||
|
if part in seen_parts:
|
||||||
|
should_add = False
|
||||||
|
break
|
||||||
|
|
||||||
|
if should_add and word not in seen_words:
|
||||||
|
seen_words.add(word)
|
||||||
|
# 记录所有使用过的词部分
|
||||||
|
for part in word_parts:
|
||||||
|
seen_parts.add(part)
|
||||||
|
keywords.append({
|
||||||
|
'word': word,
|
||||||
|
'tfidf': candidate['tfidf'],
|
||||||
|
'diff': candidate['diff']
|
||||||
|
})
|
||||||
|
if len(keywords) >= top_n:
|
||||||
|
break
|
||||||
|
|
||||||
|
# 按diff重新排序
|
||||||
|
keywords.sort(key=lambda x: -x['diff'])
|
||||||
|
return keywords
|
||||||
|
|
||||||
|
def generate_wordcloud(keywords, stock_name, output_dir='output'):
|
||||||
|
"""生成词云"""
|
||||||
|
os.makedirs(output_dir, exist_ok=True)
|
||||||
|
|
||||||
|
word_freq = {k['word']: k['tfidf'] for k in keywords}
|
||||||
|
|
||||||
|
wc = WordCloud(
|
||||||
|
font_path='C:/Windows/Fonts/simhei.ttf', # Windows中文字体路径
|
||||||
|
width=800,
|
||||||
|
height=600,
|
||||||
|
background_color='white',
|
||||||
|
max_words=100
|
||||||
|
)
|
||||||
|
|
||||||
|
wc.generate_from_frequencies(word_freq)
|
||||||
|
|
||||||
|
output_path = os.path.join(output_dir, f'wordcloud_{stock_name}.png')
|
||||||
|
wc.to_file(output_path)
|
||||||
|
print(f'词云已保存到: {output_path}')
|
||||||
|
|
||||||
|
return output_path
|
||||||
|
|
||||||
|
def analyze_all():
|
||||||
|
"""完整分析流程"""
|
||||||
|
print('='*60)
|
||||||
|
print('股吧数据 TF-IDF 分析')
|
||||||
|
print('='*60)
|
||||||
|
|
||||||
|
# 创建输出目录
|
||||||
|
os.makedirs('output', exist_ok=True)
|
||||||
|
|
||||||
|
# 加载数据
|
||||||
|
print('\n[1/5] 加载数据...')
|
||||||
|
all_data, stock_info = load_data()
|
||||||
|
|
||||||
|
if not all_data:
|
||||||
|
print('没有找到数据,请先运行爬虫')
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f' 共加载 {len(all_data)} 条帖子')
|
||||||
|
print(f' 涉及 {len(stock_info)} 只股票:')
|
||||||
|
for code, info in stock_info.items():
|
||||||
|
print(f' - {info["name"]} ({code}): {info["post_count"]} 条')
|
||||||
|
|
||||||
|
# 整体分析
|
||||||
|
print('\n[2/5] 整体关键词分析...')
|
||||||
|
all_texts = [d['clean_text'] for d in all_data]
|
||||||
|
tfidf_matrix, feature_names, vectorizer = calculate_tfidf(all_texts)
|
||||||
|
overall_keywords = get_top_keywords(tfidf_matrix, feature_names, top_n=30)
|
||||||
|
|
||||||
|
print('\n 整体Top 20关键词:')
|
||||||
|
for i, kw in enumerate(overall_keywords[:20], 1):
|
||||||
|
print(f' {i:2d}. {kw["word"]:10s} (TF-IDF: {kw["tfidf"]:.4f})')
|
||||||
|
|
||||||
|
# 保存整体关键词
|
||||||
|
overall_df = pd.DataFrame(overall_keywords)
|
||||||
|
overall_df.to_csv('output/overall_keywords.csv', index=False, encoding='utf-8-sig')
|
||||||
|
|
||||||
|
# 生成整体词云
|
||||||
|
generate_wordcloud(overall_keywords, 'overall')
|
||||||
|
|
||||||
|
# 各股票单独分析
|
||||||
|
print('\n[3/5] 各股票关键词分析...')
|
||||||
|
stock_keywords = {}
|
||||||
|
|
||||||
|
for stock_code in stock_info.keys():
|
||||||
|
stock_name = stock_info[stock_code]['name']
|
||||||
|
print(f'\n 分析 {stock_name} ({stock_code})...')
|
||||||
|
|
||||||
|
keywords = get_stock_specific_keywords(all_data, stock_code, top_n=20)
|
||||||
|
stock_keywords[stock_code] = keywords
|
||||||
|
|
||||||
|
if keywords:
|
||||||
|
print(f' Top 10关键词:')
|
||||||
|
for i, kw in enumerate(keywords[:10], 1):
|
||||||
|
print(f' {i:2d}. {kw["word"]:10s} (TF-IDF: {kw["tfidf"]:.4f}, 差值: {kw["diff"]:.4f})')
|
||||||
|
|
||||||
|
# 生成词云
|
||||||
|
generate_wordcloud(keywords, stock_name)
|
||||||
|
|
||||||
|
# 保存关键词
|
||||||
|
df = pd.DataFrame(keywords)
|
||||||
|
df.to_csv(f'output/keywords_{stock_name}.csv', index=False, encoding='utf-8-sig')
|
||||||
|
|
||||||
|
# 生成汇总报告
|
||||||
|
print('\n[4/5] 生成汇总报告...')
|
||||||
|
report_data = []
|
||||||
|
for stock_code, keywords in stock_keywords.items():
|
||||||
|
stock_name = stock_info[stock_code]['name']
|
||||||
|
top_words = ', '.join([k['word'] for k in keywords[:5]])
|
||||||
|
report_data.append({
|
||||||
|
'股票代码': stock_code,
|
||||||
|
'股票名称': stock_name,
|
||||||
|
'帖子数量': stock_info[stock_code]['post_count'],
|
||||||
|
'Top5关键词': top_words
|
||||||
|
})
|
||||||
|
|
||||||
|
report_df = pd.DataFrame(report_data)
|
||||||
|
report_df.to_csv('output/summary_report.csv', index=False, encoding='utf-8-sig')
|
||||||
|
print(' 汇总报告已保存到: output/summary_report.csv')
|
||||||
|
|
||||||
|
# 保存所有文本数据
|
||||||
|
print('\n[5/5] 保存预处理数据...')
|
||||||
|
all_df = pd.DataFrame(all_data)
|
||||||
|
all_df.to_csv('output/all_posts.csv', index=False, encoding='utf-8-sig')
|
||||||
|
print(' 所有帖子已保存到: output/all_posts.csv')
|
||||||
|
|
||||||
|
print('\n' + '='*60)
|
||||||
|
print('分析完成!结果保存在 output/ 目录中')
|
||||||
|
print('='*60)
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
analyze_all()
|
||||||
@@ -1,34 +0,0 @@
|
|||||||
import requests
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
|
||||||
'Referer': 'https://guba.eastmoney.com/',
|
|
||||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
|
|
||||||
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
|
|
||||||
'Accept-Encoding': 'gzip, deflate, br',
|
|
||||||
'Connection': 'keep-alive'
|
|
||||||
}
|
|
||||||
|
|
||||||
post_id = '1708066915'
|
|
||||||
url = f'https://guba.eastmoney.com/news,002624,{post_id}.html'
|
|
||||||
|
|
||||||
print(f'请求: {url}')
|
|
||||||
response = requests.get(url, headers=headers, timeout=15)
|
|
||||||
response.encoding = 'utf-8'
|
|
||||||
print(f'状态码: {response.status_code}')
|
|
||||||
print(f'页面长度: {len(response.text)}')
|
|
||||||
|
|
||||||
# 检查关键字符串
|
|
||||||
print('\n检查页面中的关键字符串:')
|
|
||||||
print(f'post_article: {"post_article" in response.text}')
|
|
||||||
print(f'comment_list: {"comment_list" in response.text}')
|
|
||||||
print(f'news_content: {"news_content" in response.text}')
|
|
||||||
|
|
||||||
# 保存页面
|
|
||||||
with open('current_page.html', 'w', encoding='utf-8') as f:
|
|
||||||
f.write(response.text)
|
|
||||||
print('\n页面已保存到 current_page.html')
|
|
||||||
|
|
||||||
# 查看开头部分
|
|
||||||
print('\n页面开头:')
|
|
||||||
print(response.text[:500])
|
|
||||||
File diff suppressed because one or more lines are too long
@@ -1,241 +0,0 @@
|
|||||||
page,title,author,time,reply_count,click_count
|
|
||||||
1,无论是人才还是技术还是渠道,希望这次管理成的变动能够让这个曾经行业龙头走向正轨,雪球小鲁班,2026/5/14 12:07,24,71
|
|
||||||
1,【今日股市】指数午后低位震荡,资源股跌幅居前,雪球小鲁班,2026/5/14 10:30,31,62
|
|
||||||
1,该跌了吧,沉心静气扬帆起航,2026/5/14 11:45,18,79
|
|
||||||
1,感觉像是有组织的散户进场了,就是所谓的老鼠,回本困难户第N位,2026/5/13 22:21,35,89
|
|
||||||
1,终于涨了,昨天跌那么狠,裤衩子都没有了,花火飞鸟,2026/5/14 12:04,47,83
|
|
||||||
1,算不算放量暴跌?!,股友63F0o88663,2026/5/14 12:03,22,78
|
|
||||||
1,今天爆跌,明天一大堆利好又红红火火长阳,TCL一贯套路,心碎股票人,2026/5/14 12:02,25,81
|
|
||||||
1,没事 拿住 今天主力动用资金拉券商股 明天活埋券商再来拉你,雪球小鲁班,2026/5/14 10:37,32,64
|
|
||||||
1,大神解释一下,为什么大单疯狂出货,都出了一个亿了还是能这么涨,月茨星,2026/5/14 11:52,43,66
|
|
||||||
1,为什么还在卖,要停牌了,股友36E919X121,2026/5/14 11:17,24,63
|
|
||||||
1,涨一天跌一周,就这尿性。。。,股友7715N807H3,2026/5/14 11:24,20,76
|
|
||||||
1,今天应该有榜,瑜佳不佳,2026/5/14 11:56,4,17
|
|
||||||
1,大盘涨跌工具,厉害的小散,2026/5/14 11:56,12,78
|
|
||||||
1,眼光看长远的几十个点,不纠结每天的几个点,股友3Y063588A1,2026/5/14 11:52,20,66
|
|
||||||
1,8.15箱底满仓干满仓干,半导体与智能消费最优质龙头TCL,SEO神话,2026/5/14 11:51,44,3
|
|
||||||
1,中信还在加仓,要命哟,金炫宇1,2026/5/14 11:50,1,5
|
|
||||||
1,这压单这么低的价格你出货 出完你买啥呢?想不开,是但哥,2026/5/14 11:34,38,93
|
|
||||||
1,了,股友01zC725523,2026/5/14 10:40,43,76
|
|
||||||
1,温水煮青蛙,一路闷杀,股友38M080U658,2026/5/14 11:48,20,58
|
|
||||||
1,1,谷神布斯,2026/5/14 11:26,34,0
|
|
||||||
1,!,飞驰股生的牛马,2026/5/14 10:24,35,63
|
|
||||||
1,同时加快梳理现有业务板块,去弱留强,对环球易购业务进行项目制改革,九溪你要赢,2026/5/14 11:27,1,21
|
|
||||||
1,外盘木材大跌,纸浆机会来了!,哥伦比娅,2026/5/14 11:44,33,61
|
|
||||||
1,今天收盘10.05,从头再来。,花火飞鸟,2026/5/14 10:20,4,41
|
|
||||||
1,9409+60070中签,隔壁的兄弟立高食品已经94倍市盈率了,红红的嘴唇,2026/5/14 11:42,35,4
|
|
||||||
1,太便宜了,唯一一次翻倍的机会,就在面前,不要再犹豫了,坚韧的武桦16,2026/5/14 11:18,26,5
|
|
||||||
1,7728手一口吃掉,欣锐80等你来,2026/5/14 11:40,27,93
|
|
||||||
1,又到箱底,准备干了,股友995O308r73,2026/5/14 11:40,6,78
|
|
||||||
1,地天,盯紧五日线,2026/5/14 11:39,7,70
|
|
||||||
1,纽威的管理层会招报应的,你们这代不报,你们的下一代也跑不掉!!!!!!!!!!!,一一路長虹,2026/5/14 10:04,37,73
|
|
||||||
1,国务院日前发布生物产业发展规划,7500亿产业链呼之欲出,意味着生物医药概念股面,股友965al72890,2026/5/14 10:21,30,39
|
|
||||||
1,看来跨境通的要回暖了,一念上塔山,2026/5/14 11:36,3,38
|
|
||||||
1,跌到3个点再补,好些,股友01zC725523,2026/5/14 10:15,23,11
|
|
||||||
1,火箭弹都救不了中兵红箭嘛,魍魉灬,2026/5/14 11:35,25,85
|
|
||||||
1,槽,单刃剪钳,2026/5/14 1:34,17,20
|
|
||||||
1,大盘涨狗垃圾东西还在跌,欣锐80等你来,2026/5/14 11:31,33,15
|
|
||||||
1,还会升回去吗?,小韭菜误入高端局,2026/5/14 11:28,26,78
|
|
||||||
1,啊,狐狸叫的猫,2026/5/14 11:28,32,48
|
|
||||||
1,我一定是脑子抽了,昨天盈利9个点今早盘卖了,没多久又接了回来,梦想是在股市买房,2026/5/14 10:49,26,71
|
|
||||||
1,果断卖出,不玩了,这股没戏。今天大阴线 明天大跌,炒股2年半,2026/5/14 11:22,0,54
|
|
||||||
1,没大哥一堆压单,雨文和文,2026/5/14 11:21,24,80
|
|
||||||
1,哎,赚点狗粮猫粮,2026/5/14 11:21,1,26
|
|
||||||
1,调仓换股啦!太弱了,买入券商,VINN,2026/5/14 10:31,12,90
|
|
||||||
1,现在该股是死猪不怕开水烫,阿月姐,2026/5/14 11:20,26,24
|
|
||||||
1,这股算稳吧,感觉大部分都在绿,不绿就好,高大的金青槐4,2026/5/14 11:19,9,87
|
|
||||||
1,主力说 ,马上拉升10个板,快进,5.81就是不锈钢底,鸣潮牛比,2026/5/14 11:18,45,82
|
|
||||||
1,川王挂了,市场只有一条龙了,津荣你是金龙,孤独者AA,2026/5/14 11:18,23,84
|
|
||||||
1,狗狗币 比特 “炒币”和炒股一样都有大起大落,这首歌唱出了你的心声吗?,股友68c80F8191,2026/5/14 11:16,4,29
|
|
||||||
1,啊,国运长牛632,2026/5/14 11:17,7,29
|
|
||||||
1,河钢已经跌无可跌了,再也没有一丝下降的空间了,我可以回本,2026/5/14 11:16,9,52
|
|
||||||
1,说好的3元呢?砸下来啊,gou装,仓又加錯,2026/5/14 11:16,21,74
|
|
||||||
1,出,昔涟丶爱莉希雅,2026/5/14 11:15,16,92
|
|
||||||
1,从168开始人人中签,从168开始一路发。,住在异环的别墅里,2026/5/14 11:14,1,28
|
|
||||||
1,所以之前费劲巴力的涨上来是为了个什么,买廖就涨,2026/5/14 11:14,46,53
|
|
||||||
1,who这些鬼佬办事效率真是低,印度人民生命分秒必争阿,面朝阳光追梦,2026/5/14 11:13,22,24
|
|
||||||
1,这条横线画得漂亮,futuregs,2026/5/14 11:12,16,92
|
|
||||||
1,跌停给筹码 涨停给筹码 这样要死不活不会给你筹码的,股友0f56Mr,2026/5/14 11:11,0,39
|
|
||||||
1,一堆出货的不亏钱吗,关注acgn,2026/5/14 10:40,26,35
|
|
||||||
1,又是一个美好的周末,金元马,2026/5/14 11:09,35,92
|
|
||||||
1,你不得不佩服老邓,他去年就判断到现在的行情了,3季度多卖猪仔,今年二季度少出栏。,油手好闲,2026/5/13 16:31,47,83
|
|
||||||
1,收盘价13.14,和4月28日一样,B哥sama,2026/5/14 11:01,29,85
|
|
||||||
1,减,买股不宜慌,2026/5/14 11:03,31,89
|
|
||||||
1,趋势大时代 指数中位线选择方向,如林的岑缈,2026/5/14 10:41,20,17
|
|
||||||
1,要跌停,君月00后实盘,2026/5/14 10:46,7,90
|
|
||||||
1,尾盘跌停,股民222,2026/5/14 10:36,21,11
|
|
||||||
1,今天都有哪些吃了这碗大面的?来说说后面走势,Cialloo,2026/5/14 10:23,24,11
|
|
||||||
1,大家一起去骂股市老兵这个死庄托,大智若愚量在價先,2026/5/14 10:51,22,36
|
|
||||||
1,短线小赚一波赶紧走,下一波跌至18,东方嘉木,2026/5/14 10:50,28,2
|
|
||||||
1,跨境通ZAFUL拥有成功、成熟的体系和打法,其后续发展充满期待,心中有财,2026/5/14 10:49,4,86
|
|
||||||
1,各位好啊,我来抄底了,拉萨帮二当家,2026/5/14 10:31,32,97
|
|
||||||
1,5月7日前有九阳,也必将有后九阳!,做t糕手,2026/5/14 10:46,46,19
|
|
||||||
1,还能跌破6吗?感觉都快垫底了,梦绮紫,2026/5/14 10:48,31,88
|
|
||||||
1,跑了,和这股耗不起,涨两天一天能降回去,少赔点撤了,飞逸天,2026/5/14 10:47,38,91
|
|
||||||
1,雨后彩莲死庄托还敢唱多吗,脸被打的啪啪响吧,股友Q1713355G8,2026/5/14 10:40,3,66
|
|
||||||
1,天天高抛低吸,股友C08120l209,2026/5/14 10:46,37,89
|
|
||||||
1,圣龙股份,好是好,公司发展方向无疑是好!但是盘子太小,业绩越来越好就没有大机构要!千万不要大买,谷神布斯,2026/5/14 10:45,20,93
|
|
||||||
1,几点出公告停牌,花火飞鸟,2026/5/14 10:25,28,63
|
|
||||||
1,这是已经谈拢了,杀跌两天就该停牌了.不用等到21号股东大会了。,股友61057ac363,2026/5/14 10:42,9,97
|
|
||||||
1,昨天是谁说的今天会复制3.5号的行情?简直神准,股友639537p9w3,2026/5/14 9:20,17,69
|
|
||||||
1,济南高新,东大东大红,2026/5/14 10:39,25,89
|
|
||||||
2,无论是人才还是技术还是渠道,希望这次管理成的变动能够让这个曾经行业龙头走向正轨,雪球小鲁班,2026/5/14 12:07,24,71
|
|
||||||
2,【今日股市】指数午后低位震荡,资源股跌幅居前,雪球小鲁班,2026/5/14 10:30,31,62
|
|
||||||
2,该跌了吧,沉心静气扬帆起航,2026/5/14 11:45,18,79
|
|
||||||
2,感觉像是有组织的散户进场了,就是所谓的老鼠,回本困难户第N位,2026/5/13 22:21,35,89
|
|
||||||
2,终于涨了,昨天跌那么狠,裤衩子都没有了,花火飞鸟,2026/5/14 12:04,47,83
|
|
||||||
2,算不算放量暴跌?!,股友63F0o88663,2026/5/14 12:03,22,78
|
|
||||||
2,今天爆跌,明天一大堆利好又红红火火长阳,TCL一贯套路,心碎股票人,2026/5/14 12:02,25,81
|
|
||||||
2,没事 拿住 今天主力动用资金拉券商股 明天活埋券商再来拉你,雪球小鲁班,2026/5/14 10:37,32,64
|
|
||||||
2,大神解释一下,为什么大单疯狂出货,都出了一个亿了还是能这么涨,月茨星,2026/5/14 11:52,43,66
|
|
||||||
2,为什么还在卖,要停牌了,股友36E919X121,2026/5/14 11:17,24,63
|
|
||||||
2,涨一天跌一周,就这尿性。。。,股友7715N807H3,2026/5/14 11:24,20,76
|
|
||||||
2,今天应该有榜,瑜佳不佳,2026/5/14 11:56,4,17
|
|
||||||
2,大盘涨跌工具,厉害的小散,2026/5/14 11:56,12,78
|
|
||||||
2,眼光看长远的几十个点,不纠结每天的几个点,股友3Y063588A1,2026/5/14 11:52,20,66
|
|
||||||
2,8.15箱底满仓干满仓干,半导体与智能消费最优质龙头TCL,SEO神话,2026/5/14 11:51,44,3
|
|
||||||
2,中信还在加仓,要命哟,金炫宇1,2026/5/14 11:50,1,5
|
|
||||||
2,这压单这么低的价格你出货 出完你买啥呢?想不开,是但哥,2026/5/14 11:34,38,93
|
|
||||||
2,了,股友01zC725523,2026/5/14 10:40,43,76
|
|
||||||
2,温水煮青蛙,一路闷杀,股友38M080U658,2026/5/14 11:48,20,58
|
|
||||||
2,1,谷神布斯,2026/5/14 11:26,34,0
|
|
||||||
2,!,飞驰股生的牛马,2026/5/14 10:24,35,63
|
|
||||||
2,同时加快梳理现有业务板块,去弱留强,对环球易购业务进行项目制改革,九溪你要赢,2026/5/14 11:27,1,21
|
|
||||||
2,外盘木材大跌,纸浆机会来了!,哥伦比娅,2026/5/14 11:44,33,61
|
|
||||||
2,今天收盘10.05,从头再来。,花火飞鸟,2026/5/14 10:20,4,41
|
|
||||||
2,9409+60070中签,隔壁的兄弟立高食品已经94倍市盈率了,红红的嘴唇,2026/5/14 11:42,35,4
|
|
||||||
2,太便宜了,唯一一次翻倍的机会,就在面前,不要再犹豫了,坚韧的武桦16,2026/5/14 11:18,26,5
|
|
||||||
2,7728手一口吃掉,欣锐80等你来,2026/5/14 11:40,27,93
|
|
||||||
2,又到箱底,准备干了,股友995O308r73,2026/5/14 11:40,6,78
|
|
||||||
2,地天,盯紧五日线,2026/5/14 11:39,7,70
|
|
||||||
2,纽威的管理层会招报应的,你们这代不报,你们的下一代也跑不掉!!!!!!!!!!!,一一路長虹,2026/5/14 10:04,37,73
|
|
||||||
2,国务院日前发布生物产业发展规划,7500亿产业链呼之欲出,意味着生物医药概念股面,股友965al72890,2026/5/14 10:21,30,39
|
|
||||||
2,看来跨境通的要回暖了,一念上塔山,2026/5/14 11:36,3,38
|
|
||||||
2,跌到3个点再补,好些,股友01zC725523,2026/5/14 10:15,23,11
|
|
||||||
2,火箭弹都救不了中兵红箭嘛,魍魉灬,2026/5/14 11:35,25,85
|
|
||||||
2,槽,单刃剪钳,2026/5/14 1:34,17,20
|
|
||||||
2,大盘涨狗垃圾东西还在跌,欣锐80等你来,2026/5/14 11:31,33,15
|
|
||||||
2,还会升回去吗?,小韭菜误入高端局,2026/5/14 11:28,26,78
|
|
||||||
2,啊,狐狸叫的猫,2026/5/14 11:28,32,48
|
|
||||||
2,我一定是脑子抽了,昨天盈利9个点今早盘卖了,没多久又接了回来,梦想是在股市买房,2026/5/14 10:49,26,71
|
|
||||||
2,果断卖出,不玩了,这股没戏。今天大阴线 明天大跌,炒股2年半,2026/5/14 11:22,0,54
|
|
||||||
2,没大哥一堆压单,雨文和文,2026/5/14 11:21,24,80
|
|
||||||
2,哎,赚点狗粮猫粮,2026/5/14 11:21,1,26
|
|
||||||
2,调仓换股啦!太弱了,买入券商,VINN,2026/5/14 10:31,12,90
|
|
||||||
2,现在该股是死猪不怕开水烫,阿月姐,2026/5/14 11:20,26,24
|
|
||||||
2,这股算稳吧,感觉大部分都在绿,不绿就好,高大的金青槐4,2026/5/14 11:19,9,87
|
|
||||||
2,主力说 ,马上拉升10个板,快进,5.81就是不锈钢底,鸣潮牛比,2026/5/14 11:18,45,82
|
|
||||||
2,川王挂了,市场只有一条龙了,津荣你是金龙,孤独者AA,2026/5/14 11:18,23,84
|
|
||||||
2,狗狗币 比特 “炒币”和炒股一样都有大起大落,这首歌唱出了你的心声吗?,股友68c80F8191,2026/5/14 11:16,4,29
|
|
||||||
2,啊,国运长牛632,2026/5/14 11:17,7,29
|
|
||||||
2,河钢已经跌无可跌了,再也没有一丝下降的空间了,我可以回本,2026/5/14 11:16,9,52
|
|
||||||
2,说好的3元呢?砸下来啊,gou装,仓又加錯,2026/5/14 11:16,21,74
|
|
||||||
2,出,昔涟丶爱莉希雅,2026/5/14 11:15,16,92
|
|
||||||
2,从168开始人人中签,从168开始一路发。,住在异环的别墅里,2026/5/14 11:14,1,28
|
|
||||||
2,所以之前费劲巴力的涨上来是为了个什么,买廖就涨,2026/5/14 11:14,46,53
|
|
||||||
2,who这些鬼佬办事效率真是低,印度人民生命分秒必争阿,面朝阳光追梦,2026/5/14 11:13,22,24
|
|
||||||
2,这条横线画得漂亮,futuregs,2026/5/14 11:12,16,92
|
|
||||||
2,跌停给筹码 涨停给筹码 这样要死不活不会给你筹码的,股友0f56Mr,2026/5/14 11:11,0,39
|
|
||||||
2,一堆出货的不亏钱吗,关注acgn,2026/5/14 10:40,26,35
|
|
||||||
2,又是一个美好的周末,金元马,2026/5/14 11:09,35,92
|
|
||||||
2,你不得不佩服老邓,他去年就判断到现在的行情了,3季度多卖猪仔,今年二季度少出栏。,油手好闲,2026/5/13 16:31,47,83
|
|
||||||
2,收盘价13.14,和4月28日一样,B哥sama,2026/5/14 11:01,29,85
|
|
||||||
2,减,买股不宜慌,2026/5/14 11:03,31,89
|
|
||||||
2,趋势大时代 指数中位线选择方向,如林的岑缈,2026/5/14 10:41,20,17
|
|
||||||
2,要跌停,君月00后实盘,2026/5/14 10:46,7,90
|
|
||||||
2,尾盘跌停,股民222,2026/5/14 10:36,21,11
|
|
||||||
2,今天都有哪些吃了这碗大面的?来说说后面走势,Cialloo,2026/5/14 10:23,24,11
|
|
||||||
2,大家一起去骂股市老兵这个死庄托,大智若愚量在價先,2026/5/14 10:51,22,36
|
|
||||||
2,短线小赚一波赶紧走,下一波跌至18,东方嘉木,2026/5/14 10:50,28,2
|
|
||||||
2,跨境通ZAFUL拥有成功、成熟的体系和打法,其后续发展充满期待,心中有财,2026/5/14 10:49,4,86
|
|
||||||
2,各位好啊,我来抄底了,拉萨帮二当家,2026/5/14 10:31,32,97
|
|
||||||
2,5月7日前有九阳,也必将有后九阳!,做t糕手,2026/5/14 10:46,46,19
|
|
||||||
2,还能跌破6吗?感觉都快垫底了,梦绮紫,2026/5/14 10:48,31,88
|
|
||||||
2,跑了,和这股耗不起,涨两天一天能降回去,少赔点撤了,飞逸天,2026/5/14 10:47,38,91
|
|
||||||
2,雨后彩莲死庄托还敢唱多吗,脸被打的啪啪响吧,股友Q1713355G8,2026/5/14 10:40,3,66
|
|
||||||
2,天天高抛低吸,股友C08120l209,2026/5/14 10:46,37,89
|
|
||||||
2,圣龙股份,好是好,公司发展方向无疑是好!但是盘子太小,业绩越来越好就没有大机构要!千万不要大买,谷神布斯,2026/5/14 10:45,20,93
|
|
||||||
2,几点出公告停牌,花火飞鸟,2026/5/14 10:25,28,63
|
|
||||||
2,这是已经谈拢了,杀跌两天就该停牌了.不用等到21号股东大会了。,股友61057ac363,2026/5/14 10:42,9,97
|
|
||||||
2,昨天是谁说的今天会复制3.5号的行情?简直神准,股友639537p9w3,2026/5/14 9:20,17,69
|
|
||||||
2,济南高新,东大东大红,2026/5/14 10:39,25,89
|
|
||||||
3,无论是人才还是技术还是渠道,希望这次管理成的变动能够让这个曾经行业龙头走向正轨,雪球小鲁班,2026/5/14 12:07,24,71
|
|
||||||
3,【今日股市】指数午后低位震荡,资源股跌幅居前,雪球小鲁班,2026/5/14 10:30,31,62
|
|
||||||
3,该跌了吧,沉心静气扬帆起航,2026/5/14 11:45,18,79
|
|
||||||
3,感觉像是有组织的散户进场了,就是所谓的老鼠,回本困难户第N位,2026/5/13 22:21,35,89
|
|
||||||
3,终于涨了,昨天跌那么狠,裤衩子都没有了,花火飞鸟,2026/5/14 12:04,47,83
|
|
||||||
3,算不算放量暴跌?!,股友63F0o88663,2026/5/14 12:03,22,78
|
|
||||||
3,今天爆跌,明天一大堆利好又红红火火长阳,TCL一贯套路,心碎股票人,2026/5/14 12:02,25,81
|
|
||||||
3,没事 拿住 今天主力动用资金拉券商股 明天活埋券商再来拉你,雪球小鲁班,2026/5/14 10:37,32,64
|
|
||||||
3,大神解释一下,为什么大单疯狂出货,都出了一个亿了还是能这么涨,月茨星,2026/5/14 11:52,43,66
|
|
||||||
3,为什么还在卖,要停牌了,股友36E919X121,2026/5/14 11:17,24,63
|
|
||||||
3,涨一天跌一周,就这尿性。。。,股友7715N807H3,2026/5/14 11:24,20,76
|
|
||||||
3,今天应该有榜,瑜佳不佳,2026/5/14 11:56,4,17
|
|
||||||
3,大盘涨跌工具,厉害的小散,2026/5/14 11:56,12,78
|
|
||||||
3,眼光看长远的几十个点,不纠结每天的几个点,股友3Y063588A1,2026/5/14 11:52,20,66
|
|
||||||
3,8.15箱底满仓干满仓干,半导体与智能消费最优质龙头TCL,SEO神话,2026/5/14 11:51,44,3
|
|
||||||
3,中信还在加仓,要命哟,金炫宇1,2026/5/14 11:50,1,5
|
|
||||||
3,这压单这么低的价格你出货 出完你买啥呢?想不开,是但哥,2026/5/14 11:34,38,93
|
|
||||||
3,了,股友01zC725523,2026/5/14 10:40,43,76
|
|
||||||
3,温水煮青蛙,一路闷杀,股友38M080U658,2026/5/14 11:48,20,58
|
|
||||||
3,1,谷神布斯,2026/5/14 11:26,34,0
|
|
||||||
3,!,飞驰股生的牛马,2026/5/14 10:24,35,63
|
|
||||||
3,同时加快梳理现有业务板块,去弱留强,对环球易购业务进行项目制改革,九溪你要赢,2026/5/14 11:27,1,21
|
|
||||||
3,外盘木材大跌,纸浆机会来了!,哥伦比娅,2026/5/14 11:44,33,61
|
|
||||||
3,今天收盘10.05,从头再来。,花火飞鸟,2026/5/14 10:20,4,41
|
|
||||||
3,9409+60070中签,隔壁的兄弟立高食品已经94倍市盈率了,红红的嘴唇,2026/5/14 11:42,35,4
|
|
||||||
3,太便宜了,唯一一次翻倍的机会,就在面前,不要再犹豫了,坚韧的武桦16,2026/5/14 11:18,26,5
|
|
||||||
3,7728手一口吃掉,欣锐80等你来,2026/5/14 11:40,27,93
|
|
||||||
3,又到箱底,准备干了,股友995O308r73,2026/5/14 11:40,6,78
|
|
||||||
3,地天,盯紧五日线,2026/5/14 11:39,7,70
|
|
||||||
3,纽威的管理层会招报应的,你们这代不报,你们的下一代也跑不掉!!!!!!!!!!!,一一路長虹,2026/5/14 10:04,37,73
|
|
||||||
3,国务院日前发布生物产业发展规划,7500亿产业链呼之欲出,意味着生物医药概念股面,股友965al72890,2026/5/14 10:21,30,39
|
|
||||||
3,看来跨境通的要回暖了,一念上塔山,2026/5/14 11:36,3,38
|
|
||||||
3,跌到3个点再补,好些,股友01zC725523,2026/5/14 10:15,23,11
|
|
||||||
3,火箭弹都救不了中兵红箭嘛,魍魉灬,2026/5/14 11:35,25,85
|
|
||||||
3,槽,单刃剪钳,2026/5/14 1:34,17,20
|
|
||||||
3,大盘涨狗垃圾东西还在跌,欣锐80等你来,2026/5/14 11:31,33,15
|
|
||||||
3,还会升回去吗?,小韭菜误入高端局,2026/5/14 11:28,26,78
|
|
||||||
3,啊,狐狸叫的猫,2026/5/14 11:28,32,48
|
|
||||||
3,我一定是脑子抽了,昨天盈利9个点今早盘卖了,没多久又接了回来,梦想是在股市买房,2026/5/14 10:49,26,71
|
|
||||||
3,果断卖出,不玩了,这股没戏。今天大阴线 明天大跌,炒股2年半,2026/5/14 11:22,0,54
|
|
||||||
3,没大哥一堆压单,雨文和文,2026/5/14 11:21,24,80
|
|
||||||
3,哎,赚点狗粮猫粮,2026/5/14 11:21,1,26
|
|
||||||
3,调仓换股啦!太弱了,买入券商,VINN,2026/5/14 10:31,12,90
|
|
||||||
3,现在该股是死猪不怕开水烫,阿月姐,2026/5/14 11:20,26,24
|
|
||||||
3,这股算稳吧,感觉大部分都在绿,不绿就好,高大的金青槐4,2026/5/14 11:19,9,87
|
|
||||||
3,主力说 ,马上拉升10个板,快进,5.81就是不锈钢底,鸣潮牛比,2026/5/14 11:18,45,82
|
|
||||||
3,川王挂了,市场只有一条龙了,津荣你是金龙,孤独者AA,2026/5/14 11:18,23,84
|
|
||||||
3,狗狗币 比特 “炒币”和炒股一样都有大起大落,这首歌唱出了你的心声吗?,股友68c80F8191,2026/5/14 11:16,4,29
|
|
||||||
3,啊,国运长牛632,2026/5/14 11:17,7,29
|
|
||||||
3,河钢已经跌无可跌了,再也没有一丝下降的空间了,我可以回本,2026/5/14 11:16,9,52
|
|
||||||
3,说好的3元呢?砸下来啊,gou装,仓又加錯,2026/5/14 11:16,21,74
|
|
||||||
3,出,昔涟丶爱莉希雅,2026/5/14 11:15,16,92
|
|
||||||
3,从168开始人人中签,从168开始一路发。,住在异环的别墅里,2026/5/14 11:14,1,28
|
|
||||||
3,所以之前费劲巴力的涨上来是为了个什么,买廖就涨,2026/5/14 11:14,46,53
|
|
||||||
3,who这些鬼佬办事效率真是低,印度人民生命分秒必争阿,面朝阳光追梦,2026/5/14 11:13,22,24
|
|
||||||
3,这条横线画得漂亮,futuregs,2026/5/14 11:12,16,92
|
|
||||||
3,跌停给筹码 涨停给筹码 这样要死不活不会给你筹码的,股友0f56Mr,2026/5/14 11:11,0,39
|
|
||||||
3,一堆出货的不亏钱吗,关注acgn,2026/5/14 10:40,26,35
|
|
||||||
3,又是一个美好的周末,金元马,2026/5/14 11:09,35,92
|
|
||||||
3,你不得不佩服老邓,他去年就判断到现在的行情了,3季度多卖猪仔,今年二季度少出栏。,油手好闲,2026/5/13 16:31,47,83
|
|
||||||
3,收盘价13.14,和4月28日一样,B哥sama,2026/5/14 11:01,29,85
|
|
||||||
3,减,买股不宜慌,2026/5/14 11:03,31,89
|
|
||||||
3,趋势大时代 指数中位线选择方向,如林的岑缈,2026/5/14 10:41,20,17
|
|
||||||
3,要跌停,君月00后实盘,2026/5/14 10:46,7,90
|
|
||||||
3,尾盘跌停,股民222,2026/5/14 10:36,21,11
|
|
||||||
3,今天都有哪些吃了这碗大面的?来说说后面走势,Cialloo,2026/5/14 10:23,24,11
|
|
||||||
3,大家一起去骂股市老兵这个死庄托,大智若愚量在價先,2026/5/14 10:51,22,36
|
|
||||||
3,短线小赚一波赶紧走,下一波跌至18,东方嘉木,2026/5/14 10:50,28,2
|
|
||||||
3,跨境通ZAFUL拥有成功、成熟的体系和打法,其后续发展充满期待,心中有财,2026/5/14 10:49,4,86
|
|
||||||
3,各位好啊,我来抄底了,拉萨帮二当家,2026/5/14 10:31,32,97
|
|
||||||
3,5月7日前有九阳,也必将有后九阳!,做t糕手,2026/5/14 10:46,46,19
|
|
||||||
3,还能跌破6吗?感觉都快垫底了,梦绮紫,2026/5/14 10:48,31,88
|
|
||||||
3,跑了,和这股耗不起,涨两天一天能降回去,少赔点撤了,飞逸天,2026/5/14 10:47,38,91
|
|
||||||
3,雨后彩莲死庄托还敢唱多吗,脸被打的啪啪响吧,股友Q1713355G8,2026/5/14 10:40,3,66
|
|
||||||
3,天天高抛低吸,股友C08120l209,2026/5/14 10:46,37,89
|
|
||||||
3,圣龙股份,好是好,公司发展方向无疑是好!但是盘子太小,业绩越来越好就没有大机构要!千万不要大买,谷神布斯,2026/5/14 10:45,20,93
|
|
||||||
3,几点出公告停牌,花火飞鸟,2026/5/14 10:25,28,63
|
|
||||||
3,这是已经谈拢了,杀跌两天就该停牌了.不用等到21号股东大会了。,股友61057ac363,2026/5/14 10:42,9,97
|
|
||||||
3,昨天是谁说的今天会复制3.5号的行情?简直神准,股友639537p9w3,2026/5/14 9:20,17,69
|
|
||||||
3,济南高新,东大东大红,2026/5/14 10:39,25,89
|
|
||||||
|
-2609
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,12 @@
|
|||||||
|
requests>=2.28.0
|
||||||
|
pandas>=2.0.0
|
||||||
|
openpyxl>=3.1.0
|
||||||
|
jieba>=0.42.1
|
||||||
|
scikit-learn>=1.3.0
|
||||||
|
numpy>=1.24.0
|
||||||
|
matplotlib>=3.7.0
|
||||||
|
seaborn>=0.12.0
|
||||||
|
wordcloud>=1.9.0
|
||||||
|
gensim>=4.3.0
|
||||||
|
tensorflow>=2.10.0
|
||||||
|
keras>=2.10.0
|
||||||
@@ -1,12 +0,0 @@
|
|||||||
import asyncio
|
|
||||||
import aiohttp
|
|
||||||
import json
|
|
||||||
import re
|
|
||||||
import sys
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
|
|
||||||
'Referer': 'https://guba.eastmoney.com/',
|
|
||||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
|
|
||||||
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.
|
|
||||||
-343
@@ -1,343 +0,0 @@
|
|||||||
import json
|
|
||||||
import re
|
|
||||||
import time
|
|
||||||
import urllib.request
|
|
||||||
import urllib.parse
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
|
|
||||||
'Referer': 'https://guba.eastmoney.com/',
|
|
||||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
|
|
||||||
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
|
|
||||||
'Accept-Encoding': 'gzip, deflate, br',
|
|
||||||
'Connection': 'keep-alive',
|
|
||||||
'Cache-Control': 'max-age=0',
|
|
||||||
'Upgrade-Insecure-Requests': '1',
|
|
||||||
'Sec-Ch-Ua': '"Chromium";v="148", "Not;A=Brand";v="24", "Microsoft Edge";v="148"',
|
|
||||||
'Sec-Ch-Ua-Mobile': '?0',
|
|
||||||
'Sec-Ch-Ua-Platform': '"Windows"',
|
|
||||||
'Sec-Fetch-Dest': 'document',
|
|
||||||
'Sec-Fetch-Mode': 'navigate',
|
|
||||||
'Sec-Fetch-Site': 'same-origin',
|
|
||||||
'Sec-Fetch-User': '?1',
|
|
||||||
'Cookie': 'qgqp_b_id=30059d8839ad5c045fa8856e38013e9c; st_nvi=XwpSfYXGjCxfCdbgapK5_cac4; nid18=0daec1df8064f04edd20b4e69250a8f5; nid18_create_time=1776263017375; gviem=UrMH_tSu1UpW8B_TKmytl803f; gviem_create_time=1776263017375; fullscreengg=1; fullscreengg2=1; st_si=63999118594852; wsc_checkuser_ok=1; st_asi=delete; st_pvi=26838250597806; st_sp=2026-04-15%2022%3A23%3A37; st_inirUrl=https%3A%2F%2Fcn.bing.com%2F; st_sn=30; st_psi=20260520214901287-117001354293-0422265952',
|
|
||||||
}
|
|
||||||
|
|
||||||
comment_headers = {
|
|
||||||
'Accept': '*/*',
|
|
||||||
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
|
|
||||||
'Cache-Control': 'no-cache',
|
|
||||||
'Connection': 'keep-alive',
|
|
||||||
'Content-Type': 'application/x-www-form-urlencoded',
|
|
||||||
'Origin': 'https://guba.eastmoney.com',
|
|
||||||
'Pragma': 'no-cache',
|
|
||||||
'Referer': 'https://guba.eastmoney.com/',
|
|
||||||
'Sec-Fetch-Dest': 'empty',
|
|
||||||
'Sec-Fetch-Mode': 'cors',
|
|
||||||
'Sec-Fetch-Site': 'same-origin',
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
|
|
||||||
'X-Requested-With': 'XMLHttpRequest',
|
|
||||||
'Cookie': 'qgqp_b_id=30059d8839ad5c045fa8856e38013e9c; st_nvi=XwpSfYXGjCxfCdbgapK5_cac4; nid18=0daec1df8064f04edd20b4e69250a8f5; nid18_create_time=1776263017375; gviem=UrMH_tSu1UpW8B_TKmytl803f; gviem_create_time=1776263017375; fullscreengg=1; fullscreengg2=1; st_si=63999118594852; wsc_checkuser_ok=1; st_asi=delete; st_pvi=26838250597806; st_sp=2026-04-15%2022%3A23%3A37; st_inirUrl=https%3A%2F%2Fcn.bing.com%2F; st_sn=30; st_psi=20260520214901287-117001354293-0422265952',
|
|
||||||
}
|
|
||||||
|
|
||||||
MAX_RETRIES = 3
|
|
||||||
DELAY_BETWEEN_REQUESTS = 2.0
|
|
||||||
DELAY_BETWEEN_PAGES = 5.0
|
|
||||||
OUTPUT_FILE = 'guba_data.json'
|
|
||||||
|
|
||||||
|
|
||||||
def fetch(url, headers, method='GET', data=None, timeout=15):
|
|
||||||
for attempt in range(MAX_RETRIES):
|
|
||||||
try:
|
|
||||||
req = urllib.request.Request(url, headers=headers, method=method, data=data)
|
|
||||||
with urllib.request.urlopen(req, timeout=timeout) as response:
|
|
||||||
if response.status == 429:
|
|
||||||
print(f' 请求过于频繁,等待10秒后重试...')
|
|
||||||
time.sleep(10)
|
|
||||||
continue
|
|
||||||
|
|
||||||
if response.status == 403:
|
|
||||||
print(f' 请求被拒绝,第{attempt+1}次重试...')
|
|
||||||
time.sleep(5)
|
|
||||||
continue
|
|
||||||
|
|
||||||
if response.status != 200:
|
|
||||||
print(f' 请求失败,状态码: {response.status}')
|
|
||||||
return None
|
|
||||||
|
|
||||||
content = response.read().decode('utf-8', errors='ignore')
|
|
||||||
return content
|
|
||||||
|
|
||||||
except urllib.error.URLError as e:
|
|
||||||
print(f' 请求超时,第{attempt+1}次重试...')
|
|
||||||
time.sleep(5)
|
|
||||||
except Exception as e:
|
|
||||||
print(f' 请求异常: {str(e)}')
|
|
||||||
if attempt < MAX_RETRIES - 1:
|
|
||||||
time.sleep(5)
|
|
||||||
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def initialize_session():
|
|
||||||
print('正在初始化会话...')
|
|
||||||
fetch('https://guba.eastmoney.com/', headers)
|
|
||||||
time.sleep(2)
|
|
||||||
print('会话初始化完成')
|
|
||||||
|
|
||||||
|
|
||||||
def get_post_list(stock_code='002624', page=1):
|
|
||||||
if page == 1:
|
|
||||||
url = f'https://guba.eastmoney.com/list,{stock_code},f.html'
|
|
||||||
else:
|
|
||||||
url = f'https://guba.eastmoney.com/list,{stock_code},f{page}.html'
|
|
||||||
|
|
||||||
html = fetch(url, headers)
|
|
||||||
|
|
||||||
if not html:
|
|
||||||
return []
|
|
||||||
|
|
||||||
posts = []
|
|
||||||
pattern = r'var article_list=\s*({"re":.*?});'
|
|
||||||
match = re.search(pattern, html, re.DOTALL)
|
|
||||||
|
|
||||||
if match:
|
|
||||||
try:
|
|
||||||
data = json.loads(match.group(1))
|
|
||||||
for item in data.get('re', []):
|
|
||||||
post_id = item.get('post_id', '')
|
|
||||||
title = item.get('post_title', '').strip()
|
|
||||||
author = item.get('user_nickname', '').strip()
|
|
||||||
post_time = item.get('post_display_time', '')
|
|
||||||
comment_count = item.get('post_comment_count', 0)
|
|
||||||
click_count = item.get('post_click_count', 0)
|
|
||||||
forward_count = item.get('post_forward_count', 0)
|
|
||||||
like_count = item.get('post_like_count', 0)
|
|
||||||
|
|
||||||
if post_id and title:
|
|
||||||
posts.append({
|
|
||||||
'post_id': post_id,
|
|
||||||
'title': title,
|
|
||||||
'author': author,
|
|
||||||
'post_time': post_time,
|
|
||||||
'comment_count': comment_count,
|
|
||||||
'click_count': click_count,
|
|
||||||
'forward_count': forward_count,
|
|
||||||
'like_count': like_count,
|
|
||||||
'url': f'https://guba.eastmoney.com/news,{stock_code},{post_id}.html'
|
|
||||||
})
|
|
||||||
except json.JSONDecodeError:
|
|
||||||
pass
|
|
||||||
|
|
||||||
return posts
|
|
||||||
|
|
||||||
|
|
||||||
def get_comments(stock_code, post_id, page=1, page_size=30):
|
|
||||||
url = f'https://guba.eastmoney.com/api/getData?code={stock_code}&path=reply/api/Reply/ArticleNewReplyList'
|
|
||||||
|
|
||||||
payload = {
|
|
||||||
'param': f'postid={post_id}&sort=1&sorttype=1&p={page}&ps={page_size}',
|
|
||||||
'plat': 'Web',
|
|
||||||
'path': 'reply/api/Reply/ArticleNewReplyList',
|
|
||||||
'env': '2',
|
|
||||||
'origin': '',
|
|
||||||
'version': '2022',
|
|
||||||
'product': 'Guba'
|
|
||||||
}
|
|
||||||
|
|
||||||
data = urllib.parse.urlencode(payload).encode('utf-8')
|
|
||||||
response_text = fetch(url, comment_headers, method='POST', data=data)
|
|
||||||
|
|
||||||
if not response_text:
|
|
||||||
return []
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = json.loads(response_text)
|
|
||||||
|
|
||||||
if 're' in data:
|
|
||||||
reply_list = data.get('re', [])
|
|
||||||
elif 'data' in data and 'reply_list' in data['data']:
|
|
||||||
reply_list = data['data'].get('reply_list', [])
|
|
||||||
else:
|
|
||||||
print(f' 未知的响应结构: {list(data.keys())}')
|
|
||||||
return []
|
|
||||||
|
|
||||||
if not isinstance(reply_list, list) or len(reply_list) == 0:
|
|
||||||
return []
|
|
||||||
|
|
||||||
comments = []
|
|
||||||
for item in reply_list:
|
|
||||||
reply_user = item.get('reply_user', {})
|
|
||||||
comment = {
|
|
||||||
'reply_id': str(item.get('reply_id', '')),
|
|
||||||
'user_nickname': reply_user.get('user_nickname', '').strip(),
|
|
||||||
'reply_content': item.get('reply_text', '').strip(),
|
|
||||||
'reply_time': item.get('reply_time', ''),
|
|
||||||
'reply_like_count': item.get('reply_like_count', 0),
|
|
||||||
'reply_against_count': item.get('reply_against_count', 0),
|
|
||||||
}
|
|
||||||
if comment['reply_content']:
|
|
||||||
comments.append(comment)
|
|
||||||
|
|
||||||
return comments
|
|
||||||
except json.JSONDecodeError as e:
|
|
||||||
print(f' JSON解析失败: {str(e)}')
|
|
||||||
return []
|
|
||||||
|
|
||||||
|
|
||||||
def get_all_comments(stock_code, post_id, total_comments):
|
|
||||||
all_comments = []
|
|
||||||
page_size = 30
|
|
||||||
page = 1
|
|
||||||
|
|
||||||
while True:
|
|
||||||
comments = get_comments(stock_code, post_id, page, page_size)
|
|
||||||
|
|
||||||
if not comments:
|
|
||||||
break
|
|
||||||
|
|
||||||
all_comments.extend(comments)
|
|
||||||
print(f' 第{page}页评论获取完成,累计{len(all_comments)}条')
|
|
||||||
|
|
||||||
if len(comments) < page_size:
|
|
||||||
break
|
|
||||||
|
|
||||||
page += 1
|
|
||||||
time.sleep(DELAY_BETWEEN_REQUESTS)
|
|
||||||
|
|
||||||
return all_comments
|
|
||||||
|
|
||||||
|
|
||||||
def process_post(stock_code, post):
|
|
||||||
post_id = post['post_id']
|
|
||||||
title = post['title']
|
|
||||||
print(f' 获取帖子: {title[:40]}... (评论:{post["comment_count"]})')
|
|
||||||
|
|
||||||
post_data = {
|
|
||||||
'post_id': post_id,
|
|
||||||
'title': title,
|
|
||||||
'author': post.get('author', ''),
|
|
||||||
'post_time': post.get('post_time', ''),
|
|
||||||
'url': post['url'],
|
|
||||||
'comment_count': post.get('comment_count', 0),
|
|
||||||
'click_count': post.get('click_count', 0),
|
|
||||||
'forward_count': post.get('forward_count', 0),
|
|
||||||
'like_count': post.get('like_count', 0),
|
|
||||||
'comments': []
|
|
||||||
}
|
|
||||||
|
|
||||||
if post['comment_count'] > 0:
|
|
||||||
print(f' 正在获取评论...')
|
|
||||||
comments = get_all_comments(stock_code, post_id, post['comment_count'])
|
|
||||||
post_data['comments'] = comments
|
|
||||||
print(f' 评论获取完成,共{len(comments)}条')
|
|
||||||
|
|
||||||
time.sleep(DELAY_BETWEEN_REQUESTS)
|
|
||||||
return post_data
|
|
||||||
|
|
||||||
|
|
||||||
def scrape_guba(stock_code='002624', stock_name='完美世界', total_pages=3, min_comment_count=0):
|
|
||||||
all_posts = []
|
|
||||||
seen_post_ids = set()
|
|
||||||
|
|
||||||
print(f'开始爬取{stock_name}({stock_code})股吧前{total_pages}页帖子...')
|
|
||||||
print(f'爬取时间: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}')
|
|
||||||
if min_comment_count > 0:
|
|
||||||
print(f'筛选条件: 评论数 >= {min_comment_count}')
|
|
||||||
print('-' * 60)
|
|
||||||
|
|
||||||
initialize_session()
|
|
||||||
|
|
||||||
for page in range(1, total_pages + 1):
|
|
||||||
print(f'\n正在爬取第{page}/{total_pages}页...')
|
|
||||||
|
|
||||||
posts = get_post_list(stock_code, page)
|
|
||||||
|
|
||||||
if not posts:
|
|
||||||
print(f' 第{page}页未找到数据')
|
|
||||||
continue
|
|
||||||
|
|
||||||
print(f' 找到{len(posts)}个帖子')
|
|
||||||
|
|
||||||
filtered_posts = []
|
|
||||||
for post in posts:
|
|
||||||
post_id = post['post_id']
|
|
||||||
if post_id in seen_post_ids:
|
|
||||||
continue
|
|
||||||
seen_post_ids.add(post_id)
|
|
||||||
|
|
||||||
if min_comment_count > 0 and post['comment_count'] < min_comment_count:
|
|
||||||
continue
|
|
||||||
|
|
||||||
filtered_posts.append(post)
|
|
||||||
|
|
||||||
if not filtered_posts:
|
|
||||||
print(f' 第{page}页没有符合条件的帖子')
|
|
||||||
continue
|
|
||||||
|
|
||||||
for post in filtered_posts:
|
|
||||||
post_data = process_post(stock_code, post)
|
|
||||||
all_posts.append(post_data)
|
|
||||||
|
|
||||||
print(f' 第{page}页完成,已获取{len(all_posts)}个帖子')
|
|
||||||
|
|
||||||
if page < total_pages:
|
|
||||||
time.sleep(DELAY_BETWEEN_PAGES)
|
|
||||||
|
|
||||||
return all_posts
|
|
||||||
|
|
||||||
|
|
||||||
def save_to_json(data, filename):
|
|
||||||
output = {
|
|
||||||
'scrape_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
|
|
||||||
'total_posts': len(data),
|
|
||||||
'posts': data
|
|
||||||
}
|
|
||||||
|
|
||||||
with open(filename, 'w', encoding='utf-8') as f:
|
|
||||||
json.dump(output, f, ensure_ascii=False, indent=2)
|
|
||||||
|
|
||||||
return output
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
stock_code = '002624'
|
|
||||||
stock_name = '完美世界'
|
|
||||||
total_pages = 3
|
|
||||||
min_comment_count = 0
|
|
||||||
|
|
||||||
print(f'使用 Python: {__import__("sys").version}')
|
|
||||||
print(f'脚本路径: {__file__}')
|
|
||||||
print(f'工作目录: {__import__("os").getcwd()}')
|
|
||||||
|
|
||||||
start_time = datetime.now()
|
|
||||||
|
|
||||||
all_posts = scrape_guba(stock_code, stock_name, total_pages, min_comment_count)
|
|
||||||
|
|
||||||
end_time = datetime.now()
|
|
||||||
|
|
||||||
print('\n' + '=' * 60)
|
|
||||||
|
|
||||||
if all_posts:
|
|
||||||
output = save_to_json(all_posts, OUTPUT_FILE)
|
|
||||||
|
|
||||||
print(f'爬取完成!')
|
|
||||||
print(f' - 帖子数量: {output["total_posts"]}')
|
|
||||||
print(f' - 数据已保存到: {OUTPUT_FILE}')
|
|
||||||
print(f' - 耗时: {(end_time - start_time).total_seconds():.2f} 秒')
|
|
||||||
|
|
||||||
print('\n前3个帖子预览:')
|
|
||||||
for i, post in enumerate(all_posts[:3], 1):
|
|
||||||
print(f'\n--- 帖子{i} ---')
|
|
||||||
print(f'标题: {post["title"]}')
|
|
||||||
print(f'作者: {post["author"]}')
|
|
||||||
print(f'时间: {post["post_time"]}')
|
|
||||||
print(f'URL: {post["url"]}')
|
|
||||||
print(f'评论数: {post["comment_count"]}')
|
|
||||||
print(f'实际获取评论数: {len(post["comments"])}')
|
|
||||||
if post.get('comments'):
|
|
||||||
print(f'第一条评论: {post["comments"][0]["reply_content"][:30]}...')
|
|
||||||
else:
|
|
||||||
print('未获取到任何数据')
|
|
||||||
print(f'耗时: {(end_time - start_time).total_seconds():.2f} 秒')
|
|
||||||
@@ -0,0 +1,409 @@
|
|||||||
|
import pandas as pd
|
||||||
|
import jieba
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
from collections import defaultdict
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import matplotlib
|
||||||
|
matplotlib.use('Agg')
|
||||||
|
|
||||||
|
# 设置中文字体
|
||||||
|
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei', 'SimSun', 'Arial Unicode MS']
|
||||||
|
plt.rcParams['axes.unicode_minus'] = False # 解决负号显示问题
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 第一部分:构建情感词典
|
||||||
|
# ============================================================
|
||||||
|
|
||||||
|
def build_sentiment_dictionary():
|
||||||
|
"""使用大连理工大学中文情感词汇本体构建情感词典"""
|
||||||
|
|
||||||
|
dict_path = '大连理工大学中文情感词汇本体.xlsx'
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 读取大连理工大学情感词汇
|
||||||
|
df = pd.read_excel(dict_path)
|
||||||
|
|
||||||
|
# 选择需要的列
|
||||||
|
df = df[['词语', '词性种类', '词义数', '词义序号', '情感分类', '强度', '极性']]
|
||||||
|
|
||||||
|
# 分类整理
|
||||||
|
Happy = []
|
||||||
|
Good = []
|
||||||
|
Surprise = []
|
||||||
|
Anger = []
|
||||||
|
Sad = []
|
||||||
|
Fear = []
|
||||||
|
Disgust = []
|
||||||
|
|
||||||
|
for idx, row in df.iterrows():
|
||||||
|
if row['情感分类'] in ['PA', 'PE']:
|
||||||
|
Happy.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['PD', 'PH', 'PG', 'PB', 'PK']:
|
||||||
|
Good.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['PC']:
|
||||||
|
Surprise.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NA']:
|
||||||
|
Anger.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NB', 'NJ', 'NH', 'PF']:
|
||||||
|
Sad.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NI', 'NC', 'NG']:
|
||||||
|
Fear.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NE', 'ND', 'NN', 'NK', 'NL']:
|
||||||
|
Disgust.append(row['词语'])
|
||||||
|
|
||||||
|
# 添加股票相关的补充词汇
|
||||||
|
stock_positive = ['涨', '上涨', '暴涨', '拉升', '涨停', '盈利', '收益', '赚钱', '赚',
|
||||||
|
'利好', '增长', '上升', '增加', '发展', '进步', '提升', '改善', '突破',
|
||||||
|
'创新', '优势', '超预期', '亮眼', '惊艳', '奇迹']
|
||||||
|
stock_negative = ['跌', '下跌', '暴跌', '跳水', '跌停', '亏损', '亏钱', '赔', '损失',
|
||||||
|
'套牢', '垃圾', '恶心', '坑爹', '骗局', '雷', '爆雷', '崩盘', '退市']
|
||||||
|
|
||||||
|
Good.extend(stock_positive)
|
||||||
|
Disgust.extend(stock_negative)
|
||||||
|
|
||||||
|
# 合并
|
||||||
|
Positive = Happy + Good + Surprise
|
||||||
|
Negative = Anger + Sad + Fear + Disgust
|
||||||
|
|
||||||
|
print('大连理工大学情感词典加载完成')
|
||||||
|
print(f'正面情感词: {len(Positive)}个')
|
||||||
|
print(f'负面情感词: {len(Negative)}个')
|
||||||
|
|
||||||
|
return {
|
||||||
|
'Happy': Happy,
|
||||||
|
'Good': Good,
|
||||||
|
'Surprise': Surprise,
|
||||||
|
'Anger': Anger,
|
||||||
|
'Sad': Sad,
|
||||||
|
'Fear': Fear,
|
||||||
|
'Disgust': Disgust,
|
||||||
|
'Positive': Positive,
|
||||||
|
'Negative': Negative
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f'加载大连理工大学情感词典失败: {e}')
|
||||||
|
print('使用简化版情感词典')
|
||||||
|
return build_simplified_dictionary()
|
||||||
|
|
||||||
|
def build_simplified_dictionary():
|
||||||
|
"""构建简化的中文情感词典(备用方案)"""
|
||||||
|
|
||||||
|
# 正面情感词
|
||||||
|
Happy = [
|
||||||
|
'开心', '快乐', '高兴', '喜悦', '愉快', '欣喜', '欢乐', '欢喜', '幸福',
|
||||||
|
'满意', '满足', '欣慰', '愉悦', '畅快', '乐观', '积极', '美好', '成功'
|
||||||
|
]
|
||||||
|
|
||||||
|
Good = [
|
||||||
|
'好', '优秀', '出色', '精彩', '卓越', '杰出', '优良', '良好', '完美', '不错',
|
||||||
|
'涨', '上涨', '暴涨', '拉升', '涨停', '盈利', '收益', '赚钱', '赚', '利好',
|
||||||
|
'增长', '上升', '增加', '发展', '进步', '提升', '改善', '突破', '创新', '优势'
|
||||||
|
]
|
||||||
|
|
||||||
|
Surprise = [
|
||||||
|
'惊喜', '意外', '震惊', '惊讶', '震撼', '神奇', '奇迹', '惊艳', '亮眼', '超预期'
|
||||||
|
]
|
||||||
|
|
||||||
|
# 负面情感词
|
||||||
|
Anger = [
|
||||||
|
'愤怒', '生气', '恼火', '气愤', '暴怒', '愤慨', '愤恨', '震怒', '发怒',
|
||||||
|
'骂', '垃圾', '恶心', '坑爹', '骗局', '欺骗', '欺诈', '造假', '腐败', '黑暗'
|
||||||
|
]
|
||||||
|
|
||||||
|
Sad = [
|
||||||
|
'伤心', '难过', '悲伤', '痛苦', '悲哀', '沮丧', '失望', '绝望', '低落', '悲观',
|
||||||
|
'跌', '下跌', '暴跌', '跳水', '跌停', '亏损', '亏钱', '赔', '损失', '套牢'
|
||||||
|
]
|
||||||
|
|
||||||
|
Fear = [
|
||||||
|
'害怕', '恐惧', '担心', '担忧', '恐慌', '不安', '焦虑', '忧虑', '紧张', '恐怖',
|
||||||
|
'风险', '危机', '危险', '下跌', '暴跌', '崩盘', '退市', '爆雷', '雷', '怕'
|
||||||
|
]
|
||||||
|
|
||||||
|
Disgust = [
|
||||||
|
'厌恶', '恶心', '反感', '讨厌', '鄙视', '唾弃', '不屑', '蔑视', '嫌弃',
|
||||||
|
'垃圾', '废物', '不行', '差劲', '差', '烂', '渣', '骗局'
|
||||||
|
]
|
||||||
|
|
||||||
|
# 合并
|
||||||
|
Positive = Happy + Good + Surprise
|
||||||
|
Negative = Anger + Sad + Fear + Disgust
|
||||||
|
|
||||||
|
print('简化版情感词典构建完成')
|
||||||
|
print(f'正面情感词: {len(Positive)}个')
|
||||||
|
print(f'负面情感词: {len(Negative)}个')
|
||||||
|
|
||||||
|
return {
|
||||||
|
'Happy': Happy,
|
||||||
|
'Good': Good,
|
||||||
|
'Surprise': Surprise,
|
||||||
|
'Anger': Anger,
|
||||||
|
'Sad': Sad,
|
||||||
|
'Fear': Fear,
|
||||||
|
'Disgust': Disgust,
|
||||||
|
'Positive': Positive,
|
||||||
|
'Negative': Negative
|
||||||
|
}
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 第二部分:情绪计算函数
|
||||||
|
# ============================================================
|
||||||
|
|
||||||
|
def emotion_caculate(text, sentiment_dict):
|
||||||
|
"""计算单条文本的情绪"""
|
||||||
|
|
||||||
|
if not text or pd.isna(text):
|
||||||
|
text = ''
|
||||||
|
|
||||||
|
positive = 0
|
||||||
|
negative = 0
|
||||||
|
anger = 0
|
||||||
|
disgust = 0
|
||||||
|
fear = 0
|
||||||
|
sad = 0
|
||||||
|
surprise = 0
|
||||||
|
good = 0
|
||||||
|
happy = 0
|
||||||
|
|
||||||
|
wordlist = jieba.lcut(text)
|
||||||
|
wordset = set(wordlist)
|
||||||
|
|
||||||
|
for word in wordset:
|
||||||
|
freq = wordlist.count(word)
|
||||||
|
|
||||||
|
if word in sentiment_dict['Positive']:
|
||||||
|
positive += freq
|
||||||
|
if word in sentiment_dict['Negative']:
|
||||||
|
negative += freq
|
||||||
|
if word in sentiment_dict['Anger']:
|
||||||
|
anger += freq
|
||||||
|
if word in sentiment_dict['Disgust']:
|
||||||
|
disgust += freq
|
||||||
|
if word in sentiment_dict['Fear']:
|
||||||
|
fear += freq
|
||||||
|
if word in sentiment_dict['Sad']:
|
||||||
|
sad += freq
|
||||||
|
if word in sentiment_dict['Surprise']:
|
||||||
|
surprise += freq
|
||||||
|
if word in sentiment_dict['Good']:
|
||||||
|
good += freq
|
||||||
|
if word in sentiment_dict['Happy']:
|
||||||
|
happy += freq
|
||||||
|
|
||||||
|
emotion_info = {
|
||||||
|
'length': len(wordlist),
|
||||||
|
'positive': positive,
|
||||||
|
'negative': negative,
|
||||||
|
'anger': anger,
|
||||||
|
'disgust': disgust,
|
||||||
|
'fear': fear,
|
||||||
|
'sadness': sad,
|
||||||
|
'surprise': surprise,
|
||||||
|
'good': good,
|
||||||
|
'happy': happy,
|
||||||
|
'sentiment_score': positive - negative if (positive + negative) > 0 else 0
|
||||||
|
}
|
||||||
|
|
||||||
|
indexs = ['length', 'positive', 'negative', 'anger', 'disgust', 'fear',
|
||||||
|
'sadness', 'surprise', 'good', 'happy', 'sentiment_score']
|
||||||
|
|
||||||
|
return pd.Series(emotion_info, index=indexs)
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 第三部分:数据加载与分析
|
||||||
|
# ============================================================
|
||||||
|
|
||||||
|
def load_and_analyze_data(data_dir='data', output_dir='sentiment_output'):
|
||||||
|
"""加载数据并进行情绪分析"""
|
||||||
|
|
||||||
|
os.makedirs(output_dir, exist_ok=True)
|
||||||
|
|
||||||
|
# 构建情感词典
|
||||||
|
sentiment_dict = build_sentiment_dictionary()
|
||||||
|
|
||||||
|
# 遍历所有JSON文件
|
||||||
|
all_results = []
|
||||||
|
stock_emotions = {}
|
||||||
|
|
||||||
|
for filename in os.listdir(data_dir):
|
||||||
|
if filename.endswith('.json') and filename.startswith('guba_'):
|
||||||
|
filepath = os.path.join(data_dir, filename)
|
||||||
|
|
||||||
|
print(f'\n正在分析: {filename}')
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
stock_name = data.get('stock_name', '未知')
|
||||||
|
stock_code = data.get('stock_code', '未知')
|
||||||
|
posts = data.get('posts', [])
|
||||||
|
|
||||||
|
if not posts:
|
||||||
|
print(f' 无数据,跳过')
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 转换为DataFrame
|
||||||
|
df = pd.DataFrame(posts)
|
||||||
|
|
||||||
|
# 合并标题和内容
|
||||||
|
df['full_text'] = df.apply(
|
||||||
|
lambda x: f"{x.get('post_title', '')} {x.get('post_content', '')}",
|
||||||
|
axis=1
|
||||||
|
)
|
||||||
|
|
||||||
|
# 进行情绪分析
|
||||||
|
print(f' 开始分析 {len(df)} 条帖子...')
|
||||||
|
start = time.time()
|
||||||
|
|
||||||
|
emotion_df = df['full_text'].apply(
|
||||||
|
lambda x: emotion_caculate(x, sentiment_dict)
|
||||||
|
)
|
||||||
|
|
||||||
|
end = time.time()
|
||||||
|
print(f' 分析完成,耗时: {end - start:.2f}秒')
|
||||||
|
|
||||||
|
# 合并结果
|
||||||
|
result_df = pd.concat([df, emotion_df], axis=1)
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
output_file = os.path.join(output_dir, f'sentiment_{stock_name}_{stock_code}.csv')
|
||||||
|
result_df.to_csv(output_file, index=False, encoding='utf-8-sig')
|
||||||
|
print(f' 结果已保存到: {output_file}')
|
||||||
|
|
||||||
|
# 统计整体情绪
|
||||||
|
stock_stats = {
|
||||||
|
'stock_code': stock_code,
|
||||||
|
'stock_name': stock_name,
|
||||||
|
'total_posts': len(result_df),
|
||||||
|
'avg_positive': result_df['positive'].mean(),
|
||||||
|
'avg_negative': result_df['negative'].mean(),
|
||||||
|
'avg_sentiment_score': result_df['sentiment_score'].mean(),
|
||||||
|
'positive_posts': (result_df['sentiment_score'] > 0).sum(),
|
||||||
|
'negative_posts': (result_df['sentiment_score'] < 0).sum(),
|
||||||
|
'neutral_posts': (result_df['sentiment_score'] == 0).sum(),
|
||||||
|
'total_anger': result_df['anger'].sum(),
|
||||||
|
'total_sadness': result_df['sadness'].sum(),
|
||||||
|
'total_fear': result_df['fear'].sum(),
|
||||||
|
'total_disgust': result_df['disgust'].sum(),
|
||||||
|
'total_good': result_df['good'].sum(),
|
||||||
|
'total_happy': result_df['happy'].sum(),
|
||||||
|
'total_surprise': result_df['surprise'].sum()
|
||||||
|
}
|
||||||
|
|
||||||
|
stock_emotions[stock_code] = stock_stats
|
||||||
|
all_results.append(result_df)
|
||||||
|
|
||||||
|
# 打印该股票情绪最高/最低的帖子
|
||||||
|
print(f'\n {stock_name} 情绪分析统计:')
|
||||||
|
print(f' 平均情绪得分: {stock_stats["avg_sentiment_score"]:.2f}')
|
||||||
|
print(f' 正面帖子: {stock_stats["positive_posts"]}')
|
||||||
|
print(f' 负面帖子: {stock_stats["negative_posts"]}')
|
||||||
|
print(f' 中性帖子: {stock_stats["neutral_posts"]}')
|
||||||
|
|
||||||
|
# 最正面帖子
|
||||||
|
top_positive = result_df.nlargest(1, 'sentiment_score').iloc[0]
|
||||||
|
print(f' 最正面帖子: {top_positive["full_text"][:50]}...')
|
||||||
|
|
||||||
|
# 最负面帖子
|
||||||
|
top_negative = result_df.nsmallest(1, 'sentiment_score').iloc[0]
|
||||||
|
print(f' 最负面帖子: {top_negative["full_text"][:50]}...')
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f' 分析失败: {e}')
|
||||||
|
|
||||||
|
# 保存总体统计
|
||||||
|
if stock_emotions:
|
||||||
|
summary_df = pd.DataFrame(list(stock_emotions.values()))
|
||||||
|
summary_file = os.path.join(output_dir, 'sentiment_summary.csv')
|
||||||
|
summary_df.to_csv(summary_file, index=False, encoding='utf-8-sig')
|
||||||
|
print(f'\n总体统计已保存到: {summary_file}')
|
||||||
|
|
||||||
|
# 生成可视化
|
||||||
|
generate_visualizations(summary_df, stock_emotions, output_dir)
|
||||||
|
|
||||||
|
return all_results, stock_emotions
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 第四部分:可视化
|
||||||
|
# ============================================================
|
||||||
|
|
||||||
|
def generate_visualizations(summary_df, stock_emotions, output_dir):
|
||||||
|
"""生成情绪分析可视化图表"""
|
||||||
|
|
||||||
|
# 1. 各股票平均情绪得分对比
|
||||||
|
plt.figure(figsize=(12, 6))
|
||||||
|
colors = ['green' if x >= 0 else 'red' for x in summary_df['avg_sentiment_score']]
|
||||||
|
plt.bar(summary_df['stock_name'], summary_df['avg_sentiment_score'], color=colors, alpha=0.7)
|
||||||
|
plt.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
|
||||||
|
plt.title('各股票平均情绪得分对比', fontsize=14)
|
||||||
|
plt.xlabel('股票名称', fontsize=12)
|
||||||
|
plt.ylabel('平均情绪得分', fontsize=12)
|
||||||
|
plt.xticks(rotation=45)
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.savefig(os.path.join(output_dir, 'sentiment_score_comparison.png'), dpi=300)
|
||||||
|
plt.close()
|
||||||
|
|
||||||
|
# 2. 正面/负面/中性帖子分布
|
||||||
|
fig, axes = plt.subplots(2, 4, figsize=(16, 10))
|
||||||
|
axes = axes.flatten()
|
||||||
|
|
||||||
|
for idx, (stock_code, stats) in enumerate(stock_emotions.items()):
|
||||||
|
if idx >= 8:
|
||||||
|
break
|
||||||
|
labels = ['正面', '负面', '中性']
|
||||||
|
sizes = [stats['positive_posts'], stats['negative_posts'], stats['neutral_posts']]
|
||||||
|
colors = ['green', 'red', 'gray']
|
||||||
|
|
||||||
|
axes[idx].pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
|
||||||
|
axes[idx].set_title(f'{stats["stock_name"]} 情绪分布')
|
||||||
|
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.savefig(os.path.join(output_dir, 'sentiment_distribution.png'), dpi=300)
|
||||||
|
plt.close()
|
||||||
|
|
||||||
|
# 3. 各情绪类型占比
|
||||||
|
plt.figure(figsize=(14, 7))
|
||||||
|
emotions = ['total_good', 'total_happy', 'total_surprise',
|
||||||
|
'total_anger', 'total_sadness', 'total_fear', 'total_disgust']
|
||||||
|
emotion_names = ['好评', '快乐', '惊讶', '愤怒', '悲伤', '恐惧', '厌恶']
|
||||||
|
|
||||||
|
x = range(len(emotion_names))
|
||||||
|
width = 0.1
|
||||||
|
|
||||||
|
for idx, (stock_code, stats) in enumerate(stock_emotions.items()):
|
||||||
|
values = [stats[e] for e in emotions]
|
||||||
|
total = sum(values)
|
||||||
|
if total > 0:
|
||||||
|
values = [v / total * 100 for v in values]
|
||||||
|
plt.bar([xi + width * idx for xi in x], values, width, label=stats['stock_name'])
|
||||||
|
|
||||||
|
plt.xlabel('情绪类型', fontsize=12)
|
||||||
|
plt.ylabel('占比 (%)', fontsize=12)
|
||||||
|
plt.title('各股票情绪类型分布', fontsize=14)
|
||||||
|
plt.xticks([xi + width * 3.5 for xi in x], emotion_names)
|
||||||
|
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
|
||||||
|
plt.tight_layout()
|
||||||
|
plt.savefig(os.path.join(output_dir, 'emotion_types.png'), dpi=300, bbox_inches='tight')
|
||||||
|
plt.close()
|
||||||
|
|
||||||
|
print(f'可视化图表已生成到 {output_dir}')
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 主程序
|
||||||
|
# ============================================================
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
print('=' * 60)
|
||||||
|
print('股吧数据情绪分析')
|
||||||
|
print('=' * 60)
|
||||||
|
|
||||||
|
# 运行分析
|
||||||
|
all_results, stock_emotions = load_and_analyze_data()
|
||||||
|
|
||||||
|
print('\n' + '=' * 60)
|
||||||
|
print('情绪分析完成!')
|
||||||
|
print('=' * 60)
|
||||||
@@ -0,0 +1,297 @@
|
|||||||
|
import os
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from datetime import datetime
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import matplotlib
|
||||||
|
matplotlib.use('Agg')
|
||||||
|
import jieba
|
||||||
|
|
||||||
|
# 设置中文字体
|
||||||
|
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei', 'SimSun', 'Arial Unicode MS']
|
||||||
|
plt.rcParams['axes.unicode_minus'] = False
|
||||||
|
|
||||||
|
# 加载停用词
|
||||||
|
def load_stopwords(filepath='stopwords.txt'):
|
||||||
|
stopwords = set()
|
||||||
|
if os.path.exists(filepath):
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
for line in f:
|
||||||
|
word = line.strip()
|
||||||
|
if word:
|
||||||
|
stopwords.add(word)
|
||||||
|
return stopwords
|
||||||
|
|
||||||
|
STOPWORDS = load_stopwords()
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 构建情感词典(参照 sentiment_analysis.py)
|
||||||
|
# ============================================================
|
||||||
|
def build_sentiment_dictionary():
|
||||||
|
"""使用大连理工大学中文情感词汇本体构建情感词典"""
|
||||||
|
|
||||||
|
dict_path = '大连理工大学中文情感词汇本体.xlsx'
|
||||||
|
|
||||||
|
try:
|
||||||
|
df = pd.read_excel(dict_path)
|
||||||
|
df = df[['词语', '词性种类', '词义数', '词义序号', '情感分类', '强度', '极性']]
|
||||||
|
|
||||||
|
Happy = []
|
||||||
|
Good = []
|
||||||
|
Surprise = []
|
||||||
|
Anger = []
|
||||||
|
Sad = []
|
||||||
|
Fear = []
|
||||||
|
Disgust = []
|
||||||
|
|
||||||
|
for idx, row in df.iterrows():
|
||||||
|
if row['情感分类'] in ['PA', 'PE']:
|
||||||
|
Happy.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['PD', 'PH', 'PG', 'PB', 'PK']:
|
||||||
|
Good.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['PC']:
|
||||||
|
Surprise.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NA']:
|
||||||
|
Anger.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NB', 'NJ', 'NH', 'PF']:
|
||||||
|
Sad.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NI', 'NC', 'NG']:
|
||||||
|
Fear.append(row['词语'])
|
||||||
|
if row['情感分类'] in ['NE', 'ND', 'NN', 'NK', 'NL']:
|
||||||
|
Disgust.append(row['词语'])
|
||||||
|
|
||||||
|
# 添加股票相关词汇
|
||||||
|
stock_positive = ['涨', '上涨', '暴涨', '拉升', '涨停', '盈利', '收益', '赚钱', '赚',
|
||||||
|
'利好', '增长', '上升', '增加', '发展', '进步', '提升', '改善', '突破',
|
||||||
|
'创新', '优势', '超预期', '亮眼', '惊艳', '奇迹']
|
||||||
|
stock_negative = ['跌', '下跌', '暴跌', '跳水', '跌停', '亏损', '亏钱', '赔', '损失',
|
||||||
|
'套牢', '垃圾', '恶心', '坑爹', '骗局', '雷', '爆雷', '崩盘', '退市']
|
||||||
|
|
||||||
|
Good.extend(stock_positive)
|
||||||
|
Disgust.extend(stock_negative)
|
||||||
|
|
||||||
|
Positive = Happy + Good + Surprise
|
||||||
|
Negative = Anger + Sad + Fear + Disgust
|
||||||
|
|
||||||
|
print(f'大连理工大学情感词典加载完成')
|
||||||
|
print(f' 正面情感词: {len(Positive)}个')
|
||||||
|
print(f' 负面情感词: {len(Negative)}个')
|
||||||
|
|
||||||
|
return {
|
||||||
|
'Happy': Happy,
|
||||||
|
'Good': Good,
|
||||||
|
'Surprise': Surprise,
|
||||||
|
'Anger': Anger,
|
||||||
|
'Sad': Sad,
|
||||||
|
'Fear': Fear,
|
||||||
|
'Disgust': Disgust,
|
||||||
|
'Positive': Positive,
|
||||||
|
'Negative': Negative
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f'加载大连理工大学情感词典失败: {e}')
|
||||||
|
print('使用简化版情感词典')
|
||||||
|
return build_simplified_dictionary()
|
||||||
|
|
||||||
|
def build_simplified_dictionary():
|
||||||
|
"""构建简化的中文情感词典(备用方案)"""
|
||||||
|
|
||||||
|
Happy = ['开心', '快乐', '高兴', '喜悦', '愉快', '欣喜', '欢乐', '欢喜', '幸福',
|
||||||
|
'满意', '满足', '欣慰', '愉悦', '畅快', '乐观', '积极', '美好', '成功']
|
||||||
|
|
||||||
|
Good = ['好', '优秀', '出色', '精彩', '卓越', '杰出', '优良', '良好', '完美', '不错',
|
||||||
|
'涨', '上涨', '暴涨', '拉升', '涨停', '盈利', '收益', '赚钱', '赚', '利好',
|
||||||
|
'增长', '上升', '增加', '发展', '进步', '提升', '改善', '突破', '创新', '优势']
|
||||||
|
|
||||||
|
Surprise = ['惊喜', '意外', '震惊', '惊讶', '震撼', '神奇', '奇迹', '惊艳', '亮眼', '超预期']
|
||||||
|
|
||||||
|
Anger = ['愤怒', '生气', '恼火', '气愤', '暴怒', '愤慨', '愤恨', '震怒', '发怒',
|
||||||
|
'骂', '垃圾', '恶心', '坑爹', '骗局', '欺骗', '欺诈', '造假', '腐败', '黑暗']
|
||||||
|
|
||||||
|
Sad = ['伤心', '难过', '悲伤', '痛苦', '悲哀', '沮丧', '失望', '绝望', '低落', '悲观',
|
||||||
|
'跌', '下跌', '暴跌', '跳水', '跌停', '亏损', '亏钱', '赔', '损失', '套牢']
|
||||||
|
|
||||||
|
Fear = ['害怕', '恐惧', '担心', '担忧', '恐慌', '不安', '焦虑', '忧虑', '紧张', '恐怖',
|
||||||
|
'风险', '危机', '危险', '下跌', '暴跌', '崩盘', '退市', '爆雷', '雷', '怕']
|
||||||
|
|
||||||
|
Disgust = ['厌恶', '恶心', '反感', '讨厌', '鄙视', '唾弃', '不屑', '蔑视', '嫌弃',
|
||||||
|
'垃圾', '废物', '不行', '差劲', '差', '烂', '渣', '骗局']
|
||||||
|
|
||||||
|
Positive = Happy + Good + Surprise
|
||||||
|
Negative = Anger + Sad + Fear + Disgust
|
||||||
|
|
||||||
|
print(f'简化版情感词典构建完成')
|
||||||
|
print(f' 正面情感词: {len(Positive)}个')
|
||||||
|
print(f' 负面情感词: {len(Negative)}个')
|
||||||
|
|
||||||
|
return {
|
||||||
|
'Happy': Happy,
|
||||||
|
'Good': Good,
|
||||||
|
'Surprise': Surprise,
|
||||||
|
'Anger': Anger,
|
||||||
|
'Sad': Sad,
|
||||||
|
'Fear': Fear,
|
||||||
|
'Disgust': Disgust,
|
||||||
|
'Positive': Positive,
|
||||||
|
'Negative': Negative
|
||||||
|
}
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 情绪计算函数(参照 sentiment_analysis.py)
|
||||||
|
# ============================================================
|
||||||
|
def emotion_caculate(text, sentiment_dict):
|
||||||
|
"""计算单条文本的情绪"""
|
||||||
|
|
||||||
|
if not text or pd.isna(text):
|
||||||
|
return 0
|
||||||
|
|
||||||
|
positive = 0
|
||||||
|
negative = 0
|
||||||
|
|
||||||
|
wordlist = jieba.lcut(text)
|
||||||
|
|
||||||
|
for word in wordlist:
|
||||||
|
# 跳过停用词和短词
|
||||||
|
if word in STOPWORDS or len(word) <= 1:
|
||||||
|
continue
|
||||||
|
|
||||||
|
freq = wordlist.count(word)
|
||||||
|
|
||||||
|
if word in sentiment_dict['Positive']:
|
||||||
|
positive += freq
|
||||||
|
if word in sentiment_dict['Negative']:
|
||||||
|
negative += freq
|
||||||
|
|
||||||
|
sentiment_score = positive - negative
|
||||||
|
return sentiment_score
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# 时间序列分析
|
||||||
|
# ============================================================
|
||||||
|
def analyze_sentiment_trend():
|
||||||
|
"""分析情绪时间序列趋势(使用情感词典)"""
|
||||||
|
print("="*60)
|
||||||
|
print("情绪时间序列分析(基于情感词典)")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
# 构建情感词典
|
||||||
|
print("\n[1/5] 构建情感词典...")
|
||||||
|
sentiment_dict = build_sentiment_dictionary()
|
||||||
|
|
||||||
|
# 加载数据
|
||||||
|
print("\n[2/5] 加载数据...")
|
||||||
|
df = pd.read_csv('output/all_posts.csv', encoding='utf-8-sig')
|
||||||
|
|
||||||
|
# 检查是否有 post_publish_time 字段
|
||||||
|
if 'post_publish_time' not in df.columns:
|
||||||
|
print("警告:数据中没有 post_publish_time 字段,请先运行 analyze.py")
|
||||||
|
return
|
||||||
|
|
||||||
|
# 转换时间戳
|
||||||
|
print("\n[3/5] 转换时间戳...")
|
||||||
|
df['timestamp'] = pd.to_datetime(df['post_publish_time'], errors='coerce')
|
||||||
|
df = df.dropna(subset=['timestamp'])
|
||||||
|
df['date'] = df['timestamp'].dt.date
|
||||||
|
|
||||||
|
# 计算情绪得分
|
||||||
|
print("\n[4/5] 计算情绪得分...")
|
||||||
|
df['sentiment_score'] = df['clean_text'].apply(
|
||||||
|
lambda x: emotion_caculate(x, sentiment_dict)
|
||||||
|
)
|
||||||
|
|
||||||
|
# 保存结果
|
||||||
|
df.to_csv('output/sentiment_analysis_result.csv', index=False, encoding='utf-8-sig')
|
||||||
|
print(" 情绪分析结果已保存到: output/sentiment_analysis_result.csv")
|
||||||
|
|
||||||
|
# 按股票分组分析
|
||||||
|
stock_groups = df.groupby('stock_code')
|
||||||
|
os.makedirs('output/plots', exist_ok=True)
|
||||||
|
|
||||||
|
print("\n[5/5] 生成时间序列图表...")
|
||||||
|
for stock_code, group in stock_groups:
|
||||||
|
stock_name = group['stock_name'].iloc[0]
|
||||||
|
print(f"\n 分析 {stock_name} ({stock_code})...")
|
||||||
|
|
||||||
|
# 按日期分组计算平均情绪
|
||||||
|
daily_sentiment = group.groupby('date')['sentiment_score'].agg(['mean', 'count']).reset_index()
|
||||||
|
daily_sentiment.columns = ['date', 'avg_sentiment', 'post_count']
|
||||||
|
|
||||||
|
if len(daily_sentiment) < 2:
|
||||||
|
print(f" 数据不足,跳过")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 绘制时间序列图
|
||||||
|
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8), sharex=True)
|
||||||
|
|
||||||
|
# 情绪趋势
|
||||||
|
ax1.plot(daily_sentiment['date'], daily_sentiment['avg_sentiment'],
|
||||||
|
marker='o', linestyle='-', color='b', label='日均情绪')
|
||||||
|
|
||||||
|
# 添加移动平均线
|
||||||
|
daily_sentiment['MA3'] = daily_sentiment['avg_sentiment'].rolling(window=3).mean()
|
||||||
|
ax1.plot(daily_sentiment['date'], daily_sentiment['MA3'],
|
||||||
|
marker='', linestyle='--', color='r', label='3日移动平均')
|
||||||
|
|
||||||
|
ax1.set_title(f'{stock_name} ({stock_code}) 情绪时间序列趋势', fontsize=14)
|
||||||
|
ax1.set_ylabel('情绪分数', fontsize=12)
|
||||||
|
ax1.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)
|
||||||
|
ax1.grid(True)
|
||||||
|
ax1.legend()
|
||||||
|
|
||||||
|
# 发帖量
|
||||||
|
ax2.bar(daily_sentiment['date'], daily_sentiment['post_count'], color='g', alpha=0.7)
|
||||||
|
ax2.set_xlabel('日期', fontsize=12)
|
||||||
|
ax2.set_ylabel('发帖数量', fontsize=12)
|
||||||
|
ax2.grid(True)
|
||||||
|
|
||||||
|
plt.xticks(rotation=45)
|
||||||
|
plt.tight_layout()
|
||||||
|
|
||||||
|
# 保存图表
|
||||||
|
plot_path = f'output/plots/sentiment_trend_{stock_name}.png'
|
||||||
|
plt.savefig(plot_path, dpi=100)
|
||||||
|
plt.close()
|
||||||
|
print(f" 图表已保存到: {plot_path}")
|
||||||
|
|
||||||
|
# 输出统计信息
|
||||||
|
avg_sentiment = group['sentiment_score'].mean()
|
||||||
|
pos_count = (group['sentiment_score'] > 0).sum()
|
||||||
|
neg_count = (group['sentiment_score'] < 0).sum()
|
||||||
|
neu_count = (group['sentiment_score'] == 0).sum()
|
||||||
|
print(f" 平均情绪: {avg_sentiment:.4f}")
|
||||||
|
print(f" 正面帖子: {pos_count}, 负面帖子: {neg_count}, 中性帖子: {neu_count}")
|
||||||
|
|
||||||
|
# 生成汇总报告
|
||||||
|
print("\n生成汇总报告...")
|
||||||
|
summary_data = []
|
||||||
|
for stock_code, group in stock_groups:
|
||||||
|
stock_name = group['stock_name'].iloc[0]
|
||||||
|
avg_sentiment = group['sentiment_score'].mean()
|
||||||
|
post_count = len(group)
|
||||||
|
pos_count = (group['sentiment_score'] > 0).sum()
|
||||||
|
neg_count = (group['sentiment_score'] < 0).sum()
|
||||||
|
neu_count = (group['sentiment_score'] == 0).sum()
|
||||||
|
|
||||||
|
summary_data.append({
|
||||||
|
'股票代码': stock_code,
|
||||||
|
'股票名称': stock_name,
|
||||||
|
'帖子数量': post_count,
|
||||||
|
'平均情绪': round(avg_sentiment, 4),
|
||||||
|
'正面帖子': pos_count,
|
||||||
|
'负面帖子': neg_count,
|
||||||
|
'中性帖子': neu_count
|
||||||
|
})
|
||||||
|
|
||||||
|
summary_df = pd.DataFrame(summary_data)
|
||||||
|
summary_df.to_csv('output/sentiment_summary.csv', index=False, encoding='utf-8-sig')
|
||||||
|
print("汇总报告已保存到: output/sentiment_summary.csv")
|
||||||
|
|
||||||
|
print("\n" + "="*60)
|
||||||
|
print("情绪时间序列分析完成!")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
analyze_sentiment_trend()
|
||||||
@@ -0,0 +1,187 @@
|
|||||||
|
import requests
|
||||||
|
import pandas as pd
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
import os
|
||||||
|
|
||||||
|
def fetch_guba_data(code='gssz', page=1, page_size=20, sort_type=1):
|
||||||
|
url = 'https://mguba.eastmoney.com/mguba2020/interface/GetData.aspx'
|
||||||
|
|
||||||
|
headers = {
|
||||||
|
'Accept': '*/*',
|
||||||
|
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
|
||||||
|
'Cache-Control': 'no-cache',
|
||||||
|
'Connection': 'keep-alive',
|
||||||
|
'Content-Type': 'application/x-www-form-urlencoded',
|
||||||
|
'DNT': '1',
|
||||||
|
'Origin': 'https://mguba.eastmoney.com',
|
||||||
|
'Pragma': 'no-cache',
|
||||||
|
'Referer': f'https://mguba.eastmoney.com/mguba/list/{code}_{page}',
|
||||||
|
'Sec-Fetch-Dest': 'empty',
|
||||||
|
'Sec-Fetch-Mode': 'cors',
|
||||||
|
'Sec-Fetch-Site': 'same-origin',
|
||||||
|
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Mobile Safari/537.36 Edg/148.0.0.0',
|
||||||
|
'sec-ch-ua': '"Chromium";v="148", "Microsoft Edge";v="148", "Not/A)Brand";v="99"',
|
||||||
|
'sec-ch-ua-mobile': '?1',
|
||||||
|
'sec-ch-ua-platform': '"Android"'
|
||||||
|
}
|
||||||
|
|
||||||
|
cookies = {
|
||||||
|
'qgqp_b_id': '30059d8839ad5c045fa8856e38013e9c',
|
||||||
|
'st_nvi': 'XwpSfYXGjCxfCdbgapK5_cac4',
|
||||||
|
'nid18': '0daec1df8064f04edd20b4e69250a8f5',
|
||||||
|
'nid18_create_time': '1776263017375',
|
||||||
|
'gviem': 'UrMH_tSu1UpW8B_TKmytl803f',
|
||||||
|
'gviem_create_time': '1776263017375',
|
||||||
|
'fullscreengg': '1',
|
||||||
|
'fullscreengg2': '1',
|
||||||
|
'st_si': '17952715731426',
|
||||||
|
'show_app_box_time': '1779903756410',
|
||||||
|
'st_pvi': '26838250597806',
|
||||||
|
'st_sp': '2026-04-15 22:23:37',
|
||||||
|
'st_inirUrl': 'https://cn.bing.com/',
|
||||||
|
'st_sn': '30',
|
||||||
|
'st_psi': '20260528025236177-117016304298-3040545697',
|
||||||
|
'ad_tc_load_num': '3',
|
||||||
|
'st_asi': '20260528025236177-117016304298-3040545697-ad.djxd-1'
|
||||||
|
}
|
||||||
|
|
||||||
|
param = f'code={code}&p={page}&ps={page_size}&sorttype={sort_type}'
|
||||||
|
data = {
|
||||||
|
'param': param,
|
||||||
|
'plat': 'wap',
|
||||||
|
'version': '200',
|
||||||
|
'path': '/webarticlelist/api/Article/WebArticleList',
|
||||||
|
'env': '1',
|
||||||
|
'origin': '',
|
||||||
|
'ctoken': '',
|
||||||
|
'utoken': ''
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.post(url, headers=headers, cookies=cookies, data=data)
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json()
|
||||||
|
except requests.exceptions.RequestException as e:
|
||||||
|
print(f'请求失败: {e}')
|
||||||
|
return None
|
||||||
|
|
||||||
|
def fetch_stock_posts(code, name, pages=10, page_size=20):
|
||||||
|
"""爬取指定股票的多页数据"""
|
||||||
|
all_posts = []
|
||||||
|
|
||||||
|
for page in range(1, pages + 1):
|
||||||
|
print(f'正在爬取 {name} ({code}) - 第 {page}/{pages} 页')
|
||||||
|
result = fetch_guba_data(code=code, page=page, page_size=page_size)
|
||||||
|
|
||||||
|
if result and 're' in result:
|
||||||
|
posts = result['re']
|
||||||
|
all_posts.extend(posts)
|
||||||
|
print(f' 成功获取 {len(posts)} 条帖子')
|
||||||
|
else:
|
||||||
|
print(f' 第 {page} 页获取失败或无数据')
|
||||||
|
|
||||||
|
# 添加延迟避免请求过快
|
||||||
|
if page < pages:
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
# 整理数据
|
||||||
|
data = {
|
||||||
|
'stock_code': code,
|
||||||
|
'stock_name': name,
|
||||||
|
'total_pages': pages,
|
||||||
|
'total_posts': len(all_posts),
|
||||||
|
'crawl_time': datetime.now().isoformat(),
|
||||||
|
'posts': all_posts
|
||||||
|
}
|
||||||
|
|
||||||
|
return data
|
||||||
|
|
||||||
|
def save_to_json(data, name="", filename=None):
|
||||||
|
if not data:
|
||||||
|
print('数据为空,无法保存')
|
||||||
|
return None
|
||||||
|
|
||||||
|
if not filename:
|
||||||
|
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||||
|
filename = f'guba_{name}_{timestamp}.json'
|
||||||
|
|
||||||
|
with open(filename, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
print(f'JSON数据已保存到: {filename}')
|
||||||
|
return filename
|
||||||
|
|
||||||
|
def save_to_excel(data, name="", filename=None):
|
||||||
|
if not data or 'posts' not in data:
|
||||||
|
print('数据格式不正确,无法保存')
|
||||||
|
return None
|
||||||
|
|
||||||
|
posts = data['posts']
|
||||||
|
records = []
|
||||||
|
|
||||||
|
for post in posts:
|
||||||
|
record = {
|
||||||
|
'帖子ID': post.get('post_id'),
|
||||||
|
'标题': post.get('post_title'),
|
||||||
|
'内容': post.get('post_content'),
|
||||||
|
'作者': post.get('post_user', {}).get('user_nickname'),
|
||||||
|
'发布时间': post.get('post_publish_time'),
|
||||||
|
'最后更新': post.get('post_last_time'),
|
||||||
|
'阅读数': post.get('post_click_count'),
|
||||||
|
'评论数': post.get('post_comment_count'),
|
||||||
|
'点赞数': post.get('post_like_count'),
|
||||||
|
'股吧': post.get('post_guba', {}).get('stockbar_name'),
|
||||||
|
'来源': post.get('post_from')
|
||||||
|
}
|
||||||
|
records.append(record)
|
||||||
|
|
||||||
|
df = pd.DataFrame(records)
|
||||||
|
|
||||||
|
if not filename:
|
||||||
|
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||||
|
filename = f'guba_{name}_{timestamp}.xlsx'
|
||||||
|
|
||||||
|
df.to_excel(filename, index=False, engine='openpyxl')
|
||||||
|
print(f'Excel数据已保存到: {filename}')
|
||||||
|
return filename
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
GAME_STOCKS = {
|
||||||
|
'002624': '完美世界',
|
||||||
|
'002555': '三七互娱',
|
||||||
|
'002558': '巨人网络',
|
||||||
|
'002602': '世纪华通',
|
||||||
|
'300418': '昆仑万维',
|
||||||
|
'002174': '游族网络',
|
||||||
|
'300315': '掌趣科技',
|
||||||
|
'603444': '吉比特',
|
||||||
|
}
|
||||||
|
|
||||||
|
# 创建数据目录
|
||||||
|
os.makedirs('data', exist_ok=True)
|
||||||
|
|
||||||
|
for code, name in GAME_STOCKS.items():
|
||||||
|
print(f'\n{"="*50}')
|
||||||
|
print(f'开始爬取 {name} ({code})')
|
||||||
|
print(f'{"="*50}')
|
||||||
|
|
||||||
|
# 爬取10页数据
|
||||||
|
data = fetch_stock_posts(code, name, pages=30)
|
||||||
|
|
||||||
|
if data and data['total_posts'] > 0:
|
||||||
|
print(f'\n共获取 {data["total_posts"]} 条帖子')
|
||||||
|
|
||||||
|
# 保存JSON
|
||||||
|
json_filename = os.path.join('data', f'guba_{name}_{code}.json')
|
||||||
|
save_to_json(data, name, json_filename)
|
||||||
|
|
||||||
|
# 保存Excel
|
||||||
|
excel_filename = os.path.join('data', f'guba_{name}_{code}.xlsx')
|
||||||
|
save_to_excel(data, name, excel_filename)
|
||||||
|
else:
|
||||||
|
print(f'{name} 爬取失败或无数据')
|
||||||
|
|
||||||
|
# 股票之间的延迟
|
||||||
|
time.sleep(2)
|
||||||
+1426
File diff suppressed because it is too large
Load Diff
-38
@@ -1,38 +0,0 @@
|
|||||||
import asyncio
|
|
||||||
import aiohttp
|
|
||||||
|
|
||||||
comment_headers = {
|
|
||||||
'Accept': '*/*',
|
|
||||||
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
|
|
||||||
'Cache-Control': 'no-cache',
|
|
||||||
'Connection': 'keep-alive',
|
|
||||||
'Content-Type': 'application/x-www-form-urlencoded',
|
|
||||||
'Origin': 'https://guba.eastmoney.com',
|
|
||||||
'Pragma': 'no-cache',
|
|
||||||
'Referer': 'https://guba.eastmoney.com/news,002624,1711407668.html',
|
|
||||||
'Sec-Fetch-Dest': 'empty',
|
|
||||||
'Sec-Fetch-Mode': 'cors',
|
|
||||||
'Sec-Fetch-Site': 'same-origin',
|
|
||||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
|
|
||||||
}
|
|
||||||
|
|
||||||
async def test_comment_api():
|
|
||||||
url = 'https://guba.eastmoney.com/api/getData?code=002624&path=reply/api/Reply/ArticleNewReplyList'
|
|
||||||
|
|
||||||
payload = {
|
|
||||||
'param': 'postid=1711407668&sort=1&sorttype=1&p=1&ps=30',
|
|
||||||
'plat': 'Web',
|
|
||||||
'path': 'reply/api/Reply/ArticleNewReplyList',
|
|
||||||
'env': '2',
|
|
||||||
'origin': '',
|
|
||||||
'version': '2022',
|
|
||||||
'product': 'Guba'
|
|
||||||
}
|
|
||||||
|
|
||||||
async with aiohttp.ClientSession() as session:
|
|
||||||
async with session.post(url, headers=comment_headers, data=payload) as response:
|
|
||||||
print(f'状态码: {response.status}')
|
|
||||||
text = await response.text()
|
|
||||||
print(f'响应内容:\n{text}')
|
|
||||||
|
|
||||||
asyncio.run(test_comment_api())
|
|
||||||
@@ -0,0 +1,229 @@
|
|||||||
|
import os
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from sklearn.model_selection import train_test_split
|
||||||
|
from sklearn.metrics import classification_report, accuracy_score
|
||||||
|
from gensim.models import Word2Vec
|
||||||
|
from tensorflow.keras.preprocessing.text import Tokenizer
|
||||||
|
from tensorflow.keras.preprocessing.sequence import pad_sequences
|
||||||
|
from tensorflow.keras.models import Sequential
|
||||||
|
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
|
||||||
|
from tensorflow.keras.utils import to_categorical
|
||||||
|
import jieba
|
||||||
|
|
||||||
|
def load_stopwords(filepath='stopwords.txt'):
|
||||||
|
"""从文件加载停用词"""
|
||||||
|
stopwords = set()
|
||||||
|
if os.path.exists(filepath):
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
for line in f:
|
||||||
|
word = line.strip()
|
||||||
|
if word:
|
||||||
|
stopwords.add(word)
|
||||||
|
print(f"已加载 {len(stopwords)} 个停用词")
|
||||||
|
else:
|
||||||
|
print(f"警告:停用词文件 {filepath} 不存在,使用默认停用词")
|
||||||
|
stopwords = {
|
||||||
|
'的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要',
|
||||||
|
'去', '你', '会', '着', '没有', '看', '好', '自己', '这', '那', '有', '吗', '吧', '呢', '啊', '呀', '什么', '怎么',
|
||||||
|
'为什么', '哪里', '谁', '多少', '几', '个', '只', '条', '把', '本', '篇', '次', '天', '今天', '明天', '昨天', '又',
|
||||||
|
'再', '还', '已经', '还是', '但是', '可是', '不过', '只是', '只有', '就是', '或者', '跟', '和', '与', '及', '或',
|
||||||
|
'股吧', '东方财富', '帖子', '发表', '回复', '点击', '查看', '更多', '原文', '转发', '分享', '收藏', '评论', '点赞',
|
||||||
|
'http', 'https', 'com', 'cn', 'www', 'net', 'org'
|
||||||
|
}
|
||||||
|
return stopwords
|
||||||
|
|
||||||
|
# 加载停用词
|
||||||
|
STOPWORDS = load_stopwords()
|
||||||
|
|
||||||
|
def clean_text(text):
|
||||||
|
"""清洗文本"""
|
||||||
|
if not text or pd.isna(text):
|
||||||
|
return ""
|
||||||
|
text = str(text)
|
||||||
|
text = re.sub(r'https?://\S+|www\.\S+', '', text)
|
||||||
|
text = re.sub(r'<.*?>', '', text)
|
||||||
|
text = re.sub(r'\[.*?\]', '', text)
|
||||||
|
text = re.sub(r'\b[a-zA-Z]+\d+\b', '', text)
|
||||||
|
text = re.sub(r'\b\d+[a-zA-Z]+\b', '', text)
|
||||||
|
text = re.sub(r'[^\w\s]', ' ', text)
|
||||||
|
text = re.sub(r'\s+', ' ', text).strip()
|
||||||
|
return text
|
||||||
|
|
||||||
|
def tokenize(text):
|
||||||
|
"""中文分词"""
|
||||||
|
words = jieba.lcut(text)
|
||||||
|
filtered_words = []
|
||||||
|
for w in words:
|
||||||
|
if w in STOPWORDS or len(w) <= 1:
|
||||||
|
continue
|
||||||
|
if re.match(r'^[a-zA-Z]+$', w):
|
||||||
|
continue
|
||||||
|
if re.match(r'^[a-zA-Z\s]+$', w):
|
||||||
|
continue
|
||||||
|
filtered_words.append(w)
|
||||||
|
return filtered_words
|
||||||
|
|
||||||
|
def load_and_preprocess_data(filepath='output/all_posts.csv'):
|
||||||
|
"""加载并预处理数据"""
|
||||||
|
df = pd.read_csv(filepath, encoding='utf-8-sig')
|
||||||
|
|
||||||
|
print(f"原始数据: {len(df)} 条")
|
||||||
|
|
||||||
|
df = df.dropna(subset=['clean_text', 'label'])
|
||||||
|
df = df[df['clean_text'].str.strip() != '']
|
||||||
|
|
||||||
|
print(f"有效数据: {len(df)} 条")
|
||||||
|
print(f"标签分布:")
|
||||||
|
print(df['label'].value_counts())
|
||||||
|
|
||||||
|
df['tokens'] = df['clean_text'].apply(tokenize)
|
||||||
|
df = df[df['tokens'].apply(len) > 0]
|
||||||
|
|
||||||
|
print(f"分词后有效数据: {len(df)} 条")
|
||||||
|
|
||||||
|
return df
|
||||||
|
|
||||||
|
def train_word2vec_model(sentences, vector_size=100, window=5, min_count=5):
|
||||||
|
"""训练 Word2Vec 模型"""
|
||||||
|
print(f"\n训练 Word2Vec 模型...")
|
||||||
|
model = Word2Vec(
|
||||||
|
sentences=sentences,
|
||||||
|
vector_size=vector_size,
|
||||||
|
window=window,
|
||||||
|
min_count=min_count,
|
||||||
|
workers=4,
|
||||||
|
epochs=10
|
||||||
|
)
|
||||||
|
print(f"Word2Vec 词汇表大小: {len(model.wv)}")
|
||||||
|
return model
|
||||||
|
|
||||||
|
def build_cnn_model(vocab_size, embedding_dim, max_seq_len, embedding_matrix, num_classes=3):
|
||||||
|
"""构建 CNN 模型"""
|
||||||
|
model = Sequential()
|
||||||
|
|
||||||
|
model.add(Embedding(
|
||||||
|
input_dim=vocab_size,
|
||||||
|
output_dim=embedding_dim,
|
||||||
|
input_length=max_seq_len,
|
||||||
|
weights=[embedding_matrix],
|
||||||
|
trainable=False
|
||||||
|
))
|
||||||
|
|
||||||
|
model.add(Conv1D(128, 5, activation='relu'))
|
||||||
|
model.add(GlobalMaxPooling1D())
|
||||||
|
model.add(Dense(64, activation='relu'))
|
||||||
|
model.add(Dropout(0.5))
|
||||||
|
model.add(Dense(num_classes, activation='softmax'))
|
||||||
|
|
||||||
|
model.compile(
|
||||||
|
optimizer='adam',
|
||||||
|
loss='categorical_crossentropy',
|
||||||
|
metrics=['accuracy']
|
||||||
|
)
|
||||||
|
|
||||||
|
return model
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print("="*60)
|
||||||
|
print("Word2Vec + CNN 情绪感知模型训练")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
# 加载数据
|
||||||
|
print("\n[1/5] 加载数据...")
|
||||||
|
df = load_and_preprocess_data()
|
||||||
|
|
||||||
|
if len(df) < 10:
|
||||||
|
print("数据不足,无法训练")
|
||||||
|
return
|
||||||
|
|
||||||
|
# 准备 Word2Vec 训练数据
|
||||||
|
sentences = df['tokens'].tolist()
|
||||||
|
|
||||||
|
# 训练 Word2Vec
|
||||||
|
print("\n[2/5] 训练 Word2Vec 词向量...")
|
||||||
|
w2v_model = train_word2vec_model(sentences)
|
||||||
|
|
||||||
|
# 构建词汇表
|
||||||
|
print("\n[3/5] 构建词汇表...")
|
||||||
|
tokenizer = Tokenizer()
|
||||||
|
tokenizer.fit_on_texts(sentences)
|
||||||
|
vocab_size = len(tokenizer.word_index) + 1
|
||||||
|
print(f"词汇表大小: {vocab_size}")
|
||||||
|
|
||||||
|
# 转换文本为序列
|
||||||
|
max_seq_len = max(len(s) for s in sentences)
|
||||||
|
print(f"最大序列长度: {max_seq_len}")
|
||||||
|
sequences = tokenizer.texts_to_sequences(sentences)
|
||||||
|
X = pad_sequences(sequences, maxlen=max_seq_len)
|
||||||
|
|
||||||
|
# 准备标签
|
||||||
|
label_mapping = {-1: 0, 0: 1, 1: 2}
|
||||||
|
y = df['label'].map(label_mapping).values
|
||||||
|
y = to_categorical(y, num_classes=3)
|
||||||
|
|
||||||
|
# 创建嵌入矩阵
|
||||||
|
print("\n[4/5] 创建嵌入矩阵...")
|
||||||
|
embedding_dim = w2v_model.vector_size
|
||||||
|
embedding_matrix = np.zeros((vocab_size, embedding_dim))
|
||||||
|
|
||||||
|
for word, i in tokenizer.word_index.items():
|
||||||
|
if word in w2v_model.wv:
|
||||||
|
embedding_matrix[i] = w2v_model.wv[word]
|
||||||
|
|
||||||
|
# 划分训练集和测试集
|
||||||
|
X_train, X_test, y_train, y_test = train_test_split(
|
||||||
|
X, y, test_size=0.2, random_state=42, stratify=y
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"训练集: {len(X_train)} 条")
|
||||||
|
print(f"测试集: {len(X_test)} 条")
|
||||||
|
|
||||||
|
# 构建并训练 CNN 模型
|
||||||
|
print("\n[5/5] 训练 CNN 模型...")
|
||||||
|
model = build_cnn_model(vocab_size, embedding_dim, max_seq_len, embedding_matrix)
|
||||||
|
print(model.summary())
|
||||||
|
|
||||||
|
history = model.fit(
|
||||||
|
X_train, y_train,
|
||||||
|
batch_size=32,
|
||||||
|
epochs=10,
|
||||||
|
validation_split=0.1,
|
||||||
|
verbose=1
|
||||||
|
)
|
||||||
|
|
||||||
|
# 评估模型
|
||||||
|
print("\n[6/6] 评估模型...")
|
||||||
|
y_pred = model.predict(X_test)
|
||||||
|
y_pred_classes = np.argmax(y_pred, axis=1)
|
||||||
|
y_true_classes = np.argmax(y_test, axis=1)
|
||||||
|
|
||||||
|
print("\n分类报告:")
|
||||||
|
print(classification_report(y_true_classes, y_pred_classes, target_names=['负面', '中性', '正面']))
|
||||||
|
print(f"准确率: {accuracy_score(y_true_classes, y_pred_classes):.4f}")
|
||||||
|
|
||||||
|
# 保存模型
|
||||||
|
print("\n保存模型...")
|
||||||
|
os.makedirs('models', exist_ok=True)
|
||||||
|
|
||||||
|
# 保存 Word2Vec 模型
|
||||||
|
w2v_model.save('models/word2vec.model')
|
||||||
|
print("Word2Vec 模型已保存到: models/word2vec.model")
|
||||||
|
|
||||||
|
# 保存 CNN 模型
|
||||||
|
model.save('models/cnn_sentiment.h5')
|
||||||
|
print("CNN 模型已保存到: models/cnn_sentiment.h5")
|
||||||
|
|
||||||
|
# 保存 tokenizer
|
||||||
|
with open('models/tokenizer.json', 'w', encoding='utf-8') as f:
|
||||||
|
f.write(tokenizer.to_json())
|
||||||
|
print("Tokenizer 已保存到: models/tokenizer.json")
|
||||||
|
|
||||||
|
print("\n" + "="*60)
|
||||||
|
print("训练完成!")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
@@ -0,0 +1,296 @@
|
|||||||
|
# 游戏股吧情感与话题分析报告
|
||||||
|
|
||||||
|
**报告日期**:2026-05-28
|
||||||
|
**分析范围**:完美世界、三七互娱、巨人网络、世纪华通、昆仑万维、游族网络、掌趣科技、吉比特
|
||||||
|
**数据来源**:东方财富网股吧
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 一、数据概述
|
||||||
|
|
||||||
|
本次分析共收集了8只游戏股票的股吧数据,每只股票200条帖子,总计1600条有效数据。
|
||||||
|
|
||||||
|
### 数据收集方法
|
||||||
|
- 使用网络爬虫从东方财富网股吧获取帖子
|
||||||
|
- 数据包括:帖子标题、内容、发布时间等
|
||||||
|
- 使用大连理工大学中文情感词汇本体进行情感分析
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 二、整体话题分析
|
||||||
|
|
||||||
|
### 整体词云
|
||||||
|

|
||||||
|
|
||||||
|
### 整体话题关键词
|
||||||
|
| 排名 | 关键词 | TF-IDF值 |
|
||||||
|
|------|--------|----------|
|
||||||
|
| 1 | 网络 sz | 0.0341 |
|
||||||
|
| 2 | 巨人 | 0.0235 |
|
||||||
|
| 3 | 世纪 华通 | 0.0215 |
|
||||||
|
| 4 | 昆仑 万维 | 0.0215 |
|
||||||
|
| 5 | 游族 | 0.0215 |
|
||||||
|
| 6 | 三七 互娱 | 0.0201 |
|
||||||
|
| 7 | 游戏 | 0.0199 |
|
||||||
|
| 8 | 掌趣 科技 | 0.0187 |
|
||||||
|
| 9 | 比特 sh | 0.0183 |
|
||||||
|
| 10 | 完美 世界 | 0.0174 |
|
||||||
|
|
||||||
|
### 整体热门话题
|
||||||
|
从整体词云可以看出,股吧讨论主要集中在:
|
||||||
|
1. **个股名称**:各股票名称是最热门的话题
|
||||||
|
2. **股票操作**:主力、涨停、下跌、出货、股价等
|
||||||
|
3. **市场情绪**:散户、大盘、投资等
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 三、各股票专题分析
|
||||||
|
|
||||||
|
### 1. 完美世界 (002624)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **异环**:指游戏《异环》相关讨论
|
||||||
|
- **流水**:游戏流水情况
|
||||||
|
- **版本**:游戏版本更新
|
||||||
|
- **安魂曲**:指游戏角色《安魂曲》
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:0.99(最高)
|
||||||
|
- **正面帖子**:110条
|
||||||
|
- **负面帖子**:21条
|
||||||
|
- **中性帖子**:69条
|
||||||
|
|
||||||
|
**情绪倾向**:非常积极!完美世界是本次分析中情绪最正面的股票。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. 巨人网络 (002558)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **补仓**:投资者补仓操作
|
||||||
|
- **腰斩**:股价大幅下跌
|
||||||
|
- **跳水**:股价快速下跌
|
||||||
|
- **兄弟**:股吧常见称呼
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:1.11(最高)
|
||||||
|
- **正面帖子**:115条
|
||||||
|
- **负面帖子**:20条
|
||||||
|
- **中性帖子**:65条
|
||||||
|
|
||||||
|
**情绪倾向**:非常积极!虽然有"腰斩"、"跳水"等负面词汇,但整体情绪仍然很高。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. 三七互娱 (002555)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **分红**:股票分红相关讨论
|
||||||
|
- **投资**:投资策略讨论
|
||||||
|
- **智谱**:可能指AI相关业务
|
||||||
|
- **AI**:人工智能话题
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:0.77
|
||||||
|
- **正面帖子**:72条
|
||||||
|
- **负面帖子**:39条
|
||||||
|
- **中性帖子**:89条
|
||||||
|
|
||||||
|
**情绪倾向**:积极!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. 游族网络 (002174)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **三体**:《三体》IP相关讨论
|
||||||
|
- **死刑**、**执行**:与投毒案相关讨论
|
||||||
|
- **CEO**、**林奇**:公司高管相关
|
||||||
|
- **投毒**:历史事件回顾
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:0.68
|
||||||
|
- **正面帖子**:73条
|
||||||
|
- **负面帖子**:28条
|
||||||
|
- **中性帖子**:99条
|
||||||
|
|
||||||
|
**情绪倾向**:积极!虽然有历史负面事件,但当前情绪较好。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. 世纪华通 (002602)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **调整**:股价调整
|
||||||
|
- **拉升**:股价拉升
|
||||||
|
- **索赔**:可能指投资者索赔
|
||||||
|
- **看好**:市场观点
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:0.48
|
||||||
|
- **正面帖子**:63条
|
||||||
|
- **负面帖子**:36条
|
||||||
|
- **中性帖子**:101条
|
||||||
|
|
||||||
|
**情绪倾向**:中性偏积极!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. 昆仑万维 (300418)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **解禁**:股票解禁相关
|
||||||
|
- **员工**:员工持股等
|
||||||
|
- **短剧**:短剧业务
|
||||||
|
- **模型**:AI模型相关
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:0.30
|
||||||
|
- **正面帖子**:61条
|
||||||
|
- **负面帖子**:49条
|
||||||
|
- **中性帖子**:90条
|
||||||
|
|
||||||
|
**情绪倾向**:中性偏积极!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. 掌趣科技 (300315)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **创业板**:创业板相关
|
||||||
|
- **退市**:退市风险讨论
|
||||||
|
- **垃圾**:负面评价
|
||||||
|
- **解套**:投资者解套需求
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:0.05
|
||||||
|
- **正面帖子**:44条
|
||||||
|
- **负面帖子**:47条
|
||||||
|
- **中性帖子**:109条
|
||||||
|
|
||||||
|
**情绪倾向**:中性!正负情绪基本持平。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. 吉比特 (603444)
|
||||||
|
|
||||||
|
#### 词云分析
|
||||||
|

|
||||||
|
|
||||||
|
#### 关键词分析
|
||||||
|
- **分红**:分红讨论
|
||||||
|
- **业绩**:业绩讨论
|
||||||
|
- **价值投资**:投资理念
|
||||||
|
- **恶心**:负面情绪表达
|
||||||
|
|
||||||
|
#### 情绪分析
|
||||||
|
- **平均情绪得分**:0.05
|
||||||
|
- **正面帖子**:50条
|
||||||
|
- **负面帖子**:65条
|
||||||
|
- **中性帖子**:85条
|
||||||
|
|
||||||
|
**情绪倾向**:中性偏消极!负面帖子多于正面帖子。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 四、情绪分析汇总
|
||||||
|
|
||||||
|
### 情绪得分对比
|
||||||
|

|
||||||
|
|
||||||
|
### 情绪分布
|
||||||
|

|
||||||
|
|
||||||
|
### 情绪类型分布
|
||||||
|

|
||||||
|
|
||||||
|
### 各股票情绪得分排名
|
||||||
|
|
||||||
|
| 排名 | 股票名称 | 股票代码 | 平均情绪得分 | 情绪倾向 |
|
||||||
|
|------|----------|----------|--------------|----------|
|
||||||
|
| 1 | 巨人网络 | 002558 | 1.11 | 🔵 非常积极 |
|
||||||
|
| 2 | 完美世界 | 002624 | 0.99 | 🔵 非常积极 |
|
||||||
|
| 3 | 三七互娱 | 002555 | 0.77 | 🟢 积极 |
|
||||||
|
| 4 | 游族网络 | 002174 | 0.68 | 🟢 积极 |
|
||||||
|
| 5 | 世纪华通 | 002602 | 0.48 | 🟡 中性偏积极 |
|
||||||
|
| 6 | 昆仑万维 | 300418 | 0.30 | 🟡 中性偏积极 |
|
||||||
|
| 7 | 掌趣科技 | 300315 | 0.05 | 🟡 中性 |
|
||||||
|
| 8 | 吉比特 | 603444 | 0.05 | 🟡 中性偏消极 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 五、结论与建议
|
||||||
|
|
||||||
|
### 主要发现
|
||||||
|
|
||||||
|
1. **情绪分布**:
|
||||||
|
- 整体来看,游戏股股吧情绪以中性和积极为主
|
||||||
|
- 巨人网络和完美世界情绪最积极
|
||||||
|
- 吉比特和掌趣科技情绪相对较低
|
||||||
|
|
||||||
|
2. **话题特点**:
|
||||||
|
- 各股票的讨论主要围绕自身业务和股价
|
||||||
|
- 完美世界和巨人网络讨论中游戏内容较多
|
||||||
|
- 游族网络仍有较多历史事件相关讨论
|
||||||
|
|
||||||
|
3. **热门话题**:
|
||||||
|
- 股价操作:涨停、下跌、出货、拉升
|
||||||
|
- 投资者行为:补仓、解套、分红
|
||||||
|
- 行业热点:AI、短剧、游戏流水
|
||||||
|
|
||||||
|
### 投资建议(仅供参考)
|
||||||
|
|
||||||
|
1. **情绪领先标的**:
|
||||||
|
- 完美世界和巨人网络股吧情绪最为积极,可重点关注
|
||||||
|
- 关注其游戏业务进展和业绩情况
|
||||||
|
|
||||||
|
2. **风险提示**:
|
||||||
|
- 吉比特和掌趣科技情绪相对较低,需注意风险
|
||||||
|
- 游族网络历史事件仍有一定影响
|
||||||
|
|
||||||
|
3. **持续关注**:
|
||||||
|
- 昆仑万维的AI和短剧业务
|
||||||
|
- 三七互娱的分红和投资策略
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 附录
|
||||||
|
|
||||||
|
### 数据文件说明
|
||||||
|
|
||||||
|
- `data/`:原始爬取数据(JSON和Excel格式)
|
||||||
|
- `output/`:TF-IDF分析结果和词云图片
|
||||||
|
- `sentiment_output/`:情感分析结果和可视化图片
|
||||||
|
|
||||||
|
### 分析工具
|
||||||
|
|
||||||
|
- **爬虫**:Python + Requests
|
||||||
|
- **分词**:jieba
|
||||||
|
- **情感词典**:大连理工大学中文情感词汇本体
|
||||||
|
- **可视化**:Matplotlib + WordCloud
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**报告生成时间**:2026-05-28
|
||||||
|
**分析工具**:自定义Python脚本
|
||||||
Binary file not shown.
Reference in New Issue
Block a user