Pandas 数据处理

Pandas 是 Python 数据分析的核心库，提供了 DataFrame 和 Series 两种数据结构，是量化投资中最常用的工具。

Series：一维数据结构

import pandas as pd
import numpy as np

# 创建 Series
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

# 数据对齐（Pandas 的核心特性之一）
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
print(s1 + s2)  # 自动按索引对齐，无匹配的为 NaN
# a    NaN
# b    6.0
# c    8.0
# d    NaN

# 缺失值处理
s = pd.Series([1, None, 3, np.nan, 5])
print(s.isnull())    # 检测缺失值
print(s.dropna())    # 删除缺失值
print(s.fillna(0))   # 用 0 填充缺失值
print(s.fillna(method='ffill'))  # 前向填充

DataFrame：二维数据结构

DataFrame 是量化投资中最核心的数据容器——每一列是一个指标，每一行是一个时间点。

# 创建股票数据 DataFrame
df = pd.DataFrame({
    'open':   [100, 101, 102, 103],
    'high':   [105, 106, 107, 108],
    'low':    [98,  99,  100, 101],
    'close':  [103, 104, 105, 106],
    'volume': [10000, 12000, 11000, 9500]
}, index=pd.date_range('2024-01-01', periods=4))

# 常用属性
print(df.shape)      # (4, 5) — 4 行 5 列
print(df.columns)    # 列名
print(df.index)      # 索引
print(df.dtypes)     # 每列数据类型
print(df.describe()) # 统计摘要

索引与切片

# 按列名选择
print(df['close'])           # Series
print(df[['open', 'close']]) # DataFrame

# 按标签索引 .loc[]
print(df.loc['2024-01-01'])                    # 单行
print(df.loc['2024-01-01':'2024-01-03'])       # 行范围
print(df.loc[:, ['open', 'close']])             # 所有行，指定列

# 按位置索引 .iloc[]
print(df.iloc[0])       # 第一行
print(df.iloc[:2])      # 前两行
print(df.iloc[:, 3])    # 第四列（close）

# 布尔索引
print(df[df['close'] > 104])    # 收盘价 > 104
print(df[df['volume'] > 10000]) # 成交量 > 10000

时间序列处理

# 创建时间序列
dates = pd.date_range('2024-01-01', periods=100, freq='B')  # B = 交易日
ts = pd.Series(np.random.randn(100).cumsum(), index=dates)

# 按时间筛选
print(ts['2024-01'])             # 2024年1月的数据
print(ts['2024-01':'2024-03'])   # 2024年Q1

# 重采样
weekly = ts.resample('W').last()  # 周线数据
monthly = ts.resample('ME').last()  # 月线数据

# 移动窗口（技术分析核心）
ma5 = ts.rolling(window=5).mean()     # 5日移动平均
ma20 = ts.rolling(window=20).mean()   # 20日移动平均
vol = ts.rolling(window=20).std()     # 20日波动率

金融数据操作

# 计算收益率
df['return'] = df['close'].pct_change()  # 日收益率
df['log_return'] = np.log(df['close'] / df['close'].shift(1))

# 计算累计收益
df['cum_return'] = (1 + df['return']).cumprod()

# 滚动统计
df['ma20'] = df['close'].rolling(20).mean()
df['std20'] = df['close'].rolling(20).std()
df['max20'] = df['close'].rolling(20).max()
df['min20'] = df['close'].rolling(20).min()

# 涨跌幅排名
df['return_rank'] = df['return'].rank(ascending=False)

# shift 操作（获取前一天的数值）
df['prev_close'] = df['close'].shift(1)
df['price_change'] = df['close'] - df['prev_close']

文件读写

# 读取数据
df = pd.read_csv('stock_data.csv', index_col=0, parse_dates=True)
df = pd.read_excel('stock_data.xlsx', index_col=0)

# 保存数据
df.to_csv('output.csv')
df.to_excel('output.xlsx')
df.to_parquet('output.parquet')  # 高效二进制格式

下一步：数据可视化 →

Series：一维数据结构​

DataFrame：二维数据结构​

索引与切片​

时间序列处理​

金融数据操作​

文件读写​

Series：一维数据结构

DataFrame：二维数据结构

索引与切片

时间序列处理

金融数据操作

文件读写