利用机器学习赚钱

2025-06-08

利用机器学习赚钱

你有1000美元可以挥霍。你决定投资股市,特别是特斯拉。
让我们看看能否利用机器学习来优化我们的回报。

从这里下载 TSLA.csv:https://www.kaggle.com/timoboz/tesla-stock-data-from-2010-to-2020/data

让我们开始吧。

让我们首先对我们得到的文件进行一些探索性数据分析(EDA)。

该文件是一个包含 7 列的逗号分隔值 (CSV) 文件。

这些列是:

  • 日期
  • 开盘价
  • 当日最高价
  • 当日最低价
  • 收盘价
  • 调整收盘价,考虑拆分等因素
  • 交易量
# Importing pandas. "pandas is a fast, powerful, flexible and easy to use open source data
# analysis and manipulation tool, built on top of the Python programming language."
import pandas as pd                                                                                         
pd.options.display.max_rows = 30
# Read in the CSV, save it to a pandas dataframe variable called 'tsla_data'.
tsla_data = pd.read_csv("TSLA.csv");
# .head() gives us the first 5 rows of the data frame.
# You can also pass .head() a parameter to return any number of rows. Like .head(10) for 10 rows.
tsla_data.head()
日期 打开 高的 低的 关闭 调整收盘价 体积
0 2010年6月29日 19.000000 25.00 17.540001 23.889999 23.889999 18766300
1 2010年6月30日 25.790001 30.42 23.299999 23.830000 23.830000 17187100
2 2010年7月1日 25.000000 25.92 20.270000 21.959999 21.959999 8218800
3 2010年7月2日 23.000000 23.10 18.709999 19.200001 19.200001 5139800
4 2010年7月6日 20.000000 20.00 15.830000 16.110001 16.110001 6866900
# .shape tells us the number of rows, and the number of columns.
# This dataset has 2416 rows, and 7 columns.
# The NYSE and NASDAQ average about 253 trading days a year. 
# This is from 365.25 (days on average per year) * 5/7 (proportion work days per week) 
# - 6 (weekday holidays) - 3*5/7 (fixed date holidays) = 252.75 ≈ 253.
# 10 * 253 = 2530, this dataset is pretty close. Let's assume it's not missing any days.
tsla_data.shape
(2416, 7)

这是 10 年的数据,其中包含从 2010 年开始的股票信息。

为了节省时间,让我们做一些假设,我们还不是对冲基金经理。

假设

  • 我们每天只能下达一个订单(买入或卖出),订单金额为持有的全部金额。
  • 如果我们下订单,我们假设它会以该价格成交。
  • 我们从 1000 美元开始

我们将追踪一些关键信息。

  • 钱包里的钱
  • 持有股票数量

让我们从 2010 年开始,看看如果我们在这个文件的第一天以 1000 美元开始,我们会赚多少钱。

# We're going to just pull the 2010 data. I like sticking this in variable, and array,  
# because we'll likely do this again, and by multiple years.
years_to_pull = [2010]

# Let's tell pandas to treat the 'Date' column as a date.
tsla_data['Date'] = pd.to_datetime(tsla_data['Date'])

# Let's make a function for re-use
def pull_data_by_year(tsla_data, years_to_pull):
  tsla_data_by_year = tsla_data[tsla_data['Date'].dt.year.isin(years_to_pull)]
  return tsla_data_by_year

tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year.shape
(130, 7)
# Sort by date ASC
tsla_data_by_year = tsla_data_by_year.sort_values(by = 'Date')

我们来添加几列来帮助我们处理数据。我想查看明天的调整后收盘价,并知道它是否高于今天的调整后收盘价。

# .shift(-1) brings the next row into the equation, so that we can add a column that  
# shows tomorrow's adjusted close.
tsla_data_by_year["Adj Close Tomorrow"] = tsla_data_by_year["Adj Close"].shift(-1)
# This adds another column as a bool to quickly show if the stock goes up or down tomorrow.
tsla_data_by_year["Stock Goes Up Tomorrow"] = tsla_data_by_year["Adj Close"] < tsla_data_by_year["Adj Close Tomorrow"]
# Let's look at the first 10 rows to see if this looks correct.
tsla_data_by_year.head(10)
日期 打开 高的 低的 关闭 调整收盘价 体积 明天收盘价 明天股票上涨
0 2010年6月29日 19.000000 25.000000 17.540001 23.889999 23.889999 18766300 23.830000 错误的
1 2010年6月30日 25.790001 30.420000 23.299999 23.830000 23.830000 17187100 21.959999 错误的
2 2010年7月1日 25.000000 25.920000 20.270000 21.959999 21.959999 8218800 19.200001 错误的
3 2010年7月2日 23.000000 23.100000 18.709999 19.200001 19.200001 5139800 16.110001 错误的
4 2010年7月6日 20.000000 20.000000 15.830000 16.110001 16.110001 6866900 15.800000 错误的
5 2010年7月7日 16.400000 16.629999 14.980000 15.800000 15.800000 6921700 17.459999 真的
6 2010年7月8日 16.139999 17.520000 15.570000 17.459999 17.459999 7711400 17.400000 错误的
7 2010年7月9日 17.580000 17.900000 16.549999 17.400000 17.400000 4050600 17.049999 错误的
8 2010年7月12日 17.950001 18.070000 17.000000 17.049999 17.049999 2202500 18.139999 真的
9 2010年7月13日 17.389999 18.639999 16.900000 18.139999 18.139999 2680100 19.840000 真的

按照低买高卖的规则,我们通过查看历史数据,可以得出以下结论。

首先选择第二天收盘价上涨的第一天,并在当天购买价值 1000 美元的股票。

我们将有 3 个职位。

买入
卖出
持有

在代码中:

haveNoStock && !goesUpTomorrow = 持有

没有库存 && 明天上涨 = 买入

有库存 && !明天上涨 = 卖出

haveStock && goUpTomorrow = 持有

# Setting some default values of the new columns. 
# Position can be Hold/Sell/Buy
tsla_data_by_year['Position'] = 'Hold'
tsla_data_by_year['Number Of Stocks Held'] = 0
tsla_data_by_year['Money In Wallet'] = 0
# .at says at row 0, column 'Money in Wallet', save $1000
tsla_data_by_year.at[0, 'Money In Wallet'] = 1000
tsla_data_by_year.head()
日期 打开 高的 低的 关闭 调整收盘价 体积 明天收盘价 明天股票上涨 位置 持有股票数量 钱包里的钱
0 2010年6月29日 19.000000 25.00 17.540001 23.889999 23.889999 18766300 23.830000 错误的 抓住 0 1000
1 2010年6月30日 25.790001 30.42 23.299999 23.830000 23.830000 17187100 21.959999 错误的 抓住 0 0
2 2010年7月1日 25.000000 25.92 20.270000 21.959999 21.959999 8218800 19.200001 错误的 抓住 0 0
3 2010年7月2日 23.000000 23.10 18.709999 19.200001 19.200001 5139800 16.110001 错误的 抓住 0 0
4 2010年7月6日 20.000000 20.00 15.830000 16.110001 16.110001 6866900 15.800000 错误的 抓住 0 0
# Here's my code for determining if I should buy/sell/hold. 
# We'll put this in a function down the line.
previousRow = ''
for index, row in tsla_data_by_year.iterrows():
  if(index > 0):
    row['Money In Wallet'] = previousRow['Money In Wallet']
    row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
  if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(1)
  elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Buy'
    row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
    row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
    # print(2)
  elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Sell'
    row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
    row['Number Of Stocks Held'] = 0
    # print(3)
  elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(4)
  previousRow = row
  tsla_data_by_year.at[index] = row
# Round each number to 2 decimal places.
tsla_data_by_year = tsla_data_by_year.round(2)
# Let's look at the last row to see how much money or stock we have at the end of the year.
tsla_data_by_year.tail(1)
日期 打开 高的 低的 关闭 调整收盘价 体积 明天收盘价 明天股票上涨 位置 持有股票数量 钱包里的钱
129 2010年12月31日 26.57 27.25 26.5 26.63 26.63 1417900 错误的 0.0 8645.73

如果我们预知未来的话,截至 2010 年底,我们将拥有 8,645 美元。

现在让我们用机器学习来解决这个问题。关键在于记住,这是一门科学,这是一个实验。我们需要遵循科学方法。

我们的假设:我们将尝试预测股票明天是上涨还是下跌。

当我们试图预测两种可能的结果时,这被称为二元分类。

我们的目标是,对于给定的一行,我们可以预测该股票明天是否会上涨的列,但我们掌握的只有今天的最高/最低价和价格。这并不能提供太多信息,我们需要尝试分析历史价格的趋势。为此,我们使用技术指标。

## Add bollinger bands
## To learn more about bollinger bands: https://www.investopedia.com/terms/b/bollingerbands.asp
import matplotlib.pyplot as plt

tsla_data_by_year['30 Day MA'] = tsla_data_by_year['Adj Close'].rolling(window=30).mean()
tsla_data_by_year['30 Day STD'] = tsla_data_by_year['Adj Close'].rolling(window=30).std() 

tsla_data_by_year['Upper Band'] = tsla_data_by_year['30 Day MA'] + (tsla_data_by_year['30 Day STD'] * 2)
tsla_data_by_year['Lower Band'] = tsla_data_by_year['30 Day MA'] - (tsla_data_by_year['30 Day STD'] * 2)

# Simple 30 Day Bollinger Band for Tesla
tsla_data_by_year[['Adj Close', '30 Day MA', 'Upper Band', 'Lower Band']].plot(figsize=(12,6))
plt.title('30 Day Bollinger Band for Tesla')
plt.ylabel('Price (USD)')
plt.show();

# This plot will show us the adjusted close, the rolling average, and the upper  
# and lower bands of the TSLA stock.

png

# Since we used a 30 day moving average, the starting 30 days do not have  
# bollinger bands information.
# We use dropna() to drop the nulls.
tsla_data_by_year = tsla_data_by_year.dropna()
tsla_data_by_year.head()
日期 打开 高的 低的 关闭 调整收盘价 体积 明天收盘价 明天股票上涨 位置 持有股票数量 钱包里的钱 30天移动平均线 30天性病 上带 下带
二十九 2010年8月10日 19.65 19.65 18.82 19.03 19.03 1281300 17.90 错误的 抓住 0.00 1660.38 20.041333 1.937226 23.915786 16.166880
三十 2010年8月11日 18.69 18.88 17.85 17.90 17.90 797600 17.60 错误的 抓住 0.00 1660.38 19.841667 1.832744 23.507156 16.176178
31 2010年8月12日 17.80 17.90 17.39 17.60 17.60 691000 18.32 真的 94.34 0.00 19.634000 1.714383 23.062765 16.205235
三十二 2010年8月13日 18.18 18.45 17.66 18.32 18.32 634000 18.78 真的 抓住 94.34 0.00 19.512667 1.672380 22.857427 16.167907
33 2010年8月16日 18.45 18.80 18.26 18.78 18.78 485800 19.15 真的 抓住 94.34 0.00 19.498667 1.676840 22.852346 16.144987
# Some fantastical python. 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy as np
from matplotlib import pyplot, dates

# Here we are saying we want to predict the column 'Stock Goes Up Tomorrow' by 
#  storing the column name in a variable.
predict = 'Stock Goes Up Tomorrow'
X = tsla_data_by_year
# Treat the date as a number
X['Date'] = X['Date'].dt.strftime('%Y%m%d')

# For each column, apply a LabelEncoder. Regression problems need numerical values 
#  or categorical values. 
# With columns like 'Position', we need to apply a LabelEncoding  
# to set 1 = Hold, 2 = Buy, 3 = Sell  
# (this is an example, the LabelEncoder will determine  
# the numerical values of the categories at runtime.)
for column in X.columns:
  if column != 'Date':
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])

# Set the y dataset to just the single column we want to predict.
y = tsla_data_by_year[predict]

# Set the X dataset (what we will use to predict), to all the columns mentioned.
X = tsla_data_by_year[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]

# This is used to stratify. Learn more here: https://en.wikipedia.org/wiki/Stratified_sampling
targets = tsla_data_by_year[predict]

# This splits the dataset into training and testing. 60% of the data will be  
# used to train, 40% will be used to test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101, stratify=targets)
X_test.head()
日期 打开 高的 低的 关闭 调整收盘价 体积 30天移动平均线 30天性病 上带 下带
65 20100930 22.00 22.15 20.19 20.41 20.41 2195800 20.382333 0.786847 21.956028 18.808639
106 20101129 35.41 35.95 33.33 34.33 34.33 1145600 26.106333 5.340324 36.786982 15.425684
57 2010年9月20日 20.67 21.35 20.16 21.06 21.06 947500 19.866667 1.058318 21.983302 17.750031
99 20101117 30.20 30.75 28.61 29.49 29.49 750000 23.079667 3.586195 30.252057 15.907276
90 2010年11月04日 22.60 25.33 22.15 24.90 24.90 1874000 20.957667 0.907603 22.772872 19.142461
from sklearn import ensemble

# hyper parameters for the GradientBoostingRegressor algorithm.. More on this much later.
params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
          'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

# Fit the classifier with the training data.
clf.fit(X_train, y_train)

# Use the trained model to predict the testing dataset.
y_pred_original = clf.predict(X_test)
from sklearn.metrics import (confusion_matrix, precision_score, recall_score, f1_score, classification_report)
y_pred = y_pred_original > .5
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred * 1
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# The confusion matrix will show us True/False Positives, True/False Negatives.
# This dataset is really small to get an accurate reading of the score,  
# but so far it looks like we're close to 50% accurate.
[[12  7]
 [14  7]]
              precision    recall  f1-score   support

       False       0.46      0.63      0.53        19
        True       0.50      0.33      0.40        21

    accuracy                           0.48        40
   macro avg       0.48      0.48      0.47        40
weighted avg       0.48      0.47      0.46        40

我们比抛硬币还差一点!让我们看看这是怎么回事。

predictions = clf.predict(X)
# The values of predictions are stored as a value from 0.00 to 1.00, but we need them 
#  as a true/false to work with our algorithm to calculate $$. Here I compare to .5 (threshold)  
# to determine if the prediction is true or false.
# You can manually adjust the threshold to get a better True Positive / True Negative rate,  
# sometimes it's beneficial if they're trying to reduce a particular metric.
predictions = predictions > .5
X['Stock Goes Up Tomorrow'] = predictions
# Same code as above, functionized. Use a dataset to determine  
# how much money we'll have made with our trades.
def howMuchMoneyDidWeMake(X):
  if('Money In Wallet' not in X ):
    X['Position'] = 'Hold'
    X['Number Of Stocks Held'] = 0
    X['Money In Wallet'] = 0
    X.at[0, 'Money In Wallet'] = 1000

  previousRow = ''
  for index, row in X.iterrows():
    # print(row)
    if(index > 0):
      row['Money In Wallet'] = previousRow['Money In Wallet']
      row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
    if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(1)
    elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Buy'
      row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
      row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
      # print(2)
    elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Sell'
      row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
      row['Number Of Stocks Held'] = 0
      # print(3)
    elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(4)
    previousRow = row
    X.at[index] = row
    X = X.round(2)

  return X
X = X.reset_index()
X = howMuchMoneyDidWeMake(X)
X.tail(1)
指数 日期 打开 高的 低的 关闭 调整收盘价 体积 30天移动平均线 30天性病 上带 下带 明天股票上涨 位置 持有股票数量 钱包里的钱
99 128 2010年12月30日 27.7 27.9 26.38 26.5 26.5 2041100 31.28 2.63 36.55 26.02 错误的 抓住 0.0 2917.96

还不错!我们的钱几乎翻了三倍。最后我们得到了2917美元。

但这里有一个新手常犯的错误。我们用训练算法时用的数据来衡量成功。我们需要另一个数据集来测试。等到明年再试一次吧。


# Let's pull 2011 data.
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)

def addBollingerBands(df):
  df['30 Day MA'] = df['Adj Close'].rolling(window=30).mean()
  df['30 Day STD'] = df['Adj Close'].rolling(window=30).std() 
  df['Upper Band'] = df['30 Day MA'] + (df['30 Day STD'] * 2)
  df['Lower Band'] = df['30 Day MA'] - (df['30 Day STD'] * 2)
  df = df.dropna()
  return df

tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year.head()
日期 打开 高的 低的 关闭 调整收盘价 体积 30天移动平均线 30天性病 上带 下带
159 2011年2月14日 23.639999 24.139999 23.049999 23.08 23.08 1283100 24.937333 1.720240 28.377814 21.496853
160 2011年2月15日 23.010000 23.170000 22.559999 22.84 22.84 953700 24.811333 1.731142 28.273617 21.349049
161 2011年2月16日 23.100000 24.969999 23.070000 24.73 24.73 4115100 24.746667 1.695178 28.137023 21.356310
162 2011年2月17日 24.629999 25.490000 23.549999 23.60 23.60 2618400 24.639000 1.660516 27.960031 21.317969
163 2011年2月18日 23.330000 23.490000 22.959999 23.18 23.18 2370700 24.482333 1.563047 27.608427 21.356240
# Function to add the predicted column to a dataset using a trained classifier
def addPredictedColumn(df, clf):
  df['Date'] = df['Date'].dt.strftime('%Y%m%d')
  # df["Adj Close Tomorrow"] = df["Adj Close"].shift(-1)

  df = df.dropna()

  for column in df.columns:
    if column != 'Date':
      if df[column].dtype == type(object):
          le = LabelEncoder()
          df[column] = le.fit_transform(df[column])


  predictions = clf.predict(df)
  predictions = predictions > .5
  df['Stock Goes Up Tomorrow'] = predictions
  return df

tsla_data_by_year = addPredictedColumn(tsla_data_by_year, clf)
tsla_data_by_year = tsla_data_by_year.reset_index()
tsla_data_by_year = howMuchMoneyDidWeMake(tsla_data_by_year)
tsla_data_by_year.tail(1)
指数 日期 打开 高的 低的 关闭 调整收盘价 体积 30天移动平均线 30天性病 上带 下带 明天股票上涨 位置 持有股票数量 钱包里的钱
222 381 2011年12月30日 28.49 28.98 28.25 28.56 28.56 339800 30.66 2.33 35.33 25.99 错误的 抓住 0.0 1652.3

如果我们用 2010 年训练的模型来预测 2011 年的数据,最终结果会是 1652.01 美元。这很糟糕。我认为问题在于我们没有考虑 2011 年 TSLA 的新数据,我们只是用 2010 年的数据来预测 2011 年的数据。这根本行不通。

如果我们在 2011 年每 30 天重新训练一次模型会怎么样?这样,分类器每 30 天就会根据发现的任何新模式进行“重置”。

existing_df = tsla_data_by_year
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year = tsla_data_by_year.reset_index()
working_df = tsla_data_by_year[:30]
working_df = working_df.drop(columns=['index'])
working_df = addPredictedColumn(working_df, clf)
working_df = howMuchMoneyDidWeMake(working_df)

all_the_money = pd.concat([working_df], sort=True)

for i in range(1, 8):
  new_first_row = working_df[-1:]
  for column in working_df.columns:
    if column != 'Date':
      if working_df[column].dtype == type(object):
          le = LabelEncoder()
          working_df[column] = le.fit_transform(working_df[column])
  existing_df = pd.concat([existing_df, working_df], sort=True)
  working_df = tsla_data_by_year[30*i:30*i+30]
  existing_df["Adj Close Tomorrow"] = existing_df["Adj Close"].shift(-1)
  existing_df = existing_df.dropna()
  existing_df["Stock Goes Up Tomorrow"] = existing_df["Adj Close"] < existing_df["Adj Close Tomorrow"]
  predict = 'Stock Goes Up Tomorrow'
  X = existing_df
  for column in X.columns:
    if column != 'Date':
      if X[column].dtype == type(object):
          le = LabelEncoder()
          X[column] = le.fit_transform(X[column])

  y = existing_df[predict]
  X = existing_df[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
        '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]
  targets = existing_df[predict]
  params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
            'learning_rate': 0.01, 'loss': 'ls'}
  clf = ensemble.GradientBoostingRegressor(**params)
  clf.fit(X, y)
  working_df = working_df.drop(columns=['index'])
  working_df = addPredictedColumn(working_df, clf)
  working_df = pd.concat([new_first_row, working_df], sort=True)
  working_df = working_df.reset_index(drop=True)
  working_df = howMuchMoneyDidWeMake(working_df)
  all_the_money = pd.concat([all_the_money, working_df], sort=True)

all_the_money.tail(1)
30天移动平均线 30天性病 调整收盘价 关闭 日期 高的 低的 下带 钱包里的钱 持有股票数量 打开 位置 明天股票上涨 上带 体积
十三 30.66 2.33 28.56 28.56 2011年12月30日 28.98 28.25 25.99 0.0 322.57 28.49 真的 35.33 339800

2011 年每 30 天重新训练一次模型,最终我们得到 322 股,每股 28.56 美元,总计 9,196 美元。

那太好了!

马上打电话给我的经纪人。


鏂囩珷鏉ユ簮锛�https://dev.to/bababrownbear/making-with-ml-52dp
PREV
JSX 初学者指南(以及它与 HTML 的区别)
NEXT
您最喜欢的 VS Code 扩展是什么?为什么?