利用机器学习赚钱

你有1000美元可以挥霍。你决定投资股市，特别是特斯拉。
让我们看看能否利用机器学习来优化我们的回报。

从这里下载 TSLA.csv：https://www.kaggle.com/timoboz/tesla-stock-data-from-2010-to-2020/data

让我们开始吧。

让我们首先对我们得到的文件进行一些探索性数据分析（EDA）。

该文件是一个包含 7 列的逗号分隔值 (CSV) 文件。

这些列是：

日期
开盘价
当日最高价
当日最低价
收盘价
调整收盘价，考虑拆分等因素
交易量

# Importing pandas. "pandas is a fast, powerful, flexible and easy to use open source data
# analysis and manipulation tool, built on top of the Python programming language."
import pandas as pd                                                                                         
pd.options.display.max_rows = 30
# Read in the CSV, save it to a pandas dataframe variable called 'tsla_data'.
tsla_data = pd.read_csv("TSLA.csv");

# .head() gives us the first 5 rows of the data frame.
# You can also pass .head() a parameter to return any number of rows. Like .head(10) for 10 rows.
tsla_data.head()

	日期	打开	高的	低的	关闭	调整收盘价	体积
0	2010年6月29日	19.000000	25.00	17.540001	23.889999	23.889999	18766300
1	2010年6月30日	25.790001	30.42	23.299999	23.830000	23.830000	17187100
2	2010年7月1日	25.000000	25.92	20.270000	21.959999	21.959999	8218800
3	2010年7月2日	23.000000	23.10	18.709999	19.200001	19.200001	5139800
4	2010年7月6日	20.000000	20.00	15.830000	16.110001	16.110001	6866900

# .shape tells us the number of rows, and the number of columns.
# This dataset has 2416 rows, and 7 columns.
# The NYSE and NASDAQ average about 253 trading days a year. 
# This is from 365.25 (days on average per year) * 5/7 (proportion work days per week) 
# - 6 (weekday holidays) - 3*5/7 (fixed date holidays) = 252.75 ≈ 253.
# 10 * 253 = 2530, this dataset is pretty close. Let's assume it's not missing any days.
tsla_data.shape

(2416, 7)

这是 10 年的数据，其中包含从 2010 年开始的股票信息。

为了节省时间，让我们做一些假设，我们还不是对冲基金经理。

假设

我们每天只能下达一个订单（买入或卖出），订单金额为持有的全部金额。
如果我们下订单，我们假设它会以该价格成交。
我们从 1000 美元开始

我们将追踪一些关键信息。

钱包里的钱
持有股票数量

让我们从 2010 年开始，看看如果我们在这个文件的第一天以 1000 美元开始，我们会赚多少钱。

# We're going to just pull the 2010 data. I like sticking this in variable, and array,  
# because we'll likely do this again, and by multiple years.
years_to_pull = [2010]

# Let's tell pandas to treat the 'Date' column as a date.
tsla_data['Date'] = pd.to_datetime(tsla_data['Date'])

# Let's make a function for re-use
def pull_data_by_year(tsla_data, years_to_pull):
  tsla_data_by_year = tsla_data[tsla_data['Date'].dt.year.isin(years_to_pull)]
  return tsla_data_by_year

tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year.shape

(130, 7)

# Sort by date ASC
tsla_data_by_year = tsla_data_by_year.sort_values(by = 'Date')

我们来添加几列来帮助我们处理数据。我想查看明天的调整后收盘价，并知道它是否高于今天的调整后收盘价。

# .shift(-1) brings the next row into the equation, so that we can add a column that  
# shows tomorrow's adjusted close.
tsla_data_by_year["Adj Close Tomorrow"] = tsla_data_by_year["Adj Close"].shift(-1)
# This adds another column as a bool to quickly show if the stock goes up or down tomorrow.
tsla_data_by_year["Stock Goes Up Tomorrow"] = tsla_data_by_year["Adj Close"] < tsla_data_by_year["Adj Close Tomorrow"]
# Let's look at the first 10 rows to see if this looks correct.
tsla_data_by_year.head(10)

	日期	打开	高的	低的	关闭	调整收盘价	体积	明天收盘价	明天股票上涨
0	2010年6月29日	19.000000	25.000000	17.540001	23.889999	23.889999	18766300	23.830000	错误的
1	2010年6月30日	25.790001	30.420000	23.299999	23.830000	23.830000	17187100	21.959999	错误的
2	2010年7月1日	25.000000	25.920000	20.270000	21.959999	21.959999	8218800	19.200001	错误的
3	2010年7月2日	23.000000	23.100000	18.709999	19.200001	19.200001	5139800	16.110001	错误的
4	2010年7月6日	20.000000	20.000000	15.830000	16.110001	16.110001	6866900	15.800000	错误的
5	2010年7月7日	16.400000	16.629999	14.980000	15.800000	15.800000	6921700	17.459999	真的
6	2010年7月8日	16.139999	17.520000	15.570000	17.459999	17.459999	7711400	17.400000	错误的
7	2010年7月9日	17.580000	17.900000	16.549999	17.400000	17.400000	4050600	17.049999	错误的
8	2010年7月12日	17.950001	18.070000	17.000000	17.049999	17.049999	2202500	18.139999	真的
9	2010年7月13日	17.389999	18.639999	16.900000	18.139999	18.139999	2680100	19.840000	真的

按照低买高卖的规则，我们通过查看历史数据，可以得出以下结论。

首先选择第二天收盘价上涨的第一天，并在当天购买价值 1000 美元的股票。

我们将有 3 个职位。

买入
卖出
持有

在代码中：

haveNoStock && !goesUpTomorrow = 持有

没有库存 && 明天上涨 = 买入

有库存 && !明天上涨 = 卖出

haveStock && goUpTomorrow = 持有

# Setting some default values of the new columns. 
# Position can be Hold/Sell/Buy
tsla_data_by_year['Position'] = 'Hold'
tsla_data_by_year['Number Of Stocks Held'] = 0
tsla_data_by_year['Money In Wallet'] = 0
# .at says at row 0, column 'Money in Wallet', save $1000
tsla_data_by_year.at[0, 'Money In Wallet'] = 1000
tsla_data_by_year.head()

	日期	打开	高的	低的	关闭	调整收盘价	体积	明天收盘价	明天股票上涨	位置	钱包里的钱
0	2010年6月29日	19.000000	25.00	17.540001	23.889999	23.889999	18766300	23.830000	错误的	抓住	1000
1	2010年6月30日	25.790001	30.42	23.299999	23.830000	23.830000	17187100	21.959999	错误的	抓住	0
2	2010年7月1日	25.000000	25.92	20.270000	21.959999	21.959999	8218800	19.200001	错误的	抓住	0
3	2010年7月2日	23.000000	23.10	18.709999	19.200001	19.200001	5139800	16.110001	错误的	抓住	0
4	2010年7月6日	20.000000	20.00	15.830000	16.110001	16.110001	6866900	15.800000	错误的	抓住	0

# Here's my code for determining if I should buy/sell/hold. 
# We'll put this in a function down the line.
previousRow = ''
for index, row in tsla_data_by_year.iterrows():
  if(index > 0):
    row['Money In Wallet'] = previousRow['Money In Wallet']
    row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
  if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(1)
  elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Buy'
    row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
    row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
    # print(2)
  elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Sell'
    row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
    row['Number Of Stocks Held'] = 0
    # print(3)
  elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(4)
  previousRow = row
  tsla_data_by_year.at[index] = row

# Round each number to 2 decimal places.
tsla_data_by_year = tsla_data_by_year.round(2)
# Let's look at the last row to see how much money or stock we have at the end of the year.
tsla_data_by_year.tail(1)

	日期	打开	高的	低的	关闭	调整收盘价	体积	明天收盘价	明天股票上涨	位置	持有股票数量	钱包里的钱
129	2010年12月31日	26.57	27.25	26.5	26.63	26.63	1417900	钠	错误的	卖	0.0	8645.73

如果我们预知未来的话，截至 2010 年底，我们将拥有 8,645 美元。

现在让我们用机器学习来解决这个问题。关键在于记住，这是一门科学，这是一个实验。我们需要遵循科学方法。

我们的假设：我们将尝试预测股票明天是上涨还是下跌。

当我们试图预测两种可能的结果时，这被称为二元分类。

我们的目标是，对于给定的一行，我们可以预测该股票明天是否会上涨的列，但我们掌握的只有今天的最高/最低价和价格。这并不能提供太多信息，我们需要尝试分析历史价格的趋势。为此，我们使用技术指标。

## Add bollinger bands
## To learn more about bollinger bands: https://www.investopedia.com/terms/b/bollingerbands.asp
import matplotlib.pyplot as plt

tsla_data_by_year['30 Day MA'] = tsla_data_by_year['Adj Close'].rolling(window=30).mean()
tsla_data_by_year['30 Day STD'] = tsla_data_by_year['Adj Close'].rolling(window=30).std() 

tsla_data_by_year['Upper Band'] = tsla_data_by_year['30 Day MA'] + (tsla_data_by_year['30 Day STD'] * 2)
tsla_data_by_year['Lower Band'] = tsla_data_by_year['30 Day MA'] - (tsla_data_by_year['30 Day STD'] * 2)

# Simple 30 Day Bollinger Band for Tesla
tsla_data_by_year[['Adj Close', '30 Day MA', 'Upper Band', 'Lower Band']].plot(figsize=(12,6))
plt.title('30 Day Bollinger Band for Tesla')
plt.ylabel('Price (USD)')
plt.show();

# This plot will show us the adjusted close, the rolling average, and the upper  
# and lower bands of the TSLA stock.

# Since we used a 30 day moving average, the starting 30 days do not have  
# bollinger bands information.
# We use dropna() to drop the nulls.
tsla_data_by_year = tsla_data_by_year.dropna()
tsla_data_by_year.head()

	日期	打开	高的	低的	关闭	调整收盘价	体积	明天收盘价	明天股票上涨	位置	持有股票数量	钱包里的钱	30天移动平均线	30天性病	上带	下带
二十九	2010年8月10日	19.65	19.65	18.82	19.03	19.03	1281300	17.90	错误的	抓住	0.00	1660.38	20.041333	1.937226	23.915786	16.166880
三十	2010年8月11日	18.69	18.88	17.85	17.90	17.90	797600	17.60	错误的	抓住	0.00	1660.38	19.841667	1.832744	23.507156	16.176178
31	2010年8月12日	17.80	17.90	17.39	17.60	17.60	691000	18.32	真的	买	94.34	0.00	19.634000	1.714383	23.062765	16.205235
三十二	2010年8月13日	18.18	18.45	17.66	18.32	18.32	634000	18.78	真的	抓住	94.34	0.00	19.512667	1.672380	22.857427	16.167907
33	2010年8月16日	18.45	18.80	18.26	18.78	18.78	485800	19.15	真的	抓住	94.34	0.00	19.498667	1.676840	22.852346	16.144987

# Some fantastical python. 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy as np
from matplotlib import pyplot, dates

# Here we are saying we want to predict the column 'Stock Goes Up Tomorrow' by 
#  storing the column name in a variable.
predict = 'Stock Goes Up Tomorrow'
X = tsla_data_by_year
# Treat the date as a number
X['Date'] = X['Date'].dt.strftime('%Y%m%d')

# For each column, apply a LabelEncoder. Regression problems need numerical values 
#  or categorical values. 
# With columns like 'Position', we need to apply a LabelEncoding  
# to set 1 = Hold, 2 = Buy, 3 = Sell  
# (this is an example, the LabelEncoder will determine  
# the numerical values of the categories at runtime.)
for column in X.columns:
  if column != 'Date':
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])

# Set the y dataset to just the single column we want to predict.
y = tsla_data_by_year[predict]

# Set the X dataset (what we will use to predict), to all the columns mentioned.
X = tsla_data_by_year[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]

# This is used to stratify. Learn more here: https://en.wikipedia.org/wiki/Stratified_sampling
targets = tsla_data_by_year[predict]

# This splits the dataset into training and testing. 60% of the data will be  
# used to train, 40% will be used to test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101, stratify=targets)

X_test.head()

	日期	打开	高的	低的	关闭	调整收盘价	体积	30天移动平均线	30天性病	上带	下带
65	20100930	22.00	22.15	20.19	20.41	20.41	2195800	20.382333	0.786847	21.956028	18.808639
106	20101129	35.41	35.95	33.33	34.33	34.33	1145600	26.106333	5.340324	36.786982	15.425684
57	2010年9月20日	20.67	21.35	20.16	21.06	21.06	947500	19.866667	1.058318	21.983302	17.750031
99	20101117	30.20	30.75	28.61	29.49	29.49	750000	23.079667	3.586195	30.252057	15.907276
90	2010年11月04日	22.60	25.33	22.15	24.90	24.90	1874000	20.957667	0.907603	22.772872	19.142461

from sklearn import ensemble

# hyper parameters for the GradientBoostingRegressor algorithm.. More on this much later.
params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
          'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

# Fit the classifier with the training data.
clf.fit(X_train, y_train)

# Use the trained model to predict the testing dataset.
y_pred_original = clf.predict(X_test)

from sklearn.metrics import (confusion_matrix, precision_score, recall_score, f1_score, classification_report)
y_pred = y_pred_original > .5
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred * 1
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# The confusion matrix will show us True/False Positives, True/False Negatives.
# This dataset is really small to get an accurate reading of the score,  
# but so far it looks like we're close to 50% accurate.

[[12  7]
 [14  7]]
              precision    recall  f1-score   support

       False       0.46      0.63      0.53        19
        True       0.50      0.33      0.40        21

    accuracy                           0.48        40
   macro avg       0.48      0.48      0.47        40
weighted avg       0.48      0.47      0.46        40

我们比抛硬币还差一点！让我们看看这是怎么回事。

predictions = clf.predict(X)
# The values of predictions are stored as a value from 0.00 to 1.00, but we need them 
#  as a true/false to work with our algorithm to calculate $$. Here I compare to .5 (threshold)  
# to determine if the prediction is true or false.
# You can manually adjust the threshold to get a better True Positive / True Negative rate,  
# sometimes it's beneficial if they're trying to reduce a particular metric.
predictions = predictions > .5
X['Stock Goes Up Tomorrow'] = predictions

# Same code as above, functionized. Use a dataset to determine  
# how much money we'll have made with our trades.
def howMuchMoneyDidWeMake(X):
  if('Money In Wallet' not in X ):
    X['Position'] = 'Hold'
    X['Number Of Stocks Held'] = 0
    X['Money In Wallet'] = 0
    X.at[0, 'Money In Wallet'] = 1000

  previousRow = ''
  for index, row in X.iterrows():
    # print(row)
    if(index > 0):
      row['Money In Wallet'] = previousRow['Money In Wallet']
      row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
    if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(1)
    elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Buy'
      row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
      row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
      # print(2)
    elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Sell'
      row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
      row['Number Of Stocks Held'] = 0
      # print(3)
    elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(4)
    previousRow = row
    X.at[index] = row
    X = X.round(2)

  return X

X = X.reset_index()
X = howMuchMoneyDidWeMake(X)
X.tail(1)

	指数	日期	打开	高的	低的	关闭	调整收盘价	体积	30天移动平均线	30天性病	上带	下带	明天股票上涨	位置	持有股票数量	钱包里的钱
99	128	2010年12月30日	27.7	27.9	26.38	26.5	26.5	2041100	31.28	2.63	36.55	26.02	错误的	抓住	0.0	2917.96

还不错！我们的钱几乎翻了三倍。最后我们得到了2917美元。

但这里有一个新手常犯的错误。我们用训练算法时用的数据来衡量成功。我们需要另一个数据集来测试。等到明年再试一次吧。

# Let's pull 2011 data.
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)

def addBollingerBands(df):
  df['30 Day MA'] = df['Adj Close'].rolling(window=30).mean()
  df['30 Day STD'] = df['Adj Close'].rolling(window=30).std() 
  df['Upper Band'] = df['30 Day MA'] + (df['30 Day STD'] * 2)
  df['Lower Band'] = df['30 Day MA'] - (df['30 Day STD'] * 2)
  df = df.dropna()
  return df

tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year.head()

	日期	打开	高的	低的	关闭	调整收盘价	体积	30天移动平均线	30天性病	上带	下带
159	2011年2月14日	23.639999	24.139999	23.049999	23.08	23.08	1283100	24.937333	1.720240	28.377814	21.496853
160	2011年2月15日	23.010000	23.170000	22.559999	22.84	22.84	953700	24.811333	1.731142	28.273617	21.349049
161	2011年2月16日	23.100000	24.969999	23.070000	24.73	24.73	4115100	24.746667	1.695178	28.137023	21.356310
162	2011年2月17日	24.629999	25.490000	23.549999	23.60	23.60	2618400	24.639000	1.660516	27.960031	21.317969
163	2011年2月18日	23.330000	23.490000	22.959999	23.18	23.18	2370700	24.482333	1.563047	27.608427	21.356240

# Function to add the predicted column to a dataset using a trained classifier
def addPredictedColumn(df, clf):
  df['Date'] = df['Date'].dt.strftime('%Y%m%d')
  # df["Adj Close Tomorrow"] = df["Adj Close"].shift(-1)

  df = df.dropna()

  for column in df.columns:
    if column != 'Date':
      if df[column].dtype == type(object):
          le = LabelEncoder()
          df[column] = le.fit_transform(df[column])


  predictions = clf.predict(df)
  predictions = predictions > .5
  df['Stock Goes Up Tomorrow'] = predictions
  return df

tsla_data_by_year = addPredictedColumn(tsla_data_by_year, clf)
tsla_data_by_year = tsla_data_by_year.reset_index()
tsla_data_by_year = howMuchMoneyDidWeMake(tsla_data_by_year)
tsla_data_by_year.tail(1)

	指数	日期	打开	高的	低的	关闭	调整收盘价	体积	30天移动平均线	30天性病	上带	下带	明天股票上涨	位置	持有股票数量	钱包里的钱
222	381	2011年12月30日	28.49	28.98	28.25	28.56	28.56	339800	30.66	2.33	35.33	25.99	错误的	抓住	0.0	1652.3

如果我们用 2010 年训练的模型来预测 2011 年的数据，最终结果会是 1652.01 美元。这很糟糕。我认为问题在于我们没有考虑 2011 年 TSLA 的新数据，我们只是用 2010 年的数据来预测 2011 年的数据。这根本行不通。

如果我们在 2011 年每 30 天重新训练一次模型会怎么样？这样，分类器每 30 天就会根据发现的任何新模式进行“重置”。

existing_df = tsla_data_by_year
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year = tsla_data_by_year.reset_index()
working_df = tsla_data_by_year[:30]
working_df = working_df.drop(columns=['index'])
working_df = addPredictedColumn(working_df, clf)
working_df = howMuchMoneyDidWeMake(working_df)

all_the_money = pd.concat([working_df], sort=True)

for i in range(1, 8):
  new_first_row = working_df[-1:]
  for column in working_df.columns:
    if column != 'Date':
      if working_df[column].dtype == type(object):
          le = LabelEncoder()
          working_df[column] = le.fit_transform(working_df[column])
  existing_df = pd.concat([existing_df, working_df], sort=True)
  working_df = tsla_data_by_year[30*i:30*i+30]
  existing_df["Adj Close Tomorrow"] = existing_df["Adj Close"].shift(-1)
  existing_df = existing_df.dropna()
  existing_df["Stock Goes Up Tomorrow"] = existing_df["Adj Close"] < existing_df["Adj Close Tomorrow"]
  predict = 'Stock Goes Up Tomorrow'
  X = existing_df
  for column in X.columns:
    if column != 'Date':
      if X[column].dtype == type(object):
          le = LabelEncoder()
          X[column] = le.fit_transform(X[column])

  y = existing_df[predict]
  X = existing_df[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
        '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]
  targets = existing_df[predict]
  params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
            'learning_rate': 0.01, 'loss': 'ls'}
  clf = ensemble.GradientBoostingRegressor(**params)
  clf.fit(X, y)
  working_df = working_df.drop(columns=['index'])
  working_df = addPredictedColumn(working_df, clf)
  working_df = pd.concat([new_first_row, working_df], sort=True)
  working_df = working_df.reset_index(drop=True)
  working_df = howMuchMoneyDidWeMake(working_df)
  all_the_money = pd.concat([all_the_money, working_df], sort=True)

all_the_money.tail(1)

	30天移动平均线	30天性病	调整收盘价	关闭	日期	高的	低的	下带	钱包里的钱	持有股票数量	打开	位置	明天股票上涨	上带	体积
十三	30.66	2.33	28.56	28.56	2011年12月30日	28.98	28.25	25.99	0.0	322.57	28.49	买	真的	35.33	339800

2011 年每 30 天重新训练一次模型，最终我们得到 322 股，每股 28.56 美元，总计 9,196 美元。

那太好了！

马上打电话给我的经纪人。

鏂囩珷鏉ユ簮锛�https://dev.to/bababrownbear/making-with-ml-52dp