英文:
Find minimum and maximum values in OHLC data
问题
Sure, here's the translated content:
我想在Python中查找OHLC数据中的局部最小值和最大值,条件是这些值之间的距离至少为+-5%。
时间条件
请注意:
- 对于上升运动(close>open),
low
价格在high
价格之前出现。 - 对于下降运动(close<open),
low
价格在high
价格之后出现。
最好的方式来解释我想要实现的是通过一个图形示例:
OHLC数据的格式如下:
open_time open high low close
2023-07-02 0.12800000 0.12800000 0.12090000 0.12390000
2023-07-03 0.12360000 0.13050000 0.12220000 0.12830000
2023-07-04 0.12830000 0.12830000 0.12320000 0.12410000
2023-07-05 0.12410000 0.12530000 0.11800000 0.11980000
2023-07-06 0.11990000 0.12270000 0.11470000 0.11500000
结果应该类似于:
date1 val1 date2 val2 <---上升
date2 val2 date3 val3 <---下降
date3 val3 date4 val4 <---上升
date4 val4 date5 val5 <---下降
.
.
.
至于示例中的数据,结果应该是:
2023-07-02 0.1280 2023-07-02 0.1209 -5.55%
2023-07-02 0.1209 2023-07-03 0.1305 7.94%
2023-07-03 0.1305 2023-07-06 0.1147 -12.11%
这个任务有一个名称吗?
附录
我添加了一个新的示例,使用不同的条件(+-3%)。
这是数据:
2022-02-25 38340.4200 39699.0000 38038.4600 39237.0600
2022-02-26 39237.0700 40300.0000 38600.4600 39138.1100
2022-02-27 39138.1100 39881.7700 37027.5500 37714.4300
2022-02-28 37714.4200 44200.0000 37468.2800 43181.2700
2022-03-01 43176.4100 44968.1300 42838.6800 44434.0900
最终结果应该是:
2022-02-25 38038 2022-02-26 40300 5.95%
2022-02-26 40300 2022-02-26 38600 -4.22%
2022-02-26 38600 2022-02-27 39881 3.32%
2022-02-27 39881 2022-02-27 37027 -7.16%
2022-02-27 37027 2022-02-28 44200 19.37%
2022-02-28 44200 2022-03-01 42838 -3.08%
英文:
I would like to find (in python) the local minimum and maximum values in OHLC data, under the condition that the distance between these values is at least +-5%.
Temporal Condition
Note that
- for an UP movement (close>open),
low
price comes BEFOREhigh
price - for a DOWN movement (close<open),
low
price comes AFTERhigh
price
The best way to explain what I would like to achieve is by a graphical example:
OHLC data is in this format:
open_time open high low close
2023-07-02 0.12800000 0.12800000 0.12090000 0.12390000
2023-07-03 0.12360000 0.13050000 0.12220000 0.12830000
2023-07-04 0.12830000 0.12830000 0.12320000 0.12410000
2023-07-05 0.12410000 0.12530000 0.11800000 0.11980000
2023-07-06 0.11990000 0.12270000 0.11470000 0.11500000
The result should be something like:
date1 val1 date2 val2 <---up
date2 val2 date3 val3 <---down
date3 val3 date4 val4 <---up
date4 val4 date5 val5 <---down
.
.
.
As for the data in the example the result should be:
2023-07-02 0.1280 2023-07-02 0.1209 -5.55%
2023-07-02 0.1209 2023-07-03 0.1305 7.94%
2023-07-03 0.1305 2023-07-06 0.1147 -12.11%
Is there a name for this task?
ADDENDUM
I add a new example, with a different condition (+-3%).
This is the data:
2022-02-25 38340.4200 39699.0000 38038.4600 39237.0600
2022-02-26 39237.0700 40300.0000 38600.4600 39138.1100
2022-02-27 39138.1100 39881.7700 37027.5500 37714.4300
2022-02-28 37714.4200 44200.0000 37468.2800 43181.2700
2022-03-01 43176.4100 44968.1300 42838.6800 44434.0900
And the final result shold be:
2022-02-25 38038 2022-02-26 40300 5.95%
2022-02-26 40300 2022-02-26 38600 -4.22%
2022-02-26 38600 2022-02-27 39881 3.32%
2022-02-27 39881 2022-02-27 37027 -7.16%
2022-02-27 37027 2022-02-28 44200 19.37%
2022-02-28 44200 2022-03-01 42838 -3.08%
答案1
得分: 3
这是一个将每日OHLC线分割为四个(日期,数值)条目的直接解决方案。然后,我们处理每个条目(顺序取决于方向),同时记录局部极小值/极大值(“峰值”),合并连续的运动并跳过不重要的变动。
有两个NamedTuple:Entry
(用于(日期,数值)对)和Movement
(用于结果的每一行)。我可以使用元组,但NamedTuple为每个字段提供了清晰的名称。
它不依赖于numpy、pandas或任何其他库,类型提示可以在编译时捕获错误,如果与像mypy这样的静态检查器一起使用。对于纯Python解决方案来说,它应该也相当快,因为它在一次遍历中计算所有运动。
from typing import Iterator, NamedTuple
Entry = NamedTuple('Entry', [('value', float), ('date', str)])
Movement = NamedTuple('Movement', [('start', Entry), ('end', Entry), ('percentage', float)])
get_change = lambda a, b: (b.value - a.value) / a.value
def get_movements(data_str: str, min_change_percent: float = 0.05) -> Iterator[Movement]:
""" Return all movements with changes above a threshold. """
peaks: list[Entry] = []
for line in data_str.strip().split('\n'):
# Read lines from input and split into date and values.
date, open, high, low, close = line.split()
# Order values according to movement direction.
values_str = [open, low, high, close] if close > open else [open, high, low, close]
for value_str in values_str:
entry = Entry(float(value_str), date)
if len(peaks) >= 2 and (entry > peaks[-1]) == (peaks[-1] > peaks[-2]):
# Continue movement of same direction by replacing last peak.
peaks[-1] = entry
elif not peaks or abs(get_change(peaks[-1], entry)) >= min_change_percent:
# New peak is above minimum threshold.
peaks.append(entry)
# Convert every pair of remaining peaks to a `Movement`.
for start, end in zip(peaks, peaks[1:]):
yield Movement(start, end, percentage=get_change(start, end))
第一个示例的用法:
data_str = """
2023-07-02 0.12800000 0.12800000 0.12090000 0.12390000
2023-07-03 0.12360000 0.13050000 0.12220000 0.12830000
2023-07-04 0.12830000 0.12830000 0.12320000 0.12410000
2023-07-05 0.12410000 0.12530000 0.11800000 0.11980000
2023-07-06 0.11990000 0.12270000 0.11470000 0.11500000
"""
for mov in get_movements(data_str, 0.05):
print(f'{mov.start.date} {mov.start.value:.4f} {mov.end.date} {mov.end.value:.4f} {mov.percentage:.2%}')
# 2023-07-02 0.1280 2023-07-02 0.1209 -5.55%
# 2023-07-02 0.1209 2023-07-03 0.1305 7.94%
# 2023-07-03 0.1305 2023-07-06 0.1147 -12.11%
第二个示例的用法:
data_str = """
2022-02-25 38340.4200 39699.0000 38038.4600 39237.0600
2022-02-26 39237.0700 40300.0000 38600.4600 39138.1100
2022-02-27 39138.1100 39881.7700 37027.5500 37714.4300
2022-02-28 37714.4200 44200.0000 37468.2800 43181.2700
2022-03-01 43176.4100 44968.1300 42838.6800 44434.0900
"""
for mov in get_movements(data_str, 0.03):
print(f'{mov.start.date} {int(mov.start.value)} {mov.end.date} {int(mov.end.value)} {mov.percentage:.2%}')
# 2022-02-25 38340 2022-02-26 40300 5.11%
# 2022-02-26 40300 2022-02-26 38600 -4.22%
# 2022-02-26 38600 2022-02-27 39881 3.32%
# 2022-02-27 39881 2022-02-27 37027 -7.16%
# 2022-02-27 37027 2022-02-28 44200 19.37%
# 2022-02-28 44200 2022-03-01 42838 -3.08%
# 2022-03-01 42838 2022-03-01 44968 4.97%
第二个示例的第一个结果与您提供的值不符,但我不清楚为什么它从38038
开始而不是38340
。所有其他值都完美匹配。
英文:
This is a straightforward solution by splitting each daily OHLC line into four (day, value) entries. Then we process each entry (order dependent on direction) while recording the local minima/maxima ("peaks"), merging continuous runs and skipping insignificant movements.
There are two NamedTuple's: Entry
(for a (day, value) pair) and Movement
(for each line of the results). I could have used tuples, but NamedTuple's give clear names for each field.
It also doesn't depend on numpy, pandas, or any other library, and the type hints help catch mistakes at compile time if used with a static checker like mypy. It should also be fairly fast for a pure-Python solution, as it computes all movements in one pass.
from typing import Iterator, NamedTuple
Entry = NamedTuple('Entry', [('value', float), ('date', str)])
Movement = NamedTuple('Movement', [('start', Entry), ('end', Entry), ('percentage', float)])
get_change = lambda a, b: (b.value - a.value) / a.value
def get_movements(data_str: str, min_change_percent: float = 0.05) -> Iterator[Movement]:
""" Return all movements with changes above a threshold. """
peaks: list[Entry] = []
for line in data_str.strip().split('\n'):
# Read lines from input and split into date and values.
date, open, high, low, close = line.split()
# Order values according to movement direction.
values_str = [open, low, high, close] if close > open else [open, high, low, close]
for value_str in values_str:
entry = Entry(float(value_str), date)
if len(peaks) >= 2 and (entry > peaks[-1]) == (peaks[-1] > peaks[-2]):
# Continue movement of same direction by replacing last peak.
peaks[-1] = entry
elif not peaks or abs(get_change(peaks[-1], entry)) >= min_change_percent:
# New peak is above minimum threshold.
peaks.append(entry)
# Convert every pair of remaining peaks to a `Movement`.
for start, end in zip(peaks, peaks[1:]):
yield Movement(start, end, percentage=get_change(start, end))
Usage for first example:
data_str = """
2023-07-02 0.12800000 0.12800000 0.12090000 0.12390000
2023-07-03 0.12360000 0.13050000 0.12220000 0.12830000
2023-07-04 0.12830000 0.12830000 0.12320000 0.12410000
2023-07-05 0.12410000 0.12530000 0.11800000 0.11980000
2023-07-06 0.11990000 0.12270000 0.11470000 0.11500000
"""
for mov in get_movements(data_str, 0.05):
print(f'{mov.start.date} {mov.start.value:.4f} {mov.end.date} {mov.end.value:.4f} {mov.percentage:.2%}')
# 2023-07-02 0.1280 2023-07-02 0.1209 -5.55%
# 2023-07-02 0.1209 2023-07-03 0.1305 7.94%
# 2023-07-03 0.1305 2023-07-06 0.1147 -12.11%
Usage for second example:
data_str = """
2022-02-25 38340.4200 39699.0000 38038.4600 39237.0600
2022-02-26 39237.0700 40300.0000 38600.4600 39138.1100
2022-02-27 39138.1100 39881.7700 37027.5500 37714.4300
2022-02-28 37714.4200 44200.0000 37468.2800 43181.2700
2022-03-01 43176.4100 44968.1300 42838.6800 44434.0900
"""
for mov in get_movements(data_str, 0.03):
print(f'{mov.start.date} {int(mov.start.value)} {mov.end.date} {int(mov.end.value)} {mov.percentage:.2%}')
# 2022-02-25 38340 2022-02-26 40300 5.11%
# 2022-02-26 40300 2022-02-26 38600 -4.22%
# 2022-02-26 38600 2022-02-27 39881 3.32%
# 2022-02-27 39881 2022-02-27 37027 -7.16%
# 2022-02-27 37027 2022-02-28 44200 19.37%
# 2022-02-28 44200 2022-03-01 42838 -3.08%
# 2022-03-01 42838 2022-03-01 44968 4.97%
The first result of the second example doesn't agree with the value you provided, but it's not clear to me why it started at 38038
instead of 38340
. All other values match perfectly.
答案2
得分: 1
I've translated the provided code snippets as requested:
我决定尽量使用`pandas`,以实现峰值确定的业务逻辑。我无法找到比@BoppreH更好的方法来实际实现业务逻辑。
我创建了一个可配置的过滤器,用于应用于`DataFrame`的行,并使用状态存储的装饰器:
```python
def min_percent_change_filter(min_change_percent=0.05):
peaks = []
get_change = lambda a, b: (b - a) / a
def add_entry(row):
"""由@BoppreH编写,稍作修改
使用一个新条目更新峰值列表。"""
if len(peaks) >= 2 and (row["data"] > peaks[-1]["data"]) == (
peaks[-1]["data"] > peaks[-2]["data"]
):
# 继续相同方向的运动,替换最后一个峰值。
peaks[-1] = row.copy()
return peaks
elif (
not peaks
or abs(get_change(peaks[-1]["data"], row["data"])) >= min_change_percent
):
# 新峰值高于最低阈值。
peaks.append(row.copy())
return peaks
return peaks
return add_entry
pandas
部分需要一些操作,以使数据处于正确的形状。在数据处于正确的形状之后,我们跨行应用过滤器。最后,我们将DataFrame
放入所需的输出格式中:
import pandas as pd
def pandas_approach(data, min_pct_change):
df = pd.DataFrame(data)
df["open_time"] = pd.to_datetime(df["open_time"])
# 考虑时间方面,首先创建新列first和second
# 根据是否上升或下降分别设置它们的值
df["first"] = df["low"].where(df["open"] <= df["close"], df["high"])
df["second"] = df["high"].where(df["open"] <= df["close"], df["low"])
# 通过将first和second堆叠到索引上,然后按'open_time'和是否首先出现排序的方式创建数据的新表示
stacked_representation = (
df.set_index("open_time")[["first", "second"]]
.stack()
.reset_index()
.sort_values(["open_time", "level_1"])[["open_time", 0]]
)
stacked_representation.columns = ["open_time", "data"]
# 现在我们可以使用我们的过滤器进行操作
results = pd.DataFrame(
stacked_representation.apply(min_percent_change_filter(min_pct_change), axis=1)[
0
]
)
# 我们对数据进行了重新塑形/重命名/重新排序,以适应所需的输出格式
results["begin"] = results["data"].shift()
results["begin_date"] = results["open_time"].shift()
results = results.dropna()[["begin_date", "begin", "open_time", "data"]]
results.columns = ["begin_date", "begin", "end_date", "end"]
# 最后添加百分比变化
results["pct_change"] = (results.end - results.begin) / results.begin
def format_datetime(dt):
return pd.to_datetime(dt).strftime("%Y-%m-%d")
def price_formatter(value):
return "{:.4f}".format(value) if abs(value) < 10000 else "{:.0f}".format(value)
return results.style.format(
{
"pct_change": "{:.2%}".format,
"begin_date": format_datetime,
"end_date": format_datetime,
"begin": price_formatter,
"end": price_formatter,
}
)
第一个示例的输出:
import pandas as pd
data = {
"open_time": ["2023-07-02", "2023-07-03", "2023-07-04", "2023-07-05", "2023-07-06"],
"open": [0.12800000, 0.12360000, 0.12830000, 0.12410000, 0.11990000],
"high": [0.12800000, 0.13050000, 0.12830000, 0.12530000, 0.12270000],
"low": [0.12090000, 0.12220000, 0.12320000, 0.11800000, 0.11470000],
"close": [0.12390000, 0.12830000, 0.12410000, 0.11980000, 0.11500000],
}
pandas_approach(data,0.05)
begin_date begin end_date end pct_change
1 2023-07-02 0.1280 2023-07-02 0.1209 -5.55%
3 2023-07-02 0.1209 2023-07-03 0.1305 7.94%
9 2023-07-03 0.1305 2023-07-06 0.1147 -12.11%
第二个示例的输出:
data_2 = {
"open_time": ["2022-02-25", "2022-02-26", "2022-02-27", "2022-02-28", "2022-03-01"],
"open": [38340.4200, 39237.0700, 39138.1100, 37714.4200, 43176.4100],
"high": [39699.0000, 40300.0000, 39881.7700, 44200.0000, 44968.1300],
"low": [38038.4600, 38600.4600, 37027.5500, 37468.2800, 42838.6800],
"close": [39237.0600, 39138.1100, 37714.4300, 43181.2700, 44434.0900],
}
pandas_approach(data_2, 0.03)
begin_date begin end_date end pct_change
2 2022-02-25 38038 2022-02-26 40300 5.95%
3 2022-02-26 40300 2022-02-26 38600
<details>
<summary>英文:</summary>
I sat out determined to give this a go with using as much `pandas` as possible. I couldn't figure out a better way than @BoppreH to actually implement the business logic of the peak determination.
I create a configurable filter to be applied to the rows of the `DataFrame` with a decorator for state storage:
```python
def min_percent_change_filter(min_change_percent=0.05):
peaks = []
get_change = lambda a, b: (b - a) / a
def add_entry(row):
"""By @BoppreH, with slight modifications
Update list of peaks with one new entry."""
if len(peaks) >= 2 and (row["data"] > peaks[-1]["data"]) == (
peaks[-1]["data"] > peaks[-2]["data"]
):
# Continue movement of same direction by replacing last peak.
peaks[-1] = row.copy()
return peaks
elif (
not peaks
or abs(get_change(peaks[-1]["data"], row["data"])) >= min_change_percent
):
# New peak is above minimum threshold.
peaks.append(row.copy())
return peaks
return peaks
return add_entry
The pandas
part requires quite some manipulation to get the data into the right shape. After it's in the right shape, we apply the filter across rows. Finally we put the DataFrame
in the desired output format:
import pandas as pd
def pandas_approach(data, min_pct_change):
df = pd.DataFrame(data)
df["open_time"] = pd.to_datetime(df["open_time"])
# Respect termporal aspect, create new columns first and second
# set them to the respective value depending on whether we're
# moving down or up
df["first"] = df["low"].where(df["open"] <= df["close"], df["high"])
df["second"] = df["high"].where(df["open"] <= df["close"], df["low"])
# Create a new representation of the data, by stacking first and second
# on the index, then sorting by 'open_time' and whether it came first
# or second (Note: assert 'first' < 'second')
stacked_representation = (
df.set_index("open_time")[["first", "second"]]
.stack()
.reset_index()
.sort_values(["open_time", "level_1"])[["open_time", 0]]
)
stacked_representation.columns = ["open_time", "data"]
# Now we can go to work with our filter
results = pd.DataFrame(
stacked_representation.apply(min_percent_change_filter(min_pct_change), axis=1)[
0
]
)
# We reshape /rename/reorder our data to fit the desired output format
results["begin"] = results["data"].shift()
results["begin_date"] = results["open_time"].shift()
results = results.dropna()[["begin_date", "begin", "open_time", "data"]]
results.columns = ["begin_date", "begin", "end_date", "end"]
# Lastly add the pct change
results["pct_change"] = (results.end - results.begin) / results.begin
# This returns the styler for output formatting purposes, but you can return the
# DataFrame instead by commenting/deleting it
def format_datetime(dt):
return pd.to_datetime(dt).strftime("%Y-%m-%d")
def price_formatter(value):
return "{:.4f}".format(value) if abs(value) < 10000 else "{:.0f}".format(value)
return results.style.format(
{
"pct_change": "{:,.2%}".format,
"begin_date": format_datetime,
"end_date": format_datetime,
"begin": price_formatter,
"end": price_formatter,
}
)
Output for the first example::
import pandas as pd
data = {
"open_time": ["2023-07-02", "2023-07-03", "2023-07-04", "2023-07-05", "2023-07-06"],
"open": [0.12800000, 0.12360000, 0.12830000, 0.12410000, 0.11990000],
"high": [0.12800000, 0.13050000, 0.12830000, 0.12530000, 0.12270000],
"low": [0.12090000, 0.12220000, 0.12320000, 0.11800000, 0.11470000],
"close": [0.12390000, 0.12830000, 0.12410000, 0.11980000, 0.11500000],
}
pandas_approach(data,0.05)
begin_date begin end_date end pct_change
1 2023-07-02 0.1280 2023-07-02 0.1209 -5.55%
3 2023-07-02 0.1209 2023-07-03 0.1305 7.94%
9 2023-07-03 0.1305 2023-07-06 0.1147 -12.11%
Output for the second example:
data_2 = {
"open_time": ["2022-02-25", "2022-02-26", "2022-02-27", "2022-02-28", "2022-03-01"],
"open": [38340.4200, 39237.0700, 39138.1100, 37714.4200, 43176.4100],
"high": [39699.0000, 40300.0000, 39881.7700, 44200.0000, 44968.1300],
"low": [38038.4600, 38600.4600, 37027.5500, 37468.2800, 42838.6800],
"close": [39237.0600, 39138.1100, 37714.4300, 43181.2700, 44434.0900],
}
pandas_approach(data_2, 0.03)
begin_date begin end_date end pct_change
2 2022-02-25 38038 2022-02-26 40300 5.95%
3 2022-02-26 40300 2022-02-26 38600 -4.22%
4 2022-02-26 38600 2022-02-27 39882 3.32%
5 2022-02-27 39882 2022-02-27 37028 -7.16%
7 2022-02-27 37028 2022-02-28 44200 19.37%
8 2022-02-28 44200 2022-03-01 42839 -3.08%
9 2022-03-01 42839 2022-03-01 44968 4.97%
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论