寻找OHLC数据中的最小和最大值。

huangapple go评论71阅读模式
英文:

Find minimum and maximum values in OHLC data

问题

Sure, here's the translated content:

我想在Python中查找OHLC数据中的局部最小值和最大值,条件是这些值之间的距离至少为+-5%。

时间条件

请注意:

  • 对于上升运动(close>open),low 价格在 high 价格之前出现。
  • 对于下降运动(close<open),low 价格在 high 价格之后出现。

最好的方式来解释我想要实现的是通过一个图形示例:

寻找OHLC数据中的最小和最大值。

OHLC数据的格式如下:

open_time	   open		   high		   low		   close
2023-07-02	0.12800000	0.12800000	0.12090000	0.12390000
2023-07-03	0.12360000	0.13050000	0.12220000	0.12830000
2023-07-04	0.12830000	0.12830000	0.12320000	0.12410000
2023-07-05	0.12410000	0.12530000	0.11800000	0.11980000
2023-07-06	0.11990000	0.12270000	0.11470000	0.11500000

结果应该类似于:

date1 val1 date2 val2 &lt;---上升
date2 val2 date3 val3 &lt;---下降
date3 val3 date4 val4 &lt;---上升
date4 val4 date5 val5 &lt;---下降
.
.
.

至于示例中的数据,结果应该是:

2023-07-02	0.1280	2023-07-02	0.1209	-5.55%
2023-07-02	0.1209	2023-07-03	0.1305	7.94%
2023-07-03	0.1305	2023-07-06	0.1147	-12.11%

这个任务有一个名称吗?


附录

我添加了一个新的示例,使用不同的条件(+-3%)。

这是数据:

2022-02-25	38340.4200	39699.0000	38038.4600	39237.0600
2022-02-26	39237.0700	40300.0000	38600.4600	39138.1100
2022-02-27	39138.1100	39881.7700	37027.5500	37714.4300
2022-02-28	37714.4200	44200.0000	37468.2800	43181.2700
2022-03-01	43176.4100	44968.1300	42838.6800	44434.0900

最终结果应该是:

2022-02-25	38038	2022-02-26	40300	5.95%
2022-02-26	40300	2022-02-26	38600	-4.22%
2022-02-26	38600	2022-02-27	39881	3.32%
2022-02-27	39881	2022-02-27	37027	-7.16%
2022-02-27	37027	2022-02-28	44200	19.37%
2022-02-28	44200	2022-03-01	42838	-3.08%

英文:

I would like to find (in python) the local minimum and maximum values in OHLC data, under the condition that the distance between these values is at least +-5%.

Temporal Condition

Note that

  • for an UP movement (close>open), low price comes BEFORE high price
  • for a DOWN movement (close<open), low price comes AFTER high price

The best way to explain what I would like to achieve is by a graphical example:

寻找OHLC数据中的最小和最大值。

OHLC data is in this format:

open_time	   open		   high		   low		   close
2023-07-02	0.12800000	0.12800000	0.12090000	0.12390000
2023-07-03	0.12360000	0.13050000	0.12220000	0.12830000
2023-07-04	0.12830000	0.12830000	0.12320000	0.12410000
2023-07-05	0.12410000	0.12530000	0.11800000	0.11980000
2023-07-06	0.11990000	0.12270000	0.11470000	0.11500000

The result should be something like:

date1 val1 date2 val2 &lt;---up
date2 val2 date3 val3 &lt;---down
date3 val3 date4 val4 &lt;---up
date4 val4 date5 val5 &lt;---down
.
.
.

As for the data in the example the result should be:

2023-07-02	0.1280	2023-07-02	0.1209	-5.55%
2023-07-02	0.1209	2023-07-03	0.1305	7.94%
2023-07-03	0.1305	2023-07-06	0.1147	-12.11%

Is there a name for this task?


ADDENDUM

I add a new example, with a different condition (+-3%).

This is the data:

2022-02-25	38340.4200	39699.0000	38038.4600	39237.0600
2022-02-26	39237.0700	40300.0000	38600.4600	39138.1100
2022-02-27	39138.1100	39881.7700	37027.5500	37714.4300
2022-02-28	37714.4200	44200.0000	37468.2800	43181.2700
2022-03-01	43176.4100	44968.1300	42838.6800	44434.0900

And the final result shold be:

2022-02-25	38038	2022-02-26	40300	5.95%
2022-02-26	40300	2022-02-26	38600	-4.22%
2022-02-26	38600	2022-02-27	39881	3.32%
2022-02-27	39881	2022-02-27	37027	-7.16%
2022-02-27	37027	2022-02-28	44200	19.37%
2022-02-28	44200	2022-03-01	42838	-3.08%

答案1

得分: 3

这是一个将每日OHLC线分割为四个(日期,数值)条目的直接解决方案。然后,我们处理每个条目(顺序取决于方向),同时记录局部极小值/极大值(“峰值”),合并连续的运动并跳过不重要的变动。

有两个NamedTuple:Entry(用于(日期,数值)对)和Movement(用于结果的每一行)。我可以使用元组,但NamedTuple为每个字段提供了清晰的名称。

它不依赖于numpy、pandas或任何其他库,类型提示可以在编译时捕获错误,如果与像mypy这样的静态检查器一起使用。对于纯Python解决方案来说,它应该也相当快,因为它在一次遍历中计算所有运动。

from typing import Iterator, NamedTuple

Entry = NamedTuple('Entry', [('value', float), ('date', str)])
Movement = NamedTuple('Movement', [('start', Entry), ('end', Entry), ('percentage', float)])
get_change = lambda a, b: (b.value - a.value) / a.value

def get_movements(data_str: str, min_change_percent: float = 0.05) -> Iterator[Movement]:
    """ Return all movements with changes above a threshold. """
    peaks: list[Entry] = []
    for line in data_str.strip().split('\n'):
        # Read lines from input and split into date and values.
        date, open, high, low, close = line.split()
        # Order values according to movement direction.
        values_str = [open, low, high, close] if close > open else [open, high, low, close]
        for value_str in values_str:
            entry = Entry(float(value_str), date)
            if len(peaks) >= 2 and (entry > peaks[-1]) == (peaks[-1] > peaks[-2]):
                # Continue movement of same direction by replacing last peak.
                peaks[-1] = entry
            elif not peaks or abs(get_change(peaks[-1], entry)) >= min_change_percent:
                # New peak is above minimum threshold.
                peaks.append(entry)

    # Convert every pair of remaining peaks to a `Movement`.
    for start, end in zip(peaks, peaks[1:]):
        yield Movement(start, end, percentage=get_change(start, end))

第一个示例的用法:

data_str = """
2023-07-02  0.12800000  0.12800000  0.12090000  0.12390000
2023-07-03  0.12360000  0.13050000  0.12220000  0.12830000
2023-07-04  0.12830000  0.12830000  0.12320000  0.12410000
2023-07-05  0.12410000  0.12530000  0.11800000  0.11980000
2023-07-06  0.11990000  0.12270000  0.11470000  0.11500000
"""

for mov in get_movements(data_str, 0.05):
    print(f'{mov.start.date}  {mov.start.value:.4f}  {mov.end.date}  {mov.end.value:.4f}  {mov.percentage:.2%}')
# 2023-07-02  0.1280  2023-07-02  0.1209  -5.55%
# 2023-07-02  0.1209  2023-07-03  0.1305  7.94%
# 2023-07-03  0.1305  2023-07-06  0.1147  -12.11%

第二个示例的用法:

data_str = """
2022-02-25  38340.4200  39699.0000  38038.4600  39237.0600
2022-02-26  39237.0700  40300.0000  38600.4600  39138.1100
2022-02-27  39138.1100  39881.7700  37027.5500  37714.4300
2022-02-28  37714.4200  44200.0000  37468.2800  43181.2700
2022-03-01  43176.4100  44968.1300  42838.6800  44434.0900
"""

for mov in get_movements(data_str, 0.03):
    print(f'{mov.start.date}  {int(mov.start.value)}  {mov.end.date}  {int(mov.end.value)}  {mov.percentage:.2%}')
# 2022-02-25  38340  2022-02-26  40300  5.11%
# 2022-02-26  40300  2022-02-26  38600  -4.22%
# 2022-02-26  38600  2022-02-27  39881  3.32%
# 2022-02-27  39881  2022-02-27  37027  -7.16%
# 2022-02-27  37027  2022-02-28  44200  19.37%
# 2022-02-28  44200  2022-03-01  42838  -3.08%
# 2022-03-01  42838  2022-03-01  44968  4.97%

第二个示例的第一个结果与您提供的值不符,但我不清楚为什么它从38038开始而不是38340。所有其他值都完美匹配。

英文:

This is a straightforward solution by splitting each daily OHLC line into four (day, value) entries. Then we process each entry (order dependent on direction) while recording the local minima/maxima ("peaks"), merging continuous runs and skipping insignificant movements.

There are two NamedTuple's: Entry (for a (day, value) pair) and Movement (for each line of the results). I could have used tuples, but NamedTuple's give clear names for each field.

It also doesn't depend on numpy, pandas, or any other library, and the type hints help catch mistakes at compile time if used with a static checker like mypy. It should also be fairly fast for a pure-Python solution, as it computes all movements in one pass.

from typing import Iterator, NamedTuple

Entry = NamedTuple(&#39;Entry&#39;, [(&#39;value&#39;, float), (&#39;date&#39;, str)])
Movement = NamedTuple(&#39;Movement&#39;, [(&#39;start&#39;, Entry), (&#39;end&#39;, Entry), (&#39;percentage&#39;, float)])
get_change = lambda a, b: (b.value - a.value) / a.value

def get_movements(data_str: str, min_change_percent: float = 0.05) -&gt; Iterator[Movement]:
    &quot;&quot;&quot; Return all movements with changes above a threshold. &quot;&quot;&quot;
    peaks: list[Entry] = []
    for line in data_str.strip().split(&#39;\n&#39;):
        # Read lines from input and split into date and values.
        date, open, high, low, close = line.split()
        # Order values according to movement direction.
        values_str = [open, low, high, close] if close &gt; open else [open, high, low, close]
        for value_str in values_str:
            entry = Entry(float(value_str), date)
            if len(peaks) &gt;= 2 and (entry &gt; peaks[-1]) == (peaks[-1] &gt; peaks[-2]):
                # Continue movement of same direction by replacing last peak.
                peaks[-1] = entry
            elif not peaks or abs(get_change(peaks[-1], entry)) &gt;= min_change_percent:
                # New peak is above minimum threshold.
                peaks.append(entry)

    # Convert every pair of remaining peaks to a `Movement`.
    for start, end in zip(peaks, peaks[1:]):
        yield Movement(start, end, percentage=get_change(start, end))

Usage for first example:

data_str = &quot;&quot;&quot;
2023-07-02  0.12800000  0.12800000  0.12090000  0.12390000
2023-07-03  0.12360000  0.13050000  0.12220000  0.12830000
2023-07-04  0.12830000  0.12830000  0.12320000  0.12410000
2023-07-05  0.12410000  0.12530000  0.11800000  0.11980000
2023-07-06  0.11990000  0.12270000  0.11470000  0.11500000
&quot;&quot;&quot;

for mov in get_movements(data_str, 0.05):
    print(f&#39;{mov.start.date}  {mov.start.value:.4f}  {mov.end.date}  {mov.end.value:.4f}  {mov.percentage:.2%}&#39;)
# 2023-07-02  0.1280  2023-07-02  0.1209  -5.55%
# 2023-07-02  0.1209  2023-07-03  0.1305  7.94%
# 2023-07-03  0.1305  2023-07-06  0.1147  -12.11%

Usage for second example:

data_str = &quot;&quot;&quot;
2022-02-25  38340.4200  39699.0000  38038.4600  39237.0600
2022-02-26  39237.0700  40300.0000  38600.4600  39138.1100
2022-02-27  39138.1100  39881.7700  37027.5500  37714.4300
2022-02-28  37714.4200  44200.0000  37468.2800  43181.2700
2022-03-01  43176.4100  44968.1300  42838.6800  44434.0900
&quot;&quot;&quot;

for mov in get_movements(data_str, 0.03):
    print(f&#39;{mov.start.date}  {int(mov.start.value)}  {mov.end.date}  {int(mov.end.value)}  {mov.percentage:.2%}&#39;)
# 2022-02-25  38340  2022-02-26  40300  5.11%
# 2022-02-26  40300  2022-02-26  38600  -4.22%
# 2022-02-26  38600  2022-02-27  39881  3.32%
# 2022-02-27  39881  2022-02-27  37027  -7.16%
# 2022-02-27  37027  2022-02-28  44200  19.37%
# 2022-02-28  44200  2022-03-01  42838  -3.08%
# 2022-03-01  42838  2022-03-01  44968  4.97%

The first result of the second example doesn't agree with the value you provided, but it's not clear to me why it started at 38038 instead of 38340. All other values match perfectly.

答案2

得分: 1

I've translated the provided code snippets as requested:

我决定尽量使用`pandas`,以实现峰值确定的业务逻辑我无法找到比@BoppreH更好的方法来实际实现业务逻辑
我创建了一个可配置的过滤器用于应用于`DataFrame`的行并使用状态存储的装饰器

```python
def min_percent_change_filter(min_change_percent=0.05):
    peaks = []
    get_change = lambda a, b: (b - a) / a

    def add_entry(row):
        """由@BoppreH编写,稍作修改
        使用一个新条目更新峰值列表。"""
        if len(peaks) >= 2 and (row["data"] > peaks[-1]["data"]) == (
            peaks[-1]["data"] > peaks[-2]["data"]
        ):
            # 继续相同方向的运动,替换最后一个峰值。
            peaks[-1] = row.copy()
            return peaks
        elif (
            not peaks
            or abs(get_change(peaks[-1]["data"], row["data"])) >= min_change_percent
        ):
            # 新峰值高于最低阈值。
            peaks.append(row.copy())
            return peaks
        return peaks

    return add_entry

pandas部分需要一些操作,以使数据处于正确的形状。在数据处于正确的形状之后,我们跨行应用过滤器。最后,我们将DataFrame放入所需的输出格式中:

import pandas as pd

def pandas_approach(data, min_pct_change):
    df = pd.DataFrame(data)
    df["open_time"] = pd.to_datetime(df["open_time"])

    # 考虑时间方面,首先创建新列first和second
    # 根据是否上升或下降分别设置它们的值
    df["first"] = df["low"].where(df["open"] <= df["close"], df["high"])
    df["second"] = df["high"].where(df["open"] <= df["close"], df["low"])

    # 通过将first和second堆叠到索引上,然后按'open_time'和是否首先出现排序的方式创建数据的新表示
    stacked_representation = (
        df.set_index("open_time")[["first", "second"]]
        .stack()
        .reset_index()
        .sort_values(["open_time", "level_1"])[["open_time", 0]]
    )
    stacked_representation.columns = ["open_time", "data"]

    # 现在我们可以使用我们的过滤器进行操作
    results = pd.DataFrame(
        stacked_representation.apply(min_percent_change_filter(min_pct_change), axis=1)[
            0
        ]
    )
    # 我们对数据进行了重新塑形/重命名/重新排序,以适应所需的输出格式
    results["begin"] = results["data"].shift()
    results["begin_date"] = results["open_time"].shift()
    results = results.dropna()[["begin_date", "begin", "open_time", "data"]]
    results.columns = ["begin_date", "begin", "end_date", "end"]

    # 最后添加百分比变化
    results["pct_change"] = (results.end - results.begin) / results.begin

    def format_datetime(dt):
        return pd.to_datetime(dt).strftime("%Y-%m-%d")

    def price_formatter(value):
        return "{:.4f}".format(value) if abs(value) < 10000 else "{:.0f}".format(value)

    return results.style.format(
        {
            "pct_change": "{:.2%}".format,
            "begin_date": format_datetime,
            "end_date": format_datetime,
            "begin": price_formatter,
            "end": price_formatter,
        }
    )

第一个示例的输出:


import pandas as pd

data = {
    "open_time": ["2023-07-02", "2023-07-03", "2023-07-04", "2023-07-05", "2023-07-06"],
    "open": [0.12800000, 0.12360000, 0.12830000, 0.12410000, 0.11990000],
    "high": [0.12800000, 0.13050000, 0.12830000, 0.12530000, 0.12270000],
    "low": [0.12090000, 0.12220000, 0.12320000, 0.11800000, 0.11470000],
    "close": [0.12390000, 0.12830000, 0.12410000, 0.11980000, 0.11500000],
}
pandas_approach(data,0.05)
 	begin_date	begin	end_date	end 	pct_change
1	2023-07-02	0.1280	2023-07-02	0.1209	-5.55%
3	2023-07-02	0.1209	2023-07-03	0.1305	7.94%
9	2023-07-03	0.1305	2023-07-06	0.1147	-12.11%

第二个示例的输出:

data_2 = {
    "open_time": ["2022-02-25", "2022-02-26", "2022-02-27", "2022-02-28", "2022-03-01"],
    "open": [38340.4200, 39237.0700, 39138.1100, 37714.4200, 43176.4100],
    "high": [39699.0000, 40300.0000, 39881.7700, 44200.0000, 44968.1300],
    "low": [38038.4600, 38600.4600, 37027.5500, 37468.2800, 42838.6800],
    "close": [39237.0600, 39138.1100, 37714.4300, 43181.2700, 44434.0900],
}
pandas_approach(data_2, 0.03)
 	begin_date	begin	end_date	end	    pct_change
2	2022-02-25	38038	2022-02-26	40300	5.95%
3	2022-02-26	40300	2022-02-26	38600
<details>
<summary>英文:</summary>
I sat out determined to give this a go with using as much `pandas` as possible. I couldn&#39;t figure out a better way than @BoppreH to actually implement the business logic of the peak determination.
I create a configurable filter to be applied to the rows of the `DataFrame` with a decorator for state storage:
```python
def min_percent_change_filter(min_change_percent=0.05):
peaks = []
get_change = lambda a, b: (b - a) / a
def add_entry(row):
&quot;&quot;&quot;By @BoppreH, with slight modifications
Update list of peaks with one new entry.&quot;&quot;&quot;
if len(peaks) &gt;= 2 and (row[&quot;data&quot;] &gt; peaks[-1][&quot;data&quot;]) == (
peaks[-1][&quot;data&quot;] &gt; peaks[-2][&quot;data&quot;]
):
# Continue movement of same direction by replacing last peak.
peaks[-1] = row.copy()
return peaks
elif (
not peaks
or abs(get_change(peaks[-1][&quot;data&quot;], row[&quot;data&quot;])) &gt;= min_change_percent
):
# New peak is above minimum threshold.
peaks.append(row.copy())
return peaks
return peaks
return add_entry

The pandas part requires quite some manipulation to get the data into the right shape. After it's in the right shape, we apply the filter across rows. Finally we put the DataFrame in the desired output format:

import pandas as pd
def pandas_approach(data, min_pct_change):
df = pd.DataFrame(data)
df[&quot;open_time&quot;] = pd.to_datetime(df[&quot;open_time&quot;])
# Respect termporal aspect, create new columns first and second
# set them to the respective value depending on whether we&#39;re
# moving down or up
df[&quot;first&quot;] = df[&quot;low&quot;].where(df[&quot;open&quot;] &lt;= df[&quot;close&quot;], df[&quot;high&quot;])
df[&quot;second&quot;] = df[&quot;high&quot;].where(df[&quot;open&quot;] &lt;= df[&quot;close&quot;], df[&quot;low&quot;])
# Create a new representation of the data, by stacking first and second
# on the index, then sorting by &#39;open_time&#39; and whether it came first
# or second (Note: assert &#39;first&#39; &lt; &#39;second&#39;)
stacked_representation = (
df.set_index(&quot;open_time&quot;)[[&quot;first&quot;, &quot;second&quot;]]
.stack()
.reset_index()
.sort_values([&quot;open_time&quot;, &quot;level_1&quot;])[[&quot;open_time&quot;, 0]]
)
stacked_representation.columns = [&quot;open_time&quot;, &quot;data&quot;]
# Now we can go to work with our filter
results = pd.DataFrame(
stacked_representation.apply(min_percent_change_filter(min_pct_change), axis=1)[
0
]
)
# We reshape /rename/reorder our data to fit the desired output format
results[&quot;begin&quot;] = results[&quot;data&quot;].shift()
results[&quot;begin_date&quot;] = results[&quot;open_time&quot;].shift()
results = results.dropna()[[&quot;begin_date&quot;, &quot;begin&quot;, &quot;open_time&quot;, &quot;data&quot;]]
results.columns = [&quot;begin_date&quot;, &quot;begin&quot;, &quot;end_date&quot;, &quot;end&quot;]
# Lastly add the pct change
results[&quot;pct_change&quot;] = (results.end - results.begin) / results.begin
# This returns the styler for output formatting purposes, but you can return the
# DataFrame instead by commenting/deleting it
def format_datetime(dt):
return pd.to_datetime(dt).strftime(&quot;%Y-%m-%d&quot;)
def price_formatter(value):
return &quot;{:.4f}&quot;.format(value) if abs(value) &lt; 10000 else &quot;{:.0f}&quot;.format(value)
return results.style.format(
{
&quot;pct_change&quot;: &quot;{:,.2%}&quot;.format,
&quot;begin_date&quot;: format_datetime,
&quot;end_date&quot;: format_datetime,
&quot;begin&quot;: price_formatter,
&quot;end&quot;: price_formatter,
}
)

Output for the first example::


import pandas as pd

data = {
    &quot;open_time&quot;: [&quot;2023-07-02&quot;, &quot;2023-07-03&quot;, &quot;2023-07-04&quot;, &quot;2023-07-05&quot;, &quot;2023-07-06&quot;],
    &quot;open&quot;: [0.12800000, 0.12360000, 0.12830000, 0.12410000, 0.11990000],
    &quot;high&quot;: [0.12800000, 0.13050000, 0.12830000, 0.12530000, 0.12270000],
    &quot;low&quot;: [0.12090000, 0.12220000, 0.12320000, 0.11800000, 0.11470000],
    &quot;close&quot;: [0.12390000, 0.12830000, 0.12410000, 0.11980000, 0.11500000],
}
pandas_approach(data,0.05)
 	begin_date	begin	end_date	end 	pct_change
1	2023-07-02	0.1280	2023-07-02	0.1209	-5.55%
3	2023-07-02	0.1209	2023-07-03	0.1305	7.94%
9	2023-07-03	0.1305	2023-07-06	0.1147	-12.11%

Output for the second example:

data_2 = {
&quot;open_time&quot;: [&quot;2022-02-25&quot;, &quot;2022-02-26&quot;, &quot;2022-02-27&quot;, &quot;2022-02-28&quot;, &quot;2022-03-01&quot;],
&quot;open&quot;: [38340.4200, 39237.0700, 39138.1100, 37714.4200, 43176.4100],
&quot;high&quot;: [39699.0000, 40300.0000, 39881.7700, 44200.0000, 44968.1300],
&quot;low&quot;: [38038.4600, 38600.4600, 37027.5500, 37468.2800, 42838.6800],
&quot;close&quot;: [39237.0600, 39138.1100, 37714.4300, 43181.2700, 44434.0900],
}
pandas_approach(data_2, 0.03)
 	begin_date	begin	end_date	end	    pct_change
2	2022-02-25	38038	2022-02-26	40300	5.95%
3	2022-02-26	40300	2022-02-26	38600	-4.22%
4	2022-02-26	38600	2022-02-27	39882	3.32%
5	2022-02-27	39882	2022-02-27	37028	-7.16%
7	2022-02-27	37028	2022-02-28	44200	19.37%
8	2022-02-28	44200	2022-03-01	42839	-3.08%
9	2022-03-01	42839	2022-03-01	44968	4.97%

huangapple
  • 本文由 发表于 2023年7月6日 21:20:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76629307.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定