使用Python中的for循环创建3D NumPy数组。

huangapple go评论73阅读模式
英文:

make 3d numpy array using for loop in python

问题

我有2维的训练数据(包含4个特征的200个结果)。

我进行了100个不同的应用程序测试,每个应用程序重复10次,得到了1000个CSV文件。

我想要将每个CSV文件的结果叠加以用于机器学习,但我不知道如何做。

我的每个CSV文件看起来像下面这样:

test1.csv 转换成numpy数组数据

[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]

我尝试了以下Python代码:

path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
cnt = 0
for f in csv_files:
    cnt += 1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    if cnt == 1:
        a = np.array(preprocess(f))
        b = np.array(app)
    else:
        a = np.vstack((a, np.array(preprocess(f)))
        b = np.append(b, app)
print(a)
print(b)

preprocess函数返回每个CSV文件的df.to_numpy结果。

我的期望是这样的:a(1000, 200, 4)

[[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]],
[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]],
...
[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]]

然而,我得到的是这样的:a(200000, 4)

[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]

我想要通过a[0]到a[1000]访问每个CSV文件的结果,每个子数组都应该是(200, 4)。如何解决这个问题?我很困惑。

英文:

I have training data with 2 dimension. (200 results of 4 features)

I proved 100 different applications with 10 repetition resulting 1000 csv files.

I want to stack each csv results for machine learning.
But I don't know how.

each of my csv files look like below.

test1.csv to numpy array data

[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]

I tried below python code.

path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
cnt=0
for f in csv_files:
	cnt +=1
	seperator = '_'
	app = os.path.basename(f).split(seperator, 1)[0]

	if cnt==1:
		a = np.array(preprocess(f))
		b = np.array(app)
	else:
		a = np.vstack((a, np.array(preprocess(f))))
		b = np.append(b,app)
print(a)
print(b)

preprocess function returns df.to_numpy results for each csv files.

My expectation was like below. a(1000, 200, 4)

[[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]],
[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]],
...
[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]]

However, I'm getting this. a(200000, 4)

[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]

I want to access each csv results using a[0] to a[1000] each sub-array looks like (200,4)
How can I solve the problem? I'm quite lost

答案1

得分: 0

Make a new list (outside of the loop) and append each item to that new list after reading.

英文:

Make a new list (outside of the loop) and append each item to that new list after reading.

答案2

得分: 0

你必须从 vstack 更改为 stack

la = []
lb = []
for f in csv_files:
    cnt += 1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    la.append(preprocess(f))
    lb.append(app)
a = np.stack(la, axis=0)
b = np.array(lb)

vstack 只能沿着行堆叠,而 stack 函数可以沿着新轴堆叠。

英文:

You have to change from vstack to stack

la=[]
lb=[]
for f in csv_files:
    cnt +=1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    la.append(preprocess(f))
    lb.append(app)
a=np.stack(la, axis=0)
b=np.array(lb)

vstack can stack along rows only but stack function can stack along a new axis.

答案3

得分: 0

是的,vstack(以及append)就是这样做的。它将在相同的轴(行轴)上合并数据。

a1=np.arange(10).reshape(2,5)
# [[0,1,2,3,4],
#  [5,6,7,8,9]]
a2=np.arange(10,20).reshape(2,5)
# [[10, 11, 12, 13, 14],
#  [15, 16, 17, 18, 19]])
np.vstack((a1,a2))
# [[ 0,  1,  2,  3,  4],
#  [ 5,  6,  7,  8,  9],
#  [10, 11, 12, 13, 14],
#  [15, 16, 17, 18, 19]])

b1=np.arange(5)
b2=np.arange(5,10)
np.append(b1,b2)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

如果你期望(从这些示例中)沿着一个新的轴进行附加,那么你需要添加它,或者使用更灵活的stack

np.vstack(([a1],[a2]))
#array([[[ 0,  1,  2,  3,  4],
#       [ 5,  6,  7,  8,  9]],
#
#      [[10, 11, 12, 13, 14],
#       [15, 16, 17, 18, 19]]])

或者,在一维情况下,使用vstack而不是append

np.vstack((b1,b2))
#array([[0, 1, 2, 3, 4],
#       [5, 6, 7, 8, 9]])

但更重要的是,在循环内部不应该这样做。每个这些函数(stackvstackappend)都会重新创建一个新的数组。

更高效的方法可能是将所有的np.array(preprocess(f))b = np.array(app)直接附加到一个纯Python列表中,然后仅在读取它们全部后才调用stackvstack

或者,更好的方法是直接在Python列表中直接附加preprocess(f)app,并在循环之后才调用np.array,将它们全部组合起来。

所以,类似这样:

la=[]
lb=[]
for f in csv_files:
    cnt +=1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    la.append(preprocess(f))
    lb.append(app)
a=np.array(la)
b=np.array(lb)
英文:

Well, yes, that is what vstack (and append) does. It merges things on the same axis (rows axis).

a1=np.arange(10).reshape(2,5)
# [[0,1,2,3,4],
#  [5,6,7,8,9]]
a2=np.arange(10,20).reshape(2,5)
# [[10, 11, 12, 13, 14],
#  [15, 16, 17, 18, 19]])
np.vstack((a1,a2))
# [[ 0,  1,  2,  3,  4],
#  [ 5,  6,  7,  8,  9],
#  [10, 11, 12, 13, 14],
#  [15, 16, 17, 18, 19]])

b1=np.arange(5)
b2=np.arange(5,10)
np.append(b1,b2)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

If you expect (from those examples), to append along a new axis, then you need to add it, or to use more flexible stack.

np.vstack(([a1],[a2]))
#array([[[ 0,  1,  2,  3,  4],
#       [ 5,  6,  7,  8,  9]],
#
#      [[10, 11, 12, 13, 14],
#       [15, 16, 17, 18, 19]]])

Or, in the case of 1d, use vstack instead of append

np.vstack((b1,b2))
#array([[0, 1, 2, 3, 4],
#       [5, 6, 7, 8, 9]])

But more importantly, you shouldn't be doing this in the first place inside a loop. Each of those functions (stack, vstack, append) recreates a new array.

It would be probably more efficient to just append all your np.array(preprocess(f)) and b = np.array(app) to a pure python list, and call stack and vstack only once you've read them all.

Or, even better, just append directly the preprocess(f) and the app inside python list. And call np.array only after the loop, and the whole thing.

So, something like

la=[]
lb=[]
for f in csv_files:
    cnt +=1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    la.append(preprocess(f))
    lb.append(app)
a=np.array(la)
b=np.array(lb)

huangapple
  • 本文由 发表于 2023年2月6日 14:02:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75357819.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定