英文:
make 3d numpy array using for loop in python
问题
我有2维的训练数据(包含4个特征的200个结果)。
我进行了100个不同的应用程序测试,每个应用程序重复10次,得到了1000个CSV文件。
我想要将每个CSV文件的结果叠加以用于机器学习,但我不知道如何做。
我的每个CSV文件看起来像下面这样:
test1.csv 转换成numpy数组数据
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]
我尝试了以下Python代码:
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
cnt = 0
for f in csv_files:
cnt += 1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
if cnt == 1:
a = np.array(preprocess(f))
b = np.array(app)
else:
a = np.vstack((a, np.array(preprocess(f)))
b = np.append(b, app)
print(a)
print(b)
preprocess函数返回每个CSV文件的df.to_numpy结果。
我的期望是这样的:a(1000, 200, 4)
[[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]],
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]],
...
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]]
然而,我得到的是这样的:a(200000, 4)
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]
我想要通过a[0]到a[1000]访问每个CSV文件的结果,每个子数组都应该是(200, 4)。如何解决这个问题?我很困惑。
英文:
I have training data with 2 dimension. (200 results of 4 features)
I proved 100 different applications with 10 repetition resulting 1000 csv files.
I want to stack each csv results for machine learning.
But I don't know how.
each of my csv files look like below.
test1.csv to numpy array data
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]
I tried below python code.
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
cnt=0
for f in csv_files:
cnt +=1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
if cnt==1:
a = np.array(preprocess(f))
b = np.array(app)
else:
a = np.vstack((a, np.array(preprocess(f))))
b = np.append(b,app)
print(a)
print(b)
preprocess function returns df.to_numpy results for each csv files.
My expectation was like below. a(1000, 200, 4)
[[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]],
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]],
...
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]]
However, I'm getting this. a(200000, 4)
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]
I want to access each csv results using a[0] to a[1000] each sub-array looks like (200,4)
How can I solve the problem? I'm quite lost
答案1
得分: 0
Make a new list (outside of the loop) and append each item to that new list after reading.
英文:
Make a new list (outside of the loop) and append each item to that new list after reading.
答案2
得分: 0
你必须从 vstack
更改为 stack
la = []
lb = []
for f in csv_files:
cnt += 1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
la.append(preprocess(f))
lb.append(app)
a = np.stack(la, axis=0)
b = np.array(lb)
vstack
只能沿着行堆叠,而 stack
函数可以沿着新轴堆叠。
英文:
You have to change from vstack
to stack
la=[]
lb=[]
for f in csv_files:
cnt +=1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
la.append(preprocess(f))
lb.append(app)
a=np.stack(la, axis=0)
b=np.array(lb)
vstack
can stack along rows only but stack
function can stack along a new axis.
答案3
得分: 0
是的,vstack
(以及append
)就是这样做的。它将在相同的轴(行轴)上合并数据。
a1=np.arange(10).reshape(2,5)
# [[0,1,2,3,4],
# [5,6,7,8,9]]
a2=np.arange(10,20).reshape(2,5)
# [[10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]])
np.vstack((a1,a2))
# [[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]])
b1=np.arange(5)
b2=np.arange(5,10)
np.append(b1,b2)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
如果你期望(从这些示例中)沿着一个新的轴进行附加,那么你需要添加它,或者使用更灵活的stack
。
np.vstack(([a1],[a2]))
#array([[[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9]],
#
# [[10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]]])
或者,在一维情况下,使用vstack
而不是append
。
np.vstack((b1,b2))
#array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])
但更重要的是,在循环内部不应该这样做。每个这些函数(stack
,vstack
,append
)都会重新创建一个新的数组。
更高效的方法可能是将所有的np.array(preprocess(f))
和b = np.array(app)
直接附加到一个纯Python列表中,然后仅在读取它们全部后才调用stack
和vstack
。
或者,更好的方法是直接在Python列表中直接附加preprocess(f)
和app
,并在循环之后才调用np.array
,将它们全部组合起来。
所以,类似这样:
la=[]
lb=[]
for f in csv_files:
cnt +=1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
la.append(preprocess(f))
lb.append(app)
a=np.array(la)
b=np.array(lb)
英文:
Well, yes, that is what vstack
(and append
) does. It merges things on the same axis (rows axis).
a1=np.arange(10).reshape(2,5)
# [[0,1,2,3,4],
# [5,6,7,8,9]]
a2=np.arange(10,20).reshape(2,5)
# [[10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]])
np.vstack((a1,a2))
# [[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]])
b1=np.arange(5)
b2=np.arange(5,10)
np.append(b1,b2)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
If you expect (from those examples), to append along a new axis, then you need to add it, or to use more flexible stack
.
np.vstack(([a1],[a2]))
#array([[[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9]],
#
# [[10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]]])
Or, in the case of 1d, use vstack
instead of append
np.vstack((b1,b2))
#array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])
But more importantly, you shouldn't be doing this in the first place inside a loop. Each of those functions (stack
, vstack
, append
) recreates a new array.
It would be probably more efficient to just append all your np.array(preprocess(f))
and b = np.array(app)
to a pure python list, and call stack
and vstack
only once you've read them all.
Or, even better, just append directly the preprocess(f)
and the app
inside python list. And call np.array
only after the loop, and the whole thing.
So, something like
la=[]
lb=[]
for f in csv_files:
cnt +=1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
la.append(preprocess(f))
lb.append(app)
a=np.array(la)
b=np.array(lb)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论