问题

我正在研究Reddit上不同社区之间的关系。作为我的数据研究的一部分，我有大约49,000个CSV文件，用于我抓取的不同子社区的帖子。在我能够获取的所有帖子中，我获得了每个评论者以及他们的总声望和评论数量（请参见图片格式）。每个CSV文件包含我为该个别子社区收集的所有评论者。

我想要拿每个子社区与另一个子社区进行比较，并确定它们有多少共同的用户。我不需要用户列表，只需要它们的共同数量。我还想要为代码设置一个声望阈值，目前它设置为超过30声望。

对于我已经有的代码，我准备了两个用户列表，这些用户的声望超过阈值，然后我将这两个列表转换为集合，然后使用"&"运算符合并它们。以下是我的代码：

由于我一遍又一遍地重新读取相同的CSV文件以进行比较，我能否对它们进行预处理以使其更加高效？是否应该按字母顺序对用户名进行排序？我打算重新处理所有的CSV文件，以去除先前的“删除用户”，它们都是机器人，我应该按声望而不是按用户进行排序，这样不是比较声望是否高于一定水平，而是只读取到某个点？内存使用不是问题，只关心速度。我需要进行数十亿次的比较，所以任何时间的减少都会受到欢迎。

remove_list = dt.fread('D:/Spring 2023/red/c7/removal_users.csv')
drops = remove_list['C0'].to_list()[0]
def get_scores(current):
    sub = dt.fread(current)
    sub = pd.read_csv(current, sep=',', header=None)
    current = get_name(current)
    sub.columns = ['user', 'score']
    sub = sub[~sub['user'].isin(drops)]
    sub = sub[sub['score'] > 30]
    names = list(sub['user'])
    return names
def get_name(current):
    current = current.split('/')[-1].split('.')[-2]
    return current
def common_users(list1, list2):
    return len(set(list1) & set(list2))

英文:

I am conducting research on the relationships between different communities on Reddit. As part of my data research I have about 49,000 CSV's for sub I scraped and of all the posts I could get I got each commentator and thier total karma and number of comments (see pic for format) Each CSV contains all the commentors I have collected for that individual sub.

I want to take each sub and then compare it to another sub and identify how many users they have in common. I dont need a list of what users they are just the amount the two have in common. I also want to set a karma threshold for the code I currently have it is set to more than 30 karma.

For the code I have I prepare two lists of users who are over the threshold then I convert both those lists to sets and then "&" them together. Here is my code:

Since I am re-reading the same CSVs over and over again to compare to each other can I prepare them to make this more efficient, sort user names alphabetically? I plan on re-doing all the CSVs to get rid of the "removal users" which are all bots beforehand, should I sort by karma instead of users so rather than comparing if karma is above a certain level it only reads up to a certain point?Memory usage is not a problem only speed. I have billions of comparisons to make so any reduction in time is appreciated.

remove_list = dt.fread(&#39;D:/Spring 2023/red/c7/removal_users.csv&#39;)
drops = remove_list[&#39;C0&#39;].to_list()[0]
def get_scores(current):
    sub = dt.fread(current)
    sub = pd.read_csv(current, sep=&#39;,&#39;, header =None)
    current = get_name(current)
    sub.columns = [&#39;user&#39;,&#39;score&#39;]
    sub = sub[~sub[&#39;user&#39;].isin(drops)]
    sub = sub[sub[&#39;score&#39;] &gt; 30]
    names =list(sub[&#39;user&#39;])
    return names
def get_name(current):
    current = current.split(&#39;/&#39;)[-1].split(&#39;.&#39;)[-2]
    return current
def common_users(list1,list2):
    return len(set(list1) &amp; set(list2))

答案1

得分: 1

你尝试使用len()函数吗？它用于查找列表的长度。len()接受一个序列或集合作为参数。

英文:

Have you tried using the len() function it is used to find the length of a list. len() accepts a sequence or a collection as an argument.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

两个列表/集合的并集大小的最快方法？编码效率

问题

答案1

如何在Pyodide文件系统中运行我编写的Python文件？

@vectorize函数的输出作为元组：我应该使用什么签名？

递归似乎多迭代了一次，我该如何修复这段代码？

Interactive Brokers TWS API：通过app.reqMktData获取期货价格的问题

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论