英文:
Is there an easy way to identify course codes and course names from university's UI pages
问题
我需要从大学课程目录中提取课程代码和课程名称。然而,我需要为所有大学执行此操作,为每所大学的每个页面编写代码是一项艰巨的任务。一种解决方案是从HTML页面获取原始文本,然后从这个原始文本中提取课程代码和课程名称。然而,课程代码和课程名称的格式不断变化,不能通过正则表达式来完成。我考虑使用自然语言处理(NLP)来训练模型,但这仍然需要大量的训练数据,我需要手动识别课程代码和课程名称。是否有可以轻松从原始文本中获取这些课程代码和课程名称的包或方法?
英文:
I need to pull course codes and course names from university course catalogs. However, I need to do this for all universities and writing codes for each page of every university is a daunting task. One solution is to get the raw text from the html pages and then extract the course codes and course names from this raw text. However, the format of course codes and course names keep changing and this cannot be done via regex. I thought of using NLP to train models but again this would need a lot of training data where I would need to manually identify the course codes and course names. Is there a package or method I could use to get these course codes and course names easily from raw text?
答案1
得分: 2
你应该查找XHR请求来加载数据,如果是动态的SPA或组件系统。如果使用API,我会寻找它们并找出相似之处,最有可能它们的JSON数据键不会变化太多。然而,这就是现实。你正在爬取数据,这不被视为最具伦理性的事情,因此我认为应该预料到需要付出一些努力。如果他们希望你轻松地获取他们的数据,他们会提供一个可以接口的API。如果他们没有提供,你就处于一个灰色地带,不能指望有一个完美的解决方案。不过,你可以尝试NLP或OCR,但如果它确实像你说的那样,不同的网站之间不太相似,直到你获取培训数据之前,它不会在猜测方面表现得很出色,就像你可以使用正则表达式或HTML解析文本数据一样。
潜在地,你可能能够在GitHub、Huggyface、Kaggle等网站上找到有关OCR、NLP或其他AI/ML方法的有用信息。
似乎很难找到一个好的课程数据集,我碰巧在GitHub上找到了这个数据集。然而,我不能保证链接会一直有效。所以在你能获取到数据的时候最好抓取一下。
https://github.com/Siddharth1698/Coursera-Course-Dataset
@abhi,我今天偶然发现了这个对你很有帮助的东西。你需要安装类似Puppeteer或Selenium的东西,以便JavaScript可以执行。你在这里的问题是你只渲染了静态内容。
英文:
you should probably look for xhr requests loading the data if it a dynamic spa or component system. if the use apis I would look for those and find similarities and most likely they json data keys won't vary too much. however, it is what is man. you're scraping data which is not particularly viewed as the most ethical thing and with that I think accepting the challenge of some leg work should be expected. if they wanted you to easily pull their data they would offer an API that you could interface to and if they don't you are kind of in a grey area to be expecting a silver bullet solution. however you can try nlp or ocr, but if it is like you say it is not similar site by site until you retrieve your training data it won't be great at guessing just as much as you can do with regex or html parsing for text data.
Potentially, you might be able to find models or spaces with helpful information on OCR, NLP, or other AI/ML approaches on sites like github, huggyface, kaggle and more.
It seems hard to find a good dataset for course data and I happened to come across this dataset on github. However, I cannot ensure the link will forever be valid. So it may be worth grabbing that data while you can.
https://github.com/Siddharth1698/Coursera-Course-Dataset
@abhi, I came across this today that would be very helpful for you. you need to install something like puppeteer or selenium so that the JavaScript can execute. your issue here is that you are only rendering the static content.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论