Go Web应用程序+ MySql驱动程序停顿/超时

huangapple go评论98阅读模式
英文:

Go Web Application + MySql driver Stalling / Timing Out

问题

更新:经过进一步诊断,go-sql-driver/mysql I驱动程序包存在问题。事实证明,底层的tcp似乎无法检测到断开的tcp连接。完整的详细信息在下面的go-sql-driver/mysql项目的github问题中:

https://github.com/go-sql-driver/mysql/issues/257

--

我目前在经历一个在15到48分钟的空闲期后出现的停顿中断的网络应用程序问题。最关键的问题如下所述:

  • 访问网站上的任何URL,并完全加载页面(即,页面实际加载完成,并且日志显示页面已完全加载)。
  • 关闭浏览器,然后等待。

典型的请求日志如下所示:

2014/07/13 15:29:54 INFO template rendering: index
2014/07/13 15:29:54 METRIC, URL: /, HANDLER TIME: 7.2339ms, CTX TIME: 5.0894ms, TOTAL TIME: 12.3258ms

经过很长一段时间(从15分钟到48分钟不等),系统突然记录了下面这些行,没有任何交互-网络应用程序在整个时间内都处于空闲状态:

[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 16:00:09 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 16:00:10 INFO template rendering: index
2014/07/13 16:00:10 METRIC, URL: /, HANDLER TIME: 8.8574ms, CTX TIME: 31m19.2606723s, TOTAL TIME: 31m19.2695329s

注意"TOTAL TIME"是31分钟19秒吗?还注意到在同一时间记录的MySql驱动程序错误吗?

没有任何活动/网络请求。网络应用程序只是空闲。

最关键的问题是在这些日志消息之后发生的事情:紧接着的下一个网络请求完全停顿,永远不会返回响应

user@govm1:~$ wget http://localhost
--2014-07-13 17:11:18--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2014-07-13 17:26:19--  (try: 2)  http://localhost/
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html.4'

    [ <=>                                                                                  ] 6,310       --.-K/s   in 0.001s

2014-07-13 17:26:20 (9.61 MB/s) - `index.html.4' saved [6310]

然后它处于空闲状态,没有响应,直到wget超时为止。

现在,如果我在此请求停顿时立即进行第二个或第三个请求,并且在其停顿期间的任何时间进行请求,go web应用程序将响应并返回其他请求的完整页面。没有问题。然后,循环从我最后发出的请求开始,并让它处于空闲状态。

在这15分钟后,你可以猜到接下来记录了什么:

[MySQL] 2014/07/13 17:26:57 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 17:26:57 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 17:26:57 INFO template rendering: index
2014/07/13 17:26:57 METRIC, URL: /, HANDLER TIME: 6.8938ms, CTX TIME: 15m39.1718434s, TOTAL TIME: 15m39.1787398s

另外15分钟的等待时间。

我排除了Windows Azure、集群VIP和运行go web应用程序的防火墙/Linux VM作为问题,因为我在同一台机器上本地运行wget http://localhost,我得到了这个永远不会完成和不会返回任何内容的"停顿"请求。

--

我的Web应用程序中有许多因素,因此我将尝试相应地概述它们。

使用:

  • Go 1.3
  • go-sql-driver/mysql ## Version 1.2 (2014-06-03)
  • Ubuntu 12.04 LTS,~2014年6月更新
  • Windows Azure

请注意,运行MySql的Linux框是运行GoLang应用程序集群的不同Linux框-它们位于单独的专用云服务中。MySql虚拟机是单个虚拟机,没有集群。

这是一些相关的代码:

// 全局的DB处理程序
var db *sql.DB

// CLI参数
var dbdsn string

func init() {

	flag.StringVar(&amp;dbdsn, "dbdsn", "root:root@tcp(localhost:3306)/prod?timeout=5s&amp;tls=false&amp;autocommit=true", "Specifies the MySql DSN connection.")
	flag.Parse()

	var err error
	db, err = sql.Open("mysql", dbdsn)
	if err != nil {
		log.Printf("ERROR in sql.Open(): %v", err)
	}

	//db.SetMaxIdleConns(5)

	// 验证DSN是否正确设置
	err = db.Ping()
	if err != nil {
		panic("PANIC when pinging db: " + err.Error()) // 在您的应用程序中使用适当的错误处理而不是panic
	}
}

// **********
// * 省略了Gorilla MUX路由器和http处理程序的注册
// **********

func ArticleHandler(w http.ResponseWriter, r *http.Request, c *Context) (err error) {

	m := NewArticle(c)
	id := c.Vars["id"]

	var pid int
	var title, body, excerpt, date, slug, fi, fv, region, region_slug string
	err = db.QueryRow(
		"SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, p.post_date, p.post_name, "+
			"(SELECT fpim.meta_value FROM wp_postmeta fpim WHERE fpim.meta_key = '_wp_attached_file' AND fpim.post_id = (SELECT fpim2.meta_value FROM wp_postmeta fpim2 WHERE fpim2.post_id = p.ID AND fpim2.meta_key = '_thumbnail_id' LIMIT 1) LIMIT 1) AS featured_image, "+
			"(SELECT fpim3.meta_value FROM wp_postmeta fpim3 WHERE fpim3.meta_key = 'fv_video' AND fpim3.post_id = p.ID LIMIT 1) AS featured_video, "+
			"t.name as region, t.slug as region_slug "+
			"FROM wp_posts p "+
			"JOIN wp_term_relationships tr ON tr.object_id=p.ID "+
			"JOIN wp_term_taxonomy tt ON tt.term_taxonomy_id=tr.term_taxonomy_id "+
			"JOIN wp_terms t ON t.term_id=tt.term_id "+
			"WHERE p.post_name=? AND p.post_type='post' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP()"+
			"AND tr.object_id=p.ID AND tt.parent = (SELECT t3.term_id FROM wp_terms t3 WHERE t3.name=? LIMIT 1) LIMIT 1",
		id, RegionsParentCategory).
		Scan(&amp;pid, &amp;title, &amp;body, &amp;excerpt, &amp;date, &amp;slug, &amp;fi, &amp;fv, &amp;region, &amp;region_slug)

	if err != nil {
		if err == sql.ErrNoRows {

			// 省略了重定向的代码

			// 文章未找到
			return handleNotFound(w, r, c)

		} else {
			log.Printf("ERROR in .Scan(): %v", err)
		}
	} else {
		m.Region = Region{
			Name: region,
			Slug: region_slug,
		}
		m.Id = pid
		m.Title = title
		m.Body = template.HTML(body) // 渲染原始HTML
		m.Excerpt = excerpt
		m.Datetime = date
		m.Slug = slug
		m.FeaturedImageUrl = fi
		m.FeaturedVideoUrl = fv
	}

	web.RenderTemplate(w, "article", m)
	return
}

每个请求还有5个以上的DB查询

除了这个查询之外,您在处理程序中看到的"Context"每次运行4到6个额外的SQL查询。因此,每个"article"处理程序运行大约5到7个SQL查询,最少使用完全相同的模式和您在上面看到的*db全局变量。

超时/错误总是发生在相同的DB查询上

这是一个"context"查询的示例:

rows2, err := db.Query(
	"SELECT p.post_title, p.post_name "+
		"FROM wp_posts p "+
		"WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
		"AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
		"ORDER BY p.menu_order",
	FooterPagesParentNameSlug)
if err != nil {
	log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
	defer rows2.Close()
	c.AllFooterPages = make([]FooterPage, 0)
	for rows2.Next() {
		var name, slug string
		err := rows2.Scan(&amp;name, &amp;slug)
		if err != nil {
			log.Printf("ERROR in AllPages row.Scan() : %v", err)
		} else {
			p := FooterPage{
				Page: Page{
					Title: name,
					Slug:  slug,
				},
			}
			c.AllFooterPages = append(c.AllFooterPages, p)
		}
	}
}

没有什么特别的。

我只在没有错误时调用defer rows2.Close()。也许这是问题的一部分?这个特定的SQL查询似乎在负载测试中记录为no response或mysql驱动程序超时的错误。

问题

为什么在空闲站点上会记录超过15到30分钟的请求超时?这似乎是我正在使用的mysql驱动程序的一个错误,可能是保持连接打开。但是,最后一个http请求成功并返回了一个完整的页面+模板。

我甚至在连接字符串中设置了超时,为5秒。即使是与mysql服务器有问题,为什么会记录15分钟的超时/请求?那个请求是从哪里来的?

它仍然可能是MySql驱动程序的问题,阻止请求完成-可能被MySql专用VM和那里的问题阻止。如果是这样,为什么没有记录任何内容?这个随机的15m到49m分钟的超时是什么?通常只记录15m或31m,但有时会记录48m。

在超时(@15m、31m和48m)中的"15m"倍数中非常有趣。

提前谢谢。

英文:

UPDATE: After further diagnosis, there is a problem with the go-sql-driver/mysql I driver package. It turns out that the underlying tcp can't seem to detect a broken tcp connection. The full details are in the github issue in the go-sql-driver/mysql project below:

https://github.com/go-sql-driver/mysql/issues/257

--

I am currently experiencing a stalling or broken web app after a period of idle between 15 to 48 minutes. The most critical issue is described below:

  • Visit a URL, any url on the site, and load the page completely (as in, the page actually loads and the logs show a complete page has loaded).
  • Close browser, and wait.

A typical request is logged like this:

2014/07/13 15:29:54 INFO template rendering: index
2014/07/13 15:29:54 METRIC, URL: /, HANDLER TIME: 7.2339ms, CTX TIME: 5.0894ms, TOTAL TIME: 12.3258ms

After a long period of time (ranging from 15m to 48m), the system all of a sudden logs these lines below with no interaction - the web app has been idle this entire time:

[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 16:00:09 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 16:00:10 INFO template rendering: index
2014/07/13 16:00:10 METRIC, URL: /, HANDLER TIME: 8.8574ms, CTX TIME: 31m19.2606723s, TOTAL TIME: 31m19.2695329s

Notice the "TOTAL TIME" is 31 minutes and 19 seconds? Also, notice the MySql driver error that is logged at the same time?

There was no activity / no web request made. The web app was simply idle.

The most critical issue is what comes next after these log messages: the very next web request is stalls completely, never returning a response:

user@govm1:~$ wget http://localhost
--2014-07-13 17:11:18--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2014-07-13 17:26:19--  (try: 2)  http://localhost/
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html.4&#39;

    [ &lt;=&gt;                                                                                  ] 6,310       --.-K/s   in 0.001s

2014-07-13 17:26:20 (9.61 MB/s) - `index.html.4&#39; saved [6310]

And it sits idle, no response, for 15 minutes until wget times out.

Now, if I make a 2nd or 3rd request immediately after this one is stalled and anytime while it is stalled, the go web app responds and returns a full page for other requests. No issues. And then, the cycle starts over from the last request I make and let it site idle.

After this 15m, you can guess exactly what is logged next:

[MySQL] 2014/07/13 17:26:57 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 17:26:57 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 17:26:57 INFO template rendering: index
2014/07/13 17:26:57 METRIC, URL: /, HANDLER TIME: 6.8938ms, CTX TIME: 15m39.1718434s, TOTAL TIME: 15m39.1787398s

Another 15m wait time.

I eliminated Windows Azure, the Cluster VIP and Firewall/Linux VM running the go web app as an issue because I ran wget http://localhost locally on the same box, and I get this "stalled" request that never completes and never sends back anything.

--

There are a number of factors in my web app so I will try to outline them accordingly.

Using:

  • Go 1.3
  • go-sql-driver/mysql ## Version 1.2 (2014-06-03)
  • Ubuntu 12.04 LTS, ~June 2014 Updates
  • Windows Azure

Do note that the Linux box running MySql is a different Linux box running the cluster of GoLang apps - and they are in separate dedicated Cloud Services. The MySql vm is a single VM, no cluserting.

Here is some related code:

// global handler for our DB
var db *sql.DB

// CLI parameter
var dbdsn string

func init() {

	flag.StringVar(&amp;dbdsn, &quot;dbdsn&quot;, &quot;root:root@tcp(localhost:3306)/prod?timeout=5s&amp;tls=false&amp;autocommit=true&quot;, &quot;Specifies the MySql DSN connection.&quot;)
	flag.Parse()

	var err error
	db, err = sql.Open(&quot;mysql&quot;, dbdsn)
	if err != nil {
		log.Printf(&quot;ERROR in sql.Open(): %v&quot;, err)
	}

	//db.SetMaxIdleConns(5)

	// verify the DSN is setup properly1
	err = db.Ping()
	if err != nil {
		panic(&quot;PANIC when pinging db: &quot; + err.Error()) // proper error handling instead of panic in your app
	}
}

// **********
// * omitted is the Gorilla MUX router and http handler registrations
// **********

func ArticleHandler(w http.ResponseWriter, r *http.Request, c *Context) (err error) {

	m := NewArticle(c)
	id := c.Vars[&quot;id&quot;]

	var pid int
	var title, body, excerpt, date, slug, fi, fv, region, region_slug string
	err = db.QueryRow(
		&quot;SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, p.post_date, p.post_name, &quot;+
			&quot;(SELECT fpim.meta_value FROM wp_postmeta fpim WHERE fpim.meta_key = &#39;_wp_attached_file&#39; AND fpim.post_id = (SELECT fpim2.meta_value FROM wp_postmeta fpim2 WHERE fpim2.post_id = p.ID AND fpim2.meta_key = &#39;_thumbnail_id&#39; LIMIT 1) LIMIT 1) AS featured_image, &quot;+
			&quot;(SELECT fpim3.meta_value FROM wp_postmeta fpim3 WHERE fpim3.meta_key = &#39;fv_video&#39; AND fpim3.post_id = p.ID LIMIT 1) AS featured_video, &quot;+
			&quot;t.name as region, t.slug as region_slug &quot;+
			&quot;FROM wp_posts p &quot;+
			&quot;JOIN wp_term_relationships tr ON tr.object_id=p.ID &quot;+
			&quot;JOIN wp_term_taxonomy tt ON tt.term_taxonomy_id=tr.term_taxonomy_id &quot;+
			&quot;JOIN wp_terms t ON t.term_id=tt.term_id &quot;+
			&quot;WHERE p.post_name=? AND p.post_type=&#39;post&#39; AND p.post_status=&#39;publish&#39; AND p.post_date &lt;= UTC_TIMESTAMP()&quot;+
			&quot;AND tr.object_id=p.ID AND tt.parent = (SELECT t3.term_id FROM wp_terms t3 WHERE t3.name=? LIMIT 1) LIMIT 1&quot;,
		id, RegionsParentCategory).
		Scan(&amp;pid, &amp;title, &amp;body, &amp;excerpt, &amp;date, &amp;slug, &amp;fi, &amp;fv, &amp;region, &amp;region_slug)

	if err != nil {
		if err == sql.ErrNoRows {

			// snipped code for redirects

			// article was not found
			return handleNotFound(w, r, c)

		} else {
			log.Printf(&quot;ERROR in .Scan(): %v&quot;, err)
		}
	} else {
		m.Region = Region{
			Name: region,
			Slug: region_slug,
		}
		m.Id = pid
		m.Title = title
		m.Body = template.HTML(body) // render the raw html
		m.Excerpt = excerpt
		m.Datetime = date
		m.Slug = slug
		m.FeaturedImageUrl = fi
		m.FeaturedVideoUrl = fv
	}

	web.RenderTemplate(w, &quot;article&quot;, m)
	return
}

5 more DB queries, per request

In addition to this query, my "Context" you see being passed into the handler runs 4 to 6 additional SQL queries. Therefore, each "article" handler that loads runs about 5 to 7 SQL queries, minimal, using the exact same pattern and *db global variable you see above.

Timeouts / errors are always on the same DB query

Here's one of the "context" queries as a comparison:

rows2, err := db.Query(
	&quot;SELECT p.post_title, p.post_name &quot;+
		&quot;FROM wp_posts p &quot;+
		&quot;WHERE p.post_type=&#39;page&#39; AND p.post_status=&#39;publish&#39; AND p.post_date &lt;= UTC_TIMESTAMP() &quot;+
		&quot;AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) &quot;+
		&quot;ORDER BY p.menu_order&quot;,
	FooterPagesParentNameSlug)
if err != nil {
	log.Printf(&quot;ERROR in AllPages .Query() : %v&quot;, err)
} else {
	defer rows2.Close()
	c.AllFooterPages = make([]FooterPage, 0)
	for rows2.Next() {
		var name, slug string
		err := rows2.Scan(&amp;name, &amp;slug)
		if err != nil {
			log.Printf(&quot;ERROR in AllPages row.Scan() : %v&quot;, err)
		} else {
			p := FooterPage{
				Page: Page{
					Title: name,
					Slug:  slug,
				},
			}
			c.AllFooterPages = append(c.AllFooterPages, p)
		}
	}
}

Nothing special there.

I do call defer rows2.Close() only if there was no error. Perhaps that is part of the issue? This particular SQL query seems to log errors under load tests as no response or mysql driver timing out.

Questions

Why am I getting request timeouts logged in excess of 15 to 30 minutes, from an idle site? That seems like a bug with the mysql driver I am using, possibly holding a connection open. But, the last http request was successful and returned a complete page + template.

I even have the Timeout set in the connection string, which is 5 seconds. Even if it is a problem with the mysql server, why the 15 minute timeout/request logged? Where did that request come from?

It still could be a MySql driver issue, blocking the request from completing - maybe being blocked by the MySql dedicated VM and an issue there. If that is the case, then how come nothing is logged? What is this random timeout of 15m to 49m minutes? It is usually only 15m or 31m, but sometimes 48m is logged.

It is very interesting on the "15m" multiples there in the timeouts (@15m, 31m and 48m), allotting for some padding there in seconds.

Thanks in advance.

答案1

得分: 1

在init函数中永远不要使用defer db.Close()。init函数在main函数执行之前结束,因此永远无法访问到一个打开的连接池。
你可以在main函数中调用defer db.Close()

这也可能是你的预处理语句的问题,它们属于连接池,在调用db.Close()时无效。

关于超时问题,这是一个驱动程序端的超时(从问题中得知)。

请参考我在https://github.com/go-sql-driver/mysql/issues/257的评论。

英文:

Never defer db.Close() in init. init ends before main is executed and will therefore never have access to an open connection pool.
You may call defer db.Close() in your main, though.

That's probably also the issue with your prepared statements, they belong to the connection pool and are invalid when db.Close() is called.

Regarding the timeout, it is a driver side timeout (got here from the issue).

See my comment at https://github.com/go-sql-driver/mysql/issues/257

答案2

得分: 0

我应该在每个网络请求/处理程序中“打开”并延迟 db.Close() 吗?

不需要。要么创建一个全局变量并不用担心关闭它(就像你现在做的那样),要么通过应用程序上下文传递连接池(*sql.DB),例如按照 https://medium.com/@benbjohnson/structuring-applications-in-go-3b04be4ff091 的方式,将处理程序作为嵌入了 *sql.DB 和其他可能需要的内容的上下文类型的方法。

你还可以考虑使用 http://jmoiron.github.io/sqlx/ 来帮助将数据库结果转换为结构体/映射,而无需自己编写代码。

此外(这将有助于调试问题,我敢打赌),我建议你修复第二个代码示例,使其与下面的示例匹配-因为当遇到错误时,你只是记录了错误,但代码继续执行而没有返回:

rows2, err := db.Query(`
    SELECT p.post_title, p.post_name
    FROM wp_posts p 
    WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP()
    AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1)
    ORDER BY p.menu_order`, FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
    return err
}
defer rows2.Close()

c.AllFooterPages = make([]FooterPage, 0)

for rows2.Next() {
    var name, slug string
    err := rows2.Scan(&name, &slug)
    if err != nil {
        log.Printf("ERROR in AllPages row.Scan() : %v", err)
        return err // 这里也是一样的!
    }
    // 你的代码的其余部分
}

希望能对你有所帮助!

英文:

> Should I be "opening" and defer db.Close() on every single web request/handler?

No. Either create a global and don't worry about closing it (as you have now) OR pass the pool (*sql.DB) via an app context i.e. as per https://medium.com/@benbjohnson/structuring-applications-in-go-3b04be4ff091 by having your handlers as methods on a context type that embeds *sql.DB and anything else you might need.

You may also want to look at using http://jmoiron.github.io/sqlx/ to help marshal your database results into structs/maps without having to do the dance yourself.


As an aside (that will help debug the issue, I'd bet) I'd fix your second code example to match the below - because when you encounter an error you're only logging it, but your code continues as you don' return:

rows2, err := db.Query(`
    SELECT p.post_title, p.post_name
    FROM wp_posts p 
    WHERE p.post_type=&#39;page&#39; AND p.post_status=&#39;publish&#39; AND p.post_date &lt;= UTC_TIMESTAMP()
    AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1)
    ORDER BY p.menu_order`, FooterPagesParentNameSlug)
if err != nil {
    log.Printf(&quot;ERROR in AllPages .Query() : %v&quot;, err)
	return err
}
defer rows2.Close()

c.AllFooterPages = make([]FooterPage, 0)

for rows2.Next() {
	var name, slug string
	err := rows2.Scan(&amp;name, &amp;slug)
	if err != nil {
		log.Printf(&quot;ERROR in AllPages row.Scan() : %v&quot;, err)
		return err // Same here!
    }
// Rest of your code
}

huangapple
  • 本文由 发表于 2014年7月12日 08:18:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/24708372.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定