上周碰到一个nginx报错, 在本地环境尝试进行还原, 得到以下输出:
2017/04/17 11:42:40 [emerg] 11380#0: pread() returned only 0 bytes instead of 4091 in /usr/local/nginx/conf/vhost/test.conf:1822
2017/04/17 11:42:40 [emerg] 11384#0: pread() returned only 0 bytes instead of 4095 in /usr/local/nginx/conf/vhost/test.conf:3543
2017/04/17 11:48:53 [emerg] 11796#0: pread() returned only 0 bytes instead of 3451 in /usr/local/nginx/conf/vhost/test.conf:2759
此时看到配置文件有报错, 然后进行配置文件的vim查看, 可以发现文件中没有错误, 再执行一次nginx -t的配置文件检测命令, 也一切正常:
[root@localhost vhost]# /usr/local/nginx/sbin/nginx -tthe configuration file /usr/local/nginx/conf/nginx.conf syntax is okconfiguration file /usr/local/nginx/conf/nginx.conf test is successful
很奇怪的问题, 不知道为什么会有这种情况发生。 通过报错信息来看, 是调用了linux c下的pread函数时返回了空的数据。
linux c下pread函数的说明:
通过阅读nginx源码, 定位错误在ngx_http_geo_module.c文件中的ngx_http_geo_include_binary_base方法中, 该方法中有一块定义如下:
size = (size_t) ngx_file_size(&fi); mtime = ngx_file_mtime(&fi); ch = name->data[name->len - 4]; name->data[name->len - 4] = '\0'; if (ngx_file_info(name->data, &fi) == NGX_FILE_ERROR) { ngx_conf_log_error(NGX_LOG_CRIT, cf, ngx_errno, ngx_file_info_n " \"%s\" failed", name->data); goto failed; } name->data[name->len - 4] = ch; if (mtime < ngx_file_mtime(&fi)) { ngx_conf_log_error(NGX_LOG_WARN, cf, 0, "stale binary geo range base \"%s\"", name->data); goto failed; } base = ngx_palloc(ctx->pool, size); if (base == NULL) { goto failed; } n = ngx_read_file(&file, base, size, 0); if (n == NGX_ERROR) { ngx_conf_log_error(NGX_LOG_CRIT, cf, ngx_errno, ngx_read_file_n " \"%s\" failed", name->data); goto failed; } if ((size_t) n != size) { ngx_conf_log_error(NGX_LOG_CRIT, cf, 0, ngx_read_file_n " \"%s\" returned only %z bytes instead of %z", name->data, n, size); goto failed; }
然后发现, 其调用的ngx_read_file函数, 在ngx_files.c文件中找到该函数的定义方法:
ssize_tngx_read_file(ngx_file_t *file, u_char *buf, size_t size, off_t offset){ ssize_t n; ngx_log_debug4(NGX_LOG_DEBUG_CORE, file->log, 0, "read: %d, %p, %uz, %O", file->fd, buf, size, offset);#if (NGX_HAVE_PREAD) n = pread(file->fd, buf, size, offset); if (n == -1) { ngx_log_error(NGX_LOG_CRIT, file->log, ngx_errno, "pread() \"%s\" failed", file->name.data); return NGX_ERROR; }#else if (file->sys_offset != offset) { if (lseek(file->fd, offset, SEEK_SET) == -1) { ngx_log_error(NGX_LOG_CRIT, file->log, ngx_errno, "lseek() \"%s\" failed", file->name.data); return NGX_ERROR; } file->sys_offset = offset; } n = read(file->fd, buf, size); if (n == -1) { ngx_log_error(NGX_LOG_CRIT, file->log, ngx_errno, "read() \"%s\" failed", file->name.data); return NGX_ERROR; } file->sys_offset += n;#endif file->offset += n; return n;}
可以发现, 该方法是调用pread去读取对应的文件, 然后在不影响position的情况下返回对应的bytes。
到这里问题可以确定下来, 是由于在读取文件时, 文件被更新导致发生的错误。
因为报错信息明确说明是配置文件的报错, 所以与nginx的open_file_cache参数无关, 同时也没有用到nginx的open_file_cache配置, 联想到实现的slb的发布机制,是采用的多进程的方式(无锁)去即时写入配置文件并在管理机上检测配置文件的语法正确性, 初步确认是因为"多进程"写文件导致的nginx调用pread方法读取文件报错。
于是找了线上的一个5000多行的配置文件down到本地, 写了一个脚本来进行模拟slb的操作(多线程):
#!/usr/bin/env python# encoding: utf-8from __future__ import print_functionimport timeimport sysimport multiprocessingfrom commands import getstatusoutputreload(sys)sys.setdefaultencoding("utf-8")def create_ngx_file(b): print("excute time->", time.time()) filename = "/tmp/test.conf" content = open(filename, "rb").read() ngx_path_file = "/usr/local/nginx/conf/vhost/test.conf" with open(ngx_path_file , 'wb') as fp: fp.write(content) check_nginx_synax() re_write()def re_write(): filename = "/tmp/abc.test" content = open(filename, "rb").read() ngx_path_file = "/usr/local/nginx/conf/vhost/test.conf" with open(ngx_path_file , 'wb') as fp: fp.write(content)def check_nginx_synax(): status, result = getstatusoutput('/usr/local/nginx/sbin/nginx -t') if status: print(u'配置文件测试失败: {0}'.format(', '.join(result.split('\n')))) raise u'[Config] The Ngx config file test failed, reason: {0}'.format(result) else: print(u'测试配置文件成功: {0}'.format(result.split('\n')[0])) print(u"end time->", time.time())if __name__ == "__main__": r = [] pool = multiprocessing.Pool(10) arg = ['a']*100 result = pool.imap(create_ngx_file,arg) while True: try: i = result.next() r.append(i) except StopIteration: pool.close() pool.join() break pool.close() pool.join()
abc.test文件是随便写的一个配置文件, 用来模拟更新操作。
然后执行调用时间发现报错:
excute time-> 1492403282.56配置文件测试失败: nginx: [emerg] pread() returned only 0 bytes instead of 4080 in /usr/local/nginx/conf/vhost/test.conf:509, configuration file /usr/local/nginx/conf/nginx.conf test failed
OK, 错误已经复现, 基本确定是这个问题, 后续进行了一下nginx读取配置文件的测试, 发现基本都是在10~30ms左右读取完成的配置文件, 多进程正好又在这个时间内更新了正在读取的文件.......
解决方法: 给slb的操作加上锁, 或者做一个发布队列就可以解决这个问题啦~