pept.utilities.read_csv_chunks#
- pept.utilities.read_csv_chunks(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#
Read chunks of data from a file lazily, returning numpy arrays of the values.
This function returns a generator - an object that can be iterated over once, creating data on-demand. This means that chunks of data will be read only when being accessed, making it a more efficient alternative to read_csv for large files (> 1.000.000 lines).
A more convenient and feature-complete alternative is pept.utilities.ChunkReader which is more reusable and can access out-of-order chunks using subscript notation (i.e. data[0]).
This is a convenience function that’s simply a proxy to pandas.read_csv, configured with default parameters for fast reading and parsing of usual PEPT data.
Most importantly, it lazily read chunks of size chunksize from a space-separated values file at filepath_or_buffer, optionally skipping skiprows lines and reading in nrows lines. It returns numpy.ndarray`s with `float values.
The parameters below are sent to pandas.read_csv with no further parsing. The descriptions below are taken from the pandas documentation.
- Parameters
- filepath_or_buffer
str
,path
object
or file-likeobject
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
- chunksize
int
Number of lines read in a chunk of data. Return TextFileReader object for iteration.
- skiprowslist-like,
int
orcallable()
, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
- nrows
int
, optional Number of rows of file to read. Useful for reading pieces of large files.
- dtype
Type
name
,default
float Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.
- sep
str
,default
“s+” Delimiter to use. Separators longer than 1 character and different from ‘s+’ will be interpreted as regular expressions and will also force the use of the Python parsing engine.
- header
int
,list
of
int
, “infer”, optional Row number(s) to use as the column names, and the start of the data. By default assume there is no header present (i.e. header = None).
- engine{‘c’, ‘python’},
default
“c” Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
- na_filterbool,
default
True Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
- quoting
int
or csv.QUOTE_*instance
,default
csv.QUOTE_NONE Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
- memory_mapbool,
default
True
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
- **kwargsoptional
Extra keyword arguments that will be passed to pandas.read_csv.
- filepath_or_buffer