pept.utilities.read_csv_chunks#

pept.utilities.read_csv_chunks(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#

Read chunks of data from a file lazily, returning numpy arrays of the values.

This function returns a generator - an object that can be iterated over once, creating data on-demand. This means that chunks of data will be read only when being accessed, making it a more efficient alternative to read_csv for large files (> 1.000.000 lines).

A more convenient and feature-complete alternative is pept.utilities.ChunkReader which is more reusable and can access out-of-order chunks using subscript notation (i.e. data[0]).

This is a convenience function that’s simply a proxy to pandas.read_csv, configured with default parameters for fast reading and parsing of usual PEPT data.

Most importantly, it lazily read chunks of size chunksize from a space-separated values file at filepath_or_buffer, optionally skipping skiprows lines and reading in nrows lines. It returns numpy.ndarray`s with `float values.

The parameters below are sent to pandas.read_csv with no further parsing. The descriptions below are taken from the pandas documentation.

Parameters

filepath_or_bufferstr, path object or file-like object: Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
chunksizeint: Number of lines read in a chunk of data. Return TextFileReader object for iteration.
skiprowslist-like, int or callable(), optional: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
nrowsint, optional: Number of rows of file to read. Useful for reading pieces of large files.
dtypeType name, default float: Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.
sepstr, default “s+”: Delimiter to use. Separators longer than 1 character and different from ‘s+’ will be interpreted as regular expressions and will also force the use of the Python parsing engine.
headerint, list of int, “infer”, optional: Row number(s) to use as the column names, and the start of the data. By default assume there is no header present (i.e. header = None).
engine{‘c’, ‘python’}, default “c”: Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
na_filterbool, default True: Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
quotingint or csv.QUOTE_* instance, default csv.QUOTE_NONE: Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
memory_mapbool, default True: If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
**kwargsoptional: Extra keyword arguments that will be passed to pandas.read_csv.